diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-07-24 09:54:23 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-07-24 09:54:44 +0000 |
commit | 836b47cb7e99a977c5a23b059ca1d0b5065d310e (patch) | |
tree | 1604da8f482d02effa033c94a84be42bc0c848c3 /src/health/guides/consul | |
parent | Releasing debian version 1.44.3-2. (diff) | |
download | netdata-836b47cb7e99a977c5a23b059ca1d0b5065d310e.tar.xz netdata-836b47cb7e99a977c5a23b059ca1d0b5065d310e.zip |
Merging upstream version 1.46.3.
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/health/guides/consul')
12 files changed, 497 insertions, 0 deletions
diff --git a/src/health/guides/consul/consul_autopilot_health_status.md b/src/health/guides/consul/consul_autopilot_health_status.md new file mode 100644 index 000000000..42ccab5a6 --- /dev/null +++ b/src/health/guides/consul/consul_autopilot_health_status.md @@ -0,0 +1,53 @@ +### Understand the alert + +This alert checks the health status of the Consul cluster regarding its autopilot functionality. If you receive this alert, it means that the Consul datacenter is experiencing issues, and its health status has been reported as `unhealthy` by the Consul server. + +### What is Consul autopilot? + +Consul's autopilot feature provides automatic management and stabilization features for Consul server clusters, ensuring that the clusters remain in a healthy state. These features include server health monitoring, automatic dead server reaping, and stable server introduction. + +### What does unhealthy mean? + +An unhealthy Consul cluster could experience issues regarding its operations, services, leader elections, and cluster consistency. In this alert scenario, the cluster health functionality is not working correctly, and it could lead to stability and performance problems. + +### Troubleshoot the alert + +Here are some steps to troubleshoot the consul_autopilot_health_status alert: + +1. Check the logs of the Consul server to identify any error messages or warning signs. The logs will often provide insights into the underlying problems. + + ``` + journalctl -u consul + ``` + +2. Inspect the Consul health status using the Consul CLI or API: + + ``` + consul operator autopilot get-config + ``` + + Using the Consul HTTP API: + ``` + curl http://<consul_server>:8500/v1/operator/autopilot/health + ``` + +3. Verify the configuration of Consul servers, check the `retry_join` and addresses of the Consul servers in the configuration file: + + ``` + cat /etc/consul.d/consul.hcl | grep retry_join + ``` + +4. Ensure that there is a sufficient number of Consul servers and that they are healthy. The `consul members` command will show the status of cluster members: + + ``` + consul members + ``` + +5. Check the network connectivity between Consul servers by running network diagnostics like ping and traceroute. + +6. Review Consul documentation to gain a deeper understanding of the autopilot health issues and potential configuration problems. + + +### Useful resources + +- [Consul CLI reference](https://www.consul.io/docs/commands) diff --git a/src/health/guides/consul/consul_autopilot_server_health_status.md b/src/health/guides/consul/consul_autopilot_server_health_status.md new file mode 100644 index 000000000..687c2bb1d --- /dev/null +++ b/src/health/guides/consul/consul_autopilot_server_health_status.md @@ -0,0 +1,48 @@ +### Understand the alert + +The `consul_autopilot_server_health_status` alert triggers when a Consul server in your service mesh is marked `unhealthy`. This can affect the overall stability and performance of the service mesh. Regular monitoring and addressing unhealthy servers are crucial in maintaining a smooth functioning environment. + +### What is Consul? + +`Consul` is a service mesh solution that provides a full-featured control plane with service discovery, configuration, and segmentation functionalities. It is used to connect, secure, and configure services across any runtime platform and public or private cloud. + +### Troubleshoot the alert + +Follow the steps below to identify and resolve the issue of an unhealthy Consul server: + +1. Check Consul server logs + + Inspect the logs of the unhealthy server to identify the root cause of the issue. You can find logs typically in `/var/log/consul` or use `journalctl` with Consul: + + ``` + journalctl -u consul + ``` + +2. Verify connectivity + + Ensure that the unhealthy server can communicate with other servers in the datacenter. Check for any misconfigurations or network issues. + +3. Review server resources + + Monitor the resource usage of the unhealthy server (CPU, memory, disk I/O, network). High resource usage can impact the server's health status. Use tools like `top`, `htop`, `iotop`, or `nload` to monitor the resources. + +4. Restart the Consul server + + If the issue persists and you cannot identify the root cause, try restarting the Consul server: + + ``` + sudo systemctl restart consul + ``` + +5. Refer to Consul's documentation + + Consult the official [Consul troubleshooting documentation](https://developer.hashicorp.com/consul/tutorials/datacenter-operations/troubleshooting) for further assistance. + +6. Inspect the Consul UI + + Check the Consul UI for the server health status and any additional information related to the unhealthy server. You can find the Consul UI at `http://<consul-server-ip>:8500/ui/`. + +### Useful resources + +1. [Consul Documentation](https://www.consul.io/docs) +2. [Running Consul as a Systemd Service](https://learn.hashicorp.com/tutorials/consul/deployment-guide#systemd-service) diff --git a/src/health/guides/consul/consul_client_rpc_requests_exceeded.md b/src/health/guides/consul/consul_client_rpc_requests_exceeded.md new file mode 100644 index 000000000..eab01e820 --- /dev/null +++ b/src/health/guides/consul/consul_client_rpc_requests_exceeded.md @@ -0,0 +1,38 @@ +### Understand the alert + +This alert triggers when the rate of rate-limited RPC (Remote Procedure Call) requests made by a Consul server within the specified datacenter has exceeded a certain threshold. If you receive this alert, it means that your Consul server is experiencing an increased number of rate-limited RPC requests, which may affect its performance and availability. + +### What is Consul? + +Consul is a service mesh solution used for service discovery, configuration, and segmentation. It provides a distributed platform to build robust, scalable, and secured services while simplifying network infrastructure. + +### What are RPC requests? + +Remote Procedure Call (RPC) is a protocol that allows a computer to execute a procedure on another computer across a network. In the context of Consul, RPC requests are used for communication between Consul servers and clients. + +### Troubleshoot the alert + +1. Check the Consul server logs for any relevant error messages or warnings. These logs can provide valuable information on the cause of the increased RPC requests. + + ``` + journalctl -u consul + ``` + +2. Monitor the Consul server's resource usage, such as CPU and memory utilization, to ensure that it is not running out of resources. High resource usage may cause an increase in rate-limited RPC requests. + + ``` + top -o +%CPU + ``` + +3. Analyze the Consul client's usage patterns and identify any misconfigured services or clients contributing to the increased RPC requests. Identify any services that may be sending a high number of requests per second or are not appropriately rate-limited. + +4. Review the Consul rate-limiting configurations to ensure that they are set appropriately based on the expected workload. Adjust the rate limits if necessary to better accommodate the workload. + +5. If the issue persists, consider scaling up the Consul server resources or deploying more Consul servers to handle increased traffic and prevent performance issues. + +### Useful resources + +1. [Consul Official Documentation](https://www.consul.io/docs/) +2. [Consul Rate Limiting Guide](https://developer.hashicorp.com/consul/docs/agent/limits) +3. [Understanding Remote Procedure Calls (RPC)](https://www.smashingmagazine.com/2016/09/understanding-rest-and-rpc-for-http-apis/) +4. [Troubleshooting Consul](https://developer.hashicorp.com/consul/tutorials/datacenter-operations/troubleshooting) diff --git a/src/health/guides/consul/consul_client_rpc_requests_failed.md b/src/health/guides/consul/consul_client_rpc_requests_failed.md new file mode 100644 index 000000000..7d8cb3311 --- /dev/null +++ b/src/health/guides/consul/consul_client_rpc_requests_failed.md @@ -0,0 +1,39 @@ +### Understand the alert + +This alert is triggered when the number of failed RPC (Remote Procedure Call) requests made by the Consul server in a datacenter surpasses a specific threshold. Consul is a service mesh solution and is responsible for discovering, configuring, and segmenting services in distributed systems. + +### What are RPC requests? + +Remote Procedure Call (RPC) is a protocol that allows one computer to execute remote procedures (subroutines) on another computer. In the context of Consul, clients make RPC requests to servers to obtain information about the service configurations or to execute actions. + +### What does it mean when RPC requests fail? + +When Consul's client RPC requests fail, it means that there is an issue in the communication between the Consul client and the server. It could be due to various reasons like network issues, incorrect configurations, high server load, or even software bugs. + +### Troubleshoot the alert + +1. Verify the connectivity between Consul clients and servers. + + Check the network connections between the Consul client and the server. Ensure that the required ports are open and the network is functioning correctly. You can use tools like `ping`, `traceroute`, and `telnet` to verify connectivity. + +2. Check Consul server logs. + + Analyze the Consul server's logs to look for any error messages or unusual patterns related to RPC requests. Server logs can be found in the default Consul log directory, usually `/var/log/consul`. + +3. Review Consul client and server configurations. + + Ensure that Consul client and server configurations are correct and in accordance with the best practices. You can find more information about Consul's configuration recommendations [here](https://learn.hashicorp.com/tutorials/consul/reference-architecture?in=consul/production-deploy). + +4. Monitor server load and resources. + + High server load or resource constraints can cause RPC request failures. Monitor your Consul servers' CPU, memory, and disk usage. If you find any resource bottlenecks, consider adjusting the server's resource allocation or scaling your Consul servers horizontally. + +5. Update Consul to the latest version. + + Software bugs can lead to RPC request failures. Ensure that your Consul clients and servers are running the latest version of Consul. Check the [Consul releases page](https://github.com/hashicorp/consul/releases) for the latest version. + +### Useful resources + +1. [Consul official documentation](https://www.consul.io/docs) +2. [Consul Reference Architecture](https://learn.hashicorp.com/tutorials/consul/reference-architecture?in=consul/production-deploy) +3. [Troubleshooting Consul guide](https://developer.hashicorp.com/consul/tutorials/datacenter-operations/troubleshooting) diff --git a/src/health/guides/consul/consul_gc_pause_time.md b/src/health/guides/consul/consul_gc_pause_time.md new file mode 100644 index 000000000..c4408234b --- /dev/null +++ b/src/health/guides/consul/consul_gc_pause_time.md @@ -0,0 +1,23 @@ +### Understand the alert + +This alert calculates the time spent in stop-the-world garbage collection (GC) pauses on a Consul server node within a one-minute interval. Consul is a distributed service mesh software providing service discovery, configuration, and segmentation functionality. If you receive this alert, it means that the Consul server is experiencing an increased amount of time in GC pauses, which may lead to performance degradation of your service mesh. + +### What are garbage collection pauses? + +Garbage collection (GC) in Consul is a mechanism to clean up unused memory resources and improve the overall system performance. During a GC pause, all running processes in Consul server are stopped to allow the garbage collection process to complete. If the duration of GC pauses is too high, it indicates that the Consul server might be under memory pressure, which can affect the overall performance of the system. + +### Troubleshoot the alert + +1. **Check the Consul server logs**: Examine the Consul server's logs for any errors or warnings related to memory pressure, increased heap usage, or GC pauses. You can typically find the logs in `/var/log/consul`. + +2. **Monitor Consul server metrics**: Check the Consul server's memory usage, heap usage and GC pause metrics using or Netdata. This can help you identify the cause of increased GC pause time. + +3. **Optimize Consul server configuration**: Ensure that your Consul server is properly configured based on your system resources and workload. Review and adjust the [Consul server configuration parameters](https://www.consul.io/docs/agent/options) as needed. + +4. **Reduce memory pressure**: If you have identified memory pressure as the root cause, consider adding more memory resources to your Consul server or adjusting the Consul server's memory limits. + +5. **Update Consul server**: Make sure that your Consul server is running the latest version, which can include optimizations and performance improvements. + +### Useful resources + +- [Consul Server Configuration Parameters](https://www.consul.io/docs/agent/options) diff --git a/src/health/guides/consul/consul_license_expiration_time.md b/src/health/guides/consul/consul_license_expiration_time.md new file mode 100644 index 000000000..3f86b0845 --- /dev/null +++ b/src/health/guides/consul/consul_license_expiration_time.md @@ -0,0 +1,50 @@ +### Understand the alert + +This alert checks the Consul Enterprise license expiration time. It triggers a warning if the license expiration time is less than 14 days, and critical if it's less than 7 days. + +_consul.license_expiration_time_: Monitors the remaining time in seconds until the Consul Enterprise license expires. + +### What is Consul? + +Consul is a service mesh solution that enables organizations to discover services and safely process network traffic across dynamic, distributed environments. + +### Troubleshoot the alert + +1. Check the current license expiration time + + You can check the remaining license expiration time for your Consul Enterprise instance using the Consul API: + + ``` + curl http://localhost:8500/v1/operator/license + ``` + + Look for the `ExpirationTime` field in the returned JSON output. + +2. Renew the license + + If your license is about to expire, you will need to acquire a new license. Contact [HashiCorp Support](https://support.hashicorp.com/) to obtain and renew the license key. + +3. Apply the new license + + You can apply the new license key either by restarting Consul with the new key specified via the `CONSUL_LICENSE` environment variable or the `license_path` configuration option, or by updating the license through the Consul API: + + ``` + curl -X PUT -d @new_license.json http://localhost:8500/v1/operator/license + ``` + + Replace `new_license.json` with the path to a file containing the new license key in JSON format. + +4. Verify the new license expiration time + + After applying the new license, you can check the new license expiration time using the Consul API again: + + ``` + curl http://localhost:8500/v1/operator/license + ``` + + Ensure that the `ExpirationTime` field shows the new expiration time. + +### Useful resources + +1. [Consul License Documentation](https://www.consul.io/docs/enterprise/license) +2. [HashiCorp Support](https://support.hashicorp.com/) diff --git a/src/health/guides/consul/consul_node_health_check_status.md b/src/health/guides/consul/consul_node_health_check_status.md new file mode 100644 index 000000000..44b431edc --- /dev/null +++ b/src/health/guides/consul/consul_node_health_check_status.md @@ -0,0 +1,34 @@ +### Understand the alert + +This alert is triggered when a Consul node health check status indicates a failure. Consul is a service mesh solution for service discovery and configuration. If you receive this alert, it means that the health check for a specific service on a node within the Consul cluster has failed. + +### What does the health check status mean? + +Consul performs health checks to ensure the services registered within the cluster are functioning as expected. The health check status represents the result of these checks, with a non-zero value indicating a failed health check. A failed health check can potentially cause downtime or degraded performance for the affected service. + +### Troubleshoot the alert + +1. Check the alert details: The alert information provided should include the `check_name`, `node_name`, and `datacenter` affected. Note these details as they will be useful in further troubleshooting. + +2. Verify the health check status in Consul: To confirm the health check failure, access the Consul UI or use the Consul command-line tool to query the health status of the affected service and node: + + ``` + consul members + ``` + + ``` + consul monitor + ``` + +3. Investigate the failed service: Once you confirm the health check failure, start investigating the specific service affected. Check logs, resource usage, configuration files, and other relevant information to identify the root cause of the failure. + +4. Fix the issue: Based on your investigation, apply the necessary fixes to the service or its configuration. This may include restarting the service, adjusting resource allocation, or fixing any configuration errors. + +5. Verify service health: After applying the required fixes, verify the health status of the service once again through the Consul UI or command-line tool. If the service health check status has returned to normal (zero value), the issue has been resolved. + +6. Monitor for any recurrence: Keep an eye on the service, node, and overall Consul cluster health to ensure the issue does not reappear and to catch any other potential problems. + +### Useful resources + +1. [Consul documentation](https://www.consul.io/docs/) +2. [Service and Node Health](https://www.consul.io/api-docs/health) diff --git a/src/health/guides/consul/consul_raft_leader_last_contact_time.md b/src/health/guides/consul/consul_raft_leader_last_contact_time.md new file mode 100644 index 000000000..baa6ed462 --- /dev/null +++ b/src/health/guides/consul/consul_raft_leader_last_contact_time.md @@ -0,0 +1,40 @@ +### Understand the alert + +This alert monitors the time since the Consul Raft leader server was last able to contact its follower nodes. If the time since the last contact exceeds the warning or critical thresholds, the alert will be triggered. High values indicate a potential issue with the Consul Raft leader's connection to its follower nodes. + +### Troubleshoot the alert + +1. Check Consul logs + +Inspect the logs of the Consul leader server and follower nodes for any errors or relevant information. You can find the logs in `/var/log/consul` by default. + +2. Verify Consul agent health + +Ensure that the Consul agents running on the leader and follower nodes are healthy. Use the following command to check the overall health: + + ``` + consul members + ``` + +3. Review networking connectivity + +Check the network connectivity between the leader and follower nodes. Verify the nodes are reachable, and there are no firewalls or security groups blocking the necessary ports. Consul uses these ports by default: + + - Server RPC (8300) + - Serf LAN (8301) + - Serf WAN (8302) + - HTTP API (8500) + - DNS Interface (8600) + +4. Monitor Consul server's resource usage + +Ensure that the Consul server isn't facing any resource constraints, such as high CPU, memory, or disk usage. Use system monitoring tools like `top`, `vmstat`, or `iotop` to observe resource usage and address bottlenecks. + +5. Verify the Consul server configuration + +Examine the Consul server's configuration file (usually located at `/etc/consul/consul.hcl`) and ensure that there are no errors, inconsistencies, or misconfigurations with server addresses, datacenter names, or communication settings. + +### Useful resources + +1. [Consul Docs: Troubleshooting](https://developer.hashicorp.com/consul/tutorials/datacenter-operations/troubleshooting) +2. [Consul Docs: Agent Configuration](https://www.consul.io/docs/agent/options) diff --git a/src/health/guides/consul/consul_raft_leadership_transitions.md b/src/health/guides/consul/consul_raft_leadership_transitions.md new file mode 100644 index 000000000..59eb3e738 --- /dev/null +++ b/src/health/guides/consul/consul_raft_leadership_transitions.md @@ -0,0 +1,54 @@ +### Understand the alert + +This alert triggers when there is a `leadership transition` in the `Consul` service mesh. If you receive this alert, it means that server `${label:node_name}` in datacenter `${label:datacenter}` has become the new leader. + +### What does consul_raft_leadership_transitions mean? + +Consul is a service mesh solution that provides service discovery, configuration, and segmentation functionality. It uses the Raft consensus algorithm to maintain a consistent data state across the cluster. A leadership transition occurs when the current leader node loses its leadership status and a different node takes over. + +### What causes leadership transitions? + +Leadership transitions in Consul can be caused by various reasons, such as: + +1. Network communication issues between the nodes. +2. High resource utilization on the leader node, causing it to miss heartbeat messages. +3. Nodes crashing or being intentionally shut down. +4. A forced leadership transition triggered by an operator. + +Frequent leadership transitions may lead to service disruptions, increased latency, and reduced availability. Therefore, it's essential to identify and resolve the root cause promptly. + +### Troubleshoot the alert + +1. Check the Consul logs for indications of network issues or node failures: + + ``` + journalctl -u consul.service + ``` + Alternatively, you can check the Consul log file, which is usually located at `/var/log/consul/consul.log`. + +2. Inspect the health and status of the Consul cluster using the `consul members` command: + + ``` + consul members + ``` + This command lists all cluster members and their roles, including the new leader node. + +3. Determine if there's high resource usage on the affected nodes by monitoring CPU, memory, and disk usage: + + ``` + top + ``` + +4. Examine network connectivity between nodes using tools like `ping`, `traceroute`, or `mtr`. + +5. If the transitions are forced by operators, review the changes made and their impact on the cluster. + +6. Consider increasing the heartbeat timeout configuration to allow the leader more time to respond, especially if high resource usage is causing frequent leadership transitions. + +7. Review Consul's documentation on [consensus and leadership](https://developer.hashicorp.com/consul/docs/architecture/consensus) and [operation and maintenance](https://developer.hashicorp.com/consul/docs/guides) to gain insights into best practices and ways to mitigate leadership transitions. + +### Useful resources + +1. [Consul: Service Mesh Overview](https://www.consul.io/docs/intro) +2. [Consul: Understanding Consensus and Leadership](https://developer.hashicorp.com/consul/docs/architecture/consensus) +3. [Consul: Installation, Configuration, and Maintenance](https://developer.hashicorp.com/consul/docs/guides) diff --git a/src/health/guides/consul/consul_raft_thread_fsm_saturation.md b/src/health/guides/consul/consul_raft_thread_fsm_saturation.md new file mode 100644 index 000000000..12c5f7df3 --- /dev/null +++ b/src/health/guides/consul/consul_raft_thread_fsm_saturation.md @@ -0,0 +1,42 @@ +### Understand the alert + +This alert monitors the `consul_raft_thread_fsm_saturation` metric, which represents the saturation of the `FSM Raft` goroutine in Consul, a service mesh. If you receive this alert, it indicates that the Raft goroutine on a specific Consul server is becoming saturated. + +### What is Consul? + +Consul is a distributed service mesh that provides a full-featured control plane with service discovery, configuration, and segmentation functionalities. It enables organizations to build and operate large-scale, dynamic, and resilient systems. The Raft FSM goroutine is responsible for executing finite state machine (FSM) operations on the Consul servers. + +### What does FSM Raft goroutine saturation mean? + +Saturation of the FSM Raft goroutine means that it is spending more time executing operations, which may cause delays in Consul's ability to process requests and manage the overall service mesh. High saturation levels can lead to performance issues, increased latency, or even downtime for your Consul deployment. + +### Troubleshoot the alert + +1. Identify the Consul server and datacenter with the high Raft goroutine saturation: + + The alert has labels `label:node_name` and `label:datacenter`, indicating the affected Consul server and its respective datacenter. + +2. Examine Consul server logs: + + Check the logs of the affected Consul server for any error messages or indications of high resource usage. This can provide valuable information on the cause of the saturation. + +3. Monitor Consul cluster performance: + + Use Consul's built-in monitoring tools to keep an eye on your Consul cluster's health and performance. For instance, you may monitor Raft metrics via the Consul `/v1/agent/metrics` API endpoint. + +4. Scale your Consul infrastructure: + + If the increased saturation is due to high demand, scaling your Consul infrastructure by adding more servers or increasing the resources available to existing servers can help mitigate the issue. + +5. Review and optimize Consul configuration: + + Review your Consul configuration and make any necessary optimizations to ensure the best performance. For instance, you could adjust the [Raft read and write timeouts](https://www.consul.io/docs/agent/options). + +6. Investigate and resolve any underlying issues causing the saturation: + + Look for any factors contributing to the increased load on the FSM Raft goroutine and address those issues. This may involve reviewing application workloads, network latency, or hardware limitations. + +### Useful resources + +1. [Consul Telemetry](https://www.consul.io/docs/agent/telemetry) +2. [Consul Configuration - Raft](https://www.consul.io/docs/agent/options#raft) diff --git a/src/health/guides/consul/consul_raft_thread_main_saturation.md b/src/health/guides/consul/consul_raft_thread_main_saturation.md new file mode 100644 index 000000000..7f33627d0 --- /dev/null +++ b/src/health/guides/consul/consul_raft_thread_main_saturation.md @@ -0,0 +1,41 @@ +### Understand the alert + +This alert triggers when the main Raft goroutine's saturation percentage reaches a certain threshold. If you receive this alert, it means that your Consul server is experiencing high utilization of the main Raft goroutine. + +### What is Consul? + +Consul is a service discovery, configuration, and orchestration solution developed by HashiCorp. It is used in microservice architectures and distributed systems to make services aware and discoverable by other services. Raft is a consensus-based algorithm used for maintaining the state of the Consul servers. + +### What is the main Raft goroutine? + +The main Raft goroutine is responsible for carrying out consensus-related tasks in the Consul server. It ensures the consistency and reliability of the server's state. High saturation of this goroutine can lead to performance issues in the Consul server cluster. + +### Troubleshoot the alert + +1. Verify the current status of the Consul server. + Check the health status and logs of the Consul server using the following command: + ``` + consul monitor + ``` + +2. Monitor Raft metrics. + Use the Consul telemetry feature to collect and analyze Raft performance metrics. Consult the [Consul official documentation](https://www.consul.io/docs/agent/telemetry) on setting up telemetry. + +3. Review the server's resources. + Confirm whether the server hosting the Consul service has enough resources (CPU, memory, and disk space) to handle the current load. Upgrade the server resources or adjust the Consul configurations accordingly. + +4. Inspect the Consul server's log files. + Analyze the log files to identify any errors or issues that could be affecting the performance of the main Raft goroutine. + +5. Monitor network latency between Consul servers. + High network latency can affect the performance of the Raft algorithm. Use monitoring tools like `ping` or `traceroute` to measure the latency between the Consul servers. + +6. Check for disruptions in the Consul cluster. + Investigate possible disruptions caused by external factors, such as server failures, network partitioning or misconfigurations in the cluster. + +### Useful resources + +1. [Consul: Service Mesh for Microservices Networking](https://www.consul.io/) +2. [Consul Documentation](https://www.consul.io/docs) +3. [Consul Telemetry](https://www.consul.io/docs/agent/telemetry) +4. [Understanding Raft Consensus Algorithm](https://raft.github.io/) diff --git a/src/health/guides/consul/consul_service_health_check_status.md b/src/health/guides/consul/consul_service_health_check_status.md new file mode 100644 index 000000000..e9da2508f --- /dev/null +++ b/src/health/guides/consul/consul_service_health_check_status.md @@ -0,0 +1,35 @@ +### Understand the alert + +This alert is triggered when the `health check status` of a service in a `Consul` service mesh changes to a `warning` or `critical` state. It occurs when a service health check for a specific service `${label:service_name}` fails on a server `${label:node_name}` in a datacenter `${label:datacenter}`. + +### What is Consul? + +`Consul` is a service mesh solution developed by HashiCorp that can be used to connect and secure services across dynamic, distributed infrastructure. It maintains a registry of service instances, performs health checks, and offers a flexible and high-performance service discovery mechanism. + +### What is a service health check? + +A service health check is a way to determine whether a particular service in a distributed system is running correctly, reachable, and responsive. It is an essential component of service discovery and can be used to assess the overall health of a distributed system. + +### Troubleshoot the alert + +1. Check the health status of the service that triggered the alert in the Consul UI. + + Access the Consul UI and navigate to the affected service's details page. Look for the health status information and the specific health check that caused the alert. + +2. Inspect the logs of the service that failed the health check. + + Access the logs of the affected service and look for any error messages or events that might have caused the health check to fail. Depending on the service, this might be application logs, system logs, or container logs (if the service is running in a container). + +3. Identify and fix the issue causing the health check failure. + + Based on the information from the logs and your knowledge of the system, address the issue that's causing the health check to fail. This might involve fixing a bug in the service, resolving a connection issue, or making a configuration change. + +4. Verify that the health check status has returned to a healthy state. + + After addressing the issue, monitor the service in the Consul UI and confirm that its health check status has returned to a healthy state. If the issue persists, continue investigating and resolving any underlying causes until the health check is successful. + +### Useful resources + +1. [Consul Introduction](https://www.consul.io/intro) +2. [Consul Health Check Documentation](https://www.consul.io/docs/discovery/checks) +3. [HashiCorp Learn: Consul Service Monitoring](https://learn.hashicorp.com/tutorials/consul/service-monitoring-and-alerting?in=consul/developer-discovery)
\ No newline at end of file |