diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-03-09 13:19:48 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-03-09 13:20:02 +0000 |
commit | 58daab21cd043e1dc37024a7f99b396788372918 (patch) | |
tree | 96771e43bb69f7c1c2b0b4f7374cb74d7866d0cb /health/guides/httpcheck | |
parent | Releasing debian version 1.43.2-1. (diff) | |
download | netdata-58daab21cd043e1dc37024a7f99b396788372918.tar.xz netdata-58daab21cd043e1dc37024a7f99b396788372918.zip |
Merging upstream version 1.44.3.
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'health/guides/httpcheck')
7 files changed, 220 insertions, 0 deletions
diff --git a/health/guides/httpcheck/httpcheck_web_service_bad_content.md b/health/guides/httpcheck/httpcheck_web_service_bad_content.md new file mode 100644 index 000000000..0a5961ca7 --- /dev/null +++ b/health/guides/httpcheck/httpcheck_web_service_bad_content.md @@ -0,0 +1,30 @@ +### Understand the alert + +The Netdata Agent monitors your HTTP endpoints. You can specify endpoints that the agent will monitor in Agent's Go module under `go.d/httpcheck.conf`. You can also specify the expected response pattern. This HTTP endpoint will send in the `response_match` option. If the endpoint's response does not match the `response_match` pattern, then the Agent marks the response as unexpected. + +The Netdata Agent calculates the average ratio of HTTP responses with unexpected content over the last 5 minutes. + +This alert is escalated to warning if the percentage of unexpected content is greater than 10% and then raised to critical if it is greater than 40%. + +### Troubleshoot the alert + +Check the actual response and the expected response. + +1. Try to implement a request with a verbose result: + +``` +curl -v <your_http_endpoint>:<port>/<path> +``` + +2. Compare it with the expected response. + +Check your configuration under `go.d/httpcheck.conf`: + +``` +cd /etc/netdata # Replace this path with your Netdata config directory +sudo ./edit-config go.d/httpcheck.conf +``` + +### Useful resources + +1. [HTTP endpoint monitoring with Netdata](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/httpcheck)
\ No newline at end of file diff --git a/health/guides/httpcheck/httpcheck_web_service_bad_status.md b/health/guides/httpcheck/httpcheck_web_service_bad_status.md new file mode 100644 index 000000000..bd9c14341 --- /dev/null +++ b/health/guides/httpcheck/httpcheck_web_service_bad_status.md @@ -0,0 +1,21 @@ +### Understand the alert + +The `httpcheck_web_service_bad_status` alert is generated by the Netdata Agent when monitoring the status of an HTTP web service using the `httpcheck` collector. This alert is triggered when the HTTP web service returns a non-successful status code (anything other than 2xx or 3xx), indicating that there is an issue with the web service, preventing it from responding to requests as expected. + +### Troubleshoot the alert + +1. **Verify the target URL**: Ensure that the target URL configured in the `httpcheck` collector is correct and accessible. Check for any typos or incorrect domain names. + +2. **Check the actual response status and the expected response status**: Try to implement a request with a verbose result: + +``` +root@netdata # curl -v <your_http_endpoint>:<port>/<path> +``` + +3. **Verify server resources**: Ensure that your server has enough resources (CPU, RAM, disk space) to handle the current workload. High resource utilization can lead to web service issues. You can use Netdata's dashboard to monitor the server resources in real-time. + +4. **Check server configuration**: Review the configuration files of the web service for any misconfigurations or settings that may be causing the issue. For example, incorrect permissions, wrong file paths, or improper configurations can lead to bad status codes. + +### Useful resources + +1. [HTTP endpoint monitoring with Netdata](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/httpcheck) diff --git a/health/guides/httpcheck/httpcheck_web_service_no_connection.md b/health/guides/httpcheck/httpcheck_web_service_no_connection.md new file mode 100644 index 000000000..0f36803fe --- /dev/null +++ b/health/guides/httpcheck/httpcheck_web_service_no_connection.md @@ -0,0 +1,35 @@ +### Understand the alert + +This alert monitors the percentage of failed HTTP requests to a specific URL in the last 5 minutes. If you receive this alert, it means that your web service experienced connection issues. + +### Troubleshoot the alert + +1. Verify HTTP service status + +Check if the web service is running and accepting requests. If the service is down, restart it and monitor the situation. + +2. Review server logs + +Examine the logs of the web server hosting the HTTP service. Look for any errors or warning messages that may provide more information about the cause of the connection issues. + +3. Check network connectivity + +If the server hosting the HTTP service is experiencing connectivity issues, it can lead to failed requests. Ensure that the server has stable network connectivity. + +4. Monitor server resources + +Inspect the server's resource usage to check if it is running out of resources, such as CPU, memory, or disk space. If the server is running low on resources, it can cause the HTTP service to malfunction. In this case, free up resources or upgrade the server. + +5. Review client connections + +It is also possible that the clients are having connectivity issues. Make sure that the clients are in a good network condition and can connect to the server without any issues. + +6. Test the HTTP service + +Perform HTTP requests to the service manually or using monitoring tools to measure response times and verify if the issue persists. + +### Useful resources + +1. [Apache Log Files](https://httpd.apache.org/docs/2.4/logs.html) +2. [NGINX Log Files](https://docs.nginx.com/nginx/admin-guide/monitoring/logging/) +3. [HTTP status codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) diff --git a/health/guides/httpcheck/httpcheck_web_service_slow.md b/health/guides/httpcheck/httpcheck_web_service_slow.md new file mode 100644 index 000000000..aad2cc8da --- /dev/null +++ b/health/guides/httpcheck/httpcheck_web_service_slow.md @@ -0,0 +1,18 @@ +### Understand the alert + +The Netdata Agent monitors your HTTP endpoints. You can specify endpoints the Agent will monitor in the Agent's Go module under `go.d/httpcheck.conf`. +The Agent calculates the average response time for every HTTP request made to the endpoint being monitored per hour. The Agent also calculates the average response time in a 3-min window. + +The Netdata Agent compares these two (average) values. If there is a significant increase in 3-min average, then it will trigger a warning alert when the response time 3-min average is at least twice as much as 1-hour average. The alert will escalate to critical when the response time 3-min average reaches three times the average amount per hour. + +### Troubleshoot the alert + +To troubleshoot this issue, check for: + +- Network congestion in your system's network and/or in the remote endpoint's network. +- If the endpoint is managed by you, then check the system load. + +### Useful resources + +1. [HTTP endpoint monitoring with Netdata](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/httpcheck) + diff --git a/health/guides/httpcheck/httpcheck_web_service_timeouts.md b/health/guides/httpcheck/httpcheck_web_service_timeouts.md new file mode 100644 index 000000000..03e300d1d --- /dev/null +++ b/health/guides/httpcheck/httpcheck_web_service_timeouts.md @@ -0,0 +1,39 @@ +### Understand the alert + +This alert is triggered when the percentage of timed-out HTTP requests to a specific URL goes above a certain threshold in the last 5 minutes. The alert levels are determined by the following percentage thresholds: + +- Warning: 10% to 40% +- Critical: 40% or higher + +The alert is designed to notify you about potential issues with the accessed HTTP endpoint. + +### What does HTTP request timeout mean? + +An HTTP request timeout occurs when a client (such as a web browser) sends a request to a webserver but does not receive a response within the specified time period. This can lead to a poor user experience, as the user may be unable to access the requested content or services. + +### Troubleshoot the alert + +- Verify the issue + +Check the HTTP endpoint to see if it is responsive and reachable. You can use tools like `curl` or online services like [https://www.isitdownrightnow.com/](https://www.isitdownrightnow.com/) to check the availability of the website or service. + +- Analyze server logs + +Examine the server logs for any error messages or unusual patterns of behavior that may indicate a root cause for the timeout issue. For web servers such as Apache or Nginx, look for log files located in the `/var/log` directory. + +- Check resource usage + +High resource usage, such as CPU, memory, or disk I/O, can cause HTTP request timeouts. Use tools like `top`, `vmstat`, or `iotop` to identify resource-intensive processes. Address any performance bottlenecks by resizing the server, optimizing performance, or distributing the load across multiple servers. + +- Review server configurations + +Make sure your web server configurations are optimized for performance. For instance: + + 1. Ensure that the `KeepAlive` feature is enabled and properly configured. + 2. Make sure that your server's timeout settings are appropriate for the type of traffic and workload it experiences. + 3. Confirm that your server is correctly configured for the number of concurrent connections it handles. + +- Verify network configurations + +Examine the network configurations for potential issues that can lead to HTTP request timeouts. Check for misconfigured firewalls or faulty load balancers that may be interfering with traffic to the HTTP endpoint. + diff --git a/health/guides/httpcheck/httpcheck_web_service_unreachable.md b/health/guides/httpcheck/httpcheck_web_service_unreachable.md new file mode 100644 index 000000000..bb6f51bf5 --- /dev/null +++ b/health/guides/httpcheck/httpcheck_web_service_unreachable.md @@ -0,0 +1,33 @@ +### Understand the alert + +The Netdata agent monitors your HTTP endpoints. You can specify endpoints the Agent will monitor in the Agent's Go module under `go.d/httpcheck.conf`. + +If your system fails to connect to your endpoint, or if the request to that endpoint times out, then the Agent will mark the requests and log them as "unreachable". + +The Netdata Agent calculates the ratio of these requests over the last 5 minutes. This alert is escalated to warning when the ratio is greater than 10% and then raised to critical when it is greater than 40%. + +### Troubleshoot the alert + +To troubleshoot this error, check the following: + +- Verify that your system has access to the particular endpoint. + + - Check for basic connectivity to known hosts. + - Make sure that requests and replies both to and from the endpoint are allowed in the firewall settings. Ensure they're allowed on both your end as well as the endpoint's side. + +- Verify that your DNS can resolve endpoints. + - Check your current DNS (for example in linux you can use the host command): + + ``` + host -v <your_endpoint> + ``` + + - If the HTTP endpoint is suppose to be public facing endpoint, try an alternative DNS (for example Cloudflare's DNS): + + ``` + host -v <your_endpoint> 1.1.1.1 + ``` + +### Useful resources + +1. [HTTP endpoint monitoring with Netdata](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/httpcheck)
\ No newline at end of file diff --git a/health/guides/httpcheck/httpcheck_web_service_up.md b/health/guides/httpcheck/httpcheck_web_service_up.md new file mode 100644 index 000000000..be17fadd5 --- /dev/null +++ b/health/guides/httpcheck/httpcheck_web_service_up.md @@ -0,0 +1,44 @@ +### Understand the alert + +The `httpcheck_web_service_up` alert monitors the liveness status of an HTTP endpoint by checking its response over the past minute. If the success percentage is below 75%, this alert will trigger, indicating that the web service may be experiencing issues. + +### What does an HTTP endpoint liveness status mean? + +An HTTP endpoint is like a door where clients make requests to access web services or APIs. The liveness status reveals whether the service is available and responding to client requests. Ideally, this success percentage should be near 100%, indicating that the endpoint is consistently accessible. + +### Troubleshoot the alert + +1. Check logs for any errors or warnings related to the web server or application. + + Depending on your web server or application, look for log files that may provide insights into the causes of the issues. Some common log locations are: + + - Apache: `/var/log/apache2/` + - Nginx: `/var/log/nginx/` + - Node.js: Check your application-specific log location. + +2. Examine server resources such as CPU, memory, and disk usage. + + High resource usage can cause web services to become slow or unresponsive. Use system monitoring tools like `top`, `htop`, or `free` to check the resource usage. + +3. Test the HTTP endpoint manually. + + You can use tools like `curl`, `wget`, or `httpie` to send requests to the HTTP endpoint and inspect the responses. Examine the response codes, headers, and contents to spot any problems. + + Example using `curl`: + + ``` + curl -I http://example.com/some/endpoint + ``` + +4. Check for network issues between the monitoring agent and the HTTP endpoint. + + Use tools like `ping`, `traceroute`, or `mtr` to check for network latency or packet loss between the monitoring agent and the HTTP endpoint. + +5. Review the web server or application configuration. + + Ensure the web server and application configurations are correct and not causing issues. Look for misconfigurations, incorrect settings, or other issues that may affect the liveness of the HTTP endpoint. + +### Useful resources + +1. [Monitoring Linux Performance with vmstat and iostat](https://www.tecmint.com/linux-performance-monitoring-with-vmstat-and-iostat-commands/) +2. [16 Useful Bandwidth Monitoring Tools to Analyze Network Usage in Linux](https://www.tecmint.com/linux-network-bandwidth-monitoring-tools/) |