diff options
Diffstat (limited to '')
-rw-r--r-- | health/guides/riakkv/riakkv_1h_kv_get_mean_latency.md | 52 | ||||
-rw-r--r-- | health/guides/riakkv/riakkv_1h_kv_put_mean_latency.md | 37 | ||||
-rw-r--r-- | health/guides/riakkv/riakkv_kv_get_slow.md | 22 | ||||
-rw-r--r-- | health/guides/riakkv/riakkv_kv_put_slow.md | 43 | ||||
-rw-r--r-- | health/guides/riakkv/riakkv_list_keys_active.md | 31 | ||||
-rw-r--r-- | health/guides/riakkv/riakkv_vm_high_process_count.md | 31 |
6 files changed, 216 insertions, 0 deletions
diff --git a/health/guides/riakkv/riakkv_1h_kv_get_mean_latency.md b/health/guides/riakkv/riakkv_1h_kv_get_mean_latency.md new file mode 100644 index 00000000..7233423e --- /dev/null +++ b/health/guides/riakkv/riakkv_1h_kv_get_mean_latency.md @@ -0,0 +1,52 @@ +### Understand the alert + +This alert calculates the average time between the reception of client `GET` requests and their subsequent responses in a `Riak KV` cluster over the last hour. If you receive this alert, it means that the average `GET` request latency in your Riak database has increased. + +### What does mean latency mean? + +Mean latency measures the average time taken between the start of a request and its completion, indicating the efficiency of the Riak system in processing `GET` requests. High mean latency implies slower processing times, which can negatively impact your application's performance. + +### Troubleshoot the alert + +- Check the system resources + +1. High latency might be related to resource bottlenecks on your Riak nodes. Check CPU, memory, and disk usage using `top` or `htop` tools. + ``` + top + ``` + or + ``` + htop + ``` + +2. If you find any resource constraint, consider scaling your Riak cluster or optimize resource usage by tuning the application configurations. + +- Investigate network issues + +1. Networking problems between the Riak nodes or the client and the nodes could cause increased latency. Check for network performance issues using `ping` or `traceroute`. + + ``` + ping node_ip_address + ``` + or + ``` + traceroute node_ip_address + ``` + +2. Investigate any anomalies or network congestion and address them accordingly. + +- Analyze Riak KV configurations + +1. Check Riak configuration settings, like read/write parameters and anti-entropy settings, for any misconfigurations. + +2. Re-evaluate and optimize settings for performance based on your application requirements. + +- Monitor application performance + +1. Analyze your application's request patterns and workload. High request rates or large amounts of data being fetched can cause increased latency. + +2. Optimize your application workload to reduce latency and distribute requests uniformly across the Riak nodes. + +### Useful resources + +1. [Riak KV documentation](https://riak.com/posts/technical/official-riak-kv-documentation-2.2/) diff --git a/health/guides/riakkv/riakkv_1h_kv_put_mean_latency.md b/health/guides/riakkv/riakkv_1h_kv_put_mean_latency.md new file mode 100644 index 00000000..cc2cad28 --- /dev/null +++ b/health/guides/riakkv/riakkv_1h_kv_put_mean_latency.md @@ -0,0 +1,37 @@ +### Understand the alert + +The `riakkv_1h_kv_put_mean_latency` alert calculates the average time (in milliseconds) between the reception of client `PUT` requests and the subsequent responses to the clients over the last hour in a Riak KV database. If you receive this alert, it means that your Riak KV database is experiencing higher than normal latency in processing `PUT` requests. + +### What is Riak KV? + +Riak KV is a distributed NoSQL key-value data store designed to provide high availability, fault tolerance, operational simplicity, and scalability. The primary access method is through `PUT`, `GET`, `DELETE`, and `LIST` operations on keys. + +### What does `PUT` latency mean? + +`PUT` latency refers to the time it takes for the system to process a `PUT` request - from the moment the server receives the request until it sends a response back to the client. High `PUT` latency can impact the performance and responsiveness of applications relying on the Riak KV database. + +### Troubleshoot the alert + +- Check the Riak KV cluster health + + Use the `riak-admin cluster status` command to get an overview of the Riak KV cluster's health. Make sure there are no unreachable or down nodes in the cluster. + +- Verify the Riak KV node performance + + Use the `riak-admin status` command to display various statistics of the Riak KV nodes. Pay attention to the `node_put_fsm_time_mean` and `node_put_fsm_time_95` metrics, as they are related to `PUT` latency. + +- Inspect network conditions + + Use networking tools (e.g., `ping`, `traceroute`, `mtr`, `iftop`) to check for potential network latency issues between clients and the Riak KV servers. + +- Evaluate the workload + + If the client application is heavily write-intensive, consider optimizing it to reduce the number of write operations or increase the capacity of the Riak KV cluster to handle the load. + +- Review Riak KV logs + + Examine the Riak KV logs (`/var/log/riak/riak_kv.log` by default) for any error messages or unusual patterns that might be related to the increased `PUT` latency. + +### Useful resources + +1. [Riak KV Official Documentation](https://riak.com/docs/) diff --git a/health/guides/riakkv/riakkv_kv_get_slow.md b/health/guides/riakkv/riakkv_kv_get_slow.md new file mode 100644 index 00000000..05fd67ce --- /dev/null +++ b/health/guides/riakkv/riakkv_kv_get_slow.md @@ -0,0 +1,22 @@ +### Understand the alert + +The `riakkv_kv_get_slow` alert is related to Riak KV, a distributed NoSQL key-value data store. This alert is triggered when the average processing time for GET requests significantly increases in the last 3 minutes compared to the average time over the last hour. If you receive this alert, it means that your Riak KV server is overloaded. + +### Troubleshoot the alert + +1. **Check Riak KV server load**: Investigate the current load on your Riak KV server. High CPU, memory, or disk usage can contribute to slow GET request processing times. Use monitoring tools like `top`, `htop`, `vmstat`, or `iotop` to identify any processes consuming excessive resources. + +2. **Analyze Riak KV logs**: Inspect the Riak KV logs for any error messages or warnings that could help identify the cause of the slow GET request processing times. The logs are typically located at `/var/log/riak` or `/var/log/riak_kv`. Look for messages related to timeouts, failures, or high latencies. + +3. **Monitor Riak KV metrics**: Check Riak KV metrics, such as read or write latencies, vnode operations, and disk usage, to identify possible bottlenecks contributing to the slow GET request processing times. Use tools like `riak-admin` or the Riak HTTP API to access these metrics. + +4. **Optimize query performance**: Analyze your application's Riak KV queries to identify any inefficient GET requests that could be contributing to slow processing times. Consider implementing caching mechanisms or adjusting Riak KV settings to improve query performance. + +5. **Evaluate hardware resources**: Ensure that your hardware resources are sufficient to handle the current load on your Riak KV server. If your server has insufficient resources, consider upgrading your hardware or adding additional nodes to your Riak KV cluster. + +### Useful resources + +1. [Riak KV documentation](https://riak.com/documentation/) +2. [Monitoring Riak KV with Netdata](https://learn.netdata.cloud/docs/agent/collectors/python.d.plugin/riakkv/) +3. [Riak Control: Monitoring and Administration Interface](https://docs.riak.com/riak/kv/2.2.3/configuring/reference/riak-vars/#riak-control) +4. [Riak KV Monitoring and Metrics](https://docs.riak.com/riak/kv/2.2.3/using/performance/monitoring/index.html) diff --git a/health/guides/riakkv/riakkv_kv_put_slow.md b/health/guides/riakkv/riakkv_kv_put_slow.md new file mode 100644 index 00000000..9bd314e7 --- /dev/null +++ b/health/guides/riakkv/riakkv_kv_put_slow.md @@ -0,0 +1,43 @@ +### Understand the alert + +The `riakkv_kv_put_slow` alert is triggered when the average processing time for PUT requests in Riak KV database increases significantly in comparison to the last hour's average, suggesting that the server is overloaded. + +### What does server overloaded mean? + +An overloaded server means that the server is unable to handle the incoming requests efficiently, leading to increased processing times and degraded performance. Sometimes, it might result in request timeouts or even crashes. + +### Troubleshoot the alert + +To troubleshoot this alert, follow the below steps: + +1. **Check current Riak KV performance** + + Use `riak-admin` tool's `status` command to check the current performance of the Riak KV node: + + ``` + riak-admin status + ``` + + Look for the following key performance indicators (KPIs) for PUT requests: + - riak_kv.put_fsm.time.95 (95th percentile processing time for PUT requests) + - riak_kv.put_fsm.time.99 (99th percentile processing time for PUT requests) + - riak_kv.put_fsm.time.100 (Maximum processing time for PUT requests) + + If any of these values are significantly higher than their historical values, it may indicate an issue with the node's performance. + +2. **Identify high-load operations** + + Examine the application logs or Riak KV logs for recent activity such as high volume of PUT requests, bulk updates or deletions, or other intensive database operations that could potentially cause the slowdown. + +3. **Investigate other system performance indicators** + + Check the server's CPU, memory, and disk I/O usage to identify any resource constraints that could be affecting the performance of the Riak KV node. + +4. **Review Riak KV configuration** + + Analyze the Riak KV configuration settings to ensure that they are optimized for your specific use case. Improperly configured settings can lead to performance issues. + +5. **Consider scaling the Riak KV cluster** + + If the current Riak KV cluster is not able to handle the increasing workload, consider adding new nodes to the cluster to distribute the load and improve performance. + diff --git a/health/guides/riakkv/riakkv_list_keys_active.md b/health/guides/riakkv/riakkv_list_keys_active.md new file mode 100644 index 00000000..38d42a37 --- /dev/null +++ b/health/guides/riakkv/riakkv_list_keys_active.md @@ -0,0 +1,31 @@ +### Understand the alert + +This alert indicates that currently there are active `list keys` operations in Finite State Machines (FSM) on your Riak KV database. Running `list keys` in Riak is a resource-intensive operation and can significantly affect the performance of the cluster, and it is not recommended for production use. + +### What are list keys operations in Riak? + +`List keys` operations in Riak involve iterating through all keys in a bucket to return a list of keys. The reason this is expensive in terms of resources is that Riak needs to traverse the entire dataset to generate a list of keys. As the dataset grows, the operation consumes more resources and takes longer to process the list, which can lead to reduced performance and scalability. + +### Troubleshoot the alert + +To address the `riakkv_list_keys_active` alert, follow these steps: + +1. Identify the processes and applications running `list keys` operations: + + Monitor your application logs and identify the processes or applications that are using these operations. You may need to enable additional logging to capture information related to `list keys`. + +2. Evaluate the necessity of `list keys` operations: + + Work with your development team and determine if there's a specific reason these operations are being used. If they are not necessary, consider replacing them with other, more efficient data retrieval techniques. + +3. Optimize data retrieval: + + If it is necessary to retrieve keys in your application, consider using an alternative strategy such as Secondary Indexes (2i) or implementing a custom solution tailored to your specific use case. + +4. Monitor the system: + + After making changes to your application, continue monitoring the active list key FSMs using Netdata to ensure that the number of active list keys operations is reduced. + +### Useful resources + +1. [Riak KV Operations](https://docs.riak.com/riak/kv/latest/developing/usage/operations/index.html) diff --git a/health/guides/riakkv/riakkv_vm_high_process_count.md b/health/guides/riakkv/riakkv_vm_high_process_count.md new file mode 100644 index 00000000..7fd79517 --- /dev/null +++ b/health/guides/riakkv/riakkv_vm_high_process_count.md @@ -0,0 +1,31 @@ +### Understand the alert + +The `riakkv_vm_high_process_count` alert is related to the Riak KV database. It warns you when the number of processes running in the Erlang VM is high. High process counts can result in performance degradation due to scheduling overhead. + +This alert is triggered in the warning state when the number of processes is greater than 10,000 and in the critical state when it is greater than 100,000. + +### Troubleshoot the alert + +1. Check the current number of processes in the Erlang VM. You can use the following command to see the active processes: + + ``` + riak-admin status | grep vnode_management_procs + ``` + +2. Check the Riak KV logs (/var/log/riak) to see if there are any error messages or stack traces. This can help you identify issues and potential bottlenecks in your system. + +3. Check the CPU, memory, and disk space usage on the system hosting the Riak KV database. High usage in any of these areas can also contribute to performance issues and the high process count. Use commands like `top`, `free`, and `df` to monitor these resources. + +4. Review your Riak KV configuration settings. You may need to adjust the `+P` and `+S` flags, which control the maximum number of processes and scheduler threads (respectively) that the Erlang runtime system can create. These settings can be found in the `vm.args` file. + + ``` + vim /etc/riak/vm.args + ``` + +5. If needed, optimize the Riak KV database by adjusting the configuration settings or by adding more resources to your system, such as RAM or CPU cores. + +6. Ensure that your application is not creating an excessive number of processes. You may need to examine your code and see if there are any ways to reduce the Riak KV process count. + +### Useful resources + +1. [Riak KV Documentation](http://docs.basho.com/riak/kv/2.2.3/) |