diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-03-09 13:19:22 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-03-09 13:19:22 +0000 |
commit | c21c3b0befeb46a51b6bf3758ffa30813bea0ff0 (patch) | |
tree | 9754ff1ca740f6346cf8483ec915d4054bc5da2d /health/guides/cockroachdb | |
parent | Adding upstream version 1.43.2. (diff) | |
download | netdata-upstream/1.44.3.tar.xz netdata-upstream/1.44.3.zip |
Adding upstream version 1.44.3.upstream/1.44.3
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
5 files changed, 258 insertions, 0 deletions
diff --git a/health/guides/cockroachdb/cockroachdb_open_file_descriptors_limit.md b/health/guides/cockroachdb/cockroachdb_open_file_descriptors_limit.md new file mode 100644 index 000000000..ad2fa4ac7 --- /dev/null +++ b/health/guides/cockroachdb/cockroachdb_open_file_descriptors_limit.md @@ -0,0 +1,57 @@ +### Understand the alert + +This alert indicates that the usage of file descriptors in your CockroachDB is reaching a high percentage against the soft-limit. High file descriptor utilization can cause issues, such as failures to open new files or establish network connections. + +### Troubleshoot the alert + +1. Check the current file descriptor limit and usage for CockroachDB: + + Use the `lsof` command to display information about all open file descriptors associated with the process running CockroachDB: + + ``` + lsof -p <PID> + ``` + + Replace `<PID>` with the process ID of CockroachDB. + + To display only the total number of open file descriptors, you can use this command: + + ``` + lsof -p <PID> | wc -l + ``` + +2. Monitor file descriptor usage: + + Regularly monitoring file descriptor usage can help you identify patterns and trends, making it easier to determine if adjustments are needed. You can use tools like `lsof` or `sar` to monitor file descriptor usage on your system. + +3. Adjust the file descriptors limit for the process: + + You can raise the soft-limit for the CockroachDB process by modifying the `ulimit` configuration: + + ``` + ulimit -n <new_limit> + ``` + + Replace `<new_limit>` with the desired value, which must be less than or equal to the system-wide hard limit. + + Note that changes made using `ulimit` only apply to the current shell session. To make the changes persistent, you should add the `ulimit` command to the CockroachDB service startup script or modify the system-wide limits in `/etc/security/limits.conf`. + +4. Adjust the system-wide file descriptors limit: + + If necessary, you can also adjust the system-wide limits for file descriptors in `/etc/security/limits.conf`. Edit this file as a root user, and add or modify the following lines: + + ``` + * soft nofile <new_soft_limit> + * hard nofile <new_hard_limit> + ``` + + Replace `<new_soft_limit>` and `<new_hard_limit>` with the desired values. You must restart the system or CockroachDB for the changes to take effect. + +5. Optimize CockroachDB configuration: + + Review the CockroachDB configuration and ensure that it's optimized for your workload. If appropriate, adjust settings such as cache size, query optimization, and memory usage to reduce the number of file descriptors needed. + +### Useful resources + +1. [CockroachDB recommended production settings](https://www.cockroachlabs.com/docs/v21.2/recommended-production-settings#file-descriptors-limit) +2. [Increasing file descriptor limits on Linux](https://www.tecmint.com/increase-set-open-file-limits-in-linux/) diff --git a/health/guides/cockroachdb/cockroachdb_unavailable_ranges.md b/health/guides/cockroachdb/cockroachdb_unavailable_ranges.md new file mode 100644 index 000000000..ef495cb72 --- /dev/null +++ b/health/guides/cockroachdb/cockroachdb_unavailable_ranges.md @@ -0,0 +1,51 @@ +### Understand the alert + +This alert indicates that there are unavailable ranges in your CockroachDB cluster. Unavailable ranges occur when a majority of a range's replicas are on nodes that are unavailable. This can cause the entire range to be unable to process queries. + +### Troubleshoot the alert + +1. Check for dead or unavailable nodes + + Use the `./cockroach node status` command to list the status of all nodes in your cluster. Look for nodes that are marked as dead or unavailable and try to bring them back online. + + ``` + ./cockroach node status --certs-dir=<your_cert_directory> + ``` + +2. Inspect the logs + + CockroachDB logs can provide valuable information about issues that may be affecting your cluster. Check the logs for errors or warnings related to unavailable ranges using `grep`: + + ``` + grep -i 'unavailable range' /path/to/cockroachdb/logs + ``` + +3. Check replication factor + + Make sure your cluster's replication factor is set to an appropriate value. A higher replication factor can help tolerate node failures and prevent unavailable ranges. You can check the replication factor by running the following SQL query: + + ``` + SHOW CLUSTER SETTING kv.range_replicas; + ``` + + To set the replication factor, run the following SQL command: + + ``` + SET CLUSTER SETTING kv.range_replicas=<desired_replication_factor>; + ``` + +4. Investigate and resolve network issues + + Network issues can cause nodes to become unavailable and lead to unavailable ranges. Check the status of your network and any firewalls, load balancers, or other network components that may be affecting connectivity between nodes. + +5. Monitor and manage hardware resources + + Insufficient hardware resources, such as CPU, memory, or disk space, can cause nodes to become unavailable. Monitor your nodes' resource usage and ensure that they have adequate resources to handle the workload. + +6. Consider rebalancing the cluster + + Rebalancing the cluster can help distribute the load more evenly across nodes and reduce the number of unavailable ranges. See the [CockroachDB documentation](https://www.cockroachlabs.com/docs/stable/training/manual-rebalancing.html) for more information on manual rebalancing. + +### Useful resources + +1. [CockroachDB troubleshooting guide](https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#db-console-shows-under-replicated-unavailable-ranges) diff --git a/health/guides/cockroachdb/cockroachdb_underreplicated_ranges.md b/health/guides/cockroachdb/cockroachdb_underreplicated_ranges.md new file mode 100644 index 000000000..e82695993 --- /dev/null +++ b/health/guides/cockroachdb/cockroachdb_underreplicated_ranges.md @@ -0,0 +1,41 @@ +### Understand the alert + +This alert is related to CockroachDB, a scalable and distributed SQL database. When you receive this alert, it means that there are under-replicated ranges in your database cluster. Under-replicated ranges can impact the availability and fault tolerance of your database, leading to potential data loss or unavailability in case of node failures. + +### What are under-replicated ranges? + +In a CockroachDB cluster, data is split into small chunks called ranges. These ranges are then replicated across multiple nodes to ensure fault tolerance and high availability. The desired replication factor determines the number of replicas for each range. + +When a range has fewer replicas than the desired replication factor, it is considered as "under-replicated". This situation can occur if nodes are unavailable or if the cluster is in the process of recovering from failures. + +### Troubleshoot the alert + +1. Access the CockroachDB Admin UI + + Access the Admin UI by navigating to the URL `http://<any-node-ip>:8080` on any of your cluster nodes. + +2. Check the 'Replication Status' in the dashboard + + In the Admin UI, check the 'Under-replicated Ranges' metric on the main 'Dashboard' or 'Metrics' page. + +3. Inspect the logs of your CockroachDB nodes + + Look for any error messages or issues that could be causing under-replication. For example, you may see errors related to node failures or network issues. + +4. Check cluster health and capacity + + Make sure that all nodes in the cluster are running and healthy. You can do this by running the command `cockroach node status`. Consider adding more nodes or increasing the capacity if your nodes are overworked. + +5. Verify replication factor configuration + + Check your cluster's replication factor configuration to ensure it is set to an appropriate value. The default replication factor is 3, which can tolerate one failure. You can view and change it using the [`zone configurations`](https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html). + +6. Consider decommissioning problematic nodes + + If specific nodes are causing under-replication, consider decommissioning them to allow the cluster to automatically rebalance the ranges. Follow the [decommissioning guide](https://www.cockroachlabs.com/docs/stable/remove-nodes.html) in the CockroachDB documentation. + +### Useful resources + +1. [CockroachDB: Troubleshoot Under-replicated and Unavailable Ranges](https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#db-console-shows-under-replicated-unavailable-ranges) +2. [CockroachDB: Configuring Replication Zones](https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html) +3. [CockroachDB: Decommission a Node](https://www.cockroachlabs.com/docs/stable/remove-nodes.html)
\ No newline at end of file diff --git a/health/guides/cockroachdb/cockroachdb_used_storage_capacity.md b/health/guides/cockroachdb/cockroachdb_used_storage_capacity.md new file mode 100644 index 000000000..ac1bc000c --- /dev/null +++ b/health/guides/cockroachdb/cockroachdb_used_storage_capacity.md @@ -0,0 +1,46 @@ +### Understand the Alert + +This alert indicates high storage capacity utilization in CockroachDB. + +### Definition of "size" on CockroachDB: + +The maximum size allocated to the node. When this size is reached, CockroachDB attempts to rebalance data to other nodes with available capacity. When there's no capacity elsewhere, this limit will be exceeded. Also, data may be written to the node faster than the cluster can rebalance it away; in this case, as long as capacity is available elsewhere, CockroachDB will gradually rebalance data down to the store limit. + +### Troubleshoot the Alert + +- Increase the space available for CockroachDB data + +If you had previously set a limit, then you can use the option `--store=path<YOUR PATH>,size=<SIZE>` to increase the amount of available space. Make sure to replace the "YOUR PATH" with the actual store path and "SIZE" with the new size you want to set CockroachDB to. + +Note: If you haven't set a limit on the size, then the entire drive's size will be used. In this case, you will see that the drive is full. Clearing some space or upgrading to a drive with a larger capacity are potential solutions. + +- Inspect the disk usage by tables and indexes + +CockroachDB provides the `experimental_disk_usage` builtin SQL function that allows you to check the disk usage by tables and indexes within a given database. This can help you identify the main storage consumers in your cluster. + +To run this command, first connect to your CockroachDB instance with `cockroach sql`, then execute the following query: + +```sql +SELECT * FROM [SHOW experimental_disk_usage('<database_name>')]; +``` + +Make sure to replace `<database_name>` with the actual name of the database you want to inspect. This will return a list of tables and indexes with their respective disk usage. + +- Rebalance the cluster data to other nodes with available capacity + +CockroachDB automatically rebalances data across nodes by default. If the data rebalancing is not happening fast enough, you can try to speed up this process by [adjusting `zone configurations`](https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html) or by [increasing the default rebalancing rate](https://www.cockroachlabs.com/docs/stable/cluster-settings.html#kv_range_replication_rate_bytes_per_second). + +- Purge old, unnecessary data + +Inspect your data and consider purging old or unnecessary data from the database. Be cautious while performing this operation and double-check the data you intend to remove. + +- Archive old data + +If the data cannot be purged, consider archiving it in a more compact format or moving it to a separate database or storage system to reduce the storage usage on the affected CockroachDB node. + + +## Useful resources + +1. [CockroachDB Size](https://www.cockroachlabs.com/docs/v21.2/cockroach-start#store) +2. [CockroachDB Docs](https://www.cockroachlabs.com/docs/stable/ui-storage-dashboard.html) + diff --git a/health/guides/cockroachdb/cockroachdb_used_usable_storage_capacity.md b/health/guides/cockroachdb/cockroachdb_used_usable_storage_capacity.md new file mode 100644 index 000000000..ec00dbb98 --- /dev/null +++ b/health/guides/cockroachdb/cockroachdb_used_usable_storage_capacity.md @@ -0,0 +1,63 @@ +### Understand the alert + +This alert indicates that the usable storage space allocated for your CockroachDB is being highly utilized. If the percentage of used space exceeds 85%, the alert raises a warning, and if it exceeds 95%, the alert becomes critical. High storage utilization can lead to performance issues and potential data loss if not properly managed. + +### Troubleshoot the alert + +1. Check the current storage utilization + +To understand the current utilization, you can use SQL commands to query the `crdb_internal.kv_store_status` table. + +```sql +SELECT node_id, store_id, capacity, used, available +FROM crdb_internal.kv_store_status; +``` + +This query will provide information about the available and used storage capacity of each node in your CockroachDB cluster. + +2. Identify tables and databases with high storage usage + +Use the following command to list the top databases in terms of storage usage: + +```sql +SELECT database_name, sum(data_size_int) as total_size +FROM crdb_internal.tables +WHERE database_name != 'crdb_internal' +GROUP BY database_name +ORDER BY total_size DESC +LIMIT 10; +``` + +Additionally, you can list the top tables in terms of storage usage: + +```sql +SELECT database_name, table_name, data_size +FROM crdb_internal.tables +WHERE database_name != 'crdb_internal' +ORDER BY data_size_int DESC +LIMIT 10; +``` + +3. Optimize storage usage + +Based on your findings from steps 1 and 2, consider the following actions: + +- Delete unneeded data from tables with high storage usage. +- Apply data compression to reduce the overall storage consumption. +- Archive old data or move it to external storage. + +4. Add more storage to the nodes + +If necessary, increase the storage allocated to your CockroachDB cluster by adding more space to each node. + +- To increase the usable storage capacity, modify the `--store` flag when restarting your CockroachDB nodes. Set the new size by replacing `<YOUR_PATH>` with the actual store path and `<SIZE>` with the desired new size: + + ``` + --store=path=<YOUR_PATH>,size=<SIZE> + ``` + +5. Add more nodes to the cluster + +If increasing the storage capacity of your existing nodes isn't enough, consider adding more nodes to your CockroachDB cluster. By adding more nodes, you can distribute storage more evenly and prevent single points of failure due to storage limitations. + +Refer to the [CockroachDB documentation](https://www.cockroachlabs.com/docs/stable/start-a-node.html) on how to add a new node to a cluster.
\ No newline at end of file |