health/guides/ceph/ceph_cluster_space_usage.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

### Understand the alert

The `ceph_cluster_space_usage` alert is triggered when the percentage of used disk space in your Ceph cluster reaches a high level. Ceph is a distributed storage system designed to provide excellent performance, reliability, and scalability. If the usage surpasses certain thresholds (warning: 85-90%, critical: 90-98%), this indicates high disk space utilization, which may affect the performance and reliability of your Ceph cluster.

### Troubleshoot the alert

Perform the following actions:

1. Check the Ceph cluster status

   Run the following command to see the overall health of the Ceph cluster:

   ```
   ceph status
   ```

   Pay attention to the `HEALTH` status and the `cluster` section, which provides information about the used and total disk space.

2. Review the storage utilization for each pool

   Run the following command to review the storage usage for each pool in the Ceph cluster:

   ```
   ceph df
   ```

   Identify the pools with high utilization and consider moving or removing data from these pools.

3. Investigate high storage usage clients or applications

   Check the clients or applications that interact with the Ceph cluster and the associated file systems. You can use monitoring tools, disk usage analysis programs, or log analysis tools to identify any unusual patterns, such as excessive file creation, large file uploads, or high I/O operations.

4. Add more storage or nodes to the cluster

   If the cluster is reaching its full capacity due to normal usage, consider adding more storage or nodes to the Ceph cluster. This can help prevent the cluster from becoming overloaded and maintain its performance and reliability.

   You can use the following commands to add more storage or nodes to the Ceph cluster:

   ```
   ceph osd create
   ceph osd add
   ```

5. Optimize data replication and placement

   The high disk usage might be a result of non-optimal data replication and distribution across the cluster. Review the Ceph replication and placement settings, and update the CRUSH map if needed to ensure better distribution of data.

### Useful resources

1. [Ceph Storage Cluster](https://docs.ceph.com/en/latest/architecture/#storage-cluster)
2. [Ceph Troubleshooting Guide](https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/index)
3. [Managing Ceph Placement Groups](https://docs.ceph.com/en/latest/rados/operations/placement-groups/)
4. [Ceph: Adding and Removing OSDs](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/)