diff options
Diffstat (limited to '')
-rw-r--r-- | doc/rados/operations/placement-groups.rst | 897 |
1 files changed, 897 insertions, 0 deletions
diff --git a/doc/rados/operations/placement-groups.rst b/doc/rados/operations/placement-groups.rst new file mode 100644 index 000000000..dda4a0177 --- /dev/null +++ b/doc/rados/operations/placement-groups.rst @@ -0,0 +1,897 @@ +.. _placement groups: + +================== + Placement Groups +================== + +.. _pg-autoscaler: + +Autoscaling placement groups +============================ + +Placement groups (PGs) are an internal implementation detail of how Ceph +distributes data. Autoscaling provides a way to manage PGs, and especially to +manage the number of PGs present in different pools. When *pg-autoscaling* is +enabled, the cluster is allowed to make recommendations or automatic +adjustments with respect to the number of PGs for each pool (``pgp_num``) in +accordance with expected cluster utilization and expected pool utilization. + +Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``, +``on``, or ``warn``: + +* ``off``: Disable autoscaling for this pool. It is up to the administrator to + choose an appropriate ``pgp_num`` for each pool. For more information, see + :ref:`choosing-number-of-placement-groups`. +* ``on``: Enable automated adjustments of the PG count for the given pool. +* ``warn``: Raise health checks when the PG count is in need of adjustment. + +To set the autoscaling mode for an existing pool, run a command of the +following form: + +.. prompt:: bash # + + ceph osd pool set <pool-name> pg_autoscale_mode <mode> + +For example, to enable autoscaling on pool ``foo``, run the following command: + +.. prompt:: bash # + + ceph osd pool set foo pg_autoscale_mode on + +There is also a ``pg_autoscale_mode`` setting for any pools that are created +after the initial setup of the cluster. To change this setting, run a command +of the following form: + +.. prompt:: bash # + + ceph config set global osd_pool_default_pg_autoscale_mode <mode> + +You can disable or enable the autoscaler for all pools with the ``noautoscale`` +flag. By default, this flag is set to ``off``, but you can set it to ``on`` by +running the following command: + +.. prompt:: bash # + + ceph osd pool set noautoscale + +To set the ``noautoscale`` flag to ``off``, run the following command: + +.. prompt:: bash # + + ceph osd pool unset noautoscale + +To get the value of the flag, run the following command: + +.. prompt:: bash # + + ceph osd pool get noautoscale + +Viewing PG scaling recommendations +---------------------------------- + +To view each pool, its relative utilization, and any recommended changes to the +PG count, run the following command: + +.. prompt:: bash # + + ceph osd pool autoscale-status + +The output will resemble the following:: + + POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK + a 12900M 3.0 82431M 0.4695 8 128 warn True + c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn True + b 0 953.6M 3.0 82431M 0.0347 8 warn False + +- **POOL** is the name of the pool. + +- **SIZE** is the amount of data stored in the pool. + +- **TARGET SIZE** (if present) is the amount of data that is expected to be + stored in the pool, as specified by the administrator. The system uses the + greater of the two values for its calculation. + +- **RATE** is the multiplier for the pool that determines how much raw storage + capacity is consumed. For example, a three-replica pool will have a ratio of + 3.0, and a ``k=4 m=2`` erasure-coded pool will have a ratio of 1.5. + +- **RAW CAPACITY** is the total amount of raw storage capacity on the specific + OSDs that are responsible for storing the data of the pool (and perhaps the + data of other pools). + +- **RATIO** is the ratio of (1) the storage consumed by the pool to (2) the + total raw storage capacity. In order words, RATIO is defined as + (SIZE * RATE) / RAW CAPACITY. + +- **TARGET RATIO** (if present) is the ratio of the expected storage of this + pool (that is, the amount of storage that this pool is expected to consume, + as specified by the administrator) to the expected storage of all other pools + that have target ratios set. If both ``target_size_bytes`` and + ``target_size_ratio`` are specified, then ``target_size_ratio`` takes + precedence. + +- **EFFECTIVE RATIO** is the result of making two adjustments to the target + ratio: + + #. Subtracting any capacity expected to be used by pools that have target + size set. + + #. Normalizing the target ratios among pools that have target ratio set so + that collectively they target cluster capacity. For example, four pools + with target_ratio 1.0 would have an effective ratio of 0.25. + + The system's calculations use whichever of these two ratios (that is, the + target ratio and the effective ratio) is greater. + +- **BIAS** is used as a multiplier to manually adjust a pool's PG in accordance + with prior information about how many PGs a specific pool is expected to + have. + +- **PG_NUM** is either the current number of PGs associated with the pool or, + if a ``pg_num`` change is in progress, the current number of PGs that the + pool is working towards. + +- **NEW PG_NUM** (if present) is the value that the system is recommending the + ``pg_num`` of the pool to be changed to. It is always a power of 2, and it is + present only if the recommended value varies from the current value by more + than the default factor of ``3``. To adjust this factor (in the following + example, it is changed to ``2``), run the following command: + + .. prompt:: bash # + + ceph osd pool set threshold 2.0 + +- **AUTOSCALE** is the pool's ``pg_autoscale_mode`` and is set to ``on``, + ``off``, or ``warn``. + +- **BULK** determines whether the pool is ``bulk``. It has a value of ``True`` + or ``False``. A ``bulk`` pool is expected to be large and should initially + have a large number of PGs so that performance does not suffer]. On the other + hand, a pool that is not ``bulk`` is expected to be small (for example, a + ``.mgr`` pool or a meta pool). + +.. note:: + + If the ``ceph osd pool autoscale-status`` command returns no output at all, + there is probably at least one pool that spans multiple CRUSH roots. This + 'spanning pool' issue can happen in scenarios like the following: + when a new deployment auto-creates the ``.mgr`` pool on the ``default`` + CRUSH root, subsequent pools are created with rules that constrain them to a + specific shadow CRUSH tree. For example, if you create an RBD metadata pool + that is constrained to ``deviceclass = ssd`` and an RBD data pool that is + constrained to ``deviceclass = hdd``, you will encounter this issue. To + remedy this issue, constrain the spanning pool to only one device class. In + the above scenario, there is likely to be a ``replicated-ssd`` CRUSH rule in + effect, and the ``.mgr`` pool can be constrained to ``ssd`` devices by + running the following commands: + + .. prompt:: bash # + + ceph osd pool set .mgr crush_rule replicated-ssd + ceph osd pool set pool 1 crush_rule to replicated-ssd + + This intervention will result in a small amount of backfill, but + typically this traffic completes quickly. + + +Automated scaling +----------------- + +In the simplest approach to automated scaling, the cluster is allowed to +automatically scale ``pgp_num`` in accordance with usage. Ceph considers the +total available storage and the target number of PGs for the whole system, +considers how much data is stored in each pool, and apportions PGs accordingly. +The system is conservative with its approach, making changes to a pool only +when the current number of PGs (``pg_num``) varies by more than a factor of 3 +from the recommended number. + +The target number of PGs per OSD is determined by the ``mon_target_pg_per_osd`` +parameter (default: 100), which can be adjusted by running the following +command: + +.. prompt:: bash # + + ceph config set global mon_target_pg_per_osd 100 + +The autoscaler analyzes pools and adjusts on a per-subtree basis. Because each +pool might map to a different CRUSH rule, and each rule might distribute data +across different devices, Ceph will consider the utilization of each subtree of +the hierarchy independently. For example, a pool that maps to OSDs of class +``ssd`` and a pool that maps to OSDs of class ``hdd`` will each have optimal PG +counts that are determined by how many of these two different device types +there are. + +If a pool uses OSDs under two or more CRUSH roots (for example, shadow trees +with both ``ssd`` and ``hdd`` devices), the autoscaler issues a warning to the +user in the manager log. The warning states the name of the pool and the set of +roots that overlap each other. The autoscaler does not scale any pools with +overlapping roots because this condition can cause problems with the scaling +process. We recommend constraining each pool so that it belongs to only one +root (that is, one OSD class) to silence the warning and ensure a successful +scaling process. + +.. _managing_bulk_flagged_pools: + +Managing pools that are flagged with ``bulk`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a pool is flagged ``bulk``, then the autoscaler starts the pool with a full +complement of PGs and then scales down the number of PGs only if the usage +ratio across the pool is uneven. However, if a pool is not flagged ``bulk``, +then the autoscaler starts the pool with minimal PGs and creates additional PGs +only if there is more usage in the pool. + +To create a pool that will be flagged ``bulk``, run the following command: + +.. prompt:: bash # + + ceph osd pool create <pool-name> --bulk + +To set or unset the ``bulk`` flag of an existing pool, run the following +command: + +.. prompt:: bash # + + ceph osd pool set <pool-name> bulk <true/false/1/0> + +To get the ``bulk`` flag of an existing pool, run the following command: + +.. prompt:: bash # + + ceph osd pool get <pool-name> bulk + +.. _specifying_pool_target_size: + +Specifying expected pool size +----------------------------- + +When a cluster or pool is first created, it consumes only a small fraction of +the total cluster capacity and appears to the system as if it should need only +a small number of PGs. However, in some cases, cluster administrators know +which pools are likely to consume most of the system capacity in the long run. +When Ceph is provided with this information, a more appropriate number of PGs +can be used from the beginning, obviating subsequent changes in ``pg_num`` and +the associated overhead cost of relocating data. + +The *target size* of a pool can be specified in two ways: either in relation to +the absolute size (in bytes) of the pool, or as a weight relative to all other +pools that have ``target_size_ratio`` set. + +For example, to tell the system that ``mypool`` is expected to consume 100 TB, +run the following command: + +.. prompt:: bash # + + ceph osd pool set mypool target_size_bytes 100T + +Alternatively, to tell the system that ``mypool`` is expected to consume a +ratio of 1.0 relative to other pools that have ``target_size_ratio`` set, +adjust the ``target_size_ratio`` setting of ``my pool`` by running the +following command: + +.. prompt:: bash # + + ceph osd pool set mypool target_size_ratio 1.0 + +If `mypool` is the only pool in the cluster, then it is expected to use 100% of +the total cluster capacity. However, if the cluster contains a second pool that +has ``target_size_ratio`` set to 1.0, then both pools are expected to use 50% +of the total cluster capacity. + +The ``ceph osd pool create`` command has two command-line options that can be +used to set the target size of a pool at creation time: ``--target-size-bytes +<bytes>`` and ``--target-size-ratio <ratio>``. + +Note that if the target-size values that have been specified are impossible +(for example, a capacity larger than the total cluster), then a health check +(``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised. + +If both ``target_size_ratio`` and ``target_size_bytes`` are specified for a +pool, then the latter will be ignored, the former will be used in system +calculations, and a health check (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) +will be raised. + +Specifying bounds on a pool's PGs +--------------------------------- + +It is possible to specify both the minimum number and the maximum number of PGs +for a pool. + +Setting a Minimum Number of PGs and a Maximum Number of PGs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a minimum is set, then Ceph will not itself reduce (nor recommend that you +reduce) the number of PGs to a value below the configured value. Setting a +minimum serves to establish a lower bound on the amount of parallelism enjoyed +by a client during I/O, even if a pool is mostly empty. + +If a maximum is set, then Ceph will not itself increase (or recommend that you +increase) the number of PGs to a value above the configured value. + +To set the minimum number of PGs for a pool, run a command of the following +form: + +.. prompt:: bash # + + ceph osd pool set <pool-name> pg_num_min <num> + +To set the maximum number of PGs for a pool, run a command of the following +form: + +.. prompt:: bash # + + ceph osd pool set <pool-name> pg_num_max <num> + +In addition, the ``ceph osd pool create`` command has two command-line options +that can be used to specify the minimum or maximum PG count of a pool at +creation time: ``--pg-num-min <num>`` and ``--pg-num-max <num>``. + +.. _preselection: + +Preselecting pg_num +=================== + +When creating a pool with the following command, you have the option to +preselect the value of the ``pg_num`` parameter: + +.. prompt:: bash # + + ceph osd pool create {pool-name} [pg_num] + +If you opt not to specify ``pg_num`` in this command, the cluster uses the PG +autoscaler to automatically configure the parameter in accordance with the +amount of data that is stored in the pool (see :ref:`pg-autoscaler` above). + +However, your decision of whether or not to specify ``pg_num`` at creation time +has no effect on whether the parameter will be automatically tuned by the +cluster afterwards. As seen above, autoscaling of PGs is enabled or disabled by +running a command of the following form: + +.. prompt:: bash # + + ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn) + +Without the balancer, the suggested target is approximately 100 PG replicas on +each OSD. With the balancer, an initial target of 50 PG replicas on each OSD is +reasonable. + +The autoscaler attempts to satisfy the following conditions: + +- the number of PGs per OSD should be proportional to the amount of data in the + pool +- there should be 50-100 PGs per pool, taking into account the replication + overhead or erasure-coding fan-out of each PG's replicas across OSDs + +Use of Placement Groups +======================= + +A placement group aggregates objects within a pool. The tracking of RADOS +object placement and object metadata on a per-object basis is computationally +expensive. It would be infeasible for a system with millions of RADOS +objects to efficiently track placement on a per-object basis. + +.. ditaa:: + /-----\ /-----\ /-----\ /-----\ /-----\ + | obj | | obj | | obj | | obj | | obj | + \-----/ \-----/ \-----/ \-----/ \-----/ + | | | | | + +--------+--------+ +---+----+ + | | + v v + +-----------------------+ +-----------------------+ + | Placement Group #1 | | Placement Group #2 | + | | | | + +-----------------------+ +-----------------------+ + | | + +------------------------------+ + | + v + +-----------------------+ + | Pool | + | | + +-----------------------+ + +The Ceph client calculates which PG a RADOS object should be in. As part of +this calculation, the client hashes the object ID and performs an operation +involving both the number of PGs in the specified pool and the pool ID. For +details, see `Mapping PGs to OSDs`_. + +The contents of a RADOS object belonging to a PG are stored in a set of OSDs. +For example, in a replicated pool of size two, each PG will store objects on +two OSDs, as shown below: + +.. ditaa:: + +-----------------------+ +-----------------------+ + | Placement Group #1 | | Placement Group #2 | + | | | | + +-----------------------+ +-----------------------+ + | | | | + v v v v + /----------\ /----------\ /----------\ /----------\ + | | | | | | | | + | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 | + | | | | | | | | + \----------/ \----------/ \----------/ \----------/ + + +If OSD #2 fails, another OSD will be assigned to Placement Group #1 and then +filled with copies of all objects in OSD #1. If the pool size is changed from +two to three, an additional OSD will be assigned to the PG and will receive +copies of all objects in the PG. + +An OSD assigned to a PG is not owned exclusively by that PG; rather, the OSD is +shared with other PGs either from the same pool or from other pools. In our +example, OSD #2 is shared by Placement Group #1 and Placement Group #2. If OSD +#2 fails, then Placement Group #2 must restore copies of objects (by making use +of OSD #3). + +When the number of PGs increases, several consequences ensue. The new PGs are +assigned OSDs. The result of the CRUSH function changes, which means that some +objects from the already-existing PGs are copied to the new PGs and removed +from the old ones. + +Factors Relevant To Specifying pg_num +===================================== + +On the one hand, the criteria of data durability and even distribution across +OSDs weigh in favor of a high number of PGs. On the other hand, the criteria of +saving CPU resources and minimizing memory usage weigh in favor of a low number +of PGs. + +.. _data durability: + +Data durability +--------------- + +When an OSD fails, the risk of data loss is increased until replication of the +data it hosted is restored to the configured level. To illustrate this point, +let's imagine a scenario that results in permanent data loss in a single PG: + +#. The OSD fails and all copies of the object that it contains are lost. For + each object within the PG, the number of its replicas suddenly drops from + three to two. + +#. Ceph starts recovery for this PG by choosing a new OSD on which to re-create + the third copy of each object. + +#. Another OSD within the same PG fails before the new OSD is fully populated + with the third copy. Some objects will then only have one surviving copy. + +#. Ceph selects yet another OSD and continues copying objects in order to + restore the desired number of copies. + +#. A third OSD within the same PG fails before recovery is complete. If this + OSD happened to contain the only remaining copy of an object, the object is + permanently lost. + +In a cluster containing 10 OSDs with 512 PGs in a three-replica pool, CRUSH +will give each PG three OSDs. Ultimately, each OSD hosts :math:`\frac{(512 * +3)}{10} = ~150` PGs. So when the first OSD fails in the above scenario, +recovery will begin for all 150 PGs at the same time. + +The 150 PGs that are being recovered are likely to be homogeneously distributed +across the 9 remaining OSDs. Each remaining OSD is therefore likely to send +copies of objects to all other OSDs and also likely to receive some new objects +to be stored because it has become part of a new PG. + +The amount of time it takes for this recovery to complete depends on the +architecture of the Ceph cluster. Compare two setups: (1) Each OSD is hosted by +a 1 TB SSD on a single machine, all of the OSDs are connected to a 10 Gb/s +switch, and the recovery of a single OSD completes within a certain number of +minutes. (2) There are two OSDs per machine using HDDs with no SSD WAL+DB and +a 1 Gb/s switch. In the second setup, recovery will be at least one order of +magnitude slower. + +In such a cluster, the number of PGs has almost no effect on data durability. +Whether there are 128 PGs per OSD or 8192 PGs per OSD, the recovery will be no +slower or faster. + +However, an increase in the number of OSDs can increase the speed of recovery. +Suppose our Ceph cluster is expanded from 10 OSDs to 20 OSDs. Each OSD now +participates in only ~75 PGs rather than ~150 PGs. All 19 remaining OSDs will +still be required to replicate the same number of objects in order to recover. +But instead of there being only 10 OSDs that have to copy ~100 GB each, there +are now 20 OSDs that have to copy only 50 GB each. If the network had +previously been a bottleneck, recovery now happens twice as fast. + +Similarly, suppose that our cluster grows to 40 OSDs. Each OSD will host only +~38 PGs. And if an OSD dies, recovery will take place faster than before unless +it is blocked by another bottleneck. Now, however, suppose that our cluster +grows to 200 OSDs. Each OSD will host only ~7 PGs. And if an OSD dies, recovery +will happen across at most :math:`\approx 21 = (7 \times 3)` OSDs +associated with these PGs. This means that recovery will take longer than when +there were only 40 OSDs. For this reason, the number of PGs should be +increased. + +No matter how brief the recovery time is, there is always a chance that an +additional OSD will fail while recovery is in progress. Consider the cluster +with 10 OSDs described above: if any of the OSDs fail, then :math:`\approx 17` +(approximately 150 divided by 9) PGs will have only one remaining copy. And if +any of the 8 remaining OSDs fail, then 2 (approximately 17 divided by 8) PGs +are likely to lose their remaining objects. This is one reason why setting +``size=2`` is risky. + +When the number of OSDs in the cluster increases to 20, the number of PGs that +would be damaged by the loss of three OSDs significantly decreases. The loss of +a second OSD degrades only approximately :math:`4` or (:math:`\frac{75}{19}`) +PGs rather than :math:`\approx 17` PGs, and the loss of a third OSD results in +data loss only if it is one of the 4 OSDs that contains the remaining copy. +This means -- assuming that the probability of losing one OSD during recovery +is 0.0001% -- that the probability of data loss when three OSDs are lost is +:math:`\approx 17 \times 10 \times 0.0001%` in the cluster with 10 OSDs, and +only :math:`\approx 4 \times 20 \times 0.0001%` in the cluster with 20 OSDs. + +In summary, the greater the number of OSDs, the faster the recovery and the +lower the risk of permanently losing a PG due to cascading failures. As far as +data durability is concerned, in a cluster with fewer than 50 OSDs, it doesn't +much matter whether there are 512 or 4096 PGs. + +.. note:: It can take a long time for an OSD that has been recently added to + the cluster to be populated with the PGs assigned to it. However, no object + degradation or impact on data durability will result from the slowness of + this process since Ceph populates data into the new PGs before removing it + from the old PGs. + +.. _object distribution: + +Object distribution within a pool +--------------------------------- + +Under ideal conditions, objects are evenly distributed across PGs. Because +CRUSH computes the PG for each object but does not know how much data is stored +in each OSD associated with the PG, the ratio between the number of PGs and the +number of OSDs can have a significant influence on data distribution. + +For example, suppose that there is only a single PG for ten OSDs in a +three-replica pool. In that case, only three OSDs would be used because CRUSH +would have no other option. However, if more PGs are available, RADOS objects are +more likely to be evenly distributed across OSDs. CRUSH makes every effort to +distribute OSDs evenly across all existing PGs. + +As long as there are one or two orders of magnitude more PGs than OSDs, the +distribution is likely to be even. For example: 256 PGs for 3 OSDs, 512 PGs for +10 OSDs, or 1024 PGs for 10 OSDs. + +However, uneven data distribution can emerge due to factors other than the +ratio of PGs to OSDs. For example, since CRUSH does not take into account the +size of the RADOS objects, the presence of a few very large RADOS objects can +create an imbalance. Suppose that one million 4 KB RADOS objects totaling 4 GB +are evenly distributed among 1024 PGs on 10 OSDs. These RADOS objects will +consume 4 GB / 10 = 400 MB on each OSD. If a single 400 MB RADOS object is then +added to the pool, the three OSDs supporting the PG in which the RADOS object +has been placed will each be filled with 400 MB + 400 MB = 800 MB but the seven +other OSDs will still contain only 400 MB. + +.. _resource usage: + +Memory, CPU and network usage +----------------------------- + +Every PG in the cluster imposes memory, network, and CPU demands upon OSDs and +MONs. These needs must be met at all times and are increased during recovery. +Indeed, one of the main reasons PGs were developed was to share this overhead +by clustering objects together. + +For this reason, minimizing the number of PGs saves significant resources. + +.. _choosing-number-of-placement-groups: + +Choosing the Number of PGs +========================== + +.. note: It is rarely necessary to do the math in this section by hand. + Instead, use the ``ceph osd pool autoscale-status`` command in combination + with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. For + more information, see :ref:`pg-autoscaler`. + +If you have more than 50 OSDs, we recommend approximately 50-100 PGs per OSD in +order to balance resource usage, data durability, and data distribution. If you +have fewer than 50 OSDs, follow the guidance in the `preselection`_ section. +For a single pool, use the following formula to get a baseline value: + + Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}` + +Here **pool size** is either the number of replicas for replicated pools or the +K+M sum for erasure-coded pools. To retrieve this sum, run the command ``ceph +osd erasure-code-profile get``. + +Next, check whether the resulting baseline value is consistent with the way you +designed your Ceph cluster to maximize `data durability`_ and `object +distribution`_ and to minimize `resource usage`_. + +This value should be **rounded up to the nearest power of two**. + +Each pool's ``pg_num`` should be a power of two. Other values are likely to +result in uneven distribution of data across OSDs. It is best to increase +``pg_num`` for a pool only when it is feasible and desirable to set the next +highest power of two. Note that this power of two rule is per-pool; it is +neither necessary nor easy to align the sum of all pools' ``pg_num`` to a power +of two. + +For example, if you have a cluster with 200 OSDs and a single pool with a size +of 3 replicas, estimate the number of PGs as follows: + + :math:`\frac{200 \times 100}{3} = 6667`. Rounded up to the nearest power of 2: 8192. + +When using multiple data pools to store objects, make sure that you balance the +number of PGs per pool against the number of PGs per OSD so that you arrive at +a reasonable total number of PGs. It is important to find a number that +provides reasonably low variance per OSD without taxing system resources or +making the peering process too slow. + +For example, suppose you have a cluster of 10 pools, each with 512 PGs on 10 +OSDs. That amounts to 5,120 PGs distributed across 10 OSDs, or 512 PGs per OSD. +This cluster will not use too many resources. However, in a cluster of 1,000 +pools, each with 512 PGs on 10 OSDs, the OSDs will have to handle ~50,000 PGs +each. This cluster will require significantly more resources and significantly +more time for peering. + +For determining the optimal number of PGs per OSD, we recommend the `PGCalc`_ +tool. + + +.. _setting the number of placement groups: + +Setting the Number of PGs +========================= + +Setting the initial number of PGs in a pool must be done at the time you create +the pool. See `Create a Pool`_ for details. + +However, even after a pool is created, if the ``pg_autoscaler`` is not being +used to manage ``pg_num`` values, you can change the number of PGs by running a +command of the following form: + +.. prompt:: bash # + + ceph osd pool set {pool-name} pg_num {pg_num} + +If you increase the number of PGs, your cluster will not rebalance until you +increase the number of PGs for placement (``pgp_num``). The ``pgp_num`` +parameter specifies the number of PGs that are to be considered for placement +by the CRUSH algorithm. Increasing ``pg_num`` splits the PGs in your cluster, +but data will not be migrated to the newer PGs until ``pgp_num`` is increased. +The ``pgp_num`` parameter should be equal to the ``pg_num`` parameter. To +increase the number of PGs for placement, run a command of the following form: + +.. prompt:: bash # + + ceph osd pool set {pool-name} pgp_num {pgp_num} + +If you decrease the number of PGs, then ``pgp_num`` is adjusted automatically. +In releases of Ceph that are Nautilus and later (inclusive), when the +``pg_autoscaler`` is not used, ``pgp_num`` is automatically stepped to match +``pg_num``. This process manifests as periods of remapping of PGs and of +backfill, and is expected behavior and normal. + +.. _rados_ops_pgs_get_pg_num: + +Get the Number of PGs +===================== + +To get the number of PGs in a pool, run a command of the following form: + +.. prompt:: bash # + + ceph osd pool get {pool-name} pg_num + + +Get a Cluster's PG Statistics +============================= + +To see the details of the PGs in your cluster, run a command of the following +form: + +.. prompt:: bash # + + ceph pg dump [--format {format}] + +Valid formats are ``plain`` (default) and ``json``. + + +Get Statistics for Stuck PGs +============================ + +To see the statistics for all PGs that are stuck in a specified state, run a +command of the following form: + +.. prompt:: bash # + + ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>] + +- **Inactive** PGs cannot process reads or writes because they are waiting for + enough OSDs with the most up-to-date data to come ``up`` and ``in``. + +- **Undersized** PGs contain objects that have not been replicated the desired + number of times. Under normal conditions, it can be assumed that these PGs + are recovering. + +- **Stale** PGs are in an unknown state -- the OSDs that host them have not + reported to the monitor cluster for a certain period of time (determined by + ``mon_osd_report_timeout``). + +Valid formats are ``plain`` (default) and ``json``. The threshold defines the +minimum number of seconds the PG is stuck before it is included in the returned +statistics (default: 300). + + +Get a PG Map +============ + +To get the PG map for a particular PG, run a command of the following form: + +.. prompt:: bash # + + ceph pg map {pg-id} + +For example: + +.. prompt:: bash # + + ceph pg map 1.6c + +Ceph will return the PG map, the PG, and the OSD status. The output resembles +the following: + +.. prompt:: bash # + + osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0] + + +Get a PG's Statistics +===================== + +To see statistics for a particular PG, run a command of the following form: + +.. prompt:: bash # + + ceph pg {pg-id} query + + +Scrub a PG +========== + +To scrub a PG, run a command of the following form: + +.. prompt:: bash # + + ceph pg scrub {pg-id} + +Ceph checks the primary and replica OSDs, generates a catalog of all objects in +the PG, and compares the objects against each other in order to ensure that no +objects are missing or mismatched and that their contents are consistent. If +the replicas all match, then a final semantic sweep takes place to ensure that +all snapshot-related object metadata is consistent. Errors are reported in +logs. + +To scrub all PGs from a specific pool, run a command of the following form: + +.. prompt:: bash # + + ceph osd pool scrub {pool-name} + + +Prioritize backfill/recovery of PG(s) +===================================== + +You might encounter a situation in which multiple PGs require recovery or +backfill, but the data in some PGs is more important than the data in others +(for example, some PGs hold data for images that are used by running machines +and other PGs are used by inactive machines and hold data that is less +relevant). In that case, you might want to prioritize recovery or backfill of +the PGs with especially important data so that the performance of the cluster +and the availability of their data are restored sooner. To designate specific +PG(s) as prioritized during recovery, run a command of the following form: + +.. prompt:: bash # + + ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +To mark specific PG(s) as prioritized during backfill, run a command of the +following form: + +.. prompt:: bash # + + ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +These commands instruct Ceph to perform recovery or backfill on the specified +PGs before processing the other PGs. Prioritization does not interrupt current +backfills or recovery, but places the specified PGs at the top of the queue so +that they will be acted upon next. If you change your mind or realize that you +have prioritized the wrong PGs, run one or both of the following commands: + +.. prompt:: bash # + + ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] + ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +These commands remove the ``force`` flag from the specified PGs, so that the +PGs will be processed in their usual order. As in the case of adding the +``force`` flag, this affects only those PGs that are still queued but does not +affect PGs currently undergoing recovery. + +The ``force`` flag is cleared automatically after recovery or backfill of the +PGs is complete. + +Similarly, to instruct Ceph to prioritize all PGs from a specified pool (that +is, to perform recovery or backfill on those PGs first), run one or both of the +following commands: + +.. prompt:: bash # + + ceph osd pool force-recovery {pool-name} + ceph osd pool force-backfill {pool-name} + +These commands can also be cancelled. To revert to the default order, run one +or both of the following commands: + +.. prompt:: bash # + + ceph osd pool cancel-force-recovery {pool-name} + ceph osd pool cancel-force-backfill {pool-name} + +.. warning:: These commands can break the order of Ceph's internal priority + computations, so use them with caution! If you have multiple pools that are + currently sharing the same underlying OSDs, and if the data held by certain + pools is more important than the data held by other pools, then we recommend + that you run a command of the following form to arrange a custom + recovery/backfill priority for all pools: + +.. prompt:: bash # + + ceph osd pool set {pool-name} recovery_priority {value} + +For example, if you have twenty pools, you could make the most important pool +priority ``20``, and the next most important pool priority ``19``, and so on. + +Another option is to set the recovery/backfill priority for only a proper +subset of pools. In such a scenario, three important pools might (all) be +assigned priority ``1`` and all other pools would be left without an assigned +recovery/backfill priority. Another possibility is to select three important +pools and set their recovery/backfill priorities to ``3``, ``2``, and ``1`` +respectively. + +.. important:: Numbers of greater value have higher priority than numbers of + lesser value when using ``ceph osd pool set {pool-name} recovery_priority + {value}`` to set their recovery/backfill priority. For example, a pool with + the recovery/backfill priority ``30`` has a higher priority than a pool with + the recovery/backfill priority ``15``. + +Reverting Lost RADOS Objects +============================ + +If the cluster has lost one or more RADOS objects and you have decided to +abandon the search for the lost data, you must mark the unfound objects +``lost``. + +If every possible location has been queried and all OSDs are ``up`` and ``in``, +but certain RADOS objects are still lost, you might have to give up on those +objects. This situation can arise when rare and unusual combinations of +failures allow the cluster to learn about writes that were performed before the +writes themselves were recovered. + +The command to mark a RADOS object ``lost`` has only one supported option: +``revert``. The ``revert`` option will either roll back to a previous version +of the RADOS object (if it is old enough to have a previous version) or forget +about it entirely (if it is too new to have a previous version). To mark the +"unfound" objects ``lost``, run a command of the following form: + + +.. prompt:: bash # + + ceph pg {pg-id} mark_unfound_lost revert|delete + +.. important:: Use this feature with caution. It might confuse applications + that expect the object(s) to exist. + + +.. toctree:: + :hidden: + + pg-states + pg-concepts + + +.. _Create a Pool: ../pools#createpool +.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds +.. _pgcalc: https://old.ceph.com/pgcalc/ |