summaryrefslogtreecommitdiffstats
path: root/doc/rados/operations/placement-groups.rst
blob: dda4a017715d83d76ec79f333c8438bde479ad61 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
.. _placement groups:

==================
 Placement Groups
==================

.. _pg-autoscaler:

Autoscaling placement groups
============================

Placement groups (PGs) are an internal implementation detail of how Ceph
distributes data. Autoscaling provides a way to manage PGs, and especially to
manage the number of PGs present in different pools.  When *pg-autoscaling* is
enabled, the cluster is allowed to make recommendations or automatic
adjustments with respect to the number of PGs for each pool (``pgp_num``) in
accordance with expected cluster utilization and expected pool utilization.

Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``,
``on``, or ``warn``:

* ``off``: Disable autoscaling for this pool. It is up to the administrator to
  choose an appropriate ``pgp_num`` for each pool. For more information, see
  :ref:`choosing-number-of-placement-groups`.
* ``on``: Enable automated adjustments of the PG count for the given pool.
* ``warn``: Raise health checks when the PG count is in need of adjustment.

To set the autoscaling mode for an existing pool, run a command of the
following form:

.. prompt:: bash #

   ceph osd pool set <pool-name> pg_autoscale_mode <mode>

For example, to enable autoscaling on pool ``foo``, run the following command:

.. prompt:: bash #

   ceph osd pool set foo pg_autoscale_mode on

There is also a ``pg_autoscale_mode`` setting for any pools that are created
after the initial setup of the cluster. To change this setting, run a command
of the following form:

.. prompt:: bash #

   ceph config set global osd_pool_default_pg_autoscale_mode <mode>

You can disable or enable the autoscaler for all pools with the ``noautoscale``
flag. By default, this flag is set to ``off``, but you can set it to ``on`` by
running the following command:

.. prompt:: bash #

   ceph osd pool set noautoscale

To set the ``noautoscale`` flag to ``off``, run the following command:

.. prompt:: bash #

   ceph osd pool unset noautoscale

To get the value of the flag, run the following command:

.. prompt:: bash #

   ceph osd pool get noautoscale

Viewing PG scaling recommendations
----------------------------------

To view each pool, its relative utilization, and any recommended changes to the
PG count, run the following command:

.. prompt:: bash #

   ceph osd pool autoscale-status

The output will resemble the following::

   POOL    SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO BIAS PG_NUM  NEW PG_NUM  AUTOSCALE BULK
   a     12900M                3.0        82431M  0.4695                                          8         128  warn      True
   c         0                 3.0        82431M  0.0000        0.2000           0.9884  1.0      1          64  warn      True
   b         0        953.6M   3.0        82431M  0.0347                                          8              warn      False

- **POOL** is the name of the pool. 

- **SIZE** is the amount of data stored in the pool. 
  
- **TARGET SIZE** (if present) is the amount of data that is expected to be
  stored in the pool, as specified by the administrator. The system uses the
  greater of the two values for its calculation.

- **RATE** is the multiplier for the pool that determines how much raw storage
  capacity is consumed. For example, a three-replica pool will have a ratio of
  3.0, and a ``k=4 m=2`` erasure-coded pool will have a ratio of 1.5.

- **RAW CAPACITY** is the total amount of raw storage capacity on the specific
  OSDs that are responsible for storing the data of the pool (and perhaps the
  data of other pools). 

- **RATIO** is the ratio of (1) the storage consumed by the pool to (2) the
  total raw storage capacity. In order words, RATIO is defined as 
  (SIZE * RATE) / RAW CAPACITY.

- **TARGET RATIO** (if present) is the ratio of the expected storage of this
  pool (that is, the amount of storage that this pool is expected to consume,
  as specified by the administrator) to the expected storage of all other pools
  that have target ratios set.  If both ``target_size_bytes`` and
  ``target_size_ratio`` are specified, then ``target_size_ratio`` takes
  precedence.

- **EFFECTIVE RATIO** is the result of making two adjustments to the target
  ratio:

  #. Subtracting any capacity expected to be used by pools that have target
     size set.

  #. Normalizing the target ratios among pools that have target ratio set so
     that collectively they target cluster capacity. For example, four pools
     with target_ratio 1.0 would have an effective ratio of 0.25.

  The system's calculations use whichever of these two ratios (that is, the 
  target ratio and the effective ratio) is greater.

- **BIAS** is used as a multiplier to manually adjust a pool's PG in accordance
  with prior information about how many PGs a specific pool is expected to
  have.

- **PG_NUM** is either the current number of PGs associated with the pool or,
  if a ``pg_num`` change is in progress, the current number of PGs that the
  pool is working towards. 

- **NEW PG_NUM** (if present) is the value that the system is recommending the
  ``pg_num`` of the pool to be changed to. It is always a power of 2, and it is
  present only if the recommended value varies from the current value by more
  than the default factor of ``3``. To adjust this factor (in the following
  example, it is changed to ``2``), run the following command:

  .. prompt:: bash #

     ceph osd pool set threshold 2.0

- **AUTOSCALE** is the pool's ``pg_autoscale_mode`` and is set to ``on``,
  ``off``, or ``warn``.

- **BULK** determines whether the pool is ``bulk``. It has a value of ``True``
  or ``False``. A ``bulk`` pool is expected to be large and should initially
  have a large number of PGs so that performance does not suffer]. On the other
  hand, a pool that is not ``bulk`` is expected to be small (for example, a
  ``.mgr`` pool or a meta pool).

.. note::

   If the ``ceph osd pool autoscale-status`` command returns no output at all,
   there is probably at least one pool that spans multiple CRUSH roots.  This
   'spanning pool' issue can happen in scenarios like the following:
   when a new deployment auto-creates the ``.mgr`` pool on the ``default``
   CRUSH root, subsequent pools are created with rules that constrain them to a
   specific shadow CRUSH tree. For example, if you create an RBD metadata pool
   that is constrained to ``deviceclass = ssd`` and an RBD data pool that is
   constrained to ``deviceclass = hdd``, you will encounter this issue. To
   remedy this issue, constrain the spanning pool to only one device class. In
   the above scenario, there is likely to be a ``replicated-ssd`` CRUSH rule in
   effect, and the ``.mgr`` pool can be constrained to ``ssd`` devices by
   running the following commands:

   .. prompt:: bash #

      ceph osd pool set .mgr crush_rule replicated-ssd
      ceph osd pool set pool 1 crush_rule to replicated-ssd

   This intervention will result in a small amount of backfill, but
   typically this traffic completes quickly.


Automated scaling
-----------------

In the simplest approach to automated scaling, the cluster is allowed to
automatically scale ``pgp_num`` in accordance with usage. Ceph considers the
total available storage and the target number of PGs for the whole system,
considers how much data is stored in each pool, and apportions PGs accordingly.
The system is conservative with its approach, making changes to a pool only
when the current number of PGs (``pg_num``) varies by more than a factor of 3
from the recommended number.

The target number of PGs per OSD is determined by the ``mon_target_pg_per_osd``
parameter (default: 100), which can be adjusted by running the following
command:

.. prompt:: bash #

   ceph config set global mon_target_pg_per_osd 100

The autoscaler analyzes pools and adjusts on a per-subtree basis.  Because each
pool might map to a different CRUSH rule, and each rule might distribute data
across different devices, Ceph will consider the utilization of each subtree of
the hierarchy independently. For example, a pool that maps to OSDs of class
``ssd`` and a pool that maps to OSDs of class ``hdd`` will each have optimal PG
counts that are determined by how many of these two different device types
there are.

If a pool uses OSDs under two or more CRUSH roots (for example, shadow trees
with both ``ssd`` and ``hdd`` devices), the autoscaler issues a warning to the
user in the manager log. The warning states the name of the pool and the set of
roots that overlap each other. The autoscaler does not scale any pools with
overlapping roots because this condition can cause problems with the scaling
process. We recommend constraining each pool so that it belongs to only one
root (that is, one OSD class) to silence the warning and ensure a successful
scaling process.

.. _managing_bulk_flagged_pools:

Managing pools that are flagged with ``bulk``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If a pool is flagged ``bulk``, then the autoscaler starts the pool with a full
complement of PGs and then scales down the number of PGs only if the usage
ratio across the pool is uneven.  However, if a pool is not flagged ``bulk``,
then the autoscaler starts the pool with minimal PGs and creates additional PGs
only if there is more usage in the pool.

To create a pool that will be flagged ``bulk``, run the following command:

.. prompt:: bash #

   ceph osd pool create <pool-name> --bulk

To set or unset the ``bulk`` flag of an existing pool, run the following
command:

.. prompt:: bash #

   ceph osd pool set <pool-name> bulk <true/false/1/0>

To get the ``bulk`` flag of an existing pool, run the following command:

.. prompt:: bash #

   ceph osd pool get <pool-name> bulk

.. _specifying_pool_target_size:

Specifying expected pool size
-----------------------------

When a cluster or pool is first created, it consumes only a small fraction of
the total cluster capacity and appears to the system as if it should need only
a small number of PGs. However, in some cases, cluster administrators know
which pools are likely to consume most of the system capacity in the long run.
When Ceph is provided with this information, a more appropriate number of PGs
can be used from the beginning, obviating subsequent changes in ``pg_num`` and
the associated overhead cost of relocating data.

The *target size* of a pool can be specified in two ways: either in relation to
the absolute size (in bytes) of the pool, or as a weight relative to all other
pools that have ``target_size_ratio`` set.

For example, to tell the system that ``mypool`` is expected to consume 100 TB,
run the following command:

.. prompt:: bash #

   ceph osd pool set mypool target_size_bytes 100T

Alternatively, to tell the system that ``mypool`` is expected to consume a
ratio of 1.0 relative to other pools that have ``target_size_ratio`` set,
adjust the ``target_size_ratio`` setting of ``my pool`` by running the
following command:

.. prompt:: bash # 

   ceph osd pool set mypool target_size_ratio 1.0

If `mypool` is the only pool in the cluster, then it is expected to use 100% of
the total cluster capacity. However, if the cluster contains a second pool that
has ``target_size_ratio`` set to 1.0, then both pools are expected to use 50%
of the total cluster capacity.

The ``ceph osd pool create`` command has two command-line options that can be
used to set the target size of a pool at creation time: ``--target-size-bytes
<bytes>`` and ``--target-size-ratio <ratio>``.

Note that if the target-size values that have been specified are impossible
(for example, a capacity larger than the total cluster), then a health check
(``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.

If both ``target_size_ratio`` and ``target_size_bytes`` are specified for a
pool, then the latter will be ignored, the former will be used in system
calculations, and a health check (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``)
will be raised.

Specifying bounds on a pool's PGs
---------------------------------

It is possible to specify both the minimum number and the maximum number of PGs
for a pool. 

Setting a Minimum Number of PGs and a Maximum Number of PGs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If a minimum is set, then Ceph will not itself reduce (nor recommend that you
reduce) the number of PGs to a value below the configured value. Setting a
minimum serves to establish a lower bound on the amount of parallelism enjoyed
by a client during I/O, even if a pool is mostly empty. 

If a maximum is set, then Ceph will not itself increase (or recommend that you
increase) the number of PGs to a value above the configured value.

To set the minimum number of PGs for a pool, run a command of the following
form:

.. prompt:: bash #

   ceph osd pool set <pool-name> pg_num_min <num>

To set the maximum number of PGs for a pool, run a command of the following
form:

.. prompt:: bash #

   ceph osd pool set <pool-name> pg_num_max <num>

In addition, the ``ceph osd pool create`` command has two command-line options
that can be used to specify the minimum or maximum PG count of a pool at
creation time: ``--pg-num-min <num>`` and ``--pg-num-max <num>``.

.. _preselection:

Preselecting pg_num
===================

When creating a pool with the following command, you have the option to
preselect the value of the ``pg_num`` parameter:

.. prompt:: bash #

   ceph osd pool create {pool-name} [pg_num]

If you opt not to specify ``pg_num`` in this command, the cluster uses the PG
autoscaler to automatically configure the parameter in accordance with the
amount of data that is stored in the pool (see :ref:`pg-autoscaler` above).

However, your decision of whether or not to specify ``pg_num`` at creation time
has no effect on whether the parameter will be automatically tuned by the
cluster afterwards. As seen above, autoscaling of PGs is enabled or disabled by
running a command of the following form:
    
.. prompt:: bash #

   ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)

Without the balancer, the suggested target is approximately 100 PG replicas on
each OSD. With the balancer, an initial target of 50 PG replicas on each OSD is
reasonable.

The autoscaler attempts to satisfy the following conditions:

- the number of PGs per OSD should be proportional to the amount of data in the
  pool
- there should be 50-100 PGs per pool, taking into account the replication
  overhead or erasure-coding fan-out of each PG's replicas across OSDs

Use of Placement Groups
=======================

A placement group aggregates objects within a pool. The tracking of RADOS
object placement and object metadata on a per-object basis is computationally
expensive. It would be infeasible for a system with millions of RADOS
objects to efficiently track placement on a per-object basis.

.. ditaa::
           /-----\  /-----\  /-----\  /-----\  /-----\
           | obj |  | obj |  | obj |  | obj |  | obj |
           \-----/  \-----/  \-----/  \-----/  \-----/
              |        |        |        |        |
              +--------+--------+        +---+----+
              |                              |
              v                              v
   +-----------------------+      +-----------------------+
   |  Placement Group #1   |      |  Placement Group #2   |
   |                       |      |                       |
   +-----------------------+      +-----------------------+
               |                              |
               +------------------------------+
                             |
                             v
                  +-----------------------+
                  |        Pool           |
                  |                       |
                  +-----------------------+

The Ceph client calculates which PG a RADOS object should be in. As part of
this calculation, the client hashes the object ID and performs an operation
involving both the number of PGs in the specified pool and the pool ID. For
details, see `Mapping PGs to OSDs`_.

The contents of a RADOS object belonging to a PG are stored in a set of OSDs.
For example, in a replicated pool of size two, each PG will store objects on
two OSDs, as shown below:

.. ditaa::
   +-----------------------+      +-----------------------+
   |  Placement Group #1   |      |  Placement Group #2   |
   |                       |      |                       |
   +-----------------------+      +-----------------------+
        |             |               |             |
        v             v               v             v
   /----------\  /----------\    /----------\  /----------\
   |          |  |          |    |          |  |          |
   |  OSD #1  |  |  OSD #2  |    |  OSD #2  |  |  OSD #3  |
   |          |  |          |    |          |  |          |
   \----------/  \----------/    \----------/  \----------/


If OSD #2 fails, another OSD will be assigned to Placement Group #1 and then
filled with copies of all objects in OSD #1. If the pool size is changed from
two to three, an additional OSD will be assigned to the PG and will receive
copies of all objects in the PG.

An OSD assigned to a PG is not owned exclusively by that PG; rather, the OSD is
shared with other PGs either from the same pool or from other pools. In our
example, OSD #2 is shared by Placement Group #1 and Placement Group #2. If OSD
#2 fails, then Placement Group #2 must restore copies of objects (by making use
of OSD #3).

When the number of PGs increases, several consequences ensue. The new PGs are
assigned OSDs. The result of the CRUSH function changes, which means that some
objects from the already-existing PGs are copied to the new PGs and removed
from the old ones.

Factors Relevant To Specifying pg_num
=====================================

On the one hand, the criteria of data durability and even distribution across
OSDs weigh in favor of a high number of PGs. On the other hand, the criteria of
saving CPU resources and minimizing memory usage weigh in favor of a low number
of PGs.

.. _data durability:

Data durability
---------------

When an OSD fails, the risk of data loss is increased until replication of the
data it hosted is restored to the configured level. To illustrate this point,
let's imagine a scenario that results in permanent data loss in a single PG:

#. The OSD fails and all copies of the object that it contains are lost.  For
   each object within the PG, the number of its replicas suddenly drops from
   three to two.

#. Ceph starts recovery for this PG by choosing a new OSD on which to re-create
   the third copy of each object.

#. Another OSD within the same PG fails before the new OSD is fully populated
   with the third copy. Some objects will then only have one surviving copy.

#. Ceph selects yet another OSD and continues copying objects in order to
   restore the desired number of copies.

#. A third OSD within the same PG fails before recovery is complete. If this
   OSD happened to contain the only remaining copy of an object, the object is
   permanently lost.

In a cluster containing 10 OSDs with 512 PGs in a three-replica pool, CRUSH
will give each PG three OSDs.  Ultimately, each OSD hosts :math:`\frac{(512 *
3)}{10} = ~150` PGs. So when the first OSD fails in the above scenario,
recovery will begin for all 150 PGs at the same time.

The 150 PGs that are being recovered are likely to be homogeneously distributed
across the 9 remaining OSDs. Each remaining OSD is therefore likely to send
copies of objects to all other OSDs and also likely to receive some new objects
to be stored because it has become part of a new PG.

The amount of time it takes for this recovery to complete depends on the
architecture of the Ceph cluster. Compare two setups: (1) Each OSD is hosted by
a 1 TB SSD on a single machine, all of the OSDs are connected to a 10 Gb/s
switch, and the recovery of a single OSD completes within a certain number of
minutes. (2) There are two OSDs per machine using HDDs with no SSD WAL+DB and
a 1 Gb/s switch. In the second setup, recovery will be at least one order of
magnitude slower.

In such a cluster, the number of PGs has almost no effect on data durability.
Whether there are 128 PGs per OSD or 8192 PGs per OSD, the recovery will be no
slower or faster.

However, an increase in the number of OSDs can increase the speed of recovery.
Suppose our Ceph cluster is expanded from 10 OSDs to 20 OSDs.  Each OSD now
participates in only ~75 PGs rather than ~150 PGs. All 19 remaining OSDs will
still be required to replicate the same number of objects in order to recover.
But instead of there being only 10 OSDs that have to copy ~100 GB each, there
are now 20 OSDs that have to copy only 50 GB each. If the network had
previously been a bottleneck, recovery now happens twice as fast.

Similarly, suppose that our cluster grows to 40 OSDs. Each OSD will host only
~38 PGs. And if an OSD dies, recovery will take place faster than before unless
it is blocked by another bottleneck. Now, however, suppose that our cluster
grows to 200 OSDs. Each OSD will host only ~7 PGs. And if an OSD dies, recovery
will happen across at most :math:`\approx 21 = (7 \times 3)` OSDs
associated with these PGs. This means that recovery will take longer than when
there were only 40 OSDs. For this reason, the number of PGs should be
increased.

No matter how brief the recovery time is, there is always a chance that an
additional OSD will fail while recovery is in progress.  Consider the cluster
with 10 OSDs described above: if any of the OSDs fail, then :math:`\approx 17`
(approximately 150 divided by 9) PGs will have only one remaining copy. And if
any of the 8 remaining OSDs fail, then 2 (approximately 17 divided by 8) PGs
are likely to lose their remaining objects. This is one reason why setting
``size=2`` is risky.

When the number of OSDs in the cluster increases to 20, the number of PGs that
would be damaged by the loss of three OSDs significantly decreases. The loss of
a second OSD degrades only approximately :math:`4` or (:math:`\frac{75}{19}`)
PGs rather than :math:`\approx 17` PGs, and the loss of a third OSD results in
data loss only if it is one of the 4 OSDs that contains the remaining copy.
This means -- assuming that the probability of losing one OSD during recovery
is 0.0001% -- that the probability of data loss when three OSDs are lost is
:math:`\approx 17 \times 10 \times 0.0001%` in the cluster with 10 OSDs, and
only :math:`\approx 4 \times 20 \times 0.0001%` in the cluster with 20 OSDs.

In summary, the greater the number of OSDs, the faster the recovery and the
lower the risk of permanently losing a PG due to cascading failures. As far as
data durability is concerned, in a cluster with fewer than 50 OSDs, it doesn't
much matter whether there are 512 or 4096 PGs.

.. note::  It can take a long time for an OSD that has been recently added to
   the cluster to be populated with the PGs assigned to it.  However, no object
   degradation or impact on data durability will result from the slowness of
   this process since Ceph populates data into the new PGs before removing it
   from the old PGs.

.. _object distribution:

Object distribution within a pool
---------------------------------

Under ideal conditions, objects are evenly distributed across PGs. Because
CRUSH computes the PG for each object but does not know how much data is stored
in each OSD associated with the PG, the ratio between the number of PGs and the
number of OSDs can have a significant influence on data distribution.

For example, suppose that there is only a single PG for ten OSDs in a
three-replica pool. In that case, only three OSDs would be used because CRUSH
would have no other option. However, if more PGs are available, RADOS objects are
more likely to be evenly distributed across OSDs.  CRUSH makes every effort to
distribute OSDs evenly across all existing PGs.

As long as there are one or two orders of magnitude more PGs than OSDs, the
distribution is likely to be even. For example: 256 PGs for 3 OSDs, 512 PGs for
10 OSDs, or 1024 PGs for 10 OSDs.

However, uneven data distribution can emerge due to factors other than the
ratio of PGs to OSDs. For example, since CRUSH does not take into account the
size of the RADOS objects, the presence of a few very large RADOS objects can
create an imbalance. Suppose that one million 4 KB RADOS objects totaling 4 GB
are evenly distributed among 1024 PGs on 10 OSDs. These RADOS objects will
consume 4 GB / 10 = 400 MB on each OSD. If a single 400 MB RADOS object is then
added to the pool, the three OSDs supporting the PG in which the RADOS object
has been placed will each be filled with 400 MB + 400 MB = 800 MB but the seven
other OSDs will still contain only 400 MB.

.. _resource usage:

Memory, CPU and network usage
-----------------------------

Every PG in the cluster imposes memory, network, and CPU demands upon OSDs and
MONs. These needs must be met at all times and are increased during recovery.
Indeed, one of the main reasons PGs were developed was to share this overhead
by clustering objects together.

For this reason, minimizing the number of PGs saves significant resources.

.. _choosing-number-of-placement-groups:

Choosing the Number of PGs
==========================

.. note: It is rarely necessary to do the math in this section by hand.
   Instead, use the ``ceph osd pool autoscale-status`` command in combination
   with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. For
   more information, see :ref:`pg-autoscaler`.

If you have more than 50 OSDs, we recommend approximately 50-100 PGs per OSD in
order to balance resource usage, data durability, and data distribution. If you
have fewer than 50 OSDs, follow the guidance in the `preselection`_ section.
For a single pool, use the following formula to get a baseline value:

  Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}`

Here **pool size** is either the number of replicas for replicated pools or the
K+M sum for erasure-coded pools. To retrieve this sum, run the command ``ceph
osd erasure-code-profile get``.

Next, check whether the resulting baseline value is consistent with the way you
designed your Ceph cluster to maximize `data durability`_ and `object
distribution`_ and to minimize `resource usage`_.

This value should be **rounded up to the nearest power of two**.

Each pool's ``pg_num`` should be a power of two. Other values are likely to
result in uneven distribution of data across OSDs. It is best to increase
``pg_num`` for a pool only when it is feasible and desirable to set the next
highest power of two. Note that this power of two rule is per-pool; it is
neither necessary nor easy to align the sum of all pools' ``pg_num`` to a power
of two.

For example, if you have a cluster with 200 OSDs and a single pool with a size
of 3 replicas, estimate the number of PGs as follows:

  :math:`\frac{200 \times 100}{3} = 6667`. Rounded up to the nearest power of 2: 8192.

When using multiple data pools to store objects, make sure that you balance the
number of PGs per pool against the number of PGs per OSD so that you arrive at
a reasonable total number of PGs. It is important to find a number that
provides reasonably low variance per OSD without taxing system resources or
making the peering process too slow.

For example, suppose you have a cluster of 10 pools, each with 512 PGs on 10
OSDs. That amounts to 5,120 PGs distributed across 10 OSDs, or 512 PGs per OSD.
This cluster will not use too many resources. However, in a cluster of 1,000
pools, each with 512 PGs on 10 OSDs, the OSDs will have to handle ~50,000 PGs
each. This cluster will require significantly more resources and significantly
more time for peering.

For determining the optimal number of PGs per OSD, we recommend the `PGCalc`_
tool.


.. _setting the number of placement groups:

Setting the Number of PGs
=========================

Setting the initial number of PGs in a pool must be done at the time you create
the pool. See `Create a Pool`_ for details. 

However, even after a pool is created, if the ``pg_autoscaler`` is not being
used to manage ``pg_num`` values, you can change the number of PGs by running a
command of the following form:

.. prompt:: bash # 

   ceph osd pool set {pool-name} pg_num {pg_num}

If you increase the number of PGs, your cluster will not rebalance until you
increase the number of PGs for placement (``pgp_num``). The ``pgp_num``
parameter specifies the number of PGs that are to be considered for placement
by the CRUSH algorithm. Increasing ``pg_num`` splits the PGs in your cluster,
but data will not be migrated to the newer PGs until ``pgp_num`` is increased.
The ``pgp_num`` parameter should be equal to the ``pg_num`` parameter. To
increase the number of PGs for placement, run a command of the following form:

.. prompt:: bash #

   ceph osd pool set {pool-name} pgp_num {pgp_num}

If you decrease the number of PGs, then ``pgp_num`` is adjusted automatically.
In releases of Ceph that are Nautilus and later (inclusive), when the
``pg_autoscaler`` is not used, ``pgp_num`` is automatically stepped to match
``pg_num``. This process manifests as periods of remapping of PGs and of
backfill, and is expected behavior and normal.

.. _rados_ops_pgs_get_pg_num:

Get the Number of PGs
=====================

To get the number of PGs in a pool, run a command of the following form:

.. prompt:: bash #

   ceph osd pool get {pool-name} pg_num


Get a Cluster's PG Statistics
=============================

To see the details of the PGs in your cluster, run a command of the following
form:

.. prompt:: bash #

   ceph pg dump [--format {format}]

Valid formats are ``plain`` (default) and ``json``.


Get Statistics for Stuck PGs
============================

To see the statistics for all PGs that are stuck in a specified state, run a
command of the following form:

.. prompt:: bash #

   ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]

- **Inactive** PGs cannot process reads or writes because they are waiting for
  enough OSDs with the most up-to-date data to come ``up`` and ``in``.

- **Undersized** PGs contain objects that have not been replicated the desired
  number of times. Under normal conditions, it can be assumed that these PGs
  are recovering.

- **Stale** PGs are in an unknown state -- the OSDs that host them have not
  reported to the monitor cluster for a certain period of time (determined by
  ``mon_osd_report_timeout``).

Valid formats are ``plain`` (default) and ``json``. The threshold defines the
minimum number of seconds the PG is stuck before it is included in the returned
statistics (default: 300).


Get a PG Map
============

To get the PG map for a particular PG, run a command of the following form:

.. prompt:: bash #

   ceph pg map {pg-id}

For example: 

.. prompt:: bash #

   ceph pg map 1.6c

Ceph will return the PG map, the PG, and the OSD status. The output resembles
the following:

.. prompt:: bash #

   osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]


Get a PG's Statistics
=====================

To see statistics for a particular PG, run a command of the following form:

.. prompt:: bash #

   ceph pg {pg-id} query


Scrub a PG
==========

To scrub a PG, run a command of the following form:

.. prompt:: bash #

   ceph pg scrub {pg-id}

Ceph checks the primary and replica OSDs, generates a catalog of all objects in
the PG, and compares the objects against each other in order to ensure that no
objects are missing or mismatched and that their contents are consistent. If
the replicas all match, then a final semantic sweep takes place to ensure that
all snapshot-related object metadata is consistent.  Errors are reported in
logs.

To scrub all PGs from a specific pool, run a command of the following form:

.. prompt:: bash #

   ceph osd pool scrub {pool-name}


Prioritize backfill/recovery of PG(s)
=====================================

You might encounter a situation in which multiple PGs require recovery or
backfill, but the data in some PGs is more important than the data in others
(for example, some PGs hold data for images that are used by running machines
and other PGs are used by inactive machines and hold data that is less
relevant). In that case, you might want to prioritize recovery or backfill of
the PGs with especially important data so that the performance of the cluster
and the availability of their data are restored sooner. To designate specific
PG(s) as prioritized during recovery, run a command of the following form:

.. prompt:: bash #

   ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]

To mark specific PG(s) as prioritized during backfill, run a command of the
following form:

.. prompt:: bash #

   ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]

These commands instruct Ceph to perform recovery or backfill on the specified
PGs before processing the other PGs. Prioritization does not interrupt current
backfills or recovery, but places the specified PGs at the top of the queue so
that they will be acted upon next.  If you change your mind or realize that you
have prioritized the wrong PGs, run one or both of the following commands:

.. prompt:: bash #

   ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
   ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]

These commands remove the ``force`` flag from the specified PGs, so that the
PGs will be processed in their usual order. As in the case of adding the
``force`` flag, this affects only those PGs that are still queued but does not
affect PGs currently undergoing recovery.

The ``force`` flag is cleared automatically after recovery or backfill of the
PGs is complete.

Similarly, to instruct Ceph to prioritize all PGs from a specified pool (that
is, to perform recovery or backfill on those PGs first), run one or both of the
following commands:

.. prompt:: bash #

   ceph osd pool force-recovery {pool-name}
   ceph osd pool force-backfill {pool-name}

These commands can also be cancelled. To revert to the default order, run one
or both of the following commands:

.. prompt:: bash #

   ceph osd pool cancel-force-recovery {pool-name}
   ceph osd pool cancel-force-backfill {pool-name}

.. warning:: These commands can break the order of Ceph's internal priority
   computations, so use them with caution! If you have multiple pools that are
   currently sharing the same underlying OSDs, and if the data held by certain
   pools is more important than the data held by other pools, then we recommend
   that you run a command of the following form to arrange a custom
   recovery/backfill priority for all pools:

.. prompt:: bash #

   ceph osd pool set {pool-name} recovery_priority {value}

For example, if you have twenty pools, you could make the most important pool  
priority ``20``, and the next most important pool priority ``19``, and so on.  

Another option is to set the recovery/backfill priority for only a proper
subset of pools. In such a scenario, three important pools might (all) be
assigned priority ``1`` and all other pools would be left without an assigned
recovery/backfill priority.  Another possibility is to select three important
pools and set their recovery/backfill priorities to ``3``, ``2``, and ``1``
respectively.

.. important:: Numbers of greater value have higher priority than numbers of
   lesser value when using ``ceph osd pool set {pool-name} recovery_priority
   {value}`` to set their recovery/backfill priority. For example, a pool with
   the recovery/backfill priority ``30`` has a higher priority than a pool with
   the recovery/backfill priority ``15``.

Reverting Lost RADOS Objects
============================

If the cluster has lost one or more RADOS objects and you have decided to
abandon the search for the lost data, you must mark the unfound objects
``lost``.

If every possible location has been queried and all OSDs are ``up`` and ``in``,
but certain RADOS objects are still lost, you might have to give up on those
objects. This situation can arise when rare and unusual combinations of
failures allow the cluster to learn about writes that were performed before the
writes themselves were recovered.

The command to mark a RADOS object ``lost`` has only one supported option:
``revert``. The ``revert`` option will either roll back to a previous version
of the RADOS object (if it is old enough to have a previous version) or forget
about it entirely (if it is too new to have a previous version). To mark the
"unfound" objects ``lost``, run a command of the following form:


.. prompt:: bash #

   ceph pg {pg-id} mark_unfound_lost revert|delete

.. important:: Use this feature with caution. It might confuse applications
   that expect the object(s) to exist.


.. toctree::
        :hidden:

        pg-states
        pg-concepts


.. _Create a Pool: ../pools#createpool
.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
.. _pgcalc: https://old.ceph.com/pgcalc/