doc/rados/operations/placement-groups.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798

==================
 Placement Groups
==================

.. _pg-autoscaler:

Autoscaling placement groups
============================

Placement groups (PGs) are an internal implementation detail of how
Ceph distributes data.  You can allow the cluster to either make
recommendations or automatically tune PGs based on how the cluster is
used by enabling *pg-autoscaling*.

Each pool in the system has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``.

* ``off``: Disable autoscaling for this pool.  It is up to the administrator to choose an appropriate PG number for each pool.  Please refer to :ref:`choosing-number-of-placement-groups` for more information.
* ``on``: Enable automated adjustments of the PG count for the given pool.
* ``warn``: Raise health alerts when the PG count should be adjusted

To set the autoscaling mode for an existing pool:

.. prompt:: bash #

   ceph osd pool set <pool-name> pg_autoscale_mode <mode>

For example to enable autoscaling on pool ``foo``:

.. prompt:: bash #

   ceph osd pool set foo pg_autoscale_mode on

You can also configure the default ``pg_autoscale_mode`` that is
set on any pools that are subsequently created:

.. prompt:: bash #

   ceph config set global osd_pool_default_pg_autoscale_mode <mode>

You can disable or enable the autoscaler for all pools with
the ``noautoscale`` flag. By default this flag is set to  be ``off``,
but you can turn it ``on`` by using the command:

.. prompt:: bash $

   ceph osd pool set noautoscale

You can turn it ``off`` using the command:

.. prompt:: bash #

   ceph osd pool unset noautoscale

To ``get`` the value of the flag use the command:

.. prompt:: bash #

   ceph osd pool get noautoscale

Viewing PG scaling recommendations
----------------------------------

You can view each pool, its relative utilization, and any suggested changes to
the PG count with this command:

.. prompt:: bash #

   ceph osd pool autoscale-status

Output will be something like::

   POOL    SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO BIAS PG_NUM  NEW PG_NUM  AUTOSCALE BULK
   a     12900M                3.0        82431M  0.4695                                          8         128  warn      True
   c         0                 3.0        82431M  0.0000        0.2000           0.9884  1.0      1          64  warn      True
   b         0        953.6M   3.0        82431M  0.0347                                          8              warn      False

**SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if
present, is the amount of data the administrator has specified that
they expect to eventually be stored in this pool.  The system uses
the larger of the two values for its calculation.

**RATE** is the multiplier for the pool that determines how much raw
storage capacity is consumed.  For example, a 3 replica pool will
have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a
ratio of 1.5.

**RAW CAPACITY** is the total amount of raw storage capacity on the
OSDs that are responsible for storing this pool's (and perhaps other
pools') data.  **RATIO** is the ratio of that total capacity that
this pool is consuming (i.e., ratio = size * rate / raw capacity).

**TARGET RATIO**, if present, is the ratio of storage that the
administrator has specified that they expect this pool to consume
relative to other pools with target ratios set.
If both target size bytes and ratio are specified, the
ratio takes precedence.

**EFFECTIVE RATIO** is the target ratio after adjusting in two ways:

1. subtracting any capacity expected to be used by pools with target size set
2. normalizing the target ratios among pools with target ratio set so
   they collectively target the rest of the space. For example, 4
   pools with target_ratio 1.0 would have an effective ratio of 0.25.

The system uses the larger of the actual ratio and the effective ratio
for its calculation.

**BIAS** is used as a multiplier to manually adjust a pool's PG based
on prior information about how much PGs a specific pool is expected
to have.

**PG_NUM** is the current number of PGs for the pool (or the current
number of PGs that the pool is working towards, if a ``pg_num``
change is in progress).  **NEW PG_NUM**, if present, is what the
system believes the pool's ``pg_num`` should be changed to.  It is
always a power of 2, and will only be present if the "ideal" value
varies from the current value by more than a factor of 3 by default.
This factor can be be adjusted with:

.. prompt:: bash #

  ceph osd pool set threshold 2.0

**AUTOSCALE**, is the pool ``pg_autoscale_mode``
and will be either ``on``, ``off``, or ``warn``.

The final column, **BULK** determines if the pool is ``bulk``
and will be either ``True`` or ``False``. A ``bulk`` pool
means that the pool is expected to be large and should start out
with large amount of PGs for performance purposes. On the other hand,
pools without the ``bulk`` flag are expected to be smaller e.g.,
.mgr or meta pools.


Automated scaling
-----------------

Allowing the cluster to automatically scale PGs based on usage is the
simplest approach.  Ceph will look at the total available storage and
target number of PGs for the whole system, look at how much data is
stored in each pool, and try to apportion the PGs accordingly.  The
system is relatively conservative with its approach, only making
changes to a pool when the current number of PGs (``pg_num``) is more
than 3 times off from what it thinks it should be.

The target number of PGs per OSD is based on the
``mon_target_pg_per_osd`` configurable (default: 100), which can be
adjusted with:

.. prompt:: bash #

   ceph config set global mon_target_pg_per_osd 100

The autoscaler analyzes pools and adjusts on a per-subtree basis.
Because each pool may map to a different CRUSH rule, and each rule may
distribute data across different devices, Ceph will consider
utilization of each subtree of the hierarchy independently.  For
example, a pool that maps to OSDs of class `ssd` and a pool that maps
to OSDs of class `hdd` will each have optimal PG counts that depend on
the number of those respective device types.

In the case where a pool uses OSDs under two or more CRUSH roots, e.g., (shadow
trees with both `ssd` and `hdd` devices), the autoscaler will
issue a warning to the user in the manager log stating the name of the pool
and the set of roots that overlap each other. The autoscaler will not
scale any pools with overlapping roots because this can cause problems
with the scaling process. We recommend making each pool belong to only
one root (one OSD class) to get rid of the warning and ensure a successful
scaling process.

The autoscaler uses the `bulk` flag to determine which pool
should start out with a full complement of PGs and only
scales down when the usage ratio across the pool is not even.
However, if the pool doesn't have the `bulk` flag, the pool will
start out with minimal PGs and only when there is more usage in the pool.

To create pool with `bulk` flag:

.. prompt:: bash #

   ceph osd pool create <pool-name> --bulk

To set/unset `bulk` flag of existing pool:

.. prompt:: bash #

   ceph osd pool set <pool-name> bulk <true/false/1/0>

To get `bulk` flag of existing pool:

.. prompt:: bash #

   ceph osd pool get <pool-name> bulk

.. _specifying_pool_target_size:

Specifying expected pool size
-----------------------------

When a cluster or pool is first created, it will consume a small
fraction of the total cluster capacity and will appear to the system
as if it should only need a small number of placement groups.
However, in most cases cluster administrators have a good idea which
pools are expected to consume most of the system capacity over time.
By providing this information to Ceph, a more appropriate number of
PGs can be used from the beginning, preventing subsequent changes in
``pg_num`` and the overhead associated with moving data around when
those adjustments are made.

The *target size* of a pool can be specified in two ways: either in
terms of the absolute size of the pool (i.e., bytes), or as a weight
relative to other pools with a ``target_size_ratio`` set.

For example:

.. prompt:: bash #

   ceph osd pool set mypool target_size_bytes 100T

will tell the system that `mypool` is expected to consume 100 TiB of
space.  Alternatively:

.. prompt:: bash # 

   ceph osd pool set mypool target_size_ratio 1.0

will tell the system that `mypool` is expected to consume 1.0 relative
to the other pools with ``target_size_ratio`` set. If `mypool` is the
only pool in the cluster, this means an expected use of 100% of the
total capacity. If there is a second pool with ``target_size_ratio``
1.0, both pools would expect to use 50% of the cluster capacity.

You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command.

Note that if impossible target size values are specified (for example,
a capacity larger than the total cluster) then a health warning
(``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.

If both ``target_size_ratio`` and ``target_size_bytes`` are specified
for a pool, only the ratio will be considered, and a health warning
(``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) will be issued.

Specifying bounds on a pool's PGs
---------------------------------

It is also possible to specify a minimum number of PGs for a pool.
This is useful for establishing a lower bound on the amount of
parallelism client will see when doing IO, even when a pool is mostly
empty.  Setting the lower bound prevents Ceph from reducing (or
recommending you reduce) the PG number below the configured number.

You can set the minimum or maximum number of PGs for a pool with:

.. prompt:: bash #

   ceph osd pool set <pool-name> pg_num_min <num>
   ceph osd pool set <pool-name> pg_num_max <num>

You can also specify the minimum or maximum PG count at pool creation
time with the optional ``--pg-num-min <num>`` or ``--pg-num-max
<num>`` arguments to the ``ceph osd pool create`` command.

.. _preselection:

A preselection of pg_num
========================

When creating a new pool with:

.. prompt:: bash #

   ceph osd pool create {pool-name} [pg_num]

it is optional to choose the value of ``pg_num``.  If you do not
specify ``pg_num``, the cluster can (by default) automatically tune it
for you based on how much data is stored in the pool (see above, :ref:`pg-autoscaler`).

Alternatively, ``pg_num`` can be explicitly provided.  However,
whether you specify a ``pg_num`` value or not does not affect whether
the value is automatically tuned by the cluster after the fact.  To
enable or disable auto-tuning:

.. prompt:: bash #

   ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)

The "rule of thumb" for PGs per OSD has traditionally be 100.  With
the additional of the balancer (which is also enabled by default), a
value of more like 50 PGs per OSD is probably reasonable.  The
challenge (which the autoscaler normally does for you), is to:

- have the PGs per pool proportional to the data in the pool, and
- end up with 50-100 PGs per OSDs, after the replication or
  erasuring-coding fan-out of each PG across OSDs is taken into
  consideration

How are Placement Groups used ?
===============================

A placement group (PG) aggregates objects within a pool because
tracking object placement and object metadata on a per-object basis is
computationally expensive--i.e., a system with millions of objects
cannot realistically track placement on a per-object basis.

.. ditaa::
           /-----\  /-----\  /-----\  /-----\  /-----\
           | obj |  | obj |  | obj |  | obj |  | obj |
           \-----/  \-----/  \-----/  \-----/  \-----/
              |        |        |        |        |
              +--------+--------+        +---+----+
              |                              |
              v                              v
   +-----------------------+      +-----------------------+
   |  Placement Group #1   |      |  Placement Group #2   |
   |                       |      |                       |
   +-----------------------+      +-----------------------+
               |                              |
               +------------------------------+
                             |
                             v
                  +-----------------------+
                  |        Pool           |
                  |                       |
                  +-----------------------+

The Ceph client will calculate which placement group an object should
be in. It does this by hashing the object ID and applying an operation
based on the number of PGs in the defined pool and the ID of the pool.
See `Mapping PGs to OSDs`_ for details.

The object's contents within a placement group are stored in a set of
OSDs. For instance, in a replicated pool of size two, each placement
group will store objects on two OSDs, as shown below.

.. ditaa::
   +-----------------------+      +-----------------------+
   |  Placement Group #1   |      |  Placement Group #2   |
   |                       |      |                       |
   +-----------------------+      +-----------------------+
        |             |               |             |
        v             v               v             v
   /----------\  /----------\    /----------\  /----------\
   |          |  |          |    |          |  |          |
   |  OSD #1  |  |  OSD #2  |    |  OSD #2  |  |  OSD #3  |
   |          |  |          |    |          |  |          |
   \----------/  \----------/    \----------/  \----------/


Should OSD #2 fail, another will be assigned to Placement Group #1 and
will be filled with copies of all objects in OSD #1. If the pool size
is changed from two to three, an additional OSD will be assigned to
the placement group and will receive copies of all objects in the
placement group.

Placement groups do not own the OSD; they share it with other
placement groups from the same pool or even other pools. If OSD #2
fails, the Placement Group #2 will also have to restore copies of
objects, using OSD #3.

When the number of placement groups increases, the new placement
groups will be assigned OSDs. The result of the CRUSH function will
also change and some objects from the former placement groups will be
copied over to the new Placement Groups and removed from the old ones.

Placement Groups Tradeoffs
==========================

Data durability and even distribution among all OSDs call for more
placement groups but their number should be reduced to the minimum to
save CPU and memory.

.. _data durability:

Data durability
---------------

After an OSD fails, the risk of data loss increases until the data it
contained is fully recovered. Let's imagine a scenario that causes
permanent data loss in a single placement group:

- The OSD fails and all copies of the object it contains are lost.
  For all objects within the placement group the number of replica
  suddenly drops from three to two.

- Ceph starts recovery for this placement group by choosing a new OSD
  to re-create the third copy of all objects.

- Another OSD, within the same placement group, fails before the new
  OSD is fully populated with the third copy. Some objects will then
  only have one surviving copies.

- Ceph picks yet another OSD and keeps copying objects to restore the
  desired number of copies.

- A third OSD, within the same placement group, fails before recovery
  is complete. If this OSD contained the only remaining copy of an
  object, it is permanently lost.

In a cluster containing 10 OSDs with 512 placement groups in a three
replica pool, CRUSH will give each placement groups three OSDs. In the
end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
Groups. When the first OSD fails, the above scenario will therefore
start recovery for all 150 placement groups at the same time.

The 150 placement groups being recovered are likely to be
homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
therefore likely to send copies of objects to all others and also
receive some new objects to be stored because they became part of a
new placement group.

The amount of time it takes for this recovery to complete entirely
depends on the architecture of the Ceph cluster. Let say each OSD is
hosted by a 1TB SSD on a single machine and all of them are connected
to a 10Gb/s switch and the recovery for a single OSD completes within
M minutes. If there are two OSDs per machine using spinners with no
SSD journal and a 1Gb/s switch, it will at least be an order of
magnitude slower.

In a cluster of this size, the number of placement groups has almost
no influence on data durability. It could be 128 or 8192 and the
recovery would not be slower or faster.

However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
is likely to speed up recovery and therefore improve data durability
significantly. Each OSD now participates in only ~75 placement groups
instead of ~150 when there were only 10 OSDs and it will still require
all 19 remaining OSDs to perform the same amount of object copies in
order to recover. But where 10 OSDs had to copy approximately 100GB
each, they now have to copy 50GB each instead. If the network was the
bottleneck, recovery will happen twice as fast. In other words,
recovery goes faster when the number of OSDs increases.

If this cluster grows to 40 OSDs, each of them will only host ~35
placement groups. If an OSD dies, recovery will keep going faster
unless it is blocked by another bottleneck. However, if this cluster
grows to 200 OSDs, each of them will only host ~7 placement groups. If
an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
in these placement groups: recovery will take longer than when there
were 40 OSDs, meaning the number of placement groups should be
increased.

No matter how short the recovery time is, there is a chance for a
second OSD to fail while it is in progress. In the 10 OSDs cluster
described above, if any of them fail, then ~17 placement groups
(i.e. ~150 / 9 placement groups being recovered) will only have one
surviving copy. And if any of the 8 remaining OSD fail, the last
objects of two placement groups are likely to be lost (i.e. ~17 / 8
placement groups with only one remaining copy being recovered).

When the size of the cluster grows to 20 OSDs, the number of Placement
Groups damaged by the loss of three OSDs drops. The second OSD lost
will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
instead of ~17 and the third OSD lost will only lose data if it is one
of the four OSDs containing the surviving copy. In other words, if the
probability of losing one OSD is 0.0001% during the recovery time
frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
0.0001% in the cluster with 20 OSDs.

In a nutshell, more OSDs mean faster recovery and a lower risk of
cascading failures leading to the permanent loss of a Placement
Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
cluster with less than 50 OSDs as far as data durability is concerned.

Note: It may take a long time for a new OSD added to the cluster to be
populated with placement groups that were assigned to it. However
there is no degradation of any object and it has no impact on the
durability of the data contained in the Cluster.

.. _object distribution:

Object distribution within a pool
---------------------------------

Ideally objects are evenly distributed in each placement group. Since
CRUSH computes the placement group for each object, but does not
actually know how much data is stored in each OSD within this
placement group, the ratio between the number of placement groups and
the number of OSDs may influence the distribution of the data
significantly.

For instance, if there was a single placement group for ten OSDs in a
three replica pool, only three OSD would be used because CRUSH would
have no other choice. When more placement groups are available,
objects are more likely to be evenly spread among them. CRUSH also
makes every effort to evenly spread OSDs among all existing Placement
Groups.

As long as there are one or two orders of magnitude more Placement
Groups than OSDs, the distribution should be even. For instance, 256
placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs
etc.

Uneven data distribution can be caused by factors other than the ratio
between OSDs and placement groups. Since CRUSH does not take into
account the size of the objects, a few very large objects may create
an imbalance. Let say one million 4K objects totaling 4GB are evenly
spread among 1024 placement groups on 10 OSDs. They will use 4GB / 10
= 400MB on each OSD. If one 400MB object is added to the pool, the
three OSDs supporting the placement group in which the object has been
placed will be filled with 400MB + 400MB = 800MB while the seven
others will remain occupied with only 400MB.

.. _resource usage:

Memory, CPU and network usage
-----------------------------

For each placement group, OSDs and MONs need memory, network and CPU
at all times and even more during recovery. Sharing this overhead by
clustering objects within a placement group is one of the main reasons
they exist.

Minimizing the number of placement groups saves significant amounts of
resources.

.. _choosing-number-of-placement-groups:

Choosing the number of Placement Groups
=======================================

.. note: It is rarely necessary to do this math by hand.  Instead, use the ``ceph osd pool autoscale-status`` command in combination with the ``target_size_bytes`` or ``target_size_ratio`` pool properties.  See :ref:`pg-autoscaler` for more information.

If you have more than 50 OSDs, we recommend approximately 50-100
placement groups per OSD to balance out resource usage, data
durability and distribution. If you have less than 50 OSDs, choosing
among the `preselection`_ above is best. For a single pool of objects,
you can use the following formula to get a baseline

  Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}`

Where **pool size** is either the number of replicas for replicated
pools or the K+M sum for erasure coded pools (as returned by **ceph
osd erasure-code-profile get**).

You should then check if the result makes sense with the way you
designed your Ceph cluster to maximize `data durability`_,
`object distribution`_ and minimize `resource usage`_.

The result should always be **rounded up to the nearest power of two**.

Only a power of two will evenly balance the number of objects among
placement groups. Other values will result in an uneven distribution of
data across your OSDs. Their use should be limited to incrementally
stepping from one power of two to another.

As an example, for a cluster with 200 OSDs and a pool size of 3
replicas, you would estimate your number of PGs as follows

  :math:`\frac{200 \times 100}{3} = 6667`. Nearest power of 2: 8192

When using multiple data pools for storing objects, you need to ensure
that you balance the number of placement groups per pool with the
number of placement groups per OSD so that you arrive at a reasonable
total number of placement groups that provides reasonably low variance
per OSD without taxing system resources or making the peering process
too slow.

For instance a cluster of 10 pools each with 512 placement groups on
ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
that is 512 placement groups per OSD. That does not use too many
resources. However, if 1,000 pools were created with 512 placement
groups each, the OSDs will handle ~50,000 placement groups each and it
would require significantly more resources and time for peering.

You may find the `PGCalc`_ tool helpful.


.. _setting the number of placement groups:

Set the Number of Placement Groups
==================================

To set the number of placement groups in a pool, you must specify the
number of placement groups at the time you create the pool.
See `Create a Pool`_ for details.  Even after a pool is created you can also change the number of placement groups with:

.. prompt:: bash # 

   ceph osd pool set {pool-name} pg_num {pg_num}

After you increase the number of placement groups, you must also
increase the number of placement groups for placement (``pgp_num``)
before your cluster will rebalance. The ``pgp_num`` will be the number of
placement groups that will be considered for placement by the CRUSH
algorithm. Increasing ``pg_num`` splits the placement groups but data
will not be migrated to the newer placement groups until placement
groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
should be equal to the ``pg_num``.  To increase the number of
placement groups for placement, execute the following:

.. prompt:: bash #

   ceph osd pool set {pool-name} pgp_num {pgp_num}

When decreasing the number of PGs, ``pgp_num`` is adjusted
automatically for you.

Get the Number of Placement Groups
==================================

To get the number of placement groups in a pool, execute the following:

.. prompt:: bash #
   
   ceph osd pool get {pool-name} pg_num


Get a Cluster's PG Statistics
=============================

To get the statistics for the placement groups in your cluster, execute the following:

.. prompt:: bash #

   ceph pg dump [--format {format}]

Valid formats are ``plain`` (default) and ``json``.


Get Statistics for Stuck PGs
============================

To get the statistics for all placement groups stuck in a specified state,
execute the following:

.. prompt:: bash #

   ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]

**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
with the most up-to-date data to come up and in.

**Unclean** Placement groups contain objects that are not replicated the desired number
of times. They should be recovering.

**Stale** Placement groups are in an unknown state - the OSDs that host them have not
reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).

Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
of seconds the placement group is stuck before including it in the returned statistics
(default 300 seconds).


Get a PG Map
============

To get the placement group map for a particular placement group, execute the following:

.. prompt:: bash #

   ceph pg map {pg-id}

For example: 

.. prompt:: bash #

   ceph pg map 1.6c

Ceph will return the placement group map, the placement group, and the OSD status:

.. prompt:: bash #

   osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]


Get a PGs Statistics
====================

To retrieve statistics for a particular placement group, execute the following:

.. prompt:: bash #

   ceph pg {pg-id} query


Scrub a Placement Group
=======================

To scrub a placement group, execute the following:

.. prompt:: bash #

   ceph pg scrub {pg-id}

Ceph checks the primary and any replica nodes, generates a catalog of all objects
in the placement group and compares them to ensure that no objects are missing
or mismatched, and their contents are consistent.  Assuming the replicas all
match, a final semantic sweep ensures that all of the snapshot-related object
metadata is consistent. Errors are reported via logs.

To scrub all placement groups from a specific pool, execute the following:

.. prompt:: bash #

   ceph osd pool scrub {pool-name}

Prioritize backfill/recovery of a Placement Group(s)
====================================================

You may run into a situation where a bunch of placement groups will require
recovery and/or backfill, and some particular groups hold data more important
than others (for example, those PGs may hold data for images used by running
machines and other PGs may be used by inactive machines/less relevant data).
In that case, you may want to prioritize recovery of those groups so
performance and/or availability of data stored on those groups is restored
earlier. To do this (mark particular placement group(s) as prioritized during
backfill or recovery), execute the following:

.. prompt:: bash #

   ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
   ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]

This will cause Ceph to perform recovery or backfill on specified placement
groups first, before other placement groups. This does not interrupt currently
ongoing backfills or recovery, but causes specified PGs to be processed
as soon as possible. If you change your mind or prioritize wrong groups,
use:

.. prompt:: bash #

   ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
   ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]

This will remove "force" flag from those PGs and they will be processed
in default order. Again, this doesn't affect currently processed placement
group, only those that are still queued.

The "force" flag is cleared automatically after recovery or backfill of group
is done.

Similarly, you may use the following commands to force Ceph to perform recovery
or backfill on all placement groups from a specified pool first:

.. prompt:: bash #

   ceph osd pool force-recovery {pool-name}
   ceph osd pool force-backfill {pool-name}

or:

.. prompt:: bash #

   ceph osd pool cancel-force-recovery {pool-name}
   ceph osd pool cancel-force-backfill {pool-name}

to restore to the default recovery or backfill priority if you change your mind.

Note that these commands could possibly break the ordering of Ceph's internal
priority computations, so use them with caution!
Especially, if you have multiple pools that are currently sharing the same
underlying OSDs, and some particular pools hold data more important than others,
we recommend you use the following command to re-arrange all pools's
recovery/backfill priority in a better order:

.. prompt:: bash #

   ceph osd pool set {pool-name} recovery_priority {value}

For example, if you have 10 pools you could make the most important one priority 10,
next 9, etc. Or you could leave most pools alone and have say 3 important pools
all priority 1 or priorities 3, 2, 1 respectively.

Revert Lost
===========

If the cluster has lost one or more objects, and you have decided to
abandon the search for the lost data, you must mark the unfound objects
as ``lost``.

If all possible locations have been queried and objects are still
lost, you may have to give up on the lost objects. This is
possible given unusual combinations of failures that allow the cluster
to learn about writes that were performed before the writes themselves
are recovered.

Currently the only supported option is "revert", which will either roll back to
a previous version of the object or (if it was a new object) forget about it
entirely. To mark the "unfound" objects as "lost", execute the following:

.. prompt:: bash #

   ceph pg {pg-id} mark_unfound_lost revert|delete

.. important:: Use this feature with caution, because it may confuse
   applications that expect the object(s) to exist.


.. toctree::
        :hidden:

        pg-states
        pg-concepts


.. _Create a Pool: ../pools#createpool
.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
.. _pgcalc: https://old.ceph.com/pgcalc/