diff options
Diffstat (limited to 'doc/rados/troubleshooting/troubleshooting-pg.rst')
-rw-r--r-- | doc/rados/troubleshooting/troubleshooting-pg.rst | 782 |
1 files changed, 782 insertions, 0 deletions
diff --git a/doc/rados/troubleshooting/troubleshooting-pg.rst b/doc/rados/troubleshooting/troubleshooting-pg.rst new file mode 100644 index 000000000..74d04bd9f --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-pg.rst @@ -0,0 +1,782 @@ +==================== + Troubleshooting PGs +==================== + +Placement Groups Never Get Clean +================================ + +If, after you have created your cluster, any Placement Groups (PGs) remain in +the ``active`` status, the ``active+remapped`` status or the +``active+degraded`` status and never achieves an ``active+clean`` status, you +likely have a problem with your configuration. + +In such a situation, it may be necessary to review the settings in the `Pool, +PG and CRUSH Config Reference`_ and make appropriate adjustments. + +As a general rule, run your cluster with more than one OSD and a pool size +greater than two object replicas. + +.. _one-node-cluster: + +One Node Cluster +---------------- + +Ceph no longer provides documentation for operating on a single node. Systems +designed for distributed computing by definition do not run on a single node. +The mounting of client kernel modules on a single node that contains a Ceph +daemon may cause a deadlock due to issues with the Linux kernel itself (unless +VMs are used as clients). You can experiment with Ceph in a one-node +configuration, in spite of the limitations as described herein. + +To create a cluster on a single node, you must change the +``osd_crush_chooseleaf_type`` setting from the default of ``1`` (meaning +``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration +file before you create your monitors and OSDs. This tells Ceph that an OSD is +permitted to place another OSD on the same host. If you are trying to set up a +single-node cluster and ``osd_crush_chooseleaf_type`` is greater than ``0``, +Ceph will attempt to place the PGs of one OSD with the PGs of another OSD on +another node, chassis, rack, row, or datacenter depending on the setting. + +.. tip:: DO NOT mount kernel clients directly on the same node as your Ceph + Storage Cluster. Kernel conflicts can arise. However, you can mount kernel + clients within virtual machines (VMs) on a single node. + +If you are creating OSDs using a single disk, you must manually create +directories for the data first. + + +Fewer OSDs than Replicas +------------------------ + +If two OSDs are in an ``up`` and ``in`` state, but the placement gropus are not +in an ``active + clean`` state, you may have an ``osd_pool_default_size`` set +to greater than ``2``. + +There are a few ways to address this situation. If you want to operate your +cluster in an ``active + degraded`` state with two replicas, you can set the +``osd_pool_default_min_size`` to ``2`` so that you can write objects in an +``active + degraded`` state. You may also set the ``osd_pool_default_size`` +setting to ``2`` so that you have only two stored replicas (the original and +one replica). In such a case, the cluster should achieve an ``active + clean`` +state. + +.. note:: You can make the changes while the cluster is running. If you make + the changes in your Ceph configuration file, you might need to restart your + cluster. + + +Pool Size = 1 +------------- + +If you have ``osd_pool_default_size`` set to ``1``, you will have only one copy +of the object. OSDs rely on other OSDs to tell them which objects they should +have. If one OSD has a copy of an object and there is no second copy, then +there is no second OSD to tell the first OSD that it should have that copy. For +each placement group mapped to the first OSD (see ``ceph pg dump``), you can +force the first OSD to notice the placement groups it needs by running a +command of the following form: + +.. prompt:: bash + + ceph osd force-create-pg <pgid> + + +CRUSH Map Errors +---------------- + +If any placement groups in your cluster are unclean, then there might be errors +in your CRUSH map. + + +Stuck Placement Groups +====================== + +It is normal for placement groups to enter "degraded" or "peering" states after +a component failure. Normally, these states reflect the expected progression +through the failure recovery process. However, a placement group that stays in +one of these states for a long time might be an indication of a larger problem. +For this reason, the Ceph Monitors will warn when placement groups get "stuck" +in a non-optimal state. Specifically, we check for: + +* ``inactive`` - The placement group has not been ``active`` for too long (that + is, it hasn't been able to service read/write requests). + +* ``unclean`` - The placement group has not been ``clean`` for too long (that + is, it hasn't been able to completely recover from a previous failure). + +* ``stale`` - The placement group status has not been updated by a + ``ceph-osd``. This indicates that all nodes storing this placement group may + be ``down``. + +List stuck placement groups by running one of the following commands: + +.. prompt:: bash + + ceph pg dump_stuck stale + ceph pg dump_stuck inactive + ceph pg dump_stuck unclean + +- Stuck ``stale`` placement groups usually indicate that key ``ceph-osd`` + daemons are not running. +- Stuck ``inactive`` placement groups usually indicate a peering problem (see + :ref:`failures-osd-peering`). +- Stuck ``unclean`` placement groups usually indicate that something is + preventing recovery from completing, possibly unfound objects (see + :ref:`failures-osd-unfound`); + + + +.. _failures-osd-peering: + +Placement Group Down - Peering Failure +====================================== + +In certain cases, the ``ceph-osd`` `peering` process can run into problems, +which can prevent a PG from becoming active and usable. In such a case, running +the command ``ceph health detail`` will report something similar to the following: + +.. prompt:: bash + + ceph health detail + +:: + + HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down + ... + pg 0.5 is down+peering + pg 1.4 is down+peering + ... + osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 + +Query the cluster to determine exactly why the PG is marked ``down`` by running a command of the following form: + +.. prompt:: bash + + ceph pg 0.5 query + +.. code-block:: javascript + + { "state": "down+peering", + ... + "recovery_state": [ + { "name": "Started\/Primary\/Peering\/GetInfo", + "enter_time": "2012-03-06 14:40:16.169679", + "requested_info_from": []}, + { "name": "Started\/Primary\/Peering", + "enter_time": "2012-03-06 14:40:16.169659", + "probing_osds": [ + 0, + 1], + "blocked": "peering is blocked due to down osds", + "down_osds_we_would_probe": [ + 1], + "peering_blocked_by": [ + { "osd": 1, + "current_lost_at": 0, + "comment": "starting or marking this osd lost may let us proceed"}]}, + { "name": "Started", + "enter_time": "2012-03-06 14:40:16.169513"} + ] + } + +The ``recovery_state`` section tells us that peering is blocked due to down +``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that +particular ``ceph-osd`` and recovery will proceed. + +Alternatively, if there is a catastrophic failure of ``osd.1`` (for example, if +there has been a disk failure), the cluster can be informed that the OSD is +``lost`` and the cluster can be instructed that it must cope as best it can. + +.. important:: Informing the cluster that an OSD has been lost is dangerous + because the cluster cannot guarantee that the other copies of the data are + consistent and up to date. + +To report an OSD ``lost`` and to instruct Ceph to continue to attempt recovery +anyway, run a command of the following form: + +.. prompt:: bash + + ceph osd lost 1 + +Recovery will proceed. + + +.. _failures-osd-unfound: + +Unfound Objects +=============== + +Under certain combinations of failures, Ceph may complain about ``unfound`` +objects, as in this example: + +.. prompt:: bash + + ceph health detail + +:: + + HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) + pg 2.4 is active+degraded, 78 unfound + +This means that the storage cluster knows that some objects (or newer copies of +existing objects) exist, but it hasn't found copies of them. Here is an +example of how this might come about for a PG whose data is on two OSDS, which +we will call "1" and "2": + +* 1 goes down +* 2 handles some writes, alone +* 1 comes up +* 1 and 2 re-peer, and the objects missing on 1 are queued for recovery. +* Before the new objects are copied, 2 goes down. + +At this point, 1 knows that these objects exist, but there is no live +``ceph-osd`` that has a copy of the objects. In this case, IO to those objects +will block, and the cluster will hope that the failed node comes back soon. +This is assumed to be preferable to returning an IO error to the user. + +.. note:: The situation described immediately above is one reason that setting + ``size=2`` on a replicated pool and ``m=1`` on an erasure coded pool risks + data loss. + +Identify which objects are unfound by running a command of the following form: + +.. prompt:: bash + + ceph pg 2.4 list_unfound [starting offset, in json] + +.. code-block:: javascript + + { + "num_missing": 1, + "num_unfound": 1, + "objects": [ + { + "oid": { + "oid": "object", + "key": "", + "snapid": -2, + "hash": 2249616407, + "max": 0, + "pool": 2, + "namespace": "" + }, + "need": "43'251", + "have": "0'0", + "flags": "none", + "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1", + "locations": [ + "0(3)", + "4(2)" + ] + } + ], + "state": "NotRecovering", + "available_might_have_unfound": true, + "might_have_unfound": [ + { + "osd": "2(4)", + "status": "osd is down" + } + ], + "more": false + } + +If there are too many objects to list in a single result, the ``more`` field +will be true and you can query for more. (Eventually the command line tool +will hide this from you, but not yet.) + +Now you can identify which OSDs have been probed or might contain data. + +At the end of the listing (before ``more: false``), ``might_have_unfound`` is +provided when ``available_might_have_unfound`` is true. This is equivalent to +the output of ``ceph pg #.# query``. This eliminates the need to use ``query`` +directly. The ``might_have_unfound`` information given behaves the same way as +that ``query`` does, which is described below. The only difference is that +OSDs that have the status of ``already probed`` are ignored. + +Use of ``query``: + +.. prompt:: bash + + ceph pg 2.4 query + +.. code-block:: javascript + + "recovery_state": [ + { "name": "Started\/Primary\/Active", + "enter_time": "2012-03-06 15:15:46.713212", + "might_have_unfound": [ + { "osd": 1, + "status": "osd is down"}]}, + +In this case, the cluster knows that ``osd.1`` might have data, but it is +``down``. Here is the full range of possible states: + +* already probed +* querying +* OSD is down +* not queried (yet) + +Sometimes it simply takes some time for the cluster to query possible +locations. + +It is possible that there are other locations where the object might exist that +are not listed. For example: if an OSD is stopped and taken out of the cluster +and then the cluster fully recovers, and then through a subsequent set of +failures the cluster ends up with an unfound object, the cluster will ignore +the removed OSD. (This scenario, however, is unlikely.) + +If all possible locations have been queried and objects are still lost, you may +have to give up on the lost objects. This, again, is possible only when unusual +combinations of failures have occurred that allow the cluster to learn about +writes that were performed before the writes themselves have been recovered. To +mark the "unfound" objects as "lost", run a command of the following form: + +.. prompt:: bash + + ceph pg 2.5 mark_unfound_lost revert|delete + +Here the final argument (``revert|delete``) specifies how the cluster should +deal with lost objects. + +The ``delete`` option will cause the cluster to forget about them entirely. + +The ``revert`` option (which is not available for erasure coded pools) will +either roll back to a previous version of the object or (if it was a new +object) forget about the object entirely. Use ``revert`` with caution, as it +may confuse applications that expect the object to exist. + +Homeless Placement Groups +========================= + +It is possible that every OSD that has copies of a given placement group fails. +If this happens, then the subset of the object store that contains those +placement groups becomes unavailable and the monitor will receive no status +updates for those placement groups. The monitor marks as ``stale`` any +placement group whose primary OSD has failed. For example: + +.. prompt:: bash + + ceph health + +:: + + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + +Identify which placement groups are ``stale`` and which were the last OSDs to +store the ``stale`` placement groups by running the following command: + +.. prompt:: bash + + ceph health detail + +:: + + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + ... + pg 2.5 is stuck stale+active+remapped, last acting [2,0] + ... + osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 + osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 + osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 + +This output indicates that placement group 2.5 (``pg 2.5``) was last managed by +``osd.0`` and ``osd.2``. Restart those OSDs to allow the cluster to recover +that placement group. + + +Only a Few OSDs Receive Data +============================ + +If only a few of the nodes in the cluster are receiving data, check the number +of placement groups in the pool as instructed in the :ref:`Placement Groups +<rados_ops_pgs_get_pg_num>` documentation. Since placement groups get mapped to +OSDs in an operation involving dividing the number of placement groups in the +cluster by the number of OSDs in the cluster, a small number of placement +groups (the remainder, in this operation) are sometimes not distributed across +the cluster. In situations like this, create a pool with a placement group +count that is a multiple of the number of OSDs. See `Placement Groups`_ for +details. See the :ref:`Pool, PG, and CRUSH Config Reference +<rados_config_pool_pg_crush_ref>` for instructions on changing the default +values used to determine how many placement groups are assigned to each pool. + + +Can't Write Data +================ + +If the cluster is up, but some OSDs are down and you cannot write data, make +sure that you have the minimum number of OSDs running in the pool. If you don't +have the minimum number of OSDs running in the pool, Ceph will not allow you to +write data to it because there is no guarantee that Ceph can replicate your +data. See ``osd_pool_default_min_size`` in the :ref:`Pool, PG, and CRUSH +Config Reference <rados_config_pool_pg_crush_ref>` for details. + + +PGs Inconsistent +================ + +If the command ``ceph health detail`` returns an ``active + clean + +inconsistent`` state, this might indicate an error during scrubbing. Identify +the inconsistent placement group or placement groups by running the following +command: + +.. prompt:: bash + + $ ceph health detail + +:: + + HEALTH_ERR 1 pgs inconsistent; 2 scrub errors + pg 0.6 is active+clean+inconsistent, acting [0,1,2] + 2 scrub errors + +Alternatively, run this command if you prefer to inspect the output in a +programmatic way: + +.. prompt:: bash + + $ rados list-inconsistent-pg rbd + +:: + + ["0.6"] + +There is only one consistent state, but in the worst case, we could have +different inconsistencies in multiple perspectives found in more than one +object. If an object named ``foo`` in PG ``0.6`` is truncated, the output of +``rados list-inconsistent-pg rbd`` will look something like this: + +.. prompt:: bash + + rados list-inconsistent-obj 0.6 --format=json-pretty + +.. code-block:: javascript + + { + "epoch": 14, + "inconsistents": [ + { + "object": { + "name": "foo", + "nspace": "", + "locator": "", + "snap": "head", + "version": 1 + }, + "errors": [ + "data_digest_mismatch", + "size_mismatch" + ], + "union_shard_errors": [ + "data_digest_mismatch_info", + "size_mismatch_info" + ], + "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", + "shards": [ + { + "osd": 0, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 1, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 2, + "errors": [ + "data_digest_mismatch_info", + "size_mismatch_info" + ], + "size": 0, + "omap_digest": "0xffffffff", + "data_digest": "0xffffffff" + } + ] + } + ] + } + +In this case, the output indicates the following: + +* The only inconsistent object is named ``foo``, and its head has + inconsistencies. +* The inconsistencies fall into two categories: + + #. ``errors``: these errors indicate inconsistencies between shards, without + an indication of which shard(s) are bad. Check for the ``errors`` in the + ``shards`` array, if available, to pinpoint the problem. + + * ``data_digest_mismatch``: the digest of the replica read from ``OSD.2`` + is different from the digests of the replica reads of ``OSD.0`` and + ``OSD.1`` + * ``size_mismatch``: the size of the replica read from ``OSD.2`` is ``0``, + but the size reported by ``OSD.0`` and ``OSD.1`` is ``968``. + + #. ``union_shard_errors``: the union of all shard-specific ``errors`` in the + ``shards`` array. The ``errors`` are set for the shard with the problem. + These errors include ``read_error`` and other similar errors. The + ``errors`` ending in ``oi`` indicate a comparison with + ``selected_object_info``. Examine the ``shards`` array to determine + which shard has which error or errors. + + * ``data_digest_mismatch_info``: the digest stored in the ``object-info`` + is not ``0xffffffff``, which is calculated from the shard read from + ``OSD.2`` + * ``size_mismatch_info``: the size stored in the ``object-info`` is + different from the size read from ``OSD.2``. The latter is ``0``. + +.. warning:: If ``read_error`` is listed in a shard's ``errors`` attribute, the + inconsistency is likely due to physical storage errors. In cases like this, + check the storage used by that OSD. + + Examine the output of ``dmesg`` and ``smartctl`` before attempting a drive + repair. + +To repair the inconsistent placement group, run a command of the following +form: + +.. prompt:: bash + + ceph pg repair {placement-group-ID} + +.. warning: This command overwrites the "bad" copies with "authoritative" + copies. In most cases, Ceph is able to choose authoritative copies from all + the available replicas by using some predefined criteria. This, however, + does not work in every case. For example, it might be the case that the + stored data digest is missing, which means that the calculated digest is + ignored when Ceph chooses the authoritative copies. Be aware of this, and + use the above command with caution. + + +If you receive ``active + clean + inconsistent`` states periodically due to +clock skew, consider configuring the `NTP +<https://en.wikipedia.org/wiki/Network_Time_Protocol>`_ daemons on your monitor +hosts to act as peers. See `The Network Time Protocol <http://www.ntp.org>`_ +and Ceph :ref:`Clock Settings <mon-config-ref-clock>` for more information. + + +Erasure Coded PGs are not active+clean +====================================== + +If CRUSH fails to find enough OSDs to map to a PG, it will show as a +``2147483647`` which is ``ITEM_NONE`` or ``no OSD found``. For example:: + + [2,1,6,0,5,8,2147483647,7,4] + +Not enough OSDs +--------------- + +If the Ceph cluster has only eight OSDs and an erasure coded pool needs nine +OSDs, the cluster will show "Not enough OSDs". In this case, you either create +another erasure coded pool that requires fewer OSDs, by running commands of the +following form: + +.. prompt:: bash + + ceph osd erasure-code-profile set myprofile k=5 m=3 + ceph osd pool create erasurepool erasure myprofile + +or add new OSDs, and the PG will automatically use them. + +CRUSH constraints cannot be satisfied +------------------------------------- + +If the cluster has enough OSDs, it is possible that the CRUSH rule is imposing +constraints that cannot be satisfied. If there are ten OSDs on two hosts and +the CRUSH rule requires that no two OSDs from the same host are used in the +same PG, the mapping may fail because only two OSDs will be found. Check the +constraint by displaying ("dumping") the rule, as shown here: + +.. prompt:: bash + + ceph osd crush rule ls + +:: + + [ + "replicated_rule", + "erasurepool"] + $ ceph osd crush rule dump erasurepool + { "rule_id": 1, + "rule_name": "erasurepool", + "type": 3, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + + +Resolve this problem by creating a new pool in which PGs are allowed to have +OSDs residing on the same host by running the following commands: + +.. prompt:: bash + + ceph osd erasure-code-profile set myprofile crush-failure-domain=osd + ceph osd pool create erasurepool erasure myprofile + +CRUSH gives up too soon +----------------------- + +If the Ceph cluster has just enough OSDs to map the PG (for instance a cluster +with a total of nine OSDs and an erasure coded pool that requires nine OSDs per +PG), it is possible that CRUSH gives up before finding a mapping. This problem +can be resolved by: + +* lowering the erasure coded pool requirements to use fewer OSDs per PG (this + requires the creation of another pool, because erasure code profiles cannot + be modified dynamically). + +* adding more OSDs to the cluster (this does not require the erasure coded pool + to be modified, because it will become clean automatically) + +* using a handmade CRUSH rule that tries more times to find a good mapping. + This can be modified for an existing CRUSH rule by setting + ``set_choose_tries`` to a value greater than the default. + +First, verify the problem by using ``crushtool`` after extracting the crushmap +from the cluster. This ensures that your experiments do not modify the Ceph +cluster and that they operate only on local files: + +.. prompt:: bash + + ceph osd crush rule dump erasurepool + +:: + + { "rule_id": 1, + "rule_name": "erasurepool", + "type": 3, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + $ ceph osd getcrushmap > crush.map + got crush map from osdmap epoch 13 + $ crushtool -i crush.map --test --show-bad-mappings \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0] + bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8] + bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647] + +Here, ``--num-rep`` is the number of OSDs that the erasure code CRUSH rule +needs, ``--rule`` is the value of the ``rule_id`` field that was displayed by +``ceph osd crush rule dump``. This test will attempt to map one million values +(in this example, the range defined by ``[--min-x,--max-x]``) and must display +at least one bad mapping. If this test outputs nothing, all mappings have been +successful and you can be assured that the problem with your cluster is not +caused by bad mappings. + +Changing the value of set_choose_tries +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +#. Decompile the CRUSH map to edit the CRUSH rule by running the following + command: + + .. prompt:: bash + + crushtool --decompile crush.map > crush.txt + +#. Add the following line to the rule:: + + step set_choose_tries 100 + + The relevant part of the ``crush.txt`` file will resemble this:: + + rule erasurepool { + id 1 + type erasure + step set_chooseleaf_tries 5 + step set_choose_tries 100 + step take default + step chooseleaf indep 0 type host + step emit + } + +#. Recompile and retest the CRUSH rule: + + .. prompt:: bash + + crushtool --compile crush.txt -o better-crush.map + +#. When all mappings succeed, display a histogram of the number of tries that + were necessary to find all of the mapping by using the + ``--show-choose-tries`` option of the ``crushtool`` command, as in the + following example: + + .. prompt:: bash + + crushtool -i better-crush.map --test --show-bad-mappings \ + --show-choose-tries \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + ... + 11: 42 + 12: 44 + 13: 54 + 14: 45 + 15: 35 + 16: 34 + 17: 30 + 18: 25 + 19: 19 + 20: 22 + 21: 20 + 22: 17 + 23: 13 + 24: 16 + 25: 13 + 26: 11 + 27: 11 + 28: 13 + 29: 11 + 30: 10 + 31: 6 + 32: 5 + 33: 10 + 34: 3 + 35: 7 + 36: 5 + 37: 2 + 38: 5 + 39: 5 + 40: 2 + 41: 5 + 42: 4 + 43: 1 + 44: 2 + 45: 2 + 46: 3 + 47: 1 + 48: 0 + ... + 102: 0 + 103: 1 + 104: 0 + ... + + This output indicates that it took eleven tries to map forty-two PGs, twelve + tries to map forty-four PGs etc. The highest number of tries is the minimum + value of ``set_choose_tries`` that prevents bad mappings (for example, + ``103`` in the above output, because it did not take more than 103 tries for + any PG to be mapped). + +.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups +.. _Placement Groups: ../../operations/placement-groups +.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref |