diff options
Diffstat (limited to 'doc/rados')
-rw-r--r-- | doc/rados/api/librados-intro.rst | 2 | ||||
-rw-r--r-- | doc/rados/api/python.rst | 2 | ||||
-rw-r--r-- | doc/rados/configuration/bluestore-config-ref.rst | 2 | ||||
-rw-r--r-- | doc/rados/configuration/common.rst | 7 | ||||
-rw-r--r-- | doc/rados/configuration/osd-config-ref.rst | 2 | ||||
-rw-r--r-- | doc/rados/configuration/pool-pg-config-ref.rst | 41 | ||||
-rw-r--r-- | doc/rados/operations/add-or-rm-mons.rst | 151 | ||||
-rw-r--r-- | doc/rados/operations/control.rst | 10 | ||||
-rw-r--r-- | doc/rados/operations/crush-map.rst | 49 | ||||
-rw-r--r-- | doc/rados/operations/erasure-code-profile.rst | 4 | ||||
-rw-r--r-- | doc/rados/operations/erasure-code.rst | 178 | ||||
-rw-r--r-- | doc/rados/operations/index.rst | 1 | ||||
-rw-r--r-- | doc/rados/operations/monitoring.rst | 2 | ||||
-rw-r--r-- | doc/rados/operations/pgcalc/index.rst | 68 | ||||
-rw-r--r-- | doc/rados/operations/placement-groups.rst | 32 | ||||
-rw-r--r-- | doc/rados/operations/pools.rst | 41 | ||||
-rw-r--r-- | doc/rados/operations/stretch-mode.rst | 10 | ||||
-rw-r--r-- | doc/rados/troubleshooting/log-and-debug.rst | 24 | ||||
-rw-r--r-- | doc/rados/troubleshooting/troubleshooting-mon.rst | 308 |
19 files changed, 693 insertions, 241 deletions
diff --git a/doc/rados/api/librados-intro.rst b/doc/rados/api/librados-intro.rst index 5174188b4..b863efc9e 100644 --- a/doc/rados/api/librados-intro.rst +++ b/doc/rados/api/librados-intro.rst @@ -1,3 +1,5 @@ +.. _librados-intro: + ========================== Introduction to librados ========================== diff --git a/doc/rados/api/python.rst b/doc/rados/api/python.rst index 346653a3d..60bdfa4da 100644 --- a/doc/rados/api/python.rst +++ b/doc/rados/api/python.rst @@ -1,3 +1,5 @@ +.. _librados-python: + =================== Librados (Python) =================== diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst index 3707be1aa..4c63c1043 100644 --- a/doc/rados/configuration/bluestore-config-ref.rst +++ b/doc/rados/configuration/bluestore-config-ref.rst @@ -358,7 +358,7 @@ OSD and run the following command: ceph-bluestore-tool \ --path <data path> \ - --sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l p" \ + --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} l p" \ reshard .. confval:: bluestore_rocksdb_cf diff --git a/doc/rados/configuration/common.rst b/doc/rados/configuration/common.rst index 0b373f469..c397f4e52 100644 --- a/doc/rados/configuration/common.rst +++ b/doc/rados/configuration/common.rst @@ -123,11 +123,10 @@ OSD host, run the following commands: ssh {osd-host} sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} -The ``osd_data`` path ought to lead to a mount point that has mounted on it a -device that is distinct from the device that contains the operating system and -the daemons. To use a device distinct from the device that contains the +The ``osd_data`` path must lead to a device that is not shared with the +operating system. To use a device other than the device that contains the operating system and the daemons, prepare it for use with Ceph and mount it on -the directory you just created by running the following commands: +the directory you just created by running commands of the following form: .. prompt:: bash $ diff --git a/doc/rados/configuration/osd-config-ref.rst b/doc/rados/configuration/osd-config-ref.rst index 060121200..5d59cb9f6 100644 --- a/doc/rados/configuration/osd-config-ref.rst +++ b/doc/rados/configuration/osd-config-ref.rst @@ -151,7 +151,7 @@ generates a catalog of all objects in each placement group and compares each primary object to its replicas, ensuring that no objects are missing or mismatched. Light scrubbing checks the object size and attributes, and is usually done daily. Deep scrubbing reads the data and uses checksums to ensure -data integrity, and is usually done weekly. The freqeuncies of both light +data integrity, and is usually done weekly. The frequencies of both light scrubbing and deep scrubbing are determined by the cluster's configuration, which is fully under your control and subject to the settings explained below in this section. diff --git a/doc/rados/configuration/pool-pg-config-ref.rst b/doc/rados/configuration/pool-pg-config-ref.rst index 902c80346..c3a25a3e7 100644 --- a/doc/rados/configuration/pool-pg-config-ref.rst +++ b/doc/rados/configuration/pool-pg-config-ref.rst @@ -6,12 +6,41 @@ .. index:: pools; configuration -Ceph uses default values to determine how many placement groups (PGs) will be -assigned to each pool. We recommend overriding some of the defaults. -Specifically, we recommend setting a pool's replica size and overriding the -default number of placement groups. You can set these values when running -`pool`_ commands. You can also override the defaults by adding new ones in the -``[global]`` section of your Ceph configuration file. +The number of placement groups that the CRUSH algorithm assigns to each pool is +determined by the values of variables in the centralized configuration database +in the monitor cluster. + +Both containerized deployments of Ceph (deployments made using ``cephadm`` or +Rook) and non-containerized deployments of Ceph rely on the values in the +central configuration database in the monitor cluster to assign placement +groups to pools. + +Example Commands +---------------- + +To see the value of the variable that governs the number of placement groups in a given pool, run a command of the following form: + +.. prompt:: bash + + ceph config get osd osd_pool_default_pg_num + +To set the value of the variable that governs the number of placement groups in a given pool, run a command of the following form: + +.. prompt:: bash + + ceph config set osd osd_pool_default_pg_num + +Manual Tuning +------------- +In some cases, it might be advisable to override some of the defaults. For +example, you might determine that it is wise to set a pool's replica size and +to override the default number of placement groups in the pool. You can set +these values when running `pool`_ commands. + +See Also +-------- + +See :ref:`pg-autoscaler`. .. literalinclude:: pool-pg.conf diff --git a/doc/rados/operations/add-or-rm-mons.rst b/doc/rados/operations/add-or-rm-mons.rst index 3688bb798..e97c0b94d 100644 --- a/doc/rados/operations/add-or-rm-mons.rst +++ b/doc/rados/operations/add-or-rm-mons.rst @@ -344,12 +344,13 @@ addresses, repeat this process. Changing a Monitor's IP address (Advanced Method) ------------------------------------------------- -There are cases in which the method outlined in :ref"`<Changing a Monitor's IP -Address (Preferred Method)> operations_add_or_rm_mons_changing_mon_ip` cannot -be used. For example, it might be necessary to move the cluster's monitors to a -different network, to a different part of the datacenter, or to a different -datacenter altogether. It is still possible to change the monitors' IP -addresses, but a different method must be used. +There are cases in which the method outlined in +:ref:`operations_add_or_rm_mons_changing_mon_ip` cannot be used. For example, +it might be necessary to move the cluster's monitors to a different network, to +a different part of the datacenter, or to a different datacenter altogether. It +is still possible to change the monitors' IP addresses, but a different method +must be used. + For such cases, a new monitor map with updated IP addresses for every monitor in the cluster must be generated and injected on each monitor. Although this @@ -357,11 +358,11 @@ method is not particularly easy, such a major migration is unlikely to be a routine task. As stated at the beginning of this section, existing monitors are not supposed to change their IP addresses. -Continue with the monitor configuration in the example from :ref"`<Changing a -Monitor's IP Address (Preferred Method)> -operations_add_or_rm_mons_changing_mon_ip` . Suppose that all of the monitors -are to be moved from the ``10.0.0.x`` range to the ``10.1.0.x`` range, and that -these networks are unable to communicate. Carry out the following procedure: +Continue with the monitor configuration in the example from +:ref:`operations_add_or_rm_mons_changing_mon_ip`. Suppose that all of the +monitors are to be moved from the ``10.0.0.x`` range to the ``10.1.0.x`` range, +and that these networks are unable to communicate. Carry out the following +procedure: #. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor map, and ``{filename}`` is the name of the file that contains the retrieved @@ -448,7 +449,135 @@ and inject the modified monitor map into each new monitor. Migration to the new location is now complete. The monitors should operate successfully. +Using cephadm to change the public network +========================================== + +Overview +-------- + +The procedure in this overview section provides only the broad outlines of +using ``cephadm`` to change the public network. + +#. Create backups of all keyrings, configuration files, and the current monmap. + +#. Stop the cluster and disable ``ceph.target`` to prevent the daemons from + starting. + +#. Move the servers and power them on. + +#. Change the network setup as desired. + + +Example Procedure +----------------- + +.. note:: In this procedure, the "old network" has addresses of the form + ``10.10.10.0/24`` and the "new network" has addresses of the form + ``192.168.160.0/24``. + +#. Enter the shell of the first monitor: + + .. prompt:: bash # + + cephadm shell --name mon.reef1 + +#. Extract the current monmap from ``mon.reef1``: + + .. prompt:: bash # + + ceph-mon -i reef1 --extract-monmap monmap + +#. Print the content of the monmap: + + .. prompt:: bash # + + monmaptool --print monmap + + :: + + monmaptool: monmap file monmap + epoch 5 + fsid 2851404a-d09a-11ee-9aaa-fa163e2de51a + last_changed 2024-02-21T09:32:18.292040+0000 + created 2024-02-21T09:18:27.136371+0000 + min_mon_release 18 (reef) + election_strategy: 1 + 0: [v2:10.10.10.11:3300/0,v1:10.10.10.11:6789/0] mon.reef1 + 1: [v2:10.10.10.12:3300/0,v1:10.10.10.12:6789/0] mon.reef2 + 2: [v2:10.10.10.13:3300/0,v1:10.10.10.13:6789/0] mon.reef3 + +#. Remove monitors with old addresses: + + .. prompt:: bash # + + monmaptool --rm reef1 --rm reef2 --rm reef3 monmap + +#. Add monitors with new addresses: + + .. prompt:: bash # + + monmaptool --addv reef1 [v2:192.168.160.11:3300/0,v1:192.168.160.11:6789/0] --addv reef2 [v2:192.168.160.12:3300/0,v1:192.168.160.12:6789/0] --addv reef3 [v2:192.168.160.13:3300/0,v1:192.168.160.13:6789/0] monmap + +#. Verify that the changes to the monmap have been made successfully: + + .. prompt:: bash # + + monmaptool --print monmap + + :: + + monmaptool: monmap file monmap + epoch 4 + fsid 2851404a-d09a-11ee-9aaa-fa163e2de51a + last_changed 2024-02-21T09:32:18.292040+0000 + created 2024-02-21T09:18:27.136371+0000 + min_mon_release 18 (reef) + election_strategy: 1 + 0: [v2:192.168.160.11:3300/0,v1:192.168.160.11:6789/0] mon.reef1 + 1: [v2:192.168.160.12:3300/0,v1:192.168.160.12:6789/0] mon.reef2 + 2: [v2:192.168.160.13:3300/0,v1:192.168.160.13:6789/0] mon.reef3 + +#. Inject the new monmap into the Ceph cluster: + + .. prompt:: bash # + + ceph-mon -i reef1 --inject-monmap monmap + +#. Repeat the steps above for all other monitors in the cluster. + +#. Update ``/var/lib/ceph/{FSID}/mon.{MON}/config``. + +#. Start the monitors. + +#. Update the ceph ``public_network``: + + .. prompt:: bash # + + ceph config set mon public_network 192.168.160.0/24 + +#. Update the configuration files of the managers + (``/var/lib/ceph/{FSID}/mgr.{mgr}/config``) and start them. Orchestrator + will now be available, but it will attempt to connect to the old network + because the host list contains the old addresses. + +#. Update the host addresses by running commands of the following form: + + .. prompt:: bash # + + ceph orch host set-addr reef1 192.168.160.11 + ceph orch host set-addr reef2 192.168.160.12 + ceph orch host set-addr reef3 192.168.160.13 + +#. Wait a few minutes for the orchestrator to connect to each host. + +#. Reconfigure the OSDs so that their config files are automatically updated: + + .. prompt:: bash # + + ceph orch reconfig osd +*The above procedure was developed by Eugen Block and was successfully tested +in February 2024 on Ceph version 18.2.1 (Reef).* .. _Manual Deployment: ../../../install/manual-deployment .. _Monitor Bootstrap: ../../../dev/mon-bootstrap diff --git a/doc/rados/operations/control.rst b/doc/rados/operations/control.rst index 033f831cd..32d043f1f 100644 --- a/doc/rados/operations/control.rst +++ b/doc/rados/operations/control.rst @@ -474,27 +474,25 @@ following command: ceph tell mds.{mds-id} config set {setting} {value} -Example: +Example: to enable debug messages, run the following command: .. prompt:: bash $ ceph tell mds.0 config set debug_ms 1 -To enable debug messages, run the following command: +To display the status of all metadata servers, run the following command: .. prompt:: bash $ ceph mds stat -To display the status of all metadata servers, run the following command: +To mark the active metadata server as failed (and to trigger failover to a +standby if a standby is present), run the following command: .. prompt:: bash $ ceph mds fail 0 -To mark the active metadata server as failed (and to trigger failover to a -standby if a standby is present), run the following command: - .. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap diff --git a/doc/rados/operations/crush-map.rst b/doc/rados/operations/crush-map.rst index 39151e6d4..18f4dcb6d 100644 --- a/doc/rados/operations/crush-map.rst +++ b/doc/rados/operations/crush-map.rst @@ -57,53 +57,62 @@ case for most clusters), its CRUSH location can be specified as follows:: ``pod``, ``pdu``, ``rack``, ``chassis``, and ``host``. These defined types suffice for nearly all clusters, but can be customized by modifying the CRUSH map. - #. Not all keys need to be specified. For example, by default, Ceph - automatically sets an ``OSD``'s location as ``root=default - host=HOSTNAME`` (as determined by the output of ``hostname -s``). -The CRUSH location for an OSD can be modified by adding the ``crush location`` -option in ``ceph.conf``. When this option has been added, every time the OSD +The CRUSH location for an OSD can be set by adding the ``crush_location`` +option in ``ceph.conf``, example: + + crush_location = root=default row=a rack=a2 chassis=a2a host=a2a1 + +When this option has been added, every time the OSD starts it verifies that it is in the correct location in the CRUSH map and moves itself if it is not. To disable this automatic CRUSH map management, add the following to the ``ceph.conf`` configuration file in the ``[osd]`` section:: - osd crush update on start = false + osd_crush_update_on_start = false Note that this action is unnecessary in most cases. +If the ``crush_location`` is not set explicitly, +a default of ``root=default host=HOSTNAME`` is used for ``OSD``s, +where the hostname is determined by the output of the ``hostname -s`` command. + +.. note:: If you switch from this default to an explicitly set ``crush_location``, + do not forget to include ``root=default`` because existing CRUSH rules refer to it. Custom location hooks --------------------- -A custom location hook can be used to generate a more complete CRUSH location -on startup. The CRUSH location is determined by, in order of preference: +A custom location hook can be used to generate a more complete CRUSH location, +on startup. + +This is useful when some location fields are not known at the time +``ceph.conf`` is written (for example, fields ``rack`` or ``datacenter`` +when deploying a single configuration across multiple datacenters). -#. A ``crush location`` option in ``ceph.conf`` -#. A default of ``root=default host=HOSTNAME`` where the hostname is determined - by the output of the ``hostname -s`` command +If configured, executed, and parsed successfully, the hook's output replaces +any previously set CRUSH location. -A script can be written to provide additional location fields (for example, -``rack`` or ``datacenter``) and the hook can be enabled via the following -config option:: +The hook hook can be enabled in ``ceph.conf`` by providing a path to an +executable file (often a script), example:: - crush location hook = /path/to/customized-ceph-crush-location + crush_location_hook = /path/to/customized-ceph-crush-location This hook is passed several arguments (see below). The hook outputs a single -line to ``stdout`` that contains the CRUSH location description. The output -resembles the following::: +line to ``stdout`` that contains the CRUSH location description. The arguments +resemble the following::: --cluster CLUSTER --id ID --type TYPE Here the cluster name is typically ``ceph``, the ``id`` is the daemon identifier or (in the case of OSDs) the OSD number, and the daemon type is -``osd``, ``mds, ``mgr``, or ``mon``. +``osd``, ``mds``, ``mgr``, or ``mon``. For example, a simple hook that specifies a rack location via a value in the -file ``/etc/rack`` might be as follows:: +file ``/etc/rack`` (assuming it contains no spaces) might be as follows:: #!/bin/sh - echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default" + echo "root=default rack=$(cat /etc/rack) host=$(hostname -s)" CRUSH structure diff --git a/doc/rados/operations/erasure-code-profile.rst b/doc/rados/operations/erasure-code-profile.rst index 947b34c1f..a8f006398 100644 --- a/doc/rados/operations/erasure-code-profile.rst +++ b/doc/rados/operations/erasure-code-profile.rst @@ -96,7 +96,9 @@ Where: ``--force`` :Description: Override an existing profile by the same name, and allow - setting a non-4K-aligned stripe_unit. + setting a non-4K-aligned stripe_unit. Overriding an existing + profile can be dangerous, and thus ``--yes-i-really-mean-it`` + must be used as well. :Type: String :Required: No. diff --git a/doc/rados/operations/erasure-code.rst b/doc/rados/operations/erasure-code.rst index e2bd3c296..e53f348cd 100644 --- a/doc/rados/operations/erasure-code.rst +++ b/doc/rados/operations/erasure-code.rst @@ -179,6 +179,8 @@ This can be enabled only on a pool residing on BlueStore OSDs, since BlueStore's checksumming is used during deep scrubs to detect bitrot or other corruption. Using Filestore with EC overwrites is not only unsafe, but it also results in lower performance compared to BlueStore. +Moreover, Filestore is deprecated and any Filestore OSDs in your cluster +should be migrated to BlueStore. Erasure-coded pools do not support omap, so to use them with RBD and CephFS you must instruct them to store their data in an EC pool and @@ -192,6 +194,182 @@ erasure-coded pool as the ``--data-pool`` during image creation: For CephFS, an erasure-coded pool can be set as the default data pool during file system creation or via `file layouts <../../../cephfs/file-layouts>`_. +Erasure-coded pool overhead +--------------------------- + +The overhead factor (space amplification) of an erasure-coded pool +is `(k+m) / k`. For a 4,2 profile, the overhead is +thus 1.5, which means that 1.5 GiB of underlying storage are used to store +1 GiB of user data. Contrast with default three-way replication, with +which the overhead factor is 3.0. Do not mistake erasure coding for a free +lunch: there is a significant performance tradeoff, especially when using HDDs +and when performing cluster recovery or backfill. + +Below is a table showing the overhead factors for various values of `k` and `m`. +As `m` increases above 2, the incremental capacity overhead gain quickly +experiences diminishing returns but the performance impact grows proportionally. +We recommend that you do not choose a profile with `k` > 4 or `m` > 2 until +and unless you fully understand the ramifications, including the number of +failure domains your cluster topology must contain. If you choose `m=1`, +expect data unavailability during maintenance and data loss if component +failures overlap. + +.. list-table:: Erasure coding overhead + :widths: 4 4 4 4 4 4 4 4 4 4 4 4 + :header-rows: 1 + :stub-columns: 1 + + * - + - m=1 + - m=2 + - m=3 + - m=4 + - m=4 + - m=6 + - m=7 + - m=8 + - m=9 + - m=10 + - m=11 + * - k=1 + - 2.00 + - 3.00 + - 4.00 + - 5.00 + - 6.00 + - 7.00 + - 8.00 + - 9.00 + - 10.00 + - 11.00 + - 12.00 + * - k=2 + - 1.50 + - 2.00 + - 2.50 + - 3.00 + - 3.50 + - 4.00 + - 4.50 + - 5.00 + - 5.50 + - 6.00 + - 6.50 + * - k=3 + - 1.33 + - 1.67 + - 2.00 + - 2.33 + - 2.67 + - 3.00 + - 3.33 + - 3.67 + - 4.00 + - 4.33 + - 4.67 + * - k=4 + - 1.25 + - 1.50 + - 1.75 + - 2.00 + - 2.25 + - 2.50 + - 2.75 + - 3.00 + - 3.25 + - 3.50 + - 3.75 + * - k=5 + - 1.20 + - 1.40 + - 1.60 + - 1.80 + - 2.00 + - 2.20 + - 2.40 + - 2.60 + - 2.80 + - 3.00 + - 3.20 + * - k=6 + - 1.16 + - 1.33 + - 1.50 + - 1.66 + - 1.83 + - 2.00 + - 2.17 + - 2.33 + - 2.50 + - 2.66 + - 2.83 + * - k=7 + - 1.14 + - 1.29 + - 1.43 + - 1.58 + - 1.71 + - 1.86 + - 2.00 + - 2.14 + - 2.29 + - 2.43 + - 2.58 + * - k=8 + - 1.13 + - 1.25 + - 1.38 + - 1.50 + - 1.63 + - 1.75 + - 1.88 + - 2.00 + - 2.13 + - 2.25 + - 2.38 + * - k=9 + - 1.11 + - 1.22 + - 1.33 + - 1.44 + - 1.56 + - 1.67 + - 1.78 + - 1.88 + - 2.00 + - 2.11 + - 2.22 + * - k=10 + - 1.10 + - 1.20 + - 1.30 + - 1.40 + - 1.50 + - 1.60 + - 1.70 + - 1.80 + - 1.90 + - 2.00 + - 2.10 + * - k=11 + - 1.09 + - 1.18 + - 1.27 + - 1.36 + - 1.45 + - 1.54 + - 1.63 + - 1.72 + - 1.82 + - 1.91 + - 2.00 + + + + + + + Erasure-coded pools and cache tiering ------------------------------------- diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst index 15525c1d3..91301382d 100644 --- a/doc/rados/operations/index.rst +++ b/doc/rados/operations/index.rst @@ -21,6 +21,7 @@ and, monitoring an operating cluster. monitoring-osd-pg user-management pg-repair + pgcalc/index .. raw:: html diff --git a/doc/rados/operations/monitoring.rst b/doc/rados/operations/monitoring.rst index a9171f2d8..d3a258f76 100644 --- a/doc/rados/operations/monitoring.rst +++ b/doc/rados/operations/monitoring.rst @@ -517,6 +517,8 @@ multiple monitors are running to ensure proper functioning of your Ceph cluster. Check monitor status regularly in order to ensure that all of the monitors are running. +.. _display-mon-map: + To display the monitor map, run the following command: .. prompt:: bash $ diff --git a/doc/rados/operations/pgcalc/index.rst b/doc/rados/operations/pgcalc/index.rst new file mode 100644 index 000000000..1aed87391 --- /dev/null +++ b/doc/rados/operations/pgcalc/index.rst @@ -0,0 +1,68 @@ +.. _pgcalc: + + +======= +PG Calc +======= + + +.. raw:: html + + + <link rel="stylesheet" id="wp-job-manager-job-listings-css" href="https://web.archive.org/web/20230614135557cs_/https://old.ceph.com/wp-content/plugins/wp-job-manager/assets/dist/css/job-listings.css" type="text/css" media="all"/> + <link rel="stylesheet" id="ceph/googlefont-css" href="https://web.archive.org/web/20230614135557cs_/https://fonts.googleapis.com/css?family=Raleway%3A300%2C400%2C700&ver=5.7.2" type="text/css" media="all"/> + <link rel="stylesheet" id="Stylesheet-css" href="https://web.archive.org/web/20230614135557cs_/https://old.ceph.com/wp-content/themes/cephTheme/Resources/Styles/style.min.css" type="text/css" media="all"/> + <link rel="stylesheet" id="tablepress-default-css" href="https://web.archive.org/web/20230614135557cs_/https://old.ceph.com/wp-content/plugins/tablepress/css/default.min.css" type="text/css" media="all"/> + <link rel="stylesheet" id="jetpack_css-css" href="https://web.archive.org/web/20230614135557cs_/https://old.ceph.com/wp-content/plugins/jetpack/css/jetpack.css" type="text/css" media="all"/> + <script type="text/javascript" src="https://web.archive.org/web/20230614135557js_/https://old.ceph.com/wp-content/themes/cephTheme/foundation_framework/js/vendor/jquery.js" id="jquery-js"></script> + + <link rel="stylesheet" href="https://web.archive.org/web/20230614135557cs_/https://ajax.googleapis.com/ajax/libs/jqueryui/1.11.2/themes/smoothness/jquery-ui.css"/> + <link rel="stylesheet" href="https://web.archive.org/web/20230614135557cs_/https://old.ceph.com/pgcalc_assets/pgcalc.css"/> + <script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.11.2/jquery-ui.min.js"></script> + + <script src="../../../_static/js/pgcalc.js"></script> + <div id="pgcalcdiv"> + <div id="instructions"> + <h2>Ceph PGs per Pool Calculator</h2><br/><fieldset><legend>Instructions</legend> + <ol> + <li>Confirm your understanding of the fields by reading through the Key below.</li> + <li>Select a <b>"Ceph Use Case"</b> from the drop down menu.</li> + <li>Adjust the values in the <span class="inputColor addBorder" style="font-weight: bold;">"Green"</span> shaded fields below.<br/> + <b>Tip:</b> Headers can be clicked to change the value throughout the table.</li> + <li>You will see the Suggested PG Count update based on your inputs.</li> + <li>Click the <b>"Add Pool"</b> button to create a new line for a new pool.</li> + <li>Click the <span class="ui-icon ui-icon-trash" style="display:inline-block;"></span> icon to delete the specific Pool.</li> + <li>For more details on the logic used and some important details, see the area below the table.</li> + <li>Once all values have been adjusted, click the <b>"Generate Commands"</b> button to get the pool creation commands.</li> + </ol></fieldset> + </div> + <div id="beforeTable"></div> + <br/> + <p class="validateTips"> </p> + <label for="presetType">Ceph Use Case Selector:</label><br/><select id="presetType"></select><button style="margin-left: 200px;" id="btnAddPool" type="button">Add Pool</button><button type="button" id="btnGenCommands" download="commands.txt">Generate Commands</button> + <div id="pgsPerPoolTable"> + <table id="pgsperpool"> + </table> + </div> <!-- id = pgsPerPoolTable --> + <br/> + <div id="afterTable"></div> + <div id="countLogic"><fieldset><legend>Logic behind Suggested PG Count</legend> + <br/> + <div class="upperFormula">( Target PGs per OSD ) x ( OSD # ) x ( %Data )</div> + <div class="lowerFormula">( Size )</div> + <ol id="countLogicList"> + <li>If the value of the above calculation is less than the value of <b>( OSD# ) / ( Size )</b>, then the value is updated to the value of <b>( OSD# ) / ( Size )</b>. This is to ensure even load / data distribution by allocating at least one Primary or Secondary PG to every OSD for every Pool.</li> + <li>The output value is then rounded to the <b>nearest power of 2</b>.<br/><b>Tip:</b> The nearest power of 2 provides a marginal improvement in efficiency of the <a href="https://web.archive.org/web/20230614135557/http://ceph.com/docs/master/rados/operations/crush-map/" title="CRUSH Map Details">CRUSH</a> algorithm.</li> + <li>If the nearest power of 2 is more than <b>25%</b> below the original value, the next higher power of 2 is used.</li> + </ol> + <b>Objective</b> + <ul><li>The objective of this calculation and the target ranges noted in the "Key" section above are to ensure that there are sufficient Placement Groups for even data distribution throughout the cluster, while not going high enough on the PG per OSD ratio to cause problems during Recovery and/or Backfill operations.</li></ul> + <b>Effects of enpty or non-active pools:</b> + <ul> + <li>Empty or otherwise non-active pools should not be considered helpful toward even data distribution throughout the cluster.</li> + <li>However, the PGs associated with these empty / non-active pools still consume memory and CPU overhead.</li> + </ul> + </fieldset> + </div> + <div id="commands" title="Pool Creation Commands"><code><pre id="commandCode"></pre></code></div> + </div> diff --git a/doc/rados/operations/placement-groups.rst b/doc/rados/operations/placement-groups.rst index dda4a0177..93ab1f0c0 100644 --- a/doc/rados/operations/placement-groups.rst +++ b/doc/rados/operations/placement-groups.rst @@ -4,6 +4,21 @@ Placement Groups ================== +Placement groups (PGs) are subsets of each logical Ceph pool. Placement groups +perform the function of placing objects (as a group) into OSDs. Ceph manages +data internally at placement-group granularity: this scales better than would +managing individual RADOS objects. A cluster that has a larger number of +placement groups (for example, 150 per OSD) is better balanced than an +otherwise identical cluster with a smaller number of placement groups. + +Ceph’s internal RADOS objects are each mapped to a specific placement group, +and each placement group belongs to exactly one Ceph pool. + +See Sage Weil's blog post `New in Nautilus: PG merging and autotuning +<https://ceph.io/en/news/blog/2019/new-in-nautilus-pg-merging-and-autotuning/>`_ +for more information about the relationship of placement groups to pools and to +objects. + .. _pg-autoscaler: Autoscaling placement groups @@ -131,11 +146,11 @@ The output will resemble the following:: if a ``pg_num`` change is in progress, the current number of PGs that the pool is working towards. -- **NEW PG_NUM** (if present) is the value that the system is recommending the - ``pg_num`` of the pool to be changed to. It is always a power of 2, and it is - present only if the recommended value varies from the current value by more - than the default factor of ``3``. To adjust this factor (in the following - example, it is changed to ``2``), run the following command: +- **NEW PG_NUM** (if present) is the value that the system recommends that the + ``pg_num`` of the pool should be. It is always a power of two, and it + is present only if the recommended value varies from the current value by + more than the default factor of ``3``. To adjust this multiple (in the + following example, it is changed to ``2``), run the following command: .. prompt:: bash # @@ -168,7 +183,6 @@ The output will resemble the following:: .. prompt:: bash # ceph osd pool set .mgr crush_rule replicated-ssd - ceph osd pool set pool 1 crush_rule to replicated-ssd This intervention will result in a small amount of backfill, but typically this traffic completes quickly. @@ -626,15 +640,14 @@ pools, each with 512 PGs on 10 OSDs, the OSDs will have to handle ~50,000 PGs each. This cluster will require significantly more resources and significantly more time for peering. -For determining the optimal number of PGs per OSD, we recommend the `PGCalc`_ -tool. - .. _setting the number of placement groups: Setting the Number of PGs ========================= +:ref:`Placement Group Link <pgcalc>` + Setting the initial number of PGs in a pool must be done at the time you create the pool. See `Create a Pool`_ for details. @@ -894,4 +907,3 @@ about it entirely (if it is too new to have a previous version). To mark the .. _Create a Pool: ../pools#createpool .. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds -.. _pgcalc: https://old.ceph.com/pgcalc/ diff --git a/doc/rados/operations/pools.rst b/doc/rados/operations/pools.rst index dda9e844e..c3fe3b7d8 100644 --- a/doc/rados/operations/pools.rst +++ b/doc/rados/operations/pools.rst @@ -18,15 +18,17 @@ Pools provide: <../erasure-code>`_, resilience is defined as the number of coding chunks (for example, ``m = 2`` in the default **erasure code profile**). -- **Placement Groups**: You can set the number of placement groups (PGs) for - the pool. In a typical configuration, the target number of PGs is - approximately one hundred PGs per OSD. This provides reasonable balancing - without consuming excessive computing resources. When setting up multiple - pools, be careful to set an appropriate number of PGs for each pool and for - the cluster as a whole. Each PG belongs to a specific pool: when multiple - pools use the same OSDs, make sure that the **sum** of PG replicas per OSD is - in the desired PG-per-OSD target range. To calculate an appropriate number of - PGs for your pools, use the `pgcalc`_ tool. +- **Placement Groups**: The :ref:`autoscaler <pg-autoscaler>` sets the number + of placement groups (PGs) for the pool. In a typical configuration, the + target number of PGs is approximately one-hundred and fifty PGs per OSD. This + provides reasonable balancing without consuming excessive computing + resources. When setting up multiple pools, set an appropriate number of PGs + for each pool and for the cluster as a whole. Each PG belongs to a specific + pool: when multiple pools use the same OSDs, make sure that the **sum** of PG + replicas per OSD is in the desired PG-per-OSD target range. See :ref:`Setting + the Number of Placement Groups <setting the number of placement groups>` for + instructions on how to manually set the number of placement groups per pool + (this procedure works only when the autoscaler is not used). - **CRUSH Rules**: When data is stored in a pool, the placement of the object and its replicas (or chunks, in the case of erasure-coded pools) in your @@ -94,19 +96,12 @@ To get even more information, you can execute this command with the ``--format`` Creating a Pool =============== -Before creating a pool, consult `Pool, PG and CRUSH Config Reference`_. Your -Ceph configuration file contains a setting (namely, ``pg_num``) that determines -the number of PGs. However, this setting's default value is NOT appropriate -for most systems. In most cases, you should override this default value when -creating your pool. For details on PG numbers, see `setting the number of -placement groups`_ - -For example: - -.. prompt:: bash $ - - osd_pool_default_pg_num = 128 - osd_pool_default_pgp_num = 128 +Before creating a pool, consult `Pool, PG and CRUSH Config Reference`_. The +Ceph central configuration database in the monitor cluster contains a setting +(namely, ``pg_num``) that determines the number of PGs per pool when a pool has +been created and no per-pool value has been specified. It is possible to change +this value from its default. For more on the subject of setting the number of +PGs per pool, see `setting the number of placement groups`_. .. note:: In Luminous and later releases, each pool must be associated with the application that will be using the pool. For more information, see @@ -742,8 +737,6 @@ Managing pools that are flagged with ``--bulk`` =============================================== See :ref:`managing_bulk_flagged_pools`. - -.. _pgcalc: https://old.ceph.com/pgcalc/ .. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref .. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter .. _setting the number of placement groups: ../placement-groups#set-the-number-of-placement-groups diff --git a/doc/rados/operations/stretch-mode.rst b/doc/rados/operations/stretch-mode.rst index f797b5b91..787e8cb4d 100644 --- a/doc/rados/operations/stretch-mode.rst +++ b/doc/rados/operations/stretch-mode.rst @@ -121,8 +121,6 @@ your CRUSH map. This procedure shows how to do this. rule stretch_rule { id 1 - min_size 1 - max_size 10 type replicated step take site1 step chooseleaf firstn 2 type host @@ -141,11 +139,15 @@ your CRUSH map. This procedure shows how to do this. #. Run the monitors in connectivity mode. See `Changing Monitor Elections`_. + .. prompt:: bash $ + + ceph mon set election_strategy connectivity + #. Command the cluster to enter stretch mode. In this example, ``mon.e`` is the tiebreaker monitor and we are splitting across data centers. The tiebreaker monitor must be assigned a data center that is neither ``site1`` nor - ``site2``. For this purpose you can create another data-center bucket named - ``site3`` in your CRUSH and place ``mon.e`` there: + ``site2``. This data center **should not** be defined in your CRUSH map, here + we are placing ``mon.e`` in a virtual data center called ``site3``: .. prompt:: bash $ diff --git a/doc/rados/troubleshooting/log-and-debug.rst b/doc/rados/troubleshooting/log-and-debug.rst index 929c3f53f..fa089338c 100644 --- a/doc/rados/troubleshooting/log-and-debug.rst +++ b/doc/rados/troubleshooting/log-and-debug.rst @@ -175,17 +175,19 @@ For each subsystem, there is a logging level for its output logs (a so-called "log level") and a logging level for its in-memory logs (a so-called "memory level"). Different values may be set for these two logging levels in each subsystem. Ceph's logging levels operate on a scale of ``1`` to ``20``, where -``1`` is terse and ``20`` is verbose [#f1]_. As a general rule, the in-memory -logs are not sent to the output log unless one or more of the following -conditions obtain: - -- a fatal signal is raised or -- an ``assert`` in source code is triggered or -- upon requested. Please consult `document on admin socket - <http://docs.ceph.com/en/latest/man/8/ceph/#daemon>`_ for more details. - -.. warning :: - .. [#f1] In certain rare cases, there are logging levels that can take a value greater than 20. The resulting logs are extremely verbose. +``1`` is terse and ``20`` is verbose. In certain rare cases, there are logging +levels that can take a value greater than 20. The resulting logs are extremely +verbose. + +The in-memory logs are not sent to the output log unless one or more of the +following conditions are true: + +- a fatal signal has been raised or +- an assertion within Ceph code has been triggered or +- the sending of in-memory logs to the output log has been manually triggered. + Consult `the portion of the "Ceph Administration Tool documentation + that provides an example of how to submit admin socket commands + <http://docs.ceph.com/en/latest/man/8/ceph/#daemon>`_ for more detail. Log levels and memory levels can be set either together or separately. If a subsystem is assigned a single value, then that value determines both the log diff --git a/doc/rados/troubleshooting/troubleshooting-mon.rst b/doc/rados/troubleshooting/troubleshooting-mon.rst index 1170da7c3..443d6c443 100644 --- a/doc/rados/troubleshooting/troubleshooting-mon.rst +++ b/doc/rados/troubleshooting/troubleshooting-mon.rst @@ -85,23 +85,27 @@ Using the monitor's admin socket ================================ A monitor's admin socket allows you to interact directly with a specific daemon -by using a Unix socket file. This file is found in the monitor's ``run`` -directory. The admin socket's default directory is -``/var/run/ceph/ceph-mon.ID.asok``, but this can be overridden and the admin -socket might be elsewhere, especially if your cluster's daemons are deployed in -containers. If you cannot find it, either check your ``ceph.conf`` for an -alternative path or run the following command: +by using a Unix socket file. This socket file is found in the monitor's ``run`` +directory. + +The admin socket's default directory is ``/var/run/ceph/ceph-mon.ID.asok``. It +is possible to override the admin socket's default location. If the default +location has been overridden, then the admin socket will be elsewhere. This is +often the case when a cluster's daemons are deployed in containers. + +To find the directory of the admin socket, check either your ``ceph.conf`` for +an alternative path or run the following command: .. prompt:: bash $ ceph-conf --name mon.ID --show-config-value admin_socket -The admin socket is available for use only when the monitor daemon is running. -Whenever the monitor has been properly shut down, the admin socket is removed. -However, if the monitor is not running and the admin socket persists, it is -likely that the monitor has been improperly shut down. In any case, if the -monitor is not running, it will be impossible to use the admin socket, and the -``ceph`` command is likely to return ``Error 111: Connection Refused``. +The admin socket is available for use only when the Monitor daemon is running. +Every time the Monitor is properly shut down, the admin socket is removed. If +the Monitor is not running and yet the admin socket persists, it is likely that +the Monitor has been improperly shut down. If the Monitor is not running, it +will be impossible to use the admin socket, and the ``ceph`` command is likely +to return ``Error 111: Connection Refused``. To access the admin socket, run a ``ceph tell`` command of the following form (specifying the daemon that you are interested in): @@ -110,7 +114,7 @@ To access the admin socket, run a ``ceph tell`` command of the following form ceph tell mon.<id> mon_status -This command passes a ``help`` command to the specific running monitor daemon +This command passes a ``help`` command to the specified running Monitor daemon ``<id>`` via its admin socket. If you know the full path to the admin socket file, this can be done more directly by running the following command: @@ -127,10 +131,11 @@ and ``quorum_status``. Understanding mon_status ======================== -The status of the monitor (as reported by the ``ceph tell mon.X mon_status`` -command) can always be obtained via the admin socket. This command outputs a -great deal of information about the monitor (including the information found in -the output of the ``quorum_status`` command). +The status of a Monitor (as reported by the ``ceph tell mon.X mon_status`` +command) can be obtained via the admin socket. The ``ceph tell mon.X +mon_status`` command outputs a great deal of information about the monitor +(including the information found in the output of the ``quorum_status`` +command). To understand this command's output, let us consider the following example, in which we see the output of ``ceph tell mon.c mon_status``:: @@ -160,29 +165,34 @@ which we see the output of ``ceph tell mon.c mon_status``:: "name": "c", "addr": "127.0.0.1:6795\/0"}]}} -It is clear that there are three monitors in the monmap (*a*, *b*, and *c*), -the quorum is formed by only two monitors, and *c* is in the quorum as a -*peon*. +This output reports that there are three monitors in the monmap (*a*, *b*, and +*c*), that quorum is formed by only two monitors, and that *c* is in quorum as +a *peon*. -**Which monitor is out of the quorum?** +**Which monitor is out of quorum?** - The answer is **a** (that is, ``mon.a``). + The answer is **a** (that is, ``mon.a``). ``mon.a`` is out of quorum. -**Why?** +**How do we know, in this example, that mon.a is out of quorum?** - When the ``quorum`` set is examined, there are clearly two monitors in the - set: *1* and *2*. But these are not monitor names. They are monitor ranks, as - established in the current ``monmap``. The ``quorum`` set does not include - the monitor that has rank 0, and according to the ``monmap`` that monitor is - ``mon.a``. + We know that ``mon.a`` is out of quorum because it has rank 0, and Monitors + with rank 0 are by definition out of quorum. + + If we examine the ``quorum`` set, we can see that there are clearly two + monitors in the set: *1* and *2*. But these are not monitor names. They are + monitor ranks, as established in the current ``monmap``. The ``quorum`` set + does not include the monitor that has rank 0, and according to the ``monmap`` + that monitor is ``mon.a``. **How are monitor ranks determined?** - Monitor ranks are calculated (or recalculated) whenever monitors are added or - removed. The calculation of ranks follows a simple rule: the **greater** the - ``IP:PORT`` combination, the **lower** the rank. In this case, because - ``127.0.0.1:6789`` is lower than the other two ``IP:PORT`` combinations, - ``mon.a`` has the highest rank: namely, rank 0. + Monitor ranks are calculated (or recalculated) whenever monitors are added to + or removed from the cluster. The calculation of ranks follows a simple rule: + the **greater** the ``IP:PORT`` combination, the **lower** the rank. In this + case, because ``127.0.0.1:6789`` (``mon.a``) is numerically less than the + other two ``IP:PORT`` combinations (which are ``127.0.0.1:6790`` for "Monitor + b" and ``127.0.0.1:6795`` for "Monitor c"), ``mon.a`` has the highest rank: + namely, rank 0. Most Common Monitor Issues @@ -250,14 +260,15 @@ detail`` returns a message similar to the following:: Monitors at a wrong address. ``mon_status`` outputs the ``monmap`` that is known to the monitor: determine whether the other Monitors' locations as specified in the ``monmap`` match the locations of the Monitors in the - network. If they do not, see `Recovering a Monitor's Broken monmap`_. - If the locations of the Monitors as specified in the ``monmap`` match the - locations of the Monitors in the network, then the persistent - ``probing`` state could be related to severe clock skews amongst the monitor - nodes. See `Clock Skews`_. If the information in `Clock Skews`_ does not - bring the Monitor out of the ``probing`` state, then prepare your system logs - and ask the Ceph community for help. See `Preparing your logs`_ for - information about the proper preparation of logs. + network. If they do not, see :ref:`Recovering a Monitor's Broken monmap + <rados_troubleshooting_troubleshooting_mon_recovering_broken_monmap>`. If + the locations of the Monitors as specified in the ``monmap`` match the + locations of the Monitors in the network, then the persistent ``probing`` + state could be related to severe clock skews among the monitor nodes. See + `Clock Skews`_. If the information in `Clock Skews`_ does not bring the + Monitor out of the ``probing`` state, then prepare your system logs and ask + the Ceph community for help. See `Preparing your logs`_ for information about + the proper preparation of logs. **What does it mean when a Monitor's state is ``electing``?** @@ -314,13 +325,16 @@ detail`` returns a message similar to the following:: substantiate it. See `Preparing your logs`_ for information about the proper preparation of logs. +.. _rados_troubleshooting_troubleshooting_mon_recovering_broken_monmap: -Recovering a Monitor's Broken ``monmap`` ----------------------------------------- +Recovering a Monitor's Broken "monmap" +-------------------------------------- -This is how a ``monmap`` usually looks, depending on the number of -monitors:: +A monmap can be retrieved by using a command of the form ``ceph tell mon.c +mon_status``, as described in :ref:`Understanding mon_status +<rados_troubleshoting_troubleshooting_mon_understanding_mon_status>`. +Here is an example of a ``monmap``:: epoch 3 fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8 @@ -329,61 +343,64 @@ monitors:: 0: 127.0.0.1:6789/0 mon.a 1: 127.0.0.1:6790/0 mon.b 2: 127.0.0.1:6795/0 mon.c - -This may not be what you have however. For instance, in some versions of -early Cuttlefish there was a bug that could cause your ``monmap`` -to be nullified. Completely filled with zeros. This means that not even -``monmaptool`` would be able to make sense of cold, hard, inscrutable zeros. -It's also possible to end up with a monitor with a severely outdated monmap, -notably if the node has been down for months while you fight with your vendor's -TAC. The subject ``ceph-mon`` daemon might be unable to find the surviving -monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``, -then remove ``mon.a``, then add a new monitor ``mon.e`` and remove -``mon.b``; you will end up with a totally different monmap from the one -``mon.c`` knows). -In this situation you have two possible solutions: +This ``monmap`` is in working order, but your ``monmap`` might not be in +working order. The ``monmap`` in a given node might be outdated because the +node was down for a long time, during which the cluster's Monitors changed. + +There are two ways to update a Monitor's outdated ``monmap``: + +A. **Scrap the monitor and redeploy.** + + Do this only if you are certain that you will not lose the information kept + by the Monitor that you scrap. Make sure that you have other Monitors in + good condition, so that the new Monitor will be able to synchronize with + the surviving Monitors. Remember that destroying a Monitor can lead to data + loss if there are no other copies of the Monitor's contents. + +B. **Inject a monmap into the monitor.** -Scrap the monitor and redeploy + It is possible to fix a Monitor that has an outdated ``monmap`` by + retrieving an up-to-date ``monmap`` from surviving Monitors in the cluster + and injecting it into the Monitor that has a corrupted or missing + ``monmap``. - You should only take this route if you are positive that you won't - lose the information kept by that monitor; that you have other monitors - and that they are running just fine so that your new monitor is able - to synchronize from the remaining monitors. Keep in mind that destroying - a monitor, if there are no other copies of its contents, may lead to - loss of data. + Implement this solution by carrying out the following procedure: -Inject a monmap into the monitor + #. Retrieve the ``monmap`` in one of the two following ways: - These are the basic steps: + a. **IF THERE IS A QUORUM OF MONITORS:** + + Retrieve the ``monmap`` from the quorum: - Retrieve the ``monmap`` from the surviving monitors and inject it into the - monitor whose ``monmap`` is corrupted or lost. + .. prompt:: bash - Implement this solution by carrying out the following procedure: + ceph mon getmap -o /tmp/monmap - 1. Is there a quorum of monitors? If so, retrieve the ``monmap`` from the - quorum:: + b. **IF THERE IS NO QUORUM OF MONITORS:** + + Retrieve the ``monmap`` directly from a Monitor that has been stopped + : - $ ceph mon getmap -o /tmp/monmap + .. prompt:: bash - 2. If there is no quorum, then retrieve the ``monmap`` directly from another - monitor that has been stopped (in this example, the other monitor has - the ID ``ID-FOO``):: + ceph-mon -i ID-FOO --extract-monmap /tmp/monmap - $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap + In this example, the ID of the stopped Monitor is ``ID-FOO``. - 3. Stop the monitor you are going to inject the monmap into. + #. Stop the Monitor into which the ``monmap`` will be injected. - 4. Inject the monmap:: + #. Inject the monmap into the stopped Monitor: - $ ceph-mon -i ID --inject-monmap /tmp/monmap + .. prompt:: bash - 5. Start the monitor + ceph-mon -i ID --inject-monmap /tmp/monmap - .. warning:: Injecting ``monmaps`` can cause serious problems because doing - so will overwrite the latest existing ``monmap`` stored on the monitor. Be - careful! + #. Start the Monitor. + + .. warning:: Injecting a ``monmap`` into a Monitor can cause serious + problems. Injecting a ``monmap`` overwrites the latest existing + ``monmap`` stored on the monitor. Be careful! Clock Skews ----------- @@ -464,12 +481,13 @@ Clock Skew Questions and Answers Client Can't Connect or Mount ----------------------------- -Check your IP tables. Some operating-system install utilities add a ``REJECT`` -rule to ``iptables``. ``iptables`` rules will reject all clients other than -``ssh`` that try to connect to the host. If your monitor host's IP tables have -a ``REJECT`` rule in place, clients that are connecting from a separate node -will fail and will raise a timeout error. Any ``iptables`` rules that reject -clients trying to connect to Ceph daemons must be addressed. For example:: +If a client can't connect to the cluster or mount, check your iptables. Some +operating-system install utilities add a ``REJECT`` rule to ``iptables``. +``iptables`` rules will reject all clients other than ``ssh`` that try to +connect to the host. If your monitor host's iptables have a ``REJECT`` rule in +place, clients that connect from a separate node will fail, and this will raise +a timeout error. Look for ``iptables`` rules that reject clients that are +trying to connect to Ceph daemons. For example:: REJECT all -- anywhere anywhere reject-with icmp-host-prohibited @@ -487,9 +505,9 @@ Monitor Store Failures Symptoms of store corruption ---------------------------- -Ceph monitors store the :term:`Cluster Map` in a key-value store. If key-value -store corruption causes a monitor to fail, then the monitor log might contain -one of the following error messages:: +Ceph Monitors maintain the :term:`Cluster Map` in a key-value store. If +key-value store corruption causes a Monitor to fail, then the Monitor log might +contain one of the following error messages:: Corruption: error in middle of record @@ -500,10 +518,10 @@ or:: Recovery using healthy monitor(s) --------------------------------- -If there are surviving monitors, we can always :ref:`replace -<adding-and-removing-monitors>` the corrupted monitor with a new one. After the -new monitor boots, it will synchronize with a healthy peer. After the new -monitor is fully synchronized, it will be able to serve clients. +If the cluster contains surviving Monitors, the corrupted Monitor can be +:ref:`replaced <adding-and-removing-monitors>` with a new Monitor. After the +new Monitor boots, it will synchronize with a healthy peer. After the new +Monitor is fully synchronized, it will be able to serve clients. .. _mon-store-recovery-using-osds: @@ -511,15 +529,14 @@ Recovery using OSDs ------------------- Even if all monitors fail at the same time, it is possible to recover the -monitor store by using information stored in OSDs. You are encouraged to deploy -at least three (and preferably five) monitors in a Ceph cluster. In such a -deployment, complete monitor failure is unlikely. However, unplanned power loss -in a data center whose disk settings or filesystem settings are improperly -configured could cause the underlying filesystem to fail and this could kill -all of the monitors. In such a case, data in the OSDs can be used to recover -the monitors. The following is such a script and can be used to recover the -monitors: - +Monitor store by using information that is stored in OSDs. You are encouraged +to deploy at least three (and preferably five) Monitors in a Ceph cluster. In +such a deployment, complete Monitor failure is unlikely. However, unplanned +power loss in a data center whose disk settings or filesystem settings are +improperly configured could cause the underlying filesystem to fail and this +could kill all of the monitors. In such a case, data in the OSDs can be used to +recover the Monitors. The following is a script that can be used in such a case +to recover the Monitors: .. code-block:: bash @@ -572,10 +589,10 @@ monitors: This script performs the following steps: -#. Collects the map from each OSD host. -#. Rebuilds the store. -#. Fills the entities in the keyring file with appropriate capabilities. -#. Replaces the corrupted store on ``mon.foo`` with the recovered copy. +#. Collect the map from each OSD host. +#. Rebuild the store. +#. Fill the entities in the keyring file with appropriate capabilities. +#. Replace the corrupted store on ``mon.foo`` with the recovered copy. Known limitations @@ -587,19 +604,18 @@ The above recovery tool is unable to recover the following information: auth add`` command are recovered from the OSD's copy, and the ``client.admin`` keyring is imported using ``ceph-monstore-tool``. However, the MDS keyrings and all other keyrings will be missing in the recovered - monitor store. You might need to manually re-add them. + Monitor store. It might be necessary to manually re-add them. - **Creating pools**: If any RADOS pools were in the process of being created, that state is lost. The recovery tool operates on the assumption that all pools have already been created. If there are PGs that are stuck in the - 'unknown' state after the recovery for a partially created pool, you can + ``unknown`` state after the recovery for a partially created pool, you can force creation of the *empty* PG by running the ``ceph osd force-create-pg`` - command. Note that this will create an *empty* PG, so take this action only - if you know the pool is empty. + command. This creates an *empty* PG, so take this action only if you are + certain that the pool is empty. - **MDS Maps**: The MDS maps are lost. - Everything Failed! Now What? ============================ @@ -611,16 +627,20 @@ irc.oftc.net), or at ``dev@ceph.io`` and ``ceph-users@lists.ceph.com``. Make sure that you have prepared your logs and that you have them ready upon request. -See https://ceph.io/en/community/connect/ for current (as of October 2023) -information on getting in contact with the upstream Ceph community. +The upstream Ceph Slack workspace can be joined at this address: +https://ceph-storage.slack.com/ +See https://ceph.io/en/community/connect/ for current (as of December 2023) +information on getting in contact with the upstream Ceph community. Preparing your logs ------------------- -The default location for monitor logs is ``/var/log/ceph/ceph-mon.FOO.log*``. -However, if they are not there, you can find their current location by running -the following command: +The default location for Monitor logs is ``/var/log/ceph/ceph-mon.FOO.log*``. +It is possible that the location of the Monitor logs has been changed from the +default. If the location of the Monitor logs has been changed from the default +location, find the location of the Monitor logs by running the following +command: .. prompt:: bash @@ -631,21 +651,21 @@ cluster's configuration files. If Ceph is using the default debug levels, then your logs might be missing important information that would help the upstream Ceph community address your issue. -To make sure your monitor logs contain relevant information, you can raise -debug levels. Here we are interested in information from the monitors. As with -other components, the monitors have different parts that output their debug +Raise debug levels to make sure that your Monitor logs contain relevant +information. Here we are interested in information from the Monitors. As with +other components, the Monitors have different parts that output their debug information on different subsystems. If you are an experienced Ceph troubleshooter, we recommend raising the debug -levels of the most relevant subsystems. Of course, this approach might not be -easy for beginners. In most cases, however, enough information to address the -issue will be secured if the following debug levels are entered:: +levels of the most relevant subsystems. This approach might not be easy for +beginners. In most cases, however, enough information to address the issue will +be logged if the following debug levels are entered:: debug_mon = 10 debug_ms = 1 Sometimes these debug levels do not yield enough information. In such cases, -members of the upstream Ceph community might ask you to make additional changes +members of the upstream Ceph community will ask you to make additional changes to these or to other debug levels. In any case, it is better for us to receive at least some useful information than to receive an empty log. @@ -653,10 +673,12 @@ at least some useful information than to receive an empty log. Do I need to restart a monitor to adjust debug levels? ------------------------------------------------------ -No, restarting a monitor is not necessary. Debug levels may be adjusted by -using two different methods, depending on whether or not there is a quorum: +No. It is not necessary to restart a Monitor when adjusting its debug levels. + +There are two different methods for adjusting debug levels. One method is used +when there is quorum. The other is used when there is no quorum. -There is a quorum +**Adjusting debug levels when there is a quorum** Either inject the debug option into the specific monitor that needs to be debugged:: @@ -668,17 +690,19 @@ There is a quorum ceph tell mon.* config set debug_mon 10/10 -There is no quorum +**Adjusting debug levels when there is no quorum** Use the admin socket of the specific monitor that needs to be debugged and directly adjust the monitor's configuration options:: ceph daemon mon.FOO config set debug_mon 10/10 +**Returning debug levels to their default values** To return the debug levels to their default values, run the above commands -using the debug level ``1/10`` rather than ``10/10``. To check a monitor's -current values, use the admin socket and run either of the following commands: +using the debug level ``1/10`` rather than the debug level ``10/10``. To check +a Monitor's current values, use the admin socket and run either of the +following commands: .. prompt:: bash @@ -695,17 +719,17 @@ or: I Reproduced the problem with appropriate debug levels. Now what? ----------------------------------------------------------------- -We prefer that you send us only the portions of your logs that are relevant to -your monitor problems. Of course, it might not be easy for you to determine -which portions are relevant so we are willing to accept complete and -unabridged logs. However, we request that you avoid sending logs containing -hundreds of thousands of lines with no additional clarifying information. One -common-sense way of making our task easier is to write down the current time -and date when you are reproducing the problem and then extract portions of your +Send the upstream Ceph community only the portions of your logs that are +relevant to your Monitor problems. Because it might not be easy for you to +determine which portions are relevant, the upstream Ceph community accepts +complete and unabridged logs. But don't send logs containing hundreds of +thousands of lines with no additional clarifying information. One common-sense +way to help the Ceph community help you is to write down the current time and +date when you are reproducing the problem and then extract portions of your logs based on that information. -Finally, reach out to us on the mailing lists or IRC or Slack, or by filing a -new issue on the `tracker`_. +Contact the upstream Ceph community on the mailing lists or IRC or Slack, or by +filing a new issue on the `tracker`_. .. _tracker: http://tracker.ceph.com/projects/ceph/issues/new |