diff options
Diffstat (limited to 'doc/rados/configuration')
19 files changed, 5233 insertions, 0 deletions
diff --git a/doc/rados/configuration/auth-config-ref.rst b/doc/rados/configuration/auth-config-ref.rst new file mode 100644 index 000000000..fc14f4ee6 --- /dev/null +++ b/doc/rados/configuration/auth-config-ref.rst @@ -0,0 +1,379 @@ +.. _rados-cephx-config-ref: + +======================== + CephX Config Reference +======================== + +The CephX protocol is enabled by default. The cryptographic authentication that +CephX provides has some computational costs, though they should generally be +quite low. If the network environment connecting your client and server hosts +is very safe and you cannot afford authentication, you can disable it. +**Disabling authentication is not generally recommended**. + +.. note:: If you disable authentication, you will be at risk of a + man-in-the-middle attack that alters your client/server messages, which + could have disastrous security effects. + +For information about creating users, see `User Management`_. For details on +the architecture of CephX, see `Architecture - High Availability +Authentication`_. + + +Deployment Scenarios +==================== + +How you initially configure CephX depends on your scenario. There are two +common strategies for deploying a Ceph cluster. If you are a first-time Ceph +user, you should probably take the easiest approach: using ``cephadm`` to +deploy a cluster. But if your cluster uses other deployment tools (for example, +Ansible, Chef, Juju, or Puppet), you will need either to use the manual +deployment procedures or to configure your deployment tool so that it will +bootstrap your monitor(s). + +Manual Deployment +----------------- + +When you deploy a cluster manually, it is necessary to bootstrap the monitors +manually and to create the ``client.admin`` user and keyring. To bootstrap +monitors, follow the steps in `Monitor Bootstrapping`_. Follow these steps when +using third-party deployment tools (for example, Chef, Puppet, and Juju). + + +Enabling/Disabling CephX +======================== + +Enabling CephX is possible only if the keys for your monitors, OSDs, and +metadata servers have already been deployed. If you are simply toggling CephX +on or off, it is not necessary to repeat the bootstrapping procedures. + +Enabling CephX +-------------- + +When CephX is enabled, Ceph will look for the keyring in the default search +path: this path includes ``/etc/ceph/$cluster.$name.keyring``. It is possible +to override this search-path location by adding a ``keyring`` option in the +``[global]`` section of your `Ceph configuration`_ file, but this is not +recommended. + +To enable CephX on a cluster for which authentication has been disabled, carry +out the following procedure. If you (or your deployment utility) have already +generated the keys, you may skip the steps related to generating keys. + +#. Create a ``client.admin`` key, and save a copy of the key for your client + host: + + .. prompt:: bash $ + + ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring + + **Warning:** This step will clobber any existing + ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a + deployment tool has already generated a keyring file for you. Be careful! + +#. Create a monitor keyring and generate a monitor secret key: + + .. prompt:: bash $ + + ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' + +#. For each monitor, copy the monitor keyring into a ``ceph.mon.keyring`` file + in the monitor's ``mon data`` directory. For example, to copy the monitor + keyring to ``mon.a`` in a cluster called ``ceph``, run the following + command: + + .. prompt:: bash $ + + cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring + +#. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter: + + .. prompt:: bash $ + + ceph auth get-or-create mgr.{$id} mon 'allow profile mgr' mds 'allow *' osd 'allow *' -o /var/lib/ceph/mgr/ceph-{$id}/keyring + +#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number: + + .. prompt:: bash $ + + ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring + +#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter: + + .. prompt:: bash $ + + ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' mgr 'allow profile mds' -o /var/lib/ceph/mds/ceph-{$id}/keyring + +#. Enable CephX authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file: + + .. code-block:: ini + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + +#. Start or restart the Ceph cluster. For details, see `Operating a Cluster`_. + +For details on bootstrapping a monitor manually, see `Manual Deployment`_. + + + +Disabling CephX +--------------- + +The following procedure describes how to disable CephX. If your cluster +environment is safe, you might want to disable CephX in order to offset the +computational expense of running authentication. **We do not recommend doing +so.** However, setup and troubleshooting might be easier if authentication is +temporarily disabled and subsequently re-enabled. + +#. Disable CephX authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file: + + .. code-block:: ini + + auth_cluster_required = none + auth_service_required = none + auth_client_required = none + +#. Start or restart the Ceph cluster. For details, see `Operating a Cluster`_. + + +Configuration Settings +====================== + +Enablement +---------- + + +``auth_cluster_required`` + +:Description: If this configuration setting is enabled, the Ceph Storage + Cluster daemons (that is, ``ceph-mon``, ``ceph-osd``, + ``ceph-mds``, and ``ceph-mgr``) are required to authenticate with + each other. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth_service_required`` + +:Description: If this configuration setting is enabled, then Ceph clients can + access Ceph services only if those clients authenticate with the + Ceph Storage Cluster. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth_client_required`` + +:Description: If this configuration setting is enabled, then communication + between the Ceph client and Ceph Storage Cluster can be + established only if the Ceph Storage Cluster authenticates + against the Ceph client. Valid settings are ``cephx`` or + ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +.. index:: keys; keyring + +Keys +---- + +When Ceph is run with authentication enabled, ``ceph`` administrative commands +and Ceph clients can access the Ceph Storage Cluster only if they use +authentication keys. + +The most common way to make these keys available to ``ceph`` administrative +commands and Ceph clients is to include a Ceph keyring under the ``/etc/ceph`` +directory. For Octopus and later releases that use ``cephadm``, the filename is +usually ``ceph.client.admin.keyring``. If the keyring is included in the +``/etc/ceph`` directory, then it is unnecessary to specify a ``keyring`` entry +in the Ceph configuration file. + +Because the Ceph Storage Cluster's keyring file contains the ``client.admin`` +key, we recommend copying the keyring file to nodes from which you run +administrative commands. + +To perform this step manually, run the following command: + +.. prompt:: bash $ + + sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring + +.. tip:: Make sure that the ``ceph.keyring`` file has appropriate permissions + (for example, ``chmod 644``) set on your client machine. + +You can specify the key itself by using the ``key`` setting in the Ceph +configuration file (this approach is not recommended), or instead specify a +path to a keyfile by using the ``keyfile`` setting in the Ceph configuration +file. + +``keyring`` + +:Description: The path to the keyring file. +:Type: String +:Required: No +:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin`` + + +``keyfile`` + +:Description: The path to a keyfile (that is, a file containing only the key). +:Type: String +:Required: No +:Default: None + + +``key`` + +:Description: The key (that is, the text string of the key itself). We do not + recommend that you use this setting unless you know what you're + doing. +:Type: String +:Required: No +:Default: None + + +Daemon Keyrings +--------------- + +Administrative users or deployment tools (for example, ``cephadm``) generate +daemon keyrings in the same way that they generate user keyrings. By default, +Ceph stores the keyring of a daemon inside that daemon's data directory. The +default keyring locations and the capabilities that are necessary for the +daemon to function are shown below. + +``ceph-mon`` + +:Location: ``$mon_data/keyring`` +:Capabilities: ``mon 'allow *'`` + +``ceph-osd`` + +:Location: ``$osd_data/keyring`` +:Capabilities: ``mgr 'allow profile osd' mon 'allow profile osd' osd 'allow *'`` + +``ceph-mds`` + +:Location: ``$mds_data/keyring`` +:Capabilities: ``mds 'allow' mgr 'allow profile mds' mon 'allow profile mds' osd 'allow rwx'`` + +``ceph-mgr`` + +:Location: ``$mgr_data/keyring`` +:Capabilities: ``mon 'allow profile mgr' mds 'allow *' osd 'allow *'`` + +``radosgw`` + +:Location: ``$rgw_data/keyring`` +:Capabilities: ``mon 'allow rwx' osd 'allow rwx'`` + + +.. note:: The monitor keyring (that is, ``mon.``) contains a key but no + capabilities, and this keyring is not part of the cluster ``auth`` database. + +The daemon's data-directory locations default to directories of the form:: + + /var/lib/ceph/$type/$cluster-$id + +For example, ``osd.12`` would have the following data directory:: + + /var/lib/ceph/osd/ceph-12 + +It is possible to override these locations, but it is not recommended. + + +.. index:: signatures + +Signatures +---------- + +Ceph performs a signature check that provides some limited protection against +messages being tampered with in flight (for example, by a "man in the middle" +attack). + +As with other parts of Ceph authentication, signatures admit of fine-grained +control. You can enable or disable signatures for service messages between +clients and Ceph, and for messages between Ceph daemons. + +Note that even when signatures are enabled data is not encrypted in flight. + +``cephx_require_signatures`` + +:Description: If this configuration setting is set to ``true``, Ceph requires + signatures on all message traffic between the Ceph client and the + Ceph Storage Cluster, and between daemons within the Ceph Storage + Cluster. + +.. note:: + **ANTIQUATED NOTE:** + + Neither Ceph Argonaut nor Linux kernel versions prior to 3.19 + support signatures; if one of these clients is in use, ``cephx_require_signatures`` + can be disabled in order to allow the client to connect. + + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_cluster_require_signatures`` + +:Description: If this configuration setting is set to ``true``, Ceph requires + signatures on all message traffic between Ceph daemons within the + Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_service_require_signatures`` + +:Description: If this configuration setting is set to ``true``, Ceph requires + signatures on all message traffic between Ceph clients and the + Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_sign_messages`` + +:Description: If this configuration setting is set to ``true``, and if the Ceph + version supports message signing, then Ceph will sign all + messages so that they are more difficult to spoof. + +:Type: Boolean +:Default: ``true`` + + +Time to Live +------------ + +``auth_service_ticket_ttl`` + +:Description: When the Ceph Storage Cluster sends a ticket for authentication + to a Ceph client, the Ceph Storage Cluster assigns that ticket a + Time To Live (TTL). + +:Type: Double +:Default: ``60*60`` + + +.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping +.. _Operating a Cluster: ../../operations/operating +.. _Manual Deployment: ../../../install/manual-deployment +.. _Ceph configuration: ../ceph-conf +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _User Management: ../../operations/user-management diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst new file mode 100644 index 000000000..3707be1aa --- /dev/null +++ b/doc/rados/configuration/bluestore-config-ref.rst @@ -0,0 +1,552 @@ +================================== + BlueStore Configuration Reference +================================== + +Devices +======= + +BlueStore manages either one, two, or in certain cases three storage devices. +These *devices* are "devices" in the Linux/Unix sense. This means that they are +assets listed under ``/dev`` or ``/devices``. Each of these devices may be an +entire storage drive, or a partition of a storage drive, or a logical volume. +BlueStore does not create or mount a conventional file system on devices that +it uses; BlueStore reads and writes to the devices directly in a "raw" fashion. + +In the simplest case, BlueStore consumes all of a single storage device. This +device is known as the *primary device*. The primary device is identified by +the ``block`` symlink in the data directory. + +The data directory is a ``tmpfs`` mount. When this data directory is booted or +activated by ``ceph-volume``, it is populated with metadata files and links +that hold information about the OSD: for example, the OSD's identifier, the +name of the cluster that the OSD belongs to, and the OSD's private keyring. + +In more complicated cases, BlueStore is deployed across one or two additional +devices: + +* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data + directory) can be used to separate out BlueStore's internal journal or + write-ahead log. Using a WAL device is advantageous only if the WAL device + is faster than the primary device (for example, if the WAL device is an SSD + and the primary device is an HDD). +* A *DB device* (identified as ``block.db`` in the data directory) can be used + to store BlueStore's internal metadata. BlueStore (or more precisely, the + embedded RocksDB) will put as much metadata as it can on the DB device in + order to improve performance. If the DB device becomes full, metadata will + spill back onto the primary device (where it would have been located in the + absence of the DB device). Again, it is advantageous to provision a DB device + only if it is faster than the primary device. + +If there is only a small amount of fast storage available (for example, less +than a gigabyte), we recommend using the available space as a WAL device. But +if more fast storage is available, it makes more sense to provision a DB +device. Because the BlueStore journal is always placed on the fastest device +available, using a DB device provides the same benefit that using a WAL device +would, while *also* allowing additional metadata to be stored off the primary +device (provided that it fits). DB devices make this possible because whenever +a DB device is specified but an explicit WAL device is not, the WAL will be +implicitly colocated with the DB on the faster device. + +To provision a single-device (colocated) BlueStore OSD, run the following +command: + +.. prompt:: bash $ + + ceph-volume lvm prepare --bluestore --data <device> + +To specify a WAL device or DB device, run the following command: + +.. prompt:: bash $ + + ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device> + +.. note:: The option ``--data`` can take as its argument any of the the + following devices: logical volumes specified using *vg/lv* notation, + existing logical volumes, and GPT partitions. + + + +Provisioning strategies +----------------------- + +BlueStore differs from Filestore in that there are several ways to deploy a +BlueStore OSD. However, the overall deployment strategy for BlueStore can be +clarified by examining just these two common arrangements: + +.. _bluestore-single-type-device-config: + +**block (data) only** +^^^^^^^^^^^^^^^^^^^^^ +If all devices are of the same type (for example, they are all HDDs), and if +there are no fast devices available for the storage of metadata, then it makes +sense to specify the block device only and to leave ``block.db`` and +``block.wal`` unseparated. The :ref:`ceph-volume-lvm` command for a single +``/dev/sda`` device is as follows: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data /dev/sda + +If the devices to be used for a BlueStore OSD are pre-created logical volumes, +then the :ref:`ceph-volume-lvm` call for an logical volume named +``ceph-vg/block-lv`` is as follows: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data ceph-vg/block-lv + +.. _bluestore-mixed-device-config: + +**block and block.db** +^^^^^^^^^^^^^^^^^^^^^^ + +If you have a mix of fast and slow devices (for example, SSD or HDD), then we +recommend placing ``block.db`` on the faster device while ``block`` (that is, +the data) is stored on the slower device (that is, the rotational drive). + +You must create these volume groups and these logical volumes manually. as The +``ceph-volume`` tool is currently unable to do so [create them?] automatically. + +The following procedure illustrates the manual creation of volume groups and +logical volumes. For this example, we shall assume four rotational drives +(``sda``, ``sdb``, ``sdc``, and ``sdd``) and one (fast) SSD (``sdx``). First, +to create the volume groups, run the following commands: + +.. prompt:: bash $ + + vgcreate ceph-block-0 /dev/sda + vgcreate ceph-block-1 /dev/sdb + vgcreate ceph-block-2 /dev/sdc + vgcreate ceph-block-3 /dev/sdd + +Next, to create the logical volumes for ``block``, run the following commands: + +.. prompt:: bash $ + + lvcreate -l 100%FREE -n block-0 ceph-block-0 + lvcreate -l 100%FREE -n block-1 ceph-block-1 + lvcreate -l 100%FREE -n block-2 ceph-block-2 + lvcreate -l 100%FREE -n block-3 ceph-block-3 + +Because there are four HDDs, there will be four OSDs. Supposing that there is a +200GB SSD in ``/dev/sdx``, we can create four 50GB logical volumes by running +the following commands: + +.. prompt:: bash $ + + vgcreate ceph-db-0 /dev/sdx + lvcreate -L 50GB -n db-0 ceph-db-0 + lvcreate -L 50GB -n db-1 ceph-db-0 + lvcreate -L 50GB -n db-2 ceph-db-0 + lvcreate -L 50GB -n db-3 ceph-db-0 + +Finally, to create the four OSDs, run the following commands: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0 + ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1 + ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2 + ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3 + +After this procedure is finished, there should be four OSDs, ``block`` should +be on the four HDDs, and each HDD should have a 50GB logical volume +(specifically, a DB device) on the shared SSD. + +Sizing +====== +When using a :ref:`mixed spinning-and-solid-drive setup +<bluestore-mixed-device-config>`, it is important to make a large enough +``block.db`` logical volume for BlueStore. The logical volumes associated with +``block.db`` should have logical volumes that are *as large as possible*. + +It is generally recommended that the size of ``block.db`` be somewhere between +1% and 4% of the size of ``block``. For RGW workloads, it is recommended that +the ``block.db`` be at least 4% of the ``block`` size, because RGW makes heavy +use of ``block.db`` to store metadata (in particular, omap keys). For example, +if the ``block`` size is 1TB, then ``block.db`` should have a size of at least +40GB. For RBD workloads, however, ``block.db`` usually needs no more than 1% to +2% of the ``block`` size. + +In older releases, internal level sizes are such that the DB can fully utilize +only those specific partition / logical volume sizes that correspond to sums of +L0, L0+L1, L1+L2, and so on--that is, given default settings, sizes of roughly +3GB, 30GB, 300GB, and so on. Most deployments do not substantially benefit from +sizing that accommodates L3 and higher, though DB compaction can be facilitated +by doubling these figures to 6GB, 60GB, and 600GB. + +Improvements in Nautilus 14.2.12, Octopus 15.2.6, and subsequent releases allow +for better utilization of arbitrarily-sized DB devices. Moreover, the Pacific +release brings experimental dynamic-level support. Because of these advances, +users of older releases might want to plan ahead by provisioning larger DB +devices today so that the benefits of scale can be realized when upgrades are +made in the future. + +When *not* using a mix of fast and slow devices, there is no requirement to +create separate logical volumes for ``block.db`` or ``block.wal``. BlueStore +will automatically colocate these devices within the space of ``block``. + +Automatic Cache Sizing +====================== + +BlueStore can be configured to automatically resize its caches, provided that +certain conditions are met: TCMalloc must be configured as the memory allocator +and the ``bluestore_cache_autotune`` configuration option must be enabled (note +that it is currently enabled by default). When automatic cache sizing is in +effect, BlueStore attempts to keep OSD heap-memory usage under a certain target +size (as determined by ``osd_memory_target``). This approach makes use of a +best-effort algorithm and caches do not shrink smaller than the size defined by +the value of ``osd_memory_cache_min``. Cache ratios are selected in accordance +with a hierarchy of priorities. But if priority information is not available, +the values specified in the ``bluestore_cache_meta_ratio`` and +``bluestore_cache_kv_ratio`` options are used as fallback cache ratios. + +.. confval:: bluestore_cache_autotune +.. confval:: osd_memory_target +.. confval:: bluestore_cache_autotune_interval +.. confval:: osd_memory_base +.. confval:: osd_memory_expected_fragmentation +.. confval:: osd_memory_cache_min +.. confval:: osd_memory_cache_resize_interval + + +Manual Cache Sizing +=================== + +The amount of memory consumed by each OSD to be used for its BlueStore cache is +determined by the ``bluestore_cache_size`` configuration option. If that option +has not been specified (that is, if it remains at 0), then Ceph uses a +different configuration option to determine the default memory budget: +``bluestore_cache_size_hdd`` if the primary device is an HDD, or +``bluestore_cache_size_ssd`` if the primary device is an SSD. + +BlueStore and the rest of the Ceph OSD daemon make every effort to work within +this memory budget. Note that in addition to the configured cache size, there +is also memory consumed by the OSD itself. There is additional utilization due +to memory fragmentation and other allocator overhead. + +The configured cache-memory budget can be used to store the following types of +things: + +* Key/Value metadata (that is, RocksDB's internal cache) +* BlueStore metadata +* BlueStore data (that is, recently read or recently written object data) + +Cache memory usage is governed by the configuration options +``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. The fraction +of the cache that is reserved for data is governed by both the effective +BlueStore cache size (which depends on the relevant +``bluestore_cache_size[_ssd|_hdd]`` option and the device class of the primary +device) and the "meta" and "kv" ratios. This data fraction can be calculated +with the following formula: ``<effective_cache_size> * (1 - +bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``. + +.. confval:: bluestore_cache_size +.. confval:: bluestore_cache_size_hdd +.. confval:: bluestore_cache_size_ssd +.. confval:: bluestore_cache_meta_ratio +.. confval:: bluestore_cache_kv_ratio + +Checksums +========= + +BlueStore checksums all metadata and all data written to disk. Metadata +checksumming is handled by RocksDB and uses the `crc32c` algorithm. By +contrast, data checksumming is handled by BlueStore and can use either +`crc32c`, `xxhash32`, or `xxhash64`. Nonetheless, `crc32c` is the default +checksum algorithm and it is suitable for most purposes. + +Full data checksumming increases the amount of metadata that BlueStore must +store and manage. Whenever possible (for example, when clients hint that data +is written and read sequentially), BlueStore will checksum larger blocks. In +many cases, however, it must store a checksum value (usually 4 bytes) for every +4 KB block of data. + +It is possible to obtain a smaller checksum value by truncating the checksum to +one or two bytes and reducing the metadata overhead. A drawback of this +approach is that it increases the probability of a random error going +undetected: about one in four billion given a 32-bit (4 byte) checksum, 1 in +65,536 given a 16-bit (2 byte) checksum, and 1 in 256 given an 8-bit (1 byte) +checksum. To use the smaller checksum values, select `crc32c_16` or `crc32c_8` +as the checksum algorithm. + +The *checksum algorithm* can be specified either via a per-pool ``csum_type`` +configuration option or via the global configuration option. For example: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> csum_type <algorithm> + +.. confval:: bluestore_csum_type + +Inline Compression +================== + +BlueStore supports inline compression using `snappy`, `zlib`, `lz4`, or `zstd`. + +Whether data in BlueStore is compressed is determined by two factors: (1) the +*compression mode* and (2) any client hints associated with a write operation. +The compression modes are as follows: + +* **none**: Never compress data. +* **passive**: Do not compress data unless the write operation has a + *compressible* hint set. +* **aggressive**: Do compress data unless the write operation has an + *incompressible* hint set. +* **force**: Try to compress data no matter what. + +For more information about the *compressible* and *incompressible* I/O hints, +see :c:func:`rados_set_alloc_hint`. + +Note that data in Bluestore will be compressed only if the data chunk will be +sufficiently reduced in size (as determined by the ``bluestore compression +required ratio`` setting). No matter which compression modes have been used, if +the data chunk is too big, then it will be discarded and the original +(uncompressed) data will be stored instead. For example, if ``bluestore +compression required ratio`` is set to ``.7``, then data compression will take +place only if the size of the compressed data is no more than 70% of the size +of the original data. + +The *compression mode*, *compression algorithm*, *compression required ratio*, +*min blob size*, and *max blob size* settings can be specified either via a +per-pool property or via a global config option. To specify pool properties, +run the following commands: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> compression_algorithm <algorithm> + ceph osd pool set <pool-name> compression_mode <mode> + ceph osd pool set <pool-name> compression_required_ratio <ratio> + ceph osd pool set <pool-name> compression_min_blob_size <size> + ceph osd pool set <pool-name> compression_max_blob_size <size> + +.. confval:: bluestore_compression_algorithm +.. confval:: bluestore_compression_mode +.. confval:: bluestore_compression_required_ratio +.. confval:: bluestore_compression_min_blob_size +.. confval:: bluestore_compression_min_blob_size_hdd +.. confval:: bluestore_compression_min_blob_size_ssd +.. confval:: bluestore_compression_max_blob_size +.. confval:: bluestore_compression_max_blob_size_hdd +.. confval:: bluestore_compression_max_blob_size_ssd + +.. _bluestore-rocksdb-sharding: + +RocksDB Sharding +================ + +BlueStore maintains several types of internal key-value data, all of which are +stored in RocksDB. Each data type in BlueStore is assigned a unique prefix. +Prior to the Pacific release, all key-value data was stored in a single RocksDB +column family: 'default'. In Pacific and later releases, however, BlueStore can +divide key-value data into several RocksDB column families. BlueStore achieves +better caching and more precise compaction when keys are similar: specifically, +when keys have similar access frequency, similar modification frequency, and a +similar lifetime. Under such conditions, performance is improved and less disk +space is required during compaction (because each column family is smaller and +is able to compact independently of the others). + +OSDs deployed in Pacific or later releases use RocksDB sharding by default. +However, if Ceph has been upgraded to Pacific or a later version from a +previous version, sharding is disabled on any OSDs that were created before +Pacific. + +To enable sharding and apply the Pacific defaults to a specific OSD, stop the +OSD and run the following command: + + .. prompt:: bash # + + ceph-bluestore-tool \ + --path <data path> \ + --sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l p" \ + reshard + +.. confval:: bluestore_rocksdb_cf +.. confval:: bluestore_rocksdb_cfs + +Throttling +========== + +.. confval:: bluestore_throttle_bytes +.. confval:: bluestore_throttle_deferred_bytes +.. confval:: bluestore_throttle_cost_per_io +.. confval:: bluestore_throttle_cost_per_io_hdd +.. confval:: bluestore_throttle_cost_per_io_ssd + +SPDK Usage +========== + +To use the SPDK driver for NVMe devices, you must first prepare your system. +See `SPDK document`__. + +.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples + +SPDK offers a script that will configure the device automatically. Run this +script with root permissions: + +.. prompt:: bash $ + + sudo src/spdk/scripts/setup.sh + +You will need to specify the subject NVMe device's device selector with the +"spdk:" prefix for ``bluestore_block_path``. + +In the following example, you first find the device selector of an Intel NVMe +SSD by running the following command: + +.. prompt:: bash $ + + lspci -mm -n -d -d 8086:0953 + +The form of the device selector is either ``DDDD:BB:DD.FF`` or +``DDDD.BB.DD.FF``. + +Next, supposing that ``0000:01:00.0`` is the device selector found in the +output of the ``lspci`` command, you can specify the device selector by running +the following command:: + + bluestore_block_path = "spdk:trtype:pcie traddr:0000:01:00.0" + +You may also specify a remote NVMeoF target over the TCP transport, as in the +following example:: + + bluestore_block_path = "spdk:trtype:tcp traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1" + +To run multiple SPDK instances per node, you must make sure each instance uses +its own DPDK memory by specifying for each instance the amount of DPDK memory +(in MB) that the instance will use. + +In most cases, a single device can be used for data, DB, and WAL. We describe +this strategy as *colocating* these components. Be sure to enter the below +settings to ensure that all I/Os are issued through SPDK:: + + bluestore_block_db_path = "" + bluestore_block_db_size = 0 + bluestore_block_wal_path = "" + bluestore_block_wal_size = 0 + +If these settings are not entered, then the current implementation will +populate the SPDK map files with kernel file system symbols and will use the +kernel driver to issue DB/WAL I/Os. + +Minimum Allocation Size +======================= + +There is a configured minimum amount of storage that BlueStore allocates on an +underlying storage device. In practice, this is the least amount of capacity +that even a tiny RADOS object can consume on each OSD's primary device. The +configuration option in question--:confval:`bluestore_min_alloc_size`--derives +its value from the value of either :confval:`bluestore_min_alloc_size_hdd` or +:confval:`bluestore_min_alloc_size_ssd`, depending on the OSD's ``rotational`` +attribute. Thus if an OSD is created on an HDD, BlueStore is initialized with +the current value of :confval:`bluestore_min_alloc_size_hdd`; but with SSD OSDs +(including NVMe devices), Bluestore is initialized with the current value of +:confval:`bluestore_min_alloc_size_ssd`. + +In Mimic and earlier releases, the default values were 64KB for rotational +media (HDD) and 16KB for non-rotational media (SSD). The Octopus release +changed the the default value for non-rotational media (SSD) to 4KB, and the +Pacific release changed the default value for rotational media (HDD) to 4KB. + +These changes were driven by space amplification that was experienced by Ceph +RADOS GateWay (RGW) deployments that hosted large numbers of small files +(S3/Swift objects). + +For example, when an RGW client stores a 1 KB S3 object, that object is written +to a single RADOS object. In accordance with the default +:confval:`min_alloc_size` value, 4 KB of underlying drive space is allocated. +This means that roughly 3 KB (that is, 4 KB minus 1 KB) is allocated but never +used: this corresponds to 300% overhead or 25% efficiency. Similarly, a 5 KB +user object will be stored as two RADOS objects, a 4 KB RADOS object and a 1 KB +RADOS object, with the result that 4KB of device capacity is stranded. In this +case, however, the overhead percentage is much smaller. Think of this in terms +of the remainder from a modulus operation. The overhead *percentage* thus +decreases rapidly as object size increases. + +There is an additional subtlety that is easily missed: the amplification +phenomenon just described takes place for *each* replica. For example, when +using the default of three copies of data (3R), a 1 KB S3 object actually +strands roughly 9 KB of storage device capacity. If erasure coding (EC) is used +instead of replication, the amplification might be even higher: for a ``k=4, +m=2`` pool, our 1 KB S3 object allocates 24 KB (that is, 4 KB multiplied by 6) +of device capacity. + +When an RGW bucket pool contains many relatively large user objects, the effect +of this phenomenon is often negligible. However, with deployments that can +expect a significant fraction of relatively small user objects, the effect +should be taken into consideration. + +The 4KB default value aligns well with conventional HDD and SSD devices. +However, certain novel coarse-IU (Indirection Unit) QLC SSDs perform and wear +best when :confval:`bluestore_min_alloc_size_ssd` is specified at OSD creation +to match the device's IU: this might be 8KB, 16KB, or even 64KB. These novel +storage drives can achieve read performance that is competitive with that of +conventional TLC SSDs and write performance that is faster than that of HDDs, +with higher density and lower cost than TLC SSDs. + +Note that when creating OSDs on these novel devices, one must be careful to +apply the non-default value only to appropriate devices, and not to +conventional HDD and SSD devices. Error can be avoided through careful ordering +of OSD creation, with custom OSD device classes, and especially by the use of +central configuration *masks*. + +In Quincy and later releases, you can use the +:confval:`bluestore_use_optimal_io_size_for_min_alloc_size` option to allow +automatic discovery of the correct value as each OSD is created. Note that the +use of ``bcache``, ``OpenCAS``, ``dmcrypt``, ``ATA over Ethernet``, `iSCSI`, or +other device-layering and abstraction technologies might confound the +determination of correct values. Moreover, OSDs deployed on top of VMware +storage have sometimes been found to report a ``rotational`` attribute that +does not match the underlying hardware. + +We suggest inspecting such OSDs at startup via logs and admin sockets in order +to ensure that their behavior is correct. Be aware that this kind of inspection +might not work as expected with older kernels. To check for this issue, +examine the presence and value of ``/sys/block/<drive>/queue/optimal_io_size``. + +.. note:: When running Reef or a later Ceph release, the ``min_alloc_size`` + baked into each OSD is conveniently reported by ``ceph osd metadata``. + +To inspect a specific OSD, run the following command: + +.. prompt:: bash # + + ceph osd metadata osd.1701 | egrep rotational\|alloc + +This space amplification might manifest as an unusually high ratio of raw to +stored data as reported by ``ceph df``. There might also be ``%USE`` / ``VAR`` +values reported by ``ceph osd df`` that are unusually high in comparison to +other, ostensibly identical, OSDs. Finally, there might be unexpected balancer +behavior in pools that use OSDs that have mismatched ``min_alloc_size`` values. + +This BlueStore attribute takes effect *only* at OSD creation; if the attribute +is changed later, a specific OSD's behavior will not change unless and until +the OSD is destroyed and redeployed with the appropriate option value(s). +Upgrading to a later Ceph release will *not* change the value used by OSDs that +were deployed under older releases or with other settings. + +.. confval:: bluestore_min_alloc_size +.. confval:: bluestore_min_alloc_size_hdd +.. confval:: bluestore_min_alloc_size_ssd +.. confval:: bluestore_use_optimal_io_size_for_min_alloc_size + +DSA (Data Streaming Accelerator) Usage +====================================== + +If you want to use the DML library to drive the DSA device for offloading +read/write operations on persistent memory (PMEM) in BlueStore, you need to +install `DML`_ and the `idxd-config`_ library. This will work only on machines +that have a SPR (Sapphire Rapids) CPU. + +.. _dml: https://github.com/intel/dml +.. _idxd-config: https://github.com/intel/idxd-config + +After installing the DML software, configure the shared work queues (WQs) with +reference to the following WQ configuration example: + +.. prompt:: bash $ + + accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="myapp1" --priority=10 --block-on-fault=1 dsa0/wq0.1 + accel-config config-engine dsa0/engine0.1 --group-id=1 + accel-config enable-device dsa0 + accel-config enable-wq dsa0/wq0.1 diff --git a/doc/rados/configuration/ceph-conf.rst b/doc/rados/configuration/ceph-conf.rst new file mode 100644 index 000000000..d8d5c9d03 --- /dev/null +++ b/doc/rados/configuration/ceph-conf.rst @@ -0,0 +1,715 @@ +.. _configuring-ceph: + +================== + Configuring Ceph +================== + +When Ceph services start, the initialization process activates a set of +daemons that run in the background. A :term:`Ceph Storage Cluster` runs at +least three types of daemons: + +- :term:`Ceph Monitor` (``ceph-mon``) +- :term:`Ceph Manager` (``ceph-mgr``) +- :term:`Ceph OSD Daemon` (``ceph-osd``) + +Any Ceph Storage Cluster that supports the :term:`Ceph File System` also runs +at least one :term:`Ceph Metadata Server` (``ceph-mds``). Any Cluster that +supports :term:`Ceph Object Storage` runs Ceph RADOS Gateway daemons +(``radosgw``). + +Each daemon has a number of configuration options, and each of those options +has a default value. Adjust the behavior of the system by changing these +configuration options. Make sure to understand the consequences before +overriding the default values, as it is possible to significantly degrade the +performance and stability of your cluster. Remember that default values +sometimes change between releases. For this reason, it is best to review the +version of this documentation that applies to your Ceph release. + +Option names +============ + +Each of the Ceph configuration options has a unique name that consists of words +formed with lowercase characters and connected with underscore characters +(``_``). + +When option names are specified on the command line, underscore (``_``) and +dash (``-``) characters can be used interchangeably (for example, +``--mon-host`` is equivalent to ``--mon_host``). + +When option names appear in configuration files, spaces can also be used in +place of underscores or dashes. However, for the sake of clarity and +convenience, we suggest that you consistently use underscores, as we do +throughout this documentation. + +Config sources +============== + +Each Ceph daemon, process, and library pulls its configuration from one or more +of the several sources listed below. Sources that occur later in the list +override those that occur earlier in the list (when both are present). + +- the compiled-in default value +- the monitor cluster's centralized configuration database +- a configuration file stored on the local host +- environment variables +- command-line arguments +- runtime overrides that are set by an administrator + +One of the first things a Ceph process does on startup is parse the +configuration options provided via the command line, via the environment, and +via the local configuration file. Next, the process contacts the monitor +cluster to retrieve centrally-stored configuration for the entire cluster. +After a complete view of the configuration is available, the startup of the +daemon or process will commence. + +.. _bootstrap-options: + +Bootstrap options +----------------- + +Bootstrap options are configuration options that affect the process's ability +to contact the monitors, to authenticate, and to retrieve the cluster-stored +configuration. For this reason, these options might need to be stored locally +on the node, and set by means of a local configuration file. These options +include the following: + +.. confval:: mon_host +.. confval:: mon_host_override + +- :confval:`mon_dns_srv_name` +- :confval:`mon_data`, :confval:`osd_data`, :confval:`mds_data`, + :confval:`mgr_data`, and similar options that define which local directory + the daemon stores its data in. +- :confval:`keyring`, :confval:`keyfile`, and/or :confval:`key`, which can be + used to specify the authentication credential to use to authenticate with the + monitor. Note that in most cases the default keyring location is in the data + directory specified above. + +In most cases, there is no reason to modify the default values of these +options. However, there is one exception to this: the :confval:`mon_host` +option that identifies the addresses of the cluster's monitors. But when +:ref:`DNS is used to identify monitors<mon-dns-lookup>`, a local Ceph +configuration file can be avoided entirely. + + +Skipping monitor config +----------------------- + +The option ``--no-mon-config`` can be passed in any command in order to skip +the step that retrieves configuration information from the cluster's monitors. +Skipping this retrieval step can be useful in cases where configuration is +managed entirely via configuration files, or when maintenance activity needs to +be done but the monitor cluster is down. + +.. _ceph-conf-file: + +Configuration sections +====================== + +Each of the configuration options associated with a single process or daemon +has a single value. However, the values for a configuration option can vary +across daemon types, and can vary even across different daemons of the same +type. Ceph options that are stored in the monitor configuration database or in +local configuration files are grouped into sections |---| so-called "configuration +sections" |---| to indicate which daemons or clients they apply to. + + +These sections include the following: + +.. confsec:: global + + Settings under ``global`` affect all daemons and clients + in a Ceph Storage Cluster. + + :example: ``log_file = /var/log/ceph/$cluster-$type.$id.log`` + +.. confsec:: mon + + Settings under ``mon`` affect all ``ceph-mon`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + + :example: ``mon_cluster_log_to_syslog = true`` + +.. confsec:: mgr + + Settings in the ``mgr`` section affect all ``ceph-mgr`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + + :example: ``mgr_stats_period = 10`` + +.. confsec:: osd + + Settings under ``osd`` affect all ``ceph-osd`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + + :example: ``osd_op_queue = wpq`` + +.. confsec:: mds + + Settings in the ``mds`` section affect all ``ceph-mds`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + + :example: ``mds_cache_memory_limit = 10G`` + +.. confsec:: client + + Settings under ``client`` affect all Ceph clients + (for example, mounted Ceph File Systems, mounted Ceph Block Devices) + as well as RADOS Gateway (RGW) daemons. + + :example: ``objecter_inflight_ops = 512`` + + +Configuration sections can also specify an individual daemon or client name. For example, +``mon.foo``, ``osd.123``, and ``client.smith`` are all valid section names. + + +Any given daemon will draw its settings from the global section, the daemon- or +client-type section, and the section sharing its name. Settings in the +most-specific section take precedence so precedence: for example, if the same +option is specified in both :confsec:`global`, :confsec:`mon`, and ``mon.foo`` +on the same source (i.e. that is, in the same configuration file), the +``mon.foo`` setting will be used. + +If multiple values of the same configuration option are specified in the same +section, the last value specified takes precedence. + +Note that values from the local configuration file always take precedence over +values from the monitor configuration database, regardless of the section in +which they appear. + +.. _ceph-metavariables: + +Metavariables +============= + +Metavariables dramatically simplify Ceph storage cluster configuration. When a +metavariable is set in a configuration value, Ceph expands the metavariable at +the time the configuration value is used. In this way, Ceph metavariables +behave similarly to the way that variable expansion works in the Bash shell. + +Ceph supports the following metavariables: + +.. describe:: $cluster + + Expands to the Ceph Storage Cluster name. Useful when running + multiple Ceph Storage Clusters on the same hardware. + + :example: ``/etc/ceph/$cluster.keyring`` + :default: ``ceph`` + +.. describe:: $type + + Expands to a daemon or process type (for example, ``mds``, ``osd``, or ``mon``) + + :example: ``/var/lib/ceph/$type`` + +.. describe:: $id + + Expands to the daemon or client identifier. For + ``osd.0``, this would be ``0``; for ``mds.a``, it would + be ``a``. + + :example: ``/var/lib/ceph/$type/$cluster-$id`` + +.. describe:: $host + + Expands to the host name where the process is running. + +.. describe:: $name + + Expands to ``$type.$id``. + + :example: ``/var/run/ceph/$cluster-$name.asok`` + +.. describe:: $pid + + Expands to daemon pid. + + :example: ``/var/run/ceph/$cluster-$name-$pid.asok`` + + +Ceph configuration file +======================= + +On startup, Ceph processes search for a configuration file in the +following locations: + +#. ``$CEPH_CONF`` (that is, the path following the ``$CEPH_CONF`` + environment variable) +#. ``-c path/path`` (that is, the ``-c`` command line argument) +#. ``/etc/ceph/$cluster.conf`` +#. ``~/.ceph/$cluster.conf`` +#. ``./$cluster.conf`` (that is, in the current working directory) +#. On FreeBSD systems only, ``/usr/local/etc/ceph/$cluster.conf`` + +Here ``$cluster`` is the cluster's name (default: ``ceph``). + +The Ceph configuration file uses an ``ini`` style syntax. You can add "comment +text" after a pound sign (#) or a semi-colon semicolon (;). For example: + +.. code-block:: ini + + # <--A number (#) sign number sign (#) precedes a comment. + ; A comment may be anything. + # Comments always follow a semi-colon semicolon (;) or a pound sign (#) on each line. + # The end of the line terminates a comment. + # We recommend that you provide comments in your configuration file(s). + + +.. _ceph-conf-settings: + +Config file section names +------------------------- + +The configuration file is divided into sections. Each section must begin with a +valid configuration section name (see `Configuration sections`_, above) that is +surrounded by square brackets. For example: + +.. code-block:: ini + + [global] + debug_ms = 0 + + [osd] + debug_ms = 1 + + [osd.1] + debug_ms = 10 + + [osd.2] + debug_ms = 10 + +Config file option values +------------------------- + +The value of a configuration option is a string. If the string is too long to +fit on a single line, you can put a backslash (``\``) at the end of the line +and the backslash will act as a line continuation marker. In such a case, the +value of the option will be the string after ``=`` in the current line, +combined with the string in the next line. Here is an example:: + + [global] + foo = long long ago\ + long ago + +In this example, the value of the "``foo``" option is "``long long ago long +ago``". + +An option value typically ends with either a newline or a comment. For +example: + +.. code-block:: ini + + [global] + obscure_one = difficult to explain # I will try harder in next release + simpler_one = nothing to explain + +In this example, the value of the "``obscure one``" option is "``difficult to +explain``" and the value of the "``simpler one`` options is "``nothing to +explain``". + +When an option value contains spaces, it can be enclosed within single quotes +or double quotes in order to make its scope clear and in order to make sure +that the first space in the value is not interpreted as the end of the value. +For example: + +.. code-block:: ini + + [global] + line = "to be, or not to be" + +In option values, there are four characters that are treated as escape +characters: ``=``, ``#``, ``;`` and ``[``. They are permitted to occur in an +option value only if they are immediately preceded by the backslash character +(``\``). For example: + +.. code-block:: ini + + [global] + secret = "i love \# and \[" + +Each configuration option falls under one of the following types: + +.. describe:: int + + 64-bit signed integer. Some SI suffixes are supported, such as "K", "M", + "G", "T", "P", and "E" (meaning, respectively, 10\ :sup:`3`, 10\ :sup:`6`, + 10\ :sup:`9`, etc.). "B" is the only supported unit string. Thus "1K", "1M", + "128B" and "-1" are all valid option values. When a negative value is + assigned to a threshold option, this can indicate that the option is + "unlimited" -- that is, that there is no threshold or limit in effect. + + :example: ``42``, ``-1`` + +.. describe:: uint + + This differs from ``integer`` only in that negative values are not + permitted. + + :example: ``256``, ``0`` + +.. describe:: str + + A string encoded in UTF-8. Certain characters are not permitted. Reference + the above notes for the details. + + :example: ``"hello world"``, ``"i love \#"``, ``yet-another-name`` + +.. describe:: boolean + + Typically either of the two values ``true`` or ``false``. However, any + integer is permitted: "0" implies ``false``, and any non-zero value implies + ``true``. + + :example: ``true``, ``false``, ``1``, ``0`` + +.. describe:: addr + + A single address, optionally prefixed with ``v1``, ``v2`` or ``any`` for the + messenger protocol. If no prefix is specified, the ``v2`` protocol is used. + For more details, see :ref:`address_formats`. + + :example: ``v1:1.2.3.4:567``, ``v2:1.2.3.4:567``, ``1.2.3.4:567``, ``2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567``, ``[::1]:6789`` + +.. describe:: addrvec + + A set of addresses separated by ",". The addresses can be optionally quoted + with ``[`` and ``]``. + + :example: ``[v1:1.2.3.4:567,v2:1.2.3.4:568]``, ``v1:1.2.3.4:567,v1:1.2.3.14:567`` ``[2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567], [2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::568]`` + +.. describe:: uuid + + The string format of a uuid defined by `RFC4122 + <https://www.ietf.org/rfc/rfc4122.txt>`_. Certain variants are also + supported: for more details, see `Boost document + <https://www.boost.org/doc/libs/1_74_0/libs/uuid/doc/uuid.html#String%20Generator>`_. + + :example: ``f81d4fae-7dec-11d0-a765-00a0c91e6bf6`` + +.. describe:: size + + 64-bit unsigned integer. Both SI prefixes and IEC prefixes are supported. + "B" is the only supported unit string. Negative values are not permitted. + + :example: ``1Ki``, ``1K``, ``1KiB`` and ``1B``. + +.. describe:: secs + + Denotes a duration of time. The default unit of time is the second. + The following units of time are supported: + + * second: ``s``, ``sec``, ``second``, ``seconds`` + * minute: ``m``, ``min``, ``minute``, ``minutes`` + * hour: ``hs``, ``hr``, ``hour``, ``hours`` + * day: ``d``, ``day``, ``days`` + * week: ``w``, ``wk``, ``week``, ``weeks`` + * month: ``mo``, ``month``, ``months`` + * year: ``y``, ``yr``, ``year``, ``years`` + + :example: ``1 m``, ``1m`` and ``1 week`` + +.. _ceph-conf-database: + +Monitor configuration database +============================== + +The monitor cluster manages a database of configuration options that can be +consumed by the entire cluster. This allows for streamlined central +configuration management of the entire system. For ease of administration and +transparency, the vast majority of configuration options can and should be +stored in this database. + +Some settings might need to be stored in local configuration files because they +affect the ability of the process to connect to the monitors, to authenticate, +and to fetch configuration information. In most cases this applies only to the +``mon_host`` option. This issue can be avoided by using :ref:`DNS SRV +records<mon-dns-lookup>`. + +Sections and masks +------------------ + +Configuration options stored by the monitor can be stored in a global section, +in a daemon-type section, or in a specific daemon section. In this, they are +no different from the options in a configuration file. + +In addition, options may have a *mask* associated with them to further restrict +which daemons or clients the option applies to. Masks take two forms: + +#. ``type:location`` where ``type`` is a CRUSH property like ``rack`` or + ``host``, and ``location`` is a value for that property. For example, + ``host:foo`` would limit the option only to daemons or clients + running on a particular host. +#. ``class:device-class`` where ``device-class`` is the name of a CRUSH + device class (for example, ``hdd`` or ``ssd``). For example, + ``class:ssd`` would limit the option only to OSDs backed by SSDs. + (This mask has no effect on non-OSD daemons or clients.) + +In commands that specify a configuration option, the argument of the option (in +the following examples, this is the "who" string) may be a section name, a +mask, or a combination of both separated by a slash character (``/``). For +example, ``osd/rack:foo`` would refer to all OSD daemons in the ``foo`` rack. + +When configuration options are shown, the section name and mask are presented +in separate fields or columns to make them more readable. + +Commands +-------- + +The following CLI commands are used to configure the cluster: + +* ``ceph config dump`` dumps the entire monitor configuration + database for the cluster. + +* ``ceph config get <who>`` dumps the configuration options stored in + the monitor configuration database for a specific daemon or client + (for example, ``mds.a``). + +* ``ceph config get <who> <option>`` shows either a configuration value + stored in the monitor configuration database for a specific daemon or client + (for example, ``mds.a``), or, if that value is not present in the monitor + configuration database, the compiled-in default value. + +* ``ceph config set <who> <option> <value>`` specifies a configuration + option in the monitor configuration database. + +* ``ceph config show <who>`` shows the configuration for a running daemon. + These settings might differ from those stored by the monitors if there are + also local configuration files in use or if options have been overridden on + the command line or at run time. The source of the values of the options is + displayed in the output. + +* ``ceph config assimilate-conf -i <input file> -o <output file>`` ingests a + configuration file from *input file* and moves any valid options into the + monitor configuration database. Any settings that are unrecognized, are + invalid, or cannot be controlled by the monitor will be returned in an + abbreviated configuration file stored in *output file*. This command is + useful for transitioning from legacy configuration files to centralized + monitor-based configuration. + +Note that ``ceph config set <who> <option> <value>`` and ``ceph config get +<who> <option>`` will not necessarily return the same values. The latter +command will show compiled-in default values. In order to determine whether a +configuration option is present in the monitor configuration database, run +``ceph config dump``. + +Help +==== + +To get help for a particular option, run the following command: + +.. prompt:: bash $ + + ceph config help <option> + +For example: + +.. prompt:: bash $ + + ceph config help log_file + +:: + + log_file - path to log file + (std::string, basic) + Default (non-daemon): + Default (daemon): /var/log/ceph/$cluster-$name.log + Can update at runtime: false + See also: [log_to_stderr,err_to_stderr,log_to_syslog,err_to_syslog] + +or: + +.. prompt:: bash $ + + ceph config help log_file -f json-pretty + +:: + + { + "name": "log_file", + "type": "std::string", + "level": "basic", + "desc": "path to log file", + "long_desc": "", + "default": "", + "daemon_default": "/var/log/ceph/$cluster-$name.log", + "tags": [], + "services": [], + "see_also": [ + "log_to_stderr", + "err_to_stderr", + "log_to_syslog", + "err_to_syslog" + ], + "enum_values": [], + "min": "", + "max": "", + "can_update_at_runtime": false + } + +The ``level`` property can be ``basic``, ``advanced``, or ``dev``. The `dev` +options are intended for use by developers, generally for testing purposes, and +are not recommended for use by operators. + +.. note:: This command uses the configuration schema that is compiled into the + running monitors. If you have a mixed-version cluster (as might exist, for + example, during an upgrade), you might want to query the option schema from + a specific running daemon by running a command of the following form: + +.. prompt:: bash $ + + ceph daemon <name> config help [option] + +Runtime Changes +=============== + +In most cases, Ceph permits changes to the configuration of a daemon at +run time. This can be used for increasing or decreasing the amount of logging +output, for enabling or disabling debug settings, and for runtime optimization. + +Use the ``ceph config set`` command to update configuration options. For +example, to enable the most verbose debug log level on a specific OSD, run a +command of the following form: + +.. prompt:: bash $ + + ceph config set osd.123 debug_ms 20 + +.. note:: If an option has been customized in a local configuration file, the + `central config + <https://ceph.io/en/news/blog/2018/new-mimic-centralized-configuration-management/>`_ + setting will be ignored because it has a lower priority than the local + configuration file. + +.. note:: Log levels range from 0 to 20. + +Override values +--------------- + +Options can be set temporarily by using the Ceph CLI ``tell`` or ``daemon`` +interfaces on the Ceph CLI. These *override* values are ephemeral, which means +that they affect only the current instance of the daemon and revert to +persistently configured values when the daemon restarts. + +Override values can be set in two ways: + +#. From any host, send a message to a daemon with a command of the following + form: + + .. prompt:: bash $ + + ceph tell <name> config set <option> <value> + + For example: + + .. prompt:: bash $ + + ceph tell osd.123 config set debug_osd 20 + + The ``tell`` command can also accept a wildcard as the daemon identifier. + For example, to adjust the debug level on all OSD daemons, run a command of + the following form: + + .. prompt:: bash $ + + ceph tell osd.* config set debug_osd 20 + +#. On the host where the daemon is running, connect to the daemon via a socket + in ``/var/run/ceph`` by running a command of the following form: + + .. prompt:: bash $ + + ceph daemon <name> config set <option> <value> + + For example: + + .. prompt:: bash $ + + ceph daemon osd.4 config set debug_osd 20 + +.. note:: In the output of the ``ceph config show`` command, these temporary + values are shown to have a source of ``override``. + + +Viewing runtime settings +======================== + +You can see the current settings specified for a running daemon with the ``ceph +config show`` command. For example, to see the (non-default) settings for the +daemon ``osd.0``, run the following command: + +.. prompt:: bash $ + + ceph config show osd.0 + +To see a specific setting, run the following command: + +.. prompt:: bash $ + + ceph config show osd.0 debug_osd + +To see all settings (including those with default values), run the following +command: + +.. prompt:: bash $ + + ceph config show-with-defaults osd.0 + +You can see all settings for a daemon that is currently running by connecting +to it on the local host via the admin socket. For example, to dump all +current settings, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.0 config show + +To see non-default settings and to see where each value came from (for example, +a config file, the monitor, or an override), run the following command: + +.. prompt:: bash $ + + ceph daemon osd.0 config diff + +To see the value of a single setting, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.0 config get debug_osd + + +Changes introduced in Octopus +============================= + +The Octopus release changed the way the configuration file is parsed. +These changes are as follows: + +- Repeated configuration options are allowed, and no warnings will be + displayed. This means that the setting that comes last in the file is the one + that takes effect. Prior to this change, Ceph displayed warning messages when + lines containing duplicate options were encountered, such as:: + + warning line 42: 'foo' in section 'bar' redefined +- Prior to Octopus, options containing invalid UTF-8 characters were ignored + with warning messages. But in Octopus, they are treated as fatal errors. +- The backslash character ``\`` is used as the line-continuation marker that + combines the next line with the current one. Prior to Octopus, there was a + requirement that any end-of-line backslash be followed by a non-empty line. + But in Octopus, an empty line following a backslash is allowed. +- In the configuration file, each line specifies an individual configuration + option. The option's name and its value are separated with ``=``, and the + value may be enclosed within single or double quotes. If an invalid + configuration is specified, we will treat it as an invalid configuration + file:: + + bad option ==== bad value +- Prior to Octopus, if no section name was specified in the configuration file, + all options would be set as though they were within the :confsec:`global` + section. This approach is discouraged. Since Octopus, any configuration + file that has no section name must contain only a single option. + +.. |---| unicode:: U+2014 .. EM DASH :trim: diff --git a/doc/rados/configuration/common.rst b/doc/rados/configuration/common.rst new file mode 100644 index 000000000..0b373f469 --- /dev/null +++ b/doc/rados/configuration/common.rst @@ -0,0 +1,207 @@ +.. _ceph-conf-common-settings: + +Common Settings +=============== + +The `Hardware Recommendations`_ section provides some hardware guidelines for +configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph +Node` to run multiple daemons. For example, a single node with multiple drives +ususally runs one ``ceph-osd`` for each drive. Ideally, each node will be +assigned to a particular type of process. For example, some nodes might run +``ceph-osd`` daemons, other nodes might run ``ceph-mds`` daemons, and still +other nodes might run ``ceph-mon`` daemons. + +Each node has a name. The name of a node can be found in its ``host`` setting. +Monitors also specify a network address and port (that is, a domain name or IP +address) that can be found in the ``addr`` setting. A basic configuration file +typically specifies only minimal settings for each instance of monitor daemons. +For example: + + + + +.. code-block:: ini + + [global] + mon_initial_members = ceph1 + mon_host = 10.0.0.1 + +.. important:: The ``host`` setting's value is the short name of the node. It + is not an FQDN. It is **NOT** an IP address. To retrieve the name of the + node, enter ``hostname -s`` on the command line. Unless you are deploying + Ceph manually, do not use ``host`` settings for anything other than initial + monitor setup. **DO NOT** specify the ``host`` setting under individual + daemons when using deployment tools like ``chef`` or ``cephadm``. Such tools + are designed to enter the appropriate values for you in the cluster map. + + +.. _ceph-network-config: + +Networks +======== + +For more about configuring a network for use with Ceph, see the `Network +Configuration Reference`_ . + + +Monitors +======== + +Ceph production clusters typically provision at least three :term:`Ceph +Monitor` daemons to ensure availability in the event of a monitor instance +crash. A minimum of three :term:`Ceph Monitor` daemons ensures that the Paxos +algorithm is able to determine which version of the :term:`Ceph Cluster Map` is +the most recent. It makes this determination by consulting a majority of Ceph +Monitors in the quorum. + +.. note:: You may deploy Ceph with a single monitor, but if the instance fails, + the lack of other monitors might interrupt data-service availability. + +Ceph Monitors normally listen on port ``3300`` for the new v2 protocol, and on +port ``6789`` for the old v1 protocol. + +By default, Ceph expects to store monitor data on the following path:: + + /var/lib/ceph/mon/$cluster-$id + +You or a deployment tool (for example, ``cephadm``) must create the +corresponding directory. With metavariables fully expressed and a cluster named +"ceph", the path specified in the above example evaluates to:: + + /var/lib/ceph/mon/ceph-a + +For additional details, see the `Monitor Config Reference`_. + +.. _Monitor Config Reference: ../mon-config-ref + + +.. _ceph-osd-config: + +Authentication +============== + +.. versionadded:: Bobtail 0.56 + +Authentication is explicitly enabled or disabled in the ``[global]`` section of +the Ceph configuration file, as shown here: + +.. code-block:: ini + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + +In addition, you should enable message signing. For details, see `Cephx Config +Reference`_. + +.. _Cephx Config Reference: ../auth-config-ref + + +.. _ceph-monitor-config: + + +OSDs +==== + +By default, Ceph expects to store a Ceph OSD Daemon's data on the following +path:: + + /var/lib/ceph/osd/$cluster-$id + +You or a deployment tool (for example, ``cephadm``) must create the +corresponding directory. With metavariables fully expressed and a cluster named +"ceph", the path specified in the above example evaluates to:: + + /var/lib/ceph/osd/ceph-0 + +You can override this path using the ``osd_data`` setting. We recommend that +you do not change the default location. To create the default directory on your +OSD host, run the following commands: + +.. prompt:: bash $ + + ssh {osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + +The ``osd_data`` path ought to lead to a mount point that has mounted on it a +device that is distinct from the device that contains the operating system and +the daemons. To use a device distinct from the device that contains the +operating system and the daemons, prepare it for use with Ceph and mount it on +the directory you just created by running the following commands: + +.. prompt:: bash $ + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{disk} + sudo mount -o user_xattr /dev/{disk} /var/lib/ceph/osd/ceph-{osd-number} + +We recommend using the ``xfs`` file system when running :command:`mkfs`. (The +``btrfs`` and ``ext4`` file systems are not recommended and are no longer +tested.) + +For additional configuration details, see `OSD Config Reference`_. + + +Heartbeats +========== + +During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons +and report their findings to the Ceph Monitor. This process does not require +you to provide any settings. However, if you have network latency issues, you +might want to modify the default settings. + +For additional details, see `Configuring Monitor/OSD Interaction`_. + + +.. _ceph-logging-and-debugging: + +Logs / Debugging +================ + +You might sometimes encounter issues with Ceph that require you to use Ceph's +logging and debugging features. For details on log rotation, see `Debugging and +Logging`_. + +.. _Debugging and Logging: ../../troubleshooting/log-and-debug + + +Example ceph.conf +================= + +.. literalinclude:: demo-ceph.conf + :language: ini + +.. _ceph-runtime-config: + + + +Naming Clusters (deprecated) +============================ + +Each Ceph cluster has an internal name. This internal name is used as part of +configuration, and as part of "log file" names as well as part of directory +names and as part of mountpoint names. This name defaults to "ceph". Previous +releases of Ceph allowed one to specify a custom name instead, for example +"ceph2". This option was intended to facilitate the running of multiple logical +clusters on the same physical hardware, but in practice it was rarely +exploited. Custom cluster names should no longer be attempted. Old +documentation might lead readers to wrongly think that unique cluster names are +required to use ``rbd-mirror``. They are not required. + +Custom cluster names are now considered deprecated and the ability to deploy +them has already been removed from some tools, although existing custom-name +deployments continue to operate. The ability to run and manage clusters with +custom names might be progressively removed by future Ceph releases, so **it is +strongly recommended to deploy all new clusters with the default name "ceph"**. + +Some Ceph CLI commands accept a ``--cluster`` (cluster name) option. This +option is present only for the sake of backward compatibility. New tools and +deployments cannot be relied upon to accommodate this option. + +If you need to allow multiple clusters to exist on the same host, use +:ref:`cephadm`, which uses containers to fully isolate each cluster. + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Network Configuration Reference: ../network-config-ref +.. _OSD Config Reference: ../osd-config-ref +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction diff --git a/doc/rados/configuration/demo-ceph.conf b/doc/rados/configuration/demo-ceph.conf new file mode 100644 index 000000000..8ba285a42 --- /dev/null +++ b/doc/rados/configuration/demo-ceph.conf @@ -0,0 +1,31 @@ +[global] +fsid = {cluster-id} +mon_initial_members = {hostname}[, {hostname}] +mon_host = {ip-address}[, {ip-address}] + +#All clusters have a front-side public network. +#If you have two network interfaces, you can configure a private / cluster +#network for RADOS object replication, heartbeats, backfill, +#recovery, etc. +public_network = {network}[, {network}] +#cluster_network = {network}[, {network}] + +#Clusters require authentication by default. +auth_cluster_required = cephx +auth_service_required = cephx +auth_client_required = cephx + +#Choose reasonable number of replicas and placement groups. +osd_journal_size = {n} +osd_pool_default_size = {n} # Write an object n times. +osd_pool_default_min_size = {n} # Allow writing n copies in a degraded state. +osd_pool_default_pg_autoscale_mode = {mode} # on, off, or warn +# Only used if autoscaling is off or warn: +osd_pool_default_pg_num = {n} + +#Choose a reasonable crush leaf type. +#0 for a 1-node cluster. +#1 for a multi node cluster in a single rack +#2 for a multi node, multi chassis cluster with multiple hosts in a chassis +#3 for a multi node cluster with hosts across racks, etc. +osd_crush_chooseleaf_type = {n} diff --git a/doc/rados/configuration/filestore-config-ref.rst b/doc/rados/configuration/filestore-config-ref.rst new file mode 100644 index 000000000..7aefe26b3 --- /dev/null +++ b/doc/rados/configuration/filestore-config-ref.rst @@ -0,0 +1,377 @@ +============================ + Filestore Config Reference +============================ + +.. note:: Since the Luminous release of Ceph, Filestore has not been Ceph's + default storage back end. Since the Luminous release of Ceph, BlueStore has + been Ceph's default storage back end. However, Filestore OSDs are still + supported up to Quincy. Filestore OSDs are not supported in Reef. See + :ref:`OSD Back Ends <rados_config_storage_devices_osd_backends>`. See + :ref:`BlueStore Migration <rados_operations_bluestore_migration>` for + instructions explaining how to replace an existing Filestore back end with a + BlueStore back end. + + +``filestore_debug_omap_check`` + +:Description: Debugging check on synchronization. Expensive. For debugging only. +:Type: Boolean +:Required: No +:Default: ``false`` + + +.. index:: filestore; extended attributes + +Extended Attributes +=================== + +Extended Attributes (XATTRs) are important for Filestore OSDs. However, Certain +disadvantages can occur when the underlying file system is used for the storage +of XATTRs: some file systems have limits on the number of bytes that can be +stored in XATTRs, and your file system might in some cases therefore run slower +than would an alternative method of storing XATTRs. For this reason, a method +of storing XATTRs extrinsic to the underlying file system might improve +performance. To implement such an extrinsic method, refer to the following +settings. + +If the underlying file system has no size limit, then Ceph XATTRs are stored as +``inline xattr``, using the XATTRs provided by the file system. But if there is +a size limit (for example, ext4 imposes a limit of 4 KB total), then some Ceph +XATTRs will be stored in a key/value database when the limit is reached. More +precisely, this begins to occur when either the +``filestore_max_inline_xattr_size`` or ``filestore_max_inline_xattrs`` +threshold is reached. + + +``filestore_max_inline_xattr_size`` + +:Description: Defines the maximum size per object of an XATTR that can be + stored in the file system (for example, XFS, Btrfs, ext4). The + specified size should not be larger than the file system can + handle. Using the default value of 0 instructs Filestore to use + the value specific to the file system. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore_max_inline_xattr_size_xfs`` + +:Description: Defines the maximum size of an XATTR that can be stored in the + XFS file system. This setting is used only if + ``filestore_max_inline_xattr_size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``65536`` + + +``filestore_max_inline_xattr_size_btrfs`` + +:Description: Defines the maximum size of an XATTR that can be stored in the + Btrfs file system. This setting is used only if + ``filestore_max_inline_xattr_size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``2048`` + + +``filestore_max_inline_xattr_size_other`` + +:Description: Defines the maximum size of an XATTR that can be stored in other file systems. + This setting is used only if ``filestore_max_inline_xattr_size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``512`` + + +``filestore_max_inline_xattrs`` + +:Description: Defines the maximum number of XATTRs per object that can be stored in the file system. + Using the default value of 0 instructs Filestore to use the value specific to the file system. +:Type: 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore_max_inline_xattrs_xfs`` + +:Description: Defines the maximum number of XATTRs per object that can be stored in the XFS file system. + This setting is used only if ``filestore_max_inline_xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore_max_inline_xattrs_btrfs`` + +:Description: Defines the maximum number of XATTRs per object that can be stored in the Btrfs file system. + This setting is used only if ``filestore_max_inline_xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore_max_inline_xattrs_other`` + +:Description: Defines the maximum number of XATTRs per object that can be stored in other file systems. + This setting is used only if ``filestore_max_inline_xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``2`` + +.. index:: filestore; synchronization + +Synchronization Intervals +========================= + +Filestore must periodically quiesce writes and synchronize the file system. +Each synchronization creates a consistent commit point. When the commit point +is created, Filestore is able to free all journal entries up to that point. +More-frequent synchronization tends to reduce both synchronization time and +the amount of data that needs to remain in the journal. Less-frequent +synchronization allows the backing file system to coalesce small writes and +metadata updates, potentially increasing synchronization +efficiency but also potentially increasing tail latency. + + +``filestore_max_sync_interval`` + +:Description: Defines the maximum interval (in seconds) for synchronizing Filestore. +:Type: Double +:Required: No +:Default: ``5`` + + +``filestore_min_sync_interval`` + +:Description: Defines the minimum interval (in seconds) for synchronizing Filestore. +:Type: Double +:Required: No +:Default: ``.01`` + + +.. index:: filestore; flusher + +Flusher +======= + +The Filestore flusher forces data from large writes to be written out using +``sync_file_range`` prior to the synchronization. +Ideally, this action reduces the cost of the eventual synchronization. In practice, however, disabling +'filestore_flusher' seems in some cases to improve performance. + + +``filestore_flusher`` + +:Description: Enables the Filestore flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore_flusher_max_fds`` + +:Description: Defines the maximum number of file descriptors for the flusher. +:Type: Integer +:Required: No +:Default: ``512`` + +.. deprecated:: v.65 + +``filestore_sync_flush`` + +:Description: Enables the synchronization flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore_fsync_flushes_journal_data`` + +:Description: Flushes journal data during file-system synchronization. +:Type: Boolean +:Required: No +:Default: ``false`` + + +.. index:: filestore; queue + +Queue +===== + +The following settings define limits on the size of the Filestore queue: + +``filestore_queue_max_ops`` + +:Description: Defines the maximum number of in-progress operations that Filestore accepts before it blocks the queueing of any new operations. +:Type: Integer +:Required: No. Minimal impact on performance. +:Default: ``50`` + + +``filestore_queue_max_bytes`` + +:Description: Defines the maximum number of bytes permitted per operation. +:Type: Integer +:Required: No +:Default: ``100 << 20`` + + +.. index:: filestore; timeouts + +Timeouts +======== + +``filestore_op_threads`` + +:Description: Defines the number of file-system operation threads that execute in parallel. +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore_op_thread_timeout`` + +:Description: Defines the timeout (in seconds) for a file-system operation thread. +:Type: Integer +:Required: No +:Default: ``60`` + + +``filestore_op_thread_suicide_timeout`` + +:Description: Defines the timeout (in seconds) for a commit operation before the commit is cancelled. +:Type: Integer +:Required: No +:Default: ``180`` + + +.. index:: filestore; btrfs + +B-Tree Filesystem +================= + + +``filestore_btrfs_snap`` + +:Description: Enables snapshots for a ``btrfs`` Filestore. +:Type: Boolean +:Required: No. Used only for ``btrfs``. +:Default: ``true`` + + +``filestore_btrfs_clone_range`` + +:Description: Enables cloning ranges for a ``btrfs`` Filestore. +:Type: Boolean +:Required: No. Used only for ``btrfs``. +:Default: ``true`` + + +.. index:: filestore; journal + +Journal +======= + + +``filestore_journal_parallel`` + +:Description: Enables parallel journaling, default for ``btrfs``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_journal_writeahead`` + +:Description: Enables write-ahead journaling, default for XFS. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_journal_trailing`` + +:Description: Deprecated. **Never use.** +:Type: Boolean +:Required: No +:Default: ``false`` + + +Misc +==== + + +``filestore_merge_threshold`` + +:Description: Defines the minimum number of files permitted in a subdirectory before the subdirectory is merged into its parent directory. + NOTE: A negative value means that subdirectory merging is disabled. +:Type: Integer +:Required: No +:Default: ``-10`` + + +``filestore_split_multiple`` + +:Description: ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16`` + is the maximum number of files permitted in a subdirectory + before the subdirectory is split into child directories. + +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore_split_rand_factor`` + +:Description: A random factor added to the split threshold to avoid + too many (expensive) Filestore splits occurring at the same time. + For details, see ``filestore_split_multiple``. + To change this setting for an existing OSD, it is necessary to take the OSD + offline before running the ``ceph-objectstore-tool apply-layout-settings`` command. + +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``20`` + + +``filestore_update_to`` + +:Description: Limits automatic upgrades to a specified version of Filestore. Useful in cases in which you want to avoid upgrading to a specific version. +:Type: Integer +:Required: No +:Default: ``1000`` + + +``filestore_blackhole`` + +:Description: Drops any new transactions on the floor, similar to redirecting to NULL. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_dump_file`` + +:Description: Defines the file that transaction dumps are stored on. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_kill_at`` + +:Description: Injects a failure at the *n*\th opportunity. +:Type: String +:Required: No +:Default: ``false`` + + +``filestore_fail_eio`` + +:Description: Fail/Crash on EIO. +:Type: Boolean +:Required: No +:Default: ``true`` diff --git a/doc/rados/configuration/general-config-ref.rst b/doc/rados/configuration/general-config-ref.rst new file mode 100644 index 000000000..f4613456a --- /dev/null +++ b/doc/rados/configuration/general-config-ref.rst @@ -0,0 +1,19 @@ +========================== + General Config Reference +========================== + +.. confval:: admin_socket + :default: /var/run/ceph/$cluster-$name.asok +.. confval:: pid_file +.. confval:: chdir +.. confval:: fatal_signal_handlers +.. describe:: max_open_files + + If set, when the :term:`Ceph Storage Cluster` starts, Ceph sets + the max open FDs at the OS level (i.e., the max # of file + descriptors). A suitably large value prevents Ceph Daemons from running out + of file descriptors. + + :Type: 64-bit Integer + :Required: No + :Default: ``0`` diff --git a/doc/rados/configuration/index.rst b/doc/rados/configuration/index.rst new file mode 100644 index 000000000..715b999d1 --- /dev/null +++ b/doc/rados/configuration/index.rst @@ -0,0 +1,53 @@ +=============== + Configuration +=============== + +Each Ceph process, daemon, or utility draws its configuration from several +sources on startup. Such sources can include (1) a local configuration, (2) the +monitors, (3) the command line, and (4) environment variables. + +Configuration options can be set globally so that they apply (1) to all +daemons, (2) to all daemons or services of a particular type, or (3) to only a +specific daemon, process, or client. + +.. raw:: html + + <table cellpadding="10"><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>Configuring the Object Store</h3> + +For general object store configuration, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Storage devices <storage-devices> + ceph-conf + + +.. raw:: html + + </td><td><h3>Reference</h3> + +To optimize the performance of your cluster, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Common Settings <common> + Network Settings <network-config-ref> + Messenger v2 protocol <msgr2> + Auth Settings <auth-config-ref> + Monitor Settings <mon-config-ref> + mon-lookup-dns + Heartbeat Settings <mon-osd-interaction> + OSD Settings <osd-config-ref> + DmClock Settings <mclock-config-ref> + BlueStore Settings <bluestore-config-ref> + FileStore Settings <filestore-config-ref> + Journal Settings <journal-ref> + Pool, PG & CRUSH Settings <pool-pg-config-ref.rst> + General Settings <general-config-ref> + + +.. raw:: html + + </td></tr></tbody></table> diff --git a/doc/rados/configuration/journal-ref.rst b/doc/rados/configuration/journal-ref.rst new file mode 100644 index 000000000..5ce5a5e2d --- /dev/null +++ b/doc/rados/configuration/journal-ref.rst @@ -0,0 +1,39 @@ +========================== + Journal Config Reference +========================== +.. warning:: Filestore has been deprecated in the Reef release and is no longer supported. +.. index:: journal; journal configuration + +Filestore OSDs use a journal for two reasons: speed and consistency. Note +that since Luminous, the BlueStore OSD back end has been preferred and default. +This information is provided for pre-existing OSDs and for rare situations where +Filestore is preferred for new deployments. + +- **Speed:** The journal enables the Ceph OSD Daemon to commit small writes + quickly. Ceph writes small, random i/o to the journal sequentially, which + tends to speed up bursty workloads by allowing the backing file system more + time to coalesce writes. The Ceph OSD Daemon's journal, however, can lead + to spiky performance with short spurts of high-speed writes followed by + periods without any write progress as the file system catches up to the + journal. + +- **Consistency:** Ceph OSD Daemons require a file system interface that + guarantees atomic compound operations. Ceph OSD Daemons write a description + of the operation to the journal and apply the operation to the file system. + This enables atomic updates to an object (for example, placement group + metadata). Every few seconds--between ``filestore max sync interval`` and + ``filestore min sync interval``--the Ceph OSD Daemon stops writes and + synchronizes the journal with the file system, allowing Ceph OSD Daemons to + trim operations from the journal and reuse the space. On failure, Ceph + OSD Daemons replay the journal starting after the last synchronization + operation. + +Ceph OSD Daemons recognize the following journal settings: + +.. confval:: journal_dio +.. confval:: journal_aio +.. confval:: journal_block_align +.. confval:: journal_max_write_bytes +.. confval:: journal_max_write_entries +.. confval:: journal_align_min_size +.. confval:: journal_zero_on_create diff --git a/doc/rados/configuration/mclock-config-ref.rst b/doc/rados/configuration/mclock-config-ref.rst new file mode 100644 index 000000000..a338aa6da --- /dev/null +++ b/doc/rados/configuration/mclock-config-ref.rst @@ -0,0 +1,699 @@ +======================== + mClock Config Reference +======================== + +.. index:: mclock; configuration + +QoS support in Ceph is implemented using a queuing scheduler based on `the +dmClock algorithm`_. See :ref:`dmclock-qos` section for more details. + +To make the usage of mclock more user-friendly and intuitive, mclock config +profiles are introduced. The mclock profiles mask the low level details from +users, making it easier to configure and use mclock. + +The following input parameters are required for a mclock profile to configure +the QoS related parameters: + +* total capacity (IOPS) of each OSD (determined automatically - + See `OSD Capacity Determination (Automated)`_) + +* the max sequential bandwidth capacity (MiB/s) of each OSD - + See *osd_mclock_max_sequential_bandwidth_[hdd|ssd]* option + +* an mclock profile type to enable + +Using the settings in the specified profile, an OSD determines and applies the +lower-level mclock and Ceph parameters. The parameters applied by the mclock +profile make it possible to tune the QoS between client I/O and background +operations in the OSD. + + +.. index:: mclock; mclock clients + +mClock Client Types +=================== + +The mclock scheduler handles requests from different types of Ceph services. +Each service can be considered as a type of client from mclock's perspective. +Depending on the type of requests handled, mclock clients are classified into +the buckets as shown in the table below, + ++------------------------+--------------------------------------------------------------+ +| Client Type | Request Types | ++========================+==============================================================+ +| Client | I/O requests issued by external clients of Ceph | ++------------------------+--------------------------------------------------------------+ +| Background recovery | Internal recovery requests | ++------------------------+--------------------------------------------------------------+ +| Background best-effort | Internal backfill, scrub, snap trim and PG deletion requests | ++------------------------+--------------------------------------------------------------+ + +The mclock profiles allocate parameters like reservation, weight and limit +(see :ref:`dmclock-qos`) differently for each client type. The next sections +describe the mclock profiles in greater detail. + + +.. index:: mclock; profile definition + +mClock Profiles - Definition and Purpose +======================================== + +A mclock profile is *“a configuration setting that when applied on a running +Ceph cluster enables the throttling of the operations(IOPS) belonging to +different client classes (background recovery, scrub, snaptrim, client op, +osd subop)”*. + +The mclock profile uses the capacity limits and the mclock profile type selected +by the user to determine the low-level mclock resource control configuration +parameters and apply them transparently. Additionally, other Ceph configuration +parameters are also applied. Please see sections below for more information. + +The low-level mclock resource control parameters are the *reservation*, +*limit*, and *weight* that provide control of the resource shares, as +described in the :ref:`dmclock-qos` section. + + +.. index:: mclock; profile types + +mClock Profile Types +==================== + +mclock profiles can be broadly classified into *built-in* and *custom* profiles, + +Built-in Profiles +----------------- +Users can choose between the following built-in profile types: + +.. note:: The values mentioned in the tables below represent the proportion + of the total IOPS capacity of the OSD allocated for the service type. + +* balanced (default) +* high_client_ops +* high_recovery_ops + +balanced (*default*) +^^^^^^^^^^^^^^^^^^^^ +The *balanced* profile is the default mClock profile. This profile allocates +equal reservation/priority to client operations and background recovery +operations. Background best-effort ops are given lower reservation and therefore +take a longer time to complete when are are competing operations. This profile +helps meet the normal/steady-state requirements of the cluster. This is the +case when external client performance requirement is not critical and there are +other background operations that still need attention within the OSD. + +But there might be instances that necessitate giving higher allocations to either +client ops or recovery ops. In order to deal with such a situation, the alternate +built-in profiles may be enabled by following the steps mentioned in next sections. + ++------------------------+-------------+--------+-------+ +| Service Type | Reservation | Weight | Limit | ++========================+=============+========+=======+ +| client | 50% | 1 | MAX | ++------------------------+-------------+--------+-------+ +| background recovery | 50% | 1 | MAX | ++------------------------+-------------+--------+-------+ +| background best-effort | MIN | 1 | 90% | ++------------------------+-------------+--------+-------+ + +high_client_ops +^^^^^^^^^^^^^^^ +This profile optimizes client performance over background activities by +allocating more reservation and limit to client operations as compared to +background operations in the OSD. This profile, for example, may be enabled +to provide the needed performance for I/O intensive applications for a +sustained period of time at the cost of slower recoveries. The table shows +the resource control parameters set by the profile: + ++------------------------+-------------+--------+-------+ +| Service Type | Reservation | Weight | Limit | ++========================+=============+========+=======+ +| client | 60% | 2 | MAX | ++------------------------+-------------+--------+-------+ +| background recovery | 40% | 1 | MAX | ++------------------------+-------------+--------+-------+ +| background best-effort | MIN | 1 | 70% | ++------------------------+-------------+--------+-------+ + +high_recovery_ops +^^^^^^^^^^^^^^^^^ +This profile optimizes background recovery performance as compared to external +clients and other background operations within the OSD. This profile, for +example, may be enabled by an administrator temporarily to speed-up background +recoveries during non-peak hours. The table shows the resource control +parameters set by the profile: + ++------------------------+-------------+--------+-------+ +| Service Type | Reservation | Weight | Limit | ++========================+=============+========+=======+ +| client | 30% | 1 | MAX | ++------------------------+-------------+--------+-------+ +| background recovery | 70% | 2 | MAX | ++------------------------+-------------+--------+-------+ +| background best-effort | MIN | 1 | MAX | ++------------------------+-------------+--------+-------+ + +.. note:: Across the built-in profiles, internal background best-effort clients + of mclock include "backfill", "scrub", "snap trim", and "pg deletion" + operations. + + +Custom Profile +-------------- +This profile gives users complete control over all the mclock configuration +parameters. This profile should be used with caution and is meant for advanced +users, who understand mclock and Ceph related configuration options. + + +.. index:: mclock; built-in profiles + +mClock Built-in Profiles - Locked Config Options +================================================= +The below sections describe the config options that are locked to certain values +in order to ensure mClock scheduler is able to provide predictable QoS. + +mClock Config Options +--------------------- +.. important:: These defaults cannot be changed using any of the config + subsytem commands like *config set* or via the *config daemon* or *config + tell* interfaces. Although the above command(s) report success, the mclock + QoS parameters are reverted to their respective built-in profile defaults. + +When a built-in profile is enabled, the mClock scheduler calculates the low +level mclock parameters [*reservation*, *weight*, *limit*] based on the profile +enabled for each client type. The mclock parameters are calculated based on +the max OSD capacity provided beforehand. As a result, the following mclock +config parameters cannot be modified when using any of the built-in profiles: + +- :confval:`osd_mclock_scheduler_client_res` +- :confval:`osd_mclock_scheduler_client_wgt` +- :confval:`osd_mclock_scheduler_client_lim` +- :confval:`osd_mclock_scheduler_background_recovery_res` +- :confval:`osd_mclock_scheduler_background_recovery_wgt` +- :confval:`osd_mclock_scheduler_background_recovery_lim` +- :confval:`osd_mclock_scheduler_background_best_effort_res` +- :confval:`osd_mclock_scheduler_background_best_effort_wgt` +- :confval:`osd_mclock_scheduler_background_best_effort_lim` + +Recovery/Backfill Options +------------------------- +.. warning:: The recommendation is to not change these options as the built-in + profiles are optimized based on them. Changing these defaults can result in + unexpected performance outcomes. + +The following recovery and backfill related Ceph options are overridden to +mClock defaults: + +- :confval:`osd_max_backfills` +- :confval:`osd_recovery_max_active` +- :confval:`osd_recovery_max_active_hdd` +- :confval:`osd_recovery_max_active_ssd` + +The following table shows the mClock defaults which is the same as the current +defaults. This is done to maximize the performance of the foreground (client) +operations: + ++----------------------------------------+------------------+----------------+ +| Config Option | Original Default | mClock Default | ++========================================+==================+================+ +| :confval:`osd_max_backfills` | 1 | 1 | ++----------------------------------------+------------------+----------------+ +| :confval:`osd_recovery_max_active` | 0 | 0 | ++----------------------------------------+------------------+----------------+ +| :confval:`osd_recovery_max_active_hdd` | 3 | 3 | ++----------------------------------------+------------------+----------------+ +| :confval:`osd_recovery_max_active_ssd` | 10 | 10 | ++----------------------------------------+------------------+----------------+ + +The above mClock defaults, can be modified only if necessary by enabling +:confval:`osd_mclock_override_recovery_settings` (default: false). The +steps for this is discussed in the +`Steps to Modify mClock Max Backfills/Recovery Limits`_ section. + +Sleep Options +------------- +If any mClock profile (including "custom") is active, the following Ceph config +sleep options are disabled (set to 0), + +- :confval:`osd_recovery_sleep` +- :confval:`osd_recovery_sleep_hdd` +- :confval:`osd_recovery_sleep_ssd` +- :confval:`osd_recovery_sleep_hybrid` +- :confval:`osd_scrub_sleep` +- :confval:`osd_delete_sleep` +- :confval:`osd_delete_sleep_hdd` +- :confval:`osd_delete_sleep_ssd` +- :confval:`osd_delete_sleep_hybrid` +- :confval:`osd_snap_trim_sleep` +- :confval:`osd_snap_trim_sleep_hdd` +- :confval:`osd_snap_trim_sleep_ssd` +- :confval:`osd_snap_trim_sleep_hybrid` + +The above sleep options are disabled to ensure that mclock scheduler is able to +determine when to pick the next op from its operation queue and transfer it to +the operation sequencer. This results in the desired QoS being provided across +all its clients. + + +.. index:: mclock; enable built-in profile + +Steps to Enable mClock Profile +============================== + +As already mentioned, the default mclock profile is set to *balanced*. +The other values for the built-in profiles include *high_client_ops* and +*high_recovery_ops*. + +If there is a requirement to change the default profile, then the option +:confval:`osd_mclock_profile` may be set during runtime by using the following +command: + + .. prompt:: bash # + + ceph config set osd.N osd_mclock_profile <value> + +For example, to change the profile to allow faster recoveries on "osd.0", the +following command can be used to switch to the *high_recovery_ops* profile: + + .. prompt:: bash # + + ceph config set osd.0 osd_mclock_profile high_recovery_ops + +.. note:: The *custom* profile is not recommended unless you are an advanced + user. + +And that's it! You are ready to run workloads on the cluster and check if the +QoS requirements are being met. + + +Switching Between Built-in and Custom Profiles +============================================== + +There may be situations requiring switching from a built-in profile to the +*custom* profile and vice-versa. The following sections outline the steps to +accomplish this. + +Steps to Switch From a Built-in to the Custom Profile +----------------------------------------------------- + +The following command can be used to switch to the *custom* profile: + + .. prompt:: bash # + + ceph config set osd osd_mclock_profile custom + +For example, to change the profile to *custom* on all OSDs, the following +command can be used: + + .. prompt:: bash # + + ceph config set osd osd_mclock_profile custom + +After switching to the *custom* profile, the desired mClock configuration +option may be modified. For example, to change the client reservation IOPS +ratio for a specific OSD (say osd.0) to 0.5 (or 50%), the following command +can be used: + + .. prompt:: bash # + + ceph config set osd.0 osd_mclock_scheduler_client_res 0.5 + +.. important:: Care must be taken to change the reservations of other services + like recovery and background best effort accordingly to ensure that the sum + of the reservations do not exceed the maximum proportion (1.0) of the IOPS + capacity of the OSD. + +.. tip:: The reservation and limit parameter allocations are per-shard based on + the type of backing device (HDD/SSD) under the OSD. See + :confval:`osd_op_num_shards_hdd` and :confval:`osd_op_num_shards_ssd` for + more details. + +Steps to Switch From the Custom Profile to a Built-in Profile +------------------------------------------------------------- + +Switching from the *custom* profile to a built-in profile requires an +intermediate step of removing the custom settings from the central config +database for the changes to take effect. + +The following sequence of commands can be used to switch to a built-in profile: + +#. Set the desired built-in profile using: + + .. prompt:: bash # + + ceph config set osd <mClock Configuration Option> + + For example, to set the built-in profile to ``high_client_ops`` on all + OSDs, run the following command: + + .. prompt:: bash # + + ceph config set osd osd_mclock_profile high_client_ops +#. Determine the existing custom mClock configuration settings in the central + config database using the following command: + + .. prompt:: bash # + + ceph config dump +#. Remove the custom mClock configuration settings determined in the previous + step from the central config database: + + .. prompt:: bash # + + ceph config rm osd <mClock Configuration Option> + + For example, to remove the configuration option + :confval:`osd_mclock_scheduler_client_res` that was set on all OSDs, run the + following command: + + .. prompt:: bash # + + ceph config rm osd osd_mclock_scheduler_client_res +#. After all existing custom mClock configuration settings have been removed + from the central config database, the configuration settings pertaining to + ``high_client_ops`` will come into effect. For e.g., to verify the settings + on osd.0 use: + + .. prompt:: bash # + + ceph config show osd.0 + +Switch Temporarily Between mClock Profiles +------------------------------------------ + +To switch between mClock profiles on a temporary basis, the following commands +may be used to override the settings: + +.. warning:: This section is for advanced users or for experimental testing. The + recommendation is to not use the below commands on a running cluster as it + could have unexpected outcomes. + +.. note:: The configuration changes on an OSD using the below commands are + ephemeral and are lost when it restarts. It is also important to note that + the config options overridden using the below commands cannot be modified + further using the *ceph config set osd.N ...* command. The changes will not + take effect until a given OSD is restarted. This is intentional, as per the + config subsystem design. However, any further modification can still be made + ephemerally using the commands mentioned below. + +#. Run the *injectargs* command as shown to override the mclock settings: + + .. prompt:: bash # + + ceph tell osd.N injectargs '--<mClock Configuration Option>=<value>' + + For example, the following command overrides the + :confval:`osd_mclock_profile` option on osd.0: + + .. prompt:: bash # + + ceph tell osd.0 injectargs '--osd_mclock_profile=high_recovery_ops' + + +#. An alternate command that can be used is: + + .. prompt:: bash # + + ceph daemon osd.N config set <mClock Configuration Option> <value> + + For example, the following command overrides the + :confval:`osd_mclock_profile` option on osd.0: + + .. prompt:: bash # + + ceph daemon osd.0 config set osd_mclock_profile high_recovery_ops + +The individual QoS-related config options for the *custom* profile can also be +modified ephemerally using the above commands. + + +Steps to Modify mClock Max Backfills/Recovery Limits +==================================================== + +This section describes the steps to modify the default max backfills or recovery +limits if the need arises. + +.. warning:: This section is for advanced users or for experimental testing. The + recommendation is to retain the defaults as is on a running cluster as + modifying them could have unexpected performance outcomes. The values may + be modified only if the cluster is unable to cope/showing poor performance + with the default settings or for performing experiments on a test cluster. + +.. important:: The max backfill/recovery options that can be modified are listed + in section `Recovery/Backfill Options`_. The modification of the mClock + default backfills/recovery limit is gated by the + :confval:`osd_mclock_override_recovery_settings` option, which is set to + *false* by default. Attempting to modify any default recovery/backfill + limits without setting the gating option will reset that option back to the + mClock defaults along with a warning message logged in the cluster log. Note + that it may take a few seconds for the default value to come back into + effect. Verify the limit using the *config show* command as shown below. + +#. Set the :confval:`osd_mclock_override_recovery_settings` config option on all + osds to *true* using: + + .. prompt:: bash # + + ceph config set osd osd_mclock_override_recovery_settings true + +#. Set the desired max backfill/recovery option using: + + .. prompt:: bash # + + ceph config set osd osd_max_backfills <value> + + For example, the following command modifies the :confval:`osd_max_backfills` + option on all osds to 5. + + .. prompt:: bash # + + ceph config set osd osd_max_backfills 5 + +#. Wait for a few seconds and verify the running configuration for a specific + OSD using: + + .. prompt:: bash # + + ceph config show osd.N | grep osd_max_backfills + + For example, the following command shows the running configuration of + :confval:`osd_max_backfills` on osd.0. + + .. prompt:: bash # + + ceph config show osd.0 | grep osd_max_backfills + +#. Reset the :confval:`osd_mclock_override_recovery_settings` config option on + all osds to *false* using: + + .. prompt:: bash # + + ceph config set osd osd_mclock_override_recovery_settings false + + +OSD Capacity Determination (Automated) +====================================== + +The OSD capacity in terms of total IOPS is determined automatically during OSD +initialization. This is achieved by running the OSD bench tool and overriding +the default value of ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option +depending on the device type. No other action/input is expected from the user +to set the OSD capacity. + +.. note:: If you wish to manually benchmark OSD(s) or manually tune the + Bluestore throttle parameters, see section + `Steps to Manually Benchmark an OSD (Optional)`_. + +You may verify the capacity of an OSD after the cluster is brought up by using +the following command: + + .. prompt:: bash # + + ceph config show osd.N osd_mclock_max_capacity_iops_[hdd, ssd] + +For example, the following command shows the max capacity for "osd.0" on a Ceph +node whose underlying device type is SSD: + + .. prompt:: bash # + + ceph config show osd.0 osd_mclock_max_capacity_iops_ssd + +Mitigation of Unrealistic OSD Capacity From Automated Test +---------------------------------------------------------- +In certain conditions, the OSD bench tool may show unrealistic/inflated result +depending on the drive configuration and other environment related conditions. +To mitigate the performance impact due to this unrealistic capacity, a couple +of threshold config options depending on the osd's device type are defined and +used: + +- :confval:`osd_mclock_iops_capacity_threshold_hdd` = 500 +- :confval:`osd_mclock_iops_capacity_threshold_ssd` = 80000 + +The following automated step is performed: + +Fallback to using default OSD capacity (automated) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +If OSD bench reports a measurement that exceeds the above threshold values +depending on the underlying device type, the fallback mechanism reverts to the +default value of :confval:`osd_mclock_max_capacity_iops_hdd` or +:confval:`osd_mclock_max_capacity_iops_ssd`. The threshold config options +can be reconfigured based on the type of drive used. Additionally, a cluster +warning is logged in case the measurement exceeds the threshold. For example, :: + + 2022-10-27T15:30:23.270+0000 7f9b5dbe95c0 0 log_channel(cluster) log [WRN] + : OSD bench result of 39546.479392 IOPS exceeded the threshold limit of + 25000.000000 IOPS for osd.1. IOPS capacity is unchanged at 21500.000000 + IOPS. The recommendation is to establish the osd's IOPS capacity using other + benchmark tools (e.g. Fio) and then override + osd_mclock_max_capacity_iops_[hdd|ssd]. + +If the default capacity doesn't accurately represent the OSD's capacity, the +following additional step is recommended to address this: + +Run custom drive benchmark if defaults are not accurate (manual) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +If the default OSD capacity is not accurate, the recommendation is to run a +custom benchmark using your preferred tool (e.g. Fio) on the drive and then +override the ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option as described +in the `Specifying Max OSD Capacity`_ section. + +This step is highly recommended until an alternate mechansim is worked upon. + +Steps to Manually Benchmark an OSD (Optional) +============================================= + +.. note:: These steps are only necessary if you want to override the OSD + capacity already determined automatically during OSD initialization. + Otherwise, you may skip this section entirely. + +.. tip:: If you have already determined the benchmark data and wish to manually + override the max osd capacity for an OSD, you may skip to section + `Specifying Max OSD Capacity`_. + + +Any existing benchmarking tool (e.g. Fio) can be used for this purpose. In this +case, the steps use the *Ceph OSD Bench* command described in the next section. +Regardless of the tool/command used, the steps outlined further below remain the +same. + +As already described in the :ref:`dmclock-qos` section, the number of +shards and the bluestore's throttle parameters have an impact on the mclock op +queues. Therefore, it is critical to set these values carefully in order to +maximize the impact of the mclock scheduler. + +:Number of Operational Shards: + We recommend using the default number of shards as defined by the + configuration options ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and + ``osd_op_num_shards_ssd``. In general, a lower number of shards will increase + the impact of the mclock queues. + +:Bluestore Throttle Parameters: + We recommend using the default values as defined by + :confval:`bluestore_throttle_bytes` and + :confval:`bluestore_throttle_deferred_bytes`. But these parameters may also be + determined during the benchmarking phase as described below. + +OSD Bench Command Syntax +------------------------ + +The :ref:`osd-subsystem` section describes the OSD bench command. The syntax +used for benchmarking is shown below : + +.. prompt:: bash # + + ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS] + +where, + +* ``TOTAL_BYTES``: Total number of bytes to write +* ``BYTES_PER_WRITE``: Block size per write +* ``OBJ_SIZE``: Bytes per object +* ``NUM_OBJS``: Number of objects to write + +Benchmarking Test Steps Using OSD Bench +--------------------------------------- + +The steps below use the default shards and detail the steps used to determine +the correct bluestore throttle values (optional). + +#. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that + you wish to benchmark. +#. Run a simple 4KiB random write workload on an OSD using the following + commands: + + .. note:: Note that before running the test, caches must be cleared to get an + accurate measurement. + + For example, if you are running the benchmark test on osd.0, run the following + commands: + + .. prompt:: bash # + + ceph tell osd.0 cache drop + + .. prompt:: bash # + + ceph tell osd.0 bench 12288000 4096 4194304 100 + +#. Note the overall throughput(IOPS) obtained from the output of the osd bench + command. This value is the baseline throughput(IOPS) when the default + bluestore throttle options are in effect. +#. If the intent is to determine the bluestore throttle values for your + environment, then set the two options, :confval:`bluestore_throttle_bytes` + and :confval:`bluestore_throttle_deferred_bytes` to 32 KiB(32768 Bytes) each + to begin with. Otherwise, you may skip to the next section. +#. Run the 4KiB random write test as before using OSD bench. +#. Note the overall throughput from the output and compare the value + against the baseline throughput recorded in step 3. +#. If the throughput doesn't match with the baseline, increment the bluestore + throttle options by 2x and repeat steps 5 through 7 until the obtained + throughput is very close to the baseline value. + +For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB +for both bluestore throttle and deferred bytes was determined to maximize the +impact of mclock. For HDDs, the corresponding value was 40 MiB, where the +overall throughput was roughly equal to the baseline throughput. Note that in +general for HDDs, the bluestore throttle values are expected to be higher when +compared to SSDs. + + +Specifying Max OSD Capacity +---------------------------- + +The steps in this section may be performed only if you want to override the +max osd capacity automatically set during OSD initialization. The option +``osd_mclock_max_capacity_iops_[hdd, ssd]`` for an OSD can be set by running the +following command: + + .. prompt:: bash # + + ceph config set osd.N osd_mclock_max_capacity_iops_[hdd,ssd] <value> + +For example, the following command sets the max capacity for a specific OSD +(say "osd.0") whose underlying device type is HDD to 350 IOPS: + + .. prompt:: bash # + + ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350 + +Alternatively, you may specify the max capacity for OSDs within the Ceph +configuration file under the respective [osd.N] section. See +:ref:`ceph-conf-settings` for more details. + + +.. index:: mclock; config settings + +mClock Config Options +===================== + +.. confval:: osd_mclock_profile +.. confval:: osd_mclock_max_capacity_iops_hdd +.. confval:: osd_mclock_max_capacity_iops_ssd +.. confval:: osd_mclock_max_sequential_bandwidth_hdd +.. confval:: osd_mclock_max_sequential_bandwidth_ssd +.. confval:: osd_mclock_force_run_benchmark_on_init +.. confval:: osd_mclock_skip_benchmark +.. confval:: osd_mclock_override_recovery_settings +.. confval:: osd_mclock_iops_capacity_threshold_hdd +.. confval:: osd_mclock_iops_capacity_threshold_ssd + +.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf diff --git a/doc/rados/configuration/mon-config-ref.rst b/doc/rados/configuration/mon-config-ref.rst new file mode 100644 index 000000000..e0a12d093 --- /dev/null +++ b/doc/rados/configuration/mon-config-ref.rst @@ -0,0 +1,642 @@ +.. _monitor-config-reference: + +========================== + Monitor Config Reference +========================== + +Understanding how to configure a :term:`Ceph Monitor` is an important part of +building a reliable :term:`Ceph Storage Cluster`. **All Ceph Storage Clusters +have at least one monitor**. The monitor complement usually remains fairly +consistent, but you can add, remove or replace a monitor in a cluster. See +`Adding/Removing a Monitor`_ for details. + + +.. index:: Ceph Monitor; Paxos + +Background +========== + +Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`. + +The :term:`Cluster Map` makes it possible for :term:`Ceph client`\s to +determine the location of all Ceph Monitors, Ceph OSD Daemons, and Ceph +Metadata Servers. Clients do this by connecting to one Ceph Monitor and +retrieving a current cluster map. Ceph clients must connect to a Ceph Monitor +before they can read from or write to Ceph OSD Daemons or Ceph Metadata +Servers. A Ceph client that has a current copy of the cluster map and the CRUSH +algorithm can compute the location of any RADOS object within the cluster. This +makes it possible for Ceph clients to talk directly to Ceph OSD Daemons. Direct +communication between clients and Ceph OSD Daemons improves upon traditional +storage architectures that required clients to communicate with a central +component. See `Scalability and High Availability`_ for more on this subject. + +The Ceph Monitor's primary function is to maintain a master copy of the cluster +map. Monitors also provide authentication and logging services. All changes in +the monitor services are written by the Ceph Monitor to a single Paxos +instance, and Paxos writes the changes to a key/value store. This provides +strong consistency. Ceph Monitors are able to query the most recent version of +the cluster map during sync operations, and they use the key/value store's +snapshots and iterators (using RocksDB) to perform store-wide synchronization. + +.. ditaa:: + /-------------\ /-------------\ + | Monitor | Write Changes | Paxos | + | cCCC +-------------->+ cCCC | + | | | | + +-------------+ \------+------/ + | Auth | | + +-------------+ | Write Changes + | Log | | + +-------------+ v + | Monitor Map | /------+------\ + +-------------+ | Key / Value | + | OSD Map | | Store | + +-------------+ | cCCC | + | PG Map | \------+------/ + +-------------+ ^ + | MDS Map | | Read Changes + +-------------+ | + | cCCC |*---------------------+ + \-------------/ + +.. index:: Ceph Monitor; cluster map + +Cluster Maps +------------ + +The cluster map is a composite of maps, including the monitor map, the OSD map, +the placement group map and the metadata server map. The cluster map tracks a +number of important things: which processes are ``in`` the Ceph Storage Cluster; +which processes that are ``in`` the Ceph Storage Cluster are ``up`` and running +or ``down``; whether, the placement groups are ``active`` or ``inactive``, and +``clean`` or in some other state; and, other details that reflect the current +state of the cluster such as the total amount of storage space, and the amount +of storage used. + +When there is a significant change in the state of the cluster--e.g., a Ceph OSD +Daemon goes down, a placement group falls into a degraded state, etc.--the +cluster map gets updated to reflect the current state of the cluster. +Additionally, the Ceph Monitor also maintains a history of the prior states of +the cluster. The monitor map, OSD map, placement group map and metadata server +map each maintain a history of their map versions. We call each version an +"epoch." + +When operating your Ceph Storage Cluster, keeping track of these states is an +important part of your system administration duties. See `Monitoring a Cluster`_ +and `Monitoring OSDs and PGs`_ for additional details. + +.. index:: high availability; quorum + +Monitor Quorum +-------------- + +Our Configuring ceph section provides a trivial `Ceph configuration file`_ that +provides for one monitor in the test cluster. A cluster will run fine with a +single monitor; however, **a single monitor is a single-point-of-failure**. To +ensure high availability in a production Ceph Storage Cluster, you should run +Ceph with multiple monitors so that the failure of a single monitor **WILL NOT** +bring down your entire cluster. + +When a Ceph Storage Cluster runs multiple Ceph Monitors for high availability, +Ceph Monitors use `Paxos`_ to establish consensus about the master cluster map. +A consensus requires a majority of monitors running to establish a quorum for +consensus about the cluster map (e.g., 1; 2 out of 3; 3 out of 5; 4 out of 6; +etc.). + +.. confval:: mon_force_quorum_join + +.. index:: Ceph Monitor; consistency + +Consistency +----------- + +When you add monitor settings to your Ceph configuration file, you need to be +aware of some of the architectural aspects of Ceph Monitors. **Ceph imposes +strict consistency requirements** for a Ceph monitor when discovering another +Ceph Monitor within the cluster. Whereas, Ceph Clients and other Ceph daemons +use the Ceph configuration file to discover monitors, monitors discover each +other using the monitor map (monmap), not the Ceph configuration file. + +A Ceph Monitor always refers to the local copy of the monmap when discovering +other Ceph Monitors in the Ceph Storage Cluster. Using the monmap instead of the +Ceph configuration file avoids errors that could break the cluster (e.g., typos +in ``ceph.conf`` when specifying a monitor address or port). Since monitors use +monmaps for discovery and they share monmaps with clients and other Ceph +daemons, **the monmap provides monitors with a strict guarantee that their +consensus is valid.** + +Strict consistency also applies to updates to the monmap. As with any other +updates on the Ceph Monitor, changes to the monmap always run through a +distributed consensus algorithm called `Paxos`_. The Ceph Monitors must agree on +each update to the monmap, such as adding or removing a Ceph Monitor, to ensure +that each monitor in the quorum has the same version of the monmap. Updates to +the monmap are incremental so that Ceph Monitors have the latest agreed upon +version, and a set of previous versions. Maintaining a history enables a Ceph +Monitor that has an older version of the monmap to catch up with the current +state of the Ceph Storage Cluster. + +If Ceph Monitors were to discover each other through the Ceph configuration file +instead of through the monmap, additional risks would be introduced because +Ceph configuration files are not updated and distributed automatically. Ceph +Monitors might inadvertently use an older Ceph configuration file, fail to +recognize a Ceph Monitor, fall out of a quorum, or develop a situation where +`Paxos`_ is not able to determine the current state of the system accurately. + + +.. index:: Ceph Monitor; bootstrapping monitors + +Bootstrapping Monitors +---------------------- + +In most configuration and deployment cases, tools that deploy Ceph help +bootstrap the Ceph Monitors by generating a monitor map for you (e.g., +``cephadm``, etc). A Ceph Monitor requires a few explicit +settings: + +- **Filesystem ID**: The ``fsid`` is the unique identifier for your + object store. Since you can run multiple clusters on the same + hardware, you must specify the unique ID of the object store when + bootstrapping a monitor. Deployment tools usually do this for you + (e.g., ``cephadm`` can call a tool like ``uuidgen``), but you + may specify the ``fsid`` manually too. + +- **Monitor ID**: A monitor ID is a unique ID assigned to each monitor within + the cluster. It is an alphanumeric value, and by convention the identifier + usually follows an alphabetical increment (e.g., ``a``, ``b``, etc.). This + can be set in a Ceph configuration file (e.g., ``[mon.a]``, ``[mon.b]``, etc.), + by a deployment tool, or using the ``ceph`` commandline. + +- **Keys**: The monitor must have secret keys. A deployment tool such as + ``cephadm`` usually does this for you, but you may + perform this step manually too. See `Monitor Keyrings`_ for details. + +For additional details on bootstrapping, see `Bootstrapping a Monitor`_. + +.. index:: Ceph Monitor; configuring monitors + +Configuring Monitors +==================== + +To apply configuration settings to the entire cluster, enter the configuration +settings under ``[global]``. To apply configuration settings to all monitors in +your cluster, enter the configuration settings under ``[mon]``. To apply +configuration settings to specific monitors, specify the monitor instance +(e.g., ``[mon.a]``). By convention, monitor instance names use alpha notation. + +.. code-block:: ini + + [global] + + [mon] + + [mon.a] + + [mon.b] + + [mon.c] + + +Minimum Configuration +--------------------- + +The bare minimum monitor settings for a Ceph monitor via the Ceph configuration +file include a hostname and a network address for each monitor. You can configure +these under ``[mon]`` or under the entry for a specific monitor. + +.. code-block:: ini + + [global] + mon_host = 10.0.0.2,10.0.0.3,10.0.0.4 + +.. code-block:: ini + + [mon.a] + host = hostname1 + mon_addr = 10.0.0.10:6789 + +See the `Network Configuration Reference`_ for details. + +.. note:: This minimum configuration for monitors assumes that a deployment + tool generates the ``fsid`` and the ``mon.`` key for you. + +Once you deploy a Ceph cluster, you **SHOULD NOT** change the IP addresses of +monitors. However, if you decide to change the monitor's IP address, you +must follow a specific procedure. See :ref:`Changing a Monitor's IP address` for +details. + +Monitors can also be found by clients by using DNS SRV records. See `Monitor lookup through DNS`_ for details. + +Cluster ID +---------- + +Each Ceph Storage Cluster has a unique identifier (``fsid``). If specified, it +usually appears under the ``[global]`` section of the configuration file. +Deployment tools usually generate the ``fsid`` and store it in the monitor map, +so the value may not appear in a configuration file. The ``fsid`` makes it +possible to run daemons for multiple clusters on the same hardware. + +.. confval:: fsid + +.. index:: Ceph Monitor; initial members + +Initial Members +--------------- + +We recommend running a production Ceph Storage Cluster with at least three Ceph +Monitors to ensure high availability. When you run multiple monitors, you may +specify the initial monitors that must be members of the cluster in order to +establish a quorum. This may reduce the time it takes for your cluster to come +online. + +.. code-block:: ini + + [mon] + mon_initial_members = a,b,c + + +.. confval:: mon_initial_members + +.. index:: Ceph Monitor; data path + +Data +---- + +Ceph provides a default path where Ceph Monitors store data. For optimal +performance in a production Ceph Storage Cluster, we recommend running Ceph +Monitors on separate hosts and drives from Ceph OSD Daemons. As leveldb uses +``mmap()`` for writing the data, Ceph Monitors flush their data from memory to disk +very often, which can interfere with Ceph OSD Daemon workloads if the data +store is co-located with the OSD Daemons. + +In Ceph versions 0.58 and earlier, Ceph Monitors store their data in plain files. This +approach allows users to inspect monitor data with common tools like ``ls`` +and ``cat``. However, this approach didn't provide strong consistency. + +In Ceph versions 0.59 and later, Ceph Monitors store their data as key/value +pairs. Ceph Monitors require `ACID`_ transactions. Using a data store prevents +recovering Ceph Monitors from running corrupted versions through Paxos, and it +enables multiple modification operations in one single atomic batch, among other +advantages. + +Generally, we do not recommend changing the default data location. If you modify +the default location, we recommend that you make it uniform across Ceph Monitors +by setting it in the ``[mon]`` section of the configuration file. + + +.. confval:: mon_data +.. confval:: mon_data_size_warn +.. confval:: mon_data_avail_warn +.. confval:: mon_data_avail_crit +.. confval:: mon_warn_on_crush_straw_calc_version_zero +.. confval:: mon_warn_on_legacy_crush_tunables +.. confval:: mon_crush_min_required_version +.. confval:: mon_warn_on_osd_down_out_interval_zero +.. confval:: mon_warn_on_slow_ping_ratio +.. confval:: mon_warn_on_slow_ping_time +.. confval:: mon_warn_on_pool_no_redundancy +.. confval:: mon_cache_target_full_warn_ratio +.. confval:: mon_health_to_clog +.. confval:: mon_health_to_clog_tick_interval +.. confval:: mon_health_to_clog_interval + +.. index:: Ceph Storage Cluster; capacity planning, Ceph Monitor; capacity planning + +.. _storage-capacity: + +Storage Capacity +---------------- + +When a Ceph Storage Cluster gets close to its maximum capacity +(see``mon_osd_full ratio``), Ceph prevents you from writing to or reading from OSDs +as a safety measure to prevent data loss. Therefore, letting a +production Ceph Storage Cluster approach its full ratio is not a good practice, +because it sacrifices high availability. The default full ratio is ``.95``, or +95% of capacity. This a very aggressive setting for a test cluster with a small +number of OSDs. + +.. tip:: When monitoring your cluster, be alert to warnings related to the + ``nearfull`` ratio. This means that a failure of some OSDs could result + in a temporary service disruption if one or more OSDs fails. Consider adding + more OSDs to increase storage capacity. + +A common scenario for test clusters involves a system administrator removing an +OSD from the Ceph Storage Cluster, watching the cluster rebalance, then removing +another OSD, and another, until at least one OSD eventually reaches the full +ratio and the cluster locks up. We recommend a bit of capacity +planning even with a test cluster. Planning enables you to gauge how much spare +capacity you will need in order to maintain high availability. Ideally, you want +to plan for a series of Ceph OSD Daemon failures where the cluster can recover +to an ``active+clean`` state without replacing those OSDs +immediately. Cluster operation continues in the ``active+degraded`` state, but this +is not ideal for normal operation and should be addressed promptly. + +The following diagram depicts a simplistic Ceph Storage Cluster containing 33 +Ceph Nodes with one OSD per host, each OSD reading from +and writing to a 3TB drive. So this exemplary Ceph Storage Cluster has a maximum +actual capacity of 99TB. With a ``mon osd full ratio`` of ``0.95``, if the Ceph +Storage Cluster falls to 5TB of remaining capacity, the cluster will not allow +Ceph Clients to read and write data. So the Ceph Storage Cluster's operating +capacity is 95TB, not 99TB. + +.. ditaa:: + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | Rack 1 | | Rack 2 | | Rack 3 | | Rack 4 | | Rack 5 | | Rack 6 | + | cCCC | | cF00 | | cCCC | | cCCC | | cCCC | | cCCC | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 1 | | OSD 7 | | OSD 13 | | OSD 19 | | OSD 25 | | OSD 31 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 2 | | OSD 8 | | OSD 14 | | OSD 20 | | OSD 26 | | OSD 32 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 3 | | OSD 9 | | OSD 15 | | OSD 21 | | OSD 27 | | OSD 33 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 4 | | OSD 10 | | OSD 16 | | OSD 22 | | OSD 28 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 5 | | OSD 11 | | OSD 17 | | OSD 23 | | OSD 29 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 6 | | OSD 12 | | OSD 18 | | OSD 24 | | OSD 30 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + +It is normal in such a cluster for one or two OSDs to fail. A less frequent but +reasonable scenario involves a rack's router or power supply failing, which +brings down multiple OSDs simultaneously (e.g., OSDs 7-12). In such a scenario, +you should still strive for a cluster that can remain operational and achieve an +``active + clean`` state--even if that means adding a few hosts with additional +OSDs in short order. If your capacity utilization is too high, you may not lose +data, but you could still sacrifice data availability while resolving an outage +within a failure domain if capacity utilization of the cluster exceeds the full +ratio. For this reason, we recommend at least some rough capacity planning. + +Identify two numbers for your cluster: + +#. The number of OSDs. +#. The total capacity of the cluster + +If you divide the total capacity of your cluster by the number of OSDs in your +cluster, you will find the mean average capacity of an OSD within your cluster. +Consider multiplying that number by the number of OSDs you expect will fail +simultaneously during normal operations (a relatively small number). Finally +multiply the capacity of the cluster by the full ratio to arrive at a maximum +operating capacity; then, subtract the number of amount of data from the OSDs +you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing +process with a higher number of OSD failures (e.g., a rack of OSDs) to arrive at +a reasonable number for a near full ratio. + +The following settings only apply on cluster creation and are then stored in +the OSDMap. To clarify, in normal operation the values that are used by OSDs +are those found in the OSDMap, not those in the configuration file or central +config store. + +.. code-block:: ini + + [global] + mon_osd_full_ratio = .80 + mon_osd_backfillfull_ratio = .75 + mon_osd_nearfull_ratio = .70 + + +``mon_osd_full_ratio`` + +:Description: The threshold percentage of device space utilized before an OSD is + considered ``full``. + +:Type: Float +:Default: ``0.95`` + + +``mon_osd_backfillfull_ratio`` + +:Description: The threshold percentage of device space utilized before an OSD is + considered too ``full`` to backfill. + +:Type: Float +:Default: ``0.90`` + + +``mon_osd_nearfull_ratio`` + +:Description: The threshold percentage of device space used before an OSD is + considered ``nearfull``. + +:Type: Float +:Default: ``0.85`` + + +.. tip:: If some OSDs are nearfull, but others have plenty of capacity, you + may have an inaccurate CRUSH weight set for the nearfull OSDs. + +.. tip:: These settings only apply during cluster creation. Afterwards they need + to be changed in the OSDMap using ``ceph osd set-nearfull-ratio`` and + ``ceph osd set-full-ratio`` + +.. index:: heartbeat + +Heartbeat +--------- + +Ceph monitors know about the cluster by requiring reports from each OSD, and by +receiving reports from OSDs about the status of their neighboring OSDs. Ceph +provides reasonable default settings for monitor/OSD interaction; however, you +may modify them as needed. See `Monitor/OSD Interaction`_ for details. + + +.. index:: Ceph Monitor; leader, Ceph Monitor; provider, Ceph Monitor; requester, Ceph Monitor; synchronization + +Monitor Store Synchronization +----------------------------- + +When you run a production cluster with multiple monitors (recommended), each +monitor checks to see if a neighboring monitor has a more recent version of the +cluster map (e.g., a map in a neighboring monitor with one or more epoch numbers +higher than the most current epoch in the map of the instant monitor). +Periodically, one monitor in the cluster may fall behind the other monitors to +the point where it must leave the quorum, synchronize to retrieve the most +current information about the cluster, and then rejoin the quorum. For the +purposes of synchronization, monitors may assume one of three roles: + +#. **Leader**: The `Leader` is the first monitor to achieve the most recent + Paxos version of the cluster map. + +#. **Provider**: The `Provider` is a monitor that has the most recent version + of the cluster map, but wasn't the first to achieve the most recent version. + +#. **Requester:** A `Requester` is a monitor that has fallen behind the leader + and must synchronize in order to retrieve the most recent information about + the cluster before it can rejoin the quorum. + +These roles enable a leader to delegate synchronization duties to a provider, +which prevents synchronization requests from overloading the leader--improving +performance. In the following diagram, the requester has learned that it has +fallen behind the other monitors. The requester asks the leader to synchronize, +and the leader tells the requester to synchronize with a provider. + + +.. ditaa:: + +-----------+ +---------+ +----------+ + | Requester | | Leader | | Provider | + +-----------+ +---------+ +----------+ + | | | + | | | + | Ask to Synchronize | | + |------------------->| | + | | | + |<-------------------| | + | Tell Requester to | | + | Sync with Provider | | + | | | + | Synchronize | + |--------------------+-------------------->| + | | | + |<-------------------+---------------------| + | Send Chunk to Requester | + | (repeat as necessary) | + | Requester Acks Chuck to Provider | + |--------------------+-------------------->| + | | + | Sync Complete | + | Notification | + |------------------->| + | | + |<-------------------| + | Ack | + | | + + +Synchronization always occurs when a new monitor joins the cluster. During +runtime operations, monitors may receive updates to the cluster map at different +times. This means the leader and provider roles may migrate from one monitor to +another. If this happens while synchronizing (e.g., a provider falls behind the +leader), the provider can terminate synchronization with a requester. + +Once synchronization is complete, Ceph performs trimming across the cluster. +Trimming requires that the placement groups are ``active+clean``. + + +.. confval:: mon_sync_timeout +.. confval:: mon_sync_max_payload_size +.. confval:: paxos_max_join_drift +.. confval:: paxos_stash_full_interval +.. confval:: paxos_propose_interval +.. confval:: paxos_min +.. confval:: paxos_min_wait +.. confval:: paxos_trim_min +.. confval:: paxos_trim_max +.. confval:: paxos_service_trim_min +.. confval:: paxos_service_trim_max +.. confval:: paxos_service_trim_max_multiplier +.. confval:: mon_mds_force_trim_to +.. confval:: mon_osd_force_trim_to +.. confval:: mon_osd_cache_size +.. confval:: mon_election_timeout +.. confval:: mon_lease +.. confval:: mon_lease_renew_interval_factor +.. confval:: mon_lease_ack_timeout_factor +.. confval:: mon_accept_timeout_factor +.. confval:: mon_min_osdmap_epochs +.. confval:: mon_max_log_epochs + + +.. index:: Ceph Monitor; clock + +.. _mon-config-ref-clock: + +Clock +----- + +Ceph daemons pass critical messages to each other, which must be processed +before daemons reach a timeout threshold. If the clocks in Ceph monitors +are not synchronized, it can lead to a number of anomalies. For example: + +- Daemons ignoring received messages (e.g., timestamps outdated) +- Timeouts triggered too soon/late when a message wasn't received in time. + +See `Monitor Store Synchronization`_ for details. + + +.. tip:: You must configure NTP or PTP daemons on your Ceph monitor hosts to + ensure that the monitor cluster operates with synchronized clocks. + It can be advantageous to have monitor hosts sync with each other + as well as with multiple quality upstream time sources. + +Clock drift may still be noticeable with NTP even though the discrepancy is not +yet harmful. Ceph's clock drift / clock skew warnings may get triggered even +though NTP maintains a reasonable level of synchronization. Increasing your +clock drift may be tolerable under such circumstances; however, a number of +factors such as workload, network latency, configuring overrides to default +timeouts and the `Monitor Store Synchronization`_ settings may influence +the level of acceptable clock drift without compromising Paxos guarantees. + +Ceph provides the following tunable options to allow you to find +acceptable values. + +.. confval:: mon_tick_interval +.. confval:: mon_clock_drift_allowed +.. confval:: mon_clock_drift_warn_backoff +.. confval:: mon_timecheck_interval +.. confval:: mon_timecheck_skew_interval + +Client +------ + +.. confval:: mon_client_hunt_interval +.. confval:: mon_client_ping_interval +.. confval:: mon_client_max_log_entries_per_message +.. confval:: mon_client_bytes + +.. _pool-settings: + +Pool settings +============= + +Since version v0.94 there is support for pool flags which allow or disallow changes to be made to pools. +Monitors can also disallow removal of pools if appropriately configured. The inconvenience of this guardrail +is far outweighed by the number of accidental pool (and thus data) deletions it prevents. + +.. confval:: mon_allow_pool_delete +.. confval:: osd_pool_default_ec_fast_read +.. confval:: osd_pool_default_flag_hashpspool +.. confval:: osd_pool_default_flag_nodelete +.. confval:: osd_pool_default_flag_nopgchange +.. confval:: osd_pool_default_flag_nosizechange + +For more information about the pool flags see :ref:`Pool values <setpoolvalues>`. + +Miscellaneous +============= + +.. confval:: mon_max_osd +.. confval:: mon_globalid_prealloc +.. confval:: mon_subscribe_interval +.. confval:: mon_stat_smooth_intervals +.. confval:: mon_probe_timeout +.. confval:: mon_daemon_bytes +.. confval:: mon_max_log_entries_per_event +.. confval:: mon_osd_prime_pg_temp +.. confval:: mon_osd_prime_pg_temp_max_time +.. confval:: mon_osd_prime_pg_temp_max_estimate +.. confval:: mon_mds_skip_sanity +.. confval:: mon_max_mdsmap_epochs +.. confval:: mon_config_key_max_entry_size +.. confval:: mon_scrub_interval +.. confval:: mon_scrub_max_keys +.. confval:: mon_compact_on_start +.. confval:: mon_compact_on_bootstrap +.. confval:: mon_compact_on_trim +.. confval:: mon_cpu_threads +.. confval:: mon_osd_mapping_pgs_per_chunk +.. confval:: mon_session_timeout +.. confval:: mon_osd_cache_size_min +.. confval:: mon_memory_target +.. confval:: mon_memory_autotune + +.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) +.. _Monitor Keyrings: ../../../dev/mon-bootstrap#secret-keys +.. _Ceph configuration file: ../ceph-conf/#monitors +.. _Network Configuration Reference: ../network-config-ref +.. _Monitor lookup through DNS: ../mon-lookup-dns +.. _ACID: https://en.wikipedia.org/wiki/ACID +.. _Adding/Removing a Monitor: ../../operations/add-or-rm-mons +.. _Monitoring a Cluster: ../../operations/monitoring +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg +.. _Bootstrapping a Monitor: ../../../dev/mon-bootstrap +.. _Monitor/OSD Interaction: ../mon-osd-interaction +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability diff --git a/doc/rados/configuration/mon-lookup-dns.rst b/doc/rados/configuration/mon-lookup-dns.rst new file mode 100644 index 000000000..129a083c4 --- /dev/null +++ b/doc/rados/configuration/mon-lookup-dns.rst @@ -0,0 +1,58 @@ +.. _mon-dns-lookup: + +=============================== +Looking up Monitors through DNS +=============================== + +Since Ceph version 11.0.0 (Kraken), RADOS has supported looking up monitors +through DNS. + +The addition of the ability to look up monitors through DNS means that daemons +and clients do not require a *mon host* configuration directive in their +``ceph.conf`` configuration file. + +With a DNS update, clients and daemons can be made aware of changes +in the monitor topology. To be more precise and technical, clients look up the +monitors by using ``DNS SRV TCP`` records. + +By default, clients and daemons look for the TCP service called *ceph-mon*, +which is configured by the *mon_dns_srv_name* configuration directive. + + +.. confval:: mon_dns_srv_name + +Example +------- +When the DNS search domain is set to *example.com* a DNS zone file might contain the following elements. + +First, create records for the Monitors, either IPv4 (A) or IPv6 (AAAA). + +:: + + mon1.example.com. AAAA 2001:db8::100 + mon2.example.com. AAAA 2001:db8::200 + mon3.example.com. AAAA 2001:db8::300 + +:: + + mon1.example.com. A 192.168.0.1 + mon2.example.com. A 192.168.0.2 + mon3.example.com. A 192.168.0.3 + + +With those records now existing we can create the SRV TCP records with the name *ceph-mon* pointing to the three Monitors. + +:: + + _ceph-mon._tcp.example.com. 60 IN SRV 10 20 6789 mon1.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 10 30 6789 mon2.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 20 50 6789 mon3.example.com. + +Now all Monitors are running on port *6789*, with priorities 10, 10, 20 and weights 20, 30, 50 respectively. + +Monitor clients choose monitor by referencing the SRV records. If a cluster has multiple Monitor SRV records +with the same priority value, clients and daemons will load balance the connections to Monitors in proportion +to the values of the SRV weight fields. + +For the above example, this will result in approximate 40% of the clients and daemons connecting to mon1, +60% of them connecting to mon2. However, if neither of them is reachable, then mon3 will be reconsidered as a fallback. diff --git a/doc/rados/configuration/mon-osd-interaction.rst b/doc/rados/configuration/mon-osd-interaction.rst new file mode 100644 index 000000000..8cf09707d --- /dev/null +++ b/doc/rados/configuration/mon-osd-interaction.rst @@ -0,0 +1,245 @@ +===================================== + Configuring Monitor/OSD Interaction +===================================== + +.. index:: heartbeat + +After you have completed your initial Ceph configuration, you may deploy and run +Ceph. When you execute a command such as ``ceph health`` or ``ceph -s``, the +:term:`Ceph Monitor` reports on the current state of the :term:`Ceph Storage +Cluster`. The Ceph Monitor knows about the Ceph Storage Cluster by requiring +reports from each :term:`Ceph OSD Daemon`, and by receiving reports from Ceph +OSD Daemons about the status of their neighboring Ceph OSD Daemons. If the Ceph +Monitor doesn't receive reports, or if it receives reports of changes in the +Ceph Storage Cluster, the Ceph Monitor updates the status of the :term:`Ceph +Cluster Map`. + +Ceph provides reasonable default settings for Ceph Monitor/Ceph OSD Daemon +interaction. However, you may override the defaults. The following sections +describe how Ceph Monitors and Ceph OSD Daemons interact for the purposes of +monitoring the Ceph Storage Cluster. + +.. index:: heartbeat interval + +OSDs Check Heartbeats +===================== + +Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons at random +intervals less than every 6 seconds. If a neighboring Ceph OSD Daemon doesn't +show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may +consider the neighboring Ceph OSD Daemon ``down`` and report it back to a Ceph +Monitor, which will update the Ceph Cluster Map. You may change this grace +period by adding an ``osd heartbeat grace`` setting under the ``[mon]`` +and ``[osd]`` or ``[global]`` section of your Ceph configuration file, +or by setting the value at runtime. + + +.. ditaa:: + +---------+ +---------+ + | OSD 1 | | OSD 2 | + +---------+ +---------+ + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |<-------------------| + | Heart Beating | + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |----+ Grace | + | | Period | + |<---+ Exceeded | + | | + |----+ Mark | + | | OSD 2 | + |<---+ Down | + + +.. index:: OSD down report + +OSDs Report Down OSDs +===================== + +By default, two Ceph OSD Daemons from different hosts must report to the Ceph +Monitors that another Ceph OSD Daemon is ``down`` before the Ceph Monitors +acknowledge that the reported Ceph OSD Daemon is ``down``. But there is chance +that all the OSDs reporting the failure are hosted in a rack with a bad switch +which has trouble connecting to another OSD. To avoid this sort of false alarm, +we consider the peers reporting a failure a proxy for a potential "subcluster" +over the overall cluster that is similarly laggy. This is clearly not true in +all cases, but will sometimes help us localize the grace correction to a subset +of the system that is unhappy. ``mon osd reporter subtree level`` is used to +group the peers into the "subcluster" by their common ancestor type in CRUSH +map. By default, only two reports from different subtree are required to report +another Ceph OSD Daemon ``down``. You can change the number of reporters from +unique subtrees and the common ancestor type required to report a Ceph OSD +Daemon ``down`` to a Ceph Monitor by adding an ``mon osd min down reporters`` +and ``mon osd reporter subtree level`` settings under the ``[mon]`` section of +your Ceph configuration file, or by setting the value at runtime. + + +.. ditaa:: + + +---------+ +---------+ +---------+ + | OSD 1 | | OSD 2 | | Monitor | + +---------+ +---------+ +---------+ + | | | + | OSD 3 Is Down | | + |---------------+--------------->| + | | | + | | | + | | OSD 3 Is Down | + | |--------------->| + | | | + | | | + | | |---------+ Mark + | | | | OSD 3 + | | |<--------+ Down + + +.. index:: peering failure + +OSDs Report Peering Failure +=========================== + +If a Ceph OSD Daemon cannot peer with any of the Ceph OSD Daemons defined in its +Ceph configuration file (or the cluster map), it will ping a Ceph Monitor for +the most recent copy of the cluster map every 30 seconds. You can change the +Ceph Monitor heartbeat interval by adding an ``osd mon heartbeat interval`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. + +.. ditaa:: + + +---------+ +---------+ +-------+ +---------+ + | OSD 1 | | OSD 2 | | OSD 3 | | Monitor | + +---------+ +---------+ +-------+ +---------+ + | | | | + | Request To | | | + | Peer | | | + |-------------->| | | + |<--------------| | | + | Peering | | + | | | + | Request To | | + | Peer | | + |----------------------------->| | + | | + |----+ OSD Monitor | + | | Heartbeat | + |<---+ Interval Exceeded | + | | + | Failed to Peer with OSD 3 | + |-------------------------------------------->| + |<--------------------------------------------| + | Receive New Cluster Map | + + +.. index:: OSD status + +OSDs Report Their Status +======================== + +If an Ceph OSD Daemon doesn't report to a Ceph Monitor, the Ceph Monitor will +consider the Ceph OSD Daemon ``down`` after the ``mon osd report timeout`` +elapses. A Ceph OSD Daemon sends a report to a Ceph Monitor when a reportable +event such as a failure, a change in placement group stats, a change in +``up_thru`` or when it boots within 5 seconds. You can change the Ceph OSD +Daemon minimum report interval by adding an ``osd mon report interval`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. A Ceph OSD Daemon sends a report to a Ceph +Monitor every 120 seconds irrespective of whether any notable changes occur. +You can change the Ceph Monitor report interval by adding an ``osd mon report +interval max`` setting under the ``[osd]`` section of your Ceph configuration +file, or by setting the value at runtime. + + +.. ditaa:: + + +---------+ +---------+ + | OSD 1 | | Monitor | + +---------+ +---------+ + | | + |----+ Report Min | + | | Interval | + |<---+ Exceeded | + | | + |----+ Reportable | + | | Event | + |<---+ Occurs | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Report Max | + | | Interval | + |<---+ Exceeded | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Monitor | + | | Fails | + |<---+ | + +----+ Monitor OSD + | | Report Timeout + |<---+ Exceeded + | + +----+ Mark + | | OSD 1 + |<---+ Down + + + + +Configuration Settings +====================== + +When modifying heartbeat settings, you should include them in the ``[global]`` +section of your configuration file. + +.. index:: monitor heartbeat + +Monitor Settings +---------------- + +.. confval:: mon_osd_min_up_ratio +.. confval:: mon_osd_min_in_ratio +.. confval:: mon_osd_laggy_halflife +.. confval:: mon_osd_laggy_weight +.. confval:: mon_osd_laggy_max_interval +.. confval:: mon_osd_adjust_heartbeat_grace +.. confval:: mon_osd_adjust_down_out_interval +.. confval:: mon_osd_auto_mark_in +.. confval:: mon_osd_auto_mark_auto_out_in +.. confval:: mon_osd_auto_mark_new_in +.. confval:: mon_osd_down_out_interval +.. confval:: mon_osd_down_out_subtree_limit +.. confval:: mon_osd_report_timeout +.. confval:: mon_osd_min_down_reporters +.. confval:: mon_osd_reporter_subtree_level + +.. index:: OSD heartbeat + +OSD Settings +------------ + +.. confval:: osd_heartbeat_interval +.. confval:: osd_heartbeat_grace +.. confval:: osd_mon_heartbeat_interval +.. confval:: osd_mon_heartbeat_stat_stale +.. confval:: osd_mon_report_interval diff --git a/doc/rados/configuration/msgr2.rst b/doc/rados/configuration/msgr2.rst new file mode 100644 index 000000000..33fe4e022 --- /dev/null +++ b/doc/rados/configuration/msgr2.rst @@ -0,0 +1,257 @@ +.. _msgr2: + +Messenger v2 +============ + +What is it +---------- + +The messenger v2 protocol, or msgr2, is the second major revision on +Ceph's on-wire protocol. It brings with it several key features: + +* A *secure* mode that encrypts all data passing over the network +* Improved encapsulation of authentication payloads, enabling future + integration of new authentication modes like Kerberos +* Improved earlier feature advertisement and negotiation, enabling + future protocol revisions + +Ceph daemons can now bind to multiple ports, allowing both legacy Ceph +clients and new v2-capable clients to connect to the same cluster. + +By default, monitors now bind to the new IANA-assigned port ``3300`` +(ce4h or 0xce4) for the new v2 protocol, while also binding to the +old default port ``6789`` for the legacy v1 protocol. + +.. _address_formats: + +Address formats +--------------- + +Prior to Nautilus, all network addresses were rendered like +``1.2.3.4:567/89012`` where there was an IP address, a port, and a +nonce to uniquely identify a client or daemon on the network. +Starting with Nautilus, we now have three different address types: + +* **v2**: ``v2:1.2.3.4:578/89012`` identifies a daemon binding to a + port speaking the new v2 protocol +* **v1**: ``v1:1.2.3.4:578/89012`` identifies a daemon binding to a + port speaking the legacy v1 protocol. Any address that was + previously shown with any prefix is now shown as a ``v1:`` address. +* **TYPE_ANY** ``any:1.2.3.4:578/89012`` identifies a client that can + speak either version of the protocol. Prior to nautilus, clients would appear as + ``1.2.3.4:0/123456``, where the port of 0 indicates they are clients + and do not accept incoming connections. Starting with Nautilus, + these clients are now internally represented by a **TYPE_ANY** + address, and still shown with no prefix, because they may + connect to daemons using the v2 or v1 protocol, depending on what + protocol(s) the daemons are using. + +Because daemons now bind to multiple ports, they are now described by +a vector of addresses instead of a single address. For example, +dumping the monitor map on a Nautilus cluster now includes lines +like:: + + epoch 1 + fsid 50fcf227-be32-4bcb-8b41-34ca8370bd16 + last_changed 2019-02-25 11:10:46.700821 + created 2019-02-25 11:10:46.700821 + min_mon_release 14 (nautilus) + 0: [v2:10.0.0.10:3300/0,v1:10.0.0.10:6789/0] mon.foo + 1: [v2:10.0.0.11:3300/0,v1:10.0.0.11:6789/0] mon.bar + 2: [v2:10.0.0.12:3300/0,v1:10.0.0.12:6789/0] mon.baz + +The bracketed list or vector of addresses means that the same daemon can be +reached on multiple ports (and protocols). Any client or other daemon +connecting to that daemon will use the v2 protocol (listed first) if +possible; otherwise it will back to the legacy v1 protocol. Legacy +clients will only see the v1 addresses and will continue to connect as +they did before, with the v1 protocol. + +Starting in Nautilus, the ``mon_host`` configuration option and ``-m +<mon-host>`` command line options support the same bracketed address +vector syntax. + + +Bind configuration options +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Two new configuration options control whether the v1 and/or v2 +protocol is used: + + * :confval:`ms_bind_msgr1` [default: true] controls whether a daemon binds + to a port speaking the v1 protocol + * :confval:`ms_bind_msgr2` [default: true] controls whether a daemon binds + to a port speaking the v2 protocol + +Similarly, two options control whether IPv4 and IPv6 addresses are used: + + * :confval:`ms_bind_ipv4` [default: true] controls whether a daemon binds + to an IPv4 address + * :confval:`ms_bind_ipv6` [default: false] controls whether a daemon binds + to an IPv6 address + +.. note:: The ability to bind to multiple ports has paved the way for + dual-stack IPv4 and IPv6 support. That said, dual-stack operation is + not yet supported as of Quincy v17.2.0. + +Connection modes +---------------- + +The v2 protocol supports two connection modes: + +* *crc* mode provides: + + - a strong initial authentication when the connection is established + (with cephx, mutual authentication of both parties with protection + from a man-in-the-middle or eavesdropper), and + - a crc32c integrity check to protect against bit flips due to flaky + hardware or cosmic rays + + *crc* mode does *not* provide: + + - secrecy (an eavesdropper on the network can see all + post-authentication traffic as it goes by) or + - protection from a malicious man-in-the-middle (who can deliberate + modify traffic as it goes by, as long as they are careful to + adjust the crc32c values to match) + +* *secure* mode provides: + + - a strong initial authentication when the connection is established + (with cephx, mutual authentication of both parties with protection + from a man-in-the-middle or eavesdropper), and + - full encryption of all post-authentication traffic, including a + cryptographic integrity check. + + In Nautilus, secure mode uses the `AES-GCM + <https://en.wikipedia.org/wiki/Galois/Counter_Mode>`_ stream cipher, + which is generally very fast on modern processors (e.g., faster than + a SHA-256 cryptographic hash). + +Connection mode configuration options +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For most connections, there are options that control which modes are used: + +.. confval:: ms_cluster_mode +.. confval:: ms_service_mode +.. confval:: ms_client_mode + +There are a parallel set of options that apply specifically to +monitors, allowing administrators to set different (usually more +secure) requirements on communication with the monitors. + +.. confval:: ms_mon_cluster_mode +.. confval:: ms_mon_service_mode +.. confval:: ms_mon_client_mode + + +Compression modes +----------------- + +The v2 protocol supports two compression modes: + +* *force* mode provides: + + - In multi-availability zones deployment, compressing replication messages between OSDs saves latency. + - In the public cloud, inter-AZ communications are expensive. Thus, minimizing message + size reduces network costs to cloud provider. + - When using instance storage on AWS (probably other public clouds as well) the instances with NVMe + provide low network bandwidth relative to the device bandwidth. + In this case, NW compression can improve the overall performance since this is clearly + the bottleneck. + +* *none* mode provides: + + - messages are transmitted without compression. + + +Compression mode configuration options +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For all connections, there is an option that controls compression usage in secure mode + +.. confval:: ms_compress_secure + +There is a parallel set of options that apply specifically to OSDs, +allowing administrators to set different requirements on communication between OSDs. + +.. confval:: ms_osd_compress_mode +.. confval:: ms_osd_compress_min_size +.. confval:: ms_osd_compression_algorithm + +Transitioning from v1-only to v2-plus-v1 +---------------------------------------- + +By default, ``ms_bind_msgr2`` is true starting with Nautilus 14.2.z. +However, until the monitors start using v2, only limited services will +start advertising v2 addresses. + +For most users, the monitors are binding to the default legacy port ``6789`` +for the v1 protocol. When this is the case, enabling v2 is as simple as: + +.. prompt:: bash $ + + ceph mon enable-msgr2 + +If the monitors are bound to non-standard ports, you will need to +specify an additional port for v2 explicitly. For example, if your +monitor ``mon.a`` binds to ``1.2.3.4:1111``, and you want to add v2 on +port ``1112``: + +.. prompt:: bash $ + + ceph mon set-addrs a [v2:1.2.3.4:1112,v1:1.2.3.4:1111] + +Once the monitors bind to v2, each daemon will start advertising a v2 +address when it is next restarted. + + +.. _msgr2_ceph_conf: + +Updating ceph.conf and mon_host +------------------------------- + +Prior to Nautilus, a CLI user or daemon will normally discover the +monitors via the ``mon_host`` option in ``/etc/ceph/ceph.conf``. The +syntax for this option has expanded starting with Nautilus to allow +support the new bracketed list format. For example, an old line +like:: + + mon_host = 10.0.0.1:6789,10.0.0.2:6789,10.0.0.3:6789 + +Can be changed to:: + + mon_host = [v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0],[v2:10.0.0.2:3300/0,v1:10.0.0.2:6789/0],[v2:10.0.0.3:3300/0,v1:10.0.0.3:6789/0] + +However, when default ports are used (``3300`` and ``6789``), they can +be omitted:: + + mon_host = 10.0.0.1,10.0.0.2,10.0.0.3 + +Once v2 has been enabled on the monitors, ``ceph.conf`` may need to be +updated to either specify no ports (this is usually simplest), or +explicitly specify both the v2 and v1 addresses. Note, however, that +the new bracketed syntax is only understood by Nautilus and later, so +do not make that change on hosts that have not yet had their ceph +packages upgraded. + +When you are updating ``ceph.conf``, note the new ``ceph config +generate-minimal-conf`` command (which generates a barebones config +file with just enough information to reach the monitors) and the +``ceph config assimilate-conf`` (which moves config file options into +the monitors' configuration database) may be helpful. For example,:: + + # ceph config assimilate-conf < /etc/ceph/ceph.conf + # ceph config generate-minimal-config > /etc/ceph/ceph.conf.new + # cat /etc/ceph/ceph.conf.new + # minimal ceph.conf for 0e5a806b-0ce5-4bc6-b949-aa6f68f5c2a3 + [global] + fsid = 0e5a806b-0ce5-4bc6-b949-aa6f68f5c2a3 + mon_host = [v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0] + # mv /etc/ceph/ceph.conf.new /etc/ceph/ceph.conf + +Protocol +-------- + +For a detailed description of the v2 wire protocol, see :ref:`msgr2-protocol`. diff --git a/doc/rados/configuration/network-config-ref.rst b/doc/rados/configuration/network-config-ref.rst new file mode 100644 index 000000000..81e85c5d1 --- /dev/null +++ b/doc/rados/configuration/network-config-ref.rst @@ -0,0 +1,355 @@ +================================= + Network Configuration Reference +================================= + +Network configuration is critical for building a high performance :term:`Ceph +Storage Cluster`. The Ceph Storage Cluster does not perform request routing or +dispatching on behalf of the :term:`Ceph Client`. Instead, Ceph Clients make +requests directly to Ceph OSD Daemons. Ceph OSD Daemons perform data replication +on behalf of Ceph Clients, which means replication and other factors impose +additional loads on Ceph Storage Cluster networks. + +Our Quick Start configurations provide a trivial Ceph configuration file that +sets monitor IP addresses and daemon host names only. Unless you specify a +cluster network, Ceph assumes a single "public" network. Ceph functions just +fine with a public network only, but you may see significant performance +improvement with a second "cluster" network in a large cluster. + +It is possible to run a Ceph Storage Cluster with two networks: a public +(client, front-side) network and a cluster (private, replication, back-side) +network. However, this approach +complicates network configuration (both hardware and software) and does not usually +have a significant impact on overall performance. For this reason, we recommend +that for resilience and capacity dual-NIC systems either active/active bond +these interfaces or implement a layer 3 multipath strategy with eg. FRR. + +If, despite the complexity, one still wishes to use two networks, each +:term:`Ceph Node` will need to have more than one network interface or VLAN. See `Hardware +Recommendations - Networks`_ for additional details. + +.. ditaa:: + +-------------+ + | Ceph Client | + +----*--*-----+ + | ^ + Request | : Response + v | + /----------------------------------*--*-------------------------------------\ + | Public Network | + \---*--*------------*--*-------------*--*------------*--*------------*--*---/ + ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ + | | | | | | | | | | + | : | : | : | : | : + v v v v v v v v v v + +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ + | Ceph MON | | Ceph MDS | | Ceph OSD | | Ceph OSD | | Ceph OSD | + +----------+ +----------+ +---*--*---+ +---*--*---+ +---*--*---+ + ^ ^ ^ ^ ^ ^ + The cluster network relieves | | | | | | + OSD replication and heartbeat | : | : | : + traffic from the public network. v v v v v v + /------------------------------------*--*------------*--*------------*--*---\ + | cCCC Cluster Network | + \---------------------------------------------------------------------------/ + + +IP Tables +========= + +By default, daemons `bind`_ to ports within the ``6800:7300`` range. You may +configure this range at your discretion. Before configuring your IP tables, +check the default ``iptables`` configuration. + +.. prompt:: bash $ + + sudo iptables -L + +Some Linux distributions include rules that reject all inbound requests +except SSH from all network interfaces. For example:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +You will need to delete these rules on both your public and cluster networks +initially, and replace them with appropriate rules when you are ready to +harden the ports on your Ceph Nodes. + + +Monitor IP Tables +----------------- + +Ceph Monitors listen on ports ``3300`` and ``6789`` by +default. Additionally, Ceph Monitors always operate on the public +network. When you add the rule using the example below, make sure you +replace ``{iface}`` with the public network interface (e.g., ``eth0``, +``eth1``, etc.), ``{ip-address}`` with the IP address of the public +network and ``{netmask}`` with the netmask for the public network. : + +.. prompt:: bash $ + + sudo iptables -A INPUT -i {iface} -p tcp -s {ip-address}/{netmask} --dport 6789 -j ACCEPT + + +MDS and Manager IP Tables +------------------------- + +A :term:`Ceph Metadata Server` or :term:`Ceph Manager` listens on the first +available port on the public network beginning at port 6800. Note that this +behavior is not deterministic, so if you are running more than one OSD or MDS +on the same host, or if you restart the daemons within a short window of time, +the daemons will bind to higher ports. You should open the entire 6800-7300 +range by default. When you add the rule using the example below, make sure +you replace ``{iface}`` with the public network interface (e.g., ``eth0``, +``eth1``, etc.), ``{ip-address}`` with the IP address of the public network +and ``{netmask}`` with the netmask of the public network. + +For example: + +.. prompt:: bash $ + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + + +OSD IP Tables +------------- + +By default, Ceph OSD Daemons `bind`_ to the first available ports on a Ceph Node +beginning at port 6800. Note that this behavior is not deterministic, so if you +are running more than one OSD or MDS on the same host, or if you restart the +daemons within a short window of time, the daemons will bind to higher ports. +Each Ceph OSD Daemon on a Ceph Node may use up to four ports: + +#. One for talking to clients and monitors. +#. One for sending data to other OSDs. +#. Two for heartbeating on each interface. + +.. ditaa:: + /---------------\ + | OSD | + | +---+----------------+-----------+ + | | Clients & Monitors | Heartbeat | + | +---+----------------+-----------+ + | | + | +---+----------------+-----------+ + | | Data Replication | Heartbeat | + | +---+----------------+-----------+ + | cCCC | + \---------------/ + +When a daemon fails and restarts without letting go of the port, the restarted +daemon will bind to a new port. You should open the entire 6800-7300 port range +to handle this possibility. + +If you set up separate public and cluster networks, you must add rules for both +the public network and the cluster network, because clients will connect using +the public network and other Ceph OSD Daemons will connect using the cluster +network. When you add the rule using the example below, make sure you replace +``{iface}`` with the network interface (e.g., ``eth0``, ``eth1``, etc.), +``{ip-address}`` with the IP address and ``{netmask}`` with the netmask of the +public or cluster network. For example: + +.. prompt:: bash $ + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + +.. tip:: If you run Ceph Metadata Servers on the same Ceph Node as the + Ceph OSD Daemons, you can consolidate the public network configuration step. + + +Ceph Networks +============= + +To configure Ceph networks, you must add a network configuration to the +``[global]`` section of the configuration file. Our 5-minute Quick Start +provides a trivial Ceph configuration file that assumes one public network +with client and server on the same network and subnet. Ceph functions just fine +with a public network only. However, Ceph allows you to establish much more +specific criteria, including multiple IP network and subnet masks for your +public network. You can also establish a separate cluster network to handle OSD +heartbeat, object replication and recovery traffic. Don't confuse the IP +addresses you set in your configuration with the public-facing IP addresses +network clients may use to access your service. Typical internal IP networks are +often ``192.168.0.0`` or ``10.0.0.0``. + +.. tip:: If you specify more than one IP address and subnet mask for + either the public or the cluster network, the subnets within the network + must be capable of routing to each other. Additionally, make sure you + include each IP address/subnet in your IP tables and open ports for them + as necessary. + +.. note:: Ceph uses `CIDR`_ notation for subnets (e.g., ``10.0.0.0/24``). + +When you have configured your networks, you may restart your cluster or restart +each daemon. Ceph daemons bind dynamically, so you do not have to restart the +entire cluster at once if you change your network configuration. + + +Public Network +-------------- + +To configure a public network, add the following option to the ``[global]`` +section of your Ceph configuration file. + +.. code-block:: ini + + [global] + # ... elided configuration + public_network = {public-network/netmask} + +.. _cluster-network: + +Cluster Network +--------------- + +If you declare a cluster network, OSDs will route heartbeat, object replication +and recovery traffic over the cluster network. This may improve performance +compared to using a single network. To configure a cluster network, add the +following option to the ``[global]`` section of your Ceph configuration file. + +.. code-block:: ini + + [global] + # ... elided configuration + cluster_network = {cluster-network/netmask} + +We prefer that the cluster network is **NOT** reachable from the public network +or the Internet for added security. + +IPv4/IPv6 Dual Stack Mode +------------------------- + +If you want to run in an IPv4/IPv6 dual stack mode and want to define your public and/or +cluster networks, then you need to specify both your IPv4 and IPv6 networks for each: + +.. code-block:: ini + + [global] + # ... elided configuration + public_network = {IPv4 public-network/netmask}, {IPv6 public-network/netmask} + +This is so that Ceph can find a valid IP address for both address families. + +If you want just an IPv4 or an IPv6 stack environment, then make sure you set the `ms bind` +options correctly. + +.. note:: + Binding to IPv4 is enabled by default, so if you just add the option to bind to IPv6 + you'll actually put yourself into dual stack mode. If you want just IPv6, then disable IPv4 and + enable IPv6. See `Bind`_ below. + +Ceph Daemons +============ + +Monitor daemons are each configured to bind to a specific IP address. These +addresses are normally configured by your deployment tool. Other components +in the Ceph cluster discover the monitors via the ``mon host`` configuration +option, normally specified in the ``[global]`` section of the ``ceph.conf`` file. + +.. code-block:: ini + + [global] + mon_host = 10.0.0.2, 10.0.0.3, 10.0.0.4 + +The ``mon_host`` value can be a list of IP addresses or a name that is +looked up via DNS. In the case of a DNS name with multiple A or AAAA +records, all records are probed in order to discover a monitor. Once +one monitor is reached, all other current monitors are discovered, so +the ``mon host`` configuration option only needs to be sufficiently up +to date such that a client can reach one monitor that is currently online. + +The MGR, OSD, and MDS daemons will bind to any available address and +do not require any special configuration. However, it is possible to +specify a specific IP address for them to bind to with the ``public +addr`` (and/or, in the case of OSD daemons, the ``cluster addr``) +configuration option. For example, + +.. code-block:: ini + + [osd.0] + public_addr = {host-public-ip-address} + cluster_addr = {host-cluster-ip-address} + +.. topic:: One NIC OSD in a Two Network Cluster + + Generally, we do not recommend deploying an OSD host with a single network interface in a + cluster with two networks. However, you may accomplish this by forcing the + OSD host to operate on the public network by adding a ``public_addr`` entry + to the ``[osd.n]`` section of the Ceph configuration file, where ``n`` + refers to the ID of the OSD with one network interface. Additionally, the public + network and cluster network must be able to route traffic to each other, + which we don't recommend for security reasons. + + +Network Config Settings +======================= + +Network configuration settings are not required. Ceph assumes a public network +with all hosts operating on it unless you specifically configure a cluster +network. + + +Public Network +-------------- + +The public network configuration allows you specifically define IP addresses +and subnets for the public network. You may specifically assign static IP +addresses or override ``public_network`` settings using the ``public_addr`` +setting for a specific daemon. + +.. confval:: public_network +.. confval:: public_addr + +Cluster Network +--------------- + +The cluster network configuration allows you to declare a cluster network, and +specifically define IP addresses and subnets for the cluster network. You may +specifically assign static IP addresses or override ``cluster_network`` +settings using the ``cluster_addr`` setting for specific OSD daemons. + + +.. confval:: cluster_network +.. confval:: cluster_addr + +Bind +---- + +Bind settings set the default port ranges Ceph OSD and MDS daemons use. The +default range is ``6800:7300``. Ensure that your `IP Tables`_ configuration +allows you to use the configured port range. + +You may also enable Ceph daemons to bind to IPv6 addresses instead of IPv4 +addresses. + +.. confval:: ms_bind_port_min +.. confval:: ms_bind_port_max +.. confval:: ms_bind_ipv4 +.. confval:: ms_bind_ipv6 +.. confval:: public_bind_addr + +TCP +--- + +Ceph disables TCP buffering by default. + +.. confval:: ms_tcp_nodelay +.. confval:: ms_tcp_rcvbuf + +General Settings +---------------- + +.. confval:: ms_type +.. confval:: ms_async_op_threads +.. confval:: ms_initial_backoff +.. confval:: ms_max_backoff +.. confval:: ms_die_on_bad_msg +.. confval:: ms_dispatch_throttle_bytes +.. confval:: ms_inject_socket_failures + + +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability +.. _Hardware Recommendations - Networks: ../../../start/hardware-recommendations#networks +.. _hardware recommendations: ../../../start/hardware-recommendations +.. _Monitor / OSD Interaction: ../mon-osd-interaction +.. _Message Signatures: ../auth-config-ref#signatures +.. _CIDR: https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing +.. _Nagle's Algorithm: https://en.wikipedia.org/wiki/Nagle's_algorithm diff --git a/doc/rados/configuration/osd-config-ref.rst b/doc/rados/configuration/osd-config-ref.rst new file mode 100644 index 000000000..060121200 --- /dev/null +++ b/doc/rados/configuration/osd-config-ref.rst @@ -0,0 +1,445 @@ +====================== + OSD Config Reference +====================== + +.. index:: OSD; configuration + +You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent +releases, the central config store), but Ceph OSD +Daemons can use the default values and a very minimal configuration. A minimal +Ceph OSD Daemon configuration sets ``host`` and +uses default values for nearly everything else. + +Ceph OSD Daemons are numerically identified in incremental fashion, beginning +with ``0`` using the following convention. :: + + osd.0 + osd.1 + osd.2 + +In a configuration file, you may specify settings for all Ceph OSD Daemons in +the cluster by adding configuration settings to the ``[osd]`` section of your +configuration file. To add settings directly to a specific Ceph OSD Daemon +(e.g., ``host``), enter it in an OSD-specific section of your configuration +file. For example: + +.. code-block:: ini + + [osd] + osd_journal_size = 5120 + + [osd.0] + host = osd-host-a + + [osd.1] + host = osd-host-b + + +.. index:: OSD; config settings + +General Settings +================ + +The following settings provide a Ceph OSD Daemon's ID, and determine paths to +data and journals. Ceph deployment scripts typically generate the UUID +automatically. + +.. warning:: **DO NOT** change the default paths for data or journals, as it + makes it more problematic to troubleshoot Ceph later. + +When using Filestore, the journal size should be at least twice the product of the expected drive +speed multiplied by ``filestore_max_sync_interval``. However, the most common +practice is to partition the journal drive (often an SSD), and mount it such +that Ceph uses the entire partition for the journal. + +.. confval:: osd_uuid +.. confval:: osd_data +.. confval:: osd_max_write_size +.. confval:: osd_max_object_size +.. confval:: osd_client_message_size_cap +.. confval:: osd_class_dir + :default: $libdir/rados-classes + +.. index:: OSD; file system + +File System Settings +==================== +Ceph builds and mounts file systems which are used for Ceph OSDs. + +``osd_mkfs_options {fs-type}`` + +:Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``-f -i 2048`` +:Default for other file systems: {empty string} + +For example:: + ``osd_mkfs_options_xfs = -f -d agcount=24`` + +``osd_mount_options {fs-type}`` + +:Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``rw,noatime,inode64`` +:Default for other file systems: ``rw, noatime`` + +For example:: + ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8`` + + +.. index:: OSD; journal settings + +Journal Settings +================ + +This section applies only to the older Filestore OSD back end. Since Luminous +BlueStore has been default and preferred. + +By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at +the following path, which is usually a symlink to a device or partition:: + + /var/lib/ceph/osd/$cluster-$id/journal + +When using a single device type (for example, spinning drives), the journals +should be *colocated*: the logical volume (or partition) should be in the same +device as the ``data`` logical volume. + +When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning +drives) it makes sense to place the journal on the faster device, while +``data`` occupies the slower device fully. + +The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be +larger, in which case it will need to be set in the ``ceph.conf`` file. +A value of 10 gigabytes is common in practice:: + + osd_journal_size = 10240 + + +.. confval:: osd_journal +.. confval:: osd_journal_size + +See `Journal Config Reference`_ for additional details. + + +Monitor OSD Interaction +======================= + +Ceph OSD Daemons check each other's heartbeats and report to monitors +periodically. Ceph can use default values in many cases. However, if your +network has latency issues, you may need to adopt longer intervals. See +`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. + + +Data Placement +============== + +See `Pool & PG Config Reference`_ for details. + + +.. index:: OSD; scrubbing + +.. _rados_config_scrubbing: + +Scrubbing +========= + +One way that Ceph ensures data integrity is by "scrubbing" placement groups. +Ceph scrubbing is analogous to ``fsck`` on the object storage layer. Ceph +generates a catalog of all objects in each placement group and compares each +primary object to its replicas, ensuring that no objects are missing or +mismatched. Light scrubbing checks the object size and attributes, and is +usually done daily. Deep scrubbing reads the data and uses checksums to ensure +data integrity, and is usually done weekly. The freqeuncies of both light +scrubbing and deep scrubbing are determined by the cluster's configuration, +which is fully under your control and subject to the settings explained below +in this section. + +Although scrubbing is important for maintaining data integrity, it can reduce +the performance of the Ceph cluster. You can adjust the following settings to +increase or decrease the frequency and depth of scrubbing operations. + + +.. confval:: osd_max_scrubs +.. confval:: osd_scrub_begin_hour +.. confval:: osd_scrub_end_hour +.. confval:: osd_scrub_begin_week_day +.. confval:: osd_scrub_end_week_day +.. confval:: osd_scrub_during_recovery +.. confval:: osd_scrub_load_threshold +.. confval:: osd_scrub_min_interval +.. confval:: osd_scrub_max_interval +.. confval:: osd_scrub_chunk_min +.. confval:: osd_scrub_chunk_max +.. confval:: osd_scrub_sleep +.. confval:: osd_deep_scrub_interval +.. confval:: osd_scrub_interval_randomize_ratio +.. confval:: osd_deep_scrub_stride +.. confval:: osd_scrub_auto_repair +.. confval:: osd_scrub_auto_repair_num_errors + +.. index:: OSD; operations settings + +Operations +========== + +.. confval:: osd_op_num_shards +.. confval:: osd_op_num_shards_hdd +.. confval:: osd_op_num_shards_ssd +.. confval:: osd_op_queue +.. confval:: osd_op_queue_cut_off +.. confval:: osd_client_op_priority +.. confval:: osd_recovery_op_priority +.. confval:: osd_scrub_priority +.. confval:: osd_requested_scrub_priority +.. confval:: osd_snap_trim_priority +.. confval:: osd_snap_trim_sleep +.. confval:: osd_snap_trim_sleep_hdd +.. confval:: osd_snap_trim_sleep_ssd +.. confval:: osd_snap_trim_sleep_hybrid +.. confval:: osd_op_thread_timeout +.. confval:: osd_op_complaint_time +.. confval:: osd_op_history_size +.. confval:: osd_op_history_duration +.. confval:: osd_op_log_threshold +.. confval:: osd_op_thread_suicide_timeout +.. note:: See https://old.ceph.com/planet/dealing-with-some-osd-timeouts/ for + more on ``osd_op_thread_suicide_timeout``. Be aware that this is a link to a + reworking of a blog post from 2017, and that its conclusion will direct you + back to this page "for more information". + +.. _dmclock-qos: + +QoS Based on mClock +------------------- + +Ceph's use of mClock is now more refined and can be used by following the +steps as described in `mClock Config Reference`_. + +Core Concepts +````````````` + +Ceph's QoS support is implemented using a queueing scheduler +based on `the dmClock algorithm`_. This algorithm allocates the I/O +resources of the Ceph cluster in proportion to weights, and enforces +the constraints of minimum reservation and maximum limitation, so that +the services can compete for the resources fairly. Currently the +*mclock_scheduler* operation queue divides Ceph services involving I/O +resources into following buckets: + +- client op: the iops issued by client +- osd subop: the iops issued by primary OSD +- snap trim: the snap trimming related requests +- pg recovery: the recovery related requests +- pg scrub: the scrub related requests + +And the resources are partitioned using following three sets of tags. In other +words, the share of each type of service is controlled by three tags: + +#. reservation: the minimum IOPS allocated for the service. +#. limitation: the maximum IOPS allocated for the service. +#. weight: the proportional share of capacity if extra capacity or system + oversubscribed. + +In Ceph, operations are graded with "cost". And the resources allocated +for serving various services are consumed by these "costs". So, for +example, the more reservation a services has, the more resource it is +guaranteed to possess, as long as it requires. Assuming there are 2 +services: recovery and client ops: + +- recovery: (r:1, l:5, w:1) +- client ops: (r:2, l:0, w:9) + +The settings above ensure that the recovery won't get more than 5 +requests per second serviced, even if it requires so (see CURRENT +IMPLEMENTATION NOTE below), and no other services are competing with +it. But if the clients start to issue large amount of I/O requests, +neither will they exhaust all the I/O resources. 1 request per second +is always allocated for recovery jobs as long as there are any such +requests. So the recovery jobs won't be starved even in a cluster with +high load. And in the meantime, the client ops can enjoy a larger +portion of the I/O resource, because its weight is "9", while its +competitor "1". In the case of client ops, it is not clamped by the +limit setting, so it can make use of all the resources if there is no +recovery ongoing. + +CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit +values. Therefore, if a service crosses the enforced limit, the op remains +in the operation queue until the limit is restored. + +Subtleties of mClock +```````````````````` + +The reservation and limit values have a unit of requests per +second. The weight, however, does not technically have a unit and the +weights are relative to one another. So if one class of requests has a +weight of 1 and another a weight of 9, then the latter class of +requests should get 9 executed at a 9 to 1 ratio as the first class. +However that will only happen once the reservations are met and those +values include the operations executed under the reservation phase. + +Even though the weights do not have units, one must be careful in +choosing their values due how the algorithm assigns weight tags to +requests. If the weight is *W*, then for a given class of requests, +the next one that comes in will have a weight tag of *1/W* plus the +previous weight tag or the current time, whichever is larger. That +means if *W* is sufficiently large and therefore *1/W* is sufficiently +small, the calculated tag may never be assigned as it will get a value +of the current time. The ultimate lesson is that values for weight +should not be too large. They should be under the number of requests +one expects to be serviced each second. + +Caveats +``````` + +There are some factors that can reduce the impact of the mClock op +queues within Ceph. First, requests to an OSD are sharded by their +placement group identifier. Each shard has its own mClock queue and +these queues neither interact nor share information among them. The +number of shards can be controlled with the configuration options +:confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and +:confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the +impact of the mClock queues, but may have other deleterious effects. + +Second, requests are transferred from the operation queue to the +operation sequencer, in which they go through the phases of +execution. The operation queue is where mClock resides and mClock +determines the next op to transfer to the operation sequencer. The +number of operations allowed in the operation sequencer is a complex +issue. In general we want to keep enough operations in the sequencer +so it's always getting work done on some operations while it's waiting +for disk and network access to complete on other operations. On the +other hand, once an operation is transferred to the operation +sequencer, mClock no longer has control over it. Therefore to maximize +the impact of mClock, we want to keep as few operations in the +operation sequencer as possible. So we have an inherent tension. + +The configuration options that influence the number of operations in +the operation sequencer are :confval:`bluestore_throttle_bytes`, +:confval:`bluestore_throttle_deferred_bytes`, +:confval:`bluestore_throttle_cost_per_io`, +:confval:`bluestore_throttle_cost_per_io_hdd`, and +:confval:`bluestore_throttle_cost_per_io_ssd`. + +A third factor that affects the impact of the mClock algorithm is that +we're using a distributed system, where requests are made to multiple +OSDs and each OSD has (can have) multiple shards. Yet we're currently +using the mClock algorithm, which is not distributed (note: dmClock is +the distributed version of mClock). + +Various organizations and individuals are currently experimenting with +mClock as it exists in this code base along with their modifications +to the code base. We hope you'll share you're experiences with your +mClock and dmClock experiments on the ``ceph-devel`` mailing list. + +.. confval:: osd_async_recovery_min_cost +.. confval:: osd_push_per_object_cost +.. confval:: osd_mclock_scheduler_client_res +.. confval:: osd_mclock_scheduler_client_wgt +.. confval:: osd_mclock_scheduler_client_lim +.. confval:: osd_mclock_scheduler_background_recovery_res +.. confval:: osd_mclock_scheduler_background_recovery_wgt +.. confval:: osd_mclock_scheduler_background_recovery_lim +.. confval:: osd_mclock_scheduler_background_best_effort_res +.. confval:: osd_mclock_scheduler_background_best_effort_wgt +.. confval:: osd_mclock_scheduler_background_best_effort_lim + +.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf + +.. index:: OSD; backfilling + +Backfilling +=========== + +When you add or remove Ceph OSD Daemons to a cluster, CRUSH will +rebalance the cluster by moving placement groups to or from Ceph OSDs +to restore balanced utilization. The process of migrating placement groups and +the objects they contain can reduce the cluster's operational performance +considerably. To maintain operational performance, Ceph performs this migration +with 'backfilling', which allows Ceph to set backfill operations to a lower +priority than requests to read or write data. + + +.. confval:: osd_max_backfills +.. confval:: osd_backfill_scan_min +.. confval:: osd_backfill_scan_max +.. confval:: osd_backfill_retry_interval + +.. index:: OSD; osdmap + +OSD Map +======= + +OSD maps reflect the OSD daemons operating in the cluster. Over time, the +number of map epochs increases. Ceph provides some settings to ensure that +Ceph performs well as the OSD map grows larger. + +.. confval:: osd_map_dedup +.. confval:: osd_map_cache_size +.. confval:: osd_map_message_max + +.. index:: OSD; recovery + +Recovery +======== + +When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD +begins peering with other Ceph OSD Daemons before writes can occur. See +`Monitoring OSDs and PGs`_ for details. + +If a Ceph OSD Daemon crashes and comes back online, usually it will be out of +sync with other Ceph OSD Daemons containing more recent versions of objects in +the placement groups. When this happens, the Ceph OSD Daemon goes into recovery +mode and seeks to get the latest copy of the data and bring its map back up to +date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects +and placement groups may be significantly out of date. Also, if a failure domain +went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at +the same time. This can make the recovery process time consuming and resource +intensive. + +To maintain operational performance, Ceph performs recovery with limitations on +the number recovery requests, threads and object chunk sizes which allows Ceph +perform well in a degraded state. + +.. confval:: osd_recovery_delay_start +.. confval:: osd_recovery_max_active +.. confval:: osd_recovery_max_active_hdd +.. confval:: osd_recovery_max_active_ssd +.. confval:: osd_recovery_max_chunk +.. confval:: osd_recovery_max_single_start +.. confval:: osd_recover_clone_overlap +.. confval:: osd_recovery_sleep +.. confval:: osd_recovery_sleep_hdd +.. confval:: osd_recovery_sleep_ssd +.. confval:: osd_recovery_sleep_hybrid +.. confval:: osd_recovery_priority + +Tiering +======= + +.. confval:: osd_agent_max_ops +.. confval:: osd_agent_max_low_ops + +See `cache target dirty high ratio`_ for when the tiering agent flushes dirty +objects within the high speed mode. + +Miscellaneous +============= + +.. confval:: osd_default_notify_timeout +.. confval:: osd_check_for_log_corruption +.. confval:: osd_delete_sleep +.. confval:: osd_delete_sleep_hdd +.. confval:: osd_delete_sleep_ssd +.. confval:: osd_delete_sleep_hybrid +.. confval:: osd_command_max_records +.. confval:: osd_fast_fail_on_connection_refused + +.. _pool: ../../operations/pools +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Pool & PG Config Reference: ../pool-pg-config-ref +.. _Journal Config Reference: ../journal-ref +.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio +.. _mClock Config Reference: ../mclock-config-ref diff --git a/doc/rados/configuration/pool-pg-config-ref.rst b/doc/rados/configuration/pool-pg-config-ref.rst new file mode 100644 index 000000000..902c80346 --- /dev/null +++ b/doc/rados/configuration/pool-pg-config-ref.rst @@ -0,0 +1,46 @@ +.. _rados_config_pool_pg_crush_ref: + +====================================== + Pool, PG and CRUSH Config Reference +====================================== + +.. index:: pools; configuration + +Ceph uses default values to determine how many placement groups (PGs) will be +assigned to each pool. We recommend overriding some of the defaults. +Specifically, we recommend setting a pool's replica size and overriding the +default number of placement groups. You can set these values when running +`pool`_ commands. You can also override the defaults by adding new ones in the +``[global]`` section of your Ceph configuration file. + + +.. literalinclude:: pool-pg.conf + :language: ini + +.. confval:: mon_max_pool_pg_num +.. confval:: mon_pg_stuck_threshold +.. confval:: mon_pg_warn_min_per_osd +.. confval:: mon_pg_warn_min_objects +.. confval:: mon_pg_warn_min_pool_objects +.. confval:: mon_pg_check_down_all_threshold +.. confval:: mon_pg_warn_max_object_skew +.. confval:: mon_delta_reset_interval +.. confval:: osd_crush_chooseleaf_type +.. confval:: osd_crush_initial_weight +.. confval:: osd_pool_default_crush_rule +.. confval:: osd_pool_erasure_code_stripe_unit +.. confval:: osd_pool_default_size +.. confval:: osd_pool_default_min_size +.. confval:: osd_pool_default_pg_num +.. confval:: osd_pool_default_pgp_num +.. confval:: osd_pool_default_pg_autoscale_mode +.. confval:: osd_pool_default_flags +.. confval:: osd_max_pgls +.. confval:: osd_min_pg_log_entries +.. confval:: osd_max_pg_log_entries +.. confval:: osd_default_data_pool_replay_window +.. confval:: osd_max_pg_per_osd_hard_ratio + +.. _pool: ../../operations/pools +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Weighting Bucket Items: ../../operations/crush-map#weightingbucketitems diff --git a/doc/rados/configuration/pool-pg.conf b/doc/rados/configuration/pool-pg.conf new file mode 100644 index 000000000..6765d37df --- /dev/null +++ b/doc/rados/configuration/pool-pg.conf @@ -0,0 +1,21 @@ +[global] + + # By default, Ceph makes three replicas of RADOS objects. If you want + # to maintain four copies of an object the default value--a primary + # copy and three replica copies--reset the default values as shown in + # 'osd_pool_default_size'. If you want to allow Ceph to accept an I/O + # operation to a degraded PG, set 'osd_pool_default_min_size' to a + # number less than the 'osd_pool_default_size' value. + + osd_pool_default_size = 3 # Write an object three times. + osd_pool_default_min_size = 2 # Accept an I/O operation to a PG that has two copies of an object. + + # Note: by default, PG autoscaling is enabled and this value is used only + # in specific circumstances. It is however still recommend to set it. + # Ensure you have a realistic number of placement groups. We recommend + # approximately 100 per OSD. E.g., total number of OSDs multiplied by 100 + # divided by the number of replicas (i.e., 'osd_pool_default_size'). So for + # 10 OSDs and 'osd_pool_default_size' = 4, we'd recommend approximately + # (100 * 10) / 4 = 250. + # Always use the nearest power of two. + osd_pool_default_pg_num = 256 diff --git a/doc/rados/configuration/storage-devices.rst b/doc/rados/configuration/storage-devices.rst new file mode 100644 index 000000000..c83e87da7 --- /dev/null +++ b/doc/rados/configuration/storage-devices.rst @@ -0,0 +1,93 @@ +================= + Storage Devices +================= + +There are several Ceph daemons in a storage cluster: + +.. _rados_configuration_storage-devices_ceph_osd: + +* **Ceph OSDs** (Object Storage Daemons) store most of the data + in Ceph. Usually each OSD is backed by a single storage device. + This can be a traditional hard disk (HDD) or a solid state disk + (SSD). OSDs can also be backed by a combination of devices: for + example, a HDD for most data and an SSD (or partition of an + SSD) for some metadata. The number of OSDs in a cluster is + usually a function of the amount of data to be stored, the size + of each storage device, and the level and type of redundancy + specified (replication or erasure coding). +* **Ceph Monitor** daemons manage critical cluster state. This + includes cluster membership and authentication information. + Small clusters require only a few gigabytes of storage to hold + the monitor database. In large clusters, however, the monitor + database can reach sizes of tens of gigabytes to hundreds of + gigabytes. +* **Ceph Manager** daemons run alongside monitor daemons, providing + additional monitoring and providing interfaces to external + monitoring and management systems. + +.. _rados_config_storage_devices_osd_backends: + +OSD Back Ends +============= + +There are two ways that OSDs manage the data they store. As of the Luminous +12.2.z release, the default (and recommended) back end is *BlueStore*. Prior +to the Luminous release, the default (and only) back end was *Filestore*. + +.. _rados_config_storage_devices_bluestore: + +BlueStore +--------- + +BlueStore is a special-purpose storage back end designed specifically for +managing data on disk for Ceph OSD workloads. BlueStore's design is based on +a decade of experience of supporting and managing Filestore OSDs. + +Key BlueStore features include: + +* Direct management of storage devices. BlueStore consumes raw block devices or + partitions. This avoids intervening layers of abstraction (such as local file + systems like XFS) that can limit performance or add complexity. +* Metadata management with RocksDB. RocksDB's key/value database is embedded + in order to manage internal metadata, including the mapping of object + names to block locations on disk. +* Full data and metadata checksumming. By default, all data and + metadata written to BlueStore is protected by one or more + checksums. No data or metadata is read from disk or returned + to the user without being verified. +* Inline compression. Data can be optionally compressed before being written + to disk. +* Multi-device metadata tiering. BlueStore allows its internal + journal (write-ahead log) to be written to a separate, high-speed + device (like an SSD, NVMe, or NVDIMM) for increased performance. If + a significant amount of faster storage is available, internal + metadata can be stored on the faster device. +* Efficient copy-on-write. RBD and CephFS snapshots rely on a + copy-on-write *clone* mechanism that is implemented efficiently in + BlueStore. This results in efficient I/O both for regular snapshots + and for erasure-coded pools (which rely on cloning to implement + efficient two-phase commits). + +For more information, see :doc:`bluestore-config-ref` and :doc:`/rados/operations/bluestore-migration`. + +FileStore +--------- +.. warning:: Filestore has been deprecated in the Reef release and is no longer supported. + + +FileStore is the legacy approach to storing objects in Ceph. It +relies on a standard file system (normally XFS) in combination with a +key/value database (traditionally LevelDB, now RocksDB) for some +metadata. + +FileStore is well-tested and widely used in production. However, it +suffers from many performance deficiencies due to its overall design +and its reliance on a traditional file system for object data storage. + +Although FileStore is capable of functioning on most POSIX-compatible +file systems (including btrfs and ext4), we recommend that only the +XFS file system be used with Ceph. Both btrfs and ext4 have known bugs and +deficiencies and their use may lead to data loss. By default, all Ceph +provisioning tools use XFS. + +For more information, see :doc:`filestore-config-ref`. |