diff options
Diffstat (limited to 'doc/rados/configuration')
20 files changed, 6824 insertions, 0 deletions
diff --git a/doc/rados/configuration/auth-config-ref.rst b/doc/rados/configuration/auth-config-ref.rst new file mode 100644 index 000000000..5cc13ff6a --- /dev/null +++ b/doc/rados/configuration/auth-config-ref.rst @@ -0,0 +1,362 @@ +======================== + Cephx Config Reference +======================== + +The ``cephx`` protocol is enabled by default. Cryptographic authentication has +some computational costs, though they should generally be quite low. If the +network environment connecting your client and server hosts is very safe and +you cannot afford authentication, you can turn it off. **This is not generally +recommended**. + +.. note:: If you disable authentication, you are at risk of a man-in-the-middle + attack altering your client/server messages, which could lead to disastrous + security effects. + +For creating users, see `User Management`_. For details on the architecture +of Cephx, see `Architecture - High Availability Authentication`_. + + +Deployment Scenarios +==================== + +There are two main scenarios for deploying a Ceph cluster, which impact +how you initially configure Cephx. Most first time Ceph users use +``cephadm`` to create a cluster (easiest). For clusters using +other deployment tools (e.g., Chef, Juju, Puppet, etc.), you will need +to use the manual procedures or configure your deployment tool to +bootstrap your monitor(s). + +Manual Deployment +----------------- + +When you deploy a cluster manually, you have to bootstrap the monitor manually +and create the ``client.admin`` user and keyring. To bootstrap monitors, follow +the steps in `Monitor Bootstrapping`_. The steps for monitor bootstrapping are +the logical steps you must perform when using third party deployment tools like +Chef, Puppet, Juju, etc. + + +Enabling/Disabling Cephx +======================== + +Enabling Cephx requires that you have deployed keys for your monitors, +OSDs and metadata servers. If you are simply toggling Cephx on / off, +you do not have to repeat the bootstrapping procedures. + + +Enabling Cephx +-------------- + +When ``cephx`` is enabled, Ceph will look for the keyring in the default search +path, which includes ``/etc/ceph/$cluster.$name.keyring``. You can override +this location by adding a ``keyring`` option in the ``[global]`` section of +your `Ceph configuration`_ file, but this is not recommended. + +Execute the following procedures to enable ``cephx`` on a cluster with +authentication disabled. If you (or your deployment utility) have already +generated the keys, you may skip the steps related to generating keys. + +#. Create a ``client.admin`` key, and save a copy of the key for your client + host + + .. prompt:: bash $ + + ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring + + **Warning:** This will clobber any existing + ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a + deployment tool has already done it for you. Be careful! + +#. Create a keyring for your monitor cluster and generate a monitor + secret key. + + .. prompt:: bash $ + + ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' + +#. Copy the monitor keyring into a ``ceph.mon.keyring`` file in every monitor's + ``mon data`` directory. For example, to copy it to ``mon.a`` in cluster ``ceph``, + use the following + + .. prompt:: bash $ + + cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring + +#. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter + + .. prompt:: bash $ + + ceph auth get-or-create mgr.{$id} mon 'allow profile mgr' mds 'allow *' osd 'allow *' -o /var/lib/ceph/mgr/ceph-{$id}/keyring + +#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number + + .. prompt:: bash $ + + ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring + +#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter + + .. prompt:: bash $ + + ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' mgr 'allow profile mds' -o /var/lib/ceph/mds/ceph-{$id}/keyring + +#. Enable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file + + .. code-block:: ini + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + +For details on bootstrapping a monitor manually, see `Manual Deployment`_. + + + +Disabling Cephx +--------------- + +The following procedure describes how to disable Cephx. If your cluster +environment is relatively safe, you can offset the computation expense of +running authentication. **We do not recommend it.** However, it may be easier +during setup and/or troubleshooting to temporarily disable authentication. + +#. Disable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file + + .. code-block:: ini + + auth_cluster_required = none + auth_service_required = none + auth_client_required = none + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + + +Configuration Settings +====================== + +Enablement +---------- + + +``auth_cluster_required`` + +:Description: If enabled, the Ceph Storage Cluster daemons (i.e., ``ceph-mon``, + ``ceph-osd``, ``ceph-mds`` and ``ceph-mgr``) must authenticate with + each other. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth_service_required`` + +:Description: If enabled, the Ceph Storage Cluster daemons require Ceph Clients + to authenticate with the Ceph Storage Cluster in order to access + Ceph services. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth_client_required`` + +:Description: If enabled, the Ceph Client requires the Ceph Storage Cluster to + authenticate with the Ceph Client. Valid settings are ``cephx`` + or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +.. index:: keys; keyring + +Keys +---- + +When you run Ceph with authentication enabled, ``ceph`` administrative commands +and Ceph Clients require authentication keys to access the Ceph Storage Cluster. + +The most common way to provide these keys to the ``ceph`` administrative +commands and clients is to include a Ceph keyring under the ``/etc/ceph`` +directory. For Octopus and later releases using ``cephadm``, the filename +is usually ``ceph.client.admin.keyring`` (or ``$cluster.client.admin.keyring``). +If you include the keyring under the ``/etc/ceph`` directory, you don't need to +specify a ``keyring`` entry in your Ceph configuration file. + +We recommend copying the Ceph Storage Cluster's keyring file to nodes where you +will run administrative commands, because it contains the ``client.admin`` key. + +To perform this step manually, execute the following: + +.. prompt:: bash $ + + sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring + +.. tip:: Ensure the ``ceph.keyring`` file has appropriate permissions set + (e.g., ``chmod 644``) on your client machine. + +You may specify the key itself in the Ceph configuration file using the ``key`` +setting (not recommended), or a path to a keyfile using the ``keyfile`` setting. + + +``keyring`` + +:Description: The path to the keyring file. +:Type: String +:Required: No +:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin`` + + +``keyfile`` + +:Description: The path to a key file (i.e,. a file containing only the key). +:Type: String +:Required: No +:Default: None + + +``key`` + +:Description: The key (i.e., the text string of the key itself). Not recommended. +:Type: String +:Required: No +:Default: None + + +Daemon Keyrings +--------------- + +Administrative users or deployment tools (e.g., ``cephadm``) may generate +daemon keyrings in the same way as generating user keyrings. By default, Ceph +stores daemons keyrings inside their data directory. The default keyring +locations, and the capabilities necessary for the daemon to function, are shown +below. + +``ceph-mon`` + +:Location: ``$mon_data/keyring`` +:Capabilities: ``mon 'allow *'`` + +``ceph-osd`` + +:Location: ``$osd_data/keyring`` +:Capabilities: ``mgr 'allow profile osd' mon 'allow profile osd' osd 'allow *'`` + +``ceph-mds`` + +:Location: ``$mds_data/keyring`` +:Capabilities: ``mds 'allow' mgr 'allow profile mds' mon 'allow profile mds' osd 'allow rwx'`` + +``ceph-mgr`` + +:Location: ``$mgr_data/keyring`` +:Capabilities: ``mon 'allow profile mgr' mds 'allow *' osd 'allow *'`` + +``radosgw`` + +:Location: ``$rgw_data/keyring`` +:Capabilities: ``mon 'allow rwx' osd 'allow rwx'`` + + +.. note:: The monitor keyring (i.e., ``mon.``) contains a key but no + capabilities, and is not part of the cluster ``auth`` database. + +The daemon data directory locations default to directories of the form:: + + /var/lib/ceph/$type/$cluster-$id + +For example, ``osd.12`` would be:: + + /var/lib/ceph/osd/ceph-12 + +You can override these locations, but it is not recommended. + + +.. index:: signatures + +Signatures +---------- + +Ceph performs a signature check that provides some limited protection +against messages being tampered with in flight (e.g., by a "man in the +middle" attack). + +Like other parts of Ceph authentication, Ceph provides fine-grained control so +you can enable/disable signatures for service messages between clients and +Ceph, and so you can enable/disable signatures for messages between Ceph daemons. + +Note that even with signatures enabled data is not encrypted in +flight. + +``cephx_require_signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between the Ceph Client and the Ceph Storage Cluster, and + between daemons comprising the Ceph Storage Cluster. + + Ceph Argonaut and Linux kernel versions prior to 3.19 do + not support signatures; if such clients are in use this + option can be turned off to allow them to connect. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_cluster_require_signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph daemons comprising the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_service_require_signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph Clients and the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_sign_messages`` + +:Description: If the Ceph version supports message signing, Ceph will sign + all messages so they are more difficult to spoof. + +:Type: Boolean +:Default: ``true`` + + +Time to Live +------------ + +``auth_service_ticket_ttl`` + +:Description: When the Ceph Storage Cluster sends a Ceph Client a ticket for + authentication, the Ceph Storage Cluster assigns the ticket a + time to live. + +:Type: Double +:Default: ``60*60`` + + +.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping +.. _Operating a Cluster: ../../operations/operating +.. _Manual Deployment: ../../../install/manual-deployment +.. _Ceph configuration: ../ceph-conf +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _User Management: ../../operations/user-management diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst new file mode 100644 index 000000000..3bfc8e295 --- /dev/null +++ b/doc/rados/configuration/bluestore-config-ref.rst @@ -0,0 +1,482 @@ +========================== +BlueStore Config Reference +========================== + +Devices +======= + +BlueStore manages either one, two, or (in certain cases) three storage +devices. + +In the simplest case, BlueStore consumes a single (primary) storage device. +The storage device is normally used as a whole, occupying the full device that +is managed directly by BlueStore. This *primary device* is normally identified +by a ``block`` symlink in the data directory. + +The data directory is a ``tmpfs`` mount which gets populated (at boot time, or +when ``ceph-volume`` activates it) with all the common OSD files that hold +information about the OSD, like: its identifier, which cluster it belongs to, +and its private keyring. + +It is also possible to deploy BlueStore across one or two additional devices: + +* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be + used for BlueStore's internal journal or write-ahead log. It is only useful + to use a WAL device if the device is faster than the primary device (e.g., + when it is on an SSD and the primary device is an HDD). +* A *DB device* (identified as ``block.db`` in the data directory) can be used + for storing BlueStore's internal metadata. BlueStore (or rather, the + embedded RocksDB) will put as much metadata as it can on the DB device to + improve performance. If the DB device fills up, metadata will spill back + onto the primary device (where it would have been otherwise). Again, it is + only helpful to provision a DB device if it is faster than the primary + device. + +If there is only a small amount of fast storage available (e.g., less +than a gigabyte), we recommend using it as a WAL device. If there is +more, provisioning a DB device makes more sense. The BlueStore +journal will always be placed on the fastest device available, so +using a DB device will provide the same benefit that the WAL device +would while *also* allowing additional metadata to be stored there (if +it will fit). This means that if a DB device is specified but an explicit +WAL device is not, the WAL will be implicitly colocated with the DB on the faster +device. + +A single-device (colocated) BlueStore OSD can be provisioned with: + +.. prompt:: bash $ + + ceph-volume lvm prepare --bluestore --data <device> + +To specify a WAL device and/or DB device: + +.. prompt:: bash $ + + ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device> + +.. note:: ``--data`` can be a Logical Volume using *vg/lv* notation. Other + devices can be existing logical volumes or GPT partitions. + +Provisioning strategies +----------------------- +Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore +which had just one), there are two common arrangements that should help clarify +the deployment strategy: + +.. _bluestore-single-type-device-config: + +**block (data) only** +^^^^^^^^^^^^^^^^^^^^^ +If all devices are the same type, for example all rotational drives, and +there are no fast devices to use for metadata, it makes sense to specify the +block device only and to not separate ``block.db`` or ``block.wal``. The +:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data /dev/sda + +If logical volumes have already been created for each device, (a single LV +using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named +``ceph-vg/block-lv`` would look like: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data ceph-vg/block-lv + +.. _bluestore-mixed-device-config: + +**block and block.db** +^^^^^^^^^^^^^^^^^^^^^^ +If you have a mix of fast and slow devices (SSD / NVMe and rotational), +it is recommended to place ``block.db`` on the faster device while ``block`` +(data) lives on the slower (spinning drive). + +You must create these volume groups and logical volumes manually as +the ``ceph-volume`` tool is currently not able to do so automatically. + +For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``) +and one (fast) solid state drive (``sdx``). First create the volume groups: + +.. prompt:: bash $ + + vgcreate ceph-block-0 /dev/sda + vgcreate ceph-block-1 /dev/sdb + vgcreate ceph-block-2 /dev/sdc + vgcreate ceph-block-3 /dev/sdd + +Now create the logical volumes for ``block``: + +.. prompt:: bash $ + + lvcreate -l 100%FREE -n block-0 ceph-block-0 + lvcreate -l 100%FREE -n block-1 ceph-block-1 + lvcreate -l 100%FREE -n block-2 ceph-block-2 + lvcreate -l 100%FREE -n block-3 ceph-block-3 + +We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB +SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB: + +.. prompt:: bash $ + + vgcreate ceph-db-0 /dev/sdx + lvcreate -L 50GB -n db-0 ceph-db-0 + lvcreate -L 50GB -n db-1 ceph-db-0 + lvcreate -L 50GB -n db-2 ceph-db-0 + lvcreate -L 50GB -n db-3 ceph-db-0 + +Finally, create the 4 OSDs with ``ceph-volume``: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0 + ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1 + ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2 + ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3 + +These operations should end up creating four OSDs, with ``block`` on the slower +rotational drives with a 50 GB logical volume (DB) for each on the solid state +drive. + +Sizing +====== +When using a :ref:`mixed spinning and solid drive setup +<bluestore-mixed-device-config>` it is important to make a large enough +``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have +*as large as possible* logical volumes. + +The general recommendation is to have ``block.db`` size in between 1% to 4% +of ``block`` size. For RGW workloads, it is recommended that the ``block.db`` +size isn't smaller than 4% of ``block``, because RGW heavily uses it to store +metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't +be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough. + +In older releases, internal level sizes mean that the DB can fully utilize only +specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2, +etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and +so forth. Most deployments will not substantially benefit from sizing to +accommodate L3 and higher, though DB compaction can be facilitated by doubling +these figures to 6GB, 60GB, and 600GB. + +Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6 +enable better utilization of arbitrary DB device sizes, and the Pacific +release brings experimental dynamic level support. Users of older releases may +thus wish to plan ahead by provisioning larger DB devices today so that their +benefits may be realized with future upgrades. + +When *not* using a mix of fast and slow devices, it isn't required to create +separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will +automatically colocate these within the space of ``block``. + + +Automatic Cache Sizing +====================== + +BlueStore can be configured to automatically resize its caches when TCMalloc +is configured as the memory allocator and the ``bluestore_cache_autotune`` +setting is enabled. This option is currently enabled by default. BlueStore +will attempt to keep OSD heap memory usage under a designated target size via +the ``osd_memory_target`` configuration option. This is a best effort +algorithm and caches will not shrink smaller than the amount specified by +``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy +of priorities. If priority information is not available, the +``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are +used as fallbacks. + +Manual Cache Sizing +=================== + +The amount of memory consumed by each OSD for BlueStore caches is +determined by the ``bluestore_cache_size`` configuration option. If +that config option is not set (i.e., remains at 0), there is a +different default value that is used depending on whether an HDD or +SSD is used for the primary device (set by the +``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config +options). + +BlueStore and the rest of the Ceph OSD daemon do the best they can +to work within this memory budget. Note that on top of the configured +cache size, there is also memory consumed by the OSD itself, and +some additional utilization due to memory fragmentation and other +allocator overhead. + +The configured cache memory budget can be used in a few different ways: + +* Key/Value metadata (i.e., RocksDB's internal cache) +* BlueStore metadata +* BlueStore data (i.e., recently read or written object data) + +Cache memory usage is governed by the following options: +``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. +The fraction of the cache devoted to data +is governed by the effective bluestore cache size (depending on +``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary +device) as well as the meta and kv ratios. +The data fraction can be calculated by +``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)`` + +Checksums +========= + +BlueStore checksums all metadata and data written to disk. Metadata +checksumming is handled by RocksDB and uses `crc32c`. Data +checksumming is done by BlueStore and can make use of `crc32c`, +`xxhash32`, or `xxhash64`. The default is `crc32c` and should be +suitable for most purposes. + +Full data checksumming does increase the amount of metadata that +BlueStore must store and manage. When possible, e.g., when clients +hint that data is written and read sequentially, BlueStore will +checksum larger blocks, but in many cases it must store a checksum +value (usually 4 bytes) for every 4 kilobyte block of data. + +It is possible to use a smaller checksum value by truncating the +checksum to two or one byte, reducing the metadata overhead. The +trade-off is that the probability that a random error will not be +detected is higher with a smaller checksum, going from about one in +four billion with a 32-bit (4 byte) checksum to one in 65,536 for a +16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum. +The smaller checksum values can be used by selecting `crc32c_16` or +`crc32c_8` as the checksum algorithm. + +The *checksum algorithm* can be set either via a per-pool +``csum_type`` property or the global config option. For example: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> csum_type <algorithm> + +Inline Compression +================== + +BlueStore supports inline compression using `snappy`, `zlib`, or +`lz4`. Please note that the `lz4` compression plugin is not +distributed in the official release. + +Whether data in BlueStore is compressed is determined by a combination +of the *compression mode* and any hints associated with a write +operation. The modes are: + +* **none**: Never compress data. +* **passive**: Do not compress data unless the write operation has a + *compressible* hint set. +* **aggressive**: Compress data unless the write operation has an + *incompressible* hint set. +* **force**: Try to compress data no matter what. + +For more information about the *compressible* and *incompressible* IO +hints, see :c:func:`rados_set_alloc_hint`. + +Note that regardless of the mode, if the size of the data chunk is not +reduced sufficiently it will not be used and the original +(uncompressed) data will be stored. For example, if the ``bluestore +compression required ratio`` is set to ``.7`` then the compressed data +must be 70% of the size of the original (or smaller). + +The *compression mode*, *compression algorithm*, *compression required +ratio*, *min blob size*, and *max blob size* can be set either via a +per-pool property or a global config option. Pool properties can be +set with: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> compression_algorithm <algorithm> + ceph osd pool set <pool-name> compression_mode <mode> + ceph osd pool set <pool-name> compression_required_ratio <ratio> + ceph osd pool set <pool-name> compression_min_blob_size <size> + ceph osd pool set <pool-name> compression_max_blob_size <size> + +.. _bluestore-rocksdb-sharding: + +RocksDB Sharding +================ + +Internally BlueStore uses multiple types of key-value data, +stored in RocksDB. Each data type in BlueStore is assigned a +unique prefix. Until Pacific all key-value data was stored in +single RocksDB column family: 'default'. Since Pacific, +BlueStore can divide this data into multiple RocksDB column +families. When keys have similar access frequency, modification +frequency and lifetime, BlueStore benefits from better caching +and more precise compaction. This improves performance, and also +requires less disk space during compaction, since each column +family is smaller and can compact independent of others. + +OSDs deployed in Pacific or later use RocksDB sharding by default. +If Ceph is upgraded to Pacific from a previous version, sharding is off. + +To enable sharding and apply the Pacific defaults, stop an OSD and run + + .. prompt:: bash # + + ceph-bluestore-tool \ + --path <data path> \ + --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \ + reshard + + +Throttling +========== + +SPDK Usage +================== + +If you want to use the SPDK driver for NVMe devices, you must prepare your system. +Refer to `SPDK document`__ for more details. + +.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples + +SPDK offers a script to configure the device automatically. Users can run the +script as root: + +.. prompt:: bash $ + + sudo src/spdk/scripts/setup.sh + +You will need to specify the subject NVMe device's device selector with +the "spdk:" prefix for ``bluestore_block_path``. + +For example, you can find the device selector of an Intel PCIe SSD with: + +.. prompt:: bash $ + + lspci -mm -n -D -d 8086:0953 + +The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``. + +and then set:: + + bluestore_block_path = "spdk:trtype:PCIe traddr:0000:01:00.0" + +Where ``0000:01:00.0`` is the device selector found in the output of ``lspci`` +command above. + +You may also specify a remote NVMeoF target over the TCP transport as in the +following example:: + + bluestore_block_path = "spdk:trtype:TCP traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1" + +To run multiple SPDK instances per node, you must specify the +amount of dpdk memory in MB that each instance will use, to make sure each +instance uses its own DPDK memory. + +In most cases, a single device can be used for data, DB, and WAL. We describe +this strategy as *colocating* these components. Be sure to enter the below +settings to ensure that all IOs are issued through SPDK.:: + + bluestore_block_db_path = "" + bluestore_block_db_size = 0 + bluestore_block_wal_path = "" + bluestore_block_wal_size = 0 + +Otherwise, the current implementation will populate the SPDK map files with +kernel file system symbols and will use the kernel driver to issue DB/WAL IO. + +Minimum Allocation Size +======================== + +There is a configured minimum amount of storage that BlueStore will allocate on +an OSD. In practice, this is the least amount of capacity that a RADOS object +can consume. The value of `bluestore_min_alloc_size` is derived from the +value of `bluestore_min_alloc_size_hdd` or `bluestore_min_alloc_size_ssd` +depending on the OSD's ``rotational`` attribute. This means that when an OSD +is created on an HDD, BlueStore will be initialized with the current value +of `bluestore_min_alloc_size_hdd`, and SSD OSDs (including NVMe devices) +with the value of `bluestore_min_alloc_size_ssd`. + +Through the Mimic release, the default values were 64KB and 16KB for rotational +(HDD) and non-rotational (SSD) media respectively. Octopus changed the default +for SSD (non-rotational) media to 4KB, and Pacific changed the default for HDD +(rotational) media to 4KB as well. + +These changes were driven by space amplification experienced by Ceph RADOS +GateWay (RGW) deployments that host large numbers of small files +(S3/Swift objects). + +For example, when an RGW client stores a 1KB S3 object, it is written to a +single RADOS object. With the default `min_alloc_size` value, 4KB of +underlying drive space is allocated. This means that roughly +(4KB - 1KB) == 3KB is allocated but never used, which corresponds to 300% +overhead or 25% efficiency. Similarly, a 5KB user object will be stored +as one 4KB and one 1KB RADOS object, again stranding 4KB of device capcity, +though in this case the overhead is a much smaller percentage. Think of this +in terms of the remainder from a modulus operation. The overhead *percentage* +thus decreases rapidly as user object size increases. + +An easily missed additional subtlety is that this +takes place for *each* replica. So when using the default three copies of +data (3R), a 1KB S3 object actually consumes roughly 9KB of storage device +capacity. If erasure coding (EC) is used instead of replication, the +amplification may be even higher: for a ``k=4,m=2`` pool, our 1KB S3 object +will allocate (6 * 4KB) = 24KB of device capacity. + +When an RGW bucket pool contains many relatively large user objects, the effect +of this phenomenon is often negligible, but should be considered for deployments +that expect a signficiant fraction of relatively small objects. + +The 4KB default value aligns well with conventional HDD and SSD devices. Some +new coarse-IU (Indirection Unit) QLC SSDs however perform and wear best +when `bluestore_min_alloc_size_ssd` +is set at OSD creation to match the device's IU:. 8KB, 16KB, or even 64KB. +These novel storage drives allow one to achieve read performance competitive +with conventional TLC SSDs and write performance faster than HDDs, with +high density and lower cost than TLC SSDs. + +Note that when creating OSDs on these devices, one must carefully apply the +non-default value only to appropriate devices, and not to conventional SSD and +HDD devices. This may be done through careful ordering of OSD creation, custom +OSD device classes, and especially by the use of central configuration _masks_. + +Quincy and later releases add +the `bluestore_use_optimal_io_size_for_min_alloc_size` +option that enables automatic discovery of the appropriate value as each OSD is +created. Note that the use of ``bcache``, ``OpenCAS``, ``dmcrypt``, +``ATA over Ethernet``, `iSCSI`, or other device layering / abstraction +technologies may confound the determination of appropriate values. OSDs +deployed on top of VMware storage have been reported to also +sometimes report a ``rotational`` attribute that does not match the underlying +hardware. + +We suggest inspecting such OSDs at startup via logs and admin sockets to ensure that +behavior is appropriate. Note that this also may not work as desired with +older kernels. You can check for this by examining the presence and value +of ``/sys/block/<drive>/queue/optimal_io_size``. + +You may also inspect a given OSD: + +.. prompt:: bash # + + ceph osd metadata osd.1701 | grep rotational + +This space amplification may manifest as an unusually high ratio of raw to +stored data reported by ``ceph df``. ``ceph osd df`` may also report +anomalously high ``%USE`` / ``VAR`` values when +compared to other, ostensibly identical OSDs. A pool using OSDs with +mismatched ``min_alloc_size`` values may experience unexpected balancer +behavior as well. + +Note that this BlueStore attribute takes effect *only* at OSD creation; if +changed later, a given OSD's behavior will not change unless / until it is +destroyed and redeployed with the appropriate option value(s). Upgrading +to a later Ceph release will *not* change the value used by OSDs deployed +under older releases or with other settings. + +DSA (Data Streaming Accelerator Usage) +====================================== + +If you want to use the DML library to drive DSA device for offloading +read/write operations on Persist memory in Bluestore. You need to install +`DML`_ and `idxd-config`_ library in your machine with SPR (Sapphire Rapids) CPU. + +.. _DML: https://github.com/intel/DML +.. _idxd-config: https://github.com/intel/idxd-config + +After installing the DML software, you need to configure the shared +work queues (WQs) with the following WQ configuration example via accel-config tool: + +.. prompt:: bash $ + + accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="MyApp1" --priority=10 --block-on-fault=1 dsa0/wq0.1 + accel-config config-engine dsa0/engine0.1 --group-id=1 + accel-config enable-device dsa0 + accel-config enable-wq dsa0/wq0.1 diff --git a/doc/rados/configuration/ceph-conf.rst b/doc/rados/configuration/ceph-conf.rst new file mode 100644 index 000000000..ad93598de --- /dev/null +++ b/doc/rados/configuration/ceph-conf.rst @@ -0,0 +1,689 @@ +.. _configuring-ceph: + +================== + Configuring Ceph +================== + +When Ceph services start, the initialization process activates a series +of daemons that run in the background. A :term:`Ceph Storage Cluster` runs +at a minimum three types of daemons: + +- :term:`Ceph Monitor` (``ceph-mon``) +- :term:`Ceph Manager` (``ceph-mgr``) +- :term:`Ceph OSD Daemon` (``ceph-osd``) + +Ceph Storage Clusters that support the :term:`Ceph File System` also run at +least one :term:`Ceph Metadata Server` (``ceph-mds``). Clusters that +support :term:`Ceph Object Storage` run Ceph RADOS Gateway daemons +(``radosgw``) as well. + +Each daemon has a number of configuration options, each of which has a +default value. You may adjust the behavior of the system by changing these +configuration options. Be careful to understand the consequences before +overriding default values, as it is possible to significantly degrade the +performance and stability of your cluster. Also note that default values +sometimes change between releases, so it is best to review the version of +this documentation that aligns with your Ceph release. + +Option names +============ + +All Ceph configuration options have a unique name consisting of words +formed with lower-case characters and connected with underscore +(``_``) characters. + +When option names are specified on the command line, either underscore +(``_``) or dash (``-``) characters can be used interchangeable (e.g., +``--mon-host`` is equivalent to ``--mon_host``). + +When option names appear in configuration files, spaces can also be +used in place of underscore or dash. We suggest, though, that for +clarity and convenience you consistently use underscores, as we do +throughout this documentation. + +Config sources +============== + +Each Ceph daemon, process, and library will pull its configuration +from several sources, listed below. Sources later in the list will +override those earlier in the list when both are present. + +- the compiled-in default value +- the monitor cluster's centralized configuration database +- a configuration file stored on the local host +- environment variables +- command line arguments +- runtime overrides set by an administrator + +One of the first things a Ceph process does on startup is parse the +configuration options provided via the command line, environment, and +local configuration file. The process will then contact the monitor +cluster to retrieve configuration stored centrally for the entire +cluster. Once a complete view of the configuration is available, the +daemon or process startup will proceed. + +.. _bootstrap-options: + +Bootstrap options +----------------- + +Because some configuration options affect the process's ability to +contact the monitors, authenticate, and retrieve the cluster-stored +configuration, they may need to be stored locally on the node and set +in a local configuration file. These options include: + + - ``mon_host``, the list of monitors for the cluster + - ``mon_host_override``, the list of monitors for the cluster to + **initially** contact when beginning a new instance of communication with the + Ceph cluster. This overrides the known monitor list derived from MonMap + updates sent to older Ceph instances (like librados cluster handles). It is + expected this option is primarily useful for debugging. + - ``mon_dns_srv_name`` (default: `ceph-mon`), the name of the DNS + SRV record to check to identify the cluster monitors via DNS + - ``mon_data``, ``osd_data``, ``mds_data``, ``mgr_data``, and + similar options that define which local directory the daemon + stores its data in. + - ``keyring``, ``keyfile``, and/or ``key``, which can be used to + specify the authentication credential to use to authenticate with + the monitor. Note that in most cases the default keyring location + is in the data directory specified above. + +In the vast majority of cases the default values of these are +appropriate, with the exception of the ``mon_host`` option that +identifies the addresses of the cluster's monitors. When DNS is used +to identify monitors a local ceph configuration file can be avoided +entirely. + +Skipping monitor config +----------------------- + +Any process may be passed the option ``--no-mon-config`` to skip the +step that retrieves configuration from the cluster monitors. This is +useful in cases where configuration is managed entirely via +configuration files or where the monitor cluster is currently down but +some maintenance activity needs to be done. + + +.. _ceph-conf-file: + + +Configuration sections +====================== + +Any given process or daemon has a single value for each configuration +option. However, values for an option may vary across different +daemon types even daemons of the same type. Ceph options that are +stored in the monitor configuration database or in local configuration +files are grouped into sections to indicate which daemons or clients +they apply to. + +These sections include: + +``global`` + +:Description: Settings under ``global`` affect all daemons and clients + in a Ceph Storage Cluster. + +:Example: ``log_file = /var/log/ceph/$cluster-$type.$id.log`` + +``mon`` + +:Description: Settings under ``mon`` affect all ``ceph-mon`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mon_cluster_log_to_syslog = true`` + + +``mgr`` + +:Description: Settings in the ``mgr`` section affect all ``ceph-mgr`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mgr_stats_period = 10`` + +``osd`` + +:Description: Settings under ``osd`` affect all ``ceph-osd`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``osd_op_queue = wpq`` + +``mds`` + +:Description: Settings in the ``mds`` section affect all ``ceph-mds`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mds_cache_memory_limit = 10G`` + +``client`` + +:Description: Settings under ``client`` affect all Ceph Clients + (e.g., mounted Ceph File Systems, mounted Ceph Block Devices, + etc.) as well as Rados Gateway (RGW) daemons. + +:Example: ``objecter_inflight_ops = 512`` + + +Sections may also specify an individual daemon or client name. For example, +``mon.foo``, ``osd.123``, and ``client.smith`` are all valid section names. + + +Any given daemon will draw its settings from the global section, the +daemon or client type section, and the section sharing its name. +Settings in the most-specific section take precedence, so for example +if the same option is specified in both ``global``, ``mon``, and +``mon.foo`` on the same source (i.e., in the same configurationfile), +the ``mon.foo`` value will be used. + +If multiple values of the same configuration option are specified in the same +section, the last value wins. + +Note that values from the local configuration file always take +precedence over values from the monitor configuration database, +regardless of which section they appear in. + + +.. _ceph-metavariables: + +Metavariables +============= + +Metavariables simplify Ceph Storage Cluster configuration +dramatically. When a metavariable is set in a configuration value, +Ceph expands the metavariable into a concrete value at the time the +configuration value is used. Ceph metavariables are similar to variable expansion in the Bash shell. + +Ceph supports the following metavariables: + +``$cluster`` + +:Description: Expands to the Ceph Storage Cluster name. Useful when running + multiple Ceph Storage Clusters on the same hardware. + +:Example: ``/etc/ceph/$cluster.keyring`` +:Default: ``ceph`` + + +``$type`` + +:Description: Expands to a daemon or process type (e.g., ``mds``, ``osd``, or ``mon``) + +:Example: ``/var/lib/ceph/$type`` + + +``$id`` + +:Description: Expands to the daemon or client identifier. For + ``osd.0``, this would be ``0``; for ``mds.a``, it would + be ``a``. + +:Example: ``/var/lib/ceph/$type/$cluster-$id`` + + +``$host`` + +:Description: Expands to the host name where the process is running. + + +``$name`` + +:Description: Expands to ``$type.$id``. +:Example: ``/var/run/ceph/$cluster-$name.asok`` + +``$pid`` + +:Description: Expands to daemon pid. +:Example: ``/var/run/ceph/$cluster-$name-$pid.asok`` + + + +The Configuration File +====================== + +On startup, Ceph processes search for a configuration file in the +following locations: + +#. ``$CEPH_CONF`` (*i.e.,* the path following the ``$CEPH_CONF`` + environment variable) +#. ``-c path/path`` (*i.e.,* the ``-c`` command line argument) +#. ``/etc/ceph/$cluster.conf`` +#. ``~/.ceph/$cluster.conf`` +#. ``./$cluster.conf`` (*i.e.,* in the current working directory) +#. On FreeBSD systems only, ``/usr/local/etc/ceph/$cluster.conf`` + +where ``$cluster`` is the cluster's name (default ``ceph``). + +The Ceph configuration file uses an *ini* style syntax. You can add comment +text after a pound sign (#) or a semi-colon (;). For example: + +.. code-block:: ini + + # <--A number (#) sign precedes a comment. + ; A comment may be anything. + # Comments always follow a semi-colon (;) or a pound (#) on each line. + # The end of the line terminates a comment. + # We recommend that you provide comments in your configuration file(s). + + +.. _ceph-conf-settings: + +Config file section names +------------------------- + +The configuration file is divided into sections. Each section must begin with a +valid configuration section name (see `Configuration sections`_, above) +surrounded by square brackets. For example, + +.. code-block:: ini + + [global] + debug_ms = 0 + + [osd] + debug_ms = 1 + + [osd.1] + debug_ms = 10 + + [osd.2] + debug_ms = 10 + + +Config file option values +------------------------- + +The value of a configuration option is a string. If it is too long to +fit in a single line, you can put a backslash (``\``) at the end of line +as the line continuation marker, so the value of the option will be +the string after ``=`` in current line combined with the string in the next +line:: + + [global] + foo = long long ago\ + long ago + +In the example above, the value of "``foo``" would be "``long long ago long ago``". + +Normally, the option value ends with a new line, or a comment, like + +.. code-block:: ini + + [global] + obscure_one = difficult to explain # I will try harder in next release + simpler_one = nothing to explain + +In the example above, the value of "``obscure one``" would be "``difficult to explain``"; +and the value of "``simpler one`` would be "``nothing to explain``". + +If an option value contains spaces, and we want to make it explicit, we +could quote the value using single or double quotes, like + +.. code-block:: ini + + [global] + line = "to be, or not to be" + +Certain characters are not allowed to be present in the option values directly. +They are ``=``, ``#``, ``;`` and ``[``. If we have to, we need to escape them, +like + +.. code-block:: ini + + [global] + secret = "i love \# and \[" + +Every configuration option is typed with one of the types below: + +``int`` + +:Description: 64-bit signed integer, Some SI prefixes are supported, like "K", "M", "G", + "T", "P", "E", meaning, respectively, 10\ :sup:`3`, 10\ :sup:`6`, + 10\ :sup:`9`, etc. And "B" is the only supported unit. So, "1K", "1M", "128B" and "-1" are all valid + option values. Some times, a negative value implies "unlimited" when it comes to + an option for threshold or limit. +:Example: ``42``, ``-1`` + +``uint`` + +:Description: It is almost identical to ``integer``. But a negative value will be rejected. +:Example: ``256``, ``0`` + +``str`` + +:Description: Free style strings encoded in UTF-8, but some characters are not allowed. Please + reference the above notes for the details. +:Example: ``"hello world"``, ``"i love \#"``, ``yet-another-name`` + +``boolean`` + +:Description: one of the two values ``true`` or ``false``. But an integer is also accepted, + where "0" implies ``false``, and any non-zero values imply ``true``. +:Example: ``true``, ``false``, ``1``, ``0`` + +``addr`` + +:Description: a single address optionally prefixed with ``v1``, ``v2`` or ``any`` for the messenger + protocol. If the prefix is not specified, ``v2`` protocol is used. Please see + :ref:`address_formats` for more details. +:Example: ``v1:1.2.3.4:567``, ``v2:1.2.3.4:567``, ``1.2.3.4:567``, ``2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567``, ``[::1]:6789`` + +``addrvec`` + +:Description: a set of addresses separated by ",". The addresses can be optionally quoted with ``[`` and ``]``. +:Example: ``[v1:1.2.3.4:567,v2:1.2.3.4:568]``, ``v1:1.2.3.4:567,v1:1.2.3.14:567`` ``[2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567], [2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::568]`` + +``uuid`` + +:Description: the string format of a uuid defined by `RFC4122 <https://www.ietf.org/rfc/rfc4122.txt>`_. + And some variants are also supported, for more details, see + `Boost document <https://www.boost.org/doc/libs/1_74_0/libs/uuid/doc/uuid.html#String%20Generator>`_. +:Example: ``f81d4fae-7dec-11d0-a765-00a0c91e6bf6`` + +``size`` + +:Description: denotes a 64-bit unsigned integer. Both SI prefixes and IEC prefixes are + supported. And "B" is the only supported unit. A negative value will be + rejected. +:Example: ``1Ki``, ``1K``, ``1KiB`` and ``1B``. + +``secs`` + +:Description: denotes a duration of time. By default the unit is second if not specified. + Following units of time are supported: + + * second: "s", "sec", "second", "seconds" + * minute: "m", "min", "minute", "minutes" + * hour: "hs", "hr", "hour", "hours" + * day: "d", "day", "days" + * week: "w", "wk", "week", "weeks" + * month: "mo", "month", "months" + * year: "y", "yr", "year", "years" +:Example: ``1 m``, ``1m`` and ``1 week`` + +.. _ceph-conf-database: + +Monitor configuration database +============================== + +The monitor cluster manages a database of configuration options that +can be consumed by the entire cluster, enabling streamlined central +configuration management for the entire system. The vast majority of +configuration options can and should be stored here for ease of +administration and transparency. + +A handful of settings may still need to be stored in local +configuration files because they affect the ability to connect to the +monitors, authenticate, and fetch configuration information. In most +cases this is limited to the ``mon_host`` option, although this can +also be avoided through the use of DNS SRV records. + +Sections and masks +------------------ + +Configuration options stored by the monitor can live in a global +section, daemon type section, or specific daemon section, just like +options in a configuration file can. + +In addition, options may also have a *mask* associated with them to +further restrict which daemons or clients the option applies to. +Masks take two forms: + +#. ``type:location`` where *type* is a CRUSH property like `rack` or + `host`, and *location* is a value for that property. For example, + ``host:foo`` would limit the option only to daemons or clients + running on a particular host. +#. ``class:device-class`` where *device-class* is the name of a CRUSH + device class (e.g., ``hdd`` or ``ssd``). For example, + ``class:ssd`` would limit the option only to OSDs backed by SSDs. + (This mask has no effect for non-OSD daemons or clients.) + +When setting a configuration option, the `who` may be a section name, +a mask, or a combination of both separated by a slash (``/``) +character. For example, ``osd/rack:foo`` would mean all OSD daemons +in the ``foo`` rack. + +When viewing configuration options, the section name and mask are +generally separated out into separate fields or columns to ease readability. + + +Commands +-------- + +The following CLI commands are used to configure the cluster: + +* ``ceph config dump`` will dump the entire configuration database for + the cluster. + +* ``ceph config get <who>`` will dump the configuration for a specific + daemon or client (e.g., ``mds.a``), as stored in the monitors' + configuration database. + +* ``ceph config set <who> <option> <value>`` will set a configuration + option in the monitors' configuration database. + +* ``ceph config show <who>`` will show the reported running + configuration for a running daemon. These settings may differ from + those stored by the monitors if there are also local configuration + files in use or options have been overridden on the command line or + at run time. The source of the option values is reported as part + of the output. + +* ``ceph config assimilate-conf -i <input file> -o <output file>`` + will ingest a configuration file from *input file* and move any + valid options into the monitors' configuration database. Any + settings that are unrecognized, invalid, or cannot be controlled by + the monitor will be returned in an abbreviated config file stored in + *output file*. This command is useful for transitioning from legacy + configuration files to centralized monitor-based configuration. + + +Help +==== + +You can get help for a particular option with: + +.. prompt:: bash $ + + ceph config help <option> + +Note that this will use the configuration schema that is compiled into the running monitors. If you have a mixed-version cluster (e.g., during an upgrade), you might also want to query the option schema from a specific running daemon: + +.. prompt:: bash $ + + ceph daemon <name> config help [option] + +For example: + +.. prompt:: bash $ + + ceph config help log_file + +:: + + log_file - path to log file + (std::string, basic) + Default (non-daemon): + Default (daemon): /var/log/ceph/$cluster-$name.log + Can update at runtime: false + See also: [log_to_stderr,err_to_stderr,log_to_syslog,err_to_syslog] + +or: + +.. prompt:: bash $ + + ceph config help log_file -f json-pretty + +:: + + { + "name": "log_file", + "type": "std::string", + "level": "basic", + "desc": "path to log file", + "long_desc": "", + "default": "", + "daemon_default": "/var/log/ceph/$cluster-$name.log", + "tags": [], + "services": [], + "see_also": [ + "log_to_stderr", + "err_to_stderr", + "log_to_syslog", + "err_to_syslog" + ], + "enum_values": [], + "min": "", + "max": "", + "can_update_at_runtime": false + } + +The ``level`` property can be any of `basic`, `advanced`, or `dev`. +The `dev` options are intended for use by developers, generally for +testing purposes, and are not recommended for use by operators. + + +Runtime Changes +=============== + +In most cases, Ceph allows you to make changes to the configuration of +a daemon at runtime. This capability is quite useful for +increasing/decreasing logging output, enabling/disabling debug +settings, and even for runtime optimization. + +Generally speaking, configuration options can be updated in the usual +way via the ``ceph config set`` command. For example, do enable the debug log level on a specific OSD: + +.. prompt:: bash $ + + ceph config set osd.123 debug_ms 20 + +Note that if the same option is also customized in a local +configuration file, the monitor setting will be ignored (it has a +lower priority than the local config file). + +Override values +--------------- + +You can also temporarily set an option using the `tell` or `daemon` +interfaces on the Ceph CLI. These *override* values are ephemeral in +that they only affect the running process and are discarded/lost if +the daemon or process restarts. + +Override values can be set in two ways: + +#. From any host, we can send a message to a daemon over the network with: + + .. prompt:: bash $ + + ceph tell <name> config set <option> <value> + + For example: + + .. prompt:: bash $ + + ceph tell osd.123 config set debug_osd 20 + + The `tell` command can also accept a wildcard for the daemon + identifier. For example, to adjust the debug level on all OSD + daemons: + + .. prompt:: bash $ + + ceph tell osd.* config set debug_osd 20 + +#. From the host the process is running on, we can connect directly to + the process via a socket in ``/var/run/ceph`` with: + + .. prompt:: bash $ + + ceph daemon <name> config set <option> <value> + + For example: + + .. prompt:: bash $ + + ceph daemon osd.4 config set debug_osd 20 + +Note that in the ``ceph config show`` command output these temporary +values will be shown with a source of ``override``. + + +Viewing runtime settings +======================== + +You can see the current options set for a running daemon with the ``ceph config show`` command. For example: + +.. prompt:: bash $ + + ceph config show osd.0 + +will show you the (non-default) options for that daemon. You can also look at a specific option with: + +.. prompt:: bash $ + + ceph config show osd.0 debug_osd + +or view all options (even those with default values) with: + +.. prompt:: bash $ + + ceph config show-with-defaults osd.0 + +You can also observe settings for a running daemon by connecting to it from the local host via the admin socket. For example: + +.. prompt:: bash $ + + ceph daemon osd.0 config show + +will dump all current settings: + +.. prompt:: bash $ + + ceph daemon osd.0 config diff + +will show only non-default settings (as well as where the value came from: a config file, the monitor, an override, etc.), and: + +.. prompt:: bash $ + + ceph daemon osd.0 config get debug_osd + +will report the value of a single option. + + + +Changes since Nautilus +====================== + +With the Octopus release We changed the way the configuration file is parsed. +These changes are as follows: + +- Repeated configuration options are allowed, and no warnings will be printed. + The value of the last one is used, which means that the setting last in the file + is the one that takes effect. Before this change, we would print warning messages + when lines with duplicated options were encountered, like:: + + warning line 42: 'foo' in section 'bar' redefined + +- Invalid UTF-8 options were ignored with warning messages. But since Octopus, + they are treated as fatal errors. + +- Backslash ``\`` is used as the line continuation marker to combine the next + line with current one. Before Octopus, it was required to follow a backslash with + a non-empty line. But in Octopus, an empty line following a backslash is now allowed. + +- In the configuration file, each line specifies an individual configuration + option. The option's name and its value are separated with ``=``, and the + value may be quoted using single or double quotes. If an invalid + configuration is specified, we will treat it as an invalid configuration + file :: + + bad option ==== bad value + +- Before Octopus, if no section name was specified in the configuration file, + all options would be set as though they were within the ``global`` section. This is + now discouraged. Since Octopus, only a single option is allowed for + configuration files without a section name. diff --git a/doc/rados/configuration/common.rst b/doc/rados/configuration/common.rst new file mode 100644 index 000000000..709c8bce2 --- /dev/null +++ b/doc/rados/configuration/common.rst @@ -0,0 +1,218 @@ + +.. _ceph-conf-common-settings: + +Common Settings +=============== + +The `Hardware Recommendations`_ section provides some hardware guidelines for +configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph +Node` to run multiple daemons. For example, a single node with multiple drives +may run one ``ceph-osd`` for each drive. Ideally, you will have a node for a +particular type of process. For example, some nodes may run ``ceph-osd`` +daemons, other nodes may run ``ceph-mds`` daemons, and still other nodes may +run ``ceph-mon`` daemons. + +Each node has a name identified by the ``host`` setting. Monitors also specify +a network address and port (i.e., domain name or IP address) identified by the +``addr`` setting. A basic configuration file will typically specify only +minimal settings for each instance of monitor daemons. For example: + +.. code-block:: ini + + [global] + mon_initial_members = ceph1 + mon_host = 10.0.0.1 + + +.. important:: The ``host`` setting is the short name of the node (i.e., not + an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on + the command line to retrieve the name of the node. Do not use ``host`` + settings for anything other than initial monitors unless you are deploying + Ceph manually. You **MUST NOT** specify ``host`` under individual daemons + when using deployment tools like ``chef`` or ``cephadm``, as those tools + will enter the appropriate values for you in the cluster map. + + +.. _ceph-network-config: + +Networks +======== + +See the `Network Configuration Reference`_ for a detailed discussion about +configuring a network for use with Ceph. + + +Monitors +======== + +Production Ceph clusters typically provision a minimum of three :term:`Ceph Monitor` +daemons to ensure availability should a monitor instance crash. A minimum of +three ensures that the Paxos algorithm can determine which version +of the :term:`Ceph Cluster Map` is the most recent from a majority of Ceph +Monitors in the quorum. + +.. note:: You may deploy Ceph with a single monitor, but if the instance fails, + the lack of other monitors may interrupt data service availability. + +Ceph Monitors normally listen on port ``3300`` for the new v2 protocol, and ``6789`` for the old v1 protocol. + +By default, Ceph expects to store monitor data under the +following path:: + + /var/lib/ceph/mon/$cluster-$id + +You or a deployment tool (e.g., ``cephadm``) must create the corresponding +directory. With metavariables fully expressed and a cluster named "ceph", the +foregoing directory would evaluate to:: + + /var/lib/ceph/mon/ceph-a + +For additional details, see the `Monitor Config Reference`_. + +.. _Monitor Config Reference: ../mon-config-ref + + +.. _ceph-osd-config: + + +Authentication +============== + +.. versionadded:: Bobtail 0.56 + +For Bobtail (v 0.56) and beyond, you should expressly enable or disable +authentication in the ``[global]`` section of your Ceph configuration file. + +.. code-block:: ini + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + +Additionally, you should enable message signing. See `Cephx Config Reference`_ for details. + +.. _Cephx Config Reference: ../auth-config-ref + + +.. _ceph-monitor-config: + + +OSDs +==== + +Ceph production clusters typically deploy :term:`Ceph OSD Daemons` where one node +has one OSD daemon running a Filestore on one storage device. The BlueStore back +end is now default, but when using Filestore you specify a journal size. For example: + +.. code-block:: ini + + [osd] + osd_journal_size = 10000 + + [osd.0] + host = {hostname} #manual deployments only. + + +By default, Ceph expects to store a Ceph OSD Daemon's data at the +following path:: + + /var/lib/ceph/osd/$cluster-$id + +You or a deployment tool (e.g., ``cephadm``) must create the corresponding +directory. With metavariables fully expressed and a cluster named "ceph", this +example would evaluate to:: + + /var/lib/ceph/osd/ceph-0 + +You may override this path using the ``osd_data`` setting. We recommend not +changing the default location. Create the default directory on your OSD host. + +.. prompt:: bash $ + + ssh {osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + +The ``osd_data`` path ideally leads to a mount point with a device that is +separate from the device that contains the operating system and +daemons. If an OSD is to use a device other than the OS device, prepare it for +use with Ceph, and mount it to the directory you just created + +.. prompt:: bash $ + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{disk} + sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} + +We recommend using the ``xfs`` file system when running +:command:`mkfs`. (``btrfs`` and ``ext4`` are not recommended and are no +longer tested.) + +See the `OSD Config Reference`_ for additional configuration details. + + +Heartbeats +========== + +During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons +and report their findings to the Ceph Monitor. You do not have to provide any +settings. However, if you have network latency issues, you may wish to modify +the settings. + +See `Configuring Monitor/OSD Interaction`_ for additional details. + + +.. _ceph-logging-and-debugging: + +Logs / Debugging +================ + +Sometimes you may encounter issues with Ceph that require +modifying logging output and using Ceph's debugging. See `Debugging and +Logging`_ for details on log rotation. + +.. _Debugging and Logging: ../../troubleshooting/log-and-debug + + +Example ceph.conf +================= + +.. literalinclude:: demo-ceph.conf + :language: ini + +.. _ceph-runtime-config: + + + +Running Multiple Clusters (DEPRECATED) +====================================== + +Each Ceph cluster has an internal name that is used as part of configuration +and log file names as well as directory and mountpoint names. This name +defaults to "ceph". Previous releases of Ceph allowed one to specify a custom +name instead, for example "ceph2". This was intended to faciliate running +multiple logical clusters on the same physical hardware, but in practice this +was rarely exploited and should no longer be attempted. Prior documentation +could also be misinterpreted as requiring unique cluster names in order to +use ``rbd-mirror``. + +Custom cluster names are now considered deprecated and the ability to deploy +them has already been removed from some tools, though existing custom name +deployments continue to operate. The ability to run and manage clusters with +custom names may be progressively removed by future Ceph releases, so it is +strongly recommended to deploy all new clusters with the default name "ceph". + +Some Ceph CLI commands accept an optional ``--cluster`` (cluster name) option. This +option is present purely for backward compatibility and need not be accomodated +by new tools and deployments. + +If you do need to allow multiple clusters to exist on the same host, please use +:ref:`cephadm`, which uses containers to fully isolate each cluster. + + + + + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Network Configuration Reference: ../network-config-ref +.. _OSD Config Reference: ../osd-config-ref +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction diff --git a/doc/rados/configuration/demo-ceph.conf b/doc/rados/configuration/demo-ceph.conf new file mode 100644 index 000000000..58bb7061f --- /dev/null +++ b/doc/rados/configuration/demo-ceph.conf @@ -0,0 +1,31 @@ +[global] +fsid = {cluster-id} +mon_initial_ members = {hostname}[, {hostname}] +mon_host = {ip-address}[, {ip-address}] + +#All clusters have a front-side public network. +#If you have two network interfaces, you can configure a private / cluster +#network for RADOS object replication, heartbeats, backfill, +#recovery, etc. +public_network = {network}[, {network}] +#cluster_network = {network}[, {network}] + +#Clusters require authentication by default. +auth_cluster_required = cephx +auth_service_required = cephx +auth_client_required = cephx + +#Choose reasonable numbers for journals, number of replicas +#and placement groups. +osd_journal_size = {n} +osd_pool_default_size = {n} # Write an object n times. +osd_pool_default_min size = {n} # Allow writing n copy in a degraded state. +osd_pool_default_pg num = {n} +osd_pool_default_pgp num = {n} + +#Choose a reasonable crush leaf type. +#0 for a 1-node cluster. +#1 for a multi node cluster in a single rack +#2 for a multi node, multi chassis cluster with multiple hosts in a chassis +#3 for a multi node cluster with hosts across racks, etc. +osd_crush_chooseleaf_type = {n}
\ No newline at end of file diff --git a/doc/rados/configuration/filestore-config-ref.rst b/doc/rados/configuration/filestore-config-ref.rst new file mode 100644 index 000000000..435a800a8 --- /dev/null +++ b/doc/rados/configuration/filestore-config-ref.rst @@ -0,0 +1,367 @@ +============================ + Filestore Config Reference +============================ + +The Filestore back end is no longer the default when creating new OSDs, +though Filestore OSDs are still supported. + +``filestore debug omap check`` + +:Description: Debugging check on synchronization. Expensive. For debugging only. +:Type: Boolean +:Required: No +:Default: ``false`` + + +.. index:: filestore; extended attributes + +Extended Attributes +=================== + +Extended Attributes (XATTRs) are important for Filestore OSDs. +Some file systems have limits on the number of bytes that can be stored in XATTRs. +Additionally, in some cases, the file system may not be as fast as an alternative +method of storing XATTRs. The following settings may help improve performance +by using a method of storing XATTRs that is extrinsic to the underlying file system. + +Ceph XATTRs are stored as ``inline xattr``, using the XATTRs provided +by the underlying file system, if it does not impose a size limit. If +there is a size limit (4KB total on ext4, for instance), some Ceph +XATTRs will be stored in a key/value database when either the +``filestore_max_inline_xattr_size`` or ``filestore_max_inline_xattrs`` +threshold is reached. + + +``filestore_max_inline_xattr_size`` + +:Description: The maximum size of an XATTR stored in the file system (i.e., XFS, + Btrfs, EXT4, etc.) per object. Should not be larger than the + file system can handle. Default value of 0 means to use the value + specific to the underlying file system. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore_max_inline_xattr_size_xfs`` + +:Description: The maximum size of an XATTR stored in the XFS file system. + Only used if ``filestore_max_inline_xattr_size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``65536`` + + +``filestore_max_inline_xattr_size_btrfs`` + +:Description: The maximum size of an XATTR stored in the Btrfs file system. + Only used if ``filestore_max_inline_xattr_size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``2048`` + + +``filestore_max_inline_xattr_size_other`` + +:Description: The maximum size of an XATTR stored in other file systems. + Only used if ``filestore_max_inline_xattr_size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``512`` + + +``filestore_max_inline_xattrs`` + +:Description: The maximum number of XATTRs stored in the file system per object. + Default value of 0 means to use the value specific to the + underlying file system. +:Type: 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore_max_inline_xattrs_xfs`` + +:Description: The maximum number of XATTRs stored in the XFS file system per object. + Only used if ``filestore_max_inline_xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore_max_inline_xattrs_btrfs`` + +:Description: The maximum number of XATTRs stored in the Btrfs file system per object. + Only used if ``filestore_max_inline_xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore_max_inline_xattrs_other`` + +:Description: The maximum number of XATTRs stored in other file systems per object. + Only used if ``filestore_max_inline_xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``2`` + +.. index:: filestore; synchronization + +Synchronization Intervals +========================= + +Filestore needs to periodically quiesce writes and synchronize the +file system, which creates a consistent commit point. It can then free journal +entries up to the commit point. Synchronizing more frequently tends to reduce +the time required to perform synchronization, and reduces the amount of data +that needs to remain in the journal. Less frequent synchronization allows the +backing file system to coalesce small writes and metadata updates more +optimally, potentially resulting in more efficient synchronization at the +expense of potentially increasing tail latency. + +``filestore_max_sync_interval`` + +:Description: The maximum interval in seconds for synchronizing Filestore. +:Type: Double +:Required: No +:Default: ``5`` + + +``filestore_min_sync_interval`` + +:Description: The minimum interval in seconds for synchronizing Filestore. +:Type: Double +:Required: No +:Default: ``.01`` + + +.. index:: filestore; flusher + +Flusher +======= + +The Filestore flusher forces data from large writes to be written out using +``sync_file_range`` before the sync in order to (hopefully) reduce the cost of +the eventual sync. In practice, disabling 'filestore_flusher' seems to improve +performance in some cases. + + +``filestore_flusher`` + +:Description: Enables the filestore flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore_flusher_max_fds`` + +:Description: Sets the maximum number of file descriptors for the flusher. +:Type: Integer +:Required: No +:Default: ``512`` + +.. deprecated:: v.65 + +``filestore_sync_flush`` + +:Description: Enables the synchronization flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore_fsync_flushes_journal_data`` + +:Description: Flush journal data during file system synchronization. +:Type: Boolean +:Required: No +:Default: ``false`` + + +.. index:: filestore; queue + +Queue +===== + +The following settings provide limits on the size of the Filestore queue. + +``filestore_queue_max_ops`` + +:Description: Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations. +:Type: Integer +:Required: No. Minimal impact on performance. +:Default: ``50`` + + +``filestore_queue_max_bytes`` + +:Description: The maximum number of bytes for an operation. +:Type: Integer +:Required: No +:Default: ``100 << 20`` + + + + +.. index:: filestore; timeouts + +Timeouts +======== + + +``filestore_op_threads`` + +:Description: The number of file system operation threads that execute in parallel. +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore_op_thread_timeout`` + +:Description: The timeout for a file system operation thread (in seconds). +:Type: Integer +:Required: No +:Default: ``60`` + + +``filestore_op_thread_suicide_timeout`` + +:Description: The timeout for a commit operation before cancelling the commit (in seconds). +:Type: Integer +:Required: No +:Default: ``180`` + + +.. index:: filestore; btrfs + +B-Tree Filesystem +================= + + +``filestore_btrfs_snap`` + +:Description: Enable snapshots for a ``btrfs`` filestore. +:Type: Boolean +:Required: No. Only used for ``btrfs``. +:Default: ``true`` + + +``filestore_btrfs_clone_range`` + +:Description: Enable cloning ranges for a ``btrfs`` filestore. +:Type: Boolean +:Required: No. Only used for ``btrfs``. +:Default: ``true`` + + +.. index:: filestore; journal + +Journal +======= + + +``filestore_journal_parallel`` + +:Description: Enables parallel journaling, default for Btrfs. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_journal_writeahead`` + +:Description: Enables writeahead journaling, default for XFS. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_journal_trailing`` + +:Description: Deprecated, never use. +:Type: Boolean +:Required: No +:Default: ``false`` + + +Misc +==== + + +``filestore_merge_threshold`` + +:Description: Min number of files in a subdir before merging into parent + NOTE: A negative value means to disable subdir merging +:Type: Integer +:Required: No +:Default: ``-10`` + + +``filestore_split_multiple`` + +:Description: ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16`` + is the maximum number of files in a subdirectory before + splitting into child directories. + +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore_split_rand_factor`` + +:Description: A random factor added to the split threshold to avoid + too many (expensive) Filestore splits occurring at once. See + ``filestore_split_multiple`` for details. + This can only be changed offline for an existing OSD, + via the ``ceph-objectstore-tool apply-layout-settings`` command. + +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``20`` + + +``filestore_update_to`` + +:Description: Limits Filestore auto upgrade to specified version. +:Type: Integer +:Required: No +:Default: ``1000`` + + +``filestore_blackhole`` + +:Description: Drop any new transactions on the floor. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_dump_file`` + +:Description: File onto which store transaction dumps. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_kill_at`` + +:Description: inject a failure at the n'th opportunity +:Type: String +:Required: No +:Default: ``false`` + + +``filestore_fail_eio`` + +:Description: Fail/Crash on eio. +:Type: Boolean +:Required: No +:Default: ``true`` + diff --git a/doc/rados/configuration/general-config-ref.rst b/doc/rados/configuration/general-config-ref.rst new file mode 100644 index 000000000..fc837c395 --- /dev/null +++ b/doc/rados/configuration/general-config-ref.rst @@ -0,0 +1,66 @@ +========================== + General Config Reference +========================== + + +``fsid`` + +:Description: The file system ID. One per cluster. +:Type: UUID +:Required: No. +:Default: N/A. Usually generated by deployment tools. + + +``admin_socket`` + +:Description: The socket for executing administrative commands on a daemon, + irrespective of whether Ceph Monitors have established a quorum. + +:Type: String +:Required: No +:Default: ``/var/run/ceph/$cluster-$name.asok`` + + +``pid_file`` + +:Description: The file in which the mon, osd or mds will write its + PID. For instance, ``/var/run/$cluster/$type.$id.pid`` + will create /var/run/ceph/mon.a.pid for the ``mon`` with + id ``a`` running in the ``ceph`` cluster. The ``pid + file`` is removed when the daemon stops gracefully. If + the process is not daemonized (i.e. runs with the ``-f`` + or ``-d`` option), the ``pid file`` is not created. +:Type: String +:Required: No +:Default: No + + +``chdir`` + +:Description: The directory Ceph daemons change to once they are + up and running. Default ``/`` directory recommended. + +:Type: String +:Required: No +:Default: ``/`` + + +``max_open_files`` + +:Description: If set, when the :term:`Ceph Storage Cluster` starts, Ceph sets + the max open FDs at the OS level (i.e., the max # of file + descriptors). A suitably large value prevents Ceph Daemons from running out + of file descriptors. + +:Type: 64-bit Integer +:Required: No +:Default: ``0`` + + +``fatal_signal_handlers`` + +:Description: If set, we will install signal handlers for SEGV, ABRT, BUS, ILL, + FPE, XCPU, XFSZ, SYS signals to generate a useful log message + +:Type: Boolean +:Default: ``true`` diff --git a/doc/rados/configuration/index.rst b/doc/rados/configuration/index.rst new file mode 100644 index 000000000..414fcc6fa --- /dev/null +++ b/doc/rados/configuration/index.rst @@ -0,0 +1,54 @@ +=============== + Configuration +=============== + +Each Ceph process, daemon, or utility draws its configuration from several +sources on startup. Such sources can include (1) a local configuration, (2) the +monitors, (3) the command line, and (4) environment variables. + +Configuration options can be set globally so that they apply (1) to all +daemons, (2) to all daemons or services of a particular type, or (3) to only a +specific daemon, process, or client. + +.. raw:: html + + <table cellpadding="10"><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>Configuring the Object Store</h3> + +For general object store configuration, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Storage devices <storage-devices> + ceph-conf + + +.. raw:: html + + </td><td><h3>Reference</h3> + +To optimize the performance of your cluster, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Common Settings <common> + Network Settings <network-config-ref> + Messenger v2 protocol <msgr2> + Auth Settings <auth-config-ref> + Monitor Settings <mon-config-ref> + mon-lookup-dns + Heartbeat Settings <mon-osd-interaction> + OSD Settings <osd-config-ref> + DmClock Settings <mclock-config-ref> + BlueStore Settings <bluestore-config-ref> + FileStore Settings <filestore-config-ref> + Journal Settings <journal-ref> + Pool, PG & CRUSH Settings <pool-pg-config-ref.rst> + Messaging Settings <ms-ref> + General Settings <general-config-ref> + + +.. raw:: html + + </td></tr></tbody></table> diff --git a/doc/rados/configuration/journal-ref.rst b/doc/rados/configuration/journal-ref.rst new file mode 100644 index 000000000..71c74c606 --- /dev/null +++ b/doc/rados/configuration/journal-ref.rst @@ -0,0 +1,119 @@ +========================== + Journal Config Reference +========================== + +.. index:: journal; journal configuration + +Filestore OSDs use a journal for two reasons: speed and consistency. Note +that since Luminous, the BlueStore OSD back end has been preferred and default. +This information is provided for pre-existing OSDs and for rare situations where +Filestore is preferred for new deployments. + +- **Speed:** The journal enables the Ceph OSD Daemon to commit small writes + quickly. Ceph writes small, random i/o to the journal sequentially, which + tends to speed up bursty workloads by allowing the backing file system more + time to coalesce writes. The Ceph OSD Daemon's journal, however, can lead + to spiky performance with short spurts of high-speed writes followed by + periods without any write progress as the file system catches up to the + journal. + +- **Consistency:** Ceph OSD Daemons require a file system interface that + guarantees atomic compound operations. Ceph OSD Daemons write a description + of the operation to the journal and apply the operation to the file system. + This enables atomic updates to an object (for example, placement group + metadata). Every few seconds--between ``filestore max sync interval`` and + ``filestore min sync interval``--the Ceph OSD Daemon stops writes and + synchronizes the journal with the file system, allowing Ceph OSD Daemons to + trim operations from the journal and reuse the space. On failure, Ceph + OSD Daemons replay the journal starting after the last synchronization + operation. + +Ceph OSD Daemons recognize the following journal settings: + + +``journal_dio`` + +:Description: Enables direct i/o to the journal. Requires ``journal block + align`` set to ``true``. + +:Type: Boolean +:Required: Yes when using ``aio``. +:Default: ``true`` + + + +``journal_aio`` + +.. versionchanged:: 0.61 Cuttlefish + +:Description: Enables using ``libaio`` for asynchronous writes to the journal. + Requires ``journal dio`` set to ``true``. + +:Type: Boolean +:Required: No. +:Default: Version 0.61 and later, ``true``. Version 0.60 and earlier, ``false``. + + +``journal_block_align`` + +:Description: Block aligns write operations. Required for ``dio`` and ``aio``. +:Type: Boolean +:Required: Yes when using ``dio`` and ``aio``. +:Default: ``true`` + + +``journal_max_write_bytes`` + +:Description: The maximum number of bytes the journal will write at + any one time. + +:Type: Integer +:Required: No +:Default: ``10 << 20`` + + +``journal_max_write_entries`` + +:Description: The maximum number of entries the journal will write at + any one time. + +:Type: Integer +:Required: No +:Default: ``100`` + + +``journal_queue_max_ops`` + +:Description: The maximum number of operations allowed in the queue at + any one time. + +:Type: Integer +:Required: No +:Default: ``500`` + + +``journal_queue_max_bytes`` + +:Description: The maximum number of bytes allowed in the queue at + any one time. + +:Type: Integer +:Required: No +:Default: ``10 << 20`` + + +``journal_align_min_size`` + +:Description: Align data payloads greater than the specified minimum. +:Type: Integer +:Required: No +:Default: ``64 << 10`` + + +``journal_zero_on_create`` + +:Description: Causes the file store to overwrite the entire journal with + ``0``'s during ``mkfs``. +:Type: Boolean +:Required: No +:Default: ``false`` diff --git a/doc/rados/configuration/mclock-config-ref.rst b/doc/rados/configuration/mclock-config-ref.rst new file mode 100644 index 000000000..579056895 --- /dev/null +++ b/doc/rados/configuration/mclock-config-ref.rst @@ -0,0 +1,395 @@ +======================== + mClock Config Reference +======================== + +.. index:: mclock; configuration + +Mclock profiles mask the low level details from users, making it +easier for them to configure mclock. + +The following input parameters are required for a mclock profile to configure +the QoS related parameters: + +* total capacity (IOPS) of each OSD (determined automatically) + +* an mclock profile type to enable + +Using the settings in the specified profile, the OSD determines and applies the +lower-level mclock and Ceph parameters. The parameters applied by the mclock +profile make it possible to tune the QoS between client I/O, recovery/backfill +operations, and other background operations (for example, scrub, snap trim, and +PG deletion). These background activities are considered best-effort internal +clients of Ceph. + + +.. index:: mclock; profile definition + +mClock Profiles - Definition and Purpose +======================================== + +A mclock profile is *“a configuration setting that when applied on a running +Ceph cluster enables the throttling of the operations(IOPS) belonging to +different client classes (background recovery, scrub, snaptrim, client op, +osd subop)”*. + +The mclock profile uses the capacity limits and the mclock profile type selected +by the user to determine the low-level mclock resource control parameters. + +Depending on the profile type, lower-level mclock resource-control parameters +and some Ceph-configuration parameters are transparently applied. + +The low-level mclock resource control parameters are the *reservation*, +*limit*, and *weight* that provide control of the resource shares, as +described in the :ref:`dmclock-qos` section. + + +.. index:: mclock; profile types + +mClock Profile Types +==================== + +mclock profiles can be broadly classified into two types, + +- **Built-in**: Users can choose between the following built-in profile types: + + - **high_client_ops** (*default*): + This profile allocates more reservation and limit to external-client ops + as compared to background recoveries and other internal clients within + Ceph. This profile is enabled by default. + - **high_recovery_ops**: + This profile allocates more reservation to background recoveries as + compared to external clients and other internal clients within Ceph. For + example, an admin may enable this profile temporarily to speed-up background + recoveries during non-peak hours. + - **balanced**: + This profile allocates equal reservation to client ops and background + recovery ops. + +- **Custom**: This profile gives users complete control over all the mclock + configuration parameters. Using this profile is not recommended without + a deep understanding of mclock and related Ceph-configuration options. + +.. note:: Across the built-in profiles, internal clients of mclock (for example + "scrub", "snap trim", and "pg deletion") are given slightly lower + reservations, but higher weight and no limit. This ensures that + these operations are able to complete quickly if there are no other + competing services. + + +.. index:: mclock; built-in profiles + +mClock Built-in Profiles +======================== + +When a built-in profile is enabled, the mClock scheduler calculates the low +level mclock parameters [*reservation*, *weight*, *limit*] based on the profile +enabled for each client type. The mclock parameters are calculated based on +the max OSD capacity provided beforehand. As a result, the following mclock +config parameters cannot be modified when using any of the built-in profiles: + +- ``osd_mclock_scheduler_client_res`` +- ``osd_mclock_scheduler_client_wgt`` +- ``osd_mclock_scheduler_client_lim`` +- ``osd_mclock_scheduler_background_recovery_res`` +- ``osd_mclock_scheduler_background_recovery_wgt`` +- ``osd_mclock_scheduler_background_recovery_lim`` +- ``osd_mclock_scheduler_background_best_effort_res`` +- ``osd_mclock_scheduler_background_best_effort_wgt`` +- ``osd_mclock_scheduler_background_best_effort_lim`` + +The following Ceph options will not be modifiable by the user: + +- ``osd_max_backfills`` +- ``osd_recovery_max_active`` + +This is because the above options are internally modified by the mclock +scheduler in order to maximize the impact of the set profile. + +By default, the *high_client_ops* profile is enabled to ensure that a larger +chunk of the bandwidth allocation goes to client ops. Background recovery ops +are given lower allocation (and therefore take a longer time to complete). But +there might be instances that necessitate giving higher allocations to either +client ops or recovery ops. In order to deal with such a situation, you can +enable one of the alternate built-in profiles by following the steps mentioned +in the next section. + +If any mClock profile (including "custom") is active, the following Ceph config +sleep options will be disabled, + +- ``osd_recovery_sleep`` +- ``osd_recovery_sleep_hdd`` +- ``osd_recovery_sleep_ssd`` +- ``osd_recovery_sleep_hybrid`` +- ``osd_scrub_sleep`` +- ``osd_delete_sleep`` +- ``osd_delete_sleep_hdd`` +- ``osd_delete_sleep_ssd`` +- ``osd_delete_sleep_hybrid`` +- ``osd_snap_trim_sleep`` +- ``osd_snap_trim_sleep_hdd`` +- ``osd_snap_trim_sleep_ssd`` +- ``osd_snap_trim_sleep_hybrid`` + +The above sleep options are disabled to ensure that mclock scheduler is able to +determine when to pick the next op from its operation queue and transfer it to +the operation sequencer. This results in the desired QoS being provided across +all its clients. + + +.. index:: mclock; enable built-in profile + +Steps to Enable mClock Profile +============================== + +As already mentioned, the default mclock profile is set to *high_client_ops*. +The other values for the built-in profiles include *balanced* and +*high_recovery_ops*. + +If there is a requirement to change the default profile, then the option +``osd_mclock_profile`` may be set during runtime by using the following +command: + + .. prompt:: bash # + + ceph config set osd.N osd_mclock_profile <value> + +For example, to change the profile to allow faster recoveries on "osd.0", the +following command can be used to switch to the *high_recovery_ops* profile: + + .. prompt:: bash # + + ceph config set osd.0 osd_mclock_profile high_recovery_ops + +.. note:: The *custom* profile is not recommended unless you are an advanced + user. + +And that's it! You are ready to run workloads on the cluster and check if the +QoS requirements are being met. + + +OSD Capacity Determination (Automated) +====================================== + +The OSD capacity in terms of total IOPS is determined automatically during OSD +initialization. This is achieved by running the OSD bench tool and overriding +the default value of ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option +depending on the device type. No other action/input is expected from the user +to set the OSD capacity. You may verify the capacity of an OSD after the +cluster is brought up by using the following command: + + .. prompt:: bash # + + ceph config show osd.N osd_mclock_max_capacity_iops_[hdd, ssd] + +For example, the following command shows the max capacity for "osd.0" on a Ceph +node whose underlying device type is SSD: + + .. prompt:: bash # + + ceph config show osd.0 osd_mclock_max_capacity_iops_ssd + + +Steps to Manually Benchmark an OSD (Optional) +============================================= + +.. note:: These steps are only necessary if you want to override the OSD + capacity already determined automatically during OSD initialization. + Otherwise, you may skip this section entirely. + +.. tip:: If you have already determined the benchmark data and wish to manually + override the max osd capacity for an OSD, you may skip to section + `Specifying Max OSD Capacity`_. + + +Any existing benchmarking tool can be used for this purpose. In this case, the +steps use the *Ceph OSD Bench* command described in the next section. Regardless +of the tool/command used, the steps outlined further below remain the same. + +As already described in the :ref:`dmclock-qos` section, the number of +shards and the bluestore's throttle parameters have an impact on the mclock op +queues. Therefore, it is critical to set these values carefully in order to +maximize the impact of the mclock scheduler. + +:Number of Operational Shards: + We recommend using the default number of shards as defined by the + configuration options ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and + ``osd_op_num_shards_ssd``. In general, a lower number of shards will increase + the impact of the mclock queues. + +:Bluestore Throttle Parameters: + We recommend using the default values as defined by + ``bluestore_throttle_bytes`` and ``bluestore_throttle_deferred_bytes``. But + these parameters may also be determined during the benchmarking phase as + described below. + + +OSD Bench Command Syntax +```````````````````````` + +The :ref:`osd-subsystem` section describes the OSD bench command. The syntax +used for benchmarking is shown below : + +.. prompt:: bash # + + ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS] + +where, + +* ``TOTAL_BYTES``: Total number of bytes to write +* ``BYTES_PER_WRITE``: Block size per write +* ``OBJ_SIZE``: Bytes per object +* ``NUM_OBJS``: Number of objects to write + +Benchmarking Test Steps Using OSD Bench +``````````````````````````````````````` + +The steps below use the default shards and detail the steps used to determine +the correct bluestore throttle values (optional). + +#. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that + you wish to benchmark. +#. Run a simple 4KiB random write workload on an OSD using the following + commands: + + .. note:: Note that before running the test, caches must be cleared to get an + accurate measurement. + + For example, if you are running the benchmark test on osd.0, run the following + commands: + + .. prompt:: bash # + + ceph tell osd.0 cache drop + + .. prompt:: bash # + + ceph tell osd.0 bench 12288000 4096 4194304 100 + +#. Note the overall throughput(IOPS) obtained from the output of the osd bench + command. This value is the baseline throughput(IOPS) when the default + bluestore throttle options are in effect. +#. If the intent is to determine the bluestore throttle values for your + environment, then set the two options, ``bluestore_throttle_bytes`` + and ``bluestore_throttle_deferred_bytes`` to 32 KiB(32768 Bytes) each + to begin with. Otherwise, you may skip to the next section. +#. Run the 4KiB random write test as before using OSD bench. +#. Note the overall throughput from the output and compare the value + against the baseline throughput recorded in step 3. +#. If the throughput doesn't match with the baseline, increment the bluestore + throttle options by 2x and repeat steps 5 through 7 until the obtained + throughput is very close to the baseline value. + +For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB +for both bluestore throttle and deferred bytes was determined to maximize the +impact of mclock. For HDDs, the corresponding value was 40 MiB, where the +overall throughput was roughly equal to the baseline throughput. Note that in +general for HDDs, the bluestore throttle values are expected to be higher when +compared to SSDs. + + +Specifying Max OSD Capacity +```````````````````````````` + +The steps in this section may be performed only if you want to override the +max osd capacity automatically set during OSD initialization. The option +``osd_mclock_max_capacity_iops_[hdd, ssd]`` for an OSD can be set by running the +following command: + + .. prompt:: bash # + + ceph config set osd.N osd_mclock_max_capacity_iops_[hdd,ssd] <value> + +For example, the following command sets the max capacity for a specific OSD +(say "osd.0") whose underlying device type is HDD to 350 IOPS: + + .. prompt:: bash # + + ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350 + +Alternatively, you may specify the max capacity for OSDs within the Ceph +configuration file under the respective [osd.N] section. See +:ref:`ceph-conf-settings` for more details. + + +.. index:: mclock; config settings + +mClock Config Options +===================== + +``osd_mclock_profile`` + +:Description: This sets the type of mclock profile to use for providing QoS + based on operations belonging to different classes (background + recovery, scrub, snaptrim, client op, osd subop). Once a built-in + profile is enabled, the lower level mclock resource control + parameters [*reservation, weight, limit*] and some Ceph + configuration parameters are set transparently. Note that the + above does not apply for the *custom* profile. + +:Type: String +:Valid Choices: high_client_ops, high_recovery_ops, balanced, custom +:Default: ``high_client_ops`` + +``osd_mclock_max_capacity_iops_hdd`` + +:Description: Max IOPS capacity (at 4KiB block size) to consider per OSD (for + rotational media) + +:Type: Float +:Default: ``315.0`` + +``osd_mclock_max_capacity_iops_ssd`` + +:Description: Max IOPS capacity (at 4KiB block size) to consider per OSD (for + solid state media) + +:Type: Float +:Default: ``21500.0`` + +``osd_mclock_cost_per_io_usec`` + +:Description: Cost per IO in microseconds to consider per OSD (overrides _ssd + and _hdd if non-zero) + +:Type: Float +:Default: ``0.0`` + +``osd_mclock_cost_per_io_usec_hdd`` + +:Description: Cost per IO in microseconds to consider per OSD (for rotational + media) + +:Type: Float +:Default: ``25000.0`` + +``osd_mclock_cost_per_io_usec_ssd`` + +:Description: Cost per IO in microseconds to consider per OSD (for solid state + media) + +:Type: Float +:Default: ``50.0`` + +``osd_mclock_cost_per_byte_usec`` + +:Description: Cost per byte in microseconds to consider per OSD (overrides _ssd + and _hdd if non-zero) + +:Type: Float +:Default: ``0.0`` + +``osd_mclock_cost_per_byte_usec_hdd`` + +:Description: Cost per byte in microseconds to consider per OSD (for rotational + media) + +:Type: Float +:Default: ``5.2`` + +``osd_mclock_cost_per_byte_usec_ssd`` + +:Description: Cost per byte in microseconds to consider per OSD (for solid state + media) + +:Type: Float +:Default: ``0.011`` diff --git a/doc/rados/configuration/mon-config-ref.rst b/doc/rados/configuration/mon-config-ref.rst new file mode 100644 index 000000000..3b12af43d --- /dev/null +++ b/doc/rados/configuration/mon-config-ref.rst @@ -0,0 +1,1243 @@ +.. _monitor-config-reference: + +========================== + Monitor Config Reference +========================== + +Understanding how to configure a :term:`Ceph Monitor` is an important part of +building a reliable :term:`Ceph Storage Cluster`. **All Ceph Storage Clusters +have at least one monitor**. The monitor complement usually remains fairly +consistent, but you can add, remove or replace a monitor in a cluster. See +`Adding/Removing a Monitor`_ for details. + + +.. index:: Ceph Monitor; Paxos + +Background +========== + +Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`, which means a +:term:`Ceph Client` can determine the location of all Ceph Monitors, Ceph OSD +Daemons, and Ceph Metadata Servers just by connecting to one Ceph Monitor and +retrieving a current cluster map. Before Ceph Clients can read from or write to +Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor +first. With a current copy of the cluster map and the CRUSH algorithm, a Ceph +Client can compute the location for any object. The ability to compute object +locations allows a Ceph Client to talk directly to Ceph OSD Daemons, which is a +very important aspect of Ceph's high scalability and performance. See +`Scalability and High Availability`_ for additional details. + +The primary role of the Ceph Monitor is to maintain a master copy of the cluster +map. Ceph Monitors also provide authentication and logging services. Ceph +Monitors write all changes in the monitor services to a single Paxos instance, +and Paxos writes the changes to a key/value store for strong consistency. Ceph +Monitors can query the most recent version of the cluster map during sync +operations. Ceph Monitors leverage the key/value store's snapshots and iterators +(using leveldb) to perform store-wide synchronization. + +.. ditaa:: + /-------------\ /-------------\ + | Monitor | Write Changes | Paxos | + | cCCC +-------------->+ cCCC | + | | | | + +-------------+ \------+------/ + | Auth | | + +-------------+ | Write Changes + | Log | | + +-------------+ v + | Monitor Map | /------+------\ + +-------------+ | Key / Value | + | OSD Map | | Store | + +-------------+ | cCCC | + | PG Map | \------+------/ + +-------------+ ^ + | MDS Map | | Read Changes + +-------------+ | + | cCCC |*---------------------+ + \-------------/ + + +.. deprecated:: version 0.58 + +In Ceph versions 0.58 and earlier, Ceph Monitors use a Paxos instance for +each service and store the map as a file. + +.. index:: Ceph Monitor; cluster map + +Cluster Maps +------------ + +The cluster map is a composite of maps, including the monitor map, the OSD map, +the placement group map and the metadata server map. The cluster map tracks a +number of important things: which processes are ``in`` the Ceph Storage Cluster; +which processes that are ``in`` the Ceph Storage Cluster are ``up`` and running +or ``down``; whether, the placement groups are ``active`` or ``inactive``, and +``clean`` or in some other state; and, other details that reflect the current +state of the cluster such as the total amount of storage space, and the amount +of storage used. + +When there is a significant change in the state of the cluster--e.g., a Ceph OSD +Daemon goes down, a placement group falls into a degraded state, etc.--the +cluster map gets updated to reflect the current state of the cluster. +Additionally, the Ceph Monitor also maintains a history of the prior states of +the cluster. The monitor map, OSD map, placement group map and metadata server +map each maintain a history of their map versions. We call each version an +"epoch." + +When operating your Ceph Storage Cluster, keeping track of these states is an +important part of your system administration duties. See `Monitoring a Cluster`_ +and `Monitoring OSDs and PGs`_ for additional details. + +.. index:: high availability; quorum + +Monitor Quorum +-------------- + +Our Configuring ceph section provides a trivial `Ceph configuration file`_ that +provides for one monitor in the test cluster. A cluster will run fine with a +single monitor; however, **a single monitor is a single-point-of-failure**. To +ensure high availability in a production Ceph Storage Cluster, you should run +Ceph with multiple monitors so that the failure of a single monitor **WILL NOT** +bring down your entire cluster. + +When a Ceph Storage Cluster runs multiple Ceph Monitors for high availability, +Ceph Monitors use `Paxos`_ to establish consensus about the master cluster map. +A consensus requires a majority of monitors running to establish a quorum for +consensus about the cluster map (e.g., 1; 2 out of 3; 3 out of 5; 4 out of 6; +etc.). + +``mon force quorum join`` + +:Description: Force monitor to join quorum even if it has been previously removed from the map +:Type: Boolean +:Default: ``False`` + +.. index:: Ceph Monitor; consistency + +Consistency +----------- + +When you add monitor settings to your Ceph configuration file, you need to be +aware of some of the architectural aspects of Ceph Monitors. **Ceph imposes +strict consistency requirements** for a Ceph monitor when discovering another +Ceph Monitor within the cluster. Whereas, Ceph Clients and other Ceph daemons +use the Ceph configuration file to discover monitors, monitors discover each +other using the monitor map (monmap), not the Ceph configuration file. + +A Ceph Monitor always refers to the local copy of the monmap when discovering +other Ceph Monitors in the Ceph Storage Cluster. Using the monmap instead of the +Ceph configuration file avoids errors that could break the cluster (e.g., typos +in ``ceph.conf`` when specifying a monitor address or port). Since monitors use +monmaps for discovery and they share monmaps with clients and other Ceph +daemons, **the monmap provides monitors with a strict guarantee that their +consensus is valid.** + +Strict consistency also applies to updates to the monmap. As with any other +updates on the Ceph Monitor, changes to the monmap always run through a +distributed consensus algorithm called `Paxos`_. The Ceph Monitors must agree on +each update to the monmap, such as adding or removing a Ceph Monitor, to ensure +that each monitor in the quorum has the same version of the monmap. Updates to +the monmap are incremental so that Ceph Monitors have the latest agreed upon +version, and a set of previous versions. Maintaining a history enables a Ceph +Monitor that has an older version of the monmap to catch up with the current +state of the Ceph Storage Cluster. + +If Ceph Monitors were to discover each other through the Ceph configuration file +instead of through the monmap, additional risks would be introduced because +Ceph configuration files are not updated and distributed automatically. Ceph +Monitors might inadvertently use an older Ceph configuration file, fail to +recognize a Ceph Monitor, fall out of a quorum, or develop a situation where +`Paxos`_ is not able to determine the current state of the system accurately. + + +.. index:: Ceph Monitor; bootstrapping monitors + +Bootstrapping Monitors +---------------------- + +In most configuration and deployment cases, tools that deploy Ceph help +bootstrap the Ceph Monitors by generating a monitor map for you (e.g., +``cephadm``, etc). A Ceph Monitor requires a few explicit +settings: + +- **Filesystem ID**: The ``fsid`` is the unique identifier for your + object store. Since you can run multiple clusters on the same + hardware, you must specify the unique ID of the object store when + bootstrapping a monitor. Deployment tools usually do this for you + (e.g., ``cephadm`` can call a tool like ``uuidgen``), but you + may specify the ``fsid`` manually too. + +- **Monitor ID**: A monitor ID is a unique ID assigned to each monitor within + the cluster. It is an alphanumeric value, and by convention the identifier + usually follows an alphabetical increment (e.g., ``a``, ``b``, etc.). This + can be set in a Ceph configuration file (e.g., ``[mon.a]``, ``[mon.b]``, etc.), + by a deployment tool, or using the ``ceph`` commandline. + +- **Keys**: The monitor must have secret keys. A deployment tool such as + ``cephadm`` usually does this for you, but you may + perform this step manually too. See `Monitor Keyrings`_ for details. + +For additional details on bootstrapping, see `Bootstrapping a Monitor`_. + +.. index:: Ceph Monitor; configuring monitors + +Configuring Monitors +==================== + +To apply configuration settings to the entire cluster, enter the configuration +settings under ``[global]``. To apply configuration settings to all monitors in +your cluster, enter the configuration settings under ``[mon]``. To apply +configuration settings to specific monitors, specify the monitor instance +(e.g., ``[mon.a]``). By convention, monitor instance names use alpha notation. + +.. code-block:: ini + + [global] + + [mon] + + [mon.a] + + [mon.b] + + [mon.c] + + +Minimum Configuration +--------------------- + +The bare minimum monitor settings for a Ceph monitor via the Ceph configuration +file include a hostname and a network address for each monitor. You can configure +these under ``[mon]`` or under the entry for a specific monitor. + +.. code-block:: ini + + [global] + mon host = 10.0.0.2,10.0.0.3,10.0.0.4 + +.. code-block:: ini + + [mon.a] + host = hostname1 + mon addr = 10.0.0.10:6789 + +See the `Network Configuration Reference`_ for details. + +.. note:: This minimum configuration for monitors assumes that a deployment + tool generates the ``fsid`` and the ``mon.`` key for you. + +Once you deploy a Ceph cluster, you **SHOULD NOT** change the IP addresses of +monitors. However, if you decide to change the monitor's IP address, you +must follow a specific procedure. See `Changing a Monitor's IP Address`_ for +details. + +Monitors can also be found by clients by using DNS SRV records. See `Monitor lookup through DNS`_ for details. + +Cluster ID +---------- + +Each Ceph Storage Cluster has a unique identifier (``fsid``). If specified, it +usually appears under the ``[global]`` section of the configuration file. +Deployment tools usually generate the ``fsid`` and store it in the monitor map, +so the value may not appear in a configuration file. The ``fsid`` makes it +possible to run daemons for multiple clusters on the same hardware. + +``fsid`` + +:Description: The cluster ID. One per cluster. +:Type: UUID +:Required: Yes. +:Default: N/A. May be generated by a deployment tool if not specified. + +.. note:: Do not set this value if you use a deployment tool that does + it for you. + + +.. index:: Ceph Monitor; initial members + +Initial Members +--------------- + +We recommend running a production Ceph Storage Cluster with at least three Ceph +Monitors to ensure high availability. When you run multiple monitors, you may +specify the initial monitors that must be members of the cluster in order to +establish a quorum. This may reduce the time it takes for your cluster to come +online. + +.. code-block:: ini + + [mon] + mon_initial_members = a,b,c + + +``mon_initial_members`` + +:Description: The IDs of initial monitors in a cluster during startup. If + specified, Ceph requires an odd number of monitors to form an + initial quorum (e.g., 3). + +:Type: String +:Default: None + +.. note:: A *majority* of monitors in your cluster must be able to reach + each other in order to establish a quorum. You can decrease the initial + number of monitors to establish a quorum with this setting. + +.. index:: Ceph Monitor; data path + +Data +---- + +Ceph provides a default path where Ceph Monitors store data. For optimal +performance in a production Ceph Storage Cluster, we recommend running Ceph +Monitors on separate hosts and drives from Ceph OSD Daemons. As leveldb uses +``mmap()`` for writing the data, Ceph Monitors flush their data from memory to disk +very often, which can interfere with Ceph OSD Daemon workloads if the data +store is co-located with the OSD Daemons. + +In Ceph versions 0.58 and earlier, Ceph Monitors store their data in plain files. This +approach allows users to inspect monitor data with common tools like ``ls`` +and ``cat``. However, this approach didn't provide strong consistency. + +In Ceph versions 0.59 and later, Ceph Monitors store their data as key/value +pairs. Ceph Monitors require `ACID`_ transactions. Using a data store prevents +recovering Ceph Monitors from running corrupted versions through Paxos, and it +enables multiple modification operations in one single atomic batch, among other +advantages. + +Generally, we do not recommend changing the default data location. If you modify +the default location, we recommend that you make it uniform across Ceph Monitors +by setting it in the ``[mon]`` section of the configuration file. + + +``mon_data`` + +:Description: The monitor's data location. +:Type: String +:Default: ``/var/lib/ceph/mon/$cluster-$id`` + + +``mon_data_size_warn`` + +:Description: Raise ``HEALTH_WARN`` status when a monitor's data + store grows to be larger than this size, 15GB by default. + +:Type: Integer +:Default: ``15*1024*1024*1024`` + + +``mon_data_avail_warn`` + +:Description: Raise ``HEALTH_WARN`` status when the filesystem that houses a + monitor's data store reports that its available capacity is + less than or equal to this percentage . + +:Type: Integer +:Default: ``30`` + + +``mon_data_avail_crit`` + +:Description: Raise ``HEALTH_ERR`` status when the filesystem that houses a + monitor's data store reports that its available capacity is + less than or equal to this percentage. + +:Type: Integer +:Default: ``5`` + +``mon_warn_on_cache_pools_without_hit_sets`` + +:Description: Raise ``HEALTH_WARN`` when a cache pool does not + have the ``hit_set_type`` value configured. + See :ref:`hit_set_type <hit_set_type>` for more + details. + +:Type: Boolean +:Default: ``True`` + +``mon_warn_on_crush_straw_calc_version_zero`` + +:Description: Raise ``HEALTH_WARN`` when the CRUSH + ``straw_calc_version`` is zero. See + :ref:`CRUSH map tunables <crush-map-tunables>` for + details. + +:Type: Boolean +:Default: ``True`` + + +``mon_warn_on_legacy_crush_tunables`` + +:Description: Raise ``HEALTH_WARN`` when + CRUSH tunables are too old (older than ``mon_min_crush_required_version``) + +:Type: Boolean +:Default: ``True`` + + +``mon_crush_min_required_version`` + +:Description: The minimum tunable profile required by the cluster. + See + :ref:`CRUSH map tunables <crush-map-tunables>` for + details. + +:Type: String +:Default: ``hammer`` + + +``mon_warn_on_osd_down_out_interval_zero`` + +:Description: Raise ``HEALTH_WARN`` when + ``mon_osd_down_out_interval`` is zero. Having this option set to + zero on the leader acts much like the ``noout`` flag. It's hard + to figure out what's going wrong with clusters without the + ``noout`` flag set but acting like that just the same, so we + report a warning in this case. + +:Type: Boolean +:Default: ``True`` + + +``mon_warn_on_slow_ping_ratio`` + +:Description: Raise ``HEALTH_WARN`` when any heartbeat + between OSDs exceeds ``mon_warn_on_slow_ping_ratio`` + of ``osd_heartbeat_grace``. The default is 5%. +:Type: Float +:Default: ``0.05`` + + +``mon_warn_on_slow_ping_time`` + +:Description: Override ``mon_warn_on_slow_ping_ratio`` with a specific value. + Raise ``HEALTH_WARN`` if any heartbeat + between OSDs exceeds ``mon_warn_on_slow_ping_time`` + milliseconds. The default is 0 (disabled). +:Type: Integer +:Default: ``0`` + + +``mon_warn_on_pool_no_redundancy`` + +:Description: Raise ``HEALTH_WARN`` if any pool is + configured with no replicas. +:Type: Boolean +:Default: ``True`` + + +``mon_cache_target_full_warn_ratio`` + +:Description: Position between pool's ``cache_target_full`` and + ``target_max_object`` where we start warning + +:Type: Float +:Default: ``0.66`` + + +``mon_health_to_clog`` + +:Description: Enable sending a health summary to the cluster log periodically. +:Type: Boolean +:Default: ``True`` + + +``mon_health_to_clog_tick_interval`` + +:Description: How often (in seconds) the monitor sends a health summary to the cluster + log (a non-positive number disables). If current health summary + is empty or identical to the last time, monitor will not send it + to cluster log. + +:Type: Float +:Default: ``60.0`` + + +``mon_health_to_clog_interval`` + +:Description: How often (in seconds) the monitor sends a health summary to the cluster + log (a non-positive number disables). Monitors will always + send a summary to the cluster log whether or not it differs from + the previous summary. + +:Type: Integer +:Default: ``3600`` + + + +.. index:: Ceph Storage Cluster; capacity planning, Ceph Monitor; capacity planning + +.. _storage-capacity: + +Storage Capacity +---------------- + +When a Ceph Storage Cluster gets close to its maximum capacity +(see``mon_osd_full ratio``), Ceph prevents you from writing to or reading from OSDs +as a safety measure to prevent data loss. Therefore, letting a +production Ceph Storage Cluster approach its full ratio is not a good practice, +because it sacrifices high availability. The default full ratio is ``.95``, or +95% of capacity. This a very aggressive setting for a test cluster with a small +number of OSDs. + +.. tip:: When monitoring your cluster, be alert to warnings related to the + ``nearfull`` ratio. This means that a failure of some OSDs could result + in a temporary service disruption if one or more OSDs fails. Consider adding + more OSDs to increase storage capacity. + +A common scenario for test clusters involves a system administrator removing an +OSD from the Ceph Storage Cluster, watching the cluster rebalance, then removing +another OSD, and another, until at least one OSD eventually reaches the full +ratio and the cluster locks up. We recommend a bit of capacity +planning even with a test cluster. Planning enables you to gauge how much spare +capacity you will need in order to maintain high availability. Ideally, you want +to plan for a series of Ceph OSD Daemon failures where the cluster can recover +to an ``active+clean`` state without replacing those OSDs +immediately. Cluster operation continues in the ``active+degraded`` state, but this +is not ideal for normal operation and should be addressed promptly. + +The following diagram depicts a simplistic Ceph Storage Cluster containing 33 +Ceph Nodes with one OSD per host, each OSD reading from +and writing to a 3TB drive. So this exemplary Ceph Storage Cluster has a maximum +actual capacity of 99TB. With a ``mon osd full ratio`` of ``0.95``, if the Ceph +Storage Cluster falls to 5TB of remaining capacity, the cluster will not allow +Ceph Clients to read and write data. So the Ceph Storage Cluster's operating +capacity is 95TB, not 99TB. + +.. ditaa:: + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | Rack 1 | | Rack 2 | | Rack 3 | | Rack 4 | | Rack 5 | | Rack 6 | + | cCCC | | cF00 | | cCCC | | cCCC | | cCCC | | cCCC | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 1 | | OSD 7 | | OSD 13 | | OSD 19 | | OSD 25 | | OSD 31 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 2 | | OSD 8 | | OSD 14 | | OSD 20 | | OSD 26 | | OSD 32 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 3 | | OSD 9 | | OSD 15 | | OSD 21 | | OSD 27 | | OSD 33 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 4 | | OSD 10 | | OSD 16 | | OSD 22 | | OSD 28 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 5 | | OSD 11 | | OSD 17 | | OSD 23 | | OSD 29 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 6 | | OSD 12 | | OSD 18 | | OSD 24 | | OSD 30 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + +It is normal in such a cluster for one or two OSDs to fail. A less frequent but +reasonable scenario involves a rack's router or power supply failing, which +brings down multiple OSDs simultaneously (e.g., OSDs 7-12). In such a scenario, +you should still strive for a cluster that can remain operational and achieve an +``active + clean`` state--even if that means adding a few hosts with additional +OSDs in short order. If your capacity utilization is too high, you may not lose +data, but you could still sacrifice data availability while resolving an outage +within a failure domain if capacity utilization of the cluster exceeds the full +ratio. For this reason, we recommend at least some rough capacity planning. + +Identify two numbers for your cluster: + +#. The number of OSDs. +#. The total capacity of the cluster + +If you divide the total capacity of your cluster by the number of OSDs in your +cluster, you will find the mean average capacity of an OSD within your cluster. +Consider multiplying that number by the number of OSDs you expect will fail +simultaneously during normal operations (a relatively small number). Finally +multiply the capacity of the cluster by the full ratio to arrive at a maximum +operating capacity; then, subtract the number of amount of data from the OSDs +you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing +process with a higher number of OSD failures (e.g., a rack of OSDs) to arrive at +a reasonable number for a near full ratio. + +The following settings only apply on cluster creation and are then stored in +the OSDMap. To clarify, in normal operation the values that are used by OSDs +are those found in the OSDMap, not those in the configuration file or central +config store. + +.. code-block:: ini + + [global] + mon_osd_full_ratio = .80 + mon_osd_backfillfull_ratio = .75 + mon_osd_nearfull_ratio = .70 + + +``mon_osd_full_ratio`` + +:Description: The threshold percentage of device space utilized before an OSD is + considered ``full``. + +:Type: Float +:Default: ``0.95`` + + +``mon_osd_backfillfull_ratio`` + +:Description: The threshold percentage of device space utilized before an OSD is + considered too ``full`` to backfill. + +:Type: Float +:Default: ``0.90`` + + +``mon_osd_nearfull_ratio`` + +:Description: The threshold percentage of device space used before an OSD is + considered ``nearfull``. + +:Type: Float +:Default: ``0.85`` + + +.. tip:: If some OSDs are nearfull, but others have plenty of capacity, you + may have an inaccurate CRUSH weight set for the nearfull OSDs. + +.. tip:: These settings only apply during cluster creation. Afterwards they need + to be changed in the OSDMap using ``ceph osd set-nearfull-ratio`` and + ``ceph osd set-full-ratio`` + +.. index:: heartbeat + +Heartbeat +--------- + +Ceph monitors know about the cluster by requiring reports from each OSD, and by +receiving reports from OSDs about the status of their neighboring OSDs. Ceph +provides reasonable default settings for monitor/OSD interaction; however, you +may modify them as needed. See `Monitor/OSD Interaction`_ for details. + + +.. index:: Ceph Monitor; leader, Ceph Monitor; provider, Ceph Monitor; requester, Ceph Monitor; synchronization + +Monitor Store Synchronization +----------------------------- + +When you run a production cluster with multiple monitors (recommended), each +monitor checks to see if a neighboring monitor has a more recent version of the +cluster map (e.g., a map in a neighboring monitor with one or more epoch numbers +higher than the most current epoch in the map of the instant monitor). +Periodically, one monitor in the cluster may fall behind the other monitors to +the point where it must leave the quorum, synchronize to retrieve the most +current information about the cluster, and then rejoin the quorum. For the +purposes of synchronization, monitors may assume one of three roles: + +#. **Leader**: The `Leader` is the first monitor to achieve the most recent + Paxos version of the cluster map. + +#. **Provider**: The `Provider` is a monitor that has the most recent version + of the cluster map, but wasn't the first to achieve the most recent version. + +#. **Requester:** A `Requester` is a monitor that has fallen behind the leader + and must synchronize in order to retrieve the most recent information about + the cluster before it can rejoin the quorum. + +These roles enable a leader to delegate synchronization duties to a provider, +which prevents synchronization requests from overloading the leader--improving +performance. In the following diagram, the requester has learned that it has +fallen behind the other monitors. The requester asks the leader to synchronize, +and the leader tells the requester to synchronize with a provider. + + +.. ditaa:: + +-----------+ +---------+ +----------+ + | Requester | | Leader | | Provider | + +-----------+ +---------+ +----------+ + | | | + | | | + | Ask to Synchronize | | + |------------------->| | + | | | + |<-------------------| | + | Tell Requester to | | + | Sync with Provider | | + | | | + | Synchronize | + |--------------------+-------------------->| + | | | + |<-------------------+---------------------| + | Send Chunk to Requester | + | (repeat as necessary) | + | Requester Acks Chuck to Provider | + |--------------------+-------------------->| + | | + | Sync Complete | + | Notification | + |------------------->| + | | + |<-------------------| + | Ack | + | | + + +Synchronization always occurs when a new monitor joins the cluster. During +runtime operations, monitors may receive updates to the cluster map at different +times. This means the leader and provider roles may migrate from one monitor to +another. If this happens while synchronizing (e.g., a provider falls behind the +leader), the provider can terminate synchronization with a requester. + +Once synchronization is complete, Ceph performs trimming across the cluster. +Trimming requires that the placement groups are ``active+clean``. + + +``mon_sync_timeout`` + +:Description: Number of seconds the monitor will wait for the next update + message from its sync provider before it gives up and bootstrap + again. + +:Type: Double +:Default: ``60.0`` + + +``mon_sync_max_payload_size`` + +:Description: The maximum size for a sync payload (in bytes). +:Type: 32-bit Integer +:Default: ``1048576`` + + +``paxos_max_join_drift`` + +:Description: The maximum Paxos iterations before we must first sync the + monitor data stores. When a monitor finds that its peer is too + far ahead of it, it will first sync with data stores before moving + on. + +:Type: Integer +:Default: ``10`` + + +``paxos_stash_full_interval`` + +:Description: How often (in commits) to stash a full copy of the PaxosService state. + Current this setting only affects ``mds``, ``mon``, ``auth`` and ``mgr`` + PaxosServices. + +:Type: Integer +:Default: ``25`` + + +``paxos_propose_interval`` + +:Description: Gather updates for this time interval before proposing + a map update. + +:Type: Double +:Default: ``1.0`` + + +``paxos_min`` + +:Description: The minimum number of Paxos states to keep around +:Type: Integer +:Default: ``500`` + + +``paxos_min_wait`` + +:Description: The minimum amount of time to gather updates after a period of + inactivity. + +:Type: Double +:Default: ``0.05`` + + +``paxos_trim_min`` + +:Description: Number of extra proposals tolerated before trimming +:Type: Integer +:Default: ``250`` + + +``paxos_trim_max`` + +:Description: The maximum number of extra proposals to trim at a time +:Type: Integer +:Default: ``500`` + + +``paxos_service_trim_min`` + +:Description: The minimum amount of versions to trigger a trim (0 disables it) +:Type: Integer +:Default: ``250`` + + +``paxos_service_trim_max`` + +:Description: The maximum amount of versions to trim during a single proposal (0 disables it) +:Type: Integer +:Default: ``500`` + + +``paxos service trim max multiplier`` + +:Description: The factor by which paxos service trim max will be multiplied + to get a new upper bound when trim sizes are high (0 disables it) +:Type: Integer +:Default: ``20`` + + +``mon mds force trim to`` + +:Description: Force monitor to trim mdsmaps to this point (0 disables it. + dangerous, use with care) + +:Type: Integer +:Default: ``0`` + + +``mon_osd_force_trim_to`` + +:Description: Force monitor to trim osdmaps to this point, even if there is + PGs not clean at the specified epoch (0 disables it. dangerous, + use with care) + +:Type: Integer +:Default: ``0`` + + +``mon_osd_cache_size`` + +:Description: The size of osdmaps cache, not to rely on underlying store's cache +:Type: Integer +:Default: ``500`` + + +``mon_election_timeout`` + +:Description: On election proposer, maximum waiting time for all ACKs in seconds. +:Type: Float +:Default: ``5.00`` + + +``mon_lease`` + +:Description: The length (in seconds) of the lease on the monitor's versions. +:Type: Float +:Default: ``5.00`` + + +``mon_lease_renew_interval_factor`` + +:Description: ``mon_lease`` \* ``mon_lease_renew_interval_factor`` will be the + interval for the Leader to renew the other monitor's leases. The + factor should be less than ``1.0``. + +:Type: Float +:Default: ``0.60`` + + +``mon_lease_ack_timeout_factor`` + +:Description: The Leader will wait ``mon_lease`` \* ``mon_lease_ack_timeout_factor`` + for the Providers to acknowledge the lease extension. + +:Type: Float +:Default: ``2.00`` + + +``mon_accept_timeout_factor`` + +:Description: The Leader will wait ``mon_lease`` \* ``mon_accept_timeout_factor`` + for the Requester(s) to accept a Paxos update. It is also used + during the Paxos recovery phase for similar purposes. + +:Type: Float +:Default: ``2.00`` + + +``mon_min_osdmap_epochs`` + +:Description: Minimum number of OSD map epochs to keep at all times. +:Type: 32-bit Integer +:Default: ``500`` + + +``mon_max_log_epochs`` + +:Description: Maximum number of Log epochs the monitor should keep. +:Type: 32-bit Integer +:Default: ``500`` + + + +.. index:: Ceph Monitor; clock + +Clock +----- + +Ceph daemons pass critical messages to each other, which must be processed +before daemons reach a timeout threshold. If the clocks in Ceph monitors +are not synchronized, it can lead to a number of anomalies. For example: + +- Daemons ignoring received messages (e.g., timestamps outdated) +- Timeouts triggered too soon/late when a message wasn't received in time. + +See `Monitor Store Synchronization`_ for details. + + +.. tip:: You must configure NTP or PTP daemons on your Ceph monitor hosts to + ensure that the monitor cluster operates with synchronized clocks. + It can be advantageous to have monitor hosts sync with each other + as well as with multiple quality upstream time sources. + +Clock drift may still be noticeable with NTP even though the discrepancy is not +yet harmful. Ceph's clock drift / clock skew warnings may get triggered even +though NTP maintains a reasonable level of synchronization. Increasing your +clock drift may be tolerable under such circumstances; however, a number of +factors such as workload, network latency, configuring overrides to default +timeouts and the `Monitor Store Synchronization`_ settings may influence +the level of acceptable clock drift without compromising Paxos guarantees. + +Ceph provides the following tunable options to allow you to find +acceptable values. + + +``mon_tick_interval`` + +:Description: A monitor's tick interval in seconds. +:Type: 32-bit Integer +:Default: ``5`` + + +``mon_clock_drift_allowed`` + +:Description: The clock drift in seconds allowed between monitors. +:Type: Float +:Default: ``0.05`` + + +``mon_clock_drift_warn_backoff`` + +:Description: Exponential backoff for clock drift warnings +:Type: Float +:Default: ``5.00`` + + +``mon_timecheck_interval`` + +:Description: The time check interval (clock drift check) in seconds + for the Leader. + +:Type: Float +:Default: ``300.00`` + + +``mon_timecheck_skew_interval`` + +:Description: The time check interval (clock drift check) in seconds when in + presence of a skew in seconds for the Leader. + +:Type: Float +:Default: ``30.00`` + + +Client +------ + +``mon_client_hunt_interval`` + +:Description: The client will try a new monitor every ``N`` seconds until it + establishes a connection. + +:Type: Double +:Default: ``3.00`` + + +``mon_client_ping_interval`` + +:Description: The client will ping the monitor every ``N`` seconds. +:Type: Double +:Default: ``10.00`` + + +``mon_client_max_log_entries_per_message`` + +:Description: The maximum number of log entries a monitor will generate + per client message. + +:Type: Integer +:Default: ``1000`` + + +``mon_client_bytes`` + +:Description: The amount of client message data allowed in memory (in bytes). +:Type: 64-bit Integer Unsigned +:Default: ``100ul << 20`` + +.. _pool-settings: + +Pool settings +============= + +Since version v0.94 there is support for pool flags which allow or disallow changes to be made to pools. +Monitors can also disallow removal of pools if appropriately configured. The inconvenience of this guardrail +is far outweighed by the number of accidental pool (and thus data) deletions it prevents. + +``mon_allow_pool_delete`` + +:Description: Should monitors allow pools to be removed, regardless of what the pool flags say? + +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_ec_fast_read`` + +:Description: Whether to turn on fast read on the pool or not. It will be used as + the default setting of newly created erasure coded pools if ``fast_read`` + is not specified at create time. + +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_flag_hashpspool`` + +:Description: Set the hashpspool flag on new pools +:Type: Boolean +:Default: ``true`` + + +``osd_pool_default_flag_nodelete`` + +:Description: Set the ``nodelete`` flag on new pools, which prevents pool removal. +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_flag_nopgchange`` + +:Description: Set the ``nopgchange`` flag on new pools. Does not allow the number of PGs to be changed. +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_flag_nosizechange`` + +:Description: Set the ``nosizechange`` flag on new pools. Does not allow the ``size`` to be changed. +:Type: Boolean +:Default: ``false`` + +For more information about the pool flags see `Pool values`_. + +Miscellaneous +============= + +``mon_max_osd`` + +:Description: The maximum number of OSDs allowed in the cluster. +:Type: 32-bit Integer +:Default: ``10000`` + + +``mon_globalid_prealloc`` + +:Description: The number of global IDs to pre-allocate for clients and daemons in the cluster. +:Type: 32-bit Integer +:Default: ``10000`` + + +``mon_subscribe_interval`` + +:Description: The refresh interval (in seconds) for subscriptions. The + subscription mechanism enables obtaining cluster maps + and log information. + +:Type: Double +:Default: ``86400.00`` + + +``mon_stat_smooth_intervals`` + +:Description: Ceph will smooth statistics over the last ``N`` PG maps. +:Type: Integer +:Default: ``6`` + + +``mon_probe_timeout`` + +:Description: Number of seconds the monitor will wait to find peers before bootstrapping. +:Type: Double +:Default: ``2.00`` + + +``mon_daemon_bytes`` + +:Description: The message memory cap for metadata server and OSD messages (in bytes). +:Type: 64-bit Integer Unsigned +:Default: ``400ul << 20`` + + +``mon_max_log_entries_per_event`` + +:Description: The maximum number of log entries per event. +:Type: Integer +:Default: ``4096`` + + +``mon_osd_prime_pg_temp`` + +:Description: Enables or disables priming the PGMap with the previous OSDs when an ``out`` + OSD comes back into the cluster. With the ``true`` setting, clients + will continue to use the previous OSDs until the newly ``in`` OSDs for + a PG have peered. + +:Type: Boolean +:Default: ``true`` + + +``mon_osd_prime pg temp max time`` + +:Description: How much time in seconds the monitor should spend trying to prime the + PGMap when an out OSD comes back into the cluster. + +:Type: Float +:Default: ``0.50`` + + +``mon_osd_prime_pg_temp_max_time_estimate`` + +:Description: Maximum estimate of time spent on each PG before we prime all PGs + in parallel. + +:Type: Float +:Default: ``0.25`` + + +``mon_mds_skip_sanity`` + +:Description: Skip safety assertions on FSMap (in case of bugs where we want to + continue anyway). Monitor terminates if the FSMap sanity check + fails, but we can disable it by enabling this option. + +:Type: Boolean +:Default: ``False`` + + +``mon_max_mdsmap_epochs`` + +:Description: The maximum number of mdsmap epochs to trim during a single proposal. +:Type: Integer +:Default: ``500`` + + +``mon_config_key_max_entry_size`` + +:Description: The maximum size of config-key entry (in bytes) +:Type: Integer +:Default: ``65536`` + + +``mon_scrub_interval`` + +:Description: How often the monitor scrubs its store by comparing + the stored checksums with the computed ones for all stored + keys. (0 disables it. dangerous, use with care) + +:Type: Seconds +:Default: ``1 day`` + + +``mon_scrub_max_keys`` + +:Description: The maximum number of keys to scrub each time. +:Type: Integer +:Default: ``100`` + + +``mon_compact_on_start`` + +:Description: Compact the database used as Ceph Monitor store on + ``ceph-mon`` start. A manual compaction helps to shrink the + monitor database and improve the performance of it if the regular + compaction fails to work. + +:Type: Boolean +:Default: ``False`` + + +``mon_compact_on_bootstrap`` + +:Description: Compact the database used as Ceph Monitor store + on bootstrap. Monitors probe each other to establish + a quorum after bootstrap. If a monitor times out before joining the + quorum, it will start over and bootstrap again. + +:Type: Boolean +:Default: ``False`` + + +``mon_compact_on_trim`` + +:Description: Compact a certain prefix (including paxos) when we trim its old states. +:Type: Boolean +:Default: ``True`` + + +``mon_cpu_threads`` + +:Description: Number of threads for performing CPU intensive work on monitor. +:Type: Integer +:Default: ``4`` + + +``mon_osd_mapping_pgs_per_chunk`` + +:Description: We calculate the mapping from placement group to OSDs in chunks. + This option specifies the number of placement groups per chunk. + +:Type: Integer +:Default: ``4096`` + + +``mon_session_timeout`` + +:Description: Monitor will terminate inactive sessions stay idle over this + time limit. + +:Type: Integer +:Default: ``300`` + + +``mon_osd_cache_size_min`` + +:Description: The minimum amount of bytes to be kept mapped in memory for osd + monitor caches. + +:Type: 64-bit Integer +:Default: ``134217728`` + + +``mon_memory_target`` + +:Description: The amount of bytes pertaining to OSD monitor caches and KV cache + to be kept mapped in memory with cache auto-tuning enabled. + +:Type: 64-bit Integer +:Default: ``2147483648`` + + +``mon_memory_autotune`` + +:Description: Autotune the cache memory used for OSD monitors and KV + database. + +:Type: Boolean +:Default: ``True`` + + +.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) +.. _Monitor Keyrings: ../../../dev/mon-bootstrap#secret-keys +.. _Ceph configuration file: ../ceph-conf/#monitors +.. _Network Configuration Reference: ../network-config-ref +.. _Monitor lookup through DNS: ../mon-lookup-dns +.. _ACID: https://en.wikipedia.org/wiki/ACID +.. _Adding/Removing a Monitor: ../../operations/add-or-rm-mons +.. _Monitoring a Cluster: ../../operations/monitoring +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg +.. _Bootstrapping a Monitor: ../../../dev/mon-bootstrap +.. _Changing a Monitor's IP Address: ../../operations/add-or-rm-mons#changing-a-monitor-s-ip-address +.. _Monitor/OSD Interaction: ../mon-osd-interaction +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability +.. _Pool values: ../../operations/pools/#set-pool-values diff --git a/doc/rados/configuration/mon-lookup-dns.rst b/doc/rados/configuration/mon-lookup-dns.rst new file mode 100644 index 000000000..c9bece004 --- /dev/null +++ b/doc/rados/configuration/mon-lookup-dns.rst @@ -0,0 +1,56 @@ +=============================== +Looking up Monitors through DNS +=============================== + +Since version 11.0.0 RADOS supports looking up Monitors through DNS. + +This way daemons and clients do not require a *mon host* configuration directive in their ceph.conf configuration file. + +Using DNS SRV TCP records clients are able to look up the monitors. + +This allows for less configuration on clients and monitors. Using a DNS update clients and daemons can be made aware of changes in the monitor topology. + +By default clients and daemons will look for the TCP service called *ceph-mon* which is configured by the *mon_dns_srv_name* configuration directive. + + +``mon dns srv name`` + +:Description: the service name used querying the DNS for the monitor hosts/addresses +:Type: String +:Default: ``ceph-mon`` + +Example +------- +When the DNS search domain is set to *example.com* a DNS zone file might contain the following elements. + +First, create records for the Monitors, either IPv4 (A) or IPv6 (AAAA). + +:: + + mon1.example.com. AAAA 2001:db8::100 + mon2.example.com. AAAA 2001:db8::200 + mon3.example.com. AAAA 2001:db8::300 + +:: + + mon1.example.com. A 192.168.0.1 + mon2.example.com. A 192.168.0.2 + mon3.example.com. A 192.168.0.3 + + +With those records now existing we can create the SRV TCP records with the name *ceph-mon* pointing to the three Monitors. + +:: + + _ceph-mon._tcp.example.com. 60 IN SRV 10 20 6789 mon1.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 10 30 6789 mon2.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 20 50 6789 mon3.example.com. + +Now all Monitors are running on port *6789*, with priorities 10, 10, 20 and weights 20, 30, 50 respectively. + +Monitor clients choose monitor by referencing the SRV records. If a cluster has multiple Monitor SRV records +with the same priority value, clients and daemons will load balance the connections to Monitors in proportion +to the values of the SRV weight fields. + +For the above example, this will result in approximate 40% of the clients and daemons connecting to mon1, +60% of them connecting to mon2. However, if neither of them is reachable, then mon3 will be reconsidered as a fallback. diff --git a/doc/rados/configuration/mon-osd-interaction.rst b/doc/rados/configuration/mon-osd-interaction.rst new file mode 100644 index 000000000..727070491 --- /dev/null +++ b/doc/rados/configuration/mon-osd-interaction.rst @@ -0,0 +1,396 @@ +===================================== + Configuring Monitor/OSD Interaction +===================================== + +.. index:: heartbeat + +After you have completed your initial Ceph configuration, you may deploy and run +Ceph. When you execute a command such as ``ceph health`` or ``ceph -s``, the +:term:`Ceph Monitor` reports on the current state of the :term:`Ceph Storage +Cluster`. The Ceph Monitor knows about the Ceph Storage Cluster by requiring +reports from each :term:`Ceph OSD Daemon`, and by receiving reports from Ceph +OSD Daemons about the status of their neighboring Ceph OSD Daemons. If the Ceph +Monitor doesn't receive reports, or if it receives reports of changes in the +Ceph Storage Cluster, the Ceph Monitor updates the status of the :term:`Ceph +Cluster Map`. + +Ceph provides reasonable default settings for Ceph Monitor/Ceph OSD Daemon +interaction. However, you may override the defaults. The following sections +describe how Ceph Monitors and Ceph OSD Daemons interact for the purposes of +monitoring the Ceph Storage Cluster. + +.. index:: heartbeat interval + +OSDs Check Heartbeats +===================== + +Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons at random +intervals less than every 6 seconds. If a neighboring Ceph OSD Daemon doesn't +show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may +consider the neighboring Ceph OSD Daemon ``down`` and report it back to a Ceph +Monitor, which will update the Ceph Cluster Map. You may change this grace +period by adding an ``osd heartbeat grace`` setting under the ``[mon]`` +and ``[osd]`` or ``[global]`` section of your Ceph configuration file, +or by setting the value at runtime. + + +.. ditaa:: + +---------+ +---------+ + | OSD 1 | | OSD 2 | + +---------+ +---------+ + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |<-------------------| + | Heart Beating | + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |----+ Grace | + | | Period | + |<---+ Exceeded | + | | + |----+ Mark | + | | OSD 2 | + |<---+ Down | + + +.. index:: OSD down report + +OSDs Report Down OSDs +===================== + +By default, two Ceph OSD Daemons from different hosts must report to the Ceph +Monitors that another Ceph OSD Daemon is ``down`` before the Ceph Monitors +acknowledge that the reported Ceph OSD Daemon is ``down``. But there is chance +that all the OSDs reporting the failure are hosted in a rack with a bad switch +which has trouble connecting to another OSD. To avoid this sort of false alarm, +we consider the peers reporting a failure a proxy for a potential "subcluster" +over the overall cluster that is similarly laggy. This is clearly not true in +all cases, but will sometimes help us localize the grace correction to a subset +of the system that is unhappy. ``mon osd reporter subtree level`` is used to +group the peers into the "subcluster" by their common ancestor type in CRUSH +map. By default, only two reports from different subtree are required to report +another Ceph OSD Daemon ``down``. You can change the number of reporters from +unique subtrees and the common ancestor type required to report a Ceph OSD +Daemon ``down`` to a Ceph Monitor by adding an ``mon osd min down reporters`` +and ``mon osd reporter subtree level`` settings under the ``[mon]`` section of +your Ceph configuration file, or by setting the value at runtime. + + +.. ditaa:: + + +---------+ +---------+ +---------+ + | OSD 1 | | OSD 2 | | Monitor | + +---------+ +---------+ +---------+ + | | | + | OSD 3 Is Down | | + |---------------+--------------->| + | | | + | | | + | | OSD 3 Is Down | + | |--------------->| + | | | + | | | + | | |---------+ Mark + | | | | OSD 3 + | | |<--------+ Down + + +.. index:: peering failure + +OSDs Report Peering Failure +=========================== + +If a Ceph OSD Daemon cannot peer with any of the Ceph OSD Daemons defined in its +Ceph configuration file (or the cluster map), it will ping a Ceph Monitor for +the most recent copy of the cluster map every 30 seconds. You can change the +Ceph Monitor heartbeat interval by adding an ``osd mon heartbeat interval`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. + +.. ditaa:: + + +---------+ +---------+ +-------+ +---------+ + | OSD 1 | | OSD 2 | | OSD 3 | | Monitor | + +---------+ +---------+ +-------+ +---------+ + | | | | + | Request To | | | + | Peer | | | + |-------------->| | | + |<--------------| | | + | Peering | | + | | | + | Request To | | + | Peer | | + |----------------------------->| | + | | + |----+ OSD Monitor | + | | Heartbeat | + |<---+ Interval Exceeded | + | | + | Failed to Peer with OSD 3 | + |-------------------------------------------->| + |<--------------------------------------------| + | Receive New Cluster Map | + + +.. index:: OSD status + +OSDs Report Their Status +======================== + +If an Ceph OSD Daemon doesn't report to a Ceph Monitor, the Ceph Monitor will +consider the Ceph OSD Daemon ``down`` after the ``mon osd report timeout`` +elapses. A Ceph OSD Daemon sends a report to a Ceph Monitor when a reportable +event such as a failure, a change in placement group stats, a change in +``up_thru`` or when it boots within 5 seconds. You can change the Ceph OSD +Daemon minimum report interval by adding an ``osd mon report interval`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. A Ceph OSD Daemon sends a report to a Ceph +Monitor every 120 seconds irrespective of whether any notable changes occur. +You can change the Ceph Monitor report interval by adding an ``osd mon report +interval max`` setting under the ``[osd]`` section of your Ceph configuration +file, or by setting the value at runtime. + + +.. ditaa:: + + +---------+ +---------+ + | OSD 1 | | Monitor | + +---------+ +---------+ + | | + |----+ Report Min | + | | Interval | + |<---+ Exceeded | + | | + |----+ Reportable | + | | Event | + |<---+ Occurs | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Report Max | + | | Interval | + |<---+ Exceeded | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Monitor | + | | Fails | + |<---+ | + +----+ Monitor OSD + | | Report Timeout + |<---+ Exceeded + | + +----+ Mark + | | OSD 1 + |<---+ Down + + + + +Configuration Settings +====================== + +When modifying heartbeat settings, you should include them in the ``[global]`` +section of your configuration file. + +.. index:: monitor heartbeat + +Monitor Settings +---------------- + +``mon osd min up ratio`` + +:Description: The minimum ratio of ``up`` Ceph OSD Daemons before Ceph will + mark Ceph OSD Daemons ``down``. + +:Type: Double +:Default: ``.3`` + + +``mon osd min in ratio`` + +:Description: The minimum ratio of ``in`` Ceph OSD Daemons before Ceph will + mark Ceph OSD Daemons ``out``. + +:Type: Double +:Default: ``.75`` + + +``mon osd laggy halflife`` + +:Description: The number of seconds laggy estimates will decay. +:Type: Integer +:Default: ``60*60`` + + +``mon osd laggy weight`` + +:Description: The weight for new samples in laggy estimation decay. +:Type: Double +:Default: ``0.3`` + + + +``mon osd laggy max interval`` + +:Description: Maximum value of ``laggy_interval`` in laggy estimations (in seconds). + Monitor uses an adaptive approach to evaluate the ``laggy_interval`` of + a certain OSD. This value will be used to calculate the grace time for + that OSD. +:Type: Integer +:Default: 300 + +``mon osd adjust heartbeat grace`` + +:Description: If set to ``true``, Ceph will scale based on laggy estimations. +:Type: Boolean +:Default: ``true`` + + +``mon osd adjust down out interval`` + +:Description: If set to ``true``, Ceph will scaled based on laggy estimations. +:Type: Boolean +:Default: ``true`` + + +``mon osd auto mark in`` + +:Description: Ceph will mark any booting Ceph OSD Daemons as ``in`` + the Ceph Storage Cluster. + +:Type: Boolean +:Default: ``false`` + + +``mon osd auto mark auto out in`` + +:Description: Ceph will mark booting Ceph OSD Daemons auto marked ``out`` + of the Ceph Storage Cluster as ``in`` the cluster. + +:Type: Boolean +:Default: ``true`` + + +``mon osd auto mark new in`` + +:Description: Ceph will mark booting new Ceph OSD Daemons as ``in`` the + Ceph Storage Cluster. + +:Type: Boolean +:Default: ``true`` + + +``mon osd down out interval`` + +:Description: The number of seconds Ceph waits before marking a Ceph OSD Daemon + ``down`` and ``out`` if it doesn't respond. + +:Type: 32-bit Integer +:Default: ``600`` + + +``mon osd down out subtree limit`` + +:Description: The smallest :term:`CRUSH` unit type that Ceph will **not** + automatically mark out. For instance, if set to ``host`` and if + all OSDs of a host are down, Ceph will not automatically mark out + these OSDs. + +:Type: String +:Default: ``rack`` + + +``mon osd report timeout`` + +:Description: The grace period in seconds before declaring + unresponsive Ceph OSD Daemons ``down``. + +:Type: 32-bit Integer +:Default: ``900`` + +``mon osd min down reporters`` + +:Description: The minimum number of Ceph OSD Daemons required to report a + ``down`` Ceph OSD Daemon. + +:Type: 32-bit Integer +:Default: ``2`` + + +``mon_osd_reporter_subtree_level`` + +:Description: In which level of parent bucket the reporters are counted. The OSDs + send failure reports to monitors if they find a peer that is not responsive. + Monitors mark the reported ``OSD`` out and then ``down`` after a grace period. +:Type: String +:Default: ``host`` + + +.. index:: OSD hearbeat + +OSD Settings +------------ + +``osd_heartbeat_interval`` + +:Description: How often an Ceph OSD Daemon pings its peers (in seconds). +:Type: 32-bit Integer +:Default: ``6`` + + +``osd_heartbeat_grace`` + +:Description: The elapsed time when a Ceph OSD Daemon hasn't shown a heartbeat + that the Ceph Storage Cluster considers it ``down``. + This setting must be set in both the [mon] and [osd] or [global] + sections so that it is read by both monitor and OSD daemons. +:Type: 32-bit Integer +:Default: ``20`` + + +``osd_mon_heartbeat_interval`` + +:Description: How often the Ceph OSD Daemon pings a Ceph Monitor if it has no + Ceph OSD Daemon peers. + +:Type: 32-bit Integer +:Default: ``30`` + + +``osd_mon_heartbeat_stat_stale`` + +:Description: Stop reporting on heartbeat ping times which haven't been updated for + this many seconds. Set to zero to disable this action. + +:Type: 32-bit Integer +:Default: ``3600`` + + +``osd_mon_report_interval`` + +:Description: The number of seconds a Ceph OSD Daemon may wait + from startup or another reportable event before reporting + to a Ceph Monitor. + +:Type: 32-bit Integer +:Default: ``5`` diff --git a/doc/rados/configuration/ms-ref.rst b/doc/rados/configuration/ms-ref.rst new file mode 100644 index 000000000..113bd0913 --- /dev/null +++ b/doc/rados/configuration/ms-ref.rst @@ -0,0 +1,133 @@ +=========== + Messaging +=========== + +General Settings +================ + +``ms_tcp_nodelay`` + +:Description: Disables Nagle's algorithm on messenger TCP sessions. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``ms_initial_backoff`` + +:Description: The initial time to wait before reconnecting on a fault. +:Type: Double +:Required: No +:Default: ``.2`` + + +``ms_max_backoff`` + +:Description: The maximum time to wait before reconnecting on a fault. +:Type: Double +:Required: No +:Default: ``15.0`` + + +``ms_nocrc`` + +:Description: Disables CRC on network messages. May increase performance if CPU limited. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms_die_on_bad_msg`` + +:Description: Debug option; do not configure. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms_dispatch_throttle_bytes`` + +:Description: Throttles total size of messages waiting to be dispatched. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``100 << 20`` + + +``ms_bind_ipv6`` + +:Description: Enable to bind daemons to IPv6 addresses instead of IPv4. Not required if you specify a daemon or cluster IP. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms_rwthread_stack_bytes`` + +:Description: Debug option for stack size; do not configure. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``1024 << 10`` + + +``ms_tcp_read_timeout`` + +:Description: Controls how long (in seconds) the messenger will wait before closing an idle connection. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``900`` + + +``ms_inject_socket_failures`` + +:Description: Debug option; do not configure. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``0`` + +Async messenger options +======================= + + +``ms_async_transport_type`` + +:Description: Transport type used by Async Messenger. Can be ``posix``, ``dpdk`` + or ``rdma``. Posix uses standard TCP/IP networking and is default. + Other transports may be experimental and support may be limited. +:Type: String +:Required: No +:Default: ``posix`` + + +``ms_async_op_threads`` + +:Description: Initial number of worker threads used by each Async Messenger instance. + Should be at least equal to highest number of replicas, but you can + decrease it if you are low on CPU core count and/or you host a lot of + OSDs on single server. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``3`` + + +``ms_async_max_op_threads`` + +:Description: Maximum number of worker threads used by each Async Messenger instance. + Set to lower values when your machine has limited CPU count, and increase + when your CPUs are underutilized (i. e. one or more of CPUs are + constantly on 100% load during I/O operations). +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``5`` + + +``ms_async_send_inline`` + +:Description: Send messages directly from the thread that generated them instead of + queuing and sending from Async Messenger thread. This option is known + to decrease performance on systems with a lot of CPU cores, so it's + disabled by default. +:Type: Boolean +:Required: No +:Default: ``false`` + + diff --git a/doc/rados/configuration/msgr2.rst b/doc/rados/configuration/msgr2.rst new file mode 100644 index 000000000..3415b1f5f --- /dev/null +++ b/doc/rados/configuration/msgr2.rst @@ -0,0 +1,233 @@ +.. _msgr2: + +Messenger v2 +============ + +What is it +---------- + +The messenger v2 protocol, or msgr2, is the second major revision on +Ceph's on-wire protocol. It brings with it several key features: + +* A *secure* mode that encrypts all data passing over the network +* Improved encapsulation of authentication payloads, enabling future + integration of new authentication modes like Kerberos +* Improved earlier feature advertisement and negotiation, enabling + future protocol revisions + +Ceph daemons can now bind to multiple ports, allowing both legacy Ceph +clients and new v2-capable clients to connect to the same cluster. + +By default, monitors now bind to the new IANA-assigned port ``3300`` +(ce4h or 0xce4) for the new v2 protocol, while also binding to the +old default port ``6789`` for the legacy v1 protocol. + +.. _address_formats: + +Address formats +--------------- + +Prior to Nautilus, all network addresses were rendered like +``1.2.3.4:567/89012`` where there was an IP address, a port, and a +nonce to uniquely identify a client or daemon on the network. +Starting with Nautilus, we now have three different address types: + +* **v2**: ``v2:1.2.3.4:578/89012`` identifies a daemon binding to a + port speaking the new v2 protocol +* **v1**: ``v1:1.2.3.4:578/89012`` identifies a daemon binding to a + port speaking the legacy v1 protocol. Any address that was + previously shown with any prefix is now shown as a ``v1:`` address. +* **TYPE_ANY** ``any:1.2.3.4:578/89012`` identifies a client that can + speak either version of the protocol. Prior to nautilus, clients would appear as + ``1.2.3.4:0/123456``, where the port of 0 indicates they are clients + and do not accept incoming connections. Starting with Nautilus, + these clients are now internally represented by a **TYPE_ANY** + address, and still shown with no prefix, because they may + connect to daemons using the v2 or v1 protocol, depending on what + protocol(s) the daemons are using. + +Because daemons now bind to multiple ports, they are now described by +a vector of addresses instead of a single address. For example, +dumping the monitor map on a Nautilus cluster now includes lines +like:: + + epoch 1 + fsid 50fcf227-be32-4bcb-8b41-34ca8370bd16 + last_changed 2019-02-25 11:10:46.700821 + created 2019-02-25 11:10:46.700821 + min_mon_release 14 (nautilus) + 0: [v2:10.0.0.10:3300/0,v1:10.0.0.10:6789/0] mon.foo + 1: [v2:10.0.0.11:3300/0,v1:10.0.0.11:6789/0] mon.bar + 2: [v2:10.0.0.12:3300/0,v1:10.0.0.12:6789/0] mon.baz + +The bracketed list or vector of addresses means that the same daemon can be +reached on multiple ports (and protocols). Any client or other daemon +connecting to that daemon will use the v2 protocol (listed first) if +possible; otherwise it will back to the legacy v1 protocol. Legacy +clients will only see the v1 addresses and will continue to connect as +they did before, with the v1 protocol. + +Starting in Nautilus, the ``mon_host`` configuration option and ``-m +<mon-host>`` command line options support the same bracketed address +vector syntax. + + +Bind configuration options +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Two new configuration options control whether the v1 and/or v2 +protocol is used: + + * ``ms_bind_msgr1`` [default: true] controls whether a daemon binds + to a port speaking the v1 protocol + * ``ms_bind_msgr2`` [default: true] controls whether a daemon binds + to a port speaking the v2 protocol + +Similarly, two options control whether IPv4 and IPv6 addresses are used: + + * ``ms_bind_ipv4`` [default: true] controls whether a daemon binds + to an IPv4 address + * ``ms_bind_ipv6`` [default: false] controls whether a daemon binds + to an IPv6 address + +.. note:: The ability to bind to multiple ports has paved the way for + dual-stack IPv4 and IPv6 support. That said, dual-stack support is + not yet tested as of Nautilus v14.2.0 and likely needs some + additional code changes to work correctly. + +Connection modes +---------------- + +The v2 protocol supports two connection modes: + +* *crc* mode provides: + + - a strong initial authentication when the connection is established + (with cephx, mutual authentication of both parties with protection + from a man-in-the-middle or eavesdropper), and + - a crc32c integrity check to protect against bit flips due to flaky + hardware or cosmic rays + + *crc* mode does *not* provide: + + - secrecy (an eavesdropper on the network can see all + post-authentication traffic as it goes by) or + - protection from a malicious man-in-the-middle (who can deliberate + modify traffic as it goes by, as long as they are careful to + adjust the crc32c values to match) + +* *secure* mode provides: + + - a strong initial authentication when the connection is established + (with cephx, mutual authentication of both parties with protection + from a man-in-the-middle or eavesdropper), and + - full encryption of all post-authentication traffic, including a + cryptographic integrity check. + + In Nautilus, secure mode uses the `AES-GCM + <https://en.wikipedia.org/wiki/Galois/Counter_Mode>`_ stream cipher, + which is generally very fast on modern processors (e.g., faster than + a SHA-256 cryptographic hash). + +Connection mode configuration options +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For most connections, there are options that control which modes are used: + +* ``ms_cluster_mode`` is the connection mode (or permitted modes) used + for intra-cluster communication between Ceph daemons. If multiple + modes are listed, the modes listed first are preferred. +* ``ms_service_mode`` is a list of permitted modes for clients to use + when connecting to the cluster. +* ``ms_client_mode`` is a list of connection modes, in order of + preference, for clients to use (or allow) when talking to a Ceph + cluster. + +There are a parallel set of options that apply specifically to +monitors, allowing administrators to set different (usually more +secure) requirements on communication with the monitors. + +* ``ms_mon_cluster_mode`` is the connection mode (or permitted modes) + to use between monitors. +* ``ms_mon_service_mode`` is a list of permitted modes for clients or + other Ceph daemons to use when connecting to monitors. +* ``ms_mon_client_mode`` is a list of connection modes, in order of + preference, for clients or non-monitor daemons to use when + connecting to monitors. + + +Transitioning from v1-only to v2-plus-v1 +---------------------------------------- + +By default, ``ms_bind_msgr2`` is true starting with Nautilus 14.2.z. +However, until the monitors start using v2, only limited services will +start advertising v2 addresses. + +For most users, the monitors are binding to the default legacy port ``6789`` +for the v1 protocol. When this is the case, enabling v2 is as simple as: + +.. prompt:: bash $ + + ceph mon enable-msgr2 + +If the monitors are bound to non-standard ports, you will need to +specify an additional port for v2 explicitly. For example, if your +monitor ``mon.a`` binds to ``1.2.3.4:1111``, and you want to add v2 on +port ``1112``: + +.. prompt:: bash $ + + ceph mon set-addrs a [v2:1.2.3.4:1112,v1:1.2.3.4:1111] + +Once the monitors bind to v2, each daemon will start advertising a v2 +address when it is next restarted. + + +.. _msgr2_ceph_conf: + +Updating ceph.conf and mon_host +------------------------------- + +Prior to Nautilus, a CLI user or daemon will normally discover the +monitors via the ``mon_host`` option in ``/etc/ceph/ceph.conf``. The +syntax for this option has expanded starting with Nautilus to allow +support the new bracketed list format. For example, an old line +like:: + + mon_host = 10.0.0.1:6789,10.0.0.2:6789,10.0.0.3:6789 + +Can be changed to:: + + mon_host = [v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0],[v2:10.0.0.2:3300/0,v1:10.0.0.2:6789/0],[v2:10.0.0.3:3300/0,v1:10.0.0.3:6789/0] + +However, when default ports are used (``3300`` and ``6789``), they can +be omitted:: + + mon_host = 10.0.0.1,10.0.0.2,10.0.0.3 + +Once v2 has been enabled on the monitors, ``ceph.conf`` may need to be +updated to either specify no ports (this is usually simplest), or +explicitly specify both the v2 and v1 addresses. Note, however, that +the new bracketed syntax is only understood by Nautilus and later, so +do not make that change on hosts that have not yet had their ceph +packages upgraded. + +When you are updating ``ceph.conf``, note the new ``ceph config +generate-minimal-conf`` command (which generates a barebones config +file with just enough information to reach the monitors) and the +``ceph config assimilate-conf`` (which moves config file options into +the monitors' configuration database) may be helpful. For example,:: + + # ceph config assimilate-conf < /etc/ceph/ceph.conf + # ceph config generate-minimal-config > /etc/ceph/ceph.conf.new + # cat /etc/ceph/ceph.conf.new + # minimal ceph.conf for 0e5a806b-0ce5-4bc6-b949-aa6f68f5c2a3 + [global] + fsid = 0e5a806b-0ce5-4bc6-b949-aa6f68f5c2a3 + mon_host = [v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0] + # mv /etc/ceph/ceph.conf.new /etc/ceph/ceph.conf + +Protocol +-------- + +For a detailed description of the v2 wire protocol, see :ref:`msgr2-protocol`. diff --git a/doc/rados/configuration/network-config-ref.rst b/doc/rados/configuration/network-config-ref.rst new file mode 100644 index 000000000..97229a401 --- /dev/null +++ b/doc/rados/configuration/network-config-ref.rst @@ -0,0 +1,454 @@ +================================= + Network Configuration Reference +================================= + +Network configuration is critical for building a high performance :term:`Ceph +Storage Cluster`. The Ceph Storage Cluster does not perform request routing or +dispatching on behalf of the :term:`Ceph Client`. Instead, Ceph Clients make +requests directly to Ceph OSD Daemons. Ceph OSD Daemons perform data replication +on behalf of Ceph Clients, which means replication and other factors impose +additional loads on Ceph Storage Cluster networks. + +Our Quick Start configurations provide a trivial Ceph configuration file that +sets monitor IP addresses and daemon host names only. Unless you specify a +cluster network, Ceph assumes a single "public" network. Ceph functions just +fine with a public network only, but you may see significant performance +improvement with a second "cluster" network in a large cluster. + +It is possible to run a Ceph Storage Cluster with two networks: a public +(client, front-side) network and a cluster (private, replication, back-side) +network. However, this approach +complicates network configuration (both hardware and software) and does not usually +have a significant impact on overall performance. For this reason, we recommend +that for resilience and capacity dual-NIC systems either active/active bond +these interfaces or implemebnt a layer 3 multipath strategy with eg. FRR. + +If, despite the complexity, one still wishes to use two networks, each +:term:`Ceph Node` will need to have more than one network interface or VLAN. See `Hardware +Recommendations - Networks`_ for additional details. + +.. ditaa:: + +-------------+ + | Ceph Client | + +----*--*-----+ + | ^ + Request | : Response + v | + /----------------------------------*--*-------------------------------------\ + | Public Network | + \---*--*------------*--*-------------*--*------------*--*------------*--*---/ + ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ + | | | | | | | | | | + | : | : | : | : | : + v v v v v v v v v v + +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ + | Ceph MON | | Ceph MDS | | Ceph OSD | | Ceph OSD | | Ceph OSD | + +----------+ +----------+ +---*--*---+ +---*--*---+ +---*--*---+ + ^ ^ ^ ^ ^ ^ + The cluster network relieves | | | | | | + OSD replication and heartbeat | : | : | : + traffic from the public network. v v v v v v + /------------------------------------*--*------------*--*------------*--*---\ + | cCCC Cluster Network | + \---------------------------------------------------------------------------/ + + +IP Tables +========= + +By default, daemons `bind`_ to ports within the ``6800:7300`` range. You may +configure this range at your discretion. Before configuring your IP tables, +check the default ``iptables`` configuration. + +.. prompt:: bash $ + + sudo iptables -L + +Some Linux distributions include rules that reject all inbound requests +except SSH from all network interfaces. For example:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +You will need to delete these rules on both your public and cluster networks +initially, and replace them with appropriate rules when you are ready to +harden the ports on your Ceph Nodes. + + +Monitor IP Tables +----------------- + +Ceph Monitors listen on ports ``3300`` and ``6789`` by +default. Additionally, Ceph Monitors always operate on the public +network. When you add the rule using the example below, make sure you +replace ``{iface}`` with the public network interface (e.g., ``eth0``, +``eth1``, etc.), ``{ip-address}`` with the IP address of the public +network and ``{netmask}`` with the netmask for the public network. : + +.. prompt:: bash $ + + sudo iptables -A INPUT -i {iface} -p tcp -s {ip-address}/{netmask} --dport 6789 -j ACCEPT + + +MDS and Manager IP Tables +------------------------- + +A :term:`Ceph Metadata Server` or :term:`Ceph Manager` listens on the first +available port on the public network beginning at port 6800. Note that this +behavior is not deterministic, so if you are running more than one OSD or MDS +on the same host, or if you restart the daemons within a short window of time, +the daemons will bind to higher ports. You should open the entire 6800-7300 +range by default. When you add the rule using the example below, make sure +you replace ``{iface}`` with the public network interface (e.g., ``eth0``, +``eth1``, etc.), ``{ip-address}`` with the IP address of the public network +and ``{netmask}`` with the netmask of the public network. + +For example: + +.. prompt:: bash $ + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + + +OSD IP Tables +------------- + +By default, Ceph OSD Daemons `bind`_ to the first available ports on a Ceph Node +beginning at port 6800. Note that this behavior is not deterministic, so if you +are running more than one OSD or MDS on the same host, or if you restart the +daemons within a short window of time, the daemons will bind to higher ports. +Each Ceph OSD Daemon on a Ceph Node may use up to four ports: + +#. One for talking to clients and monitors. +#. One for sending data to other OSDs. +#. Two for heartbeating on each interface. + +.. ditaa:: + /---------------\ + | OSD | + | +---+----------------+-----------+ + | | Clients & Monitors | Heartbeat | + | +---+----------------+-----------+ + | | + | +---+----------------+-----------+ + | | Data Replication | Heartbeat | + | +---+----------------+-----------+ + | cCCC | + \---------------/ + +When a daemon fails and restarts without letting go of the port, the restarted +daemon will bind to a new port. You should open the entire 6800-7300 port range +to handle this possibility. + +If you set up separate public and cluster networks, you must add rules for both +the public network and the cluster network, because clients will connect using +the public network and other Ceph OSD Daemons will connect using the cluster +network. When you add the rule using the example below, make sure you replace +``{iface}`` with the network interface (e.g., ``eth0``, ``eth1``, etc.), +``{ip-address}`` with the IP address and ``{netmask}`` with the netmask of the +public or cluster network. For example: + +.. prompt:: bash $ + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + +.. tip:: If you run Ceph Metadata Servers on the same Ceph Node as the + Ceph OSD Daemons, you can consolidate the public network configuration step. + + +Ceph Networks +============= + +To configure Ceph networks, you must add a network configuration to the +``[global]`` section of the configuration file. Our 5-minute Quick Start +provides a trivial Ceph configuration file that assumes one public network +with client and server on the same network and subnet. Ceph functions just fine +with a public network only. However, Ceph allows you to establish much more +specific criteria, including multiple IP network and subnet masks for your +public network. You can also establish a separate cluster network to handle OSD +heartbeat, object replication and recovery traffic. Don't confuse the IP +addresses you set in your configuration with the public-facing IP addresses +network clients may use to access your service. Typical internal IP networks are +often ``192.168.0.0`` or ``10.0.0.0``. + +.. tip:: If you specify more than one IP address and subnet mask for + either the public or the cluster network, the subnets within the network + must be capable of routing to each other. Additionally, make sure you + include each IP address/subnet in your IP tables and open ports for them + as necessary. + +.. note:: Ceph uses `CIDR`_ notation for subnets (e.g., ``10.0.0.0/24``). + +When you have configured your networks, you may restart your cluster or restart +each daemon. Ceph daemons bind dynamically, so you do not have to restart the +entire cluster at once if you change your network configuration. + + +Public Network +-------------- + +To configure a public network, add the following option to the ``[global]`` +section of your Ceph configuration file. + +.. code-block:: ini + + [global] + # ... elided configuration + public_network = {public-network/netmask} + +.. _cluster-network: + +Cluster Network +--------------- + +If you declare a cluster network, OSDs will route heartbeat, object replication +and recovery traffic over the cluster network. This may improve performance +compared to using a single network. To configure a cluster network, add the +following option to the ``[global]`` section of your Ceph configuration file. + +.. code-block:: ini + + [global] + # ... elided configuration + cluster_network = {cluster-network/netmask} + +We prefer that the cluster network is **NOT** reachable from the public network +or the Internet for added security. + +IPv4/IPv6 Dual Stack Mode +------------------------- + +If you want to run in an IPv4/IPv6 dual stack mode and want to define your public and/or +cluster networks, then you need to specify both your IPv4 and IPv6 networks for each: + +.. code-block:: ini + + [global] + # ... elided configuration + public_network = {IPv4 public-network/netmask}, {IPv6 public-network/netmask} + +This is so that Ceph can find a valid IP address for both address families. + +If you want just an IPv4 or an IPv6 stack environment, then make sure you set the `ms bind` +options correctly. + +.. note:: + Binding to IPv4 is enabled by default, so if you just add the option to bind to IPv6 + you'll actually put yourself into dual stack mode. If you want just IPv6, then disable IPv4 and + enable IPv6. See `Bind`_ below. + +Ceph Daemons +============ + +Monitor daemons are each configured to bind to a specific IP address. These +addresses are normally configured by your deployment tool. Other components +in the Ceph cluster discover the monitors via the ``mon host`` configuration +option, normally specified in the ``[global]`` section of the ``ceph.conf`` file. + +.. code-block:: ini + + [global] + mon_host = 10.0.0.2, 10.0.0.3, 10.0.0.4 + +The ``mon_host`` value can be a list of IP addresses or a name that is +looked up via DNS. In the case of a DNS name with multiple A or AAAA +records, all records are probed in order to discover a monitor. Once +one monitor is reached, all other current monitors are discovered, so +the ``mon host`` configuration option only needs to be sufficiently up +to date such that a client can reach one monitor that is currently online. + +The MGR, OSD, and MDS daemons will bind to any available address and +do not require any special configuration. However, it is possible to +specify a specific IP address for them to bind to with the ``public +addr`` (and/or, in the case of OSD daemons, the ``cluster addr``) +configuration option. For example, + +.. code-block:: ini + + [osd.0] + public addr = {host-public-ip-address} + cluster addr = {host-cluster-ip-address} + +.. topic:: One NIC OSD in a Two Network Cluster + + Generally, we do not recommend deploying an OSD host with a single network interface in a + cluster with two networks. However, you may accomplish this by forcing the + OSD host to operate on the public network by adding a ``public_addr`` entry + to the ``[osd.n]`` section of the Ceph configuration file, where ``n`` + refers to the ID of the OSD with one network interface. Additionally, the public + network and cluster network must be able to route traffic to each other, + which we don't recommend for security reasons. + + +Network Config Settings +======================= + +Network configuration settings are not required. Ceph assumes a public network +with all hosts operating on it unless you specifically configure a cluster +network. + + +Public Network +-------------- + +The public network configuration allows you specifically define IP addresses +and subnets for the public network. You may specifically assign static IP +addresses or override ``public_network`` settings using the ``public_addr`` +setting for a specific daemon. + +``public_network`` + +:Description: The IP address and netmask of the public (front-side) network + (e.g., ``192.168.0.0/24``). Set in ``[global]``. You may specify + comma-separated subnets. + +:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` +:Required: No +:Default: N/A + + +``public_addr`` + +:Description: The IP address for the public (front-side) network. + Set for each daemon. + +:Type: IP Address +:Required: No +:Default: N/A + + + +Cluster Network +--------------- + +The cluster network configuration allows you to declare a cluster network, and +specifically define IP addresses and subnets for the cluster network. You may +specifically assign static IP addresses or override ``cluster_network`` +settings using the ``cluster_addr`` setting for specific OSD daemons. + + +``cluster_network`` + +:Description: The IP address and netmask of the cluster (back-side) network + (e.g., ``10.0.0.0/24``). Set in ``[global]``. You may specify + comma-separated subnets. + +:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` +:Required: No +:Default: N/A + + +``cluster_addr`` + +:Description: The IP address for the cluster (back-side) network. + Set for each daemon. + +:Type: Address +:Required: No +:Default: N/A + + +Bind +---- + +Bind settings set the default port ranges Ceph OSD and MDS daemons use. The +default range is ``6800:7300``. Ensure that your `IP Tables`_ configuration +allows you to use the configured port range. + +You may also enable Ceph daemons to bind to IPv6 addresses instead of IPv4 +addresses. + + +``ms_bind_port_min`` + +:Description: The minimum port number to which an OSD or MDS daemon will bind. +:Type: 32-bit Integer +:Default: ``6800`` +:Required: No + + +``ms_bind_port_max`` + +:Description: The maximum port number to which an OSD or MDS daemon will bind. +:Type: 32-bit Integer +:Default: ``7300`` +:Required: No. + +``ms_bind_ipv4`` + +:Description: Enables Ceph daemons to bind to IPv4 addresses. +:Type: Boolean +:Default: ``true`` +:Required: No + +``ms_bind_ipv6`` + +:Description: Enables Ceph daemons to bind to IPv6 addresses. +:Type: Boolean +:Default: ``false`` +:Required: No + +``public_bind_addr`` + +:Description: In some dynamic deployments the Ceph MON daemon might bind + to an IP address locally that is different from the ``public_addr`` + advertised to other peers in the network. The environment must ensure + that routing rules are set correctly. If ``public_bind_addr`` is set + the Ceph Monitor daemon will bind to it locally and use ``public_addr`` + in the monmaps to advertise its address to peers. This behavior is limited + to the Monitor daemon. + +:Type: IP Address +:Required: No +:Default: N/A + + + +TCP +--- + +Ceph disables TCP buffering by default. + + +``ms_tcp_nodelay`` + +:Description: Ceph enables ``ms_tcp_nodelay`` so that each request is sent + immediately (no buffering). Disabling `Nagle's algorithm`_ + increases network traffic, which can introduce latency. If you + experience large numbers of small packets, you may try + disabling ``ms_tcp_nodelay``. + +:Type: Boolean +:Required: No +:Default: ``true`` + + +``ms_tcp_rcvbuf`` + +:Description: The size of the socket buffer on the receiving end of a network + connection. Disable by default. + +:Type: 32-bit Integer +:Required: No +:Default: ``0`` + + +``ms_tcp_read_timeout`` + +:Description: If a client or daemon makes a request to another Ceph daemon and + does not drop an unused connection, the ``ms tcp read timeout`` + defines the connection as idle after the specified number + of seconds. + +:Type: Unsigned 64-bit Integer +:Required: No +:Default: ``900`` 15 minutes. + + + +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability +.. _Hardware Recommendations - Networks: ../../../start/hardware-recommendations#networks +.. _hardware recommendations: ../../../start/hardware-recommendations +.. _Monitor / OSD Interaction: ../mon-osd-interaction +.. _Message Signatures: ../auth-config-ref#signatures +.. _CIDR: https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing +.. _Nagle's Algorithm: https://en.wikipedia.org/wiki/Nagle's_algorithm diff --git a/doc/rados/configuration/osd-config-ref.rst b/doc/rados/configuration/osd-config-ref.rst new file mode 100644 index 000000000..7f69b5e80 --- /dev/null +++ b/doc/rados/configuration/osd-config-ref.rst @@ -0,0 +1,1127 @@ +====================== + OSD Config Reference +====================== + +.. index:: OSD; configuration + +You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent +releases, the central config store), but Ceph OSD +Daemons can use the default values and a very minimal configuration. A minimal +Ceph OSD Daemon configuration sets ``osd journal size`` (for Filestore), ``host``, and +uses default values for nearly everything else. + +Ceph OSD Daemons are numerically identified in incremental fashion, beginning +with ``0`` using the following convention. :: + + osd.0 + osd.1 + osd.2 + +In a configuration file, you may specify settings for all Ceph OSD Daemons in +the cluster by adding configuration settings to the ``[osd]`` section of your +configuration file. To add settings directly to a specific Ceph OSD Daemon +(e.g., ``host``), enter it in an OSD-specific section of your configuration +file. For example: + +.. code-block:: ini + + [osd] + osd_journal_size = 5120 + + [osd.0] + host = osd-host-a + + [osd.1] + host = osd-host-b + + +.. index:: OSD; config settings + +General Settings +================ + +The following settings provide a Ceph OSD Daemon's ID, and determine paths to +data and journals. Ceph deployment scripts typically generate the UUID +automatically. + +.. warning:: **DO NOT** change the default paths for data or journals, as it + makes it more problematic to troubleshoot Ceph later. + +When using Filestore, the journal size should be at least twice the product of the expected drive +speed multiplied by ``filestore_max_sync_interval``. However, the most common +practice is to partition the journal drive (often an SSD), and mount it such +that Ceph uses the entire partition for the journal. + + +``osd_uuid`` + +:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon. +:Type: UUID +:Default: The UUID. +:Note: The ``osd_uuid`` applies to a single Ceph OSD Daemon. The ``fsid`` + applies to the entire cluster. + + +``osd_data`` + +:Description: The path to the OSDs data. You must create the directory when + deploying Ceph. You should mount a drive for OSD data at this + mount point. We do not recommend changing the default. + +:Type: String +:Default: ``/var/lib/ceph/osd/$cluster-$id`` + + +``osd_max_write_size`` + +:Description: The maximum size of a write in megabytes. +:Type: 32-bit Integer +:Default: ``90`` + + +``osd_max_object_size`` + +:Description: The maximum size of a RADOS object in bytes. +:Type: 32-bit Unsigned Integer +:Default: 128MB + + +``osd_client_message_size_cap`` + +:Description: The largest client data message allowed in memory. +:Type: 64-bit Unsigned Integer +:Default: 500MB default. ``500*1024L*1024L`` + + +``osd_class_dir`` + +:Description: The class path for RADOS class plug-ins. +:Type: String +:Default: ``$libdir/rados-classes`` + + +.. index:: OSD; file system + +File System Settings +==================== +Ceph builds and mounts file systems which are used for Ceph OSDs. + +``osd_mkfs_options {fs-type}`` + +:Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``-f -i 2048`` +:Default for other file systems: {empty string} + +For example:: + ``osd_mkfs_options_xfs = -f -d agcount=24`` + +``osd_mount_options {fs-type}`` + +:Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``rw,noatime,inode64`` +:Default for other file systems: ``rw, noatime`` + +For example:: + ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8`` + + +.. index:: OSD; journal settings + +Journal Settings +================ + +This section applies only to the older Filestore OSD back end. Since Luminous +BlueStore has been default and preferred. + +By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at +the following path, which is usually a symlink to a device or partition:: + + /var/lib/ceph/osd/$cluster-$id/journal + +When using a single device type (for example, spinning drives), the journals +should be *colocated*: the logical volume (or partition) should be in the same +device as the ``data`` logical volume. + +When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning +drives) it makes sense to place the journal on the faster device, while +``data`` occupies the slower device fully. + +The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be +larger, in which case it will need to be set in the ``ceph.conf`` file. +A value of 10 gigabytes is common in practice:: + + osd_journal_size = 10240 + + +``osd_journal`` + +:Description: The path to the OSD's journal. This may be a path to a file or a + block device (such as a partition of an SSD). If it is a file, + you must create the directory to contain it. We recommend using a + separate fast device when the ``osd_data`` drive is an HDD. + +:Type: String +:Default: ``/var/lib/ceph/osd/$cluster-$id/journal`` + + +``osd_journal_size`` + +:Description: The size of the journal in megabytes. + +:Type: 32-bit Integer +:Default: ``5120`` + + +See `Journal Config Reference`_ for additional details. + + +Monitor OSD Interaction +======================= + +Ceph OSD Daemons check each other's heartbeats and report to monitors +periodically. Ceph can use default values in many cases. However, if your +network has latency issues, you may need to adopt longer intervals. See +`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. + + +Data Placement +============== + +See `Pool & PG Config Reference`_ for details. + + +.. index:: OSD; scrubbing + +Scrubbing +========= + +In addition to making multiple copies of objects, Ceph ensures data integrity by +scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the +object storage layer. For each placement group, Ceph generates a catalog of all +objects and compares each primary object and its replicas to ensure that no +objects are missing or mismatched. Light scrubbing (daily) checks the object +size and attributes. Deep scrubbing (weekly) reads the data and uses checksums +to ensure data integrity. + +Scrubbing is important for maintaining data integrity, but it can reduce +performance. You can adjust the following settings to increase or decrease +scrubbing operations. + + +``osd_max_scrubs`` + +:Description: The maximum number of simultaneous scrub operations for + a Ceph OSD Daemon. + +:Type: 32-bit Int +:Default: ``1`` + +``osd_scrub_begin_hour`` + +:Description: This restricts scrubbing to this hour of the day or later. + Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0`` + to allow scrubbing the entire day. Along with ``osd_scrub_end_hour``, they define a time + window, in which the scrubs can happen. + But a scrub will be performed + no matter whether the time window allows or not, as long as the placement + group's scrub interval exceeds ``osd_scrub_max_interval``. +:Type: Integer in the range of 0 to 23 +:Default: ``0`` + + +``osd_scrub_end_hour`` + +:Description: This restricts scrubbing to the hour earlier than this. + Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0`` to allow scrubbing + for the entire day. Along with ``osd_scrub_begin_hour``, they define a time + window, in which the scrubs can happen. But a scrub will be performed + no matter whether the time window allows or not, as long as the placement + group's scrub interval exceeds ``osd_scrub_max_interval``. +:Type: Integer in the range of 0 to 23 +:Default: ``0`` + + +``osd_scrub_begin_week_day`` + +:Description: This restricts scrubbing to this day of the week or later. + 0 = Sunday, 1 = Monday, etc. Use ``osd_scrub_begin_week_day = 0`` + and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week. + Along with ``osd_scrub_end_week_day``, they define a time window in which + scrubs can happen. But a scrub will be performed + no matter whether the time window allows or not, when the PG's + scrub interval exceeds ``osd_scrub_max_interval``. +:Type: Integer in the range of 0 to 6 +:Default: ``0`` + + +``osd_scrub_end_week_day`` + +:Description: This restricts scrubbing to days of the week earlier than this. + 0 = Sunday, 1 = Monday, etc. Use ``osd_scrub_begin_week_day = 0`` + and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week. + Along with ``osd_scrub_begin_week_day``, they define a time + window, in which the scrubs can happen. But a scrub will be performed + no matter whether the time window allows or not, as long as the placement + group's scrub interval exceeds ``osd_scrub_max_interval``. +:Type: Integer in the range of 0 to 6 +:Default: ``0`` + + +``osd scrub during recovery`` + +:Description: Allow scrub during recovery. Setting this to ``false`` will disable + scheduling new scrub (and deep--scrub) while there is active recovery. + Already running scrubs will be continued. This might be useful to reduce + load on busy clusters. +:Type: Boolean +:Default: ``false`` + + +``osd_scrub_thread_timeout`` + +:Description: The maximum time in seconds before timing out a scrub thread. +:Type: 32-bit Integer +:Default: ``60`` + + +``osd_scrub_finalize_thread_timeout`` + +:Description: The maximum time in seconds before timing out a scrub finalize + thread. + +:Type: 32-bit Integer +:Default: ``10*60`` + + +``osd_scrub_load_threshold`` + +:Description: The normalized maximum load. Ceph will not scrub when the system load + (as defined by ``getloadavg() / number of online CPUs``) is higher than this number. + Default is ``0.5``. + +:Type: Float +:Default: ``0.5`` + + +``osd_scrub_min_interval`` + +:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon + when the Ceph Storage Cluster load is low. + +:Type: Float +:Default: Once per day. ``24*60*60`` + +.. _osd_scrub_max_interval: + +``osd_scrub_max_interval`` + +:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon + irrespective of cluster load. + +:Type: Float +:Default: Once per week. ``7*24*60*60`` + + +``osd_scrub_chunk_min`` + +:Description: The minimal number of object store chunks to scrub during single operation. + Ceph blocks writes to single chunk during scrub. + +:Type: 32-bit Integer +:Default: 5 + + +``osd_scrub_chunk_max`` + +:Description: The maximum number of object store chunks to scrub during single operation. + +:Type: 32-bit Integer +:Default: 25 + + +``osd_scrub_sleep`` + +:Description: Time to sleep before scrubbing the next group of chunks. Increasing this value will slow + down the overall rate of scrubbing so that client operations will be less impacted. + +:Type: Float +:Default: 0 + + +``osd_deep_scrub_interval`` + +:Description: The interval for "deep" scrubbing (fully reading all data). The + ``osd_scrub_load_threshold`` does not affect this setting. + +:Type: Float +:Default: Once per week. ``7*24*60*60`` + + +``osd_scrub_interval_randomize_ratio`` + +:Description: Add a random delay to ``osd_scrub_min_interval`` when scheduling + the next scrub job for a PG. The delay is a random + value less than ``osd_scrub_min_interval`` \* + ``osd_scrub_interval_randomized_ratio``. The default setting + spreads scrubs throughout the allowed time + window of ``[1, 1.5]`` \* ``osd_scrub_min_interval``. +:Type: Float +:Default: ``0.5`` + +``osd_deep_scrub_stride`` + +:Description: Read size when doing a deep scrub. +:Type: 32-bit Integer +:Default: 512 KB. ``524288`` + + +``osd_scrub_auto_repair`` + +:Description: Setting this to ``true`` will enable automatic PG repair when errors + are found by scrubs or deep-scrubs. However, if more than + ``osd_scrub_auto_repair_num_errors`` errors are found a repair is NOT performed. +:Type: Boolean +:Default: ``false`` + + +``osd_scrub_auto_repair_num_errors`` + +:Description: Auto repair will not occur if more than this many errors are found. +:Type: 32-bit Integer +:Default: ``5`` + + +.. index:: OSD; operations settings + +Operations +========== + + ``osd_op_queue`` + +:Description: This sets the type of queue to be used for prioritizing ops + within each OSD. Both queues feature a strict sub-queue which is + dequeued before the normal queue. The normal queue is different + between implementations. The WeightedPriorityQueue (``wpq``) + dequeues operations in relation to their priorities to prevent + starvation of any queue. WPQ should help in cases where a few OSDs + are more overloaded than others. The new mClockQueue + (``mclock_scheduler``) prioritizes operations based on which class + they belong to (recovery, scrub, snaptrim, client op, osd subop). + See `QoS Based on mClock`_. Requires a restart. + +:Type: String +:Valid Choices: wpq, mclock_scheduler +:Default: ``wpq`` + + +``osd_op_queue_cut_off`` + +:Description: This selects which priority ops will be sent to the strict + queue verses the normal queue. The ``low`` setting sends all + replication ops and higher to the strict queue, while the ``high`` + option sends only replication acknowledgment ops and higher to + the strict queue. Setting this to ``high`` should help when a few + OSDs in the cluster are very busy especially when combined with + ``wpq`` in the ``osd_op_queue`` setting. OSDs that are very busy + handling replication traffic could starve primary client traffic + on these OSDs without these settings. Requires a restart. + +:Type: String +:Valid Choices: low, high +:Default: ``high`` + + +``osd_client_op_priority`` + +:Description: The priority set for client operations. This value is relative + to that of ``osd_recovery_op_priority`` below. The default + strongly favors client ops over recovery. + +:Type: 32-bit Integer +:Default: ``63`` +:Valid Range: 1-63 + + +``osd_recovery_op_priority`` + +:Description: The priority of recovery operations vs client operations, if not specified by the + pool's ``recovery_op_priority``. The default value prioritizes client + ops (see above) over recovery ops. You may adjust the tradeoff of client + impact against the time to restore cluster health by lowering this value + for increased prioritization of client ops, or by increasing it to favor + recovery. + +:Type: 32-bit Integer +:Default: ``3`` +:Valid Range: 1-63 + + +``osd_scrub_priority`` + +:Description: The default work queue priority for scheduled scrubs when the + pool doesn't specify a value of ``scrub_priority``. This can be + boosted to the value of ``osd_client_op_priority`` when scrubs are + blocking client operations. + +:Type: 32-bit Integer +:Default: ``5`` +:Valid Range: 1-63 + + +``osd_requested_scrub_priority`` + +:Description: The priority set for user requested scrub on the work queue. If + this value were to be smaller than ``osd_client_op_priority`` it + can be boosted to the value of ``osd_client_op_priority`` when + scrub is blocking client operations. + +:Type: 32-bit Integer +:Default: ``120`` + + +``osd_snap_trim_priority`` + +:Description: The priority set for the snap trim work queue. + +:Type: 32-bit Integer +:Default: ``5`` +:Valid Range: 1-63 + +``osd_snap_trim_sleep`` + +:Description: Time in seconds to sleep before next snap trim op. + Increasing this value will slow down snap trimming. + This option overrides backend specific variants. + +:Type: Float +:Default: ``0`` + + +``osd_snap_trim_sleep_hdd`` + +:Description: Time in seconds to sleep before next snap trim op + for HDDs. + +:Type: Float +:Default: ``5`` + + +``osd_snap_trim_sleep_ssd`` + +:Description: Time in seconds to sleep before next snap trim op + for SSD OSDs (including NVMe). + +:Type: Float +:Default: ``0`` + + +``osd_snap_trim_sleep_hybrid`` + +:Description: Time in seconds to sleep before next snap trim op + when OSD data is on an HDD and the OSD journal or WAL+DB is on an SSD. + +:Type: Float +:Default: ``2`` + +``osd_op_thread_timeout`` + +:Description: The Ceph OSD Daemon operation thread timeout in seconds. +:Type: 32-bit Integer +:Default: ``15`` + + +``osd_op_complaint_time`` + +:Description: An operation becomes complaint worthy after the specified number + of seconds have elapsed. + +:Type: Float +:Default: ``30`` + + +``osd_op_history_size`` + +:Description: The maximum number of completed operations to track. +:Type: 32-bit Unsigned Integer +:Default: ``20`` + + +``osd_op_history_duration`` + +:Description: The oldest completed operation to track. +:Type: 32-bit Unsigned Integer +:Default: ``600`` + + +``osd_op_log_threshold`` + +:Description: How many operations logs to display at once. +:Type: 32-bit Integer +:Default: ``5`` + + +.. _dmclock-qos: + +QoS Based on mClock +------------------- + +Ceph's use of mClock is now more refined and can be used by following the +steps as described in `mClock Config Reference`_. + +Core Concepts +````````````` + +Ceph's QoS support is implemented using a queueing scheduler +based on `the dmClock algorithm`_. This algorithm allocates the I/O +resources of the Ceph cluster in proportion to weights, and enforces +the constraints of minimum reservation and maximum limitation, so that +the services can compete for the resources fairly. Currently the +*mclock_scheduler* operation queue divides Ceph services involving I/O +resources into following buckets: + +- client op: the iops issued by client +- osd subop: the iops issued by primary OSD +- snap trim: the snap trimming related requests +- pg recovery: the recovery related requests +- pg scrub: the scrub related requests + +And the resources are partitioned using following three sets of tags. In other +words, the share of each type of service is controlled by three tags: + +#. reservation: the minimum IOPS allocated for the service. +#. limitation: the maximum IOPS allocated for the service. +#. weight: the proportional share of capacity if extra capacity or system + oversubscribed. + +In Ceph, operations are graded with "cost". And the resources allocated +for serving various services are consumed by these "costs". So, for +example, the more reservation a services has, the more resource it is +guaranteed to possess, as long as it requires. Assuming there are 2 +services: recovery and client ops: + +- recovery: (r:1, l:5, w:1) +- client ops: (r:2, l:0, w:9) + +The settings above ensure that the recovery won't get more than 5 +requests per second serviced, even if it requires so (see CURRENT +IMPLEMENTATION NOTE below), and no other services are competing with +it. But if the clients start to issue large amount of I/O requests, +neither will they exhaust all the I/O resources. 1 request per second +is always allocated for recovery jobs as long as there are any such +requests. So the recovery jobs won't be starved even in a cluster with +high load. And in the meantime, the client ops can enjoy a larger +portion of the I/O resource, because its weight is "9", while its +competitor "1". In the case of client ops, it is not clamped by the +limit setting, so it can make use of all the resources if there is no +recovery ongoing. + +CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit +values. Therefore, if a service crosses the enforced limit, the op remains +in the operation queue until the limit is restored. + +Subtleties of mClock +```````````````````` + +The reservation and limit values have a unit of requests per +second. The weight, however, does not technically have a unit and the +weights are relative to one another. So if one class of requests has a +weight of 1 and another a weight of 9, then the latter class of +requests should get 9 executed at a 9 to 1 ratio as the first class. +However that will only happen once the reservations are met and those +values include the operations executed under the reservation phase. + +Even though the weights do not have units, one must be careful in +choosing their values due how the algorithm assigns weight tags to +requests. If the weight is *W*, then for a given class of requests, +the next one that comes in will have a weight tag of *1/W* plus the +previous weight tag or the current time, whichever is larger. That +means if *W* is sufficiently large and therefore *1/W* is sufficiently +small, the calculated tag may never be assigned as it will get a value +of the current time. The ultimate lesson is that values for weight +should not be too large. They should be under the number of requests +one expects to be serviced each second. + +Caveats +``````` + +There are some factors that can reduce the impact of the mClock op +queues within Ceph. First, requests to an OSD are sharded by their +placement group identifier. Each shard has its own mClock queue and +these queues neither interact nor share information among them. The +number of shards can be controlled with the configuration options +``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and +``osd_op_num_shards_ssd``. A lower number of shards will increase the +impact of the mClock queues, but may have other deleterious effects. + +Second, requests are transferred from the operation queue to the +operation sequencer, in which they go through the phases of +execution. The operation queue is where mClock resides and mClock +determines the next op to transfer to the operation sequencer. The +number of operations allowed in the operation sequencer is a complex +issue. In general we want to keep enough operations in the sequencer +so it's always getting work done on some operations while it's waiting +for disk and network access to complete on other operations. On the +other hand, once an operation is transferred to the operation +sequencer, mClock no longer has control over it. Therefore to maximize +the impact of mClock, we want to keep as few operations in the +operation sequencer as possible. So we have an inherent tension. + +The configuration options that influence the number of operations in +the operation sequencer are ``bluestore_throttle_bytes``, +``bluestore_throttle_deferred_bytes``, +``bluestore_throttle_cost_per_io``, +``bluestore_throttle_cost_per_io_hdd``, and +``bluestore_throttle_cost_per_io_ssd``. + +A third factor that affects the impact of the mClock algorithm is that +we're using a distributed system, where requests are made to multiple +OSDs and each OSD has (can have) multiple shards. Yet we're currently +using the mClock algorithm, which is not distributed (note: dmClock is +the distributed version of mClock). + +Various organizations and individuals are currently experimenting with +mClock as it exists in this code base along with their modifications +to the code base. We hope you'll share you're experiences with your +mClock and dmClock experiments on the ``ceph-devel`` mailing list. + + +``osd_push_per_object_cost`` + +:Description: the overhead for serving a push op + +:Type: Unsigned Integer +:Default: 1000 + + +``osd_recovery_max_chunk`` + +:Description: the maximum total size of data chunks a recovery op can carry. + +:Type: Unsigned Integer +:Default: 8 MiB + + +``osd_mclock_scheduler_client_res`` + +:Description: IO proportion reserved for each client (default). + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_client_wgt`` + +:Description: IO share for each client (default) over reservation. + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_client_lim`` + +:Description: IO limit for each client (default) over reservation. + +:Type: Unsigned Integer +:Default: 999999 + + +``osd_mclock_scheduler_background_recovery_res`` + +:Description: IO proportion reserved for background recovery (default). + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_background_recovery_wgt`` + +:Description: IO share for each background recovery over reservation. + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_background_recovery_lim`` + +:Description: IO limit for background recovery over reservation. + +:Type: Unsigned Integer +:Default: 999999 + + +``osd_mclock_scheduler_background_best_effort_res`` + +:Description: IO proportion reserved for background best_effort (default). + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_background_best_effort_wgt`` + +:Description: IO share for each background best_effort over reservation. + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_background_best_effort_lim`` + +:Description: IO limit for background best_effort over reservation. + +:Type: Unsigned Integer +:Default: 999999 + +.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf + + +.. index:: OSD; backfilling + +Backfilling +=========== + +When you add or remove Ceph OSD Daemons to a cluster, CRUSH will +rebalance the cluster by moving placement groups to or from Ceph OSDs +to restore balanced utilization. The process of migrating placement groups and +the objects they contain can reduce the cluster's operational performance +considerably. To maintain operational performance, Ceph performs this migration +with 'backfilling', which allows Ceph to set backfill operations to a lower +priority than requests to read or write data. + + +``osd_max_backfills`` + +:Description: The maximum number of backfills allowed to or from a single OSD. + Note that this is applied separately for read and write operations. +:Type: 64-bit Unsigned Integer +:Default: ``1`` + + +``osd_backfill_scan_min`` + +:Description: The minimum number of objects per backfill scan. + +:Type: 32-bit Integer +:Default: ``64`` + + +``osd_backfill_scan_max`` + +:Description: The maximum number of objects per backfill scan. + +:Type: 32-bit Integer +:Default: ``512`` + + +``osd_backfill_retry_interval`` + +:Description: The number of seconds to wait before retrying backfill requests. +:Type: Double +:Default: ``10.0`` + +.. index:: OSD; osdmap + +OSD Map +======= + +OSD maps reflect the OSD daemons operating in the cluster. Over time, the +number of map epochs increases. Ceph provides some settings to ensure that +Ceph performs well as the OSD map grows larger. + + +``osd_map_dedup`` + +:Description: Enable removing duplicates in the OSD map. +:Type: Boolean +:Default: ``true`` + + +``osd_map_cache_size`` + +:Description: The number of OSD maps to keep cached. +:Type: 32-bit Integer +:Default: ``50`` + + +``osd_map_message_max`` + +:Description: The maximum map entries allowed per MOSDMap message. +:Type: 32-bit Integer +:Default: ``40`` + + + +.. index:: OSD; recovery + +Recovery +======== + +When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD +begins peering with other Ceph OSD Daemons before writes can occur. See +`Monitoring OSDs and PGs`_ for details. + +If a Ceph OSD Daemon crashes and comes back online, usually it will be out of +sync with other Ceph OSD Daemons containing more recent versions of objects in +the placement groups. When this happens, the Ceph OSD Daemon goes into recovery +mode and seeks to get the latest copy of the data and bring its map back up to +date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects +and placement groups may be significantly out of date. Also, if a failure domain +went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at +the same time. This can make the recovery process time consuming and resource +intensive. + +To maintain operational performance, Ceph performs recovery with limitations on +the number recovery requests, threads and object chunk sizes which allows Ceph +perform well in a degraded state. + + +``osd_recovery_delay_start`` + +:Description: After peering completes, Ceph will delay for the specified number + of seconds before starting to recover RADOS objects. + +:Type: Float +:Default: ``0`` + + +``osd_recovery_max_active`` + +:Description: The number of active recovery requests per OSD at one time. More + requests will accelerate recovery, but the requests places an + increased load on the cluster. + + This value is only used if it is non-zero. Normally it + is ``0``, which means that the ``hdd`` or ``ssd`` values + (below) are used, depending on the type of the primary + device backing the OSD. + +:Type: 32-bit Integer +:Default: ``0`` + +``osd_recovery_max_active_hdd`` + +:Description: The number of active recovery requests per OSD at one time, if the + primary device is rotational. + +:Type: 32-bit Integer +:Default: ``3`` + +``osd_recovery_max_active_ssd`` + +:Description: The number of active recovery requests per OSD at one time, if the + primary device is non-rotational (i.e., an SSD). + +:Type: 32-bit Integer +:Default: ``10`` + + +``osd_recovery_max_chunk`` + +:Description: The maximum size of a recovered chunk of data to push. +:Type: 64-bit Unsigned Integer +:Default: ``8 << 20`` + + +``osd_recovery_max_single_start`` + +:Description: The maximum number of recovery operations per OSD that will be + newly started when an OSD is recovering. +:Type: 64-bit Unsigned Integer +:Default: ``1`` + + +``osd_recovery_thread_timeout`` + +:Description: The maximum time in seconds before timing out a recovery thread. +:Type: 32-bit Integer +:Default: ``30`` + + +``osd_recover_clone_overlap`` + +:Description: Preserves clone overlap during recovery. Should always be set + to ``true``. + +:Type: Boolean +:Default: ``true`` + + +``osd_recovery_sleep`` + +:Description: Time in seconds to sleep before the next recovery or backfill op. + Increasing this value will slow down recovery operation while + client operations will be less impacted. + +:Type: Float +:Default: ``0`` + + +``osd_recovery_sleep_hdd`` + +:Description: Time in seconds to sleep before next recovery or backfill op + for HDDs. + +:Type: Float +:Default: ``0.1`` + + +``osd_recovery_sleep_ssd`` + +:Description: Time in seconds to sleep before the next recovery or backfill op + for SSDs. + +:Type: Float +:Default: ``0`` + + +``osd_recovery_sleep_hybrid`` + +:Description: Time in seconds to sleep before the next recovery or backfill op + when OSD data is on HDD and OSD journal / WAL+DB is on SSD. + +:Type: Float +:Default: ``0.025`` + + +``osd_recovery_priority`` + +:Description: The default priority set for recovery work queue. Not + related to a pool's ``recovery_priority``. + +:Type: 32-bit Integer +:Default: ``5`` + + +Tiering +======= + +``osd_agent_max_ops`` + +:Description: The maximum number of simultaneous flushing ops per tiering agent + in the high speed mode. +:Type: 32-bit Integer +:Default: ``4`` + + +``osd_agent_max_low_ops`` + +:Description: The maximum number of simultaneous flushing ops per tiering agent + in the low speed mode. +:Type: 32-bit Integer +:Default: ``2`` + +See `cache target dirty high ratio`_ for when the tiering agent flushes dirty +objects within the high speed mode. + +Miscellaneous +============= + + +``osd_snap_trim_thread_timeout`` + +:Description: The maximum time in seconds before timing out a snap trim thread. +:Type: 32-bit Integer +:Default: ``1*60*60`` + + +``osd_backlog_thread_timeout`` + +:Description: The maximum time in seconds before timing out a backlog thread. +:Type: 32-bit Integer +:Default: ``1*60*60`` + + +``osd_default_notify_timeout`` + +:Description: The OSD default notification timeout (in seconds). +:Type: 32-bit Unsigned Integer +:Default: ``30`` + + +``osd_check_for_log_corruption`` + +:Description: Check log files for corruption. Can be computationally expensive. +:Type: Boolean +:Default: ``false`` + + +``osd_remove_thread_timeout`` + +:Description: The maximum time in seconds before timing out a remove OSD thread. +:Type: 32-bit Integer +:Default: ``60*60`` + + +``osd_command_thread_timeout`` + +:Description: The maximum time in seconds before timing out a command thread. +:Type: 32-bit Integer +:Default: ``10*60`` + + +``osd_delete_sleep`` + +:Description: Time in seconds to sleep before the next removal transaction. This + throttles the PG deletion process. + +:Type: Float +:Default: ``0`` + + +``osd_delete_sleep_hdd`` + +:Description: Time in seconds to sleep before the next removal transaction + for HDDs. + +:Type: Float +:Default: ``5`` + + +``osd_delete_sleep_ssd`` + +:Description: Time in seconds to sleep before the next removal transaction + for SSDs. + +:Type: Float +:Default: ``0`` + + +``osd_delete_sleep_hybrid`` + +:Description: Time in seconds to sleep before the next removal transaction + when OSD data is on HDD and OSD journal or WAL+DB is on SSD. + +:Type: Float +:Default: ``1`` + + +``osd_command_max_records`` + +:Description: Limits the number of lost objects to return. +:Type: 32-bit Integer +:Default: ``256`` + + +``osd_fast_fail_on_connection_refused`` + +:Description: If this option is enabled, crashed OSDs are marked down + immediately by connected peers and MONs (assuming that the + crashed OSD host survives). Disable it to restore old + behavior, at the expense of possible long I/O stalls when + OSDs crash in the middle of I/O operations. +:Type: Boolean +:Default: ``true`` + + + +.. _pool: ../../operations/pools +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Pool & PG Config Reference: ../pool-pg-config-ref +.. _Journal Config Reference: ../journal-ref +.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio +.. _mClock Config Reference: ../mclock-config-ref diff --git a/doc/rados/configuration/pool-pg-config-ref.rst b/doc/rados/configuration/pool-pg-config-ref.rst new file mode 100644 index 000000000..b4b5df478 --- /dev/null +++ b/doc/rados/configuration/pool-pg-config-ref.rst @@ -0,0 +1,282 @@ +====================================== + Pool, PG and CRUSH Config Reference +====================================== + +.. index:: pools; configuration + +When you create pools and set the number of placement groups (PGs) for each, Ceph +uses default values when you don't specifically override the defaults. **We +recommend** overriding some of the defaults. Specifically, we recommend setting +a pool's replica size and overriding the default number of placement groups. You +can specifically set these values when running `pool`_ commands. You can also +override the defaults by adding new ones in the ``[global]`` section of your +Ceph configuration file. + + +.. literalinclude:: pool-pg.conf + :language: ini + + +``mon_max_pool_pg_num`` + +:Description: The maximum number of placement groups per pool. +:Type: Integer +:Default: ``65536`` + + +``mon_pg_create_interval`` + +:Description: Number of seconds between PG creation in the same + Ceph OSD Daemon. + +:Type: Float +:Default: ``30.0`` + + +``mon_pg_stuck_threshold`` + +:Description: Number of seconds after which PGs can be considered as + being stuck. + +:Type: 32-bit Integer +:Default: ``300`` + +``mon_pg_min_inactive`` + +:Description: Raise ``HEALTH_ERR`` if the count of PGs that have been + inactive longer than the ``mon_pg_stuck_threshold`` exceeds this + setting. A non-positive number means disabled, never go into ERR. +:Type: Integer +:Default: ``1`` + + +``mon_pg_warn_min_per_osd`` + +:Description: Raise ``HEALTH_WARN`` if the average number + of PGs per ``in`` OSD is under this number. A non-positive number + disables this. +:Type: Integer +:Default: ``30`` + + +``mon_pg_warn_min_objects`` + +:Description: Do not warn if the total number of RADOS objects in cluster is below + this number +:Type: Integer +:Default: ``1000`` + + +``mon_pg_warn_min_pool_objects`` + +:Description: Do not warn on pools whose RADOS object count is below this number +:Type: Integer +:Default: ``1000`` + + +``mon_pg_check_down_all_threshold`` + +:Description: Percentage threshold of ``down`` OSDs above which we check all PGs + for stale ones. +:Type: Float +:Default: ``0.5`` + + +``mon_pg_warn_max_object_skew`` + +:Description: Raise ``HEALTH_WARN`` if the average RADOS object count per PG + of any pool is greater than ``mon_pg_warn_max_object_skew`` times + the average RADOS object count per PG of all pools. Zero or a non-positive + number disables this. Note that this option applies to ``ceph-mgr`` daemons. +:Type: Float +:Default: ``10`` + + +``mon_delta_reset_interval`` + +:Description: Seconds of inactivity before we reset the PG delta to 0. We keep + track of the delta of the used space of each pool, so, for + example, it would be easier for us to understand the progress of + recovery or the performance of cache tier. But if there's no + activity reported for a certain pool, we just reset the history of + deltas of that pool. +:Type: Integer +:Default: ``10`` + + +``mon_osd_max_op_age`` + +:Description: Maximum op age before we get concerned (make it a power of 2). + ``HEALTH_WARN`` will be raised if a request has been blocked longer + than this limit. +:Type: Float +:Default: ``32.0`` + + +``osd_pg_bits`` + +:Description: Placement group bits per Ceph OSD Daemon. +:Type: 32-bit Integer +:Default: ``6`` + + +``osd_pgp_bits`` + +:Description: The number of bits per Ceph OSD Daemon for PGPs. +:Type: 32-bit Integer +:Default: ``6`` + + +``osd_crush_chooseleaf_type`` + +:Description: The bucket type to use for ``chooseleaf`` in a CRUSH rule. Uses + ordinal rank rather than name. + +:Type: 32-bit Integer +:Default: ``1``. Typically a host containing one or more Ceph OSD Daemons. + + +``osd_crush_initial_weight`` + +:Description: The initial CRUSH weight for newly added OSDs. + +:Type: Double +:Default: ``the size of a newly added OSD in TB``. By default, the initial CRUSH + weight for a newly added OSD is set to its device size in TB. + See `Weighting Bucket Items`_ for details. + + +``osd_pool_default_crush_rule`` + +:Description: The default CRUSH rule to use when creating a replicated pool. +:Type: 8-bit Integer +:Default: ``-1``, which means "pick the rule with the lowest numerical ID and + use that". This is to make pool creation work in the absence of rule 0. + + +``osd_pool_erasure_code_stripe_unit`` + +:Description: Sets the default size, in bytes, of a chunk of an object + stripe for erasure coded pools. Every object of size S + will be stored as N stripes, with each data chunk + receiving ``stripe unit`` bytes. Each stripe of ``N * + stripe unit`` bytes will be encoded/decoded + individually. This option can is overridden by the + ``stripe_unit`` setting in an erasure code profile. + +:Type: Unsigned 32-bit Integer +:Default: ``4096`` + + +``osd_pool_default_size`` + +:Description: Sets the number of replicas for objects in the pool. The default + value is the same as + ``ceph osd pool set {pool-name} size {size}``. + +:Type: 32-bit Integer +:Default: ``3`` + + +``osd_pool_default_min_size`` + +:Description: Sets the minimum number of written replicas for objects in the + pool in order to acknowledge a write operation to the client. If + minimum is not met, Ceph will not acknowledge the write to the + client, **which may result in data loss**. This setting ensures + a minimum number of replicas when operating in ``degraded`` mode. + +:Type: 32-bit Integer +:Default: ``0``, which means no particular minimum. If ``0``, + minimum is ``size - (size / 2)``. + + +``osd_pool_default_pg_num`` + +:Description: The default number of placement groups for a pool. The default + value is the same as ``pg_num`` with ``mkpool``. + +:Type: 32-bit Integer +:Default: ``32`` + + +``osd_pool_default_pgp_num`` + +:Description: The default number of placement groups for placement for a pool. + The default value is the same as ``pgp_num`` with ``mkpool``. + PG and PGP should be equal (for now). + +:Type: 32-bit Integer +:Default: ``8`` + + +``osd_pool_default_flags`` + +:Description: The default flags for new pools. +:Type: 32-bit Integer +:Default: ``0`` + + +``osd_max_pgls`` + +:Description: The maximum number of placement groups to list. A client + requesting a large number can tie up the Ceph OSD Daemon. + +:Type: Unsigned 64-bit Integer +:Default: ``1024`` +:Note: Default should be fine. + + +``osd_min_pg_log_entries`` + +:Description: The minimum number of placement group logs to maintain + when trimming log files. + +:Type: 32-bit Int Unsigned +:Default: ``250`` + + +``osd_max_pg_log_entries`` + +:Description: The maximum number of placement group logs to maintain + when trimming log files. + +:Type: 32-bit Int Unsigned +:Default: ``10000`` + + +``osd_default_data_pool_replay_window`` + +:Description: The time (in seconds) for an OSD to wait for a client to replay + a request. + +:Type: 32-bit Integer +:Default: ``45`` + +``osd_max_pg_per_osd_hard_ratio`` + +:Description: The ratio of number of PGs per OSD allowed by the cluster before the + OSD refuses to create new PGs. An OSD stops creating new PGs if the number + of PGs it serves exceeds + ``osd_max_pg_per_osd_hard_ratio`` \* ``mon_max_pg_per_osd``. + +:Type: Float +:Default: ``2`` + +``osd_recovery_priority`` + +:Description: Priority of recovery in the work queue. + +:Type: Integer +:Default: ``5`` + +``osd_recovery_op_priority`` + +:Description: Default priority used for recovery operations if pool doesn't override. + +:Type: Integer +:Default: ``3`` + +.. _pool: ../../operations/pools +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Weighting Bucket Items: ../../operations/crush-map#weightingbucketitems diff --git a/doc/rados/configuration/pool-pg.conf b/doc/rados/configuration/pool-pg.conf new file mode 100644 index 000000000..34c3af9f0 --- /dev/null +++ b/doc/rados/configuration/pool-pg.conf @@ -0,0 +1,21 @@ +[global] + + # By default, Ceph makes 3 replicas of RADOS objects. If you want to maintain four + # copies of an object the default value--a primary copy and three replica + # copies--reset the default values as shown in 'osd_pool_default_size'. + # If you want to allow Ceph to write a lesser number of copies in a degraded + # state, set 'osd_pool_default_min_size' to a number less than the + # 'osd_pool_default_size' value. + + osd_pool_default_size = 3 # Write an object 3 times. + osd_pool_default_min_size = 2 # Allow writing two copies in a degraded state. + + # Ensure you have a realistic number of placement groups. We recommend + # approximately 100 per OSD. E.g., total number of OSDs multiplied by 100 + # divided by the number of replicas (i.e., osd pool default size). So for + # 10 OSDs and osd pool default size = 4, we'd recommend approximately + # (100 * 10) / 4 = 250. + # always use the nearest power of 2 + + osd_pool_default_pg_num = 256 + osd_pool_default_pgp_num = 256 diff --git a/doc/rados/configuration/storage-devices.rst b/doc/rados/configuration/storage-devices.rst new file mode 100644 index 000000000..8536d2cfa --- /dev/null +++ b/doc/rados/configuration/storage-devices.rst @@ -0,0 +1,96 @@ +================= + Storage Devices +================= + +There are two Ceph daemons that store data on devices: + +.. _rados_configuration_storage-devices_ceph_osd: + +* **Ceph OSDs** (Object Storage Daemons) store most of the data + in Ceph. Usually each OSD is backed by a single storage device. + This can be a traditional hard disk (HDD) or a solid state disk + (SSD). OSDs can also be backed by a combination of devices: for + example, a HDD for most data and an SSD (or partition of an + SSD) for some metadata. The number of OSDs in a cluster is + usually a function of the amount of data to be stored, the size + of each storage device, and the level and type of redundancy + specified (replication or erasure coding). +* **Ceph Monitor** daemons manage critical cluster state. This + includes cluster membership and authentication information. + Small clusters require only a few gigabytes of storage to hold + the monitor database. In large clusters, however, the monitor + database can reach sizes of tens of gigabytes to hundreds of + gigabytes. +* **Ceph Manager** daemons run alongside monitor daemons, providing + additional monitoring and providing interfaces to external + monitoring and management systems. + + +OSD Back Ends +============= + +There are two ways that OSDs manage the data they store. As of the Luminous +12.2.z release, the default (and recommended) back end is *BlueStore*. Prior +to the Luminous release, the default (and only) back end was *Filestore*. + +.. _rados_config_storage_devices_bluestore: + +BlueStore +--------- + +<<<<<<< HEAD +BlueStore is a special-purpose storage backend designed specifically +for managing data on disk for Ceph OSD workloads. It is motivated by +experience supporting and managing OSDs using FileStore over the +last ten years. Key BlueStore features include: +======= +BlueStore is a special-purpose storage back end designed specifically for +managing data on disk for Ceph OSD workloads. BlueStore's design is based on +a decade of experience of supporting and managing Filestore OSDs. +>>>>>>> 28abc6a9a59 (doc/rados: s/backend/back end/) + +* Direct management of storage devices. BlueStore consumes raw block + devices or partitions. This avoids any intervening layers of + abstraction (such as local file systems like XFS) that may limit + performance or add complexity. +* Metadata management with RocksDB. We embed RocksDB's key/value database + in order to manage internal metadata, such as the mapping from object + names to block locations on disk. +* Full data and metadata checksumming. By default all data and + metadata written to BlueStore is protected by one or more + checksums. No data or metadata will be read from disk or returned + to the user without being verified. +* Inline compression. Data written may be optionally compressed + before being written to disk. +* Multi-device metadata tiering. BlueStore allows its internal + journal (write-ahead log) to be written to a separate, high-speed + device (like an SSD, NVMe, or NVDIMM) to increased performance. If + a significant amount of faster storage is available, internal + metadata can also be stored on the faster device. +* Efficient copy-on-write. RBD and CephFS snapshots rely on a + copy-on-write *clone* mechanism that is implemented efficiently in + BlueStore. This results in efficient IO both for regular snapshots + and for erasure coded pools (which rely on cloning to implement + efficient two-phase commits). + +For more information, see :doc:`bluestore-config-ref` and :doc:`/rados/operations/bluestore-migration`. + +FileStore +--------- + +FileStore is the legacy approach to storing objects in Ceph. It +relies on a standard file system (normally XFS) in combination with a +key/value database (traditionally LevelDB, now RocksDB) for some +metadata. + +FileStore is well-tested and widely used in production but suffers +from many performance deficiencies due to its overall design and +reliance on a traditional file system for storing object data. + +Although FileStore is generally capable of functioning on most +POSIX-compatible file systems (including btrfs and ext4), we only +recommend that XFS be used. Both btrfs and ext4 have known bugs and +deficiencies and their use may lead to data loss. By default all Ceph +provisioning tools will use XFS. + +For more information, see :doc:`filestore-config-ref`. |