diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 18:24:20 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 18:24:20 +0000 |
commit | 483eb2f56657e8e7f419ab1a4fab8dce9ade8609 (patch) | |
tree | e5d88d25d870d5dedacb6bbdbe2a966086a0a5cf /doc/rados/configuration | |
parent | Initial commit. (diff) | |
download | ceph-upstream.tar.xz ceph-upstream.zip |
Adding upstream version 14.2.21.upstream/14.2.21upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/rados/configuration')
19 files changed, 6267 insertions, 0 deletions
diff --git a/doc/rados/configuration/auth-config-ref.rst b/doc/rados/configuration/auth-config-ref.rst new file mode 100644 index 00000000..c6816f1e --- /dev/null +++ b/doc/rados/configuration/auth-config-ref.rst @@ -0,0 +1,378 @@ +======================== + Cephx Config Reference +======================== + +The ``cephx`` protocol is enabled by default. Cryptographic authentication has +some computational costs, though they should generally be quite low. If the +network environment connecting your client and server hosts is very safe and +you cannot afford authentication, you can turn it off. **This is not generally +recommended**. + +.. note:: If you disable authentication, you are at risk of a man-in-the-middle + attack altering your client/server messages, which could lead to disastrous + security effects. + +For creating users, see `User Management`_. For details on the architecture +of Cephx, see `Architecture - High Availability Authentication`_. + + +Deployment Scenarios +==================== + +There are two main scenarios for deploying a Ceph cluster, which impact +how you initially configure Cephx. Most first time Ceph users use +``ceph-deploy`` to create a cluster (easiest). For clusters using +other deployment tools (e.g., Chef, Juju, Puppet, etc.), you will need +to use the manual procedures or configure your deployment tool to +bootstrap your monitor(s). + +ceph-deploy +----------- + +When you deploy a cluster with ``ceph-deploy``, you do not have to bootstrap the +monitor manually or create the ``client.admin`` user or keyring. The steps you +execute in the `Storage Cluster Quick Start`_ will invoke ``ceph-deploy`` to do +that for you. + +When you execute ``ceph-deploy new {initial-monitor(s)}``, Ceph will create a +monitor keyring for you (only used to bootstrap monitors), and it will generate +an initial Ceph configuration file for you, which contains the following +authentication settings, indicating that Ceph enables authentication by +default:: + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + +When you execute ``ceph-deploy mon create-initial``, Ceph will bootstrap the +initial monitor(s), retrieve a ``ceph.client.admin.keyring`` file containing the +key for the ``client.admin`` user. Additionally, it will also retrieve keyrings +that give ``ceph-deploy`` and ``ceph-volume`` utilities the ability to prepare and +activate OSDs and metadata servers. + +When you execute ``ceph-deploy admin {node-name}`` (**note:** Ceph must be +installed first), you are pushing a Ceph configuration file and the +``ceph.client.admin.keyring`` to the ``/etc/ceph`` directory of the node. You +will be able to execute Ceph administrative functions as ``root`` on the command +line of that node. + + +Manual Deployment +----------------- + +When you deploy a cluster manually, you have to bootstrap the monitor manually +and create the ``client.admin`` user and keyring. To bootstrap monitors, follow +the steps in `Monitor Bootstrapping`_. The steps for monitor bootstrapping are +the logical steps you must perform when using third party deployment tools like +Chef, Puppet, Juju, etc. + + +Enabling/Disabling Cephx +======================== + +Enabling Cephx requires that you have deployed keys for your monitors, +OSDs and metadata servers. If you are simply toggling Cephx on / off, +you do not have to repeat the bootstrapping procedures. + + +Enabling Cephx +-------------- + +When ``cephx`` is enabled, Ceph will look for the keyring in the default search +path, which includes ``/etc/ceph/$cluster.$name.keyring``. You can override +this location by adding a ``keyring`` option in the ``[global]`` section of +your `Ceph configuration`_ file, but this is not recommended. + +Execute the following procedures to enable ``cephx`` on a cluster with +authentication disabled. If you (or your deployment utility) have already +generated the keys, you may skip the steps related to generating keys. + +#. Create a ``client.admin`` key, and save a copy of the key for your client + host:: + + ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring + + **Warning:** This will clobber any existing + ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a + deployment tool has already done it for you. Be careful! + +#. Create a keyring for your monitor cluster and generate a monitor + secret key. :: + + ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' + +#. Copy the monitor keyring into a ``ceph.mon.keyring`` file in every monitor's + ``mon data`` directory. For example, to copy it to ``mon.a`` in cluster ``ceph``, + use the following:: + + cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring + +#. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter:: + + ceph auth get-or-create mgr.{$id} mon 'allow profile mgr' mds 'allow *' osd 'allow *' -o /var/lib/ceph/mgr/ceph-{$id}/keyring + +#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number:: + + ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring + +#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter:: + + ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' mgr 'allow profile mds' -o /var/lib/ceph/mds/ceph-{$id}/keyring + +#. Enable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file:: + + auth cluster required = cephx + auth service required = cephx + auth client required = cephx + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + +For details on bootstrapping a monitor manually, see `Manual Deployment`_. + + + +Disabling Cephx +--------------- + +The following procedure describes how to disable Cephx. If your cluster +environment is relatively safe, you can offset the computation expense of +running authentication. **We do not recommend it.** However, it may be easier +during setup and/or troubleshooting to temporarily disable authentication. + +#. Disable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file:: + + auth cluster required = none + auth service required = none + auth client required = none + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + + +Configuration Settings +====================== + +Enablement +---------- + + +``auth cluster required`` + +:Description: If enabled, the Ceph Storage Cluster daemons (i.e., ``ceph-mon``, + ``ceph-osd``, ``ceph-mds`` and ``ceph-mgr``) must authenticate with + each other. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth service required`` + +:Description: If enabled, the Ceph Storage Cluster daemons require Ceph Clients + to authenticate with the Ceph Storage Cluster in order to access + Ceph services. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth client required`` + +:Description: If enabled, the Ceph Client requires the Ceph Storage Cluster to + authenticate with the Ceph Client. Valid settings are ``cephx`` + or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +.. index:: keys; keyring + +Keys +---- + +When you run Ceph with authentication enabled, ``ceph`` administrative commands +and Ceph Clients require authentication keys to access the Ceph Storage Cluster. + +The most common way to provide these keys to the ``ceph`` administrative +commands and clients is to include a Ceph keyring under the ``/etc/ceph`` +directory. For Cuttlefish and later releases using ``ceph-deploy``, the filename +is usually ``ceph.client.admin.keyring`` (or ``$cluster.client.admin.keyring``). +If you include the keyring under the ``/etc/ceph`` directory, you don't need to +specify a ``keyring`` entry in your Ceph configuration file. + +We recommend copying the Ceph Storage Cluster's keyring file to nodes where you +will run administrative commands, because it contains the ``client.admin`` key. + +You may use ``ceph-deploy admin`` to perform this task. See `Create an Admin +Host`_ for details. To perform this step manually, execute the following:: + + sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring + +.. tip:: Ensure the ``ceph.keyring`` file has appropriate permissions set + (e.g., ``chmod 644``) on your client machine. + +You may specify the key itself in the Ceph configuration file using the ``key`` +setting (not recommended), or a path to a keyfile using the ``keyfile`` setting. + + +``keyring`` + +:Description: The path to the keyring file. +:Type: String +:Required: No +:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin`` + + +``keyfile`` + +:Description: The path to a key file (i.e,. a file containing only the key). +:Type: String +:Required: No +:Default: None + + +``key`` + +:Description: The key (i.e., the text string of the key itself). Not recommended. +:Type: String +:Required: No +:Default: None + + +Daemon Keyrings +--------------- + +Administrative users or deployment tools (e.g., ``ceph-deploy``) may generate +daemon keyrings in the same way as generating user keyrings. By default, Ceph +stores daemons keyrings inside their data directory. The default keyring +locations, and the capabilities necessary for the daemon to function, are shown +below. + +``ceph-mon`` + +:Location: ``$mon_data/keyring`` +:Capabilities: ``mon 'allow *'`` + +``ceph-osd`` + +:Location: ``$osd_data/keyring`` +:Capabilities: ``mgr 'allow profile osd' mon 'allow profile osd' osd 'allow *'`` + +``ceph-mds`` + +:Location: ``$mds_data/keyring`` +:Capabilities: ``mds 'allow' mgr 'allow profile mds' mon 'allow profile mds' osd 'allow rwx'`` + +``ceph-mgr`` + +:Location: ``$mgr_data/keyring`` +:Capabilities: ``mon 'allow profile mgr' mds 'allow *' osd 'allow *'`` + +``radosgw`` + +:Location: ``$rgw_data/keyring`` +:Capabilities: ``mon 'allow rwx' osd 'allow rwx'`` + + +.. note:: The monitor keyring (i.e., ``mon.``) contains a key but no + capabilities, and is not part of the cluster ``auth`` database. + +The daemon data directory locations default to directories of the form:: + + /var/lib/ceph/$type/$cluster-$id + +For example, ``osd.12`` would be:: + + /var/lib/ceph/osd/ceph-12 + +You can override these locations, but it is not recommended. + + +.. index:: signatures + +Signatures +---------- + +Ceph performs a signature check that provides some limited protection +against messages being tampered with in flight (e.g., by a "man in the +middle" attack). + +Like other parts of Ceph authentication, Ceph provides fine-grained control so +you can enable/disable signatures for service messages between the client and +Ceph, and you can enable/disable signatures for messages between Ceph daemons. + +Note that even with signatures enabled data is not encrypted in +flight. + +``cephx require signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between the Ceph Client and the Ceph Storage Cluster, and + between daemons comprising the Ceph Storage Cluster. + + Ceph Argonaut and Linux kernel versions prior to 3.19 do + not support signatures; if such clients are in use this + option can be turned off to allow them to connect. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx cluster require signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph daemons comprising the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx service require signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph Clients and the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx sign messages`` + +:Description: If the Ceph version supports message signing, Ceph will sign + all messages so they are more difficult to spoof. + +:Type: Boolean +:Default: ``true`` + + +Time to Live +------------ + +``auth service ticket ttl`` + +:Description: When the Ceph Storage Cluster sends a Ceph Client a ticket for + authentication, the Ceph Storage Cluster assigns the ticket a + time to live. + +:Type: Double +:Default: ``60*60`` + + +.. _Storage Cluster Quick Start: ../../../start/quick-ceph-deploy/ +.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping +.. _Operating a Cluster: ../../operations/operating +.. _Manual Deployment: ../../../install/manual-deployment +.. _Ceph configuration: ../ceph-conf +.. _Create an Admin Host: ../../deployment/ceph-deploy-admin +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _User Management: ../../operations/user-management diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst new file mode 100644 index 00000000..7d1c50c9 --- /dev/null +++ b/doc/rados/configuration/bluestore-config-ref.rst @@ -0,0 +1,498 @@ +========================== +BlueStore Config Reference +========================== + +Devices +======= + +BlueStore manages either one, two, or (in certain cases) three storage +devices. + +In the simplest case, BlueStore consumes a single (primary) storage device. +The storage device is normally used as a whole, occupying the full device that +is managed directly by BlueStore. This *primary device* is normally identified +by a ``block`` symlink in the data directory. + +The data directory is a ``tmpfs`` mount which gets populated (at boot time, or +when ``ceph-volume`` activates it) with all the common OSD files that hold +information about the OSD, like: its identifier, which cluster it belongs to, +and its private keyring. + +It is also possible to deploy BlueStore across two additional devices: + +* A *WAL device* (identified as ``block.wal`` in the data directory) can be + used for BlueStore's internal journal or write-ahead log. It is only useful + to use a WAL device if the device is faster than the primary device (e.g., + when it is on an SSD and the primary device is an HDD). +* A *DB device* (identified as ``block.db`` in the data directory) can be used + for storing BlueStore's internal metadata. BlueStore (or rather, the + embedded RocksDB) will put as much metadata as it can on the DB device to + improve performance. If the DB device fills up, metadata will spill back + onto the primary device (where it would have been otherwise). Again, it is + only helpful to provision a DB device if it is faster than the primary + device. + +If there is only a small amount of fast storage available (e.g., less +than a gigabyte), we recommend using it as a WAL device. If there is +more, provisioning a DB device makes more sense. The BlueStore +journal will always be placed on the fastest device available, so +using a DB device will provide the same benefit that the WAL device +would while *also* allowing additional metadata to be stored there (if +it will fit). + +A single-device BlueStore OSD can be provisioned with:: + + ceph-volume lvm prepare --bluestore --data <device> + +To specify a WAL device and/or DB device, :: + + ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device> + +.. note:: --data can be a Logical Volume using the vg/lv notation. Other + devices can be existing logical volumes or GPT partitions + +Provisioning strategies +----------------------- +Although there are multiple ways to deploy a Bluestore OSD (unlike Filestore +which had 1) here are two common use cases that should help clarify the +initial deployment strategy: + +.. _bluestore-single-type-device-config: + +**block (data) only** +^^^^^^^^^^^^^^^^^^^^^ +If all the devices are the same type, for example all are spinning drives, and +there are no fast devices to combine these, it makes sense to just deploy with +block only and not try to separate ``block.db`` or ``block.wal``. The +:ref:`ceph-volume-lvm` call for a single ``/dev/sda`` device would look like:: + + ceph-volume lvm create --bluestore --data /dev/sda + +If logical volumes have already been created for each device (1 LV using 100% +of the device), then the :ref:`ceph-volume-lvm` call for an lv named +``ceph-vg/block-lv`` would look like:: + + ceph-volume lvm create --bluestore --data ceph-vg/block-lv + +.. _bluestore-mixed-device-config: + +**block and block.db** +^^^^^^^^^^^^^^^^^^^^^^ +If there is a mix of fast and slow devices (spinning and solid state), +it is recommended to place ``block.db`` on the faster device while ``block`` +(data) lives on the slower (spinning drive). Sizing for ``block.db`` should be +as large as possible to avoid performance penalties otherwise. The +``ceph-volume`` tool is currently not able to create these automatically, so +the volume groups and logical volumes need to be created manually. + +For the below example, lets assume 4 spinning drives (sda, sdb, sdc, and sdd) +and 1 solid state drive (sdx). First create the volume groups:: + + $ vgcreate ceph-block-0 /dev/sda + $ vgcreate ceph-block-1 /dev/sdb + $ vgcreate ceph-block-2 /dev/sdc + $ vgcreate ceph-block-3 /dev/sdd + +Now create the logical volumes for ``block``:: + + $ lvcreate -l 100%FREE -n block-0 ceph-block-0 + $ lvcreate -l 100%FREE -n block-1 ceph-block-1 + $ lvcreate -l 100%FREE -n block-2 ceph-block-2 + $ lvcreate -l 100%FREE -n block-3 ceph-block-3 + +We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB +SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB:: + + $ vgcreate ceph-db-0 /dev/sdx + $ lvcreate -L 50GB -n db-0 ceph-db-0 + $ lvcreate -L 50GB -n db-1 ceph-db-0 + $ lvcreate -L 50GB -n db-2 ceph-db-0 + $ lvcreate -L 50GB -n db-3 ceph-db-0 + +Finally, create the 4 OSDs with ``ceph-volume``:: + + $ ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0 + $ ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1 + $ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2 + $ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3 + +These operations should end up creating 4 OSDs, with ``block`` on the slower +spinning drives and a 50GB logical volume for each coming from the solid state +drive. + +Sizing +====== +When using a :ref:`mixed spinning and solid drive setup +<bluestore-mixed-device-config>` it is important to make a large-enough +``block.db`` logical volume for Bluestore. Generally, ``block.db`` should have +*as large as possible* logical volumes. + +It is recommended that the ``block.db`` size isn't smaller than 4% of +``block``. For example, if the ``block`` size is 1TB, then ``block.db`` +shouldn't be less than 40GB. + +If *not* using a mix of fast and slow devices, it isn't required to create +separate logical volumes for ``block.db`` (or ``block.wal``). Bluestore will +automatically manage these within the space of ``block``. + + +Automatic Cache Sizing +====================== + +Bluestore can be configured to automatically resize it's caches when tc_malloc +is configured as the memory allocator and the ``bluestore_cache_autotune`` +setting is enabled. This option is currently enabled by default. Bluestore +will attempt to keep OSD heap memory usage under a designated target size via +the ``osd_memory_target`` configuration option. This is a best effort +algorithm and caches will not shrink smaller than the amount specified by +``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy +of priorities. If priority information is not availabe, the +``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are +used as fallbacks. + +``bluestore_cache_autotune`` + +:Description: Automatically tune the ratios assigned to different bluestore caches while respecting minimum values. +:Type: Boolean +:Required: Yes +:Default: ``True`` + +``osd_memory_target`` + +:Description: When tcmalloc is available and cache autotuning is enabled, try to keep this many bytes mapped in memory. Note: This may not exactly match the RSS memory usage of the process. While the total amount of heap memory mapped by the process should generally stay close to this target, there is no guarantee that the kernel will actually reclaim memory that has been unmapped. During initial developement, it was found that some kernels result in the OSD's RSS Memory exceeding the mapped memory by up to 20%. It is hypothesised however, that the kernel generally may be more aggressive about reclaiming unmapped memory when there is a high amount of memory pressure. Your mileage may vary. +:Type: Unsigned Integer +:Required: Yes +:Default: ``4294967296`` + +``bluestore_cache_autotune_chunk_size`` + +:Description: The chunk size in bytes to allocate to caches when cache autotune is enabled. When the autotuner assigns memory to different caches, it will allocate memory in chunks. This is done to avoid evictions when there are minor fluctuations in the heap size or autotuned cache ratios. +:Type: Unsigned Integer +:Required: No +:Default: ``33554432`` + +``bluestore_cache_autotune_interval`` + +:Description: The number of seconds to wait between rebalances when cache autotune is enabled. This setting changes how quickly the ratios of the difference caches are recomputed. Note: Setting the interval too small can result in high CPU usage and lower performance. +:Type: Float +:Required: No +:Default: ``5`` + +``osd_memory_base`` + +:Description: When tcmalloc and cache autotuning is enabled, estimate the minimum amount of memory in bytes the OSD will need. This is used to help the autotuner estimate the expected aggregate memory consumption of the caches. +:Type: Unsigned Interger +:Required: No +:Default: ``805306368`` + +``osd_memory_expected_fragmentation`` + +:Description: When tcmalloc and cache autotuning is enabled, estimate the percent of memory fragmentation. This is used to help the autotuner estimate the expected aggregate memory consumption of the caches. +:Type: Float +:Required: No +:Default: ``0.15`` + +``osd_memory_cache_min`` + +:Description: When tcmalloc and cache autotuning is enabled, set the minimum amount of memory used for caches. Note: Setting this value too low can result in significant cache thrashing. +:Type: Unsigned Integer +:Required: No +:Default: ``134217728`` + +``osd_memory_cache_resize_interval`` + +:Description: When tcmalloc and cache autotuning is enabled, wait this many seconds between resizing caches. This setting changes the total amount of memory available for bluestore to use for caching. Note: Setting the interval too small can result in memory allocator thrashing and lower performance. +:Type: Float +:Required: No +:Default: ``1`` + + +Manual Cache Sizing +=================== + +The amount of memory consumed by each OSD for BlueStore's cache is +determined by the ``bluestore_cache_size`` configuration option. If +that config option is not set (i.e., remains at 0), there is a +different default value that is used depending on whether an HDD or +SSD is used for the primary device (set by the +``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config +options). + +BlueStore and the rest of the Ceph OSD does the best it can currently +to stick to the budgeted memory. Note that on top of the configured +cache size, there is also memory consumed by the OSD itself, and +generally some overhead due to memory fragmentation and other +allocator overhead. + +The configured cache memory budget can be used in a few different ways: + +* Key/Value metadata (i.e., RocksDB's internal cache) +* BlueStore metadata +* BlueStore data (i.e., recently read or written object data) + +Cache memory usage is governed by the following options: +``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. +The fraction of the cache devoted to data +is governed by the effective bluestore cache size (depending on +``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary +device) as well as the meta and kv ratios. +The data fraction can be calculated by +``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)`` + +``bluestore_cache_size`` + +:Description: The amount of memory BlueStore will use for its cache. If zero, ``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will be used instead. +:Type: Unsigned Integer +:Required: Yes +:Default: ``0`` + +``bluestore_cache_size_hdd`` + +:Description: The default amount of memory BlueStore will use for its cache when backed by an HDD. +:Type: Unsigned Integer +:Required: Yes +:Default: ``1 * 1024 * 1024 * 1024`` (1 GB) + +``bluestore_cache_size_ssd`` + +:Description: The default amount of memory BlueStore will use for its cache when backed by an SSD. +:Type: Unsigned Integer +:Required: Yes +:Default: ``3 * 1024 * 1024 * 1024`` (3 GB) + +``bluestore_cache_meta_ratio`` + +:Description: The ratio of cache devoted to metadata. +:Type: Floating point +:Required: Yes +:Default: ``.4`` + +``bluestore_cache_kv_ratio`` + +:Description: The ratio of cache devoted to key/value data (rocksdb). +:Type: Floating point +:Required: Yes +:Default: ``.4`` + +``bluestore_cache_kv_max`` + +:Description: The maximum amount of cache devoted to key/value data (rocksdb). +:Type: Unsigned Integer +:Required: Yes +:Default: ``512 * 1024*1024`` (512 MB) + + +Checksums +========= + +BlueStore checksums all metadata and data written to disk. Metadata +checksumming is handled by RocksDB and uses `crc32c`. Data +checksumming is done by BlueStore and can make use of `crc32c`, +`xxhash32`, or `xxhash64`. The default is `crc32c` and should be +suitable for most purposes. + +Full data checksumming does increase the amount of metadata that +BlueStore must store and manage. When possible, e.g., when clients +hint that data is written and read sequentially, BlueStore will +checksum larger blocks, but in many cases it must store a checksum +value (usually 4 bytes) for every 4 kilobyte block of data. + +It is possible to use a smaller checksum value by truncating the +checksum to two or one byte, reducing the metadata overhead. The +trade-off is that the probability that a random error will not be +detected is higher with a smaller checksum, going from about one in +four billion with a 32-bit (4 byte) checksum to one in 65,536 for a +16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum. +The smaller checksum values can be used by selecting `crc32c_16` or +`crc32c_8` as the checksum algorithm. + +The *checksum algorithm* can be set either via a per-pool +``csum_type`` property or the global config option. For example, :: + + ceph osd pool set <pool-name> csum_type <algorithm> + +``bluestore_csum_type`` + +:Description: The default checksum algorithm to use. +:Type: String +:Required: Yes +:Valid Settings: ``none``, ``crc32c``, ``crc32c_16``, ``crc32c_8``, ``xxhash32``, ``xxhash64`` +:Default: ``crc32c`` + + +Inline Compression +================== + +BlueStore supports inline compression using `snappy`, `zlib`, or +`lz4`. Please note that the `lz4` compression plugin is not +distributed in the official release. + +Whether data in BlueStore is compressed is determined by a combination +of the *compression mode* and any hints associated with a write +operation. The modes are: + +* **none**: Never compress data. +* **passive**: Do not compress data unless the write operation has a + *compressible* hint set. +* **aggressive**: Compress data unless the write operation has an + *incompressible* hint set. +* **force**: Try to compress data no matter what. + +For more information about the *compressible* and *incompressible* IO +hints, see :c:func:`rados_set_alloc_hint`. + +Note that regardless of the mode, if the size of the data chunk is not +reduced sufficiently it will not be used and the original +(uncompressed) data will be stored. For example, if the ``bluestore +compression required ratio`` is set to ``.7`` then the compressed data +must be 70% of the size of the original (or smaller). + +The *compression mode*, *compression algorithm*, *compression required +ratio*, *min blob size*, and *max blob size* can be set either via a +per-pool property or a global config option. Pool properties can be +set with:: + + ceph osd pool set <pool-name> compression_algorithm <algorithm> + ceph osd pool set <pool-name> compression_mode <mode> + ceph osd pool set <pool-name> compression_required_ratio <ratio> + ceph osd pool set <pool-name> compression_min_blob_size <size> + ceph osd pool set <pool-name> compression_max_blob_size <size> + +``bluestore compression algorithm`` + +:Description: The default compressor to use (if any) if the per-pool property + ``compression_algorithm`` is not set. Note that zstd is *not* + recommended for bluestore due to high CPU overhead when + compressing small amounts of data. +:Type: String +:Required: No +:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd`` +:Default: ``snappy`` + +``bluestore compression mode`` + +:Description: The default policy for using compression if the per-pool property + ``compression_mode`` is not set. ``none`` means never use + compression. ``passive`` means use compression when + :c:func:`clients hint <rados_set_alloc_hint>` that data is + compressible. ``aggressive`` means use compression unless + clients hint that data is not compressible. ``force`` means use + compression under all circumstances even if the clients hint that + the data is not compressible. +:Type: String +:Required: No +:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` +:Default: ``none`` + +``bluestore compression required ratio`` + +:Description: The ratio of the size of the data chunk after + compression relative to the original size must be at + least this small in order to store the compressed + version. + +:Type: Floating point +:Required: No +:Default: .875 + +``bluestore compression min blob size`` + +:Description: Chunks smaller than this are never compressed. + The per-pool property ``compression_min_blob_size`` overrides + this setting. + +:Type: Unsigned Integer +:Required: No +:Default: 0 + +``bluestore compression min blob size hdd`` + +:Description: Default value of ``bluestore compression min blob size`` + for rotational media. + +:Type: Unsigned Integer +:Required: No +:Default: 128K + +``bluestore compression min blob size ssd`` + +:Description: Default value of ``bluestore compression min blob size`` + for non-rotational (solid state) media. + +:Type: Unsigned Integer +:Required: No +:Default: 8K + +``bluestore compression max blob size`` + +:Description: Chunks larger than this are broken into smaller blobs sizing + ``bluestore compression max blob size`` before being compressed. + The per-pool property ``compression_max_blob_size`` overrides + this setting. + +:Type: Unsigned Integer +:Required: No +:Default: 0 + +``bluestore compression max blob size hdd`` + +:Description: Default value of ``bluestore compression max blob size`` + for rotational media. + +:Type: Unsigned Integer +:Required: No +:Default: 512K + +``bluestore compression max blob size ssd`` + +:Description: Default value of ``bluestore compression max blob size`` + for non-rotational (solid state) media. + +:Type: Unsigned Integer +:Required: No +:Default: 64K + +SPDK Usage +================== + +If you want to use SPDK driver for NVME SSD, you need to ready your system. +Please refer to `SPDK document`__ for more details. + +.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples + +SPDK offers a script to configure the device automatically. Users can run the +script as root:: + + $ sudo src/spdk/scripts/setup.sh + +Then you need to specify NVMe device's device selector here with "spdk:" prefix for +``bluestore_block_path``. + +For example, users can find the device selector of an Intel PCIe SSD with:: + + $ lspci -mm -n -D -d 8086:0953 + +The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``. + +and then set:: + + bluestore block path = spdk:0000:01:00.0 + +Where ``0000:01:00.0`` is the device selector found in the output of ``lspci`` +command above. + +If you want to run multiple SPDK instances per node, you must specify the +amount of dpdk memory size in MB each instance will use, to make sure each +instance uses its own dpdk memory + +In most cases, we only need one device to serve as data, db, db wal purposes. +We need to make sure configurations below to make sure all IOs issued under +SPDK.:: + + bluestore_block_db_path = "" + bluestore_block_db_size = 0 + bluestore_block_wal_path = "" + bluestore_block_wal_size = 0 + +Otherwise, the current implementation will setup symbol file to kernel +filesystem location and uses kernel driver to issue DB/WAL IO. diff --git a/doc/rados/configuration/ceph-conf.rst b/doc/rados/configuration/ceph-conf.rst new file mode 100644 index 00000000..24c9afa4 --- /dev/null +++ b/doc/rados/configuration/ceph-conf.rst @@ -0,0 +1,496 @@ +.. _configuring-ceph: + +================== + Configuring Ceph +================== + +When you start the Ceph service, the initialization process activates a series +of daemons that run in the background. A :term:`Ceph Storage Cluster` runs +three types of daemons: + +- :term:`Ceph Monitor` (``ceph-mon``) +- :term:`Ceph Manager` (``ceph-mgr``) +- :term:`Ceph OSD Daemon` (``ceph-osd``) + +Ceph Storage Clusters that support the :term:`Ceph Filesystem` run at +least one :term:`Ceph Metadata Server` (``ceph-mds``). Clusters that +support :term:`Ceph Object Storage` run Ceph Gateway daemons +(``radosgw``). + +Each daemon has a series of configuration options, each of which has a +default values. You may adjust the behavior of the system by changing these +configuration options. + +Option names +============ + +All Ceph configuration options have a unique name consisting of words +formed with lower-case characters and connected with underscore +(``_``) characters. + +When option names are specified on the command line, either underscore +(``_``) or dash (``-``) characters can be used interchangeable (e.g., +``--mon-host`` is equivalent to ``--mon_host``). + +When option names appear in configuration files, spaces can also be +used in place of underscore or dash. + +Config sources +============== + +Each Ceph daemon, process, and library will pull its configuration +from several sources, listed below. Sources later in the list will +override those earlier in the list when both are present. + +- the compiled-in default value +- the monitor cluster's centralized configuration database +- a configuration file stored on the local host +- environment variables +- command line arguments +- runtime overrides set by an administrator + +One of the first things a Ceph process does on startup is parse the +configuration options provided via the command line, environment, and +local configuration file. The process will then contact the monitor +cluster to retrieve configuration stored centrally for the entire +cluster. Once a complete view of the configuration is available, the +daemon or process startup will proceed. + +Bootstrap options +----------------- + +Because some configuration options affect the process's ability to +contact the monitors, authenticate, and retrieve the cluster-stored +configuration, they may need to be stored locally on the node and set +in a local configuration file. These options include: + + - ``mon_host``, the list of monitors for the cluster + - ``mon_host_override``, the list of monitors for the cluster to + **initially** contact when beginning a new instance of communication with the + Ceph cluster. This overrides the known monitor list derived from MonMap + updates sent to older Ceph instances (like librados cluster handles). It is + expected this option is primarily useful for debugging. + - ``mon_dns_serv_name`` (default: `ceph-mon`), the name of the DNS + SRV record to check to identify the cluster monitors via DNS + - ``mon_data``, ``osd_data``, ``mds_data``, ``mgr_data``, and + similar options that define which local directory the daemon + stores its data in. + - ``keyring``, ``keyfile``, and/or ``key``, which can be used to + specify the authentication credential to use to authenticate with + the monitor. Note that in most cases the default keyring location + is in the data directory specified above. + +In the vast majority of cases the default values of these are +appropriate, with the exception of the ``mon_host`` option that +identifies the addresses of the cluster's monitors. When DNS is used +to identify monitors a local ceph configuration file can be avoided +entirely. + +Skipping monitor config +----------------------- + +Any process may be passed the option ``--no-mon-config`` to skip the +step that retrieves configuration from the cluster monitors. This is +useful in cases where configuration is managed entirely via +configuration files or where the monitor cluster is currently down but +some maintenance activity needs to be done. + + +.. _ceph-conf-file: + + +Configuration sections +====================== + +Any given process or daemon has a single value for each configuration +option. However, values for an option may vary across different +daemon types even daemons of the same type. Ceph options that are +stored in the monitor configuration database or in local configuration +files are grouped into sections to indicate which daemons or clients +they apply to. + +These sections include: + +``global`` + +:Description: Settings under ``global`` affect all daemons and clients + in a Ceph Storage Cluster. + +:Example: ``log_file = /var/log/ceph/$cluster-$type.$id.log`` + +``mon`` + +:Description: Settings under ``mon`` affect all ``ceph-mon`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mon_cluster_log_to_syslog = true`` + + +``mgr`` + +:Description: Settings in the ``mgr`` section affect all ``ceph-mgr`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mgr_stats_period = 10`` + +``osd`` + +:Description: Settings under ``osd`` affect all ``ceph-osd`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``osd_op_queue = wpq`` + +``mds`` + +:Description: Settings in the ``mds`` section affect all ``ceph-mds`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mds_cache_size = 10G`` + +``client`` + +:Description: Settings under ``client`` affect all Ceph Clients + (e.g., mounted Ceph Filesystems, mounted Ceph Block Devices, + etc.) as well as Rados Gateway (RGW) daemons. + +:Example: ``objecter_inflight_ops = 512`` + + +Sections may also specify an individual daemon or client name. For example, +``mon.foo``, ``osd.123``, and ``client.smith`` are all valid section names. + + +Any given daemon will draw its settings from the global section, the +daemon or client type section, and the section sharing its name. +Settings in the most-specific section take precedence, so for example +if the same option is specified in both ``global``, ``mon``, and +``mon.foo`` on the same source (i.e., in the same configurationfile), +the ``mon.foo`` value will be used. + +Note that values from the local configuration file always take +precedence over values from the monitor configuration database, +regardless of which section they appear in. + + +.. _ceph-metavariables: + +Metavariables +============= + +Metavariables simplify Ceph Storage Cluster configuration +dramatically. When a metavariable is set in a configuration value, +Ceph expands the metavariable into a concrete value at the time the +configuration value is used. Ceph metavariables are similar to variable expansion in the Bash shell. + +Ceph supports the following metavariables: + +``$cluster`` + +:Description: Expands to the Ceph Storage Cluster name. Useful when running + multiple Ceph Storage Clusters on the same hardware. + +:Example: ``/etc/ceph/$cluster.keyring`` +:Default: ``ceph`` + + +``$type`` + +:Description: Expands to a daemon or process type (e.g., ``mds``, ``osd``, or ``mon``) + +:Example: ``/var/lib/ceph/$type`` + + +``$id`` + +:Description: Expands to the daemon or client identifier. For + ``osd.0``, this would be ``0``; for ``mds.a``, it would + be ``a``. + +:Example: ``/var/lib/ceph/$type/$cluster-$id`` + + +``$host`` + +:Description: Expands to the host name where the process is running. + + +``$name`` + +:Description: Expands to ``$type.$id``. +:Example: ``/var/run/ceph/$cluster-$name.asok`` + +``$pid`` + +:Description: Expands to daemon pid. +:Example: ``/var/run/ceph/$cluster-$name-$pid.asok`` + + + +The Configuration File +====================== + +On startup, Ceph processes search for a configuration file in the +following locations: + +#. ``$CEPH_CONF`` (*i.e.,* the path following the ``$CEPH_CONF`` + environment variable) +#. ``-c path/path`` (*i.e.,* the ``-c`` command line argument) +#. ``/etc/ceph/$cluster.conf`` +#. ``~/.ceph/$cluster.conf`` +#. ``./$cluster.conf`` (*i.e.,* in the current working directory) +#. On FreeBSD systems only, ``/usr/local/etc/ceph/$cluster.conf`` + +where ``$cluster`` is the cluster's name (default ``ceph``). + +The Ceph configuration file uses an *ini* style syntax. You can add comments +by preceding comments with a pound sign (#) or a semi-colon (;). For example: + +.. code-block:: ini + + # <--A number (#) sign precedes a comment. + ; A comment may be anything. + # Comments always follow a semi-colon (;) or a pound (#) on each line. + # The end of the line terminates a comment. + # We recommend that you provide comments in your configuration file(s). + + +.. _ceph-conf-settings: + +Config file section names +------------------------- + +The configuration file is divided into sections. Each section must begin with a +valid configuration section name (see `Configuration sections`_, above) +surrounded by square brackets. For example, + +.. code-block:: ini + + [global] + debug ms = 0 + + [osd] + debug ms = 1 + + [osd.1] + debug ms = 10 + + [osd.2] + debug ms = 10 + + + +Monitor configuration database +============================== + +The monitor cluster manages a database of configuration options that +can be consumed by the entire cluster, enabling streamlined central +configuration management for the entire system. The vast majority of +configuration options can and should be stored here for ease of +administration and transparency. + +A handful of settings may still need to be stored in local +configuration files because they affect the ability to connect to the +monitors, authenticate, and fetch configuration information. In most +cases this is limited to the ``mon_host`` option, although this can +also be avoided through the use of DNS SRV records. + +Sections and masks +------------------ + +Configuration options stored by the monitor can live in a global +section, daemon type section, or specific daemon section, just like +options in a configuration file can. + +In addition, options may also have a *mask* associated with them to +further restrict which daemons or clients the option applies to. +Masks take two forms: + +#. ``type:location`` where *type* is a CRUSH property like `rack` or + `host`, and *location* is a value for that property. For example, + ``host:foo`` would limit the option only to daemons or clients + running on a particular host. +#. ``class:device-class`` where *device-class* is the name of a CRUSH + device class (e.g., ``hdd`` or ``ssd``). For example, + ``class:ssd`` would limit the option only to OSDs backed by SSDs. + (This mask has no effect for non-OSD daemons or clients.) + +When setting a configuration option, the `who` may be a section name, +a mask, or a combination of both separated by a slash (``/``) +character. For example, ``osd/rack:foo`` would mean all OSD daemons +in the ``foo`` rack. + +When viewing configuration options, the section name and mask are +generally separated out into separate fields or columns to ease readability. + + +Commands +-------- + +The following CLI commands are used to configure the cluster: + +* ``ceph config dump`` will dump the entire configuration database for + the cluster. + +* ``ceph config get <who>`` will dump the configuration for a specific + daemon or client (e.g., ``mds.a``), as stored in the monitors' + configuration database. + +* ``ceph config set <who> <option> <value>`` will set a configuration + option in the monitors' configuration database. + +* ``ceph config show <who>`` will show the reported running + configuration for a running daemon. These settings may differ from + those stored by the monitors if there are also local configuration + files in use or options have been overridden on the command line or + at run time. The source of the option values is reported as part + of the output. + +* ``ceph config assimilate-conf -i <input file> -o <output file>`` + will ingest a configuration file from *input file* and move any + valid options into the monitors' configuration database. Any + settings that are unrecognized, invalid, or cannot be controlled by + the monitor will be returned in an abbreviated config file stored in + *output file*. This command is useful for transitioning from legacy + configuration files to centralized monitor-based configuration. + + +Help +==== + +You can get help for a particular option with:: + + ceph config help <option> + +Note that this will use the configuration schema that is compiled into the running monitors. If you have a mixed-version cluster (e.g., during an upgrade), you might also want to query the option schema from a specific running daemon:: + + ceph daemon <name> config help [option] + +For example,:: + + $ ceph config help log_file + log_file - path to log file + (std::string, basic) + Default (non-daemon): + Default (daemon): /var/log/ceph/$cluster-$name.log + Can update at runtime: false + See also: [log_to_stderr,err_to_stderr,log_to_syslog,err_to_syslog] + +or:: + + $ ceph config help log_file -f json-pretty + { + "name": "log_file", + "type": "std::string", + "level": "basic", + "desc": "path to log file", + "long_desc": "", + "default": "", + "daemon_default": "/var/log/ceph/$cluster-$name.log", + "tags": [], + "services": [], + "see_also": [ + "log_to_stderr", + "err_to_stderr", + "log_to_syslog", + "err_to_syslog" + ], + "enum_values": [], + "min": "", + "max": "", + "can_update_at_runtime": false + } + +The ``level`` property can be any of `basic`, `advanced`, or `dev`. +The `dev` options are intended for use by developers, generally for +testing purposes, and are not recommended for use by operators. + + +Runtime Changes +=============== + +In most cases, Ceph allows you to make changes to the configuration of +a daemon at runtime. This capability is quite useful for +increasing/decreasing logging output, enabling/disabling debug +settings, and even for runtime optimization. + +Generally speaking, configuration options can be updated in the usual +way via the ``ceph config set`` command. For example, do enable the debug log level on a specific OSD,:: + + ceph config set osd.123 debug_ms 20 + +Note that if the same option is also customized in a local +configuration file, the monitor setting will be ignored (it has a +lower priority than the local config file). + +Override values +--------------- + +You can also temporarily set an option using the `tell` or `daemon` +interfaces on the Ceph CLI. These *override* values are ephemeral in +that they only affect the running process and are discarded/lost if +the daemon or process restarts. + +Override values can be set in two ways: + +#. From any host, we can send a message to a daemon over the network with:: + + ceph tell <name> config set <option> <value> + + For example,:: + + ceph tell osd.123 config set debug_osd 20 + + The `tell` command can also accept a wildcard for the daemon + identifier. For example, to adjust the debug level on all OSD + daemons,:: + + ceph tell osd.* config set debug_osd 20 + +#. From the host the process is running on, we can connect directly to + the process via a socket in ``/var/run/ceph`` with:: + + ceph daemon <name> config set <option> <value> + + For example,:: + + ceph daemon osd.4 config set debug_osd 20 + +Note that in the ``ceph config show`` command output these temporary +values will be shown with a source of ``override``. + + +Viewing runtime settings +======================== + +You can see the current options set for a running daemon with the ``ceph config show`` command. For example,:: + + ceph config show osd.0 + +will show you the (non-default) options for that daemon. You can also look at a specific option with:: + + ceph config show osd.0 debug_osd + +or view all options (even those with default values) with:: + + ceph config show-with-defaults osd.0 + +You can also observe settings for a running daemon by connecting to it from the local host via the admin socket. For example,:: + + ceph daemon osd.0 config show + +will dump all current settings,:: + + ceph daemon osd.0 config diff + +will show only non-default settings (as well as where the value came from: a config file, the monitor, an override, etc.), and:: + + ceph daemon osd.0 config get debug_osd + +will report the value of a single option. + + + diff --git a/doc/rados/configuration/common.rst b/doc/rados/configuration/common.rst new file mode 100644 index 00000000..acb5d325 --- /dev/null +++ b/doc/rados/configuration/common.rst @@ -0,0 +1,258 @@ + +.. _ceph-conf-common-settings: + +Common Settings +=============== + +The `Hardware Recommendations`_ section provides some hardware guidelines for +configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph +Node` to run multiple daemons. For example, a single node with multiple drives +may run one ``ceph-osd`` for each drive. Ideally, you will have a node for a +particular type of process. For example, some nodes may run ``ceph-osd`` +daemons, other nodes may run ``ceph-mds`` daemons, and still other nodes may +run ``ceph-mon`` daemons. + +Each node has a name identified by the ``host`` setting. Monitors also specify +a network address and port (i.e., domain name or IP address) identified by the +``addr`` setting. A basic configuration file will typically specify only +minimal settings for each instance of monitor daemons. For example: + +.. code-block:: ini + + [global] + mon_initial_members = ceph1 + mon_host = 10.0.0.1 + + +.. important:: The ``host`` setting is the short name of the node (i.e., not + an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on + the command line to retrieve the name of the node. Do not use ``host`` + settings for anything other than initial monitors unless you are deploying + Ceph manually. You **MUST NOT** specify ``host`` under individual daemons + when using deployment tools like ``chef`` or ``ceph-deploy``, as those tools + will enter the appropriate values for you in the cluster map. + + +.. _ceph-network-config: + +Networks +======== + +See the `Network Configuration Reference`_ for a detailed discussion about +configuring a network for use with Ceph. + + +Monitors +======== + +Ceph production clusters typically deploy with a minimum 3 :term:`Ceph Monitor` +daemons to ensure high availability should a monitor instance crash. At least +three (3) monitors ensures that the Paxos algorithm can determine which version +of the :term:`Ceph Cluster Map` is the most recent from a majority of Ceph +Monitors in the quorum. + +.. note:: You may deploy Ceph with a single monitor, but if the instance fails, + the lack of other monitors may interrupt data service availability. + +Ceph Monitors normally listen on port ``3300`` for the new v2 protocol, and ``6789`` for the old v1 protocol. + +By default, Ceph expects that you will store a monitor's data under the +following path:: + + /var/lib/ceph/mon/$cluster-$id + +You or a deployment tool (e.g., ``ceph-deploy``) must create the corresponding +directory. With metavariables fully expressed and a cluster named "ceph", the +foregoing directory would evaluate to:: + + /var/lib/ceph/mon/ceph-a + +For additional details, see the `Monitor Config Reference`_. + +.. _Monitor Config Reference: ../mon-config-ref + + +.. _ceph-osd-config: + + +Authentication +============== + +.. versionadded:: Bobtail 0.56 + +For Bobtail (v 0.56) and beyond, you should expressly enable or disable +authentication in the ``[global]`` section of your Ceph configuration file. :: + + auth cluster required = cephx + auth service required = cephx + auth client required = cephx + +Additionally, you should enable message signing. See `Cephx Config Reference`_ for details. + +.. important:: When upgrading, we recommend expressly disabling authentication + first, then perform the upgrade. Once the upgrade is complete, re-enable + authentication. + +.. _Cephx Config Reference: ../auth-config-ref + + +.. _ceph-monitor-config: + + +OSDs +==== + +Ceph production clusters typically deploy :term:`Ceph OSD Daemons` where one node +has one OSD daemon running a filestore on one storage drive. A typical +deployment specifies a journal size. For example: + +.. code-block:: ini + + [osd] + osd journal size = 10000 + + [osd.0] + host = {hostname} #manual deployments only. + + +By default, Ceph expects that you will store a Ceph OSD Daemon's data with the +following path:: + + /var/lib/ceph/osd/$cluster-$id + +You or a deployment tool (e.g., ``ceph-deploy``) must create the corresponding +directory. With metavariables fully expressed and a cluster named "ceph", the +foregoing directory would evaluate to:: + + /var/lib/ceph/osd/ceph-0 + +You may override this path using the ``osd data`` setting. We don't recommend +changing the default location. Create the default directory on your OSD host. + +:: + + ssh {osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + +The ``osd data`` path ideally leads to a mount point with a hard disk that is +separate from the hard disk storing and running the operating system and +daemons. If the OSD is for a disk other than the OS disk, prepare it for +use with Ceph, and mount it to the directory you just created:: + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{disk} + sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} + +We recommend using the ``xfs`` file system when running +:command:`mkfs`. (``btrfs`` and ``ext4`` are not recommended and no +longer tested.) + +See the `OSD Config Reference`_ for additional configuration details. + + +Heartbeats +========== + +During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons +and report their findings to the Ceph Monitor. You do not have to provide any +settings. However, if you have network latency issues, you may wish to modify +the settings. + +See `Configuring Monitor/OSD Interaction`_ for additional details. + + +.. _ceph-logging-and-debugging: + +Logs / Debugging +================ + +Sometimes you may encounter issues with Ceph that require +modifying logging output and using Ceph's debugging. See `Debugging and +Logging`_ for details on log rotation. + +.. _Debugging and Logging: ../../troubleshooting/log-and-debug + + +Example ceph.conf +================= + +.. literalinclude:: demo-ceph.conf + :language: ini + +.. _ceph-runtime-config: + + + +Running Multiple Clusters +========================= + +With Ceph, you can run multiple Ceph Storage Clusters on the same hardware. +Running multiple clusters provides a higher level of isolation compared to +using different pools on the same cluster with different CRUSH rules. A +separate cluster will have separate monitor, OSD and metadata server processes. +When running Ceph with default settings, the default cluster name is ``ceph``, +which means you would save your Ceph configuration file with the file name +``ceph.conf`` in the ``/etc/ceph`` default directory. + +See `Create a Cluster`_ for details. + +.. _Create a Cluster: ../../deployment/ceph-deploy-new + +When you run multiple clusters, you must name your cluster and save the Ceph +configuration file with the name of the cluster. For example, a cluster named +``openstack`` will have a Ceph configuration file with the file name +``openstack.conf`` in the ``/etc/ceph`` default directory. + +.. important:: Cluster names must consist of letters a-z and digits 0-9 only. + +Separate clusters imply separate data disks and journals, which are not shared +between clusters. Referring to `Metavariables`_, the ``$cluster`` metavariable +evaluates to the cluster name (i.e., ``openstack`` in the foregoing example). +Various settings use the ``$cluster`` metavariable, including: + +.. _Metavariables: ../ceph-conf#Metavariables + +- ``keyring`` +- ``admin socket`` +- ``log file`` +- ``pid file`` +- ``mon data`` +- ``mon cluster log file`` +- ``osd data`` +- ``osd journal`` +- ``mds data`` +- ``rgw data`` + +See `General Settings`_, `OSD Settings`_, `Monitor Settings`_, `MDS Settings`_, +`RGW Settings`_ and `Log Settings`_ for relevant path defaults that use the +``$cluster`` metavariable. + +.. _General Settings: ../general-config-ref +.. _OSD Settings: ../osd-config-ref +.. _Monitor Settings: ../mon-config-ref +.. _MDS Settings: ../../../cephfs/mds-config-ref +.. _RGW Settings: ../../../radosgw/config-ref/ +.. _Log Settings: ../../troubleshooting/log-and-debug + + +When creating default directories or files, you should use the cluster +name at the appropriate places in the path. For example:: + + sudo mkdir /var/lib/ceph/osd/openstack-0 + sudo mkdir /var/lib/ceph/mon/openstack-a + +.. important:: When running monitors on the same host, you should use + different ports. By default, monitors use port 6789. If you already + have monitors using port 6789, use a different port for your other cluster(s). + +To invoke a cluster other than the default ``ceph`` cluster, use the +``-c {filename}.conf`` option with the ``ceph`` command. For example:: + + ceph -c {cluster-name}.conf health + ceph -c openstack.conf health + + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Network Configuration Reference: ../network-config-ref +.. _OSD Config Reference: ../osd-config-ref +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction diff --git a/doc/rados/configuration/demo-ceph.conf b/doc/rados/configuration/demo-ceph.conf new file mode 100644 index 00000000..ba86d531 --- /dev/null +++ b/doc/rados/configuration/demo-ceph.conf @@ -0,0 +1,31 @@ +[global] +fsid = {cluster-id} +mon initial members = {hostname}[, {hostname}] +mon host = {ip-address}[, {ip-address}] + +#All clusters have a front-side public network. +#If you have two NICs, you can configure a back side cluster +#network for OSD object replication, heart beats, backfilling, +#recovery, etc. +public network = {network}[, {network}] +#cluster network = {network}[, {network}] + +#Clusters require authentication by default. +auth cluster required = cephx +auth service required = cephx +auth client required = cephx + +#Choose reasonable numbers for your journals, number of replicas +#and placement groups. +osd journal size = {n} +osd pool default size = {n} # Write an object n times. +osd pool default min size = {n} # Allow writing n copy in a degraded state. +osd pool default pg num = {n} +osd pool default pgp num = {n} + +#Choose a reasonable crush leaf type. +#0 for a 1-node cluster. +#1 for a multi node cluster in a single rack +#2 for a multi node, multi chassis cluster with multiple hosts in a chassis +#3 for a multi node cluster with hosts across racks, etc. +osd crush chooseleaf type = {n}
\ No newline at end of file diff --git a/doc/rados/configuration/filestore-config-ref.rst b/doc/rados/configuration/filestore-config-ref.rst new file mode 100644 index 00000000..3be9c489 --- /dev/null +++ b/doc/rados/configuration/filestore-config-ref.rst @@ -0,0 +1,365 @@ +============================ + Filestore Config Reference +============================ + + +``filestore debug omap check`` + +:Description: Debugging check on synchronization. Expensive. For debugging only. +:Type: Boolean +:Required: No +:Default: ``false`` + + +.. index:: filestore; extended attributes + +Extended Attributes +=================== + +Extended Attributes (XATTRs) are an important aspect in your configuration. +Some file systems have limits on the number of bytes stored in XATTRS. +Additionally, in some cases, the filesystem may not be as fast as an alternative +method of storing XATTRs. The following settings may help improve performance +by using a method of storing XATTRs that is extrinsic to the underlying filesystem. + +Ceph XATTRs are stored as ``inline xattr``, using the XATTRs provided +by the underlying file system, if it does not impose a size limit. If +there is a size limit (4KB total on ext4, for instance), some Ceph +XATTRs will be stored in an key/value database when either the +``filestore max inline xattr size`` or ``filestore max inline +xattrs`` threshold is reached. + + +``filestore max inline xattr size`` + +:Description: The maximum size of an XATTR stored in the filesystem (i.e., XFS, + btrfs, ext4, etc.) per object. Should not be larger than the + filesystem can handle. Default value of 0 means to use the value + specific to the underlying filesystem. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore max inline xattr size xfs`` + +:Description: The maximum size of an XATTR stored in the XFS filesystem. + Only used if ``filestore max inline xattr size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``65536`` + + +``filestore max inline xattr size btrfs`` + +:Description: The maximum size of an XATTR stored in the btrfs filesystem. + Only used if ``filestore max inline xattr size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``2048`` + + +``filestore max inline xattr size other`` + +:Description: The maximum size of an XATTR stored in other filesystems. + Only used if ``filestore max inline xattr size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``512`` + + +``filestore max inline xattrs`` + +:Description: The maximum number of XATTRs stored in the filesystem per object. + Default value of 0 means to use the value specific to the + underlying filesystem. +:Type: 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore max inline xattrs xfs`` + +:Description: The maximum number of XATTRs stored in the XFS filesystem per object. + Only used if ``filestore max inline xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore max inline xattrs btrfs`` + +:Description: The maximum number of XATTRs stored in the btrfs filesystem per object. + Only used if ``filestore max inline xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore max inline xattrs other`` + +:Description: The maximum number of XATTRs stored in other filesystems per object. + Only used if ``filestore max inline xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``2`` + +.. index:: filestore; synchronization + +Synchronization Intervals +========================= + +Periodically, the filestore needs to quiesce writes and synchronize the +filesystem, which creates a consistent commit point. It can then free journal +entries up to the commit point. Synchronizing more frequently tends to reduce +the time required to perform synchronization, and reduces the amount of data +that needs to remain in the journal. Less frequent synchronization allows the +backing filesystem to coalesce small writes and metadata updates more +optimally--potentially resulting in more efficient synchronization. + + +``filestore max sync interval`` + +:Description: The maximum interval in seconds for synchronizing the filestore. +:Type: Double +:Required: No +:Default: ``5`` + + +``filestore min sync interval`` + +:Description: The minimum interval in seconds for synchronizing the filestore. +:Type: Double +:Required: No +:Default: ``.01`` + + +.. index:: filestore; flusher + +Flusher +======= + +The filestore flusher forces data from large writes to be written out using +``sync file range`` before the sync in order to (hopefully) reduce the cost of +the eventual sync. In practice, disabling 'filestore flusher' seems to improve +performance in some cases. + + +``filestore flusher`` + +:Description: Enables the filestore flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore flusher max fds`` + +:Description: Sets the maximum number of file descriptors for the flusher. +:Type: Integer +:Required: No +:Default: ``512`` + +.. deprecated:: v.65 + +``filestore sync flush`` + +:Description: Enables the synchronization flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore fsync flushes journal data`` + +:Description: Flush journal data during filesystem synchronization. +:Type: Boolean +:Required: No +:Default: ``false`` + + +.. index:: filestore; queue + +Queue +===== + +The following settings provide limits on the size of filestore queue. + +``filestore queue max ops`` + +:Description: Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations. +:Type: Integer +:Required: No. Minimal impact on performance. +:Default: ``50`` + + +``filestore queue max bytes`` + +:Description: The maximum number of bytes for an operation. +:Type: Integer +:Required: No +:Default: ``100 << 20`` + + + + +.. index:: filestore; timeouts + +Timeouts +======== + + +``filestore op threads`` + +:Description: The number of filesystem operation threads that execute in parallel. +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore op thread timeout`` + +:Description: The timeout for a filesystem operation thread (in seconds). +:Type: Integer +:Required: No +:Default: ``60`` + + +``filestore op thread suicide timeout`` + +:Description: The timeout for a commit operation before cancelling the commit (in seconds). +:Type: Integer +:Required: No +:Default: ``180`` + + +.. index:: filestore; btrfs + +B-Tree Filesystem +================= + + +``filestore btrfs snap`` + +:Description: Enable snapshots for a ``btrfs`` filestore. +:Type: Boolean +:Required: No. Only used for ``btrfs``. +:Default: ``true`` + + +``filestore btrfs clone range`` + +:Description: Enable cloning ranges for a ``btrfs`` filestore. +:Type: Boolean +:Required: No. Only used for ``btrfs``. +:Default: ``true`` + + +.. index:: filestore; journal + +Journal +======= + + +``filestore journal parallel`` + +:Description: Enables parallel journaling, default for btrfs. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore journal writeahead`` + +:Description: Enables writeahead journaling, default for xfs. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore journal trailing`` + +:Description: Deprecated, never use. +:Type: Boolean +:Required: No +:Default: ``false`` + + +Misc +==== + + +``filestore merge threshold`` + +:Description: Min number of files in a subdir before merging into parent + NOTE: A negative value means to disable subdir merging +:Type: Integer +:Required: No +:Default: ``-10`` + + +``filestore split multiple`` + +:Description: ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16`` + is the maximum number of files in a subdirectory before + splitting into child directories. + +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore split rand factor`` + +:Description: A random factor added to the split threshold to avoid + too many filestore splits occurring at once. See + ``filestore split multiple`` for details. + This can only be changed for an existing osd offline, + via ceph-objectstore-tool's apply-layout-settings command. + +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``20`` + + +``filestore update to`` + +:Description: Limits filestore auto upgrade to specified version. +:Type: Integer +:Required: No +:Default: ``1000`` + + +``filestore blackhole`` + +:Description: Drop any new transactions on the floor. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore dump file`` + +:Description: File onto which store transaction dumps. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore kill at`` + +:Description: inject a failure at the n'th opportunity +:Type: String +:Required: No +:Default: ``false`` + + +``filestore fail eio`` + +:Description: Fail/Crash on eio. +:Type: Boolean +:Required: No +:Default: ``true`` + diff --git a/doc/rados/configuration/general-config-ref.rst b/doc/rados/configuration/general-config-ref.rst new file mode 100644 index 00000000..ca09ee5b --- /dev/null +++ b/doc/rados/configuration/general-config-ref.rst @@ -0,0 +1,66 @@ +========================== + General Config Reference +========================== + + +``fsid`` + +:Description: The filesystem ID. One per cluster. +:Type: UUID +:Required: No. +:Default: N/A. Usually generated by deployment tools. + + +``admin socket`` + +:Description: The socket for executing administrative commands on a daemon, + irrespective of whether Ceph Monitors have established a quorum. + +:Type: String +:Required: No +:Default: ``/var/run/ceph/$cluster-$name.asok`` + + +``pid file`` + +:Description: The file in which the mon, osd or mds will write its + PID. For instance, ``/var/run/$cluster/$type.$id.pid`` + will create /var/run/ceph/mon.a.pid for the ``mon`` with + id ``a`` running in the ``ceph`` cluster. The ``pid + file`` is removed when the daemon stops gracefully. If + the process is not daemonized (i.e. runs with the ``-f`` + or ``-d`` option), the ``pid file`` is not created. +:Type: String +:Required: No +:Default: No + + +``chdir`` + +:Description: The directory Ceph daemons change to once they are + up and running. Default ``/`` directory recommended. + +:Type: String +:Required: No +:Default: ``/`` + + +``max open files`` + +:Description: If set, when the :term:`Ceph Storage Cluster` starts, Ceph sets + the ``max open fds`` at the OS level (i.e., the max # of file + descriptors). It helps prevents Ceph OSD Daemons from running out + of file descriptors. + +:Type: 64-bit Integer +:Required: No +:Default: ``0`` + + +``fatal signal handlers`` + +:Description: If set, we will install signal handlers for SEGV, ABRT, BUS, ILL, + FPE, XCPU, XFSZ, SYS signals to generate a useful log message + +:Type: Boolean +:Default: ``true`` diff --git a/doc/rados/configuration/index.rst b/doc/rados/configuration/index.rst new file mode 100644 index 00000000..e0ed2581 --- /dev/null +++ b/doc/rados/configuration/index.rst @@ -0,0 +1,52 @@ +=============== + Configuration +=============== + +Each Ceph process, daemon or utility draws its configuration from +several sources on startup, include a local configuration, the +monitors, the command line, or environment variables. Configuration +options may be set globally such that they apply to all daemons, to +all daemons or services of a particular type, or only to a specific +daemon, process, or client. + +.. raw:: html + + <table cellpadding="10"><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>Configuring the Object Store</h3> + +For general object store configuration, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Storage devices <storage-devices> + ceph-conf + + +.. raw:: html + + </td><td><h3>Reference</h3> + +To optimize the performance of your cluster, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Common Settings <common> + Network Settings <network-config-ref> + Messenger v2 protocol <msgr2> + Auth Settings <auth-config-ref> + Monitor Settings <mon-config-ref> + mon-lookup-dns + Heartbeat Settings <mon-osd-interaction> + OSD Settings <osd-config-ref> + BlueStore Settings <bluestore-config-ref> + FileStore Settings <filestore-config-ref> + Journal Settings <journal-ref> + Pool, PG & CRUSH Settings <pool-pg-config-ref.rst> + Messaging Settings <ms-ref> + General Settings <general-config-ref> + + +.. raw:: html + + </td></tr></tbody></table> diff --git a/doc/rados/configuration/journal-ref.rst b/doc/rados/configuration/journal-ref.rst new file mode 100644 index 00000000..97300f4a --- /dev/null +++ b/doc/rados/configuration/journal-ref.rst @@ -0,0 +1,116 @@ +========================== + Journal Config Reference +========================== + +.. index:: journal; journal configuration + +Ceph OSDs use a journal for two reasons: speed and consistency. + +- **Speed:** The journal enables the Ceph OSD Daemon to commit small writes + quickly. Ceph writes small, random i/o to the journal sequentially, which + tends to speed up bursty workloads by allowing the backing filesystem more + time to coalesce writes. The Ceph OSD Daemon's journal, however, can lead + to spiky performance with short spurts of high-speed writes followed by + periods without any write progress as the filesystem catches up to the + journal. + +- **Consistency:** Ceph OSD Daemons require a filesystem interface that + guarantees atomic compound operations. Ceph OSD Daemons write a description + of the operation to the journal and apply the operation to the filesystem. + This enables atomic updates to an object (for example, placement group + metadata). Every few seconds--between ``filestore max sync interval`` and + ``filestore min sync interval``--the Ceph OSD Daemon stops writes and + synchronizes the journal with the filesystem, allowing Ceph OSD Daemons to + trim operations from the journal and reuse the space. On failure, Ceph + OSD Daemons replay the journal starting after the last synchronization + operation. + +Ceph OSD Daemons support the following journal settings: + + +``journal dio`` + +:Description: Enables direct i/o to the journal. Requires ``journal block + align`` set to ``true``. + +:Type: Boolean +:Required: Yes when using ``aio``. +:Default: ``true`` + + + +``journal aio`` + +.. versionchanged:: 0.61 Cuttlefish + +:Description: Enables using ``libaio`` for asynchronous writes to the journal. + Requires ``journal dio`` set to ``true``. + +:Type: Boolean +:Required: No. +:Default: Version 0.61 and later, ``true``. Version 0.60 and earlier, ``false``. + + +``journal block align`` + +:Description: Block aligns write operations. Required for ``dio`` and ``aio``. +:Type: Boolean +:Required: Yes when using ``dio`` and ``aio``. +:Default: ``true`` + + +``journal max write bytes`` + +:Description: The maximum number of bytes the journal will write at + any one time. + +:Type: Integer +:Required: No +:Default: ``10 << 20`` + + +``journal max write entries`` + +:Description: The maximum number of entries the journal will write at + any one time. + +:Type: Integer +:Required: No +:Default: ``100`` + + +``journal queue max ops`` + +:Description: The maximum number of operations allowed in the queue at + any one time. + +:Type: Integer +:Required: No +:Default: ``500`` + + +``journal queue max bytes`` + +:Description: The maximum number of bytes allowed in the queue at + any one time. + +:Type: Integer +:Required: No +:Default: ``10 << 20`` + + +``journal align min size`` + +:Description: Align data payloads greater than the specified minimum. +:Type: Integer +:Required: No +:Default: ``64 << 10`` + + +``journal zero on create`` + +:Description: Causes the file store to overwrite the entire journal with + ``0``'s during ``mkfs``. +:Type: Boolean +:Required: No +:Default: ``false`` diff --git a/doc/rados/configuration/mon-config-ref.rst b/doc/rados/configuration/mon-config-ref.rst new file mode 100644 index 00000000..727734d0 --- /dev/null +++ b/doc/rados/configuration/mon-config-ref.rst @@ -0,0 +1,1261 @@ +========================== + Monitor Config Reference +========================== + +Understanding how to configure a :term:`Ceph Monitor` is an important part of +building a reliable :term:`Ceph Storage Cluster`. **All Ceph Storage Clusters +have at least one monitor**. A monitor configuration usually remains fairly +consistent, but you can add, remove or replace a monitor in a cluster. See +`Adding/Removing a Monitor`_ and `Add/Remove a Monitor (ceph-deploy)`_ for +details. + + +.. index:: Ceph Monitor; Paxos + +Background +========== + +Ceph Monitors maintain a "master copy" of the :term:`cluster map`, which means a +:term:`Ceph Client` can determine the location of all Ceph Monitors, Ceph OSD +Daemons, and Ceph Metadata Servers just by connecting to one Ceph Monitor and +retrieving a current cluster map. Before Ceph Clients can read from or write to +Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor +first. With a current copy of the cluster map and the CRUSH algorithm, a Ceph +Client can compute the location for any object. The ability to compute object +locations allows a Ceph Client to talk directly to Ceph OSD Daemons, which is a +very important aspect of Ceph's high scalability and performance. See +`Scalability and High Availability`_ for additional details. + +The primary role of the Ceph Monitor is to maintain a master copy of the cluster +map. Ceph Monitors also provide authentication and logging services. Ceph +Monitors write all changes in the monitor services to a single Paxos instance, +and Paxos writes the changes to a key/value store for strong consistency. Ceph +Monitors can query the most recent version of the cluster map during sync +operations. Ceph Monitors leverage the key/value store's snapshots and iterators +(using leveldb) to perform store-wide synchronization. + +.. ditaa:: + /-------------\ /-------------\ + | Monitor | Write Changes | Paxos | + | cCCC +-------------->+ cCCC | + | | | | + +-------------+ \------+------/ + | Auth | | + +-------------+ | Write Changes + | Log | | + +-------------+ v + | Monitor Map | /------+------\ + +-------------+ | Key / Value | + | OSD Map | | Store | + +-------------+ | cCCC | + | PG Map | \------+------/ + +-------------+ ^ + | MDS Map | | Read Changes + +-------------+ | + | cCCC |*---------------------+ + \-------------/ + + +.. deprecated:: version 0.58 + +In Ceph versions 0.58 and earlier, Ceph Monitors use a Paxos instance for +each service and store the map as a file. + +.. index:: Ceph Monitor; cluster map + +Cluster Maps +------------ + +The cluster map is a composite of maps, including the monitor map, the OSD map, +the placement group map and the metadata server map. The cluster map tracks a +number of important things: which processes are ``in`` the Ceph Storage Cluster; +which processes that are ``in`` the Ceph Storage Cluster are ``up`` and running +or ``down``; whether, the placement groups are ``active`` or ``inactive``, and +``clean`` or in some other state; and, other details that reflect the current +state of the cluster such as the total amount of storage space, and the amount +of storage used. + +When there is a significant change in the state of the cluster--e.g., a Ceph OSD +Daemon goes down, a placement group falls into a degraded state, etc.--the +cluster map gets updated to reflect the current state of the cluster. +Additionally, the Ceph Monitor also maintains a history of the prior states of +the cluster. The monitor map, OSD map, placement group map and metadata server +map each maintain a history of their map versions. We call each version an +"epoch." + +When operating your Ceph Storage Cluster, keeping track of these states is an +important part of your system administration duties. See `Monitoring a Cluster`_ +and `Monitoring OSDs and PGs`_ for additional details. + +.. index:: high availability; quorum + +Monitor Quorum +-------------- + +Our Configuring ceph section provides a trivial `Ceph configuration file`_ that +provides for one monitor in the test cluster. A cluster will run fine with a +single monitor; however, **a single monitor is a single-point-of-failure**. To +ensure high availability in a production Ceph Storage Cluster, you should run +Ceph with multiple monitors so that the failure of a single monitor **WILL NOT** +bring down your entire cluster. + +When a Ceph Storage Cluster runs multiple Ceph Monitors for high availability, +Ceph Monitors use `Paxos`_ to establish consensus about the master cluster map. +A consensus requires a majority of monitors running to establish a quorum for +consensus about the cluster map (e.g., 1; 2 out of 3; 3 out of 5; 4 out of 6; +etc.). + +``mon force quorum join`` + +:Description: Force monitor to join quorum even if it has been previously removed from the map +:Type: Boolean +:Default: ``False`` + +.. index:: Ceph Monitor; consistency + +Consistency +----------- + +When you add monitor settings to your Ceph configuration file, you need to be +aware of some of the architectural aspects of Ceph Monitors. **Ceph imposes +strict consistency requirements** for a Ceph monitor when discovering another +Ceph Monitor within the cluster. Whereas, Ceph Clients and other Ceph daemons +use the Ceph configuration file to discover monitors, monitors discover each +other using the monitor map (monmap), not the Ceph configuration file. + +A Ceph Monitor always refers to the local copy of the monmap when discovering +other Ceph Monitors in the Ceph Storage Cluster. Using the monmap instead of the +Ceph configuration file avoids errors that could break the cluster (e.g., typos +in ``ceph.conf`` when specifying a monitor address or port). Since monitors use +monmaps for discovery and they share monmaps with clients and other Ceph +daemons, **the monmap provides monitors with a strict guarantee that their +consensus is valid.** + +Strict consistency also applies to updates to the monmap. As with any other +updates on the Ceph Monitor, changes to the monmap always run through a +distributed consensus algorithm called `Paxos`_. The Ceph Monitors must agree on +each update to the monmap, such as adding or removing a Ceph Monitor, to ensure +that each monitor in the quorum has the same version of the monmap. Updates to +the monmap are incremental so that Ceph Monitors have the latest agreed upon +version, and a set of previous versions. Maintaining a history enables a Ceph +Monitor that has an older version of the monmap to catch up with the current +state of the Ceph Storage Cluster. + +If Ceph Monitors discovered each other through the Ceph configuration file +instead of through the monmap, it would introduce additional risks because the +Ceph configuration files are not updated and distributed automatically. Ceph +Monitors might inadvertently use an older Ceph configuration file, fail to +recognize a Ceph Monitor, fall out of a quorum, or develop a situation where +`Paxos`_ is not able to determine the current state of the system accurately. + + +.. index:: Ceph Monitor; bootstrapping monitors + +Bootstrapping Monitors +---------------------- + +In most configuration and deployment cases, tools that deploy Ceph may help +bootstrap the Ceph Monitors by generating a monitor map for you (e.g., +``ceph-deploy``, etc). A Ceph Monitor requires a few explicit +settings: + +- **Filesystem ID**: The ``fsid`` is the unique identifier for your + object store. Since you can run multiple clusters on the same + hardware, you must specify the unique ID of the object store when + bootstrapping a monitor. Deployment tools usually do this for you + (e.g., ``ceph-deploy`` can call a tool like ``uuidgen``), but you + may specify the ``fsid`` manually too. + +- **Monitor ID**: A monitor ID is a unique ID assigned to each monitor within + the cluster. It is an alphanumeric value, and by convention the identifier + usually follows an alphabetical increment (e.g., ``a``, ``b``, etc.). This + can be set in a Ceph configuration file (e.g., ``[mon.a]``, ``[mon.b]``, etc.), + by a deployment tool, or using the ``ceph`` commandline. + +- **Keys**: The monitor must have secret keys. A deployment tool such as + ``ceph-deploy`` usually does this for you, but you may + perform this step manually too. See `Monitor Keyrings`_ for details. + +For additional details on bootstrapping, see `Bootstrapping a Monitor`_. + +.. index:: Ceph Monitor; configuring monitors + +Configuring Monitors +==================== + +To apply configuration settings to the entire cluster, enter the configuration +settings under ``[global]``. To apply configuration settings to all monitors in +your cluster, enter the configuration settings under ``[mon]``. To apply +configuration settings to specific monitors, specify the monitor instance +(e.g., ``[mon.a]``). By convention, monitor instance names use alpha notation. + +.. code-block:: ini + + [global] + + [mon] + + [mon.a] + + [mon.b] + + [mon.c] + + +Minimum Configuration +--------------------- + +The bare minimum monitor settings for a Ceph monitor via the Ceph configuration +file include a hostname and a monitor address for each monitor. You can configure +these under ``[mon]`` or under the entry for a specific monitor. + +.. code-block:: ini + + [global] + mon host = 10.0.0.2,10.0.0.3,10.0.0.4 + +.. code-block:: ini + + [mon.a] + host = hostname1 + mon addr = 10.0.0.10:6789 + +See the `Network Configuration Reference`_ for details. + +.. note:: This minimum configuration for monitors assumes that a deployment + tool generates the ``fsid`` and the ``mon.`` key for you. + +Once you deploy a Ceph cluster, you **SHOULD NOT** change the IP address of +the monitors. However, if you decide to change the monitor's IP address, you +must follow a specific procedure. See `Changing a Monitor's IP Address`_ for +details. + +Monitors can also be found by clients using DNS SRV records. See `Monitor lookup through DNS`_ for details. + +Cluster ID +---------- + +Each Ceph Storage Cluster has a unique identifier (``fsid``). If specified, it +usually appears under the ``[global]`` section of the configuration file. +Deployment tools usually generate the ``fsid`` and store it in the monitor map, +so the value may not appear in a configuration file. The ``fsid`` makes it +possible to run daemons for multiple clusters on the same hardware. + +``fsid`` + +:Description: The cluster ID. One per cluster. +:Type: UUID +:Required: Yes. +:Default: N/A. May be generated by a deployment tool if not specified. + +.. note:: Do not set this value if you use a deployment tool that does + it for you. + + +.. index:: Ceph Monitor; initial members + +Initial Members +--------------- + +We recommend running a production Ceph Storage Cluster with at least three Ceph +Monitors to ensure high availability. When you run multiple monitors, you may +specify the initial monitors that must be members of the cluster in order to +establish a quorum. This may reduce the time it takes for your cluster to come +online. + +.. code-block:: ini + + [mon] + mon initial members = a,b,c + + +``mon initial members`` + +:Description: The IDs of initial monitors in a cluster during startup. If + specified, Ceph requires an odd number of monitors to form an + initial quorum (e.g., 3). + +:Type: String +:Default: None + +.. note:: A *majority* of monitors in your cluster must be able to reach + each other in order to establish a quorum. You can decrease the initial + number of monitors to establish a quorum with this setting. + +.. index:: Ceph Monitor; data path + +Data +---- + +Ceph provides a default path where Ceph Monitors store data. For optimal +performance in a production Ceph Storage Cluster, we recommend running Ceph +Monitors on separate hosts and drives from Ceph OSD Daemons. As leveldb is using +``mmap()`` for writing the data, Ceph Monitors flush their data from memory to disk +very often, which can interfere with Ceph OSD Daemon workloads if the data +store is co-located with the OSD Daemons. + +In Ceph versions 0.58 and earlier, Ceph Monitors store their data in files. This +approach allows users to inspect monitor data with common tools like ``ls`` +and ``cat``. However, it doesn't provide strong consistency. + +In Ceph versions 0.59 and later, Ceph Monitors store their data as key/value +pairs. Ceph Monitors require `ACID`_ transactions. Using a data store prevents +recovering Ceph Monitors from running corrupted versions through Paxos, and it +enables multiple modification operations in one single atomic batch, among other +advantages. + +Generally, we do not recommend changing the default data location. If you modify +the default location, we recommend that you make it uniform across Ceph Monitors +by setting it in the ``[mon]`` section of the configuration file. + + +``mon data`` + +:Description: The monitor's data location. +:Type: String +:Default: ``/var/lib/ceph/mon/$cluster-$id`` + + +``mon data size warn`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log when the monitor's data + store goes over 15GB. +:Type: Integer +:Default: 15*1024*1024*1024* + + +``mon data avail warn`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log when the available disk + space of monitor's data store is lower or equal to this + percentage. +:Type: Integer +:Default: 30 + + +``mon data avail crit`` + +:Description: Issue a ``HEALTH_ERR`` in cluster log when the available disk + space of monitor's data store is lower or equal to this + percentage. +:Type: Integer +:Default: 5 + + +``mon warn on cache pools without hit sets`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if a cache pool does not + have the ``hit_set_type`` value configured. + See :ref:`hit_set_type <hit_set_type>` for more + details. +:Type: Boolean +:Default: True + + +``mon warn on crush straw calc version zero`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if the CRUSH's + ``straw_calc_version`` is zero. See + :ref:`CRUSH map tunables <crush-map-tunables>` for + details. +:Type: Boolean +:Default: True + + +``mon warn on legacy crush tunables`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if + CRUSH tunables are too old (older than ``mon_min_crush_required_version``) +:Type: Boolean +:Default: True + + +``mon crush min required version`` + +:Description: The minimum tunable profile version required by the cluster. + See + :ref:`CRUSH map tunables <crush-map-tunables>` for + details. +:Type: String +:Default: ``firefly`` + + +``mon warn on osd down out interval zero`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if + ``mon osd down out interval`` is zero. Having this option set to + zero on the leader acts much like the ``noout`` flag. It's hard + to figure out what's going wrong with clusters without the + ``noout`` flag set but acting like that just the same, so we + report a warning in this case. +:Type: Boolean +:Default: True + + +``mon warn on slow ping ratio`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if any heartbeat + between OSDs exceeds ``mon warn on slow ping ratio`` + of ``osd heartbeat grace``. The default is 5%. +:Type: Float +:Default: ``0.05`` + + +``mon warn on slow ping time`` + +:Description: Override ``mon warn on slow ping ratio`` with a specific value. + Issue a ``HEALTH_WARN`` in cluster log if any heartbeat + between OSDs exceeds ``mon warn on slow ping time`` + milliseconds. The default is 0 (disabled). +:Type: Integer +:Default: ``0`` + + +``mon warn on pool no redundancy`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if any pool is + configured with no replicas. +:Type: Boolean +:Default: ``True`` + + +``mon cache target full warn ratio`` + +:Description: Position between pool's ``cache_target_full`` and + ``target_max_object`` where we start warning +:Type: Float +:Default: ``0.66`` + + +``mon health data update interval`` + +:Description: How often (in seconds) the monitor in quorum shares its health + status with its peers. (negative number disables it) +:Type: Float +:Default: ``60`` + + +``mon health to clog`` + +:Description: Enable sending health summary to cluster log periodically. +:Type: Boolean +:Default: True + + +``mon health to clog tick interval`` + +:Description: How often (in seconds) the monitor send health summary to cluster + log (a non-positive number disables it). If current health summary + is empty or identical to the last time, monitor will not send it + to cluster log. +:Type: Float +:Default: 60.000000 + + +``mon health to clog interval`` + +:Description: How often (in seconds) the monitor send health summary to cluster + log (a non-positive number disables it). Monitor will always + send the summary to cluster log no matter if the summary changes + or not. +:Type: Integer +:Default: 3600 + + + +.. index:: Ceph Storage Cluster; capacity planning, Ceph Monitor; capacity planning + +Storage Capacity +---------------- + +When a Ceph Storage Cluster gets close to its maximum capacity (i.e., ``mon osd +full ratio``), Ceph prevents you from writing to or reading from Ceph OSD +Daemons as a safety measure to prevent data loss. Therefore, letting a +production Ceph Storage Cluster approach its full ratio is not a good practice, +because it sacrifices high availability. The default full ratio is ``.95``, or +95% of capacity. This a very aggressive setting for a test cluster with a small +number of OSDs. + +.. tip:: When monitoring your cluster, be alert to warnings related to the + ``nearfull`` ratio. This means that a failure of some OSDs could result + in a temporary service disruption if one or more OSDs fails. Consider adding + more OSDs to increase storage capacity. + +A common scenario for test clusters involves a system administrator removing a +Ceph OSD Daemon from the Ceph Storage Cluster to watch the cluster rebalance; +then, removing another Ceph OSD Daemon, and so on until the Ceph Storage Cluster +eventually reaches the full ratio and locks up. We recommend a bit of capacity +planning even with a test cluster. Planning enables you to gauge how much spare +capacity you will need in order to maintain high availability. Ideally, you want +to plan for a series of Ceph OSD Daemon failures where the cluster can recover +to an ``active + clean`` state without replacing those Ceph OSD Daemons +immediately. You can run a cluster in an ``active + degraded`` state, but this +is not ideal for normal operating conditions. + +The following diagram depicts a simplistic Ceph Storage Cluster containing 33 +Ceph Nodes with one Ceph OSD Daemon per host, each Ceph OSD Daemon reading from +and writing to a 3TB drive. So this exemplary Ceph Storage Cluster has a maximum +actual capacity of 99TB. With a ``mon osd full ratio`` of ``0.95``, if the Ceph +Storage Cluster falls to 5TB of remaining capacity, the cluster will not allow +Ceph Clients to read and write data. So the Ceph Storage Cluster's operating +capacity is 95TB, not 99TB. + +.. ditaa:: + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | Rack 1 | | Rack 2 | | Rack 3 | | Rack 4 | | Rack 5 | | Rack 6 | + | cCCC | | cF00 | | cCCC | | cCCC | | cCCC | | cCCC | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 1 | | OSD 7 | | OSD 13 | | OSD 19 | | OSD 25 | | OSD 31 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 2 | | OSD 8 | | OSD 14 | | OSD 20 | | OSD 26 | | OSD 32 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 3 | | OSD 9 | | OSD 15 | | OSD 21 | | OSD 27 | | OSD 33 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 4 | | OSD 10 | | OSD 16 | | OSD 22 | | OSD 28 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 5 | | OSD 11 | | OSD 17 | | OSD 23 | | OSD 29 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 6 | | OSD 12 | | OSD 18 | | OSD 24 | | OSD 30 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + +It is normal in such a cluster for one or two OSDs to fail. A less frequent but +reasonable scenario involves a rack's router or power supply failing, which +brings down multiple OSDs simultaneously (e.g., OSDs 7-12). In such a scenario, +you should still strive for a cluster that can remain operational and achieve an +``active + clean`` state--even if that means adding a few hosts with additional +OSDs in short order. If your capacity utilization is too high, you may not lose +data, but you could still sacrifice data availability while resolving an outage +within a failure domain if capacity utilization of the cluster exceeds the full +ratio. For this reason, we recommend at least some rough capacity planning. + +Identify two numbers for your cluster: + +#. The number of OSDs. +#. The total capacity of the cluster + +If you divide the total capacity of your cluster by the number of OSDs in your +cluster, you will find the mean average capacity of an OSD within your cluster. +Consider multiplying that number by the number of OSDs you expect will fail +simultaneously during normal operations (a relatively small number). Finally +multiply the capacity of the cluster by the full ratio to arrive at a maximum +operating capacity; then, subtract the number of amount of data from the OSDs +you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing +process with a higher number of OSD failures (e.g., a rack of OSDs) to arrive at +a reasonable number for a near full ratio. + +The following settings only apply on cluster creation and are then stored in +the OSDMap. + +.. code-block:: ini + + [global] + + mon osd full ratio = .80 + mon osd backfillfull ratio = .75 + mon osd nearfull ratio = .70 + + +``mon osd full ratio`` + +:Description: The percentage of disk space used before an OSD is + considered ``full``. + +:Type: Float +:Default: ``.95`` + + +``mon osd backfillfull ratio`` + +:Description: The percentage of disk space used before an OSD is + considered too ``full`` to backfill. + +:Type: Float +:Default: ``.90`` + + +``mon osd nearfull ratio`` + +:Description: The percentage of disk space used before an OSD is + considered ``nearfull``. + +:Type: Float +:Default: ``.85`` + + +.. tip:: If some OSDs are nearfull, but others have plenty of capacity, you + may have a problem with the CRUSH weight for the nearfull OSDs. + +.. tip:: These settings only apply during cluster creation. Afterwards they need + to be changed in the OSDMap using ``ceph osd set-nearfull-ratio`` and + ``ceph osd set-full-ratio`` + +.. index:: heartbeat + +Heartbeat +--------- + +Ceph monitors know about the cluster by requiring reports from each OSD, and by +receiving reports from OSDs about the status of their neighboring OSDs. Ceph +provides reasonable default settings for monitor/OSD interaction; however, you +may modify them as needed. See `Monitor/OSD Interaction`_ for details. + + +.. index:: Ceph Monitor; leader, Ceph Monitor; provider, Ceph Monitor; requester, Ceph Monitor; synchronization + +Monitor Store Synchronization +----------------------------- + +When you run a production cluster with multiple monitors (recommended), each +monitor checks to see if a neighboring monitor has a more recent version of the +cluster map (e.g., a map in a neighboring monitor with one or more epoch numbers +higher than the most current epoch in the map of the instant monitor). +Periodically, one monitor in the cluster may fall behind the other monitors to +the point where it must leave the quorum, synchronize to retrieve the most +current information about the cluster, and then rejoin the quorum. For the +purposes of synchronization, monitors may assume one of three roles: + +#. **Leader**: The `Leader` is the first monitor to achieve the most recent + Paxos version of the cluster map. + +#. **Provider**: The `Provider` is a monitor that has the most recent version + of the cluster map, but wasn't the first to achieve the most recent version. + +#. **Requester:** A `Requester` is a monitor that has fallen behind the leader + and must synchronize in order to retrieve the most recent information about + the cluster before it can rejoin the quorum. + +These roles enable a leader to delegate synchronization duties to a provider, +which prevents synchronization requests from overloading the leader--improving +performance. In the following diagram, the requester has learned that it has +fallen behind the other monitors. The requester asks the leader to synchronize, +and the leader tells the requester to synchronize with a provider. + + +.. ditaa:: + +-----------+ +---------+ +----------+ + | Requester | | Leader | | Provider | + +-----------+ +---------+ +----------+ + | | | + | | | + | Ask to Synchronize | | + |------------------->| | + | | | + |<-------------------| | + | Tell Requester to | | + | Sync with Provider | | + | | | + | Synchronize | + |--------------------+-------------------->| + | | | + |<-------------------+---------------------| + | Send Chunk to Requester | + | (repeat as necessary) | + | Requester Acks Chuck to Provider | + |--------------------+-------------------->| + | | + | Sync Complete | + | Notification | + |------------------->| + | | + |<-------------------| + | Ack | + | | + + +Synchronization always occurs when a new monitor joins the cluster. During +runtime operations, monitors may receive updates to the cluster map at different +times. This means the leader and provider roles may migrate from one monitor to +another. If this happens while synchronizing (e.g., a provider falls behind the +leader), the provider can terminate synchronization with a requester. + +Once synchronization is complete, Ceph requires trimming across the cluster. +Trimming requires that the placement groups are ``active + clean``. + + +``mon sync trim timeout`` + +:Description: +:Type: Double +:Default: ``30.0`` + + +``mon sync heartbeat timeout`` + +:Description: +:Type: Double +:Default: ``30.0`` + + +``mon sync heartbeat interval`` + +:Description: +:Type: Double +:Default: ``5.0`` + + +``mon sync backoff timeout`` + +:Description: +:Type: Double +:Default: ``30.0`` + + +``mon sync timeout`` + +:Description: Number of seconds the monitor will wait for the next update + message from its sync provider before it gives up and bootstrap + again. +:Type: Double +:Default: ``60.0`` + + +``mon sync max retries`` + +:Description: +:Type: Integer +:Default: ``5`` + + +``mon sync max payload size`` + +:Description: The maximum size for a sync payload (in bytes). +:Type: 32-bit Integer +:Default: ``1045676`` + + +``paxos max join drift`` + +:Description: The maximum Paxos iterations before we must first sync the + monitor data stores. When a monitor finds that its peer is too + far ahead of it, it will first sync with data stores before moving + on. +:Type: Integer +:Default: ``10`` + +``paxos stash full interval`` + +:Description: How often (in commits) to stash a full copy of the PaxosService state. + Current this setting only affects ``mds``, ``mon``, ``auth`` and ``mgr`` + PaxosServices. +:Type: Integer +:Default: 25 + +``paxos propose interval`` + +:Description: Gather updates for this time interval before proposing + a map update. +:Type: Double +:Default: ``1.0`` + + +``paxos min`` + +:Description: The minimum number of paxos states to keep around +:Type: Integer +:Default: 500 + + +``paxos min wait`` + +:Description: The minimum amount of time to gather updates after a period of + inactivity. +:Type: Double +:Default: ``0.05`` + + +``paxos trim min`` + +:Description: Number of extra proposals tolerated before trimming +:Type: Integer +:Default: 250 + + +``paxos trim max`` + +:Description: The maximum number of extra proposals to trim at a time +:Type: Integer +:Default: 500 + + +``paxos service trim min`` + +:Description: The minimum amount of versions to trigger a trim (0 disables it) +:Type: Integer +:Default: 250 + + +``paxos service trim max`` + +:Description: The maximum amount of versions to trim during a single proposal (0 disables it) +:Type: Integer +:Default: 500 + + +``mon max log epochs`` + +:Description: The maximum amount of log epochs to trim during a single proposal +:Type: Integer +:Default: 500 + + +``mon max pgmap epochs`` + +:Description: The maximum amount of pgmap epochs to trim during a single proposal +:Type: Integer +:Default: 500 + + +``mon mds force trim to`` + +:Description: Force monitor to trim mdsmaps to this point (0 disables it. + dangerous, use with care) +:Type: Integer +:Default: 0 + + +``mon osd force trim to`` + +:Description: Force monitor to trim osdmaps to this point, even if there is + PGs not clean at the specified epoch (0 disables it. dangerous, + use with care) +:Type: Integer +:Default: 0 + +``mon osd cache size`` + +:Description: The size of osdmaps cache, not to rely on underlying store's cache +:Type: Integer +:Default: 10 + + +``mon election timeout`` + +:Description: On election proposer, maximum waiting time for all ACKs in seconds. +:Type: Float +:Default: ``5`` + + +``mon lease`` + +:Description: The length (in seconds) of the lease on the monitor's versions. +:Type: Float +:Default: ``5`` + + +``mon lease renew interval factor`` + +:Description: ``mon lease`` \* ``mon lease renew interval factor`` will be the + interval for the Leader to renew the other monitor's leases. The + factor should be less than ``1.0``. +:Type: Float +:Default: ``0.6`` + + +``mon lease ack timeout factor`` + +:Description: The Leader will wait ``mon lease`` \* ``mon lease ack timeout factor`` + for the Providers to acknowledge the lease extension. +:Type: Float +:Default: ``2.0`` + + +``mon accept timeout factor`` + +:Description: The Leader will wait ``mon lease`` \* ``mon accept timeout factor`` + for the Requester(s) to accept a Paxos update. It is also used + during the Paxos recovery phase for similar purposes. +:Type: Float +:Default: ``2.0`` + + +``mon min osdmap epochs`` + +:Description: Minimum number of OSD map epochs to keep at all times. +:Type: 32-bit Integer +:Default: ``500`` + + +``mon max pgmap epochs`` + +:Description: Maximum number of PG map epochs the monitor should keep. +:Type: 32-bit Integer +:Default: ``500`` + + +``mon max log epochs`` + +:Description: Maximum number of Log epochs the monitor should keep. +:Type: 32-bit Integer +:Default: ``500`` + + + +.. index:: Ceph Monitor; clock + +Clock +----- + +Ceph daemons pass critical messages to each other, which must be processed +before daemons reach a timeout threshold. If the clocks in Ceph monitors +are not synchronized, it can lead to a number of anomalies. For example: + +- Daemons ignoring received messages (e.g., timestamps outdated) +- Timeouts triggered too soon/late when a message wasn't received in time. + +See `Monitor Store Synchronization`_ for details. + + +.. tip:: You SHOULD install NTP on your Ceph monitor hosts to + ensure that the monitor cluster operates with synchronized clocks. + +Clock drift may still be noticeable with NTP even though the discrepancy is not +yet harmful. Ceph's clock drift / clock skew warnings may get triggered even +though NTP maintains a reasonable level of synchronization. Increasing your +clock drift may be tolerable under such circumstances; however, a number of +factors such as workload, network latency, configuring overrides to default +timeouts and the `Monitor Store Synchronization`_ settings may influence +the level of acceptable clock drift without compromising Paxos guarantees. + +Ceph provides the following tunable options to allow you to find +acceptable values. + + +``clock offset`` + +:Description: How much to offset the system clock. See ``Clock.cc`` for details. +:Type: Double +:Default: ``0`` + + +.. deprecated:: 0.58 + +``mon tick interval`` + +:Description: A monitor's tick interval in seconds. +:Type: 32-bit Integer +:Default: ``5`` + + +``mon clock drift allowed`` + +:Description: The clock drift in seconds allowed between monitors. +:Type: Float +:Default: ``.050`` + + +``mon clock drift warn backoff`` + +:Description: Exponential backoff for clock drift warnings +:Type: Float +:Default: ``5`` + + +``mon timecheck interval`` + +:Description: The time check interval (clock drift check) in seconds + for the Leader. + +:Type: Float +:Default: ``300.0`` + + +``mon timecheck skew interval`` + +:Description: The time check interval (clock drift check) in seconds when in + presence of a skew in seconds for the Leader. +:Type: Float +:Default: ``30.0`` + + +Client +------ + +``mon client hunt interval`` + +:Description: The client will try a new monitor every ``N`` seconds until it + establishes a connection. + +:Type: Double +:Default: ``3.0`` + + +``mon client ping interval`` + +:Description: The client will ping the monitor every ``N`` seconds. +:Type: Double +:Default: ``10.0`` + + +``mon client max log entries per message`` + +:Description: The maximum number of log entries a monitor will generate + per client message. + +:Type: Integer +:Default: ``1000`` + + +``mon client bytes`` + +:Description: The amount of client message data allowed in memory (in bytes). +:Type: 64-bit Integer Unsigned +:Default: ``100ul << 20`` + + +Pool settings +============= +Since version v0.94 there is support for pool flags which allow or disallow changes to be made to pools. + +Monitors can also disallow removal of pools if configured that way. + +``mon allow pool delete`` + +:Description: If the monitors should allow pools to be removed. Regardless of what the pool flags say. +:Type: Boolean +:Default: ``false`` + +``osd pool default ec fast read`` + +:Description: Whether to turn on fast read on the pool or not. It will be used as + the default setting of newly created erasure coded pools if ``fast_read`` + is not specified at create time. +:Type: Boolean +:Default: ``false`` + +``osd pool default flag hashpspool`` + +:Description: Set the hashpspool flag on new pools +:Type: Boolean +:Default: ``true`` + +``osd pool default flag nodelete`` + +:Description: Set the nodelete flag on new pools. Prevents allow pool removal with this flag in any way. +:Type: Boolean +:Default: ``false`` + +``osd pool default flag nopgchange`` + +:Description: Set the nopgchange flag on new pools. Does not allow the number of PGs to be changed for a pool. +:Type: Boolean +:Default: ``false`` + +``osd pool default flag nosizechange`` + +:Description: Set the nosizechange flag on new pools. Does not allow the size to be changed of pool. +:Type: Boolean +:Default: ``false`` + +For more information about the pool flags see `Pool values`_. + +Miscellaneous +============= + + +``mon max osd`` + +:Description: The maximum number of OSDs allowed in the cluster. +:Type: 32-bit Integer +:Default: ``10000`` + +``mon globalid prealloc`` + +:Description: The number of global IDs to pre-allocate for clients and daemons in the cluster. +:Type: 32-bit Integer +:Default: ``100`` + +``mon subscribe interval`` + +:Description: The refresh interval (in seconds) for subscriptions. The + subscription mechanism enables obtaining the cluster maps + and log information. + +:Type: Double +:Default: ``86400`` + + +``mon stat smooth intervals`` + +:Description: Ceph will smooth statistics over the last ``N`` PG maps. +:Type: Integer +:Default: ``2`` + + +``mon probe timeout`` + +:Description: Number of seconds the monitor will wait to find peers before bootstrapping. +:Type: Double +:Default: ``2.0`` + + +``mon daemon bytes`` + +:Description: The message memory cap for metadata server and OSD messages (in bytes). +:Type: 64-bit Integer Unsigned +:Default: ``400ul << 20`` + + +``mon max log entries per event`` + +:Description: The maximum number of log entries per event. +:Type: Integer +:Default: ``4096`` + + +``mon osd prime pg temp`` + +:Description: Enables or disable priming the PGMap with the previous OSDs when an out + OSD comes back into the cluster. With the ``true`` setting the clients + will continue to use the previous OSDs until the newly in OSDs as that + PG peered. +:Type: Boolean +:Default: ``true`` + + +``mon osd prime pg temp max time`` + +:Description: How much time in seconds the monitor should spend trying to prime the + PGMap when an out OSD comes back into the cluster. +:Type: Float +:Default: ``0.5`` + + +``mon osd prime pg temp max time estimate`` + +:Description: Maximum estimate of time spent on each PG before we prime all PGs + in parallel. +:Type: Float +:Default: ``0.25`` + + +``mon osd allow primary affinity`` + +:Description: allow ``primary_affinity`` to be set in the osdmap. +:Type: Boolean +:Default: False + + +``mon mds skip sanity`` + +:Description: Skip safety assertions on FSMap (in case of bugs where we want to + continue anyway). Monitor terminates if the FSMap sanity check + fails, but we can disable it by enabling this option. +:Type: Boolean +:Default: False + + +``mon max mdsmap epochs`` + +:Description: The maximum amount of mdsmap epochs to trim during a single proposal. +:Type: Integer +:Default: 500 + + +``mon config key max entry size`` + +:Description: The maximum size of config-key entry (in bytes) +:Type: Integer +:Default: 4096 + + +``mon scrub interval`` + +:Description: How often (in seconds) the monitor scrub its store by comparing + the stored checksums with the computed ones of all the stored + keys. +:Type: Integer +:Default: 3600*24 + + +``mon scrub max keys`` + +:Description: The maximum number of keys to scrub each time. +:Type: Integer +:Default: 100 + + +``mon compact on start`` + +:Description: Compact the database used as Ceph Monitor store on + ``ceph-mon`` start. A manual compaction helps to shrink the + monitor database and improve the performance of it if the regular + compaction fails to work. +:Type: Boolean +:Default: False + + +``mon compact on bootstrap`` + +:Description: Compact the database used as Ceph Monitor store on + on bootstrap. Monitor starts probing each other for creating + a quorum after bootstrap. If it times out before joining the + quorum, it will start over and bootstrap itself again. +:Type: Boolean +:Default: False + + +``mon compact on trim`` + +:Description: Compact a certain prefix (including paxos) when we trim its old states. +:Type: Boolean +:Default: True + + +``mon cpu threads`` + +:Description: Number of threads for performing CPU intensive work on monitor. +:Type: Boolean +:Default: True + + +``mon osd mapping pgs per chunk`` + +:Description: We calculate the mapping from placement group to OSDs in chunks. + This option specifies the number of placement groups per chunk. +:Type: Integer +:Default: 4096 + +``mon session timeout`` + +:Description: Monitor will terminate inactive sessions stay idle over this + time limit. +:Type: Integer +:Default: 300 + +``mon osd cache size min`` + +:Description: The minimum amount of bytes to be kept mapped in memory for osd + monitor caches. +:Type: 64-bit Integer +:Default: 134217728 + +``mon memory target`` + +:Description: The amount of bytes pertaining to osd monitor caches and kv cache + to be kept mapped in memory with cache auto-tuning enabled. +:Type: 64-bit Integer +:Default: 2147483648 + +``mon memory autotune`` + +:Description: Autotune the cache memory being used for osd monitors and kv + database. +:Type: Boolean +:Default: True + + +.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) +.. _Monitor Keyrings: ../../../dev/mon-bootstrap#secret-keys +.. _Ceph configuration file: ../ceph-conf/#monitors +.. _Network Configuration Reference: ../network-config-ref +.. _Monitor lookup through DNS: ../mon-lookup-dns +.. _ACID: https://en.wikipedia.org/wiki/ACID +.. _Adding/Removing a Monitor: ../../operations/add-or-rm-mons +.. _Add/Remove a Monitor (ceph-deploy): ../../deployment/ceph-deploy-mon +.. _Monitoring a Cluster: ../../operations/monitoring +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg +.. _Bootstrapping a Monitor: ../../../dev/mon-bootstrap +.. _Changing a Monitor's IP Address: ../../operations/add-or-rm-mons#changing-a-monitor-s-ip-address +.. _Monitor/OSD Interaction: ../mon-osd-interaction +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability +.. _Pool values: ../../operations/pools/#set-pool-values diff --git a/doc/rados/configuration/mon-lookup-dns.rst b/doc/rados/configuration/mon-lookup-dns.rst new file mode 100644 index 00000000..e32b3206 --- /dev/null +++ b/doc/rados/configuration/mon-lookup-dns.rst @@ -0,0 +1,51 @@ +=============================== +Looking op Monitors through DNS +=============================== + +Since version 11.0.0 RADOS supports looking up Monitors through DNS. + +This way daemons and clients do not require a *mon host* configuration directive in their ceph.conf configuration file. + +Using DNS SRV TCP records clients are able to look up the monitors. + +This allows for less configuration on clients and monitors. Using a DNS update clients and daemons can be made aware of changes in the monitor topology. + +By default clients and daemons will look for the TCP service called *ceph-mon* which is configured by the *mon_dns_srv_name* configuration directive. + + +``mon dns srv name`` + +:Description: the service name used querying the DNS for the monitor hosts/addresses +:Type: String +:Default: ``ceph-mon`` + +Example +------- +When the DNS search domain is set to *example.com* a DNS zone file might contain the following elements. + +First, create records for the Monitors, either IPv4 (A) or IPv6 (AAAA). + +:: + + mon1.example.com. AAAA 2001:db8::100 + mon2.example.com. AAAA 2001:db8::200 + mon3.example.com. AAAA 2001:db8::300 + +:: + + mon1.example.com. A 192.168.0.1 + mon2.example.com. A 192.168.0.2 + mon3.example.com. A 192.168.0.3 + + +With those records now existing we can create the SRV TCP records with the name *ceph-mon* pointing to the three Monitors. + +:: + + _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon1.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon2.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon3.example.com. + +In this case the Monitors are running on port *6789*, and their priority and weight are all *10* and *60* respectively. + +The current implementation in clients and daemons will *only* respect the priority set in SRV records, and they will only connect to the monitors with lowest-numbered priority. The targets with the same priority will be selected at random. diff --git a/doc/rados/configuration/mon-osd-interaction.rst b/doc/rados/configuration/mon-osd-interaction.rst new file mode 100644 index 00000000..6ef66265 --- /dev/null +++ b/doc/rados/configuration/mon-osd-interaction.rst @@ -0,0 +1,412 @@ +===================================== + Configuring Monitor/OSD Interaction +===================================== + +.. index:: heartbeat + +After you have completed your initial Ceph configuration, you may deploy and run +Ceph. When you execute a command such as ``ceph health`` or ``ceph -s``, the +:term:`Ceph Monitor` reports on the current state of the :term:`Ceph Storage +Cluster`. The Ceph Monitor knows about the Ceph Storage Cluster by requiring +reports from each :term:`Ceph OSD Daemon`, and by receiving reports from Ceph +OSD Daemons about the status of their neighboring Ceph OSD Daemons. If the Ceph +Monitor doesn't receive reports, or if it receives reports of changes in the +Ceph Storage Cluster, the Ceph Monitor updates the status of the :term:`Ceph +Cluster Map`. + +Ceph provides reasonable default settings for Ceph Monitor/Ceph OSD Daemon +interaction. However, you may override the defaults. The following sections +describe how Ceph Monitors and Ceph OSD Daemons interact for the purposes of +monitoring the Ceph Storage Cluster. + +.. index:: heartbeat interval + +OSDs Check Heartbeats +===================== + +Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons at random +intervals less than every 6 seconds. If a neighboring Ceph OSD Daemon doesn't +show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may +consider the neighboring Ceph OSD Daemon ``down`` and report it back to a Ceph +Monitor, which will update the Ceph Cluster Map. You may change this grace +period by adding an ``osd heartbeat grace`` setting under the ``[mon]`` +and ``[osd]`` or ``[global]`` section of your Ceph configuration file, +or by setting the value at runtime. + + +.. ditaa:: + +---------+ +---------+ + | OSD 1 | | OSD 2 | + +---------+ +---------+ + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |<-------------------| + | Heart Beating | + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |----+ Grace | + | | Period | + |<---+ Exceeded | + | | + |----+ Mark | + | | OSD 2 | + |<---+ Down | + + +.. index:: OSD down report + +OSDs Report Down OSDs +===================== + +By default, two Ceph OSD Daemons from different hosts must report to the Ceph +Monitors that another Ceph OSD Daemon is ``down`` before the Ceph Monitors +acknowledge that the reported Ceph OSD Daemon is ``down``. But there is chance +that all the OSDs reporting the failure are hosted in a rack with a bad switch +which has trouble connecting to another OSD. To avoid this sort of false alarm, +we consider the peers reporting a failure a proxy for a potential "subcluster" +over the overall cluster that is similarly laggy. This is clearly not true in +all cases, but will sometimes help us localize the grace correction to a subset +of the system that is unhappy. ``mon osd reporter subtree level`` is used to +group the peers into the "subcluster" by their common ancestor type in CRUSH +map. By default, only two reports from different subtree are required to report +another Ceph OSD Daemon ``down``. You can change the number of reporters from +unique subtrees and the common ancestor type required to report a Ceph OSD +Daemon ``down`` to a Ceph Monitor by adding an ``mon osd min down reporters`` +and ``mon osd reporter subtree level`` settings under the ``[mon]`` section of +your Ceph configuration file, or by setting the value at runtime. + + +.. ditaa:: + + +---------+ +---------+ +---------+ + | OSD 1 | | OSD 2 | | Monitor | + +---------+ +---------+ +---------+ + | | | + | OSD 3 Is Down | | + |---------------+--------------->| + | | | + | | | + | | OSD 3 Is Down | + | |--------------->| + | | | + | | | + | | |---------+ Mark + | | | | OSD 3 + | | |<--------+ Down + + +.. index:: peering failure + +OSDs Report Peering Failure +=========================== + +If a Ceph OSD Daemon cannot peer with any of the Ceph OSD Daemons defined in its +Ceph configuration file (or the cluster map), it will ping a Ceph Monitor for +the most recent copy of the cluster map every 30 seconds. You can change the +Ceph Monitor heartbeat interval by adding an ``osd mon heartbeat interval`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. + +.. ditaa:: + + +---------+ +---------+ +-------+ +---------+ + | OSD 1 | | OSD 2 | | OSD 3 | | Monitor | + +---------+ +---------+ +-------+ +---------+ + | | | | + | Request To | | | + | Peer | | | + |-------------->| | | + |<--------------| | | + | Peering | | + | | | + | Request To | | + | Peer | | + |----------------------------->| | + | | + |----+ OSD Monitor | + | | Heartbeat | + |<---+ Interval Exceeded | + | | + | Failed to Peer with OSD 3 | + |-------------------------------------------->| + |<--------------------------------------------| + | Receive New Cluster Map | + + +.. index:: OSD status + +OSDs Report Their Status +======================== + +If an Ceph OSD Daemon doesn't report to a Ceph Monitor, the Ceph Monitor will +consider the Ceph OSD Daemon ``down`` after the ``mon osd report timeout`` +elapses. A Ceph OSD Daemon sends a report to a Ceph Monitor when a reportable +event such as a failure, a change in placement group stats, a change in +``up_thru`` or when it boots within 5 seconds. You can change the Ceph OSD +Daemon minimum report interval by adding an ``osd mon report interval`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. A Ceph OSD Daemon sends a report to a Ceph +Monitor every 120 seconds irrespective of whether any notable changes occur. +You can change the Ceph Monitor report interval by adding an ``osd mon report +interval max`` setting under the ``[osd]`` section of your Ceph configuration +file, or by setting the value at runtime. + + +.. ditaa:: + + +---------+ +---------+ + | OSD 1 | | Monitor | + +---------+ +---------+ + | | + |----+ Report Min | + | | Interval | + |<---+ Exceeded | + | | + |----+ Reportable | + | | Event | + |<---+ Occurs | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Report Max | + | | Interval | + |<---+ Exceeded | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Monitor | + | | Fails | + |<---+ | + +----+ Monitor OSD + | | Report Timeout + |<---+ Exceeded + | + +----+ Mark + | | OSD 1 + |<---+ Down + + + + +Configuration Settings +====================== + +When modifying heartbeat settings, you should include them in the ``[global]`` +section of your configuration file. + +.. index:: monitor heartbeat + +Monitor Settings +---------------- + +``mon osd min up ratio`` + +:Description: The minimum ratio of ``up`` Ceph OSD Daemons before Ceph will + mark Ceph OSD Daemons ``down``. + +:Type: Double +:Default: ``.3`` + + +``mon osd min in ratio`` + +:Description: The minimum ratio of ``in`` Ceph OSD Daemons before Ceph will + mark Ceph OSD Daemons ``out``. + +:Type: Double +:Default: ``.75`` + + +``mon osd laggy halflife`` + +:Description: The number of seconds laggy estimates will decay. +:Type: Integer +:Default: ``60*60`` + + +``mon osd laggy weight`` + +:Description: The weight for new samples in laggy estimation decay. +:Type: Double +:Default: ``0.3`` + + + +``mon osd laggy max interval`` + +:Description: Maximum value of ``laggy_interval`` in laggy estimations (in seconds). + Monitor uses an adaptive approach to evaluate the ``laggy_interval`` of + a certain OSD. This value will be used to calculate the grace time for + that OSD. +:Type: Integer +:Default: 300 + +``mon osd adjust heartbeat grace`` + +:Description: If set to ``true``, Ceph will scale based on laggy estimations. +:Type: Boolean +:Default: ``true`` + + +``mon osd adjust down out interval`` + +:Description: If set to ``true``, Ceph will scaled based on laggy estimations. +:Type: Boolean +:Default: ``true`` + + +``mon osd auto mark in`` + +:Description: Ceph will mark any booting Ceph OSD Daemons as ``in`` + the Ceph Storage Cluster. + +:Type: Boolean +:Default: ``false`` + + +``mon osd auto mark auto out in`` + +:Description: Ceph will mark booting Ceph OSD Daemons auto marked ``out`` + of the Ceph Storage Cluster as ``in`` the cluster. + +:Type: Boolean +:Default: ``true`` + + +``mon osd auto mark new in`` + +:Description: Ceph will mark booting new Ceph OSD Daemons as ``in`` the + Ceph Storage Cluster. + +:Type: Boolean +:Default: ``true`` + + +``mon osd down out interval`` + +:Description: The number of seconds Ceph waits before marking a Ceph OSD Daemon + ``down`` and ``out`` if it doesn't respond. + +:Type: 32-bit Integer +:Default: ``600`` + + +``mon osd down out subtree limit`` + +:Description: The smallest :term:`CRUSH` unit type that Ceph will **not** + automatically mark out. For instance, if set to ``host`` and if + all OSDs of a host are down, Ceph will not automatically mark out + these OSDs. + +:Type: String +:Default: ``rack`` + + +``mon osd report timeout`` + +:Description: The grace period in seconds before declaring + unresponsive Ceph OSD Daemons ``down``. + +:Type: 32-bit Integer +:Default: ``900`` + +``mon osd min down reporters`` + +:Description: The minimum number of Ceph OSD Daemons required to report a + ``down`` Ceph OSD Daemon. + +:Type: 32-bit Integer +:Default: ``2`` + + +``mon osd reporter subtree level`` + +:Description: In which level of parent bucket the reporters are counted. The OSDs + send failure reports to monitor if they find its peer is not responsive. + And monitor mark the reported OSD out and then down after a grace period. +:Type: String +:Default: ``host`` + + +.. index:: OSD hearbeat + +OSD Settings +------------ + +``osd heartbeat address`` + +:Description: An Ceph OSD Daemon's network address for heartbeats. +:Type: Address +:Default: The host address. + + +``osd heartbeat interval`` + +:Description: How often an Ceph OSD Daemon pings its peers (in seconds). +:Type: 32-bit Integer +:Default: ``6`` + + +``osd heartbeat grace`` + +:Description: The elapsed time when a Ceph OSD Daemon hasn't shown a heartbeat + that the Ceph Storage Cluster considers it ``down``. + This setting has to be set in both the [mon] and [osd] or [global] + section so that it is read by both the MON and OSD daemons. +:Type: 32-bit Integer +:Default: ``20`` + + +``osd mon heartbeat interval`` + +:Description: How often the Ceph OSD Daemon pings a Ceph Monitor if it has no + Ceph OSD Daemon peers. + +:Type: 32-bit Integer +:Default: ``30`` + + +``osd mon heartbeat stat stale`` + +:Description: Stop reporting on heartbeat ping times which haven't been updated for + this many seconds. Set to zero to disable this action. + +:Type: 32-bit Integer +:Default: ``3600`` + + +``osd mon report interval`` + +:Description: The number of seconds a Ceph OSD Daemon may wait + from startup or another reportable event before reporting + to a Ceph Monitor. + +:Type: 32-bit Integer +:Default: ``5`` + + +``osd mon ack timeout`` + +:Description: The number of seconds to wait for a Ceph Monitor to acknowledge a + request for statistics. + +:Type: 32-bit Integer +:Default: ``30`` diff --git a/doc/rados/configuration/ms-ref.rst b/doc/rados/configuration/ms-ref.rst new file mode 100644 index 00000000..45c45861 --- /dev/null +++ b/doc/rados/configuration/ms-ref.rst @@ -0,0 +1,133 @@ +=========== + Messaging +=========== + +General Settings +================ + +``ms tcp nodelay`` + +:Description: Disables nagle's algorithm on messenger tcp sessions. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``ms initial backoff`` + +:Description: The initial time to wait before reconnecting on a fault. +:Type: Double +:Required: No +:Default: ``.2`` + + +``ms max backoff`` + +:Description: The maximum time to wait before reconnecting on a fault. +:Type: Double +:Required: No +:Default: ``15.0`` + + +``ms nocrc`` + +:Description: Disables crc on network messages. May increase performance if cpu limited. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms die on bad msg`` + +:Description: Debug option; do not configure. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms dispatch throttle bytes`` + +:Description: Throttles total size of messages waiting to be dispatched. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``100 << 20`` + + +``ms bind ipv6`` + +:Description: Enable if you want your daemons to bind to IPv6 address instead of IPv4 ones. (Not required if you specify a daemon or cluster IP.) +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms rwthread stack bytes`` + +:Description: Debug option for stack size; do not configure. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``1024 << 10`` + + +``ms tcp read timeout`` + +:Description: Controls how long (in seconds) the messenger will wait before closing an idle connection. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``900`` + + +``ms inject socket failures`` + +:Description: Debug option; do not configure. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``0`` + +Async messenger options +======================= + + +``ms async transport type`` + +:Description: Transport type used by Async Messenger. Can be ``posix``, ``dpdk`` + or ``rdma``. Posix uses standard TCP/IP networking and is default. + Other transports may be experimental and support may be limited. +:Type: String +:Required: No +:Default: ``posix`` + + +``ms async op threads`` + +:Description: Initial number of worker threads used by each Async Messenger instance. + Should be at least equal to highest number of replicas, but you can + decrease it if you are low on CPU core count and/or you host a lot of + OSDs on single server. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``3`` + + +``ms async max op threads`` + +:Description: Maximum number of worker threads used by each Async Messenger instance. + Set to lower values when your machine has limited CPU count, and increase + when your CPUs are underutilized (i. e. one or more of CPUs are + constantly on 100% load during I/O operations). +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``5`` + + +``ms async send inline`` + +:Description: Send messages directly from the thread that generated them instead of + queuing and sending from Async Messenger thread. This option is known + to decrease performance on systems with a lot of CPU cores, so it's + disabled by default. +:Type: Boolean +:Required: No +:Default: ``false`` + + diff --git a/doc/rados/configuration/msgr2.rst b/doc/rados/configuration/msgr2.rst new file mode 100644 index 00000000..cea412e7 --- /dev/null +++ b/doc/rados/configuration/msgr2.rst @@ -0,0 +1,224 @@ +.. _msgr2: + +Messenger v2 +============ + +What is it +---------- + +The messenger v2 protocol, or msgr2, is the second major revision on +Ceph's on-wire protocol. It brings with it several key features: + +* A *secure* mode that encrypts all data passing over the network +* Improved encapsulation of authentication payloads, enabling future + integration of new authentication modes like Kerberos +* Improved earlier feature advertisement and negotiation, enabling + future protocol revisions + +Ceph daemons can now bind to multiple ports, allowing both legacy Ceph +clients and new v2-capable clients to connect to the same cluster. + +By default, monitors now bind to the new IANA-assigned port ``3300`` +(ce4h or 0xce4) for the new v2 protocol, while also binding to the +old default port ``6789`` for the legacy v1 protocol. + +Address formats +--------------- + +Prior to nautilus, all network addresses were rendered like +``1.2.3.4:567/89012`` where there was an IP address, a port, and a +nonce to uniquely identify a client or daemon on the network. +Starting with nautilus, we now have three different address types: + +* **v2**: ``v2:1.2.3.4:578/89012`` identifies a daemon binding to a + port speaking the new v2 protocol +* **v1**: ``v1:1.2.3.4:578/89012`` identifies a daemon binding to a + port speaking the legacy v1 protocol. Any address that was + previously shown with any prefix is now shown as a ``v1:`` address. +* **TYPE_ANY** addresses identify a client that can speak either + version of the protocol. Prior to nautilus, clients would appear as + ``1.2.3.4:0/123456``, where the port of 0 indicates they are clients + and do not accept incoming connections. Starting with Nautilus, + these clients are now internally represented by a **TYPE_ANY** + address, and still shown with no prefix, because they may + connect to daemons using the v2 or v1 protocol, depending on what + protocol(s) the daemons are using. + +Because daemons now bind to multiple ports, they are now described by +a vector of addresses instead of a single address. For example, +dumping the monitor map on a Nautilus cluster now includes lines +like:: + + epoch 1 + fsid 50fcf227-be32-4bcb-8b41-34ca8370bd16 + last_changed 2019-02-25 11:10:46.700821 + created 2019-02-25 11:10:46.700821 + min_mon_release 14 (nautilus) + 0: [v2:10.0.0.10:3300/0,v1:10.0.0.10:6789/0] mon.foo + 1: [v2:10.0.0.11:3300/0,v1:10.0.0.11:6789/0] mon.bar + 2: [v2:10.0.0.12:3300/0,v1:10.0.0.12:6789/0] mon.baz + +The bracketed list or vector of addresses means that the same daemon can be +reached on multiple ports (and protocols). Any client or other daemon +connecting to that daemon will use the v2 protocol (listed first) if +possible; otherwise it will back to the legacy v1 protocol. Legacy +clients will only see the v1 addresses and will continue to connect as +they did before, with the v1 protocol. + +Starting in Nautilus, the ``mon_host`` configuration option and ``-m +<mon-host>`` command line options support the same bracketed address +vector syntax. + + +Bind configuration options +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Two new configuration options control whether the v1 and/or v2 +protocol is used: + + * ``ms_bind_msgr1`` [default: true] controls whether a daemon binds + to a port speaking the v1 protocol + * ``ms_bind_msgr2`` [default: true] controls whether a daemon binds + to a port speaking the v2 protocol + +Similarly, two options control whether IPv4 and IPv6 addresses are used: + + * ``ms_bind_ipv4`` [default: true] controls whether a daemon binds + to an IPv4 address + * ``ms_bind_ipv6`` [default: false] controls whether a daemon binds + to an IPv6 address + +.. note: The ability to bind to multiple ports has paved the way for + dual-stack IPv4 and IPv6 support. That said, dual-stack support is + not yet tested as of Nautilus v14.2.0 and likely needs some + additional code changes to work correctly. + +Connection modes +---------------- + +The v2 protocol supports two connection modes: + +* *crc* mode provides: + + - a strong initial authentication when the connection is established + (with cephx, mutual authentication of both parties with protection + from a man-in-the-middle or eavesdropper), and + - a crc32c integrity check to protect against bit flips due to flaky + hardware or cosmic rays + + *crc* mode does *not* provide: + + - secrecy (an eavesdropper on the network can see all + post-authentication traffic as it goes by) or + - protection from a malicious man-in-the-middle (who can deliberate + modify traffic as it goes by, as long as they are careful to + adjust the crc32c values to match) + +* *secure* mode provides: + + - a strong initial authentication when the connection is established + (with cephx, mutual authentication of both parties with protection + from a man-in-the-middle or eavesdropper), and + - full encryption of all post-authentication traffic, including a + cryptographic integrity check. + + In Nautilus, secure mode uses the `AES-GCM + <https://en.wikipedia.org/wiki/Galois/Counter_Mode>`_ stream cipher, + which is generally very fast on modern processors (e.g., faster than + a SHA-256 cryptographic hash). + +Connection mode configuration options +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For most connections, there are options that control which modes are used: + +* ``ms_cluster_mode`` is the connection mode (or permitted modes) used + for intra-cluster communication between Ceph daemons. If multiple + modes are listed, the modes listed first are preferred. +* ``ms_service_mode`` is a list of permitted modes for clients to use + when connecting to the cluster. +* ``ms_client_mode`` is a list of connection modes, in order of + preference, for clients to use (or allow) when talking to a Ceph + cluster. + +There are a parallel set of options that apply specifically to +monitors, allowing administrators to set different (usually more +secure) requirements on communication with the monitors. + +* ``ms_mon_cluster_mode`` is the connection mode (or permitted modes) + to use between monitors. +* ``ms_mon_service_mode`` is a list of permitted modes for clients or + other Ceph daemons to use when connecting to monitors. +* ``ms_mon_client_mode`` is a list of connection modes, in order of + preference, for clients or non-monitor daemons to use when + connecting to monitors. + + +Transitioning from v1-only to v2-plus-v1 +---------------------------------------- + +By default, ``ms_bind_msgr2`` is true starting with Nautilus 14.2.z. +However, until the monitors start using v2, only limited services will +start advertising v2 addresses. + +For most users, the monitors are binding to the default legacy port ``6789`` for the v1 protocol. When this is the case, enabling v2 is as simple as:: + + ceph mon enable-msgr2 + +If the monitors are bound to non-standard ports, you will need to +specify an additional port for v2 explicitly. For example, if your +monitor ``mon.a`` binds to ``1.2.3.4:1111``, and you want to add v2 on +port ``1112``,:: + + ceph mon set-addrs a [v2:1.2.3.4:1112,v1:1.2.3.4:1111] + +Once the monitors bind to v2, each daemon will start advertising a v2 +address when it is next restarted. + + +Updating ceph.conf and mon_host +------------------------------- + +Prior to Nautilus, a CLI user or daemon will normally discover the +monitors via the ``mon_host`` option in ``/etc/ceph/ceph.conf``. The +syntax for this option has expanded starting with Nautilus to allow +support the new bracketed list format. For example, an old line +like:: + + mon_host = 10.0.0.1:6789,10.0.0.2:6789,10.0.0.3:6789 + +Can be changed to:: + + mon_host = [v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0],[v2:10.0.0.2:3300/0,v1:10.0.0.2:6789/0],[v2:10.0.0.3:3300/0,v1:10.0.0.3:6789/0] + +However, when default ports are used (``3300`` and ``6789``), they can +be omitted:: + + mon_host = 10.0.0.1,10.0.0.2,10.0.0.3 + +Once v2 has been enabled on the monitors, ``ceph.conf`` may need to be +updated to either specify no ports (this is usually simplest), or +explicitly specify both the v2 and v1 addresses. Note, however, that +the new bracketed syntax is only understood by Nautilus and later, so +do not make that change on hosts that have not yet had their ceph +packages upgraded. + +When you are updating ``ceph.conf``, note the new ``ceph config +generate-minimal-conf`` command (which generates a barebones config +file with just enough information to reach the monitors) and the +``ceph config assimilate-conf`` (which moves config file options into +the monitors' configuration database) may be helpful. For example,:: + + # ceph config assimilate-conf < /etc/ceph/ceph.conf + # ceph config generate-minimal-config > /etc/ceph/ceph.conf.new + # cat /etc/ceph/ceph.conf.new + # minimal ceph.conf for 0e5a806b-0ce5-4bc6-b949-aa6f68f5c2a3 + [global] + fsid = 0e5a806b-0ce5-4bc6-b949-aa6f68f5c2a3 + mon_host = [v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0] + # mv /etc/ceph/ceph.conf.new /etc/ceph/ceph.conf + +Protocol +-------- + +For a detailed description of the v2 wire protocol, see :ref:`msgr2-protocol`. diff --git a/doc/rados/configuration/network-config-ref.rst b/doc/rados/configuration/network-config-ref.rst new file mode 100644 index 00000000..5a719e3c --- /dev/null +++ b/doc/rados/configuration/network-config-ref.rst @@ -0,0 +1,415 @@ +================================= + Network Configuration Reference +================================= + +Network configuration is critical for building a high performance :term:`Ceph +Storage Cluster`. The Ceph Storage Cluster does not perform request routing or +dispatching on behalf of the :term:`Ceph Client`. Instead, Ceph Clients make +requests directly to Ceph OSD Daemons. Ceph OSD Daemons perform data replication +on behalf of Ceph Clients, which means replication and other factors impose +additional loads on Ceph Storage Cluster networks. + +Our Quick Start configurations provide a trivial `Ceph configuration file`_ that +sets monitor IP addresses and daemon host names only. Unless you specify a +cluster network, Ceph assumes a single "public" network. Ceph functions just +fine with a public network only, but you may see significant performance +improvement with a second "cluster" network in a large cluster. + +It is possible to run a Ceph Storage Cluster with two networks: a public +(front-side) network and a cluster (back-side) network. However, this approach +complicates network configuration (both hardware and software) and does not usually have a significant impact on overall performance. For this reason, we generally recommend that dual-NIC systems either be configured with two IPs on the same network, or bonded. + +If, despite the complexity, one still wishes to use two networks, each +:term:`Ceph Node` will need to have more than one NIC. See `Hardware +Recommendations - Networks`_ for additional details. + +.. ditaa:: + +-------------+ + | Ceph Client | + +----*--*-----+ + | ^ + Request | : Response + v | + /----------------------------------*--*-------------------------------------\ + | Public Network | + \---*--*------------*--*-------------*--*------------*--*------------*--*---/ + ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ + | | | | | | | | | | + | : | : | : | : | : + v v v v v v v v v v + +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ + | Ceph MON | | Ceph MDS | | Ceph OSD | | Ceph OSD | | Ceph OSD | + +----------+ +----------+ +---*--*---+ +---*--*---+ +---*--*---+ + ^ ^ ^ ^ ^ ^ + The cluster network relieves | | | | | | + OSD replication and heartbeat | : | : | : + traffic from the public network. v v v v v v + /------------------------------------*--*------------*--*------------*--*---\ + | cCCC Cluster Network | + \---------------------------------------------------------------------------/ + + +IP Tables +========= + +By default, daemons `bind`_ to ports within the ``6800:7300`` range. You may +configure this range at your discretion. Before configuring your IP tables, +check the default ``iptables`` configuration. + + sudo iptables -L + +Some Linux distributions include rules that reject all inbound requests +except SSH from all network interfaces. For example:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +You will need to delete these rules on both your public and cluster networks +initially, and replace them with appropriate rules when you are ready to +harden the ports on your Ceph Nodes. + + +Monitor IP Tables +----------------- + +Ceph Monitors listen on ports ``3300`` and ``6789`` by +default. Additionally, Ceph Monitors always operate on the public +network. When you add the rule using the example below, make sure you +replace ``{iface}`` with the public network interface (e.g., ``eth0``, +``eth1``, etc.), ``{ip-address}`` with the IP address of the public +network and ``{netmask}`` with the netmask for the public network. :: + + sudo iptables -A INPUT -i {iface} -p tcp -s {ip-address}/{netmask} --dport 6789 -j ACCEPT + + +MDS and Manager IP Tables +------------------------- + +A :term:`Ceph Metadata Server` or :term:`Ceph Manager` listens on the first +available port on the public network beginning at port 6800. Note that this +behavior is not deterministic, so if you are running more than one OSD or MDS +on the same host, or if you restart the daemons within a short window of time, +the daemons will bind to higher ports. You should open the entire 6800-7300 +range by default. When you add the rule using the example below, make sure +you replace ``{iface}`` with the public network interface (e.g., ``eth0``, +``eth1``, etc.), ``{ip-address}`` with the IP address of the public network +and ``{netmask}`` with the netmask of the public network. + +For example:: + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + + +OSD IP Tables +------------- + +By default, Ceph OSD Daemons `bind`_ to the first available ports on a Ceph Node +beginning at port 6800. Note that this behavior is not deterministic, so if you +are running more than one OSD or MDS on the same host, or if you restart the +daemons within a short window of time, the daemons will bind to higher ports. +Each Ceph OSD Daemon on a Ceph Node may use up to four ports: + +#. One for talking to clients and monitors. +#. One for sending data to other OSDs. +#. Two for heartbeating on each interface. + +.. ditaa:: + /---------------\ + | OSD | + | +---+----------------+-----------+ + | | Clients & Monitors | Heartbeat | + | +---+----------------+-----------+ + | | + | +---+----------------+-----------+ + | | Data Replication | Heartbeat | + | +---+----------------+-----------+ + | cCCC | + \---------------/ + +When a daemon fails and restarts without letting go of the port, the restarted +daemon will bind to a new port. You should open the entire 6800-7300 port range +to handle this possibility. + +If you set up separate public and cluster networks, you must add rules for both +the public network and the cluster network, because clients will connect using +the public network and other Ceph OSD Daemons will connect using the cluster +network. When you add the rule using the example below, make sure you replace +``{iface}`` with the network interface (e.g., ``eth0``, ``eth1``, etc.), +``{ip-address}`` with the IP address and ``{netmask}`` with the netmask of the +public or cluster network. For example:: + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + +.. tip:: If you run Ceph Metadata Servers on the same Ceph Node as the + Ceph OSD Daemons, you can consolidate the public network configuration step. + + +Ceph Networks +============= + +To configure Ceph networks, you must add a network configuration to the +``[global]`` section of the configuration file. Our 5-minute Quick Start +provides a trivial `Ceph configuration file`_ that assumes one public network +with client and server on the same network and subnet. Ceph functions just fine +with a public network only. However, Ceph allows you to establish much more +specific criteria, including multiple IP network and subnet masks for your +public network. You can also establish a separate cluster network to handle OSD +heartbeat, object replication and recovery traffic. Don't confuse the IP +addresses you set in your configuration with the public-facing IP addresses +network clients may use to access your service. Typical internal IP networks are +often ``192.168.0.0`` or ``10.0.0.0``. + +.. tip:: If you specify more than one IP address and subnet mask for + either the public or the cluster network, the subnets within the network + must be capable of routing to each other. Additionally, make sure you + include each IP address/subnet in your IP tables and open ports for them + as necessary. + +.. note:: Ceph uses `CIDR`_ notation for subnets (e.g., ``10.0.0.0/24``). + +When you have configured your networks, you may restart your cluster or restart +each daemon. Ceph daemons bind dynamically, so you do not have to restart the +entire cluster at once if you change your network configuration. + + +Public Network +-------------- + +To configure a public network, add the following option to the ``[global]`` +section of your Ceph configuration file. + +.. code-block:: ini + + [global] + # ... elided configuration + public network = {public-network/netmask} + + +Cluster Network +--------------- + +If you declare a cluster network, OSDs will route heartbeat, object replication +and recovery traffic over the cluster network. This may improve performance +compared to using a single network. To configure a cluster network, add the +following option to the ``[global]`` section of your Ceph configuration file. + +.. code-block:: ini + + [global] + # ... elided configuration + cluster network = {cluster-network/netmask} + +We prefer that the cluster network is **NOT** reachable from the public network +or the Internet for added security. + + +Ceph Daemons +============ + +The monitor daemons are each configured to bind to a specific IP address. These addresses are normally configured by your deployment tool. Other components in the Ceph system discover the monitors via the ``mon host`` configuration option, normally specified in the ``[global]`` section of the ``ceph.conf`` file. + +.. code-block:: ini + + [global] + mon host = 10.0.0.2, 10.0.0.3, 10.0.0.4 + +The ``mon host`` value can be a list of IP addresses or a name that is +looked up via DNS. In the case of a DNS name with multiple A or AAAA +records, all records are probed in order to discover a monitor. Once +one monitor is reached, all other current monitors are discovered, so +the ``mon host`` configuration option only needs to be sufficiently up +to date such that a client can reach one monitor that is currently online. + +The MGR, OSD, and MDS daemons will bind to any available address and +do not require any special configuration. However, it is possible to +specify a specific IP address for them to bind to with the ``public +addr`` (and/or, in the case of OSD daemons, the ``cluster addr``) +configuration option. For example, + +.. code-block:: ini + + [osd.0] + public addr = {host-public-ip-address} + cluster addr = {host-cluster-ip-address} + +.. topic:: One NIC OSD in a Two Network Cluster + + Generally, we do not recommend deploying an OSD host with a single NIC in a + cluster with two networks. However, you may accomplish this by forcing the + OSD host to operate on the public network by adding a ``public addr`` entry + to the ``[osd.n]`` section of the Ceph configuration file, where ``n`` + refers to the number of the OSD with one NIC. Additionally, the public + network and cluster network must be able to route traffic to each other, + which we don't recommend for security reasons. + + +Network Config Settings +======================= + +Network configuration settings are not required. Ceph assumes a public network +with all hosts operating on it unless you specifically configure a cluster +network. + + +Public Network +-------------- + +The public network configuration allows you specifically define IP addresses +and subnets for the public network. You may specifically assign static IP +addresses or override ``public network`` settings using the ``public addr`` +setting for a specific daemon. + +``public network`` + +:Description: The IP address and netmask of the public (front-side) network + (e.g., ``192.168.0.0/24``). Set in ``[global]``. You may specify + comma-delimited subnets. + +:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` +:Required: No +:Default: N/A + + +``public addr`` + +:Description: The IP address for the public (front-side) network. + Set for each daemon. + +:Type: IP Address +:Required: No +:Default: N/A + + + +Cluster Network +--------------- + +The cluster network configuration allows you to declare a cluster network, and +specifically define IP addresses and subnets for the cluster network. You may +specifically assign static IP addresses or override ``cluster network`` +settings using the ``cluster addr`` setting for specific OSD daemons. + + +``cluster network`` + +:Description: The IP address and netmask of the cluster (back-side) network + (e.g., ``10.0.0.0/24``). Set in ``[global]``. You may specify + comma-delimited subnets. + +:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` +:Required: No +:Default: N/A + + +``cluster addr`` + +:Description: The IP address for the cluster (back-side) network. + Set for each daemon. + +:Type: Address +:Required: No +:Default: N/A + + +Bind +---- + +Bind settings set the default port ranges Ceph OSD and MDS daemons use. The +default range is ``6800:7300``. Ensure that your `IP Tables`_ configuration +allows you to use the configured port range. + +You may also enable Ceph daemons to bind to IPv6 addresses instead of IPv4 +addresses. + + +``ms bind port min`` + +:Description: The minimum port number to which an OSD or MDS daemon will bind. +:Type: 32-bit Integer +:Default: ``6800`` +:Required: No + + +``ms bind port max`` + +:Description: The maximum port number to which an OSD or MDS daemon will bind. +:Type: 32-bit Integer +:Default: ``7300`` +:Required: No. + + +``ms bind ipv6`` + +:Description: Enables Ceph daemons to bind to IPv6 addresses. Currently the + messenger *either* uses IPv4 or IPv6, but it cannot do both. +:Type: Boolean +:Default: ``false`` +:Required: No + +``public bind addr`` + +:Description: In some dynamic deployments the Ceph MON daemon might bind + to an IP address locally that is different from the ``public addr`` + advertised to other peers in the network. The environment must ensure + that routing rules are set correctly. If ``public bind addr`` is set + the Ceph MON daemon will bind to it locally and use ``public addr`` + in the monmaps to advertise its address to peers. This behavior is limited + to the MON daemon. + +:Type: IP Address +:Required: No +:Default: N/A + + + +TCP +--- + +Ceph disables TCP buffering by default. + + +``ms tcp nodelay`` + +:Description: Ceph enables ``ms tcp nodelay`` so that each request is sent + immediately (no buffering). Disabling `Nagle's algorithm`_ + increases network traffic, which can introduce latency. If you + experience large numbers of small packets, you may try + disabling ``ms tcp nodelay``. + +:Type: Boolean +:Required: No +:Default: ``true`` + + + +``ms tcp rcvbuf`` + +:Description: The size of the socket buffer on the receiving end of a network + connection. Disable by default. + +:Type: 32-bit Integer +:Required: No +:Default: ``0`` + + + +``ms tcp read timeout`` + +:Description: If a client or daemon makes a request to another Ceph daemon and + does not drop an unused connection, the ``ms tcp read timeout`` + defines the connection as idle after the specified number + of seconds. + +:Type: Unsigned 64-bit Integer +:Required: No +:Default: ``900`` 15 minutes. + + + +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability +.. _Hardware Recommendations - Networks: ../../../start/hardware-recommendations#networks +.. _Ceph configuration file: ../../../start/quick-ceph-deploy/#create-a-cluster +.. _hardware recommendations: ../../../start/hardware-recommendations +.. _Monitor / OSD Interaction: ../mon-osd-interaction +.. _Message Signatures: ../auth-config-ref#signatures +.. _CIDR: https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing +.. _Nagle's Algorithm: https://en.wikipedia.org/wiki/Nagle's_algorithm diff --git a/doc/rados/configuration/osd-config-ref.rst b/doc/rados/configuration/osd-config-ref.rst new file mode 100644 index 00000000..4579f1b6 --- /dev/null +++ b/doc/rados/configuration/osd-config-ref.rst @@ -0,0 +1,1134 @@ +====================== + OSD Config Reference +====================== + +.. index:: OSD; configuration + +You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD +Daemons can use the default values and a very minimal configuration. A minimal +Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and +uses default values for nearly everything else. + +Ceph OSD Daemons are numerically identified in incremental fashion, beginning +with ``0`` using the following convention. :: + + osd.0 + osd.1 + osd.2 + +In a configuration file, you may specify settings for all Ceph OSD Daemons in +the cluster by adding configuration settings to the ``[osd]`` section of your +configuration file. To add settings directly to a specific Ceph OSD Daemon +(e.g., ``host``), enter it in an OSD-specific section of your configuration +file. For example: + +.. code-block:: ini + + [osd] + osd journal size = 5120 + + [osd.0] + host = osd-host-a + + [osd.1] + host = osd-host-b + + +.. index:: OSD; config settings + +General Settings +================ + +The following settings provide an Ceph OSD Daemon's ID, and determine paths to +data and journals. Ceph deployment scripts typically generate the UUID +automatically. + +.. warning:: **DO NOT** change the default paths for data or journals, as it + makes it more problematic to troubleshoot Ceph later. + +The journal size should be at least twice the product of the expected drive +speed multiplied by ``filestore max sync interval``. However, the most common +practice is to partition the journal drive (often an SSD), and mount it such +that Ceph uses the entire partition for the journal. + + +``osd uuid`` + +:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon. +:Type: UUID +:Default: The UUID. +:Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid`` + applies to the entire cluster. + + +``osd data`` + +:Description: The path to the OSDs data. You must create the directory when + deploying Ceph. You should mount a drive for OSD data at this + mount point. We do not recommend changing the default. + +:Type: String +:Default: ``/var/lib/ceph/osd/$cluster-$id`` + + +``osd max write size`` + +:Description: The maximum size of a write in megabytes. +:Type: 32-bit Integer +:Default: ``90`` + + +``osd max object size`` + +:Description: The maximum size of a RADOS object in bytes. +:Type: 32-bit Unsigned Integer +:Default: 128MB + + +``osd client message size cap`` + +:Description: The largest client data message allowed in memory. +:Type: 64-bit Unsigned Integer +:Default: 500MB default. ``500*1024L*1024L`` + + +``osd class dir`` + +:Description: The class path for RADOS class plug-ins. +:Type: String +:Default: ``$libdir/rados-classes`` + + +.. index:: OSD; file system + +File System Settings +==================== +Ceph builds and mounts file systems which are used for Ceph OSDs. + +``osd mkfs options {fs-type}`` + +:Description: Options used when creating a new Ceph OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``-f -i 2048`` +:Default for other file systems: {empty string} + +For example:: + ``osd mkfs options xfs = -f -d agcount=24`` + +``osd mount options {fs-type}`` + +:Description: Options used when mounting a Ceph OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``rw,noatime,inode64`` +:Default for other file systems: ``rw, noatime`` + +For example:: + ``osd mount options xfs = rw, noatime, inode64, logbufs=8`` + + +.. index:: OSD; journal settings + +Journal Settings +================ + +By default, Ceph expects that you will store an Ceph OSD Daemons journal with +the following path:: + + /var/lib/ceph/osd/$cluster-$id/journal + +When using a single device type (for example, spinning drives), the journals +should be *colocated*: the logical volume (or partition) should be in the same +device as the ``data`` logical volume. + +When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning +drives) it makes sense to place the journal on the faster device, while +``data`` occupies the slower device fully. + +The default ``osd journal size`` value is 5120 (5 gigabytes), but it can be +larger, in which case it will need to be set in the ``ceph.conf`` file:: + + + osd journal size = 10240 + + +``osd journal`` + +:Description: The path to the OSD's journal. This may be a path to a file or a + block device (such as a partition of an SSD). If it is a file, + you must create the directory to contain it. We recommend using a + drive separate from the ``osd data`` drive. + +:Type: String +:Default: ``/var/lib/ceph/osd/$cluster-$id/journal`` + + +``osd journal size`` + +:Description: The size of the journal in megabytes. + +:Type: 32-bit Integer +:Default: ``5120`` + + +See `Journal Config Reference`_ for additional details. + + +Monitor OSD Interaction +======================= + +Ceph OSD Daemons check each other's heartbeats and report to monitors +periodically. Ceph can use default values in many cases. However, if your +network has latency issues, you may need to adopt longer intervals. See +`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. + + +Data Placement +============== + +See `Pool & PG Config Reference`_ for details. + + +.. index:: OSD; scrubbing + +Scrubbing +========= + +In addition to making multiple copies of objects, Ceph insures data integrity by +scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the +object storage layer. For each placement group, Ceph generates a catalog of all +objects and compares each primary object and its replicas to ensure that no +objects are missing or mismatched. Light scrubbing (daily) checks the object +size and attributes. Deep scrubbing (weekly) reads the data and uses checksums +to ensure data integrity. + +Scrubbing is important for maintaining data integrity, but it can reduce +performance. You can adjust the following settings to increase or decrease +scrubbing operations. + + +``osd max scrubs`` + +:Description: The maximum number of simultaneous scrub operations for + a Ceph OSD Daemon. + +:Type: 32-bit Int +:Default: ``1`` + +``osd scrub begin hour`` + +:Description: The time of day for the lower bound when a scheduled scrub can be + performed. +:Type: Integer in the range of 0 to 24 +:Default: ``0`` + + +``osd scrub end hour`` + +:Description: The time of day for the upper bound when a scheduled scrub can be + performed. Along with ``osd scrub begin hour``, they define a time + window, in which the scrubs can happen. But a scrub will be performed + no matter the time window allows or not, as long as the placement + group's scrub interval exceeds ``osd scrub max interval``. +:Type: Integer in the range of 0 to 24 +:Default: ``24`` + + +``osd scrub begin week day`` + +:Description: This restricts scrubbing to this day of the week or later. + 0 or 7 = Sunday, 1 = Monday, etc. +:Type: Integer in the range of 0 to 7 +:Default: ``0`` + + +``osd scrub end week day`` + +:Description: This restricts scrubbing to days of the week earlier than this. + 0 or 7 = Sunday, 1 = Monday, etc. +:Type: Integer in the range of 0 to 7 +:Default: ``7`` + + +``osd scrub during recovery`` + +:Description: Allow scrub during recovery. Setting this to ``false`` will disable + scheduling new scrub (and deep--scrub) while there is active recovery. + Already running scrubs will be continued. This might be useful to reduce + load on busy clusters. +:Type: Boolean +:Default: ``false`` + + +``osd scrub thread timeout`` + +:Description: The maximum time in seconds before timing out a scrub thread. +:Type: 32-bit Integer +:Default: ``60`` + + +``osd scrub finalize thread timeout`` + +:Description: The maximum time in seconds before timing out a scrub finalize + thread. + +:Type: 32-bit Integer +:Default: ``60*10`` + + +``osd scrub load threshold`` + +:Description: The normalized maximum load. Ceph will not scrub when the system load + (as defined by ``getloadavg() / number of online cpus``) is higher than this number. + Default is ``0.5``. + +:Type: Float +:Default: ``0.5`` + + +``osd scrub min interval`` + +:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon + when the Ceph Storage Cluster load is low. + +:Type: Float +:Default: Once per day. ``60*60*24`` + + +``osd scrub max interval`` + +:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon + irrespective of cluster load. + +:Type: Float +:Default: Once per week. ``7*60*60*24`` + + +``osd scrub chunk min`` + +:Description: The minimal number of object store chunks to scrub during single operation. + Ceph blocks writes to single chunk during scrub. + +:Type: 32-bit Integer +:Default: 5 + + +``osd scrub chunk max`` + +:Description: The maximum number of object store chunks to scrub during single operation. + +:Type: 32-bit Integer +:Default: 25 + + +``osd scrub sleep`` + +:Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow + down whole scrub operation while client operations will be less impacted. + +:Type: Float +:Default: 0 + + +``osd deep scrub interval`` + +:Description: The interval for "deep" scrubbing (fully reading all data). The + ``osd scrub load threshold`` does not affect this setting. + +:Type: Float +:Default: Once per week. ``60*60*24*7`` + + +``osd scrub interval randomize ratio`` + +:Description: Add a random delay to ``osd scrub min interval`` when scheduling + the next scrub job for a placement group. The delay is a random + value less than ``osd scrub min interval`` \* + ``osd scrub interval randomized ratio``. So the default setting + practically randomly spreads the scrubs out in the allowed time + window of ``[1, 1.5]`` \* ``osd scrub min interval``. +:Type: Float +:Default: ``0.5`` + +``osd deep scrub stride`` + +:Description: Read size when doing a deep scrub. +:Type: 32-bit Integer +:Default: 512 KB. ``524288`` + + +``osd scrub auto repair`` + +:Description: Setting this to ``true`` will enable automatic pg repair when errors + are found in scrub or deep-scrub. However, if more than + ``osd scrub auto repair num errors`` errors are found a repair is NOT performed. +:Type: Boolean +:Default: ``false`` + + +``osd scrub auto repair num errors`` + +:Description: Auto repair will not occur if more than this many errors are found. +:Type: 32-bit Integer +:Default: ``5`` + + +.. index:: OSD; operations settings + +Operations +========== + +``osd op queue`` + +:Description: This sets the type of queue to be used for prioritizing ops + in the OSDs. Both queues feature a strict sub-queue which is + dequeued before the normal queue. The normal queue is different + between implementations. The original PrioritizedQueue (``prio``) uses a + token bucket system which when there are sufficient tokens will + dequeue high priority queues first. If there are not enough + tokens available, queues are dequeued low priority to high priority. + The WeightedPriorityQueue (``wpq``) dequeues all priorities in + relation to their priorities to prevent starvation of any queue. + WPQ should help in cases where a few OSDs are more overloaded + than others. The new mClock based OpClassQueue + (``mclock_opclass``) prioritizes operations based on which class + they belong to (recovery, scrub, snaptrim, client op, osd subop). + And, the mClock based ClientQueue (``mclock_client``) also + incorporates the client identifier in order to promote fairness + between clients. See `QoS Based on mClock`_. Requires a restart. + +:Type: String +:Valid Choices: prio, wpq, mclock_opclass, mclock_client +:Default: ``wpq`` + + +``osd op queue cut off`` + +:Description: This selects which priority ops will be sent to the strict + queue verses the normal queue. The ``low`` setting sends all + replication ops and higher to the strict queue, while the ``high`` + option sends only replication acknowledgement ops and higher to + the strict queue. Setting this to ``high`` should help when a few + OSDs in the cluster are very busy especially when combined with + ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy + handling replication traffic could starve primary client traffic + on these OSDs without these settings. Requires a restart. + +:Type: String +:Valid Choices: low, high +:Default: ``low`` + + +``osd client op priority`` + +:Description: The priority set for client operations. + +:Type: 32-bit Integer +:Default: ``63`` +:Valid Range: 1-63 + + +``osd recovery op priority`` + +:Description: The priority set for recovery operations, if not specified by the pool's ``recovery_op_priority``. + +:Type: 32-bit Integer +:Default: ``3`` +:Valid Range: 1-63 + + +``osd scrub priority`` + +:Description: The default priority set for a scheduled scrub work queue when the + pool doesn't specify a value of ``scrub_priority``. This can be + boosted to the value of ``osd client op priority`` when scrub is + blocking client operations. + +:Type: 32-bit Integer +:Default: ``5`` +:Valid Range: 1-63 + + +``osd requested scrub priority`` + +:Description: The priority set for user requested scrub on the work queue. If + this value were to be smaller than ``osd client op priority`` it + can be boosted to the value of ``osd client op priority`` when + scrub is blocking client operations. + +:Type: 32-bit Integer +:Default: ``120`` + + +``osd snap trim priority`` + +:Description: The priority set for the snap trim work queue. + +:Type: 32-bit Integer +:Default: ``5`` +:Valid Range: 1-63 + +``osd snap trim sleep`` + +:Description: Time in seconds to sleep before next snap trim op. + Increasing this value will slow down snap trimming. + This option overrides backend specific variants. + +:Type: Float +:Default: ``0`` + + +``osd snap trim sleep hdd`` + +:Description: Time in seconds to sleep before next snap trim op + for HDDs. + +:Type: Float +:Default: ``5`` + + +``osd snap trim sleep ssd`` + +:Description: Time in seconds to sleep before next snap trim op + for SSDs. + +:Type: Float +:Default: ``0`` + + +``osd snap trim sleep hybrid`` + +:Description: Time in seconds to sleep before next snap trim op + when osd data is on HDD and osd journal is on SSD. + +:Type: Float +:Default: ``2`` + +``osd op thread timeout`` + +:Description: The Ceph OSD Daemon operation thread timeout in seconds. +:Type: 32-bit Integer +:Default: ``15`` + + +``osd op complaint time`` + +:Description: An operation becomes complaint worthy after the specified number + of seconds have elapsed. + +:Type: Float +:Default: ``30`` + + +``osd op history size`` + +:Description: The maximum number of completed operations to track. +:Type: 32-bit Unsigned Integer +:Default: ``20`` + + +``osd op history duration`` + +:Description: The oldest completed operation to track. +:Type: 32-bit Unsigned Integer +:Default: ``600`` + + +``osd op log threshold`` + +:Description: How many operations logs to display at once. +:Type: 32-bit Integer +:Default: ``5`` + + +QoS Based on mClock +------------------- + +Ceph's use of mClock is currently in the experimental phase and should +be approached with an exploratory mindset. + +Core Concepts +````````````` + +The QoS support of Ceph is implemented using a queueing scheduler +based on `the dmClock algorithm`_. This algorithm allocates the I/O +resources of the Ceph cluster in proportion to weights, and enforces +the constraints of minimum reservation and maximum limitation, so that +the services can compete for the resources fairly. Currently the +*mclock_opclass* operation queue divides Ceph services involving I/O +resources into following buckets: + +- client op: the iops issued by client +- osd subop: the iops issued by primary OSD +- snap trim: the snap trimming related requests +- pg recovery: the recovery related requests +- pg scrub: the scrub related requests + +And the resources are partitioned using following three sets of tags. In other +words, the share of each type of service is controlled by three tags: + +#. reservation: the minimum IOPS allocated for the service. +#. limitation: the maximum IOPS allocated for the service. +#. weight: the proportional share of capacity if extra capacity or system + oversubscribed. + +In Ceph operations are graded with "cost". And the resources allocated +for serving various services are consumed by these "costs". So, for +example, the more reservation a services has, the more resource it is +guaranteed to possess, as long as it requires. Assuming there are 2 +services: recovery and client ops: + +- recovery: (r:1, l:5, w:1) +- client ops: (r:2, l:0, w:9) + +The settings above ensure that the recovery won't get more than 5 +requests per second serviced, even if it requires so (see CURRENT +IMPLEMENTATION NOTE below), and no other services are competing with +it. But if the clients start to issue large amount of I/O requests, +neither will they exhaust all the I/O resources. 1 request per second +is always allocated for recovery jobs as long as there are any such +requests. So the recovery jobs won't be starved even in a cluster with +high load. And in the meantime, the client ops can enjoy a larger +portion of the I/O resource, because its weight is "9", while its +competitor "1". In the case of client ops, it is not clamped by the +limit setting, so it can make use of all the resources if there is no +recovery ongoing. + +Along with *mclock_opclass* another mclock operation queue named +*mclock_client* is available. It divides operations based on category +but also divides them based on the client making the request. This +helps not only manage the distribution of resources spent on different +classes of operations but also tries to insure fairness among clients. + +CURRENT IMPLEMENTATION NOTE: the current experimental implementation +does not enforce the limit values. As a first approximation we decided +not to prevent operations that would otherwise enter the operation +sequencer from doing so. + +Subtleties of mClock +```````````````````` + +The reservation and limit values have a unit of requests per +second. The weight, however, does not technically have a unit and the +weights are relative to one another. So if one class of requests has a +weight of 1 and another a weight of 9, then the latter class of +requests should get 9 executed at a 9 to 1 ratio as the first class. +However that will only happen once the reservations are met and those +values include the operations executed under the reservation phase. + +Even though the weights do not have units, one must be careful in +choosing their values due how the algorithm assigns weight tags to +requests. If the weight is *W*, then for a given class of requests, +the next one that comes in will have a weight tag of *1/W* plus the +previous weight tag or the current time, whichever is larger. That +means if *W* is sufficiently large and therefore *1/W* is sufficiently +small, the calculated tag may never be assigned as it will get a value +of the current time. The ultimate lesson is that values for weight +should not be too large. They should be under the number of requests +one expects to ve serviced each second. + +Caveats +``````` + +There are some factors that can reduce the impact of the mClock op +queues within Ceph. First, requests to an OSD are sharded by their +placement group identifier. Each shard has its own mClock queue and +these queues neither interact nor share information among them. The +number of shards can be controlled with the configuration options +``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and +``osd_op_num_shards_ssd``. A lower number of shards will increase the +impact of the mClock queues, but may have other deleterious effects. + +Second, requests are transferred from the operation queue to the +operation sequencer, in which they go through the phases of +execution. The operation queue is where mClock resides and mClock +determines the next op to transfer to the operation sequencer. The +number of operations allowed in the operation sequencer is a complex +issue. In general we want to keep enough operations in the sequencer +so it's always getting work done on some operations while it's waiting +for disk and network access to complete on other operations. On the +other hand, once an operation is transferred to the operation +sequencer, mClock no longer has control over it. Therefore to maximize +the impact of mClock, we want to keep as few operations in the +operation sequencer as possible. So we have an inherent tension. + +The configuration options that influence the number of operations in +the operation sequencer are ``bluestore_throttle_bytes``, +``bluestore_throttle_deferred_bytes``, +``bluestore_throttle_cost_per_io``, +``bluestore_throttle_cost_per_io_hdd``, and +``bluestore_throttle_cost_per_io_ssd``. + +A third factor that affects the impact of the mClock algorithm is that +we're using a distributed system, where requests are made to multiple +OSDs and each OSD has (can have) multiple shards. Yet we're currently +using the mClock algorithm, which is not distributed (note: dmClock is +the distributed version of mClock). + +Various organizations and individuals are currently experimenting with +mClock as it exists in this code base along with their modifications +to the code base. We hope you'll share you're experiences with your +mClock and dmClock experiments in the ceph-devel mailing list. + + +``osd push per object cost`` + +:Description: the overhead for serving a push op + +:Type: Unsigned Integer +:Default: 1000 + +``osd recovery max chunk`` + +:Description: the maximum total size of data chunks a recovery op can carry. + +:Type: Unsigned Integer +:Default: 8 MiB + + +``osd op queue mclock client op res`` + +:Description: the reservation of client op. + +:Type: Float +:Default: 1000.0 + + +``osd op queue mclock client op wgt`` + +:Description: the weight of client op. + +:Type: Float +:Default: 500.0 + + +``osd op queue mclock client op lim`` + +:Description: the limit of client op. + +:Type: Float +:Default: 1000.0 + + +``osd op queue mclock osd subop res`` + +:Description: the reservation of osd subop. + +:Type: Float +:Default: 1000.0 + + +``osd op queue mclock osd subop wgt`` + +:Description: the weight of osd subop. + +:Type: Float +:Default: 500.0 + + +``osd op queue mclock osd subop lim`` + +:Description: the limit of osd subop. + +:Type: Float +:Default: 0.0 + + +``osd op queue mclock snap res`` + +:Description: the reservation of snap trimming. + +:Type: Float +:Default: 0.0 + + +``osd op queue mclock snap wgt`` + +:Description: the weight of snap trimming. + +:Type: Float +:Default: 1.0 + + +``osd op queue mclock snap lim`` + +:Description: the limit of snap trimming. + +:Type: Float +:Default: 0.001 + + +``osd op queue mclock recov res`` + +:Description: the reservation of recovery. + +:Type: Float +:Default: 0.0 + + +``osd op queue mclock recov wgt`` + +:Description: the weight of recovery. + +:Type: Float +:Default: 1.0 + + +``osd op queue mclock recov lim`` + +:Description: the limit of recovery. + +:Type: Float +:Default: 0.001 + + +``osd op queue mclock scrub res`` + +:Description: the reservation of scrub jobs. + +:Type: Float +:Default: 0.0 + + +``osd op queue mclock scrub wgt`` + +:Description: the weight of scrub jobs. + +:Type: Float +:Default: 1.0 + + +``osd op queue mclock scrub lim`` + +:Description: the limit of scrub jobs. + +:Type: Float +:Default: 0.001 + +.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf + + +.. index:: OSD; backfilling + +Backfilling +=========== + +When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will +want to rebalance the cluster by moving placement groups to or from Ceph OSD +Daemons to restore the balance. The process of migrating placement groups and +the objects they contain can reduce the cluster's operational performance +considerably. To maintain operational performance, Ceph performs this migration +with 'backfilling', which allows Ceph to set backfill operations to a lower +priority than requests to read or write data. + + +``osd max backfills`` + +:Description: The maximum number of backfills allowed to or from a single OSD. +:Type: 64-bit Unsigned Integer +:Default: ``1`` + + +``osd backfill scan min`` + +:Description: The minimum number of objects per backfill scan. + +:Type: 32-bit Integer +:Default: ``64`` + + +``osd backfill scan max`` + +:Description: The maximum number of objects per backfill scan. + +:Type: 32-bit Integer +:Default: ``512`` + + +``osd backfill retry interval`` + +:Description: The number of seconds to wait before retrying backfill requests. +:Type: Double +:Default: ``10.0`` + +.. index:: OSD; osdmap + +OSD Map +======= + +OSD maps reflect the OSD daemons operating in the cluster. Over time, the +number of map epochs increases. Ceph provides some settings to ensure that +Ceph performs well as the OSD map grows larger. + + +``osd map dedup`` + +:Description: Enable removing duplicates in the OSD map. +:Type: Boolean +:Default: ``true`` + + +``osd map cache size`` + +:Description: The number of OSD maps to keep cached. +:Type: 32-bit Integer +:Default: ``50`` + + +``osd map message max`` + +:Description: The maximum map entries allowed per MOSDMap message. +:Type: 32-bit Integer +:Default: ``40`` + + + +.. index:: OSD; recovery + +Recovery +======== + +When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD +begins peering with other Ceph OSD Daemons before writes can occur. See +`Monitoring OSDs and PGs`_ for details. + +If a Ceph OSD Daemon crashes and comes back online, usually it will be out of +sync with other Ceph OSD Daemons containing more recent versions of objects in +the placement groups. When this happens, the Ceph OSD Daemon goes into recovery +mode and seeks to get the latest copy of the data and bring its map back up to +date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects +and placement groups may be significantly out of date. Also, if a failure domain +went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at +the same time. This can make the recovery process time consuming and resource +intensive. + +To maintain operational performance, Ceph performs recovery with limitations on +the number recovery requests, threads and object chunk sizes which allows Ceph +perform well in a degraded state. + + +``osd recovery delay start`` + +:Description: After peering completes, Ceph will delay for the specified number + of seconds before starting to recover objects. + +:Type: Float +:Default: ``0`` + + +``osd recovery max active`` + +:Description: The number of active recovery requests per OSD at one time. More + requests will accelerate recovery, but the requests places an + increased load on the cluster. + +:Type: 32-bit Integer +:Default: ``3`` + + +``osd recovery max chunk`` + +:Description: The maximum size of a recovered chunk of data to push. +:Type: 64-bit Unsigned Integer +:Default: ``8 << 20`` + + +``osd recovery max single start`` + +:Description: The maximum number of recovery operations per OSD that will be + newly started when an OSD is recovering. +:Type: 64-bit Unsigned Integer +:Default: ``1`` + + +``osd recovery thread timeout`` + +:Description: The maximum time in seconds before timing out a recovery thread. +:Type: 32-bit Integer +:Default: ``30`` + + +``osd recover clone overlap`` + +:Description: Preserves clone overlap during recovery. Should always be set + to ``true``. + +:Type: Boolean +:Default: ``true`` + + +``osd recovery sleep`` + +:Description: Time in seconds to sleep before next recovery or backfill op. + Increasing this value will slow down recovery operation while + client operations will be less impacted. + +:Type: Float +:Default: ``0`` + + +``osd recovery sleep hdd`` + +:Description: Time in seconds to sleep before next recovery or backfill op + for HDDs. + +:Type: Float +:Default: ``0.1`` + + +``osd recovery sleep ssd`` + +:Description: Time in seconds to sleep before next recovery or backfill op + for SSDs. + +:Type: Float +:Default: ``0`` + + +``osd recovery sleep hybrid`` + +:Description: Time in seconds to sleep before next recovery or backfill op + when osd data is on HDD and osd journal is on SSD. + +:Type: Float +:Default: ``0.025`` + + +``osd recovery priority`` + +:Description: The default priority set for recovery work queue. Not + related to a pool's ``recovery_priority``. + +:Type: 32-bit Integer +:Default: ``5`` + + +Tiering +======= + +``osd agent max ops`` + +:Description: The maximum number of simultaneous flushing ops per tiering agent + in the high speed mode. +:Type: 32-bit Integer +:Default: ``4`` + + +``osd agent max low ops`` + +:Description: The maximum number of simultaneous flushing ops per tiering agent + in the low speed mode. +:Type: 32-bit Integer +:Default: ``2`` + +See `cache target dirty high ratio`_ for when the tiering agent flushes dirty +objects within the high speed mode. + +Miscellaneous +============= + + +``osd snap trim thread timeout`` + +:Description: The maximum time in seconds before timing out a snap trim thread. +:Type: 32-bit Integer +:Default: ``60*60*1`` + + +``osd backlog thread timeout`` + +:Description: The maximum time in seconds before timing out a backlog thread. +:Type: 32-bit Integer +:Default: ``60*60*1`` + + +``osd default notify timeout`` + +:Description: The OSD default notification timeout (in seconds). +:Type: 32-bit Unsigned Integer +:Default: ``30`` + + +``osd check for log corruption`` + +:Description: Check log files for corruption. Can be computationally expensive. +:Type: Boolean +:Default: ``false`` + + +``osd remove thread timeout`` + +:Description: The maximum time in seconds before timing out a remove OSD thread. +:Type: 32-bit Integer +:Default: ``60*60`` + + +``osd command thread timeout`` + +:Description: The maximum time in seconds before timing out a command thread. +:Type: 32-bit Integer +:Default: ``10*60`` + + +``osd delete sleep`` + +:Description: Time in seconds to sleep before next removal transaction. This + helps to throttle the pg deletion process. + +:Type: Float +:Default: ``0`` + + +``osd delete sleep hdd`` + +:Description: Time in seconds to sleep before next removal transaction + for HDDs. + +:Type: Float +:Default: ``5`` + + +``osd delete sleep ssd`` + +:Description: Time in seconds to sleep before next removal transaction + for SSDs. + +:Type: Float +:Default: ``0`` + + +``osd delete sleep hybrid`` + +:Description: Time in seconds to sleep before next removal transaction + when osd data is on HDD and osd journal is on SSD. + +:Type: Float +:Default: ``1`` + + +``osd command max records`` + +:Description: Limits the number of lost objects to return. +:Type: 32-bit Integer +:Default: ``256`` + + +``osd fast fail on connection refused`` + +:Description: If this option is enabled, crashed OSDs are marked down + immediately by connected peers and MONs (assuming that the + crashed OSD host survives). Disable it to restore old + behavior, at the expense of possible long I/O stalls when + OSDs crash in the middle of I/O operations. +:Type: Boolean +:Default: ``true`` + + + +.. _pool: ../../operations/pools +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Pool & PG Config Reference: ../pool-pg-config-ref +.. _Journal Config Reference: ../journal-ref +.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio diff --git a/doc/rados/configuration/pool-pg-config-ref.rst b/doc/rados/configuration/pool-pg-config-ref.rst new file mode 100644 index 00000000..ade82ff3 --- /dev/null +++ b/doc/rados/configuration/pool-pg-config-ref.rst @@ -0,0 +1,274 @@ +====================================== + Pool, PG and CRUSH Config Reference +====================================== + +.. index:: pools; configuration + +When you create pools and set the number of placement groups for the pool, Ceph +uses default values when you don't specifically override the defaults. **We +recommend** overriding some of the defaults. Specifically, we recommend setting +a pool's replica size and overriding the default number of placement groups. You +can specifically set these values when running `pool`_ commands. You can also +override the defaults by adding new ones in the ``[global]`` section of your +Ceph configuration file. + + +.. literalinclude:: pool-pg.conf + :language: ini + + + +``mon max pool pg num`` + +:Description: The maximum number of placement groups per pool. +:Type: Integer +:Default: ``65536`` + + +``mon pg create interval`` + +:Description: Number of seconds between PG creation in the same + Ceph OSD Daemon. + +:Type: Float +:Default: ``30.0`` + + +``mon pg stuck threshold`` + +:Description: Number of seconds after which PGs can be considered as + being stuck. + +:Type: 32-bit Integer +:Default: ``300`` + +``mon pg min inactive`` + +:Description: Issue a ``HEALTH_ERR`` in cluster log if the number of PGs stay + inactive longer than ``mon_pg_stuck_threshold`` exceeds this + setting. A non-positive number means disabled, never go into ERR. +:Type: Integer +:Default: ``1`` + + +``mon pg warn min per osd`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if the average number + of PGs per (in) OSD is under this number. (a non-positive number + disables this) +:Type: Integer +:Default: ``30`` + + +``mon pg warn min objects`` + +:Description: Do not warn if the total number of objects in cluster is below + this number +:Type: Integer +:Default: ``1000`` + + +``mon pg warn min pool objects`` + +:Description: Do not warn on pools whose object number is below this number +:Type: Integer +:Default: ``1000`` + + +``mon pg check down all threshold`` + +:Description: Threshold of down OSDs percentage after which we check all PGs + for stale ones. +:Type: Float +:Default: ``0.5`` + + +``mon pg warn max object skew`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if the average object number + of a certain pool is greater than ``mon pg warn max object skew`` times + the average object number of the whole pool. (a non-positive number + disables this) +:Type: Float +:Default: ``10`` + + +``mon delta reset interval`` + +:Description: Seconds of inactivity before we reset the pg delta to 0. We keep + track of the delta of the used space of each pool, so, for + example, it would be easier for us to understand the progress of + recovery or the performance of cache tier. But if there's no + activity reported for a certain pool, we just reset the history of + deltas of that pool. +:Type: Integer +:Default: ``10`` + + +``mon osd max op age`` + +:Description: Maximum op age before we get concerned (make it a power of 2). + A ``HEALTH_WARN`` will be issued if a request has been blocked longer + than this limit. +:Type: Float +:Default: ``32.0`` + + +``osd pg bits`` + +:Description: Placement group bits per Ceph OSD Daemon. +:Type: 32-bit Integer +:Default: ``6`` + + +``osd pgp bits`` + +:Description: The number of bits per Ceph OSD Daemon for PGPs. +:Type: 32-bit Integer +:Default: ``6`` + + +``osd crush chooseleaf type`` + +:Description: The bucket type to use for ``chooseleaf`` in a CRUSH rule. Uses + ordinal rank rather than name. + +:Type: 32-bit Integer +:Default: ``1``. Typically a host containing one or more Ceph OSD Daemons. + + +``osd crush initial weight`` + +:Description: The initial crush weight for newly added osds into crushmap. + +:Type: Double +:Default: ``the size of newly added osd in TB``. By default, the initial crush + weight for the newly added osd is set to its volume size in TB. + See `Weighting Bucket Items`_ for details. + + +``osd pool default crush rule`` + +:Description: The default CRUSH rule to use when creating a replicated pool. +:Type: 8-bit Integer +:Default: ``-1``, which means "pick the rule with the lowest numerical ID and + use that". This is to make pool creation work in the absence of rule 0. + + +``osd pool erasure code stripe unit`` + +:Description: Sets the default size, in bytes, of a chunk of an object + stripe for erasure coded pools. Every object of size S + will be stored as N stripes, with each data chunk + receiving ``stripe unit`` bytes. Each stripe of ``N * + stripe unit`` bytes will be encoded/decoded + individually. This option can is overridden by the + ``stripe_unit`` setting in an erasure code profile. + +:Type: Unsigned 32-bit Integer +:Default: ``4096`` + + +``osd pool default size`` + +:Description: Sets the number of replicas for objects in the pool. The default + value is the same as + ``ceph osd pool set {pool-name} size {size}``. + +:Type: 32-bit Integer +:Default: ``3`` + + +``osd pool default min size`` + +:Description: Sets the minimum number of written replicas for objects in the + pool in order to acknowledge a write operation to the client. If + minimum is not met, Ceph will not acknowledge the write to the + client, **which may result in data loss**. This setting ensures + a minimum number of replicas when operating in ``degraded`` mode. + +:Type: 32-bit Integer +:Default: ``0``, which means no particular minimum. If ``0``, + minimum is ``size - (size / 2)``. + + +``osd pool default pg num`` + +:Description: The default number of placement groups for a pool. The default + value is the same as ``pg_num`` with ``mkpool``. + +:Type: 32-bit Integer +:Default: ``32`` + + +``osd pool default pgp num`` + +:Description: The default number of placement groups for placement for a pool. + The default value is the same as ``pgp_num`` with ``mkpool``. + PG and PGP should be equal (for now). + +:Type: 32-bit Integer +:Default: ``8`` + + +``osd pool default flags`` + +:Description: The default flags for new pools. +:Type: 32-bit Integer +:Default: ``0`` + + +``osd max pgls`` + +:Description: The maximum number of placement groups to list. A client + requesting a large number can tie up the Ceph OSD Daemon. + +:Type: Unsigned 64-bit Integer +:Default: ``1024`` +:Note: Default should be fine. + + +``osd min pg log entries`` + +:Description: The minimum number of placement group logs to maintain + when trimming log files. + +:Type: 32-bit Int Unsigned +:Default: ``1000`` + + +``osd default data pool replay window`` + +:Description: The time (in seconds) for an OSD to wait for a client to replay + a request. + +:Type: 32-bit Integer +:Default: ``45`` + +``osd max pg per osd hard ratio`` + +:Description: The ratio of number of PGs per OSD allowed by the cluster before + OSD refuses to create new PGs. OSD stops creating new PGs if the number + of PGs it serves exceeds + ``osd max pg per osd hard ratio`` \* ``mon max pg per osd``. + +:Type: Float +:Default: ``2`` + +``osd recovery priority`` + +:Description: Priority of recovery in the work queue. + +:Type: Integer +:Default: ``5`` + +``osd recovery op priority`` + +:Description: Default priority used for recovery operations if pool doesn't override. + +:Type: Integer +:Default: ``3`` + +.. _pool: ../../operations/pools +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Weighting Bucket Items: ../../operations/crush-map#weightingbucketitems diff --git a/doc/rados/configuration/pool-pg.conf b/doc/rados/configuration/pool-pg.conf new file mode 100644 index 00000000..0c69d0b9 --- /dev/null +++ b/doc/rados/configuration/pool-pg.conf @@ -0,0 +1,20 @@ +[global] + + # By default, Ceph makes 3 replicas of objects. If you want to make four + # copies of an object the default value--a primary copy and three replica + # copies--reset the default values as shown in 'osd pool default size'. + # If you want to allow Ceph to write a lesser number of copies in a degraded + # state, set 'osd pool default min size' to a number less than the + # 'osd pool default size' value. + + osd pool default size = 3 # Write an object 3 times. + osd pool default min size = 2 # Allow writing two copies in a degraded state. + + # Ensure you have a realistic number of placement groups. We recommend + # approximately 100 per OSD. E.g., total number of OSDs multiplied by 100 + # divided by the number of replicas (i.e., osd pool default size). So for + # 10 OSDs and osd pool default size = 4, we'd recommend approximately + # (100 * 10) / 4 = 250. + + osd pool default pg num = 250 + osd pool default pgp num = 250 diff --git a/doc/rados/configuration/storage-devices.rst b/doc/rados/configuration/storage-devices.rst new file mode 100644 index 00000000..262778d0 --- /dev/null +++ b/doc/rados/configuration/storage-devices.rst @@ -0,0 +1,83 @@ +================= + Storage Devices +================= + +There are two Ceph daemons that store data on disk: + +* **Ceph OSDs** (or Object Storage Daemons) are where most of the + data is stored in Ceph. Generally speaking, each OSD is backed by + a single storage device, like a traditional hard disk (HDD) or + solid state disk (SSD). OSDs can also be backed by a combination + of devices, like a HDD for most data and an SSD (or partition of an + SSD) for some metadata. The number of OSDs in a cluster is + generally a function of how much data will be stored, how big each + storage device will be, and the level and type of redundancy + (replication or erasure coding). +* **Ceph Monitor** daemons manage critical cluster state like cluster + membership and authentication information. For smaller clusters a + few gigabytes is all that is needed, although for larger clusters + the monitor database can reach tens or possibly hundreds of + gigabytes. + + +OSD Backends +============ + +There are two ways that OSDs can manage the data they store. Starting +with the Luminous 12.2.z release, the new default (and recommended) backend is +*BlueStore*. Prior to Luminous, the default (and only option) was +*FileStore*. + +BlueStore +--------- + +BlueStore is a special-purpose storage backend designed specifically +for managing data on disk for Ceph OSD workloads. It is motivated by +experience supporting and managing OSDs using FileStore over the +last ten years. Key BlueStore features include: + +* Direct management of storage devices. BlueStore consumes raw block + devices or partitions. This avoids any intervening layers of + abstraction (such as local file systems like XFS) that may limit + performance or add complexity. +* Metadata management with RocksDB. We embed RocksDB's key/value database + in order to manage internal metadata, such as the mapping from object + names to block locations on disk. +* Full data and metadata checksumming. By default all data and + metadata written to BlueStore is protected by one or more + checksums. No data or metadata will be read from disk or returned + to the user without being verified. +* Inline compression. Data written may be optionally compressed + before being written to disk. +* Multi-device metadata tiering. BlueStore allows its internal + journal (write-ahead log) to be written to a separate, high-speed + device (like an SSD, NVMe, or NVDIMM) to increased performance. If + a significant amount of faster storage is available, internal + metadata can also be stored on the faster device. +* Efficient copy-on-write. RBD and CephFS snapshots rely on a + copy-on-write *clone* mechanism that is implemented efficiently in + BlueStore. This results in efficient IO both for regular snapshots + and for erasure coded pools (which rely on cloning to implement + efficient two-phase commits). + +For more information, see :doc:`bluestore-config-ref` and :doc:`/rados/operations/bluestore-migration`. + +FileStore +--------- + +FileStore is the legacy approach to storing objects in Ceph. It +relies on a standard file system (normally XFS) in combination with a +key/value database (traditionally LevelDB, now RocksDB) for some +metadata. + +FileStore is well-tested and widely used in production but suffers +from many performance deficiencies due to its overall design and +reliance on a traditional file system for storing object data. + +Although FileStore is generally capable of functioning on most +POSIX-compatible file systems (including btrfs and ext4), we only +recommend that XFS be used. Both btrfs and ext4 have known bugs and +deficiencies and their use may lead to data loss. By default all Ceph +provisioning tools will use XFS. + +For more information, see :doc:`filestore-config-ref`. |