From 483eb2f56657e8e7f419ab1a4fab8dce9ade8609 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sat, 27 Apr 2024 20:24:20 +0200 Subject: Adding upstream version 14.2.21. Signed-off-by: Daniel Baumann --- doc/rados/configuration/auth-config-ref.rst | 378 +++++++ doc/rados/configuration/bluestore-config-ref.rst | 498 +++++++++ doc/rados/configuration/ceph-conf.rst | 496 +++++++++ doc/rados/configuration/common.rst | 258 +++++ doc/rados/configuration/demo-ceph.conf | 31 + doc/rados/configuration/filestore-config-ref.rst | 365 +++++++ doc/rados/configuration/general-config-ref.rst | 66 ++ doc/rados/configuration/index.rst | 52 + doc/rados/configuration/journal-ref.rst | 116 ++ doc/rados/configuration/mon-config-ref.rst | 1261 ++++++++++++++++++++++ doc/rados/configuration/mon-lookup-dns.rst | 51 + doc/rados/configuration/mon-osd-interaction.rst | 412 +++++++ doc/rados/configuration/ms-ref.rst | 133 +++ doc/rados/configuration/msgr2.rst | 224 ++++ doc/rados/configuration/network-config-ref.rst | 415 +++++++ doc/rados/configuration/osd-config-ref.rst | 1134 +++++++++++++++++++ doc/rados/configuration/pool-pg-config-ref.rst | 274 +++++ doc/rados/configuration/pool-pg.conf | 20 + doc/rados/configuration/storage-devices.rst | 83 ++ 19 files changed, 6267 insertions(+) create mode 100644 doc/rados/configuration/auth-config-ref.rst create mode 100644 doc/rados/configuration/bluestore-config-ref.rst create mode 100644 doc/rados/configuration/ceph-conf.rst create mode 100644 doc/rados/configuration/common.rst create mode 100644 doc/rados/configuration/demo-ceph.conf create mode 100644 doc/rados/configuration/filestore-config-ref.rst create mode 100644 doc/rados/configuration/general-config-ref.rst create mode 100644 doc/rados/configuration/index.rst create mode 100644 doc/rados/configuration/journal-ref.rst create mode 100644 doc/rados/configuration/mon-config-ref.rst create mode 100644 doc/rados/configuration/mon-lookup-dns.rst create mode 100644 doc/rados/configuration/mon-osd-interaction.rst create mode 100644 doc/rados/configuration/ms-ref.rst create mode 100644 doc/rados/configuration/msgr2.rst create mode 100644 doc/rados/configuration/network-config-ref.rst create mode 100644 doc/rados/configuration/osd-config-ref.rst create mode 100644 doc/rados/configuration/pool-pg-config-ref.rst create mode 100644 doc/rados/configuration/pool-pg.conf create mode 100644 doc/rados/configuration/storage-devices.rst (limited to 'doc/rados/configuration') diff --git a/doc/rados/configuration/auth-config-ref.rst b/doc/rados/configuration/auth-config-ref.rst new file mode 100644 index 00000000..c6816f1e --- /dev/null +++ b/doc/rados/configuration/auth-config-ref.rst @@ -0,0 +1,378 @@ +======================== + Cephx Config Reference +======================== + +The ``cephx`` protocol is enabled by default. Cryptographic authentication has +some computational costs, though they should generally be quite low. If the +network environment connecting your client and server hosts is very safe and +you cannot afford authentication, you can turn it off. **This is not generally +recommended**. + +.. note:: If you disable authentication, you are at risk of a man-in-the-middle + attack altering your client/server messages, which could lead to disastrous + security effects. + +For creating users, see `User Management`_. For details on the architecture +of Cephx, see `Architecture - High Availability Authentication`_. + + +Deployment Scenarios +==================== + +There are two main scenarios for deploying a Ceph cluster, which impact +how you initially configure Cephx. Most first time Ceph users use +``ceph-deploy`` to create a cluster (easiest). For clusters using +other deployment tools (e.g., Chef, Juju, Puppet, etc.), you will need +to use the manual procedures or configure your deployment tool to +bootstrap your monitor(s). + +ceph-deploy +----------- + +When you deploy a cluster with ``ceph-deploy``, you do not have to bootstrap the +monitor manually or create the ``client.admin`` user or keyring. The steps you +execute in the `Storage Cluster Quick Start`_ will invoke ``ceph-deploy`` to do +that for you. + +When you execute ``ceph-deploy new {initial-monitor(s)}``, Ceph will create a +monitor keyring for you (only used to bootstrap monitors), and it will generate +an initial Ceph configuration file for you, which contains the following +authentication settings, indicating that Ceph enables authentication by +default:: + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + +When you execute ``ceph-deploy mon create-initial``, Ceph will bootstrap the +initial monitor(s), retrieve a ``ceph.client.admin.keyring`` file containing the +key for the ``client.admin`` user. Additionally, it will also retrieve keyrings +that give ``ceph-deploy`` and ``ceph-volume`` utilities the ability to prepare and +activate OSDs and metadata servers. + +When you execute ``ceph-deploy admin {node-name}`` (**note:** Ceph must be +installed first), you are pushing a Ceph configuration file and the +``ceph.client.admin.keyring`` to the ``/etc/ceph`` directory of the node. You +will be able to execute Ceph administrative functions as ``root`` on the command +line of that node. + + +Manual Deployment +----------------- + +When you deploy a cluster manually, you have to bootstrap the monitor manually +and create the ``client.admin`` user and keyring. To bootstrap monitors, follow +the steps in `Monitor Bootstrapping`_. The steps for monitor bootstrapping are +the logical steps you must perform when using third party deployment tools like +Chef, Puppet, Juju, etc. + + +Enabling/Disabling Cephx +======================== + +Enabling Cephx requires that you have deployed keys for your monitors, +OSDs and metadata servers. If you are simply toggling Cephx on / off, +you do not have to repeat the bootstrapping procedures. + + +Enabling Cephx +-------------- + +When ``cephx`` is enabled, Ceph will look for the keyring in the default search +path, which includes ``/etc/ceph/$cluster.$name.keyring``. You can override +this location by adding a ``keyring`` option in the ``[global]`` section of +your `Ceph configuration`_ file, but this is not recommended. + +Execute the following procedures to enable ``cephx`` on a cluster with +authentication disabled. If you (or your deployment utility) have already +generated the keys, you may skip the steps related to generating keys. + +#. Create a ``client.admin`` key, and save a copy of the key for your client + host:: + + ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring + + **Warning:** This will clobber any existing + ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a + deployment tool has already done it for you. Be careful! + +#. Create a keyring for your monitor cluster and generate a monitor + secret key. :: + + ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' + +#. Copy the monitor keyring into a ``ceph.mon.keyring`` file in every monitor's + ``mon data`` directory. For example, to copy it to ``mon.a`` in cluster ``ceph``, + use the following:: + + cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring + +#. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter:: + + ceph auth get-or-create mgr.{$id} mon 'allow profile mgr' mds 'allow *' osd 'allow *' -o /var/lib/ceph/mgr/ceph-{$id}/keyring + +#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number:: + + ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring + +#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter:: + + ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' mgr 'allow profile mds' -o /var/lib/ceph/mds/ceph-{$id}/keyring + +#. Enable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file:: + + auth cluster required = cephx + auth service required = cephx + auth client required = cephx + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + +For details on bootstrapping a monitor manually, see `Manual Deployment`_. + + + +Disabling Cephx +--------------- + +The following procedure describes how to disable Cephx. If your cluster +environment is relatively safe, you can offset the computation expense of +running authentication. **We do not recommend it.** However, it may be easier +during setup and/or troubleshooting to temporarily disable authentication. + +#. Disable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file:: + + auth cluster required = none + auth service required = none + auth client required = none + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + + +Configuration Settings +====================== + +Enablement +---------- + + +``auth cluster required`` + +:Description: If enabled, the Ceph Storage Cluster daemons (i.e., ``ceph-mon``, + ``ceph-osd``, ``ceph-mds`` and ``ceph-mgr``) must authenticate with + each other. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth service required`` + +:Description: If enabled, the Ceph Storage Cluster daemons require Ceph Clients + to authenticate with the Ceph Storage Cluster in order to access + Ceph services. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth client required`` + +:Description: If enabled, the Ceph Client requires the Ceph Storage Cluster to + authenticate with the Ceph Client. Valid settings are ``cephx`` + or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +.. index:: keys; keyring + +Keys +---- + +When you run Ceph with authentication enabled, ``ceph`` administrative commands +and Ceph Clients require authentication keys to access the Ceph Storage Cluster. + +The most common way to provide these keys to the ``ceph`` administrative +commands and clients is to include a Ceph keyring under the ``/etc/ceph`` +directory. For Cuttlefish and later releases using ``ceph-deploy``, the filename +is usually ``ceph.client.admin.keyring`` (or ``$cluster.client.admin.keyring``). +If you include the keyring under the ``/etc/ceph`` directory, you don't need to +specify a ``keyring`` entry in your Ceph configuration file. + +We recommend copying the Ceph Storage Cluster's keyring file to nodes where you +will run administrative commands, because it contains the ``client.admin`` key. + +You may use ``ceph-deploy admin`` to perform this task. See `Create an Admin +Host`_ for details. To perform this step manually, execute the following:: + + sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring + +.. tip:: Ensure the ``ceph.keyring`` file has appropriate permissions set + (e.g., ``chmod 644``) on your client machine. + +You may specify the key itself in the Ceph configuration file using the ``key`` +setting (not recommended), or a path to a keyfile using the ``keyfile`` setting. + + +``keyring`` + +:Description: The path to the keyring file. +:Type: String +:Required: No +:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin`` + + +``keyfile`` + +:Description: The path to a key file (i.e,. a file containing only the key). +:Type: String +:Required: No +:Default: None + + +``key`` + +:Description: The key (i.e., the text string of the key itself). Not recommended. +:Type: String +:Required: No +:Default: None + + +Daemon Keyrings +--------------- + +Administrative users or deployment tools (e.g., ``ceph-deploy``) may generate +daemon keyrings in the same way as generating user keyrings. By default, Ceph +stores daemons keyrings inside their data directory. The default keyring +locations, and the capabilities necessary for the daemon to function, are shown +below. + +``ceph-mon`` + +:Location: ``$mon_data/keyring`` +:Capabilities: ``mon 'allow *'`` + +``ceph-osd`` + +:Location: ``$osd_data/keyring`` +:Capabilities: ``mgr 'allow profile osd' mon 'allow profile osd' osd 'allow *'`` + +``ceph-mds`` + +:Location: ``$mds_data/keyring`` +:Capabilities: ``mds 'allow' mgr 'allow profile mds' mon 'allow profile mds' osd 'allow rwx'`` + +``ceph-mgr`` + +:Location: ``$mgr_data/keyring`` +:Capabilities: ``mon 'allow profile mgr' mds 'allow *' osd 'allow *'`` + +``radosgw`` + +:Location: ``$rgw_data/keyring`` +:Capabilities: ``mon 'allow rwx' osd 'allow rwx'`` + + +.. note:: The monitor keyring (i.e., ``mon.``) contains a key but no + capabilities, and is not part of the cluster ``auth`` database. + +The daemon data directory locations default to directories of the form:: + + /var/lib/ceph/$type/$cluster-$id + +For example, ``osd.12`` would be:: + + /var/lib/ceph/osd/ceph-12 + +You can override these locations, but it is not recommended. + + +.. index:: signatures + +Signatures +---------- + +Ceph performs a signature check that provides some limited protection +against messages being tampered with in flight (e.g., by a "man in the +middle" attack). + +Like other parts of Ceph authentication, Ceph provides fine-grained control so +you can enable/disable signatures for service messages between the client and +Ceph, and you can enable/disable signatures for messages between Ceph daemons. + +Note that even with signatures enabled data is not encrypted in +flight. + +``cephx require signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between the Ceph Client and the Ceph Storage Cluster, and + between daemons comprising the Ceph Storage Cluster. + + Ceph Argonaut and Linux kernel versions prior to 3.19 do + not support signatures; if such clients are in use this + option can be turned off to allow them to connect. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx cluster require signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph daemons comprising the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx service require signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph Clients and the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx sign messages`` + +:Description: If the Ceph version supports message signing, Ceph will sign + all messages so they are more difficult to spoof. + +:Type: Boolean +:Default: ``true`` + + +Time to Live +------------ + +``auth service ticket ttl`` + +:Description: When the Ceph Storage Cluster sends a Ceph Client a ticket for + authentication, the Ceph Storage Cluster assigns the ticket a + time to live. + +:Type: Double +:Default: ``60*60`` + + +.. _Storage Cluster Quick Start: ../../../start/quick-ceph-deploy/ +.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping +.. _Operating a Cluster: ../../operations/operating +.. _Manual Deployment: ../../../install/manual-deployment +.. _Ceph configuration: ../ceph-conf +.. _Create an Admin Host: ../../deployment/ceph-deploy-admin +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _User Management: ../../operations/user-management diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst new file mode 100644 index 00000000..7d1c50c9 --- /dev/null +++ b/doc/rados/configuration/bluestore-config-ref.rst @@ -0,0 +1,498 @@ +========================== +BlueStore Config Reference +========================== + +Devices +======= + +BlueStore manages either one, two, or (in certain cases) three storage +devices. + +In the simplest case, BlueStore consumes a single (primary) storage device. +The storage device is normally used as a whole, occupying the full device that +is managed directly by BlueStore. This *primary device* is normally identified +by a ``block`` symlink in the data directory. + +The data directory is a ``tmpfs`` mount which gets populated (at boot time, or +when ``ceph-volume`` activates it) with all the common OSD files that hold +information about the OSD, like: its identifier, which cluster it belongs to, +and its private keyring. + +It is also possible to deploy BlueStore across two additional devices: + +* A *WAL device* (identified as ``block.wal`` in the data directory) can be + used for BlueStore's internal journal or write-ahead log. It is only useful + to use a WAL device if the device is faster than the primary device (e.g., + when it is on an SSD and the primary device is an HDD). +* A *DB device* (identified as ``block.db`` in the data directory) can be used + for storing BlueStore's internal metadata. BlueStore (or rather, the + embedded RocksDB) will put as much metadata as it can on the DB device to + improve performance. If the DB device fills up, metadata will spill back + onto the primary device (where it would have been otherwise). Again, it is + only helpful to provision a DB device if it is faster than the primary + device. + +If there is only a small amount of fast storage available (e.g., less +than a gigabyte), we recommend using it as a WAL device. If there is +more, provisioning a DB device makes more sense. The BlueStore +journal will always be placed on the fastest device available, so +using a DB device will provide the same benefit that the WAL device +would while *also* allowing additional metadata to be stored there (if +it will fit). + +A single-device BlueStore OSD can be provisioned with:: + + ceph-volume lvm prepare --bluestore --data + +To specify a WAL device and/or DB device, :: + + ceph-volume lvm prepare --bluestore --data --block.wal --block.db + +.. note:: --data can be a Logical Volume using the vg/lv notation. Other + devices can be existing logical volumes or GPT partitions + +Provisioning strategies +----------------------- +Although there are multiple ways to deploy a Bluestore OSD (unlike Filestore +which had 1) here are two common use cases that should help clarify the +initial deployment strategy: + +.. _bluestore-single-type-device-config: + +**block (data) only** +^^^^^^^^^^^^^^^^^^^^^ +If all the devices are the same type, for example all are spinning drives, and +there are no fast devices to combine these, it makes sense to just deploy with +block only and not try to separate ``block.db`` or ``block.wal``. The +:ref:`ceph-volume-lvm` call for a single ``/dev/sda`` device would look like:: + + ceph-volume lvm create --bluestore --data /dev/sda + +If logical volumes have already been created for each device (1 LV using 100% +of the device), then the :ref:`ceph-volume-lvm` call for an lv named +``ceph-vg/block-lv`` would look like:: + + ceph-volume lvm create --bluestore --data ceph-vg/block-lv + +.. _bluestore-mixed-device-config: + +**block and block.db** +^^^^^^^^^^^^^^^^^^^^^^ +If there is a mix of fast and slow devices (spinning and solid state), +it is recommended to place ``block.db`` on the faster device while ``block`` +(data) lives on the slower (spinning drive). Sizing for ``block.db`` should be +as large as possible to avoid performance penalties otherwise. The +``ceph-volume`` tool is currently not able to create these automatically, so +the volume groups and logical volumes need to be created manually. + +For the below example, lets assume 4 spinning drives (sda, sdb, sdc, and sdd) +and 1 solid state drive (sdx). First create the volume groups:: + + $ vgcreate ceph-block-0 /dev/sda + $ vgcreate ceph-block-1 /dev/sdb + $ vgcreate ceph-block-2 /dev/sdc + $ vgcreate ceph-block-3 /dev/sdd + +Now create the logical volumes for ``block``:: + + $ lvcreate -l 100%FREE -n block-0 ceph-block-0 + $ lvcreate -l 100%FREE -n block-1 ceph-block-1 + $ lvcreate -l 100%FREE -n block-2 ceph-block-2 + $ lvcreate -l 100%FREE -n block-3 ceph-block-3 + +We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB +SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB:: + + $ vgcreate ceph-db-0 /dev/sdx + $ lvcreate -L 50GB -n db-0 ceph-db-0 + $ lvcreate -L 50GB -n db-1 ceph-db-0 + $ lvcreate -L 50GB -n db-2 ceph-db-0 + $ lvcreate -L 50GB -n db-3 ceph-db-0 + +Finally, create the 4 OSDs with ``ceph-volume``:: + + $ ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0 + $ ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1 + $ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2 + $ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3 + +These operations should end up creating 4 OSDs, with ``block`` on the slower +spinning drives and a 50GB logical volume for each coming from the solid state +drive. + +Sizing +====== +When using a :ref:`mixed spinning and solid drive setup +` it is important to make a large-enough +``block.db`` logical volume for Bluestore. Generally, ``block.db`` should have +*as large as possible* logical volumes. + +It is recommended that the ``block.db`` size isn't smaller than 4% of +``block``. For example, if the ``block`` size is 1TB, then ``block.db`` +shouldn't be less than 40GB. + +If *not* using a mix of fast and slow devices, it isn't required to create +separate logical volumes for ``block.db`` (or ``block.wal``). Bluestore will +automatically manage these within the space of ``block``. + + +Automatic Cache Sizing +====================== + +Bluestore can be configured to automatically resize it's caches when tc_malloc +is configured as the memory allocator and the ``bluestore_cache_autotune`` +setting is enabled. This option is currently enabled by default. Bluestore +will attempt to keep OSD heap memory usage under a designated target size via +the ``osd_memory_target`` configuration option. This is a best effort +algorithm and caches will not shrink smaller than the amount specified by +``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy +of priorities. If priority information is not availabe, the +``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are +used as fallbacks. + +``bluestore_cache_autotune`` + +:Description: Automatically tune the ratios assigned to different bluestore caches while respecting minimum values. +:Type: Boolean +:Required: Yes +:Default: ``True`` + +``osd_memory_target`` + +:Description: When tcmalloc is available and cache autotuning is enabled, try to keep this many bytes mapped in memory. Note: This may not exactly match the RSS memory usage of the process. While the total amount of heap memory mapped by the process should generally stay close to this target, there is no guarantee that the kernel will actually reclaim memory that has been unmapped. During initial developement, it was found that some kernels result in the OSD's RSS Memory exceeding the mapped memory by up to 20%. It is hypothesised however, that the kernel generally may be more aggressive about reclaiming unmapped memory when there is a high amount of memory pressure. Your mileage may vary. +:Type: Unsigned Integer +:Required: Yes +:Default: ``4294967296`` + +``bluestore_cache_autotune_chunk_size`` + +:Description: The chunk size in bytes to allocate to caches when cache autotune is enabled. When the autotuner assigns memory to different caches, it will allocate memory in chunks. This is done to avoid evictions when there are minor fluctuations in the heap size or autotuned cache ratios. +:Type: Unsigned Integer +:Required: No +:Default: ``33554432`` + +``bluestore_cache_autotune_interval`` + +:Description: The number of seconds to wait between rebalances when cache autotune is enabled. This setting changes how quickly the ratios of the difference caches are recomputed. Note: Setting the interval too small can result in high CPU usage and lower performance. +:Type: Float +:Required: No +:Default: ``5`` + +``osd_memory_base`` + +:Description: When tcmalloc and cache autotuning is enabled, estimate the minimum amount of memory in bytes the OSD will need. This is used to help the autotuner estimate the expected aggregate memory consumption of the caches. +:Type: Unsigned Interger +:Required: No +:Default: ``805306368`` + +``osd_memory_expected_fragmentation`` + +:Description: When tcmalloc and cache autotuning is enabled, estimate the percent of memory fragmentation. This is used to help the autotuner estimate the expected aggregate memory consumption of the caches. +:Type: Float +:Required: No +:Default: ``0.15`` + +``osd_memory_cache_min`` + +:Description: When tcmalloc and cache autotuning is enabled, set the minimum amount of memory used for caches. Note: Setting this value too low can result in significant cache thrashing. +:Type: Unsigned Integer +:Required: No +:Default: ``134217728`` + +``osd_memory_cache_resize_interval`` + +:Description: When tcmalloc and cache autotuning is enabled, wait this many seconds between resizing caches. This setting changes the total amount of memory available for bluestore to use for caching. Note: Setting the interval too small can result in memory allocator thrashing and lower performance. +:Type: Float +:Required: No +:Default: ``1`` + + +Manual Cache Sizing +=================== + +The amount of memory consumed by each OSD for BlueStore's cache is +determined by the ``bluestore_cache_size`` configuration option. If +that config option is not set (i.e., remains at 0), there is a +different default value that is used depending on whether an HDD or +SSD is used for the primary device (set by the +``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config +options). + +BlueStore and the rest of the Ceph OSD does the best it can currently +to stick to the budgeted memory. Note that on top of the configured +cache size, there is also memory consumed by the OSD itself, and +generally some overhead due to memory fragmentation and other +allocator overhead. + +The configured cache memory budget can be used in a few different ways: + +* Key/Value metadata (i.e., RocksDB's internal cache) +* BlueStore metadata +* BlueStore data (i.e., recently read or written object data) + +Cache memory usage is governed by the following options: +``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. +The fraction of the cache devoted to data +is governed by the effective bluestore cache size (depending on +``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary +device) as well as the meta and kv ratios. +The data fraction can be calculated by +`` * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)`` + +``bluestore_cache_size`` + +:Description: The amount of memory BlueStore will use for its cache. If zero, ``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will be used instead. +:Type: Unsigned Integer +:Required: Yes +:Default: ``0`` + +``bluestore_cache_size_hdd`` + +:Description: The default amount of memory BlueStore will use for its cache when backed by an HDD. +:Type: Unsigned Integer +:Required: Yes +:Default: ``1 * 1024 * 1024 * 1024`` (1 GB) + +``bluestore_cache_size_ssd`` + +:Description: The default amount of memory BlueStore will use for its cache when backed by an SSD. +:Type: Unsigned Integer +:Required: Yes +:Default: ``3 * 1024 * 1024 * 1024`` (3 GB) + +``bluestore_cache_meta_ratio`` + +:Description: The ratio of cache devoted to metadata. +:Type: Floating point +:Required: Yes +:Default: ``.4`` + +``bluestore_cache_kv_ratio`` + +:Description: The ratio of cache devoted to key/value data (rocksdb). +:Type: Floating point +:Required: Yes +:Default: ``.4`` + +``bluestore_cache_kv_max`` + +:Description: The maximum amount of cache devoted to key/value data (rocksdb). +:Type: Unsigned Integer +:Required: Yes +:Default: ``512 * 1024*1024`` (512 MB) + + +Checksums +========= + +BlueStore checksums all metadata and data written to disk. Metadata +checksumming is handled by RocksDB and uses `crc32c`. Data +checksumming is done by BlueStore and can make use of `crc32c`, +`xxhash32`, or `xxhash64`. The default is `crc32c` and should be +suitable for most purposes. + +Full data checksumming does increase the amount of metadata that +BlueStore must store and manage. When possible, e.g., when clients +hint that data is written and read sequentially, BlueStore will +checksum larger blocks, but in many cases it must store a checksum +value (usually 4 bytes) for every 4 kilobyte block of data. + +It is possible to use a smaller checksum value by truncating the +checksum to two or one byte, reducing the metadata overhead. The +trade-off is that the probability that a random error will not be +detected is higher with a smaller checksum, going from about one in +four billion with a 32-bit (4 byte) checksum to one in 65,536 for a +16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum. +The smaller checksum values can be used by selecting `crc32c_16` or +`crc32c_8` as the checksum algorithm. + +The *checksum algorithm* can be set either via a per-pool +``csum_type`` property or the global config option. For example, :: + + ceph osd pool set csum_type + +``bluestore_csum_type`` + +:Description: The default checksum algorithm to use. +:Type: String +:Required: Yes +:Valid Settings: ``none``, ``crc32c``, ``crc32c_16``, ``crc32c_8``, ``xxhash32``, ``xxhash64`` +:Default: ``crc32c`` + + +Inline Compression +================== + +BlueStore supports inline compression using `snappy`, `zlib`, or +`lz4`. Please note that the `lz4` compression plugin is not +distributed in the official release. + +Whether data in BlueStore is compressed is determined by a combination +of the *compression mode* and any hints associated with a write +operation. The modes are: + +* **none**: Never compress data. +* **passive**: Do not compress data unless the write operation has a + *compressible* hint set. +* **aggressive**: Compress data unless the write operation has an + *incompressible* hint set. +* **force**: Try to compress data no matter what. + +For more information about the *compressible* and *incompressible* IO +hints, see :c:func:`rados_set_alloc_hint`. + +Note that regardless of the mode, if the size of the data chunk is not +reduced sufficiently it will not be used and the original +(uncompressed) data will be stored. For example, if the ``bluestore +compression required ratio`` is set to ``.7`` then the compressed data +must be 70% of the size of the original (or smaller). + +The *compression mode*, *compression algorithm*, *compression required +ratio*, *min blob size*, and *max blob size* can be set either via a +per-pool property or a global config option. Pool properties can be +set with:: + + ceph osd pool set compression_algorithm + ceph osd pool set compression_mode + ceph osd pool set compression_required_ratio + ceph osd pool set compression_min_blob_size + ceph osd pool set compression_max_blob_size + +``bluestore compression algorithm`` + +:Description: The default compressor to use (if any) if the per-pool property + ``compression_algorithm`` is not set. Note that zstd is *not* + recommended for bluestore due to high CPU overhead when + compressing small amounts of data. +:Type: String +:Required: No +:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd`` +:Default: ``snappy`` + +``bluestore compression mode`` + +:Description: The default policy for using compression if the per-pool property + ``compression_mode`` is not set. ``none`` means never use + compression. ``passive`` means use compression when + :c:func:`clients hint ` that data is + compressible. ``aggressive`` means use compression unless + clients hint that data is not compressible. ``force`` means use + compression under all circumstances even if the clients hint that + the data is not compressible. +:Type: String +:Required: No +:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` +:Default: ``none`` + +``bluestore compression required ratio`` + +:Description: The ratio of the size of the data chunk after + compression relative to the original size must be at + least this small in order to store the compressed + version. + +:Type: Floating point +:Required: No +:Default: .875 + +``bluestore compression min blob size`` + +:Description: Chunks smaller than this are never compressed. + The per-pool property ``compression_min_blob_size`` overrides + this setting. + +:Type: Unsigned Integer +:Required: No +:Default: 0 + +``bluestore compression min blob size hdd`` + +:Description: Default value of ``bluestore compression min blob size`` + for rotational media. + +:Type: Unsigned Integer +:Required: No +:Default: 128K + +``bluestore compression min blob size ssd`` + +:Description: Default value of ``bluestore compression min blob size`` + for non-rotational (solid state) media. + +:Type: Unsigned Integer +:Required: No +:Default: 8K + +``bluestore compression max blob size`` + +:Description: Chunks larger than this are broken into smaller blobs sizing + ``bluestore compression max blob size`` before being compressed. + The per-pool property ``compression_max_blob_size`` overrides + this setting. + +:Type: Unsigned Integer +:Required: No +:Default: 0 + +``bluestore compression max blob size hdd`` + +:Description: Default value of ``bluestore compression max blob size`` + for rotational media. + +:Type: Unsigned Integer +:Required: No +:Default: 512K + +``bluestore compression max blob size ssd`` + +:Description: Default value of ``bluestore compression max blob size`` + for non-rotational (solid state) media. + +:Type: Unsigned Integer +:Required: No +:Default: 64K + +SPDK Usage +================== + +If you want to use SPDK driver for NVME SSD, you need to ready your system. +Please refer to `SPDK document`__ for more details. + +.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples + +SPDK offers a script to configure the device automatically. Users can run the +script as root:: + + $ sudo src/spdk/scripts/setup.sh + +Then you need to specify NVMe device's device selector here with "spdk:" prefix for +``bluestore_block_path``. + +For example, users can find the device selector of an Intel PCIe SSD with:: + + $ lspci -mm -n -D -d 8086:0953 + +The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``. + +and then set:: + + bluestore block path = spdk:0000:01:00.0 + +Where ``0000:01:00.0`` is the device selector found in the output of ``lspci`` +command above. + +If you want to run multiple SPDK instances per node, you must specify the +amount of dpdk memory size in MB each instance will use, to make sure each +instance uses its own dpdk memory + +In most cases, we only need one device to serve as data, db, db wal purposes. +We need to make sure configurations below to make sure all IOs issued under +SPDK.:: + + bluestore_block_db_path = "" + bluestore_block_db_size = 0 + bluestore_block_wal_path = "" + bluestore_block_wal_size = 0 + +Otherwise, the current implementation will setup symbol file to kernel +filesystem location and uses kernel driver to issue DB/WAL IO. diff --git a/doc/rados/configuration/ceph-conf.rst b/doc/rados/configuration/ceph-conf.rst new file mode 100644 index 00000000..24c9afa4 --- /dev/null +++ b/doc/rados/configuration/ceph-conf.rst @@ -0,0 +1,496 @@ +.. _configuring-ceph: + +================== + Configuring Ceph +================== + +When you start the Ceph service, the initialization process activates a series +of daemons that run in the background. A :term:`Ceph Storage Cluster` runs +three types of daemons: + +- :term:`Ceph Monitor` (``ceph-mon``) +- :term:`Ceph Manager` (``ceph-mgr``) +- :term:`Ceph OSD Daemon` (``ceph-osd``) + +Ceph Storage Clusters that support the :term:`Ceph Filesystem` run at +least one :term:`Ceph Metadata Server` (``ceph-mds``). Clusters that +support :term:`Ceph Object Storage` run Ceph Gateway daemons +(``radosgw``). + +Each daemon has a series of configuration options, each of which has a +default values. You may adjust the behavior of the system by changing these +configuration options. + +Option names +============ + +All Ceph configuration options have a unique name consisting of words +formed with lower-case characters and connected with underscore +(``_``) characters. + +When option names are specified on the command line, either underscore +(``_``) or dash (``-``) characters can be used interchangeable (e.g., +``--mon-host`` is equivalent to ``--mon_host``). + +When option names appear in configuration files, spaces can also be +used in place of underscore or dash. + +Config sources +============== + +Each Ceph daemon, process, and library will pull its configuration +from several sources, listed below. Sources later in the list will +override those earlier in the list when both are present. + +- the compiled-in default value +- the monitor cluster's centralized configuration database +- a configuration file stored on the local host +- environment variables +- command line arguments +- runtime overrides set by an administrator + +One of the first things a Ceph process does on startup is parse the +configuration options provided via the command line, environment, and +local configuration file. The process will then contact the monitor +cluster to retrieve configuration stored centrally for the entire +cluster. Once a complete view of the configuration is available, the +daemon or process startup will proceed. + +Bootstrap options +----------------- + +Because some configuration options affect the process's ability to +contact the monitors, authenticate, and retrieve the cluster-stored +configuration, they may need to be stored locally on the node and set +in a local configuration file. These options include: + + - ``mon_host``, the list of monitors for the cluster + - ``mon_host_override``, the list of monitors for the cluster to + **initially** contact when beginning a new instance of communication with the + Ceph cluster. This overrides the known monitor list derived from MonMap + updates sent to older Ceph instances (like librados cluster handles). It is + expected this option is primarily useful for debugging. + - ``mon_dns_serv_name`` (default: `ceph-mon`), the name of the DNS + SRV record to check to identify the cluster monitors via DNS + - ``mon_data``, ``osd_data``, ``mds_data``, ``mgr_data``, and + similar options that define which local directory the daemon + stores its data in. + - ``keyring``, ``keyfile``, and/or ``key``, which can be used to + specify the authentication credential to use to authenticate with + the monitor. Note that in most cases the default keyring location + is in the data directory specified above. + +In the vast majority of cases the default values of these are +appropriate, with the exception of the ``mon_host`` option that +identifies the addresses of the cluster's monitors. When DNS is used +to identify monitors a local ceph configuration file can be avoided +entirely. + +Skipping monitor config +----------------------- + +Any process may be passed the option ``--no-mon-config`` to skip the +step that retrieves configuration from the cluster monitors. This is +useful in cases where configuration is managed entirely via +configuration files or where the monitor cluster is currently down but +some maintenance activity needs to be done. + + +.. _ceph-conf-file: + + +Configuration sections +====================== + +Any given process or daemon has a single value for each configuration +option. However, values for an option may vary across different +daemon types even daemons of the same type. Ceph options that are +stored in the monitor configuration database or in local configuration +files are grouped into sections to indicate which daemons or clients +they apply to. + +These sections include: + +``global`` + +:Description: Settings under ``global`` affect all daemons and clients + in a Ceph Storage Cluster. + +:Example: ``log_file = /var/log/ceph/$cluster-$type.$id.log`` + +``mon`` + +:Description: Settings under ``mon`` affect all ``ceph-mon`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mon_cluster_log_to_syslog = true`` + + +``mgr`` + +:Description: Settings in the ``mgr`` section affect all ``ceph-mgr`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mgr_stats_period = 10`` + +``osd`` + +:Description: Settings under ``osd`` affect all ``ceph-osd`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``osd_op_queue = wpq`` + +``mds`` + +:Description: Settings in the ``mds`` section affect all ``ceph-mds`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mds_cache_size = 10G`` + +``client`` + +:Description: Settings under ``client`` affect all Ceph Clients + (e.g., mounted Ceph Filesystems, mounted Ceph Block Devices, + etc.) as well as Rados Gateway (RGW) daemons. + +:Example: ``objecter_inflight_ops = 512`` + + +Sections may also specify an individual daemon or client name. For example, +``mon.foo``, ``osd.123``, and ``client.smith`` are all valid section names. + + +Any given daemon will draw its settings from the global section, the +daemon or client type section, and the section sharing its name. +Settings in the most-specific section take precedence, so for example +if the same option is specified in both ``global``, ``mon``, and +``mon.foo`` on the same source (i.e., in the same configurationfile), +the ``mon.foo`` value will be used. + +Note that values from the local configuration file always take +precedence over values from the monitor configuration database, +regardless of which section they appear in. + + +.. _ceph-metavariables: + +Metavariables +============= + +Metavariables simplify Ceph Storage Cluster configuration +dramatically. When a metavariable is set in a configuration value, +Ceph expands the metavariable into a concrete value at the time the +configuration value is used. Ceph metavariables are similar to variable expansion in the Bash shell. + +Ceph supports the following metavariables: + +``$cluster`` + +:Description: Expands to the Ceph Storage Cluster name. Useful when running + multiple Ceph Storage Clusters on the same hardware. + +:Example: ``/etc/ceph/$cluster.keyring`` +:Default: ``ceph`` + + +``$type`` + +:Description: Expands to a daemon or process type (e.g., ``mds``, ``osd``, or ``mon``) + +:Example: ``/var/lib/ceph/$type`` + + +``$id`` + +:Description: Expands to the daemon or client identifier. For + ``osd.0``, this would be ``0``; for ``mds.a``, it would + be ``a``. + +:Example: ``/var/lib/ceph/$type/$cluster-$id`` + + +``$host`` + +:Description: Expands to the host name where the process is running. + + +``$name`` + +:Description: Expands to ``$type.$id``. +:Example: ``/var/run/ceph/$cluster-$name.asok`` + +``$pid`` + +:Description: Expands to daemon pid. +:Example: ``/var/run/ceph/$cluster-$name-$pid.asok`` + + + +The Configuration File +====================== + +On startup, Ceph processes search for a configuration file in the +following locations: + +#. ``$CEPH_CONF`` (*i.e.,* the path following the ``$CEPH_CONF`` + environment variable) +#. ``-c path/path`` (*i.e.,* the ``-c`` command line argument) +#. ``/etc/ceph/$cluster.conf`` +#. ``~/.ceph/$cluster.conf`` +#. ``./$cluster.conf`` (*i.e.,* in the current working directory) +#. On FreeBSD systems only, ``/usr/local/etc/ceph/$cluster.conf`` + +where ``$cluster`` is the cluster's name (default ``ceph``). + +The Ceph configuration file uses an *ini* style syntax. You can add comments +by preceding comments with a pound sign (#) or a semi-colon (;). For example: + +.. code-block:: ini + + # <--A number (#) sign precedes a comment. + ; A comment may be anything. + # Comments always follow a semi-colon (;) or a pound (#) on each line. + # The end of the line terminates a comment. + # We recommend that you provide comments in your configuration file(s). + + +.. _ceph-conf-settings: + +Config file section names +------------------------- + +The configuration file is divided into sections. Each section must begin with a +valid configuration section name (see `Configuration sections`_, above) +surrounded by square brackets. For example, + +.. code-block:: ini + + [global] + debug ms = 0 + + [osd] + debug ms = 1 + + [osd.1] + debug ms = 10 + + [osd.2] + debug ms = 10 + + + +Monitor configuration database +============================== + +The monitor cluster manages a database of configuration options that +can be consumed by the entire cluster, enabling streamlined central +configuration management for the entire system. The vast majority of +configuration options can and should be stored here for ease of +administration and transparency. + +A handful of settings may still need to be stored in local +configuration files because they affect the ability to connect to the +monitors, authenticate, and fetch configuration information. In most +cases this is limited to the ``mon_host`` option, although this can +also be avoided through the use of DNS SRV records. + +Sections and masks +------------------ + +Configuration options stored by the monitor can live in a global +section, daemon type section, or specific daemon section, just like +options in a configuration file can. + +In addition, options may also have a *mask* associated with them to +further restrict which daemons or clients the option applies to. +Masks take two forms: + +#. ``type:location`` where *type* is a CRUSH property like `rack` or + `host`, and *location* is a value for that property. For example, + ``host:foo`` would limit the option only to daemons or clients + running on a particular host. +#. ``class:device-class`` where *device-class* is the name of a CRUSH + device class (e.g., ``hdd`` or ``ssd``). For example, + ``class:ssd`` would limit the option only to OSDs backed by SSDs. + (This mask has no effect for non-OSD daemons or clients.) + +When setting a configuration option, the `who` may be a section name, +a mask, or a combination of both separated by a slash (``/``) +character. For example, ``osd/rack:foo`` would mean all OSD daemons +in the ``foo`` rack. + +When viewing configuration options, the section name and mask are +generally separated out into separate fields or columns to ease readability. + + +Commands +-------- + +The following CLI commands are used to configure the cluster: + +* ``ceph config dump`` will dump the entire configuration database for + the cluster. + +* ``ceph config get `` will dump the configuration for a specific + daemon or client (e.g., ``mds.a``), as stored in the monitors' + configuration database. + +* ``ceph config set