diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
commit | e6918187568dbd01842d8d1d2c808ce16a894239 (patch) | |
tree | 64f88b554b444a49f656b6c656111a145cbbaa28 /doc/rados | |
parent | Initial commit. (diff) | |
download | ceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip |
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/rados')
70 files changed, 23619 insertions, 0 deletions
diff --git a/doc/rados/api/index.rst b/doc/rados/api/index.rst new file mode 100644 index 000000000..5422ce871 --- /dev/null +++ b/doc/rados/api/index.rst @@ -0,0 +1,25 @@ +.. _rados api: + +=========================== + Ceph Storage Cluster APIs +=========================== + +The :term:`Ceph Storage Cluster` has a messaging layer protocol that enables +clients to interact with a :term:`Ceph Monitor` and a :term:`Ceph OSD Daemon`. +``librados`` provides this functionality to :term:`Ceph Client`\s in the form of +a library. All Ceph Clients either use ``librados`` or the same functionality +encapsulated in ``librados`` to interact with the object store. For example, +``librbd`` and ``libcephfs`` leverage this functionality. You may use +``librados`` to interact with Ceph directly (e.g., an application that talks to +Ceph, your own interface to Ceph, etc.). + + +.. toctree:: + :maxdepth: 2 + + Introduction to librados <librados-intro> + librados (C) <librados> + librados (C++) <libradospp> + librados (Python) <python> + libcephsqlite (SQLite) <libcephsqlite> + object class <objclass-sdk> diff --git a/doc/rados/api/libcephsqlite.rst b/doc/rados/api/libcephsqlite.rst new file mode 100644 index 000000000..beee4a466 --- /dev/null +++ b/doc/rados/api/libcephsqlite.rst @@ -0,0 +1,454 @@ +.. _libcephsqlite: + +================ + Ceph SQLite VFS +================ + +This `SQLite VFS`_ may be used for storing and accessing a `SQLite`_ database +backed by RADOS. This allows you to fully decentralize your database using +Ceph's object store for improved availability, accessibility, and use of +storage. + +Note what this is not: a distributed SQL engine. SQLite on RADOS can be thought +of like RBD as compared to CephFS: RBD puts a disk image on RADOS for the +purposes of exclusive access by a machine and generally does not allow parallel +access by other machines; on the other hand, CephFS allows fully distributed +access to a file system from many client mounts. SQLite on RADOS is meant to be +accessed by a single SQLite client database connection at a given time. The +database may be manipulated safely by multiple clients only in a serial fashion +controlled by RADOS locks managed by the Ceph SQLite VFS. + + +Usage +^^^^^ + +Normal unmodified applications (including the sqlite command-line toolset +binary) may load the *ceph* VFS using the `SQLite Extension Loading API`_. + +.. code:: sql + + .LOAD libcephsqlite.so + +or during the invocation of ``sqlite3`` + +.. code:: sh + + sqlite3 -cmd '.load libcephsqlite.so' + +A database file is formatted as a SQLite URI:: + + file:///<"*"poolid|poolname>:[namespace]/<dbname>?vfs=ceph + +The RADOS ``namespace`` is optional. Note the triple ``///`` in the path. The URI +authority must be empty or localhost in SQLite. Only the path part of the URI +is parsed. For this reason, the URI will not parse properly if you only use two +``//``. + +A complete example of (optionally) creating a database and opening: + +.. code:: sh + + sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph' + +Note you cannot specify the database file as the normal positional argument to +``sqlite3``. This is because the ``.load libcephsqlite.so`` command is applied +after opening the database, but opening the database depends on the extension +being loaded first. + +An example passing the pool integer id and no RADOS namespace: + +.. code:: sh + + sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///*2:/baz.db?vfs=ceph' + +Like other Ceph tools, the *ceph* VFS looks at some environment variables that +help with configuring which Ceph cluster to communicate with and which +credential to use. Here would be a typical configuration: + +.. code:: sh + + export CEPH_CONF=/path/to/ceph.conf + export CEPH_KEYRING=/path/to/ceph.keyring + export CEPH_ARGS='--id myclientid' + ./runmyapp + # or + sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph' + +The default operation would look at the standard Ceph configuration file path +using the ``client.admin`` user. + + +User +^^^^ + +The *ceph* VFS requires a user credential with read access to the monitors, the +ability to blocklist dead clients of the database, and access to the OSDs +hosting the database. This can be done with authorizations as simply as: + +.. code:: sh + + ceph auth get-or-create client.X mon 'allow r, allow command "osd blocklist" with blocklistop=add' osd 'allow rwx' + +.. note:: The terminology change from ``blacklist`` to ``blocklist``; older clusters may require using the old terms. + +You may also simplify using the ``simple-rados-client-with-blocklist`` profile: + +.. code:: sh + + ceph auth get-or-create client.X mon 'profile simple-rados-client-with-blocklist' osd 'allow rwx' + +To learn why blocklisting is necessary, see :ref:`libcephsqlite-corrupt`. + + +Page Size +^^^^^^^^^ + +SQLite allows configuring the page size prior to creating a new database. It is +advisable to increase this config to 65536 (64K) when using RADOS backed +databases to reduce the number of OSD reads/writes and thereby improve +throughput and latency. + +.. code:: sql + + PRAGMA page_size = 65536 + +You may also try other values according to your application needs but note that +64K is the max imposed by SQLite. + + +Cache +^^^^^ + +The ceph VFS does not do any caching of reads or buffering of writes. Instead, +and more appropriately, the SQLite page cache is used. You may find it is too small +for most workloads and should therefore increase it significantly: + + +.. code:: sql + + PRAGMA cache_size = 4096 + +Which will cache 4096 pages or 256MB (with 64K ``page_cache``). + + +Journal Persistence +^^^^^^^^^^^^^^^^^^^ + +By default, SQLite deletes the journal for every transaction. This can be +expensive as the *ceph* VFS must delete every object backing the journal for each +transaction. For this reason, it is much faster and simpler to ask SQLite to +**persist** the journal. In this mode, SQLite will invalidate the journal via a +write to its header. This is done as: + +.. code:: sql + + PRAGMA journal_mode = PERSIST + +The cost of this may be increased unused space according to the high-water size +of the rollback journal (based on transaction type and size). + + +Exclusive Lock Mode +^^^^^^^^^^^^^^^^^^^ + +SQLite operates in a ``NORMAL`` locking mode where each transaction requires +locking the backing database file. This can add unnecessary overhead to +transactions when you know there's only ever one user of the database at a +given time. You can have SQLite lock the database once for the duration of the +connection using: + +.. code:: sql + + PRAGMA locking_mode = EXCLUSIVE + +This can more than **halve** the time taken to perform a transaction. Keep in +mind this prevents other clients from accessing the database. + +In this locking mode, each write transaction to the database requires 3 +synchronization events: once to write to the journal, another to write to the +database file, and a final write to invalidate the journal header (in +``PERSIST`` journaling mode). + + +WAL Journal +^^^^^^^^^^^ + +The `WAL Journal Mode`_ is only available when SQLite is operating in exclusive +lock mode. This is because it requires shared memory communication with other +readers and writers when in the ``NORMAL`` locking mode. + +As with local disk databases, WAL mode may significantly reduce small +transaction latency. Testing has shown it can provide more than 50% speedup +over persisted rollback journals in exclusive locking mode. You can expect +around 150-250 transactions per second depending on size. + + +Performance Notes +^^^^^^^^^^^^^^^^^ + +The filing backend for the database on RADOS is asynchronous as much as +possible. Still, performance can be anywhere from 3x-10x slower than a local +database on SSD. Latency can be a major factor. It is advisable to be familiar +with SQL transactions and other strategies for efficient database updates. +Depending on the performance of the underlying pool, you can expect small +transactions to take up to 30 milliseconds to complete. If you use the +``EXCLUSIVE`` locking mode, it can be reduced further to 15 milliseconds per +transaction. A WAL journal in ``EXCLUSIVE`` locking mode can further reduce +this as low as ~2-5 milliseconds (or the time to complete a RADOS write; you +won't get better than that!). + +There is no limit to the size of a SQLite database on RADOS imposed by the Ceph +VFS. There are standard `SQLite Limits`_ to be aware of, notably the maximum +database size of 281 TB. Large databases may or may not be performant on Ceph. +Experimentation for your own use-case is advised. + +Be aware that read-heavy queries could take significant amounts of time as +reads are necessarily synchronous (due to the VFS API). No readahead is yet +performed by the VFS. + + +Recommended Use-Cases +^^^^^^^^^^^^^^^^^^^^^ + +The original purpose of this module was to support saving relational or large +data in RADOS which needs to span multiple objects. Many current applications +with trivial state try to use RADOS omap storage on a single object but this +cannot scale without striping data across multiple objects. Unfortunately, it +is non-trivial to design a store spanning multiple objects which is consistent +and also simple to use. SQLite can be used to bridge that gap. + + +Parallel Access +^^^^^^^^^^^^^^^ + +The VFS does not yet support concurrent readers. All database access is protected +by a single exclusive lock. + + +Export or Extract Database out of RADOS +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The database is striped on RADOS and can be extracted using the RADOS cli toolset. + +.. code:: sh + + rados --pool=foo --striper get bar.db local-bar.db + rados --pool=foo --striper get bar.db-journal local-bar.db-journal + sqlite3 local-bar.db ... + +Keep in mind the rollback journal is also striped and will need to be extracted +as well if the database was in the middle of a transaction. If you're using +WAL, that journal will need to be extracted as well. + +Keep in mind that extracting the database using the striper uses the same RADOS +locks as those used by the *ceph* VFS. However, the journal file locks are not +used by the *ceph* VFS (SQLite only locks the main database file) so there is a +potential race with other SQLite clients when extracting both files. That could +result in fetching a corrupt journal. + +Instead of manually extracting the files, it would be more advisable to use the +`SQLite Backup`_ mechanism instead. + + +Temporary Tables +^^^^^^^^^^^^^^^^ + +Temporary tables backed by the ceph VFS are not supported. The main reason for +this is that the VFS lacks context about where it should put the database, i.e. +which RADOS pool. The persistent database associated with the temporary +database is not communicated via the SQLite VFS API. + +Instead, it's suggested to attach a secondary local or `In-Memory Database`_ +and put the temporary tables there. Alternatively, you may set a connection +pragma: + +.. code:: sql + + PRAGMA temp_store=memory + + +.. _libcephsqlite-breaking-locks: + +Breaking Locks +^^^^^^^^^^^^^^ + +Access to the database file is protected by an exclusive lock on the first +object stripe of the database. If the application fails without unlocking the +database (e.g. a segmentation fault), the lock is not automatically unlocked, +even if the client connection is blocklisted afterward. Eventually, the lock +will timeout subject to the configurations:: + + cephsqlite_lock_renewal_timeout = 30000 + +The timeout is in milliseconds. Once the timeout is reached, the OSD will +expire the lock and allow clients to relock. When this occurs, the database +will be recovered by SQLite and the in-progress transaction rolled back. The +new client recovering the database will also blocklist the old client to +prevent potential database corruption from rogue writes. + +The holder of the exclusive lock on the database will periodically renew the +lock so it does not lose the lock. This is necessary for large transactions or +database connections operating in ``EXCLUSIVE`` locking mode. The lock renewal +interval is adjustable via:: + + cephsqlite_lock_renewal_interval = 2000 + +This configuration is also in units of milliseconds. + +It is possible to break the lock early if you know the client is gone for good +(e.g. blocklisted). This allows restoring database access to clients +immediately. For example: + +.. code:: sh + + $ rados --pool=foo --namespace bar lock info baz.db.0000000000000000 striper.lock + {"name":"striper.lock","type":"exclusive","tag":"","lockers":[{"name":"client.4463","cookie":"555c7208-db39-48e8-a4d7-3ba92433a41a","description":"SimpleRADOSStriper","expiration":"0.000000","addr":"127.0.0.1:0/1831418345"}]} + + $ rados --pool=foo --namespace bar lock break baz.db.0000000000000000 striper.lock client.4463 --lock-cookie 555c7208-db39-48e8-a4d7-3ba92433a41a + +.. _libcephsqlite-corrupt: + +How to Corrupt Your Database +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +There is the usual reading on `How to Corrupt Your SQLite Database`_ that you +should review before using this tool. To add to that, the most likely way you +may corrupt your database is by a rogue process transiently losing network +connectivity and then resuming its work. The exclusive RADOS lock it held will +be lost but it cannot know that immediately. Any work it might do after +regaining network connectivity could corrupt the database. + +The *ceph* VFS library defaults do not allow for this scenario to occur. The Ceph +VFS will blocklist the last owner of the exclusive lock on the database if it +detects incomplete cleanup. + +By blocklisting the old client, it's no longer possible for the old client to +resume its work on the database when it returns (subject to blocklist +expiration, 3600 seconds by default). To turn off blocklisting the prior client, change:: + + cephsqlite_blocklist_dead_locker = false + +Do NOT do this unless you know database corruption cannot result due to other +guarantees. If this config is true (the default), the *ceph* VFS will cowardly +fail if it cannot blocklist the prior instance (due to lack of authorization, +for example). + +One example where out-of-band mechanisms exist to blocklist the last dead +holder of the exclusive lock on the database is in the ``ceph-mgr``. The +monitors are made aware of the RADOS connection used for the *ceph* VFS and will +blocklist the instance during ``ceph-mgr`` failover. This prevents a zombie +``ceph-mgr`` from continuing work and potentially corrupting the database. For +this reason, it is not necessary for the *ceph* VFS to do the blocklist command +in the new instance of the ``ceph-mgr`` (but it still does so, harmlessly). + +To blocklist the *ceph* VFS manually, you may see the instance address of the +*ceph* VFS using the ``ceph_status`` SQL function: + +.. code:: sql + + SELECT ceph_status(); + +.. code:: + + {"id":788461300,"addr":"172.21.10.4:0/1472139388"} + +You may easily manipulate that information using the `JSON1 extension`_: + +.. code:: sql + + SELECT json_extract(ceph_status(), '$.addr'); + +.. code:: + + 172.21.10.4:0/3563721180 + +This is the address you would pass to the ceph blocklist command: + +.. code:: sh + + ceph osd blocklist add 172.21.10.4:0/3082314560 + + +Performance Statistics +^^^^^^^^^^^^^^^^^^^^^^ + +The *ceph* VFS provides a SQLite function, ``ceph_perf``, for querying the +performance statistics of the VFS. The data is from "performance counters" as +in other Ceph services normally queried via an admin socket. + +.. code:: sql + + SELECT ceph_perf(); + +.. code:: + + {"libcephsqlite_vfs":{"op_open":{"avgcount":2,"sum":0.150001291,"avgtime":0.075000645},"op_delete":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"op_access":{"avgcount":1,"sum":0.003000026,"avgtime":0.003000026},"op_fullpathname":{"avgcount":1,"sum":0.064000551,"avgtime":0.064000551},"op_currenttime":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_close":{"avgcount":1,"sum":0.000000000,"avgtime":0.000000000},"opf_read":{"avgcount":3,"sum":0.036000310,"avgtime":0.012000103},"opf_write":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_truncate":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_sync":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_filesize":{"avgcount":2,"sum":0.000000000,"avgtime":0.000000000},"opf_lock":{"avgcount":1,"sum":0.158001360,"avgtime":0.158001360},"opf_unlock":{"avgcount":1,"sum":0.101000871,"avgtime":0.101000871},"opf_checkreservedlock":{"avgcount":1,"sum":0.002000017,"avgtime":0.002000017},"opf_filecontrol":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000},"opf_sectorsize":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_devicecharacteristics":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000}},"libcephsqlite_striper":{"update_metadata":0,"update_allocated":0,"update_size":0,"update_version":0,"shrink":0,"shrink_bytes":0,"lock":1,"unlock":1}} + +You may easily manipulate that information using the `JSON1 extension`_: + +.. code:: sql + + SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync.avgcount'); + +.. code:: + + 776 + +That tells you the number of times SQLite has called the xSync method of the +`SQLite IO Methods`_ of the VFS (for **all** open database connections in the +process). You could analyze the performance stats before and after a number of +queries to see the number of file system syncs required (this would just be +proportional to the number of transactions). Alternatively, you may be more +interested in the average latency to complete a write: + +.. code:: sql + + SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_write'); + +.. code:: + + {"avgcount":7873,"sum":0.675005797,"avgtime":0.000085736} + +Which would tell you there have been 7873 writes with an average +time-to-complete of 85 microseconds. That clearly shows the calls are executed +asynchronously. Returning to sync: + +.. code:: sql + + SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync'); + +.. code:: + + {"avgcount":776,"sum":4.802041199,"avgtime":0.006188197} + +6 milliseconds were spent on average executing a sync call. This gathers all of +the asynchronous writes as well as an asynchronous update to the size of the +striped file. + + +Debugging +^^^^^^^^^ + +Debugging libcephsqlite can be turned on via:: + + debug_cephsqlite + +If running the ``sqlite3`` command-line tool, use: + +.. code:: sh + + env CEPH_ARGS='--log_to_file true --log-file sqlite3.log --debug_cephsqlite 20 --debug_ms 1' sqlite3 ... + +This will save all the usual Ceph debugging to a file ``sqlite3.log`` for inspection. + + +.. _SQLite: https://sqlite.org/index.html +.. _SQLite VFS: https://www.sqlite.org/vfs.html +.. _SQLite Backup: https://www.sqlite.org/backup.html +.. _SQLite Limits: https://www.sqlite.org/limits.html +.. _SQLite Extension Loading API: https://sqlite.org/c3ref/load_extension.html +.. _In-Memory Database: https://www.sqlite.org/inmemorydb.html +.. _WAL Journal Mode: https://sqlite.org/wal.html +.. _How to Corrupt Your SQLite Database: https://www.sqlite.org/howtocorrupt.html +.. _JSON1 Extension: https://www.sqlite.org/json1.html +.. _SQLite IO Methods: https://www.sqlite.org/c3ref/io_methods.html diff --git a/doc/rados/api/librados-intro.rst b/doc/rados/api/librados-intro.rst new file mode 100644 index 000000000..5174188b4 --- /dev/null +++ b/doc/rados/api/librados-intro.rst @@ -0,0 +1,1051 @@ +========================== + Introduction to librados +========================== + +The :term:`Ceph Storage Cluster` provides the basic storage service that allows +:term:`Ceph` to uniquely deliver **object, block, and file storage** in one +unified system. However, you are not limited to using the RESTful, block, or +POSIX interfaces. Based upon :abbr:`RADOS (Reliable Autonomic Distributed Object +Store)`, the ``librados`` API enables you to create your own interface to the +Ceph Storage Cluster. + +The ``librados`` API enables you to interact with the two types of daemons in +the Ceph Storage Cluster: + +- The :term:`Ceph Monitor`, which maintains a master copy of the cluster map. +- The :term:`Ceph OSD Daemon` (OSD), which stores data as objects on a storage node. + +.. ditaa:: + +---------------------------------+ + | Ceph Storage Cluster Protocol | + | (librados) | + +---------------------------------+ + +---------------+ +---------------+ + | OSDs | | Monitors | + +---------------+ +---------------+ + +This guide provides a high-level introduction to using ``librados``. +Refer to :doc:`../../architecture` for additional details of the Ceph +Storage Cluster. To use the API, you need a running Ceph Storage Cluster. +See `Installation (Quick)`_ for details. + + +Step 1: Getting librados +======================== + +Your client application must bind with ``librados`` to connect to the Ceph +Storage Cluster. You must install ``librados`` and any required packages to +write applications that use ``librados``. The ``librados`` API is written in +C++, with additional bindings for C, Python, Java and PHP. + + +Getting librados for C/C++ +-------------------------- + +To install ``librados`` development support files for C/C++ on Debian/Ubuntu +distributions, execute the following: + +.. prompt:: bash $ + + sudo apt-get install librados-dev + +To install ``librados`` development support files for C/C++ on RHEL/CentOS +distributions, execute the following: + +.. prompt:: bash $ + + sudo yum install librados2-devel + +Once you install ``librados`` for developers, you can find the required +headers for C/C++ under ``/usr/include/rados``: + +.. prompt:: bash $ + + ls /usr/include/rados + + +Getting librados for Python +--------------------------- + +The ``rados`` module provides ``librados`` support to Python +applications. You may install ``python3-rados`` for Debian, Ubuntu, SLE or +openSUSE or the ``python-rados`` package for CentOS/RHEL. + +To install ``librados`` development support files for Python on Debian/Ubuntu +distributions, execute the following: + +.. prompt:: bash $ + + sudo apt-get install python3-rados + +To install ``librados`` development support files for Python on RHEL/CentOS +distributions, execute the following: + +.. prompt:: bash $ + + sudo yum install python-rados + +To install ``librados`` development support files for Python on SLE/openSUSE +distributions, execute the following: + +.. prompt:: bash $ + + sudo zypper install python3-rados + +You can find the module under ``/usr/share/pyshared`` on Debian systems, +or under ``/usr/lib/python*/site-packages`` on CentOS/RHEL systems. + + +Getting librados for Java +------------------------- + +To install ``librados`` for Java, you need to execute the following procedure: + +#. Install ``jna.jar``. For Debian/Ubuntu, execute: + + .. prompt:: bash $ + + sudo apt-get install libjna-java + + For CentOS/RHEL, execute: + + .. prompt:: bash $ + + sudo yum install jna + + The JAR files are located in ``/usr/share/java``. + +#. Clone the ``rados-java`` repository: + + .. prompt:: bash $ + + git clone --recursive https://github.com/ceph/rados-java.git + +#. Build the ``rados-java`` repository: + + .. prompt:: bash $ + + cd rados-java + ant + + The JAR file is located under ``rados-java/target``. + +#. Copy the JAR for RADOS to a common location (e.g., ``/usr/share/java``) and + ensure that it and the JNA JAR are in your JVM's classpath. For example: + + .. prompt:: bash $ + + sudo cp target/rados-0.1.3.jar /usr/share/java/rados-0.1.3.jar + sudo ln -s /usr/share/java/jna-3.2.7.jar /usr/lib/jvm/default-java/jre/lib/ext/jna-3.2.7.jar + sudo ln -s /usr/share/java/rados-0.1.3.jar /usr/lib/jvm/default-java/jre/lib/ext/rados-0.1.3.jar + +To build the documentation, execute the following: + +.. prompt:: bash $ + + ant docs + + +Getting librados for PHP +------------------------- + +To install the ``librados`` extension for PHP, you need to execute the following procedure: + +#. Install php-dev. For Debian/Ubuntu, execute: + + .. prompt:: bash $ + + sudo apt-get install php5-dev build-essential + + For CentOS/RHEL, execute: + + .. prompt:: bash $ + + sudo yum install php-devel + +#. Clone the ``phprados`` repository: + + .. prompt:: bash $ + + git clone https://github.com/ceph/phprados.git + +#. Build ``phprados``: + + .. prompt:: bash $ + + cd phprados + phpize + ./configure + make + sudo make install + +#. Enable ``phprados`` by adding the following line to ``php.ini``:: + + extension=rados.so + + +Step 2: Configuring a Cluster Handle +==================================== + +A :term:`Ceph Client`, via ``librados``, interacts directly with OSDs to store +and retrieve data. To interact with OSDs, the client app must invoke +``librados`` and connect to a Ceph Monitor. Once connected, ``librados`` +retrieves the :term:`Cluster Map` from the Ceph Monitor. When the client app +wants to read or write data, it creates an I/O context and binds to a +:term:`Pool`. The pool has an associated :term:`CRUSH rule` that defines how it +will place data in the storage cluster. Via the I/O context, the client +provides the object name to ``librados``, which takes the object name +and the cluster map (i.e., the topology of the cluster) and `computes`_ the +placement group and `OSD`_ for locating the data. Then the client application +can read or write data. The client app doesn't need to learn about the topology +of the cluster directly. + +.. ditaa:: + +--------+ Retrieves +---------------+ + | Client |------------>| Cluster Map | + +--------+ +---------------+ + | + v Writes + /-----\ + | obj | + \-----/ + | To + v + +--------+ +---------------+ + | Pool |---------->| CRUSH Rule | + +--------+ Selects +---------------+ + + +The Ceph Storage Cluster handle encapsulates the client configuration, including: + +- The `user ID`_ for ``rados_create()`` or user name for ``rados_create2()`` + (preferred). +- The :term:`cephx` authentication key +- The monitor ID and IP address +- Logging levels +- Debugging levels + +Thus, the first steps in using the cluster from your app are to 1) create +a cluster handle that your app will use to connect to the storage cluster, +and then 2) use that handle to connect. To connect to the cluster, the +app must supply a monitor address, a username and an authentication key +(cephx is enabled by default). + +.. tip:: Talking to different Ceph Storage Clusters – or to the same cluster + with different users – requires different cluster handles. + +RADOS provides a number of ways for you to set the required values. For +the monitor and encryption key settings, an easy way to handle them is to ensure +that your Ceph configuration file contains a ``keyring`` path to a keyring file +and at least one monitor address (e.g., ``mon_host``). For example:: + + [global] + mon_host = 192.168.1.1 + keyring = /etc/ceph/ceph.client.admin.keyring + +Once you create the handle, you can read a Ceph configuration file to configure +the handle. You can also pass arguments to your app and parse them with the +function for parsing command line arguments (e.g., ``rados_conf_parse_argv()``), +or parse Ceph environment variables (e.g., ``rados_conf_parse_env()``). Some +wrappers may not implement convenience methods, so you may need to implement +these capabilities. The following diagram provides a high-level flow for the +initial connection. + + +.. ditaa:: + +---------+ +---------+ + | Client | | Monitor | + +---------+ +---------+ + | | + |-----+ create | + | | cluster | + |<----+ handle | + | | + |-----+ read | + | | config | + |<----+ file | + | | + | connect | + |-------------->| + | | + |<--------------| + | connected | + | | + + +Once connected, your app can invoke functions that affect the whole cluster +with only the cluster handle. For example, once you have a cluster +handle, you can: + +- Get cluster statistics +- Use Pool Operation (exists, create, list, delete) +- Get and set the configuration + + +One of the powerful features of Ceph is the ability to bind to different pools. +Each pool may have a different number of placement groups, object replicas and +replication strategies. For example, a pool could be set up as a "hot" pool that +uses SSDs for frequently used objects or a "cold" pool that uses erasure coding. + +The main difference in the various ``librados`` bindings is between C and +the object-oriented bindings for C++, Java and Python. The object-oriented +bindings use objects to represent cluster handles, IO Contexts, iterators, +exceptions, etc. + + +C Example +--------- + +For C, creating a simple cluster handle using the ``admin`` user, configuring +it and connecting to the cluster might look something like this: + +.. code-block:: c + + #include <stdio.h> + #include <stdlib.h> + #include <string.h> + #include <rados/librados.h> + + int main (int argc, const char **argv) + { + + /* Declare the cluster handle and required arguments. */ + rados_t cluster; + char cluster_name[] = "ceph"; + char user_name[] = "client.admin"; + uint64_t flags = 0; + + /* Initialize the cluster handle with the "ceph" cluster name and the "client.admin" user */ + int err; + err = rados_create2(&cluster, cluster_name, user_name, flags); + + if (err < 0) { + fprintf(stderr, "%s: Couldn't create the cluster handle! %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nCreated a cluster handle.\n"); + } + + + /* Read a Ceph configuration file to configure the cluster handle. */ + err = rados_conf_read_file(cluster, "/etc/ceph/ceph.conf"); + if (err < 0) { + fprintf(stderr, "%s: cannot read config file: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nRead the config file.\n"); + } + + /* Read command line arguments */ + err = rados_conf_parse_argv(cluster, argc, argv); + if (err < 0) { + fprintf(stderr, "%s: cannot parse command line arguments: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nRead the command line arguments.\n"); + } + + /* Connect to the cluster */ + err = rados_connect(cluster); + if (err < 0) { + fprintf(stderr, "%s: cannot connect to cluster: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nConnected to the cluster.\n"); + } + + } + +Compile your client and link to ``librados`` using ``-lrados``. For example: + +.. prompt:: bash $ + + gcc ceph-client.c -lrados -o ceph-client + + +C++ Example +----------- + +The Ceph project provides a C++ example in the ``ceph/examples/librados`` +directory. For C++, a simple cluster handle using the ``admin`` user requires +you to initialize a ``librados::Rados`` cluster handle object: + +.. code-block:: c++ + + #include <iostream> + #include <string> + #include <rados/librados.hpp> + + int main(int argc, const char **argv) + { + + int ret = 0; + + /* Declare the cluster handle and required variables. */ + librados::Rados cluster; + char cluster_name[] = "ceph"; + char user_name[] = "client.admin"; + uint64_t flags = 0; + + /* Initialize the cluster handle with the "ceph" cluster name and "client.admin" user */ + { + ret = cluster.init2(user_name, cluster_name, flags); + if (ret < 0) { + std::cerr << "Couldn't initialize the cluster handle! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Created a cluster handle." << std::endl; + } + } + + /* Read a Ceph configuration file to configure the cluster handle. */ + { + ret = cluster.conf_read_file("/etc/ceph/ceph.conf"); + if (ret < 0) { + std::cerr << "Couldn't read the Ceph configuration file! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Read the Ceph configuration file." << std::endl; + } + } + + /* Read command line arguments */ + { + ret = cluster.conf_parse_argv(argc, argv); + if (ret < 0) { + std::cerr << "Couldn't parse command line options! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Parsed command line options." << std::endl; + } + } + + /* Connect to the cluster */ + { + ret = cluster.connect(); + if (ret < 0) { + std::cerr << "Couldn't connect to cluster! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Connected to the cluster." << std::endl; + } + } + + return 0; + } + + +Compile the source; then, link ``librados`` using ``-lrados``. +For example: + +.. prompt:: bash $ + + g++ -g -c ceph-client.cc -o ceph-client.o + g++ -g ceph-client.o -lrados -o ceph-client + + + +Python Example +-------------- + +Python uses the ``admin`` id and the ``ceph`` cluster name by default, and +will read the standard ``ceph.conf`` file if the conffile parameter is +set to the empty string. The Python binding converts C++ errors +into exceptions. + + +.. code-block:: python + + import rados + + try: + cluster = rados.Rados(conffile='') + except TypeError as e: + print('Argument validation error: {}'.format(e)) + raise e + + print("Created cluster handle.") + + try: + cluster.connect() + except Exception as e: + print("connection error: {}".format(e)) + raise e + finally: + print("Connected to the cluster.") + + +Execute the example to verify that it connects to your cluster: + +.. prompt:: bash $ + + python ceph-client.py + + +Java Example +------------ + +Java requires you to specify the user ID (``admin``) or user name +(``client.admin``), and uses the ``ceph`` cluster name by default . The Java +binding converts C++-based errors into exceptions. + +.. code-block:: java + + import com.ceph.rados.Rados; + import com.ceph.rados.RadosException; + + import java.io.File; + + public class CephClient { + public static void main (String args[]){ + + try { + Rados cluster = new Rados("admin"); + System.out.println("Created cluster handle."); + + File f = new File("/etc/ceph/ceph.conf"); + cluster.confReadFile(f); + System.out.println("Read the configuration file."); + + cluster.connect(); + System.out.println("Connected to the cluster."); + + } catch (RadosException e) { + System.out.println(e.getMessage() + ": " + e.getReturnValue()); + } + } + } + + +Compile the source; then, run it. If you have copied the JAR to +``/usr/share/java`` and sym linked from your ``ext`` directory, you won't need +to specify the classpath. For example: + +.. prompt:: bash $ + + javac CephClient.java + java CephClient + + +PHP Example +------------ + +With the RADOS extension enabled in PHP you can start creating a new cluster handle very easily: + +.. code-block:: php + + <?php + + $r = rados_create(); + rados_conf_read_file($r, '/etc/ceph/ceph.conf'); + if (!rados_connect($r)) { + echo "Failed to connect to Ceph cluster"; + } else { + echo "Successfully connected to Ceph cluster"; + } + + +Save this as rados.php and run the code: + +.. prompt:: bash $ + + php rados.php + + +Step 3: Creating an I/O Context +=============================== + +Once your app has a cluster handle and a connection to a Ceph Storage Cluster, +you may create an I/O Context and begin reading and writing data. An I/O Context +binds the connection to a specific pool. The user must have appropriate +`CAPS`_ permissions to access the specified pool. For example, a user with read +access but not write access will only be able to read data. I/O Context +functionality includes: + +- Write/read data and extended attributes +- List and iterate over objects and extended attributes +- Snapshot pools, list snapshots, etc. + + +.. ditaa:: + +---------+ +---------+ +---------+ + | Client | | Monitor | | OSD | + +---------+ +---------+ +---------+ + | | | + |-----+ create | | + | | I/O | | + |<----+ context | | + | | | + | write data | | + |---------------+-------------->| + | | | + | write ack | | + |<--------------+---------------| + | | | + | write xattr | | + |---------------+-------------->| + | | | + | xattr ack | | + |<--------------+---------------| + | | | + | read data | | + |---------------+-------------->| + | | | + | read ack | | + |<--------------+---------------| + | | | + | remove data | | + |---------------+-------------->| + | | | + | remove ack | | + |<--------------+---------------| + + + +RADOS enables you to interact both synchronously and asynchronously. Once your +app has an I/O Context, read/write operations only require you to know the +object/xattr name. The CRUSH algorithm encapsulated in ``librados`` uses the +cluster map to identify the appropriate OSD. OSD daemons handle the replication, +as described in `Smart Daemons Enable Hyperscale`_. The ``librados`` library also +maps objects to placement groups, as described in `Calculating PG IDs`_. + +The following examples use the default ``data`` pool. However, you may also +use the API to list pools, ensure they exist, or create and delete pools. For +the write operations, the examples illustrate how to use synchronous mode. For +the read operations, the examples illustrate how to use asynchronous mode. + +.. important:: Use caution when deleting pools with this API. If you delete + a pool, the pool and ALL DATA in the pool will be lost. + + +C Example +--------- + + +.. code-block:: c + + #include <stdio.h> + #include <stdlib.h> + #include <string.h> + #include <rados/librados.h> + + int main (int argc, const char **argv) + { + /* + * Continued from previous C example, where cluster handle and + * connection are established. First declare an I/O Context. + */ + + rados_ioctx_t io; + char *poolname = "data"; + + err = rados_ioctx_create(cluster, poolname, &io); + if (err < 0) { + fprintf(stderr, "%s: cannot open rados pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_shutdown(cluster); + exit(EXIT_FAILURE); + } else { + printf("\nCreated I/O context.\n"); + } + + /* Write data to the cluster synchronously. */ + err = rados_write(io, "hw", "Hello World!", 12, 0); + if (err < 0) { + fprintf(stderr, "%s: Cannot write object \"hw\" to pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nWrote \"Hello World\" to object \"hw\".\n"); + } + + char xattr[] = "en_US"; + err = rados_setxattr(io, "hw", "lang", xattr, 5); + if (err < 0) { + fprintf(stderr, "%s: Cannot write xattr to pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nWrote \"en_US\" to xattr \"lang\" for object \"hw\".\n"); + } + + /* + * Read data from the cluster asynchronously. + * First, set up asynchronous I/O completion. + */ + rados_completion_t comp; + err = rados_aio_create_completion(NULL, NULL, NULL, &comp); + if (err < 0) { + fprintf(stderr, "%s: Could not create aio completion: %s\n", argv[0], strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nCreated AIO completion.\n"); + } + + /* Next, read data using rados_aio_read. */ + char read_res[100]; + err = rados_aio_read(io, "hw", comp, read_res, 12, 0); + if (err < 0) { + fprintf(stderr, "%s: Cannot read object. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRead object \"hw\". The contents are:\n %s \n", read_res); + } + + /* Wait for the operation to complete */ + rados_aio_wait_for_complete(comp); + + /* Release the asynchronous I/O complete handle to avoid memory leaks. */ + rados_aio_release(comp); + + + char xattr_res[100]; + err = rados_getxattr(io, "hw", "lang", xattr_res, 5); + if (err < 0) { + fprintf(stderr, "%s: Cannot read xattr. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRead xattr \"lang\" for object \"hw\". The contents are:\n %s \n", xattr_res); + } + + err = rados_rmxattr(io, "hw", "lang"); + if (err < 0) { + fprintf(stderr, "%s: Cannot remove xattr. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRemoved xattr \"lang\" for object \"hw\".\n"); + } + + err = rados_remove(io, "hw"); + if (err < 0) { + fprintf(stderr, "%s: Cannot remove object. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRemoved object \"hw\".\n"); + } + + } + + + +C++ Example +----------- + + +.. code-block:: c++ + + #include <iostream> + #include <string> + #include <rados/librados.hpp> + + int main(int argc, const char **argv) + { + + /* Continued from previous C++ example, where cluster handle and + * connection are established. First declare an I/O Context. + */ + + librados::IoCtx io_ctx; + const char *pool_name = "data"; + + { + ret = cluster.ioctx_create(pool_name, io_ctx); + if (ret < 0) { + std::cerr << "Couldn't set up ioctx! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Created an ioctx for the pool." << std::endl; + } + } + + + /* Write an object synchronously. */ + { + librados::bufferlist bl; + bl.append("Hello World!"); + ret = io_ctx.write_full("hw", bl); + if (ret < 0) { + std::cerr << "Couldn't write object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Wrote new object 'hw' " << std::endl; + } + } + + + /* + * Add an xattr to the object. + */ + { + librados::bufferlist lang_bl; + lang_bl.append("en_US"); + ret = io_ctx.setxattr("hw", "lang", lang_bl); + if (ret < 0) { + std::cerr << "failed to set xattr version entry! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Set the xattr 'lang' on our object!" << std::endl; + } + } + + + /* + * Read the object back asynchronously. + */ + { + librados::bufferlist read_buf; + int read_len = 4194304; + + //Create I/O Completion. + librados::AioCompletion *read_completion = librados::Rados::aio_create_completion(); + + //Send read request. + ret = io_ctx.aio_read("hw", read_completion, &read_buf, read_len, 0); + if (ret < 0) { + std::cerr << "Couldn't start read object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } + + // Wait for the request to complete, and check that it succeeded. + read_completion->wait_for_complete(); + ret = read_completion->get_return_value(); + if (ret < 0) { + std::cerr << "Couldn't read object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Read object hw asynchronously with contents.\n" + << read_buf.c_str() << std::endl; + } + } + + + /* + * Read the xattr. + */ + { + librados::bufferlist lang_res; + ret = io_ctx.getxattr("hw", "lang", lang_res); + if (ret < 0) { + std::cerr << "failed to get xattr version entry! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Got the xattr 'lang' from object hw!" + << lang_res.c_str() << std::endl; + } + } + + + /* + * Remove the xattr. + */ + { + ret = io_ctx.rmxattr("hw", "lang"); + if (ret < 0) { + std::cerr << "Failed to remove xattr! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Removed the xattr 'lang' from our object!" << std::endl; + } + } + + /* + * Remove the object. + */ + { + ret = io_ctx.remove("hw"); + if (ret < 0) { + std::cerr << "Couldn't remove object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Removed object 'hw'." << std::endl; + } + } + } + + + +Python Example +-------------- + +.. code-block:: python + + print("\n\nI/O Context and Object Operations") + print("=================================") + + print("\nCreating a context for the 'data' pool") + if not cluster.pool_exists('data'): + raise RuntimeError('No data pool exists') + ioctx = cluster.open_ioctx('data') + + print("\nWriting object 'hw' with contents 'Hello World!' to pool 'data'.") + ioctx.write("hw", b"Hello World!") + print("Writing XATTR 'lang' with value 'en_US' to object 'hw'") + ioctx.set_xattr("hw", "lang", b"en_US") + + + print("\nWriting object 'bm' with contents 'Bonjour tout le monde!' to pool + 'data'.") + ioctx.write("bm", b"Bonjour tout le monde!") + print("Writing XATTR 'lang' with value 'fr_FR' to object 'bm'") + ioctx.set_xattr("bm", "lang", b"fr_FR") + + print("\nContents of object 'hw'\n------------------------") + print(ioctx.read("hw")) + + print("\n\nGetting XATTR 'lang' from object 'hw'") + print(ioctx.get_xattr("hw", "lang")) + + print("\nContents of object 'bm'\n------------------------") + print(ioctx.read("bm")) + + print("\n\nGetting XATTR 'lang' from object 'bm'") + print(ioctx.get_xattr("bm", "lang")) + + + print("\nRemoving object 'hw'") + ioctx.remove_object("hw") + + print("Removing object 'bm'") + ioctx.remove_object("bm") + + +Java-Example +------------ + +.. code-block:: java + + import com.ceph.rados.Rados; + import com.ceph.rados.RadosException; + + import java.io.File; + import com.ceph.rados.IoCTX; + + public class CephClient { + public static void main (String args[]){ + + try { + Rados cluster = new Rados("admin"); + System.out.println("Created cluster handle."); + + File f = new File("/etc/ceph/ceph.conf"); + cluster.confReadFile(f); + System.out.println("Read the configuration file."); + + cluster.connect(); + System.out.println("Connected to the cluster."); + + IoCTX io = cluster.ioCtxCreate("data"); + + String oidone = "hw"; + String contentone = "Hello World!"; + io.write(oidone, contentone); + + String oidtwo = "bm"; + String contenttwo = "Bonjour tout le monde!"; + io.write(oidtwo, contenttwo); + + String[] objects = io.listObjects(); + for (String object: objects) + System.out.println(object); + + io.remove(oidone); + io.remove(oidtwo); + + cluster.ioCtxDestroy(io); + + } catch (RadosException e) { + System.out.println(e.getMessage() + ": " + e.getReturnValue()); + } + } + } + + +PHP Example +----------- + +.. code-block:: php + + <?php + + $io = rados_ioctx_create($r, "mypool"); + rados_write_full($io, "oidOne", "mycontents"); + rados_remove("oidOne"); + rados_ioctx_destroy($io); + + +Step 4: Closing Sessions +======================== + +Once your app finishes with the I/O Context and cluster handle, the app should +close the connection and shutdown the handle. For asynchronous I/O, the app +should also ensure that pending asynchronous operations have completed. + + +C Example +--------- + +.. code-block:: c + + rados_ioctx_destroy(io); + rados_shutdown(cluster); + + +C++ Example +----------- + +.. code-block:: c++ + + io_ctx.close(); + cluster.shutdown(); + + +Java Example +-------------- + +.. code-block:: java + + cluster.ioCtxDestroy(io); + cluster.shutDown(); + + +Python Example +-------------- + +.. code-block:: python + + print("\nClosing the connection.") + ioctx.close() + + print("Shutting down the handle.") + cluster.shutdown() + +PHP Example +----------- + +.. code-block:: php + + rados_shutdown($r); + + + +.. _user ID: ../../operations/user-management#command-line-usage +.. _CAPS: ../../operations/user-management#authorization-capabilities +.. _Installation (Quick): ../../../start +.. _Smart Daemons Enable Hyperscale: ../../../architecture#smart-daemons-enable-hyperscale +.. _Calculating PG IDs: ../../../architecture#calculating-pg-ids +.. _computes: ../../../architecture#calculating-pg-ids +.. _OSD: ../../../architecture#mapping-pgs-to-osds diff --git a/doc/rados/api/librados.rst b/doc/rados/api/librados.rst new file mode 100644 index 000000000..3e202bd4b --- /dev/null +++ b/doc/rados/api/librados.rst @@ -0,0 +1,187 @@ +============== + Librados (C) +============== + +.. highlight:: c + +`librados` provides low-level access to the RADOS service. For an +overview of RADOS, see :doc:`../../architecture`. + + +Example: connecting and writing an object +========================================= + +To use `Librados`, you instantiate a :c:type:`rados_t` variable (a cluster handle) and +call :c:func:`rados_create()` with a pointer to it:: + + int err; + rados_t cluster; + + err = rados_create(&cluster, NULL); + if (err < 0) { + fprintf(stderr, "%s: cannot create a cluster handle: %s\n", argv[0], strerror(-err)); + exit(1); + } + +Then you configure your :c:type:`rados_t` to connect to your cluster, +either by setting individual values (:c:func:`rados_conf_set()`), +using a configuration file (:c:func:`rados_conf_read_file()`), using +command line options (:c:func:`rados_conf_parse_argv`), or an +environment variable (:c:func:`rados_conf_parse_env()`):: + + err = rados_conf_read_file(cluster, "/path/to/myceph.conf"); + if (err < 0) { + fprintf(stderr, "%s: cannot read config file: %s\n", argv[0], strerror(-err)); + exit(1); + } + +Once the cluster handle is configured, you can connect to the cluster with :c:func:`rados_connect()`:: + + err = rados_connect(cluster); + if (err < 0) { + fprintf(stderr, "%s: cannot connect to cluster: %s\n", argv[0], strerror(-err)); + exit(1); + } + +Then you open an "IO context", a :c:type:`rados_ioctx_t`, with :c:func:`rados_ioctx_create()`:: + + rados_ioctx_t io; + char *poolname = "mypool"; + + err = rados_ioctx_create(cluster, poolname, &io); + if (err < 0) { + fprintf(stderr, "%s: cannot open rados pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_shutdown(cluster); + exit(1); + } + +Note that the pool you try to access must exist. + +Then you can use the RADOS data manipulation functions, for example +write into an object called ``greeting`` with +:c:func:`rados_write_full()`:: + + err = rados_write_full(io, "greeting", "hello", 5); + if (err < 0) { + fprintf(stderr, "%s: cannot write pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } + +In the end, you will want to close your IO context and connection to RADOS with :c:func:`rados_ioctx_destroy()` and :c:func:`rados_shutdown()`:: + + rados_ioctx_destroy(io); + rados_shutdown(cluster); + + +Asynchronous IO +=============== + +When doing lots of IO, you often don't need to wait for one operation +to complete before starting the next one. `Librados` provides +asynchronous versions of several operations: + +* :c:func:`rados_aio_write` +* :c:func:`rados_aio_append` +* :c:func:`rados_aio_write_full` +* :c:func:`rados_aio_read` + +For each operation, you must first create a +:c:type:`rados_completion_t` that represents what to do when the +operation is safe or complete by calling +:c:func:`rados_aio_create_completion`. If you don't need anything +special to happen, you can pass NULL:: + + rados_completion_t comp; + err = rados_aio_create_completion(NULL, NULL, NULL, &comp); + if (err < 0) { + fprintf(stderr, "%s: could not create aio completion: %s\n", argv[0], strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } + +Now you can call any of the aio operations, and wait for it to +be in memory or on disk on all replicas:: + + err = rados_aio_write(io, "foo", comp, "bar", 3, 0); + if (err < 0) { + fprintf(stderr, "%s: could not schedule aio write: %s\n", argv[0], strerror(-err)); + rados_aio_release(comp); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } + rados_aio_wait_for_complete(comp); // in memory + rados_aio_wait_for_safe(comp); // on disk + +Finally, we need to free the memory used by the completion with :c:func:`rados_aio_release`:: + + rados_aio_release(comp); + +You can use the callbacks to tell your application when writes are +durable, or when read buffers are full. For example, if you wanted to +measure the latency of each operation when appending to several +objects, you could schedule several writes and store the ack and +commit time in the corresponding callback, then wait for all of them +to complete using :c:func:`rados_aio_flush` before analyzing the +latencies:: + + typedef struct { + struct timeval start; + struct timeval ack_end; + struct timeval commit_end; + } req_duration; + + void ack_callback(rados_completion_t comp, void *arg) { + req_duration *dur = (req_duration *) arg; + gettimeofday(&dur->ack_end, NULL); + } + + void commit_callback(rados_completion_t comp, void *arg) { + req_duration *dur = (req_duration *) arg; + gettimeofday(&dur->commit_end, NULL); + } + + int output_append_latency(rados_ioctx_t io, const char *data, size_t len, size_t num_writes) { + req_duration times[num_writes]; + rados_completion_t comps[num_writes]; + for (size_t i = 0; i < num_writes; ++i) { + gettimeofday(×[i].start, NULL); + int err = rados_aio_create_completion((void*) ×[i], ack_callback, commit_callback, &comps[i]); + if (err < 0) { + fprintf(stderr, "Error creating rados completion: %s\n", strerror(-err)); + return err; + } + char obj_name[100]; + snprintf(obj_name, sizeof(obj_name), "foo%ld", (unsigned long)i); + err = rados_aio_append(io, obj_name, comps[i], data, len); + if (err < 0) { + fprintf(stderr, "Error from rados_aio_append: %s", strerror(-err)); + return err; + } + } + // wait until all requests finish *and* the callbacks complete + rados_aio_flush(io); + // the latencies can now be analyzed + printf("Request # | Ack latency (s) | Commit latency (s)\n"); + for (size_t i = 0; i < num_writes; ++i) { + // don't forget to free the completions + rados_aio_release(comps[i]); + struct timeval ack_lat, commit_lat; + timersub(×[i].ack_end, ×[i].start, &ack_lat); + timersub(×[i].commit_end, ×[i].start, &commit_lat); + printf("%9ld | %8ld.%06ld | %10ld.%06ld\n", (unsigned long) i, ack_lat.tv_sec, ack_lat.tv_usec, commit_lat.tv_sec, commit_lat.tv_usec); + } + return 0; + } + +Note that all the :c:type:`rados_completion_t` must be freed with :c:func:`rados_aio_release` to avoid leaking memory. + + +API calls +========= + + .. autodoxygenfile:: rados_types.h + .. autodoxygenfile:: librados.h diff --git a/doc/rados/api/libradospp.rst b/doc/rados/api/libradospp.rst new file mode 100644 index 000000000..08483c8d4 --- /dev/null +++ b/doc/rados/api/libradospp.rst @@ -0,0 +1,9 @@ +================== + LibradosPP (C++) +================== + +.. note:: The librados C++ API is not guaranteed to be API+ABI stable + between major releases. All applications using the librados C++ API must + be recompiled and relinked against a specific Ceph release. + +.. todo:: write me! diff --git a/doc/rados/api/objclass-sdk.rst b/doc/rados/api/objclass-sdk.rst new file mode 100644 index 000000000..90b8eb018 --- /dev/null +++ b/doc/rados/api/objclass-sdk.rst @@ -0,0 +1,39 @@ +.. _`rados-objclass-api-sdk`: + +=========================== +SDK for Ceph Object Classes +=========================== + +`Ceph` can be extended by creating shared object classes called `Ceph Object +Classes`. The existing framework to build these object classes has dependencies +on the internal functionality of `Ceph`, which restricts users to build object +classes within the tree. The aim of this project is to create an independent +object class interface, which can be used to build object classes outside the +`Ceph` tree. This allows us to have two types of object classes, 1) those that +have in-tree dependencies and reside in the tree and 2) those that can make use +of the `Ceph Object Class SDK framework` and can be built outside of the `Ceph` +tree because they do not depend on any internal implementation of `Ceph`. This +project decouples object class development from Ceph and encourages creation +and distribution of object classes as packages. + +In order to demonstrate the use of this framework, we have provided an example +called ``cls_sdk``, which is a very simple object class that makes use of the +SDK framework. This object class resides in the ``src/cls`` directory. + +Installing objclass.h +--------------------- + +The object class interface that enables out-of-tree development of object +classes resides in ``src/include/rados/`` and gets installed with `Ceph` +installation. After running ``make install``, you should be able to see it +in ``<prefix>/include/rados``. :: + + ls /usr/local/include/rados + +Using the SDK example +--------------------- + +The ``cls_sdk`` object class resides in ``src/cls/sdk/``. This gets built and +loaded into Ceph, with the Ceph build process. You can run the +``ceph_test_cls_sdk`` unittest, which resides in ``src/test/cls_sdk/``, +to test this class. diff --git a/doc/rados/api/python.rst b/doc/rados/api/python.rst new file mode 100644 index 000000000..346653a3d --- /dev/null +++ b/doc/rados/api/python.rst @@ -0,0 +1,428 @@ +=================== + Librados (Python) +=================== + +The ``rados`` module is a thin Python wrapper for ``librados``. + +Installation +============ + +To install Python libraries for Ceph, see `Getting librados for Python`_. + + +Getting Started +=============== + +You can create your own Ceph client using Python. The following tutorial will +show you how to import the Ceph Python module, connect to a Ceph cluster, and +perform object operations as a ``client.admin`` user. + +.. note:: To use the Ceph Python bindings, you must have access to a + running Ceph cluster. To set one up quickly, see `Getting Started`_. + +First, create a Python source file for your Ceph client. + +.. prompt:: bash + + vim client.py + + +Import the Module +----------------- + +To use the ``rados`` module, import it into your source file. + +.. code-block:: python + :linenos: + + import rados + + +Configure a Cluster Handle +-------------------------- + +Before connecting to the Ceph Storage Cluster, create a cluster handle. By +default, the cluster handle assumes a cluster named ``ceph`` (i.e., the default +for deployment tools, and our Getting Started guides too), and a +``client.admin`` user name. You may change these defaults to suit your needs. + +To connect to the Ceph Storage Cluster, your application needs to know where to +find the Ceph Monitor. Provide this information to your application by +specifying the path to your Ceph configuration file, which contains the location +of the initial Ceph monitors. + +.. code-block:: python + :linenos: + + import rados, sys + + #Create Handle Examples. + cluster = rados.Rados(conffile='ceph.conf') + cluster = rados.Rados(conffile=sys.argv[1]) + cluster = rados.Rados(conffile = 'ceph.conf', conf = dict (keyring = '/path/to/keyring')) + +Ensure that the ``conffile`` argument provides the path and file name of your +Ceph configuration file. You may use the ``sys`` module to avoid hard-coding the +Ceph configuration path and file name. + +Your Python client also requires a client keyring. For this example, we use the +``client.admin`` key by default. If you would like to specify the keyring when +creating the cluster handle, you may use the ``conf`` argument. Alternatively, +you may specify the keyring path in your Ceph configuration file. For example, +you may add something like the following line to your Ceph configuration file:: + + keyring = /path/to/ceph.client.admin.keyring + +For additional details on modifying your configuration via Python, see `Configuration`_. + + +Connect to the Cluster +---------------------- + +Once you have a cluster handle configured, you may connect to the cluster. +With a connection to the cluster, you may execute methods that return +information about the cluster. + +.. code-block:: python + :linenos: + :emphasize-lines: 7 + + import rados, sys + + cluster = rados.Rados(conffile='ceph.conf') + print("\nlibrados version: {}".format(str(cluster.version()))) + print("Will attempt to connect to: {}".format(str(cluster.conf_get('mon host')))) + + cluster.connect() + print("\nCluster ID: {}".format(cluster.get_fsid())) + + print("\n\nCluster Statistics") + print("==================") + cluster_stats = cluster.get_cluster_stats() + + for key, value in cluster_stats.items(): + print(key, value) + + +By default, Ceph authentication is ``on``. Your application will need to know +the location of the keyring. The ``python-ceph`` module doesn't have the default +location, so you need to specify the keyring path. The easiest way to specify +the keyring is to add it to the Ceph configuration file. The following Ceph +configuration file example uses the ``client.admin`` keyring. + +.. code-block:: ini + :linenos: + + [global] + # ... elided configuration + keyring = /path/to/keyring/ceph.client.admin.keyring + + +Manage Pools +------------ + +When connected to the cluster, the ``Rados`` API allows you to manage pools. You +can list pools, check for the existence of a pool, create a pool and delete a +pool. + +.. code-block:: python + :linenos: + :emphasize-lines: 6, 13, 18, 25 + + print("\n\nPool Operations") + print("===============") + + print("\nAvailable Pools") + print("----------------") + pools = cluster.list_pools() + + for pool in pools: + print(pool) + + print("\nCreate 'test' Pool") + print("------------------") + cluster.create_pool('test') + + print("\nPool named 'test' exists: {}".format(str(cluster.pool_exists('test')))) + print("\nVerify 'test' Pool Exists") + print("-------------------------") + pools = cluster.list_pools() + + for pool in pools: + print(pool) + + print("\nDelete 'test' Pool") + print("------------------") + cluster.delete_pool('test') + print("\nPool named 'test' exists: {}".format(str(cluster.pool_exists('test')))) + + +Input/Output Context +-------------------- + +Reading from and writing to the Ceph Storage Cluster requires an input/output +context (ioctx). You can create an ioctx with the ``open_ioctx()`` or +``open_ioctx2()`` method of the ``Rados`` class. The ``ioctx_name`` parameter +is the name of the pool and ``pool_id`` is the ID of the pool you wish to use. + +.. code-block:: python + :linenos: + + ioctx = cluster.open_ioctx('data') + + +or + +.. code-block:: python + :linenos: + + ioctx = cluster.open_ioctx2(pool_id) + + +Once you have an I/O context, you can read/write objects, extended attributes, +and perform a number of other operations. After you complete operations, ensure +that you close the connection. For example: + +.. code-block:: python + :linenos: + + print("\nClosing the connection.") + ioctx.close() + + +Writing, Reading and Removing Objects +------------------------------------- + +Once you create an I/O context, you can write objects to the cluster. If you +write to an object that doesn't exist, Ceph creates it. If you write to an +object that exists, Ceph overwrites it (except when you specify a range, and +then it only overwrites the range). You may read objects (and object ranges) +from the cluster. You may also remove objects from the cluster. For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 2, 5, 8 + + print("\nWriting object 'hw' with contents 'Hello World!' to pool 'data'.") + ioctx.write_full("hw", "Hello World!") + + print("\n\nContents of object 'hw'\n------------------------\n") + print(ioctx.read("hw")) + + print("\nRemoving object 'hw'") + ioctx.remove_object("hw") + + +Writing and Reading XATTRS +-------------------------- + +Once you create an object, you can write extended attributes (XATTRs) to +the object and read XATTRs from the object. For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 2, 5 + + print("\n\nWriting XATTR 'lang' with value 'en_US' to object 'hw'") + ioctx.set_xattr("hw", "lang", "en_US") + + print("\n\nGetting XATTR 'lang' from object 'hw'\n") + print(ioctx.get_xattr("hw", "lang")) + + +Listing Objects +--------------- + +If you want to examine the list of objects in a pool, you may +retrieve the list of objects and iterate over them with the object iterator. +For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 1, 6, 7, 13 + + object_iterator = ioctx.list_objects() + + while True : + + try : + rados_object = object_iterator.__next__() + print("Object contents = {}".format(rados_object.read())) + + except StopIteration : + break + + # Or alternatively + [print("Object contents = {}".format(obj.read())) for obj in ioctx.list_objects()] + +The ``Object`` class provides a file-like interface to an object, allowing +you to read and write content and extended attributes. Object operations using +the I/O context provide additional functionality and asynchronous capabilities. + + +Cluster Handle API +================== + +The ``Rados`` class provides an interface into the Ceph Storage Daemon. + + +Configuration +------------- + +The ``Rados`` class provides methods for getting and setting configuration +values, reading the Ceph configuration file, and parsing arguments. You +do not need to be connected to the Ceph Storage Cluster to invoke the following +methods. See `Storage Cluster Configuration`_ for details on settings. + +.. currentmodule:: rados +.. automethod:: Rados.conf_get(option) +.. automethod:: Rados.conf_set(option, val) +.. automethod:: Rados.conf_read_file(path=None) +.. automethod:: Rados.conf_parse_argv(args) +.. automethod:: Rados.version() + + +Connection Management +--------------------- + +Once you configure your cluster handle, you may connect to the cluster, check +the cluster ``fsid``, retrieve cluster statistics, and disconnect (shutdown) +from the cluster. You may also assert that the cluster handle is in a particular +state (e.g., "configuring", "connecting", etc.). + +.. automethod:: Rados.connect(timeout=0) +.. automethod:: Rados.shutdown() +.. automethod:: Rados.get_fsid() +.. automethod:: Rados.get_cluster_stats() + +.. documented manually because it raises warnings because of *args usage in the +.. signature + +.. py:class:: Rados + + .. py:method:: require_state(*args) + + Checks if the Rados object is in a special state + + :param args: Any number of states to check as separate arguments + :raises: :class:`RadosStateError` + + +Pool Operations +--------------- + +To use pool operation methods, you must connect to the Ceph Storage Cluster +first. You may list the available pools, create a pool, check to see if a pool +exists, and delete a pool. + +.. automethod:: Rados.list_pools() +.. automethod:: Rados.create_pool(pool_name, crush_rule=None) +.. automethod:: Rados.pool_exists() +.. automethod:: Rados.delete_pool(pool_name) + + +CLI Commands +------------ + +The Ceph CLI command is internally using the following librados Python binding methods. + +In order to send a command, choose the correct method and choose the correct target. + +.. automethod:: Rados.mon_command +.. automethod:: Rados.osd_command +.. automethod:: Rados.mgr_command +.. automethod:: Rados.pg_command + + +Input/Output Context API +======================== + +To write data to and read data from the Ceph Object Store, you must create +an Input/Output context (ioctx). The `Rados` class provides `open_ioctx()` +and `open_ioctx2()` methods. The remaining ``ioctx`` operations involve +invoking methods of the `Ioctx` and other classes. + +.. automethod:: Rados.open_ioctx(ioctx_name) +.. automethod:: Ioctx.require_ioctx_open() +.. automethod:: Ioctx.get_stats() +.. automethod:: Ioctx.get_last_version() +.. automethod:: Ioctx.close() + + +.. Pool Snapshots +.. -------------- + +.. The Ceph Storage Cluster allows you to make a snapshot of a pool's state. +.. Whereas, basic pool operations only require a connection to the cluster, +.. snapshots require an I/O context. + +.. Ioctx.create_snap(self, snap_name) +.. Ioctx.list_snaps(self) +.. SnapIterator.next(self) +.. Snap.get_timestamp(self) +.. Ioctx.lookup_snap(self, snap_name) +.. Ioctx.remove_snap(self, snap_name) + +.. not published. This doesn't seem ready yet. + +Object Operations +----------------- + +The Ceph Storage Cluster stores data as objects. You can read and write objects +synchronously or asynchronously. You can read and write from offsets. An object +has a name (or key) and data. + + +.. automethod:: Ioctx.aio_write(object_name, to_write, offset=0, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.aio_write_full(object_name, to_write, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.aio_append(object_name, to_append, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.write(key, data, offset=0) +.. automethod:: Ioctx.write_full(key, data) +.. automethod:: Ioctx.aio_flush() +.. automethod:: Ioctx.set_locator_key(loc_key) +.. automethod:: Ioctx.aio_read(object_name, length, offset, oncomplete) +.. automethod:: Ioctx.read(key, length=8192, offset=0) +.. automethod:: Ioctx.stat(key) +.. automethod:: Ioctx.trunc(key, size) +.. automethod:: Ioctx.remove_object(key) + + +Object Extended Attributes +-------------------------- + +You may set extended attributes (XATTRs) on an object. You can retrieve a list +of objects or XATTRs and iterate over them. + +.. automethod:: Ioctx.set_xattr(key, xattr_name, xattr_value) +.. automethod:: Ioctx.get_xattrs(oid) +.. automethod:: XattrIterator.__next__() +.. automethod:: Ioctx.get_xattr(key, xattr_name) +.. automethod:: Ioctx.rm_xattr(key, xattr_name) + + + +Object Interface +================ + +From an I/O context, you can retrieve a list of objects from a pool and iterate +over them. The object interface provide makes each object look like a file, and +you may perform synchronous operations on the objects. For asynchronous +operations, you should use the I/O context methods. + +.. automethod:: Ioctx.list_objects() +.. automethod:: ObjectIterator.__next__() +.. automethod:: Object.read(length = 1024*1024) +.. automethod:: Object.write(string_to_write) +.. automethod:: Object.get_xattrs() +.. automethod:: Object.get_xattr(xattr_name) +.. automethod:: Object.set_xattr(xattr_name, xattr_value) +.. automethod:: Object.rm_xattr(xattr_name) +.. automethod:: Object.stat() +.. automethod:: Object.remove() + + + + +.. _Getting Started: ../../../start +.. _Storage Cluster Configuration: ../../configuration +.. _Getting librados for Python: ../librados-intro#getting-librados-for-python diff --git a/doc/rados/command/list-inconsistent-obj.json b/doc/rados/command/list-inconsistent-obj.json new file mode 100644 index 000000000..2bdc5f74c --- /dev/null +++ b/doc/rados/command/list-inconsistent-obj.json @@ -0,0 +1,237 @@ +{ + "$schema": "http://json-schema.org/draft-04/schema#", + "type": "object", + "properties": { + "epoch": { + "description": "Scrub epoch", + "type": "integer" + }, + "inconsistents": { + "type": "array", + "items": { + "type": "object", + "properties": { + "object": { + "description": "Identify a Ceph object", + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "nspace": { + "type": "string" + }, + "locator": { + "type": "string" + }, + "version": { + "type": "integer", + "minimum": 0 + }, + "snap": { + "oneOf": [ + { + "type": "string", + "enum": [ "head", "snapdir" ] + }, + { + "type": "integer", + "minimum": 0 + } + ] + } + }, + "required": [ + "name", + "nspace", + "locator", + "version", + "snap" + ] + }, + "selected_object_info": { + "type": "object", + "description": "Selected object information", + "additionalProperties": true + }, + "union_shard_errors": { + "description": "Union of all shard errors", + "type": "array", + "items": { + "enum": [ + "missing", + "stat_error", + "read_error", + "data_digest_mismatch_info", + "omap_digest_mismatch_info", + "size_mismatch_info", + "ec_hash_error", + "ec_size_error", + "info_missing", + "info_corrupted", + "obj_size_info_mismatch", + "snapset_missing", + "snapset_corrupted", + "hinfo_missing", + "hinfo_corrupted" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "errors": { + "description": "Errors related to the analysis of this object", + "type": "array", + "items": { + "enum": [ + "object_info_inconsistency", + "data_digest_mismatch", + "omap_digest_mismatch", + "size_mismatch", + "attr_value_mismatch", + "attr_name_mismatch", + "snapset_inconsistency", + "hinfo_inconsistency", + "size_too_large" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "shards": { + "description": "All found or expected shards", + "type": "array", + "items": { + "description": "Information about a particular shard of object", + "type": "object", + "properties": { + "object_info": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "object", + "description": "Object information", + "additionalProperties": true + } + ] + }, + "snapset": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "object", + "description": "Snap set information", + "additionalProperties": true + } + ] + }, + "hashinfo": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "object", + "description": "Erasure code hash information", + "additionalProperties": true + } + ] + }, + "shard": { + "type": "integer" + }, + "osd": { + "type": "integer" + }, + "primary": { + "type": "boolean" + }, + "size": { + "type": "integer" + }, + "omap_digest": { + "description": "Hex representation (e.g. 0x1abd1234)", + "type": "string" + }, + "data_digest": { + "description": "Hex representation (e.g. 0x1abd1234)", + "type": "string" + }, + "errors": { + "description": "Errors with this shard", + "type": "array", + "items": { + "enum": [ + "missing", + "stat_error", + "read_error", + "data_digest_mismatch_info", + "omap_digest_mismatch_info", + "size_mismatch_info", + "ec_hash_error", + "ec_size_error", + "info_missing", + "info_corrupted", + "obj_size_info_mismatch", + "snapset_missing", + "snapset_corrupted", + "hinfo_missing", + "hinfo_corrupted" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "attrs": { + "description": "If any shard's attr error is set then all attrs are here", + "type": "array", + "items": { + "description": "Information about a particular shard of object", + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "value": { + "type": "string" + }, + "Base64": { + "type": "boolean" + } + }, + "required": [ + "name", + "value", + "Base64" + ], + "additionalProperties": false + } + } + }, + "additionalProperties": false, + "required": [ + "osd", + "primary", + "errors" + ] + } + } + }, + "required": [ + "object", + "union_shard_errors", + "errors", + "shards" + ] + } + } + }, + "required": [ + "epoch", + "inconsistents" + ] +} diff --git a/doc/rados/command/list-inconsistent-snap.json b/doc/rados/command/list-inconsistent-snap.json new file mode 100644 index 000000000..55f1d53e9 --- /dev/null +++ b/doc/rados/command/list-inconsistent-snap.json @@ -0,0 +1,86 @@ +{ + "$schema": "http://json-schema.org/draft-04/schema#", + "type": "object", + "properties": { + "epoch": { + "description": "Scrub epoch", + "type": "integer" + }, + "inconsistents": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "nspace": { + "type": "string" + }, + "locator": { + "type": "string" + }, + "snap": { + "oneOf": [ + { + "type": "string", + "enum": [ + "head", + "snapdir" + ] + }, + { + "type": "integer", + "minimum": 0 + } + ] + }, + "errors": { + "description": "Errors for this object's snap", + "type": "array", + "items": { + "enum": [ + "snapset_missing", + "snapset_corrupted", + "info_missing", + "info_corrupted", + "snapset_error", + "headless", + "size_mismatch", + "extra_clones", + "clone_missing" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "missing": { + "description": "List of missing clones if clone_missing error set", + "type": "array", + "items": { + "type": "integer" + } + }, + "extra_clones": { + "description": "List of extra clones if extra_clones error set", + "type": "array", + "items": { + "type": "integer" + } + } + }, + "required": [ + "name", + "nspace", + "locator", + "snap", + "errors" + ] + } + } + }, + "required": [ + "epoch", + "inconsistents" + ] +} diff --git a/doc/rados/configuration/auth-config-ref.rst b/doc/rados/configuration/auth-config-ref.rst new file mode 100644 index 000000000..fc14f4ee6 --- /dev/null +++ b/doc/rados/configuration/auth-config-ref.rst @@ -0,0 +1,379 @@ +.. _rados-cephx-config-ref: + +======================== + CephX Config Reference +======================== + +The CephX protocol is enabled by default. The cryptographic authentication that +CephX provides has some computational costs, though they should generally be +quite low. If the network environment connecting your client and server hosts +is very safe and you cannot afford authentication, you can disable it. +**Disabling authentication is not generally recommended**. + +.. note:: If you disable authentication, you will be at risk of a + man-in-the-middle attack that alters your client/server messages, which + could have disastrous security effects. + +For information about creating users, see `User Management`_. For details on +the architecture of CephX, see `Architecture - High Availability +Authentication`_. + + +Deployment Scenarios +==================== + +How you initially configure CephX depends on your scenario. There are two +common strategies for deploying a Ceph cluster. If you are a first-time Ceph +user, you should probably take the easiest approach: using ``cephadm`` to +deploy a cluster. But if your cluster uses other deployment tools (for example, +Ansible, Chef, Juju, or Puppet), you will need either to use the manual +deployment procedures or to configure your deployment tool so that it will +bootstrap your monitor(s). + +Manual Deployment +----------------- + +When you deploy a cluster manually, it is necessary to bootstrap the monitors +manually and to create the ``client.admin`` user and keyring. To bootstrap +monitors, follow the steps in `Monitor Bootstrapping`_. Follow these steps when +using third-party deployment tools (for example, Chef, Puppet, and Juju). + + +Enabling/Disabling CephX +======================== + +Enabling CephX is possible only if the keys for your monitors, OSDs, and +metadata servers have already been deployed. If you are simply toggling CephX +on or off, it is not necessary to repeat the bootstrapping procedures. + +Enabling CephX +-------------- + +When CephX is enabled, Ceph will look for the keyring in the default search +path: this path includes ``/etc/ceph/$cluster.$name.keyring``. It is possible +to override this search-path location by adding a ``keyring`` option in the +``[global]`` section of your `Ceph configuration`_ file, but this is not +recommended. + +To enable CephX on a cluster for which authentication has been disabled, carry +out the following procedure. If you (or your deployment utility) have already +generated the keys, you may skip the steps related to generating keys. + +#. Create a ``client.admin`` key, and save a copy of the key for your client + host: + + .. prompt:: bash $ + + ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring + + **Warning:** This step will clobber any existing + ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a + deployment tool has already generated a keyring file for you. Be careful! + +#. Create a monitor keyring and generate a monitor secret key: + + .. prompt:: bash $ + + ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' + +#. For each monitor, copy the monitor keyring into a ``ceph.mon.keyring`` file + in the monitor's ``mon data`` directory. For example, to copy the monitor + keyring to ``mon.a`` in a cluster called ``ceph``, run the following + command: + + .. prompt:: bash $ + + cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring + +#. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter: + + .. prompt:: bash $ + + ceph auth get-or-create mgr.{$id} mon 'allow profile mgr' mds 'allow *' osd 'allow *' -o /var/lib/ceph/mgr/ceph-{$id}/keyring + +#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number: + + .. prompt:: bash $ + + ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring + +#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter: + + .. prompt:: bash $ + + ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' mgr 'allow profile mds' -o /var/lib/ceph/mds/ceph-{$id}/keyring + +#. Enable CephX authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file: + + .. code-block:: ini + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + +#. Start or restart the Ceph cluster. For details, see `Operating a Cluster`_. + +For details on bootstrapping a monitor manually, see `Manual Deployment`_. + + + +Disabling CephX +--------------- + +The following procedure describes how to disable CephX. If your cluster +environment is safe, you might want to disable CephX in order to offset the +computational expense of running authentication. **We do not recommend doing +so.** However, setup and troubleshooting might be easier if authentication is +temporarily disabled and subsequently re-enabled. + +#. Disable CephX authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file: + + .. code-block:: ini + + auth_cluster_required = none + auth_service_required = none + auth_client_required = none + +#. Start or restart the Ceph cluster. For details, see `Operating a Cluster`_. + + +Configuration Settings +====================== + +Enablement +---------- + + +``auth_cluster_required`` + +:Description: If this configuration setting is enabled, the Ceph Storage + Cluster daemons (that is, ``ceph-mon``, ``ceph-osd``, + ``ceph-mds``, and ``ceph-mgr``) are required to authenticate with + each other. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth_service_required`` + +:Description: If this configuration setting is enabled, then Ceph clients can + access Ceph services only if those clients authenticate with the + Ceph Storage Cluster. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth_client_required`` + +:Description: If this configuration setting is enabled, then communication + between the Ceph client and Ceph Storage Cluster can be + established only if the Ceph Storage Cluster authenticates + against the Ceph client. Valid settings are ``cephx`` or + ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +.. index:: keys; keyring + +Keys +---- + +When Ceph is run with authentication enabled, ``ceph`` administrative commands +and Ceph clients can access the Ceph Storage Cluster only if they use +authentication keys. + +The most common way to make these keys available to ``ceph`` administrative +commands and Ceph clients is to include a Ceph keyring under the ``/etc/ceph`` +directory. For Octopus and later releases that use ``cephadm``, the filename is +usually ``ceph.client.admin.keyring``. If the keyring is included in the +``/etc/ceph`` directory, then it is unnecessary to specify a ``keyring`` entry +in the Ceph configuration file. + +Because the Ceph Storage Cluster's keyring file contains the ``client.admin`` +key, we recommend copying the keyring file to nodes from which you run +administrative commands. + +To perform this step manually, run the following command: + +.. prompt:: bash $ + + sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring + +.. tip:: Make sure that the ``ceph.keyring`` file has appropriate permissions + (for example, ``chmod 644``) set on your client machine. + +You can specify the key itself by using the ``key`` setting in the Ceph +configuration file (this approach is not recommended), or instead specify a +path to a keyfile by using the ``keyfile`` setting in the Ceph configuration +file. + +``keyring`` + +:Description: The path to the keyring file. +:Type: String +:Required: No +:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin`` + + +``keyfile`` + +:Description: The path to a keyfile (that is, a file containing only the key). +:Type: String +:Required: No +:Default: None + + +``key`` + +:Description: The key (that is, the text string of the key itself). We do not + recommend that you use this setting unless you know what you're + doing. +:Type: String +:Required: No +:Default: None + + +Daemon Keyrings +--------------- + +Administrative users or deployment tools (for example, ``cephadm``) generate +daemon keyrings in the same way that they generate user keyrings. By default, +Ceph stores the keyring of a daemon inside that daemon's data directory. The +default keyring locations and the capabilities that are necessary for the +daemon to function are shown below. + +``ceph-mon`` + +:Location: ``$mon_data/keyring`` +:Capabilities: ``mon 'allow *'`` + +``ceph-osd`` + +:Location: ``$osd_data/keyring`` +:Capabilities: ``mgr 'allow profile osd' mon 'allow profile osd' osd 'allow *'`` + +``ceph-mds`` + +:Location: ``$mds_data/keyring`` +:Capabilities: ``mds 'allow' mgr 'allow profile mds' mon 'allow profile mds' osd 'allow rwx'`` + +``ceph-mgr`` + +:Location: ``$mgr_data/keyring`` +:Capabilities: ``mon 'allow profile mgr' mds 'allow *' osd 'allow *'`` + +``radosgw`` + +:Location: ``$rgw_data/keyring`` +:Capabilities: ``mon 'allow rwx' osd 'allow rwx'`` + + +.. note:: The monitor keyring (that is, ``mon.``) contains a key but no + capabilities, and this keyring is not part of the cluster ``auth`` database. + +The daemon's data-directory locations default to directories of the form:: + + /var/lib/ceph/$type/$cluster-$id + +For example, ``osd.12`` would have the following data directory:: + + /var/lib/ceph/osd/ceph-12 + +It is possible to override these locations, but it is not recommended. + + +.. index:: signatures + +Signatures +---------- + +Ceph performs a signature check that provides some limited protection against +messages being tampered with in flight (for example, by a "man in the middle" +attack). + +As with other parts of Ceph authentication, signatures admit of fine-grained +control. You can enable or disable signatures for service messages between +clients and Ceph, and for messages between Ceph daemons. + +Note that even when signatures are enabled data is not encrypted in flight. + +``cephx_require_signatures`` + +:Description: If this configuration setting is set to ``true``, Ceph requires + signatures on all message traffic between the Ceph client and the + Ceph Storage Cluster, and between daemons within the Ceph Storage + Cluster. + +.. note:: + **ANTIQUATED NOTE:** + + Neither Ceph Argonaut nor Linux kernel versions prior to 3.19 + support signatures; if one of these clients is in use, ``cephx_require_signatures`` + can be disabled in order to allow the client to connect. + + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_cluster_require_signatures`` + +:Description: If this configuration setting is set to ``true``, Ceph requires + signatures on all message traffic between Ceph daemons within the + Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_service_require_signatures`` + +:Description: If this configuration setting is set to ``true``, Ceph requires + signatures on all message traffic between Ceph clients and the + Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_sign_messages`` + +:Description: If this configuration setting is set to ``true``, and if the Ceph + version supports message signing, then Ceph will sign all + messages so that they are more difficult to spoof. + +:Type: Boolean +:Default: ``true`` + + +Time to Live +------------ + +``auth_service_ticket_ttl`` + +:Description: When the Ceph Storage Cluster sends a ticket for authentication + to a Ceph client, the Ceph Storage Cluster assigns that ticket a + Time To Live (TTL). + +:Type: Double +:Default: ``60*60`` + + +.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping +.. _Operating a Cluster: ../../operations/operating +.. _Manual Deployment: ../../../install/manual-deployment +.. _Ceph configuration: ../ceph-conf +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _User Management: ../../operations/user-management diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst new file mode 100644 index 000000000..3707be1aa --- /dev/null +++ b/doc/rados/configuration/bluestore-config-ref.rst @@ -0,0 +1,552 @@ +================================== + BlueStore Configuration Reference +================================== + +Devices +======= + +BlueStore manages either one, two, or in certain cases three storage devices. +These *devices* are "devices" in the Linux/Unix sense. This means that they are +assets listed under ``/dev`` or ``/devices``. Each of these devices may be an +entire storage drive, or a partition of a storage drive, or a logical volume. +BlueStore does not create or mount a conventional file system on devices that +it uses; BlueStore reads and writes to the devices directly in a "raw" fashion. + +In the simplest case, BlueStore consumes all of a single storage device. This +device is known as the *primary device*. The primary device is identified by +the ``block`` symlink in the data directory. + +The data directory is a ``tmpfs`` mount. When this data directory is booted or +activated by ``ceph-volume``, it is populated with metadata files and links +that hold information about the OSD: for example, the OSD's identifier, the +name of the cluster that the OSD belongs to, and the OSD's private keyring. + +In more complicated cases, BlueStore is deployed across one or two additional +devices: + +* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data + directory) can be used to separate out BlueStore's internal journal or + write-ahead log. Using a WAL device is advantageous only if the WAL device + is faster than the primary device (for example, if the WAL device is an SSD + and the primary device is an HDD). +* A *DB device* (identified as ``block.db`` in the data directory) can be used + to store BlueStore's internal metadata. BlueStore (or more precisely, the + embedded RocksDB) will put as much metadata as it can on the DB device in + order to improve performance. If the DB device becomes full, metadata will + spill back onto the primary device (where it would have been located in the + absence of the DB device). Again, it is advantageous to provision a DB device + only if it is faster than the primary device. + +If there is only a small amount of fast storage available (for example, less +than a gigabyte), we recommend using the available space as a WAL device. But +if more fast storage is available, it makes more sense to provision a DB +device. Because the BlueStore journal is always placed on the fastest device +available, using a DB device provides the same benefit that using a WAL device +would, while *also* allowing additional metadata to be stored off the primary +device (provided that it fits). DB devices make this possible because whenever +a DB device is specified but an explicit WAL device is not, the WAL will be +implicitly colocated with the DB on the faster device. + +To provision a single-device (colocated) BlueStore OSD, run the following +command: + +.. prompt:: bash $ + + ceph-volume lvm prepare --bluestore --data <device> + +To specify a WAL device or DB device, run the following command: + +.. prompt:: bash $ + + ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device> + +.. note:: The option ``--data`` can take as its argument any of the the + following devices: logical volumes specified using *vg/lv* notation, + existing logical volumes, and GPT partitions. + + + +Provisioning strategies +----------------------- + +BlueStore differs from Filestore in that there are several ways to deploy a +BlueStore OSD. However, the overall deployment strategy for BlueStore can be +clarified by examining just these two common arrangements: + +.. _bluestore-single-type-device-config: + +**block (data) only** +^^^^^^^^^^^^^^^^^^^^^ +If all devices are of the same type (for example, they are all HDDs), and if +there are no fast devices available for the storage of metadata, then it makes +sense to specify the block device only and to leave ``block.db`` and +``block.wal`` unseparated. The :ref:`ceph-volume-lvm` command for a single +``/dev/sda`` device is as follows: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data /dev/sda + +If the devices to be used for a BlueStore OSD are pre-created logical volumes, +then the :ref:`ceph-volume-lvm` call for an logical volume named +``ceph-vg/block-lv`` is as follows: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data ceph-vg/block-lv + +.. _bluestore-mixed-device-config: + +**block and block.db** +^^^^^^^^^^^^^^^^^^^^^^ + +If you have a mix of fast and slow devices (for example, SSD or HDD), then we +recommend placing ``block.db`` on the faster device while ``block`` (that is, +the data) is stored on the slower device (that is, the rotational drive). + +You must create these volume groups and these logical volumes manually. as The +``ceph-volume`` tool is currently unable to do so [create them?] automatically. + +The following procedure illustrates the manual creation of volume groups and +logical volumes. For this example, we shall assume four rotational drives +(``sda``, ``sdb``, ``sdc``, and ``sdd``) and one (fast) SSD (``sdx``). First, +to create the volume groups, run the following commands: + +.. prompt:: bash $ + + vgcreate ceph-block-0 /dev/sda + vgcreate ceph-block-1 /dev/sdb + vgcreate ceph-block-2 /dev/sdc + vgcreate ceph-block-3 /dev/sdd + +Next, to create the logical volumes for ``block``, run the following commands: + +.. prompt:: bash $ + + lvcreate -l 100%FREE -n block-0 ceph-block-0 + lvcreate -l 100%FREE -n block-1 ceph-block-1 + lvcreate -l 100%FREE -n block-2 ceph-block-2 + lvcreate -l 100%FREE -n block-3 ceph-block-3 + +Because there are four HDDs, there will be four OSDs. Supposing that there is a +200GB SSD in ``/dev/sdx``, we can create four 50GB logical volumes by running +the following commands: + +.. prompt:: bash $ + + vgcreate ceph-db-0 /dev/sdx + lvcreate -L 50GB -n db-0 ceph-db-0 + lvcreate -L 50GB -n db-1 ceph-db-0 + lvcreate -L 50GB -n db-2 ceph-db-0 + lvcreate -L 50GB -n db-3 ceph-db-0 + +Finally, to create the four OSDs, run the following commands: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0 + ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1 + ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2 + ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3 + +After this procedure is finished, there should be four OSDs, ``block`` should +be on the four HDDs, and each HDD should have a 50GB logical volume +(specifically, a DB device) on the shared SSD. + +Sizing +====== +When using a :ref:`mixed spinning-and-solid-drive setup +<bluestore-mixed-device-config>`, it is important to make a large enough +``block.db`` logical volume for BlueStore. The logical volumes associated with +``block.db`` should have logical volumes that are *as large as possible*. + +It is generally recommended that the size of ``block.db`` be somewhere between +1% and 4% of the size of ``block``. For RGW workloads, it is recommended that +the ``block.db`` be at least 4% of the ``block`` size, because RGW makes heavy +use of ``block.db`` to store metadata (in particular, omap keys). For example, +if the ``block`` size is 1TB, then ``block.db`` should have a size of at least +40GB. For RBD workloads, however, ``block.db`` usually needs no more than 1% to +2% of the ``block`` size. + +In older releases, internal level sizes are such that the DB can fully utilize +only those specific partition / logical volume sizes that correspond to sums of +L0, L0+L1, L1+L2, and so on--that is, given default settings, sizes of roughly +3GB, 30GB, 300GB, and so on. Most deployments do not substantially benefit from +sizing that accommodates L3 and higher, though DB compaction can be facilitated +by doubling these figures to 6GB, 60GB, and 600GB. + +Improvements in Nautilus 14.2.12, Octopus 15.2.6, and subsequent releases allow +for better utilization of arbitrarily-sized DB devices. Moreover, the Pacific +release brings experimental dynamic-level support. Because of these advances, +users of older releases might want to plan ahead by provisioning larger DB +devices today so that the benefits of scale can be realized when upgrades are +made in the future. + +When *not* using a mix of fast and slow devices, there is no requirement to +create separate logical volumes for ``block.db`` or ``block.wal``. BlueStore +will automatically colocate these devices within the space of ``block``. + +Automatic Cache Sizing +====================== + +BlueStore can be configured to automatically resize its caches, provided that +certain conditions are met: TCMalloc must be configured as the memory allocator +and the ``bluestore_cache_autotune`` configuration option must be enabled (note +that it is currently enabled by default). When automatic cache sizing is in +effect, BlueStore attempts to keep OSD heap-memory usage under a certain target +size (as determined by ``osd_memory_target``). This approach makes use of a +best-effort algorithm and caches do not shrink smaller than the size defined by +the value of ``osd_memory_cache_min``. Cache ratios are selected in accordance +with a hierarchy of priorities. But if priority information is not available, +the values specified in the ``bluestore_cache_meta_ratio`` and +``bluestore_cache_kv_ratio`` options are used as fallback cache ratios. + +.. confval:: bluestore_cache_autotune +.. confval:: osd_memory_target +.. confval:: bluestore_cache_autotune_interval +.. confval:: osd_memory_base +.. confval:: osd_memory_expected_fragmentation +.. confval:: osd_memory_cache_min +.. confval:: osd_memory_cache_resize_interval + + +Manual Cache Sizing +=================== + +The amount of memory consumed by each OSD to be used for its BlueStore cache is +determined by the ``bluestore_cache_size`` configuration option. If that option +has not been specified (that is, if it remains at 0), then Ceph uses a +different configuration option to determine the default memory budget: +``bluestore_cache_size_hdd`` if the primary device is an HDD, or +``bluestore_cache_size_ssd`` if the primary device is an SSD. + +BlueStore and the rest of the Ceph OSD daemon make every effort to work within +this memory budget. Note that in addition to the configured cache size, there +is also memory consumed by the OSD itself. There is additional utilization due +to memory fragmentation and other allocator overhead. + +The configured cache-memory budget can be used to store the following types of +things: + +* Key/Value metadata (that is, RocksDB's internal cache) +* BlueStore metadata +* BlueStore data (that is, recently read or recently written object data) + +Cache memory usage is governed by the configuration options +``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. The fraction +of the cache that is reserved for data is governed by both the effective +BlueStore cache size (which depends on the relevant +``bluestore_cache_size[_ssd|_hdd]`` option and the device class of the primary +device) and the "meta" and "kv" ratios. This data fraction can be calculated +with the following formula: ``<effective_cache_size> * (1 - +bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``. + +.. confval:: bluestore_cache_size +.. confval:: bluestore_cache_size_hdd +.. confval:: bluestore_cache_size_ssd +.. confval:: bluestore_cache_meta_ratio +.. confval:: bluestore_cache_kv_ratio + +Checksums +========= + +BlueStore checksums all metadata and all data written to disk. Metadata +checksumming is handled by RocksDB and uses the `crc32c` algorithm. By +contrast, data checksumming is handled by BlueStore and can use either +`crc32c`, `xxhash32`, or `xxhash64`. Nonetheless, `crc32c` is the default +checksum algorithm and it is suitable for most purposes. + +Full data checksumming increases the amount of metadata that BlueStore must +store and manage. Whenever possible (for example, when clients hint that data +is written and read sequentially), BlueStore will checksum larger blocks. In +many cases, however, it must store a checksum value (usually 4 bytes) for every +4 KB block of data. + +It is possible to obtain a smaller checksum value by truncating the checksum to +one or two bytes and reducing the metadata overhead. A drawback of this +approach is that it increases the probability of a random error going +undetected: about one in four billion given a 32-bit (4 byte) checksum, 1 in +65,536 given a 16-bit (2 byte) checksum, and 1 in 256 given an 8-bit (1 byte) +checksum. To use the smaller checksum values, select `crc32c_16` or `crc32c_8` +as the checksum algorithm. + +The *checksum algorithm* can be specified either via a per-pool ``csum_type`` +configuration option or via the global configuration option. For example: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> csum_type <algorithm> + +.. confval:: bluestore_csum_type + +Inline Compression +================== + +BlueStore supports inline compression using `snappy`, `zlib`, `lz4`, or `zstd`. + +Whether data in BlueStore is compressed is determined by two factors: (1) the +*compression mode* and (2) any client hints associated with a write operation. +The compression modes are as follows: + +* **none**: Never compress data. +* **passive**: Do not compress data unless the write operation has a + *compressible* hint set. +* **aggressive**: Do compress data unless the write operation has an + *incompressible* hint set. +* **force**: Try to compress data no matter what. + +For more information about the *compressible* and *incompressible* I/O hints, +see :c:func:`rados_set_alloc_hint`. + +Note that data in Bluestore will be compressed only if the data chunk will be +sufficiently reduced in size (as determined by the ``bluestore compression +required ratio`` setting). No matter which compression modes have been used, if +the data chunk is too big, then it will be discarded and the original +(uncompressed) data will be stored instead. For example, if ``bluestore +compression required ratio`` is set to ``.7``, then data compression will take +place only if the size of the compressed data is no more than 70% of the size +of the original data. + +The *compression mode*, *compression algorithm*, *compression required ratio*, +*min blob size*, and *max blob size* settings can be specified either via a +per-pool property or via a global config option. To specify pool properties, +run the following commands: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> compression_algorithm <algorithm> + ceph osd pool set <pool-name> compression_mode <mode> + ceph osd pool set <pool-name> compression_required_ratio <ratio> + ceph osd pool set <pool-name> compression_min_blob_size <size> + ceph osd pool set <pool-name> compression_max_blob_size <size> + +.. confval:: bluestore_compression_algorithm +.. confval:: bluestore_compression_mode +.. confval:: bluestore_compression_required_ratio +.. confval:: bluestore_compression_min_blob_size +.. confval:: bluestore_compression_min_blob_size_hdd +.. confval:: bluestore_compression_min_blob_size_ssd +.. confval:: bluestore_compression_max_blob_size +.. confval:: bluestore_compression_max_blob_size_hdd +.. confval:: bluestore_compression_max_blob_size_ssd + +.. _bluestore-rocksdb-sharding: + +RocksDB Sharding +================ + +BlueStore maintains several types of internal key-value data, all of which are +stored in RocksDB. Each data type in BlueStore is assigned a unique prefix. +Prior to the Pacific release, all key-value data was stored in a single RocksDB +column family: 'default'. In Pacific and later releases, however, BlueStore can +divide key-value data into several RocksDB column families. BlueStore achieves +better caching and more precise compaction when keys are similar: specifically, +when keys have similar access frequency, similar modification frequency, and a +similar lifetime. Under such conditions, performance is improved and less disk +space is required during compaction (because each column family is smaller and +is able to compact independently of the others). + +OSDs deployed in Pacific or later releases use RocksDB sharding by default. +However, if Ceph has been upgraded to Pacific or a later version from a +previous version, sharding is disabled on any OSDs that were created before +Pacific. + +To enable sharding and apply the Pacific defaults to a specific OSD, stop the +OSD and run the following command: + + .. prompt:: bash # + + ceph-bluestore-tool \ + --path <data path> \ + --sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l p" \ + reshard + +.. confval:: bluestore_rocksdb_cf +.. confval:: bluestore_rocksdb_cfs + +Throttling +========== + +.. confval:: bluestore_throttle_bytes +.. confval:: bluestore_throttle_deferred_bytes +.. confval:: bluestore_throttle_cost_per_io +.. confval:: bluestore_throttle_cost_per_io_hdd +.. confval:: bluestore_throttle_cost_per_io_ssd + +SPDK Usage +========== + +To use the SPDK driver for NVMe devices, you must first prepare your system. +See `SPDK document`__. + +.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples + +SPDK offers a script that will configure the device automatically. Run this +script with root permissions: + +.. prompt:: bash $ + + sudo src/spdk/scripts/setup.sh + +You will need to specify the subject NVMe device's device selector with the +"spdk:" prefix for ``bluestore_block_path``. + +In the following example, you first find the device selector of an Intel NVMe +SSD by running the following command: + +.. prompt:: bash $ + + lspci -mm -n -d -d 8086:0953 + +The form of the device selector is either ``DDDD:BB:DD.FF`` or +``DDDD.BB.DD.FF``. + +Next, supposing that ``0000:01:00.0`` is the device selector found in the +output of the ``lspci`` command, you can specify the device selector by running +the following command:: + + bluestore_block_path = "spdk:trtype:pcie traddr:0000:01:00.0" + +You may also specify a remote NVMeoF target over the TCP transport, as in the +following example:: + + bluestore_block_path = "spdk:trtype:tcp traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1" + +To run multiple SPDK instances per node, you must make sure each instance uses +its own DPDK memory by specifying for each instance the amount of DPDK memory +(in MB) that the instance will use. + +In most cases, a single device can be used for data, DB, and WAL. We describe +this strategy as *colocating* these components. Be sure to enter the below +settings to ensure that all I/Os are issued through SPDK:: + + bluestore_block_db_path = "" + bluestore_block_db_size = 0 + bluestore_block_wal_path = "" + bluestore_block_wal_size = 0 + +If these settings are not entered, then the current implementation will +populate the SPDK map files with kernel file system symbols and will use the +kernel driver to issue DB/WAL I/Os. + +Minimum Allocation Size +======================= + +There is a configured minimum amount of storage that BlueStore allocates on an +underlying storage device. In practice, this is the least amount of capacity +that even a tiny RADOS object can consume on each OSD's primary device. The +configuration option in question--:confval:`bluestore_min_alloc_size`--derives +its value from the value of either :confval:`bluestore_min_alloc_size_hdd` or +:confval:`bluestore_min_alloc_size_ssd`, depending on the OSD's ``rotational`` +attribute. Thus if an OSD is created on an HDD, BlueStore is initialized with +the current value of :confval:`bluestore_min_alloc_size_hdd`; but with SSD OSDs +(including NVMe devices), Bluestore is initialized with the current value of +:confval:`bluestore_min_alloc_size_ssd`. + +In Mimic and earlier releases, the default values were 64KB for rotational +media (HDD) and 16KB for non-rotational media (SSD). The Octopus release +changed the the default value for non-rotational media (SSD) to 4KB, and the +Pacific release changed the default value for rotational media (HDD) to 4KB. + +These changes were driven by space amplification that was experienced by Ceph +RADOS GateWay (RGW) deployments that hosted large numbers of small files +(S3/Swift objects). + +For example, when an RGW client stores a 1 KB S3 object, that object is written +to a single RADOS object. In accordance with the default +:confval:`min_alloc_size` value, 4 KB of underlying drive space is allocated. +This means that roughly 3 KB (that is, 4 KB minus 1 KB) is allocated but never +used: this corresponds to 300% overhead or 25% efficiency. Similarly, a 5 KB +user object will be stored as two RADOS objects, a 4 KB RADOS object and a 1 KB +RADOS object, with the result that 4KB of device capacity is stranded. In this +case, however, the overhead percentage is much smaller. Think of this in terms +of the remainder from a modulus operation. The overhead *percentage* thus +decreases rapidly as object size increases. + +There is an additional subtlety that is easily missed: the amplification +phenomenon just described takes place for *each* replica. For example, when +using the default of three copies of data (3R), a 1 KB S3 object actually +strands roughly 9 KB of storage device capacity. If erasure coding (EC) is used +instead of replication, the amplification might be even higher: for a ``k=4, +m=2`` pool, our 1 KB S3 object allocates 24 KB (that is, 4 KB multiplied by 6) +of device capacity. + +When an RGW bucket pool contains many relatively large user objects, the effect +of this phenomenon is often negligible. However, with deployments that can +expect a significant fraction of relatively small user objects, the effect +should be taken into consideration. + +The 4KB default value aligns well with conventional HDD and SSD devices. +However, certain novel coarse-IU (Indirection Unit) QLC SSDs perform and wear +best when :confval:`bluestore_min_alloc_size_ssd` is specified at OSD creation +to match the device's IU: this might be 8KB, 16KB, or even 64KB. These novel +storage drives can achieve read performance that is competitive with that of +conventional TLC SSDs and write performance that is faster than that of HDDs, +with higher density and lower cost than TLC SSDs. + +Note that when creating OSDs on these novel devices, one must be careful to +apply the non-default value only to appropriate devices, and not to +conventional HDD and SSD devices. Error can be avoided through careful ordering +of OSD creation, with custom OSD device classes, and especially by the use of +central configuration *masks*. + +In Quincy and later releases, you can use the +:confval:`bluestore_use_optimal_io_size_for_min_alloc_size` option to allow +automatic discovery of the correct value as each OSD is created. Note that the +use of ``bcache``, ``OpenCAS``, ``dmcrypt``, ``ATA over Ethernet``, `iSCSI`, or +other device-layering and abstraction technologies might confound the +determination of correct values. Moreover, OSDs deployed on top of VMware +storage have sometimes been found to report a ``rotational`` attribute that +does not match the underlying hardware. + +We suggest inspecting such OSDs at startup via logs and admin sockets in order +to ensure that their behavior is correct. Be aware that this kind of inspection +might not work as expected with older kernels. To check for this issue, +examine the presence and value of ``/sys/block/<drive>/queue/optimal_io_size``. + +.. note:: When running Reef or a later Ceph release, the ``min_alloc_size`` + baked into each OSD is conveniently reported by ``ceph osd metadata``. + +To inspect a specific OSD, run the following command: + +.. prompt:: bash # + + ceph osd metadata osd.1701 | egrep rotational\|alloc + +This space amplification might manifest as an unusually high ratio of raw to +stored data as reported by ``ceph df``. There might also be ``%USE`` / ``VAR`` +values reported by ``ceph osd df`` that are unusually high in comparison to +other, ostensibly identical, OSDs. Finally, there might be unexpected balancer +behavior in pools that use OSDs that have mismatched ``min_alloc_size`` values. + +This BlueStore attribute takes effect *only* at OSD creation; if the attribute +is changed later, a specific OSD's behavior will not change unless and until +the OSD is destroyed and redeployed with the appropriate option value(s). +Upgrading to a later Ceph release will *not* change the value used by OSDs that +were deployed under older releases or with other settings. + +.. confval:: bluestore_min_alloc_size +.. confval:: bluestore_min_alloc_size_hdd +.. confval:: bluestore_min_alloc_size_ssd +.. confval:: bluestore_use_optimal_io_size_for_min_alloc_size + +DSA (Data Streaming Accelerator) Usage +====================================== + +If you want to use the DML library to drive the DSA device for offloading +read/write operations on persistent memory (PMEM) in BlueStore, you need to +install `DML`_ and the `idxd-config`_ library. This will work only on machines +that have a SPR (Sapphire Rapids) CPU. + +.. _dml: https://github.com/intel/dml +.. _idxd-config: https://github.com/intel/idxd-config + +After installing the DML software, configure the shared work queues (WQs) with +reference to the following WQ configuration example: + +.. prompt:: bash $ + + accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="myapp1" --priority=10 --block-on-fault=1 dsa0/wq0.1 + accel-config config-engine dsa0/engine0.1 --group-id=1 + accel-config enable-device dsa0 + accel-config enable-wq dsa0/wq0.1 diff --git a/doc/rados/configuration/ceph-conf.rst b/doc/rados/configuration/ceph-conf.rst new file mode 100644 index 000000000..d8d5c9d03 --- /dev/null +++ b/doc/rados/configuration/ceph-conf.rst @@ -0,0 +1,715 @@ +.. _configuring-ceph: + +================== + Configuring Ceph +================== + +When Ceph services start, the initialization process activates a set of +daemons that run in the background. A :term:`Ceph Storage Cluster` runs at +least three types of daemons: + +- :term:`Ceph Monitor` (``ceph-mon``) +- :term:`Ceph Manager` (``ceph-mgr``) +- :term:`Ceph OSD Daemon` (``ceph-osd``) + +Any Ceph Storage Cluster that supports the :term:`Ceph File System` also runs +at least one :term:`Ceph Metadata Server` (``ceph-mds``). Any Cluster that +supports :term:`Ceph Object Storage` runs Ceph RADOS Gateway daemons +(``radosgw``). + +Each daemon has a number of configuration options, and each of those options +has a default value. Adjust the behavior of the system by changing these +configuration options. Make sure to understand the consequences before +overriding the default values, as it is possible to significantly degrade the +performance and stability of your cluster. Remember that default values +sometimes change between releases. For this reason, it is best to review the +version of this documentation that applies to your Ceph release. + +Option names +============ + +Each of the Ceph configuration options has a unique name that consists of words +formed with lowercase characters and connected with underscore characters +(``_``). + +When option names are specified on the command line, underscore (``_``) and +dash (``-``) characters can be used interchangeably (for example, +``--mon-host`` is equivalent to ``--mon_host``). + +When option names appear in configuration files, spaces can also be used in +place of underscores or dashes. However, for the sake of clarity and +convenience, we suggest that you consistently use underscores, as we do +throughout this documentation. + +Config sources +============== + +Each Ceph daemon, process, and library pulls its configuration from one or more +of the several sources listed below. Sources that occur later in the list +override those that occur earlier in the list (when both are present). + +- the compiled-in default value +- the monitor cluster's centralized configuration database +- a configuration file stored on the local host +- environment variables +- command-line arguments +- runtime overrides that are set by an administrator + +One of the first things a Ceph process does on startup is parse the +configuration options provided via the command line, via the environment, and +via the local configuration file. Next, the process contacts the monitor +cluster to retrieve centrally-stored configuration for the entire cluster. +After a complete view of the configuration is available, the startup of the +daemon or process will commence. + +.. _bootstrap-options: + +Bootstrap options +----------------- + +Bootstrap options are configuration options that affect the process's ability +to contact the monitors, to authenticate, and to retrieve the cluster-stored +configuration. For this reason, these options might need to be stored locally +on the node, and set by means of a local configuration file. These options +include the following: + +.. confval:: mon_host +.. confval:: mon_host_override + +- :confval:`mon_dns_srv_name` +- :confval:`mon_data`, :confval:`osd_data`, :confval:`mds_data`, + :confval:`mgr_data`, and similar options that define which local directory + the daemon stores its data in. +- :confval:`keyring`, :confval:`keyfile`, and/or :confval:`key`, which can be + used to specify the authentication credential to use to authenticate with the + monitor. Note that in most cases the default keyring location is in the data + directory specified above. + +In most cases, there is no reason to modify the default values of these +options. However, there is one exception to this: the :confval:`mon_host` +option that identifies the addresses of the cluster's monitors. But when +:ref:`DNS is used to identify monitors<mon-dns-lookup>`, a local Ceph +configuration file can be avoided entirely. + + +Skipping monitor config +----------------------- + +The option ``--no-mon-config`` can be passed in any command in order to skip +the step that retrieves configuration information from the cluster's monitors. +Skipping this retrieval step can be useful in cases where configuration is +managed entirely via configuration files, or when maintenance activity needs to +be done but the monitor cluster is down. + +.. _ceph-conf-file: + +Configuration sections +====================== + +Each of the configuration options associated with a single process or daemon +has a single value. However, the values for a configuration option can vary +across daemon types, and can vary even across different daemons of the same +type. Ceph options that are stored in the monitor configuration database or in +local configuration files are grouped into sections |---| so-called "configuration +sections" |---| to indicate which daemons or clients they apply to. + + +These sections include the following: + +.. confsec:: global + + Settings under ``global`` affect all daemons and clients + in a Ceph Storage Cluster. + + :example: ``log_file = /var/log/ceph/$cluster-$type.$id.log`` + +.. confsec:: mon + + Settings under ``mon`` affect all ``ceph-mon`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + + :example: ``mon_cluster_log_to_syslog = true`` + +.. confsec:: mgr + + Settings in the ``mgr`` section affect all ``ceph-mgr`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + + :example: ``mgr_stats_period = 10`` + +.. confsec:: osd + + Settings under ``osd`` affect all ``ceph-osd`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + + :example: ``osd_op_queue = wpq`` + +.. confsec:: mds + + Settings in the ``mds`` section affect all ``ceph-mds`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + + :example: ``mds_cache_memory_limit = 10G`` + +.. confsec:: client + + Settings under ``client`` affect all Ceph clients + (for example, mounted Ceph File Systems, mounted Ceph Block Devices) + as well as RADOS Gateway (RGW) daemons. + + :example: ``objecter_inflight_ops = 512`` + + +Configuration sections can also specify an individual daemon or client name. For example, +``mon.foo``, ``osd.123``, and ``client.smith`` are all valid section names. + + +Any given daemon will draw its settings from the global section, the daemon- or +client-type section, and the section sharing its name. Settings in the +most-specific section take precedence so precedence: for example, if the same +option is specified in both :confsec:`global`, :confsec:`mon`, and ``mon.foo`` +on the same source (i.e. that is, in the same configuration file), the +``mon.foo`` setting will be used. + +If multiple values of the same configuration option are specified in the same +section, the last value specified takes precedence. + +Note that values from the local configuration file always take precedence over +values from the monitor configuration database, regardless of the section in +which they appear. + +.. _ceph-metavariables: + +Metavariables +============= + +Metavariables dramatically simplify Ceph storage cluster configuration. When a +metavariable is set in a configuration value, Ceph expands the metavariable at +the time the configuration value is used. In this way, Ceph metavariables +behave similarly to the way that variable expansion works in the Bash shell. + +Ceph supports the following metavariables: + +.. describe:: $cluster + + Expands to the Ceph Storage Cluster name. Useful when running + multiple Ceph Storage Clusters on the same hardware. + + :example: ``/etc/ceph/$cluster.keyring`` + :default: ``ceph`` + +.. describe:: $type + + Expands to a daemon or process type (for example, ``mds``, ``osd``, or ``mon``) + + :example: ``/var/lib/ceph/$type`` + +.. describe:: $id + + Expands to the daemon or client identifier. For + ``osd.0``, this would be ``0``; for ``mds.a``, it would + be ``a``. + + :example: ``/var/lib/ceph/$type/$cluster-$id`` + +.. describe:: $host + + Expands to the host name where the process is running. + +.. describe:: $name + + Expands to ``$type.$id``. + + :example: ``/var/run/ceph/$cluster-$name.asok`` + +.. describe:: $pid + + Expands to daemon pid. + + :example: ``/var/run/ceph/$cluster-$name-$pid.asok`` + + +Ceph configuration file +======================= + +On startup, Ceph processes search for a configuration file in the +following locations: + +#. ``$CEPH_CONF`` (that is, the path following the ``$CEPH_CONF`` + environment variable) +#. ``-c path/path`` (that is, the ``-c`` command line argument) +#. ``/etc/ceph/$cluster.conf`` +#. ``~/.ceph/$cluster.conf`` +#. ``./$cluster.conf`` (that is, in the current working directory) +#. On FreeBSD systems only, ``/usr/local/etc/ceph/$cluster.conf`` + +Here ``$cluster`` is the cluster's name (default: ``ceph``). + +The Ceph configuration file uses an ``ini`` style syntax. You can add "comment +text" after a pound sign (#) or a semi-colon semicolon (;). For example: + +.. code-block:: ini + + # <--A number (#) sign number sign (#) precedes a comment. + ; A comment may be anything. + # Comments always follow a semi-colon semicolon (;) or a pound sign (#) on each line. + # The end of the line terminates a comment. + # We recommend that you provide comments in your configuration file(s). + + +.. _ceph-conf-settings: + +Config file section names +------------------------- + +The configuration file is divided into sections. Each section must begin with a +valid configuration section name (see `Configuration sections`_, above) that is +surrounded by square brackets. For example: + +.. code-block:: ini + + [global] + debug_ms = 0 + + [osd] + debug_ms = 1 + + [osd.1] + debug_ms = 10 + + [osd.2] + debug_ms = 10 + +Config file option values +------------------------- + +The value of a configuration option is a string. If the string is too long to +fit on a single line, you can put a backslash (``\``) at the end of the line +and the backslash will act as a line continuation marker. In such a case, the +value of the option will be the string after ``=`` in the current line, +combined with the string in the next line. Here is an example:: + + [global] + foo = long long ago\ + long ago + +In this example, the value of the "``foo``" option is "``long long ago long +ago``". + +An option value typically ends with either a newline or a comment. For +example: + +.. code-block:: ini + + [global] + obscure_one = difficult to explain # I will try harder in next release + simpler_one = nothing to explain + +In this example, the value of the "``obscure one``" option is "``difficult to +explain``" and the value of the "``simpler one`` options is "``nothing to +explain``". + +When an option value contains spaces, it can be enclosed within single quotes +or double quotes in order to make its scope clear and in order to make sure +that the first space in the value is not interpreted as the end of the value. +For example: + +.. code-block:: ini + + [global] + line = "to be, or not to be" + +In option values, there are four characters that are treated as escape +characters: ``=``, ``#``, ``;`` and ``[``. They are permitted to occur in an +option value only if they are immediately preceded by the backslash character +(``\``). For example: + +.. code-block:: ini + + [global] + secret = "i love \# and \[" + +Each configuration option falls under one of the following types: + +.. describe:: int + + 64-bit signed integer. Some SI suffixes are supported, such as "K", "M", + "G", "T", "P", and "E" (meaning, respectively, 10\ :sup:`3`, 10\ :sup:`6`, + 10\ :sup:`9`, etc.). "B" is the only supported unit string. Thus "1K", "1M", + "128B" and "-1" are all valid option values. When a negative value is + assigned to a threshold option, this can indicate that the option is + "unlimited" -- that is, that there is no threshold or limit in effect. + + :example: ``42``, ``-1`` + +.. describe:: uint + + This differs from ``integer`` only in that negative values are not + permitted. + + :example: ``256``, ``0`` + +.. describe:: str + + A string encoded in UTF-8. Certain characters are not permitted. Reference + the above notes for the details. + + :example: ``"hello world"``, ``"i love \#"``, ``yet-another-name`` + +.. describe:: boolean + + Typically either of the two values ``true`` or ``false``. However, any + integer is permitted: "0" implies ``false``, and any non-zero value implies + ``true``. + + :example: ``true``, ``false``, ``1``, ``0`` + +.. describe:: addr + + A single address, optionally prefixed with ``v1``, ``v2`` or ``any`` for the + messenger protocol. If no prefix is specified, the ``v2`` protocol is used. + For more details, see :ref:`address_formats`. + + :example: ``v1:1.2.3.4:567``, ``v2:1.2.3.4:567``, ``1.2.3.4:567``, ``2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567``, ``[::1]:6789`` + +.. describe:: addrvec + + A set of addresses separated by ",". The addresses can be optionally quoted + with ``[`` and ``]``. + + :example: ``[v1:1.2.3.4:567,v2:1.2.3.4:568]``, ``v1:1.2.3.4:567,v1:1.2.3.14:567`` ``[2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567], [2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::568]`` + +.. describe:: uuid + + The string format of a uuid defined by `RFC4122 + <https://www.ietf.org/rfc/rfc4122.txt>`_. Certain variants are also + supported: for more details, see `Boost document + <https://www.boost.org/doc/libs/1_74_0/libs/uuid/doc/uuid.html#String%20Generator>`_. + + :example: ``f81d4fae-7dec-11d0-a765-00a0c91e6bf6`` + +.. describe:: size + + 64-bit unsigned integer. Both SI prefixes and IEC prefixes are supported. + "B" is the only supported unit string. Negative values are not permitted. + + :example: ``1Ki``, ``1K``, ``1KiB`` and ``1B``. + +.. describe:: secs + + Denotes a duration of time. The default unit of time is the second. + The following units of time are supported: + + * second: ``s``, ``sec``, ``second``, ``seconds`` + * minute: ``m``, ``min``, ``minute``, ``minutes`` + * hour: ``hs``, ``hr``, ``hour``, ``hours`` + * day: ``d``, ``day``, ``days`` + * week: ``w``, ``wk``, ``week``, ``weeks`` + * month: ``mo``, ``month``, ``months`` + * year: ``y``, ``yr``, ``year``, ``years`` + + :example: ``1 m``, ``1m`` and ``1 week`` + +.. _ceph-conf-database: + +Monitor configuration database +============================== + +The monitor cluster manages a database of configuration options that can be +consumed by the entire cluster. This allows for streamlined central +configuration management of the entire system. For ease of administration and +transparency, the vast majority of configuration options can and should be +stored in this database. + +Some settings might need to be stored in local configuration files because they +affect the ability of the process to connect to the monitors, to authenticate, +and to fetch configuration information. In most cases this applies only to the +``mon_host`` option. This issue can be avoided by using :ref:`DNS SRV +records<mon-dns-lookup>`. + +Sections and masks +------------------ + +Configuration options stored by the monitor can be stored in a global section, +in a daemon-type section, or in a specific daemon section. In this, they are +no different from the options in a configuration file. + +In addition, options may have a *mask* associated with them to further restrict +which daemons or clients the option applies to. Masks take two forms: + +#. ``type:location`` where ``type`` is a CRUSH property like ``rack`` or + ``host``, and ``location`` is a value for that property. For example, + ``host:foo`` would limit the option only to daemons or clients + running on a particular host. +#. ``class:device-class`` where ``device-class`` is the name of a CRUSH + device class (for example, ``hdd`` or ``ssd``). For example, + ``class:ssd`` would limit the option only to OSDs backed by SSDs. + (This mask has no effect on non-OSD daemons or clients.) + +In commands that specify a configuration option, the argument of the option (in +the following examples, this is the "who" string) may be a section name, a +mask, or a combination of both separated by a slash character (``/``). For +example, ``osd/rack:foo`` would refer to all OSD daemons in the ``foo`` rack. + +When configuration options are shown, the section name and mask are presented +in separate fields or columns to make them more readable. + +Commands +-------- + +The following CLI commands are used to configure the cluster: + +* ``ceph config dump`` dumps the entire monitor configuration + database for the cluster. + +* ``ceph config get <who>`` dumps the configuration options stored in + the monitor configuration database for a specific daemon or client + (for example, ``mds.a``). + +* ``ceph config get <who> <option>`` shows either a configuration value + stored in the monitor configuration database for a specific daemon or client + (for example, ``mds.a``), or, if that value is not present in the monitor + configuration database, the compiled-in default value. + +* ``ceph config set <who> <option> <value>`` specifies a configuration + option in the monitor configuration database. + +* ``ceph config show <who>`` shows the configuration for a running daemon. + These settings might differ from those stored by the monitors if there are + also local configuration files in use or if options have been overridden on + the command line or at run time. The source of the values of the options is + displayed in the output. + +* ``ceph config assimilate-conf -i <input file> -o <output file>`` ingests a + configuration file from *input file* and moves any valid options into the + monitor configuration database. Any settings that are unrecognized, are + invalid, or cannot be controlled by the monitor will be returned in an + abbreviated configuration file stored in *output file*. This command is + useful for transitioning from legacy configuration files to centralized + monitor-based configuration. + +Note that ``ceph config set <who> <option> <value>`` and ``ceph config get +<who> <option>`` will not necessarily return the same values. The latter +command will show compiled-in default values. In order to determine whether a +configuration option is present in the monitor configuration database, run +``ceph config dump``. + +Help +==== + +To get help for a particular option, run the following command: + +.. prompt:: bash $ + + ceph config help <option> + +For example: + +.. prompt:: bash $ + + ceph config help log_file + +:: + + log_file - path to log file + (std::string, basic) + Default (non-daemon): + Default (daemon): /var/log/ceph/$cluster-$name.log + Can update at runtime: false + See also: [log_to_stderr,err_to_stderr,log_to_syslog,err_to_syslog] + +or: + +.. prompt:: bash $ + + ceph config help log_file -f json-pretty + +:: + + { + "name": "log_file", + "type": "std::string", + "level": "basic", + "desc": "path to log file", + "long_desc": "", + "default": "", + "daemon_default": "/var/log/ceph/$cluster-$name.log", + "tags": [], + "services": [], + "see_also": [ + "log_to_stderr", + "err_to_stderr", + "log_to_syslog", + "err_to_syslog" + ], + "enum_values": [], + "min": "", + "max": "", + "can_update_at_runtime": false + } + +The ``level`` property can be ``basic``, ``advanced``, or ``dev``. The `dev` +options are intended for use by developers, generally for testing purposes, and +are not recommended for use by operators. + +.. note:: This command uses the configuration schema that is compiled into the + running monitors. If you have a mixed-version cluster (as might exist, for + example, during an upgrade), you might want to query the option schema from + a specific running daemon by running a command of the following form: + +.. prompt:: bash $ + + ceph daemon <name> config help [option] + +Runtime Changes +=============== + +In most cases, Ceph permits changes to the configuration of a daemon at +run time. This can be used for increasing or decreasing the amount of logging +output, for enabling or disabling debug settings, and for runtime optimization. + +Use the ``ceph config set`` command to update configuration options. For +example, to enable the most verbose debug log level on a specific OSD, run a +command of the following form: + +.. prompt:: bash $ + + ceph config set osd.123 debug_ms 20 + +.. note:: If an option has been customized in a local configuration file, the + `central config + <https://ceph.io/en/news/blog/2018/new-mimic-centralized-configuration-management/>`_ + setting will be ignored because it has a lower priority than the local + configuration file. + +.. note:: Log levels range from 0 to 20. + +Override values +--------------- + +Options can be set temporarily by using the Ceph CLI ``tell`` or ``daemon`` +interfaces on the Ceph CLI. These *override* values are ephemeral, which means +that they affect only the current instance of the daemon and revert to +persistently configured values when the daemon restarts. + +Override values can be set in two ways: + +#. From any host, send a message to a daemon with a command of the following + form: + + .. prompt:: bash $ + + ceph tell <name> config set <option> <value> + + For example: + + .. prompt:: bash $ + + ceph tell osd.123 config set debug_osd 20 + + The ``tell`` command can also accept a wildcard as the daemon identifier. + For example, to adjust the debug level on all OSD daemons, run a command of + the following form: + + .. prompt:: bash $ + + ceph tell osd.* config set debug_osd 20 + +#. On the host where the daemon is running, connect to the daemon via a socket + in ``/var/run/ceph`` by running a command of the following form: + + .. prompt:: bash $ + + ceph daemon <name> config set <option> <value> + + For example: + + .. prompt:: bash $ + + ceph daemon osd.4 config set debug_osd 20 + +.. note:: In the output of the ``ceph config show`` command, these temporary + values are shown to have a source of ``override``. + + +Viewing runtime settings +======================== + +You can see the current settings specified for a running daemon with the ``ceph +config show`` command. For example, to see the (non-default) settings for the +daemon ``osd.0``, run the following command: + +.. prompt:: bash $ + + ceph config show osd.0 + +To see a specific setting, run the following command: + +.. prompt:: bash $ + + ceph config show osd.0 debug_osd + +To see all settings (including those with default values), run the following +command: + +.. prompt:: bash $ + + ceph config show-with-defaults osd.0 + +You can see all settings for a daemon that is currently running by connecting +to it on the local host via the admin socket. For example, to dump all +current settings, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.0 config show + +To see non-default settings and to see where each value came from (for example, +a config file, the monitor, or an override), run the following command: + +.. prompt:: bash $ + + ceph daemon osd.0 config diff + +To see the value of a single setting, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.0 config get debug_osd + + +Changes introduced in Octopus +============================= + +The Octopus release changed the way the configuration file is parsed. +These changes are as follows: + +- Repeated configuration options are allowed, and no warnings will be + displayed. This means that the setting that comes last in the file is the one + that takes effect. Prior to this change, Ceph displayed warning messages when + lines containing duplicate options were encountered, such as:: + + warning line 42: 'foo' in section 'bar' redefined +- Prior to Octopus, options containing invalid UTF-8 characters were ignored + with warning messages. But in Octopus, they are treated as fatal errors. +- The backslash character ``\`` is used as the line-continuation marker that + combines the next line with the current one. Prior to Octopus, there was a + requirement that any end-of-line backslash be followed by a non-empty line. + But in Octopus, an empty line following a backslash is allowed. +- In the configuration file, each line specifies an individual configuration + option. The option's name and its value are separated with ``=``, and the + value may be enclosed within single or double quotes. If an invalid + configuration is specified, we will treat it as an invalid configuration + file:: + + bad option ==== bad value +- Prior to Octopus, if no section name was specified in the configuration file, + all options would be set as though they were within the :confsec:`global` + section. This approach is discouraged. Since Octopus, any configuration + file that has no section name must contain only a single option. + +.. |---| unicode:: U+2014 .. EM DASH :trim: diff --git a/doc/rados/configuration/common.rst b/doc/rados/configuration/common.rst new file mode 100644 index 000000000..0b373f469 --- /dev/null +++ b/doc/rados/configuration/common.rst @@ -0,0 +1,207 @@ +.. _ceph-conf-common-settings: + +Common Settings +=============== + +The `Hardware Recommendations`_ section provides some hardware guidelines for +configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph +Node` to run multiple daemons. For example, a single node with multiple drives +ususally runs one ``ceph-osd`` for each drive. Ideally, each node will be +assigned to a particular type of process. For example, some nodes might run +``ceph-osd`` daemons, other nodes might run ``ceph-mds`` daemons, and still +other nodes might run ``ceph-mon`` daemons. + +Each node has a name. The name of a node can be found in its ``host`` setting. +Monitors also specify a network address and port (that is, a domain name or IP +address) that can be found in the ``addr`` setting. A basic configuration file +typically specifies only minimal settings for each instance of monitor daemons. +For example: + + + + +.. code-block:: ini + + [global] + mon_initial_members = ceph1 + mon_host = 10.0.0.1 + +.. important:: The ``host`` setting's value is the short name of the node. It + is not an FQDN. It is **NOT** an IP address. To retrieve the name of the + node, enter ``hostname -s`` on the command line. Unless you are deploying + Ceph manually, do not use ``host`` settings for anything other than initial + monitor setup. **DO NOT** specify the ``host`` setting under individual + daemons when using deployment tools like ``chef`` or ``cephadm``. Such tools + are designed to enter the appropriate values for you in the cluster map. + + +.. _ceph-network-config: + +Networks +======== + +For more about configuring a network for use with Ceph, see the `Network +Configuration Reference`_ . + + +Monitors +======== + +Ceph production clusters typically provision at least three :term:`Ceph +Monitor` daemons to ensure availability in the event of a monitor instance +crash. A minimum of three :term:`Ceph Monitor` daemons ensures that the Paxos +algorithm is able to determine which version of the :term:`Ceph Cluster Map` is +the most recent. It makes this determination by consulting a majority of Ceph +Monitors in the quorum. + +.. note:: You may deploy Ceph with a single monitor, but if the instance fails, + the lack of other monitors might interrupt data-service availability. + +Ceph Monitors normally listen on port ``3300`` for the new v2 protocol, and on +port ``6789`` for the old v1 protocol. + +By default, Ceph expects to store monitor data on the following path:: + + /var/lib/ceph/mon/$cluster-$id + +You or a deployment tool (for example, ``cephadm``) must create the +corresponding directory. With metavariables fully expressed and a cluster named +"ceph", the path specified in the above example evaluates to:: + + /var/lib/ceph/mon/ceph-a + +For additional details, see the `Monitor Config Reference`_. + +.. _Monitor Config Reference: ../mon-config-ref + + +.. _ceph-osd-config: + +Authentication +============== + +.. versionadded:: Bobtail 0.56 + +Authentication is explicitly enabled or disabled in the ``[global]`` section of +the Ceph configuration file, as shown here: + +.. code-block:: ini + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + +In addition, you should enable message signing. For details, see `Cephx Config +Reference`_. + +.. _Cephx Config Reference: ../auth-config-ref + + +.. _ceph-monitor-config: + + +OSDs +==== + +By default, Ceph expects to store a Ceph OSD Daemon's data on the following +path:: + + /var/lib/ceph/osd/$cluster-$id + +You or a deployment tool (for example, ``cephadm``) must create the +corresponding directory. With metavariables fully expressed and a cluster named +"ceph", the path specified in the above example evaluates to:: + + /var/lib/ceph/osd/ceph-0 + +You can override this path using the ``osd_data`` setting. We recommend that +you do not change the default location. To create the default directory on your +OSD host, run the following commands: + +.. prompt:: bash $ + + ssh {osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + +The ``osd_data`` path ought to lead to a mount point that has mounted on it a +device that is distinct from the device that contains the operating system and +the daemons. To use a device distinct from the device that contains the +operating system and the daemons, prepare it for use with Ceph and mount it on +the directory you just created by running the following commands: + +.. prompt:: bash $ + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{disk} + sudo mount -o user_xattr /dev/{disk} /var/lib/ceph/osd/ceph-{osd-number} + +We recommend using the ``xfs`` file system when running :command:`mkfs`. (The +``btrfs`` and ``ext4`` file systems are not recommended and are no longer +tested.) + +For additional configuration details, see `OSD Config Reference`_. + + +Heartbeats +========== + +During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons +and report their findings to the Ceph Monitor. This process does not require +you to provide any settings. However, if you have network latency issues, you +might want to modify the default settings. + +For additional details, see `Configuring Monitor/OSD Interaction`_. + + +.. _ceph-logging-and-debugging: + +Logs / Debugging +================ + +You might sometimes encounter issues with Ceph that require you to use Ceph's +logging and debugging features. For details on log rotation, see `Debugging and +Logging`_. + +.. _Debugging and Logging: ../../troubleshooting/log-and-debug + + +Example ceph.conf +================= + +.. literalinclude:: demo-ceph.conf + :language: ini + +.. _ceph-runtime-config: + + + +Naming Clusters (deprecated) +============================ + +Each Ceph cluster has an internal name. This internal name is used as part of +configuration, and as part of "log file" names as well as part of directory +names and as part of mountpoint names. This name defaults to "ceph". Previous +releases of Ceph allowed one to specify a custom name instead, for example +"ceph2". This option was intended to facilitate the running of multiple logical +clusters on the same physical hardware, but in practice it was rarely +exploited. Custom cluster names should no longer be attempted. Old +documentation might lead readers to wrongly think that unique cluster names are +required to use ``rbd-mirror``. They are not required. + +Custom cluster names are now considered deprecated and the ability to deploy +them has already been removed from some tools, although existing custom-name +deployments continue to operate. The ability to run and manage clusters with +custom names might be progressively removed by future Ceph releases, so **it is +strongly recommended to deploy all new clusters with the default name "ceph"**. + +Some Ceph CLI commands accept a ``--cluster`` (cluster name) option. This +option is present only for the sake of backward compatibility. New tools and +deployments cannot be relied upon to accommodate this option. + +If you need to allow multiple clusters to exist on the same host, use +:ref:`cephadm`, which uses containers to fully isolate each cluster. + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Network Configuration Reference: ../network-config-ref +.. _OSD Config Reference: ../osd-config-ref +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction diff --git a/doc/rados/configuration/demo-ceph.conf b/doc/rados/configuration/demo-ceph.conf new file mode 100644 index 000000000..8ba285a42 --- /dev/null +++ b/doc/rados/configuration/demo-ceph.conf @@ -0,0 +1,31 @@ +[global] +fsid = {cluster-id} +mon_initial_members = {hostname}[, {hostname}] +mon_host = {ip-address}[, {ip-address}] + +#All clusters have a front-side public network. +#If you have two network interfaces, you can configure a private / cluster +#network for RADOS object replication, heartbeats, backfill, +#recovery, etc. +public_network = {network}[, {network}] +#cluster_network = {network}[, {network}] + +#Clusters require authentication by default. +auth_cluster_required = cephx +auth_service_required = cephx +auth_client_required = cephx + +#Choose reasonable number of replicas and placement groups. +osd_journal_size = {n} +osd_pool_default_size = {n} # Write an object n times. +osd_pool_default_min_size = {n} # Allow writing n copies in a degraded state. +osd_pool_default_pg_autoscale_mode = {mode} # on, off, or warn +# Only used if autoscaling is off or warn: +osd_pool_default_pg_num = {n} + +#Choose a reasonable crush leaf type. +#0 for a 1-node cluster. +#1 for a multi node cluster in a single rack +#2 for a multi node, multi chassis cluster with multiple hosts in a chassis +#3 for a multi node cluster with hosts across racks, etc. +osd_crush_chooseleaf_type = {n} diff --git a/doc/rados/configuration/filestore-config-ref.rst b/doc/rados/configuration/filestore-config-ref.rst new file mode 100644 index 000000000..7aefe26b3 --- /dev/null +++ b/doc/rados/configuration/filestore-config-ref.rst @@ -0,0 +1,377 @@ +============================ + Filestore Config Reference +============================ + +.. note:: Since the Luminous release of Ceph, Filestore has not been Ceph's + default storage back end. Since the Luminous release of Ceph, BlueStore has + been Ceph's default storage back end. However, Filestore OSDs are still + supported up to Quincy. Filestore OSDs are not supported in Reef. See + :ref:`OSD Back Ends <rados_config_storage_devices_osd_backends>`. See + :ref:`BlueStore Migration <rados_operations_bluestore_migration>` for + instructions explaining how to replace an existing Filestore back end with a + BlueStore back end. + + +``filestore_debug_omap_check`` + +:Description: Debugging check on synchronization. Expensive. For debugging only. +:Type: Boolean +:Required: No +:Default: ``false`` + + +.. index:: filestore; extended attributes + +Extended Attributes +=================== + +Extended Attributes (XATTRs) are important for Filestore OSDs. However, Certain +disadvantages can occur when the underlying file system is used for the storage +of XATTRs: some file systems have limits on the number of bytes that can be +stored in XATTRs, and your file system might in some cases therefore run slower +than would an alternative method of storing XATTRs. For this reason, a method +of storing XATTRs extrinsic to the underlying file system might improve +performance. To implement such an extrinsic method, refer to the following +settings. + +If the underlying file system has no size limit, then Ceph XATTRs are stored as +``inline xattr``, using the XATTRs provided by the file system. But if there is +a size limit (for example, ext4 imposes a limit of 4 KB total), then some Ceph +XATTRs will be stored in a key/value database when the limit is reached. More +precisely, this begins to occur when either the +``filestore_max_inline_xattr_size`` or ``filestore_max_inline_xattrs`` +threshold is reached. + + +``filestore_max_inline_xattr_size`` + +:Description: Defines the maximum size per object of an XATTR that can be + stored in the file system (for example, XFS, Btrfs, ext4). The + specified size should not be larger than the file system can + handle. Using the default value of 0 instructs Filestore to use + the value specific to the file system. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore_max_inline_xattr_size_xfs`` + +:Description: Defines the maximum size of an XATTR that can be stored in the + XFS file system. This setting is used only if + ``filestore_max_inline_xattr_size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``65536`` + + +``filestore_max_inline_xattr_size_btrfs`` + +:Description: Defines the maximum size of an XATTR that can be stored in the + Btrfs file system. This setting is used only if + ``filestore_max_inline_xattr_size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``2048`` + + +``filestore_max_inline_xattr_size_other`` + +:Description: Defines the maximum size of an XATTR that can be stored in other file systems. + This setting is used only if ``filestore_max_inline_xattr_size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``512`` + + +``filestore_max_inline_xattrs`` + +:Description: Defines the maximum number of XATTRs per object that can be stored in the file system. + Using the default value of 0 instructs Filestore to use the value specific to the file system. +:Type: 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore_max_inline_xattrs_xfs`` + +:Description: Defines the maximum number of XATTRs per object that can be stored in the XFS file system. + This setting is used only if ``filestore_max_inline_xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore_max_inline_xattrs_btrfs`` + +:Description: Defines the maximum number of XATTRs per object that can be stored in the Btrfs file system. + This setting is used only if ``filestore_max_inline_xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore_max_inline_xattrs_other`` + +:Description: Defines the maximum number of XATTRs per object that can be stored in other file systems. + This setting is used only if ``filestore_max_inline_xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``2`` + +.. index:: filestore; synchronization + +Synchronization Intervals +========================= + +Filestore must periodically quiesce writes and synchronize the file system. +Each synchronization creates a consistent commit point. When the commit point +is created, Filestore is able to free all journal entries up to that point. +More-frequent synchronization tends to reduce both synchronization time and +the amount of data that needs to remain in the journal. Less-frequent +synchronization allows the backing file system to coalesce small writes and +metadata updates, potentially increasing synchronization +efficiency but also potentially increasing tail latency. + + +``filestore_max_sync_interval`` + +:Description: Defines the maximum interval (in seconds) for synchronizing Filestore. +:Type: Double +:Required: No +:Default: ``5`` + + +``filestore_min_sync_interval`` + +:Description: Defines the minimum interval (in seconds) for synchronizing Filestore. +:Type: Double +:Required: No +:Default: ``.01`` + + +.. index:: filestore; flusher + +Flusher +======= + +The Filestore flusher forces data from large writes to be written out using +``sync_file_range`` prior to the synchronization. +Ideally, this action reduces the cost of the eventual synchronization. In practice, however, disabling +'filestore_flusher' seems in some cases to improve performance. + + +``filestore_flusher`` + +:Description: Enables the Filestore flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore_flusher_max_fds`` + +:Description: Defines the maximum number of file descriptors for the flusher. +:Type: Integer +:Required: No +:Default: ``512`` + +.. deprecated:: v.65 + +``filestore_sync_flush`` + +:Description: Enables the synchronization flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore_fsync_flushes_journal_data`` + +:Description: Flushes journal data during file-system synchronization. +:Type: Boolean +:Required: No +:Default: ``false`` + + +.. index:: filestore; queue + +Queue +===== + +The following settings define limits on the size of the Filestore queue: + +``filestore_queue_max_ops`` + +:Description: Defines the maximum number of in-progress operations that Filestore accepts before it blocks the queueing of any new operations. +:Type: Integer +:Required: No. Minimal impact on performance. +:Default: ``50`` + + +``filestore_queue_max_bytes`` + +:Description: Defines the maximum number of bytes permitted per operation. +:Type: Integer +:Required: No +:Default: ``100 << 20`` + + +.. index:: filestore; timeouts + +Timeouts +======== + +``filestore_op_threads`` + +:Description: Defines the number of file-system operation threads that execute in parallel. +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore_op_thread_timeout`` + +:Description: Defines the timeout (in seconds) for a file-system operation thread. +:Type: Integer +:Required: No +:Default: ``60`` + + +``filestore_op_thread_suicide_timeout`` + +:Description: Defines the timeout (in seconds) for a commit operation before the commit is cancelled. +:Type: Integer +:Required: No +:Default: ``180`` + + +.. index:: filestore; btrfs + +B-Tree Filesystem +================= + + +``filestore_btrfs_snap`` + +:Description: Enables snapshots for a ``btrfs`` Filestore. +:Type: Boolean +:Required: No. Used only for ``btrfs``. +:Default: ``true`` + + +``filestore_btrfs_clone_range`` + +:Description: Enables cloning ranges for a ``btrfs`` Filestore. +:Type: Boolean +:Required: No. Used only for ``btrfs``. +:Default: ``true`` + + +.. index:: filestore; journal + +Journal +======= + + +``filestore_journal_parallel`` + +:Description: Enables parallel journaling, default for ``btrfs``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_journal_writeahead`` + +:Description: Enables write-ahead journaling, default for XFS. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_journal_trailing`` + +:Description: Deprecated. **Never use.** +:Type: Boolean +:Required: No +:Default: ``false`` + + +Misc +==== + + +``filestore_merge_threshold`` + +:Description: Defines the minimum number of files permitted in a subdirectory before the subdirectory is merged into its parent directory. + NOTE: A negative value means that subdirectory merging is disabled. +:Type: Integer +:Required: No +:Default: ``-10`` + + +``filestore_split_multiple`` + +:Description: ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16`` + is the maximum number of files permitted in a subdirectory + before the subdirectory is split into child directories. + +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore_split_rand_factor`` + +:Description: A random factor added to the split threshold to avoid + too many (expensive) Filestore splits occurring at the same time. + For details, see ``filestore_split_multiple``. + To change this setting for an existing OSD, it is necessary to take the OSD + offline before running the ``ceph-objectstore-tool apply-layout-settings`` command. + +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``20`` + + +``filestore_update_to`` + +:Description: Limits automatic upgrades to a specified version of Filestore. Useful in cases in which you want to avoid upgrading to a specific version. +:Type: Integer +:Required: No +:Default: ``1000`` + + +``filestore_blackhole`` + +:Description: Drops any new transactions on the floor, similar to redirecting to NULL. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_dump_file`` + +:Description: Defines the file that transaction dumps are stored on. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_kill_at`` + +:Description: Injects a failure at the *n*\th opportunity. +:Type: String +:Required: No +:Default: ``false`` + + +``filestore_fail_eio`` + +:Description: Fail/Crash on EIO. +:Type: Boolean +:Required: No +:Default: ``true`` diff --git a/doc/rados/configuration/general-config-ref.rst b/doc/rados/configuration/general-config-ref.rst new file mode 100644 index 000000000..f4613456a --- /dev/null +++ b/doc/rados/configuration/general-config-ref.rst @@ -0,0 +1,19 @@ +========================== + General Config Reference +========================== + +.. confval:: admin_socket + :default: /var/run/ceph/$cluster-$name.asok +.. confval:: pid_file +.. confval:: chdir +.. confval:: fatal_signal_handlers +.. describe:: max_open_files + + If set, when the :term:`Ceph Storage Cluster` starts, Ceph sets + the max open FDs at the OS level (i.e., the max # of file + descriptors). A suitably large value prevents Ceph Daemons from running out + of file descriptors. + + :Type: 64-bit Integer + :Required: No + :Default: ``0`` diff --git a/doc/rados/configuration/index.rst b/doc/rados/configuration/index.rst new file mode 100644 index 000000000..715b999d1 --- /dev/null +++ b/doc/rados/configuration/index.rst @@ -0,0 +1,53 @@ +=============== + Configuration +=============== + +Each Ceph process, daemon, or utility draws its configuration from several +sources on startup. Such sources can include (1) a local configuration, (2) the +monitors, (3) the command line, and (4) environment variables. + +Configuration options can be set globally so that they apply (1) to all +daemons, (2) to all daemons or services of a particular type, or (3) to only a +specific daemon, process, or client. + +.. raw:: html + + <table cellpadding="10"><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>Configuring the Object Store</h3> + +For general object store configuration, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Storage devices <storage-devices> + ceph-conf + + +.. raw:: html + + </td><td><h3>Reference</h3> + +To optimize the performance of your cluster, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Common Settings <common> + Network Settings <network-config-ref> + Messenger v2 protocol <msgr2> + Auth Settings <auth-config-ref> + Monitor Settings <mon-config-ref> + mon-lookup-dns + Heartbeat Settings <mon-osd-interaction> + OSD Settings <osd-config-ref> + DmClock Settings <mclock-config-ref> + BlueStore Settings <bluestore-config-ref> + FileStore Settings <filestore-config-ref> + Journal Settings <journal-ref> + Pool, PG & CRUSH Settings <pool-pg-config-ref.rst> + General Settings <general-config-ref> + + +.. raw:: html + + </td></tr></tbody></table> diff --git a/doc/rados/configuration/journal-ref.rst b/doc/rados/configuration/journal-ref.rst new file mode 100644 index 000000000..5ce5a5e2d --- /dev/null +++ b/doc/rados/configuration/journal-ref.rst @@ -0,0 +1,39 @@ +========================== + Journal Config Reference +========================== +.. warning:: Filestore has been deprecated in the Reef release and is no longer supported. +.. index:: journal; journal configuration + +Filestore OSDs use a journal for two reasons: speed and consistency. Note +that since Luminous, the BlueStore OSD back end has been preferred and default. +This information is provided for pre-existing OSDs and for rare situations where +Filestore is preferred for new deployments. + +- **Speed:** The journal enables the Ceph OSD Daemon to commit small writes + quickly. Ceph writes small, random i/o to the journal sequentially, which + tends to speed up bursty workloads by allowing the backing file system more + time to coalesce writes. The Ceph OSD Daemon's journal, however, can lead + to spiky performance with short spurts of high-speed writes followed by + periods without any write progress as the file system catches up to the + journal. + +- **Consistency:** Ceph OSD Daemons require a file system interface that + guarantees atomic compound operations. Ceph OSD Daemons write a description + of the operation to the journal and apply the operation to the file system. + This enables atomic updates to an object (for example, placement group + metadata). Every few seconds--between ``filestore max sync interval`` and + ``filestore min sync interval``--the Ceph OSD Daemon stops writes and + synchronizes the journal with the file system, allowing Ceph OSD Daemons to + trim operations from the journal and reuse the space. On failure, Ceph + OSD Daemons replay the journal starting after the last synchronization + operation. + +Ceph OSD Daemons recognize the following journal settings: + +.. confval:: journal_dio +.. confval:: journal_aio +.. confval:: journal_block_align +.. confval:: journal_max_write_bytes +.. confval:: journal_max_write_entries +.. confval:: journal_align_min_size +.. confval:: journal_zero_on_create diff --git a/doc/rados/configuration/mclock-config-ref.rst b/doc/rados/configuration/mclock-config-ref.rst new file mode 100644 index 000000000..a338aa6da --- /dev/null +++ b/doc/rados/configuration/mclock-config-ref.rst @@ -0,0 +1,699 @@ +======================== + mClock Config Reference +======================== + +.. index:: mclock; configuration + +QoS support in Ceph is implemented using a queuing scheduler based on `the +dmClock algorithm`_. See :ref:`dmclock-qos` section for more details. + +To make the usage of mclock more user-friendly and intuitive, mclock config +profiles are introduced. The mclock profiles mask the low level details from +users, making it easier to configure and use mclock. + +The following input parameters are required for a mclock profile to configure +the QoS related parameters: + +* total capacity (IOPS) of each OSD (determined automatically - + See `OSD Capacity Determination (Automated)`_) + +* the max sequential bandwidth capacity (MiB/s) of each OSD - + See *osd_mclock_max_sequential_bandwidth_[hdd|ssd]* option + +* an mclock profile type to enable + +Using the settings in the specified profile, an OSD determines and applies the +lower-level mclock and Ceph parameters. The parameters applied by the mclock +profile make it possible to tune the QoS between client I/O and background +operations in the OSD. + + +.. index:: mclock; mclock clients + +mClock Client Types +=================== + +The mclock scheduler handles requests from different types of Ceph services. +Each service can be considered as a type of client from mclock's perspective. +Depending on the type of requests handled, mclock clients are classified into +the buckets as shown in the table below, + ++------------------------+--------------------------------------------------------------+ +| Client Type | Request Types | ++========================+==============================================================+ +| Client | I/O requests issued by external clients of Ceph | ++------------------------+--------------------------------------------------------------+ +| Background recovery | Internal recovery requests | ++------------------------+--------------------------------------------------------------+ +| Background best-effort | Internal backfill, scrub, snap trim and PG deletion requests | ++------------------------+--------------------------------------------------------------+ + +The mclock profiles allocate parameters like reservation, weight and limit +(see :ref:`dmclock-qos`) differently for each client type. The next sections +describe the mclock profiles in greater detail. + + +.. index:: mclock; profile definition + +mClock Profiles - Definition and Purpose +======================================== + +A mclock profile is *“a configuration setting that when applied on a running +Ceph cluster enables the throttling of the operations(IOPS) belonging to +different client classes (background recovery, scrub, snaptrim, client op, +osd subop)”*. + +The mclock profile uses the capacity limits and the mclock profile type selected +by the user to determine the low-level mclock resource control configuration +parameters and apply them transparently. Additionally, other Ceph configuration +parameters are also applied. Please see sections below for more information. + +The low-level mclock resource control parameters are the *reservation*, +*limit*, and *weight* that provide control of the resource shares, as +described in the :ref:`dmclock-qos` section. + + +.. index:: mclock; profile types + +mClock Profile Types +==================== + +mclock profiles can be broadly classified into *built-in* and *custom* profiles, + +Built-in Profiles +----------------- +Users can choose between the following built-in profile types: + +.. note:: The values mentioned in the tables below represent the proportion + of the total IOPS capacity of the OSD allocated for the service type. + +* balanced (default) +* high_client_ops +* high_recovery_ops + +balanced (*default*) +^^^^^^^^^^^^^^^^^^^^ +The *balanced* profile is the default mClock profile. This profile allocates +equal reservation/priority to client operations and background recovery +operations. Background best-effort ops are given lower reservation and therefore +take a longer time to complete when are are competing operations. This profile +helps meet the normal/steady-state requirements of the cluster. This is the +case when external client performance requirement is not critical and there are +other background operations that still need attention within the OSD. + +But there might be instances that necessitate giving higher allocations to either +client ops or recovery ops. In order to deal with such a situation, the alternate +built-in profiles may be enabled by following the steps mentioned in next sections. + ++------------------------+-------------+--------+-------+ +| Service Type | Reservation | Weight | Limit | ++========================+=============+========+=======+ +| client | 50% | 1 | MAX | ++------------------------+-------------+--------+-------+ +| background recovery | 50% | 1 | MAX | ++------------------------+-------------+--------+-------+ +| background best-effort | MIN | 1 | 90% | ++------------------------+-------------+--------+-------+ + +high_client_ops +^^^^^^^^^^^^^^^ +This profile optimizes client performance over background activities by +allocating more reservation and limit to client operations as compared to +background operations in the OSD. This profile, for example, may be enabled +to provide the needed performance for I/O intensive applications for a +sustained period of time at the cost of slower recoveries. The table shows +the resource control parameters set by the profile: + ++------------------------+-------------+--------+-------+ +| Service Type | Reservation | Weight | Limit | ++========================+=============+========+=======+ +| client | 60% | 2 | MAX | ++------------------------+-------------+--------+-------+ +| background recovery | 40% | 1 | MAX | ++------------------------+-------------+--------+-------+ +| background best-effort | MIN | 1 | 70% | ++------------------------+-------------+--------+-------+ + +high_recovery_ops +^^^^^^^^^^^^^^^^^ +This profile optimizes background recovery performance as compared to external +clients and other background operations within the OSD. This profile, for +example, may be enabled by an administrator temporarily to speed-up background +recoveries during non-peak hours. The table shows the resource control +parameters set by the profile: + ++------------------------+-------------+--------+-------+ +| Service Type | Reservation | Weight | Limit | ++========================+=============+========+=======+ +| client | 30% | 1 | MAX | ++------------------------+-------------+--------+-------+ +| background recovery | 70% | 2 | MAX | ++------------------------+-------------+--------+-------+ +| background best-effort | MIN | 1 | MAX | ++------------------------+-------------+--------+-------+ + +.. note:: Across the built-in profiles, internal background best-effort clients + of mclock include "backfill", "scrub", "snap trim", and "pg deletion" + operations. + + +Custom Profile +-------------- +This profile gives users complete control over all the mclock configuration +parameters. This profile should be used with caution and is meant for advanced +users, who understand mclock and Ceph related configuration options. + + +.. index:: mclock; built-in profiles + +mClock Built-in Profiles - Locked Config Options +================================================= +The below sections describe the config options that are locked to certain values +in order to ensure mClock scheduler is able to provide predictable QoS. + +mClock Config Options +--------------------- +.. important:: These defaults cannot be changed using any of the config + subsytem commands like *config set* or via the *config daemon* or *config + tell* interfaces. Although the above command(s) report success, the mclock + QoS parameters are reverted to their respective built-in profile defaults. + +When a built-in profile is enabled, the mClock scheduler calculates the low +level mclock parameters [*reservation*, *weight*, *limit*] based on the profile +enabled for each client type. The mclock parameters are calculated based on +the max OSD capacity provided beforehand. As a result, the following mclock +config parameters cannot be modified when using any of the built-in profiles: + +- :confval:`osd_mclock_scheduler_client_res` +- :confval:`osd_mclock_scheduler_client_wgt` +- :confval:`osd_mclock_scheduler_client_lim` +- :confval:`osd_mclock_scheduler_background_recovery_res` +- :confval:`osd_mclock_scheduler_background_recovery_wgt` +- :confval:`osd_mclock_scheduler_background_recovery_lim` +- :confval:`osd_mclock_scheduler_background_best_effort_res` +- :confval:`osd_mclock_scheduler_background_best_effort_wgt` +- :confval:`osd_mclock_scheduler_background_best_effort_lim` + +Recovery/Backfill Options +------------------------- +.. warning:: The recommendation is to not change these options as the built-in + profiles are optimized based on them. Changing these defaults can result in + unexpected performance outcomes. + +The following recovery and backfill related Ceph options are overridden to +mClock defaults: + +- :confval:`osd_max_backfills` +- :confval:`osd_recovery_max_active` +- :confval:`osd_recovery_max_active_hdd` +- :confval:`osd_recovery_max_active_ssd` + +The following table shows the mClock defaults which is the same as the current +defaults. This is done to maximize the performance of the foreground (client) +operations: + ++----------------------------------------+------------------+----------------+ +| Config Option | Original Default | mClock Default | ++========================================+==================+================+ +| :confval:`osd_max_backfills` | 1 | 1 | ++----------------------------------------+------------------+----------------+ +| :confval:`osd_recovery_max_active` | 0 | 0 | ++----------------------------------------+------------------+----------------+ +| :confval:`osd_recovery_max_active_hdd` | 3 | 3 | ++----------------------------------------+------------------+----------------+ +| :confval:`osd_recovery_max_active_ssd` | 10 | 10 | ++----------------------------------------+------------------+----------------+ + +The above mClock defaults, can be modified only if necessary by enabling +:confval:`osd_mclock_override_recovery_settings` (default: false). The +steps for this is discussed in the +`Steps to Modify mClock Max Backfills/Recovery Limits`_ section. + +Sleep Options +------------- +If any mClock profile (including "custom") is active, the following Ceph config +sleep options are disabled (set to 0), + +- :confval:`osd_recovery_sleep` +- :confval:`osd_recovery_sleep_hdd` +- :confval:`osd_recovery_sleep_ssd` +- :confval:`osd_recovery_sleep_hybrid` +- :confval:`osd_scrub_sleep` +- :confval:`osd_delete_sleep` +- :confval:`osd_delete_sleep_hdd` +- :confval:`osd_delete_sleep_ssd` +- :confval:`osd_delete_sleep_hybrid` +- :confval:`osd_snap_trim_sleep` +- :confval:`osd_snap_trim_sleep_hdd` +- :confval:`osd_snap_trim_sleep_ssd` +- :confval:`osd_snap_trim_sleep_hybrid` + +The above sleep options are disabled to ensure that mclock scheduler is able to +determine when to pick the next op from its operation queue and transfer it to +the operation sequencer. This results in the desired QoS being provided across +all its clients. + + +.. index:: mclock; enable built-in profile + +Steps to Enable mClock Profile +============================== + +As already mentioned, the default mclock profile is set to *balanced*. +The other values for the built-in profiles include *high_client_ops* and +*high_recovery_ops*. + +If there is a requirement to change the default profile, then the option +:confval:`osd_mclock_profile` may be set during runtime by using the following +command: + + .. prompt:: bash # + + ceph config set osd.N osd_mclock_profile <value> + +For example, to change the profile to allow faster recoveries on "osd.0", the +following command can be used to switch to the *high_recovery_ops* profile: + + .. prompt:: bash # + + ceph config set osd.0 osd_mclock_profile high_recovery_ops + +.. note:: The *custom* profile is not recommended unless you are an advanced + user. + +And that's it! You are ready to run workloads on the cluster and check if the +QoS requirements are being met. + + +Switching Between Built-in and Custom Profiles +============================================== + +There may be situations requiring switching from a built-in profile to the +*custom* profile and vice-versa. The following sections outline the steps to +accomplish this. + +Steps to Switch From a Built-in to the Custom Profile +----------------------------------------------------- + +The following command can be used to switch to the *custom* profile: + + .. prompt:: bash # + + ceph config set osd osd_mclock_profile custom + +For example, to change the profile to *custom* on all OSDs, the following +command can be used: + + .. prompt:: bash # + + ceph config set osd osd_mclock_profile custom + +After switching to the *custom* profile, the desired mClock configuration +option may be modified. For example, to change the client reservation IOPS +ratio for a specific OSD (say osd.0) to 0.5 (or 50%), the following command +can be used: + + .. prompt:: bash # + + ceph config set osd.0 osd_mclock_scheduler_client_res 0.5 + +.. important:: Care must be taken to change the reservations of other services + like recovery and background best effort accordingly to ensure that the sum + of the reservations do not exceed the maximum proportion (1.0) of the IOPS + capacity of the OSD. + +.. tip:: The reservation and limit parameter allocations are per-shard based on + the type of backing device (HDD/SSD) under the OSD. See + :confval:`osd_op_num_shards_hdd` and :confval:`osd_op_num_shards_ssd` for + more details. + +Steps to Switch From the Custom Profile to a Built-in Profile +------------------------------------------------------------- + +Switching from the *custom* profile to a built-in profile requires an +intermediate step of removing the custom settings from the central config +database for the changes to take effect. + +The following sequence of commands can be used to switch to a built-in profile: + +#. Set the desired built-in profile using: + + .. prompt:: bash # + + ceph config set osd <mClock Configuration Option> + + For example, to set the built-in profile to ``high_client_ops`` on all + OSDs, run the following command: + + .. prompt:: bash # + + ceph config set osd osd_mclock_profile high_client_ops +#. Determine the existing custom mClock configuration settings in the central + config database using the following command: + + .. prompt:: bash # + + ceph config dump +#. Remove the custom mClock configuration settings determined in the previous + step from the central config database: + + .. prompt:: bash # + + ceph config rm osd <mClock Configuration Option> + + For example, to remove the configuration option + :confval:`osd_mclock_scheduler_client_res` that was set on all OSDs, run the + following command: + + .. prompt:: bash # + + ceph config rm osd osd_mclock_scheduler_client_res +#. After all existing custom mClock configuration settings have been removed + from the central config database, the configuration settings pertaining to + ``high_client_ops`` will come into effect. For e.g., to verify the settings + on osd.0 use: + + .. prompt:: bash # + + ceph config show osd.0 + +Switch Temporarily Between mClock Profiles +------------------------------------------ + +To switch between mClock profiles on a temporary basis, the following commands +may be used to override the settings: + +.. warning:: This section is for advanced users or for experimental testing. The + recommendation is to not use the below commands on a running cluster as it + could have unexpected outcomes. + +.. note:: The configuration changes on an OSD using the below commands are + ephemeral and are lost when it restarts. It is also important to note that + the config options overridden using the below commands cannot be modified + further using the *ceph config set osd.N ...* command. The changes will not + take effect until a given OSD is restarted. This is intentional, as per the + config subsystem design. However, any further modification can still be made + ephemerally using the commands mentioned below. + +#. Run the *injectargs* command as shown to override the mclock settings: + + .. prompt:: bash # + + ceph tell osd.N injectargs '--<mClock Configuration Option>=<value>' + + For example, the following command overrides the + :confval:`osd_mclock_profile` option on osd.0: + + .. prompt:: bash # + + ceph tell osd.0 injectargs '--osd_mclock_profile=high_recovery_ops' + + +#. An alternate command that can be used is: + + .. prompt:: bash # + + ceph daemon osd.N config set <mClock Configuration Option> <value> + + For example, the following command overrides the + :confval:`osd_mclock_profile` option on osd.0: + + .. prompt:: bash # + + ceph daemon osd.0 config set osd_mclock_profile high_recovery_ops + +The individual QoS-related config options for the *custom* profile can also be +modified ephemerally using the above commands. + + +Steps to Modify mClock Max Backfills/Recovery Limits +==================================================== + +This section describes the steps to modify the default max backfills or recovery +limits if the need arises. + +.. warning:: This section is for advanced users or for experimental testing. The + recommendation is to retain the defaults as is on a running cluster as + modifying them could have unexpected performance outcomes. The values may + be modified only if the cluster is unable to cope/showing poor performance + with the default settings or for performing experiments on a test cluster. + +.. important:: The max backfill/recovery options that can be modified are listed + in section `Recovery/Backfill Options`_. The modification of the mClock + default backfills/recovery limit is gated by the + :confval:`osd_mclock_override_recovery_settings` option, which is set to + *false* by default. Attempting to modify any default recovery/backfill + limits without setting the gating option will reset that option back to the + mClock defaults along with a warning message logged in the cluster log. Note + that it may take a few seconds for the default value to come back into + effect. Verify the limit using the *config show* command as shown below. + +#. Set the :confval:`osd_mclock_override_recovery_settings` config option on all + osds to *true* using: + + .. prompt:: bash # + + ceph config set osd osd_mclock_override_recovery_settings true + +#. Set the desired max backfill/recovery option using: + + .. prompt:: bash # + + ceph config set osd osd_max_backfills <value> + + For example, the following command modifies the :confval:`osd_max_backfills` + option on all osds to 5. + + .. prompt:: bash # + + ceph config set osd osd_max_backfills 5 + +#. Wait for a few seconds and verify the running configuration for a specific + OSD using: + + .. prompt:: bash # + + ceph config show osd.N | grep osd_max_backfills + + For example, the following command shows the running configuration of + :confval:`osd_max_backfills` on osd.0. + + .. prompt:: bash # + + ceph config show osd.0 | grep osd_max_backfills + +#. Reset the :confval:`osd_mclock_override_recovery_settings` config option on + all osds to *false* using: + + .. prompt:: bash # + + ceph config set osd osd_mclock_override_recovery_settings false + + +OSD Capacity Determination (Automated) +====================================== + +The OSD capacity in terms of total IOPS is determined automatically during OSD +initialization. This is achieved by running the OSD bench tool and overriding +the default value of ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option +depending on the device type. No other action/input is expected from the user +to set the OSD capacity. + +.. note:: If you wish to manually benchmark OSD(s) or manually tune the + Bluestore throttle parameters, see section + `Steps to Manually Benchmark an OSD (Optional)`_. + +You may verify the capacity of an OSD after the cluster is brought up by using +the following command: + + .. prompt:: bash # + + ceph config show osd.N osd_mclock_max_capacity_iops_[hdd, ssd] + +For example, the following command shows the max capacity for "osd.0" on a Ceph +node whose underlying device type is SSD: + + .. prompt:: bash # + + ceph config show osd.0 osd_mclock_max_capacity_iops_ssd + +Mitigation of Unrealistic OSD Capacity From Automated Test +---------------------------------------------------------- +In certain conditions, the OSD bench tool may show unrealistic/inflated result +depending on the drive configuration and other environment related conditions. +To mitigate the performance impact due to this unrealistic capacity, a couple +of threshold config options depending on the osd's device type are defined and +used: + +- :confval:`osd_mclock_iops_capacity_threshold_hdd` = 500 +- :confval:`osd_mclock_iops_capacity_threshold_ssd` = 80000 + +The following automated step is performed: + +Fallback to using default OSD capacity (automated) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +If OSD bench reports a measurement that exceeds the above threshold values +depending on the underlying device type, the fallback mechanism reverts to the +default value of :confval:`osd_mclock_max_capacity_iops_hdd` or +:confval:`osd_mclock_max_capacity_iops_ssd`. The threshold config options +can be reconfigured based on the type of drive used. Additionally, a cluster +warning is logged in case the measurement exceeds the threshold. For example, :: + + 2022-10-27T15:30:23.270+0000 7f9b5dbe95c0 0 log_channel(cluster) log [WRN] + : OSD bench result of 39546.479392 IOPS exceeded the threshold limit of + 25000.000000 IOPS for osd.1. IOPS capacity is unchanged at 21500.000000 + IOPS. The recommendation is to establish the osd's IOPS capacity using other + benchmark tools (e.g. Fio) and then override + osd_mclock_max_capacity_iops_[hdd|ssd]. + +If the default capacity doesn't accurately represent the OSD's capacity, the +following additional step is recommended to address this: + +Run custom drive benchmark if defaults are not accurate (manual) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +If the default OSD capacity is not accurate, the recommendation is to run a +custom benchmark using your preferred tool (e.g. Fio) on the drive and then +override the ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option as described +in the `Specifying Max OSD Capacity`_ section. + +This step is highly recommended until an alternate mechansim is worked upon. + +Steps to Manually Benchmark an OSD (Optional) +============================================= + +.. note:: These steps are only necessary if you want to override the OSD + capacity already determined automatically during OSD initialization. + Otherwise, you may skip this section entirely. + +.. tip:: If you have already determined the benchmark data and wish to manually + override the max osd capacity for an OSD, you may skip to section + `Specifying Max OSD Capacity`_. + + +Any existing benchmarking tool (e.g. Fio) can be used for this purpose. In this +case, the steps use the *Ceph OSD Bench* command described in the next section. +Regardless of the tool/command used, the steps outlined further below remain the +same. + +As already described in the :ref:`dmclock-qos` section, the number of +shards and the bluestore's throttle parameters have an impact on the mclock op +queues. Therefore, it is critical to set these values carefully in order to +maximize the impact of the mclock scheduler. + +:Number of Operational Shards: + We recommend using the default number of shards as defined by the + configuration options ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and + ``osd_op_num_shards_ssd``. In general, a lower number of shards will increase + the impact of the mclock queues. + +:Bluestore Throttle Parameters: + We recommend using the default values as defined by + :confval:`bluestore_throttle_bytes` and + :confval:`bluestore_throttle_deferred_bytes`. But these parameters may also be + determined during the benchmarking phase as described below. + +OSD Bench Command Syntax +------------------------ + +The :ref:`osd-subsystem` section describes the OSD bench command. The syntax +used for benchmarking is shown below : + +.. prompt:: bash # + + ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS] + +where, + +* ``TOTAL_BYTES``: Total number of bytes to write +* ``BYTES_PER_WRITE``: Block size per write +* ``OBJ_SIZE``: Bytes per object +* ``NUM_OBJS``: Number of objects to write + +Benchmarking Test Steps Using OSD Bench +--------------------------------------- + +The steps below use the default shards and detail the steps used to determine +the correct bluestore throttle values (optional). + +#. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that + you wish to benchmark. +#. Run a simple 4KiB random write workload on an OSD using the following + commands: + + .. note:: Note that before running the test, caches must be cleared to get an + accurate measurement. + + For example, if you are running the benchmark test on osd.0, run the following + commands: + + .. prompt:: bash # + + ceph tell osd.0 cache drop + + .. prompt:: bash # + + ceph tell osd.0 bench 12288000 4096 4194304 100 + +#. Note the overall throughput(IOPS) obtained from the output of the osd bench + command. This value is the baseline throughput(IOPS) when the default + bluestore throttle options are in effect. +#. If the intent is to determine the bluestore throttle values for your + environment, then set the two options, :confval:`bluestore_throttle_bytes` + and :confval:`bluestore_throttle_deferred_bytes` to 32 KiB(32768 Bytes) each + to begin with. Otherwise, you may skip to the next section. +#. Run the 4KiB random write test as before using OSD bench. +#. Note the overall throughput from the output and compare the value + against the baseline throughput recorded in step 3. +#. If the throughput doesn't match with the baseline, increment the bluestore + throttle options by 2x and repeat steps 5 through 7 until the obtained + throughput is very close to the baseline value. + +For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB +for both bluestore throttle and deferred bytes was determined to maximize the +impact of mclock. For HDDs, the corresponding value was 40 MiB, where the +overall throughput was roughly equal to the baseline throughput. Note that in +general for HDDs, the bluestore throttle values are expected to be higher when +compared to SSDs. + + +Specifying Max OSD Capacity +---------------------------- + +The steps in this section may be performed only if you want to override the +max osd capacity automatically set during OSD initialization. The option +``osd_mclock_max_capacity_iops_[hdd, ssd]`` for an OSD can be set by running the +following command: + + .. prompt:: bash # + + ceph config set osd.N osd_mclock_max_capacity_iops_[hdd,ssd] <value> + +For example, the following command sets the max capacity for a specific OSD +(say "osd.0") whose underlying device type is HDD to 350 IOPS: + + .. prompt:: bash # + + ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350 + +Alternatively, you may specify the max capacity for OSDs within the Ceph +configuration file under the respective [osd.N] section. See +:ref:`ceph-conf-settings` for more details. + + +.. index:: mclock; config settings + +mClock Config Options +===================== + +.. confval:: osd_mclock_profile +.. confval:: osd_mclock_max_capacity_iops_hdd +.. confval:: osd_mclock_max_capacity_iops_ssd +.. confval:: osd_mclock_max_sequential_bandwidth_hdd +.. confval:: osd_mclock_max_sequential_bandwidth_ssd +.. confval:: osd_mclock_force_run_benchmark_on_init +.. confval:: osd_mclock_skip_benchmark +.. confval:: osd_mclock_override_recovery_settings +.. confval:: osd_mclock_iops_capacity_threshold_hdd +.. confval:: osd_mclock_iops_capacity_threshold_ssd + +.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf diff --git a/doc/rados/configuration/mon-config-ref.rst b/doc/rados/configuration/mon-config-ref.rst new file mode 100644 index 000000000..e0a12d093 --- /dev/null +++ b/doc/rados/configuration/mon-config-ref.rst @@ -0,0 +1,642 @@ +.. _monitor-config-reference: + +========================== + Monitor Config Reference +========================== + +Understanding how to configure a :term:`Ceph Monitor` is an important part of +building a reliable :term:`Ceph Storage Cluster`. **All Ceph Storage Clusters +have at least one monitor**. The monitor complement usually remains fairly +consistent, but you can add, remove or replace a monitor in a cluster. See +`Adding/Removing a Monitor`_ for details. + + +.. index:: Ceph Monitor; Paxos + +Background +========== + +Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`. + +The :term:`Cluster Map` makes it possible for :term:`Ceph client`\s to +determine the location of all Ceph Monitors, Ceph OSD Daemons, and Ceph +Metadata Servers. Clients do this by connecting to one Ceph Monitor and +retrieving a current cluster map. Ceph clients must connect to a Ceph Monitor +before they can read from or write to Ceph OSD Daemons or Ceph Metadata +Servers. A Ceph client that has a current copy of the cluster map and the CRUSH +algorithm can compute the location of any RADOS object within the cluster. This +makes it possible for Ceph clients to talk directly to Ceph OSD Daemons. Direct +communication between clients and Ceph OSD Daemons improves upon traditional +storage architectures that required clients to communicate with a central +component. See `Scalability and High Availability`_ for more on this subject. + +The Ceph Monitor's primary function is to maintain a master copy of the cluster +map. Monitors also provide authentication and logging services. All changes in +the monitor services are written by the Ceph Monitor to a single Paxos +instance, and Paxos writes the changes to a key/value store. This provides +strong consistency. Ceph Monitors are able to query the most recent version of +the cluster map during sync operations, and they use the key/value store's +snapshots and iterators (using RocksDB) to perform store-wide synchronization. + +.. ditaa:: + /-------------\ /-------------\ + | Monitor | Write Changes | Paxos | + | cCCC +-------------->+ cCCC | + | | | | + +-------------+ \------+------/ + | Auth | | + +-------------+ | Write Changes + | Log | | + +-------------+ v + | Monitor Map | /------+------\ + +-------------+ | Key / Value | + | OSD Map | | Store | + +-------------+ | cCCC | + | PG Map | \------+------/ + +-------------+ ^ + | MDS Map | | Read Changes + +-------------+ | + | cCCC |*---------------------+ + \-------------/ + +.. index:: Ceph Monitor; cluster map + +Cluster Maps +------------ + +The cluster map is a composite of maps, including the monitor map, the OSD map, +the placement group map and the metadata server map. The cluster map tracks a +number of important things: which processes are ``in`` the Ceph Storage Cluster; +which processes that are ``in`` the Ceph Storage Cluster are ``up`` and running +or ``down``; whether, the placement groups are ``active`` or ``inactive``, and +``clean`` or in some other state; and, other details that reflect the current +state of the cluster such as the total amount of storage space, and the amount +of storage used. + +When there is a significant change in the state of the cluster--e.g., a Ceph OSD +Daemon goes down, a placement group falls into a degraded state, etc.--the +cluster map gets updated to reflect the current state of the cluster. +Additionally, the Ceph Monitor also maintains a history of the prior states of +the cluster. The monitor map, OSD map, placement group map and metadata server +map each maintain a history of their map versions. We call each version an +"epoch." + +When operating your Ceph Storage Cluster, keeping track of these states is an +important part of your system administration duties. See `Monitoring a Cluster`_ +and `Monitoring OSDs and PGs`_ for additional details. + +.. index:: high availability; quorum + +Monitor Quorum +-------------- + +Our Configuring ceph section provides a trivial `Ceph configuration file`_ that +provides for one monitor in the test cluster. A cluster will run fine with a +single monitor; however, **a single monitor is a single-point-of-failure**. To +ensure high availability in a production Ceph Storage Cluster, you should run +Ceph with multiple monitors so that the failure of a single monitor **WILL NOT** +bring down your entire cluster. + +When a Ceph Storage Cluster runs multiple Ceph Monitors for high availability, +Ceph Monitors use `Paxos`_ to establish consensus about the master cluster map. +A consensus requires a majority of monitors running to establish a quorum for +consensus about the cluster map (e.g., 1; 2 out of 3; 3 out of 5; 4 out of 6; +etc.). + +.. confval:: mon_force_quorum_join + +.. index:: Ceph Monitor; consistency + +Consistency +----------- + +When you add monitor settings to your Ceph configuration file, you need to be +aware of some of the architectural aspects of Ceph Monitors. **Ceph imposes +strict consistency requirements** for a Ceph monitor when discovering another +Ceph Monitor within the cluster. Whereas, Ceph Clients and other Ceph daemons +use the Ceph configuration file to discover monitors, monitors discover each +other using the monitor map (monmap), not the Ceph configuration file. + +A Ceph Monitor always refers to the local copy of the monmap when discovering +other Ceph Monitors in the Ceph Storage Cluster. Using the monmap instead of the +Ceph configuration file avoids errors that could break the cluster (e.g., typos +in ``ceph.conf`` when specifying a monitor address or port). Since monitors use +monmaps for discovery and they share monmaps with clients and other Ceph +daemons, **the monmap provides monitors with a strict guarantee that their +consensus is valid.** + +Strict consistency also applies to updates to the monmap. As with any other +updates on the Ceph Monitor, changes to the monmap always run through a +distributed consensus algorithm called `Paxos`_. The Ceph Monitors must agree on +each update to the monmap, such as adding or removing a Ceph Monitor, to ensure +that each monitor in the quorum has the same version of the monmap. Updates to +the monmap are incremental so that Ceph Monitors have the latest agreed upon +version, and a set of previous versions. Maintaining a history enables a Ceph +Monitor that has an older version of the monmap to catch up with the current +state of the Ceph Storage Cluster. + +If Ceph Monitors were to discover each other through the Ceph configuration file +instead of through the monmap, additional risks would be introduced because +Ceph configuration files are not updated and distributed automatically. Ceph +Monitors might inadvertently use an older Ceph configuration file, fail to +recognize a Ceph Monitor, fall out of a quorum, or develop a situation where +`Paxos`_ is not able to determine the current state of the system accurately. + + +.. index:: Ceph Monitor; bootstrapping monitors + +Bootstrapping Monitors +---------------------- + +In most configuration and deployment cases, tools that deploy Ceph help +bootstrap the Ceph Monitors by generating a monitor map for you (e.g., +``cephadm``, etc). A Ceph Monitor requires a few explicit +settings: + +- **Filesystem ID**: The ``fsid`` is the unique identifier for your + object store. Since you can run multiple clusters on the same + hardware, you must specify the unique ID of the object store when + bootstrapping a monitor. Deployment tools usually do this for you + (e.g., ``cephadm`` can call a tool like ``uuidgen``), but you + may specify the ``fsid`` manually too. + +- **Monitor ID**: A monitor ID is a unique ID assigned to each monitor within + the cluster. It is an alphanumeric value, and by convention the identifier + usually follows an alphabetical increment (e.g., ``a``, ``b``, etc.). This + can be set in a Ceph configuration file (e.g., ``[mon.a]``, ``[mon.b]``, etc.), + by a deployment tool, or using the ``ceph`` commandline. + +- **Keys**: The monitor must have secret keys. A deployment tool such as + ``cephadm`` usually does this for you, but you may + perform this step manually too. See `Monitor Keyrings`_ for details. + +For additional details on bootstrapping, see `Bootstrapping a Monitor`_. + +.. index:: Ceph Monitor; configuring monitors + +Configuring Monitors +==================== + +To apply configuration settings to the entire cluster, enter the configuration +settings under ``[global]``. To apply configuration settings to all monitors in +your cluster, enter the configuration settings under ``[mon]``. To apply +configuration settings to specific monitors, specify the monitor instance +(e.g., ``[mon.a]``). By convention, monitor instance names use alpha notation. + +.. code-block:: ini + + [global] + + [mon] + + [mon.a] + + [mon.b] + + [mon.c] + + +Minimum Configuration +--------------------- + +The bare minimum monitor settings for a Ceph monitor via the Ceph configuration +file include a hostname and a network address for each monitor. You can configure +these under ``[mon]`` or under the entry for a specific monitor. + +.. code-block:: ini + + [global] + mon_host = 10.0.0.2,10.0.0.3,10.0.0.4 + +.. code-block:: ini + + [mon.a] + host = hostname1 + mon_addr = 10.0.0.10:6789 + +See the `Network Configuration Reference`_ for details. + +.. note:: This minimum configuration for monitors assumes that a deployment + tool generates the ``fsid`` and the ``mon.`` key for you. + +Once you deploy a Ceph cluster, you **SHOULD NOT** change the IP addresses of +monitors. However, if you decide to change the monitor's IP address, you +must follow a specific procedure. See :ref:`Changing a Monitor's IP address` for +details. + +Monitors can also be found by clients by using DNS SRV records. See `Monitor lookup through DNS`_ for details. + +Cluster ID +---------- + +Each Ceph Storage Cluster has a unique identifier (``fsid``). If specified, it +usually appears under the ``[global]`` section of the configuration file. +Deployment tools usually generate the ``fsid`` and store it in the monitor map, +so the value may not appear in a configuration file. The ``fsid`` makes it +possible to run daemons for multiple clusters on the same hardware. + +.. confval:: fsid + +.. index:: Ceph Monitor; initial members + +Initial Members +--------------- + +We recommend running a production Ceph Storage Cluster with at least three Ceph +Monitors to ensure high availability. When you run multiple monitors, you may +specify the initial monitors that must be members of the cluster in order to +establish a quorum. This may reduce the time it takes for your cluster to come +online. + +.. code-block:: ini + + [mon] + mon_initial_members = a,b,c + + +.. confval:: mon_initial_members + +.. index:: Ceph Monitor; data path + +Data +---- + +Ceph provides a default path where Ceph Monitors store data. For optimal +performance in a production Ceph Storage Cluster, we recommend running Ceph +Monitors on separate hosts and drives from Ceph OSD Daemons. As leveldb uses +``mmap()`` for writing the data, Ceph Monitors flush their data from memory to disk +very often, which can interfere with Ceph OSD Daemon workloads if the data +store is co-located with the OSD Daemons. + +In Ceph versions 0.58 and earlier, Ceph Monitors store their data in plain files. This +approach allows users to inspect monitor data with common tools like ``ls`` +and ``cat``. However, this approach didn't provide strong consistency. + +In Ceph versions 0.59 and later, Ceph Monitors store their data as key/value +pairs. Ceph Monitors require `ACID`_ transactions. Using a data store prevents +recovering Ceph Monitors from running corrupted versions through Paxos, and it +enables multiple modification operations in one single atomic batch, among other +advantages. + +Generally, we do not recommend changing the default data location. If you modify +the default location, we recommend that you make it uniform across Ceph Monitors +by setting it in the ``[mon]`` section of the configuration file. + + +.. confval:: mon_data +.. confval:: mon_data_size_warn +.. confval:: mon_data_avail_warn +.. confval:: mon_data_avail_crit +.. confval:: mon_warn_on_crush_straw_calc_version_zero +.. confval:: mon_warn_on_legacy_crush_tunables +.. confval:: mon_crush_min_required_version +.. confval:: mon_warn_on_osd_down_out_interval_zero +.. confval:: mon_warn_on_slow_ping_ratio +.. confval:: mon_warn_on_slow_ping_time +.. confval:: mon_warn_on_pool_no_redundancy +.. confval:: mon_cache_target_full_warn_ratio +.. confval:: mon_health_to_clog +.. confval:: mon_health_to_clog_tick_interval +.. confval:: mon_health_to_clog_interval + +.. index:: Ceph Storage Cluster; capacity planning, Ceph Monitor; capacity planning + +.. _storage-capacity: + +Storage Capacity +---------------- + +When a Ceph Storage Cluster gets close to its maximum capacity +(see``mon_osd_full ratio``), Ceph prevents you from writing to or reading from OSDs +as a safety measure to prevent data loss. Therefore, letting a +production Ceph Storage Cluster approach its full ratio is not a good practice, +because it sacrifices high availability. The default full ratio is ``.95``, or +95% of capacity. This a very aggressive setting for a test cluster with a small +number of OSDs. + +.. tip:: When monitoring your cluster, be alert to warnings related to the + ``nearfull`` ratio. This means that a failure of some OSDs could result + in a temporary service disruption if one or more OSDs fails. Consider adding + more OSDs to increase storage capacity. + +A common scenario for test clusters involves a system administrator removing an +OSD from the Ceph Storage Cluster, watching the cluster rebalance, then removing +another OSD, and another, until at least one OSD eventually reaches the full +ratio and the cluster locks up. We recommend a bit of capacity +planning even with a test cluster. Planning enables you to gauge how much spare +capacity you will need in order to maintain high availability. Ideally, you want +to plan for a series of Ceph OSD Daemon failures where the cluster can recover +to an ``active+clean`` state without replacing those OSDs +immediately. Cluster operation continues in the ``active+degraded`` state, but this +is not ideal for normal operation and should be addressed promptly. + +The following diagram depicts a simplistic Ceph Storage Cluster containing 33 +Ceph Nodes with one OSD per host, each OSD reading from +and writing to a 3TB drive. So this exemplary Ceph Storage Cluster has a maximum +actual capacity of 99TB. With a ``mon osd full ratio`` of ``0.95``, if the Ceph +Storage Cluster falls to 5TB of remaining capacity, the cluster will not allow +Ceph Clients to read and write data. So the Ceph Storage Cluster's operating +capacity is 95TB, not 99TB. + +.. ditaa:: + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | Rack 1 | | Rack 2 | | Rack 3 | | Rack 4 | | Rack 5 | | Rack 6 | + | cCCC | | cF00 | | cCCC | | cCCC | | cCCC | | cCCC | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 1 | | OSD 7 | | OSD 13 | | OSD 19 | | OSD 25 | | OSD 31 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 2 | | OSD 8 | | OSD 14 | | OSD 20 | | OSD 26 | | OSD 32 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 3 | | OSD 9 | | OSD 15 | | OSD 21 | | OSD 27 | | OSD 33 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 4 | | OSD 10 | | OSD 16 | | OSD 22 | | OSD 28 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 5 | | OSD 11 | | OSD 17 | | OSD 23 | | OSD 29 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 6 | | OSD 12 | | OSD 18 | | OSD 24 | | OSD 30 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + +It is normal in such a cluster for one or two OSDs to fail. A less frequent but +reasonable scenario involves a rack's router or power supply failing, which +brings down multiple OSDs simultaneously (e.g., OSDs 7-12). In such a scenario, +you should still strive for a cluster that can remain operational and achieve an +``active + clean`` state--even if that means adding a few hosts with additional +OSDs in short order. If your capacity utilization is too high, you may not lose +data, but you could still sacrifice data availability while resolving an outage +within a failure domain if capacity utilization of the cluster exceeds the full +ratio. For this reason, we recommend at least some rough capacity planning. + +Identify two numbers for your cluster: + +#. The number of OSDs. +#. The total capacity of the cluster + +If you divide the total capacity of your cluster by the number of OSDs in your +cluster, you will find the mean average capacity of an OSD within your cluster. +Consider multiplying that number by the number of OSDs you expect will fail +simultaneously during normal operations (a relatively small number). Finally +multiply the capacity of the cluster by the full ratio to arrive at a maximum +operating capacity; then, subtract the number of amount of data from the OSDs +you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing +process with a higher number of OSD failures (e.g., a rack of OSDs) to arrive at +a reasonable number for a near full ratio. + +The following settings only apply on cluster creation and are then stored in +the OSDMap. To clarify, in normal operation the values that are used by OSDs +are those found in the OSDMap, not those in the configuration file or central +config store. + +.. code-block:: ini + + [global] + mon_osd_full_ratio = .80 + mon_osd_backfillfull_ratio = .75 + mon_osd_nearfull_ratio = .70 + + +``mon_osd_full_ratio`` + +:Description: The threshold percentage of device space utilized before an OSD is + considered ``full``. + +:Type: Float +:Default: ``0.95`` + + +``mon_osd_backfillfull_ratio`` + +:Description: The threshold percentage of device space utilized before an OSD is + considered too ``full`` to backfill. + +:Type: Float +:Default: ``0.90`` + + +``mon_osd_nearfull_ratio`` + +:Description: The threshold percentage of device space used before an OSD is + considered ``nearfull``. + +:Type: Float +:Default: ``0.85`` + + +.. tip:: If some OSDs are nearfull, but others have plenty of capacity, you + may have an inaccurate CRUSH weight set for the nearfull OSDs. + +.. tip:: These settings only apply during cluster creation. Afterwards they need + to be changed in the OSDMap using ``ceph osd set-nearfull-ratio`` and + ``ceph osd set-full-ratio`` + +.. index:: heartbeat + +Heartbeat +--------- + +Ceph monitors know about the cluster by requiring reports from each OSD, and by +receiving reports from OSDs about the status of their neighboring OSDs. Ceph +provides reasonable default settings for monitor/OSD interaction; however, you +may modify them as needed. See `Monitor/OSD Interaction`_ for details. + + +.. index:: Ceph Monitor; leader, Ceph Monitor; provider, Ceph Monitor; requester, Ceph Monitor; synchronization + +Monitor Store Synchronization +----------------------------- + +When you run a production cluster with multiple monitors (recommended), each +monitor checks to see if a neighboring monitor has a more recent version of the +cluster map (e.g., a map in a neighboring monitor with one or more epoch numbers +higher than the most current epoch in the map of the instant monitor). +Periodically, one monitor in the cluster may fall behind the other monitors to +the point where it must leave the quorum, synchronize to retrieve the most +current information about the cluster, and then rejoin the quorum. For the +purposes of synchronization, monitors may assume one of three roles: + +#. **Leader**: The `Leader` is the first monitor to achieve the most recent + Paxos version of the cluster map. + +#. **Provider**: The `Provider` is a monitor that has the most recent version + of the cluster map, but wasn't the first to achieve the most recent version. + +#. **Requester:** A `Requester` is a monitor that has fallen behind the leader + and must synchronize in order to retrieve the most recent information about + the cluster before it can rejoin the quorum. + +These roles enable a leader to delegate synchronization duties to a provider, +which prevents synchronization requests from overloading the leader--improving +performance. In the following diagram, the requester has learned that it has +fallen behind the other monitors. The requester asks the leader to synchronize, +and the leader tells the requester to synchronize with a provider. + + +.. ditaa:: + +-----------+ +---------+ +----------+ + | Requester | | Leader | | Provider | + +-----------+ +---------+ +----------+ + | | | + | | | + | Ask to Synchronize | | + |------------------->| | + | | | + |<-------------------| | + | Tell Requester to | | + | Sync with Provider | | + | | | + | Synchronize | + |--------------------+-------------------->| + | | | + |<-------------------+---------------------| + | Send Chunk to Requester | + | (repeat as necessary) | + | Requester Acks Chuck to Provider | + |--------------------+-------------------->| + | | + | Sync Complete | + | Notification | + |------------------->| + | | + |<-------------------| + | Ack | + | | + + +Synchronization always occurs when a new monitor joins the cluster. During +runtime operations, monitors may receive updates to the cluster map at different +times. This means the leader and provider roles may migrate from one monitor to +another. If this happens while synchronizing (e.g., a provider falls behind the +leader), the provider can terminate synchronization with a requester. + +Once synchronization is complete, Ceph performs trimming across the cluster. +Trimming requires that the placement groups are ``active+clean``. + + +.. confval:: mon_sync_timeout +.. confval:: mon_sync_max_payload_size +.. confval:: paxos_max_join_drift +.. confval:: paxos_stash_full_interval +.. confval:: paxos_propose_interval +.. confval:: paxos_min +.. confval:: paxos_min_wait +.. confval:: paxos_trim_min +.. confval:: paxos_trim_max +.. confval:: paxos_service_trim_min +.. confval:: paxos_service_trim_max +.. confval:: paxos_service_trim_max_multiplier +.. confval:: mon_mds_force_trim_to +.. confval:: mon_osd_force_trim_to +.. confval:: mon_osd_cache_size +.. confval:: mon_election_timeout +.. confval:: mon_lease +.. confval:: mon_lease_renew_interval_factor +.. confval:: mon_lease_ack_timeout_factor +.. confval:: mon_accept_timeout_factor +.. confval:: mon_min_osdmap_epochs +.. confval:: mon_max_log_epochs + + +.. index:: Ceph Monitor; clock + +.. _mon-config-ref-clock: + +Clock +----- + +Ceph daemons pass critical messages to each other, which must be processed +before daemons reach a timeout threshold. If the clocks in Ceph monitors +are not synchronized, it can lead to a number of anomalies. For example: + +- Daemons ignoring received messages (e.g., timestamps outdated) +- Timeouts triggered too soon/late when a message wasn't received in time. + +See `Monitor Store Synchronization`_ for details. + + +.. tip:: You must configure NTP or PTP daemons on your Ceph monitor hosts to + ensure that the monitor cluster operates with synchronized clocks. + It can be advantageous to have monitor hosts sync with each other + as well as with multiple quality upstream time sources. + +Clock drift may still be noticeable with NTP even though the discrepancy is not +yet harmful. Ceph's clock drift / clock skew warnings may get triggered even +though NTP maintains a reasonable level of synchronization. Increasing your +clock drift may be tolerable under such circumstances; however, a number of +factors such as workload, network latency, configuring overrides to default +timeouts and the `Monitor Store Synchronization`_ settings may influence +the level of acceptable clock drift without compromising Paxos guarantees. + +Ceph provides the following tunable options to allow you to find +acceptable values. + +.. confval:: mon_tick_interval +.. confval:: mon_clock_drift_allowed +.. confval:: mon_clock_drift_warn_backoff +.. confval:: mon_timecheck_interval +.. confval:: mon_timecheck_skew_interval + +Client +------ + +.. confval:: mon_client_hunt_interval +.. confval:: mon_client_ping_interval +.. confval:: mon_client_max_log_entries_per_message +.. confval:: mon_client_bytes + +.. _pool-settings: + +Pool settings +============= + +Since version v0.94 there is support for pool flags which allow or disallow changes to be made to pools. +Monitors can also disallow removal of pools if appropriately configured. The inconvenience of this guardrail +is far outweighed by the number of accidental pool (and thus data) deletions it prevents. + +.. confval:: mon_allow_pool_delete +.. confval:: osd_pool_default_ec_fast_read +.. confval:: osd_pool_default_flag_hashpspool +.. confval:: osd_pool_default_flag_nodelete +.. confval:: osd_pool_default_flag_nopgchange +.. confval:: osd_pool_default_flag_nosizechange + +For more information about the pool flags see :ref:`Pool values <setpoolvalues>`. + +Miscellaneous +============= + +.. confval:: mon_max_osd +.. confval:: mon_globalid_prealloc +.. confval:: mon_subscribe_interval +.. confval:: mon_stat_smooth_intervals +.. confval:: mon_probe_timeout +.. confval:: mon_daemon_bytes +.. confval:: mon_max_log_entries_per_event +.. confval:: mon_osd_prime_pg_temp +.. confval:: mon_osd_prime_pg_temp_max_time +.. confval:: mon_osd_prime_pg_temp_max_estimate +.. confval:: mon_mds_skip_sanity +.. confval:: mon_max_mdsmap_epochs +.. confval:: mon_config_key_max_entry_size +.. confval:: mon_scrub_interval +.. confval:: mon_scrub_max_keys +.. confval:: mon_compact_on_start +.. confval:: mon_compact_on_bootstrap +.. confval:: mon_compact_on_trim +.. confval:: mon_cpu_threads +.. confval:: mon_osd_mapping_pgs_per_chunk +.. confval:: mon_session_timeout +.. confval:: mon_osd_cache_size_min +.. confval:: mon_memory_target +.. confval:: mon_memory_autotune + +.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) +.. _Monitor Keyrings: ../../../dev/mon-bootstrap#secret-keys +.. _Ceph configuration file: ../ceph-conf/#monitors +.. _Network Configuration Reference: ../network-config-ref +.. _Monitor lookup through DNS: ../mon-lookup-dns +.. _ACID: https://en.wikipedia.org/wiki/ACID +.. _Adding/Removing a Monitor: ../../operations/add-or-rm-mons +.. _Monitoring a Cluster: ../../operations/monitoring +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg +.. _Bootstrapping a Monitor: ../../../dev/mon-bootstrap +.. _Monitor/OSD Interaction: ../mon-osd-interaction +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability diff --git a/doc/rados/configuration/mon-lookup-dns.rst b/doc/rados/configuration/mon-lookup-dns.rst new file mode 100644 index 000000000..129a083c4 --- /dev/null +++ b/doc/rados/configuration/mon-lookup-dns.rst @@ -0,0 +1,58 @@ +.. _mon-dns-lookup: + +=============================== +Looking up Monitors through DNS +=============================== + +Since Ceph version 11.0.0 (Kraken), RADOS has supported looking up monitors +through DNS. + +The addition of the ability to look up monitors through DNS means that daemons +and clients do not require a *mon host* configuration directive in their +``ceph.conf`` configuration file. + +With a DNS update, clients and daemons can be made aware of changes +in the monitor topology. To be more precise and technical, clients look up the +monitors by using ``DNS SRV TCP`` records. + +By default, clients and daemons look for the TCP service called *ceph-mon*, +which is configured by the *mon_dns_srv_name* configuration directive. + + +.. confval:: mon_dns_srv_name + +Example +------- +When the DNS search domain is set to *example.com* a DNS zone file might contain the following elements. + +First, create records for the Monitors, either IPv4 (A) or IPv6 (AAAA). + +:: + + mon1.example.com. AAAA 2001:db8::100 + mon2.example.com. AAAA 2001:db8::200 + mon3.example.com. AAAA 2001:db8::300 + +:: + + mon1.example.com. A 192.168.0.1 + mon2.example.com. A 192.168.0.2 + mon3.example.com. A 192.168.0.3 + + +With those records now existing we can create the SRV TCP records with the name *ceph-mon* pointing to the three Monitors. + +:: + + _ceph-mon._tcp.example.com. 60 IN SRV 10 20 6789 mon1.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 10 30 6789 mon2.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 20 50 6789 mon3.example.com. + +Now all Monitors are running on port *6789*, with priorities 10, 10, 20 and weights 20, 30, 50 respectively. + +Monitor clients choose monitor by referencing the SRV records. If a cluster has multiple Monitor SRV records +with the same priority value, clients and daemons will load balance the connections to Monitors in proportion +to the values of the SRV weight fields. + +For the above example, this will result in approximate 40% of the clients and daemons connecting to mon1, +60% of them connecting to mon2. However, if neither of them is reachable, then mon3 will be reconsidered as a fallback. diff --git a/doc/rados/configuration/mon-osd-interaction.rst b/doc/rados/configuration/mon-osd-interaction.rst new file mode 100644 index 000000000..8cf09707d --- /dev/null +++ b/doc/rados/configuration/mon-osd-interaction.rst @@ -0,0 +1,245 @@ +===================================== + Configuring Monitor/OSD Interaction +===================================== + +.. index:: heartbeat + +After you have completed your initial Ceph configuration, you may deploy and run +Ceph. When you execute a command such as ``ceph health`` or ``ceph -s``, the +:term:`Ceph Monitor` reports on the current state of the :term:`Ceph Storage +Cluster`. The Ceph Monitor knows about the Ceph Storage Cluster by requiring +reports from each :term:`Ceph OSD Daemon`, and by receiving reports from Ceph +OSD Daemons about the status of their neighboring Ceph OSD Daemons. If the Ceph +Monitor doesn't receive reports, or if it receives reports of changes in the +Ceph Storage Cluster, the Ceph Monitor updates the status of the :term:`Ceph +Cluster Map`. + +Ceph provides reasonable default settings for Ceph Monitor/Ceph OSD Daemon +interaction. However, you may override the defaults. The following sections +describe how Ceph Monitors and Ceph OSD Daemons interact for the purposes of +monitoring the Ceph Storage Cluster. + +.. index:: heartbeat interval + +OSDs Check Heartbeats +===================== + +Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons at random +intervals less than every 6 seconds. If a neighboring Ceph OSD Daemon doesn't +show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may +consider the neighboring Ceph OSD Daemon ``down`` and report it back to a Ceph +Monitor, which will update the Ceph Cluster Map. You may change this grace +period by adding an ``osd heartbeat grace`` setting under the ``[mon]`` +and ``[osd]`` or ``[global]`` section of your Ceph configuration file, +or by setting the value at runtime. + + +.. ditaa:: + +---------+ +---------+ + | OSD 1 | | OSD 2 | + +---------+ +---------+ + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |<-------------------| + | Heart Beating | + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |----+ Grace | + | | Period | + |<---+ Exceeded | + | | + |----+ Mark | + | | OSD 2 | + |<---+ Down | + + +.. index:: OSD down report + +OSDs Report Down OSDs +===================== + +By default, two Ceph OSD Daemons from different hosts must report to the Ceph +Monitors that another Ceph OSD Daemon is ``down`` before the Ceph Monitors +acknowledge that the reported Ceph OSD Daemon is ``down``. But there is chance +that all the OSDs reporting the failure are hosted in a rack with a bad switch +which has trouble connecting to another OSD. To avoid this sort of false alarm, +we consider the peers reporting a failure a proxy for a potential "subcluster" +over the overall cluster that is similarly laggy. This is clearly not true in +all cases, but will sometimes help us localize the grace correction to a subset +of the system that is unhappy. ``mon osd reporter subtree level`` is used to +group the peers into the "subcluster" by their common ancestor type in CRUSH +map. By default, only two reports from different subtree are required to report +another Ceph OSD Daemon ``down``. You can change the number of reporters from +unique subtrees and the common ancestor type required to report a Ceph OSD +Daemon ``down`` to a Ceph Monitor by adding an ``mon osd min down reporters`` +and ``mon osd reporter subtree level`` settings under the ``[mon]`` section of +your Ceph configuration file, or by setting the value at runtime. + + +.. ditaa:: + + +---------+ +---------+ +---------+ + | OSD 1 | | OSD 2 | | Monitor | + +---------+ +---------+ +---------+ + | | | + | OSD 3 Is Down | | + |---------------+--------------->| + | | | + | | | + | | OSD 3 Is Down | + | |--------------->| + | | | + | | | + | | |---------+ Mark + | | | | OSD 3 + | | |<--------+ Down + + +.. index:: peering failure + +OSDs Report Peering Failure +=========================== + +If a Ceph OSD Daemon cannot peer with any of the Ceph OSD Daemons defined in its +Ceph configuration file (or the cluster map), it will ping a Ceph Monitor for +the most recent copy of the cluster map every 30 seconds. You can change the +Ceph Monitor heartbeat interval by adding an ``osd mon heartbeat interval`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. + +.. ditaa:: + + +---------+ +---------+ +-------+ +---------+ + | OSD 1 | | OSD 2 | | OSD 3 | | Monitor | + +---------+ +---------+ +-------+ +---------+ + | | | | + | Request To | | | + | Peer | | | + |-------------->| | | + |<--------------| | | + | Peering | | + | | | + | Request To | | + | Peer | | + |----------------------------->| | + | | + |----+ OSD Monitor | + | | Heartbeat | + |<---+ Interval Exceeded | + | | + | Failed to Peer with OSD 3 | + |-------------------------------------------->| + |<--------------------------------------------| + | Receive New Cluster Map | + + +.. index:: OSD status + +OSDs Report Their Status +======================== + +If an Ceph OSD Daemon doesn't report to a Ceph Monitor, the Ceph Monitor will +consider the Ceph OSD Daemon ``down`` after the ``mon osd report timeout`` +elapses. A Ceph OSD Daemon sends a report to a Ceph Monitor when a reportable +event such as a failure, a change in placement group stats, a change in +``up_thru`` or when it boots within 5 seconds. You can change the Ceph OSD +Daemon minimum report interval by adding an ``osd mon report interval`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. A Ceph OSD Daemon sends a report to a Ceph +Monitor every 120 seconds irrespective of whether any notable changes occur. +You can change the Ceph Monitor report interval by adding an ``osd mon report +interval max`` setting under the ``[osd]`` section of your Ceph configuration +file, or by setting the value at runtime. + + +.. ditaa:: + + +---------+ +---------+ + | OSD 1 | | Monitor | + +---------+ +---------+ + | | + |----+ Report Min | + | | Interval | + |<---+ Exceeded | + | | + |----+ Reportable | + | | Event | + |<---+ Occurs | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Report Max | + | | Interval | + |<---+ Exceeded | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Monitor | + | | Fails | + |<---+ | + +----+ Monitor OSD + | | Report Timeout + |<---+ Exceeded + | + +----+ Mark + | | OSD 1 + |<---+ Down + + + + +Configuration Settings +====================== + +When modifying heartbeat settings, you should include them in the ``[global]`` +section of your configuration file. + +.. index:: monitor heartbeat + +Monitor Settings +---------------- + +.. confval:: mon_osd_min_up_ratio +.. confval:: mon_osd_min_in_ratio +.. confval:: mon_osd_laggy_halflife +.. confval:: mon_osd_laggy_weight +.. confval:: mon_osd_laggy_max_interval +.. confval:: mon_osd_adjust_heartbeat_grace +.. confval:: mon_osd_adjust_down_out_interval +.. confval:: mon_osd_auto_mark_in +.. confval:: mon_osd_auto_mark_auto_out_in +.. confval:: mon_osd_auto_mark_new_in +.. confval:: mon_osd_down_out_interval +.. confval:: mon_osd_down_out_subtree_limit +.. confval:: mon_osd_report_timeout +.. confval:: mon_osd_min_down_reporters +.. confval:: mon_osd_reporter_subtree_level + +.. index:: OSD heartbeat + +OSD Settings +------------ + +.. confval:: osd_heartbeat_interval +.. confval:: osd_heartbeat_grace +.. confval:: osd_mon_heartbeat_interval +.. confval:: osd_mon_heartbeat_stat_stale +.. confval:: osd_mon_report_interval diff --git a/doc/rados/configuration/msgr2.rst b/doc/rados/configuration/msgr2.rst new file mode 100644 index 000000000..33fe4e022 --- /dev/null +++ b/doc/rados/configuration/msgr2.rst @@ -0,0 +1,257 @@ +.. _msgr2: + +Messenger v2 +============ + +What is it +---------- + +The messenger v2 protocol, or msgr2, is the second major revision on +Ceph's on-wire protocol. It brings with it several key features: + +* A *secure* mode that encrypts all data passing over the network +* Improved encapsulation of authentication payloads, enabling future + integration of new authentication modes like Kerberos +* Improved earlier feature advertisement and negotiation, enabling + future protocol revisions + +Ceph daemons can now bind to multiple ports, allowing both legacy Ceph +clients and new v2-capable clients to connect to the same cluster. + +By default, monitors now bind to the new IANA-assigned port ``3300`` +(ce4h or 0xce4) for the new v2 protocol, while also binding to the +old default port ``6789`` for the legacy v1 protocol. + +.. _address_formats: + +Address formats +--------------- + +Prior to Nautilus, all network addresses were rendered like +``1.2.3.4:567/89012`` where there was an IP address, a port, and a +nonce to uniquely identify a client or daemon on the network. +Starting with Nautilus, we now have three different address types: + +* **v2**: ``v2:1.2.3.4:578/89012`` identifies a daemon binding to a + port speaking the new v2 protocol +* **v1**: ``v1:1.2.3.4:578/89012`` identifies a daemon binding to a + port speaking the legacy v1 protocol. Any address that was + previously shown with any prefix is now shown as a ``v1:`` address. +* **TYPE_ANY** ``any:1.2.3.4:578/89012`` identifies a client that can + speak either version of the protocol. Prior to nautilus, clients would appear as + ``1.2.3.4:0/123456``, where the port of 0 indicates they are clients + and do not accept incoming connections. Starting with Nautilus, + these clients are now internally represented by a **TYPE_ANY** + address, and still shown with no prefix, because they may + connect to daemons using the v2 or v1 protocol, depending on what + protocol(s) the daemons are using. + +Because daemons now bind to multiple ports, they are now described by +a vector of addresses instead of a single address. For example, +dumping the monitor map on a Nautilus cluster now includes lines +like:: + + epoch 1 + fsid 50fcf227-be32-4bcb-8b41-34ca8370bd16 + last_changed 2019-02-25 11:10:46.700821 + created 2019-02-25 11:10:46.700821 + min_mon_release 14 (nautilus) + 0: [v2:10.0.0.10:3300/0,v1:10.0.0.10:6789/0] mon.foo + 1: [v2:10.0.0.11:3300/0,v1:10.0.0.11:6789/0] mon.bar + 2: [v2:10.0.0.12:3300/0,v1:10.0.0.12:6789/0] mon.baz + +The bracketed list or vector of addresses means that the same daemon can be +reached on multiple ports (and protocols). Any client or other daemon +connecting to that daemon will use the v2 protocol (listed first) if +possible; otherwise it will back to the legacy v1 protocol. Legacy +clients will only see the v1 addresses and will continue to connect as +they did before, with the v1 protocol. + +Starting in Nautilus, the ``mon_host`` configuration option and ``-m +<mon-host>`` command line options support the same bracketed address +vector syntax. + + +Bind configuration options +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Two new configuration options control whether the v1 and/or v2 +protocol is used: + + * :confval:`ms_bind_msgr1` [default: true] controls whether a daemon binds + to a port speaking the v1 protocol + * :confval:`ms_bind_msgr2` [default: true] controls whether a daemon binds + to a port speaking the v2 protocol + +Similarly, two options control whether IPv4 and IPv6 addresses are used: + + * :confval:`ms_bind_ipv4` [default: true] controls whether a daemon binds + to an IPv4 address + * :confval:`ms_bind_ipv6` [default: false] controls whether a daemon binds + to an IPv6 address + +.. note:: The ability to bind to multiple ports has paved the way for + dual-stack IPv4 and IPv6 support. That said, dual-stack operation is + not yet supported as of Quincy v17.2.0. + +Connection modes +---------------- + +The v2 protocol supports two connection modes: + +* *crc* mode provides: + + - a strong initial authentication when the connection is established + (with cephx, mutual authentication of both parties with protection + from a man-in-the-middle or eavesdropper), and + - a crc32c integrity check to protect against bit flips due to flaky + hardware or cosmic rays + + *crc* mode does *not* provide: + + - secrecy (an eavesdropper on the network can see all + post-authentication traffic as it goes by) or + - protection from a malicious man-in-the-middle (who can deliberate + modify traffic as it goes by, as long as they are careful to + adjust the crc32c values to match) + +* *secure* mode provides: + + - a strong initial authentication when the connection is established + (with cephx, mutual authentication of both parties with protection + from a man-in-the-middle or eavesdropper), and + - full encryption of all post-authentication traffic, including a + cryptographic integrity check. + + In Nautilus, secure mode uses the `AES-GCM + <https://en.wikipedia.org/wiki/Galois/Counter_Mode>`_ stream cipher, + which is generally very fast on modern processors (e.g., faster than + a SHA-256 cryptographic hash). + +Connection mode configuration options +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For most connections, there are options that control which modes are used: + +.. confval:: ms_cluster_mode +.. confval:: ms_service_mode +.. confval:: ms_client_mode + +There are a parallel set of options that apply specifically to +monitors, allowing administrators to set different (usually more +secure) requirements on communication with the monitors. + +.. confval:: ms_mon_cluster_mode +.. confval:: ms_mon_service_mode +.. confval:: ms_mon_client_mode + + +Compression modes +----------------- + +The v2 protocol supports two compression modes: + +* *force* mode provides: + + - In multi-availability zones deployment, compressing replication messages between OSDs saves latency. + - In the public cloud, inter-AZ communications are expensive. Thus, minimizing message + size reduces network costs to cloud provider. + - When using instance storage on AWS (probably other public clouds as well) the instances with NVMe + provide low network bandwidth relative to the device bandwidth. + In this case, NW compression can improve the overall performance since this is clearly + the bottleneck. + +* *none* mode provides: + + - messages are transmitted without compression. + + +Compression mode configuration options +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For all connections, there is an option that controls compression usage in secure mode + +.. confval:: ms_compress_secure + +There is a parallel set of options that apply specifically to OSDs, +allowing administrators to set different requirements on communication between OSDs. + +.. confval:: ms_osd_compress_mode +.. confval:: ms_osd_compress_min_size +.. confval:: ms_osd_compression_algorithm + +Transitioning from v1-only to v2-plus-v1 +---------------------------------------- + +By default, ``ms_bind_msgr2`` is true starting with Nautilus 14.2.z. +However, until the monitors start using v2, only limited services will +start advertising v2 addresses. + +For most users, the monitors are binding to the default legacy port ``6789`` +for the v1 protocol. When this is the case, enabling v2 is as simple as: + +.. prompt:: bash $ + + ceph mon enable-msgr2 + +If the monitors are bound to non-standard ports, you will need to +specify an additional port for v2 explicitly. For example, if your +monitor ``mon.a`` binds to ``1.2.3.4:1111``, and you want to add v2 on +port ``1112``: + +.. prompt:: bash $ + + ceph mon set-addrs a [v2:1.2.3.4:1112,v1:1.2.3.4:1111] + +Once the monitors bind to v2, each daemon will start advertising a v2 +address when it is next restarted. + + +.. _msgr2_ceph_conf: + +Updating ceph.conf and mon_host +------------------------------- + +Prior to Nautilus, a CLI user or daemon will normally discover the +monitors via the ``mon_host`` option in ``/etc/ceph/ceph.conf``. The +syntax for this option has expanded starting with Nautilus to allow +support the new bracketed list format. For example, an old line +like:: + + mon_host = 10.0.0.1:6789,10.0.0.2:6789,10.0.0.3:6789 + +Can be changed to:: + + mon_host = [v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0],[v2:10.0.0.2:3300/0,v1:10.0.0.2:6789/0],[v2:10.0.0.3:3300/0,v1:10.0.0.3:6789/0] + +However, when default ports are used (``3300`` and ``6789``), they can +be omitted:: + + mon_host = 10.0.0.1,10.0.0.2,10.0.0.3 + +Once v2 has been enabled on the monitors, ``ceph.conf`` may need to be +updated to either specify no ports (this is usually simplest), or +explicitly specify both the v2 and v1 addresses. Note, however, that +the new bracketed syntax is only understood by Nautilus and later, so +do not make that change on hosts that have not yet had their ceph +packages upgraded. + +When you are updating ``ceph.conf``, note the new ``ceph config +generate-minimal-conf`` command (which generates a barebones config +file with just enough information to reach the monitors) and the +``ceph config assimilate-conf`` (which moves config file options into +the monitors' configuration database) may be helpful. For example,:: + + # ceph config assimilate-conf < /etc/ceph/ceph.conf + # ceph config generate-minimal-config > /etc/ceph/ceph.conf.new + # cat /etc/ceph/ceph.conf.new + # minimal ceph.conf for 0e5a806b-0ce5-4bc6-b949-aa6f68f5c2a3 + [global] + fsid = 0e5a806b-0ce5-4bc6-b949-aa6f68f5c2a3 + mon_host = [v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0] + # mv /etc/ceph/ceph.conf.new /etc/ceph/ceph.conf + +Protocol +-------- + +For a detailed description of the v2 wire protocol, see :ref:`msgr2-protocol`. diff --git a/doc/rados/configuration/network-config-ref.rst b/doc/rados/configuration/network-config-ref.rst new file mode 100644 index 000000000..81e85c5d1 --- /dev/null +++ b/doc/rados/configuration/network-config-ref.rst @@ -0,0 +1,355 @@ +================================= + Network Configuration Reference +================================= + +Network configuration is critical for building a high performance :term:`Ceph +Storage Cluster`. The Ceph Storage Cluster does not perform request routing or +dispatching on behalf of the :term:`Ceph Client`. Instead, Ceph Clients make +requests directly to Ceph OSD Daemons. Ceph OSD Daemons perform data replication +on behalf of Ceph Clients, which means replication and other factors impose +additional loads on Ceph Storage Cluster networks. + +Our Quick Start configurations provide a trivial Ceph configuration file that +sets monitor IP addresses and daemon host names only. Unless you specify a +cluster network, Ceph assumes a single "public" network. Ceph functions just +fine with a public network only, but you may see significant performance +improvement with a second "cluster" network in a large cluster. + +It is possible to run a Ceph Storage Cluster with two networks: a public +(client, front-side) network and a cluster (private, replication, back-side) +network. However, this approach +complicates network configuration (both hardware and software) and does not usually +have a significant impact on overall performance. For this reason, we recommend +that for resilience and capacity dual-NIC systems either active/active bond +these interfaces or implement a layer 3 multipath strategy with eg. FRR. + +If, despite the complexity, one still wishes to use two networks, each +:term:`Ceph Node` will need to have more than one network interface or VLAN. See `Hardware +Recommendations - Networks`_ for additional details. + +.. ditaa:: + +-------------+ + | Ceph Client | + +----*--*-----+ + | ^ + Request | : Response + v | + /----------------------------------*--*-------------------------------------\ + | Public Network | + \---*--*------------*--*-------------*--*------------*--*------------*--*---/ + ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ + | | | | | | | | | | + | : | : | : | : | : + v v v v v v v v v v + +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ + | Ceph MON | | Ceph MDS | | Ceph OSD | | Ceph OSD | | Ceph OSD | + +----------+ +----------+ +---*--*---+ +---*--*---+ +---*--*---+ + ^ ^ ^ ^ ^ ^ + The cluster network relieves | | | | | | + OSD replication and heartbeat | : | : | : + traffic from the public network. v v v v v v + /------------------------------------*--*------------*--*------------*--*---\ + | cCCC Cluster Network | + \---------------------------------------------------------------------------/ + + +IP Tables +========= + +By default, daemons `bind`_ to ports within the ``6800:7300`` range. You may +configure this range at your discretion. Before configuring your IP tables, +check the default ``iptables`` configuration. + +.. prompt:: bash $ + + sudo iptables -L + +Some Linux distributions include rules that reject all inbound requests +except SSH from all network interfaces. For example:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +You will need to delete these rules on both your public and cluster networks +initially, and replace them with appropriate rules when you are ready to +harden the ports on your Ceph Nodes. + + +Monitor IP Tables +----------------- + +Ceph Monitors listen on ports ``3300`` and ``6789`` by +default. Additionally, Ceph Monitors always operate on the public +network. When you add the rule using the example below, make sure you +replace ``{iface}`` with the public network interface (e.g., ``eth0``, +``eth1``, etc.), ``{ip-address}`` with the IP address of the public +network and ``{netmask}`` with the netmask for the public network. : + +.. prompt:: bash $ + + sudo iptables -A INPUT -i {iface} -p tcp -s {ip-address}/{netmask} --dport 6789 -j ACCEPT + + +MDS and Manager IP Tables +------------------------- + +A :term:`Ceph Metadata Server` or :term:`Ceph Manager` listens on the first +available port on the public network beginning at port 6800. Note that this +behavior is not deterministic, so if you are running more than one OSD or MDS +on the same host, or if you restart the daemons within a short window of time, +the daemons will bind to higher ports. You should open the entire 6800-7300 +range by default. When you add the rule using the example below, make sure +you replace ``{iface}`` with the public network interface (e.g., ``eth0``, +``eth1``, etc.), ``{ip-address}`` with the IP address of the public network +and ``{netmask}`` with the netmask of the public network. + +For example: + +.. prompt:: bash $ + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + + +OSD IP Tables +------------- + +By default, Ceph OSD Daemons `bind`_ to the first available ports on a Ceph Node +beginning at port 6800. Note that this behavior is not deterministic, so if you +are running more than one OSD or MDS on the same host, or if you restart the +daemons within a short window of time, the daemons will bind to higher ports. +Each Ceph OSD Daemon on a Ceph Node may use up to four ports: + +#. One for talking to clients and monitors. +#. One for sending data to other OSDs. +#. Two for heartbeating on each interface. + +.. ditaa:: + /---------------\ + | OSD | + | +---+----------------+-----------+ + | | Clients & Monitors | Heartbeat | + | +---+----------------+-----------+ + | | + | +---+----------------+-----------+ + | | Data Replication | Heartbeat | + | +---+----------------+-----------+ + | cCCC | + \---------------/ + +When a daemon fails and restarts without letting go of the port, the restarted +daemon will bind to a new port. You should open the entire 6800-7300 port range +to handle this possibility. + +If you set up separate public and cluster networks, you must add rules for both +the public network and the cluster network, because clients will connect using +the public network and other Ceph OSD Daemons will connect using the cluster +network. When you add the rule using the example below, make sure you replace +``{iface}`` with the network interface (e.g., ``eth0``, ``eth1``, etc.), +``{ip-address}`` with the IP address and ``{netmask}`` with the netmask of the +public or cluster network. For example: + +.. prompt:: bash $ + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + +.. tip:: If you run Ceph Metadata Servers on the same Ceph Node as the + Ceph OSD Daemons, you can consolidate the public network configuration step. + + +Ceph Networks +============= + +To configure Ceph networks, you must add a network configuration to the +``[global]`` section of the configuration file. Our 5-minute Quick Start +provides a trivial Ceph configuration file that assumes one public network +with client and server on the same network and subnet. Ceph functions just fine +with a public network only. However, Ceph allows you to establish much more +specific criteria, including multiple IP network and subnet masks for your +public network. You can also establish a separate cluster network to handle OSD +heartbeat, object replication and recovery traffic. Don't confuse the IP +addresses you set in your configuration with the public-facing IP addresses +network clients may use to access your service. Typical internal IP networks are +often ``192.168.0.0`` or ``10.0.0.0``. + +.. tip:: If you specify more than one IP address and subnet mask for + either the public or the cluster network, the subnets within the network + must be capable of routing to each other. Additionally, make sure you + include each IP address/subnet in your IP tables and open ports for them + as necessary. + +.. note:: Ceph uses `CIDR`_ notation for subnets (e.g., ``10.0.0.0/24``). + +When you have configured your networks, you may restart your cluster or restart +each daemon. Ceph daemons bind dynamically, so you do not have to restart the +entire cluster at once if you change your network configuration. + + +Public Network +-------------- + +To configure a public network, add the following option to the ``[global]`` +section of your Ceph configuration file. + +.. code-block:: ini + + [global] + # ... elided configuration + public_network = {public-network/netmask} + +.. _cluster-network: + +Cluster Network +--------------- + +If you declare a cluster network, OSDs will route heartbeat, object replication +and recovery traffic over the cluster network. This may improve performance +compared to using a single network. To configure a cluster network, add the +following option to the ``[global]`` section of your Ceph configuration file. + +.. code-block:: ini + + [global] + # ... elided configuration + cluster_network = {cluster-network/netmask} + +We prefer that the cluster network is **NOT** reachable from the public network +or the Internet for added security. + +IPv4/IPv6 Dual Stack Mode +------------------------- + +If you want to run in an IPv4/IPv6 dual stack mode and want to define your public and/or +cluster networks, then you need to specify both your IPv4 and IPv6 networks for each: + +.. code-block:: ini + + [global] + # ... elided configuration + public_network = {IPv4 public-network/netmask}, {IPv6 public-network/netmask} + +This is so that Ceph can find a valid IP address for both address families. + +If you want just an IPv4 or an IPv6 stack environment, then make sure you set the `ms bind` +options correctly. + +.. note:: + Binding to IPv4 is enabled by default, so if you just add the option to bind to IPv6 + you'll actually put yourself into dual stack mode. If you want just IPv6, then disable IPv4 and + enable IPv6. See `Bind`_ below. + +Ceph Daemons +============ + +Monitor daemons are each configured to bind to a specific IP address. These +addresses are normally configured by your deployment tool. Other components +in the Ceph cluster discover the monitors via the ``mon host`` configuration +option, normally specified in the ``[global]`` section of the ``ceph.conf`` file. + +.. code-block:: ini + + [global] + mon_host = 10.0.0.2, 10.0.0.3, 10.0.0.4 + +The ``mon_host`` value can be a list of IP addresses or a name that is +looked up via DNS. In the case of a DNS name with multiple A or AAAA +records, all records are probed in order to discover a monitor. Once +one monitor is reached, all other current monitors are discovered, so +the ``mon host`` configuration option only needs to be sufficiently up +to date such that a client can reach one monitor that is currently online. + +The MGR, OSD, and MDS daemons will bind to any available address and +do not require any special configuration. However, it is possible to +specify a specific IP address for them to bind to with the ``public +addr`` (and/or, in the case of OSD daemons, the ``cluster addr``) +configuration option. For example, + +.. code-block:: ini + + [osd.0] + public_addr = {host-public-ip-address} + cluster_addr = {host-cluster-ip-address} + +.. topic:: One NIC OSD in a Two Network Cluster + + Generally, we do not recommend deploying an OSD host with a single network interface in a + cluster with two networks. However, you may accomplish this by forcing the + OSD host to operate on the public network by adding a ``public_addr`` entry + to the ``[osd.n]`` section of the Ceph configuration file, where ``n`` + refers to the ID of the OSD with one network interface. Additionally, the public + network and cluster network must be able to route traffic to each other, + which we don't recommend for security reasons. + + +Network Config Settings +======================= + +Network configuration settings are not required. Ceph assumes a public network +with all hosts operating on it unless you specifically configure a cluster +network. + + +Public Network +-------------- + +The public network configuration allows you specifically define IP addresses +and subnets for the public network. You may specifically assign static IP +addresses or override ``public_network`` settings using the ``public_addr`` +setting for a specific daemon. + +.. confval:: public_network +.. confval:: public_addr + +Cluster Network +--------------- + +The cluster network configuration allows you to declare a cluster network, and +specifically define IP addresses and subnets for the cluster network. You may +specifically assign static IP addresses or override ``cluster_network`` +settings using the ``cluster_addr`` setting for specific OSD daemons. + + +.. confval:: cluster_network +.. confval:: cluster_addr + +Bind +---- + +Bind settings set the default port ranges Ceph OSD and MDS daemons use. The +default range is ``6800:7300``. Ensure that your `IP Tables`_ configuration +allows you to use the configured port range. + +You may also enable Ceph daemons to bind to IPv6 addresses instead of IPv4 +addresses. + +.. confval:: ms_bind_port_min +.. confval:: ms_bind_port_max +.. confval:: ms_bind_ipv4 +.. confval:: ms_bind_ipv6 +.. confval:: public_bind_addr + +TCP +--- + +Ceph disables TCP buffering by default. + +.. confval:: ms_tcp_nodelay +.. confval:: ms_tcp_rcvbuf + +General Settings +---------------- + +.. confval:: ms_type +.. confval:: ms_async_op_threads +.. confval:: ms_initial_backoff +.. confval:: ms_max_backoff +.. confval:: ms_die_on_bad_msg +.. confval:: ms_dispatch_throttle_bytes +.. confval:: ms_inject_socket_failures + + +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability +.. _Hardware Recommendations - Networks: ../../../start/hardware-recommendations#networks +.. _hardware recommendations: ../../../start/hardware-recommendations +.. _Monitor / OSD Interaction: ../mon-osd-interaction +.. _Message Signatures: ../auth-config-ref#signatures +.. _CIDR: https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing +.. _Nagle's Algorithm: https://en.wikipedia.org/wiki/Nagle's_algorithm diff --git a/doc/rados/configuration/osd-config-ref.rst b/doc/rados/configuration/osd-config-ref.rst new file mode 100644 index 000000000..060121200 --- /dev/null +++ b/doc/rados/configuration/osd-config-ref.rst @@ -0,0 +1,445 @@ +====================== + OSD Config Reference +====================== + +.. index:: OSD; configuration + +You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent +releases, the central config store), but Ceph OSD +Daemons can use the default values and a very minimal configuration. A minimal +Ceph OSD Daemon configuration sets ``host`` and +uses default values for nearly everything else. + +Ceph OSD Daemons are numerically identified in incremental fashion, beginning +with ``0`` using the following convention. :: + + osd.0 + osd.1 + osd.2 + +In a configuration file, you may specify settings for all Ceph OSD Daemons in +the cluster by adding configuration settings to the ``[osd]`` section of your +configuration file. To add settings directly to a specific Ceph OSD Daemon +(e.g., ``host``), enter it in an OSD-specific section of your configuration +file. For example: + +.. code-block:: ini + + [osd] + osd_journal_size = 5120 + + [osd.0] + host = osd-host-a + + [osd.1] + host = osd-host-b + + +.. index:: OSD; config settings + +General Settings +================ + +The following settings provide a Ceph OSD Daemon's ID, and determine paths to +data and journals. Ceph deployment scripts typically generate the UUID +automatically. + +.. warning:: **DO NOT** change the default paths for data or journals, as it + makes it more problematic to troubleshoot Ceph later. + +When using Filestore, the journal size should be at least twice the product of the expected drive +speed multiplied by ``filestore_max_sync_interval``. However, the most common +practice is to partition the journal drive (often an SSD), and mount it such +that Ceph uses the entire partition for the journal. + +.. confval:: osd_uuid +.. confval:: osd_data +.. confval:: osd_max_write_size +.. confval:: osd_max_object_size +.. confval:: osd_client_message_size_cap +.. confval:: osd_class_dir + :default: $libdir/rados-classes + +.. index:: OSD; file system + +File System Settings +==================== +Ceph builds and mounts file systems which are used for Ceph OSDs. + +``osd_mkfs_options {fs-type}`` + +:Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``-f -i 2048`` +:Default for other file systems: {empty string} + +For example:: + ``osd_mkfs_options_xfs = -f -d agcount=24`` + +``osd_mount_options {fs-type}`` + +:Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``rw,noatime,inode64`` +:Default for other file systems: ``rw, noatime`` + +For example:: + ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8`` + + +.. index:: OSD; journal settings + +Journal Settings +================ + +This section applies only to the older Filestore OSD back end. Since Luminous +BlueStore has been default and preferred. + +By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at +the following path, which is usually a symlink to a device or partition:: + + /var/lib/ceph/osd/$cluster-$id/journal + +When using a single device type (for example, spinning drives), the journals +should be *colocated*: the logical volume (or partition) should be in the same +device as the ``data`` logical volume. + +When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning +drives) it makes sense to place the journal on the faster device, while +``data`` occupies the slower device fully. + +The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be +larger, in which case it will need to be set in the ``ceph.conf`` file. +A value of 10 gigabytes is common in practice:: + + osd_journal_size = 10240 + + +.. confval:: osd_journal +.. confval:: osd_journal_size + +See `Journal Config Reference`_ for additional details. + + +Monitor OSD Interaction +======================= + +Ceph OSD Daemons check each other's heartbeats and report to monitors +periodically. Ceph can use default values in many cases. However, if your +network has latency issues, you may need to adopt longer intervals. See +`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. + + +Data Placement +============== + +See `Pool & PG Config Reference`_ for details. + + +.. index:: OSD; scrubbing + +.. _rados_config_scrubbing: + +Scrubbing +========= + +One way that Ceph ensures data integrity is by "scrubbing" placement groups. +Ceph scrubbing is analogous to ``fsck`` on the object storage layer. Ceph +generates a catalog of all objects in each placement group and compares each +primary object to its replicas, ensuring that no objects are missing or +mismatched. Light scrubbing checks the object size and attributes, and is +usually done daily. Deep scrubbing reads the data and uses checksums to ensure +data integrity, and is usually done weekly. The freqeuncies of both light +scrubbing and deep scrubbing are determined by the cluster's configuration, +which is fully under your control and subject to the settings explained below +in this section. + +Although scrubbing is important for maintaining data integrity, it can reduce +the performance of the Ceph cluster. You can adjust the following settings to +increase or decrease the frequency and depth of scrubbing operations. + + +.. confval:: osd_max_scrubs +.. confval:: osd_scrub_begin_hour +.. confval:: osd_scrub_end_hour +.. confval:: osd_scrub_begin_week_day +.. confval:: osd_scrub_end_week_day +.. confval:: osd_scrub_during_recovery +.. confval:: osd_scrub_load_threshold +.. confval:: osd_scrub_min_interval +.. confval:: osd_scrub_max_interval +.. confval:: osd_scrub_chunk_min +.. confval:: osd_scrub_chunk_max +.. confval:: osd_scrub_sleep +.. confval:: osd_deep_scrub_interval +.. confval:: osd_scrub_interval_randomize_ratio +.. confval:: osd_deep_scrub_stride +.. confval:: osd_scrub_auto_repair +.. confval:: osd_scrub_auto_repair_num_errors + +.. index:: OSD; operations settings + +Operations +========== + +.. confval:: osd_op_num_shards +.. confval:: osd_op_num_shards_hdd +.. confval:: osd_op_num_shards_ssd +.. confval:: osd_op_queue +.. confval:: osd_op_queue_cut_off +.. confval:: osd_client_op_priority +.. confval:: osd_recovery_op_priority +.. confval:: osd_scrub_priority +.. confval:: osd_requested_scrub_priority +.. confval:: osd_snap_trim_priority +.. confval:: osd_snap_trim_sleep +.. confval:: osd_snap_trim_sleep_hdd +.. confval:: osd_snap_trim_sleep_ssd +.. confval:: osd_snap_trim_sleep_hybrid +.. confval:: osd_op_thread_timeout +.. confval:: osd_op_complaint_time +.. confval:: osd_op_history_size +.. confval:: osd_op_history_duration +.. confval:: osd_op_log_threshold +.. confval:: osd_op_thread_suicide_timeout +.. note:: See https://old.ceph.com/planet/dealing-with-some-osd-timeouts/ for + more on ``osd_op_thread_suicide_timeout``. Be aware that this is a link to a + reworking of a blog post from 2017, and that its conclusion will direct you + back to this page "for more information". + +.. _dmclock-qos: + +QoS Based on mClock +------------------- + +Ceph's use of mClock is now more refined and can be used by following the +steps as described in `mClock Config Reference`_. + +Core Concepts +````````````` + +Ceph's QoS support is implemented using a queueing scheduler +based on `the dmClock algorithm`_. This algorithm allocates the I/O +resources of the Ceph cluster in proportion to weights, and enforces +the constraints of minimum reservation and maximum limitation, so that +the services can compete for the resources fairly. Currently the +*mclock_scheduler* operation queue divides Ceph services involving I/O +resources into following buckets: + +- client op: the iops issued by client +- osd subop: the iops issued by primary OSD +- snap trim: the snap trimming related requests +- pg recovery: the recovery related requests +- pg scrub: the scrub related requests + +And the resources are partitioned using following three sets of tags. In other +words, the share of each type of service is controlled by three tags: + +#. reservation: the minimum IOPS allocated for the service. +#. limitation: the maximum IOPS allocated for the service. +#. weight: the proportional share of capacity if extra capacity or system + oversubscribed. + +In Ceph, operations are graded with "cost". And the resources allocated +for serving various services are consumed by these "costs". So, for +example, the more reservation a services has, the more resource it is +guaranteed to possess, as long as it requires. Assuming there are 2 +services: recovery and client ops: + +- recovery: (r:1, l:5, w:1) +- client ops: (r:2, l:0, w:9) + +The settings above ensure that the recovery won't get more than 5 +requests per second serviced, even if it requires so (see CURRENT +IMPLEMENTATION NOTE below), and no other services are competing with +it. But if the clients start to issue large amount of I/O requests, +neither will they exhaust all the I/O resources. 1 request per second +is always allocated for recovery jobs as long as there are any such +requests. So the recovery jobs won't be starved even in a cluster with +high load. And in the meantime, the client ops can enjoy a larger +portion of the I/O resource, because its weight is "9", while its +competitor "1". In the case of client ops, it is not clamped by the +limit setting, so it can make use of all the resources if there is no +recovery ongoing. + +CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit +values. Therefore, if a service crosses the enforced limit, the op remains +in the operation queue until the limit is restored. + +Subtleties of mClock +```````````````````` + +The reservation and limit values have a unit of requests per +second. The weight, however, does not technically have a unit and the +weights are relative to one another. So if one class of requests has a +weight of 1 and another a weight of 9, then the latter class of +requests should get 9 executed at a 9 to 1 ratio as the first class. +However that will only happen once the reservations are met and those +values include the operations executed under the reservation phase. + +Even though the weights do not have units, one must be careful in +choosing their values due how the algorithm assigns weight tags to +requests. If the weight is *W*, then for a given class of requests, +the next one that comes in will have a weight tag of *1/W* plus the +previous weight tag or the current time, whichever is larger. That +means if *W* is sufficiently large and therefore *1/W* is sufficiently +small, the calculated tag may never be assigned as it will get a value +of the current time. The ultimate lesson is that values for weight +should not be too large. They should be under the number of requests +one expects to be serviced each second. + +Caveats +``````` + +There are some factors that can reduce the impact of the mClock op +queues within Ceph. First, requests to an OSD are sharded by their +placement group identifier. Each shard has its own mClock queue and +these queues neither interact nor share information among them. The +number of shards can be controlled with the configuration options +:confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and +:confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the +impact of the mClock queues, but may have other deleterious effects. + +Second, requests are transferred from the operation queue to the +operation sequencer, in which they go through the phases of +execution. The operation queue is where mClock resides and mClock +determines the next op to transfer to the operation sequencer. The +number of operations allowed in the operation sequencer is a complex +issue. In general we want to keep enough operations in the sequencer +so it's always getting work done on some operations while it's waiting +for disk and network access to complete on other operations. On the +other hand, once an operation is transferred to the operation +sequencer, mClock no longer has control over it. Therefore to maximize +the impact of mClock, we want to keep as few operations in the +operation sequencer as possible. So we have an inherent tension. + +The configuration options that influence the number of operations in +the operation sequencer are :confval:`bluestore_throttle_bytes`, +:confval:`bluestore_throttle_deferred_bytes`, +:confval:`bluestore_throttle_cost_per_io`, +:confval:`bluestore_throttle_cost_per_io_hdd`, and +:confval:`bluestore_throttle_cost_per_io_ssd`. + +A third factor that affects the impact of the mClock algorithm is that +we're using a distributed system, where requests are made to multiple +OSDs and each OSD has (can have) multiple shards. Yet we're currently +using the mClock algorithm, which is not distributed (note: dmClock is +the distributed version of mClock). + +Various organizations and individuals are currently experimenting with +mClock as it exists in this code base along with their modifications +to the code base. We hope you'll share you're experiences with your +mClock and dmClock experiments on the ``ceph-devel`` mailing list. + +.. confval:: osd_async_recovery_min_cost +.. confval:: osd_push_per_object_cost +.. confval:: osd_mclock_scheduler_client_res +.. confval:: osd_mclock_scheduler_client_wgt +.. confval:: osd_mclock_scheduler_client_lim +.. confval:: osd_mclock_scheduler_background_recovery_res +.. confval:: osd_mclock_scheduler_background_recovery_wgt +.. confval:: osd_mclock_scheduler_background_recovery_lim +.. confval:: osd_mclock_scheduler_background_best_effort_res +.. confval:: osd_mclock_scheduler_background_best_effort_wgt +.. confval:: osd_mclock_scheduler_background_best_effort_lim + +.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf + +.. index:: OSD; backfilling + +Backfilling +=========== + +When you add or remove Ceph OSD Daemons to a cluster, CRUSH will +rebalance the cluster by moving placement groups to or from Ceph OSDs +to restore balanced utilization. The process of migrating placement groups and +the objects they contain can reduce the cluster's operational performance +considerably. To maintain operational performance, Ceph performs this migration +with 'backfilling', which allows Ceph to set backfill operations to a lower +priority than requests to read or write data. + + +.. confval:: osd_max_backfills +.. confval:: osd_backfill_scan_min +.. confval:: osd_backfill_scan_max +.. confval:: osd_backfill_retry_interval + +.. index:: OSD; osdmap + +OSD Map +======= + +OSD maps reflect the OSD daemons operating in the cluster. Over time, the +number of map epochs increases. Ceph provides some settings to ensure that +Ceph performs well as the OSD map grows larger. + +.. confval:: osd_map_dedup +.. confval:: osd_map_cache_size +.. confval:: osd_map_message_max + +.. index:: OSD; recovery + +Recovery +======== + +When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD +begins peering with other Ceph OSD Daemons before writes can occur. See +`Monitoring OSDs and PGs`_ for details. + +If a Ceph OSD Daemon crashes and comes back online, usually it will be out of +sync with other Ceph OSD Daemons containing more recent versions of objects in +the placement groups. When this happens, the Ceph OSD Daemon goes into recovery +mode and seeks to get the latest copy of the data and bring its map back up to +date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects +and placement groups may be significantly out of date. Also, if a failure domain +went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at +the same time. This can make the recovery process time consuming and resource +intensive. + +To maintain operational performance, Ceph performs recovery with limitations on +the number recovery requests, threads and object chunk sizes which allows Ceph +perform well in a degraded state. + +.. confval:: osd_recovery_delay_start +.. confval:: osd_recovery_max_active +.. confval:: osd_recovery_max_active_hdd +.. confval:: osd_recovery_max_active_ssd +.. confval:: osd_recovery_max_chunk +.. confval:: osd_recovery_max_single_start +.. confval:: osd_recover_clone_overlap +.. confval:: osd_recovery_sleep +.. confval:: osd_recovery_sleep_hdd +.. confval:: osd_recovery_sleep_ssd +.. confval:: osd_recovery_sleep_hybrid +.. confval:: osd_recovery_priority + +Tiering +======= + +.. confval:: osd_agent_max_ops +.. confval:: osd_agent_max_low_ops + +See `cache target dirty high ratio`_ for when the tiering agent flushes dirty +objects within the high speed mode. + +Miscellaneous +============= + +.. confval:: osd_default_notify_timeout +.. confval:: osd_check_for_log_corruption +.. confval:: osd_delete_sleep +.. confval:: osd_delete_sleep_hdd +.. confval:: osd_delete_sleep_ssd +.. confval:: osd_delete_sleep_hybrid +.. confval:: osd_command_max_records +.. confval:: osd_fast_fail_on_connection_refused + +.. _pool: ../../operations/pools +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Pool & PG Config Reference: ../pool-pg-config-ref +.. _Journal Config Reference: ../journal-ref +.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio +.. _mClock Config Reference: ../mclock-config-ref diff --git a/doc/rados/configuration/pool-pg-config-ref.rst b/doc/rados/configuration/pool-pg-config-ref.rst new file mode 100644 index 000000000..902c80346 --- /dev/null +++ b/doc/rados/configuration/pool-pg-config-ref.rst @@ -0,0 +1,46 @@ +.. _rados_config_pool_pg_crush_ref: + +====================================== + Pool, PG and CRUSH Config Reference +====================================== + +.. index:: pools; configuration + +Ceph uses default values to determine how many placement groups (PGs) will be +assigned to each pool. We recommend overriding some of the defaults. +Specifically, we recommend setting a pool's replica size and overriding the +default number of placement groups. You can set these values when running +`pool`_ commands. You can also override the defaults by adding new ones in the +``[global]`` section of your Ceph configuration file. + + +.. literalinclude:: pool-pg.conf + :language: ini + +.. confval:: mon_max_pool_pg_num +.. confval:: mon_pg_stuck_threshold +.. confval:: mon_pg_warn_min_per_osd +.. confval:: mon_pg_warn_min_objects +.. confval:: mon_pg_warn_min_pool_objects +.. confval:: mon_pg_check_down_all_threshold +.. confval:: mon_pg_warn_max_object_skew +.. confval:: mon_delta_reset_interval +.. confval:: osd_crush_chooseleaf_type +.. confval:: osd_crush_initial_weight +.. confval:: osd_pool_default_crush_rule +.. confval:: osd_pool_erasure_code_stripe_unit +.. confval:: osd_pool_default_size +.. confval:: osd_pool_default_min_size +.. confval:: osd_pool_default_pg_num +.. confval:: osd_pool_default_pgp_num +.. confval:: osd_pool_default_pg_autoscale_mode +.. confval:: osd_pool_default_flags +.. confval:: osd_max_pgls +.. confval:: osd_min_pg_log_entries +.. confval:: osd_max_pg_log_entries +.. confval:: osd_default_data_pool_replay_window +.. confval:: osd_max_pg_per_osd_hard_ratio + +.. _pool: ../../operations/pools +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Weighting Bucket Items: ../../operations/crush-map#weightingbucketitems diff --git a/doc/rados/configuration/pool-pg.conf b/doc/rados/configuration/pool-pg.conf new file mode 100644 index 000000000..6765d37df --- /dev/null +++ b/doc/rados/configuration/pool-pg.conf @@ -0,0 +1,21 @@ +[global] + + # By default, Ceph makes three replicas of RADOS objects. If you want + # to maintain four copies of an object the default value--a primary + # copy and three replica copies--reset the default values as shown in + # 'osd_pool_default_size'. If you want to allow Ceph to accept an I/O + # operation to a degraded PG, set 'osd_pool_default_min_size' to a + # number less than the 'osd_pool_default_size' value. + + osd_pool_default_size = 3 # Write an object three times. + osd_pool_default_min_size = 2 # Accept an I/O operation to a PG that has two copies of an object. + + # Note: by default, PG autoscaling is enabled and this value is used only + # in specific circumstances. It is however still recommend to set it. + # Ensure you have a realistic number of placement groups. We recommend + # approximately 100 per OSD. E.g., total number of OSDs multiplied by 100 + # divided by the number of replicas (i.e., 'osd_pool_default_size'). So for + # 10 OSDs and 'osd_pool_default_size' = 4, we'd recommend approximately + # (100 * 10) / 4 = 250. + # Always use the nearest power of two. + osd_pool_default_pg_num = 256 diff --git a/doc/rados/configuration/storage-devices.rst b/doc/rados/configuration/storage-devices.rst new file mode 100644 index 000000000..c83e87da7 --- /dev/null +++ b/doc/rados/configuration/storage-devices.rst @@ -0,0 +1,93 @@ +================= + Storage Devices +================= + +There are several Ceph daemons in a storage cluster: + +.. _rados_configuration_storage-devices_ceph_osd: + +* **Ceph OSDs** (Object Storage Daemons) store most of the data + in Ceph. Usually each OSD is backed by a single storage device. + This can be a traditional hard disk (HDD) or a solid state disk + (SSD). OSDs can also be backed by a combination of devices: for + example, a HDD for most data and an SSD (or partition of an + SSD) for some metadata. The number of OSDs in a cluster is + usually a function of the amount of data to be stored, the size + of each storage device, and the level and type of redundancy + specified (replication or erasure coding). +* **Ceph Monitor** daemons manage critical cluster state. This + includes cluster membership and authentication information. + Small clusters require only a few gigabytes of storage to hold + the monitor database. In large clusters, however, the monitor + database can reach sizes of tens of gigabytes to hundreds of + gigabytes. +* **Ceph Manager** daemons run alongside monitor daemons, providing + additional monitoring and providing interfaces to external + monitoring and management systems. + +.. _rados_config_storage_devices_osd_backends: + +OSD Back Ends +============= + +There are two ways that OSDs manage the data they store. As of the Luminous +12.2.z release, the default (and recommended) back end is *BlueStore*. Prior +to the Luminous release, the default (and only) back end was *Filestore*. + +.. _rados_config_storage_devices_bluestore: + +BlueStore +--------- + +BlueStore is a special-purpose storage back end designed specifically for +managing data on disk for Ceph OSD workloads. BlueStore's design is based on +a decade of experience of supporting and managing Filestore OSDs. + +Key BlueStore features include: + +* Direct management of storage devices. BlueStore consumes raw block devices or + partitions. This avoids intervening layers of abstraction (such as local file + systems like XFS) that can limit performance or add complexity. +* Metadata management with RocksDB. RocksDB's key/value database is embedded + in order to manage internal metadata, including the mapping of object + names to block locations on disk. +* Full data and metadata checksumming. By default, all data and + metadata written to BlueStore is protected by one or more + checksums. No data or metadata is read from disk or returned + to the user without being verified. +* Inline compression. Data can be optionally compressed before being written + to disk. +* Multi-device metadata tiering. BlueStore allows its internal + journal (write-ahead log) to be written to a separate, high-speed + device (like an SSD, NVMe, or NVDIMM) for increased performance. If + a significant amount of faster storage is available, internal + metadata can be stored on the faster device. +* Efficient copy-on-write. RBD and CephFS snapshots rely on a + copy-on-write *clone* mechanism that is implemented efficiently in + BlueStore. This results in efficient I/O both for regular snapshots + and for erasure-coded pools (which rely on cloning to implement + efficient two-phase commits). + +For more information, see :doc:`bluestore-config-ref` and :doc:`/rados/operations/bluestore-migration`. + +FileStore +--------- +.. warning:: Filestore has been deprecated in the Reef release and is no longer supported. + + +FileStore is the legacy approach to storing objects in Ceph. It +relies on a standard file system (normally XFS) in combination with a +key/value database (traditionally LevelDB, now RocksDB) for some +metadata. + +FileStore is well-tested and widely used in production. However, it +suffers from many performance deficiencies due to its overall design +and its reliance on a traditional file system for object data storage. + +Although FileStore is capable of functioning on most POSIX-compatible +file systems (including btrfs and ext4), we recommend that only the +XFS file system be used with Ceph. Both btrfs and ext4 have known bugs and +deficiencies and their use may lead to data loss. By default, all Ceph +provisioning tools use XFS. + +For more information, see :doc:`filestore-config-ref`. diff --git a/doc/rados/index.rst b/doc/rados/index.rst new file mode 100644 index 000000000..b506b7a7e --- /dev/null +++ b/doc/rados/index.rst @@ -0,0 +1,81 @@ +.. _rados-index: + +====================== + Ceph Storage Cluster +====================== + +The :term:`Ceph Storage Cluster` is the foundation for all Ceph deployments. +Based upon :abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, Ceph +Storage Clusters consist of several types of daemons: + + 1. a :term:`Ceph OSD Daemon` (OSD) stores data as objects on a storage node + 2. a :term:`Ceph Monitor` (MON) maintains a master copy of the cluster map. + 3. a :term:`Ceph Manager` manager daemon + +A Ceph Storage Cluster might contain thousands of storage nodes. A +minimal system has at least one Ceph Monitor and two Ceph OSD +Daemons for data replication. + +The Ceph File System, Ceph Object Storage and Ceph Block Devices read data from +and write data to the Ceph Storage Cluster. + +.. container:: columns-3 + + .. container:: column + + .. raw:: html + + <h3>Config and Deploy</h3> + + Ceph Storage Clusters have a few required settings, but most configuration + settings have default values. A typical deployment uses a deployment tool + to define a cluster and bootstrap a monitor. See :ref:`cephadm` for details. + + .. toctree:: + :maxdepth: 2 + + Configuration <configuration/index> + + .. container:: column + + .. raw:: html + + <h3>Operations</h3> + + Once you have deployed a Ceph Storage Cluster, you may begin operating + your cluster. + + .. toctree:: + :maxdepth: 2 + + Operations <operations/index> + + .. toctree:: + :maxdepth: 1 + + Man Pages <man/index> + + .. toctree:: + :hidden: + + troubleshooting/index + + .. container:: column + + .. raw:: html + + <h3>APIs</h3> + + Most Ceph deployments use `Ceph Block Devices`_, `Ceph Object Storage`_ and/or the + `Ceph File System`_. You may also develop applications that talk directly to + the Ceph Storage Cluster. + + .. toctree:: + :maxdepth: 2 + + APIs <api/index> + +.. _Ceph Block Devices: ../rbd/ +.. _Ceph File System: ../cephfs/ +.. _Ceph Object Storage: ../radosgw/ +.. _Deployment: ../cephadm/ diff --git a/doc/rados/man/index.rst b/doc/rados/man/index.rst new file mode 100644 index 000000000..bac56aa46 --- /dev/null +++ b/doc/rados/man/index.rst @@ -0,0 +1,32 @@ +======================= + Object Store Manpages +======================= + +.. toctree:: + :maxdepth: 1 + + ../../man/8/ceph-volume.rst + ../../man/8/ceph-volume-systemd.rst + ../../man/8/ceph.rst + ../../man/8/ceph-authtool.rst + ../../man/8/ceph-clsinfo.rst + ../../man/8/ceph-conf.rst + ../../man/8/ceph-debugpack.rst + ../../man/8/ceph-dencoder.rst + ../../man/8/ceph-mon.rst + ../../man/8/ceph-osd.rst + ../../man/8/ceph-kvstore-tool.rst + ../../man/8/ceph-run.rst + ../../man/8/ceph-syn.rst + ../../man/8/crushdiff.rst + ../../man/8/crushtool.rst + ../../man/8/librados-config.rst + ../../man/8/monmaptool.rst + ../../man/8/osdmaptool.rst + ../../man/8/rados.rst + + +.. toctree:: + :hidden: + + ../../man/8/ceph-post-file.rst diff --git a/doc/rados/operations/add-or-rm-mons.rst b/doc/rados/operations/add-or-rm-mons.rst new file mode 100644 index 000000000..3688bb798 --- /dev/null +++ b/doc/rados/operations/add-or-rm-mons.rst @@ -0,0 +1,458 @@ +.. _adding-and-removing-monitors: + +========================== + Adding/Removing Monitors +========================== + +It is possible to add monitors to a running cluster as long as redundancy is +maintained. To bootstrap a monitor, see `Manual Deployment`_ or `Monitor +Bootstrap`_. + +.. _adding-monitors: + +Adding Monitors +=============== + +Ceph monitors serve as the single source of truth for the cluster map. It is +possible to run a cluster with only one monitor, but for a production cluster +it is recommended to have at least three monitors provisioned and in quorum. +Ceph monitors use a variation of the `Paxos`_ algorithm to maintain consensus +about maps and about other critical information across the cluster. Due to the +nature of Paxos, Ceph is able to maintain quorum (and thus establish +consensus) only if a majority of the monitors are ``active``. + +It is best to run an odd number of monitors. This is because a cluster that is +running an odd number of monitors is more resilient than a cluster running an +even number. For example, in a two-monitor deployment, no failures can be +tolerated if quorum is to be maintained; in a three-monitor deployment, one +failure can be tolerated; in a four-monitor deployment, one failure can be +tolerated; and in a five-monitor deployment, two failures can be tolerated. In +general, a cluster running an odd number of monitors is best because it avoids +what is called the *split brain* phenomenon. In short, Ceph is able to operate +only if a majority of monitors are ``active`` and able to communicate with each +other, (for example: there must be a single monitor, two out of two monitors, +two out of three monitors, three out of five monitors, or the like). + +For small or non-critical deployments of multi-node Ceph clusters, it is +recommended to deploy three monitors. For larger clusters or for clusters that +are intended to survive a double failure, it is recommended to deploy five +monitors. Only in rare circumstances is there any justification for deploying +seven or more monitors. + +It is possible to run a monitor on the same host that is running an OSD. +However, this approach has disadvantages: for example: `fsync` issues with the +kernel might weaken performance, monitor and OSD daemons might be inactive at +the same time and cause disruption if the node crashes, is rebooted, or is +taken down for maintenance. Because of these risks, it is instead +recommended to run monitors and managers on dedicated hosts. + +.. note:: A *majority* of monitors in your cluster must be able to + reach each other in order for quorum to be established. + +Deploying your Hardware +----------------------- + +Some operators choose to add a new monitor host at the same time that they add +a new monitor. For details on the minimum recommendations for monitor hardware, +see `Hardware Recommendations`_. Before adding a monitor host to the cluster, +make sure that there is an up-to-date version of Linux installed. + +Add the newly installed monitor host to a rack in your cluster, connect the +host to the network, and make sure that the host has network connectivity. + +.. _Hardware Recommendations: ../../../start/hardware-recommendations + +Installing the Required Software +-------------------------------- + +In manually deployed clusters, it is necessary to install Ceph packages +manually. For details, see `Installing Packages`_. Configure SSH so that it can +be used by a user that has passwordless authentication and root permissions. + +.. _Installing Packages: ../../../install/install-storage-cluster + + +.. _Adding a Monitor (Manual): + +Adding a Monitor (Manual) +------------------------- + +The procedure in this section creates a ``ceph-mon`` data directory, retrieves +both the monitor map and the monitor keyring, and adds a ``ceph-mon`` daemon to +the cluster. The procedure might result in a Ceph cluster that contains only +two monitor daemons. To add more monitors until there are enough ``ceph-mon`` +daemons to establish quorum, repeat the procedure. + +This is a good point at which to define the new monitor's ``id``. Monitors have +often been named with single letters (``a``, ``b``, ``c``, etc.), but you are +free to define the ``id`` however you see fit. In this document, ``{mon-id}`` +refers to the ``id`` exclusive of the ``mon.`` prefix: for example, if +``mon.a`` has been chosen as the ``id`` of a monitor, then ``{mon-id}`` is +``a``. ??? + +#. Create a data directory on the machine that will host the new monitor: + + .. prompt:: bash $ + + ssh {new-mon-host} + sudo mkdir /var/lib/ceph/mon/ceph-{mon-id} + +#. Create a temporary directory ``{tmp}`` that will contain the files needed + during this procedure. This directory should be different from the data + directory created in the previous step. Because this is a temporary + directory, it can be removed after the procedure is complete: + + .. prompt:: bash $ + + mkdir {tmp} + +#. Retrieve the keyring for your monitors (``{tmp}`` is the path to the + retrieved keyring and ``{key-filename}`` is the name of the file that + contains the retrieved monitor key): + + .. prompt:: bash $ + + ceph auth get mon. -o {tmp}/{key-filename} + +#. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor map + and ``{map-filename}`` is the name of the file that contains the retrieved + monitor map): + + .. prompt:: bash $ + + ceph mon getmap -o {tmp}/{map-filename} + +#. Prepare the monitor's data directory, which was created in the first step. + The following command must specify the path to the monitor map (so that + information about a quorum of monitors and their ``fsid``\s can be + retrieved) and specify the path to the monitor keyring: + + .. prompt:: bash $ + + sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename} + +#. Start the new monitor. It will automatically join the cluster. To provide + information to the daemon about which address to bind to, use either the + ``--public-addr {ip}`` option or the ``--public-network {network}`` option. + For example: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --public-addr {ip:port} + +.. _removing-monitors: + +Removing Monitors +================= + +When monitors are removed from a cluster, it is important to remember +that Ceph monitors use Paxos to maintain consensus about the cluster +map. Such consensus is possible only if the number of monitors is sufficient +to establish quorum. + + +.. _Removing a Monitor (Manual): + +Removing a Monitor (Manual) +--------------------------- + +The procedure in this section removes a ``ceph-mon`` daemon from the cluster. +The procedure might result in a Ceph cluster that contains a number of monitors +insufficient to maintain quorum, so plan carefully. When replacing an old +monitor with a new monitor, add the new monitor first, wait for quorum to be +established, and then remove the old monitor. This ensures that quorum is not +lost. + + +#. Stop the monitor: + + .. prompt:: bash $ + + service ceph -a stop mon.{mon-id} + +#. Remove the monitor from the cluster: + + .. prompt:: bash $ + + ceph mon remove {mon-id} + +#. Remove the monitor entry from the ``ceph.conf`` file: + +.. _rados-mon-remove-from-unhealthy: + + +Removing Monitors from an Unhealthy Cluster +------------------------------------------- + +The procedure in this section removes a ``ceph-mon`` daemon from an unhealthy +cluster (for example, a cluster whose monitors are unable to form a quorum). + +#. Stop all ``ceph-mon`` daemons on all monitor hosts: + + .. prompt:: bash $ + + ssh {mon-host} + systemctl stop ceph-mon.target + + Repeat this step on every monitor host. + +#. Identify a surviving monitor and log in to the monitor's host: + + .. prompt:: bash $ + + ssh {mon-host} + +#. Extract a copy of the ``monmap`` file by running a command of the following + form: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --extract-monmap {map-path} + + Here is a more concrete example. In this example, ``hostname`` is the + ``{mon-id}`` and ``/tmp/monpap`` is the ``{map-path}``: + + .. prompt:: bash $ + + ceph-mon -i `hostname` --extract-monmap /tmp/monmap + +#. Remove the non-surviving or otherwise problematic monitors: + + .. prompt:: bash $ + + monmaptool {map-path} --rm {mon-id} + + For example, suppose that there are three monitors |---| ``mon.a``, ``mon.b``, + and ``mon.c`` |---| and that only ``mon.a`` will survive: + + .. prompt:: bash $ + + monmaptool /tmp/monmap --rm b + monmaptool /tmp/monmap --rm c + +#. Inject the surviving map that includes the removed monitors into the + monmap of the surviving monitor(s): + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --inject-monmap {map-path} + + Continuing with the above example, inject a map into monitor ``mon.a`` by + running the following command: + + .. prompt:: bash $ + + ceph-mon -i a --inject-monmap /tmp/monmap + + +#. Start only the surviving monitors. + +#. Verify that the monitors form a quorum by running the command ``ceph -s``. + +#. The data directory of the removed monitors is in ``/var/lib/ceph/mon``: + either archive this data directory in a safe location or delete this data + directory. However, do not delete it unless you are confident that the + remaining monitors are healthy and sufficiently redundant. Make sure that + there is enough room for the live DB to expand and compact, and make sure + that there is also room for an archived copy of the DB. The archived copy + can be compressed. + +.. _Changing a Monitor's IP address: + +Changing a Monitor's IP Address +=============================== + +.. important:: Existing monitors are not supposed to change their IP addresses. + +Monitors are critical components of a Ceph cluster. The entire system can work +properly only if the monitors maintain quorum, and quorum can be established +only if the monitors have discovered each other by means of their IP addresses. +Ceph has strict requirements on the discovery of monitors. + +Although the ``ceph.conf`` file is used by Ceph clients and other Ceph daemons +to discover monitors, the monitor map is used by monitors to discover each +other. This is why it is necessary to obtain the current ``monmap`` at the time +a new monitor is created: as can be seen above in `Adding a Monitor (Manual)`_, +the ``monmap`` is one of the arguments required by the ``ceph-mon -i {mon-id} +--mkfs`` command. The following sections explain the consistency requirements +for Ceph monitors, and also explain a number of safe ways to change a monitor's +IP address. + + +Consistency Requirements +------------------------ + +When a monitor discovers other monitors in the cluster, it always refers to the +local copy of the monitor map. Using the monitor map instead of using the +``ceph.conf`` file avoids errors that could break the cluster (for example, +typos or other slight errors in ``ceph.conf`` when a monitor address or port is +specified). Because monitors use monitor maps for discovery and because they +share monitor maps with Ceph clients and other Ceph daemons, the monitor map +provides monitors with a strict guarantee that their consensus is valid. + +Strict consistency also applies to updates to the monmap. As with any other +updates on the monitor, changes to the monmap always run through a distributed +consensus algorithm called `Paxos`_. The monitors must agree on each update to +the monmap, such as adding or removing a monitor, to ensure that each monitor +in the quorum has the same version of the monmap. Updates to the monmap are +incremental so that monitors have the latest agreed upon version, and a set of +previous versions, allowing a monitor that has an older version of the monmap +to catch up with the current state of the cluster. + +There are additional advantages to using the monitor map rather than +``ceph.conf`` when monitors discover each other. Because ``ceph.conf`` is not +automatically updated and distributed, its use would bring certain risks: +monitors might use an outdated ``ceph.conf`` file, might fail to recognize a +specific monitor, might fall out of quorum, and might develop a situation in +which `Paxos`_ is unable to accurately ascertain the current state of the +system. Because of these risks, any changes to an existing monitor's IP address +must be made with great care. + +.. _operations_add_or_rm_mons_changing_mon_ip: + +Changing a Monitor's IP address (Preferred Method) +-------------------------------------------------- + +If a monitor's IP address is changed only in the ``ceph.conf`` file, there is +no guarantee that the other monitors in the cluster will receive the update. +For this reason, the preferred method to change a monitor's IP address is as +follows: add a new monitor with the desired IP address (as described in `Adding +a Monitor (Manual)`_), make sure that the new monitor successfully joins the +quorum, remove the monitor that is using the old IP address, and update the +``ceph.conf`` file to ensure that clients and other daemons are made aware of +the new monitor's IP address. + +For example, suppose that there are three monitors in place:: + + [mon.a] + host = host01 + addr = 10.0.0.1:6789 + [mon.b] + host = host02 + addr = 10.0.0.2:6789 + [mon.c] + host = host03 + addr = 10.0.0.3:6789 + +To change ``mon.c`` so that its name is ``host04`` and its IP address is +``10.0.0.4``: (1) follow the steps in `Adding a Monitor (Manual)`_ to add a new +monitor ``mon.d``, (2) make sure that ``mon.d`` is running before removing +``mon.c`` or else quorum will be broken, and (3) follow the steps in `Removing +a Monitor (Manual)`_ to remove ``mon.c``. To move all three monitors to new IP +addresses, repeat this process. + +Changing a Monitor's IP address (Advanced Method) +------------------------------------------------- + +There are cases in which the method outlined in :ref"`<Changing a Monitor's IP +Address (Preferred Method)> operations_add_or_rm_mons_changing_mon_ip` cannot +be used. For example, it might be necessary to move the cluster's monitors to a +different network, to a different part of the datacenter, or to a different +datacenter altogether. It is still possible to change the monitors' IP +addresses, but a different method must be used. + +For such cases, a new monitor map with updated IP addresses for every monitor +in the cluster must be generated and injected on each monitor. Although this +method is not particularly easy, such a major migration is unlikely to be a +routine task. As stated at the beginning of this section, existing monitors are +not supposed to change their IP addresses. + +Continue with the monitor configuration in the example from :ref"`<Changing a +Monitor's IP Address (Preferred Method)> +operations_add_or_rm_mons_changing_mon_ip` . Suppose that all of the monitors +are to be moved from the ``10.0.0.x`` range to the ``10.1.0.x`` range, and that +these networks are unable to communicate. Carry out the following procedure: + +#. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor + map, and ``{filename}`` is the name of the file that contains the retrieved + monitor map): + + .. prompt:: bash $ + + ceph mon getmap -o {tmp}/{filename} + +#. Check the contents of the monitor map: + + .. prompt:: bash $ + + monmaptool --print {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.0.0.1:6789/0 mon.a + 1: 10.0.0.2:6789/0 mon.b + 2: 10.0.0.3:6789/0 mon.c + +#. Remove the existing monitors from the monitor map: + + .. prompt:: bash $ + + monmaptool --rm a --rm b --rm c {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + monmaptool: removing a + monmaptool: removing b + monmaptool: removing c + monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors) + +#. Add the new monitor locations to the monitor map: + + .. prompt:: bash $ + + monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors) + +#. Check the new contents of the monitor map: + + .. prompt:: bash $ + + monmaptool --print {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.1.0.1:6789/0 mon.a + 1: 10.1.0.2:6789/0 mon.b + 2: 10.1.0.3:6789/0 mon.c + +At this point, we assume that the monitors (and stores) have been installed at +the new location. Next, propagate the modified monitor map to the new monitors, +and inject the modified monitor map into each new monitor. + +#. Make sure all of your monitors have been stopped. Never inject into a + monitor while the monitor daemon is running. + +#. Inject the monitor map: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename} + +#. Restart all of the monitors. + +Migration to the new location is now complete. The monitors should operate +successfully. + + + +.. _Manual Deployment: ../../../install/manual-deployment +.. _Monitor Bootstrap: ../../../dev/mon-bootstrap +.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) + +.. |---| unicode:: U+2014 .. EM DASH + :trim: diff --git a/doc/rados/operations/add-or-rm-osds.rst b/doc/rados/operations/add-or-rm-osds.rst new file mode 100644 index 000000000..1a6621148 --- /dev/null +++ b/doc/rados/operations/add-or-rm-osds.rst @@ -0,0 +1,419 @@ +====================== + Adding/Removing OSDs +====================== + +When a cluster is up and running, it is possible to add or remove OSDs. + +Adding OSDs +=========== + +OSDs can be added to a cluster in order to expand the cluster's capacity and +resilience. Typically, an OSD is a Ceph ``ceph-osd`` daemon running on one +storage drive within a host machine. But if your host machine has multiple +storage drives, you may map one ``ceph-osd`` daemon for each drive on the +machine. + +It's a good idea to check the capacity of your cluster so that you know when it +approaches its capacity limits. If your cluster has reached its ``near full`` +ratio, then you should add OSDs to expand your cluster's capacity. + +.. warning:: Do not add an OSD after your cluster has reached its ``full + ratio``. OSD failures that occur after the cluster reaches its ``near full + ratio`` might cause the cluster to exceed its ``full ratio``. + + +Deploying your Hardware +----------------------- + +If you are also adding a new host when adding a new OSD, see `Hardware +Recommendations`_ for details on minimum recommendations for OSD hardware. To +add an OSD host to your cluster, begin by making sure that an appropriate +version of Linux has been installed on the host machine and that all initial +preparations for your storage drives have been carried out. For details, see +`Filesystem Recommendations`_. + +Next, add your OSD host to a rack in your cluster, connect the host to the +network, and ensure that the host has network connectivity. For details, see +`Network Configuration Reference`_. + + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Filesystem Recommendations: ../../configuration/filesystem-recommendations +.. _Network Configuration Reference: ../../configuration/network-config-ref + +Installing the Required Software +-------------------------------- + +If your cluster has been manually deployed, you will need to install Ceph +software packages manually. For details, see `Installing Ceph (Manual)`_. +Configure SSH for the appropriate user to have both passwordless authentication +and root permissions. + +.. _Installing Ceph (Manual): ../../../install + + +Adding an OSD (Manual) +---------------------- + +The following procedure sets up a ``ceph-osd`` daemon, configures this OSD to +use one drive, and configures the cluster to distribute data to the OSD. If +your host machine has multiple drives, you may add an OSD for each drive on the +host by repeating this procedure. + +As the following procedure will demonstrate, adding an OSD involves creating a +metadata directory for it, configuring a data storage drive, adding the OSD to +the cluster, and then adding it to the CRUSH map. + +When you add the OSD to the CRUSH map, you will need to consider the weight you +assign to the new OSD. Since storage drive capacities increase over time, newer +OSD hosts are likely to have larger hard drives than the older hosts in the +cluster have and therefore might have greater weight as well. + +.. tip:: Ceph works best with uniform hardware across pools. It is possible to + add drives of dissimilar size and then adjust their weights accordingly. + However, for best performance, consider a CRUSH hierarchy that has drives of + the same type and size. It is better to add larger drives uniformly to + existing hosts. This can be done incrementally, replacing smaller drives + each time the new drives are added. + +#. Create the new OSD by running a command of the following form. If you opt + not to specify a UUID in this command, the UUID will be set automatically + when the OSD starts up. The OSD number, which is needed for subsequent + steps, is found in the command's output: + + .. prompt:: bash $ + + ceph osd create [{uuid} [{id}]] + + If the optional parameter {id} is specified it will be used as the OSD ID. + However, if the ID number is already in use, the command will fail. + + .. warning:: Explicitly specifying the ``{id}`` parameter is not + recommended. IDs are allocated as an array, and any skipping of entries + consumes extra memory. This memory consumption can become significant if + there are large gaps or if clusters are large. By leaving the ``{id}`` + parameter unspecified, we ensure that Ceph uses the smallest ID number + available and that these problems are avoided. + +#. Create the default directory for your new OSD by running commands of the + following form: + + .. prompt:: bash $ + + ssh {new-osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + +#. If the OSD will be created on a drive other than the OS drive, prepare it + for use with Ceph. Run commands of the following form: + + .. prompt:: bash $ + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{drive} + sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} + +#. Initialize the OSD data directory by running commands of the following form: + + .. prompt:: bash $ + + ssh {new-osd-host} + ceph-osd -i {osd-num} --mkfs --mkkey + + Make sure that the directory is empty before running ``ceph-osd``. + +#. Register the OSD authentication key by running a command of the following + form: + + .. prompt:: bash $ + + ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring + + This presentation of the command has ``ceph-{osd-num}`` in the listed path + because many clusters have the name ``ceph``. However, if your cluster name + is not ``ceph``, then the string ``ceph`` in ``ceph-{osd-num}`` needs to be + replaced with your cluster name. For example, if your cluster name is + ``cluster1``, then the path in the command should be + ``/var/lib/ceph/osd/cluster1-{osd-num}/keyring``. + +#. Add the OSD to the CRUSH map by running the following command. This allows + the OSD to begin receiving data. The ``ceph osd crush add`` command can add + OSDs to the CRUSH hierarchy wherever you want. If you specify one or more + buckets, the command places the OSD in the most specific of those buckets, + and it moves that bucket underneath any other buckets that you have + specified. **Important:** If you specify only the root bucket, the command + will attach the OSD directly to the root, but CRUSH rules expect OSDs to be + inside of hosts. If the OSDs are not inside hosts, the OSDS will likely not + receive any data. + + .. prompt:: bash $ + + ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...] + + Note that there is another way to add a new OSD to the CRUSH map: decompile + the CRUSH map, add the OSD to the device list, add the host as a bucket (if + it is not already in the CRUSH map), add the device as an item in the host, + assign the device a weight, recompile the CRUSH map, and set the CRUSH map. + For details, see `Add/Move an OSD`_. This is rarely necessary with recent + releases (this sentence was written the month that Reef was released). + + +.. _rados-replacing-an-osd: + +Replacing an OSD +---------------- + +.. note:: If the procedure in this section does not work for you, try the + instructions in the ``cephadm`` documentation: + :ref:`cephadm-replacing-an-osd`. + +Sometimes OSDs need to be replaced: for example, when a disk fails, or when an +administrator wants to reprovision OSDs with a new back end (perhaps when +switching from Filestore to BlueStore). Replacing an OSD differs from `Removing +the OSD`_ in that the replaced OSD's ID and CRUSH map entry must be kept intact +after the OSD is destroyed for replacement. + + +#. Make sure that it is safe to destroy the OSD: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy osd.{id} ; do sleep 10 ; done + +#. Destroy the OSD: + + .. prompt:: bash $ + + ceph osd destroy {id} --yes-i-really-mean-it + +#. *Optional*: If the disk that you plan to use is not a new disk and has been + used before for other purposes, zap the disk: + + .. prompt:: bash $ + + ceph-volume lvm zap /dev/sdX + +#. Prepare the disk for replacement by using the ID of the OSD that was + destroyed in previous steps: + + .. prompt:: bash $ + + ceph-volume lvm prepare --osd-id {id} --data /dev/sdX + +#. Finally, activate the OSD: + + .. prompt:: bash $ + + ceph-volume lvm activate {id} {fsid} + +Alternatively, instead of carrying out the final two steps (preparing the disk +and activating the OSD), you can re-create the OSD by running a single command +of the following form: + + .. prompt:: bash $ + + ceph-volume lvm create --osd-id {id} --data /dev/sdX + +Starting the OSD +---------------- + +After an OSD is added to Ceph, the OSD is in the cluster. However, until it is +started, the OSD is considered ``down`` and ``in``. The OSD is not running and +will be unable to receive data. To start an OSD, either run ``service ceph`` +from your admin host or run a command of the following form to start the OSD +from its host machine: + + .. prompt:: bash $ + + sudo systemctl start ceph-osd@{osd-num} + +After the OSD is started, it is considered ``up`` and ``in``. + +Observing the Data Migration +---------------------------- + +After the new OSD has been added to the CRUSH map, Ceph begins rebalancing the +cluster by migrating placement groups (PGs) to the new OSD. To observe this +process by using the `ceph`_ tool, run the following command: + + .. prompt:: bash $ + + ceph -w + +Or: + + .. prompt:: bash $ + + watch ceph status + +The PG states will first change from ``active+clean`` to ``active, some +degraded objects`` and then return to ``active+clean`` when migration +completes. When you are finished observing, press Ctrl-C to exit. + +.. _Add/Move an OSD: ../crush-map#addosd +.. _ceph: ../monitoring + + +Removing OSDs (Manual) +====================== + +It is possible to remove an OSD manually while the cluster is running: you +might want to do this in order to reduce the size of the cluster or when +replacing hardware. Typically, an OSD is a Ceph ``ceph-osd`` daemon running on +one storage drive within a host machine. Alternatively, if your host machine +has multiple storage drives, you might need to remove multiple ``ceph-osd`` +daemons: one daemon for each drive on the machine. + +.. warning:: Before you begin the process of removing an OSD, make sure that + your cluster is not near its ``full ratio``. Otherwise the act of removing + OSDs might cause the cluster to reach or exceed its ``full ratio``. + + +Taking the OSD ``out`` of the Cluster +------------------------------------- + +OSDs are typically ``up`` and ``in`` before they are removed from the cluster. +Before the OSD can be removed from the cluster, the OSD must be taken ``out`` +of the cluster so that Ceph can begin rebalancing and copying its data to other +OSDs. To take an OSD ``out`` of the cluster, run a command of the following +form: + + .. prompt:: bash $ + + ceph osd out {osd-num} + + +Observing the Data Migration +---------------------------- + +After the OSD has been taken ``out`` of the cluster, Ceph begins rebalancing +the cluster by migrating placement groups out of the OSD that was removed. To +observe this process by using the `ceph`_ tool, run the following command: + + .. prompt:: bash $ + + ceph -w + +The PG states will change from ``active+clean`` to ``active, some degraded +objects`` and will then return to ``active+clean`` when migration completes. +When you are finished observing, press Ctrl-C to exit. + +.. note:: Under certain conditions, the action of taking ``out`` an OSD + might lead CRUSH to encounter a corner case in which some PGs remain stuck + in the ``active+remapped`` state. This problem sometimes occurs in small + clusters with few hosts (for example, in a small testing cluster). To + address this problem, mark the OSD ``in`` by running a command of the + following form: + + .. prompt:: bash $ + + ceph osd in {osd-num} + + After the OSD has come back to its initial state, do not mark the OSD + ``out`` again. Instead, set the OSD's weight to ``0`` by running a command + of the following form: + + .. prompt:: bash $ + + ceph osd crush reweight osd.{osd-num} 0 + + After the OSD has been reweighted, observe the data migration and confirm + that it has completed successfully. The difference between marking an OSD + ``out`` and reweighting the OSD to ``0`` has to do with the bucket that + contains the OSD. When an OSD is marked ``out``, the weight of the bucket is + not changed. But when an OSD is reweighted to ``0``, the weight of the + bucket is updated (namely, the weight of the OSD is subtracted from the + overall weight of the bucket). When operating small clusters, it can + sometimes be preferable to use the above reweight command. + + +Stopping the OSD +---------------- + +After you take an OSD ``out`` of the cluster, the OSD might still be running. +In such a case, the OSD is ``up`` and ``out``. Before it is removed from the +cluster, the OSD must be stopped by running commands of the following form: + + .. prompt:: bash $ + + ssh {osd-host} + sudo systemctl stop ceph-osd@{osd-num} + +After the OSD has been stopped, it is ``down``. + + +Removing the OSD +---------------- + +The following procedure removes an OSD from the cluster map, removes the OSD's +authentication key, removes the OSD from the OSD map, and removes the OSD from +the ``ceph.conf`` file. If your host has multiple drives, it might be necessary +to remove an OSD from each drive by repeating this procedure. + +#. Begin by having the cluster forget the OSD. This step removes the OSD from + the CRUSH map, removes the OSD's authentication key, and removes the OSD + from the OSD map. (The :ref:`purge subcommand <ceph-admin-osd>` was + introduced in Luminous. For older releases, see :ref:`the procedure linked + here <ceph_osd_purge_procedure_pre_luminous>`.): + + .. prompt:: bash $ + + ceph osd purge {id} --yes-i-really-mean-it + + +#. Navigate to the host where the master copy of the cluster's + ``ceph.conf`` file is kept: + + .. prompt:: bash $ + + ssh {admin-host} + cd /etc/ceph + vim ceph.conf + +#. Remove the OSD entry from your ``ceph.conf`` file (if such an entry + exists):: + + [osd.1] + host = {hostname} + +#. Copy the updated ``ceph.conf`` file from the location on the host where the + master copy of the cluster's ``ceph.conf`` is kept to the ``/etc/ceph`` + directory of the other hosts in your cluster. + +.. _ceph_osd_purge_procedure_pre_luminous: + +If your Ceph cluster is older than Luminous, you will be unable to use the +``ceph osd purge`` command. Instead, carry out the following procedure: + +#. Remove the OSD from the CRUSH map so that it no longer receives data (for + more details, see `Remove an OSD`_): + + .. prompt:: bash $ + + ceph osd crush remove {name} + + Instead of removing the OSD from the CRUSH map, you might opt for one of two + alternatives: (1) decompile the CRUSH map, remove the OSD from the device + list, and remove the device from the host bucket; (2) remove the host bucket + from the CRUSH map (provided that it is in the CRUSH map and that you intend + to remove the host), recompile the map, and set it: + + +#. Remove the OSD authentication key: + + .. prompt:: bash $ + + ceph auth del osd.{osd-num} + +#. Remove the OSD: + + .. prompt:: bash $ + + ceph osd rm {osd-num} + + For example: + + .. prompt:: bash $ + + ceph osd rm 1 + +.. _Remove an OSD: ../crush-map#removeosd diff --git a/doc/rados/operations/balancer.rst b/doc/rados/operations/balancer.rst new file mode 100644 index 000000000..aa4eab93c --- /dev/null +++ b/doc/rados/operations/balancer.rst @@ -0,0 +1,221 @@ +.. _balancer: + +Balancer Module +======================= + +The *balancer* can optimize the allocation of placement groups (PGs) across +OSDs in order to achieve a balanced distribution. The balancer can operate +either automatically or in a supervised fashion. + + +Status +------ + +To check the current status of the balancer, run the following command: + + .. prompt:: bash $ + + ceph balancer status + + +Automatic balancing +------------------- + +When the balancer is in ``upmap`` mode, the automatic balancing feature is +enabled by default. For more details, see :ref:`upmap`. To disable the +balancer, run the following command: + + .. prompt:: bash $ + + ceph balancer off + +The balancer mode can be changed from ``upmap`` mode to ``crush-compat`` mode. +``crush-compat`` mode is backward compatible with older clients. In +``crush-compat`` mode, the balancer automatically makes small changes to the +data distribution in order to ensure that OSDs are utilized equally. + + +Throttling +---------- + +If the cluster is degraded (that is, if an OSD has failed and the system hasn't +healed itself yet), then the balancer will not make any adjustments to the PG +distribution. + +When the cluster is healthy, the balancer will incrementally move a small +fraction of unbalanced PGs in order to improve distribution. This fraction +will not exceed a certain threshold that defaults to 5%. To adjust this +``target_max_misplaced_ratio`` threshold setting, run the following command: + + .. prompt:: bash $ + + ceph config set mgr target_max_misplaced_ratio .07 # 7% + +The balancer sleeps between runs. To set the number of seconds for this +interval of sleep, run the following command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/sleep_interval 60 + +To set the time of day (in HHMM format) at which automatic balancing begins, +run the following command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/begin_time 0000 + +To set the time of day (in HHMM format) at which automatic balancing ends, run +the following command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/end_time 2359 + +Automatic balancing can be restricted to certain days of the week. To restrict +it to a specific day of the week or later (as with crontab, ``0`` is Sunday, +``1`` is Monday, and so on), run the following command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/begin_weekday 0 + +To restrict automatic balancing to a specific day of the week or earlier +(again, ``0`` is Sunday, ``1`` is Monday, and so on), run the following +command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/end_weekday 6 + +Automatic balancing can be restricted to certain pools. By default, the value +of this setting is an empty string, so that all pools are automatically +balanced. To restrict automatic balancing to specific pools, retrieve their +numeric pool IDs (by running the :command:`ceph osd pool ls detail` command), +and then run the following command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/pool_ids 1,2,3 + + +Modes +----- + +There are two supported balancer modes: + +#. **crush-compat**. This mode uses the compat weight-set feature (introduced + in Luminous) to manage an alternative set of weights for devices in the + CRUSH hierarchy. When the balancer is operating in this mode, the normal + weights should remain set to the size of the device in order to reflect the + target amount of data intended to be stored on the device. The balancer will + then optimize the weight-set values, adjusting them up or down in small + increments, in order to achieve a distribution that matches the target + distribution as closely as possible. (Because PG placement is a pseudorandom + process, it is subject to a natural amount of variation; optimizing the + weights serves to counteract that natural variation.) + + Note that this mode is *fully backward compatible* with older clients: when + an OSD Map and CRUSH map are shared with older clients, Ceph presents the + optimized weights as the "real" weights. + + The primary limitation of this mode is that the balancer cannot handle + multiple CRUSH hierarchies with different placement rules if the subtrees of + the hierarchy share any OSDs. (Such sharing of OSDs is not typical and, + because of the difficulty of managing the space utilization on the shared + OSDs, is generally not recommended.) + +#. **upmap**. In Luminous and later releases, the OSDMap can store explicit + mappings for individual OSDs as exceptions to the normal CRUSH placement + calculation. These ``upmap`` entries provide fine-grained control over the + PG mapping. This balancer mode optimizes the placement of individual PGs in + order to achieve a balanced distribution. In most cases, the resulting + distribution is nearly perfect: that is, there is an equal number of PGs on + each OSD (±1 PG, since the total number might not divide evenly). + + To use ``upmap``, all clients must be Luminous or newer. + +The default mode is ``upmap``. The mode can be changed to ``crush-compat`` by +running the following command: + + .. prompt:: bash $ + + ceph balancer mode crush-compat + +Supervised optimization +----------------------- + +Supervised use of the balancer can be understood in terms of three distinct +phases: + +#. building a plan +#. evaluating the quality of the data distribution, either for the current PG + distribution or for the PG distribution that would result after executing a + plan +#. executing the plan + +To evaluate the current distribution, run the following command: + + .. prompt:: bash $ + + ceph balancer eval + +To evaluate the distribution for a single pool, run the following command: + + .. prompt:: bash $ + + ceph balancer eval <pool-name> + +To see the evaluation in greater detail, run the following command: + + .. prompt:: bash $ + + ceph balancer eval-verbose ... + +To instruct the balancer to generate a plan (using the currently configured +mode), make up a name (any useful identifying string) for the plan, and run the +following command: + + .. prompt:: bash $ + + ceph balancer optimize <plan-name> + +To see the contents of a plan, run the following command: + + .. prompt:: bash $ + + ceph balancer show <plan-name> + +To display all plans, run the following command: + + .. prompt:: bash $ + + ceph balancer ls + +To discard an old plan, run the following command: + + .. prompt:: bash $ + + ceph balancer rm <plan-name> + +To see currently recorded plans, examine the output of the following status +command: + + .. prompt:: bash $ + + ceph balancer status + +To evaluate the distribution that would result from executing a specific plan, +run the following command: + + .. prompt:: bash $ + + ceph balancer eval <plan-name> + +If a plan is expected to improve the distribution (that is, the plan's score is +lower than the current cluster state's score), you can execute that plan by +running the following command: + + .. prompt:: bash $ + + ceph balancer execute <plan-name> diff --git a/doc/rados/operations/bluestore-migration.rst b/doc/rados/operations/bluestore-migration.rst new file mode 100644 index 000000000..d24782c46 --- /dev/null +++ b/doc/rados/operations/bluestore-migration.rst @@ -0,0 +1,357 @@ +.. _rados_operations_bluestore_migration: + +===================== + BlueStore Migration +===================== +.. warning:: Filestore has been deprecated in the Reef release and is no longer supported. + Please migrate to BlueStore. + +Each OSD must be formatted as either Filestore or BlueStore. However, a Ceph +cluster can operate with a mixture of both Filestore OSDs and BlueStore OSDs. +Because BlueStore is superior to Filestore in performance and robustness, and +because Filestore is not supported by Ceph releases beginning with Reef, users +deploying Filestore OSDs should transition to BlueStore. There are several +strategies for making the transition to BlueStore. + +BlueStore is so different from Filestore that an individual OSD cannot be +converted in place. Instead, the conversion process must use either (1) the +cluster's normal replication and healing support, or (2) tools and strategies +that copy OSD content from an old (Filestore) device to a new (BlueStore) one. + +Deploying new OSDs with BlueStore +================================= + +Use BlueStore when deploying new OSDs (for example, when the cluster is +expanded). Because this is the default behavior, no specific change is +needed. + +Similarly, use BlueStore for any OSDs that have been reprovisioned after +a failed drive was replaced. + +Converting existing OSDs +======================== + +"Mark-``out``" replacement +-------------------------- + +The simplest approach is to verify that the cluster is healthy and +then follow these steps for each Filestore OSD in succession: mark the OSD +``out``, wait for the data to replicate across the cluster, reprovision the OSD, +mark the OSD back ``in``, and wait for recovery to complete before proceeding +to the next OSD. This approach is easy to automate, but it entails unnecessary +data migration that carries costs in time and SSD wear. + +#. Identify a Filestore OSD to replace:: + + ID=<osd-id-number> + DEVICE=<disk-device> + + #. Determine whether a given OSD is Filestore or BlueStore: + + .. prompt:: bash $ + + ceph osd metadata $ID | grep osd_objectstore + + #. Get a current count of Filestore and BlueStore OSDs: + + .. prompt:: bash $ + + ceph osd count-metadata osd_objectstore + +#. Mark a Filestore OSD ``out``: + + .. prompt:: bash $ + + ceph osd out $ID + +#. Wait for the data to migrate off this OSD: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done + +#. Stop the OSD: + + .. prompt:: bash $ + + systemctl kill ceph-osd@$ID + + .. _osd_id_retrieval: + +#. Note which device the OSD is using: + + .. prompt:: bash $ + + mount | grep /var/lib/ceph/osd/ceph-$ID + +#. Unmount the OSD: + + .. prompt:: bash $ + + umount /var/lib/ceph/osd/ceph-$ID + +#. Destroy the OSD's data. Be *EXTREMELY CAREFUL*! These commands will destroy + the contents of the device; you must be certain that the data on the device is + not needed (in other words, that the cluster is healthy) before proceeding: + + .. prompt:: bash $ + + ceph-volume lvm zap $DEVICE + +#. Tell the cluster that the OSD has been destroyed (and that a new OSD can be + reprovisioned with the same OSD ID): + + .. prompt:: bash $ + + ceph osd destroy $ID --yes-i-really-mean-it + +#. Provision a BlueStore OSD in place by using the same OSD ID. This requires + you to identify which device to wipe, and to make certain that you target + the correct and intended device, using the information that was retrieved in + the :ref:`"Note which device the OSD is using" <osd_id_retrieval>` step. BE + CAREFUL! Note that you may need to modify these commands when dealing with + hybrid OSDs: + + .. prompt:: bash $ + + ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID + +#. Repeat. + +You may opt to (1) have the balancing of the replacement BlueStore OSD take +place concurrently with the draining of the next Filestore OSD, or instead +(2) follow the same procedure for multiple OSDs in parallel. In either case, +however, you must ensure that the cluster is fully clean (in other words, that +all data has all replicas) before destroying any OSDs. If you opt to reprovision +multiple OSDs in parallel, be **very** careful to destroy OSDs only within a +single CRUSH failure domain (for example, ``host`` or ``rack``). Failure to +satisfy this requirement will reduce the redundancy and availability of your +data and increase the risk of data loss (or even guarantee data loss). + +Advantages: + +* Simple. +* Can be done on a device-by-device basis. +* No spare devices or hosts are required. + +Disadvantages: + +* Data is copied over the network twice: once to another OSD in the cluster (to + maintain the specified number of replicas), and again back to the + reprovisioned BlueStore OSD. + +"Whole host" replacement +------------------------ + +If you have a spare host in the cluster, or sufficient free space to evacuate +an entire host for use as a spare, then the conversion can be done on a +host-by-host basis so that each stored copy of the data is migrated only once. + +To use this approach, you need an empty host that has no OSDs provisioned. +There are two ways to do this: either by using a new, empty host that is not +yet part of the cluster, or by offloading data from an existing host that is +already part of the cluster. + +Using a new, empty host +^^^^^^^^^^^^^^^^^^^^^^^ + +Ideally the host will have roughly the same capacity as each of the other hosts +you will be converting. Add the host to the CRUSH hierarchy, but do not attach +it to the root: + + +.. prompt:: bash $ + + NEWHOST=<empty-host-name> + ceph osd crush add-bucket $NEWHOST host + +Make sure that Ceph packages are installed on the new host. + +Using an existing host +^^^^^^^^^^^^^^^^^^^^^^ + +If you would like to use an existing host that is already part of the cluster, +and if there is sufficient free space on that host so that all of its data can +be migrated off to other cluster hosts, you can do the following (instead of +using a new, empty host): + +.. prompt:: bash $ + + OLDHOST=<existing-cluster-host-to-offload> + ceph osd crush unlink $OLDHOST default + +where "default" is the immediate ancestor in the CRUSH map. (For +smaller clusters with unmodified configurations this will normally +be "default", but it might instead be a rack name.) You should now +see the host at the top of the OSD tree output with no parent: + +.. prompt:: bash $ + + bin/ceph osd tree + +:: + + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host oldhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host foo + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +If everything looks good, jump directly to the :ref:`"Wait for the data +migration to complete" <bluestore_data_migration_step>` step below and proceed +from there to clean up the old OSDs. + +Migration process +^^^^^^^^^^^^^^^^^ + +If you're using a new host, start at :ref:`the first step +<bluestore_migration_process_first_step>`. If you're using an existing host, +jump to :ref:`this step <bluestore_data_migration_step>`. + +.. _bluestore_migration_process_first_step: + +#. Provision new BlueStore OSDs for all devices: + + .. prompt:: bash $ + + ceph-volume lvm create --bluestore --data /dev/$DEVICE + +#. Verify that the new OSDs have joined the cluster: + + .. prompt:: bash $ + + ceph osd tree + + You should see the new host ``$NEWHOST`` with all of the OSDs beneath + it, but the host should *not* be nested beneath any other node in the + hierarchy (like ``root default``). For example, if ``newhost`` is + the empty host, you might see something like:: + + $ bin/ceph osd tree + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host newhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host oldhost1 + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +#. Identify the first target host to convert : + + .. prompt:: bash $ + + OLDHOST=<existing-cluster-host-to-convert> + +#. Swap the new host into the old host's position in the cluster: + + .. prompt:: bash $ + + ceph osd crush swap-bucket $NEWHOST $OLDHOST + + At this point all data on ``$OLDHOST`` will begin migrating to the OSDs on + ``$NEWHOST``. If there is a difference between the total capacity of the + old hosts and the total capacity of the new hosts, you may also see some + data migrate to or from other nodes in the cluster. Provided that the hosts + are similarly sized, however, this will be a relatively small amount of + data. + + .. _bluestore_data_migration_step: + +#. Wait for the data migration to complete: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done + +#. Stop all old OSDs on the now-empty ``$OLDHOST``: + + .. prompt:: bash $ + + ssh $OLDHOST + systemctl kill ceph-osd.target + umount /var/lib/ceph/osd/ceph-* + +#. Destroy and purge the old OSDs: + + .. prompt:: bash $ + + for osd in `ceph osd ls-tree $OLDHOST`; do + ceph osd purge $osd --yes-i-really-mean-it + done + +#. Wipe the old OSDs. This requires you to identify which devices are to be + wiped manually. BE CAREFUL! For each device: + + .. prompt:: bash $ + + ceph-volume lvm zap $DEVICE + +#. Use the now-empty host as the new host, and repeat: + + .. prompt:: bash $ + + NEWHOST=$OLDHOST + +Advantages: + +* Data is copied over the network only once. +* An entire host's OSDs are converted at once. +* Can be parallelized, to make possible the conversion of multiple hosts at the same time. +* No host involved in this process needs to have a spare device. + +Disadvantages: + +* A spare host is required. +* An entire host's worth of OSDs will be migrating data at a time. This + is likely to impact overall cluster performance. +* All migrated data still makes one full hop over the network. + +Per-OSD device copy +------------------- +A single logical OSD can be converted by using the ``copy`` function +included in ``ceph-objectstore-tool``. This requires that the host have one or more free +devices to provision a new, empty BlueStore OSD. For +example, if each host in your cluster has twelve OSDs, then you need a +thirteenth unused OSD so that each OSD can be converted before the +previous OSD is reclaimed to convert the next OSD. + +Caveats: + +* This approach requires that we prepare an empty BlueStore OSD but that we do not allocate + a new OSD ID to it. The ``ceph-volume`` tool does not support such an operation. **IMPORTANT:** + because the setup of *dmcrypt* is closely tied to the identity of the OSD, this approach does not + work with encrypted OSDs. + +* The device must be manually partitioned. + +* An unsupported user-contributed script that demonstrates this process may be found here: + https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash + +Advantages: + +* Provided that the 'noout' or the 'norecover'/'norebalance' flags are set on the OSD or the + cluster while the conversion process is underway, little or no data migrates over the + network during the conversion. + +Disadvantages: + +* Tooling is not fully implemented, supported, or documented. + +* Each host must have an appropriate spare or empty device for staging. + +* The OSD is offline during the conversion, which means new writes to PGs + with the OSD in their acting set may not be ideally redundant until the + subject OSD comes up and recovers. This increases the risk of data + loss due to an overlapping failure. However, if another OSD fails before + conversion and startup have completed, the original Filestore OSD can be + started to provide access to its original data. diff --git a/doc/rados/operations/cache-tiering.rst b/doc/rados/operations/cache-tiering.rst new file mode 100644 index 000000000..127b0141f --- /dev/null +++ b/doc/rados/operations/cache-tiering.rst @@ -0,0 +1,557 @@ +=============== + Cache Tiering +=============== + +.. warning:: Cache tiering has been deprecated in the Reef release as it + has lacked a maintainer for a very long time. This does not mean + it will be certainly removed, but we may choose to remove it + without much further notice. + +A cache tier provides Ceph Clients with better I/O performance for a subset of +the data stored in a backing storage tier. Cache tiering involves creating a +pool of relatively fast/expensive storage devices (e.g., solid state drives) +configured to act as a cache tier, and a backing pool of either erasure-coded +or relatively slower/cheaper devices configured to act as an economical storage +tier. The Ceph objecter handles where to place the objects and the tiering +agent determines when to flush objects from the cache to the backing storage +tier. So the cache tier and the backing storage tier are completely transparent +to Ceph clients. + + +.. ditaa:: + +-------------+ + | Ceph Client | + +------+------+ + ^ + Tiering is | + Transparent | Faster I/O + to Ceph | +---------------+ + Client Ops | | | + | +----->+ Cache Tier | + | | | | + | | +-----+---+-----+ + | | | ^ + v v | | Active Data in Cache Tier + +------+----+--+ | | + | Objecter | | | + +-----------+--+ | | + ^ | | Inactive Data in Storage Tier + | v | + | +-----+---+-----+ + | | | + +----->| Storage Tier | + | | + +---------------+ + Slower I/O + + +The cache tiering agent handles the migration of data between the cache tier +and the backing storage tier automatically. However, admins have the ability to +configure how this migration takes place by setting the ``cache-mode``. There are +two main scenarios: + +- **writeback** mode: If the base tier and the cache tier are configured in + ``writeback`` mode, Ceph clients receive an ACK from the base tier every time + they write data to it. Then the cache tiering agent determines whether + ``osd_tier_default_cache_min_write_recency_for_promote`` has been set. If it + has been set and the data has been written more than a specified number of + times per interval, the data is promoted to the cache tier. + + When Ceph clients need access to data stored in the base tier, the cache + tiering agent reads the data from the base tier and returns it to the client. + While data is being read from the base tier, the cache tiering agent consults + the value of ``osd_tier_default_cache_min_read_recency_for_promote`` and + decides whether to promote that data from the base tier to the cache tier. + When data has been promoted from the base tier to the cache tier, the Ceph + client is able to perform I/O operations on it using the cache tier. This is + well-suited for mutable data (for example, photo/video editing, transactional + data). + +- **readproxy** mode: This mode will use any objects that already + exist in the cache tier, but if an object is not present in the + cache the request will be proxied to the base tier. This is useful + for transitioning from ``writeback`` mode to a disabled cache as it + allows the workload to function properly while the cache is drained, + without adding any new objects to the cache. + +Other cache modes are: + +- **readonly** promotes objects to the cache on read operations only; write + operations are forwarded to the base tier. This mode is intended for + read-only workloads that do not require consistency to be enforced by the + storage system. (**Warning**: when objects are updated in the base tier, + Ceph makes **no** attempt to sync these updates to the corresponding objects + in the cache. Since this mode is considered experimental, a + ``--yes-i-really-mean-it`` option must be passed in order to enable it.) + +- **none** is used to completely disable caching. + + +A word of caution +================= + +Cache tiering will *degrade* performance for most workloads. Users should use +extreme caution before using this feature. + +* *Workload dependent*: Whether a cache will improve performance is + highly dependent on the workload. Because there is a cost + associated with moving objects into or out of the cache, it can only + be effective when there is a *large skew* in the access pattern in + the data set, such that most of the requests touch a small number of + objects. The cache pool should be large enough to capture the + working set for your workload to avoid thrashing. + +* *Difficult to benchmark*: Most benchmarks that users run to measure + performance will show terrible performance with cache tiering, in + part because very few of them skew requests toward a small set of + objects, it can take a long time for the cache to "warm up," and + because the warm-up cost can be high. + +* *Usually slower*: For workloads that are not cache tiering-friendly, + performance is often slower than a normal RADOS pool without cache + tiering enabled. + +* *librados object enumeration*: The librados-level object enumeration + API is not meant to be coherent in the presence of the case. If + your application is using librados directly and relies on object + enumeration, cache tiering will probably not work as expected. + (This is not a problem for RGW, RBD, or CephFS.) + +* *Complexity*: Enabling cache tiering means that a lot of additional + machinery and complexity within the RADOS cluster is being used. + This increases the probability that you will encounter a bug in the system + that other users have not yet encountered and will put your deployment at a + higher level of risk. + +Known Good Workloads +-------------------- + +* *RGW time-skewed*: If the RGW workload is such that almost all read + operations are directed at recently written objects, a simple cache + tiering configuration that destages recently written objects from + the cache to the base tier after a configurable period can work + well. + +Known Bad Workloads +------------------- + +The following configurations are *known to work poorly* with cache +tiering. + +* *RBD with replicated cache and erasure-coded base*: This is a common + request, but usually does not perform well. Even reasonably skewed + workloads still send some small writes to cold objects, and because + small writes are not yet supported by the erasure-coded pool, entire + (usually 4 MB) objects must be migrated into the cache in order to + satisfy a small (often 4 KB) write. Only a handful of users have + successfully deployed this configuration, and it only works for them + because their data is extremely cold (backups) and they are not in + any way sensitive to performance. + +* *RBD with replicated cache and base*: RBD with a replicated base + tier does better than when the base is erasure coded, but it is + still highly dependent on the amount of skew in the workload, and + very difficult to validate. The user will need to have a good + understanding of their workload and will need to tune the cache + tiering parameters carefully. + + +Setting Up Pools +================ + +To set up cache tiering, you must have two pools. One will act as the +backing storage and the other will act as the cache. + + +Setting Up a Backing Storage Pool +--------------------------------- + +Setting up a backing storage pool typically involves one of two scenarios: + +- **Standard Storage**: In this scenario, the pool stores multiple copies + of an object in the Ceph Storage Cluster. + +- **Erasure Coding:** In this scenario, the pool uses erasure coding to + store data much more efficiently with a small performance tradeoff. + +In the standard storage scenario, you can setup a CRUSH rule to establish +the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD +Daemons perform optimally when all storage drives in the rule are of the +same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_ +for details on creating a rule. Once you have created a rule, create +a backing storage pool. + +In the erasure coding scenario, the pool creation arguments will generate the +appropriate rule automatically. See `Create a Pool`_ for details. + +In subsequent examples, we will refer to the backing storage pool +as ``cold-storage``. + + +Setting Up a Cache Pool +----------------------- + +Setting up a cache pool follows the same procedure as the standard storage +scenario, but with this difference: the drives for the cache tier are typically +high performance drives that reside in their own servers and have their own +CRUSH rule. When setting up such a rule, it should take account of the hosts +that have the high performance drives while omitting the hosts that don't. See +:ref:`CRUSH Device Class<crush-map-device-class>` for details. + + +In subsequent examples, we will refer to the cache pool as ``hot-storage`` and +the backing pool as ``cold-storage``. + +For cache tier configuration and default values, see +`Pools - Set Pool Values`_. + + +Creating a Cache Tier +===================== + +Setting up a cache tier involves associating a backing storage pool with +a cache pool: + +.. prompt:: bash $ + + ceph osd tier add {storagepool} {cachepool} + +For example: + +.. prompt:: bash $ + + ceph osd tier add cold-storage hot-storage + +To set the cache mode, execute the following: + +.. prompt:: bash $ + + ceph osd tier cache-mode {cachepool} {cache-mode} + +For example: + +.. prompt:: bash $ + + ceph osd tier cache-mode hot-storage writeback + +The cache tiers overlay the backing storage tier, so they require one +additional step: you must direct all client traffic from the storage pool to +the cache pool. To direct client traffic directly to the cache pool, execute +the following: + +.. prompt:: bash $ + + ceph osd tier set-overlay {storagepool} {cachepool} + +For example: + +.. prompt:: bash $ + + ceph osd tier set-overlay cold-storage hot-storage + + +Configuring a Cache Tier +======================== + +Cache tiers have several configuration options. You may set +cache tier configuration options with the following usage: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} {key} {value} + +See `Pools - Set Pool Values`_ for details. + + +Target Size and Type +-------------------- + +Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} hit_set_type bloom + +For example: + +.. prompt:: bash $ + + ceph osd pool set hot-storage hit_set_type bloom + +The ``hit_set_count`` and ``hit_set_period`` define how many such HitSets to +store, and how much time each HitSet should cover: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} hit_set_count 12 + ceph osd pool set {cachepool} hit_set_period 14400 + ceph osd pool set {cachepool} target_max_bytes 1000000000000 + +.. note:: A larger ``hit_set_count`` results in more RAM consumed by + the ``ceph-osd`` process. + +Binning accesses over time allows Ceph to determine whether a Ceph client +accessed an object at least once, or more than once over a time period +("age" vs "temperature"). + +The ``min_read_recency_for_promote`` defines how many HitSets to check for the +existence of an object when handling a read operation. The checking result is +used to decide whether to promote the object asynchronously. Its value should be +between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted. +If it's set to 1, the current HitSet is checked. And if this object is in the +current HitSet, it's promoted. Otherwise not. For the other values, the exact +number of archive HitSets are checked. The object is promoted if the object is +found in any of the most recent ``min_read_recency_for_promote`` HitSets. + +A similar parameter can be set for the write operation, which is +``min_write_recency_for_promote``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} min_read_recency_for_promote 2 + ceph osd pool set {cachepool} min_write_recency_for_promote 2 + +.. note:: The longer the period and the higher the + ``min_read_recency_for_promote`` and + ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd`` + daemon consumes. In particular, when the agent is active to flush + or evict cache objects, all ``hit_set_count`` HitSets are loaded + into RAM. + + +Cache Sizing +------------ + +The cache tiering agent performs two main functions: + +- **Flushing:** The agent identifies modified (or dirty) objects and forwards + them to the storage pool for long-term storage. + +- **Evicting:** The agent identifies objects that haven't been modified + (or clean) and evicts the least recently used among them from the cache. + + +Absolute Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects based upon the total number +of bytes or the total number of objects. To specify a maximum number of bytes, +execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} target_max_bytes {#bytes} + +For example, to flush or evict at 1 TB, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage target_max_bytes 1099511627776 + +To specify the maximum number of objects, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} target_max_objects {#objects} + +For example, to flush or evict at 1M objects, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage target_max_objects 1000000 + +.. note:: Ceph is not able to determine the size of a cache pool automatically, so + the configuration on the absolute size is required here, otherwise the + flush/evict will not work. If you specify both limits, the cache tiering + agent will begin flushing or evicting when either threshold is triggered. + +.. note:: All client requests will be blocked only when ``target_max_bytes`` or + ``target_max_objects`` reached + +Relative Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects relative to the size of the +cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in +`Absolute sizing`_). When the cache pool consists of a certain percentage of +modified (or dirty) objects, the cache tiering agent will flush them to the +storage pool. To set the ``cache_target_dirty_ratio``, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0} + +For example, setting the value to ``0.4`` will begin flushing modified +(dirty) objects when they reach 40% of the cache pool's capacity: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_dirty_ratio 0.4 + +When the dirty objects reaches a certain percentage of its capacity, flush dirty +objects with a higher speed. To set the ``cache_target_dirty_high_ratio``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0} + +For example, setting the value to ``0.6`` will begin aggressively flush dirty +objects when they reach 60% of the cache pool's capacity. obviously, we'd +better set the value between dirty_ratio and full_ratio: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6 + +When the cache pool reaches a certain percentage of its capacity, the cache +tiering agent will evict objects to maintain free capacity. To set the +``cache_target_full_ratio``, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0} + +For example, setting the value to ``0.8`` will begin flushing unmodified +(clean) objects when they reach 80% of the cache pool's capacity: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_full_ratio 0.8 + + +Cache Age +--------- + +You can specify the minimum age of an object before the cache tiering agent +flushes a recently modified (or dirty) object to the backing storage pool: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_min_flush_age {#seconds} + +For example, to flush modified (or dirty) objects after 10 minutes, execute the +following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_min_flush_age 600 + +You can specify the minimum age of an object before it will be evicted from the +cache tier: + +.. prompt:: bash $ + + ceph osd pool {cache-tier} cache_min_evict_age {#seconds} + +For example, to evict objects after 30 minutes, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_min_evict_age 1800 + + +Removing a Cache Tier +===================== + +Removing a cache tier differs depending on whether it is a writeback +cache or a read-only cache. + + +Removing a Read-Only Cache +-------------------------- + +Since a read-only cache does not have modified data, you can disable +and remove it without losing any recent changes to objects in the cache. + +#. Change the cache-mode to ``none`` to disable it.: + + .. prompt:: bash + + ceph osd tier cache-mode {cachepool} none + + For example: + + .. prompt:: bash $ + + ceph osd tier cache-mode hot-storage none + +#. Remove the cache pool from the backing pool.: + + .. prompt:: bash $ + + ceph osd tier remove {storagepool} {cachepool} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove cold-storage hot-storage + + +Removing a Writeback Cache +-------------------------- + +Since a writeback cache may have modified data, you must take steps to ensure +that you do not lose any recent changes to objects in the cache before you +disable and remove it. + + +#. Change the cache mode to ``proxy`` so that new and modified objects will + flush to the backing storage pool.: + + .. prompt:: bash $ + + ceph osd tier cache-mode {cachepool} proxy + + For example: + + .. prompt:: bash $ + + ceph osd tier cache-mode hot-storage proxy + + +#. Ensure that the cache pool has been flushed. This may take a few minutes: + + .. prompt:: bash $ + + rados -p {cachepool} ls + + If the cache pool still has objects, you can flush them manually. + For example: + + .. prompt:: bash $ + + rados -p {cachepool} cache-flush-evict-all + + +#. Remove the overlay so that clients will not direct traffic to the cache.: + + .. prompt:: bash $ + + ceph osd tier remove-overlay {storagetier} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove-overlay cold-storage + + +#. Finally, remove the cache tier pool from the backing storage pool.: + + .. prompt:: bash $ + + ceph osd tier remove {storagepool} {cachepool} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove cold-storage hot-storage + + +.. _Create a Pool: ../pools#create-a-pool +.. _Pools - Set Pool Values: ../pools#set-pool-values +.. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter +.. _CRUSH Maps: ../crush-map +.. _Absolute Sizing: #absolute-sizing diff --git a/doc/rados/operations/change-mon-elections.rst b/doc/rados/operations/change-mon-elections.rst new file mode 100644 index 000000000..7418ea363 --- /dev/null +++ b/doc/rados/operations/change-mon-elections.rst @@ -0,0 +1,100 @@ +.. _changing_monitor_elections: + +======================================= +Configuring Monitor Election Strategies +======================================= + +By default, the monitors are in ``classic`` mode. We recommend staying in this +mode unless you have a very specific reason. + +If you want to switch modes BEFORE constructing the cluster, change the ``mon +election default strategy`` option. This option takes an integer value: + +* ``1`` for ``classic`` +* ``2`` for ``disallow`` +* ``3`` for ``connectivity`` + +After your cluster has started running, you can change strategies by running a +command of the following form: + + $ ceph mon set election_strategy {classic|disallow|connectivity} + +Choosing a mode +=============== + +The modes other than ``classic`` provide specific features. We recommend staying +in ``classic`` mode if you don't need these extra features because it is the +simplest mode. + +.. _rados_operations_disallow_mode: + +Disallow Mode +============= + +The ``disallow`` mode allows you to mark monitors as disallowed. Disallowed +monitors participate in the quorum and serve clients, but cannot be elected +leader. You might want to use this mode for monitors that are far away from +clients. + +To disallow a monitor from being elected leader, run a command of the following +form: + +.. prompt:: bash $ + + ceph mon add disallowed_leader {name} + +To remove a monitor from the disallowed list and allow it to be elected leader, +run a command of the following form: + +.. prompt:: bash $ + + ceph mon rm disallowed_leader {name} + +To see the list of disallowed leaders, examine the output of the following +command: + +.. prompt:: bash $ + + ceph mon dump + +Connectivity Mode +================= + +The ``connectivity`` mode evaluates connection scores that are provided by each +monitor for its peers and elects the monitor with the highest score. This mode +is designed to handle network partitioning (also called *net-splits*): network +partitioning might occur if your cluster is stretched across multiple data +centers or otherwise has a non-uniform or unbalanced network topology. + +The ``connectivity`` mode also supports disallowing monitors from being elected +leader by using the same commands that were presented in :ref:`Disallow Mode <rados_operations_disallow_mode>`. + +Examining connectivity scores +============================= + +The monitors maintain connection scores even if they aren't in ``connectivity`` +mode. To examine a specific monitor's connection scores, run a command of the +following form: + +.. prompt:: bash $ + + ceph daemon mon.{name} connection scores dump + +Scores for an individual connection range from ``0`` to ``1`` inclusive and +include whether the connection is considered alive or dead (as determined by +whether it returned its latest ping before timeout). + +Connectivity scores are expected to remain valid. However, if during +troubleshooting you determine that these scores have for some reason become +invalid, drop the history and reset the scores by running a command of the +following form: + +.. prompt:: bash $ + + ceph daemon mon.{name} connection scores reset + +Resetting connectivity scores carries little risk: monitors will still quickly +determine whether a connection is alive or dead and trend back to the previous +scores if those scores were accurate. Nevertheless, resetting scores ought to +be unnecessary and it is not recommended unless advised by your support team +or by a developer. diff --git a/doc/rados/operations/control.rst b/doc/rados/operations/control.rst new file mode 100644 index 000000000..033f831cd --- /dev/null +++ b/doc/rados/operations/control.rst @@ -0,0 +1,665 @@ +.. index:: control, commands + +================== + Control Commands +================== + + +Monitor Commands +================ + +To issue monitor commands, use the ``ceph`` utility: + +.. prompt:: bash $ + + ceph [-m monhost] {command} + +In most cases, monitor commands have the following form: + +.. prompt:: bash $ + + ceph {subsystem} {command} + + +System Commands +=============== + +To display the current cluster status, run the following commands: + +.. prompt:: bash $ + + ceph -s + ceph status + +To display a running summary of cluster status and major events, run the +following command: + +.. prompt:: bash $ + + ceph -w + +To display the monitor quorum, including which monitors are participating and +which one is the leader, run the following commands: + +.. prompt:: bash $ + + ceph mon stat + ceph quorum_status + +To query the status of a single monitor, including whether it is in the quorum, +run the following command: + +.. prompt:: bash $ + + ceph tell mon.[id] mon_status + +Here the value of ``[id]`` can be found by consulting the output of ``ceph +-s``. + + +Authentication Subsystem +======================== + +To add an OSD keyring for a specific OSD, run the following command: + +.. prompt:: bash $ + + ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring} + +To list the cluster's keys and their capabilities, run the following command: + +.. prompt:: bash $ + + ceph auth ls + + +Placement Group Subsystem +========================= + +To display the statistics for all placement groups (PGs), run the following +command: + +.. prompt:: bash $ + + ceph pg dump [--format {format}] + +Here the valid formats are ``plain`` (default), ``json`` ``json-pretty``, +``xml``, and ``xml-pretty``. When implementing monitoring tools and other +tools, it is best to use the ``json`` format. JSON parsing is more +deterministic than the ``plain`` format (which is more human readable), and the +layout is much more consistent from release to release. The ``jq`` utility is +very useful for extracting data from JSON output. + +To display the statistics for all PGs stuck in a specified state, run the +following command: + +.. prompt:: bash $ + + ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}] + +Here ``--format`` may be ``plain`` (default), ``json``, ``json-pretty``, +``xml``, or ``xml-pretty``. + +The ``--threshold`` argument determines the time interval (in seconds) for a PG +to be considered ``stuck`` (default: 300). + +PGs might be stuck in any of the following states: + +**Inactive** + + PGs are unable to process reads or writes because they are waiting for an + OSD that has the most up-to-date data to return to an ``up`` state. + + +**Unclean** + + PGs contain objects that have not been replicated the desired number of + times. These PGs have not yet completed the process of recovering. + + +**Stale** + + PGs are in an unknown state, because the OSDs that host them have not + reported to the monitor cluster for a certain period of time (specified by + the ``mon_osd_report_timeout`` configuration setting). + + +To delete a ``lost`` object or revert an object to its prior state, either by +reverting it to its previous version or by deleting it because it was just +created and has no previous version, run the following command: + +.. prompt:: bash $ + + ceph pg {pgid} mark_unfound_lost revert|delete + + +.. _osd-subsystem: + +OSD Subsystem +============= + +To query OSD subsystem status, run the following command: + +.. prompt:: bash $ + + ceph osd stat + +To write a copy of the most recent OSD map to a file (see :ref:`osdmaptool +<osdmaptool>`), run the following command: + +.. prompt:: bash $ + + ceph osd getmap -o file + +To write a copy of the CRUSH map from the most recent OSD map to a file, run +the following command: + +.. prompt:: bash $ + + ceph osd getcrushmap -o file + +Note that this command is functionally equivalent to the following two +commands: + +.. prompt:: bash $ + + ceph osd getmap -o /tmp/osdmap + osdmaptool /tmp/osdmap --export-crush file + +To dump the OSD map, run the following command: + +.. prompt:: bash $ + + ceph osd dump [--format {format}] + +The ``--format`` option accepts the following arguments: ``plain`` (default), +``json``, ``json-pretty``, ``xml``, and ``xml-pretty``. As noted above, JSON is +the recommended format for tools, scripting, and other forms of automation. + +To dump the OSD map as a tree that lists one OSD per line and displays +information about the weights and states of the OSDs, run the following +command: + +.. prompt:: bash $ + + ceph osd tree [--format {format}] + +To find out where a specific RADOS object is stored in the system, run a +command of the following form: + +.. prompt:: bash $ + + ceph osd map <pool-name> <object-name> + +To add or move a new OSD (specified by its ID, name, or weight) to a specific +CRUSH location, run the following command: + +.. prompt:: bash $ + + ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]] + +To remove an existing OSD from the CRUSH map, run the following command: + +.. prompt:: bash $ + + ceph osd crush remove {name} + +To remove an existing bucket from the CRUSH map, run the following command: + +.. prompt:: bash $ + + ceph osd crush remove {bucket-name} + +To move an existing bucket from one position in the CRUSH hierarchy to another, +run the following command: + +.. prompt:: bash $ + + ceph osd crush move {id} {loc1} [{loc2} ...] + +To set the CRUSH weight of a specific OSD (specified by ``{name}``) to +``{weight}``, run the following command: + +.. prompt:: bash $ + + ceph osd crush reweight {name} {weight} + +To mark an OSD as ``lost``, run the following command: + +.. prompt:: bash $ + + ceph osd lost {id} [--yes-i-really-mean-it] + +.. warning:: + This could result in permanent data loss. Use with caution! + +To create a new OSD, run the following command: + +.. prompt:: bash $ + + ceph osd create [{uuid}] + +If no UUID is given as part of this command, the UUID will be set automatically +when the OSD starts up. + +To remove one or more specific OSDs, run the following command: + +.. prompt:: bash $ + + ceph osd rm [{id}...] + +To display the current ``max_osd`` parameter in the OSD map, run the following +command: + +.. prompt:: bash $ + + ceph osd getmaxosd + +To import a specific CRUSH map, run the following command: + +.. prompt:: bash $ + + ceph osd setcrushmap -i file + +To set the ``max_osd`` parameter in the OSD map, run the following command: + +.. prompt:: bash $ + + ceph osd setmaxosd + +The parameter has a default value of 10000. Most operators will never need to +adjust it. + +To mark a specific OSD ``down``, run the following command: + +.. prompt:: bash $ + + ceph osd down {osd-num} + +To mark a specific OSD ``out`` (so that no data will be allocated to it), run +the following command: + +.. prompt:: bash $ + + ceph osd out {osd-num} + +To mark a specific OSD ``in`` (so that data will be allocated to it), run the +following command: + +.. prompt:: bash $ + + ceph osd in {osd-num} + +By using the "pause flags" in the OSD map, you can pause or unpause I/O +requests. If the flags are set, then no I/O requests will be sent to any OSD. +When the flags are cleared, then pending I/O requests will be resent. To set or +clear pause flags, run one of the following commands: + +.. prompt:: bash $ + + ceph osd pause + ceph osd unpause + +You can assign an override or ``reweight`` weight value to a specific OSD if +the normal CRUSH distribution seems to be suboptimal. The weight of an OSD +helps determine the extent of its I/O requests and data storage: two OSDs with +the same weight will receive approximately the same number of I/O requests and +store approximately the same amount of data. The ``ceph osd reweight`` command +assigns an override weight to an OSD. The weight value is in the range 0 to 1, +and the command forces CRUSH to relocate a certain amount (1 - ``weight``) of +the data that would otherwise be on this OSD. The command does not change the +weights of the buckets above the OSD in the CRUSH map. Using the command is +merely a corrective measure: for example, if one of your OSDs is at 90% and the +others are at 50%, you could reduce the outlier weight to correct this +imbalance. To assign an override weight to a specific OSD, run the following +command: + +.. prompt:: bash $ + + ceph osd reweight {osd-num} {weight} + +.. note:: Any assigned override reweight value will conflict with the balancer. + This means that if the balancer is in use, all override reweight values + should be ``1.0000`` in order to avoid suboptimal cluster behavior. + +A cluster's OSDs can be reweighted in order to maintain balance if some OSDs +are being disproportionately utilized. Note that override or ``reweight`` +weights have values relative to one another that default to 1.00000; their +values are not absolute, and these weights must be distinguished from CRUSH +weights (which reflect the absolute capacity of a bucket, as measured in TiB). +To reweight OSDs by utilization, run the following command: + +.. prompt:: bash $ + + ceph osd reweight-by-utilization [threshold [max_change [max_osds]]] [--no-increasing] + +By default, this command adjusts the override weight of OSDs that have ±20% of +the average utilization, but you can specify a different percentage in the +``threshold`` argument. + +To limit the increment by which any OSD's reweight is to be changed, use the +``max_change`` argument (default: 0.05). To limit the number of OSDs that are +to be adjusted, use the ``max_osds`` argument (default: 4). Increasing these +variables can accelerate the reweighting process, but perhaps at the cost of +slower client operations (as a result of the increase in data movement). + +You can test the ``osd reweight-by-utilization`` command before running it. To +find out which and how many PGs and OSDs will be affected by a specific use of +the ``osd reweight-by-utilization`` command, run the following command: + +.. prompt:: bash $ + + ceph osd test-reweight-by-utilization [threshold [max_change max_osds]] [--no-increasing] + +The ``--no-increasing`` option can be added to the ``reweight-by-utilization`` +and ``test-reweight-by-utilization`` commands in order to prevent any override +weights that are currently less than 1.00000 from being increased. This option +can be useful in certain circumstances: for example, when you are hastily +balancing in order to remedy ``full`` or ``nearfull`` OSDs, or when there are +OSDs being evacuated or slowly brought into service. + +Operators of deployments that utilize Nautilus or newer (or later revisions of +Luminous and Mimic) and that have no pre-Luminous clients might likely instead +want to enable the `balancer`` module for ``ceph-mgr``. + +The blocklist can be modified by adding or removing an IP address or a CIDR +range. If an address is blocklisted, it will be unable to connect to any OSD. +If an OSD is contained within an IP address or CIDR range that has been +blocklisted, the OSD will be unable to perform operations on its peers when it +acts as a client: such blocked operations include tiering and copy-from +functionality. To add or remove an IP address or CIDR range to the blocklist, +run one of the following commands: + +.. prompt:: bash $ + + ceph osd blocklist ["range"] add ADDRESS[:source_port][/netmask_bits] [TIME] + ceph osd blocklist ["range"] rm ADDRESS[:source_port][/netmask_bits] + +If you add something to the blocklist with the above ``add`` command, you can +use the ``TIME`` keyword to specify the length of time (in seconds) that it +will remain on the blocklist (default: one hour). To add or remove a CIDR +range, use the ``range`` keyword in the above commands. + +Note that these commands are useful primarily in failure testing. Under normal +conditions, blocklists are maintained automatically and do not need any manual +intervention. + +To create or delete a snapshot of a specific storage pool, run one of the +following commands: + +.. prompt:: bash $ + + ceph osd pool mksnap {pool-name} {snap-name} + ceph osd pool rmsnap {pool-name} {snap-name} + +To create, delete, or rename a specific storage pool, run one of the following +commands: + +.. prompt:: bash $ + + ceph osd pool create {pool-name} [pg_num [pgp_num]] + ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] + ceph osd pool rename {old-name} {new-name} + +To change a pool setting, run the following command: + +.. prompt:: bash $ + + ceph osd pool set {pool-name} {field} {value} + +The following are valid fields: + + * ``size``: The number of copies of data in the pool. + * ``pg_num``: The PG number. + * ``pgp_num``: The effective number of PGs when calculating placement. + * ``crush_rule``: The rule number for mapping placement. + +To retrieve the value of a pool setting, run the following command: + +.. prompt:: bash $ + + ceph osd pool get {pool-name} {field} + +Valid fields are: + + * ``pg_num``: The PG number. + * ``pgp_num``: The effective number of PGs when calculating placement. + +To send a scrub command to a specific OSD, or to all OSDs (by using ``*``), run +the following command: + +.. prompt:: bash $ + + ceph osd scrub {osd-num} + +To send a repair command to a specific OSD, or to all OSDs (by using ``*``), +run the following command: + +.. prompt:: bash $ + + ceph osd repair N + +You can run a simple throughput benchmark test against a specific OSD. This +test writes a total size of ``TOTAL_DATA_BYTES`` (default: 1 GB) incrementally, +in multiple write requests that each have a size of ``BYTES_PER_WRITE`` +(default: 4 MB). The test is not destructive and it will not overwrite existing +live OSD data, but it might temporarily affect the performance of clients that +are concurrently accessing the OSD. To launch this benchmark test, run the +following command: + +.. prompt:: bash $ + + ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE] + +To clear the caches of a specific OSD during the interval between one benchmark +run and another, run the following command: + +.. prompt:: bash $ + + ceph tell osd.N cache drop + +To retrieve the cache statistics of a specific OSD, run the following command: + +.. prompt:: bash $ + + ceph tell osd.N cache status + +MDS Subsystem +============= + +To change the configuration parameters of a running metadata server, run the +following command: + +.. prompt:: bash $ + + ceph tell mds.{mds-id} config set {setting} {value} + +Example: + +.. prompt:: bash $ + + ceph tell mds.0 config set debug_ms 1 + +To enable debug messages, run the following command: + +.. prompt:: bash $ + + ceph mds stat + +To display the status of all metadata servers, run the following command: + +.. prompt:: bash $ + + ceph mds fail 0 + +To mark the active metadata server as failed (and to trigger failover to a +standby if a standby is present), run the following command: + +.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap + + +Mon Subsystem +============= + +To display monitor statistics, run the following command: + +.. prompt:: bash $ + + ceph mon stat + +This command returns output similar to the following: + +:: + + e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c + +There is a ``quorum`` list at the end of the output. It lists those monitor +nodes that are part of the current quorum. + +To retrieve this information in a more direct way, run the following command: + +.. prompt:: bash $ + + ceph quorum_status -f json-pretty + +This command returns output similar to the following: + +.. code-block:: javascript + + { + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "quorum_names": [ + "a", + "b", + "c" + ], + "quorum_leader_name": "a", + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + + +The above will block until a quorum is reached. + +To see the status of a specific monitor, run the following command: + +.. prompt:: bash $ + + ceph tell mon.[name] mon_status + +Here the value of ``[name]`` can be found by consulting the output of the +``ceph quorum_status`` command. This command returns output similar to the +following: + +:: + + { + "name": "b", + "rank": 1, + "state": "peon", + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "features": { + "required_con": "9025616074522624", + "required_mon": [ + "kraken" + ], + "quorum_con": "1152921504336314367", + "quorum_mon": [ + "kraken" + ] + }, + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + +To see a dump of the monitor state, run the following command: + +.. prompt:: bash $ + + ceph mon dump + +This command returns output similar to the following: + +:: + + dumped monmap epoch 2 + epoch 2 + fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc + last_changed 2016-12-26 14:42:09.288066 + created 2016-12-26 14:42:03.573585 + 0: 127.0.0.1:40000/0 mon.a + 1: 127.0.0.1:40001/0 mon.b + 2: 127.0.0.1:40002/0 mon.c diff --git a/doc/rados/operations/crush-map-edits.rst b/doc/rados/operations/crush-map-edits.rst new file mode 100644 index 000000000..46a4a4f74 --- /dev/null +++ b/doc/rados/operations/crush-map-edits.rst @@ -0,0 +1,746 @@ +Manually editing the CRUSH Map +============================== + +.. note:: Manually editing the CRUSH map is an advanced administrator + operation. For the majority of installations, CRUSH changes can be + implemented via the Ceph CLI and do not require manual CRUSH map edits. If + you have identified a use case where manual edits *are* necessary with a + recent Ceph release, consider contacting the Ceph developers at dev@ceph.io + so that future versions of Ceph do not have this problem. + +To edit an existing CRUSH map, carry out the following procedure: + +#. `Get the CRUSH map`_. +#. `Decompile`_ the CRUSH map. +#. Edit at least one of the following sections: `Devices`_, `Buckets`_, and + `Rules`_. Use a text editor for this task. +#. `Recompile`_ the CRUSH map. +#. `Set the CRUSH map`_. + +For details on setting the CRUSH map rule for a specific pool, see `Set Pool +Values`_. + +.. _Get the CRUSH map: #getcrushmap +.. _Decompile: #decompilecrushmap +.. _Devices: #crushmapdevices +.. _Buckets: #crushmapbuckets +.. _Rules: #crushmaprules +.. _Recompile: #compilecrushmap +.. _Set the CRUSH map: #setcrushmap +.. _Set Pool Values: ../pools#setpoolvalues + +.. _getcrushmap: + +Get the CRUSH Map +----------------- + +To get the CRUSH map for your cluster, run a command of the following form: + +.. prompt:: bash $ + + ceph osd getcrushmap -o {compiled-crushmap-filename} + +Ceph outputs (``-o``) a compiled CRUSH map to the filename that you have +specified. Because the CRUSH map is in a compiled form, you must first +decompile it before you can edit it. + +.. _decompilecrushmap: + +Decompile the CRUSH Map +----------------------- + +To decompile the CRUSH map, run a command of the following form: + +.. prompt:: bash $ + + crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} + +.. _compilecrushmap: + +Recompile the CRUSH Map +----------------------- + +To compile the CRUSH map, run a command of the following form: + +.. prompt:: bash $ + + crushtool -c {decompiled-crushmap-filename} -o {compiled-crushmap-filename} + +.. _setcrushmap: + +Set the CRUSH Map +----------------- + +To set the CRUSH map for your cluster, run a command of the following form: + +.. prompt:: bash $ + + ceph osd setcrushmap -i {compiled-crushmap-filename} + +Ceph loads (``-i``) a compiled CRUSH map from the filename that you have +specified. + +Sections +-------- + +A CRUSH map has six main sections: + +#. **tunables:** The preamble at the top of the map describes any *tunables* + that are not a part of legacy CRUSH behavior. These tunables correct for old + bugs, optimizations, or other changes that have been made over the years to + improve CRUSH's behavior. + +#. **devices:** Devices are individual OSDs that store data. + +#. **types**: Bucket ``types`` define the types of buckets that are used in + your CRUSH hierarchy. + +#. **buckets:** Buckets consist of a hierarchical aggregation of storage + locations (for example, rows, racks, chassis, hosts) and their assigned + weights. After the bucket ``types`` have been defined, the CRUSH map defines + each node in the hierarchy, its type, and which devices or other nodes it + contains. + +#. **rules:** Rules define policy about how data is distributed across + devices in the hierarchy. + +#. **choose_args:** ``choose_args`` are alternative weights associated with + the hierarchy that have been adjusted in order to optimize data placement. A + single ``choose_args`` map can be used for the entire cluster, or a number + of ``choose_args`` maps can be created such that each map is crafted for a + particular pool. + + +.. _crushmapdevices: + +CRUSH-Map Devices +----------------- + +Devices are individual OSDs that store data. In this section, there is usually +one device defined for each OSD daemon in your cluster. Devices are identified +by an ``id`` (a non-negative integer) and a ``name`` (usually ``osd.N``, where +``N`` is the device's ``id``). + + +.. _crush-map-device-class: + +A device can also have a *device class* associated with it: for example, +``hdd`` or ``ssd``. Device classes make it possible for devices to be targeted +by CRUSH rules. This means that device classes allow CRUSH rules to select only +OSDs that match certain characteristics. For example, you might want an RBD +pool associated only with SSDs and a different RBD pool associated only with +HDDs. + +To see a list of devices, run the following command: + +.. prompt:: bash # + + ceph device ls + +The output of this command takes the following form: + +:: + + device {num} {osd.name} [class {class}] + +For example: + +.. prompt:: bash # + + ceph device ls + +:: + + device 0 osd.0 class ssd + device 1 osd.1 class hdd + device 2 osd.2 + device 3 osd.3 + +In most cases, each device maps to a corresponding ``ceph-osd`` daemon. This +daemon might map to a single storage device, a pair of devices (for example, +one for data and one for a journal or metadata), or in some cases a small RAID +device or a partition of a larger storage device. + + +CRUSH-Map Bucket Types +---------------------- + +The second list in the CRUSH map defines 'bucket' types. Buckets facilitate a +hierarchy of nodes and leaves. Node buckets (also known as non-leaf buckets) +typically represent physical locations in a hierarchy. Nodes aggregate other +nodes or leaves. Leaf buckets represent ``ceph-osd`` daemons and their +corresponding storage media. + +.. tip:: In the context of CRUSH, the term "bucket" is used to refer to + a node in the hierarchy (that is, to a location or a piece of physical + hardware). In the context of RADOS Gateway APIs, however, the term + "bucket" has a different meaning. + +To add a bucket type to the CRUSH map, create a new line under the list of +bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. +By convention, there is exactly one leaf bucket type and it is ``type 0``; +however, you may give the leaf bucket any name you like (for example: ``osd``, +``disk``, ``drive``, ``storage``):: + + # types + type {num} {bucket-name} + +For example:: + + # types + type 0 osd + type 1 host + type 2 chassis + type 3 rack + type 4 row + type 5 pdu + type 6 pod + type 7 room + type 8 datacenter + type 9 zone + type 10 region + type 11 root + +.. _crushmapbuckets: + +CRUSH-Map Bucket Hierarchy +-------------------------- + +The CRUSH algorithm distributes data objects among storage devices according to +a per-device weight value, approximating a uniform probability distribution. +CRUSH distributes objects and their replicas according to the hierarchical +cluster map you define. The CRUSH map represents the available storage devices +and the logical elements that contain them. + +To map placement groups (PGs) to OSDs across failure domains, a CRUSH map +defines a hierarchical list of bucket types under ``#types`` in the generated +CRUSH map. The purpose of creating a bucket hierarchy is to segregate the leaf +nodes according to their failure domains (for example: hosts, chassis, racks, +power distribution units, pods, rows, rooms, and data centers). With the +exception of the leaf nodes that represent OSDs, the hierarchy is arbitrary and +you may define it according to your own needs. + +We recommend adapting your CRUSH map to your preferred hardware-naming +conventions and using bucket names that clearly reflect the physical +hardware. Clear naming practice can make it easier to administer the cluster +and easier to troubleshoot problems when OSDs malfunction (or other hardware +malfunctions) and the administrator needs access to physical hardware. + + +In the following example, the bucket hierarchy has a leaf bucket named ``osd`` +and two node buckets named ``host`` and ``rack``: + +.. ditaa:: + +-----------+ + | {o}rack | + | Bucket | + +-----+-----+ + | + +---------------+---------------+ + | | + +-----+-----+ +-----+-----+ + | {o}host | | {o}host | + | Bucket | | Bucket | + +-----+-----+ +-----+-----+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd | | osd | | osd | | osd | + | Bucket | | Bucket | | Bucket | | Bucket | + +-----------+ +-----------+ +-----------+ +-----------+ + +.. note:: The higher-numbered ``rack`` bucket type aggregates the + lower-numbered ``host`` bucket type. + +Because leaf nodes reflect storage devices that have already been declared +under the ``#devices`` list at the beginning of the CRUSH map, there is no need +to declare them as bucket instances. The second-lowest bucket type in your +hierarchy is typically used to aggregate the devices (that is, the +second-lowest bucket type is usually the computer that contains the storage +media and, such as ``node``, ``computer``, ``server``, ``host``, or +``machine``). In high-density environments, it is common to have multiple hosts +or nodes in a single chassis (for example, in the cases of blades or twins). It +is important to anticipate the potential consequences of chassis failure -- for +example, during the replacement of a chassis in case of a node failure, the +chassis's hosts or nodes (and their associated OSDs) will be in a ``down`` +state. + +To declare a bucket instance, do the following: specify its type, give it a +unique name (an alphanumeric string), assign it a unique ID expressed as a +negative integer (this is optional), assign it a weight relative to the total +capacity and capability of the item(s) in the bucket, assign it a bucket +algorithm (usually ``straw2``), and specify the bucket algorithm's hash +(usually ``0``, a setting that reflects the hash algorithm ``rjenkins1``). A +bucket may have one or more items. The items may consist of node buckets or +leaves. Items may have a weight that reflects the relative weight of the item. + +To declare a node bucket, use the following syntax:: + + [bucket-type] [bucket-name] { + id [a unique negative numeric ID] + weight [the relative capacity/capability of the item(s)] + alg [the bucket type: uniform | list | tree | straw | straw2 ] + hash [the hash type: 0 by default] + item [item-name] weight [weight] + } + +For example, in the above diagram, two host buckets (referred to in the +declaration below as ``node1`` and ``node2``) and one rack bucket (referred to +in the declaration below as ``rack1``) are defined. The OSDs are declared as +items within the host buckets:: + + host node1 { + id -1 + alg straw2 + hash 0 + item osd.0 weight 1.00 + item osd.1 weight 1.00 + } + + host node2 { + id -2 + alg straw2 + hash 0 + item osd.2 weight 1.00 + item osd.3 weight 1.00 + } + + rack rack1 { + id -3 + alg straw2 + hash 0 + item node1 weight 2.00 + item node2 weight 2.00 + } + +.. note:: In this example, the rack bucket does not contain any OSDs. Instead, + it contains lower-level host buckets and includes the sum of their weight in + the item entry. + + +.. topic:: Bucket Types + + Ceph supports five bucket types. Each bucket type provides a balance between + performance and reorganization efficiency, and each is different from the + others. If you are unsure of which bucket type to use, use the ``straw2`` + bucket. For a more technical discussion of bucket types than is offered + here, see **Section 3.4** of `CRUSH - Controlled, Scalable, Decentralized + Placement of Replicated Data`_. + + The bucket types are as follows: + + #. **uniform**: Uniform buckets aggregate devices that have **exactly** + the same weight. For example, when hardware is commissioned or + decommissioned, it is often done in sets of machines that have exactly + the same physical configuration (this can be the case, for example, + after bulk purchases). When storage devices have exactly the same + weight, you may use the ``uniform`` bucket type, which allows CRUSH to + map replicas into uniform buckets in constant time. If your devices have + non-uniform weights, you should not use the uniform bucket algorithm. + + #. **list**: List buckets aggregate their content as linked lists. The + behavior of list buckets is governed by the :abbr:`RUSH (Replication + Under Scalable Hashing)`:sub:`P` algorithm. In the behavior of this + bucket type, an object is either relocated to the newest device in + accordance with an appropriate probability, or it remains on the older + devices as before. This results in optimal data migration when items are + added to the bucket. The removal of items from the middle or the tail of + the list, however, can result in a significant amount of unnecessary + data movement. This means that list buckets are most suitable for + circumstances in which they **never shrink or very rarely shrink**. + + #. **tree**: Tree buckets use a binary search tree. They are more efficient + at dealing with buckets that contain many items than are list buckets. + The behavior of tree buckets is governed by the :abbr:`RUSH (Replication + Under Scalable Hashing)`:sub:`R` algorithm. Tree buckets reduce the + placement time to 0(log\ :sub:`n`). This means that tree buckets are + suitable for managing large sets of devices or nested buckets. + + #. **straw**: Straw buckets allow all items in the bucket to "compete" + against each other for replica placement through a process analogous to + drawing straws. This is different from the behavior of list buckets and + tree buckets, which use a divide-and-conquer strategy that either gives + certain items precedence (for example, those at the beginning of a list) + or obviates the need to consider entire subtrees of items. Such an + approach improves the performance of the replica placement process, but + can also introduce suboptimal reorganization behavior when the contents + of a bucket change due an addition, a removal, or the re-weighting of an + item. + + * **straw2**: Straw2 buckets improve on Straw by correctly avoiding + any data movement between items when neighbor weights change. For + example, if the weight of a given item changes (including during the + operations of adding it to the cluster or removing it from the + cluster), there will be data movement to or from only that item. + Neighbor weights are not taken into account. + + +.. topic:: Hash + + Each bucket uses a hash algorithm. As of Reef, Ceph supports the + ``rjenkins1`` algorithm. To select ``rjenkins1`` as the hash algorithm, + enter ``0`` as your hash setting. + +.. _weightingbucketitems: + +.. topic:: Weighting Bucket Items + + Ceph expresses bucket weights as doubles, which allows for fine-grained + weighting. A weight is the relative difference between device capacities. We + recommend using ``1.00`` as the relative weight for a 1 TB storage device. + In such a scenario, a weight of ``0.50`` would represent approximately 500 + GB, and a weight of ``3.00`` would represent approximately 3 TB. Buckets + higher in the CRUSH hierarchy have a weight that is the sum of the weight of + the leaf items aggregated by the bucket. + + +.. _crushmaprules: + +CRUSH Map Rules +--------------- + +CRUSH maps have rules that include data placement for a pool: these are +called "CRUSH rules". The default CRUSH map has one rule for each pool. If you +are running a large cluster, you might create many pools and each of those +pools might have its own non-default CRUSH rule. + + +.. note:: In most cases, there is no need to modify the default rule. When a + new pool is created, by default the rule will be set to the value ``0`` + (which indicates the default CRUSH rule, which has the numeric ID ``0``). + +CRUSH rules define policy that governs how data is distributed across the devices in +the hierarchy. The rules define placement as well as replication strategies or +distribution policies that allow you to specify exactly how CRUSH places data +replicas. For example, you might create one rule selecting a pair of targets for +two-way mirroring, another rule for selecting three targets in two different data +centers for three-way replication, and yet another rule for erasure coding across +six storage devices. For a detailed discussion of CRUSH rules, see **Section 3.2** +of `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_. + +A rule takes the following form:: + + rule <rulename> { + + id [a unique integer ID] + type [replicated|erasure] + step take <bucket-name> [class <device-class>] + step [choose|chooseleaf] [firstn|indep] <N> type <bucket-type> + step emit + } + + +``id`` + :Description: A unique integer that identifies the rule. + :Purpose: A component of the rule mask. + :Type: Integer + :Required: Yes + :Default: 0 + + +``type`` + :Description: Denotes the type of replication strategy to be enforced by the + rule. + :Purpose: A component of the rule mask. + :Type: String + :Required: Yes + :Default: ``replicated`` + :Valid Values: ``replicated`` or ``erasure`` + + +``step take <bucket-name> [class <device-class>]`` + :Description: Takes a bucket name and iterates down the tree. If + the ``device-class`` argument is specified, the argument must + match a class assigned to OSDs within the cluster. Only + devices belonging to the class are included. + :Purpose: A component of the rule. + :Required: Yes + :Example: ``step take data`` + + + +``step choose firstn {num} type {bucket-type}`` + :Description: Selects ``num`` buckets of the given type from within the + current bucket. ``{num}`` is usually the number of replicas in + the pool (in other words, the pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (as many buckets as are available). + - If ``pool-num-replicas > {num} > 0``, choose that many buckets. + - If ``{num} < 0``, choose ``pool-num-replicas - {num}`` buckets. + + :Purpose: A component of the rule. + :Prerequisite: Follows ``step take`` or ``step choose``. + :Example: ``step choose firstn 1 type row`` + + +``step chooseleaf firstn {num} type {bucket-type}`` + :Description: Selects a set of buckets of the given type and chooses a leaf + node (that is, an OSD) from the subtree of each bucket in that set of buckets. The + number of buckets in the set is usually the number of replicas in + the pool (in other words, the pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (as many buckets as are available). + - If ``pool-num-replicas > {num} > 0``, choose that many buckets. + - If ``{num} < 0``, choose ``pool-num-replicas - {num}`` buckets. + :Purpose: A component of the rule. Using ``chooseleaf`` obviates the need to select a device in a separate step. + :Prerequisite: Follows ``step take`` or ``step choose``. + :Example: ``step chooseleaf firstn 0 type row`` + + +``step emit`` + :Description: Outputs the current value on the top of the stack and empties + the stack. Typically used + at the end of a rule, but may also be used to choose from different + trees in the same rule. + + :Purpose: A component of the rule. + :Prerequisite: Follows ``step choose``. + :Example: ``step emit`` + +.. important:: A single CRUSH rule can be assigned to multiple pools, but + a single pool cannot have multiple CRUSH rules. + +``firstn`` or ``indep`` + + :Description: Determines which replacement strategy CRUSH uses when items (OSDs) + are marked ``down`` in the CRUSH map. When this rule is used + with replicated pools, ``firstn`` is used. When this rule is + used with erasure-coded pools, ``indep`` is used. + + Suppose that a PG is stored on OSDs 1, 2, 3, 4, and 5 and then + OSD 3 goes down. + + When in ``firstn`` mode, CRUSH simply adjusts its calculation + to select OSDs 1 and 2, then selects 3 and discovers that 3 is + down, retries and selects 4 and 5, and finally goes on to + select a new OSD: OSD 6. The final CRUSH mapping + transformation is therefore 1, 2, 3, 4, 5 → 1, 2, 4, 5, 6. + + However, if you were storing an erasure-coded pool, the above + sequence would have changed the data that is mapped to OSDs 4, + 5, and 6. The ``indep`` mode attempts to avoid this unwanted + consequence. When in ``indep`` mode, CRUSH can be expected to + select 3, discover that 3 is down, retry, and select 6. The + final CRUSH mapping transformation is therefore 1, 2, 3, 4, 5 + → 1, 2, 6, 4, 5. + +.. _crush-reclassify: + +Migrating from a legacy SSD rule to device classes +-------------------------------------------------- + +Prior to the Luminous release's introduction of the *device class* feature, in +order to write rules that applied to a specialized device type (for example, +SSD), it was necessary to manually edit the CRUSH map and maintain a parallel +hierarchy for each device type. The device class feature provides a more +transparent way to achieve this end. + +However, if your cluster is migrated from an existing manually-customized +per-device map to new device class-based rules, all data in the system will be +reshuffled. + +The ``crushtool`` utility has several commands that can transform a legacy rule +and hierarchy and allow you to start using the new device class rules. There +are three possible types of transformation: + +#. ``--reclassify-root <root-name> <device-class>`` + + This command examines everything under ``root-name`` in the hierarchy and + rewrites any rules that reference the specified root and that have the + form ``take <root-name>`` so that they instead have the + form ``take <root-name> class <device-class>``. The command also renumbers + the buckets in such a way that the old IDs are used for the specified + class's "shadow tree" and as a result no data movement takes place. + + For example, suppose you have the following as an existing rule:: + + rule replicated_rule { + id 0 + type replicated + step take default + step chooseleaf firstn 0 type rack + step emit + } + + If the root ``default`` is reclassified as class ``hdd``, the new rule will + be as follows:: + + rule replicated_rule { + id 0 + type replicated + step take default class hdd + step chooseleaf firstn 0 type rack + step emit + } + +#. ``--set-subtree-class <bucket-name> <device-class>`` + + This command marks every device in the subtree that is rooted at *bucket-name* + with the specified device class. + + This command is typically used in conjunction with the ``--reclassify-root`` option + in order to ensure that all devices in that root are labeled with the + correct class. In certain circumstances, however, some of those devices + are correctly labeled with a different class and must not be relabeled. To + manage this difficulty, one can exclude the ``--set-subtree-class`` + option. The remapping process will not be perfect, because the previous rule + had an effect on devices of multiple classes but the adjusted rules will map + only to devices of the specified device class. However, when there are not many + outlier devices, the resulting level of data movement is often within tolerable + limits. + + +#. ``--reclassify-bucket <match-pattern> <device-class> <default-parent>`` + + This command allows you to merge a parallel type-specific hierarchy with the + normal hierarchy. For example, many users have maps that resemble the + following:: + + host node1 { + id -2 # do not change unnecessarily + # weight 109.152 + alg straw2 + hash 0 # rjenkins1 + item osd.0 weight 9.096 + item osd.1 weight 9.096 + item osd.2 weight 9.096 + item osd.3 weight 9.096 + item osd.4 weight 9.096 + item osd.5 weight 9.096 + ... + } + + host node1-ssd { + id -10 # do not change unnecessarily + # weight 2.000 + alg straw2 + hash 0 # rjenkins1 + item osd.80 weight 2.000 + ... + } + + root default { + id -1 # do not change unnecessarily + alg straw2 + hash 0 # rjenkins1 + item node1 weight 110.967 + ... + } + + root ssd { + id -18 # do not change unnecessarily + # weight 16.000 + alg straw2 + hash 0 # rjenkins1 + item node1-ssd weight 2.000 + ... + } + + This command reclassifies each bucket that matches a certain + pattern. The pattern can be of the form ``%suffix`` or ``prefix%``. For + example, in the above example, we would use the pattern + ``%-ssd``. For each matched bucket, the remaining portion of the + name (corresponding to the ``%`` wildcard) specifies the *base bucket*. All + devices in the matched bucket are labeled with the specified + device class and then moved to the base bucket. If the base bucket + does not exist (for example, ``node12-ssd`` exists but ``node12`` does + not), then it is created and linked under the specified + *default parent* bucket. In each case, care is taken to preserve + the old bucket IDs for the new shadow buckets in order to prevent data + movement. Any rules with ``take`` steps that reference the old + buckets are adjusted accordingly. + + +#. ``--reclassify-bucket <bucket-name> <device-class> <base-bucket>`` + + The same command can also be used without a wildcard in order to map a + single bucket. For example, in the previous example, we want the + ``ssd`` bucket to be mapped to the ``default`` bucket. + +#. The final command to convert the map that consists of the above fragments + resembles the following: + + .. prompt:: bash $ + + ceph osd getcrushmap -o original + crushtool -i original --reclassify \ + --set-subtree-class default hdd \ + --reclassify-root default hdd \ + --reclassify-bucket %-ssd ssd default \ + --reclassify-bucket ssd ssd default \ + -o adjusted + +``--compare`` flag +------------------ + +A ``--compare`` flag is available to make sure that the conversion performed in +:ref:`Migrating from a legacy SSD rule to device classes <crush-reclassify>` is +correct. This flag tests a large sample of inputs against the CRUSH map and +checks that the expected result is output. The options that control these +inputs are the same as the options that apply to the ``--test`` command. For an +illustration of how this ``--compare`` command applies to the above example, +see the following: + +.. prompt:: bash $ + + crushtool -i original --compare adjusted + +:: + + rule 0 had 0/10240 mismatched mappings (0) + rule 1 had 0/10240 mismatched mappings (0) + maps appear equivalent + +If the command finds any differences, the ratio of remapped inputs is reported +in the parentheses. + +When you are satisfied with the adjusted map, apply it to the cluster by +running the following command: + +.. prompt:: bash $ + + ceph osd setcrushmap -i adjusted + +Manually Tuning CRUSH +--------------------- + +If you have verified that all clients are running recent code, you can adjust +the CRUSH tunables by extracting the CRUSH map, modifying the values, and +reinjecting the map into the cluster. The procedure is carried out as follows: + +#. Extract the latest CRUSH map: + + .. prompt:: bash $ + + ceph osd getcrushmap -o /tmp/crush + +#. Adjust tunables. In our tests, the following values appear to result in the + best behavior for both large and small clusters. The procedure requires that + you specify the ``--enable-unsafe-tunables`` flag in the ``crushtool`` + command. Use this option with **extreme care**: + + .. prompt:: bash $ + + crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new + +#. Reinject the modified map: + + .. prompt:: bash $ + + ceph osd setcrushmap -i /tmp/crush.new + +Legacy values +------------- + +To set the legacy values of the CRUSH tunables, run the following command: + +.. prompt:: bash $ + + crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy + +The special ``--enable-unsafe-tunables`` flag is required. Be careful when +running old versions of the ``ceph-osd`` daemon after reverting to legacy +values, because the feature bit is not perfectly enforced. + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf diff --git a/doc/rados/operations/crush-map.rst b/doc/rados/operations/crush-map.rst new file mode 100644 index 000000000..39151e6d4 --- /dev/null +++ b/doc/rados/operations/crush-map.rst @@ -0,0 +1,1147 @@ +============ + CRUSH Maps +============ + +The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm +computes storage locations in order to determine how to store and retrieve +data. CRUSH allows Ceph clients to communicate with OSDs directly rather than +through a centralized server or broker. By using an algorithmically-determined +method of storing and retrieving data, Ceph avoids a single point of failure, a +performance bottleneck, and a physical limit to its scalability. + +CRUSH uses a map of the cluster (the CRUSH map) to map data to OSDs, +distributing the data across the cluster in accordance with configured +replication policy and failure domains. For a detailed discussion of CRUSH, see +`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ + +CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)` and a +hierarchy of "buckets" (``host``\s, ``rack``\s) and rules that govern how CRUSH +replicates data within the cluster's pools. By reflecting the underlying +physical organization of the installation, CRUSH can model (and thereby +address) the potential for correlated device failures. Some factors relevant +to the CRUSH hierarchy include chassis, racks, physical proximity, a shared +power source, shared networking, and failure domains. By encoding this +information into the CRUSH map, CRUSH placement policies distribute object +replicas across failure domains while maintaining the desired distribution. For +example, to address the possibility of concurrent failures, it might be +desirable to ensure that data replicas are on devices that reside in or rely +upon different shelves, racks, power supplies, controllers, or physical +locations. + +When OSDs are deployed, they are automatically added to the CRUSH map under a +``host`` bucket that is named for the node on which the OSDs run. This +behavior, combined with the configured CRUSH failure domain, ensures that +replicas or erasure-code shards are distributed across hosts and that the +failure of a single host or other kinds of failures will not affect +availability. For larger clusters, administrators must carefully consider their +choice of failure domain. For example, distributing replicas across racks is +typical for mid- to large-sized clusters. + + +CRUSH Location +============== + +The location of an OSD within the CRUSH map's hierarchy is referred to as its +``CRUSH location``. The specification of a CRUSH location takes the form of a +list of key-value pairs. For example, if an OSD is in a particular row, rack, +chassis, and host, and is also part of the 'default' CRUSH root (which is the +case for most clusters), its CRUSH location can be specified as follows:: + + root=default row=a rack=a2 chassis=a2a host=a2a1 + +.. note:: + + #. The order of the keys does not matter. + #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default, + valid CRUSH types include ``root``, ``datacenter``, ``room``, ``row``, + ``pod``, ``pdu``, ``rack``, ``chassis``, and ``host``. These defined + types suffice for nearly all clusters, but can be customized by + modifying the CRUSH map. + #. Not all keys need to be specified. For example, by default, Ceph + automatically sets an ``OSD``'s location as ``root=default + host=HOSTNAME`` (as determined by the output of ``hostname -s``). + +The CRUSH location for an OSD can be modified by adding the ``crush location`` +option in ``ceph.conf``. When this option has been added, every time the OSD +starts it verifies that it is in the correct location in the CRUSH map and +moves itself if it is not. To disable this automatic CRUSH map management, add +the following to the ``ceph.conf`` configuration file in the ``[osd]`` +section:: + + osd crush update on start = false + +Note that this action is unnecessary in most cases. + + +Custom location hooks +--------------------- + +A custom location hook can be used to generate a more complete CRUSH location +on startup. The CRUSH location is determined by, in order of preference: + +#. A ``crush location`` option in ``ceph.conf`` +#. A default of ``root=default host=HOSTNAME`` where the hostname is determined + by the output of the ``hostname -s`` command + +A script can be written to provide additional location fields (for example, +``rack`` or ``datacenter``) and the hook can be enabled via the following +config option:: + + crush location hook = /path/to/customized-ceph-crush-location + +This hook is passed several arguments (see below). The hook outputs a single +line to ``stdout`` that contains the CRUSH location description. The output +resembles the following::: + + --cluster CLUSTER --id ID --type TYPE + +Here the cluster name is typically ``ceph``, the ``id`` is the daemon +identifier or (in the case of OSDs) the OSD number, and the daemon type is +``osd``, ``mds, ``mgr``, or ``mon``. + +For example, a simple hook that specifies a rack location via a value in the +file ``/etc/rack`` might be as follows:: + + #!/bin/sh + echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default" + + +CRUSH structure +=============== + +The CRUSH map consists of (1) a hierarchy that describes the physical topology +of the cluster and (2) a set of rules that defines data placement policy. The +hierarchy has devices (OSDs) at the leaves and internal nodes corresponding to +other physical features or groupings: hosts, racks, rows, data centers, and so +on. The rules determine how replicas are placed in terms of that hierarchy (for +example, 'three replicas in different racks'). + +Devices +------- + +Devices are individual OSDs that store data (usually one device for each +storage drive). Devices are identified by an ``id`` (a non-negative integer) +and a ``name`` (usually ``osd.N``, where ``N`` is the device's ``id``). + +In Luminous and later releases, OSDs can have a *device class* assigned (for +example, ``hdd`` or ``ssd`` or ``nvme``), allowing them to be targeted by CRUSH +rules. Device classes are especially useful when mixing device types within +hosts. + +.. _crush_map_default_types: + +Types and Buckets +----------------- + +"Bucket", in the context of CRUSH, is a term for any of the internal nodes in +the hierarchy: hosts, racks, rows, and so on. The CRUSH map defines a series of +*types* that are used to identify these nodes. Default types include: + +- ``osd`` (or ``device``) +- ``host`` +- ``chassis`` +- ``rack`` +- ``row`` +- ``pdu`` +- ``pod`` +- ``room`` +- ``datacenter`` +- ``zone`` +- ``region`` +- ``root`` + +Most clusters use only a handful of these types, and other types can be defined +as needed. + +The hierarchy is built with devices (normally of type ``osd``) at the leaves +and non-device types as the internal nodes. The root node is of type ``root``. +For example: + + +.. ditaa:: + + +-----------------+ + |{o}root default | + +--------+--------+ + | + +---------------+---------------+ + | | + +------+------+ +------+------+ + |{o}host foo | |{o}host bar | + +------+------+ +------+------+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd.0 | | osd.1 | | osd.2 | | osd.3 | + +-----------+ +-----------+ +-----------+ +-----------+ + + +Each node (device or bucket) in the hierarchy has a *weight* that indicates the +relative proportion of the total data that should be stored by that device or +hierarchy subtree. Weights are set at the leaves, indicating the size of the +device. These weights automatically sum in an 'up the tree' direction: that is, +the weight of the ``root`` node will be the sum of the weights of all devices +contained under it. Weights are typically measured in tebibytes (TiB). + +To get a simple view of the cluster's CRUSH hierarchy, including weights, run +the following command: + +.. prompt:: bash $ + + ceph osd tree + +Rules +----- + +CRUSH rules define policy governing how data is distributed across the devices +in the hierarchy. The rules define placement as well as replication strategies +or distribution policies that allow you to specify exactly how CRUSH places +data replicas. For example, you might create one rule selecting a pair of +targets for two-way mirroring, another rule for selecting three targets in two +different data centers for three-way replication, and yet another rule for +erasure coding across six storage devices. For a detailed discussion of CRUSH +rules, see **Section 3.2** of `CRUSH - Controlled, Scalable, Decentralized +Placement of Replicated Data`_. + +CRUSH rules can be created via the command-line by specifying the *pool type* +that they will govern (replicated or erasure coded), the *failure domain*, and +optionally a *device class*. In rare cases, CRUSH rules must be created by +manually editing the CRUSH map. + +To see the rules that are defined for the cluster, run the following command: + +.. prompt:: bash $ + + ceph osd crush rule ls + +To view the contents of the rules, run the following command: + +.. prompt:: bash $ + + ceph osd crush rule dump + +.. _device_classes: + +Device classes +-------------- + +Each device can optionally have a *class* assigned. By default, OSDs +automatically set their class at startup to `hdd`, `ssd`, or `nvme` in +accordance with the type of device they are backed by. + +To explicitly set the device class of one or more OSDs, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd crush set-device-class <class> <osd-name> [...] + +Once a device class has been set, it cannot be changed to another class until +the old class is unset. To remove the old class of one or more OSDs, run a +command of the following form: + +.. prompt:: bash $ + + ceph osd crush rm-device-class <osd-name> [...] + +This restriction allows administrators to set device classes that won't be +changed on OSD restart or by a script. + +To create a placement rule that targets a specific device class, run a command +of the following form: + +.. prompt:: bash $ + + ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> + +To apply the new placement rule to a specific pool, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> crush_rule <rule-name> + +Device classes are implemented by creating one or more "shadow" CRUSH +hierarchies. For each device class in use, there will be a shadow hierarchy +that contains only devices of that class. CRUSH rules can then distribute data +across the relevant shadow hierarchy. This approach is fully backward +compatible with older Ceph clients. To view the CRUSH hierarchy with shadow +items displayed, run the following command: + +.. prompt:: bash # + + ceph osd crush tree --show-shadow + +Some older clusters that were created before the Luminous release rely on +manually crafted CRUSH maps to maintain per-device-type hierarchies. For these +clusters, there is a *reclassify* tool available that can help them transition +to device classes without triggering unwanted data movement (see +:ref:`crush-reclassify`). + +Weight sets +----------- + +A *weight set* is an alternative set of weights to use when calculating data +placement. The normal weights associated with each device in the CRUSH map are +set in accordance with the device size and indicate how much data should be +stored where. However, because CRUSH is a probabilistic pseudorandom placement +process, there is always some variation from this ideal distribution (in the +same way that rolling a die sixty times will likely not result in exactly ten +ones and ten sixes). Weight sets allow the cluster to perform numerical +optimization based on the specifics of your cluster (for example: hierarchy, +pools) to achieve a balanced distribution. + +Ceph supports two types of weight sets: + +#. A **compat** weight set is a single alternative set of weights for each + device and each node in the cluster. Compat weight sets cannot be expected + to correct all anomalies (for example, PGs for different pools might be of + different sizes and have different load levels, but are mostly treated alike + by the balancer). However, they have the major advantage of being *backward + compatible* with previous versions of Ceph. This means that even though + weight sets were first introduced in Luminous v12.2.z, older clients (for + example, Firefly) can still connect to the cluster when a compat weight set + is being used to balance data. + +#. A **per-pool** weight set is more flexible in that it allows placement to + be optimized for each data pool. Additionally, weights can be adjusted + for each position of placement, allowing the optimizer to correct for a + subtle skew of data toward devices with small weights relative to their + peers (an effect that is usually apparent only in very large clusters + but that can cause balancing problems). + +When weight sets are in use, the weights associated with each node in the +hierarchy are visible in a separate column (labeled either as ``(compat)`` or +as the pool name) in the output of the following command: + +.. prompt:: bash # + + ceph osd tree + +If both *compat* and *per-pool* weight sets are in use, data placement for a +particular pool will use its own per-pool weight set if present. If only +*compat* weight sets are in use, data placement will use the compat weight set. +If neither are in use, data placement will use the normal CRUSH weights. + +Although weight sets can be set up and adjusted manually, we recommend enabling +the ``ceph-mgr`` *balancer* module to perform these tasks automatically if the +cluster is running Luminous or a later release. + +Modifying the CRUSH map +======================= + +.. _addosd: + +Adding/Moving an OSD +-------------------- + +.. note:: Under normal conditions, OSDs automatically add themselves to the + CRUSH map when they are created. The command in this section is rarely + needed. + + +To add or move an OSD in the CRUSH map of a running cluster, run a command of +the following form: + +.. prompt:: bash $ + + ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] + +For details on this command's parameters, see the following: + +``name`` + :Description: The full name of the OSD. + :Type: String + :Required: Yes + :Example: ``osd.0`` + + +``weight`` + :Description: The CRUSH weight of the OSD. Normally, this is its size, as measured in terabytes (TB). + :Type: Double + :Required: Yes + :Example: ``2.0`` + + +``root`` + :Description: The root node of the CRUSH hierarchy in which the OSD resides (normally ``default``). + :Type: Key-value pair. + :Required: Yes + :Example: ``root=default`` + + +``bucket-type`` + :Description: The OSD's location in the CRUSH hierarchy. + :Type: Key-value pairs. + :Required: No + :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + +In the following example, the command adds ``osd.0`` to the hierarchy, or moves +``osd.0`` from a previous location: + +.. prompt:: bash $ + + ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 + + +Adjusting OSD weight +-------------------- + +.. note:: Under normal conditions, OSDs automatically add themselves to the + CRUSH map with the correct weight when they are created. The command in this + section is rarely needed. + +To adjust an OSD's CRUSH weight in a running cluster, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd crush reweight {name} {weight} + +For details on this command's parameters, see the following: + +``name`` + :Description: The full name of the OSD. + :Type: String + :Required: Yes + :Example: ``osd.0`` + + +``weight`` + :Description: The CRUSH weight of the OSD. + :Type: Double + :Required: Yes + :Example: ``2.0`` + + +.. _removeosd: + +Removing an OSD +--------------- + +.. note:: OSDs are normally removed from the CRUSH map as a result of the + `ceph osd purge`` command. This command is rarely needed. + +To remove an OSD from the CRUSH map of a running cluster, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd crush remove {name} + +For details on the ``name`` parameter, see the following: + +``name`` + :Description: The full name of the OSD. + :Type: String + :Required: Yes + :Example: ``osd.0`` + + +Adding a CRUSH Bucket +--------------------- + +.. note:: Buckets are implicitly created when an OSD is added and the command + that creates it specifies a ``{bucket-type}={bucket-name}`` as part of the + OSD's location (provided that a bucket with that name does not already + exist). The command in this section is typically used when manually + adjusting the structure of the hierarchy after OSDs have already been + created. One use of this command is to move a series of hosts to a new + rack-level bucket. Another use of this command is to add new ``host`` + buckets (OSD nodes) to a dummy ``root`` so that the buckets don't receive + any data until they are ready to receive data. When they are ready, move the + buckets to the ``default`` root or to any other root as described below. + +To add a bucket in the CRUSH map of a running cluster, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd crush add-bucket {bucket-name} {bucket-type} + +For details on this command's parameters, see the following: + +``bucket-name`` + :Description: The full name of the bucket. + :Type: String + :Required: Yes + :Example: ``rack12`` + + +``bucket-type`` + :Description: The type of the bucket. This type must already exist in the CRUSH hierarchy. + :Type: String + :Required: Yes + :Example: ``rack`` + +In the following example, the command adds the ``rack12`` bucket to the hierarchy: + +.. prompt:: bash $ + + ceph osd crush add-bucket rack12 rack + +Moving a Bucket +--------------- + +To move a bucket to a different location or position in the CRUSH map +hierarchy, run a command of the following form: + +.. prompt:: bash $ + + ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] + +For details on this command's parameters, see the following: + +``bucket-name`` + :Description: The name of the bucket that you are moving. + :Type: String + :Required: Yes + :Example: ``foo-bar-1`` + +``bucket-type`` + :Description: The bucket's new location in the CRUSH hierarchy. + :Type: Key-value pairs. + :Required: No + :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + +Removing a Bucket +----------------- + +To remove a bucket from the CRUSH hierarchy, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd crush remove {bucket-name} + +.. note:: A bucket must already be empty before it is removed from the CRUSH + hierarchy. In other words, there must not be OSDs or any other CRUSH buckets + within it. + +For details on the ``bucket-name`` parameter, see the following: + +``bucket-name`` + :Description: The name of the bucket that is being removed. + :Type: String + :Required: Yes + :Example: ``rack12`` + +In the following example, the command removes the ``rack12`` bucket from the +hierarchy: + +.. prompt:: bash $ + + ceph osd crush remove rack12 + +Creating a compat weight set +---------------------------- + +.. note:: Normally this action is done automatically if needed by the + ``balancer`` module (provided that the module is enabled). + +To create a *compat* weight set, run the following command: + +.. prompt:: bash $ + + ceph osd crush weight-set create-compat + +To adjust the weights of the compat weight set, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd crush weight-set reweight-compat {name} {weight} + +To destroy the compat weight set, run the following command: + +.. prompt:: bash $ + + ceph osd crush weight-set rm-compat + +Creating per-pool weight sets +----------------------------- + +To create a weight set for a specific pool, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd crush weight-set create {pool-name} {mode} + +.. note:: Per-pool weight sets can be used only if all servers and daemons are + running Luminous v12.2.z or a later release. + +For details on this command's parameters, see the following: + +``pool-name`` + :Description: The name of a RADOS pool. + :Type: String + :Required: Yes + :Example: ``rbd`` + +``mode`` + :Description: Either ``flat`` or ``positional``. A *flat* weight set + assigns a single weight to all devices or buckets. A + *positional* weight set has a potentially different + weight for each position in the resulting placement + mapping. For example: if a pool has a replica count of + ``3``, then a positional weight set will have three + weights for each device and bucket. + :Type: String + :Required: Yes + :Example: ``flat`` + +To adjust the weight of an item in a weight set, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} + +To list existing weight sets, run the following command: + +.. prompt:: bash $ + + ceph osd crush weight-set ls + +To remove a weight set, run a command of the following form: + +.. prompt:: bash $ + + ceph osd crush weight-set rm {pool-name} + + +Creating a rule for a replicated pool +------------------------------------- + +When you create a CRUSH rule for a replicated pool, there is an important +decision to make: selecting a failure domain. For example, if you select a +failure domain of ``host``, then CRUSH will ensure that each replica of the +data is stored on a unique host. Alternatively, if you select a failure domain +of ``rack``, then each replica of the data will be stored in a different rack. +Your selection of failure domain should be guided by the size and its CRUSH +topology. + +The entire cluster hierarchy is typically nested beneath a root node that is +named ``default``. If you have customized your hierarchy, you might want to +create a rule nested beneath some other node in the hierarchy. In creating +this rule for the customized hierarchy, the node type doesn't matter, and in +particular the rule does not have to be nested beneath a ``root`` node. + +It is possible to create a rule that restricts data placement to a specific +*class* of device. By default, Ceph OSDs automatically classify themselves as +either ``hdd`` or ``ssd`` in accordance with the underlying type of device +being used. These device classes can be customized. One might set the ``device +class`` of OSDs to ``nvme`` to distinguish the from SATA SSDs, or one might set +them to something arbitrary like ``ssd-testing`` or ``ssd-ethel`` so that rules +and pools may be flexibly constrained to use (or avoid using) specific subsets +of OSDs based on specific requirements. + +To create a rule for a replicated pool, run a command of the following form: + +.. prompt:: bash $ + + ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] + +For details on this command's parameters, see the following: + +``name`` + :Description: The name of the rule. + :Type: String + :Required: Yes + :Example: ``rbd-rule`` + +``root`` + :Description: The name of the CRUSH hierarchy node under which data is to be placed. + :Type: String + :Required: Yes + :Example: ``default`` + +``failure-domain-type`` + :Description: The type of CRUSH nodes used for the replicas of the failure domain. + :Type: String + :Required: Yes + :Example: ``rack`` + +``class`` + :Description: The device class on which data is to be placed. + :Type: String + :Required: No + :Example: ``ssd`` + +Creating a rule for an erasure-coded pool +----------------------------------------- + +For an erasure-coded pool, similar decisions need to be made: what the failure +domain is, which node in the hierarchy data will be placed under (usually +``default``), and whether placement is restricted to a specific device class. +However, erasure-code pools are created in a different way: there is a need to +construct them carefully with reference to the erasure code plugin in use. For +this reason, these decisions must be incorporated into the **erasure-code +profile**. A CRUSH rule will then be created from the erasure-code profile, +either explicitly or automatically when the profile is used to create a pool. + +To list the erasure-code profiles, run the following command: + +.. prompt:: bash $ + + ceph osd erasure-code-profile ls + +To view a specific existing profile, run a command of the following form: + +.. prompt:: bash $ + + ceph osd erasure-code-profile get {profile-name} + +Under normal conditions, profiles should never be modified; instead, a new +profile should be created and used when creating either a new pool or a new +rule for an existing pool. + +An erasure-code profile consists of a set of key-value pairs. Most of these +key-value pairs govern the behavior of the erasure code that encodes data in +the pool. However, key-value pairs that begin with ``crush-`` govern the CRUSH +rule that is created. + +The relevant erasure-code profile properties are as follows: + + * **crush-root**: the name of the CRUSH node under which to place data + [default: ``default``]. + * **crush-failure-domain**: the CRUSH bucket type used in the distribution of + erasure-coded shards [default: ``host``]. + * **crush-device-class**: the device class on which to place data [default: + none, which means that all devices are used]. + * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the + number of erasure-code shards, affecting the resulting CRUSH rule. + + After a profile is defined, you can create a CRUSH rule by running a command + of the following form: + +.. prompt:: bash $ + + ceph osd crush rule create-erasure {name} {profile-name} + +.. note: When creating a new pool, it is not necessary to create the rule + explicitly. If only the erasure-code profile is specified and the rule + argument is omitted, then Ceph will create the CRUSH rule automatically. + + +Deleting rules +-------------- + +To delete rules that are not in use by pools, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd crush rule rm {rule-name} + +.. _crush-map-tunables: + +Tunables +======== + +The CRUSH algorithm that is used to calculate the placement of data has been +improved over time. In order to support changes in behavior, we have provided +users with sets of tunables that determine which legacy or optimal version of +CRUSH is to be used. + +In order to use newer tunables, all Ceph clients and daemons must support the +new major release of CRUSH. Because of this requirement, we have created +``profiles`` that are named after the Ceph version in which they were +introduced. For example, the ``firefly`` tunables were first supported by the +Firefly release and do not work with older clients (for example, clients +running Dumpling). After a cluster's tunables profile is changed from a legacy +set to a newer or ``optimal`` set, the ``ceph-mon`` and ``ceph-osd`` options +will prevent older clients that do not support the new CRUSH features from +connecting to the cluster. + +argonaut (legacy) +----------------- + +The legacy CRUSH behavior used by Argonaut and older releases works fine for +most clusters, provided that not many OSDs have been marked ``out``. + +bobtail (CRUSH_TUNABLES2) +------------------------- + +The ``bobtail`` tunable profile provides the following improvements: + + * For hierarchies with a small number of devices in leaf buckets, some PGs + might map to fewer than the desired number of replicas, resulting in + ``undersized`` PGs. This is known to happen in the case of hierarchies with + ``host`` nodes that have a small number of OSDs (1 to 3) nested beneath each + host. + + * For large clusters, a small percentage of PGs might map to fewer than the + desired number of OSDs. This is known to happen when there are multiple + hierarchy layers in use (for example,, ``row``, ``rack``, ``host``, + ``osd``). + + * When one or more OSDs are marked ``out``, data tends to be redistributed + to nearby OSDs instead of across the entire hierarchy. + +The tunables introduced in the Bobtail release are as follows: + + * ``choose_local_tries``: Number of local retries. The legacy value is ``2``, + and the optimal value is ``0``. + + * ``choose_local_fallback_tries``: The legacy value is ``5``, and the optimal + value is 0. + + * ``choose_total_tries``: Total number of attempts to choose an item. The + legacy value is ``19``, but subsequent testing indicates that a value of + ``50`` is more appropriate for typical clusters. For extremely large + clusters, an even larger value might be necessary. + + * ``chooseleaf_descend_once``: Whether a recursive ``chooseleaf`` attempt will + retry, or try only once and allow the original placement to retry. The + legacy default is ``0``, and the optimal value is ``1``. + +Migration impact: + + * Moving from the ``argonaut`` tunables to the ``bobtail`` tunables triggers a + moderate amount of data movement. Use caution on a cluster that is already + populated with data. + +firefly (CRUSH_TUNABLES3) +------------------------- + +chooseleaf_vary_r +~~~~~~~~~~~~~~~~~ + +This ``firefly`` tunable profile fixes a problem with ``chooseleaf`` CRUSH step +behavior. This problem arose when a large fraction of OSDs were marked ``out``, which resulted in PG mappings with too few OSDs. + +This profile was introduced in the Firefly release, and adds a new tunable as follows: + + * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will start + with a non-zero value of ``r``, as determined by the number of attempts the + parent has already made. The legacy default value is ``0``, but with this + value CRUSH is sometimes unable to find a mapping. The optimal value (in + terms of computational cost and correctness) is ``1``. + +Migration impact: + + * For existing clusters that store a great deal of data, changing this tunable + from ``0`` to ``1`` will trigger a large amount of data migration; a value + of ``4`` or ``5`` will allow CRUSH to still find a valid mapping and will + cause less data to move. + +straw_calc_version tunable +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There were problems with the internal weights calculated and stored in the +CRUSH map for ``straw`` algorithm buckets. When there were buckets with a CRUSH +weight of ``0`` or with a mix of different and unique weights, CRUSH would +distribute data incorrectly (that is, not in proportion to the weights). + +This tunable, introduced in the Firefly release, is as follows: + + * ``straw_calc_version``: A value of ``0`` preserves the old, broken + internal-weight calculation; a value of ``1`` fixes the problem. + +Migration impact: + + * Changing this tunable to a value of ``1`` and then adjusting a straw bucket + (either by adding, removing, or reweighting an item or by using the + reweight-all command) can trigger a small to moderate amount of data + movement provided that the cluster has hit one of the problematic + conditions. + +This tunable option is notable in that it has absolutely no impact on the +required kernel version in the client side. + +hammer (CRUSH_V4) +----------------- + +The ``hammer`` tunable profile does not affect the mapping of existing CRUSH +maps simply by changing the profile. However: + + * There is a new bucket algorithm supported: ``straw2``. This new algorithm + fixes several limitations in the original ``straw``. More specifically, the + old ``straw`` buckets would change some mappings that should not have + changed when a weight was adjusted, while ``straw2`` achieves the original + goal of changing mappings only to or from the bucket item whose weight has + changed. + + * The ``straw2`` type is the default type for any newly created buckets. + +Migration impact: + + * Changing a bucket type from ``straw`` to ``straw2`` will trigger a small + amount of data movement, depending on how much the bucket items' weights + vary from each other. When the weights are all the same no data will move, + and the more variance there is in the weights the more movement there will + be. + +jewel (CRUSH_TUNABLES5) +----------------------- + +The ``jewel`` tunable profile improves the overall behavior of CRUSH. As a +result, significantly fewer mappings change when an OSD is marked ``out`` of +the cluster. This improvement results in significantly less data movement. + +The new tunable introduced in the Jewel release is as follows: + + * ``chooseleaf_stable``: Determines whether a recursive chooseleaf attempt + will use a better value for an inner loop that greatly reduces the number of + mapping changes when an OSD is marked ``out``. The legacy value is ``0``, + and the new value of ``1`` uses the new approach. + +Migration impact: + + * Changing this value on an existing cluster will result in a very large + amount of data movement because nearly every PG mapping is likely to change. + +Client versions that support CRUSH_TUNABLES2 +-------------------------------------------- + + * v0.55 and later, including Bobtail (v0.56.x) + * Linux kernel version v3.9 and later (for the CephFS and RBD kernel clients) + +Client versions that support CRUSH_TUNABLES3 +-------------------------------------------- + + * v0.78 (Firefly) and later + * Linux kernel version v3.15 and later (for the CephFS and RBD kernel clients) + +Client versions that support CRUSH_V4 +------------------------------------- + + * v0.94 (Hammer) and later + * Linux kernel version v4.1 and later (for the CephFS and RBD kernel clients) + +Client versions that support CRUSH_TUNABLES5 +-------------------------------------------- + + * v10.0.2 (Jewel) and later + * Linux kernel version v4.5 and later (for the CephFS and RBD kernel clients) + +"Non-optimal tunables" warning +------------------------------ + +In v0.74 and later versions, Ceph will raise a health check ("HEALTH_WARN crush +map has non-optimal tunables") if any of the current CRUSH tunables have +non-optimal values: that is, if any fail to have the optimal values from the +:ref:` ``default`` profile +<rados_operations_crush_map_default_profile_definition>`. There are two +different ways to silence the alert: + +1. Adjust the CRUSH tunables on the existing cluster so as to render them + optimal. Making this adjustment will trigger some data movement + (possibly as much as 10%). This approach is generally preferred to the + other approach, but special care must be taken in situations where + data movement might affect performance: for example, in production clusters. + To enable optimal tunables, run the following command: + + .. prompt:: bash $ + + ceph osd crush tunables optimal + + There are several potential problems that might make it preferable to revert + to the previous values of the tunables. The new values might generate too + much load for the cluster to handle, the new values might unacceptably slow + the operation of the cluster, or there might be a client-compatibility + problem. Such client-compatibility problems can arise when using old-kernel + CephFS or RBD clients, or pre-Bobtail ``librados`` clients. To revert to + the previous values of the tunables, run the following command: + + .. prompt:: bash $ + + ceph osd crush tunables legacy + +2. To silence the alert without making any changes to CRUSH, + add the following option to the ``[mon]`` section of your ceph.conf file:: + + mon_warn_on_legacy_crush_tunables = false + + In order for this change to take effect, you will need to either restart + the monitors or run the following command to apply the option to the + monitors while they are still running: + + .. prompt:: bash $ + + ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false + + +Tuning CRUSH +------------ + +When making adjustments to CRUSH tunables, keep the following considerations in +mind: + + * Adjusting the values of CRUSH tunables will result in the shift of one or + more PGs from one storage node to another. If the Ceph cluster is already + storing a great deal of data, be prepared for significant data movement. + * When the ``ceph-osd`` and ``ceph-mon`` daemons get the updated map, they + immediately begin rejecting new connections from clients that do not support + the new feature. However, already-connected clients are effectively + grandfathered in, and any of these clients that do not support the new + feature will malfunction. + * If the CRUSH tunables are set to newer (non-legacy) values and subsequently + reverted to the legacy values, ``ceph-osd`` daemons will not be required to + support any of the newer CRUSH features associated with the newer + (non-legacy) values. However, the OSD peering process requires the + examination and understanding of old maps. For this reason, **if the cluster + has previously used non-legacy CRUSH values, do not run old versions of + the** ``ceph-osd`` **daemon** -- even if the latest version of the map has + been reverted so as to use the legacy defaults. + +The simplest way to adjust CRUSH tunables is to apply them in matched sets +known as *profiles*. As of the Octopus release, Ceph supports the following +profiles: + + * ``legacy``: The legacy behavior from argonaut and earlier. + * ``argonaut``: The legacy values supported by the argonaut release. + * ``bobtail``: The values supported by the bobtail release. + * ``firefly``: The values supported by the firefly release. + * ``hammer``: The values supported by the hammer release. + * ``jewel``: The values supported by the jewel release. + * ``optimal``: The best values for the current version of Ceph. + .. _rados_operations_crush_map_default_profile_definition: + * ``default``: The default values of a new cluster that has been installed + from scratch. These values, which depend on the current version of Ceph, are + hardcoded and are typically a mix of optimal and legacy values. These + values often correspond to the ``optimal`` profile of either the previous + LTS (long-term service) release or the most recent release for which most + users are expected to have up-to-date clients. + +To apply a profile to a running cluster, run a command of the following form: + +.. prompt:: bash $ + + ceph osd crush tunables {PROFILE} + +This action might trigger a great deal of data movement. Consult release notes +and documentation before changing the profile on a running cluster. Consider +throttling recovery and backfill parameters in order to limit the backfill +resulting from a specific change. + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf + + +Tuning Primary OSD Selection +============================ + +When a Ceph client reads or writes data, it first contacts the primary OSD in +each affected PG's acting set. By default, the first OSD in the acting set is +the primary OSD (also known as the "lead OSD"). For example, in the acting set +``[2, 3, 4]``, ``osd.2`` is listed first and is therefore the primary OSD. +However, sometimes it is clear that an OSD is not well suited to act as the +lead as compared with other OSDs (for example, if the OSD has a slow drive or a +slow controller). To prevent performance bottlenecks (especially on read +operations) and at the same time maximize the utilization of your hardware, you +can influence the selection of the primary OSD either by adjusting "primary +affinity" values, or by crafting a CRUSH rule that selects OSDs that are better +suited to act as the lead rather than other OSDs. + +To determine whether tuning Ceph's selection of primary OSDs will improve +cluster performance, pool redundancy strategy must be taken into account. For +replicated pools, this tuning can be especially useful, because by default read +operations are served from the primary OSD of each PG. For erasure-coded pools, +however, the speed of read operations can be increased by enabling **fast +read** (see :ref:`pool-settings`). + +.. _rados_ops_primary_affinity: + +Primary Affinity +---------------- + +**Primary affinity** is a characteristic of an OSD that governs the likelihood +that a given OSD will be selected as the primary OSD (or "lead OSD") in a given +acting set. A primary affinity value can be any real number in the range ``0`` +to ``1``, inclusive. + +As an example of a common scenario in which it can be useful to adjust primary +affinity values, let us suppose that a cluster contains a mix of drive sizes: +for example, suppose it contains some older racks with 1.9 TB SATA SSDs and +some newer racks with 3.84 TB SATA SSDs. The latter will on average be assigned +twice the number of PGs and will thus serve twice the number of write and read +operations -- they will be busier than the former. In such a scenario, you +might make a rough assignment of primary affinity as inversely proportional to +OSD size. Such an assignment will not be 100% optimal, but it can readily +achieve a 15% improvement in overall read throughput by means of a more even +utilization of SATA interface bandwidth and CPU cycles. This example is not +merely a thought experiment meant to illustrate the theoretical benefits of +adjusting primary affinity values; this fifteen percent improvement was +achieved on an actual Ceph cluster. + +By default, every Ceph OSD has a primary affinity value of ``1``. In a cluster +in which every OSD has this default value, all OSDs are equally likely to act +as a primary OSD. + +By reducing the value of a Ceph OSD's primary affinity, you make CRUSH less +likely to select the OSD as primary in a PG's acting set. To change the weight +value associated with a specific OSD's primary affinity, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd primary-affinity <osd-id> <weight> + +The primary affinity of an OSD can be set to any real number in the range +``[0-1]`` inclusive, where ``0`` indicates that the OSD may not be used as +primary and ``1`` indicates that the OSD is maximally likely to be used as a +primary. When the weight is between these extremes, its value indicates roughly +how likely it is that CRUSH will select the OSD associated with it as a +primary. + +The process by which CRUSH selects the lead OSD is not a mere function of a +simple probability determined by relative affinity values. Nevertheless, +measurable results can be achieved even with first-order approximations of +desirable primary affinity values. + + +Custom CRUSH Rules +------------------ + +Some clusters balance cost and performance by mixing SSDs and HDDs in the same +replicated pool. By setting the primary affinity of HDD OSDs to ``0``, +operations will be directed to an SSD OSD in each acting set. Alternatively, +you can define a CRUSH rule that always selects an SSD OSD as the primary OSD +and then selects HDDs for the remaining OSDs. Given this rule, each PG's acting +set will contain an SSD OSD as the primary and have the remaining OSDs on HDDs. + +For example, see the following CRUSH rule:: + + rule mixed_replicated_rule { + id 11 + type replicated + step take default class ssd + step chooseleaf firstn 1 type host + step emit + step take default class hdd + step chooseleaf firstn 0 type host + step emit + } + +This rule chooses an SSD as the first OSD. For an ``N``-times replicated pool, +this rule selects ``N+1`` OSDs in order to guarantee that ``N`` copies are on +different hosts, because the first SSD OSD might be colocated with any of the +``N`` HDD OSDs. + +To avoid this extra storage requirement, you might place SSDs and HDDs in +different hosts. However, taking this approach means that all client requests +will be received by hosts with SSDs. For this reason, it might be advisable to +have faster CPUs for SSD OSDs and more modest CPUs for HDD OSDs, since the +latter will under normal circumstances perform only recovery operations. Here +the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` are under a strict requirement +not to contain any of the same servers, as seen in the following CRUSH rule:: + + rule mixed_replicated_rule_two { + id 1 + type replicated + step take ssd_hosts class ssd + step chooseleaf firstn 1 type host + step emit + step take hdd_hosts class hdd + step chooseleaf firstn -1 type host + step emit + } + +.. note:: If a primary SSD OSD fails, then requests to the associated PG will + be temporarily served from a slower HDD OSD until the PG's data has been + replicated onto the replacement primary SSD OSD. + + diff --git a/doc/rados/operations/data-placement.rst b/doc/rados/operations/data-placement.rst new file mode 100644 index 000000000..3d3be65ec --- /dev/null +++ b/doc/rados/operations/data-placement.rst @@ -0,0 +1,47 @@ +========================= + Data Placement Overview +========================= + +Ceph stores, replicates, and rebalances data objects across a RADOS cluster +dynamically. Because different users store objects in different pools for +different purposes on many OSDs, Ceph operations require a certain amount of +data- placement planning. The main data-placement planning concepts in Ceph +include: + +- **Pools:** Ceph stores data within pools, which are logical groups used for + storing objects. Pools manage the number of placement groups, the number of + replicas, and the CRUSH rule for the pool. To store data in a pool, it is + necessary to be an authenticated user with permissions for the pool. Ceph is + able to make snapshots of pools. For additional details, see `Pools`_. + +- **Placement Groups:** Ceph maps objects to placement groups. Placement + groups (PGs) are shards or fragments of a logical object pool that place + objects as a group into OSDs. Placement groups reduce the amount of + per-object metadata that is necessary for Ceph to store the data in OSDs. A + greater number of placement groups (for example, 100 PGs per OSD as compared + with 50 PGs per OSD) leads to better balancing. For additional details, see + :ref:`placement groups`. + +- **CRUSH Maps:** CRUSH plays a major role in allowing Ceph to scale while + avoiding certain pitfalls, such as performance bottlenecks, limitations to + scalability, and single points of failure. CRUSH maps provide the physical + topology of the cluster to the CRUSH algorithm, so that it can determine both + (1) where the data for an object and its replicas should be stored and (2) + how to store that data across failure domains so as to improve data safety. + For additional details, see `CRUSH Maps`_. + +- **Balancer:** The balancer is a feature that automatically optimizes the + distribution of placement groups across devices in order to achieve a + balanced data distribution, in order to maximize the amount of data that can + be stored in the cluster, and in order to evenly distribute the workload + across OSDs. + +It is possible to use the default values for each of the above components. +Default values are recommended for a test cluster's initial setup. However, +when planning a large Ceph cluster, values should be customized for +data-placement operations with reference to the different roles played by +pools, placement groups, and CRUSH. + +.. _Pools: ../pools +.. _CRUSH Maps: ../crush-map +.. _Balancer: ../balancer diff --git a/doc/rados/operations/devices.rst b/doc/rados/operations/devices.rst new file mode 100644 index 000000000..f92f622d5 --- /dev/null +++ b/doc/rados/operations/devices.rst @@ -0,0 +1,227 @@ +.. _devices: + +Device Management +================= + +Device management allows Ceph to address hardware failure. Ceph tracks hardware +storage devices (HDDs, SSDs) to see which devices are managed by which daemons. +Ceph also collects health metrics about these devices. By doing so, Ceph can +provide tools that predict hardware failure and can automatically respond to +hardware failure. + +Device tracking +--------------- + +To see a list of the storage devices that are in use, run the following +command: + +.. prompt:: bash $ + + ceph device ls + +Alternatively, to list devices by daemon or by host, run a command of one of +the following forms: + +.. prompt:: bash $ + + ceph device ls-by-daemon <daemon> + ceph device ls-by-host <host> + +To see information about the location of an specific device and about how the +device is being consumed, run a command of the following form: + +.. prompt:: bash $ + + ceph device info <devid> + +Identifying physical devices +---------------------------- + +To make the replacement of failed disks easier and less error-prone, you can +(in some cases) "blink" the drive's LEDs on hardware enclosures by running a +command of the following form:: + + device light on|off <devid> [ident|fault] [--force] + +.. note:: Using this command to blink the lights might not work. Whether it + works will depend upon such factors as your kernel revision, your SES + firmware, or the setup of your HBA. + +The ``<devid>`` parameter is the device identification. To retrieve this +information, run the following command: + +.. prompt:: bash $ + + ceph device ls + +The ``[ident|fault]`` parameter determines which kind of light will blink. By +default, the `identification` light is used. + +.. note:: This command works only if the Cephadm or the Rook `orchestrator + <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ + module is enabled. To see which orchestrator module is enabled, run the + following command: + + .. prompt:: bash $ + + ceph orch status + +The command that makes the drive's LEDs blink is `lsmcli`. To customize this +command, configure it via a Jinja2 template by running commands of the +following forms:: + + ceph config-key set mgr/cephadm/blink_device_light_cmd "<template>" + ceph config-key set mgr/cephadm/<host>/blink_device_light_cmd "lsmcli local-disk-{{ ident_fault }}-led-{{'on' if on else 'off'}} --path '{{ path or dev }}'" + +The following arguments can be used to customize the Jinja2 template: + +* ``on`` + A boolean value. +* ``ident_fault`` + A string that contains `ident` or `fault`. +* ``dev`` + A string that contains the device ID: for example, `SanDisk_X400_M.2_2280_512GB_162924424784`. +* ``path`` + A string that contains the device path: for example, `/dev/sda`. + +.. _enabling-monitoring: + +Enabling monitoring +------------------- + +Ceph can also monitor the health metrics associated with your device. For +example, SATA drives implement a standard called SMART that provides a wide +range of internal metrics about the device's usage and health (for example: the +number of hours powered on, the number of power cycles, the number of +unrecoverable read errors). Other device types such as SAS and NVMe present a +similar set of metrics (via slightly different standards). All of these +metrics can be collected by Ceph via the ``smartctl`` tool. + +You can enable or disable health monitoring by running one of the following +commands: + +.. prompt:: bash $ + + ceph device monitoring on + ceph device monitoring off + +Scraping +-------- + +If monitoring is enabled, device metrics will be scraped automatically at +regular intervals. To configure that interval, run a command of the following +form: + +.. prompt:: bash $ + + ceph config set mgr mgr/devicehealth/scrape_frequency <seconds> + +By default, device metrics are scraped once every 24 hours. + +To manually scrape all devices, run the following command: + +.. prompt:: bash $ + + ceph device scrape-health-metrics + +To scrape a single device, run a command of the following form: + +.. prompt:: bash $ + + ceph device scrape-health-metrics <device-id> + +To scrape a single daemon's devices, run a command of the following form: + +.. prompt:: bash $ + + ceph device scrape-daemon-health-metrics <who> + +To retrieve the stored health metrics for a device (optionally for a specific +timestamp), run a command of the following form: + +.. prompt:: bash $ + + ceph device get-health-metrics <devid> [sample-timestamp] + +Failure prediction +------------------ + +Ceph can predict drive life expectancy and device failures by analyzing the +health metrics that it collects. The prediction modes are as follows: + +* *none*: disable device failure prediction. +* *local*: use a pre-trained prediction model from the ``ceph-mgr`` daemon. + +To configure the prediction mode, run a command of the following form: + +.. prompt:: bash $ + + ceph config set global device_failure_prediction_mode <mode> + +Under normal conditions, failure prediction runs periodically in the +background. For this reason, life expectancy values might be populated only +after a significant amount of time has passed. The life expectancy of all +devices is displayed in the output of the following command: + +.. prompt:: bash $ + + ceph device ls + +To see the metadata of a specific device, run a command of the following form: + +.. prompt:: bash $ + + ceph device info <devid> + +To explicitly force prediction of a specific device's life expectancy, run a +command of the following form: + +.. prompt:: bash $ + + ceph device predict-life-expectancy <devid> + +In addition to Ceph's internal device failure prediction, you might have an +external source of information about device failures. To inform Ceph of a +specific device's life expectancy, run a command of the following form: + +.. prompt:: bash $ + + ceph device set-life-expectancy <devid> <from> [<to>] + +Life expectancies are expressed as a time interval. This means that the +uncertainty of the life expectancy can be expressed in the form of a range of +time, and perhaps a wide range of time. The interval's end can be left +unspecified. + +Health alerts +------------- + +The ``mgr/devicehealth/warn_threshold`` configuration option controls the +health check for an expected device failure. If the device is expected to fail +within the specified time interval, an alert is raised. + +To check the stored life expectancy of all devices and generate any appropriate +health alert, run the following command: + +.. prompt:: bash $ + + ceph device check-health + +Automatic Migration +------------------- + +The ``mgr/devicehealth/self_heal`` option (enabled by default) automatically +migrates data away from devices that are expected to fail soon. If this option +is enabled, the module marks such devices ``out`` so that automatic migration +will occur. + +.. note:: The ``mon_osd_min_up_ratio`` configuration option can help prevent + this process from cascading to total failure. If the "self heal" module + marks ``out`` so many OSDs that the ratio value of ``mon_osd_min_up_ratio`` + is exceeded, then the cluster raises the ``DEVICE_HEALTH_TOOMANY`` health + check. For instructions on what to do in this situation, see + :ref:`DEVICE_HEALTH_TOOMANY<rados_health_checks_device_health_toomany>`. + +The ``mgr/devicehealth/mark_out_threshold`` configuration option specifies the +time interval for automatic migration. If a device is expected to fail within +the specified time interval, it will be automatically marked ``out``. diff --git a/doc/rados/operations/erasure-code-clay.rst b/doc/rados/operations/erasure-code-clay.rst new file mode 100644 index 000000000..1cffa32f5 --- /dev/null +++ b/doc/rados/operations/erasure-code-clay.rst @@ -0,0 +1,240 @@ +================ +CLAY code plugin +================ + +CLAY (short for coupled-layer) codes are erasure codes designed to bring about significant savings +in terms of network bandwidth and disk IO when a failed node/OSD/rack is being repaired. Let: + + d = number of OSDs contacted during repair + +If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires +reading from the *d=8* others to repair. And recovery of say a 1GiB needs +a download of 8 X 1GiB = 8GiB of information. + +However, in the case of the *clay* plugin *d* is configurable within the limits: + + k+1 <= d <= k+m-1 + +By default, the clay code plugin picks *d=k+m-1* as it provides the greatest savings in terms +of network bandwidth and disk IO. In the case of the *clay* plugin configured with +*k=8*, *m=4* and *d=11* when a single OSD fails, d=11 osds are contacted and +250MiB is downloaded from each of them, resulting in a total download of 11 X 250MiB = 2.75GiB +amount of information. More general parameters are provided below. The benefits are substantial +when the repair is carried out for a rack that stores information on the order of +Terabytes. + + +-------------+---------------------------------------------------------+ + | plugin | total amount of disk IO | + +=============+=========================================================+ + |jerasure,isa | :math:`k S` | + +-------------+---------------------------------------------------------+ + | clay | :math:`\frac{d S}{d - k + 1} = \frac{(k + m - 1) S}{m}` | + +-------------+---------------------------------------------------------+ + +where *S* is the amount of data stored on a single OSD undergoing repair. In the table above, we have +used the largest possible value of *d* as this will result in the smallest amount of data download needed +to achieve recovery from an OSD failure. + +Erasure-code profile examples +============================= + +An example configuration that can be used to observe reduced bandwidth usage: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set CLAYprofile \ + plugin=clay \ + k=4 m=2 d=5 \ + crush-failure-domain=host + ceph osd pool create claypool erasure CLAYprofile + + +Creating a clay profile +======================= + +To create a new clay code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=clay \ + k={data-chunks} \ + m={coding-chunks} \ + [d={helper-chunks}] \ + [scalar_mds={plugin-name}] \ + [technique={technique-name}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split into **data-chunks** parts, + each of which is stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``d={helper-chunks}`` + +:Description: Number of OSDs requested to send data during recovery of + a single chunk. *d* needs to be chosen such that + k+1 <= d <= k+m-1. The larger the *d*, the better the savings. + +:Type: Integer +:Required: No. +:Default: k+m-1 + +``scalar_mds={jerasure|isa|shec}`` + +:Description: **scalar_mds** specifies the plugin that is used as a + building block in the layered construction. It can be + one of *jerasure*, *isa*, *shec* + +:Type: String +:Required: No. +:Default: jerasure + +``technique={technique}`` + +:Description: **technique** specifies the technique that will be picked + within the 'scalar_mds' plugin specified. Supported techniques + are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig', + 'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van', + 'cauchy' for isa and 'single', 'multiple' for shec. + +:Type: String +:Required: No. +:Default: reed_sol_van (for jerasure, isa), single (for shec) + + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + + +Notion of sub-chunks +==================== + +The Clay code is able to save in terms of disk IO, network bandwidth as it +is a vector code and it is able to view and manipulate data within a chunk +at a finer granularity termed as a sub-chunk. The number of sub-chunks within +a chunk for a Clay code is given by: + + sub-chunk count = :math:`q^{\frac{k+m}{q}}`, where :math:`q = d - k + 1` + + +During repair of an OSD, the helper information requested +from an available OSD is only a fraction of a chunk. In fact, the number +of sub-chunks within a chunk that are accessed during repair is given by: + + repair sub-chunk count = :math:`\frac{sub---chunk \: count}{q}` + +Examples +-------- + +#. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is + 8 and the repair sub-chunk count is 4. Therefore, only half of a chunk is read + during repair. +#. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count + is 16. A quarter of a chunk is read from an available OSD for repair of a failed + chunk. + + + +How to choose a configuration given a workload +============================================== + +Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks +are not necessarily stored consecutively within a chunk. For best disk IO +performance, it is helpful to read contiguous data. For this reason, it is suggested that +you choose stripe-size such that the sub-chunk size is sufficiently large. + +For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that: + + sub-chunk size = :math:`\frac{stripe-size}{k sub-chunk count}` = 4KB, 8KB, 12KB ... + +#. For large size workloads for which the stripe size is large, it is easy to choose k, m, d. + For example consider a stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will + result in a sub-chunk count of 1024 and a sub-chunk size of 4KB. +#. For small size workloads, *k=4*, *m=2* is a good configuration that provides both network + and disk IO benefits. + +Comparisons with LRC +==================== + +Locally Recoverable Codes (LRC) are also designed in order to save in terms of network +bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the +number of OSDs contacted during repair (d) to be minimal, but this comes at the cost of storage overhead. +The *clay* code has a storage overhead m/k. In the case of an *lrc*, it stores (k+m)/d parities in +addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc* +can recover from the failure of any ``m`` OSDs. + + +-----------------+----------------------------------+----------------------------------+ + | Parameters | disk IO, storage overhead (LRC) | disk IO, storage overhead (CLAY) | + +=================+================+=================+==================================+ + | (k=10, m=4) | 7 * S, 0.6 (d=7) | 3.25 * S, 0.4 (d=13) | + +-----------------+----------------------------------+----------------------------------+ + | (k=16, m=4) | 4 * S, 0.5625 (d=4) | 4.75 * S, 0.25 (d=19) | + +-----------------+----------------------------------+----------------------------------+ + + +where ``S`` is the amount of data stored of single OSD being recovered. diff --git a/doc/rados/operations/erasure-code-isa.rst b/doc/rados/operations/erasure-code-isa.rst new file mode 100644 index 000000000..9a43f89a2 --- /dev/null +++ b/doc/rados/operations/erasure-code-isa.rst @@ -0,0 +1,107 @@ +======================= +ISA erasure code plugin +======================= + +The *isa* plugin encapsulates the `ISA +<https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version/>`_ +library. + +Create an isa profile +===================== + +To create a new *isa* erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=isa \ + technique={reed_sol_van|cauchy} \ + [k={data-chunks}] \ + [m={coding-chunks}] \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: No. +:Default: 7 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 3 + +``technique={reed_sol_van|cauchy}`` + +:Description: The ISA plugin comes in two `Reed Solomon + <https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction>`_ + forms. If *reed_sol_van* is set, it is `Vandermonde + <https://en.wikipedia.org/wiki/Vandermonde_matrix>`_, if + *cauchy* is set, it is `Cauchy + <https://en.wikipedia.org/wiki/Cauchy_matrix>`_. + +:Type: String +:Required: No. +:Default: reed_sol_van + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + diff --git a/doc/rados/operations/erasure-code-jerasure.rst b/doc/rados/operations/erasure-code-jerasure.rst new file mode 100644 index 000000000..8a0207748 --- /dev/null +++ b/doc/rados/operations/erasure-code-jerasure.rst @@ -0,0 +1,123 @@ +============================ +Jerasure erasure code plugin +============================ + +The *jerasure* plugin is the most generic and flexible plugin, it is +also the default for Ceph erasure coded pools. + +The *jerasure* plugin encapsulates the `Jerasure +<https://github.com/ceph/jerasure>`_ library. It is +recommended to read the ``jerasure`` documentation to +understand the parameters. Note that the ``jerasure.org`` +web site as of 2023 may no longer be connected to the original +project or legitimate. + +Create a jerasure profile +========================= + +To create a new *jerasure* erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=jerasure \ + k={data-chunks} \ + m={coding-chunks} \ + technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion} \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion}`` + +:Description: The more flexible technique is *reed_sol_van* : it is + enough to set *k* and *m*. The *cauchy_good* technique + can be faster but you need to chose the *packetsize* + carefully. All of *reed_sol_r6_op*, *liberation*, + *blaum_roth*, *liber8tion* are *RAID6* equivalents in + the sense that they can only be configured with *m=2*. + +:Type: String +:Required: No. +:Default: reed_sol_van + +``packetsize={bytes}`` + +:Description: The encoding will be done on packets of *bytes* size at + a time. Choosing the right packet size is difficult. The + *jerasure* documentation contains extensive information + on this topic. + +:Type: Integer +:Required: No. +:Default: 2048 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + diff --git a/doc/rados/operations/erasure-code-lrc.rst b/doc/rados/operations/erasure-code-lrc.rst new file mode 100644 index 000000000..5329603b9 --- /dev/null +++ b/doc/rados/operations/erasure-code-lrc.rst @@ -0,0 +1,388 @@ +====================================== +Locally repairable erasure code plugin +====================================== + +With the *jerasure* plugin, when an erasure coded object is stored on +multiple OSDs, recovering from the loss of one OSD requires reading +from *k* others. For instance if *jerasure* is configured with +*k=8* and *m=4*, recovering from the loss of one OSD requires reading +from eight others. + +The *lrc* erasure code plugin creates local parity chunks to enable +recovery using fewer surviving OSDs. For instance if *lrc* is configured with +*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for +every four OSDs. When a single OSD is lost, it can be recovered with +only four OSDs instead of eight. + +Erasure code profile examples +============================= + +Reduce recovery bandwidth between hosts +--------------------------------------- + +Although it is probably not an interesting use case when all hosts are +connected to the same switch, reduced bandwidth usage can actually be +observed.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + k=4 m=2 l=3 \ + crush-failure-domain=host + ceph osd pool create lrcpool erasure LRCprofile + + +Reduce recovery bandwidth between racks +--------------------------------------- + +In Firefly the bandwidth reduction will only be observed if the primary +OSD is in the same rack as the lost chunk.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + k=4 m=2 l=3 \ + crush-locality=rack \ + crush-failure-domain=host + ceph osd pool create lrcpool erasure LRCprofile + + +Create an lrc profile +===================== + +To create a new lrc erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=lrc \ + k={data-chunks} \ + m={coding-chunks} \ + l={locality} \ + [crush-root={root}] \ + [crush-locality={bucket-type}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``l={locality}`` + +:Description: Group the coding and data chunks into sets of size + **locality**. For instance, for **k=4** and **m=2**, + when **locality=3** two groups of three are created. + Each set can be recovered without reading chunks + from another set. + +:Type: Integer +:Required: Yes. +:Example: 3 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-locality={bucket-type}`` + +:Description: The type of the CRUSH bucket in which each set of chunks + defined by **l** will be stored. For instance, if it is + set to **rack**, each group of **l** chunks will be + placed in a different rack. It is used to create a + CRUSH rule step such as **step choose rack**. If it is not + set, no such grouping is done. + +:Type: String +:Required: No. + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + +Low level plugin configuration +============================== + +The sum of **k** and **m** must be a multiple of the **l** parameter. +The low level configuration parameters however do not enforce this +restriction and it may be advantageous to use them for specific +purposes. It is for instance possible to define two groups, one with 4 +chunks and another with 3 chunks. It is also possible to recursively +define locality sets, for instance datacenters and racks into +datacenters. The **k/m/l** are implemented by generating a low level +configuration. + +The *lrc* erasure code plugin recursively applies erasure code +techniques so that recovering from the loss of some chunks only +requires a subset of the available chunks, most of the time. + +For instance, when three coding steps are described as:: + + chunk nr 01234567 + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +where *c* are coding chunks calculated from the data chunks *D*, the +loss of chunk *7* can be recovered with the last four chunks. And the +loss of chunk *2* chunk can be recovered with the first four +chunks. + +Erasure code profile examples using low level configuration +=========================================================== + +Minimal testing +--------------- + +It is strictly equivalent to using a *K=2* *M=1* erasure code profile. The *DD* +implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used +by default.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=DD_ \ + layers='[ [ "DDc", "" ] ]' + ceph osd pool create lrcpool erasure LRCprofile + +Reduce recovery bandwidth between hosts +--------------------------------------- + +Although it is probably not an interesting use case when all hosts are +connected to the same switch, reduced bandwidth usage can actually be +observed. It is equivalent to **k=4**, **m=2** and **l=3** although +the layout of the chunks is different. **WARNING: PROMPTS ARE SELECTABLE** + +:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "" ], + [ "cDDD____", "" ], + [ "____cDDD", "" ], + ]' + $ ceph osd pool create lrcpool erasure LRCprofile + + +Reduce recovery bandwidth between racks +--------------------------------------- + +In Firefly the reduced bandwidth will only be observed if the primary OSD is in +the same rack as the lost chunk. **WARNING: PROMPTS ARE SELECTABLE** + +:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "" ], + [ "cDDD____", "" ], + [ "____cDDD", "" ], + ]' \ + crush-steps='[ + [ "choose", "rack", 2 ], + [ "chooseleaf", "host", 4 ], + ]' + + $ ceph osd pool create lrcpool erasure LRCprofile + +Testing with different Erasure Code backends +-------------------------------------------- + +LRC now uses jerasure as the default EC backend. It is possible to +specify the EC backend/algorithm on a per layer basis using the low +level configuration. The second argument in layers='[ [ "DDc", "" ] ]' +is actually an erasure code profile to be used for this level. The +example below specifies the ISA backend with the cauchy technique to +be used in the lrcpool.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=DD_ \ + layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]' + ceph osd pool create lrcpool erasure LRCprofile + +You could also use a different erasure code profile for each +layer. **WARNING: PROMPTS ARE SELECTABLE** + +:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "plugin=isa technique=cauchy" ], + [ "cDDD____", "plugin=isa" ], + [ "____cDDD", "plugin=jerasure" ], + ]' + $ ceph osd pool create lrcpool erasure LRCprofile + + + +Erasure coding and decoding algorithm +===================================== + +The steps found in the layers description:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +are applied in order. For instance, if a 4K object is encoded, it will +first go through *step 1* and be divided in four 1K chunks (the four +uppercase D). They are stored in the chunks 2, 3, 6 and 7, in +order. From these, two coding chunks are calculated (the two lowercase +c). The coding chunks are stored in the chunks 1 and 5, respectively. + +The *step 2* re-uses the content created by *step 1* in a similar +fashion and stores a single coding chunk *c* at position 0. The last four +chunks, marked with an underscore (*_*) for readability, are ignored. + +The *step 3* stores a single coding chunk *c* at position 4. The three +chunks created by *step 1* are used to compute this coding chunk, +i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*. + +If chunk *2* is lost:: + + chunk nr 01234567 + + step 1 _c D_cDD + step 2 cD D____ + step 3 __ _cDDD + +decoding will attempt to recover it by walking the steps in reverse +order: *step 3* then *step 2* and finally *step 1*. + +The *step 3* knows nothing about chunk *2* (i.e. it is an underscore) +and is skipped. + +The coding chunk from *step 2*, stored in chunk *0*, allows it to +recover the content of chunk *2*. There are no more chunks to recover +and the process stops, without considering *step 1*. + +Recovering chunk *2* requires reading chunks *0, 1, 3* and writing +back chunk *2*. + +If chunk *2, 3, 6* are lost:: + + chunk nr 01234567 + + step 1 _c _c D + step 2 cD __ _ + step 3 __ cD D + +The *step 3* can recover the content of chunk *6*:: + + chunk nr 01234567 + + step 1 _c _cDD + step 2 cD ____ + step 3 __ cDDD + +The *step 2* fails to recover and is skipped because there are two +chunks missing (*2, 3*) and it can only recover from one missing +chunk. + +The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to +recover the content of chunk *2, 3*:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +Controlling CRUSH placement +=========================== + +The default CRUSH rule provides OSDs that are on different hosts. For instance:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +needs exactly *8* OSDs, one for each chunk. If the hosts are in two +adjacent racks, the first four chunks can be placed in the first rack +and the last four in the second rack. So that recovering from the loss +of a single OSD does not require using bandwidth between the two +racks. + +For instance:: + + crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]' + +will create a rule that will select two crush buckets of type +*rack* and for each of them choose four OSDs, each of them located in +different buckets of type *host*. + +The CRUSH rule can also be manually crafted for finer control. diff --git a/doc/rados/operations/erasure-code-profile.rst b/doc/rados/operations/erasure-code-profile.rst new file mode 100644 index 000000000..947b34c1f --- /dev/null +++ b/doc/rados/operations/erasure-code-profile.rst @@ -0,0 +1,128 @@ +.. _erasure-code-profiles: + +===================== +Erasure code profiles +===================== + +Erasure code is defined by a **profile** and is used when creating an +erasure coded pool and the associated CRUSH rule. + +The **default** erasure code profile (which is created when the Ceph +cluster is initialized) will split the data into 2 equal-sized chunks, +and have 2 parity chunks of the same size. It will take as much space +in the cluster as a 2-replica pool but can sustain the data loss of 2 +chunks out of 4. It is described as a profile with **k=2** and **m=2**, +meaning the information is spread over four OSD (k+m == 4) and two of +them can be lost. + +To improve redundancy without increasing raw storage requirements, a +new profile can be created. For instance, a profile with **k=10** and +**m=4** can sustain the loss of four (**m=4**) OSDs by distributing an +object on fourteen (k+m=14) OSDs. The object is first divided in +**10** chunks (if the object is 10MB, each chunk is 1MB) and **4** +coding chunks are computed, for recovery (each coding chunk has the +same size as the data chunk, i.e. 1MB). The raw space overhead is only +40% and the object will not be lost even if four OSDs break at the +same time. + +.. _list of available plugins: + +.. toctree:: + :maxdepth: 1 + + erasure-code-jerasure + erasure-code-isa + erasure-code-lrc + erasure-code-shec + erasure-code-clay + +osd erasure-code-profile set +============================ + +To create a new erasure code profile:: + + ceph osd erasure-code-profile set {name} \ + [{directory=directory}] \ + [{plugin=plugin}] \ + [{stripe_unit=stripe_unit}] \ + [{key=value} ...] \ + [--force] + +Where: + +``{directory=directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``{plugin=plugin}`` + +:Description: Use the erasure code **plugin** to compute coding chunks + and recover missing chunks. See the `list of available + plugins`_ for more information. + +:Type: String +:Required: No. +:Default: jerasure + +``{stripe_unit=stripe_unit}`` + +:Description: The amount of data in a data chunk, per stripe. For + example, a profile with 2 data chunks and stripe_unit=4K + would put the range 0-4K in chunk 0, 4K-8K in chunk 1, + then 8K-12K in chunk 0 again. This should be a multiple + of 4K for best performance. The default value is taken + from the monitor config option + ``osd_pool_erasure_code_stripe_unit`` when a pool is + created. The stripe_width of a pool using this profile + will be the number of data chunks multiplied by this + stripe_unit. + +:Type: String +:Required: No. + +``{key=value}`` + +:Description: The semantic of the remaining key/value pairs is defined + by the erasure code plugin. + +:Type: String +:Required: No. + +``--force`` + +:Description: Override an existing profile by the same name, and allow + setting a non-4K-aligned stripe_unit. + +:Type: String +:Required: No. + +osd erasure-code-profile rm +============================ + +To remove an erasure code profile:: + + ceph osd erasure-code-profile rm {name} + +If the profile is referenced by a pool, the deletion will fail. + +.. warning:: Removing an erasure code profile using ``osd erasure-code-profile rm`` does not automatically delete the associated CRUSH rule associated with the erasure code profile. It is recommended to manually remove the associated CRUSH rule using ``ceph osd crush rule remove {rule-name}`` to avoid unexpected behavior. + +osd erasure-code-profile get +============================ + +To display an erasure code profile:: + + ceph osd erasure-code-profile get {name} + +osd erasure-code-profile ls +=========================== + +To list the names of all erasure code profiles:: + + ceph osd erasure-code-profile ls + diff --git a/doc/rados/operations/erasure-code-shec.rst b/doc/rados/operations/erasure-code-shec.rst new file mode 100644 index 000000000..4e8f59b0b --- /dev/null +++ b/doc/rados/operations/erasure-code-shec.rst @@ -0,0 +1,145 @@ +======================== +SHEC erasure code plugin +======================== + +The *shec* plugin encapsulates the `multiple SHEC +<http://tracker.ceph.com/projects/ceph/wiki/Shingled_Erasure_Code_(SHEC)>`_ +library. It allows ceph to recover data more efficiently than Reed Solomon codes. + +Create an SHEC profile +====================== + +To create a new *shec* erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=shec \ + [k={data-chunks}] \ + [m={coding-chunks}] \ + [c={durability-estimator}] \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data-chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: No. +:Default: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding-chunks** for each object and store them on + different OSDs. The number of **coding-chunks** does not necessarily + equal the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 3 + +``c={durability-estimator}`` + +:Description: The number of parity chunks each of which includes each data chunk in its + calculation range. The number is used as a **durability estimator**. + For instance, if c=2, 2 OSDs can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 2 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + +Brief description of SHEC's layouts +=================================== + +Space Efficiency +---------------- + +Space efficiency is a ratio of data chunks to all ones in a object and +represented as k/(k+m). +In order to improve space efficiency, you should increase k or decrease m: + + space efficiency of SHEC(4,3,2) = :math:`\frac{4}{4+3}` = 0.57 + SHEC(5,3,2) or SHEC(4,2,2) improves SHEC(4,3,2)'s space efficiency + +Durability +---------- + +The third parameter of SHEC (=c) is a durability estimator, which approximates +the number of OSDs that can be down without losing data. + +``durability estimator of SHEC(4,3,2) = 2`` + +Recovery Efficiency +------------------- + +Describing calculation of recovery efficiency is beyond the scope of this document, +but at least increasing m without increasing c achieves improvement of recovery efficiency. +(However, we must pay attention to the sacrifice of space efficiency in this case.) + +``SHEC(4,2,2) -> SHEC(4,3,2) : achieves improvement of recovery efficiency`` + +Erasure code profile examples +============================= + + +.. prompt:: bash $ + + ceph osd erasure-code-profile set SHECprofile \ + plugin=shec \ + k=8 m=4 c=3 \ + crush-failure-domain=host + ceph osd pool create shecpool erasure SHECprofile diff --git a/doc/rados/operations/erasure-code.rst b/doc/rados/operations/erasure-code.rst new file mode 100644 index 000000000..e2bd3c296 --- /dev/null +++ b/doc/rados/operations/erasure-code.rst @@ -0,0 +1,272 @@ +.. _ecpool: + +============== + Erasure code +============== + +By default, Ceph `pools <../pools>`_ are created with the type "replicated". In +replicated-type pools, every object is copied to multiple disks. This +multiple copying is the method of data protection known as "replication". + +By contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_ +pools use a method of data protection that is different from replication. In +erasure coding, data is broken into fragments of two kinds: data blocks and +parity blocks. If a drive fails or becomes corrupted, the parity blocks are +used to rebuild the data. At scale, erasure coding saves space relative to +replication. + +In this documentation, data blocks are referred to as "data chunks" +and parity blocks are referred to as "coding chunks". + +Erasure codes are also called "forward error correction codes". The +first forward error correction code was developed in 1950 by Richard +Hamming at Bell Laboratories. + + +Creating a sample erasure-coded pool +------------------------------------ + +The simplest erasure-coded pool is similar to `RAID5 +<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and +requires at least three hosts: + +.. prompt:: bash $ + + ceph osd pool create ecpool erasure + +:: + + pool 'ecpool' created + +.. prompt:: bash $ + + echo ABCDEFGHI | rados --pool ecpool put NYAN - + rados --pool ecpool get NYAN - + +:: + + ABCDEFGHI + +Erasure-code profiles +--------------------- + +The default erasure-code profile can sustain the overlapping loss of two OSDs +without losing data. This erasure-code profile is equivalent to a replicated +pool of size three, but with different storage requirements: instead of +requiring 3TB to store 1TB, it requires only 2TB to store 1TB. The default +profile can be displayed with this command: + +.. prompt:: bash $ + + ceph osd erasure-code-profile get default + +:: + + k=2 + m=2 + plugin=jerasure + crush-failure-domain=host + technique=reed_sol_van + +.. note:: + The profile just displayed is for the *default* erasure-coded pool, not the + *simplest* erasure-coded pool. These two pools are not the same: + + The default erasure-coded pool has two data chunks (K) and two coding chunks + (M). The profile of the default erasure-coded pool is "k=2 m=2". + + The simplest erasure-coded pool has two data chunks (K) and one coding chunk + (M). The profile of the simplest erasure-coded pool is "k=2 m=1". + +Choosing the right profile is important because the profile cannot be modified +after the pool is created. If you find that you need an erasure-coded pool with +a profile different than the one you have created, you must create a new pool +with a different (and presumably more carefully considered) profile. When the +new pool is created, all objects from the wrongly configured pool must be moved +to the newly created pool. There is no way to alter the profile of a pool after +the pool has been created. + +The most important parameters of the profile are *K*, *M*, and +*crush-failure-domain* because they define the storage overhead and +the data durability. For example, if the desired architecture must +sustain the loss of two racks with a storage overhead of 67%, +the following profile can be defined: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set myprofile \ + k=3 \ + m=2 \ + crush-failure-domain=rack + ceph osd pool create ecpool erasure myprofile + echo ABCDEFGHI | rados --pool ecpool put NYAN - + rados --pool ecpool get NYAN - + +:: + + ABCDEFGHI + +The *NYAN* object will be divided in three (*K=3*) and two additional +*chunks* will be created (*M=2*). The value of *M* defines how many +OSDs can be lost simultaneously without losing any data. The +*crush-failure-domain=rack* will create a CRUSH rule that ensures +no two *chunks* are stored in the same rack. + +.. ditaa:: + +-------------------+ + name | NYAN | + +-------------------+ + content | ABCDEFGHI | + +--------+----------+ + | + | + v + +------+------+ + +---------------+ encode(3,2) +-----------+ + | +--+--+---+---+ | + | | | | | + | +-------+ | +-----+ | + | | | | | + +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ + name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | + +------+ +------+ +------+ +------+ +------+ + shard | 1 | | 2 | | 3 | | 4 | | 5 | + +------+ +------+ +------+ +------+ +------+ + content | ABC | | DEF | | GHI | | YXY | | QGC | + +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ + | | | | | + | | v | | + | | +--+---+ | | + | | | OSD1 | | | + | | +------+ | | + | | | | + | | +------+ | | + | +------>| OSD2 | | | + | +------+ | | + | | | + | +------+ | | + | | OSD3 |<----+ | + | +------+ | + | | + | +------+ | + | | OSD4 |<--------------+ + | +------+ + | + | +------+ + +----------------->| OSD5 | + +------+ + + +More information can be found in the `erasure-code profiles +<../erasure-code-profile>`_ documentation. + + +Erasure Coding with Overwrites +------------------------------ + +By default, erasure-coded pools work only with operations that +perform full object writes and appends (for example, RGW). + +Since Luminous, partial writes for an erasure-coded pool may be +enabled with a per-pool setting. This lets RBD and CephFS store their +data in an erasure-coded pool: + +.. prompt:: bash $ + + ceph osd pool set ec_pool allow_ec_overwrites true + +This can be enabled only on a pool residing on BlueStore OSDs, since +BlueStore's checksumming is used during deep scrubs to detect bitrot +or other corruption. Using Filestore with EC overwrites is not only +unsafe, but it also results in lower performance compared to BlueStore. + +Erasure-coded pools do not support omap, so to use them with RBD and +CephFS you must instruct them to store their data in an EC pool and +their metadata in a replicated pool. For RBD, this means using the +erasure-coded pool as the ``--data-pool`` during image creation: + +.. prompt:: bash $ + + rbd create --size 1G --data-pool ec_pool replicated_pool/image_name + +For CephFS, an erasure-coded pool can be set as the default data pool during +file system creation or via `file layouts <../../../cephfs/file-layouts>`_. + + +Erasure-coded pools and cache tiering +------------------------------------- + +.. note:: Cache tiering is deprecated in Reef. + +Erasure-coded pools require more resources than replicated pools and +lack some of the functionality supported by replicated pools (for example, omap). +To overcome these limitations, one can set up a `cache tier <../cache-tiering>`_ +before setting up the erasure-coded pool. + +For example, if the pool *hot-storage* is made of fast storage, the following commands +will place the *hot-storage* pool as a tier of *ecpool* in *writeback* +mode: + +.. prompt:: bash $ + + ceph osd tier add ecpool hot-storage + ceph osd tier cache-mode hot-storage writeback + ceph osd tier set-overlay ecpool hot-storage + +The result is that every write and read to the *ecpool* actually uses +the *hot-storage* pool and benefits from its flexibility and speed. + +More information can be found in the `cache tiering +<../cache-tiering>`_ documentation. Note, however, that cache tiering +is deprecated and may be removed completely in a future release. + +Erasure-coded pool recovery +--------------------------- +If an erasure-coded pool loses any data shards, it must recover them from others. +This recovery involves reading from the remaining shards, reconstructing the data, and +writing new shards. + +In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards +available. (With fewer than *K* shards, you have actually lost data!) + +Prior to Octopus, erasure-coded pools required that at least ``min_size`` shards be +available, even if ``min_size`` was greater than ``K``. This was a conservative +decision made out of an abundance of caution when designing the new pool +mode. As a result, however, pools with lost OSDs but without complete data loss were +unable to recover and go active without manual intervention to temporarily change +the ``min_size`` setting. + +We recommend that ``min_size`` be ``K+1`` or greater to prevent loss of writes and +loss of data. + + + +Glossary +-------- + +*chunk* + When the encoding function is called, it returns chunks of the same size as each other. There are two + kinds of chunks: (1) *data chunks*, which can be concatenated to reconstruct the original object, and + (2) *coding chunks*, which can be used to rebuild a lost chunk. + +*K* + The number of data chunks into which an object is divided. For example, if *K* = 2, then a 10KB object + is divided into two objects of 5KB each. + +*M* + The number of coding chunks computed by the encoding function. *M* is equal to the number of OSDs that can + be missing from the cluster without the cluster suffering data loss. For example, if there are two coding + chunks, then two OSDs can be missing without data loss. + +Table of contents +----------------- + +.. toctree:: + :maxdepth: 1 + + erasure-code-profile + erasure-code-jerasure + erasure-code-isa + erasure-code-lrc + erasure-code-shec + erasure-code-clay diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst new file mode 100644 index 000000000..d52465602 --- /dev/null +++ b/doc/rados/operations/health-checks.rst @@ -0,0 +1,1619 @@ +.. _health-checks: + +=============== + Health checks +=============== + +Overview +======== + +There is a finite set of health messages that a Ceph cluster can raise. These +messages are known as *health checks*. Each health check has a unique +identifier. + +The identifier is a terse human-readable string -- that is, the identifier is +readable in much the same way as a typical variable name. It is intended to +enable tools (for example, UIs) to make sense of health checks and present them +in a way that reflects their meaning. + +This page lists the health checks that are raised by the monitor and manager +daemons. In addition to these, you might see health checks that originate +from MDS daemons (see :ref:`cephfs-health-messages`), and health checks +that are defined by ``ceph-mgr`` python modules. + +Definitions +=========== + +Monitor +------- + +DAEMON_OLD_VERSION +__________________ + +Warn if one or more old versions of Ceph are running on any daemons. A health +check is raised if multiple versions are detected. This condition must exist +for a period of time greater than ``mon_warn_older_version_delay`` (set to one +week by default) in order for the health check to be raised. This allows most +upgrades to proceed without the occurrence of a false warning. If the upgrade +is paused for an extended time period, ``health mute`` can be used by running +``ceph health mute DAEMON_OLD_VERSION --sticky``. Be sure, however, to run +``ceph health unmute DAEMON_OLD_VERSION`` after the upgrade has finished. + +MON_DOWN +________ + +One or more monitor daemons are currently down. The cluster requires a majority +(more than one-half) of the monitors to be available. When one or more monitors +are down, clients might have a harder time forming their initial connection to +the cluster, as they might need to try more addresses before they reach an +operating monitor. + +The down monitor daemon should be restarted as soon as possible to reduce the +risk of a subsequent monitor failure leading to a service outage. + +MON_CLOCK_SKEW +______________ + +The clocks on the hosts running the ceph-mon monitor daemons are not +well-synchronized. This health check is raised if the cluster detects a clock +skew greater than ``mon_clock_drift_allowed``. + +This issue is best resolved by synchronizing the clocks by using a tool like +``ntpd`` or ``chrony``. + +If it is impractical to keep the clocks closely synchronized, the +``mon_clock_drift_allowed`` threshold can also be increased. However, this +value must stay significantly below the ``mon_lease`` interval in order for the +monitor cluster to function properly. + +MON_MSGR2_NOT_ENABLED +_____________________ + +The :confval:`ms_bind_msgr2` option is enabled but one or more monitors are +not configured to bind to a v2 port in the cluster's monmap. This +means that features specific to the msgr2 protocol (for example, encryption) +are unavailable on some or all connections. + +In most cases this can be corrected by running the following command: + +.. prompt:: bash $ + + ceph mon enable-msgr2 + +After this command is run, any monitor configured to listen on the old default +port (6789) will continue to listen for v1 connections on 6789 and begin to +listen for v2 connections on the new default port 3300. + +If a monitor is configured to listen for v1 connections on a non-standard port +(that is, a port other than 6789), then the monmap will need to be modified +manually. + + +MON_DISK_LOW +____________ + +One or more monitors are low on disk space. This health check is raised if the +percentage of available space on the file system used by the monitor database +(normally ``/var/lib/ceph/mon``) drops below the percentage value +``mon_data_avail_warn`` (default: 30%). + +This alert might indicate that some other process or user on the system is +filling up the file system used by the monitor. It might also +indicate that the monitor database is too large (see ``MON_DISK_BIG`` +below). + +If space cannot be freed, the monitor's data directory might need to be +moved to another storage device or file system (this relocation process must be carried out while the monitor +daemon is not running). + + +MON_DISK_CRIT +_____________ + +One or more monitors are critically low on disk space. This health check is raised if the +percentage of available space on the file system used by the monitor database +(normally ``/var/lib/ceph/mon``) drops below the percentage value +``mon_data_avail_crit`` (default: 5%). See ``MON_DISK_LOW``, above. + +MON_DISK_BIG +____________ + +The database size for one or more monitors is very large. This health check is +raised if the size of the monitor database is larger than +``mon_data_size_warn`` (default: 15 GiB). + +A large database is unusual, but does not necessarily indicate a problem. +Monitor databases might grow in size when there are placement groups that have +not reached an ``active+clean`` state in a long time. + +This alert might also indicate that the monitor's database is not properly +compacting, an issue that has been observed with some older versions of leveldb +and rocksdb. Forcing a compaction with ``ceph daemon mon.<id> compact`` might +shrink the database's on-disk size. + +This alert might also indicate that the monitor has a bug that prevents it from +pruning the cluster metadata that it stores. If the problem persists, please +report a bug. + +To adjust the warning threshold, run the following command: + +.. prompt:: bash $ + + ceph config set global mon_data_size_warn <size> + + +AUTH_INSECURE_GLOBAL_ID_RECLAIM +_______________________________ + +One or more clients or daemons that are connected to the cluster are not +securely reclaiming their ``global_id`` (a unique number that identifies each +entity in the cluster) when reconnecting to a monitor. The client is being +permitted to connect anyway because the +``auth_allow_insecure_global_id_reclaim`` option is set to ``true`` (which may +be necessary until all Ceph clients have been upgraded) and because the +``auth_expose_insecure_global_id_reclaim`` option is set to ``true`` (which +allows monitors to detect clients with "insecure reclaim" sooner by forcing +those clients to reconnect immediately after their initial authentication). + +To identify which client(s) are using unpatched Ceph client code, run the +following command: + +.. prompt:: bash $ + + ceph health detail + +If you collect a dump of the clients that are connected to an individual +monitor and examine the ``global_id_status`` field in the output of the dump, +you can see the ``global_id`` reclaim behavior of those clients. Here +``reclaim_insecure`` means that a client is unpatched and is contributing to +this health check. To effect a client dump, run the following command: + +.. prompt:: bash $ + + ceph tell mon.\* sessions + +We strongly recommend that all clients in the system be upgraded to a newer +version of Ceph that correctly reclaims ``global_id`` values. After all clients +have been updated, run the following command to stop allowing insecure +reconnections: + +.. prompt:: bash $ + + ceph config set mon auth_allow_insecure_global_id_reclaim false + +If it is impractical to upgrade all clients immediately, you can temporarily +silence this alert by running the following command: + +.. prompt:: bash $ + + ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM 1w # 1 week + +Although we do NOT recommend doing so, you can also disable this alert +indefinitely by running the following command: + +.. prompt:: bash $ + + ceph config set mon mon_warn_on_insecure_global_id_reclaim false + +AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED +_______________________________________ + +Ceph is currently configured to allow clients that reconnect to monitors using +an insecure process to reclaim their previous ``global_id``. Such reclaiming is +allowed because, by default, ``auth_allow_insecure_global_id_reclaim`` is set +to ``true``. It might be necessary to leave this setting enabled while existing +Ceph clients are upgraded to newer versions of Ceph that correctly and securely +reclaim their ``global_id``. + +If the ``AUTH_INSECURE_GLOBAL_ID_RECLAIM`` health check has not also been +raised and if the ``auth_expose_insecure_global_id_reclaim`` setting has not +been disabled (it is enabled by default), then there are currently no clients +connected that need to be upgraded. In that case, it is safe to disable +``insecure global_id reclaim`` by running the following command: + +.. prompt:: bash $ + + ceph config set mon auth_allow_insecure_global_id_reclaim false + +On the other hand, if there are still clients that need to be upgraded, then +this alert can be temporarily silenced by running the following command: + +.. prompt:: bash $ + + ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED 1w # 1 week + +Although we do NOT recommend doing so, you can also disable this alert indefinitely +by running the following command: + +.. prompt:: bash $ + + ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed false + + +Manager +------- + +MGR_DOWN +________ + +All manager daemons are currently down. The cluster should normally have at +least one running manager (``ceph-mgr``) daemon. If no manager daemon is +running, the cluster's ability to monitor itself will be compromised, and parts +of the management API will become unavailable (for example, the dashboard will +not work, and most CLI commands that report metrics or runtime state will +block). However, the cluster will still be able to perform all I/O operations +and to recover from failures. + +The "down" manager daemon should be restarted as soon as possible to ensure +that the cluster can be monitored (for example, so that the ``ceph -s`` +information is up to date, or so that metrics can be scraped by Prometheus). + + +MGR_MODULE_DEPENDENCY +_____________________ + +An enabled manager module is failing its dependency check. This health check +typically comes with an explanatory message from the module about the problem. + +For example, a module might report that a required package is not installed: in +this case, you should install the required package and restart your manager +daemons. + +This health check is applied only to enabled modules. If a module is not +enabled, you can see whether it is reporting dependency issues in the output of +`ceph module ls`. + + +MGR_MODULE_ERROR +________________ + +A manager module has experienced an unexpected error. Typically, this means +that an unhandled exception was raised from the module's `serve` function. The +human-readable description of the error might be obscurely worded if the +exception did not provide a useful description of itself. + +This health check might indicate a bug: please open a Ceph bug report if you +think you have encountered a bug. + +However, if you believe the error is transient, you may restart your manager +daemon(s) or use ``ceph mgr fail`` on the active daemon in order to force +failover to another daemon. + +OSDs +---- + +OSD_DOWN +________ + +One or more OSDs are marked "down". The ceph-osd daemon might have been +stopped, or peer OSDs might be unable to reach the OSD over the network. +Common causes include a stopped or crashed daemon, a "down" host, or a network +outage. + +Verify that the host is healthy, the daemon is started, and the network is +functioning. If the daemon has crashed, the daemon log file +(``/var/log/ceph/ceph-osd.*``) might contain debugging information. + +OSD_<crush type>_DOWN +_____________________ + +(for example, OSD_HOST_DOWN, OSD_ROOT_DOWN) + +All of the OSDs within a particular CRUSH subtree are marked "down" (for +example, all OSDs on a host). + +OSD_ORPHAN +__________ + +An OSD is referenced in the CRUSH map hierarchy, but does not exist. + +To remove the OSD from the CRUSH map hierarchy, run the following command: + +.. prompt:: bash $ + + ceph osd crush rm osd.<id> + +OSD_OUT_OF_ORDER_FULL +_____________________ + +The utilization thresholds for `nearfull`, `backfillfull`, `full`, and/or +`failsafe_full` are not ascending. In particular, the following pattern is +expected: `nearfull < backfillfull`, `backfillfull < full`, and `full < +failsafe_full`. + +To adjust these utilization thresholds, run the following commands: + +.. prompt:: bash $ + + ceph osd set-nearfull-ratio <ratio> + ceph osd set-backfillfull-ratio <ratio> + ceph osd set-full-ratio <ratio> + + +OSD_FULL +________ + +One or more OSDs have exceeded the `full` threshold and are preventing the +cluster from servicing writes. + +To check utilization by pool, run the following command: + +.. prompt:: bash $ + + ceph df + +To see the currently defined `full` ratio, run the following command: + +.. prompt:: bash $ + + ceph osd dump | grep full_ratio + +A short-term workaround to restore write availability is to raise the full +threshold by a small amount. To do so, run the following command: + +.. prompt:: bash $ + + ceph osd set-full-ratio <ratio> + +Additional OSDs should be deployed in order to add new storage to the cluster, +or existing data should be deleted in order to free up space in the cluster. + +OSD_BACKFILLFULL +________________ + +One or more OSDs have exceeded the `backfillfull` threshold or *would* exceed +it if the currently-mapped backfills were to finish, which will prevent data +from rebalancing to this OSD. This alert is an early warning that +rebalancing might be unable to complete and that the cluster is approaching +full. + +To check utilization by pool, run the following command: + +.. prompt:: bash $ + + ceph df + +OSD_NEARFULL +____________ + +One or more OSDs have exceeded the `nearfull` threshold. This alert is an early +warning that the cluster is approaching full. + +To check utilization by pool, run the following command: + +.. prompt:: bash $ + + ceph df + +OSDMAP_FLAGS +____________ + +One or more cluster flags of interest have been set. These flags include: + +* *full* - the cluster is flagged as full and cannot serve writes +* *pauserd*, *pausewr* - there are paused reads or writes +* *noup* - OSDs are not allowed to start +* *nodown* - OSD failure reports are being ignored, and that means that the + monitors will not mark OSDs "down" +* *noin* - OSDs that were previously marked ``out`` are not being marked + back ``in`` when they start +* *noout* - "down" OSDs are not automatically being marked ``out`` after the + configured interval +* *nobackfill*, *norecover*, *norebalance* - recovery or data + rebalancing is suspended +* *noscrub*, *nodeep_scrub* - scrubbing is disabled +* *notieragent* - cache-tiering activity is suspended + +With the exception of *full*, these flags can be set or cleared by running the +following commands: + +.. prompt:: bash $ + + ceph osd set <flag> + ceph osd unset <flag> + +OSD_FLAGS +_________ + +One or more OSDs or CRUSH {nodes,device classes} have a flag of interest set. +These flags include: + +* *noup*: these OSDs are not allowed to start +* *nodown*: failure reports for these OSDs will be ignored +* *noin*: if these OSDs were previously marked ``out`` automatically + after a failure, they will not be marked ``in`` when they start +* *noout*: if these OSDs are "down" they will not automatically be marked + ``out`` after the configured interval + +To set and clear these flags in batch, run the following commands: + +.. prompt:: bash $ + + ceph osd set-group <flags> <who> + ceph osd unset-group <flags> <who> + +For example: + +.. prompt:: bash $ + + ceph osd set-group noup,noout osd.0 osd.1 + ceph osd unset-group noup,noout osd.0 osd.1 + ceph osd set-group noup,noout host-foo + ceph osd unset-group noup,noout host-foo + ceph osd set-group noup,noout class-hdd + ceph osd unset-group noup,noout class-hdd + +OLD_CRUSH_TUNABLES +__________________ + +The CRUSH map is using very old settings and should be updated. The oldest set +of tunables that can be used (that is, the oldest client version that can +connect to the cluster) without raising this health check is determined by the +``mon_crush_min_required_version`` config option. For more information, see +:ref:`crush-map-tunables`. + +OLD_CRUSH_STRAW_CALC_VERSION +____________________________ + +The CRUSH map is using an older, non-optimal method of calculating intermediate +weight values for ``straw`` buckets. + +The CRUSH map should be updated to use the newer method (that is: +``straw_calc_version=1``). For more information, see :ref:`crush-map-tunables`. + +CACHE_POOL_NO_HIT_SET +_____________________ + +One or more cache pools are not configured with a *hit set* to track +utilization. This issue prevents the tiering agent from identifying cold +objects that are to be flushed and evicted from the cache. + +To configure hit sets on the cache pool, run the following commands: + +.. prompt:: bash $ + + ceph osd pool set <poolname> hit_set_type <type> + ceph osd pool set <poolname> hit_set_period <period-in-seconds> + ceph osd pool set <poolname> hit_set_count <number-of-hitsets> + ceph osd pool set <poolname> hit_set_fpp <target-false-positive-rate> + +OSD_NO_SORTBITWISE +__________________ + +No pre-Luminous v12.y.z OSDs are running, but the ``sortbitwise`` flag has not +been set. + +The ``sortbitwise`` flag must be set in order for OSDs running Luminous v12.y.z +or newer to start. To safely set the flag, run the following command: + +.. prompt:: bash $ + + ceph osd set sortbitwise + +OSD_FILESTORE +__________________ + +Warn if OSDs are running Filestore. The Filestore OSD back end has been +deprecated; the BlueStore back end has been the default object store since the +Ceph Luminous release. + +The 'mclock_scheduler' is not supported for Filestore OSDs. For this reason, +the default 'osd_op_queue' is set to 'wpq' for Filestore OSDs and is enforced +even if the user attempts to change it. + + + +.. prompt:: bash $ + + ceph report | jq -c '."osd_metadata" | .[] | select(.osd_objectstore | contains("filestore")) | {id, osd_objectstore}' + +**In order to upgrade to Reef or a later release, you must first migrate any +Filestore OSDs to BlueStore.** + +If you are upgrading a pre-Reef release to Reef or later, but it is not +feasible to migrate Filestore OSDs to BlueStore immediately, you can +temporarily silence this alert by running the following command: + +.. prompt:: bash $ + + ceph health mute OSD_FILESTORE + +Since this migration can take a considerable amount of time to complete, we +recommend that you begin the process well in advance of any update to Reef or +to later releases. + +POOL_FULL +_________ + +One or more pools have reached their quota and are no longer allowing writes. + +To see pool quotas and utilization, run the following command: + +.. prompt:: bash $ + + ceph df detail + +If you opt to raise the pool quota, run the following commands: + +.. prompt:: bash $ + + ceph osd pool set-quota <poolname> max_objects <num-objects> + ceph osd pool set-quota <poolname> max_bytes <num-bytes> + +If not, delete some existing data to reduce utilization. + +BLUEFS_SPILLOVER +________________ + +One or more OSDs that use the BlueStore back end have been allocated `db` +partitions (that is, storage space for metadata, normally on a faster device), +but because that space has been filled, metadata has "spilled over" onto the +slow device. This is not necessarily an error condition or even unexpected +behavior, but may result in degraded performance. If the administrator had +expected that all metadata would fit on the faster device, this alert indicates +that not enough space was provided. + +To disable this alert on all OSDs, run the following command: + +.. prompt:: bash $ + + ceph config set osd bluestore_warn_on_bluefs_spillover false + +Alternatively, to disable the alert on a specific OSD, run the following +command: + +.. prompt:: bash $ + + ceph config set osd.123 bluestore_warn_on_bluefs_spillover false + +To secure more metadata space, you can destroy and reprovision the OSD in +question. This process involves data migration and recovery. + +It might also be possible to expand the LVM logical volume that backs the `db` +storage. If the underlying LV has been expanded, you must stop the OSD daemon +and inform BlueFS of the device-size change by running the following command: + +.. prompt:: bash $ + + ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-$ID + +BLUEFS_AVAILABLE_SPACE +______________________ + +To see how much space is free for BlueFS, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore bluefs available + +This will output up to three values: ``BDEV_DB free``, ``BDEV_SLOW free``, and +``available_from_bluestore``. ``BDEV_DB`` and ``BDEV_SLOW`` report the amount +of space that has been acquired by BlueFS and is now considered free. The value +``available_from_bluestore`` indicates the ability of BlueStore to relinquish +more space to BlueFS. It is normal for this value to differ from the amount of +BlueStore free space, because the BlueFS allocation unit is typically larger +than the BlueStore allocation unit. This means that only part of the BlueStore +free space will be available for BlueFS. + +BLUEFS_LOW_SPACE +_________________ + +If BlueFS is running low on available free space and there is not much free +space available from BlueStore (in other words, `available_from_bluestore` has +a low value), consider reducing the BlueFS allocation unit size. To simulate +available space when the allocation unit is different, run the following +command: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore bluefs available <alloc-unit-size> + +BLUESTORE_FRAGMENTATION +_______________________ + +As BlueStore operates, the free space on the underlying storage will become +fragmented. This is normal and unavoidable, but excessive fragmentation causes +slowdown. To inspect BlueStore fragmentation, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore allocator score block + +The fragmentation score is given in a [0-1] range. +[0.0 .. 0.4] tiny fragmentation +[0.4 .. 0.7] small, acceptable fragmentation +[0.7 .. 0.9] considerable, but safe fragmentation +[0.9 .. 1.0] severe fragmentation, might impact BlueFS's ability to get space from BlueStore + +To see a detailed report of free fragments, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore allocator dump block + +For OSD processes that are not currently running, fragmentation can be +inspected with `ceph-bluestore-tool`. To see the fragmentation score, run the +following command: + +.. prompt:: bash $ + + ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-score + +To dump detailed free chunks, run the following command: + +.. prompt:: bash $ + + ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-dump + +BLUESTORE_LEGACY_STATFS +_______________________ + +One or more OSDs have BlueStore volumes that were created prior to the +Nautilus release. (In Nautilus, BlueStore tracks its internal usage +statistics on a granular, per-pool basis.) + +If *all* OSDs +are older than Nautilus, this means that the per-pool metrics are +simply unavailable. But if there is a mixture of pre-Nautilus and +post-Nautilus OSDs, the cluster usage statistics reported by ``ceph +df`` will be inaccurate. + +The old OSDs can be updated to use the new usage-tracking scheme by stopping +each OSD, running a repair operation, and then restarting the OSD. For example, +to update ``osd.123``, run the following commands: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +To disable this alert, run the following command: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_legacy_statfs false + +BLUESTORE_NO_PER_POOL_OMAP +__________________________ + +One or more OSDs have volumes that were created prior to the Octopus release. +(In Octopus and later releases, BlueStore tracks omap space utilization by +pool.) + +If there are any BlueStore OSDs that do not have the new tracking enabled, the +cluster will report an approximate value for per-pool omap usage based on the +most recent deep scrub. + +The OSDs can be updated to track by pool by stopping each OSD, running a repair +operation, and then restarting the OSD. For example, to update ``osd.123``, run +the following commands: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +To disable this alert, run the following command: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_no_per_pool_omap false + +BLUESTORE_NO_PER_PG_OMAP +__________________________ + +One or more OSDs have volumes that were created prior to Pacific. (In Pacific +and later releases Bluestore tracks omap space utilitzation by Placement Group +(PG).) + +Per-PG omap allows faster PG removal when PGs migrate. + +The older OSDs can be updated to track by PG by stopping each OSD, running a +repair operation, and then restarting the OSD. For example, to update +``osd.123``, run the following commands: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +To disable this alert, run the following command: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_no_per_pg_omap false + + +BLUESTORE_DISK_SIZE_MISMATCH +____________________________ + +One or more BlueStore OSDs have an internal inconsistency between the size of +the physical device and the metadata that tracks its size. This inconsistency +can lead to the OSD(s) crashing in the future. + +The OSDs that have this inconsistency should be destroyed and reprovisioned. Be +very careful to execute this procedure on only one OSD at a time, so as to +minimize the risk of losing any data. To execute this procedure, where ``$N`` +is the OSD that has the inconsistency, run the following commands: + +.. prompt:: bash $ + + ceph osd out osd.$N + while ! ceph osd safe-to-destroy osd.$N ; do sleep 1m ; done + ceph osd destroy osd.$N + ceph-volume lvm zap /path/to/device + ceph-volume lvm create --osd-id $N --data /path/to/device + +.. note:: + + Wait for this recovery procedure to completely on one OSD before running it + on the next. + +BLUESTORE_NO_COMPRESSION +________________________ + +One or more OSDs is unable to load a BlueStore compression plugin. This issue +might be caused by a broken installation, in which the ``ceph-osd`` binary does +not match the compression plugins. Or it might be caused by a recent upgrade in +which the ``ceph-osd`` daemon was not restarted. + +To resolve this issue, verify that all of the packages on the host that is +running the affected OSD(s) are correctly installed and that the OSD daemon(s) +have been restarted. If the problem persists, check the OSD log for information +about the source of the problem. + +BLUESTORE_SPURIOUS_READ_ERRORS +______________________________ + +One or more BlueStore OSDs detect spurious read errors on the main device. +BlueStore has recovered from these errors by retrying disk reads. This alert +might indicate issues with underlying hardware, issues with the I/O subsystem, +or something similar. In theory, such issues can cause permanent data +corruption. Some observations on the root cause of spurious read errors can be +found here: https://tracker.ceph.com/issues/22464 + +This alert does not require an immediate response, but the affected host might +need additional attention: for example, upgrading the host to the latest +OS/kernel versions and implementing hardware-resource-utilization monitoring. + +To disable this alert on all OSDs, run the following command: + +.. prompt:: bash $ + + ceph config set osd bluestore_warn_on_spurious_read_errors false + +Or, to disable this alert on a specific OSD, run the following command: + +.. prompt:: bash $ + + ceph config set osd.123 bluestore_warn_on_spurious_read_errors false + +Device health +------------- + +DEVICE_HEALTH +_____________ + +One or more OSD devices are expected to fail soon, where the warning threshold +is determined by the ``mgr/devicehealth/warn_threshold`` config option. + +Because this alert applies only to OSDs that are currently marked ``in``, the +appropriate response to this expected failure is (1) to mark the OSD ``out`` so +that data is migrated off of the OSD, and then (2) to remove the hardware from +the system. Note that this marking ``out`` is normally done automatically if +``mgr/devicehealth/self_heal`` is enabled (as determined by +``mgr/devicehealth/mark_out_threshold``). + +To check device health, run the following command: + +.. prompt:: bash $ + + ceph device info <device-id> + +Device life expectancy is set either by a prediction model that the mgr runs or +by an external tool that is activated by running the following command: + +.. prompt:: bash $ + + ceph device set-life-expectancy <device-id> <from> <to> + +You can change the stored life expectancy manually, but such a change usually +doesn't accomplish anything. The reason for this is that whichever tool +originally set the stored life expectancy will probably undo your change by +setting it again, and a change to the stored value does not affect the actual +health of the hardware device. + +DEVICE_HEALTH_IN_USE +____________________ + +One or more devices (that is, OSDs) are expected to fail soon and have been +marked ``out`` of the cluster (as controlled by +``mgr/devicehealth/mark_out_threshold``), but they are still participating in +one or more Placement Groups. This might be because the OSD(s) were marked +``out`` only recently and data is still migrating, or because data cannot be +migrated off of the OSD(s) for some reason (for example, the cluster is nearly +full, or the CRUSH hierarchy is structured so that there isn't another suitable +OSD to migrate the data to). + +This message can be silenced by disabling self-heal behavior (that is, setting +``mgr/devicehealth/self_heal`` to ``false``), by adjusting +``mgr/devicehealth/mark_out_threshold``, or by addressing whichever condition +is preventing data from being migrated off of the ailing OSD(s). + +.. _rados_health_checks_device_health_toomany: + +DEVICE_HEALTH_TOOMANY +_____________________ + +Too many devices (that is, OSDs) are expected to fail soon, and because +``mgr/devicehealth/self_heal`` behavior is enabled, marking ``out`` all of the +ailing OSDs would exceed the cluster's ``mon_osd_min_in_ratio`` ratio. This +ratio prevents a cascade of too many OSDs from being automatically marked +``out``. + +You should promptly add new OSDs to the cluster to prevent data loss, or +incrementally replace the failing OSDs. + +Alternatively, you can silence this health check by adjusting options including +``mon_osd_min_in_ratio`` or ``mgr/devicehealth/mark_out_threshold``. Be +warned, however, that this will increase the likelihood of unrecoverable data +loss. + + +Data health (pools & placement groups) +-------------------------------------- + +PG_AVAILABILITY +_______________ + +Data availability is reduced. In other words, the cluster is unable to service +potential read or write requests for at least some data in the cluster. More +precisely, one or more Placement Groups (PGs) are in a state that does not +allow I/O requests to be serviced. Any of the following PG states are +problematic if they do not clear quickly: *peering*, *stale*, *incomplete*, and +the lack of *active*. + +For detailed information about which PGs are affected, run the following +command: + +.. prompt:: bash $ + + ceph health detail + +In most cases, the root cause of this issue is that one or more OSDs are +currently ``down``: see ``OSD_DOWN`` above. + +To see the state of a specific problematic PG, run the following command: + +.. prompt:: bash $ + + ceph tell <pgid> query + +PG_DEGRADED +___________ + +Data redundancy is reduced for some data: in other words, the cluster does not +have the desired number of replicas for all data (in the case of replicated +pools) or erasure code fragments (in the case of erasure-coded pools). More +precisely, one or more Placement Groups (PGs): + +* have the *degraded* or *undersized* flag set, which means that there are not + enough instances of that PG in the cluster; or +* have not had the *clean* state set for a long time. + +For detailed information about which PGs are affected, run the following +command: + +.. prompt:: bash $ + + ceph health detail + +In most cases, the root cause of this issue is that one or more OSDs are +currently "down": see ``OSD_DOWN`` above. + +To see the state of a specific problematic PG, run the following command: + +.. prompt:: bash $ + + ceph tell <pgid> query + + +PG_RECOVERY_FULL +________________ + +Data redundancy might be reduced or even put at risk for some data due to a +lack of free space in the cluster. More precisely, one or more Placement Groups +have the *recovery_toofull* flag set, which means that the cluster is unable to +migrate or recover data because one or more OSDs are above the ``full`` +threshold. + +For steps to resolve this condition, see *OSD_FULL* above. + +PG_BACKFILL_FULL +________________ + +Data redundancy might be reduced or even put at risk for some data due to a +lack of free space in the cluster. More precisely, one or more Placement Groups +have the *backfill_toofull* flag set, which means that the cluster is unable to +migrate or recover data because one or more OSDs are above the ``backfillfull`` +threshold. + +For steps to resolve this condition, see *OSD_BACKFILLFULL* above. + +PG_DAMAGED +__________ + +Data scrubbing has discovered problems with data consistency in the cluster. +More precisely, one or more Placement Groups either (1) have the *inconsistent* +or ``snaptrim_error`` flag set, which indicates that an earlier data scrub +operation found a problem, or (2) have the *repair* flag set, which means that +a repair for such an inconsistency is currently in progress. + +For more information, see :doc:`pg-repair`. + +OSD_SCRUB_ERRORS +________________ + +Recent OSD scrubs have discovered inconsistencies. This alert is generally +paired with *PG_DAMAGED* (see above). + +For more information, see :doc:`pg-repair`. + +OSD_TOO_MANY_REPAIRS +____________________ + +The count of read repairs has exceeded the config value threshold +``mon_osd_warn_num_repaired`` (default: ``10``). Because scrub handles errors +only for data at rest, and because any read error that occurs when another +replica is available will be repaired immediately so that the client can get +the object data, there might exist failing disks that are not registering any +scrub errors. This repair count is maintained as a way of identifying any such +failing disks. + + +LARGE_OMAP_OBJECTS +__________________ + +One or more pools contain large omap objects, as determined by +``osd_deep_scrub_large_omap_object_key_threshold`` (threshold for the number of +keys to determine what is considered a large omap object) or +``osd_deep_scrub_large_omap_object_value_sum_threshold`` (the threshold for the +summed size in bytes of all key values to determine what is considered a large +omap object) or both. To find more information on object name, key count, and +size in bytes, search the cluster log for 'Large omap object found'. This issue +can be caused by RGW-bucket index objects that do not have automatic resharding +enabled. For more information on resharding, see :ref:`RGW Dynamic Bucket Index +Resharding <rgw_dynamic_bucket_index_resharding>`. + +To adjust the thresholds mentioned above, run the following commands: + +.. prompt:: bash $ + + ceph config set osd osd_deep_scrub_large_omap_object_key_threshold <keys> + ceph config set osd osd_deep_scrub_large_omap_object_value_sum_threshold <bytes> + +CACHE_POOL_NEAR_FULL +____________________ + +A cache-tier pool is nearly full, as determined by the ``target_max_bytes`` and +``target_max_objects`` properties of the cache pool. Once the pool reaches the +target threshold, write requests to the pool might block while data is flushed +and evicted from the cache. This state normally leads to very high latencies +and poor performance. + +To adjust the cache pool's target size, run the following commands: + +.. prompt:: bash $ + + ceph osd pool set <cache-pool-name> target_max_bytes <bytes> + ceph osd pool set <cache-pool-name> target_max_objects <objects> + +There might be other reasons that normal cache flush and evict activity are +throttled: for example, reduced availability of the base tier, reduced +performance of the base tier, or overall cluster load. + +TOO_FEW_PGS +___________ + +The number of Placement Groups (PGs) that are in use in the cluster is below +the configurable threshold of ``mon_pg_warn_min_per_osd`` PGs per OSD. This can +lead to suboptimal distribution and suboptimal balance of data across the OSDs +in the cluster, and a reduction of overall performance. + +If data pools have not yet been created, this condition is expected. + +To address this issue, you can increase the PG count for existing pools or +create new pools. For more information, see +:ref:`choosing-number-of-placement-groups`. + +POOL_PG_NUM_NOT_POWER_OF_TWO +____________________________ + +One or more pools have a ``pg_num`` value that is not a power of two. Although +this is not strictly incorrect, it does lead to a less balanced distribution of +data because some Placement Groups will have roughly twice as much data as +others have. + +This is easily corrected by setting the ``pg_num`` value for the affected +pool(s) to a nearby power of two. To do so, run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_num <value> + +To disable this health check, run the following command: + +.. prompt:: bash $ + + ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false + +POOL_TOO_FEW_PGS +________________ + +One or more pools should probably have more Placement Groups (PGs), given the +amount of data that is currently stored in the pool. This issue can lead to +suboptimal distribution and suboptimal balance of data across the OSDs in the +cluster, and a reduction of overall performance. This alert is raised only if +the ``pg_autoscale_mode`` property on the pool is set to ``warn``. + +To disable the alert, entirely disable auto-scaling of PGs for the pool by +running the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode off + +To allow the cluster to automatically adjust the number of PGs for the pool, +run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode on + +Alternatively, to manually set the number of PGs for the pool to the +recommended amount, run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_num <new-pg-num> + +For more information, see :ref:`choosing-number-of-placement-groups` and +:ref:`pg-autoscaler`. + +TOO_MANY_PGS +____________ + +The number of Placement Groups (PGs) in use in the cluster is above the +configurable threshold of ``mon_max_pg_per_osd`` PGs per OSD. If this threshold +is exceeded, the cluster will not allow new pools to be created, pool `pg_num` +to be increased, or pool replication to be increased (any of which, if allowed, +would lead to more PGs in the cluster). A large number of PGs can lead to +higher memory utilization for OSD daemons, slower peering after cluster state +changes (for example, OSD restarts, additions, or removals), and higher load on +the Manager and Monitor daemons. + +The simplest way to mitigate the problem is to increase the number of OSDs in +the cluster by adding more hardware. Note that, because the OSD count that is +used for the purposes of this health check is the number of ``in`` OSDs, +marking ``out`` OSDs ``in`` (if there are any ``out`` OSDs available) can also +help. To do so, run the following command: + +.. prompt:: bash $ + + ceph osd in <osd id(s)> + +For more information, see :ref:`choosing-number-of-placement-groups`. + +POOL_TOO_MANY_PGS +_________________ + +One or more pools should probably have fewer Placement Groups (PGs), given the +amount of data that is currently stored in the pool. This issue can lead to +higher memory utilization for OSD daemons, slower peering after cluster state +changes (for example, OSD restarts, additions, or removals), and higher load on +the Manager and Monitor daemons. This alert is raised only if the +``pg_autoscale_mode`` property on the pool is set to ``warn``. + +To disable the alert, entirely disable auto-scaling of PGs for the pool by +running the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode off + +To allow the cluster to automatically adjust the number of PGs for the pool, +run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode on + +Alternatively, to manually set the number of PGs for the pool to the +recommended amount, run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_num <new-pg-num> + +For more information, see :ref:`choosing-number-of-placement-groups` and +:ref:`pg-autoscaler`. + + +POOL_TARGET_SIZE_BYTES_OVERCOMMITTED +____________________________________ + +One or more pools have a ``target_size_bytes`` property that is set in order to +estimate the expected size of the pool, but the value(s) of this property are +greater than the total available storage (either by themselves or in +combination with other pools). + +This alert is usually an indication that the ``target_size_bytes`` value for +the pool is too large and should be reduced or set to zero. To reduce the +``target_size_bytes`` value or set it to zero, run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> target_size_bytes 0 + +The above command sets the value of ``target_size_bytes`` to zero. To set the +value of ``target_size_bytes`` to a non-zero value, replace the ``0`` with that +non-zero value. + +For more information, see :ref:`specifying_pool_target_size`. + +POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO +____________________________________ + +One or more pools have both ``target_size_bytes`` and ``target_size_ratio`` set +in order to estimate the expected size of the pool. Only one of these +properties should be non-zero. If both are set to a non-zero value, then +``target_size_ratio`` takes precedence and ``target_size_bytes`` is ignored. + +To reset ``target_size_bytes`` to zero, run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> target_size_bytes 0 + +For more information, see :ref:`specifying_pool_target_size`. + +TOO_FEW_OSDS +____________ + +The number of OSDs in the cluster is below the configurable threshold of +``osd_pool_default_size``. This means that some or all data may not be able to +satisfy the data protection policy specified in CRUSH rules and pool settings. + +SMALLER_PGP_NUM +_______________ + +One or more pools have a ``pgp_num`` value less than ``pg_num``. This alert is +normally an indication that the Placement Group (PG) count was increased +without any increase in the placement behavior. + +This disparity is sometimes brought about deliberately, in order to separate +out the `split` step when the PG count is adjusted from the data migration that +is needed when ``pgp_num`` is changed. + +This issue is normally resolved by setting ``pgp_num`` to match ``pg_num``, so +as to trigger the data migration, by running the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool> pgp_num <pg-num-value> + +MANY_OBJECTS_PER_PG +___________________ + +One or more pools have an average number of objects per Placement Group (PG) +that is significantly higher than the overall cluster average. The specific +threshold is determined by the ``mon_pg_warn_max_object_skew`` configuration +value. + +This alert is usually an indication that the pool(s) that contain most of the +data in the cluster have too few PGs, or that other pools that contain less +data have too many PGs. See *TOO_MANY_PGS* above. + +To silence the health check, raise the threshold by adjusting the +``mon_pg_warn_max_object_skew`` config option on the managers. + +The health check will be silenced for a specific pool only if +``pg_autoscale_mode`` is set to ``on``. + +POOL_APP_NOT_ENABLED +____________________ + +A pool exists but the pool has not been tagged for use by a particular +application. + +To resolve this issue, tag the pool for use by an application. For +example, if the pool is used by RBD, run the following command: + +.. prompt:: bash $ + + rbd pool init <poolname> + +Alternatively, if the pool is being used by a custom application (here 'foo'), +you can label the pool by running the following low-level command: + +.. prompt:: bash $ + + ceph osd pool application enable foo + +For more information, see :ref:`associate-pool-to-application`. + +POOL_FULL +_________ + +One or more pools have reached (or are very close to reaching) their quota. The +threshold to raise this health check is determined by the +``mon_pool_quota_crit_threshold`` configuration option. + +Pool quotas can be adjusted up or down (or removed) by running the following +commands: + +.. prompt:: bash $ + + ceph osd pool set-quota <pool> max_bytes <bytes> + ceph osd pool set-quota <pool> max_objects <objects> + +To disable a quota, set the quota value to 0. + +POOL_NEAR_FULL +______________ + +One or more pools are approaching a configured fullness threshold. + +One of the several thresholds that can raise this health check is determined by +the ``mon_pool_quota_warn_threshold`` configuration option. + +Pool quotas can be adjusted up or down (or removed) by running the following +commands: + +.. prompt:: bash $ + + ceph osd pool set-quota <pool> max_bytes <bytes> + ceph osd pool set-quota <pool> max_objects <objects> + +To disable a quota, set the quota value to 0. + +Other thresholds that can raise the two health checks above are +``mon_osd_nearfull_ratio`` and ``mon_osd_full_ratio``. For details and +resolution, see :ref:`storage-capacity` and :ref:`no-free-drive-space`. + +OBJECT_MISPLACED +________________ + +One or more objects in the cluster are not stored on the node that CRUSH would +prefer that they be stored on. This alert is an indication that data migration +due to a recent cluster change has not yet completed. + +Misplaced data is not a dangerous condition in and of itself; data consistency +is never at risk, and old copies of objects will not be removed until the +desired number of new copies (in the desired locations) has been created. + +OBJECT_UNFOUND +______________ + +One or more objects in the cluster cannot be found. More precisely, the OSDs +know that a new or updated copy of an object should exist, but no such copy has +been found on OSDs that are currently online. + +Read or write requests to unfound objects will block. + +Ideally, a "down" OSD that has a more recent copy of the unfound object can be +brought back online. To identify candidate OSDs, check the peering state of the +PG(s) responsible for the unfound object. To see the peering state, run the +following command: + +.. prompt:: bash $ + + ceph tell <pgid> query + +On the other hand, if the latest copy of the object is not available, the +cluster can be told to roll back to a previous version of the object. For more +information, see :ref:`failures-osd-unfound`. + +SLOW_OPS +________ + +One or more OSD requests or monitor requests are taking a long time to process. +This alert might be an indication of extreme load, a slow storage device, or a +software bug. + +To query the request queue for the daemon that is causing the slowdown, run the +following command from the daemon's host: + +.. prompt:: bash $ + + ceph daemon osd.<id> ops + +To see a summary of the slowest recent requests, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.<id> dump_historic_ops + +To see the location of a specific OSD, run the following command: + +.. prompt:: bash $ + + ceph osd find osd.<id> + +PG_NOT_SCRUBBED +_______________ + +One or more Placement Groups (PGs) have not been scrubbed recently. PGs are +normally scrubbed within an interval determined by +:confval:`osd_scrub_max_interval` globally. This interval can be overridden on +per-pool basis by changing the value of the variable +:confval:`scrub_max_interval`. This health check is raised if a certain +percentage (determined by ``mon_warn_pg_not_scrubbed_ratio``) of the interval +has elapsed after the time the scrub was scheduled and no scrub has been +performed. + +PGs will be scrubbed only if they are flagged as ``clean`` (which means that +they are to be cleaned, and not that they have been examined and found to be +clean). Misplaced or degraded PGs will not be flagged as ``clean`` (see +*PG_AVAILABILITY* and *PG_DEGRADED* above). + +To manually initiate a scrub of a clean PG, run the following command: + +.. prompt: bash $ + + ceph pg scrub <pgid> + +PG_NOT_DEEP_SCRUBBED +____________________ + +One or more Placement Groups (PGs) have not been deep scrubbed recently. PGs +are normally scrubbed every :confval:`osd_deep_scrub_interval` seconds at most. +This health check is raised if a certain percentage (determined by +``mon_warn_pg_not_deep_scrubbed_ratio``) of the interval has elapsed after the +time the scrub was scheduled and no scrub has been performed. + +PGs will receive a deep scrub only if they are flagged as *clean* (which means +that they are to be cleaned, and not that they have been examined and found to +be clean). Misplaced or degraded PGs might not be flagged as ``clean`` (see +*PG_AVAILABILITY* and *PG_DEGRADED* above). + +To manually initiate a deep scrub of a clean PG, run the following command: + +.. prompt:: bash $ + + ceph pg deep-scrub <pgid> + + +PG_SLOW_SNAP_TRIMMING +_____________________ + +The snapshot trim queue for one or more PGs has exceeded the configured warning +threshold. This alert indicates either that an extremely large number of +snapshots was recently deleted, or that OSDs are unable to trim snapshots +quickly enough to keep up with the rate of new snapshot deletions. + +The warning threshold is determined by the ``mon_osd_snap_trim_queue_warn_on`` +option (default: 32768). + +This alert might be raised if OSDs are under excessive load and unable to keep +up with their background work, or if the OSDs' internal metadata database is +heavily fragmented and unable to perform. The alert might also indicate some +other performance issue with the OSDs. + +The exact size of the snapshot trim queue is reported by the ``snaptrimq_len`` +field of ``ceph pg ls -f json-detail``. + +Stretch Mode +------------ + +INCORRECT_NUM_BUCKETS_STRETCH_MODE +__________________________________ + +Stretch mode currently only support 2 dividing buckets with OSDs, this warning suggests +that the number of dividing buckets is not equal to 2 after stretch mode is enabled. +You can expect unpredictable failures and MON assertions until the condition is fixed. + +We encourage you to fix this by removing additional dividing buckets or bump the +number of dividing buckets to 2. + +UNEVEN_WEIGHTS_STRETCH_MODE +___________________________ + +The 2 dividing buckets must have equal weights when stretch mode is enabled. +This warning suggests that the 2 dividing buckets have uneven weights after +stretch mode is enabled. This is not immediately fatal, however, you can expect +Ceph to be confused when trying to process transitions between dividing buckets. + +We encourage you to fix this by making the weights even on both dividing buckets. +This can be done by making sure the combined weight of the OSDs on each dividing +bucket are the same. + +Miscellaneous +------------- + +RECENT_CRASH +____________ + +One or more Ceph daemons have crashed recently, and the crash(es) have not yet +been acknowledged and archived by the administrator. This alert might indicate +a software bug, a hardware problem (for example, a failing disk), or some other +problem. + +To list recent crashes, run the following command: + +.. prompt:: bash $ + + ceph crash ls-new + +To examine information about a specific crash, run the following command: + +.. prompt:: bash $ + + ceph crash info <crash-id> + +To silence this alert, you can archive the crash (perhaps after the crash +has been examined by an administrator) by running the following command: + +.. prompt:: bash $ + + ceph crash archive <crash-id> + +Similarly, to archive all recent crashes, run the following command: + +.. prompt:: bash $ + + ceph crash archive-all + +Archived crashes will still be visible by running the command ``ceph crash +ls``, but not by running the command ``ceph crash ls-new``. + +The time period that is considered recent is determined by the option +``mgr/crash/warn_recent_interval`` (default: two weeks). + +To entirely disable this alert, run the following command: + +.. prompt:: bash $ + + ceph config set mgr/crash/warn_recent_interval 0 + +RECENT_MGR_MODULE_CRASH +_______________________ + +One or more ``ceph-mgr`` modules have crashed recently, and the crash(es) have +not yet been acknowledged and archived by the administrator. This alert +usually indicates a software bug in one of the software modules that are +running inside the ``ceph-mgr`` daemon. The module that experienced the problem +might be disabled as a result, but other modules are unaffected and continue to +function as expected. + +As with the *RECENT_CRASH* health check, a specific crash can be inspected by +running the following command: + +.. prompt:: bash $ + + ceph crash info <crash-id> + +To silence this alert, you can archive the crash (perhaps after the crash has +been examined by an administrator) by running the following command: + +.. prompt:: bash $ + + ceph crash archive <crash-id> + +Similarly, to archive all recent crashes, run the following command: + +.. prompt:: bash $ + + ceph crash archive-all + +Archived crashes will still be visible by running the command ``ceph crash ls`` +but not by running the command ``ceph crash ls-new``. + +The time period that is considered recent is determined by the option +``mgr/crash/warn_recent_interval`` (default: two weeks). + +To entirely disable this alert, run the following command: + +.. prompt:: bash $ + + ceph config set mgr/crash/warn_recent_interval 0 + +TELEMETRY_CHANGED +_________________ + +Telemetry has been enabled, but because the contents of the telemetry report +have changed in the meantime, telemetry reports will not be sent. + +Ceph developers occasionally revise the telemetry feature to include new and +useful information, or to remove information found to be useless or sensitive. +If any new information is included in the report, Ceph requires the +administrator to re-enable telemetry. This requirement ensures that the +administrator has an opportunity to (re)review the information that will be +shared. + +To review the contents of the telemetry report, run the following command: + +.. prompt:: bash $ + + ceph telemetry show + +Note that the telemetry report consists of several channels that may be +independently enabled or disabled. For more information, see :ref:`telemetry`. + +To re-enable telemetry (and silence the alert), run the following command: + +.. prompt:: bash $ + + ceph telemetry on + +To disable telemetry (and silence the alert), run the following command: + +.. prompt:: bash $ + + ceph telemetry off + +AUTH_BAD_CAPS +_____________ + +One or more auth users have capabilities that cannot be parsed by the monitors. +As a general rule, this alert indicates that there are one or more daemon types +that the user is not authorized to use to perform any action. + +This alert is most likely to be raised after an upgrade if (1) the capabilities +were set with an older version of Ceph that did not properly validate the +syntax of those capabilities, or if (2) the syntax of the capabilities has +changed. + +To remove the user(s) in question, run the following command: + +.. prompt:: bash $ + + ceph auth rm <entity-name> + +(This resolves the health check, but it prevents clients from being able to +authenticate as the removed user.) + +Alternatively, to update the capabilities for the user(s), run the following +command: + +.. prompt:: bash $ + + ceph auth <entity-name> <daemon-type> <caps> [<daemon-type> <caps> ...] + +For more information about auth capabilities, see :ref:`user-management`. + +OSD_NO_DOWN_OUT_INTERVAL +________________________ + +The ``mon_osd_down_out_interval`` option is set to zero, which means that the +system does not automatically perform any repair or healing operations when an +OSD fails. Instead, an administrator an external orchestrator must manually +mark "down" OSDs as ``out`` (by running ``ceph osd out <osd-id>``) in order to +trigger recovery. + +This option is normally set to five or ten minutes, which should be enough time +for a host to power-cycle or reboot. + +To silence this alert, set ``mon_warn_on_osd_down_out_interval_zero`` to +``false`` by running the following command: + +.. prompt:: bash $ + + ceph config global mon mon_warn_on_osd_down_out_interval_zero false + +DASHBOARD_DEBUG +_______________ + +The Dashboard debug mode is enabled. This means that if there is an error while +processing a REST API request, the HTTP error response will contain a Python +traceback. This mode should be disabled in production environments because such +a traceback might contain and expose sensitive information. + +To disable the debug mode, run the following command: + +.. prompt:: bash $ + + ceph dashboard debug disable diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst new file mode 100644 index 000000000..15525c1d3 --- /dev/null +++ b/doc/rados/operations/index.rst @@ -0,0 +1,99 @@ +.. _rados-operations: + +==================== + Cluster Operations +==================== + +.. raw:: html + + <table><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>High-level Operations</h3> + +High-level cluster operations consist primarily of starting, stopping, and +restarting a cluster with the ``ceph`` service; checking the cluster's health; +and, monitoring an operating cluster. + +.. toctree:: + :maxdepth: 1 + + operating + health-checks + monitoring + monitoring-osd-pg + user-management + pg-repair + +.. raw:: html + + </td><td><h3>Data Placement</h3> + +Once you have your cluster up and running, you may begin working with data +placement. Ceph supports petabyte-scale data storage clusters, with storage +pools and placement groups that distribute data across the cluster using Ceph's +CRUSH algorithm. + +.. toctree:: + :maxdepth: 1 + + data-placement + pools + erasure-code + cache-tiering + placement-groups + upmap + read-balancer + balancer + crush-map + crush-map-edits + stretch-mode + change-mon-elections + + + +.. raw:: html + + </td></tr><tr><td><h3>Low-level Operations</h3> + +Low-level cluster operations consist of starting, stopping, and restarting a +particular daemon within a cluster; changing the settings of a particular +daemon or subsystem; and, adding a daemon to the cluster or removing a daemon +from the cluster. The most common use cases for low-level operations include +growing or shrinking the Ceph cluster and replacing legacy or failed hardware +with new hardware. + +.. toctree:: + :maxdepth: 1 + + add-or-rm-osds + add-or-rm-mons + devices + bluestore-migration + Command Reference <control> + + + +.. raw:: html + + </td><td><h3>Troubleshooting</h3> + +Ceph is still on the leading edge, so you may encounter situations that require +you to evaluate your Ceph configuration and modify your logging and debugging +settings to identify and remedy issues you are encountering with your cluster. + +.. toctree:: + :maxdepth: 1 + + ../troubleshooting/community + ../troubleshooting/troubleshooting-mon + ../troubleshooting/troubleshooting-osd + ../troubleshooting/troubleshooting-pg + ../troubleshooting/log-and-debug + ../troubleshooting/cpu-profiling + ../troubleshooting/memory-profiling + + + + +.. raw:: html + + </td></tr></tbody></table> + diff --git a/doc/rados/operations/monitoring-osd-pg.rst b/doc/rados/operations/monitoring-osd-pg.rst new file mode 100644 index 000000000..b0a6767a1 --- /dev/null +++ b/doc/rados/operations/monitoring-osd-pg.rst @@ -0,0 +1,556 @@ +========================= + Monitoring OSDs and PGs +========================= + +High availability and high reliability require a fault-tolerant approach to +managing hardware and software issues. Ceph has no single point of failure and +it can service requests for data even when in a "degraded" mode. Ceph's `data +placement`_ introduces a layer of indirection to ensure that data doesn't bind +directly to specific OSDs. For this reason, tracking system faults +requires finding the `placement group`_ (PG) and the underlying OSDs at the +root of the problem. + +.. tip:: A fault in one part of the cluster might prevent you from accessing a + particular object, but that doesn't mean that you are prevented from + accessing other objects. When you run into a fault, don't panic. Just + follow the steps for monitoring your OSDs and placement groups, and then + begin troubleshooting. + +Ceph is self-repairing. However, when problems persist, monitoring OSDs and +placement groups will help you identify the problem. + + +Monitoring OSDs +=============== + +An OSD is either *in* service (``in``) or *out* of service (``out``). An OSD is +either running and reachable (``up``), or it is not running and not reachable +(``down``). + +If an OSD is ``up``, it may be either ``in`` service (clients can read and +write data) or it is ``out`` of service. If the OSD was ``in`` but then due to +a failure or a manual action was set to the ``out`` state, Ceph will migrate +placement groups to the other OSDs to maintin the configured redundancy. + +If an OSD is ``out`` of service, CRUSH will not assign placement groups to it. +If an OSD is ``down``, it will also be ``out``. + +.. note:: If an OSD is ``down`` and ``in``, there is a problem and this + indicates that the cluster is not in a healthy state. + +.. ditaa:: + + +----------------+ +----------------+ + | | | | + | OSD #n In | | OSD #n Up | + | | | | + +----------------+ +----------------+ + ^ ^ + | | + | | + v v + +----------------+ +----------------+ + | | | | + | OSD #n Out | | OSD #n Down | + | | | | + +----------------+ +----------------+ + +If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``, +you might notice that the cluster does not always show ``HEALTH OK``. Don't +panic. There are certain circumstances in which it is expected and normal that +the cluster will **NOT** show ``HEALTH OK``: + +#. You haven't started the cluster yet. +#. You have just started or restarted the cluster and it's not ready to show + health statuses yet, because the PGs are in the process of being created and + the OSDs are in the process of peering. +#. You have just added or removed an OSD. +#. You have just have modified your cluster map. + +Checking to see if OSDs are ``up`` and running is an important aspect of monitoring them: +whenever the cluster is up and running, every OSD that is ``in`` the cluster should also +be ``up`` and running. To see if all of the cluster's OSDs are running, run the following +command: + +.. prompt:: bash $ + + ceph osd stat + +The output provides the following information: the total number of OSDs (x), +how many OSDs are ``up`` (y), how many OSDs are ``in`` (z), and the map epoch (eNNNN). :: + + x osds: y up, z in; epoch: eNNNN + +If the number of OSDs that are ``in`` the cluster is greater than the number of +OSDs that are ``up``, run the following command to identify the ``ceph-osd`` +daemons that are not running: + +.. prompt:: bash $ + + ceph osd tree + +:: + + #ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -1 2.00000 pool openstack + -3 2.00000 rack dell-2950-rack-A + -2 2.00000 host dell-2950-A1 + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 down 1.00000 1.00000 + +.. tip:: Searching through a well-designed CRUSH hierarchy to identify the physical + locations of particular OSDs might help you troubleshoot your cluster. + +If an OSD is ``down``, start it by running the following command: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd@1 + +For problems associated with OSDs that have stopped or won't restart, see `OSD Not Running`_. + + +PG Sets +======= + +When CRUSH assigns a PG to OSDs, it takes note of how many replicas of the PG +are required by the pool and then assigns each replica to a different OSD. +For example, if the pool requires three replicas of a PG, CRUSH might assign +them individually to ``osd.1``, ``osd.2`` and ``osd.3``. CRUSH seeks a +pseudo-random placement that takes into account the failure domains that you +have set in your `CRUSH map`_; for this reason, PGs are rarely assigned to +immediately adjacent OSDs in a large cluster. + +Ceph processes client requests with the **Acting Set** of OSDs: this is the set +of OSDs that currently have a full and working version of a PG shard and that +are therefore responsible for handling requests. By contrast, the **Up Set** is +the set of OSDs that contain a shard of a specific PG. Data is moved or copied +to the **Up Set**, or planned to be moved or copied, to the **Up Set**. See +:ref:`Placement Group Concepts <rados_operations_pg_concepts>`. + +Sometimes an OSD in the Acting Set is ``down`` or otherwise unable to +service requests for objects in the PG. When this kind of situation +arises, don't panic. Common examples of such a situation include: + +- You added or removed an OSD, CRUSH reassigned the PG to + other OSDs, and this reassignment changed the composition of the Acting Set and triggered + the migration of data by means of a "backfill" process. +- An OSD was ``down``, was restarted, and is now ``recovering``. +- An OSD in the Acting Set is ``down`` or unable to service requests, + and another OSD has temporarily assumed its duties. + +Typically, the Up Set and the Acting Set are identical. When they are not, it +might indicate that Ceph is migrating the PG (in other words, that the PG has +been remapped), that an OSD is recovering, or that there is a problem with the +cluster (in such scenarios, Ceph usually shows a "HEALTH WARN" state with a +"stuck stale" message). + +To retrieve a list of PGs, run the following command: + +.. prompt:: bash $ + + ceph pg dump + +To see which OSDs are within the Acting Set and the Up Set for a specific PG, run the following command: + +.. prompt:: bash $ + + ceph pg map {pg-num} + +The output provides the following information: the osdmap epoch (eNNN), the PG number +({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the Acting Set +(acting[]):: + + osdmap eNNN pg {raw-pg-num} ({pg-num}) -> up [0,1,2] acting [0,1,2] + +.. note:: If the Up Set and the Acting Set do not match, this might indicate + that the cluster is rebalancing itself or that there is a problem with + the cluster. + + +Peering +======= + +Before you can write data to a PG, it must be in an ``active`` state and it +will preferably be in a ``clean`` state. For Ceph to determine the current +state of a PG, peering must take place. That is, the primary OSD of the PG +(that is, the first OSD in the Acting Set) must peer with the secondary and +OSDs so that consensus on the current state of the PG can be established. In +the following diagram, we assume a pool with three replicas of the PG: + +.. ditaa:: + + +---------+ +---------+ +-------+ + | OSD 1 | | OSD 2 | | OSD 3 | + +---------+ +---------+ +-------+ + | | | + | Request To | | + | Peer | | + |-------------->| | + |<--------------| | + | Peering | + | | + | Request To | + | Peer | + |----------------------------->| + |<-----------------------------| + | Peering | + +The OSDs also report their status to the monitor. For details, see `Configuring Monitor/OSD +Interaction`_. To troubleshoot peering issues, see `Peering +Failure`_. + + +Monitoring PG States +==================== + +If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``, +you might notice that the cluster does not always show ``HEALTH OK``. After +first checking to see if the OSDs are running, you should also check PG +states. There are certain PG-peering-related circumstances in which it is expected +and normal that the cluster will **NOT** show ``HEALTH OK``: + +#. You have just created a pool and the PGs haven't peered yet. +#. The PGs are recovering. +#. You have just added an OSD to or removed an OSD from the cluster. +#. You have just modified your CRUSH map and your PGs are migrating. +#. There is inconsistent data in different replicas of a PG. +#. Ceph is scrubbing a PG's replicas. +#. Ceph doesn't have enough storage capacity to complete backfilling operations. + +If one of these circumstances causes Ceph to show ``HEALTH WARN``, don't +panic. In many cases, the cluster will recover on its own. In some cases, however, you +might need to take action. An important aspect of monitoring PGs is to check their +status as ``active`` and ``clean``: that is, it is important to ensure that, when the +cluster is up and running, all PGs are ``active`` and (preferably) ``clean``. +To see the status of every PG, run the following command: + +.. prompt:: bash $ + + ceph pg stat + +The output provides the following information: the total number of PGs (x), how many +PGs are in a particular state such as ``active+clean`` (y), and the +amount of data stored (z). :: + + x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail + +.. note:: It is common for Ceph to report multiple states for PGs (for example, + ``active+clean``, ``active+clean+remapped``, ``active+clean+scrubbing``. + +Here Ceph shows not only the PG states, but also storage capacity used (aa), +the amount of storage capacity remaining (bb), and the total storage capacity +of the PG. These values can be important in a few cases: + +- The cluster is reaching its ``near full ratio`` or ``full ratio``. +- Data is not being distributed across the cluster due to an error in the + CRUSH configuration. + + +.. topic:: Placement Group IDs + + PG IDs consist of the pool number (not the pool name) followed by a period + (.) and a hexadecimal number. You can view pool numbers and their names from + in the output of ``ceph osd lspools``. For example, the first pool that was + created corresponds to pool number ``1``. A fully qualified PG ID has the + following form:: + + {pool-num}.{pg-id} + + It typically resembles the following:: + + 1.1701b + + +To retrieve a list of PGs, run the following command: + +.. prompt:: bash $ + + ceph pg dump + +To format the output in JSON format and save it to a file, run the following command: + +.. prompt:: bash $ + + ceph pg dump -o {filename} --format=json + +To query a specific PG, run the following command: + +.. prompt:: bash $ + + ceph pg {poolnum}.{pg-id} query + +Ceph will output the query in JSON format. + +The following subsections describe the most common PG states in detail. + + +Creating +-------- + +PGs are created when you create a pool: the command that creates a pool +specifies the total number of PGs for that pool, and when the pool is created +all of those PGs are created as well. Ceph will echo ``creating`` while it is +creating PGs. After the PG(s) are created, the OSDs that are part of a PG's +Acting Set will peer. Once peering is complete, the PG status should be +``active+clean``. This status means that Ceph clients begin writing to the +PG. + +.. ditaa:: + + /-----------\ /-----------\ /-----------\ + | Creating |------>| Peering |------>| Active | + \-----------/ \-----------/ \-----------/ + +Peering +------- + +When a PG peers, the OSDs that store the replicas of its data converge on an +agreed state of the data and metadata within that PG. When peering is complete, +those OSDs agree about the state of that PG. However, completion of the peering +process does **NOT** mean that each replica has the latest contents. + +.. topic:: Authoritative History + + Ceph will **NOT** acknowledge a write operation to a client until that write + operation is persisted by every OSD in the Acting Set. This practice ensures + that at least one member of the Acting Set will have a record of every + acknowledged write operation since the last successful peering operation. + + Given an accurate record of each acknowledged write operation, Ceph can + construct a new authoritative history of the PG--that is, a complete and + fully ordered set of operations that, if performed, would bring an OSD’s + copy of the PG up to date. + + +Active +------ + +After Ceph has completed the peering process, a PG should become ``active``. +The ``active`` state means that the data in the PG is generally available for +read and write operations in the primary and replica OSDs. + + +Clean +----- + +When a PG is in the ``clean`` state, all OSDs holding its data and metadata +have successfully peered and there are no stray replicas. Ceph has replicated +all objects in the PG the correct number of times. + + +Degraded +-------- + +When a client writes an object to the primary OSD, the primary OSD is +responsible for writing the replicas to the replica OSDs. After the primary OSD +writes the object to storage, the PG will remain in a ``degraded`` +state until the primary OSD has received an acknowledgement from the replica +OSDs that Ceph created the replica objects successfully. + +The reason that a PG can be ``active+degraded`` is that an OSD can be +``active`` even if it doesn't yet hold all of the PG's objects. If an OSD goes +``down``, Ceph marks each PG assigned to the OSD as ``degraded``. The PGs must +peer again when the OSD comes back online. However, a client can still write a +new object to a ``degraded`` PG if it is ``active``. + +If an OSD is ``down`` and the ``degraded`` condition persists, Ceph might mark the +``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD +to another OSD. The time between being marked ``down`` and being marked ``out`` +is determined by ``mon_osd_down_out_interval``, which is set to ``600`` seconds +by default. + +A PG can also be in the ``degraded`` state because there are one or more +objects that Ceph expects to find in the PG but that Ceph cannot find. Although +you cannot read or write to unfound objects, you can still access all of the other +objects in the ``degraded`` PG. + + +Recovering +---------- + +Ceph was designed for fault-tolerance, because hardware and other server +problems are expected or even routine. When an OSD goes ``down``, its contents +might fall behind the current state of other replicas in the PGs. When the OSD +has returned to the ``up`` state, the contents of the PGs must be updated to +reflect that current state. During that time period, the OSD might be in a +``recovering`` state. + +Recovery is not always trivial, because a hardware failure might cause a +cascading failure of multiple OSDs. For example, a network switch for a rack or +cabinet might fail, which can cause the OSDs of a number of host machines to +fall behind the current state of the cluster. In such a scenario, general +recovery is possible only if each of the OSDs recovers after the fault has been +resolved.] + +Ceph provides a number of settings that determine how the cluster balances the +resource contention between the need to process new service requests and the +need to recover data objects and restore the PGs to the current state. The +``osd_recovery_delay_start`` setting allows an OSD to restart, re-peer, and +even process some replay requests before starting the recovery process. The +``osd_recovery_thread_timeout`` setting determines the duration of a thread +timeout, because multiple OSDs might fail, restart, and re-peer at staggered +rates. The ``osd_recovery_max_active`` setting limits the number of recovery +requests an OSD can entertain simultaneously, in order to prevent the OSD from +failing to serve. The ``osd_recovery_max_chunk`` setting limits the size of +the recovered data chunks, in order to prevent network congestion. + + +Back Filling +------------ + +When a new OSD joins the cluster, CRUSH will reassign PGs from OSDs that are +already in the cluster to the newly added OSD. It can put excessive load on the +new OSD to force it to immediately accept the reassigned PGs. Back filling the +OSD with the PGs allows this process to begin in the background. After the +backfill operations have completed, the new OSD will begin serving requests as +soon as it is ready. + +During the backfill operations, you might see one of several states: +``backfill_wait`` indicates that a backfill operation is pending, but is not +yet underway; ``backfilling`` indicates that a backfill operation is currently +underway; and ``backfill_toofull`` indicates that a backfill operation was +requested but couldn't be completed due to insufficient storage capacity. When +a PG cannot be backfilled, it might be considered ``incomplete``. + +The ``backfill_toofull`` state might be transient. It might happen that, as PGs +are moved around, space becomes available. The ``backfill_toofull`` state is +similar to ``backfill_wait`` in that backfill operations can proceed as soon as +conditions change. + +Ceph provides a number of settings to manage the load spike associated with the +reassignment of PGs to an OSD (especially a new OSD). The ``osd_max_backfills`` +setting specifies the maximum number of concurrent backfills to and from an OSD +(default: 1). The ``backfill_full_ratio`` setting allows an OSD to refuse a +backfill request if the OSD is approaching its full ratio (default: 90%). This +setting can be changed with the ``ceph osd set-backfillfull-ratio`` command. If +an OSD refuses a backfill request, the ``osd_backfill_retry_interval`` setting +allows an OSD to retry the request after a certain interval (default: 30 +seconds). OSDs can also set ``osd_backfill_scan_min`` and +``osd_backfill_scan_max`` in order to manage scan intervals (default: 64 and +512, respectively). + + +Remapped +-------- + +When the Acting Set that services a PG changes, the data migrates from the old +Acting Set to the new Acting Set. Because it might take time for the new +primary OSD to begin servicing requests, the old primary OSD might be required +to continue servicing requests until the PG data migration is complete. After +data migration has completed, the mapping uses the primary OSD of the new +Acting Set. + + +Stale +----- + +Although Ceph uses heartbeats in order to ensure that hosts and daemons are +running, the ``ceph-osd`` daemons might enter a ``stuck`` state where they are +not reporting statistics in a timely manner (for example, there might be a +temporary network fault). By default, OSD daemons report their PG, up through, +boot, and failure statistics every half second (that is, in accordance with a +value of ``0.5``), which is more frequent than the reports defined by the +heartbeat thresholds. If the primary OSD of a PG's Acting Set fails to report +to the monitor or if other OSDs have reported the primary OSD ``down``, the +monitors will mark the PG ``stale``. + +When you start your cluster, it is common to see the ``stale`` state until the +peering process completes. After your cluster has been running for a while, +however, seeing PGs in the ``stale`` state indicates that the primary OSD for +those PGs is ``down`` or not reporting PG statistics to the monitor. + + +Identifying Troubled PGs +======================== + +As previously noted, a PG is not necessarily having problems just because its +state is not ``active+clean``. When PGs are stuck, this might indicate that +Ceph cannot perform self-repairs. The stuck states include: + +- **Unclean**: PGs contain objects that have not been replicated the desired + number of times. Under normal conditions, it can be assumed that these PGs + are recovering. +- **Inactive**: PGs cannot process reads or writes because they are waiting for + an OSD that has the most up-to-date data to come back ``up``. +- **Stale**: PG are in an unknown state, because the OSDs that host them have + not reported to the monitor cluster for a certain period of time (determined + by ``mon_osd_report_timeout``). + +To identify stuck PGs, run the following command: + +.. prompt:: bash $ + + ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded] + +For more detail, see `Placement Group Subsystem`_. To troubleshoot stuck PGs, +see `Troubleshooting PG Errors`_. + + +Finding an Object Location +========================== + +To store object data in the Ceph Object Store, a Ceph client must: + +#. Set an object name +#. Specify a `pool`_ + +The Ceph client retrieves the latest cluster map, the CRUSH algorithm +calculates how to map the object to a PG, and then the algorithm calculates how +to dynamically assign the PG to an OSD. To find the object location given only +the object name and the pool name, run a command of the following form: + +.. prompt:: bash $ + + ceph osd map {poolname} {object-name} [namespace] + +.. topic:: Exercise: Locate an Object + + As an exercise, let's create an object. We can specify an object name, a path + to a test file that contains some object data, and a pool name by using the + ``rados put`` command on the command line. For example: + + .. prompt:: bash $ + + rados put {object-name} {file-path} --pool=data + rados put test-object-1 testfile.txt --pool=data + + To verify that the Ceph Object Store stored the object, run the + following command: + + .. prompt:: bash $ + + rados -p data ls + + To identify the object location, run the following commands: + + .. prompt:: bash $ + + ceph osd map {pool-name} {object-name} + ceph osd map data test-object-1 + + Ceph should output the object's location. For example:: + + osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0) + + To remove the test object, simply delete it by running the ``rados rm`` + command. For example: + + .. prompt:: bash $ + + rados rm test-object-1 --pool=data + +As the cluster evolves, the object location may change dynamically. One benefit +of Ceph's dynamic rebalancing is that Ceph spares you the burden of manually +performing the migration. For details, see the `Architecture`_ section. + +.. _data placement: ../data-placement +.. _pool: ../pools +.. _placement group: ../placement-groups +.. _Architecture: ../../../architecture +.. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running +.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors +.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering +.. _CRUSH map: ../crush-map +.. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/ +.. _Placement Group Subsystem: ../control#placement-group-subsystem diff --git a/doc/rados/operations/monitoring.rst b/doc/rados/operations/monitoring.rst new file mode 100644 index 000000000..a9171f2d8 --- /dev/null +++ b/doc/rados/operations/monitoring.rst @@ -0,0 +1,644 @@ +====================== + Monitoring a Cluster +====================== + +After you have a running cluster, you can use the ``ceph`` tool to monitor your +cluster. Monitoring a cluster typically involves checking OSD status, monitor +status, placement group status, and metadata server status. + +Using the command line +====================== + +Interactive mode +---------------- + +To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line +with no arguments. For example: + +.. prompt:: bash $ + + ceph + +.. prompt:: ceph> + :prompts: ceph> + + health + status + quorum_status + mon stat + +Non-default paths +----------------- + +If you specified non-default locations for your configuration or keyring when +you install the cluster, you may specify their locations to the ``ceph`` tool +by running the following command: + +.. prompt:: bash $ + + ceph -c /path/to/conf -k /path/to/keyring health + +Checking a Cluster's Status +=========================== + +After you start your cluster, and before you start reading and/or writing data, +you should check your cluster's status. + +To check a cluster's status, run the following command: + +.. prompt:: bash $ + + ceph status + +Alternatively, you can run the following command: + +.. prompt:: bash $ + + ceph -s + +In interactive mode, this operation is performed by typing ``status`` and +pressing **Enter**: + +.. prompt:: ceph> + :prompts: ceph> + + status + +Ceph will print the cluster status. For example, a tiny Ceph "demonstration +cluster" that is running one instance of each service (monitor, manager, and +OSD) might print the following: + +:: + + cluster: + id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 + health: HEALTH_OK + + services: + mon: 3 daemons, quorum a,b,c + mgr: x(active) + mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby + osd: 3 osds: 3 up, 3 in + + data: + pools: 2 pools, 16 pgs + objects: 21 objects, 2.19K + usage: 546 GB used, 384 GB / 931 GB avail + pgs: 16 active+clean + + +How Ceph Calculates Data Usage +------------------------------ + +The ``usage`` value reflects the *actual* amount of raw storage used. The ``xxx +GB / xxx GB`` value means the amount available (the lesser number) of the +overall storage capacity of the cluster. The notional number reflects the size +of the stored data before it is replicated, cloned or snapshotted. Therefore, +the amount of data actually stored typically exceeds the notional amount +stored, because Ceph creates replicas of the data and may also use storage +capacity for cloning and snapshotting. + + +Watching a Cluster +================== + +Each daemon in the Ceph cluster maintains a log of events, and the Ceph cluster +itself maintains a *cluster log* that records high-level events about the +entire Ceph cluster. These events are logged to disk on monitor servers (in +the default location ``/var/log/ceph/ceph.log``), and they can be monitored via +the command line. + +To follow the cluster log, run the following command: + +.. prompt:: bash $ + + ceph -w + +Ceph will print the status of the system, followed by each log message as it is +added. For example: + +:: + + cluster: + id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 + health: HEALTH_OK + + services: + mon: 3 daemons, quorum a,b,c + mgr: x(active) + mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby + osd: 3 osds: 3 up, 3 in + + data: + pools: 2 pools, 16 pgs + objects: 21 objects, 2.19K + usage: 546 GB used, 384 GB / 931 GB avail + pgs: 16 active+clean + + + 2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot + 2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x + 2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available + +Instead of printing log lines as they are added, you might want to print only +the most recent lines. Run ``ceph log last [n]`` to see the most recent ``n`` +lines from the cluster log. + +Monitoring Health Checks +======================== + +Ceph continuously runs various *health checks*. When +a health check fails, this failure is reflected in the output of ``ceph status`` and +``ceph health``. The cluster log receives messages that +indicate when a check has failed and when the cluster has recovered. + +For example, when an OSD goes down, the ``health`` section of the status +output is updated as follows: + +:: + + health: HEALTH_WARN + 1 osds down + Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded + +At the same time, cluster log messages are emitted to record the failure of the +health checks: + +:: + + 2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) + 2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED) + +When the OSD comes back online, the cluster log records the cluster's return +to a healthy state: + +:: + + 2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED) + 2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized) + 2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy + +Network Performance Checks +-------------------------- + +Ceph OSDs send heartbeat ping messages to each other in order to monitor daemon +availability and network performance. If a single delayed response is detected, +this might indicate nothing more than a busy OSD. But if multiple delays +between distinct pairs of OSDs are detected, this might indicate a failed +network switch, a NIC failure, or a layer 1 failure. + +By default, a heartbeat time that exceeds 1 second (1000 milliseconds) raises a +health check (a ``HEALTH_WARN``. For example: + +:: + + HEALTH_WARN Slow OSD heartbeats on back (longest 1118.001ms) + +In the output of the ``ceph health detail`` command, you can see which OSDs are +experiencing delays and how long the delays are. The output of ``ceph health +detail`` is limited to ten lines. Here is an example of the output you can +expect from the ``ceph health detail`` command:: + + [WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1118.001ms) + Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.1 [dc1,rack1] 1118.001 msec possibly improving + Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.2 [dc1,rack2] 1030.123 msec + Slow OSD heartbeats on back from osd.2 [dc1,rack2] to osd.1 [dc1,rack1] 1015.321 msec + Slow OSD heartbeats on back from osd.1 [dc1,rack1] to osd.0 [dc1,rack1] 1010.456 msec + +To see more detail and to collect a complete dump of network performance +information, use the ``dump_osd_network`` command. This command is usually sent +to a Ceph Manager Daemon, but it can be used to collect information about a +specific OSD's interactions by sending it to that OSD. The default threshold +for a slow heartbeat is 1 second (1000 milliseconds), but this can be +overridden by providing a number of milliseconds as an argument. + +To show all network performance data with a specified threshold of 0, send the +following command to the mgr: + +.. prompt:: bash $ + + ceph daemon /var/run/ceph/ceph-mgr.x.asok dump_osd_network 0 + +:: + + { + "threshold": 0, + "entries": [ + { + "last update": "Wed Sep 4 17:04:49 2019", + "stale": false, + "from osd": 2, + "to osd": 0, + "interface": "front", + "average": { + "1min": 1.023, + "5min": 0.860, + "15min": 0.883 + }, + "min": { + "1min": 0.818, + "5min": 0.607, + "15min": 0.607 + }, + "max": { + "1min": 1.164, + "5min": 1.173, + "15min": 1.544 + }, + "last": 0.924 + }, + { + "last update": "Wed Sep 4 17:04:49 2019", + "stale": false, + "from osd": 2, + "to osd": 0, + "interface": "back", + "average": { + "1min": 0.968, + "5min": 0.897, + "15min": 0.830 + }, + "min": { + "1min": 0.860, + "5min": 0.563, + "15min": 0.502 + }, + "max": { + "1min": 1.171, + "5min": 1.216, + "15min": 1.456 + }, + "last": 0.845 + }, + { + "last update": "Wed Sep 4 17:04:48 2019", + "stale": false, + "from osd": 0, + "to osd": 1, + "interface": "front", + "average": { + "1min": 0.965, + "5min": 0.811, + "15min": 0.850 + }, + "min": { + "1min": 0.650, + "5min": 0.488, + "15min": 0.466 + }, + "max": { + "1min": 1.252, + "5min": 1.252, + "15min": 1.362 + }, + "last": 0.791 + }, + ... + + + +Muting Health Checks +-------------------- + +Health checks can be muted so that they have no effect on the overall +reported status of the cluster. For example, if the cluster has raised a +single health check and then you mute that health check, then the cluster will report a status of ``HEALTH_OK``. +To mute a specific health check, use the health check code that corresponds to that health check (see :ref:`health-checks`), and +run the following command: + +.. prompt:: bash $ + + ceph health mute <code> + +For example, to mute an ``OSD_DOWN`` health check, run the following command: + +.. prompt:: bash $ + + ceph health mute OSD_DOWN + +Mutes are reported as part of the short and long form of the ``ceph health`` command's output. +For example, in the above scenario, the cluster would report: + +.. prompt:: bash $ + + ceph health + +:: + + HEALTH_OK (muted: OSD_DOWN) + +.. prompt:: bash $ + + ceph health detail + +:: + + HEALTH_OK (muted: OSD_DOWN) + (MUTED) OSD_DOWN 1 osds down + osd.1 is down + +A mute can be removed by running the following command: + +.. prompt:: bash $ + + ceph health unmute <code> + +For example: + +.. prompt:: bash $ + + ceph health unmute OSD_DOWN + +A "health mute" can have a TTL (**T**\ime **T**\o **L**\ive) +associated with it: this means that the mute will automatically expire +after a specified period of time. The TTL is specified as an optional +duration argument, as seen in the following examples: + +.. prompt:: bash $ + + ceph health mute OSD_DOWN 4h # mute for 4 hours + ceph health mute MON_DOWN 15m # mute for 15 minutes + +Normally, if a muted health check is resolved (for example, if the OSD that raised the ``OSD_DOWN`` health check +in the example above has come back up), the mute goes away. If the health check comes +back later, it will be reported in the usual way. + +It is possible to make a health mute "sticky": this means that the mute will remain even if the +health check clears. For example, to make a health mute "sticky", you might run the following command: + +.. prompt:: bash $ + + ceph health mute OSD_DOWN 1h --sticky # ignore any/all down OSDs for next hour + +Most health mutes disappear if the unhealthy condition that triggered the health check gets worse. +For example, suppose that there is one OSD down and the health check is muted. In that case, if +one or more additional OSDs go down, then the health mute disappears. This behavior occurs in any health check with a threshold value. + + +Checking a Cluster's Usage Stats +================================ + +To check a cluster's data usage and data distribution among pools, use the +``df`` command. This option is similar to Linux's ``df`` command. Run the +following command: + +.. prompt:: bash $ + + ceph df + +The output of ``ceph df`` resembles the following:: + + CLASS SIZE AVAIL USED RAW USED %RAW USED + ssd 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00 + TOTAL 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00 + + --- POOLS --- + POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR + device_health_metrics 1 1 242 KiB 15 KiB 227 KiB 4 251 KiB 24 KiB 227 KiB 0 297 GiB N/A N/A 4 0 B 0 B + cephfs.a.meta 2 32 6.8 KiB 6.8 KiB 0 B 22 96 KiB 96 KiB 0 B 0 297 GiB N/A N/A 22 0 B 0 B + cephfs.a.data 3 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 99 GiB N/A N/A 0 0 B 0 B + test 4 32 22 MiB 22 MiB 50 KiB 248 19 MiB 19 MiB 50 KiB 0 297 GiB N/A N/A 248 0 B 0 B + +- **CLASS:** For example, "ssd" or "hdd". +- **SIZE:** The amount of storage capacity managed by the cluster. +- **AVAIL:** The amount of free space available in the cluster. +- **USED:** The amount of raw storage consumed by user data (excluding + BlueStore's database). +- **RAW USED:** The amount of raw storage consumed by user data, internal + overhead, and reserved capacity. +- **%RAW USED:** The percentage of raw storage used. Watch this number in + conjunction with ``full ratio`` and ``near full ratio`` to be forewarned when + your cluster approaches the fullness thresholds. See `Storage Capacity`_. + + +**POOLS:** + +The POOLS section of the output provides a list of pools and the *notional* +usage of each pool. This section of the output **DOES NOT** reflect replicas, +clones, or snapshots. For example, if you store an object with 1MB of data, +then the notional usage will be 1MB, but the actual usage might be 2MB or more +depending on the number of replicas, clones, and snapshots. + +- **ID:** The number of the specific node within the pool. +- **STORED:** The actual amount of data that the user has stored in a pool. + This is similar to the USED column in earlier versions of Ceph, but the + calculations (for BlueStore!) are more precise (in that gaps are properly + handled). + + - **(DATA):** Usage for RBD (RADOS Block Device), CephFS file data, and RGW + (RADOS Gateway) object data. + - **(OMAP):** Key-value pairs. Used primarily by CephFS and RGW (RADOS + Gateway) for metadata storage. + +- **OBJECTS:** The notional number of objects stored per pool (that is, the + number of objects other than replicas, clones, or snapshots). +- **USED:** The space allocated for a pool over all OSDs. This includes space + for replication, space for allocation granularity, and space for the overhead + associated with erasure-coding. Compression savings and object-content gaps + are also taken into account. However, BlueStore's database is not included in + the amount reported under USED. + + - **(DATA):** Object usage for RBD (RADOS Block Device), CephFS file data, + and RGW (RADOS Gateway) object data. + - **(OMAP):** Object key-value pairs. Used primarily by CephFS and RGW (RADOS + Gateway) for metadata storage. + +- **%USED:** The notional percentage of storage used per pool. +- **MAX AVAIL:** An estimate of the notional amount of data that can be written + to this pool. +- **QUOTA OBJECTS:** The number of quota objects. +- **QUOTA BYTES:** The number of bytes in the quota objects. +- **DIRTY:** The number of objects in the cache pool that have been written to + the cache pool but have not yet been flushed to the base pool. This field is + available only when cache tiering is in use. +- **USED COMPR:** The amount of space allocated for compressed data. This + includes compressed data in addition to all of the space required for + replication, allocation granularity, and erasure- coding overhead. +- **UNDER COMPR:** The amount of data that has passed through compression + (summed over all replicas) and that is worth storing in a compressed form. + + +.. note:: The numbers in the POOLS section are notional. They do not include + the number of replicas, clones, or snapshots. As a result, the sum of the + USED and %USED amounts in the POOLS section of the output will not be equal + to the sum of the USED and %USED amounts in the RAW section of the output. + +.. note:: The MAX AVAIL value is a complicated function of the replication or + the kind of erasure coding used, the CRUSH rule that maps storage to + devices, the utilization of those devices, and the configured + ``mon_osd_full_ratio`` setting. + + +Checking OSD Status +=================== + +To check if OSDs are ``up`` and ``in``, run the +following command: + +.. prompt:: bash # + + ceph osd stat + +Alternatively, you can run the following command: + +.. prompt:: bash # + + ceph osd dump + +To view OSDs according to their position in the CRUSH map, run the following +command: + +.. prompt:: bash # + + ceph osd tree + +To print out a CRUSH tree that displays a host, its OSDs, whether the OSDs are +``up``, and the weight of the OSDs, run the following command: + +.. code-block:: bash + + #ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -1 3.00000 pool default + -3 3.00000 rack mainrack + -2 3.00000 host osd-host + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + +See `Monitoring OSDs and Placement Groups`_. + +Checking Monitor Status +======================= + +If your cluster has multiple monitors, then you need to perform certain +"monitor status" checks. After starting the cluster and before reading or +writing data, you should check quorum status. A quorum must be present when +multiple monitors are running to ensure proper functioning of your Ceph +cluster. Check monitor status regularly in order to ensure that all of the +monitors are running. + +To display the monitor map, run the following command: + +.. prompt:: bash $ + + ceph mon stat + +Alternatively, you can run the following command: + +.. prompt:: bash $ + + ceph mon dump + +To check the quorum status for the monitor cluster, run the following command: + +.. prompt:: bash $ + + ceph quorum_status + +Ceph returns the quorum status. For example, a Ceph cluster that consists of +three monitors might return the following: + +.. code-block:: javascript + + { "election_epoch": 10, + "quorum": [ + 0, + 1, + 2], + "quorum_names": [ + "a", + "b", + "c"], + "quorum_leader_name": "a", + "monmap": { "epoch": 1, + "fsid": "444b489c-4f16-4b75-83f0-cb8097468898", + "modified": "2011-12-12 13:28:27.505520", + "created": "2011-12-12 13:28:27.505520", + "features": {"persistent": [ + "kraken", + "luminous", + "mimic"], + "optional": [] + }, + "mons": [ + { "rank": 0, + "name": "a", + "addr": "127.0.0.1:6789/0", + "public_addr": "127.0.0.1:6789/0"}, + { "rank": 1, + "name": "b", + "addr": "127.0.0.1:6790/0", + "public_addr": "127.0.0.1:6790/0"}, + { "rank": 2, + "name": "c", + "addr": "127.0.0.1:6791/0", + "public_addr": "127.0.0.1:6791/0"} + ] + } + } + +Checking MDS Status +=================== + +Metadata servers provide metadata services for CephFS. Metadata servers have +two sets of states: ``up | down`` and ``active | inactive``. To check if your +metadata servers are ``up`` and ``active``, run the following command: + +.. prompt:: bash $ + + ceph mds stat + +To display details of the metadata servers, run the following command: + +.. prompt:: bash $ + + ceph fs dump + + +Checking Placement Group States +=============================== + +Placement groups (PGs) map objects to OSDs. PGs are monitored in order to +ensure that they are ``active`` and ``clean``. See `Monitoring OSDs and +Placement Groups`_. + +.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg + +.. _rados-monitoring-using-admin-socket: + +Using the Admin Socket +====================== + +The Ceph admin socket allows you to query a daemon via a socket interface. By +default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon via +the admin socket, log in to the host that is running the daemon and run one of +the two following commands: + +.. prompt:: bash $ + + ceph daemon {daemon-name} + ceph daemon {path-to-socket-file} + +For example, the following commands are equivalent to each other: + +.. prompt:: bash $ + + ceph daemon osd.0 foo + ceph daemon /var/run/ceph/ceph-osd.0.asok foo + +To view the available admin-socket commands, run the following command: + +.. prompt:: bash $ + + ceph daemon {daemon-name} help + +Admin-socket commands enable you to view and set your configuration at runtime. +For more on viewing your configuration, see `Viewing a Configuration at +Runtime`_. There are two methods of setting configuration value at runtime: (1) +using the admin socket, which bypasses the monitor and requires a direct login +to the host in question, and (2) using the ``ceph tell {daemon-type}.{id} +config set`` command, which relies on the monitor and does not require a direct +login. + +.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#viewing-a-configuration-at-runtime +.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity diff --git a/doc/rados/operations/operating.rst b/doc/rados/operations/operating.rst new file mode 100644 index 000000000..f4a2fd988 --- /dev/null +++ b/doc/rados/operations/operating.rst @@ -0,0 +1,174 @@ +===================== + Operating a Cluster +===================== + +.. index:: systemd; operating a cluster + + +Running Ceph with systemd +========================= + +In all distributions that support systemd (CentOS 7, Fedora, Debian +Jessie 8 and later, and SUSE), systemd files (and NOT legacy SysVinit scripts) +are used to manage Ceph daemons. Ceph daemons therefore behave like any other daemons +that can be controlled by the ``systemctl`` command, as in the following examples: + +.. prompt:: bash $ + + sudo systemctl start ceph.target # start all daemons + sudo systemctl status ceph-osd@12 # check status of osd.12 + +To list all of the Ceph systemd units on a node, run the following command: + +.. prompt:: bash $ + + sudo systemctl status ceph\*.service ceph\*.target + + +Starting all daemons +-------------------- + +To start all of the daemons on a Ceph node (regardless of their type), run the +following command: + +.. prompt:: bash $ + + sudo systemctl start ceph.target + + +Stopping all daemons +-------------------- + +To stop all of the daemons on a Ceph node (regardless of their type), run the +following command: + +.. prompt:: bash $ + + sudo systemctl stop ceph\*.service ceph\*.target + + +Starting all daemons by type +---------------------------- + +To start all of the daemons of a particular type on a Ceph node, run one of the +following commands: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd.target + sudo systemctl start ceph-mon.target + sudo systemctl start ceph-mds.target + + +Stopping all daemons by type +---------------------------- + +To stop all of the daemons of a particular type on a Ceph node, run one of the +following commands: + +.. prompt:: bash $ + + sudo systemctl stop ceph-osd\*.service ceph-osd.target + sudo systemctl stop ceph-mon\*.service ceph-mon.target + sudo systemctl stop ceph-mds\*.service ceph-mds.target + + +Starting a daemon +----------------- + +To start a specific daemon instance on a Ceph node, run one of the +following commands: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd@{id} + sudo systemctl start ceph-mon@{hostname} + sudo systemctl start ceph-mds@{hostname} + +For example: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd@1 + sudo systemctl start ceph-mon@ceph-server + sudo systemctl start ceph-mds@ceph-server + + +Stopping a daemon +----------------- + +To stop a specific daemon instance on a Ceph node, run one of the +following commands: + +.. prompt:: bash $ + + sudo systemctl stop ceph-osd@{id} + sudo systemctl stop ceph-mon@{hostname} + sudo systemctl stop ceph-mds@{hostname} + +For example: + +.. prompt:: bash $ + + sudo systemctl stop ceph-osd@1 + sudo systemctl stop ceph-mon@ceph-server + sudo systemctl stop ceph-mds@ceph-server + + +.. index:: sysvinit; operating a cluster + +Running Ceph with SysVinit +========================== + +Each time you start, restart, or stop Ceph daemons, you must specify at least one option and one command. +Likewise, each time you start, restart, or stop your entire cluster, you must specify at least one option and one command. +In both cases, you can also specify a daemon type or a daemon instance. :: + + {commandline} [options] [commands] [daemons] + +The ``ceph`` options include: + ++-----------------+----------+-------------------------------------------------+ +| Option | Shortcut | Description | ++=================+==========+=================================================+ +| ``--verbose`` | ``-v`` | Use verbose logging. | ++-----------------+----------+-------------------------------------------------+ +| ``--valgrind`` | ``N/A`` | (Dev and QA only) Use `Valgrind`_ debugging. | ++-----------------+----------+-------------------------------------------------+ +| ``--allhosts`` | ``-a`` | Execute on all nodes listed in ``ceph.conf``. | +| | | Otherwise, it only executes on ``localhost``. | ++-----------------+----------+-------------------------------------------------+ +| ``--restart`` | ``N/A`` | Automatically restart daemon if it core dumps. | ++-----------------+----------+-------------------------------------------------+ +| ``--norestart`` | ``N/A`` | Do not restart a daemon if it core dumps. | ++-----------------+----------+-------------------------------------------------+ +| ``--conf`` | ``-c`` | Use an alternate configuration file. | ++-----------------+----------+-------------------------------------------------+ + +The ``ceph`` commands include: + ++------------------+------------------------------------------------------------+ +| Command | Description | ++==================+============================================================+ +| ``start`` | Start the daemon(s). | ++------------------+------------------------------------------------------------+ +| ``stop`` | Stop the daemon(s). | ++------------------+------------------------------------------------------------+ +| ``forcestop`` | Force the daemon(s) to stop. Same as ``kill -9``. | ++------------------+------------------------------------------------------------+ +| ``killall`` | Kill all daemons of a particular type. | ++------------------+------------------------------------------------------------+ +| ``cleanlogs`` | Cleans out the log directory. | ++------------------+------------------------------------------------------------+ +| ``cleanalllogs`` | Cleans out **everything** in the log directory. | ++------------------+------------------------------------------------------------+ + +The ``[daemons]`` option allows the ``ceph`` service to target specific daemon types +in order to perform subsystem operations. Daemon types include: + +- ``mon`` +- ``osd`` +- ``mds`` + +.. _Valgrind: http://www.valgrind.org/ +.. _initctl: http://manpages.ubuntu.com/manpages/raring/en/man8/initctl.8.html diff --git a/doc/rados/operations/pg-concepts.rst b/doc/rados/operations/pg-concepts.rst new file mode 100644 index 000000000..83062b53a --- /dev/null +++ b/doc/rados/operations/pg-concepts.rst @@ -0,0 +1,104 @@ +.. _rados_operations_pg_concepts: + +========================== + Placement Group Concepts +========================== + +When you execute commands like ``ceph -w``, ``ceph osd dump``, and other +commands related to placement groups, Ceph may return values using some +of the following terms: + +*Peering* + The process of bringing all of the OSDs that store + a Placement Group (PG) into agreement about the state + of all of the objects (and their metadata) in that PG. + Note that agreeing on the state does not mean that + they all have the latest contents. + +*Acting Set* + The ordered list of OSDs who are (or were as of some epoch) + responsible for a particular placement group. + +*Up Set* + The ordered list of OSDs responsible for a particular placement + group for a particular epoch according to CRUSH. Normally this + is the same as the *Acting Set*, except when the *Acting Set* has + been explicitly overridden via ``pg_temp`` in the OSD Map. + +*Current Interval* or *Past Interval* + A sequence of OSD map epochs during which the *Acting Set* and *Up + Set* for particular placement group do not change. + +*Primary* + The member (and by convention first) of the *Acting Set*, + that is responsible for coordination peering, and is + the only OSD that will accept client-initiated + writes to objects in a placement group. + +*Replica* + A non-primary OSD in the *Acting Set* for a placement group + (and who has been recognized as such and *activated* by the primary). + +*Stray* + An OSD that is not a member of the current *Acting Set*, but + has not yet been told that it can delete its copies of a + particular placement group. + +*Recovery* + Ensuring that copies of all of the objects in a placement group + are on all of the OSDs in the *Acting Set*. Once *Peering* has + been performed, the *Primary* can start accepting write operations, + and *Recovery* can proceed in the background. + +*PG Info* + Basic metadata about the placement group's creation epoch, the version + for the most recent write to the placement group, *last epoch started*, + *last epoch clean*, and the beginning of the *current interval*. Any + inter-OSD communication about placement groups includes the *PG Info*, + such that any OSD that knows a placement group exists (or once existed) + also has a lower bound on *last epoch clean* or *last epoch started*. + +*PG Log* + A list of recent updates made to objects in a placement group. + Note that these logs can be truncated after all OSDs + in the *Acting Set* have acknowledged up to a certain + point. + +*Missing Set* + Each OSD notes update log entries and if they imply updates to + the contents of an object, adds that object to a list of needed + updates. This list is called the *Missing Set* for that ``<OSD,PG>``. + +*Authoritative History* + A complete, and fully ordered set of operations that, if + performed, would bring an OSD's copy of a placement group + up to date. + +*Epoch* + A (monotonically increasing) OSD map version number + +*Last Epoch Start* + The last epoch at which all nodes in the *Acting Set* + for a particular placement group agreed on an + *Authoritative History*. At this point, *Peering* is + deemed to have been successful. + +*up_thru* + Before a *Primary* can successfully complete the *Peering* process, + it must inform a monitor that is alive through the current + OSD map *Epoch* by having the monitor set its *up_thru* in the osd + map. This helps *Peering* ignore previous *Acting Sets* for which + *Peering* never completed after certain sequences of failures, such as + the second interval below: + + - *acting set* = [A,B] + - *acting set* = [A] + - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection) + - *acting set* = [B] (B restarts, A does not) + +*Last Epoch Clean* + The last *Epoch* at which all nodes in the *Acting set* + for a particular placement group were completely + up to date (both placement group logs and object contents). + At this point, *recovery* is deemed to have been + completed. diff --git a/doc/rados/operations/pg-repair.rst b/doc/rados/operations/pg-repair.rst new file mode 100644 index 000000000..609318fca --- /dev/null +++ b/doc/rados/operations/pg-repair.rst @@ -0,0 +1,118 @@ +============================ +Repairing PG Inconsistencies +============================ +Sometimes a Placement Group (PG) might become ``inconsistent``. To return the PG +to an ``active+clean`` state, you must first determine which of the PGs has become +inconsistent and then run the ``pg repair`` command on it. This page contains +commands for diagnosing PGs and the command for repairing PGs that have become +inconsistent. + +.. highlight:: console + +Commands for Diagnosing PG Problems +=================================== +The commands in this section provide various ways of diagnosing broken PGs. + +To see a high-level (low-detail) overview of Ceph cluster health, run the +following command: + +.. prompt:: bash # + + ceph health detail + +To see more detail on the status of the PGs, run the following command: + +.. prompt:: bash # + + ceph pg dump --format=json-pretty + +To see a list of inconsistent PGs, run the following command: + +.. prompt:: bash # + + rados list-inconsistent-pg {pool} + +To see a list of inconsistent RADOS objects, run the following command: + +.. prompt:: bash # + + rados list-inconsistent-obj {pgid} + +To see a list of inconsistent snapsets in a specific PG, run the following +command: + +.. prompt:: bash # + + rados list-inconsistent-snapset {pgid} + + +Commands for Repairing PGs +========================== +The form of the command to repair a broken PG is as follows: + +.. prompt:: bash # + + ceph pg repair {pgid} + +Here ``{pgid}`` represents the id of the affected PG. + +For example: + +.. prompt:: bash # + + ceph pg repair 1.4 + +.. note:: PG IDs have the form ``N.xxxxx``, where ``N`` is the number of the + pool that contains the PG. The command ``ceph osd listpools`` and the + command ``ceph osd dump | grep pool`` return a list of pool numbers. + +More Information on PG Repair +============================= +Ceph stores and updates the checksums of objects stored in the cluster. When a +scrub is performed on a PG, the OSD attempts to choose an authoritative copy +from among its replicas. Only one of the possible cases is consistent. After +performing a deep scrub, Ceph calculates the checksum of an object that is read +from disk and compares it to the checksum that was previously recorded. If the +current checksum and the previously recorded checksum do not match, that +mismatch is considered to be an inconsistency. In the case of replicated pools, +any mismatch between the checksum of any replica of an object and the checksum +of the authoritative copy means that there is an inconsistency. The discovery +of these inconsistencies cause a PG's state to be set to ``inconsistent``. + +The ``pg repair`` command attempts to fix inconsistencies of various kinds. If +``pg repair`` finds an inconsistent PG, it attempts to overwrite the digest of +the inconsistent copy with the digest of the authoritative copy. If ``pg +repair`` finds an inconsistent replicated pool, it marks the inconsistent copy +as missing. In the case of replicated pools, recovery is beyond the scope of +``pg repair``. + +In the case of erasure-coded and BlueStore pools, Ceph will automatically +perform repairs if ``osd_scrub_auto_repair`` (default ``false``) is set to +``true`` and if no more than ``osd_scrub_auto_repair_num_errors`` (default +``5``) errors are found. + +The ``pg repair`` command will not solve every problem. Ceph does not +automatically repair PGs when they are found to contain inconsistencies. + +The checksum of a RADOS object or an omap is not always available. Checksums +are calculated incrementally. If a replicated object is updated +non-sequentially, the write operation involved in the update changes the object +and invalidates its checksum. The whole object is not read while the checksum +is recalculated. The ``pg repair`` command is able to make repairs even when +checksums are not available to it, as in the case of Filestore. Users working +with replicated Filestore pools might prefer manual repair to ``ceph pg +repair``. + +This material is relevant for Filestore, but not for BlueStore, which has its +own internal checksums. The matched-record checksum and the calculated checksum +cannot prove that any specific copy is in fact authoritative. If there is no +checksum available, ``pg repair`` favors the data on the primary, but this +might not be the uncorrupted replica. Because of this uncertainty, human +intervention is necessary when an inconsistency is discovered. This +intervention sometimes involves use of ``ceph-objectstore-tool``. + +External Links +============== +https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page +contains a walkthrough of the repair of a PG. It is recommended reading if you +want to repair a PG but have never done so. diff --git a/doc/rados/operations/pg-states.rst b/doc/rados/operations/pg-states.rst new file mode 100644 index 000000000..495229d92 --- /dev/null +++ b/doc/rados/operations/pg-states.rst @@ -0,0 +1,118 @@ +======================== + Placement Group States +======================== + +When checking a cluster's status (e.g., running ``ceph -w`` or ``ceph -s``), +Ceph will report on the status of the placement groups. A placement group has +one or more states. The optimum state for placement groups in the placement group +map is ``active + clean``. + +*creating* + Ceph is still creating the placement group. + +*activating* + The placement group is peered but not yet active. + +*active* + Ceph will process requests to the placement group. + +*clean* + Ceph replicated all objects in the placement group the correct number of times. + +*down* + A replica with necessary data is down, so the placement group is offline. + +*laggy* + A replica is not acknowledging new leases from the primary in a timely fashion; IO is temporarily paused. + +*wait* + The set of OSDs for this PG has just changed and IO is temporarily paused until the previous interval's leases expire. + +*scrubbing* + Ceph is checking the placement group metadata for inconsistencies. + +*deep* + Ceph is checking the placement group data against stored checksums. + +*degraded* + Ceph has not replicated some objects in the placement group the correct number of times yet. + +*inconsistent* + Ceph detects inconsistencies in the one or more replicas of an object in the placement group + (e.g. objects are the wrong size, objects are missing from one replica *after* recovery finished, etc.). + +*peering* + The placement group is undergoing the peering process + +*repair* + Ceph is checking the placement group and repairing any inconsistencies it finds (if possible). + +*recovering* + Ceph is migrating/synchronizing objects and their replicas. + +*forced_recovery* + High recovery priority of that PG is enforced by user. + +*recovery_wait* + The placement group is waiting in line to start recover. + +*recovery_toofull* + A recovery operation is waiting because the destination OSD is over its + full ratio. + +*recovery_unfound* + Recovery stopped due to unfound objects. + +*backfilling* + Ceph is scanning and synchronizing the entire contents of a placement group + instead of inferring what contents need to be synchronized from the logs of + recent operations. Backfill is a special case of recovery. + +*forced_backfill* + High backfill priority of that PG is enforced by user. + +*backfill_wait* + The placement group is waiting in line to start backfill. + +*backfill_toofull* + A backfill operation is waiting because the destination OSD is over + the backfillfull ratio. + +*backfill_unfound* + Backfill stopped due to unfound objects. + +*incomplete* + Ceph detects that a placement group is missing information about + writes that may have occurred, or does not have any healthy + copies. If you see this state, try to start any failed OSDs that may + contain the needed information. In the case of an erasure coded pool + temporarily reducing min_size may allow recovery. + +*stale* + The placement group is in an unknown state - the monitors have not received + an update for it since the placement group mapping changed. + +*remapped* + The placement group is temporarily mapped to a different set of OSDs from what + CRUSH specified. + +*undersized* + The placement group has fewer copies than the configured pool replication level. + +*peered* + The placement group has peered, but cannot serve client IO due to not having + enough copies to reach the pool's configured min_size parameter. Recovery + may occur in this state, so the pg may heal up to min_size eventually. + +*snaptrim* + Trimming snaps. + +*snaptrim_wait* + Queued to trim snaps. + +*snaptrim_error* + Error stopped trimming snaps. + +*unknown* + The ceph-mgr hasn't yet received any information about the PG's state from an + OSD since mgr started up. diff --git a/doc/rados/operations/placement-groups.rst b/doc/rados/operations/placement-groups.rst new file mode 100644 index 000000000..dda4a0177 --- /dev/null +++ b/doc/rados/operations/placement-groups.rst @@ -0,0 +1,897 @@ +.. _placement groups: + +================== + Placement Groups +================== + +.. _pg-autoscaler: + +Autoscaling placement groups +============================ + +Placement groups (PGs) are an internal implementation detail of how Ceph +distributes data. Autoscaling provides a way to manage PGs, and especially to +manage the number of PGs present in different pools. When *pg-autoscaling* is +enabled, the cluster is allowed to make recommendations or automatic +adjustments with respect to the number of PGs for each pool (``pgp_num``) in +accordance with expected cluster utilization and expected pool utilization. + +Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``, +``on``, or ``warn``: + +* ``off``: Disable autoscaling for this pool. It is up to the administrator to + choose an appropriate ``pgp_num`` for each pool. For more information, see + :ref:`choosing-number-of-placement-groups`. +* ``on``: Enable automated adjustments of the PG count for the given pool. +* ``warn``: Raise health checks when the PG count is in need of adjustment. + +To set the autoscaling mode for an existing pool, run a command of the +following form: + +.. prompt:: bash # + + ceph osd pool set <pool-name> pg_autoscale_mode <mode> + +For example, to enable autoscaling on pool ``foo``, run the following command: + +.. prompt:: bash # + + ceph osd pool set foo pg_autoscale_mode on + +There is also a ``pg_autoscale_mode`` setting for any pools that are created +after the initial setup of the cluster. To change this setting, run a command +of the following form: + +.. prompt:: bash # + + ceph config set global osd_pool_default_pg_autoscale_mode <mode> + +You can disable or enable the autoscaler for all pools with the ``noautoscale`` +flag. By default, this flag is set to ``off``, but you can set it to ``on`` by +running the following command: + +.. prompt:: bash # + + ceph osd pool set noautoscale + +To set the ``noautoscale`` flag to ``off``, run the following command: + +.. prompt:: bash # + + ceph osd pool unset noautoscale + +To get the value of the flag, run the following command: + +.. prompt:: bash # + + ceph osd pool get noautoscale + +Viewing PG scaling recommendations +---------------------------------- + +To view each pool, its relative utilization, and any recommended changes to the +PG count, run the following command: + +.. prompt:: bash # + + ceph osd pool autoscale-status + +The output will resemble the following:: + + POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK + a 12900M 3.0 82431M 0.4695 8 128 warn True + c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn True + b 0 953.6M 3.0 82431M 0.0347 8 warn False + +- **POOL** is the name of the pool. + +- **SIZE** is the amount of data stored in the pool. + +- **TARGET SIZE** (if present) is the amount of data that is expected to be + stored in the pool, as specified by the administrator. The system uses the + greater of the two values for its calculation. + +- **RATE** is the multiplier for the pool that determines how much raw storage + capacity is consumed. For example, a three-replica pool will have a ratio of + 3.0, and a ``k=4 m=2`` erasure-coded pool will have a ratio of 1.5. + +- **RAW CAPACITY** is the total amount of raw storage capacity on the specific + OSDs that are responsible for storing the data of the pool (and perhaps the + data of other pools). + +- **RATIO** is the ratio of (1) the storage consumed by the pool to (2) the + total raw storage capacity. In order words, RATIO is defined as + (SIZE * RATE) / RAW CAPACITY. + +- **TARGET RATIO** (if present) is the ratio of the expected storage of this + pool (that is, the amount of storage that this pool is expected to consume, + as specified by the administrator) to the expected storage of all other pools + that have target ratios set. If both ``target_size_bytes`` and + ``target_size_ratio`` are specified, then ``target_size_ratio`` takes + precedence. + +- **EFFECTIVE RATIO** is the result of making two adjustments to the target + ratio: + + #. Subtracting any capacity expected to be used by pools that have target + size set. + + #. Normalizing the target ratios among pools that have target ratio set so + that collectively they target cluster capacity. For example, four pools + with target_ratio 1.0 would have an effective ratio of 0.25. + + The system's calculations use whichever of these two ratios (that is, the + target ratio and the effective ratio) is greater. + +- **BIAS** is used as a multiplier to manually adjust a pool's PG in accordance + with prior information about how many PGs a specific pool is expected to + have. + +- **PG_NUM** is either the current number of PGs associated with the pool or, + if a ``pg_num`` change is in progress, the current number of PGs that the + pool is working towards. + +- **NEW PG_NUM** (if present) is the value that the system is recommending the + ``pg_num`` of the pool to be changed to. It is always a power of 2, and it is + present only if the recommended value varies from the current value by more + than the default factor of ``3``. To adjust this factor (in the following + example, it is changed to ``2``), run the following command: + + .. prompt:: bash # + + ceph osd pool set threshold 2.0 + +- **AUTOSCALE** is the pool's ``pg_autoscale_mode`` and is set to ``on``, + ``off``, or ``warn``. + +- **BULK** determines whether the pool is ``bulk``. It has a value of ``True`` + or ``False``. A ``bulk`` pool is expected to be large and should initially + have a large number of PGs so that performance does not suffer]. On the other + hand, a pool that is not ``bulk`` is expected to be small (for example, a + ``.mgr`` pool or a meta pool). + +.. note:: + + If the ``ceph osd pool autoscale-status`` command returns no output at all, + there is probably at least one pool that spans multiple CRUSH roots. This + 'spanning pool' issue can happen in scenarios like the following: + when a new deployment auto-creates the ``.mgr`` pool on the ``default`` + CRUSH root, subsequent pools are created with rules that constrain them to a + specific shadow CRUSH tree. For example, if you create an RBD metadata pool + that is constrained to ``deviceclass = ssd`` and an RBD data pool that is + constrained to ``deviceclass = hdd``, you will encounter this issue. To + remedy this issue, constrain the spanning pool to only one device class. In + the above scenario, there is likely to be a ``replicated-ssd`` CRUSH rule in + effect, and the ``.mgr`` pool can be constrained to ``ssd`` devices by + running the following commands: + + .. prompt:: bash # + + ceph osd pool set .mgr crush_rule replicated-ssd + ceph osd pool set pool 1 crush_rule to replicated-ssd + + This intervention will result in a small amount of backfill, but + typically this traffic completes quickly. + + +Automated scaling +----------------- + +In the simplest approach to automated scaling, the cluster is allowed to +automatically scale ``pgp_num`` in accordance with usage. Ceph considers the +total available storage and the target number of PGs for the whole system, +considers how much data is stored in each pool, and apportions PGs accordingly. +The system is conservative with its approach, making changes to a pool only +when the current number of PGs (``pg_num``) varies by more than a factor of 3 +from the recommended number. + +The target number of PGs per OSD is determined by the ``mon_target_pg_per_osd`` +parameter (default: 100), which can be adjusted by running the following +command: + +.. prompt:: bash # + + ceph config set global mon_target_pg_per_osd 100 + +The autoscaler analyzes pools and adjusts on a per-subtree basis. Because each +pool might map to a different CRUSH rule, and each rule might distribute data +across different devices, Ceph will consider the utilization of each subtree of +the hierarchy independently. For example, a pool that maps to OSDs of class +``ssd`` and a pool that maps to OSDs of class ``hdd`` will each have optimal PG +counts that are determined by how many of these two different device types +there are. + +If a pool uses OSDs under two or more CRUSH roots (for example, shadow trees +with both ``ssd`` and ``hdd`` devices), the autoscaler issues a warning to the +user in the manager log. The warning states the name of the pool and the set of +roots that overlap each other. The autoscaler does not scale any pools with +overlapping roots because this condition can cause problems with the scaling +process. We recommend constraining each pool so that it belongs to only one +root (that is, one OSD class) to silence the warning and ensure a successful +scaling process. + +.. _managing_bulk_flagged_pools: + +Managing pools that are flagged with ``bulk`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a pool is flagged ``bulk``, then the autoscaler starts the pool with a full +complement of PGs and then scales down the number of PGs only if the usage +ratio across the pool is uneven. However, if a pool is not flagged ``bulk``, +then the autoscaler starts the pool with minimal PGs and creates additional PGs +only if there is more usage in the pool. + +To create a pool that will be flagged ``bulk``, run the following command: + +.. prompt:: bash # + + ceph osd pool create <pool-name> --bulk + +To set or unset the ``bulk`` flag of an existing pool, run the following +command: + +.. prompt:: bash # + + ceph osd pool set <pool-name> bulk <true/false/1/0> + +To get the ``bulk`` flag of an existing pool, run the following command: + +.. prompt:: bash # + + ceph osd pool get <pool-name> bulk + +.. _specifying_pool_target_size: + +Specifying expected pool size +----------------------------- + +When a cluster or pool is first created, it consumes only a small fraction of +the total cluster capacity and appears to the system as if it should need only +a small number of PGs. However, in some cases, cluster administrators know +which pools are likely to consume most of the system capacity in the long run. +When Ceph is provided with this information, a more appropriate number of PGs +can be used from the beginning, obviating subsequent changes in ``pg_num`` and +the associated overhead cost of relocating data. + +The *target size* of a pool can be specified in two ways: either in relation to +the absolute size (in bytes) of the pool, or as a weight relative to all other +pools that have ``target_size_ratio`` set. + +For example, to tell the system that ``mypool`` is expected to consume 100 TB, +run the following command: + +.. prompt:: bash # + + ceph osd pool set mypool target_size_bytes 100T + +Alternatively, to tell the system that ``mypool`` is expected to consume a +ratio of 1.0 relative to other pools that have ``target_size_ratio`` set, +adjust the ``target_size_ratio`` setting of ``my pool`` by running the +following command: + +.. prompt:: bash # + + ceph osd pool set mypool target_size_ratio 1.0 + +If `mypool` is the only pool in the cluster, then it is expected to use 100% of +the total cluster capacity. However, if the cluster contains a second pool that +has ``target_size_ratio`` set to 1.0, then both pools are expected to use 50% +of the total cluster capacity. + +The ``ceph osd pool create`` command has two command-line options that can be +used to set the target size of a pool at creation time: ``--target-size-bytes +<bytes>`` and ``--target-size-ratio <ratio>``. + +Note that if the target-size values that have been specified are impossible +(for example, a capacity larger than the total cluster), then a health check +(``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised. + +If both ``target_size_ratio`` and ``target_size_bytes`` are specified for a +pool, then the latter will be ignored, the former will be used in system +calculations, and a health check (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) +will be raised. + +Specifying bounds on a pool's PGs +--------------------------------- + +It is possible to specify both the minimum number and the maximum number of PGs +for a pool. + +Setting a Minimum Number of PGs and a Maximum Number of PGs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a minimum is set, then Ceph will not itself reduce (nor recommend that you +reduce) the number of PGs to a value below the configured value. Setting a +minimum serves to establish a lower bound on the amount of parallelism enjoyed +by a client during I/O, even if a pool is mostly empty. + +If a maximum is set, then Ceph will not itself increase (or recommend that you +increase) the number of PGs to a value above the configured value. + +To set the minimum number of PGs for a pool, run a command of the following +form: + +.. prompt:: bash # + + ceph osd pool set <pool-name> pg_num_min <num> + +To set the maximum number of PGs for a pool, run a command of the following +form: + +.. prompt:: bash # + + ceph osd pool set <pool-name> pg_num_max <num> + +In addition, the ``ceph osd pool create`` command has two command-line options +that can be used to specify the minimum or maximum PG count of a pool at +creation time: ``--pg-num-min <num>`` and ``--pg-num-max <num>``. + +.. _preselection: + +Preselecting pg_num +=================== + +When creating a pool with the following command, you have the option to +preselect the value of the ``pg_num`` parameter: + +.. prompt:: bash # + + ceph osd pool create {pool-name} [pg_num] + +If you opt not to specify ``pg_num`` in this command, the cluster uses the PG +autoscaler to automatically configure the parameter in accordance with the +amount of data that is stored in the pool (see :ref:`pg-autoscaler` above). + +However, your decision of whether or not to specify ``pg_num`` at creation time +has no effect on whether the parameter will be automatically tuned by the +cluster afterwards. As seen above, autoscaling of PGs is enabled or disabled by +running a command of the following form: + +.. prompt:: bash # + + ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn) + +Without the balancer, the suggested target is approximately 100 PG replicas on +each OSD. With the balancer, an initial target of 50 PG replicas on each OSD is +reasonable. + +The autoscaler attempts to satisfy the following conditions: + +- the number of PGs per OSD should be proportional to the amount of data in the + pool +- there should be 50-100 PGs per pool, taking into account the replication + overhead or erasure-coding fan-out of each PG's replicas across OSDs + +Use of Placement Groups +======================= + +A placement group aggregates objects within a pool. The tracking of RADOS +object placement and object metadata on a per-object basis is computationally +expensive. It would be infeasible for a system with millions of RADOS +objects to efficiently track placement on a per-object basis. + +.. ditaa:: + /-----\ /-----\ /-----\ /-----\ /-----\ + | obj | | obj | | obj | | obj | | obj | + \-----/ \-----/ \-----/ \-----/ \-----/ + | | | | | + +--------+--------+ +---+----+ + | | + v v + +-----------------------+ +-----------------------+ + | Placement Group #1 | | Placement Group #2 | + | | | | + +-----------------------+ +-----------------------+ + | | + +------------------------------+ + | + v + +-----------------------+ + | Pool | + | | + +-----------------------+ + +The Ceph client calculates which PG a RADOS object should be in. As part of +this calculation, the client hashes the object ID and performs an operation +involving both the number of PGs in the specified pool and the pool ID. For +details, see `Mapping PGs to OSDs`_. + +The contents of a RADOS object belonging to a PG are stored in a set of OSDs. +For example, in a replicated pool of size two, each PG will store objects on +two OSDs, as shown below: + +.. ditaa:: + +-----------------------+ +-----------------------+ + | Placement Group #1 | | Placement Group #2 | + | | | | + +-----------------------+ +-----------------------+ + | | | | + v v v v + /----------\ /----------\ /----------\ /----------\ + | | | | | | | | + | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 | + | | | | | | | | + \----------/ \----------/ \----------/ \----------/ + + +If OSD #2 fails, another OSD will be assigned to Placement Group #1 and then +filled with copies of all objects in OSD #1. If the pool size is changed from +two to three, an additional OSD will be assigned to the PG and will receive +copies of all objects in the PG. + +An OSD assigned to a PG is not owned exclusively by that PG; rather, the OSD is +shared with other PGs either from the same pool or from other pools. In our +example, OSD #2 is shared by Placement Group #1 and Placement Group #2. If OSD +#2 fails, then Placement Group #2 must restore copies of objects (by making use +of OSD #3). + +When the number of PGs increases, several consequences ensue. The new PGs are +assigned OSDs. The result of the CRUSH function changes, which means that some +objects from the already-existing PGs are copied to the new PGs and removed +from the old ones. + +Factors Relevant To Specifying pg_num +===================================== + +On the one hand, the criteria of data durability and even distribution across +OSDs weigh in favor of a high number of PGs. On the other hand, the criteria of +saving CPU resources and minimizing memory usage weigh in favor of a low number +of PGs. + +.. _data durability: + +Data durability +--------------- + +When an OSD fails, the risk of data loss is increased until replication of the +data it hosted is restored to the configured level. To illustrate this point, +let's imagine a scenario that results in permanent data loss in a single PG: + +#. The OSD fails and all copies of the object that it contains are lost. For + each object within the PG, the number of its replicas suddenly drops from + three to two. + +#. Ceph starts recovery for this PG by choosing a new OSD on which to re-create + the third copy of each object. + +#. Another OSD within the same PG fails before the new OSD is fully populated + with the third copy. Some objects will then only have one surviving copy. + +#. Ceph selects yet another OSD and continues copying objects in order to + restore the desired number of copies. + +#. A third OSD within the same PG fails before recovery is complete. If this + OSD happened to contain the only remaining copy of an object, the object is + permanently lost. + +In a cluster containing 10 OSDs with 512 PGs in a three-replica pool, CRUSH +will give each PG three OSDs. Ultimately, each OSD hosts :math:`\frac{(512 * +3)}{10} = ~150` PGs. So when the first OSD fails in the above scenario, +recovery will begin for all 150 PGs at the same time. + +The 150 PGs that are being recovered are likely to be homogeneously distributed +across the 9 remaining OSDs. Each remaining OSD is therefore likely to send +copies of objects to all other OSDs and also likely to receive some new objects +to be stored because it has become part of a new PG. + +The amount of time it takes for this recovery to complete depends on the +architecture of the Ceph cluster. Compare two setups: (1) Each OSD is hosted by +a 1 TB SSD on a single machine, all of the OSDs are connected to a 10 Gb/s +switch, and the recovery of a single OSD completes within a certain number of +minutes. (2) There are two OSDs per machine using HDDs with no SSD WAL+DB and +a 1 Gb/s switch. In the second setup, recovery will be at least one order of +magnitude slower. + +In such a cluster, the number of PGs has almost no effect on data durability. +Whether there are 128 PGs per OSD or 8192 PGs per OSD, the recovery will be no +slower or faster. + +However, an increase in the number of OSDs can increase the speed of recovery. +Suppose our Ceph cluster is expanded from 10 OSDs to 20 OSDs. Each OSD now +participates in only ~75 PGs rather than ~150 PGs. All 19 remaining OSDs will +still be required to replicate the same number of objects in order to recover. +But instead of there being only 10 OSDs that have to copy ~100 GB each, there +are now 20 OSDs that have to copy only 50 GB each. If the network had +previously been a bottleneck, recovery now happens twice as fast. + +Similarly, suppose that our cluster grows to 40 OSDs. Each OSD will host only +~38 PGs. And if an OSD dies, recovery will take place faster than before unless +it is blocked by another bottleneck. Now, however, suppose that our cluster +grows to 200 OSDs. Each OSD will host only ~7 PGs. And if an OSD dies, recovery +will happen across at most :math:`\approx 21 = (7 \times 3)` OSDs +associated with these PGs. This means that recovery will take longer than when +there were only 40 OSDs. For this reason, the number of PGs should be +increased. + +No matter how brief the recovery time is, there is always a chance that an +additional OSD will fail while recovery is in progress. Consider the cluster +with 10 OSDs described above: if any of the OSDs fail, then :math:`\approx 17` +(approximately 150 divided by 9) PGs will have only one remaining copy. And if +any of the 8 remaining OSDs fail, then 2 (approximately 17 divided by 8) PGs +are likely to lose their remaining objects. This is one reason why setting +``size=2`` is risky. + +When the number of OSDs in the cluster increases to 20, the number of PGs that +would be damaged by the loss of three OSDs significantly decreases. The loss of +a second OSD degrades only approximately :math:`4` or (:math:`\frac{75}{19}`) +PGs rather than :math:`\approx 17` PGs, and the loss of a third OSD results in +data loss only if it is one of the 4 OSDs that contains the remaining copy. +This means -- assuming that the probability of losing one OSD during recovery +is 0.0001% -- that the probability of data loss when three OSDs are lost is +:math:`\approx 17 \times 10 \times 0.0001%` in the cluster with 10 OSDs, and +only :math:`\approx 4 \times 20 \times 0.0001%` in the cluster with 20 OSDs. + +In summary, the greater the number of OSDs, the faster the recovery and the +lower the risk of permanently losing a PG due to cascading failures. As far as +data durability is concerned, in a cluster with fewer than 50 OSDs, it doesn't +much matter whether there are 512 or 4096 PGs. + +.. note:: It can take a long time for an OSD that has been recently added to + the cluster to be populated with the PGs assigned to it. However, no object + degradation or impact on data durability will result from the slowness of + this process since Ceph populates data into the new PGs before removing it + from the old PGs. + +.. _object distribution: + +Object distribution within a pool +--------------------------------- + +Under ideal conditions, objects are evenly distributed across PGs. Because +CRUSH computes the PG for each object but does not know how much data is stored +in each OSD associated with the PG, the ratio between the number of PGs and the +number of OSDs can have a significant influence on data distribution. + +For example, suppose that there is only a single PG for ten OSDs in a +three-replica pool. In that case, only three OSDs would be used because CRUSH +would have no other option. However, if more PGs are available, RADOS objects are +more likely to be evenly distributed across OSDs. CRUSH makes every effort to +distribute OSDs evenly across all existing PGs. + +As long as there are one or two orders of magnitude more PGs than OSDs, the +distribution is likely to be even. For example: 256 PGs for 3 OSDs, 512 PGs for +10 OSDs, or 1024 PGs for 10 OSDs. + +However, uneven data distribution can emerge due to factors other than the +ratio of PGs to OSDs. For example, since CRUSH does not take into account the +size of the RADOS objects, the presence of a few very large RADOS objects can +create an imbalance. Suppose that one million 4 KB RADOS objects totaling 4 GB +are evenly distributed among 1024 PGs on 10 OSDs. These RADOS objects will +consume 4 GB / 10 = 400 MB on each OSD. If a single 400 MB RADOS object is then +added to the pool, the three OSDs supporting the PG in which the RADOS object +has been placed will each be filled with 400 MB + 400 MB = 800 MB but the seven +other OSDs will still contain only 400 MB. + +.. _resource usage: + +Memory, CPU and network usage +----------------------------- + +Every PG in the cluster imposes memory, network, and CPU demands upon OSDs and +MONs. These needs must be met at all times and are increased during recovery. +Indeed, one of the main reasons PGs were developed was to share this overhead +by clustering objects together. + +For this reason, minimizing the number of PGs saves significant resources. + +.. _choosing-number-of-placement-groups: + +Choosing the Number of PGs +========================== + +.. note: It is rarely necessary to do the math in this section by hand. + Instead, use the ``ceph osd pool autoscale-status`` command in combination + with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. For + more information, see :ref:`pg-autoscaler`. + +If you have more than 50 OSDs, we recommend approximately 50-100 PGs per OSD in +order to balance resource usage, data durability, and data distribution. If you +have fewer than 50 OSDs, follow the guidance in the `preselection`_ section. +For a single pool, use the following formula to get a baseline value: + + Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}` + +Here **pool size** is either the number of replicas for replicated pools or the +K+M sum for erasure-coded pools. To retrieve this sum, run the command ``ceph +osd erasure-code-profile get``. + +Next, check whether the resulting baseline value is consistent with the way you +designed your Ceph cluster to maximize `data durability`_ and `object +distribution`_ and to minimize `resource usage`_. + +This value should be **rounded up to the nearest power of two**. + +Each pool's ``pg_num`` should be a power of two. Other values are likely to +result in uneven distribution of data across OSDs. It is best to increase +``pg_num`` for a pool only when it is feasible and desirable to set the next +highest power of two. Note that this power of two rule is per-pool; it is +neither necessary nor easy to align the sum of all pools' ``pg_num`` to a power +of two. + +For example, if you have a cluster with 200 OSDs and a single pool with a size +of 3 replicas, estimate the number of PGs as follows: + + :math:`\frac{200 \times 100}{3} = 6667`. Rounded up to the nearest power of 2: 8192. + +When using multiple data pools to store objects, make sure that you balance the +number of PGs per pool against the number of PGs per OSD so that you arrive at +a reasonable total number of PGs. It is important to find a number that +provides reasonably low variance per OSD without taxing system resources or +making the peering process too slow. + +For example, suppose you have a cluster of 10 pools, each with 512 PGs on 10 +OSDs. That amounts to 5,120 PGs distributed across 10 OSDs, or 512 PGs per OSD. +This cluster will not use too many resources. However, in a cluster of 1,000 +pools, each with 512 PGs on 10 OSDs, the OSDs will have to handle ~50,000 PGs +each. This cluster will require significantly more resources and significantly +more time for peering. + +For determining the optimal number of PGs per OSD, we recommend the `PGCalc`_ +tool. + + +.. _setting the number of placement groups: + +Setting the Number of PGs +========================= + +Setting the initial number of PGs in a pool must be done at the time you create +the pool. See `Create a Pool`_ for details. + +However, even after a pool is created, if the ``pg_autoscaler`` is not being +used to manage ``pg_num`` values, you can change the number of PGs by running a +command of the following form: + +.. prompt:: bash # + + ceph osd pool set {pool-name} pg_num {pg_num} + +If you increase the number of PGs, your cluster will not rebalance until you +increase the number of PGs for placement (``pgp_num``). The ``pgp_num`` +parameter specifies the number of PGs that are to be considered for placement +by the CRUSH algorithm. Increasing ``pg_num`` splits the PGs in your cluster, +but data will not be migrated to the newer PGs until ``pgp_num`` is increased. +The ``pgp_num`` parameter should be equal to the ``pg_num`` parameter. To +increase the number of PGs for placement, run a command of the following form: + +.. prompt:: bash # + + ceph osd pool set {pool-name} pgp_num {pgp_num} + +If you decrease the number of PGs, then ``pgp_num`` is adjusted automatically. +In releases of Ceph that are Nautilus and later (inclusive), when the +``pg_autoscaler`` is not used, ``pgp_num`` is automatically stepped to match +``pg_num``. This process manifests as periods of remapping of PGs and of +backfill, and is expected behavior and normal. + +.. _rados_ops_pgs_get_pg_num: + +Get the Number of PGs +===================== + +To get the number of PGs in a pool, run a command of the following form: + +.. prompt:: bash # + + ceph osd pool get {pool-name} pg_num + + +Get a Cluster's PG Statistics +============================= + +To see the details of the PGs in your cluster, run a command of the following +form: + +.. prompt:: bash # + + ceph pg dump [--format {format}] + +Valid formats are ``plain`` (default) and ``json``. + + +Get Statistics for Stuck PGs +============================ + +To see the statistics for all PGs that are stuck in a specified state, run a +command of the following form: + +.. prompt:: bash # + + ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>] + +- **Inactive** PGs cannot process reads or writes because they are waiting for + enough OSDs with the most up-to-date data to come ``up`` and ``in``. + +- **Undersized** PGs contain objects that have not been replicated the desired + number of times. Under normal conditions, it can be assumed that these PGs + are recovering. + +- **Stale** PGs are in an unknown state -- the OSDs that host them have not + reported to the monitor cluster for a certain period of time (determined by + ``mon_osd_report_timeout``). + +Valid formats are ``plain`` (default) and ``json``. The threshold defines the +minimum number of seconds the PG is stuck before it is included in the returned +statistics (default: 300). + + +Get a PG Map +============ + +To get the PG map for a particular PG, run a command of the following form: + +.. prompt:: bash # + + ceph pg map {pg-id} + +For example: + +.. prompt:: bash # + + ceph pg map 1.6c + +Ceph will return the PG map, the PG, and the OSD status. The output resembles +the following: + +.. prompt:: bash # + + osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0] + + +Get a PG's Statistics +===================== + +To see statistics for a particular PG, run a command of the following form: + +.. prompt:: bash # + + ceph pg {pg-id} query + + +Scrub a PG +========== + +To scrub a PG, run a command of the following form: + +.. prompt:: bash # + + ceph pg scrub {pg-id} + +Ceph checks the primary and replica OSDs, generates a catalog of all objects in +the PG, and compares the objects against each other in order to ensure that no +objects are missing or mismatched and that their contents are consistent. If +the replicas all match, then a final semantic sweep takes place to ensure that +all snapshot-related object metadata is consistent. Errors are reported in +logs. + +To scrub all PGs from a specific pool, run a command of the following form: + +.. prompt:: bash # + + ceph osd pool scrub {pool-name} + + +Prioritize backfill/recovery of PG(s) +===================================== + +You might encounter a situation in which multiple PGs require recovery or +backfill, but the data in some PGs is more important than the data in others +(for example, some PGs hold data for images that are used by running machines +and other PGs are used by inactive machines and hold data that is less +relevant). In that case, you might want to prioritize recovery or backfill of +the PGs with especially important data so that the performance of the cluster +and the availability of their data are restored sooner. To designate specific +PG(s) as prioritized during recovery, run a command of the following form: + +.. prompt:: bash # + + ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +To mark specific PG(s) as prioritized during backfill, run a command of the +following form: + +.. prompt:: bash # + + ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +These commands instruct Ceph to perform recovery or backfill on the specified +PGs before processing the other PGs. Prioritization does not interrupt current +backfills or recovery, but places the specified PGs at the top of the queue so +that they will be acted upon next. If you change your mind or realize that you +have prioritized the wrong PGs, run one or both of the following commands: + +.. prompt:: bash # + + ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] + ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +These commands remove the ``force`` flag from the specified PGs, so that the +PGs will be processed in their usual order. As in the case of adding the +``force`` flag, this affects only those PGs that are still queued but does not +affect PGs currently undergoing recovery. + +The ``force`` flag is cleared automatically after recovery or backfill of the +PGs is complete. + +Similarly, to instruct Ceph to prioritize all PGs from a specified pool (that +is, to perform recovery or backfill on those PGs first), run one or both of the +following commands: + +.. prompt:: bash # + + ceph osd pool force-recovery {pool-name} + ceph osd pool force-backfill {pool-name} + +These commands can also be cancelled. To revert to the default order, run one +or both of the following commands: + +.. prompt:: bash # + + ceph osd pool cancel-force-recovery {pool-name} + ceph osd pool cancel-force-backfill {pool-name} + +.. warning:: These commands can break the order of Ceph's internal priority + computations, so use them with caution! If you have multiple pools that are + currently sharing the same underlying OSDs, and if the data held by certain + pools is more important than the data held by other pools, then we recommend + that you run a command of the following form to arrange a custom + recovery/backfill priority for all pools: + +.. prompt:: bash # + + ceph osd pool set {pool-name} recovery_priority {value} + +For example, if you have twenty pools, you could make the most important pool +priority ``20``, and the next most important pool priority ``19``, and so on. + +Another option is to set the recovery/backfill priority for only a proper +subset of pools. In such a scenario, three important pools might (all) be +assigned priority ``1`` and all other pools would be left without an assigned +recovery/backfill priority. Another possibility is to select three important +pools and set their recovery/backfill priorities to ``3``, ``2``, and ``1`` +respectively. + +.. important:: Numbers of greater value have higher priority than numbers of + lesser value when using ``ceph osd pool set {pool-name} recovery_priority + {value}`` to set their recovery/backfill priority. For example, a pool with + the recovery/backfill priority ``30`` has a higher priority than a pool with + the recovery/backfill priority ``15``. + +Reverting Lost RADOS Objects +============================ + +If the cluster has lost one or more RADOS objects and you have decided to +abandon the search for the lost data, you must mark the unfound objects +``lost``. + +If every possible location has been queried and all OSDs are ``up`` and ``in``, +but certain RADOS objects are still lost, you might have to give up on those +objects. This situation can arise when rare and unusual combinations of +failures allow the cluster to learn about writes that were performed before the +writes themselves were recovered. + +The command to mark a RADOS object ``lost`` has only one supported option: +``revert``. The ``revert`` option will either roll back to a previous version +of the RADOS object (if it is old enough to have a previous version) or forget +about it entirely (if it is too new to have a previous version). To mark the +"unfound" objects ``lost``, run a command of the following form: + + +.. prompt:: bash # + + ceph pg {pg-id} mark_unfound_lost revert|delete + +.. important:: Use this feature with caution. It might confuse applications + that expect the object(s) to exist. + + +.. toctree:: + :hidden: + + pg-states + pg-concepts + + +.. _Create a Pool: ../pools#createpool +.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds +.. _pgcalc: https://old.ceph.com/pgcalc/ diff --git a/doc/rados/operations/pools.rst b/doc/rados/operations/pools.rst new file mode 100644 index 000000000..dda9e844e --- /dev/null +++ b/doc/rados/operations/pools.rst @@ -0,0 +1,751 @@ +.. _rados_pools: + +======= + Pools +======= +Pools are logical partitions that are used to store objects. + +Pools provide: + +- **Resilience**: It is possible to set the number of OSDs that are allowed to + fail without any data being lost. If your cluster uses replicated pools, the + number of OSDs that can fail without data loss is equal to the number of + replicas. + + For example: a typical configuration stores an object and two replicas + (copies) of each RADOS object (that is: ``size = 3``), but you can configure + the number of replicas on a per-pool basis. For `erasure-coded pools + <../erasure-code>`_, resilience is defined as the number of coding chunks + (for example, ``m = 2`` in the default **erasure code profile**). + +- **Placement Groups**: You can set the number of placement groups (PGs) for + the pool. In a typical configuration, the target number of PGs is + approximately one hundred PGs per OSD. This provides reasonable balancing + without consuming excessive computing resources. When setting up multiple + pools, be careful to set an appropriate number of PGs for each pool and for + the cluster as a whole. Each PG belongs to a specific pool: when multiple + pools use the same OSDs, make sure that the **sum** of PG replicas per OSD is + in the desired PG-per-OSD target range. To calculate an appropriate number of + PGs for your pools, use the `pgcalc`_ tool. + +- **CRUSH Rules**: When data is stored in a pool, the placement of the object + and its replicas (or chunks, in the case of erasure-coded pools) in your + cluster is governed by CRUSH rules. Custom CRUSH rules can be created for a + pool if the default rule does not fit your use case. + +- **Snapshots**: The command ``ceph osd pool mksnap`` creates a snapshot of a + pool. + +Pool Names +========== + +Pool names beginning with ``.`` are reserved for use by Ceph's internal +operations. Do not create or manipulate pools with these names. + + +List Pools +========== + +There are multiple ways to get the list of pools in your cluster. + +To list just your cluster's pool names (good for scripting), execute: + +.. prompt:: bash $ + + ceph osd pool ls + +:: + + .rgw.root + default.rgw.log + default.rgw.control + default.rgw.meta + +To list your cluster's pools with the pool number, run the following command: + +.. prompt:: bash $ + + ceph osd lspools + +:: + + 1 .rgw.root + 2 default.rgw.log + 3 default.rgw.control + 4 default.rgw.meta + +To list your cluster's pools with additional information, execute: + +.. prompt:: bash $ + + ceph osd pool ls detail + +:: + + pool 1 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 19 flags hashpspool stripe_width 0 application rgw read_balance_score 4.00 + pool 2 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 21 flags hashpspool stripe_width 0 application rgw read_balance_score 4.00 + pool 3 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 23 flags hashpspool stripe_width 0 application rgw read_balance_score 4.00 + pool 4 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 25 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw read_balance_score 4.00 + +To get even more information, you can execute this command with the ``--format`` (or ``-f``) option and the ``json``, ``json-pretty``, ``xml`` or ``xml-pretty`` value. + +.. _createpool: + +Creating a Pool +=============== + +Before creating a pool, consult `Pool, PG and CRUSH Config Reference`_. Your +Ceph configuration file contains a setting (namely, ``pg_num``) that determines +the number of PGs. However, this setting's default value is NOT appropriate +for most systems. In most cases, you should override this default value when +creating your pool. For details on PG numbers, see `setting the number of +placement groups`_ + +For example: + +.. prompt:: bash $ + + osd_pool_default_pg_num = 128 + osd_pool_default_pgp_num = 128 + +.. note:: In Luminous and later releases, each pool must be associated with the + application that will be using the pool. For more information, see + `Associating a Pool with an Application`_ below. + +To create a pool, run one of the following commands: + +.. prompt:: bash $ + + ceph osd pool create {pool-name} [{pg-num} [{pgp-num}]] [replicated] \ + [crush-rule-name] [expected-num-objects] + +or: + +.. prompt:: bash $ + + ceph osd pool create {pool-name} [{pg-num} [{pgp-num}]] erasure \ + [erasure-code-profile] [crush-rule-name] [expected_num_objects] [--autoscale-mode=<on,off,warn>] + +For a brief description of the elements of the above commands, consult the +following: + +.. describe:: {pool-name} + + The name of the pool. It must be unique. + + :Type: String + :Required: Yes. + +.. describe:: {pg-num} + + The total number of PGs in the pool. For details on calculating an + appropriate number, see :ref:`placement groups`. The default value ``8`` is + NOT suitable for most systems. + + :Type: Integer + :Required: Yes. + :Default: 8 + +.. describe:: {pgp-num} + + The total number of PGs for placement purposes. This **should be equal to + the total number of PGs**, except briefly while ``pg_num`` is being + increased or decreased. + + :Type: Integer + :Required: Yes. If no value has been specified in the command, then the default value is used (unless a different value has been set in Ceph configuration). + :Default: 8 + +.. describe:: {replicated|erasure} + + The pool type. This can be either **replicated** (to recover from lost OSDs + by keeping multiple copies of the objects) or **erasure** (to achieve a kind + of `generalized parity RAID <../erasure-code>`_ capability). The + **replicated** pools require more raw storage but can implement all Ceph + operations. The **erasure** pools require less raw storage but can perform + only some Ceph tasks and may provide decreased performance. + + :Type: String + :Required: No. + :Default: replicated + +.. describe:: [crush-rule-name] + + The name of the CRUSH rule to use for this pool. The specified rule must + exist; otherwise the command will fail. + + :Type: String + :Required: No. + :Default: For **replicated** pools, it is the rule specified by the :confval:`osd_pool_default_crush_rule` configuration variable. This rule must exist. For **erasure** pools, it is the ``erasure-code`` rule if the ``default`` `erasure code profile`_ is used or the ``{pool-name}`` rule if not. This rule will be created implicitly if it doesn't already exist. + +.. describe:: [erasure-code-profile=profile] + + For **erasure** pools only. Instructs Ceph to use the specified `erasure + code profile`_. This profile must be an existing profile as defined by **osd + erasure-code-profile set**. + + :Type: String + :Required: No. + +.. _erasure code profile: ../erasure-code-profile + +.. describe:: --autoscale-mode=<on,off,warn> + + - ``on``: the Ceph cluster will autotune or recommend changes to the number of PGs in your pool based on actual usage. + - ``warn``: the Ceph cluster will autotune or recommend changes to the number of PGs in your pool based on actual usage. + - ``off``: refer to :ref:`placement groups` for more information. + + :Type: String + :Required: No. + :Default: The default behavior is determined by the :confval:`osd_pool_default_pg_autoscale_mode` option. + +.. describe:: [expected-num-objects] + + The expected number of RADOS objects for this pool. By setting this value and + assigning a negative value to **filestore merge threshold**, you arrange + for the PG folder splitting to occur at the time of pool creation and + avoid the latency impact that accompanies runtime folder splitting. + + :Type: Integer + :Required: No. + :Default: 0, no splitting at the time of pool creation. + +.. _associate-pool-to-application: + +Associating a Pool with an Application +====================================== + +Pools need to be associated with an application before they can be used. Pools +that are intended for use with CephFS and pools that are created automatically +by RGW are associated automatically. Pools that are intended for use with RBD +should be initialized with the ``rbd`` tool (see `Block Device Commands`_ for +more information). + +For other cases, you can manually associate a free-form application name to a +pool by running the following command.: + +.. prompt:: bash $ + + ceph osd pool application enable {pool-name} {application-name} + +.. note:: CephFS uses the application name ``cephfs``, RBD uses the + application name ``rbd``, and RGW uses the application name ``rgw``. + +Setting Pool Quotas +=================== + +To set pool quotas for the maximum number of bytes and/or the maximum number of +RADOS objects per pool, run the following command: + +.. prompt:: bash $ + + ceph osd pool set-quota {pool-name} [max_objects {obj-count}] [max_bytes {bytes}] + +For example: + +.. prompt:: bash $ + + ceph osd pool set-quota data max_objects 10000 + +To remove a quota, set its value to ``0``. + + +Deleting a Pool +=============== + +To delete a pool, run a command of the following form: + +.. prompt:: bash $ + + ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] + +To remove a pool, you must set the ``mon_allow_pool_delete`` flag to ``true`` +in the monitor's configuration. Otherwise, monitors will refuse to remove +pools. + +For more information, see `Monitor Configuration`_. + +.. _Monitor Configuration: ../../configuration/mon-config-ref + +If there are custom rules for a pool that is no longer needed, consider +deleting those rules. + +.. prompt:: bash $ + + ceph osd pool get {pool-name} crush_rule + +For example, if the custom rule is "123", check all pools to see whether they +use the rule by running the following command: + +.. prompt:: bash $ + + ceph osd dump | grep "^pool" | grep "crush_rule 123" + +If no pools use this custom rule, then it is safe to delete the rule from the +cluster. + +Similarly, if there are users with permissions restricted to a pool that no +longer exists, consider deleting those users by running commands of the +following forms: + +.. prompt:: bash $ + + ceph auth ls | grep -C 5 {pool-name} + ceph auth del {user} + + +Renaming a Pool +=============== + +To rename a pool, run a command of the following form: + +.. prompt:: bash $ + + ceph osd pool rename {current-pool-name} {new-pool-name} + +If you rename a pool for which an authenticated user has per-pool capabilities, +you must update the user's capabilities ("caps") to refer to the new pool name. + + +Showing Pool Statistics +======================= + +To show a pool's utilization statistics, run the following command: + +.. prompt:: bash $ + + rados df + +To obtain I/O information for a specific pool or for all pools, run a command +of the following form: + +.. prompt:: bash $ + + ceph osd pool stats [{pool-name}] + + +Making a Snapshot of a Pool +=========================== + +To make a snapshot of a pool, run a command of the following form: + +.. prompt:: bash $ + + ceph osd pool mksnap {pool-name} {snap-name} + +Removing a Snapshot of a Pool +============================= + +To remove a snapshot of a pool, run a command of the following form: + +.. prompt:: bash $ + + ceph osd pool rmsnap {pool-name} {snap-name} + +.. _setpoolvalues: + +Setting Pool Values +=================== + +To assign values to a pool's configuration keys, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd pool set {pool-name} {key} {value} + +You may set values for the following keys: + +.. _compression_algorithm: + +.. describe:: compression_algorithm + + :Description: Sets the inline compression algorithm used in storing data on the underlying BlueStore back end. This key's setting overrides the global setting :confval:`bluestore_compression_algorithm`. + :Type: String + :Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd`` + +.. describe:: compression_mode + + :Description: Sets the policy for the inline compression algorithm used in storing data on the underlying BlueStore back end. This key's setting overrides the global setting :confval:`bluestore_compression_mode`. + :Type: String + :Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` + +.. describe:: compression_min_blob_size + + + :Description: Sets the minimum size for the compression of chunks: that is, chunks smaller than this are not compressed. This key's setting overrides the following global settings: + + * :confval:`bluestore_compression_min_blob_size` + * :confval:`bluestore_compression_min_blob_size_hdd` + * :confval:`bluestore_compression_min_blob_size_ssd` + + :Type: Unsigned Integer + + +.. describe:: compression_max_blob_size + + :Description: Sets the maximum size for chunks: that is, chunks larger than this are broken into smaller blobs of this size before compression is performed. + :Type: Unsigned Integer + +.. _size: + +.. describe:: size + + :Description: Sets the number of replicas for objects in the pool. For further details, see `Setting the Number of RADOS Object Replicas`_. Replicated pools only. + :Type: Integer + +.. _min_size: + +.. describe:: min_size + + :Description: Sets the minimum number of replicas required for I/O. For further details, see `Setting the Number of RADOS Object Replicas`_. For erasure-coded pools, this should be set to a value greater than 'k'. If I/O is allowed at the value 'k', then there is no redundancy and data will be lost in the event of a permanent OSD failure. For more information, see `Erasure Code <../erasure-code>`_ + :Type: Integer + :Version: ``0.54`` and above + +.. _pg_num: + +.. describe:: pg_num + + :Description: Sets the effective number of PGs to use when calculating data placement. + :Type: Integer + :Valid Range: ``0`` to ``mon_max_pool_pg_num``. If set to ``0``, the value of ``osd_pool_default_pg_num`` will be used. + +.. _pgp_num: + +.. describe:: pgp_num + + :Description: Sets the effective number of PGs to use when calculating data placement. + :Type: Integer + :Valid Range: Between ``1`` and the current value of ``pg_num``. + +.. _crush_rule: + +.. describe:: crush_rule + + :Description: Sets the CRUSH rule that Ceph uses to map object placement within the pool. + :Type: String + +.. _allow_ec_overwrites: + +.. describe:: allow_ec_overwrites + + :Description: Determines whether writes to an erasure-coded pool are allowed to update only part of a RADOS object. This allows CephFS and RBD to use an EC (erasure-coded) pool for user data (but not for metadata). For more details, see `Erasure Coding with Overwrites`_. + :Type: Boolean + + .. versionadded:: 12.2.0 + +.. describe:: hashpspool + + :Description: Sets and unsets the HASHPSPOOL flag on a given pool. + :Type: Integer + :Valid Range: 1 sets flag, 0 unsets flag + +.. _nodelete: + +.. describe:: nodelete + + :Description: Sets and unsets the NODELETE flag on a given pool. + :Type: Integer + :Valid Range: 1 sets flag, 0 unsets flag + :Version: Version ``FIXME`` + +.. _nopgchange: + +.. describe:: nopgchange + + :Description: Sets and unsets the NOPGCHANGE flag on a given pool. + :Type: Integer + :Valid Range: 1 sets flag, 0 unsets flag + :Version: Version ``FIXME`` + +.. _nosizechange: + +.. describe:: nosizechange + + :Description: Sets and unsets the NOSIZECHANGE flag on a given pool. + :Type: Integer + :Valid Range: 1 sets flag, 0 unsets flag + :Version: Version ``FIXME`` + +.. _bulk: + +.. describe:: bulk + + :Description: Sets and unsets the bulk flag on a given pool. + :Type: Boolean + :Valid Range: ``true``/``1`` sets flag, ``false``/``0`` unsets flag + +.. _write_fadvise_dontneed: + +.. describe:: write_fadvise_dontneed + + :Description: Sets and unsets the WRITE_FADVISE_DONTNEED flag on a given pool. + :Type: Integer + :Valid Range: ``1`` sets flag, ``0`` unsets flag + +.. _noscrub: + +.. describe:: noscrub + + :Description: Sets and unsets the NOSCRUB flag on a given pool. + :Type: Integer + :Valid Range: ``1`` sets flag, ``0`` unsets flag + +.. _nodeep-scrub: + +.. describe:: nodeep-scrub + + :Description: Sets and unsets the NODEEP_SCRUB flag on a given pool. + :Type: Integer + :Valid Range: ``1`` sets flag, ``0`` unsets flag + +.. _target_max_bytes: + +.. describe:: target_max_bytes + + :Description: Ceph will begin flushing or evicting objects when the + ``max_bytes`` threshold is triggered. + :Type: Integer + :Example: ``1000000000000`` #1-TB + +.. _target_max_objects: + +.. describe:: target_max_objects + + :Description: Ceph will begin flushing or evicting objects when the + ``max_objects`` threshold is triggered. + :Type: Integer + :Example: ``1000000`` #1M objects + +.. _fast_read: + +.. describe:: fast_read + + :Description: For erasure-coded pools, if this flag is turned ``on``, the + read request issues "sub reads" to all shards, and then waits + until it receives enough shards to decode before it serves + the client. If *jerasure* or *isa* erasure plugins are in + use, then after the first *K* replies have returned, the + client's request is served immediately using the data decoded + from these replies. This approach sacrifices resources in + exchange for better performance. This flag is supported only + for erasure-coded pools. + :Type: Boolean + :Defaults: ``0`` + +.. _scrub_min_interval: + +.. describe:: scrub_min_interval + + :Description: Sets the minimum interval (in seconds) for successive scrubs of the pool's PGs when the load is low. If the default value of ``0`` is in effect, then the value of ``osd_scrub_min_interval`` from central config is used. + + :Type: Double + :Default: ``0`` + +.. _scrub_max_interval: + +.. describe:: scrub_max_interval + + :Description: Sets the maximum interval (in seconds) for scrubs of the pool's PGs regardless of cluster load. If the value of ``scrub_max_interval`` is ``0``, then the value ``osd_scrub_max_interval`` from central config is used. + + :Type: Double + :Default: ``0`` + +.. _deep_scrub_interval: + +.. describe:: deep_scrub_interval + + :Description: Sets the interval (in seconds) for pool “deep” scrubs of the pool's PGs. If the value of ``deep_scrub_interval`` is ``0``, the value ``osd_deep_scrub_interval`` from central config is used. + + :Type: Double + :Default: ``0`` + +.. _recovery_priority: + +.. describe:: recovery_priority + + :Description: Setting this value adjusts a pool's computed reservation priority. This value must be in the range ``-10`` to ``10``. Any pool assigned a negative value will be given a lower priority than any new pools, so users are directed to assign negative values to low-priority pools. + + :Type: Integer + :Default: ``0`` + + +.. _recovery_op_priority: + +.. describe:: recovery_op_priority + + :Description: Sets the recovery operation priority for a specific pool's PGs. This overrides the general priority determined by :confval:`osd_recovery_op_priority`. + + :Type: Integer + :Default: ``0`` + + +Getting Pool Values +=================== + +To get a value from a pool's key, run a command of the following form: + +.. prompt:: bash $ + + ceph osd pool get {pool-name} {key} + + +You may get values from the following keys: + + +``size`` + +:Description: See size_. + +:Type: Integer + + +``min_size`` + +:Description: See min_size_. + +:Type: Integer +:Version: ``0.54`` and above + + +``pg_num`` + +:Description: See pg_num_. + +:Type: Integer + + +``pgp_num`` + +:Description: See pgp_num_. + +:Type: Integer +:Valid Range: Equal to or less than ``pg_num``. + + +``crush_rule`` + +:Description: See crush_rule_. + + +``target_max_bytes`` + +:Description: See target_max_bytes_. + +:Type: Integer + + +``target_max_objects`` + +:Description: See target_max_objects_. + +:Type: Integer + + +``fast_read`` + +:Description: See fast_read_. + +:Type: Boolean + + +``scrub_min_interval`` + +:Description: See scrub_min_interval_. + +:Type: Double + + +``scrub_max_interval`` + +:Description: See scrub_max_interval_. + +:Type: Double + + +``deep_scrub_interval`` + +:Description: See deep_scrub_interval_. + +:Type: Double + + +``allow_ec_overwrites`` + +:Description: See allow_ec_overwrites_. + +:Type: Boolean + + +``recovery_priority`` + +:Description: See recovery_priority_. + +:Type: Integer + + +``recovery_op_priority`` + +:Description: See recovery_op_priority_. + +:Type: Integer + + +Setting the Number of RADOS Object Replicas +=========================================== + +To set the number of data replicas on a replicated pool, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd pool set {poolname} size {num-replicas} + +.. important:: The ``{num-replicas}`` argument includes the primary object + itself. For example, if you want there to be two replicas of the object in + addition to the original object (for a total of three instances of the + object) specify ``3`` by running the following command: + +.. prompt:: bash $ + + ceph osd pool set data size 3 + +You may run the above command for each pool. + +.. Note:: An object might accept I/Os in degraded mode with fewer than ``pool + size`` replicas. To set a minimum number of replicas required for I/O, you + should use the ``min_size`` setting. For example, you might run the + following command: + +.. prompt:: bash $ + + ceph osd pool set data min_size 2 + +This command ensures that no object in the data pool will receive I/O if it has +fewer than ``min_size`` (in this case, two) replicas. + + +Getting the Number of Object Replicas +===================================== + +To get the number of object replicas, run the following command: + +.. prompt:: bash $ + + ceph osd dump | grep 'replicated size' + +Ceph will list pools and highlight the ``replicated size`` attribute. By +default, Ceph creates two replicas of an object (a total of three copies, for a +size of ``3``). + +Managing pools that are flagged with ``--bulk`` +=============================================== +See :ref:`managing_bulk_flagged_pools`. + + +.. _pgcalc: https://old.ceph.com/pgcalc/ +.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref +.. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter +.. _setting the number of placement groups: ../placement-groups#set-the-number-of-placement-groups +.. _Erasure Coding with Overwrites: ../erasure-code#erasure-coding-with-overwrites +.. _Block Device Commands: ../../../rbd/rados-rbd-cmds/#create-a-block-device-pool diff --git a/doc/rados/operations/read-balancer.rst b/doc/rados/operations/read-balancer.rst new file mode 100644 index 000000000..0833e4326 --- /dev/null +++ b/doc/rados/operations/read-balancer.rst @@ -0,0 +1,64 @@ +.. _read_balancer: + +======================================= +Operating the Read (Primary) Balancer +======================================= + +You might be wondering: How can I improve performance in my Ceph cluster? +One important data point you can check is the ``read_balance_score`` on each +of your replicated pools. + +This metric, available via ``ceph osd pool ls detail`` (see :ref:`rados_pools` +for more details) indicates read performance, or how balanced the primaries are +for each replicated pool. In most cases, if a ``read_balance_score`` is above 1 +(for instance, 1.5), this means that your pool has unbalanced primaries and that +you may want to try improving your read performance with the read balancer. + +Online Optimization +=================== + +At present, there is no online option for the read balancer. However, we plan to add +the read balancer as an option to the :ref:`balancer` in the next Ceph version +so it can be enabled to run automatically in the background like the upmap balancer. + +Offline Optimization +==================== + +Primaries are updated with an offline optimizer that is built into the +:ref:`osdmaptool`. + +#. Grab the latest copy of your osdmap: + + .. prompt:: bash $ + + ceph osd getmap -o om + +#. Run the optimizer: + + .. prompt:: bash $ + + osdmaptool om --read out.txt --read-pool <pool name> [--vstart] + + It is highly recommended that you run the capacity balancer before running the + balancer to ensure optimal results. See :ref:`upmap` for details on how to balance + capacity in a cluster. + +#. Apply the changes: + + .. prompt:: bash $ + + source out.txt + + In the above example, the proposed changes are written to the output file + ``out.txt``. The commands in this procedure are normal Ceph CLI commands + that can be run in order to apply the changes to the cluster. + + If you are working in a vstart cluster, you may pass the ``--vstart`` parameter + as shown above so the CLI commands are formatted with the `./bin/` prefix. + + Note that any time the number of pgs changes (for instance, if the pg autoscaler [:ref:`pg-autoscaler`] + kicks in), you should consider rechecking the scores and rerunning the balancer if needed. + +To see some details about what the tool is doing, you can pass +``--debug-osd 10`` to ``osdmaptool``. To see even more details, pass +``--debug-osd 20`` to ``osdmaptool``. diff --git a/doc/rados/operations/stretch-mode.rst b/doc/rados/operations/stretch-mode.rst new file mode 100644 index 000000000..f797b5b91 --- /dev/null +++ b/doc/rados/operations/stretch-mode.rst @@ -0,0 +1,262 @@ +.. _stretch_mode: + +================ +Stretch Clusters +================ + + +Stretch Clusters +================ + +A stretch cluster is a cluster that has servers in geographically separated +data centers, distributed over a WAN. Stretch clusters have LAN-like high-speed +and low-latency connections, but limited links. Stretch clusters have a higher +likelihood of (possibly asymmetric) network splits, and a higher likelihood of +temporary or complete loss of an entire data center (which can represent +one-third to one-half of the total cluster). + +Ceph is designed with the expectation that all parts of its network and cluster +will be reliable and that failures will be distributed randomly across the +CRUSH map. Even if a switch goes down and causes the loss of many OSDs, Ceph is +designed so that the remaining OSDs and monitors will route around such a loss. + +Sometimes this cannot be relied upon. If you have a "stretched-cluster" +deployment in which much of your cluster is behind a single network component, +you might need to use **stretch mode** to ensure data integrity. + +We will here consider two standard configurations: a configuration with two +data centers (or, in clouds, two availability zones), and a configuration with +three data centers (or, in clouds, three availability zones). + +In the two-site configuration, Ceph expects each of the sites to hold a copy of +the data, and Ceph also expects there to be a third site that has a tiebreaker +monitor. This tiebreaker monitor picks a winner if the network connection fails +and both data centers remain alive. + +The tiebreaker monitor can be a VM. It can also have high latency relative to +the two main sites. + +The standard Ceph configuration is able to survive MANY network failures or +data-center failures without ever compromising data availability. If enough +Ceph servers are brought back following a failure, the cluster *will* recover. +If you lose a data center but are still able to form a quorum of monitors and +still have all the data available, Ceph will maintain availability. (This +assumes that the cluster has enough copies to satisfy the pools' ``min_size`` +configuration option, or (failing that) that the cluster has CRUSH rules in +place that will cause the cluster to re-replicate the data until the +``min_size`` configuration option has been met.) + +Stretch Cluster Issues +====================== + +Ceph does not permit the compromise of data integrity and data consistency +under any circumstances. When service is restored after a network failure or a +loss of Ceph nodes, Ceph will restore itself to a state of normal functioning +without operator intervention. + +Ceph does not permit the compromise of data integrity or data consistency, but +there are situations in which *data availability* is compromised. These +situations can occur even though there are enough clusters available to satisfy +Ceph's consistency and sizing constraints. In some situations, you might +discover that your cluster does not satisfy those constraints. + +The first category of these failures that we will discuss involves inconsistent +networks -- if there is a netsplit (a disconnection between two servers that +splits the network into two pieces), Ceph might be unable to mark OSDs ``down`` +and remove them from the acting PG sets. This failure to mark ODSs ``down`` +will occur, despite the fact that the primary PG is unable to replicate data (a +situation that, under normal non-netsplit circumstances, would result in the +marking of affected OSDs as ``down`` and their removal from the PG). If this +happens, Ceph will be unable to satisfy its durability guarantees and +consequently IO will not be permitted. + +The second category of failures that we will discuss involves the situation in +which the constraints are not sufficient to guarantee the replication of data +across data centers, though it might seem that the data is correctly replicated +across data centers. For example, in a scenario in which there are two data +centers named Data Center A and Data Center B, and the CRUSH rule targets three +replicas and places a replica in each data center with a ``min_size`` of ``2``, +the PG might go active with two replicas in Data Center A and zero replicas in +Data Center B. In a situation of this kind, the loss of Data Center A means +that the data is lost and Ceph will not be able to operate on it. This +situation is surprisingly difficult to avoid using only standard CRUSH rules. + + +Stretch Mode +============ +Stretch mode is designed to handle deployments in which you cannot guarantee the +replication of data across two data centers. This kind of situation can arise +when the cluster's CRUSH rule specifies that three copies are to be made, but +then a copy is placed in each data center with a ``min_size`` of 2. Under such +conditions, a placement group can become active with two copies in the first +data center and no copies in the second data center. + + +Entering Stretch Mode +--------------------- + +To enable stretch mode, you must set the location of each monitor, matching +your CRUSH map. This procedure shows how to do this. + + +#. Place ``mon.a`` in your first data center: + + .. prompt:: bash $ + + ceph mon set_location a datacenter=site1 + +#. Generate a CRUSH rule that places two copies in each data center. + This requires editing the CRUSH map directly: + + .. prompt:: bash $ + + ceph osd getcrushmap > crush.map.bin + crushtool -d crush.map.bin -o crush.map.txt + +#. Edit the ``crush.map.txt`` file to add a new rule. Here there is only one + other rule (``id 1``), but you might need to use a different rule ID. We + have two data-center buckets named ``site1`` and ``site2``: + + :: + + rule stretch_rule { + id 1 + min_size 1 + max_size 10 + type replicated + step take site1 + step chooseleaf firstn 2 type host + step emit + step take site2 + step chooseleaf firstn 2 type host + step emit + } + +#. Inject the CRUSH map to make the rule available to the cluster: + + .. prompt:: bash $ + + crushtool -c crush.map.txt -o crush2.map.bin + ceph osd setcrushmap -i crush2.map.bin + +#. Run the monitors in connectivity mode. See `Changing Monitor Elections`_. + +#. Command the cluster to enter stretch mode. In this example, ``mon.e`` is the + tiebreaker monitor and we are splitting across data centers. The tiebreaker + monitor must be assigned a data center that is neither ``site1`` nor + ``site2``. For this purpose you can create another data-center bucket named + ``site3`` in your CRUSH and place ``mon.e`` there: + + .. prompt:: bash $ + + ceph mon set_location e datacenter=site3 + ceph mon enable_stretch_mode e stretch_rule datacenter + +When stretch mode is enabled, PGs will become active only when they peer +across data centers (or across whichever CRUSH bucket type was specified), +assuming both are alive. Pools will increase in size from the default ``3`` to +``4``, and two copies will be expected in each site. OSDs will be allowed to +connect to monitors only if they are in the same data center as the monitors. +New monitors will not be allowed to join the cluster if they do not specify a +location. + +If all OSDs and monitors in one of the data centers become inaccessible at once, +the surviving data center enters a "degraded stretch mode". A warning will be +issued, the ``min_size`` will be reduced to ``1``, and the cluster will be +allowed to go active with the data in the single remaining site. The pool size +does not change, so warnings will be generated that report that the pools are +too small -- but a special stretch mode flag will prevent the OSDs from +creating extra copies in the remaining data center. This means that the data +center will keep only two copies, just as before. + +When the missing data center comes back, the cluster will enter a "recovery +stretch mode". This changes the warning and allows peering, but requires OSDs +only from the data center that was ``up`` throughout the duration of the +downtime. When all PGs are in a known state, and are neither degraded nor +incomplete, the cluster transitions back to regular stretch mode, ends the +warning, restores ``min_size`` to its original value (``2``), requires both +sites to peer, and no longer requires the site that was up throughout the +duration of the downtime when peering (which makes failover to the other site +possible, if needed). + +.. _Changing Monitor elections: ../change-mon-elections + +Limitations of Stretch Mode +=========================== +When using stretch mode, OSDs must be located at exactly two sites. + +Two monitors should be run in each data center, plus a tiebreaker in a third +(or in the cloud) for a total of five monitors. While in stretch mode, OSDs +will connect only to monitors within the data center in which they are located. +OSDs *DO NOT* connect to the tiebreaker monitor. + +Erasure-coded pools cannot be used with stretch mode. Attempts to use erasure +coded pools with stretch mode will fail. Erasure coded pools cannot be created +while in stretch mode. + +To use stretch mode, you will need to create a CRUSH rule that provides two +replicas in each data center. Ensure that there are four total replicas: two in +each data center. If pools exist in the cluster that do not have the default +``size`` or ``min_size``, Ceph will not enter stretch mode. An example of such +a CRUSH rule is given above. + +Because stretch mode runs with ``min_size`` set to ``1`` (or, more directly, +``min_size 1``), we recommend enabling stretch mode only when using OSDs on +SSDs (including NVMe OSDs). Hybrid HDD+SDD or HDD-only OSDs are not recommended +due to the long time it takes for them to recover after connectivity between +data centers has been restored. This reduces the potential for data loss. + +In the future, stretch mode might support erasure-coded pools and might support +deployments that have more than two data centers. + +Other commands +============== + +Replacing a failed tiebreaker monitor +------------------------------------- + +Turn on a new monitor and run the following command: + +.. prompt:: bash $ + + ceph mon set_new_tiebreaker mon.<new_mon_name> + +This command protests if the new monitor is in the same location as the +existing non-tiebreaker monitors. **This command WILL NOT remove the previous +tiebreaker monitor.** Remove the previous tiebreaker monitor yourself. + +Using "--set-crush-location" and not "ceph mon set_location" +------------------------------------------------------------ + +If you write your own tooling for deploying Ceph, use the +``--set-crush-location`` option when booting monitors instead of running ``ceph +mon set_location``. This option accepts only a single ``bucket=loc`` pair (for +example, ``ceph-mon --set-crush-location 'datacenter=a'``), and that pair must +match the bucket type that was specified when running ``enable_stretch_mode``. + +Forcing recovery stretch mode +----------------------------- + +When in stretch degraded mode, the cluster will go into "recovery" mode +automatically when the disconnected data center comes back. If that does not +happen or you want to enable recovery mode early, run the following command: + +.. prompt:: bash $ + + ceph osd force_recovery_stretch_mode --yes-i-really-mean-it + +Forcing normal stretch mode +--------------------------- + +When in recovery mode, the cluster should go back into normal stretch mode when +the PGs are healthy. If this fails to happen or if you want to force the +cross-data-center peering early and are willing to risk data downtime (or have +verified separately that all the PGs can peer, even if they aren't fully +recovered), run the following command: + +.. prompt:: bash $ + + ceph osd force_healthy_stretch_mode --yes-i-really-mean-it + +This command can be used to to remove the ``HEALTH_WARN`` state, which recovery +mode generates. diff --git a/doc/rados/operations/upmap.rst b/doc/rados/operations/upmap.rst new file mode 100644 index 000000000..8541680d8 --- /dev/null +++ b/doc/rados/operations/upmap.rst @@ -0,0 +1,113 @@ +.. _upmap: + +======================================= +Using pg-upmap +======================================= + +In Luminous v12.2.z and later releases, there is a *pg-upmap* exception table +in the OSDMap that allows the cluster to explicitly map specific PGs to +specific OSDs. This allows the cluster to fine-tune the data distribution to, +in most cases, uniformly distribute PGs across OSDs. + +However, there is an important caveat when it comes to this new feature: it +requires all clients to understand the new *pg-upmap* structure in the OSDMap. + +Online Optimization +=================== + +Enabling +-------- + +In order to use ``pg-upmap``, the cluster cannot have any pre-Luminous clients. +By default, new clusters enable the *balancer module*, which makes use of +``pg-upmap``. If you want to use a different balancer or you want to make your +own custom ``pg-upmap`` entries, you might want to turn off the balancer in +order to avoid conflict: + +.. prompt:: bash $ + + ceph balancer off + +To allow use of the new feature on an existing cluster, you must restrict the +cluster to supporting only Luminous (and newer) clients. To do so, run the +following command: + +.. prompt:: bash $ + + ceph osd set-require-min-compat-client luminous + +This command will fail if any pre-Luminous clients or daemons are connected to +the monitors. To see which client versions are in use, run the following +command: + +.. prompt:: bash $ + + ceph features + +Balancer Module +--------------- + +The `balancer` module for ``ceph-mgr`` will automatically balance the number of +PGs per OSD. See :ref:`balancer` + +Offline Optimization +==================== + +Upmap entries are updated with an offline optimizer that is built into the +:ref:`osdmaptool`. + +#. Grab the latest copy of your osdmap: + + .. prompt:: bash $ + + ceph osd getmap -o om + +#. Run the optimizer: + + .. prompt:: bash $ + + osdmaptool om --upmap out.txt [--upmap-pool <pool>] \ + [--upmap-max <max-optimizations>] \ + [--upmap-deviation <max-deviation>] \ + [--upmap-active] + + It is highly recommended that optimization be done for each pool + individually, or for sets of similarly utilized pools. You can specify the + ``--upmap-pool`` option multiple times. "Similarly utilized pools" means + pools that are mapped to the same devices and that store the same kind of + data (for example, RBD image pools are considered to be similarly utilized; + an RGW index pool and an RGW data pool are not considered to be similarly + utilized). + + The ``max-optimizations`` value determines the maximum number of upmap + entries to identify. The default is `10` (as is the case with the + ``ceph-mgr`` balancer module), but you should use a larger number if you are + doing offline optimization. If it cannot find any additional changes to + make (that is, if the pool distribution is perfect), it will stop early. + + The ``max-deviation`` value defaults to `5`. If an OSD's PG count varies + from the computed target number by no more than this amount it will be + considered perfect. + + The ``--upmap-active`` option simulates the behavior of the active balancer + in upmap mode. It keeps cycling until the OSDs are balanced and reports how + many rounds have occurred and how long each round takes. The elapsed time + for rounds indicates the CPU load that ``ceph-mgr`` consumes when it computes + the next optimization plan. + +#. Apply the changes: + + .. prompt:: bash $ + + source out.txt + + In the above example, the proposed changes are written to the output file + ``out.txt``. The commands in this procedure are normal Ceph CLI commands + that can be run in order to apply the changes to the cluster. + +The above steps can be repeated as many times as necessary to achieve a perfect +distribution of PGs for each set of pools. + +To see some (gory) details about what the tool is doing, you can pass +``--debug-osd 10`` to ``osdmaptool``. To see even more details, pass +``--debug-crush 10`` to ``osdmaptool``. diff --git a/doc/rados/operations/user-management.rst b/doc/rados/operations/user-management.rst new file mode 100644 index 000000000..130c02002 --- /dev/null +++ b/doc/rados/operations/user-management.rst @@ -0,0 +1,840 @@ +.. _user-management: + +================= + User Management +================= + +This document describes :term:`Ceph Client` users, and describes the process by +which they perform authentication and authorization so that they can access the +:term:`Ceph Storage Cluster`. Users are either individuals or system actors +(for example, applications) that use Ceph clients to interact with the Ceph +Storage Cluster daemons. + +.. ditaa:: + +-----+ + | {o} | + | | + +--+--+ /---------\ /---------\ + | | Ceph | | Ceph | + ---+---*----->| |<------------->| | + | uses | Clients | | Servers | + | \---------/ \---------/ + /--+--\ + | | + | | + actor + + +When Ceph runs with authentication and authorization enabled (both are enabled +by default), you must specify a user name and a keyring that contains the +secret key of the specified user (usually these are specified via the command +line). If you do not specify a user name, Ceph will use ``client.admin`` as the +default user name. If you do not specify a keyring, Ceph will look for a +keyring via the ``keyring`` setting in the Ceph configuration. For example, if +you execute the ``ceph health`` command without specifying a user or a keyring, +Ceph will assume that the keyring is in ``/etc/ceph/ceph.client.admin.keyring`` +and will attempt to use that keyring. The following illustrates this behavior: + +.. prompt:: bash $ + + ceph health + +Ceph will interpret the command like this: + +.. prompt:: bash $ + + ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health + +Alternatively, you may use the ``CEPH_ARGS`` environment variable to avoid +re-entry of the user name and secret. + +For details on configuring the Ceph Storage Cluster to use authentication, see +`Cephx Config Reference`_. For details on the architecture of Cephx, see +`Architecture - High Availability Authentication`_. + +Background +========== + +No matter what type of Ceph client is used (for example: Block Device, Object +Storage, Filesystem, native API), Ceph stores all data as RADOS objects within +`pools`_. Ceph users must have access to a given pool in order to read and +write data, and Ceph users must have execute permissions in order to use Ceph's +administrative commands. The following concepts will help you understand +Ceph['s] user management. + +.. _rados-ops-user: + +User +---- + +A user is either an individual or a system actor (for example, an application). +Creating users allows you to control who (or what) can access your Ceph Storage +Cluster, its pools, and the data within those pools. + +Ceph has the concept of a ``type`` of user. For purposes of user management, +the type will always be ``client``. Ceph identifies users in a "period- +delimited form" that consists of the user type and the user ID: for example, +``TYPE.ID``, ``client.admin``, or ``client.user1``. The reason for user typing +is that the Cephx protocol is used not only by clients but also non-clients, +such as Ceph Monitors, OSDs, and Metadata Servers. Distinguishing the user type +helps to distinguish between client users and other users. This distinction +streamlines access control, user monitoring, and traceability. + +Sometimes Ceph's user type might seem confusing, because the Ceph command line +allows you to specify a user with or without the type, depending upon your +command line usage. If you specify ``--user`` or ``--id``, you can omit the +type. For example, ``client.user1`` can be entered simply as ``user1``. On the +other hand, if you specify ``--name`` or ``-n``, you must supply the type and +name: for example, ``client.user1``. We recommend using the type and name as a +best practice wherever possible. + +.. note:: A Ceph Storage Cluster user is not the same as a Ceph Object Storage + user or a Ceph File System user. The Ceph Object Gateway uses a Ceph Storage + Cluster user to communicate between the gateway daemon and the storage + cluster, but the Ceph Object Gateway has its own user-management + functionality for end users. The Ceph File System uses POSIX semantics, and + the user space associated with the Ceph File System is not the same as the + user space associated with a Ceph Storage Cluster user. + +Authorization (Capabilities) +---------------------------- + +Ceph uses the term "capabilities" (caps) to describe the permissions granted to +an authenticated user to exercise the functionality of the monitors, OSDs, and +metadata servers. Capabilities can also restrict access to data within a pool, +a namespace within a pool, or a set of pools based on their application tags. +A Ceph administrative user specifies the capabilities of a user when creating +or updating that user. + +Capability syntax follows this form:: + + {daemon-type} '{cap-spec}[, {cap-spec} ...]' + +- **Monitor Caps:** Monitor capabilities include ``r``, ``w``, ``x`` access + settings, and can be applied in aggregate from pre-defined profiles with + ``profile {name}``. For example:: + + mon 'allow {access-spec} [network {network/prefix}]' + + mon 'profile {name}' + + The ``{access-spec}`` syntax is as follows: :: + + * | all | [r][w][x] + + The optional ``{network/prefix}`` is a standard network name and prefix + length in CIDR notation (for example, ``10.3.0.0/16``). If + ``{network/prefix}`` is present, the monitor capability can be used only by + clients that connect from the specified network. + +- **OSD Caps:** OSD capabilities include ``r``, ``w``, ``x``, and + ``class-read`` and ``class-write`` access settings. OSD capabilities can be + applied in aggregate from pre-defined profiles with ``profile {name}``. In + addition, OSD capabilities allow for pool and namespace settings. :: + + osd 'allow {access-spec} [{match-spec}] [network {network/prefix}]' + + osd 'profile {name} [pool={pool-name} [namespace={namespace-name}]] [network {network/prefix}]' + + There are two alternative forms of the ``{access-spec}`` syntax: :: + + * | all | [r][w][x] [class-read] [class-write] + + class {class name} [{method name}] + + There are two alternative forms of the optional ``{match-spec}`` syntax:: + + pool={pool-name} [namespace={namespace-name}] [object_prefix {prefix}] + + [namespace={namespace-name}] tag {application} {key}={value} + + The optional ``{network/prefix}`` is a standard network name and prefix + length in CIDR notation (for example, ``10.3.0.0/16``). If + ``{network/prefix}`` is present, the OSD capability can be used only by + clients that connect from the specified network. + +- **Manager Caps:** Manager (``ceph-mgr``) capabilities include ``r``, ``w``, + ``x`` access settings, and can be applied in aggregate from pre-defined + profiles with ``profile {name}``. For example:: + + mgr 'allow {access-spec} [network {network/prefix}]' + + mgr 'profile {name} [{key1} {match-type} {value1} ...] [network {network/prefix}]' + + Manager capabilities can also be specified for specific commands, for all + commands exported by a built-in manager service, or for all commands exported + by a specific add-on module. For example:: + + mgr 'allow command "{command-prefix}" [with {key1} {match-type} {value1} ...] [network {network/prefix}]' + + mgr 'allow service {service-name} {access-spec} [network {network/prefix}]' + + mgr 'allow module {module-name} [with {key1} {match-type} {value1} ...] {access-spec} [network {network/prefix}]' + + The ``{access-spec}`` syntax is as follows: :: + + * | all | [r][w][x] + + The ``{service-name}`` is one of the following: :: + + mgr | osd | pg | py + + The ``{match-type}`` is one of the following: :: + + = | prefix | regex + +- **Metadata Server Caps:** For administrators, use ``allow *``. For all other + users (for example, CephFS clients), consult :doc:`/cephfs/client-auth` + +.. note:: The Ceph Object Gateway daemon (``radosgw``) is a client of the + Ceph Storage Cluster. For this reason, it is not represented as + a Ceph Storage Cluster daemon type. + +The following entries describe access capabilities. + +``allow`` + +:Description: Precedes access settings for a daemon. Implies ``rw`` + for MDS only. + + +``r`` + +:Description: Gives the user read access. Required with monitors to retrieve + the CRUSH map. + + +``w`` + +:Description: Gives the user write access to objects. + + +``x`` + +:Description: Gives the user the capability to call class methods + (that is, both read and write) and to conduct ``auth`` + operations on monitors. + + +``class-read`` + +:Descriptions: Gives the user the capability to call class read methods. + Subset of ``x``. + + +``class-write`` + +:Description: Gives the user the capability to call class write methods. + Subset of ``x``. + + +``*``, ``all`` + +:Description: Gives the user read, write, and execute permissions for a + particular daemon/pool, as well as the ability to execute + admin commands. + + +The following entries describe valid capability profiles: + +``profile osd`` (Monitor only) + +:Description: Gives a user permissions to connect as an OSD to other OSDs or + monitors. Conferred on OSDs in order to enable OSDs to handle replication + heartbeat traffic and status reporting. + + +``profile mds`` (Monitor only) + +:Description: Gives a user permissions to connect as an MDS to other MDSs or + monitors. + + +``profile bootstrap-osd`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an OSD. Conferred on + deployment tools such as ``ceph-volume`` and ``cephadm`` + so that they have permissions to add keys when + bootstrapping an OSD. + + +``profile bootstrap-mds`` (Monitor only) + +:Description: Gives a user permissions to bootstrap a metadata server. + Conferred on deployment tools such as ``cephadm`` + so that they have permissions to add keys when bootstrapping + a metadata server. + +``profile bootstrap-rbd`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an RBD user. + Conferred on deployment tools such as ``cephadm`` + so that they have permissions to add keys when bootstrapping + an RBD user. + +``profile bootstrap-rbd-mirror`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an ``rbd-mirror`` daemon + user. Conferred on deployment tools such as ``cephadm`` so that + they have permissions to add keys when bootstrapping an + ``rbd-mirror`` daemon. + +``profile rbd`` (Manager, Monitor, and OSD) + +:Description: Gives a user permissions to manipulate RBD images. When used as a + Monitor cap, it provides the user with the minimal privileges + required by an RBD client application; such privileges include + the ability to blocklist other client users. When used as an OSD + cap, it provides an RBD client application with read-write access + to the specified pool. The Manager cap supports optional ``pool`` + and ``namespace`` keyword arguments. + +``profile rbd-mirror`` (Monitor only) + +:Description: Gives a user permissions to manipulate RBD images and retrieve + RBD mirroring config-key secrets. It provides the minimal + privileges required for the user to manipulate the ``rbd-mirror`` + daemon. + +``profile rbd-read-only`` (Manager and OSD) + +:Description: Gives a user read-only permissions to RBD images. The Manager cap + supports optional ``pool`` and ``namespace`` keyword arguments. + +``profile simple-rados-client`` (Monitor only) + +:Description: Gives a user read-only permissions for monitor, OSD, and PG data. + Intended for use by direct librados client applications. + +``profile simple-rados-client-with-blocklist`` (Monitor only) + +:Description: Gives a user read-only permissions for monitor, OSD, and PG data. + Intended for use by direct librados client applications. Also + includes permissions to add blocklist entries to build + high-availability (HA) applications. + +``profile fs-client`` (Monitor only) + +:Description: Gives a user read-only permissions for monitor, OSD, PG, and MDS + data. Intended for CephFS clients. + +``profile role-definer`` (Monitor and Auth) + +:Description: Gives a user **all** permissions for the auth subsystem, read-only + access to monitors, and nothing else. Useful for automation + tools. Do not assign this unless you really, **really** know what + you're doing, as the security ramifications are substantial and + pervasive. + +``profile crash`` (Monitor and MGR) + +:Description: Gives a user read-only access to monitors. Used in conjunction + with the manager ``crash`` module to upload daemon crash + dumps into monitor storage for later analysis. + +Pool +---- + +A pool is a logical partition where users store data. +In Ceph deployments, it is common to create a pool as a logical partition for +similar types of data. For example, when deploying Ceph as a back end for +OpenStack, a typical deployment would have pools for volumes, images, backups +and virtual machines, and such users as ``client.glance`` and ``client.cinder``. + +Application Tags +---------------- + +Access may be restricted to specific pools as defined by their application +metadata. The ``*`` wildcard may be used for the ``key`` argument, the +``value`` argument, or both. The ``all`` tag is a synonym for ``*``. + +Namespace +--------- + +Objects within a pool can be associated to a namespace: that is, to a logical group of +objects within the pool. A user's access to a pool can be associated with a +namespace so that reads and writes by the user can take place only within the +namespace. Objects written to a namespace within the pool can be accessed only +by users who have access to the namespace. + +.. note:: Namespaces are primarily useful for applications written on top of + ``librados``. In such situations, the logical grouping provided by + namespaces can obviate the need to create different pools. In Luminous and + later releases, Ceph Object Gateway uses namespaces for various metadata + objects. + +The rationale for namespaces is this: namespaces are relatively less +computationally expensive than pools, which (pools) can be a computationally +expensive method of segregating data sets between different authorized users. + +For example, a pool ought to host approximately 100 placement-group replicas +per OSD. This means that a cluster with 1000 OSDs and three 3R replicated pools +would have (in a single pool) 100,000 placement-group replicas, and that means +that it has 33,333 Placement Groups. + +By contrast, writing an object to a namespace simply associates the namespace +to the object name without incurring the computational overhead of a separate +pool. Instead of creating a separate pool for a user or set of users, you can +use a namespace. + +.. note:: + + Namespaces are available only when using ``librados``. + + +Access may be restricted to specific RADOS namespaces by use of the ``namespace`` +capability. Limited globbing of namespaces (that is, use of wildcards (``*``)) is supported: if the last character +of the specified namespace is ``*``, then access is granted to any namespace +starting with the provided argument. + +Managing Users +============== + +User management functionality provides Ceph Storage Cluster administrators with +the ability to create, update, and delete users directly in the Ceph Storage +Cluster. + +When you create or delete users in the Ceph Storage Cluster, you might need to +distribute keys to clients so that they can be added to keyrings. For details, see `Keyring +Management`_. + +Listing Users +------------- + +To list the users in your cluster, run the following command: + +.. prompt:: bash $ + + ceph auth ls + +Ceph will list all users in your cluster. For example, in a two-node +cluster, ``ceph auth ls`` will provide an output that resembles the following:: + + installed auth entries: + + osd.0 + key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w== + caps: [mon] allow profile osd + caps: [osd] allow * + osd.1 + key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA== + caps: [mon] allow profile osd + caps: [osd] allow * + client.admin + key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw== + caps: [mds] allow + caps: [mon] allow * + caps: [osd] allow * + client.bootstrap-mds + key: AQBICbtTOK9uGBAAdbe5zcIGHZL3T/u2g6EBww== + caps: [mon] allow profile bootstrap-mds + client.bootstrap-osd + key: AQBHCbtT4GxqORAADE5u7RkpCN/oo4e5W0uBtw== + caps: [mon] allow profile bootstrap-osd + +Note that, according to the ``TYPE.ID`` notation for users, ``osd.0`` is a +user of type ``osd`` and an ID of ``0``, and ``client.admin`` is a user of type +``client`` and an ID of ``admin`` (that is, the default ``client.admin`` user). +Note too that each entry has a ``key: <value>`` entry, and also has one or more +``caps:`` entries. + +To save the output of ``ceph auth ls`` to a file, use the ``-o {filename}`` option. + + +Getting a User +-------------- + +To retrieve a specific user, key, and capabilities, run the following command: + +.. prompt:: bash $ + + ceph auth get {TYPE.ID} + +For example: + +.. prompt:: bash $ + + ceph auth get client.admin + +To save the output of ``ceph auth get`` to a file, use the ``-o {filename}`` option. Developers may also run the following command: + +.. prompt:: bash $ + + ceph auth export {TYPE.ID} + +The ``auth export`` command is identical to ``auth get``. + +.. _rados_ops_adding_a_user: + +Adding a User +------------- + +Adding a user creates a user name (that is, ``TYPE.ID``), a secret key, and +any capabilities specified in the command that creates the user. + +A user's key allows the user to authenticate with the Ceph Storage Cluster. +The user's capabilities authorize the user to read, write, or execute on Ceph +monitors (``mon``), Ceph OSDs (``osd``) or Ceph Metadata Servers (``mds``). + +There are a few ways to add a user: + +- ``ceph auth add``: This command is the canonical way to add a user. It + will create the user, generate a key, and add any specified capabilities. + +- ``ceph auth get-or-create``: This command is often the most convenient way + to create a user, because it returns a keyfile format with the user name + (in brackets) and the key. If the user already exists, this command + simply returns the user name and key in the keyfile format. To save the output to + a file, use the ``-o {filename}`` option. + +- ``ceph auth get-or-create-key``: This command is a convenient way to create + a user and return the user's key and nothing else. This is useful for clients that + need only the key (for example, libvirt). If the user already exists, this command + simply returns the key. To save the output to + a file, use the ``-o {filename}`` option. + +It is possible, when creating client users, to create a user with no capabilities. A user +with no capabilities is useless beyond mere authentication, because the client +cannot retrieve the cluster map from the monitor. However, you might want to create a user +with no capabilities and wait until later to add capabilities to the user by using the ``ceph auth caps`` comand. + +A typical user has at least read capabilities on the Ceph monitor and +read and write capabilities on Ceph OSDs. A user's OSD permissions +are often restricted so that the user can access only one particular pool. +In the following example, the commands (1) add a client named ``john`` that has read capabilities on the Ceph monitor +and read and write capabilities on the pool named ``liverpool``, (2) authorize a client named ``paul`` to have read capabilities on the Ceph monitor and +read and write capabilities on the pool named ``liverpool``, (3) authorize a client named ``george`` to have read capabilities on the Ceph monitor and +read and write capabilities on the pool named ``liverpool`` and use the keyring named ``george.keyring`` to make this authorization, and (4) authorize +a client named ``ringo`` to have read capabilities on the Ceph monitor and read and write capabilities on the pool named ``liverpool`` and use the key +named ``ringo.key`` to make this authorization: + +.. prompt:: bash $ + + ceph auth add client.john mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth get-or-create client.paul mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth get-or-create client.george mon 'allow r' osd 'allow rw pool=liverpool' -o george.keyring + ceph auth get-or-create-key client.ringo mon 'allow r' osd 'allow rw pool=liverpool' -o ringo.key + +.. important:: Any user that has capabilities on OSDs will have access to ALL pools in the cluster + unless that user's access has been restricted to a proper subset of the pools in the cluster. + + +.. _modify-user-capabilities: + +Modifying User Capabilities +--------------------------- + +The ``ceph auth caps`` command allows you to specify a user and change that +user's capabilities. Setting new capabilities will overwrite current capabilities. +To view current capabilities, run ``ceph auth get USERTYPE.USERID``. +To add capabilities, run a command of the following form (and be sure to specify the existing capabilities): + +.. prompt:: bash $ + + ceph auth caps USERTYPE.USERID {daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]' [{daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]'] + +For example: + +.. prompt:: bash $ + + ceph auth get client.john + ceph auth caps client.john mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth caps client.paul mon 'allow rw' osd 'allow rwx pool=liverpool' + ceph auth caps client.brian-manager mon 'allow *' osd 'allow *' + +For additional details on capabilities, see `Authorization (Capabilities)`_. + +Deleting a User +--------------- + +To delete a user, use ``ceph auth del``: + +.. prompt:: bash $ + + ceph auth del {TYPE}.{ID} + +Here ``{TYPE}`` is either ``client``, ``osd``, ``mon``, or ``mds``, +and ``{ID}`` is the user name or the ID of the daemon. + + +Printing a User's Key +--------------------- + +To print a user's authentication key to standard output, run the following command: + +.. prompt:: bash $ + + ceph auth print-key {TYPE}.{ID} + +Here ``{TYPE}`` is either ``client``, ``osd``, ``mon``, or ``mds``, +and ``{ID}`` is the user name or the ID of the daemon. + +When it is necessary to populate client software with a user's key (as in the case of libvirt), +you can print the user's key by running the following command: + +.. prompt:: bash $ + + mount -t ceph serverhost:/ mountpoint -o name=client.user,secret=`ceph auth print-key client.user` + +Importing a User +---------------- + +To import one or more users, use ``ceph auth import`` and +specify a keyring as follows: + +.. prompt:: bash $ + + ceph auth import -i /path/to/keyring + +For example: + +.. prompt:: bash $ + + sudo ceph auth import -i /etc/ceph/ceph.keyring + +.. note:: The Ceph storage cluster will add new users, their keys, and their + capabilities and will update existing users, their keys, and their + capabilities. + +Keyring Management +================== + +When you access Ceph via a Ceph client, the Ceph client will look for a local +keyring. Ceph presets the ``keyring`` setting with four keyring +names by default. For this reason, you do not have to set the keyring names in your Ceph configuration file +unless you want to override these defaults (which is not recommended). The four default keyring names are as follows: + +- ``/etc/ceph/$cluster.$name.keyring`` +- ``/etc/ceph/$cluster.keyring`` +- ``/etc/ceph/keyring`` +- ``/etc/ceph/keyring.bin`` + +The ``$cluster`` metavariable found in the first two default keyring names above +is your Ceph cluster name as defined by the name of the Ceph configuration +file: for example, if the Ceph configuration file is named ``ceph.conf``, +then your Ceph cluster name is ``ceph`` and the second name above would be +``ceph.keyring``. The ``$name`` metavariable is the user type and user ID: +for example, given the user ``client.admin``, the first name above would be +``ceph.client.admin.keyring``. + +.. note:: When running commands that read or write to ``/etc/ceph``, you might + need to use ``sudo`` to run the command as ``root``. + +After you create a user (for example, ``client.ringo``), you must get the key and add +it to a keyring on a Ceph client so that the user can access the Ceph Storage +Cluster. + +The `User Management`_ section details how to list, get, add, modify, and delete +users directly in the Ceph Storage Cluster. In addition, Ceph provides the +``ceph-authtool`` utility to allow you to manage keyrings from a Ceph client. + +Creating a Keyring +------------------ + +When you use the procedures in the `Managing Users`_ section to create users, +you must provide user keys to the Ceph client(s). This is required so that the Ceph client(s) +can retrieve the key for the specified user and authenticate that user against the Ceph +Storage Cluster. Ceph clients access keyrings in order to look up a user name and +retrieve the user's key. + +The ``ceph-authtool`` utility allows you to create a keyring. To create an +empty keyring, use ``--create-keyring`` or ``-C``. For example: + +.. prompt:: bash $ + + ceph-authtool --create-keyring /path/to/keyring + +When creating a keyring with multiple users, we recommend using the cluster name +(of the form ``$cluster.keyring``) for the keyring filename and saving the keyring in the +``/etc/ceph`` directory. By doing this, you ensure that the ``keyring`` configuration default setting +will pick up the filename without requiring you to specify the filename in the local copy +of your Ceph configuration file. For example, you can create ``ceph.keyring`` by +running the following command: + +.. prompt:: bash $ + + sudo ceph-authtool -C /etc/ceph/ceph.keyring + +When creating a keyring with a single user, we recommend using the cluster name, +the user type, and the user name, and saving the keyring in the ``/etc/ceph`` directory. +For example, we recommend that the ``client.admin`` user use ``ceph.client.admin.keyring``. + +To create a keyring in ``/etc/ceph``, you must do so as ``root``. This means +that the file will have ``rw`` permissions for the ``root`` user only, which is +appropriate when the keyring contains administrator keys. However, if you +intend to use the keyring for a particular user or group of users, be sure to use ``chown`` or ``chmod`` to establish appropriate keyring +ownership and access. + +Adding a User to a Keyring +-------------------------- + +When you :ref:`Add a user<rados_ops_adding_a_user>` to the Ceph Storage +Cluster, you can use the `Getting a User`_ procedure to retrieve a user, key, +and capabilities and then save the user to a keyring. + +If you want to use only one user per keyring, the `Getting a User`_ procedure with +the ``-o`` option will save the output in the keyring file format. For example, +to create a keyring for the ``client.admin`` user, run the following command: + +.. prompt:: bash $ + + sudo ceph auth get client.admin -o /etc/ceph/ceph.client.admin.keyring + +Notice that the file format in this command is the file format conventionally used when manipulating the keyrings of individual users. + +If you want to import users to a keyring, you can use ``ceph-authtool`` +to specify the destination keyring and the source keyring. +For example: + +.. prompt:: bash $ + + sudo ceph-authtool /etc/ceph/ceph.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring + +Creating a User +--------------- + +Ceph provides the `Adding a User`_ function to create a user directly in the Ceph +Storage Cluster. However, you can also create a user, keys, and capabilities +directly on a Ceph client keyring, and then import the user to the Ceph +Storage Cluster. For example: + +.. prompt:: bash $ + + sudo ceph-authtool -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' /etc/ceph/ceph.keyring + +For additional details on capabilities, see `Authorization (Capabilities)`_. + +You can also create a keyring and add a new user to the keyring simultaneously. +For example: + +.. prompt:: bash $ + + sudo ceph-authtool -C /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' --gen-key + +In the above examples, the new user ``client.ringo`` has been added only to the +keyring. The new user has not been added to the Ceph Storage Cluster. + +To add the new user ``client.ringo`` to the Ceph Storage Cluster, run the following command: + +.. prompt:: bash $ + + sudo ceph auth add client.ringo -i /etc/ceph/ceph.keyring + +Modifying a User +---------------- + +To modify the capabilities of a user record in a keyring, specify the keyring +and the user, followed by the capabilities. For example: + +.. prompt:: bash $ + + sudo ceph-authtool /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' + +To update the user in the Ceph Storage Cluster, you must update the user +in the keyring to the user entry in the Ceph Storage Cluster. To do so, run the following command: + +.. prompt:: bash $ + + sudo ceph auth import -i /etc/ceph/ceph.keyring + +For details on updating a Ceph Storage Cluster user from a +keyring, see `Importing a User`_ + +You may also :ref:`Modify user capabilities<modify-user-capabilities>` directly in the cluster, store the +results to a keyring file, and then import the keyring into your main +``ceph.keyring`` file. + +Command Line Usage +================== + +Ceph supports the following usage for user name and secret: + +``--id`` | ``--user`` + +:Description: Ceph identifies users with a type and an ID: the form of this user identification is ``TYPE.ID``, and examples of the type and ID are + ``client.admin`` and ``client.user1``. The ``id``, ``name`` and + ``-n`` options allow you to specify the ID portion of the user + name (for example, ``admin``, ``user1``, ``foo``). You can specify + the user with the ``--id`` and omit the type. For example, + to specify user ``client.foo``, run the following commands: + + .. prompt:: bash $ + + ceph --id foo --keyring /path/to/keyring health + ceph --user foo --keyring /path/to/keyring health + + +``--name`` | ``-n`` + +:Description: Ceph identifies users with a type and an ID: the form of this user identification is ``TYPE.ID``, and examples of the type and ID are + ``client.admin`` and ``client.user1``. The ``--name`` and ``-n`` + options allow you to specify the fully qualified user name. + You are required to specify the user type (typically ``client``) with the + user ID. For example: + + .. prompt:: bash $ + + ceph --name client.foo --keyring /path/to/keyring health + ceph -n client.foo --keyring /path/to/keyring health + + +``--keyring`` + +:Description: The path to the keyring that contains one or more user names and + secrets. The ``--secret`` option provides the same functionality, + but it does not work with Ceph RADOS Gateway, which uses + ``--secret`` for another purpose. You may retrieve a keyring with + ``ceph auth get-or-create`` and store it locally. This is a + preferred approach, because you can switch user names without + switching the keyring path. For example: + + .. prompt:: bash $ + + sudo rbd map --id foo --keyring /path/to/keyring mypool/myimage + + +.. _pools: ../pools + +Limitations +=========== + +The ``cephx`` protocol authenticates Ceph clients and servers to each other. It +is not intended to handle authentication of human users or application programs +that are run on their behalf. If your access control +needs require that kind of authentication, you will need to have some other mechanism, which is likely to be specific to the +front end that is used to access the Ceph object store. This other mechanism would ensure that only acceptable users and programs are able to run on the +machine that Ceph permits to access its object store. + +The keys used to authenticate Ceph clients and servers are typically stored in +a plain text file on a trusted host. Appropriate permissions must be set on the plain text file. + +.. important:: Storing keys in plaintext files has security shortcomings, but + they are difficult to avoid, given the basic authentication methods Ceph + uses in the background. Anyone setting up Ceph systems should be aware of + these shortcomings. + +In particular, user machines, especially portable machines, should not +be configured to interact directly with Ceph, since that mode of use would +require the storage of a plaintext authentication key on an insecure machine. +Anyone who stole that machine or obtained access to it could +obtain a key that allows them to authenticate their own machines to Ceph. + +Instead of permitting potentially insecure machines to access a Ceph object +store directly, you should require users to sign in to a trusted machine in +your environment, using a method that provides sufficient security for your +purposes. That trusted machine will store the plaintext Ceph keys for the +human users. A future version of Ceph might address these particular +authentication issues more fully. + +At present, none of the Ceph authentication protocols provide secrecy for +messages in transit. As a result, an eavesdropper on the wire can hear and understand +all data sent between clients and servers in Ceph, even if the eavesdropper cannot create or +alter the data. Similarly, Ceph does not include options to encrypt user data in the +object store. Users can, of course, hand-encrypt and store their own data in the Ceph +object store, but Ceph itself provides no features to perform object +encryption. Anyone storing sensitive data in Ceph should consider +encrypting their data before providing it to the Ceph system. + + +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _Cephx Config Reference: ../../configuration/auth-config-ref diff --git a/doc/rados/troubleshooting/community.rst b/doc/rados/troubleshooting/community.rst new file mode 100644 index 000000000..c0d7be10c --- /dev/null +++ b/doc/rados/troubleshooting/community.rst @@ -0,0 +1,37 @@ +==================== + The Ceph Community +==================== + +Ceph-users email list +===================== + +The Ceph community is an excellent source of information and help. For +operational issues with Ceph we recommend that you `subscribe to the ceph-users +email list`_. When you no longer want to receive emails, you can `unsubscribe +from the ceph-users email list`_. + +Ceph-devel email list +===================== + +You can also `subscribe to the ceph-devel email list`_. You should do so if +your issue is: + +- Likely related to a bug +- Related to a development release package +- Related to a development testing package +- Related to your own builds + +If you no longer want to receive emails from the ``ceph-devel`` email list, you +can `unsubscribe from the ceph-devel email list`_. + +Ceph report +=========== + +.. tip:: Community members can help you if you provide them with detailed + information about your problem. Attach the output of the ``ceph report`` + command to help people understand your issues. + +.. _subscribe to the ceph-devel email list: mailto:dev-join@ceph.io +.. _unsubscribe from the ceph-devel email list: mailto:dev-leave@ceph.io +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@ceph.io +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@ceph.io diff --git a/doc/rados/troubleshooting/cpu-profiling.rst b/doc/rados/troubleshooting/cpu-profiling.rst new file mode 100644 index 000000000..b7fdd1d41 --- /dev/null +++ b/doc/rados/troubleshooting/cpu-profiling.rst @@ -0,0 +1,80 @@ +=============== + CPU Profiling +=============== + +If you built Ceph from source and compiled Ceph for use with `oprofile`_ +you can profile Ceph's CPU usage. See `Installing Oprofile`_ for details. + + +Initializing oprofile +===================== + +``oprofile`` must be initalized the first time it is used. Locate the +``vmlinux`` image that corresponds to the kernel you are running: + +.. prompt:: bash $ + + ls /boot + sudo opcontrol --init + sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6 + + +Starting oprofile +================= + +Run the following command to start ``oprofile``: + +.. prompt:: bash $ + + opcontrol --start + + +Stopping oprofile +================= + +Run the following command to stop ``oprofile``: + +.. prompt:: bash $ + + opcontrol --stop + + +Retrieving oprofile Results +=========================== + +Run the following command to retrieve the top ``cmon`` results: + +.. prompt:: bash $ + + opreport -gal ./cmon | less + + +Run the following command to retrieve the top ``cmon`` results, with call +graphs attached: + +.. prompt:: bash $ + + opreport -cal ./cmon | less + +.. important:: After you have reviewed the results, reset ``oprofile`` before + running it again. The act of resetting ``oprofile`` removes data from the + session directory. + + +Resetting oprofile +================== + +Run the following command to reset ``oprofile``: + +.. prompt:: bash $ + + sudo opcontrol --reset + +.. important:: Reset ``oprofile`` after analyzing data. This ensures that + results from prior tests do not get mixed in with the results of the current + test. + +.. _oprofile: http://oprofile.sourceforge.net/about/ +.. _Installing Oprofile: ../../../dev/cpu-profiler + + diff --git a/doc/rados/troubleshooting/index.rst b/doc/rados/troubleshooting/index.rst new file mode 100644 index 000000000..b481ee1dc --- /dev/null +++ b/doc/rados/troubleshooting/index.rst @@ -0,0 +1,19 @@ +================= + Troubleshooting +================= + +You may encounter situations that require you to examine your configuration, +consult the documentation, modify your logging output, troubleshoot monitors +and OSDs, profile memory and CPU usage, and, in the last resort, reach out to +the Ceph community for help. + +.. toctree:: + :maxdepth: 1 + + community + log-and-debug + troubleshooting-mon + troubleshooting-osd + troubleshooting-pg + memory-profiling + cpu-profiling diff --git a/doc/rados/troubleshooting/log-and-debug.rst b/doc/rados/troubleshooting/log-and-debug.rst new file mode 100644 index 000000000..929c3f53f --- /dev/null +++ b/doc/rados/troubleshooting/log-and-debug.rst @@ -0,0 +1,430 @@ +======================= + Logging and Debugging +======================= + +Ceph component debug log levels can be adjusted at runtime, while services are +running. In some circumstances you might want to adjust debug log levels in +``ceph.conf`` or in the central config store. Increased debug logging can be +useful if you are encountering issues when operating your cluster. By default, +Ceph log files are in ``/var/log/ceph``. + +.. tip:: Remember that debug output can slow down your system, and that this + latency sometimes hides race conditions. + +Debug logging is resource intensive. If you encounter a problem in a specific +component of your cluster, begin troubleshooting by enabling logging for only +that component of the cluster. For example, if your OSDs are running without +errors, but your metadata servers are not, enable logging for any specific +metadata server instances that are having problems. Continue by enabling +logging for each subsystem only as needed. + +.. important:: Verbose logging sometimes generates over 1 GB of data per hour. + If the disk that your operating system runs on (your "OS disk") reaches its + capacity, the node associated with that disk will stop working. + +Whenever you enable or increase the rate of debug logging, make sure that you +have ample capacity for log files, as this may dramatically increase their +size. For details on rotating log files, see `Accelerating Log Rotation`_. +When your system is running well again, remove unnecessary debugging settings +in order to ensure that your cluster runs optimally. Logging debug-output +messages is a slow process and a potential waste of your cluster's resources. + +For details on available settings, see `Subsystem, Log and Debug Settings`_. + +Runtime +======= + +To see the configuration settings at runtime, log in to a host that has a +running daemon and run a command of the following form: + +.. prompt:: bash $ + + ceph daemon {daemon-name} config show | less + +For example: + +.. prompt:: bash $ + + ceph daemon osd.0 config show | less + +To activate Ceph's debugging output (that is, the ``dout()`` logging function) +at runtime, inject arguments into the runtime configuration by running a ``ceph +tell`` command of the following form: + +.. prompt:: bash $ + + ceph tell {daemon-type}.{daemon id or *} config set {name} {value} + +Here ``{daemon-type}`` is ``osd``, ``mon``, or ``mds``. Apply the runtime +setting either to a specific daemon (by specifying its ID) or to all daemons of +a particular type (by using the ``*`` operator). For example, to increase +debug logging for a specific ``ceph-osd`` daemon named ``osd.0``, run the +following command: + +.. prompt:: bash $ + + ceph tell osd.0 config set debug_osd 0/5 + +The ``ceph tell`` command goes through the monitors. However, if you are unable +to bind to the monitor, there is another method that can be used to activate +Ceph's debugging output: use the ``ceph daemon`` command to log in to the host +of a specific daemon and change the daemon's configuration. For example: + +.. prompt:: bash $ + + sudo ceph daemon osd.0 config set debug_osd 0/5 + +For details on available settings, see `Subsystem, Log and Debug Settings`_. + + +Boot Time +========= + +To activate Ceph's debugging output (that is, the ``dout()`` logging function) +at boot time, you must add settings to your Ceph configuration file. +Subsystems that are common to all daemons are set under ``[global]`` in the +configuration file. Subsystems for a specific daemon are set under the relevant +daemon section in the configuration file (for example, ``[mon]``, ``[osd]``, +``[mds]``). Here is an example that shows possible debugging settings in a Ceph +configuration file: + +.. code-block:: ini + + [global] + debug_ms = 1/5 + + [mon] + debug_mon = 20 + debug_paxos = 1/5 + debug_auth = 2 + + [osd] + debug_osd = 1/5 + debug_filestore = 1/5 + debug_journal = 1 + debug_monc = 5/20 + + [mds] + debug_mds = 1 + debug_mds_balancer = 1 + + +For details, see `Subsystem, Log and Debug Settings`_. + + +Accelerating Log Rotation +========================= + +If your log filesystem is nearly full, you can accelerate log rotation by +modifying the Ceph log rotation file at ``/etc/logrotate.d/ceph``. To increase +the frequency of log rotation (which will guard against a filesystem reaching +capacity), add a ``size`` directive after the ``weekly`` frequency directive. +To smooth out volume spikes, consider changing ``weekly`` to ``daily`` and +consider changing ``rotate`` to ``30``. The procedure for adding the size +setting is shown immediately below. + +#. Note the default settings of the ``/etc/logrotate.d/ceph`` file:: + + rotate 7 + weekly + compress + sharedscripts + +#. Modify them by adding a ``size`` setting:: + + rotate 7 + weekly + size 500M + compress + sharedscripts + +#. Start the crontab editor for your user space: + + .. prompt:: bash $ + + crontab -e + +#. Add an entry to crontab that instructs cron to check the + ``etc/logrotate.d/ceph`` file:: + + 30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1 + +In this example, the ``etc/logrotate.d/ceph`` file will be checked every 30 +minutes. + +Valgrind +======== + +When you are debugging your cluster's performance, you might find it necessary +to track down memory and threading issues. The Valgrind tool suite can be used +to detect problems in a specific daemon, in a particular type of daemon, or in +the entire cluster. Because Valgrind is computationally expensive, it should be +used only when developing or debugging Ceph, and it will slow down your system +if used at other times. Valgrind messages are logged to ``stderr``. + + +Subsystem, Log and Debug Settings +================================= + +Debug logging output is typically enabled via subsystems. + +Ceph Subsystems +--------------- + +For each subsystem, there is a logging level for its output logs (a so-called +"log level") and a logging level for its in-memory logs (a so-called "memory +level"). Different values may be set for these two logging levels in each +subsystem. Ceph's logging levels operate on a scale of ``1`` to ``20``, where +``1`` is terse and ``20`` is verbose [#f1]_. As a general rule, the in-memory +logs are not sent to the output log unless one or more of the following +conditions obtain: + +- a fatal signal is raised or +- an ``assert`` in source code is triggered or +- upon requested. Please consult `document on admin socket + <http://docs.ceph.com/en/latest/man/8/ceph/#daemon>`_ for more details. + +.. warning :: + .. [#f1] In certain rare cases, there are logging levels that can take a value greater than 20. The resulting logs are extremely verbose. + +Log levels and memory levels can be set either together or separately. If a +subsystem is assigned a single value, then that value determines both the log +level and the memory level. For example, ``debug ms = 5`` will give the ``ms`` +subsystem a log level of ``5`` and a memory level of ``5``. On the other hand, +if a subsystem is assigned two values that are separated by a forward slash +(/), then the first value determines the log level and the second value +determines the memory level. For example, ``debug ms = 1/5`` will give the +``ms`` subsystem a log level of ``1`` and a memory level of ``5``. See the +following: + +.. code-block:: ini + + debug {subsystem} = {log-level}/{memory-level} + #for example + debug mds balancer = 1/20 + +The following table provides a list of Ceph subsystems and their default log and +memory levels. Once you complete your logging efforts, restore the subsystems +to their default level or to a level suitable for normal operations. + ++--------------------------+-----------+--------------+ +| Subsystem | Log Level | Memory Level | ++==========================+===========+==============+ +| ``default`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``lockdep`` | 0 | 1 | ++--------------------------+-----------+--------------+ +| ``context`` | 0 | 1 | ++--------------------------+-----------+--------------+ +| ``crush`` | 1 | 1 | ++--------------------------+-----------+--------------+ +| ``mds`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``mds balancer`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``mds log`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``mds log expire`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``mds migrator`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``buffer`` | 0 | 1 | ++--------------------------+-----------+--------------+ +| ``timer`` | 0 | 1 | ++--------------------------+-----------+--------------+ +| ``filer`` | 0 | 1 | ++--------------------------+-----------+--------------+ +| ``striper`` | 0 | 1 | ++--------------------------+-----------+--------------+ +| ``objecter`` | 0 | 1 | ++--------------------------+-----------+--------------+ +| ``rados`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``rbd`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``rbd mirror`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``rbd replay`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``rbd pwl`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``journaler`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``objectcacher`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``immutable obj cache`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``client`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``osd`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``optracker`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``objclass`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``filestore`` | 1 | 3 | ++--------------------------+-----------+--------------+ +| ``journal`` | 1 | 3 | ++--------------------------+-----------+--------------+ +| ``ms`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``mon`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``monc`` | 0 | 10 | ++--------------------------+-----------+--------------+ +| ``paxos`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``tp`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``auth`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``crypto`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``finisher`` | 1 | 1 | ++--------------------------+-----------+--------------+ +| ``reserver`` | 1 | 1 | ++--------------------------+-----------+--------------+ +| ``heartbeatmap`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``perfcounter`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``rgw`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``rgw sync`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``rgw datacache`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``rgw access`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``rgw dbstore`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``javaclient`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``asok`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``throttle`` | 1 | 1 | ++--------------------------+-----------+--------------+ +| ``refs`` | 0 | 0 | ++--------------------------+-----------+--------------+ +| ``compressor`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``bluestore`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``bluefs`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``bdev`` | 1 | 3 | ++--------------------------+-----------+--------------+ +| ``kstore`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``rocksdb`` | 4 | 5 | ++--------------------------+-----------+--------------+ +| ``leveldb`` | 4 | 5 | ++--------------------------+-----------+--------------+ +| ``fuse`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``mgr`` | 2 | 5 | ++--------------------------+-----------+--------------+ +| ``mgrc`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``dpdk`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``eventtrace`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``prioritycache`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``test`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``cephfs mirror`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``cepgsqlite`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore onode`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore odata`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore ompap`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore tm`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore t`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore cleaner`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore epm`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore lba`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore fixedkv tree``| 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore cache`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore journal`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore device`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore backref`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``alienstore`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``mclock`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``cyanstore`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``ceph exporter`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``memstore`` | 1 | 5 | ++--------------------------+-----------+--------------+ + + +Logging Settings +---------------- + +It is not necessary to specify logging and debugging settings in the Ceph +configuration file, but you may override default settings when needed. Ceph +supports the following settings: + +.. confval:: log_file +.. confval:: log_max_new +.. confval:: log_max_recent +.. confval:: log_to_file +.. confval:: log_to_stderr +.. confval:: err_to_stderr +.. confval:: log_to_syslog +.. confval:: err_to_syslog +.. confval:: log_flush_on_exit +.. confval:: clog_to_monitors +.. confval:: clog_to_syslog +.. confval:: mon_cluster_log_to_syslog +.. confval:: mon_cluster_log_file + +OSD +--- + +.. confval:: osd_debug_drop_ping_probability +.. confval:: osd_debug_drop_ping_duration + +Filestore +--------- + +.. confval:: filestore_debug_omap_check + +MDS +--- + +- :confval:`mds_debug_scatterstat` +- :confval:`mds_debug_frag` +- :confval:`mds_debug_auth_pins` +- :confval:`mds_debug_subtrees` + +RADOS Gateway +------------- + +- :confval:`rgw_log_nonexistent_bucket` +- :confval:`rgw_log_object_name` +- :confval:`rgw_log_object_name_utc` +- :confval:`rgw_enable_ops_log` +- :confval:`rgw_enable_usage_log` +- :confval:`rgw_usage_log_flush_threshold` +- :confval:`rgw_usage_log_tick_interval` diff --git a/doc/rados/troubleshooting/memory-profiling.rst b/doc/rados/troubleshooting/memory-profiling.rst new file mode 100644 index 000000000..8e58f2d76 --- /dev/null +++ b/doc/rados/troubleshooting/memory-profiling.rst @@ -0,0 +1,203 @@ +================== + Memory Profiling +================== + +Ceph Monitor, OSD, and MDS can report ``TCMalloc`` heap profiles. Install +``google-perftools`` if you want to generate these. Your OS distribution might +package this under a different name (for example, ``gperftools``), and your OS +distribution might use a different package manager. Run a command similar to +this one to install ``google-perftools``: + +.. prompt:: bash + + sudo apt-get install google-perftools + +The profiler dumps output to your ``log file`` directory (``/var/log/ceph``). +See `Logging and Debugging`_ for details. + +To view the profiler logs with Google's performance tools, run the following +command: + +.. prompt:: bash + + google-pprof --text {path-to-daemon} {log-path/filename} + +For example:: + + $ ceph tell osd.0 heap start_profiler + $ ceph tell osd.0 heap dump + osd.0 tcmalloc heap stats:------------------------------------------------ + MALLOC: 2632288 ( 2.5 MiB) Bytes in use by application + MALLOC: + 499712 ( 0.5 MiB) Bytes in page heap freelist + MALLOC: + 543800 ( 0.5 MiB) Bytes in central cache freelist + MALLOC: + 327680 ( 0.3 MiB) Bytes in transfer cache freelist + MALLOC: + 1239400 ( 1.2 MiB) Bytes in thread cache freelists + MALLOC: + 1142936 ( 1.1 MiB) Bytes in malloc metadata + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Actual memory used (physical + swap) + MALLOC: + 0 ( 0.0 MiB) Bytes released to OS (aka unmapped) + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Virtual address space used + MALLOC: + MALLOC: 231 Spans in use + MALLOC: 56 Thread heaps in use + MALLOC: 8192 Tcmalloc page size + ------------------------------------------------ + Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). + Bytes released to the OS take up virtual address space but no physical memory. + $ google-pprof --text \ + /usr/bin/ceph-osd \ + /var/log/ceph/ceph-osd.0.profile.0001.heap + Total: 3.7 MB + 1.9 51.1% 51.1% 1.9 51.1% ceph::log::Log::create_entry + 1.8 47.3% 98.4% 1.8 47.3% std::string::_Rep::_S_create + 0.0 0.4% 98.9% 0.0 0.6% SimpleMessenger::add_accept_pipe + 0.0 0.4% 99.2% 0.0 0.6% decode_message + ... + +Performing another heap dump on the same daemon creates another file. It is +convenient to compare the new file to a file created by a previous heap dump to +show what has grown in the interval. For example:: + + $ google-pprof --text --base out/osd.0.profile.0001.heap \ + ceph-osd out/osd.0.profile.0003.heap + Total: 0.2 MB + 0.1 50.3% 50.3% 0.1 50.3% ceph::log::Log::create_entry + 0.1 46.6% 96.8% 0.1 46.6% std::string::_Rep::_S_create + 0.0 0.9% 97.7% 0.0 26.1% ReplicatedPG::do_op + 0.0 0.8% 98.5% 0.0 0.8% __gnu_cxx::new_allocator::allocate + +See `Google Heap Profiler`_ for additional details. + +After you have installed the heap profiler, start your cluster and begin using +the heap profiler. You can enable or disable the heap profiler at runtime, or +ensure that it runs continuously. When running commands based on the examples +that follow, do the following: + +#. replace ``{daemon-type}`` with ``mon``, ``osd`` or ``mds`` +#. replace ``{daemon-id}`` with the OSD number or the MON ID or the MDS ID + + +Starting the Profiler +--------------------- + +To start the heap profiler, run a command of the following form: + +.. prompt:: bash + + ceph tell {daemon-type}.{daemon-id} heap start_profiler + +For example: + +.. prompt:: bash + + ceph tell osd.1 heap start_profiler + +Alternatively, if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in the +environment, the profile will be started when the daemon starts running. + +Printing Stats +-------------- + +To print out statistics, run a command of the following form: + +.. prompt:: bash + + ceph tell {daemon-type}.{daemon-id} heap stats + +For example: + +.. prompt:: bash + + ceph tell osd.0 heap stats + +.. note:: The reporting of stats with this command does not require the + profiler to be running and does not dump the heap allocation information to + a file. + + +Dumping Heap Information +------------------------ + +To dump heap information, run a command of the following form: + +.. prompt:: bash + + ceph tell {daemon-type}.{daemon-id} heap dump + +For example: + +.. prompt:: bash + + ceph tell mds.a heap dump + +.. note:: Dumping heap information works only when the profiler is running. + + +Releasing Memory +---------------- + +To release memory that ``tcmalloc`` has allocated but which is not being used +by the Ceph daemon itself, run a command of the following form: + +.. prompt:: bash + + ceph tell {daemon-type}{daemon-id} heap release + +For example: + +.. prompt:: bash + + ceph tell osd.2 heap release + + +Stopping the Profiler +--------------------- + +To stop the heap profiler, run a command of the following form: + +.. prompt:: bash + + ceph tell {daemon-type}.{daemon-id} heap stop_profiler + +For example: + +.. prompt:: bash + + ceph tell osd.0 heap stop_profiler + +.. _Logging and Debugging: ../log-and-debug +.. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html + +Alternative Methods of Memory Profiling +---------------------------------------- + +Running Massif heap profiler with Valgrind +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The Massif heap profiler tool can be used with Valgrind to measure how much +heap memory is used. This method is well-suited to troubleshooting RadosGW. + +See the `Massif documentation +<https://valgrind.org/docs/manual/ms-manual.html>`_ for more information. + +Install Valgrind from the package manager for your distribution then start the +Ceph daemon you want to troubleshoot: + +.. prompt:: bash + + sudo -u ceph valgrind --max-threads=1024 --tool=massif /usr/bin/radosgw -f --cluster ceph --name NAME --setuser ceph --setgroup ceph + +When this command has completed its run, a file with a name of the form +``massif.out.<pid>`` will be saved in your current working directory. To run +the command above, the user who runs it must have write permissions in the +current directory. + +Run the ``ms_print`` command to get a graph and statistics from the collected +data in the ``massif.out.<pid>`` file: + +.. prompt:: bash + + ms_print massif.out.12345 + +The output of this command is helpful when submitting a bug report. diff --git a/doc/rados/troubleshooting/troubleshooting-mon.rst b/doc/rados/troubleshooting/troubleshooting-mon.rst new file mode 100644 index 000000000..1170da7c3 --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-mon.rst @@ -0,0 +1,713 @@ +.. _rados-troubleshooting-mon: + +========================== + Troubleshooting Monitors +========================== + +.. index:: monitor, high availability + +Even if a cluster experiences monitor-related problems, the cluster is not +necessarily in danger of going down. If a cluster has lost multiple monitors, +it can still remain up and running as long as there are enough surviving +monitors to form a quorum. + +If your cluster is having monitor-related problems, we recommend that you +consult the following troubleshooting information. + +Initial Troubleshooting +======================= + +The first steps in the process of troubleshooting Ceph Monitors involve making +sure that the Monitors are running and that they are able to communicate with +the network and on the network. Follow the steps in this section to rule out +the simplest causes of Monitor malfunction. + +#. **Make sure that the Monitors are running.** + + Make sure that the Monitor (*mon*) daemon processes (``ceph-mon``) are + running. It might be the case that the mons have not be restarted after an + upgrade. Checking for this simple oversight can save hours of painstaking + troubleshooting. + + It is also important to make sure that the manager daemons (``ceph-mgr``) + are running. Remember that typical cluster configurations provide one + Manager (``ceph-mgr``) for each Monitor (``ceph-mon``). + + .. note:: In releases prior to v1.12.5, Rook will not run more than two + managers. + +#. **Make sure that you can reach the Monitor nodes.** + + In certain rare cases, ``iptables`` rules might be blocking access to + Monitor nodes or TCP ports. These rules might be left over from earlier + stress testing or rule development. To check for the presence of such + rules, SSH into each Monitor node and use ``telnet`` or ``nc`` or a similar + tool to attempt to connect to each of the other Monitor nodes on ports + ``tcp/3300`` and ``tcp/6789``. + +#. **Make sure that the "ceph status" command runs and receives a reply from the cluster.** + + If the ``ceph status`` command receives a reply from the cluster, then the + cluster is up and running. Monitors answer to a ``status`` request only if + there is a formed quorum. Confirm that one or more ``mgr`` daemons are + reported as running. In a cluster with no deficiencies, ``ceph status`` + will report that all ``mgr`` daemons are running. + + If the ``ceph status`` command does not receive a reply from the cluster, + then there are probably not enough Monitors ``up`` to form a quorum. If the + ``ceph -s`` command is run with no further options specified, it connects + to an arbitrarily selected Monitor. In certain cases, however, it might be + helpful to connect to a specific Monitor (or to several specific Monitors + in sequence) by adding the ``-m`` flag to the command: for example, ``ceph + status -m mymon1``. + +#. **None of this worked. What now?** + + If the above solutions have not resolved your problems, you might find it + helpful to examine each individual Monitor in turn. Even if no quorum has + been formed, it is possible to contact each Monitor individually and + request its status by using the ``ceph tell mon.ID mon_status`` command + (here ``ID`` is the Monitor's identifier). + + Run the ``ceph tell mon.ID mon_status`` command for each Monitor in the + cluster. For more on this command's output, see :ref:`Understanding + mon_status + <rados_troubleshoting_troubleshooting_mon_understanding_mon_status>`. + + There is also an alternative method for contacting each individual Monitor: + SSH into each Monitor node and query the daemon's admin socket. See + :ref:`Using the Monitor's Admin + Socket<rados_troubleshoting_troubleshooting_mon_using_admin_socket>`. + +.. _rados_troubleshoting_troubleshooting_mon_using_admin_socket: + +Using the monitor's admin socket +================================ + +A monitor's admin socket allows you to interact directly with a specific daemon +by using a Unix socket file. This file is found in the monitor's ``run`` +directory. The admin socket's default directory is +``/var/run/ceph/ceph-mon.ID.asok``, but this can be overridden and the admin +socket might be elsewhere, especially if your cluster's daemons are deployed in +containers. If you cannot find it, either check your ``ceph.conf`` for an +alternative path or run the following command: + +.. prompt:: bash $ + + ceph-conf --name mon.ID --show-config-value admin_socket + +The admin socket is available for use only when the monitor daemon is running. +Whenever the monitor has been properly shut down, the admin socket is removed. +However, if the monitor is not running and the admin socket persists, it is +likely that the monitor has been improperly shut down. In any case, if the +monitor is not running, it will be impossible to use the admin socket, and the +``ceph`` command is likely to return ``Error 111: Connection Refused``. + +To access the admin socket, run a ``ceph tell`` command of the following form +(specifying the daemon that you are interested in): + +.. prompt:: bash $ + + ceph tell mon.<id> mon_status + +This command passes a ``help`` command to the specific running monitor daemon +``<id>`` via its admin socket. If you know the full path to the admin socket +file, this can be done more directly by running the following command: + +.. prompt:: bash $ + + ceph --admin-daemon <full_path_to_asok_file> <command> + +Running ``ceph help`` shows all supported commands that are available through +the admin socket. See especially ``config get``, ``config show``, ``mon stat``, +and ``quorum_status``. + +.. _rados_troubleshoting_troubleshooting_mon_understanding_mon_status: + +Understanding mon_status +======================== + +The status of the monitor (as reported by the ``ceph tell mon.X mon_status`` +command) can always be obtained via the admin socket. This command outputs a +great deal of information about the monitor (including the information found in +the output of the ``quorum_status`` command). + +To understand this command's output, let us consider the following example, in +which we see the output of ``ceph tell mon.c mon_status``:: + + { "name": "c", + "rank": 2, + "state": "peon", + "election_epoch": 38, + "quorum": [ + 1, + 2], + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { "epoch": 3, + "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8", + "modified": "2013-10-30 04:12:01.945629", + "created": "2013-10-29 14:14:41.914786", + "mons": [ + { "rank": 0, + "name": "a", + "addr": "127.0.0.1:6789\/0"}, + { "rank": 1, + "name": "b", + "addr": "127.0.0.1:6790\/0"}, + { "rank": 2, + "name": "c", + "addr": "127.0.0.1:6795\/0"}]}} + +It is clear that there are three monitors in the monmap (*a*, *b*, and *c*), +the quorum is formed by only two monitors, and *c* is in the quorum as a +*peon*. + +**Which monitor is out of the quorum?** + + The answer is **a** (that is, ``mon.a``). + +**Why?** + + When the ``quorum`` set is examined, there are clearly two monitors in the + set: *1* and *2*. But these are not monitor names. They are monitor ranks, as + established in the current ``monmap``. The ``quorum`` set does not include + the monitor that has rank 0, and according to the ``monmap`` that monitor is + ``mon.a``. + +**How are monitor ranks determined?** + + Monitor ranks are calculated (or recalculated) whenever monitors are added or + removed. The calculation of ranks follows a simple rule: the **greater** the + ``IP:PORT`` combination, the **lower** the rank. In this case, because + ``127.0.0.1:6789`` is lower than the other two ``IP:PORT`` combinations, + ``mon.a`` has the highest rank: namely, rank 0. + + +Most Common Monitor Issues +=========================== + +The Cluster Has Quorum but at Least One Monitor is Down +------------------------------------------------------- + +When the cluster has quorum but at least one monitor is down, ``ceph health +detail`` returns a message similar to the following:: + + $ ceph health detail + [snip] + mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum) + +**How do I troubleshoot a Ceph cluster that has quorum but also has at least one monitor down?** + + #. Make sure that ``mon.a`` is running. + + #. Make sure that you can connect to ``mon.a``'s node from the + other Monitor nodes. Check the TCP ports as well. Check ``iptables`` and + ``nf_conntrack`` on all nodes and make sure that you are not + dropping/rejecting connections. + + If this initial troubleshooting doesn't solve your problem, then further + investigation is necessary. + + First, check the problematic monitor's ``mon_status`` via the admin + socket as explained in `Using the monitor's admin socket`_ and + `Understanding mon_status`_. + + If the Monitor is out of the quorum, then its state will be one of the + following: ``probing``, ``electing`` or ``synchronizing``. If the state of + the Monitor is ``leader`` or ``peon``, then the Monitor believes itself to be + in quorum but the rest of the cluster believes that it is not in quorum. It + is possible that a Monitor that is in one of the ``probing``, ``electing``, + or ``synchronizing`` states has entered the quorum during the process of + troubleshooting. Check ``ceph status`` again to determine whether the Monitor + has entered quorum during your troubleshooting. If the Monitor remains out of + the quorum, then proceed with the investigations described in this section of + the documentation. + + +**What does it mean when a Monitor's state is ``probing``?** + + If ``ceph health detail`` shows that a Monitor's state is + ``probing``, then the Monitor is still looking for the other Monitors. Every + Monitor remains in this state for some time when it is started. When a + Monitor has connected to the other Monitors specified in the ``monmap``, it + ceases to be in the ``probing`` state. The amount of time that a Monitor is + in the ``probing`` state depends upon the parameters of the cluster of which + it is a part. For example, when a Monitor is a part of a single-monitor + cluster (never do this in production), the monitor passes through the probing + state almost instantaneously. In a multi-monitor cluster, the Monitors stay + in the ``probing`` state until they find enough monitors to form a quorum + |---| this means that if two out of three Monitors in the cluster are + ``down``, the one remaining Monitor stays in the ``probing`` state + indefinitely until you bring one of the other monitors up. + + If quorum has been established, then the Monitor daemon should be able to + find the other Monitors quickly, as long as they can be reached. If a Monitor + is stuck in the ``probing`` state and you have exhausted the procedures above + that describe the troubleshooting of communications between the Monitors, + then it is possible that the problem Monitor is trying to reach the other + Monitors at a wrong address. ``mon_status`` outputs the ``monmap`` that is + known to the monitor: determine whether the other Monitors' locations as + specified in the ``monmap`` match the locations of the Monitors in the + network. If they do not, see `Recovering a Monitor's Broken monmap`_. + If the locations of the Monitors as specified in the ``monmap`` match the + locations of the Monitors in the network, then the persistent + ``probing`` state could be related to severe clock skews amongst the monitor + nodes. See `Clock Skews`_. If the information in `Clock Skews`_ does not + bring the Monitor out of the ``probing`` state, then prepare your system logs + and ask the Ceph community for help. See `Preparing your logs`_ for + information about the proper preparation of logs. + + +**What does it mean when a Monitor's state is ``electing``?** + + If ``ceph health detail`` shows that a Monitor's state is ``electing``, the + monitor is in the middle of an election. Elections typically complete + quickly, but sometimes the monitors can get stuck in what is known as an + *election storm*. See :ref:`Monitor Elections <dev_mon_elections>` for more + on monitor elections. + + The presence of election storm might indicate clock skew among the monitor + nodes. See `Clock Skews`_ for more information. + + If your clocks are properly synchronized, search the mailing lists and bug + tracker for issues similar to your issue. The ``electing`` state is not + likely to persist. In versions of Ceph after the release of Cuttlefish, there + is no obvious reason other than clock skew that explains why an ``electing`` + state would persist. + + It is possible to investigate the cause of a persistent ``electing`` state if + you put the problematic Monitor into a ``down`` state while you investigate. + This is possible only if there are enough surviving Monitors to form quorum. + +**What does it mean when a Monitor's state is ``synchronizing``?** + + If ``ceph health detail`` shows that the Monitor is ``synchronizing``, the + monitor is catching up with the rest of the cluster so that it can join the + quorum. The amount of time that it takes for the Monitor to synchronize with + the rest of the quorum is a function of the size of the cluster's monitor + store, the cluster's size, and the state of the cluster. Larger and degraded + clusters generally keep Monitors in the ``synchronizing`` state longer than + do smaller, new clusters. + + A Monitor that changes its state from ``synchronizing`` to ``electing`` and + then back to ``synchronizing`` indicates a problem: the cluster state may be + advancing (that is, generating new maps) too fast for the synchronization + process to keep up with the pace of the creation of the new maps. This issue + presented more frequently prior to the Cuttlefish release than it does in + more recent releases, because the synchronization process has since been + refactored and enhanced to avoid this dynamic. If you experience this in + later versions, report the issue in the `Ceph bug tracker + <https://tracker.ceph.com>`_. Prepare and provide logs to substantiate any + bug you raise. See `Preparing your logs`_ for information about the proper + preparation of logs. + +**What does it mean when a Monitor's state is ``leader`` or ``peon``?** + + If ``ceph health detail`` shows that the Monitor is in the ``leader`` state + or in the ``peon`` state, it is likely that clock skew is present. Follow the + instructions in `Clock Skews`_. If you have followed those instructions and + ``ceph health detail`` still shows that the Monitor is in the ``leader`` + state or the ``peon`` state, report the issue in the `Ceph bug tracker + <https://tracker.ceph.com>`_. If you raise an issue, provide logs to + substantiate it. See `Preparing your logs`_ for information about the + proper preparation of logs. + + +Recovering a Monitor's Broken ``monmap`` +---------------------------------------- + +This is how a ``monmap`` usually looks, depending on the number of +monitors:: + + + epoch 3 + fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8 + last_changed 2013-10-30 04:12:01.945629 + created 2013-10-29 14:14:41.914786 + 0: 127.0.0.1:6789/0 mon.a + 1: 127.0.0.1:6790/0 mon.b + 2: 127.0.0.1:6795/0 mon.c + +This may not be what you have however. For instance, in some versions of +early Cuttlefish there was a bug that could cause your ``monmap`` +to be nullified. Completely filled with zeros. This means that not even +``monmaptool`` would be able to make sense of cold, hard, inscrutable zeros. +It's also possible to end up with a monitor with a severely outdated monmap, +notably if the node has been down for months while you fight with your vendor's +TAC. The subject ``ceph-mon`` daemon might be unable to find the surviving +monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``, +then remove ``mon.a``, then add a new monitor ``mon.e`` and remove +``mon.b``; you will end up with a totally different monmap from the one +``mon.c`` knows). + +In this situation you have two possible solutions: + +Scrap the monitor and redeploy + + You should only take this route if you are positive that you won't + lose the information kept by that monitor; that you have other monitors + and that they are running just fine so that your new monitor is able + to synchronize from the remaining monitors. Keep in mind that destroying + a monitor, if there are no other copies of its contents, may lead to + loss of data. + +Inject a monmap into the monitor + + These are the basic steps: + + Retrieve the ``monmap`` from the surviving monitors and inject it into the + monitor whose ``monmap`` is corrupted or lost. + + Implement this solution by carrying out the following procedure: + + 1. Is there a quorum of monitors? If so, retrieve the ``monmap`` from the + quorum:: + + $ ceph mon getmap -o /tmp/monmap + + 2. If there is no quorum, then retrieve the ``monmap`` directly from another + monitor that has been stopped (in this example, the other monitor has + the ID ``ID-FOO``):: + + $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap + + 3. Stop the monitor you are going to inject the monmap into. + + 4. Inject the monmap:: + + $ ceph-mon -i ID --inject-monmap /tmp/monmap + + 5. Start the monitor + + .. warning:: Injecting ``monmaps`` can cause serious problems because doing + so will overwrite the latest existing ``monmap`` stored on the monitor. Be + careful! + +Clock Skews +----------- + +The Paxos consensus algorithm requires close time synchroniziation, which means +that clock skew among the monitors in the quorum can have a serious effect on +monitor operation. The resulting behavior can be puzzling. To avoid this issue, +run a clock synchronization tool on your monitor nodes: for example, use +``Chrony`` or the legacy ``ntpd`` utility. Configure each monitor nodes so that +the `iburst` option is in effect and so that each monitor has multiple peers, +including the following: + +* Each other +* Internal ``NTP`` servers +* Multiple external, public pool servers + +.. note:: The ``iburst`` option sends a burst of eight packets instead of the + usual single packet, and is used during the process of getting two peers + into initial synchronization. + +Furthermore, it is advisable to synchronize *all* nodes in your cluster against +internal and external servers, and perhaps even against your monitors. Run +``NTP`` servers on bare metal: VM-virtualized clocks are not suitable for +steady timekeeping. See `https://www.ntp.org <https://www.ntp.org>`_ for more +information about the Network Time Protocol (NTP). Your organization might +already have quality internal ``NTP`` servers available. Sources for ``NTP`` +server appliances include the following: + +* Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_ +* EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_ +* Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_ + +Clock Skew Questions and Answers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +**What's the maximum tolerated clock skew?** + + By default, monitors allow clocks to drift up to a maximum of 0.05 seconds + (50 milliseconds). + +**Can I increase the maximum tolerated clock skew?** + + Yes, but we strongly recommend against doing so. The maximum tolerated clock + skew is configurable via the ``mon-clock-drift-allowed`` option, but it is + almost certainly a bad idea to make changes to this option. The clock skew + maximum is in place because clock-skewed monitors cannot be relied upon. The + current default value has proven its worth at alerting the user before the + monitors encounter serious problems. Changing this value might cause + unforeseen effects on the stability of the monitors and overall cluster + health. + +**How do I know whether there is a clock skew?** + + The monitors will warn you via the cluster status ``HEALTH_WARN``. When clock + skew is present, the ``ceph health detail`` and ``ceph status`` commands + return an output resembling the following:: + + mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s) + + In this example, the monitor ``mon.c`` has been flagged as suffering from + clock skew. + + In Luminous and later releases, it is possible to check for a clock skew by + running the ``ceph time-sync-status`` command. Note that the lead monitor + typically has the numerically lowest IP address. It will always show ``0``: + the reported offsets of other monitors are relative to the lead monitor, not + to any external reference source. + +**What should I do if there is a clock skew?** + + Synchronize your clocks. Using an NTP client might help. However, if you + are already using an NTP client and you still encounter clock skew problems, + determine whether the NTP server that you are using is remote to your network + or instead hosted on your network. Hosting your own NTP servers tends to + mitigate clock skew problems. + + +Client Can't Connect or Mount +----------------------------- + +Check your IP tables. Some operating-system install utilities add a ``REJECT`` +rule to ``iptables``. ``iptables`` rules will reject all clients other than +``ssh`` that try to connect to the host. If your monitor host's IP tables have +a ``REJECT`` rule in place, clients that are connecting from a separate node +will fail and will raise a timeout error. Any ``iptables`` rules that reject +clients trying to connect to Ceph daemons must be addressed. For example:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +It might also be necessary to add rules to iptables on your Ceph hosts to +ensure that clients are able to access the TCP ports associated with your Ceph +monitors (default: port 6789) and Ceph OSDs (default: 6800 through 7300). For +example:: + + iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT + + +Monitor Store Failures +====================== + +Symptoms of store corruption +---------------------------- + +Ceph monitors store the :term:`Cluster Map` in a key-value store. If key-value +store corruption causes a monitor to fail, then the monitor log might contain +one of the following error messages:: + + Corruption: error in middle of record + +or:: + + Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb + +Recovery using healthy monitor(s) +--------------------------------- + +If there are surviving monitors, we can always :ref:`replace +<adding-and-removing-monitors>` the corrupted monitor with a new one. After the +new monitor boots, it will synchronize with a healthy peer. After the new +monitor is fully synchronized, it will be able to serve clients. + +.. _mon-store-recovery-using-osds: + +Recovery using OSDs +------------------- + +Even if all monitors fail at the same time, it is possible to recover the +monitor store by using information stored in OSDs. You are encouraged to deploy +at least three (and preferably five) monitors in a Ceph cluster. In such a +deployment, complete monitor failure is unlikely. However, unplanned power loss +in a data center whose disk settings or filesystem settings are improperly +configured could cause the underlying filesystem to fail and this could kill +all of the monitors. In such a case, data in the OSDs can be used to recover +the monitors. The following is such a script and can be used to recover the +monitors: + + +.. code-block:: bash + + ms=/root/mon-store + mkdir $ms + + # collect the cluster map from stopped OSDs + for host in $hosts; do + rsync -avz $ms/. user@$host:$ms.remote + rm -rf $ms + ssh user@$host <<EOF + for osd in /var/lib/ceph/osd/ceph-*; do + ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote + done + EOF + rsync -avz user@$host:$ms.remote/. $ms + done + + # rebuild the monitor store from the collected map, if the cluster does not + # use cephx authentication, we can skip the following steps to update the + # keyring with the caps, and there is no need to pass the "--keyring" option. + # i.e. just use "ceph-monstore-tool $ms rebuild" instead + ceph-authtool /path/to/admin.keyring -n mon. \ + --cap mon 'allow *' + ceph-authtool /path/to/admin.keyring -n client.admin \ + --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' + # add one or more ceph-mgr's key to the keyring. in this case, an encoded key + # for mgr.x is added, you can find the encoded key in + # /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is + # deployed + ceph-authtool /path/to/admin.keyring --add-key 'AQDN8kBe9PLWARAAZwxXMr+n85SBYbSlLcZnMA==' -n mgr.x \ + --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *' + # If your monitors' ids are not sorted by ip address, please specify them in order. + # For example. if mon 'a' is 10.0.0.3, mon 'b' is 10.0.0.2, and mon 'c' is 10.0.0.4, + # please passing "--mon-ids b a c". + # In addition, if your monitors' ids are not single characters like 'a', 'b', 'c', please + # specify them in the command line by passing them as arguments of the "--mon-ids" + # option. if you are not sure, please check your ceph.conf to see if there is any + # sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are + # using DNS SRV for looking up monitors. + ceph-monstore-tool $ms rebuild -- --keyring /path/to/admin.keyring --mon-ids alpha beta gamma + + # make a backup of the corrupted store.db just in case! repeat for + # all monitors. + mv /var/lib/ceph/mon/mon.foo/store.db /var/lib/ceph/mon/mon.foo/store.db.corrupted + + # move rebuild store.db into place. repeat for all monitors. + mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db + chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db + +This script performs the following steps: + +#. Collects the map from each OSD host. +#. Rebuilds the store. +#. Fills the entities in the keyring file with appropriate capabilities. +#. Replaces the corrupted store on ``mon.foo`` with the recovered copy. + + +Known limitations +~~~~~~~~~~~~~~~~~ + +The above recovery tool is unable to recover the following information: + +- **Certain added keyrings**: All of the OSD keyrings added using the ``ceph + auth add`` command are recovered from the OSD's copy, and the + ``client.admin`` keyring is imported using ``ceph-monstore-tool``. However, + the MDS keyrings and all other keyrings will be missing in the recovered + monitor store. You might need to manually re-add them. + +- **Creating pools**: If any RADOS pools were in the process of being created, + that state is lost. The recovery tool operates on the assumption that all + pools have already been created. If there are PGs that are stuck in the + 'unknown' state after the recovery for a partially created pool, you can + force creation of the *empty* PG by running the ``ceph osd force-create-pg`` + command. Note that this will create an *empty* PG, so take this action only + if you know the pool is empty. + +- **MDS Maps**: The MDS maps are lost. + + +Everything Failed! Now What? +============================ + +Reaching out for help +--------------------- + +You can find help on IRC in #ceph and #ceph-devel on OFTC (server +irc.oftc.net), or at ``dev@ceph.io`` and ``ceph-users@lists.ceph.com``. Make +sure that you have prepared your logs and that you have them ready upon +request. + +See https://ceph.io/en/community/connect/ for current (as of October 2023) +information on getting in contact with the upstream Ceph community. + + +Preparing your logs +------------------- + +The default location for monitor logs is ``/var/log/ceph/ceph-mon.FOO.log*``. +However, if they are not there, you can find their current location by running +the following command: + +.. prompt:: bash + + ceph-conf --name mon.FOO --show-config-value log_file + +The amount of information in the logs is determined by the debug levels in the +cluster's configuration files. If Ceph is using the default debug levels, then +your logs might be missing important information that would help the upstream +Ceph community address your issue. + +To make sure your monitor logs contain relevant information, you can raise +debug levels. Here we are interested in information from the monitors. As with +other components, the monitors have different parts that output their debug +information on different subsystems. + +If you are an experienced Ceph troubleshooter, we recommend raising the debug +levels of the most relevant subsystems. Of course, this approach might not be +easy for beginners. In most cases, however, enough information to address the +issue will be secured if the following debug levels are entered:: + + debug_mon = 10 + debug_ms = 1 + +Sometimes these debug levels do not yield enough information. In such cases, +members of the upstream Ceph community might ask you to make additional changes +to these or to other debug levels. In any case, it is better for us to receive +at least some useful information than to receive an empty log. + + +Do I need to restart a monitor to adjust debug levels? +------------------------------------------------------ + +No, restarting a monitor is not necessary. Debug levels may be adjusted by +using two different methods, depending on whether or not there is a quorum: + +There is a quorum + + Either inject the debug option into the specific monitor that needs to + be debugged:: + + ceph tell mon.FOO config set debug_mon 10/10 + + Or inject it into all monitors at once:: + + ceph tell mon.* config set debug_mon 10/10 + + +There is no quorum + + Use the admin socket of the specific monitor that needs to be debugged + and directly adjust the monitor's configuration options:: + + ceph daemon mon.FOO config set debug_mon 10/10 + + +To return the debug levels to their default values, run the above commands +using the debug level ``1/10`` rather than ``10/10``. To check a monitor's +current values, use the admin socket and run either of the following commands: + + .. prompt:: bash + + ceph daemon mon.FOO config show + +or: + + .. prompt:: bash + + ceph daemon mon.FOO config get 'OPTION_NAME' + + + +I Reproduced the problem with appropriate debug levels. Now what? +----------------------------------------------------------------- + +We prefer that you send us only the portions of your logs that are relevant to +your monitor problems. Of course, it might not be easy for you to determine +which portions are relevant so we are willing to accept complete and +unabridged logs. However, we request that you avoid sending logs containing +hundreds of thousands of lines with no additional clarifying information. One +common-sense way of making our task easier is to write down the current time +and date when you are reproducing the problem and then extract portions of your +logs based on that information. + +Finally, reach out to us on the mailing lists or IRC or Slack, or by filing a +new issue on the `tracker`_. + +.. _tracker: http://tracker.ceph.com/projects/ceph/issues/new + +.. |---| unicode:: U+2014 .. EM DASH + :trim: diff --git a/doc/rados/troubleshooting/troubleshooting-osd.rst b/doc/rados/troubleshooting/troubleshooting-osd.rst new file mode 100644 index 000000000..035947d7e --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-osd.rst @@ -0,0 +1,787 @@ +====================== + Troubleshooting OSDs +====================== + +Before troubleshooting the cluster's OSDs, check the monitors +and the network. + +First, determine whether the monitors have a quorum. Run the ``ceph health`` +command or the ``ceph -s`` command and if Ceph shows ``HEALTH_OK`` then there +is a monitor quorum. + +If the monitors don't have a quorum or if there are errors with the monitor +status, address the monitor issues before proceeding by consulting the material +in `Troubleshooting Monitors <../troubleshooting-mon>`_. + +Next, check your networks to make sure that they are running properly. Networks +can have a significant impact on OSD operation and performance. Look for +dropped packets on the host side and CRC errors on the switch side. + + +Obtaining Data About OSDs +========================= + +When troubleshooting OSDs, it is useful to collect different kinds of +information about the OSDs. Some information comes from the practice of +`monitoring OSDs`_ (for example, by running the ``ceph osd tree`` command). +Additional information concerns the topology of your cluster, and is discussed +in the following sections. + + +Ceph Logs +--------- + +Ceph log files are stored under ``/var/log/ceph``. Unless the path has been +changed (or you are in a containerized environment that stores logs in a +different location), the log files can be listed by running the following +command: + +.. prompt:: bash + + ls /var/log/ceph + +If there is not enough log detail, change the logging level. To ensure that +Ceph performs adequately under high logging volume, see `Logging and +Debugging`_. + + + +Admin Socket +------------ + +Use the admin socket tool to retrieve runtime information. First, list the +sockets of Ceph's daemons by running the following command: + +.. prompt:: bash + + ls /var/run/ceph + +Next, run a command of the following form (replacing ``{daemon-name}`` with the +name of a specific daemon: for example, ``osd.0``): + +.. prompt:: bash + + ceph daemon {daemon-name} help + +Alternatively, run the command with a ``{socket-file}`` specified (a "socket +file" is a specific file in ``/var/run/ceph``): + +.. prompt:: bash + + ceph daemon {socket-file} help + +The admin socket makes many tasks possible, including: + +- Listing Ceph configuration at runtime +- Dumping historic operations +- Dumping the operation priority queue state +- Dumping operations in flight +- Dumping perfcounters + +Display Free Space +------------------ + +Filesystem issues may arise. To display your filesystems' free space, run the +following command: + +.. prompt:: bash + + df -h + +To see this command's supported syntax and options, run ``df --help``. + +I/O Statistics +-------------- + +The `iostat`_ tool can be used to identify I/O-related issues. Run the +following command: + +.. prompt:: bash + + iostat -x + + +Diagnostic Messages +------------------- + +To retrieve diagnostic messages from the kernel, run the ``dmesg`` command and +specify the output with ``less``, ``more``, ``grep``, or ``tail``. For +example: + +.. prompt:: bash + + dmesg | grep scsi + +Stopping without Rebalancing +============================ + +It might be occasionally necessary to perform maintenance on a subset of your +cluster or to resolve a problem that affects a failure domain (for example, a +rack). However, when you stop OSDs for maintenance, you might want to prevent +CRUSH from automatically rebalancing the cluster. To avert this rebalancing +behavior, set the cluster to ``noout`` by running the following command: + +.. prompt:: bash + + ceph osd set noout + +.. warning:: This is more a thought exercise offered for the purpose of giving + the reader a sense of failure domains and CRUSH behavior than a suggestion + that anyone in the post-Luminous world run ``ceph osd set noout``. When the + OSDs return to an ``up`` state, rebalancing will resume and the change + introduced by the ``ceph osd set noout`` command will be reverted. + +In Luminous and later releases, however, it is a safer approach to flag only +affected OSDs. To add or remove a ``noout`` flag to a specific OSD, run a +command like the following: + +.. prompt:: bash + + ceph osd add-noout osd.0 + ceph osd rm-noout osd.0 + +It is also possible to flag an entire CRUSH bucket. For example, if you plan to +take down ``prod-ceph-data1701`` in order to add RAM, you might run the +following command: + +.. prompt:: bash + + ceph osd set-group noout prod-ceph-data1701 + +After the flag is set, stop the OSDs and any other colocated +Ceph services within the failure domain that requires maintenance work:: + + systemctl stop ceph\*.service ceph\*.target + +.. note:: When an OSD is stopped, any placement groups within the OSD are + marked as ``degraded``. + +After the maintenance is complete, it will be necessary to restart the OSDs +and any other daemons that have stopped. However, if the host was rebooted as +part of the maintenance, they do not need to be restarted and will come back up +automatically. To restart OSDs or other daemons, use a command of the following +form: + +.. prompt:: bash + + sudo systemctl start ceph.target + +Finally, unset the ``noout`` flag as needed by running commands like the +following: + +.. prompt:: bash + + ceph osd unset noout + ceph osd unset-group noout prod-ceph-data1701 + +Many contemporary Linux distributions employ ``systemd`` for service +management. However, for certain operating systems (especially older ones) it +might be necessary to issue equivalent ``service`` or ``start``/``stop`` +commands. + + +.. _osd-not-running: + +OSD Not Running +=============== + +Under normal conditions, restarting a ``ceph-osd`` daemon will allow it to +rejoin the cluster and recover. + + +An OSD Won't Start +------------------ + +If the cluster has started but an OSD isn't starting, check the following: + +- **Configuration File:** If you were not able to get OSDs running from a new + installation, check your configuration file to ensure it conforms to the + standard (for example, make sure that it says ``host`` and not ``hostname``, + etc.). + +- **Check Paths:** Ensure that the paths specified in the configuration + correspond to the paths for data and metadata that actually exist (for + example, the paths to the journals, the WAL, and the DB). Separate the OSD + data from the metadata in order to see whether there are errors in the + configuration file and in the actual mounts. If so, these errors might + explain why OSDs are not starting. To store the metadata on a separate block + device, partition or LVM the drive and assign one partition per OSD. + +- **Check Max Threadcount:** If the cluster has a node with an especially high + number of OSDs, it might be hitting the default maximum number of threads + (usually 32,000). This is especially likely to happen during recovery. + Increasing the maximum number of threads to the maximum possible number of + threads allowed (4194303) might help with the problem. To increase the number + of threads to the maximum, run the following command: + + .. prompt:: bash + + sysctl -w kernel.pid_max=4194303 + + If this increase resolves the issue, you must make the increase permanent by + including a ``kernel.pid_max`` setting either in a file under + ``/etc/sysctl.d`` or within the master ``/etc/sysctl.conf`` file. For + example:: + + kernel.pid_max = 4194303 + +- **Check ``nf_conntrack``:** This connection-tracking and connection-limiting + system causes problems for many production Ceph clusters. The problems often + emerge slowly and subtly. As cluster topology and client workload grow, + mysterious and intermittent connection failures and performance glitches + occur more and more, especially at certain times of the day. To begin taking + the measure of your problem, check the ``syslog`` history for "table full" + events. One way to address this kind of problem is as follows: First, use the + ``sysctl`` utility to assign ``nf_conntrack_max`` a much higher value. Next, + raise the value of ``nf_conntrack_buckets`` so that ``nf_conntrack_buckets`` + × 8 = ``nf_conntrack_max``; this action might require running commands + outside of ``sysctl`` (for example, ``"echo 131072 > + /sys/module/nf_conntrack/parameters/hashsize``). Another way to address the + problem is to blacklist the associated kernel modules in order to disable + processing altogether. This approach is powerful, but fragile. The modules + and the order in which the modules must be listed can vary among kernel + versions. Even when blacklisted, ``iptables`` and ``docker`` might sometimes + activate connection tracking anyway, so we advise a "set and forget" strategy + for the tunables. On modern systems, this approach will not consume + appreciable resources. + +- **Kernel Version:** Identify the kernel version and distribution that are in + use. By default, Ceph uses third-party tools that might be buggy or come into + conflict with certain distributions or kernel versions (for example, Google's + ``gperftools`` and ``TCMalloc``). Check the `OS recommendations`_ and the + release notes for each Ceph version in order to make sure that you have + addressed any issues related to your kernel. + +- **Segment Fault:** If there is a segment fault, increase log levels and + restart the problematic daemon(s). If segment faults recur, search the Ceph + bug tracker `https://tracker.ceph/com/projects/ceph + <https://tracker.ceph.com/projects/ceph/>`_ and the ``dev`` and + ``ceph-users`` mailing list archives `https://ceph.io/resources + <https://ceph.io/resources>`_ to see if others have experienced and reported + these issues. If this truly is a new and unique failure, post to the ``dev`` + email list and provide the following information: the specific Ceph release + being run, ``ceph.conf`` (with secrets XXX'd out), your monitor status + output, and excerpts from your log file(s). + + +An OSD Failed +------------- + +When an OSD fails, this means that a ``ceph-osd`` process is unresponsive or +has died and that the corresponding OSD has been marked ``down``. Surviving +``ceph-osd`` daemons will report to the monitors that the OSD appears to be +down, and a new status will be visible in the output of the ``ceph health`` +command, as in the following example: + +.. prompt:: bash + + ceph health + +:: + + HEALTH_WARN 1/3 in osds are down + +This health alert is raised whenever there are one or more OSDs marked ``in`` +and ``down``. To see which OSDs are ``down``, add ``detail`` to the command as in +the following example: + +.. prompt:: bash + + ceph health detail + +:: + + HEALTH_WARN 1/3 in osds are down + osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 + +Alternatively, run the following command: + +.. prompt:: bash + + ceph osd tree down + +If there is a drive failure or another fault that is preventing a given +``ceph-osd`` daemon from functioning or restarting, then there should be an +error message present in its log file under ``/var/log/ceph``. + +If the ``ceph-osd`` daemon stopped because of a heartbeat failure or a +``suicide timeout`` error, then the underlying drive or filesystem might be +unresponsive. Check ``dmesg`` output and `syslog` output for drive errors or +kernel errors. It might be necessary to specify certain flags (for example, +``dmesg -T`` to see human-readable timestamps) in order to avoid mistaking old +errors for new errors. + +If an entire host's OSDs are ``down``, check to see if there is a network +error or a hardware issue with the host. + +If the OSD problem is the result of a software error (for example, a failed +assertion or another unexpected error), search for reports of the issue in the +`bug tracker <https://tracker.ceph/com/projects/ceph>`_ , the `dev mailing list +archives <https://lists.ceph.io/hyperkitty/list/dev@ceph.io/>`_, and the +`ceph-users mailing list archives +<https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/>`_. If there is no +clear fix or existing bug, then :ref:`report the problem to the ceph-devel +email list <Get Involved>`. + + +.. _no-free-drive-space: + +No Free Drive Space +------------------- + +If an OSD is full, Ceph prevents data loss by ensuring that no new data is +written to the OSD. In an properly running cluster, health checks are raised +when the cluster's OSDs and pools approach certain "fullness" ratios. The +``mon_osd_full_ratio`` threshold defaults to ``0.95`` (or 95% of capacity): +this is the point above which clients are prevented from writing data. The +``mon_osd_backfillfull_ratio`` threshold defaults to ``0.90`` (or 90% of +capacity): this is the point above which backfills will not start. The +``mon_osd_nearfull_ratio`` threshold defaults to ``0.85`` (or 85% of capacity): +this is the point at which it raises the ``OSD_NEARFULL`` health check. + +OSDs within a cluster will vary in how much data is allocated to them by Ceph. +To check "fullness" by displaying data utilization for every OSD, run the +following command: + +.. prompt:: bash + + ceph osd df + +To check "fullness" by displaying a cluster’s overall data usage and data +distribution among pools, run the following command: + +.. prompt:: bash + + ceph df + +When examining the output of the ``ceph df`` command, pay special attention to +the **most full** OSDs, as opposed to the percentage of raw space used. If a +single outlier OSD becomes full, all writes to this OSD's pool might fail as a +result. When ``ceph df`` reports the space available to a pool, it considers +the ratio settings relative to the *most full* OSD that is part of the pool. To +flatten the distribution, two approaches are available: (1) Using the +``reweight-by-utilization`` command to progressively move data from excessively +full OSDs or move data to insufficiently full OSDs, and (2) in later revisions +of Luminous and subsequent releases, exploiting the ``ceph-mgr`` ``balancer`` +module to perform the same task automatically. + +To adjust the "fullness" ratios, run a command or commands of the following +form: + +.. prompt:: bash + + ceph osd set-nearfull-ratio <float[0.0-1.0]> + ceph osd set-full-ratio <float[0.0-1.0]> + ceph osd set-backfillfull-ratio <float[0.0-1.0]> + +Sometimes full cluster issues arise because an OSD has failed. This can happen +either because of a test or because the cluster is small, very full, or +unbalanced. When an OSD or node holds an excessive percentage of the cluster's +data, component failures or natural growth can result in the ``nearfull`` and +``full`` ratios being exceeded. When testing Ceph's resilience to OSD failures +on a small cluster, it is advised to leave ample free disk space and to +consider temporarily lowering the OSD ``full ratio``, OSD ``backfillfull +ratio``, and OSD ``nearfull ratio``. + +The "fullness" status of OSDs is visible in the output of the ``ceph health`` +command, as in the following example: + +.. prompt:: bash + + ceph health + +:: + + HEALTH_WARN 1 nearfull osd(s) + +For details, add the ``detail`` command as in the following example: + +.. prompt:: bash + + ceph health detail + +:: + + HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) + osd.3 is full at 97% + osd.4 is backfill full at 91% + osd.2 is near full at 87% + +To address full cluster issues, it is recommended to add capacity by adding +OSDs. Adding new OSDs allows the cluster to redistribute data to newly +available storage. Search for ``rados bench`` orphans that are wasting space. + +If a legacy Filestore OSD cannot be started because it is full, it is possible +to reclaim space by deleting a small number of placement group directories in +the full OSD. + +.. important:: If you choose to delete a placement group directory on a full + OSD, **DO NOT** delete the same placement group directory on another full + OSD. **OTHERWISE YOU WILL LOSE DATA**. You **MUST** maintain at least one + copy of your data on at least one OSD. Deleting placement group directories + is a rare and extreme intervention. It is not to be undertaken lightly. + +See `Monitor Config Reference`_ for more information. + + +OSDs are Slow/Unresponsive +========================== + +OSDs are sometimes slow or unresponsive. When troubleshooting this common +problem, it is advised to eliminate other possibilities before investigating +OSD performance issues. For example, be sure to confirm that your network(s) +are working properly, to verify that your OSDs are running, and to check +whether OSDs are throttling recovery traffic. + +.. tip:: In pre-Luminous releases of Ceph, ``up`` and ``in`` OSDs were + sometimes not available or were otherwise slow because recovering OSDs were + consuming system resources. Newer releases provide better recovery handling + by preventing this phenomenon. + + +Networking Issues +----------------- + +As a distributed storage system, Ceph relies upon networks for OSD peering and +replication, recovery from faults, and periodic heartbeats. Networking issues +can cause OSD latency and flapping OSDs. For more information, see `Flapping +OSDs`_. + +To make sure that Ceph processes and Ceph-dependent processes are connected and +listening, run the following commands: + +.. prompt:: bash + + netstat -a | grep ceph + netstat -l | grep ceph + sudo netstat -p | grep ceph + +To check network statistics, run the following command: + +.. prompt:: bash + + netstat -s + +Drive Configuration +------------------- + +An SAS or SATA storage drive should house only one OSD, but a NVMe drive can +easily house two or more. However, it is possible for read and write throughput +to bottleneck if other processes share the drive. Such processes include: +journals / metadata, operating systems, Ceph monitors, ``syslog`` logs, other +OSDs, and non-Ceph processes. + +Because Ceph acknowledges writes *after* journaling, fast SSDs are an +attractive option for accelerating response time -- particularly when using the +``XFS`` or ``ext4`` filesystems for legacy FileStore OSDs. By contrast, the +``Btrfs`` file system can write and journal simultaneously. (However, use of +``Btrfs`` is not recommended for production deployments.) + +.. note:: Partitioning a drive does not change its total throughput or + sequential read/write limits. Throughput might be improved somewhat by + running a journal in a separate partition, but it is better still to run + such a journal in a separate physical drive. + +.. warning:: Reef does not support FileStore. Releases after Reef do not + support FileStore. Any information that mentions FileStore is pertinent only + to the Quincy release of Ceph and to releases prior to Quincy. + + +Bad Sectors / Fragmented Disk +----------------------------- + +Check your drives for bad blocks, fragmentation, and other errors that can +cause significantly degraded performance. Tools that are useful in checking for +drive errors include ``dmesg``, ``syslog`` logs, and ``smartctl`` (found in the +``smartmontools`` package). + +.. note:: ``smartmontools`` 7.0 and late provides NVMe stat passthrough and + JSON output. + + +Co-resident Monitors/OSDs +------------------------- + +Although monitors are relatively lightweight processes, performance issues can +result when monitors are run on the same host machine as an OSD. Monitors issue +many ``fsync()`` calls and this can interfere with other workloads. The danger +of performance issues is especially acute when the monitors are co-resident on +the same storage drive as an OSD. In addition, if the monitors are running an +older kernel (pre-3.0) or a kernel with no ``syncfs(2)`` syscall, then multiple +OSDs running on the same host might make so many commits as to undermine each +other's performance. This problem sometimes results in what is called "the +bursty writes". + + +Co-resident Processes +--------------------- + +Significant OSD latency can result from processes that write data to Ceph (for +example, cloud-based solutions and virtual machines) while operating on the +same hardware as OSDs. For this reason, making such processes co-resident with +OSDs is not generally recommended. Instead, the recommended practice is to +optimize certain hosts for use with Ceph and use other hosts for other +processes. This practice of separating Ceph operations from other applications +might help improve performance and might also streamline troubleshooting and +maintenance. + +Running co-resident processes on the same hardware is sometimes called +"convergence". When using Ceph, engage in convergence only with expertise and +after consideration. + + +Logging Levels +-------------- + +Performance issues can result from high logging levels. Operators sometimes +raise logging levels in order to track an issue and then forget to lower them +afterwards. In such a situation, OSDs might consume valuable system resources to +write needlessly verbose logs onto the disk. Anyone who does want to use high logging +levels is advised to consider mounting a drive to the default path for logging +(for example, ``/var/log/ceph/$cluster-$name.log``). + +Recovery Throttling +------------------- + +Depending upon your configuration, Ceph may reduce recovery rates to maintain +client or OSD performance, or it may increase recovery rates to the point that +recovery impacts client or OSD performance. Check to see if the client or OSD +is recovering. + + +Kernel Version +-------------- + +Check the kernel version that you are running. Older kernels may lack updates +that improve Ceph performance. + + +Kernel Issues with SyncFS +------------------------- + +If you have kernel issues with SyncFS, try running one OSD per host to see if +performance improves. Old kernels might not have a recent enough version of +``glibc`` to support ``syncfs(2)``. + + +Filesystem Issues +----------------- + +In post-Luminous releases, we recommend deploying clusters with the BlueStore +back end. When running a pre-Luminous release, or if you have a specific +reason to deploy OSDs with the previous Filestore backend, we recommend +``XFS``. + +We recommend against using ``Btrfs`` or ``ext4``. The ``Btrfs`` filesystem has +many attractive features, but bugs may lead to performance issues and spurious +ENOSPC errors. We do not recommend ``ext4`` for Filestore OSDs because +``xattr`` limitations break support for long object names, which are needed for +RGW. + +For more information, see `Filesystem Recommendations`_. + +.. _Filesystem Recommendations: ../configuration/filesystem-recommendations + +Insufficient RAM +---------------- + +We recommend a *minimum* of 4GB of RAM per OSD daemon and we suggest rounding +up from 6GB to 8GB. During normal operations, you may notice that ``ceph-osd`` +processes use only a fraction of that amount. You might be tempted to use the +excess RAM for co-resident applications or to skimp on each node's memory +capacity. However, when OSDs experience recovery their memory utilization +spikes. If there is insufficient RAM available during recovery, OSD performance +will slow considerably and the daemons may even crash or be killed by the Linux +``OOM Killer``. + + +Blocked Requests or Slow Requests +--------------------------------- + +When a ``ceph-osd`` daemon is slow to respond to a request, the cluster log +receives messages reporting ops that are taking too long. The warning threshold +defaults to 30 seconds and is configurable via the ``osd_op_complaint_time`` +setting. + +Legacy versions of Ceph complain about ``old requests``:: + + osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops + +Newer versions of Ceph complain about ``slow requests``:: + + {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs + {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] + +Possible causes include: + +- A failing drive (check ``dmesg`` output) +- A bug in the kernel file system (check ``dmesg`` output) +- An overloaded cluster (check system load, iostat, etc.) +- A bug in the ``ceph-osd`` daemon. + +Possible solutions: + +- Remove VMs from Ceph hosts +- Upgrade kernel +- Upgrade Ceph +- Restart OSDs +- Replace failed or failing components + +Debugging Slow Requests +----------------------- + +If you run ``ceph daemon osd.<id> dump_historic_ops`` or ``ceph daemon osd.<id> +dump_ops_in_flight``, you will see a set of operations and a list of events +each operation went through. These are briefly described below. + +Events from the Messenger layer: + +- ``header_read``: The time that the messenger first started reading the message off the wire. +- ``throttled``: The time that the messenger tried to acquire memory throttle space to read + the message into memory. +- ``all_read``: The time that the messenger finished reading the message off the wire. +- ``dispatched``: The time that the messenger gave the message to the OSD. +- ``initiated``: This is identical to ``header_read``. The existence of both is a + historical oddity. + +Events from the OSD as it processes ops: + +- ``queued_for_pg``: The op has been put into the queue for processing by its PG. +- ``reached_pg``: The PG has started performing the op. +- ``waiting for \*``: The op is waiting for some other work to complete before + it can proceed (for example, a new OSDMap; the scrubbing of its object + target; the completion of a PG's peering; all as specified in the message). +- ``started``: The op has been accepted as something the OSD should do and + is now being performed. +- ``waiting for subops from``: The op has been sent to replica OSDs. + +Events from ```Filestore```: + +- ``commit_queued_for_journal_write``: The op has been given to the FileStore. +- ``write_thread_in_journal_buffer``: The op is in the journal's buffer and is waiting + to be persisted (as the next disk write). +- ``journaled_completion_queued``: The op was journaled to disk and its callback + has been queued for invocation. + +Events from the OSD after data has been given to underlying storage: + +- ``op_commit``: The op has been committed (that is, written to journal) by the + primary OSD. +- ``op_applied``: The op has been `write()'en + <https://www.freebsd.org/cgi/man.cgi?write(2)>`_ to the backing FS (that is, + applied in memory but not flushed out to disk) on the primary. +- ``sub_op_applied``: ``op_applied``, but for a replica's "subop". +- ``sub_op_committed``: ``op_commit``, but for a replica's subop (only for EC pools). +- ``sub_op_commit_rec/sub_op_apply_rec from <X>``: The primary marks this when it + hears about the above, but for a particular replica (i.e. ``<X>``). +- ``commit_sent``: We sent a reply back to the client (or primary OSD, for sub ops). + +Some of these events may appear redundant, but they cross important boundaries +in the internal code (such as passing data across locks into new threads). + + +Flapping OSDs +============= + +"Flapping" is the term for the phenomenon of an OSD being repeatedly marked +``up`` and then ``down`` in rapid succession. This section explains how to +recognize flapping, and how to mitigate it. + +When OSDs peer and check heartbeats, they use the cluster (back-end) network +when it is available. See `Monitor/OSD Interaction`_ for details. + +The upstream Ceph community has traditionally recommended separate *public* +(front-end) and *private* (cluster / back-end / replication) networks. This +provides the following benefits: + +#. Segregation of (1) heartbeat traffic and replication/recovery traffic + (private) from (2) traffic from clients and between OSDs and monitors + (public). This helps keep one stream of traffic from DoS-ing the other, + which could in turn result in a cascading failure. + +#. Additional throughput for both public and private traffic. + +In the past, when common networking technologies were measured in a range +encompassing 100Mb/s and 1Gb/s, this separation was often critical. But with +today's 10Gb/s, 40Gb/s, and 25/50/100Gb/s networks, the above capacity concerns +are often diminished or even obviated. For example, if your OSD nodes have two +network ports, dedicating one to the public and the other to the private +network means that you have no path redundancy. This degrades your ability to +endure network maintenance and network failures without significant cluster or +client impact. In situations like this, consider instead using both links for +only a public network: with bonding (LACP) or equal-cost routing (for example, +FRR) you reap the benefits of increased throughput headroom, fault tolerance, +and reduced OSD flapping. + +When a private network (or even a single host link) fails or degrades while the +public network continues operating normally, OSDs may not handle this situation +well. In such situations, OSDs use the public network to report each other +``down`` to the monitors, while marking themselves ``up``. The monitors then +send out-- again on the public network--an updated cluster map with the +affected OSDs marked `down`. These OSDs reply to the monitors "I'm not dead +yet!", and the cycle repeats. We call this scenario 'flapping`, and it can be +difficult to isolate and remediate. Without a private network, this irksome +dynamic is avoided: OSDs are generally either ``up`` or ``down`` without +flapping. + +If something does cause OSDs to 'flap' (repeatedly being marked ``down`` and +then ``up`` again), you can force the monitors to halt the flapping by +temporarily freezing their states: + +.. prompt:: bash + + ceph osd set noup # prevent OSDs from getting marked up + ceph osd set nodown # prevent OSDs from getting marked down + +These flags are recorded in the osdmap: + +.. prompt:: bash + + ceph osd dump | grep flags + +:: + + flags no-up,no-down + +You can clear these flags with: + +.. prompt:: bash + + ceph osd unset noup + ceph osd unset nodown + +Two other flags are available, ``noin`` and ``noout``, which prevent booting +OSDs from being marked ``in`` (allocated data) or protect OSDs from eventually +being marked ``out`` (regardless of the current value of +``mon_osd_down_out_interval``). + +.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the sense that + after the flags are cleared, the action that they were blocking should be + possible shortly thereafter. But the ``noin`` flag prevents OSDs from being + marked ``in`` on boot, and any daemons that started while the flag was set + will remain that way. + +.. note:: The causes and effects of flapping can be mitigated somewhat by + making careful adjustments to ``mon_osd_down_out_subtree_limit``, + ``mon_osd_reporter_subtree_level``, and ``mon_osd_min_down_reporters``. + Derivation of optimal settings depends on cluster size, topology, and the + Ceph release in use. The interaction of all of these factors is subtle and + is beyond the scope of this document. + + +.. _iostat: https://en.wikipedia.org/wiki/Iostat +.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging +.. _Logging and Debugging: ../log-and-debug +.. _Debugging and Logging: ../debug +.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction +.. _Monitor Config Reference: ../../configuration/mon-config-ref +.. _monitoring your OSDs: ../../operations/monitoring-osd-pg + +.. _monitoring OSDs: ../../operations/monitoring-osd-pg/#monitoring-osds + +.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel +.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com +.. _OS recommendations: ../../../start/os-recommendations +.. _ceph-devel: ceph-devel@vger.kernel.org diff --git a/doc/rados/troubleshooting/troubleshooting-pg.rst b/doc/rados/troubleshooting/troubleshooting-pg.rst new file mode 100644 index 000000000..74d04bd9f --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-pg.rst @@ -0,0 +1,782 @@ +==================== + Troubleshooting PGs +==================== + +Placement Groups Never Get Clean +================================ + +If, after you have created your cluster, any Placement Groups (PGs) remain in +the ``active`` status, the ``active+remapped`` status or the +``active+degraded`` status and never achieves an ``active+clean`` status, you +likely have a problem with your configuration. + +In such a situation, it may be necessary to review the settings in the `Pool, +PG and CRUSH Config Reference`_ and make appropriate adjustments. + +As a general rule, run your cluster with more than one OSD and a pool size +greater than two object replicas. + +.. _one-node-cluster: + +One Node Cluster +---------------- + +Ceph no longer provides documentation for operating on a single node. Systems +designed for distributed computing by definition do not run on a single node. +The mounting of client kernel modules on a single node that contains a Ceph +daemon may cause a deadlock due to issues with the Linux kernel itself (unless +VMs are used as clients). You can experiment with Ceph in a one-node +configuration, in spite of the limitations as described herein. + +To create a cluster on a single node, you must change the +``osd_crush_chooseleaf_type`` setting from the default of ``1`` (meaning +``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration +file before you create your monitors and OSDs. This tells Ceph that an OSD is +permitted to place another OSD on the same host. If you are trying to set up a +single-node cluster and ``osd_crush_chooseleaf_type`` is greater than ``0``, +Ceph will attempt to place the PGs of one OSD with the PGs of another OSD on +another node, chassis, rack, row, or datacenter depending on the setting. + +.. tip:: DO NOT mount kernel clients directly on the same node as your Ceph + Storage Cluster. Kernel conflicts can arise. However, you can mount kernel + clients within virtual machines (VMs) on a single node. + +If you are creating OSDs using a single disk, you must manually create +directories for the data first. + + +Fewer OSDs than Replicas +------------------------ + +If two OSDs are in an ``up`` and ``in`` state, but the placement gropus are not +in an ``active + clean`` state, you may have an ``osd_pool_default_size`` set +to greater than ``2``. + +There are a few ways to address this situation. If you want to operate your +cluster in an ``active + degraded`` state with two replicas, you can set the +``osd_pool_default_min_size`` to ``2`` so that you can write objects in an +``active + degraded`` state. You may also set the ``osd_pool_default_size`` +setting to ``2`` so that you have only two stored replicas (the original and +one replica). In such a case, the cluster should achieve an ``active + clean`` +state. + +.. note:: You can make the changes while the cluster is running. If you make + the changes in your Ceph configuration file, you might need to restart your + cluster. + + +Pool Size = 1 +------------- + +If you have ``osd_pool_default_size`` set to ``1``, you will have only one copy +of the object. OSDs rely on other OSDs to tell them which objects they should +have. If one OSD has a copy of an object and there is no second copy, then +there is no second OSD to tell the first OSD that it should have that copy. For +each placement group mapped to the first OSD (see ``ceph pg dump``), you can +force the first OSD to notice the placement groups it needs by running a +command of the following form: + +.. prompt:: bash + + ceph osd force-create-pg <pgid> + + +CRUSH Map Errors +---------------- + +If any placement groups in your cluster are unclean, then there might be errors +in your CRUSH map. + + +Stuck Placement Groups +====================== + +It is normal for placement groups to enter "degraded" or "peering" states after +a component failure. Normally, these states reflect the expected progression +through the failure recovery process. However, a placement group that stays in +one of these states for a long time might be an indication of a larger problem. +For this reason, the Ceph Monitors will warn when placement groups get "stuck" +in a non-optimal state. Specifically, we check for: + +* ``inactive`` - The placement group has not been ``active`` for too long (that + is, it hasn't been able to service read/write requests). + +* ``unclean`` - The placement group has not been ``clean`` for too long (that + is, it hasn't been able to completely recover from a previous failure). + +* ``stale`` - The placement group status has not been updated by a + ``ceph-osd``. This indicates that all nodes storing this placement group may + be ``down``. + +List stuck placement groups by running one of the following commands: + +.. prompt:: bash + + ceph pg dump_stuck stale + ceph pg dump_stuck inactive + ceph pg dump_stuck unclean + +- Stuck ``stale`` placement groups usually indicate that key ``ceph-osd`` + daemons are not running. +- Stuck ``inactive`` placement groups usually indicate a peering problem (see + :ref:`failures-osd-peering`). +- Stuck ``unclean`` placement groups usually indicate that something is + preventing recovery from completing, possibly unfound objects (see + :ref:`failures-osd-unfound`); + + + +.. _failures-osd-peering: + +Placement Group Down - Peering Failure +====================================== + +In certain cases, the ``ceph-osd`` `peering` process can run into problems, +which can prevent a PG from becoming active and usable. In such a case, running +the command ``ceph health detail`` will report something similar to the following: + +.. prompt:: bash + + ceph health detail + +:: + + HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down + ... + pg 0.5 is down+peering + pg 1.4 is down+peering + ... + osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 + +Query the cluster to determine exactly why the PG is marked ``down`` by running a command of the following form: + +.. prompt:: bash + + ceph pg 0.5 query + +.. code-block:: javascript + + { "state": "down+peering", + ... + "recovery_state": [ + { "name": "Started\/Primary\/Peering\/GetInfo", + "enter_time": "2012-03-06 14:40:16.169679", + "requested_info_from": []}, + { "name": "Started\/Primary\/Peering", + "enter_time": "2012-03-06 14:40:16.169659", + "probing_osds": [ + 0, + 1], + "blocked": "peering is blocked due to down osds", + "down_osds_we_would_probe": [ + 1], + "peering_blocked_by": [ + { "osd": 1, + "current_lost_at": 0, + "comment": "starting or marking this osd lost may let us proceed"}]}, + { "name": "Started", + "enter_time": "2012-03-06 14:40:16.169513"} + ] + } + +The ``recovery_state`` section tells us that peering is blocked due to down +``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that +particular ``ceph-osd`` and recovery will proceed. + +Alternatively, if there is a catastrophic failure of ``osd.1`` (for example, if +there has been a disk failure), the cluster can be informed that the OSD is +``lost`` and the cluster can be instructed that it must cope as best it can. + +.. important:: Informing the cluster that an OSD has been lost is dangerous + because the cluster cannot guarantee that the other copies of the data are + consistent and up to date. + +To report an OSD ``lost`` and to instruct Ceph to continue to attempt recovery +anyway, run a command of the following form: + +.. prompt:: bash + + ceph osd lost 1 + +Recovery will proceed. + + +.. _failures-osd-unfound: + +Unfound Objects +=============== + +Under certain combinations of failures, Ceph may complain about ``unfound`` +objects, as in this example: + +.. prompt:: bash + + ceph health detail + +:: + + HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) + pg 2.4 is active+degraded, 78 unfound + +This means that the storage cluster knows that some objects (or newer copies of +existing objects) exist, but it hasn't found copies of them. Here is an +example of how this might come about for a PG whose data is on two OSDS, which +we will call "1" and "2": + +* 1 goes down +* 2 handles some writes, alone +* 1 comes up +* 1 and 2 re-peer, and the objects missing on 1 are queued for recovery. +* Before the new objects are copied, 2 goes down. + +At this point, 1 knows that these objects exist, but there is no live +``ceph-osd`` that has a copy of the objects. In this case, IO to those objects +will block, and the cluster will hope that the failed node comes back soon. +This is assumed to be preferable to returning an IO error to the user. + +.. note:: The situation described immediately above is one reason that setting + ``size=2`` on a replicated pool and ``m=1`` on an erasure coded pool risks + data loss. + +Identify which objects are unfound by running a command of the following form: + +.. prompt:: bash + + ceph pg 2.4 list_unfound [starting offset, in json] + +.. code-block:: javascript + + { + "num_missing": 1, + "num_unfound": 1, + "objects": [ + { + "oid": { + "oid": "object", + "key": "", + "snapid": -2, + "hash": 2249616407, + "max": 0, + "pool": 2, + "namespace": "" + }, + "need": "43'251", + "have": "0'0", + "flags": "none", + "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1", + "locations": [ + "0(3)", + "4(2)" + ] + } + ], + "state": "NotRecovering", + "available_might_have_unfound": true, + "might_have_unfound": [ + { + "osd": "2(4)", + "status": "osd is down" + } + ], + "more": false + } + +If there are too many objects to list in a single result, the ``more`` field +will be true and you can query for more. (Eventually the command line tool +will hide this from you, but not yet.) + +Now you can identify which OSDs have been probed or might contain data. + +At the end of the listing (before ``more: false``), ``might_have_unfound`` is +provided when ``available_might_have_unfound`` is true. This is equivalent to +the output of ``ceph pg #.# query``. This eliminates the need to use ``query`` +directly. The ``might_have_unfound`` information given behaves the same way as +that ``query`` does, which is described below. The only difference is that +OSDs that have the status of ``already probed`` are ignored. + +Use of ``query``: + +.. prompt:: bash + + ceph pg 2.4 query + +.. code-block:: javascript + + "recovery_state": [ + { "name": "Started\/Primary\/Active", + "enter_time": "2012-03-06 15:15:46.713212", + "might_have_unfound": [ + { "osd": 1, + "status": "osd is down"}]}, + +In this case, the cluster knows that ``osd.1`` might have data, but it is +``down``. Here is the full range of possible states: + +* already probed +* querying +* OSD is down +* not queried (yet) + +Sometimes it simply takes some time for the cluster to query possible +locations. + +It is possible that there are other locations where the object might exist that +are not listed. For example: if an OSD is stopped and taken out of the cluster +and then the cluster fully recovers, and then through a subsequent set of +failures the cluster ends up with an unfound object, the cluster will ignore +the removed OSD. (This scenario, however, is unlikely.) + +If all possible locations have been queried and objects are still lost, you may +have to give up on the lost objects. This, again, is possible only when unusual +combinations of failures have occurred that allow the cluster to learn about +writes that were performed before the writes themselves have been recovered. To +mark the "unfound" objects as "lost", run a command of the following form: + +.. prompt:: bash + + ceph pg 2.5 mark_unfound_lost revert|delete + +Here the final argument (``revert|delete``) specifies how the cluster should +deal with lost objects. + +The ``delete`` option will cause the cluster to forget about them entirely. + +The ``revert`` option (which is not available for erasure coded pools) will +either roll back to a previous version of the object or (if it was a new +object) forget about the object entirely. Use ``revert`` with caution, as it +may confuse applications that expect the object to exist. + +Homeless Placement Groups +========================= + +It is possible that every OSD that has copies of a given placement group fails. +If this happens, then the subset of the object store that contains those +placement groups becomes unavailable and the monitor will receive no status +updates for those placement groups. The monitor marks as ``stale`` any +placement group whose primary OSD has failed. For example: + +.. prompt:: bash + + ceph health + +:: + + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + +Identify which placement groups are ``stale`` and which were the last OSDs to +store the ``stale`` placement groups by running the following command: + +.. prompt:: bash + + ceph health detail + +:: + + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + ... + pg 2.5 is stuck stale+active+remapped, last acting [2,0] + ... + osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 + osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 + osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 + +This output indicates that placement group 2.5 (``pg 2.5``) was last managed by +``osd.0`` and ``osd.2``. Restart those OSDs to allow the cluster to recover +that placement group. + + +Only a Few OSDs Receive Data +============================ + +If only a few of the nodes in the cluster are receiving data, check the number +of placement groups in the pool as instructed in the :ref:`Placement Groups +<rados_ops_pgs_get_pg_num>` documentation. Since placement groups get mapped to +OSDs in an operation involving dividing the number of placement groups in the +cluster by the number of OSDs in the cluster, a small number of placement +groups (the remainder, in this operation) are sometimes not distributed across +the cluster. In situations like this, create a pool with a placement group +count that is a multiple of the number of OSDs. See `Placement Groups`_ for +details. See the :ref:`Pool, PG, and CRUSH Config Reference +<rados_config_pool_pg_crush_ref>` for instructions on changing the default +values used to determine how many placement groups are assigned to each pool. + + +Can't Write Data +================ + +If the cluster is up, but some OSDs are down and you cannot write data, make +sure that you have the minimum number of OSDs running in the pool. If you don't +have the minimum number of OSDs running in the pool, Ceph will not allow you to +write data to it because there is no guarantee that Ceph can replicate your +data. See ``osd_pool_default_min_size`` in the :ref:`Pool, PG, and CRUSH +Config Reference <rados_config_pool_pg_crush_ref>` for details. + + +PGs Inconsistent +================ + +If the command ``ceph health detail`` returns an ``active + clean + +inconsistent`` state, this might indicate an error during scrubbing. Identify +the inconsistent placement group or placement groups by running the following +command: + +.. prompt:: bash + + $ ceph health detail + +:: + + HEALTH_ERR 1 pgs inconsistent; 2 scrub errors + pg 0.6 is active+clean+inconsistent, acting [0,1,2] + 2 scrub errors + +Alternatively, run this command if you prefer to inspect the output in a +programmatic way: + +.. prompt:: bash + + $ rados list-inconsistent-pg rbd + +:: + + ["0.6"] + +There is only one consistent state, but in the worst case, we could have +different inconsistencies in multiple perspectives found in more than one +object. If an object named ``foo`` in PG ``0.6`` is truncated, the output of +``rados list-inconsistent-pg rbd`` will look something like this: + +.. prompt:: bash + + rados list-inconsistent-obj 0.6 --format=json-pretty + +.. code-block:: javascript + + { + "epoch": 14, + "inconsistents": [ + { + "object": { + "name": "foo", + "nspace": "", + "locator": "", + "snap": "head", + "version": 1 + }, + "errors": [ + "data_digest_mismatch", + "size_mismatch" + ], + "union_shard_errors": [ + "data_digest_mismatch_info", + "size_mismatch_info" + ], + "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", + "shards": [ + { + "osd": 0, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 1, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 2, + "errors": [ + "data_digest_mismatch_info", + "size_mismatch_info" + ], + "size": 0, + "omap_digest": "0xffffffff", + "data_digest": "0xffffffff" + } + ] + } + ] + } + +In this case, the output indicates the following: + +* The only inconsistent object is named ``foo``, and its head has + inconsistencies. +* The inconsistencies fall into two categories: + + #. ``errors``: these errors indicate inconsistencies between shards, without + an indication of which shard(s) are bad. Check for the ``errors`` in the + ``shards`` array, if available, to pinpoint the problem. + + * ``data_digest_mismatch``: the digest of the replica read from ``OSD.2`` + is different from the digests of the replica reads of ``OSD.0`` and + ``OSD.1`` + * ``size_mismatch``: the size of the replica read from ``OSD.2`` is ``0``, + but the size reported by ``OSD.0`` and ``OSD.1`` is ``968``. + + #. ``union_shard_errors``: the union of all shard-specific ``errors`` in the + ``shards`` array. The ``errors`` are set for the shard with the problem. + These errors include ``read_error`` and other similar errors. The + ``errors`` ending in ``oi`` indicate a comparison with + ``selected_object_info``. Examine the ``shards`` array to determine + which shard has which error or errors. + + * ``data_digest_mismatch_info``: the digest stored in the ``object-info`` + is not ``0xffffffff``, which is calculated from the shard read from + ``OSD.2`` + * ``size_mismatch_info``: the size stored in the ``object-info`` is + different from the size read from ``OSD.2``. The latter is ``0``. + +.. warning:: If ``read_error`` is listed in a shard's ``errors`` attribute, the + inconsistency is likely due to physical storage errors. In cases like this, + check the storage used by that OSD. + + Examine the output of ``dmesg`` and ``smartctl`` before attempting a drive + repair. + +To repair the inconsistent placement group, run a command of the following +form: + +.. prompt:: bash + + ceph pg repair {placement-group-ID} + +.. warning: This command overwrites the "bad" copies with "authoritative" + copies. In most cases, Ceph is able to choose authoritative copies from all + the available replicas by using some predefined criteria. This, however, + does not work in every case. For example, it might be the case that the + stored data digest is missing, which means that the calculated digest is + ignored when Ceph chooses the authoritative copies. Be aware of this, and + use the above command with caution. + + +If you receive ``active + clean + inconsistent`` states periodically due to +clock skew, consider configuring the `NTP +<https://en.wikipedia.org/wiki/Network_Time_Protocol>`_ daemons on your monitor +hosts to act as peers. See `The Network Time Protocol <http://www.ntp.org>`_ +and Ceph :ref:`Clock Settings <mon-config-ref-clock>` for more information. + + +Erasure Coded PGs are not active+clean +====================================== + +If CRUSH fails to find enough OSDs to map to a PG, it will show as a +``2147483647`` which is ``ITEM_NONE`` or ``no OSD found``. For example:: + + [2,1,6,0,5,8,2147483647,7,4] + +Not enough OSDs +--------------- + +If the Ceph cluster has only eight OSDs and an erasure coded pool needs nine +OSDs, the cluster will show "Not enough OSDs". In this case, you either create +another erasure coded pool that requires fewer OSDs, by running commands of the +following form: + +.. prompt:: bash + + ceph osd erasure-code-profile set myprofile k=5 m=3 + ceph osd pool create erasurepool erasure myprofile + +or add new OSDs, and the PG will automatically use them. + +CRUSH constraints cannot be satisfied +------------------------------------- + +If the cluster has enough OSDs, it is possible that the CRUSH rule is imposing +constraints that cannot be satisfied. If there are ten OSDs on two hosts and +the CRUSH rule requires that no two OSDs from the same host are used in the +same PG, the mapping may fail because only two OSDs will be found. Check the +constraint by displaying ("dumping") the rule, as shown here: + +.. prompt:: bash + + ceph osd crush rule ls + +:: + + [ + "replicated_rule", + "erasurepool"] + $ ceph osd crush rule dump erasurepool + { "rule_id": 1, + "rule_name": "erasurepool", + "type": 3, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + + +Resolve this problem by creating a new pool in which PGs are allowed to have +OSDs residing on the same host by running the following commands: + +.. prompt:: bash + + ceph osd erasure-code-profile set myprofile crush-failure-domain=osd + ceph osd pool create erasurepool erasure myprofile + +CRUSH gives up too soon +----------------------- + +If the Ceph cluster has just enough OSDs to map the PG (for instance a cluster +with a total of nine OSDs and an erasure coded pool that requires nine OSDs per +PG), it is possible that CRUSH gives up before finding a mapping. This problem +can be resolved by: + +* lowering the erasure coded pool requirements to use fewer OSDs per PG (this + requires the creation of another pool, because erasure code profiles cannot + be modified dynamically). + +* adding more OSDs to the cluster (this does not require the erasure coded pool + to be modified, because it will become clean automatically) + +* using a handmade CRUSH rule that tries more times to find a good mapping. + This can be modified for an existing CRUSH rule by setting + ``set_choose_tries`` to a value greater than the default. + +First, verify the problem by using ``crushtool`` after extracting the crushmap +from the cluster. This ensures that your experiments do not modify the Ceph +cluster and that they operate only on local files: + +.. prompt:: bash + + ceph osd crush rule dump erasurepool + +:: + + { "rule_id": 1, + "rule_name": "erasurepool", + "type": 3, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + $ ceph osd getcrushmap > crush.map + got crush map from osdmap epoch 13 + $ crushtool -i crush.map --test --show-bad-mappings \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0] + bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8] + bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647] + +Here, ``--num-rep`` is the number of OSDs that the erasure code CRUSH rule +needs, ``--rule`` is the value of the ``rule_id`` field that was displayed by +``ceph osd crush rule dump``. This test will attempt to map one million values +(in this example, the range defined by ``[--min-x,--max-x]``) and must display +at least one bad mapping. If this test outputs nothing, all mappings have been +successful and you can be assured that the problem with your cluster is not +caused by bad mappings. + +Changing the value of set_choose_tries +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +#. Decompile the CRUSH map to edit the CRUSH rule by running the following + command: + + .. prompt:: bash + + crushtool --decompile crush.map > crush.txt + +#. Add the following line to the rule:: + + step set_choose_tries 100 + + The relevant part of the ``crush.txt`` file will resemble this:: + + rule erasurepool { + id 1 + type erasure + step set_chooseleaf_tries 5 + step set_choose_tries 100 + step take default + step chooseleaf indep 0 type host + step emit + } + +#. Recompile and retest the CRUSH rule: + + .. prompt:: bash + + crushtool --compile crush.txt -o better-crush.map + +#. When all mappings succeed, display a histogram of the number of tries that + were necessary to find all of the mapping by using the + ``--show-choose-tries`` option of the ``crushtool`` command, as in the + following example: + + .. prompt:: bash + + crushtool -i better-crush.map --test --show-bad-mappings \ + --show-choose-tries \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + ... + 11: 42 + 12: 44 + 13: 54 + 14: 45 + 15: 35 + 16: 34 + 17: 30 + 18: 25 + 19: 19 + 20: 22 + 21: 20 + 22: 17 + 23: 13 + 24: 16 + 25: 13 + 26: 11 + 27: 11 + 28: 13 + 29: 11 + 30: 10 + 31: 6 + 32: 5 + 33: 10 + 34: 3 + 35: 7 + 36: 5 + 37: 2 + 38: 5 + 39: 5 + 40: 2 + 41: 5 + 42: 4 + 43: 1 + 44: 2 + 45: 2 + 46: 3 + 47: 1 + 48: 0 + ... + 102: 0 + 103: 1 + 104: 0 + ... + + This output indicates that it took eleven tries to map forty-two PGs, twelve + tries to map forty-four PGs etc. The highest number of tries is the minimum + value of ``set_choose_tries`` that prevents bad mappings (for example, + ``103`` in the above output, because it did not take more than 103 tries for + any PG to be mapped). + +.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups +.. _Placement Groups: ../../operations/placement-groups +.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref |