diff options
Diffstat (limited to 'doc/rados')
70 files changed, 24582 insertions, 0 deletions
diff --git a/doc/rados/api/index.rst b/doc/rados/api/index.rst new file mode 100644 index 000000000..63bc7222d --- /dev/null +++ b/doc/rados/api/index.rst @@ -0,0 +1,23 @@ +=========================== + Ceph Storage Cluster APIs +=========================== + +The :term:`Ceph Storage Cluster` has a messaging layer protocol that enables +clients to interact with a :term:`Ceph Monitor` and a :term:`Ceph OSD Daemon`. +``librados`` provides this functionality to :term:`Ceph Client`\s in the form of +a library. All Ceph Clients either use ``librados`` or the same functionality +encapsulated in ``librados`` to interact with the object store. For example, +``librbd`` and ``libcephfs`` leverage this functionality. You may use +``librados`` to interact with Ceph directly (e.g., an application that talks to +Ceph, your own interface to Ceph, etc.). + + +.. toctree:: + :maxdepth: 2 + + Introduction to librados <librados-intro> + librados (C) <librados> + librados (C++) <libradospp> + librados (Python) <python> + libcephsqlite (SQLite) <libcephsqlite> + object class <objclass-sdk> diff --git a/doc/rados/api/libcephsqlite.rst b/doc/rados/api/libcephsqlite.rst new file mode 100644 index 000000000..76ab306bb --- /dev/null +++ b/doc/rados/api/libcephsqlite.rst @@ -0,0 +1,438 @@ +.. _libcephsqlite: + +================ + Ceph SQLite VFS +================ + +This `SQLite VFS`_ may be used for storing and accessing a `SQLite`_ database +backed by RADOS. This allows you to fully decentralize your database using +Ceph's object store for improved availability, accessibility, and use of +storage. + +Note what this is not: a distributed SQL engine. SQLite on RADOS can be thought +of like RBD as compared to CephFS: RBD puts a disk image on RADOS for the +purposes of exclusive access by a machine and generally does not allow parallel +access by other machines; on the other hand, CephFS allows fully distributed +access to a file system from many client mounts. SQLite on RADOS is meant to be +accessed by a single SQLite client database connection at a given time. The +database may be manipulated safely by multiple clients only in a serial fashion +controlled by RADOS locks managed by the Ceph SQLite VFS. + + +Usage +^^^^^ + +Normal unmodified applications (including the sqlite command-line toolset +binary) may load the *ceph* VFS using the `SQLite Extension Loading API`_. + +.. code:: sql + + .LOAD libcephsqlite.so + +or during the invocation of ``sqlite3`` + +.. code:: sh + + sqlite3 -cmd '.load libcephsqlite.so' + +A database file is formatted as a SQLite URI:: + + file:///<"*"poolid|poolname>:[namespace]/<dbname>?vfs=ceph + +The RADOS ``namespace`` is optional. Note the triple ``///`` in the path. The URI +authority must be empty or localhost in SQLite. Only the path part of the URI +is parsed. For this reason, the URI will not parse properly if you only use two +``//``. + +A complete example of (optionally) creating a database and opening: + +.. code:: sh + + sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph' + +Note you cannot specify the database file as the normal positional argument to +``sqlite3``. This is because the ``.load libcephsqlite.so`` command is applied +after opening the database, but opening the database depends on the extension +being loaded first. + +An example passing the pool integer id and no RADOS namespace: + +.. code:: sh + + sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///*2:/baz.db?vfs=ceph' + +Like other Ceph tools, the *ceph* VFS looks at some environment variables that +help with configuring which Ceph cluster to communicate with and which +credential to use. Here would be a typical configuration: + +.. code:: sh + + export CEPH_CONF=/path/to/ceph.conf + export CEPH_KEYRING=/path/to/ceph.keyring + export CEPH_ARGS='--id myclientid' + ./runmyapp + # or + sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph' + +The default operation would look at the standard Ceph configuration file path +using the ``client.admin`` user. + + +User +^^^^ + +The *ceph* VFS requires a user credential with read access to the monitors, the +ability to blocklist dead clients of the database, and access to the OSDs +hosting the database. This can be done with authorizations as simply as: + +.. code:: sh + + ceph auth get-or-create client.X mon 'allow r, allow command "osd blocklist" with blocklistop=add' osd 'allow rwx' + +.. note:: The terminology change from ``blacklist`` to ``blocklist``; older clusters may require using the old terms. + +You may also simplify using the ``simple-rados-client-with-blocklist`` profile: + +.. code:: sh + + ceph auth get-or-create client.X mon 'profile simple-rados-client-with-blocklist' osd 'allow rwx' + +To learn why blocklisting is necessary, see :ref:`libcephsqlite-corrupt`. + + +Page Size +^^^^^^^^^ + +SQLite allows configuring the page size prior to creating a new database. It is +advisable to increase this config to 65536 (64K) when using RADOS backed +databases to reduce the number of OSD reads/writes and thereby improve +throughput and latency. + +.. code:: sql + + PRAGMA page_size = 65536 + +You may also try other values according to your application needs but note that +64K is the max imposed by SQLite. + + +Cache +^^^^^ + +The ceph VFS does not do any caching of reads or buffering of writes. Instead, +and more appropriately, the SQLite page cache is used. You may find it is too small +for most workloads and should therefore increase it significantly: + + +.. code:: sql + + PRAGMA cache_size = 4096 + +Which will cache 4096 pages or 256MB (with 64K ``page_cache``). + + +Journal Persistence +^^^^^^^^^^^^^^^^^^^ + +By default, SQLite deletes the journal for every transaction. This can be +expensive as the *ceph* VFS must delete every object backing the journal for each +transaction. For this reason, it is much faster and simpler to ask SQLite to +**persist** the journal. In this mode, SQLite will invalidate the journal via a +write to its header. This is done as: + +.. code:: sql + + PRAGMA journal_mode = PERSIST + +The cost of this may be increased unused space according to the high-water size +of the rollback journal (based on transaction type and size). + + +Exclusive Lock Mode +^^^^^^^^^^^^^^^^^^^ + +SQLite operates in a ``NORMAL`` locking mode where each transaction requires +locking the backing database file. This can add unnecessary overhead to +transactions when you know there's only ever one user of the database at a +given time. You can have SQLite lock the database once for the duration of the +connection using: + +.. code:: sql + + PRAGMA locking_mode = EXCLUSIVE + +This can more than **halve** the time taken to perform a transaction. Keep in +mind this prevents other clients from accessing the database. + +In this locking mode, each write transaction to the database requires 3 +synchronization events: once to write to the journal, another to write to the +database file, and a final write to invalidate the journal header (in +``PERSIST`` journaling mode). + + +WAL Journal +^^^^^^^^^^^ + +The `WAL Journal Mode`_ is only available when SQLite is operating in exclusive +lock mode. This is because it requires shared memory communication with other +readers and writers when in the ``NORMAL`` locking mode. + +As with local disk databases, WAL mode may significantly reduce small +transaction latency. Testing has shown it can provide more than 50% speedup +over persisted rollback journals in exclusive locking mode. You can expect +around 150-250 transactions per second depending on size. + + +Performance Notes +^^^^^^^^^^^^^^^^^ + +The filing backend for the database on RADOS is asynchronous as much as +possible. Still, performance can be anywhere from 3x-10x slower than a local +database on SSD. Latency can be a major factor. It is advisable to be familiar +with SQL transactions and other strategies for efficient database updates. +Depending on the performance of the underlying pool, you can expect small +transactions to take up to 30 milliseconds to complete. If you use the +``EXCLUSIVE`` locking mode, it can be reduced further to 15 milliseconds per +transaction. A WAL journal in ``EXCLUSIVE`` locking mode can further reduce +this as low as ~2-5 milliseconds (or the time to complete a RADOS write; you +won't get better than that!). + +There is no limit to the size of a SQLite database on RADOS imposed by the Ceph +VFS. There are standard `SQLite Limits`_ to be aware of, notably the maximum +database size of 281 TB. Large databases may or may not be performant on Ceph. +Experimentation for your own use-case is advised. + +Be aware that read-heavy queries could take significant amounts of time as +reads are necessarily synchronous (due to the VFS API). No readahead is yet +performed by the VFS. + + +Recommended Use-Cases +^^^^^^^^^^^^^^^^^^^^^ + +The original purpose of this module was to support saving relational or large +data in RADOS which needs to span multiple objects. Many current applications +with trivial state try to use RADOS omap storage on a single object but this +cannot scale without striping data across multiple objects. Unfortunately, it +is non-trivial to design a store spanning multiple objects which is consistent +and also simple to use. SQLite can be used to bridge that gap. + + +Parallel Access +^^^^^^^^^^^^^^^ + +The VFS does not yet support concurrent readers. All database access is protected +by a single exclusive lock. + + +Export or Extract Database out of RADOS +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The database is striped on RADOS and can be extracted using the RADOS cli toolset. + +.. code:: sh + + rados --pool=foo --striper get bar.db local-bar.db + rados --pool=foo --striper get bar.db-journal local-bar.db-journal + sqlite3 local-bar.db ... + +Keep in mind the rollback journal is also striped and will need to be extracted +as well if the database was in the middle of a transaction. If you're using +WAL, that journal will need to be extracted as well. + +Keep in mind that extracting the database using the striper uses the same RADOS +locks as those used by the *ceph* VFS. However, the journal file locks are not +used by the *ceph* VFS (SQLite only locks the main database file) so there is a +potential race with other SQLite clients when extracting both files. That could +result in fetching a corrupt journal. + +Instead of manually extracting the files, it would be more advisable to use the +`SQLite Backup`_ mechanism instead. + + +Temporary Tables +^^^^^^^^^^^^^^^^ + +Temporary tables backed by the ceph VFS are not supported. The main reason for +this is that the VFS lacks context about where it should put the database, i.e. +which RADOS pool. The persistent database associated with the temporary +database is not communicated via the SQLite VFS API. + +Instead, it's suggested to attach a secondary local or `In-Memory Database`_ +and put the temporary tables there. Alternatively, you may set a connection +pragma: + +.. code:: sql + + PRAGMA temp_store=memory + + +.. _libcephsqlite-breaking-locks: + +Breaking Locks +^^^^^^^^^^^^^^ + +Access to the database file is protected by an exclusive lock on the first +object stripe of the database. If the application fails without unlocking the +database (e.g. a segmentation fault), the lock is not automatically unlocked, +even if the client connection is blocklisted afterward. Eventually, the lock +will timeout subject to the configurations:: + + cephsqlite_lock_renewal_timeout = 30000 + +The timeout is in milliseconds. Once the timeout is reached, the OSD will +expire the lock and allow clients to relock. When this occurs, the database +will be recovered by SQLite and the in-progress transaction rolled back. The +new client recovering the database will also blocklist the old client to +prevent potential database corruption from rogue writes. + +The holder of the exclusive lock on the database will periodically renew the +lock so it does not lose the lock. This is necessary for large transactions or +database connections operating in ``EXCLUSIVE`` locking mode. The lock renewal +interval is adjustable via:: + + cephsqlite_lock_renewal_interval = 2000 + +This configuration is also in units of milliseconds. + +It is possible to break the lock early if you know the client is gone for good +(e.g. blocklisted). This allows restoring database access to clients +immediately. For example: + +.. code:: sh + + $ rados --pool=foo --namespace bar lock info baz.db.0000000000000000 striper.lock + {"name":"striper.lock","type":"exclusive","tag":"","lockers":[{"name":"client.4463","cookie":"555c7208-db39-48e8-a4d7-3ba92433a41a","description":"SimpleRADOSStriper","expiration":"0.000000","addr":"127.0.0.1:0/1831418345"}]} + + $ rados --pool=foo --namespace bar lock break baz.db.0000000000000000 striper.lock client.4463 --lock-cookie 555c7208-db39-48e8-a4d7-3ba92433a41a + +.. _libcephsqlite-corrupt: + +How to Corrupt Your Database +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +There is the usual reading on `How to Corrupt Your SQLite Database`_ that you +should review before using this tool. To add to that, the most likely way you +may corrupt your database is by a rogue process transiently losing network +connectivity and then resuming its work. The exclusive RADOS lock it held will +be lost but it cannot know that immediately. Any work it might do after +regaining network connectivity could corrupt the database. + +The *ceph* VFS library defaults do not allow for this scenario to occur. The Ceph +VFS will blocklist the last owner of the exclusive lock on the database if it +detects incomplete cleanup. + +By blocklisting the old client, it's no longer possible for the old client to +resume its work on the database when it returns (subject to blocklist +expiration, 3600 seconds by default). To turn off blocklisting the prior client, change:: + + cephsqlite_blocklist_dead_locker = false + +Do NOT do this unless you know database corruption cannot result due to other +guarantees. If this config is true (the default), the *ceph* VFS will cowardly +fail if it cannot blocklist the prior instance (due to lack of authorization, +for example). + +One example where out-of-band mechanisms exist to blocklist the last dead +holder of the exclusive lock on the database is in the ``ceph-mgr``. The +monitors are made aware of the RADOS connection used for the *ceph* VFS and will +blocklist the instance during ``ceph-mgr`` failover. This prevents a zombie +``ceph-mgr`` from continuing work and potentially corrupting the database. For +this reason, it is not necessary for the *ceph* VFS to do the blocklist command +in the new instance of the ``ceph-mgr`` (but it still does so, harmlessly). + +To blocklist the *ceph* VFS manually, you may see the instance address of the +*ceph* VFS using the ``ceph_status`` SQL function: + +.. code:: sql + + SELECT ceph_status(); + +.. code:: + + {"id":788461300,"addr":"172.21.10.4:0/1472139388"} + +You may easily manipulate that information using the `JSON1 extension`_: + +.. code:: sql + + SELECT json_extract(ceph_status(), '$.addr'); + +.. code:: + + 172.21.10.4:0/3563721180 + +This is the address you would pass to the ceph blocklist command: + +.. code:: sh + + ceph osd blocklist add 172.21.10.4:0/3082314560 + + +Performance Statistics +^^^^^^^^^^^^^^^^^^^^^^ + +The *ceph* VFS provides a SQLite function, ``ceph_perf``, for querying the +performance statistics of the VFS. The data is from "performance counters" as +in other Ceph services normally queried via an admin socket. + +.. code:: sql + + SELECT ceph_perf(); + +.. code:: + + {"libcephsqlite_vfs":{"op_open":{"avgcount":2,"sum":0.150001291,"avgtime":0.075000645},"op_delete":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"op_access":{"avgcount":1,"sum":0.003000026,"avgtime":0.003000026},"op_fullpathname":{"avgcount":1,"sum":0.064000551,"avgtime":0.064000551},"op_currenttime":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_close":{"avgcount":1,"sum":0.000000000,"avgtime":0.000000000},"opf_read":{"avgcount":3,"sum":0.036000310,"avgtime":0.012000103},"opf_write":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_truncate":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_sync":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_filesize":{"avgcount":2,"sum":0.000000000,"avgtime":0.000000000},"opf_lock":{"avgcount":1,"sum":0.158001360,"avgtime":0.158001360},"opf_unlock":{"avgcount":1,"sum":0.101000871,"avgtime":0.101000871},"opf_checkreservedlock":{"avgcount":1,"sum":0.002000017,"avgtime":0.002000017},"opf_filecontrol":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000},"opf_sectorsize":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_devicecharacteristics":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000}},"libcephsqlite_striper":{"update_metadata":0,"update_allocated":0,"update_size":0,"update_version":0,"shrink":0,"shrink_bytes":0,"lock":1,"unlock":1}} + +You may easily manipulate that information using the `JSON1 extension`_: + +.. code:: sql + + SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync.avgcount'); + +.. code:: + + 776 + +That tells you the number of times SQLite has called the xSync method of the +`SQLite IO Methods`_ of the VFS (for **all** open database connections in the +process). You could analyze the performance stats before and after a number of +queries to see the number of file system syncs required (this would just be +proportional to the number of transactions). Alternatively, you may be more +interested in the average latency to complete a write: + +.. code:: sql + + SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_write'); + +.. code:: + + {"avgcount":7873,"sum":0.675005797,"avgtime":0.000085736} + +Which would tell you there have been 7873 writes with an average +time-to-complete of 85 microseconds. That clearly shows the calls are executed +asynchronously. Returning to sync: + +.. code:: sql + + SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync'); + +.. code:: + + {"avgcount":776,"sum":4.802041199,"avgtime":0.006188197} + +6 milliseconds were spent on average executing a sync call. This gathers all of +the asynchronous writes as well as an asynchronous update to the size of the +striped file. + + +.. _SQLite: https://sqlite.org/index.html +.. _SQLite VFS: https://www.sqlite.org/vfs.html +.. _SQLite Backup: https://www.sqlite.org/backup.html +.. _SQLite Limits: https://www.sqlite.org/limits.html +.. _SQLite Extension Loading API: https://sqlite.org/c3ref/load_extension.html +.. _In-Memory Database: https://www.sqlite.org/inmemorydb.html +.. _WAL Journal Mode: https://sqlite.org/wal.html +.. _How to Corrupt Your SQLite Database: https://www.sqlite.org/howtocorrupt.html +.. _JSON1 Extension: https://www.sqlite.org/json1.html +.. _SQLite IO Methods: https://www.sqlite.org/c3ref/io_methods.html diff --git a/doc/rados/api/librados-intro.rst b/doc/rados/api/librados-intro.rst new file mode 100644 index 000000000..9bffa3114 --- /dev/null +++ b/doc/rados/api/librados-intro.rst @@ -0,0 +1,1052 @@ +========================== + Introduction to librados +========================== + +The :term:`Ceph Storage Cluster` provides the basic storage service that allows +:term:`Ceph` to uniquely deliver **object, block, and file storage** in one +unified system. However, you are not limited to using the RESTful, block, or +POSIX interfaces. Based upon :abbr:`RADOS (Reliable Autonomic Distributed Object +Store)`, the ``librados`` API enables you to create your own interface to the +Ceph Storage Cluster. + +The ``librados`` API enables you to interact with the two types of daemons in +the Ceph Storage Cluster: + +- The :term:`Ceph Monitor`, which maintains a master copy of the cluster map. +- The :term:`Ceph OSD Daemon` (OSD), which stores data as objects on a storage node. + +.. ditaa:: + +---------------------------------+ + | Ceph Storage Cluster Protocol | + | (librados) | + +---------------------------------+ + +---------------+ +---------------+ + | OSDs | | Monitors | + +---------------+ +---------------+ + +This guide provides a high-level introduction to using ``librados``. +Refer to :doc:`../../architecture` for additional details of the Ceph +Storage Cluster. To use the API, you need a running Ceph Storage Cluster. +See `Installation (Quick)`_ for details. + + +Step 1: Getting librados +======================== + +Your client application must bind with ``librados`` to connect to the Ceph +Storage Cluster. You must install ``librados`` and any required packages to +write applications that use ``librados``. The ``librados`` API is written in +C++, with additional bindings for C, Python, Java and PHP. + + +Getting librados for C/C++ +-------------------------- + +To install ``librados`` development support files for C/C++ on Debian/Ubuntu +distributions, execute the following: + +.. prompt:: bash $ + + sudo apt-get install librados-dev + +To install ``librados`` development support files for C/C++ on RHEL/CentOS +distributions, execute the following: + +.. prompt:: bash $ + + sudo yum install librados2-devel + +Once you install ``librados`` for developers, you can find the required +headers for C/C++ under ``/usr/include/rados``: + +.. prompt:: bash $ + + ls /usr/include/rados + + +Getting librados for Python +--------------------------- + +The ``rados`` module provides ``librados`` support to Python +applications. The ``librados-dev`` package for Debian/Ubuntu +and the ``librados2-devel`` package for RHEL/CentOS will install the +``python-rados`` package for you. You may install ``python-rados`` +directly too. + +To install ``librados`` development support files for Python on Debian/Ubuntu +distributions, execute the following: + +.. prompt:: bash $ + + sudo apt-get install python3-rados + +To install ``librados`` development support files for Python on RHEL/CentOS +distributions, execute the following: + +.. prompt:: bash $ + + sudo yum install python-rados + +To install ``librados`` development support files for Python on SLE/openSUSE +distributions, execute the following: + +.. prompt:: bash $ + + sudo zypper install python3-rados + +You can find the module under ``/usr/share/pyshared`` on Debian systems, +or under ``/usr/lib/python*/site-packages`` on CentOS/RHEL systems. + + +Getting librados for Java +------------------------- + +To install ``librados`` for Java, you need to execute the following procedure: + +#. Install ``jna.jar``. For Debian/Ubuntu, execute: + + .. prompt:: bash $ + + sudo apt-get install libjna-java + + For CentOS/RHEL, execute: + + .. prompt:: bash $ + + sudo yum install jna + + The JAR files are located in ``/usr/share/java``. + +#. Clone the ``rados-java`` repository: + + .. prompt:: bash $ + + git clone --recursive https://github.com/ceph/rados-java.git + +#. Build the ``rados-java`` repository: + + .. prompt:: bash $ + + cd rados-java + ant + + The JAR file is located under ``rados-java/target``. + +#. Copy the JAR for RADOS to a common location (e.g., ``/usr/share/java``) and + ensure that it and the JNA JAR are in your JVM's classpath. For example: + + .. prompt:: bash $ + + sudo cp target/rados-0.1.3.jar /usr/share/java/rados-0.1.3.jar + sudo ln -s /usr/share/java/jna-3.2.7.jar /usr/lib/jvm/default-java/jre/lib/ext/jna-3.2.7.jar + sudo ln -s /usr/share/java/rados-0.1.3.jar /usr/lib/jvm/default-java/jre/lib/ext/rados-0.1.3.jar + +To build the documentation, execute the following: + +.. prompt:: bash $ + + ant docs + + +Getting librados for PHP +------------------------- + +To install the ``librados`` extension for PHP, you need to execute the following procedure: + +#. Install php-dev. For Debian/Ubuntu, execute: + + .. prompt:: bash $ + + sudo apt-get install php5-dev build-essential + + For CentOS/RHEL, execute: + + .. prompt:: bash $ + + sudo yum install php-devel + +#. Clone the ``phprados`` repository: + + .. prompt:: bash $ + + git clone https://github.com/ceph/phprados.git + +#. Build ``phprados``: + + .. prompt:: bash $ + + cd phprados + phpize + ./configure + make + sudo make install + +#. Enable ``phprados`` by adding the following line to ``php.ini``:: + + extension=rados.so + + +Step 2: Configuring a Cluster Handle +==================================== + +A :term:`Ceph Client`, via ``librados``, interacts directly with OSDs to store +and retrieve data. To interact with OSDs, the client app must invoke +``librados`` and connect to a Ceph Monitor. Once connected, ``librados`` +retrieves the :term:`Cluster Map` from the Ceph Monitor. When the client app +wants to read or write data, it creates an I/O context and binds to a +:term:`Pool`. The pool has an associated :term:`CRUSH rule` that defines how it +will place data in the storage cluster. Via the I/O context, the client +provides the object name to ``librados``, which takes the object name +and the cluster map (i.e., the topology of the cluster) and `computes`_ the +placement group and `OSD`_ for locating the data. Then the client application +can read or write data. The client app doesn't need to learn about the topology +of the cluster directly. + +.. ditaa:: + +--------+ Retrieves +---------------+ + | Client |------------>| Cluster Map | + +--------+ +---------------+ + | + v Writes + /-----\ + | obj | + \-----/ + | To + v + +--------+ +---------------+ + | Pool |---------->| CRUSH Rule | + +--------+ Selects +---------------+ + + +The Ceph Storage Cluster handle encapsulates the client configuration, including: + +- The `user ID`_ for ``rados_create()`` or user name for ``rados_create2()`` + (preferred). +- The :term:`cephx` authentication key +- The monitor ID and IP address +- Logging levels +- Debugging levels + +Thus, the first steps in using the cluster from your app are to 1) create +a cluster handle that your app will use to connect to the storage cluster, +and then 2) use that handle to connect. To connect to the cluster, the +app must supply a monitor address, a username and an authentication key +(cephx is enabled by default). + +.. tip:: Talking to different Ceph Storage Clusters – or to the same cluster + with different users – requires different cluster handles. + +RADOS provides a number of ways for you to set the required values. For +the monitor and encryption key settings, an easy way to handle them is to ensure +that your Ceph configuration file contains a ``keyring`` path to a keyring file +and at least one monitor address (e.g., ``mon host``). For example:: + + [global] + mon host = 192.168.1.1 + keyring = /etc/ceph/ceph.client.admin.keyring + +Once you create the handle, you can read a Ceph configuration file to configure +the handle. You can also pass arguments to your app and parse them with the +function for parsing command line arguments (e.g., ``rados_conf_parse_argv()``), +or parse Ceph environment variables (e.g., ``rados_conf_parse_env()``). Some +wrappers may not implement convenience methods, so you may need to implement +these capabilities. The following diagram provides a high-level flow for the +initial connection. + + +.. ditaa:: + +---------+ +---------+ + | Client | | Monitor | + +---------+ +---------+ + | | + |-----+ create | + | | cluster | + |<----+ handle | + | | + |-----+ read | + | | config | + |<----+ file | + | | + | connect | + |-------------->| + | | + |<--------------| + | connected | + | | + + +Once connected, your app can invoke functions that affect the whole cluster +with only the cluster handle. For example, once you have a cluster +handle, you can: + +- Get cluster statistics +- Use Pool Operation (exists, create, list, delete) +- Get and set the configuration + + +One of the powerful features of Ceph is the ability to bind to different pools. +Each pool may have a different number of placement groups, object replicas and +replication strategies. For example, a pool could be set up as a "hot" pool that +uses SSDs for frequently used objects or a "cold" pool that uses erasure coding. + +The main difference in the various ``librados`` bindings is between C and +the object-oriented bindings for C++, Java and Python. The object-oriented +bindings use objects to represent cluster handles, IO Contexts, iterators, +exceptions, etc. + + +C Example +--------- + +For C, creating a simple cluster handle using the ``admin`` user, configuring +it and connecting to the cluster might look something like this: + +.. code-block:: c + + #include <stdio.h> + #include <stdlib.h> + #include <string.h> + #include <rados/librados.h> + + int main (int argc, const char **argv) + { + + /* Declare the cluster handle and required arguments. */ + rados_t cluster; + char cluster_name[] = "ceph"; + char user_name[] = "client.admin"; + uint64_t flags = 0; + + /* Initialize the cluster handle with the "ceph" cluster name and the "client.admin" user */ + int err; + err = rados_create2(&cluster, cluster_name, user_name, flags); + + if (err < 0) { + fprintf(stderr, "%s: Couldn't create the cluster handle! %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nCreated a cluster handle.\n"); + } + + + /* Read a Ceph configuration file to configure the cluster handle. */ + err = rados_conf_read_file(cluster, "/etc/ceph/ceph.conf"); + if (err < 0) { + fprintf(stderr, "%s: cannot read config file: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nRead the config file.\n"); + } + + /* Read command line arguments */ + err = rados_conf_parse_argv(cluster, argc, argv); + if (err < 0) { + fprintf(stderr, "%s: cannot parse command line arguments: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nRead the command line arguments.\n"); + } + + /* Connect to the cluster */ + err = rados_connect(cluster); + if (err < 0) { + fprintf(stderr, "%s: cannot connect to cluster: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nConnected to the cluster.\n"); + } + + } + +Compile your client and link to ``librados`` using ``-lrados``. For example: + +.. prompt:: bash $ + + gcc ceph-client.c -lrados -o ceph-client + + +C++ Example +----------- + +The Ceph project provides a C++ example in the ``ceph/examples/librados`` +directory. For C++, a simple cluster handle using the ``admin`` user requires +you to initialize a ``librados::Rados`` cluster handle object: + +.. code-block:: c++ + + #include <iostream> + #include <string> + #include <rados/librados.hpp> + + int main(int argc, const char **argv) + { + + int ret = 0; + + /* Declare the cluster handle and required variables. */ + librados::Rados cluster; + char cluster_name[] = "ceph"; + char user_name[] = "client.admin"; + uint64_t flags = 0; + + /* Initialize the cluster handle with the "ceph" cluster name and "client.admin" user */ + { + ret = cluster.init2(user_name, cluster_name, flags); + if (ret < 0) { + std::cerr << "Couldn't initialize the cluster handle! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Created a cluster handle." << std::endl; + } + } + + /* Read a Ceph configuration file to configure the cluster handle. */ + { + ret = cluster.conf_read_file("/etc/ceph/ceph.conf"); + if (ret < 0) { + std::cerr << "Couldn't read the Ceph configuration file! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Read the Ceph configuration file." << std::endl; + } + } + + /* Read command line arguments */ + { + ret = cluster.conf_parse_argv(argc, argv); + if (ret < 0) { + std::cerr << "Couldn't parse command line options! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Parsed command line options." << std::endl; + } + } + + /* Connect to the cluster */ + { + ret = cluster.connect(); + if (ret < 0) { + std::cerr << "Couldn't connect to cluster! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Connected to the cluster." << std::endl; + } + } + + return 0; + } + + +Compile the source; then, link ``librados`` using ``-lrados``. +For example: + +.. prompt:: bash $ + + g++ -g -c ceph-client.cc -o ceph-client.o + g++ -g ceph-client.o -lrados -o ceph-client + + + +Python Example +-------------- + +Python uses the ``admin`` id and the ``ceph`` cluster name by default, and +will read the standard ``ceph.conf`` file if the conffile parameter is +set to the empty string. The Python binding converts C++ errors +into exceptions. + + +.. code-block:: python + + import rados + + try: + cluster = rados.Rados(conffile='') + except TypeError as e: + print 'Argument validation error: ', e + raise e + + print "Created cluster handle." + + try: + cluster.connect() + except Exception as e: + print "connection error: ", e + raise e + finally: + print "Connected to the cluster." + + +Execute the example to verify that it connects to your cluster: + +.. prompt:: bash $ + + python ceph-client.py + + +Java Example +------------ + +Java requires you to specify the user ID (``admin``) or user name +(``client.admin``), and uses the ``ceph`` cluster name by default . The Java +binding converts C++-based errors into exceptions. + +.. code-block:: java + + import com.ceph.rados.Rados; + import com.ceph.rados.RadosException; + + import java.io.File; + + public class CephClient { + public static void main (String args[]){ + + try { + Rados cluster = new Rados("admin"); + System.out.println("Created cluster handle."); + + File f = new File("/etc/ceph/ceph.conf"); + cluster.confReadFile(f); + System.out.println("Read the configuration file."); + + cluster.connect(); + System.out.println("Connected to the cluster."); + + } catch (RadosException e) { + System.out.println(e.getMessage() + ": " + e.getReturnValue()); + } + } + } + + +Compile the source; then, run it. If you have copied the JAR to +``/usr/share/java`` and sym linked from your ``ext`` directory, you won't need +to specify the classpath. For example: + +.. prompt:: bash $ + + javac CephClient.java + java CephClient + + +PHP Example +------------ + +With the RADOS extension enabled in PHP you can start creating a new cluster handle very easily: + +.. code-block:: php + + <?php + + $r = rados_create(); + rados_conf_read_file($r, '/etc/ceph/ceph.conf'); + if (!rados_connect($r)) { + echo "Failed to connect to Ceph cluster"; + } else { + echo "Successfully connected to Ceph cluster"; + } + + +Save this as rados.php and run the code: + +.. prompt:: bash $ + + php rados.php + + +Step 3: Creating an I/O Context +=============================== + +Once your app has a cluster handle and a connection to a Ceph Storage Cluster, +you may create an I/O Context and begin reading and writing data. An I/O Context +binds the connection to a specific pool. The user must have appropriate +`CAPS`_ permissions to access the specified pool. For example, a user with read +access but not write access will only be able to read data. I/O Context +functionality includes: + +- Write/read data and extended attributes +- List and iterate over objects and extended attributes +- Snapshot pools, list snapshots, etc. + + +.. ditaa:: + +---------+ +---------+ +---------+ + | Client | | Monitor | | OSD | + +---------+ +---------+ +---------+ + | | | + |-----+ create | | + | | I/O | | + |<----+ context | | + | | | + | write data | | + |---------------+-------------->| + | | | + | write ack | | + |<--------------+---------------| + | | | + | write xattr | | + |---------------+-------------->| + | | | + | xattr ack | | + |<--------------+---------------| + | | | + | read data | | + |---------------+-------------->| + | | | + | read ack | | + |<--------------+---------------| + | | | + | remove data | | + |---------------+-------------->| + | | | + | remove ack | | + |<--------------+---------------| + + + +RADOS enables you to interact both synchronously and asynchronously. Once your +app has an I/O Context, read/write operations only require you to know the +object/xattr name. The CRUSH algorithm encapsulated in ``librados`` uses the +cluster map to identify the appropriate OSD. OSD daemons handle the replication, +as described in `Smart Daemons Enable Hyperscale`_. The ``librados`` library also +maps objects to placement groups, as described in `Calculating PG IDs`_. + +The following examples use the default ``data`` pool. However, you may also +use the API to list pools, ensure they exist, or create and delete pools. For +the write operations, the examples illustrate how to use synchronous mode. For +the read operations, the examples illustrate how to use asynchronous mode. + +.. important:: Use caution when deleting pools with this API. If you delete + a pool, the pool and ALL DATA in the pool will be lost. + + +C Example +--------- + + +.. code-block:: c + + #include <stdio.h> + #include <stdlib.h> + #include <string.h> + #include <rados/librados.h> + + int main (int argc, const char **argv) + { + /* + * Continued from previous C example, where cluster handle and + * connection are established. First declare an I/O Context. + */ + + rados_ioctx_t io; + char *poolname = "data"; + + err = rados_ioctx_create(cluster, poolname, &io); + if (err < 0) { + fprintf(stderr, "%s: cannot open rados pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_shutdown(cluster); + exit(EXIT_FAILURE); + } else { + printf("\nCreated I/O context.\n"); + } + + /* Write data to the cluster synchronously. */ + err = rados_write(io, "hw", "Hello World!", 12, 0); + if (err < 0) { + fprintf(stderr, "%s: Cannot write object \"hw\" to pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nWrote \"Hello World\" to object \"hw\".\n"); + } + + char xattr[] = "en_US"; + err = rados_setxattr(io, "hw", "lang", xattr, 5); + if (err < 0) { + fprintf(stderr, "%s: Cannot write xattr to pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nWrote \"en_US\" to xattr \"lang\" for object \"hw\".\n"); + } + + /* + * Read data from the cluster asynchronously. + * First, set up asynchronous I/O completion. + */ + rados_completion_t comp; + err = rados_aio_create_completion(NULL, NULL, NULL, &comp); + if (err < 0) { + fprintf(stderr, "%s: Could not create aio completion: %s\n", argv[0], strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nCreated AIO completion.\n"); + } + + /* Next, read data using rados_aio_read. */ + char read_res[100]; + err = rados_aio_read(io, "hw", comp, read_res, 12, 0); + if (err < 0) { + fprintf(stderr, "%s: Cannot read object. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRead object \"hw\". The contents are:\n %s \n", read_res); + } + + /* Wait for the operation to complete */ + rados_aio_wait_for_complete(comp); + + /* Release the asynchronous I/O complete handle to avoid memory leaks. */ + rados_aio_release(comp); + + + char xattr_res[100]; + err = rados_getxattr(io, "hw", "lang", xattr_res, 5); + if (err < 0) { + fprintf(stderr, "%s: Cannot read xattr. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRead xattr \"lang\" for object \"hw\". The contents are:\n %s \n", xattr_res); + } + + err = rados_rmxattr(io, "hw", "lang"); + if (err < 0) { + fprintf(stderr, "%s: Cannot remove xattr. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRemoved xattr \"lang\" for object \"hw\".\n"); + } + + err = rados_remove(io, "hw"); + if (err < 0) { + fprintf(stderr, "%s: Cannot remove object. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRemoved object \"hw\".\n"); + } + + } + + + +C++ Example +----------- + + +.. code-block:: c++ + + #include <iostream> + #include <string> + #include <rados/librados.hpp> + + int main(int argc, const char **argv) + { + + /* Continued from previous C++ example, where cluster handle and + * connection are established. First declare an I/O Context. + */ + + librados::IoCtx io_ctx; + const char *pool_name = "data"; + + { + ret = cluster.ioctx_create(pool_name, io_ctx); + if (ret < 0) { + std::cerr << "Couldn't set up ioctx! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Created an ioctx for the pool." << std::endl; + } + } + + + /* Write an object synchronously. */ + { + librados::bufferlist bl; + bl.append("Hello World!"); + ret = io_ctx.write_full("hw", bl); + if (ret < 0) { + std::cerr << "Couldn't write object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Wrote new object 'hw' " << std::endl; + } + } + + + /* + * Add an xattr to the object. + */ + { + librados::bufferlist lang_bl; + lang_bl.append("en_US"); + ret = io_ctx.setxattr("hw", "lang", lang_bl); + if (ret < 0) { + std::cerr << "failed to set xattr version entry! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Set the xattr 'lang' on our object!" << std::endl; + } + } + + + /* + * Read the object back asynchronously. + */ + { + librados::bufferlist read_buf; + int read_len = 4194304; + + //Create I/O Completion. + librados::AioCompletion *read_completion = librados::Rados::aio_create_completion(); + + //Send read request. + ret = io_ctx.aio_read("hw", read_completion, &read_buf, read_len, 0); + if (ret < 0) { + std::cerr << "Couldn't start read object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } + + // Wait for the request to complete, and check that it succeeded. + read_completion->wait_for_complete(); + ret = read_completion->get_return_value(); + if (ret < 0) { + std::cerr << "Couldn't read object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Read object hw asynchronously with contents.\n" + << read_buf.c_str() << std::endl; + } + } + + + /* + * Read the xattr. + */ + { + librados::bufferlist lang_res; + ret = io_ctx.getxattr("hw", "lang", lang_res); + if (ret < 0) { + std::cerr << "failed to get xattr version entry! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Got the xattr 'lang' from object hw!" + << lang_res.c_str() << std::endl; + } + } + + + /* + * Remove the xattr. + */ + { + ret = io_ctx.rmxattr("hw", "lang"); + if (ret < 0) { + std::cerr << "Failed to remove xattr! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Removed the xattr 'lang' from our object!" << std::endl; + } + } + + /* + * Remove the object. + */ + { + ret = io_ctx.remove("hw"); + if (ret < 0) { + std::cerr << "Couldn't remove object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Removed object 'hw'." << std::endl; + } + } + } + + + +Python Example +-------------- + +.. code-block:: python + + print "\n\nI/O Context and Object Operations" + print "=================================" + + print "\nCreating a context for the 'data' pool" + if not cluster.pool_exists('data'): + raise RuntimeError('No data pool exists') + ioctx = cluster.open_ioctx('data') + + print "\nWriting object 'hw' with contents 'Hello World!' to pool 'data'." + ioctx.write("hw", "Hello World!") + print "Writing XATTR 'lang' with value 'en_US' to object 'hw'" + ioctx.set_xattr("hw", "lang", "en_US") + + + print "\nWriting object 'bm' with contents 'Bonjour tout le monde!' to pool 'data'." + ioctx.write("bm", "Bonjour tout le monde!") + print "Writing XATTR 'lang' with value 'fr_FR' to object 'bm'" + ioctx.set_xattr("bm", "lang", "fr_FR") + + print "\nContents of object 'hw'\n------------------------" + print ioctx.read("hw") + + print "\n\nGetting XATTR 'lang' from object 'hw'" + print ioctx.get_xattr("hw", "lang") + + print "\nContents of object 'bm'\n------------------------" + print ioctx.read("bm") + + print "Getting XATTR 'lang' from object 'bm'" + print ioctx.get_xattr("bm", "lang") + + + print "\nRemoving object 'hw'" + ioctx.remove_object("hw") + + print "Removing object 'bm'" + ioctx.remove_object("bm") + + +Java-Example +------------ + +.. code-block:: java + + import com.ceph.rados.Rados; + import com.ceph.rados.RadosException; + + import java.io.File; + import com.ceph.rados.IoCTX; + + public class CephClient { + public static void main (String args[]){ + + try { + Rados cluster = new Rados("admin"); + System.out.println("Created cluster handle."); + + File f = new File("/etc/ceph/ceph.conf"); + cluster.confReadFile(f); + System.out.println("Read the configuration file."); + + cluster.connect(); + System.out.println("Connected to the cluster."); + + IoCTX io = cluster.ioCtxCreate("data"); + + String oidone = "hw"; + String contentone = "Hello World!"; + io.write(oidone, contentone); + + String oidtwo = "bm"; + String contenttwo = "Bonjour tout le monde!"; + io.write(oidtwo, contenttwo); + + String[] objects = io.listObjects(); + for (String object: objects) + System.out.println(object); + + io.remove(oidone); + io.remove(oidtwo); + + cluster.ioCtxDestroy(io); + + } catch (RadosException e) { + System.out.println(e.getMessage() + ": " + e.getReturnValue()); + } + } + } + + +PHP Example +----------- + +.. code-block:: php + + <?php + + $io = rados_ioctx_create($r, "mypool"); + rados_write_full($io, "oidOne", "mycontents"); + rados_remove("oidOne"); + rados_ioctx_destroy($io); + + +Step 4: Closing Sessions +======================== + +Once your app finishes with the I/O Context and cluster handle, the app should +close the connection and shutdown the handle. For asynchronous I/O, the app +should also ensure that pending asynchronous operations have completed. + + +C Example +--------- + +.. code-block:: c + + rados_ioctx_destroy(io); + rados_shutdown(cluster); + + +C++ Example +----------- + +.. code-block:: c++ + + io_ctx.close(); + cluster.shutdown(); + + +Java Example +-------------- + +.. code-block:: java + + cluster.ioCtxDestroy(io); + cluster.shutDown(); + + +Python Example +-------------- + +.. code-block:: python + + print "\nClosing the connection." + ioctx.close() + + print "Shutting down the handle." + cluster.shutdown() + +PHP Example +----------- + +.. code-block:: php + + rados_shutdown($r); + + + +.. _user ID: ../../operations/user-management#command-line-usage +.. _CAPS: ../../operations/user-management#authorization-capabilities +.. _Installation (Quick): ../../../start +.. _Smart Daemons Enable Hyperscale: ../../../architecture#smart-daemons-enable-hyperscale +.. _Calculating PG IDs: ../../../architecture#calculating-pg-ids +.. _computes: ../../../architecture#calculating-pg-ids +.. _OSD: ../../../architecture#mapping-pgs-to-osds diff --git a/doc/rados/api/librados.rst b/doc/rados/api/librados.rst new file mode 100644 index 000000000..3e202bd4b --- /dev/null +++ b/doc/rados/api/librados.rst @@ -0,0 +1,187 @@ +============== + Librados (C) +============== + +.. highlight:: c + +`librados` provides low-level access to the RADOS service. For an +overview of RADOS, see :doc:`../../architecture`. + + +Example: connecting and writing an object +========================================= + +To use `Librados`, you instantiate a :c:type:`rados_t` variable (a cluster handle) and +call :c:func:`rados_create()` with a pointer to it:: + + int err; + rados_t cluster; + + err = rados_create(&cluster, NULL); + if (err < 0) { + fprintf(stderr, "%s: cannot create a cluster handle: %s\n", argv[0], strerror(-err)); + exit(1); + } + +Then you configure your :c:type:`rados_t` to connect to your cluster, +either by setting individual values (:c:func:`rados_conf_set()`), +using a configuration file (:c:func:`rados_conf_read_file()`), using +command line options (:c:func:`rados_conf_parse_argv`), or an +environment variable (:c:func:`rados_conf_parse_env()`):: + + err = rados_conf_read_file(cluster, "/path/to/myceph.conf"); + if (err < 0) { + fprintf(stderr, "%s: cannot read config file: %s\n", argv[0], strerror(-err)); + exit(1); + } + +Once the cluster handle is configured, you can connect to the cluster with :c:func:`rados_connect()`:: + + err = rados_connect(cluster); + if (err < 0) { + fprintf(stderr, "%s: cannot connect to cluster: %s\n", argv[0], strerror(-err)); + exit(1); + } + +Then you open an "IO context", a :c:type:`rados_ioctx_t`, with :c:func:`rados_ioctx_create()`:: + + rados_ioctx_t io; + char *poolname = "mypool"; + + err = rados_ioctx_create(cluster, poolname, &io); + if (err < 0) { + fprintf(stderr, "%s: cannot open rados pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_shutdown(cluster); + exit(1); + } + +Note that the pool you try to access must exist. + +Then you can use the RADOS data manipulation functions, for example +write into an object called ``greeting`` with +:c:func:`rados_write_full()`:: + + err = rados_write_full(io, "greeting", "hello", 5); + if (err < 0) { + fprintf(stderr, "%s: cannot write pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } + +In the end, you will want to close your IO context and connection to RADOS with :c:func:`rados_ioctx_destroy()` and :c:func:`rados_shutdown()`:: + + rados_ioctx_destroy(io); + rados_shutdown(cluster); + + +Asynchronous IO +=============== + +When doing lots of IO, you often don't need to wait for one operation +to complete before starting the next one. `Librados` provides +asynchronous versions of several operations: + +* :c:func:`rados_aio_write` +* :c:func:`rados_aio_append` +* :c:func:`rados_aio_write_full` +* :c:func:`rados_aio_read` + +For each operation, you must first create a +:c:type:`rados_completion_t` that represents what to do when the +operation is safe or complete by calling +:c:func:`rados_aio_create_completion`. If you don't need anything +special to happen, you can pass NULL:: + + rados_completion_t comp; + err = rados_aio_create_completion(NULL, NULL, NULL, &comp); + if (err < 0) { + fprintf(stderr, "%s: could not create aio completion: %s\n", argv[0], strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } + +Now you can call any of the aio operations, and wait for it to +be in memory or on disk on all replicas:: + + err = rados_aio_write(io, "foo", comp, "bar", 3, 0); + if (err < 0) { + fprintf(stderr, "%s: could not schedule aio write: %s\n", argv[0], strerror(-err)); + rados_aio_release(comp); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } + rados_aio_wait_for_complete(comp); // in memory + rados_aio_wait_for_safe(comp); // on disk + +Finally, we need to free the memory used by the completion with :c:func:`rados_aio_release`:: + + rados_aio_release(comp); + +You can use the callbacks to tell your application when writes are +durable, or when read buffers are full. For example, if you wanted to +measure the latency of each operation when appending to several +objects, you could schedule several writes and store the ack and +commit time in the corresponding callback, then wait for all of them +to complete using :c:func:`rados_aio_flush` before analyzing the +latencies:: + + typedef struct { + struct timeval start; + struct timeval ack_end; + struct timeval commit_end; + } req_duration; + + void ack_callback(rados_completion_t comp, void *arg) { + req_duration *dur = (req_duration *) arg; + gettimeofday(&dur->ack_end, NULL); + } + + void commit_callback(rados_completion_t comp, void *arg) { + req_duration *dur = (req_duration *) arg; + gettimeofday(&dur->commit_end, NULL); + } + + int output_append_latency(rados_ioctx_t io, const char *data, size_t len, size_t num_writes) { + req_duration times[num_writes]; + rados_completion_t comps[num_writes]; + for (size_t i = 0; i < num_writes; ++i) { + gettimeofday(×[i].start, NULL); + int err = rados_aio_create_completion((void*) ×[i], ack_callback, commit_callback, &comps[i]); + if (err < 0) { + fprintf(stderr, "Error creating rados completion: %s\n", strerror(-err)); + return err; + } + char obj_name[100]; + snprintf(obj_name, sizeof(obj_name), "foo%ld", (unsigned long)i); + err = rados_aio_append(io, obj_name, comps[i], data, len); + if (err < 0) { + fprintf(stderr, "Error from rados_aio_append: %s", strerror(-err)); + return err; + } + } + // wait until all requests finish *and* the callbacks complete + rados_aio_flush(io); + // the latencies can now be analyzed + printf("Request # | Ack latency (s) | Commit latency (s)\n"); + for (size_t i = 0; i < num_writes; ++i) { + // don't forget to free the completions + rados_aio_release(comps[i]); + struct timeval ack_lat, commit_lat; + timersub(×[i].ack_end, ×[i].start, &ack_lat); + timersub(×[i].commit_end, ×[i].start, &commit_lat); + printf("%9ld | %8ld.%06ld | %10ld.%06ld\n", (unsigned long) i, ack_lat.tv_sec, ack_lat.tv_usec, commit_lat.tv_sec, commit_lat.tv_usec); + } + return 0; + } + +Note that all the :c:type:`rados_completion_t` must be freed with :c:func:`rados_aio_release` to avoid leaking memory. + + +API calls +========= + + .. autodoxygenfile:: rados_types.h + .. autodoxygenfile:: librados.h diff --git a/doc/rados/api/libradospp.rst b/doc/rados/api/libradospp.rst new file mode 100644 index 000000000..08483c8d4 --- /dev/null +++ b/doc/rados/api/libradospp.rst @@ -0,0 +1,9 @@ +================== + LibradosPP (C++) +================== + +.. note:: The librados C++ API is not guaranteed to be API+ABI stable + between major releases. All applications using the librados C++ API must + be recompiled and relinked against a specific Ceph release. + +.. todo:: write me! diff --git a/doc/rados/api/objclass-sdk.rst b/doc/rados/api/objclass-sdk.rst new file mode 100644 index 000000000..6b1162fd4 --- /dev/null +++ b/doc/rados/api/objclass-sdk.rst @@ -0,0 +1,37 @@ +=========================== +SDK for Ceph Object Classes +=========================== + +`Ceph` can be extended by creating shared object classes called `Ceph Object +Classes`. The existing framework to build these object classes has dependencies +on the internal functionality of `Ceph`, which restricts users to build object +classes within the tree. The aim of this project is to create an independent +object class interface, which can be used to build object classes outside the +`Ceph` tree. This allows us to have two types of object classes, 1) those that +have in-tree dependencies and reside in the tree and 2) those that can make use +of the `Ceph Object Class SDK framework` and can be built outside of the `Ceph` +tree because they do not depend on any internal implementation of `Ceph`. This +project decouples object class development from Ceph and encourages creation +and distribution of object classes as packages. + +In order to demonstrate the use of this framework, we have provided an example +called ``cls_sdk``, which is a very simple object class that makes use of the +SDK framework. This object class resides in the ``src/cls`` directory. + +Installing objclass.h +--------------------- + +The object class interface that enables out-of-tree development of object +classes resides in ``src/include/rados/`` and gets installed with `Ceph` +installation. After running ``make install``, you should be able to see it +in ``<prefix>/include/rados``. :: + + ls /usr/local/include/rados + +Using the SDK example +--------------------- + +The ``cls_sdk`` object class resides in ``src/cls/sdk/``. This gets built and +loaded into Ceph, with the Ceph build process. You can run the +``ceph_test_cls_sdk`` unittest, which resides in ``src/test/cls_sdk/``, +to test this class. diff --git a/doc/rados/api/python.rst b/doc/rados/api/python.rst new file mode 100644 index 000000000..0c9cb9e98 --- /dev/null +++ b/doc/rados/api/python.rst @@ -0,0 +1,425 @@ +=================== + Librados (Python) +=================== + +The ``rados`` module is a thin Python wrapper for ``librados``. + +Installation +============ + +To install Python libraries for Ceph, see `Getting librados for Python`_. + + +Getting Started +=============== + +You can create your own Ceph client using Python. The following tutorial will +show you how to import the Ceph Python module, connect to a Ceph cluster, and +perform object operations as a ``client.admin`` user. + +.. note:: To use the Ceph Python bindings, you must have access to a + running Ceph cluster. To set one up quickly, see `Getting Started`_. + +First, create a Python source file for your Ceph client. :: + :linenos: + + sudo vim client.py + + +Import the Module +----------------- + +To use the ``rados`` module, import it into your source file. + +.. code-block:: python + :linenos: + + import rados + + +Configure a Cluster Handle +-------------------------- + +Before connecting to the Ceph Storage Cluster, create a cluster handle. By +default, the cluster handle assumes a cluster named ``ceph`` (i.e., the default +for deployment tools, and our Getting Started guides too), and a +``client.admin`` user name. You may change these defaults to suit your needs. + +To connect to the Ceph Storage Cluster, your application needs to know where to +find the Ceph Monitor. Provide this information to your application by +specifying the path to your Ceph configuration file, which contains the location +of the initial Ceph monitors. + +.. code-block:: python + :linenos: + + import rados, sys + + #Create Handle Examples. + cluster = rados.Rados(conffile='ceph.conf') + cluster = rados.Rados(conffile=sys.argv[1]) + cluster = rados.Rados(conffile = 'ceph.conf', conf = dict (keyring = '/path/to/keyring')) + +Ensure that the ``conffile`` argument provides the path and file name of your +Ceph configuration file. You may use the ``sys`` module to avoid hard-coding the +Ceph configuration path and file name. + +Your Python client also requires a client keyring. For this example, we use the +``client.admin`` key by default. If you would like to specify the keyring when +creating the cluster handle, you may use the ``conf`` argument. Alternatively, +you may specify the keyring path in your Ceph configuration file. For example, +you may add something like the following line to your Ceph configuration file:: + + keyring = /path/to/ceph.client.admin.keyring + +For additional details on modifying your configuration via Python, see `Configuration`_. + + +Connect to the Cluster +---------------------- + +Once you have a cluster handle configured, you may connect to the cluster. +With a connection to the cluster, you may execute methods that return +information about the cluster. + +.. code-block:: python + :linenos: + :emphasize-lines: 7 + + import rados, sys + + cluster = rados.Rados(conffile='ceph.conf') + print "\nlibrados version: " + str(cluster.version()) + print "Will attempt to connect to: " + str(cluster.conf_get('mon host')) + + cluster.connect() + print "\nCluster ID: " + cluster.get_fsid() + + print "\n\nCluster Statistics" + print "==================" + cluster_stats = cluster.get_cluster_stats() + + for key, value in cluster_stats.iteritems(): + print key, value + + +By default, Ceph authentication is ``on``. Your application will need to know +the location of the keyring. The ``python-ceph`` module doesn't have the default +location, so you need to specify the keyring path. The easiest way to specify +the keyring is to add it to the Ceph configuration file. The following Ceph +configuration file example uses the ``client.admin`` keyring. + +.. code-block:: ini + :linenos: + + [global] + # ... elided configuration + keyring=/path/to/keyring/ceph.client.admin.keyring + + +Manage Pools +------------ + +When connected to the cluster, the ``Rados`` API allows you to manage pools. You +can list pools, check for the existence of a pool, create a pool and delete a +pool. + +.. code-block:: python + :linenos: + :emphasize-lines: 6, 13, 18, 25 + + print "\n\nPool Operations" + print "===============" + + print "\nAvailable Pools" + print "----------------" + pools = cluster.list_pools() + + for pool in pools: + print pool + + print "\nCreate 'test' Pool" + print "------------------" + cluster.create_pool('test') + + print "\nPool named 'test' exists: " + str(cluster.pool_exists('test')) + print "\nVerify 'test' Pool Exists" + print "-------------------------" + pools = cluster.list_pools() + + for pool in pools: + print pool + + print "\nDelete 'test' Pool" + print "------------------" + cluster.delete_pool('test') + print "\nPool named 'test' exists: " + str(cluster.pool_exists('test')) + + + +Input/Output Context +-------------------- + +Reading from and writing to the Ceph Storage Cluster requires an input/output +context (ioctx). You can create an ioctx with the ``open_ioctx()`` or +``open_ioctx2()`` method of the ``Rados`` class. The ``ioctx_name`` parameter +is the name of the pool and ``pool_id`` is the ID of the pool you wish to use. + +.. code-block:: python + :linenos: + + ioctx = cluster.open_ioctx('data') + + +or + +.. code-block:: python + :linenos: + + ioctx = cluster.open_ioctx2(pool_id) + + +Once you have an I/O context, you can read/write objects, extended attributes, +and perform a number of other operations. After you complete operations, ensure +that you close the connection. For example: + +.. code-block:: python + :linenos: + + print "\nClosing the connection." + ioctx.close() + + +Writing, Reading and Removing Objects +------------------------------------- + +Once you create an I/O context, you can write objects to the cluster. If you +write to an object that doesn't exist, Ceph creates it. If you write to an +object that exists, Ceph overwrites it (except when you specify a range, and +then it only overwrites the range). You may read objects (and object ranges) +from the cluster. You may also remove objects from the cluster. For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 2, 5, 8 + + print "\nWriting object 'hw' with contents 'Hello World!' to pool 'data'." + ioctx.write_full("hw", "Hello World!") + + print "\n\nContents of object 'hw'\n------------------------\n" + print ioctx.read("hw") + + print "\nRemoving object 'hw'" + ioctx.remove_object("hw") + + +Writing and Reading XATTRS +-------------------------- + +Once you create an object, you can write extended attributes (XATTRs) to +the object and read XATTRs from the object. For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 2, 5 + + print "\n\nWriting XATTR 'lang' with value 'en_US' to object 'hw'" + ioctx.set_xattr("hw", "lang", "en_US") + + print "\n\nGetting XATTR 'lang' from object 'hw'\n" + print ioctx.get_xattr("hw", "lang") + + +Listing Objects +--------------- + +If you want to examine the list of objects in a pool, you may +retrieve the list of objects and iterate over them with the object iterator. +For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 1, 6, 7 + + object_iterator = ioctx.list_objects() + + while True : + + try : + rados_object = object_iterator.next() + print "Object contents = " + rados_object.read() + + except StopIteration : + break + +The ``Object`` class provides a file-like interface to an object, allowing +you to read and write content and extended attributes. Object operations using +the I/O context provide additional functionality and asynchronous capabilities. + + +Cluster Handle API +================== + +The ``Rados`` class provides an interface into the Ceph Storage Daemon. + + +Configuration +------------- + +The ``Rados`` class provides methods for getting and setting configuration +values, reading the Ceph configuration file, and parsing arguments. You +do not need to be connected to the Ceph Storage Cluster to invoke the following +methods. See `Storage Cluster Configuration`_ for details on settings. + +.. currentmodule:: rados +.. automethod:: Rados.conf_get(option) +.. automethod:: Rados.conf_set(option, val) +.. automethod:: Rados.conf_read_file(path=None) +.. automethod:: Rados.conf_parse_argv(args) +.. automethod:: Rados.version() + + +Connection Management +--------------------- + +Once you configure your cluster handle, you may connect to the cluster, check +the cluster ``fsid``, retrieve cluster statistics, and disconnect (shutdown) +from the cluster. You may also assert that the cluster handle is in a particular +state (e.g., "configuring", "connecting", etc.). + +.. automethod:: Rados.connect(timeout=0) +.. automethod:: Rados.shutdown() +.. automethod:: Rados.get_fsid() +.. automethod:: Rados.get_cluster_stats() + +.. documented manually because it raises warnings because of *args usage in the +.. signature + +.. py:class:: Rados + + .. py:method:: require_state(*args) + + Checks if the Rados object is in a special state + + :param args: Any number of states to check as separate arguments + :raises: :class:`RadosStateError` + + +Pool Operations +--------------- + +To use pool operation methods, you must connect to the Ceph Storage Cluster +first. You may list the available pools, create a pool, check to see if a pool +exists, and delete a pool. + +.. automethod:: Rados.list_pools() +.. automethod:: Rados.create_pool(pool_name, crush_rule=None) +.. automethod:: Rados.pool_exists() +.. automethod:: Rados.delete_pool(pool_name) + + +CLI Commands +------------ + +The Ceph CLI command is internally using the following librados Python binding methods. + +In order to send a command, choose the correct method and choose the correct target. + +.. automethod:: Rados.mon_command +.. automethod:: Rados.osd_command +.. automethod:: Rados.mgr_command +.. automethod:: Rados.pg_command + + +Input/Output Context API +======================== + +To write data to and read data from the Ceph Object Store, you must create +an Input/Output context (ioctx). The `Rados` class provides `open_ioctx()` +and `open_ioctx2()` methods. The remaining ``ioctx`` operations involve +invoking methods of the `Ioctx` and other classes. + +.. automethod:: Rados.open_ioctx(ioctx_name) +.. automethod:: Ioctx.require_ioctx_open() +.. automethod:: Ioctx.get_stats() +.. automethod:: Ioctx.get_last_version() +.. automethod:: Ioctx.close() + + +.. Pool Snapshots +.. -------------- + +.. The Ceph Storage Cluster allows you to make a snapshot of a pool's state. +.. Whereas, basic pool operations only require a connection to the cluster, +.. snapshots require an I/O context. + +.. Ioctx.create_snap(self, snap_name) +.. Ioctx.list_snaps(self) +.. SnapIterator.next(self) +.. Snap.get_timestamp(self) +.. Ioctx.lookup_snap(self, snap_name) +.. Ioctx.remove_snap(self, snap_name) + +.. not published. This doesn't seem ready yet. + +Object Operations +----------------- + +The Ceph Storage Cluster stores data as objects. You can read and write objects +synchronously or asynchronously. You can read and write from offsets. An object +has a name (or key) and data. + + +.. automethod:: Ioctx.aio_write(object_name, to_write, offset=0, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.aio_write_full(object_name, to_write, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.aio_append(object_name, to_append, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.write(key, data, offset=0) +.. automethod:: Ioctx.write_full(key, data) +.. automethod:: Ioctx.aio_flush() +.. automethod:: Ioctx.set_locator_key(loc_key) +.. automethod:: Ioctx.aio_read(object_name, length, offset, oncomplete) +.. automethod:: Ioctx.read(key, length=8192, offset=0) +.. automethod:: Ioctx.stat(key) +.. automethod:: Ioctx.trunc(key, size) +.. automethod:: Ioctx.remove_object(key) + + +Object Extended Attributes +-------------------------- + +You may set extended attributes (XATTRs) on an object. You can retrieve a list +of objects or XATTRs and iterate over them. + +.. automethod:: Ioctx.set_xattr(key, xattr_name, xattr_value) +.. automethod:: Ioctx.get_xattrs(oid) +.. automethod:: XattrIterator.__next__() +.. automethod:: Ioctx.get_xattr(key, xattr_name) +.. automethod:: Ioctx.rm_xattr(key, xattr_name) + + + +Object Interface +================ + +From an I/O context, you can retrieve a list of objects from a pool and iterate +over them. The object interface provide makes each object look like a file, and +you may perform synchronous operations on the objects. For asynchronous +operations, you should use the I/O context methods. + +.. automethod:: Ioctx.list_objects() +.. automethod:: ObjectIterator.__next__() +.. automethod:: Object.read(length = 1024*1024) +.. automethod:: Object.write(string_to_write) +.. automethod:: Object.get_xattrs() +.. automethod:: Object.get_xattr(xattr_name) +.. automethod:: Object.set_xattr(xattr_name, xattr_value) +.. automethod:: Object.rm_xattr(xattr_name) +.. automethod:: Object.stat() +.. automethod:: Object.remove() + + + + +.. _Getting Started: ../../../start +.. _Storage Cluster Configuration: ../../configuration +.. _Getting librados for Python: ../librados-intro#getting-librados-for-python diff --git a/doc/rados/command/list-inconsistent-obj.json b/doc/rados/command/list-inconsistent-obj.json new file mode 100644 index 000000000..2bdc5f74c --- /dev/null +++ b/doc/rados/command/list-inconsistent-obj.json @@ -0,0 +1,237 @@ +{ + "$schema": "http://json-schema.org/draft-04/schema#", + "type": "object", + "properties": { + "epoch": { + "description": "Scrub epoch", + "type": "integer" + }, + "inconsistents": { + "type": "array", + "items": { + "type": "object", + "properties": { + "object": { + "description": "Identify a Ceph object", + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "nspace": { + "type": "string" + }, + "locator": { + "type": "string" + }, + "version": { + "type": "integer", + "minimum": 0 + }, + "snap": { + "oneOf": [ + { + "type": "string", + "enum": [ "head", "snapdir" ] + }, + { + "type": "integer", + "minimum": 0 + } + ] + } + }, + "required": [ + "name", + "nspace", + "locator", + "version", + "snap" + ] + }, + "selected_object_info": { + "type": "object", + "description": "Selected object information", + "additionalProperties": true + }, + "union_shard_errors": { + "description": "Union of all shard errors", + "type": "array", + "items": { + "enum": [ + "missing", + "stat_error", + "read_error", + "data_digest_mismatch_info", + "omap_digest_mismatch_info", + "size_mismatch_info", + "ec_hash_error", + "ec_size_error", + "info_missing", + "info_corrupted", + "obj_size_info_mismatch", + "snapset_missing", + "snapset_corrupted", + "hinfo_missing", + "hinfo_corrupted" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "errors": { + "description": "Errors related to the analysis of this object", + "type": "array", + "items": { + "enum": [ + "object_info_inconsistency", + "data_digest_mismatch", + "omap_digest_mismatch", + "size_mismatch", + "attr_value_mismatch", + "attr_name_mismatch", + "snapset_inconsistency", + "hinfo_inconsistency", + "size_too_large" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "shards": { + "description": "All found or expected shards", + "type": "array", + "items": { + "description": "Information about a particular shard of object", + "type": "object", + "properties": { + "object_info": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "object", + "description": "Object information", + "additionalProperties": true + } + ] + }, + "snapset": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "object", + "description": "Snap set information", + "additionalProperties": true + } + ] + }, + "hashinfo": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "object", + "description": "Erasure code hash information", + "additionalProperties": true + } + ] + }, + "shard": { + "type": "integer" + }, + "osd": { + "type": "integer" + }, + "primary": { + "type": "boolean" + }, + "size": { + "type": "integer" + }, + "omap_digest": { + "description": "Hex representation (e.g. 0x1abd1234)", + "type": "string" + }, + "data_digest": { + "description": "Hex representation (e.g. 0x1abd1234)", + "type": "string" + }, + "errors": { + "description": "Errors with this shard", + "type": "array", + "items": { + "enum": [ + "missing", + "stat_error", + "read_error", + "data_digest_mismatch_info", + "omap_digest_mismatch_info", + "size_mismatch_info", + "ec_hash_error", + "ec_size_error", + "info_missing", + "info_corrupted", + "obj_size_info_mismatch", + "snapset_missing", + "snapset_corrupted", + "hinfo_missing", + "hinfo_corrupted" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "attrs": { + "description": "If any shard's attr error is set then all attrs are here", + "type": "array", + "items": { + "description": "Information about a particular shard of object", + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "value": { + "type": "string" + }, + "Base64": { + "type": "boolean" + } + }, + "required": [ + "name", + "value", + "Base64" + ], + "additionalProperties": false + } + } + }, + "additionalProperties": false, + "required": [ + "osd", + "primary", + "errors" + ] + } + } + }, + "required": [ + "object", + "union_shard_errors", + "errors", + "shards" + ] + } + } + }, + "required": [ + "epoch", + "inconsistents" + ] +} diff --git a/doc/rados/command/list-inconsistent-snap.json b/doc/rados/command/list-inconsistent-snap.json new file mode 100644 index 000000000..55f1d53e9 --- /dev/null +++ b/doc/rados/command/list-inconsistent-snap.json @@ -0,0 +1,86 @@ +{ + "$schema": "http://json-schema.org/draft-04/schema#", + "type": "object", + "properties": { + "epoch": { + "description": "Scrub epoch", + "type": "integer" + }, + "inconsistents": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "nspace": { + "type": "string" + }, + "locator": { + "type": "string" + }, + "snap": { + "oneOf": [ + { + "type": "string", + "enum": [ + "head", + "snapdir" + ] + }, + { + "type": "integer", + "minimum": 0 + } + ] + }, + "errors": { + "description": "Errors for this object's snap", + "type": "array", + "items": { + "enum": [ + "snapset_missing", + "snapset_corrupted", + "info_missing", + "info_corrupted", + "snapset_error", + "headless", + "size_mismatch", + "extra_clones", + "clone_missing" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "missing": { + "description": "List of missing clones if clone_missing error set", + "type": "array", + "items": { + "type": "integer" + } + }, + "extra_clones": { + "description": "List of extra clones if extra_clones error set", + "type": "array", + "items": { + "type": "integer" + } + } + }, + "required": [ + "name", + "nspace", + "locator", + "snap", + "errors" + ] + } + } + }, + "required": [ + "epoch", + "inconsistents" + ] +} diff --git a/doc/rados/configuration/auth-config-ref.rst b/doc/rados/configuration/auth-config-ref.rst new file mode 100644 index 000000000..5cc13ff6a --- /dev/null +++ b/doc/rados/configuration/auth-config-ref.rst @@ -0,0 +1,362 @@ +======================== + Cephx Config Reference +======================== + +The ``cephx`` protocol is enabled by default. Cryptographic authentication has +some computational costs, though they should generally be quite low. If the +network environment connecting your client and server hosts is very safe and +you cannot afford authentication, you can turn it off. **This is not generally +recommended**. + +.. note:: If you disable authentication, you are at risk of a man-in-the-middle + attack altering your client/server messages, which could lead to disastrous + security effects. + +For creating users, see `User Management`_. For details on the architecture +of Cephx, see `Architecture - High Availability Authentication`_. + + +Deployment Scenarios +==================== + +There are two main scenarios for deploying a Ceph cluster, which impact +how you initially configure Cephx. Most first time Ceph users use +``cephadm`` to create a cluster (easiest). For clusters using +other deployment tools (e.g., Chef, Juju, Puppet, etc.), you will need +to use the manual procedures or configure your deployment tool to +bootstrap your monitor(s). + +Manual Deployment +----------------- + +When you deploy a cluster manually, you have to bootstrap the monitor manually +and create the ``client.admin`` user and keyring. To bootstrap monitors, follow +the steps in `Monitor Bootstrapping`_. The steps for monitor bootstrapping are +the logical steps you must perform when using third party deployment tools like +Chef, Puppet, Juju, etc. + + +Enabling/Disabling Cephx +======================== + +Enabling Cephx requires that you have deployed keys for your monitors, +OSDs and metadata servers. If you are simply toggling Cephx on / off, +you do not have to repeat the bootstrapping procedures. + + +Enabling Cephx +-------------- + +When ``cephx`` is enabled, Ceph will look for the keyring in the default search +path, which includes ``/etc/ceph/$cluster.$name.keyring``. You can override +this location by adding a ``keyring`` option in the ``[global]`` section of +your `Ceph configuration`_ file, but this is not recommended. + +Execute the following procedures to enable ``cephx`` on a cluster with +authentication disabled. If you (or your deployment utility) have already +generated the keys, you may skip the steps related to generating keys. + +#. Create a ``client.admin`` key, and save a copy of the key for your client + host + + .. prompt:: bash $ + + ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring + + **Warning:** This will clobber any existing + ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a + deployment tool has already done it for you. Be careful! + +#. Create a keyring for your monitor cluster and generate a monitor + secret key. + + .. prompt:: bash $ + + ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' + +#. Copy the monitor keyring into a ``ceph.mon.keyring`` file in every monitor's + ``mon data`` directory. For example, to copy it to ``mon.a`` in cluster ``ceph``, + use the following + + .. prompt:: bash $ + + cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring + +#. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter + + .. prompt:: bash $ + + ceph auth get-or-create mgr.{$id} mon 'allow profile mgr' mds 'allow *' osd 'allow *' -o /var/lib/ceph/mgr/ceph-{$id}/keyring + +#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number + + .. prompt:: bash $ + + ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring + +#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter + + .. prompt:: bash $ + + ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' mgr 'allow profile mds' -o /var/lib/ceph/mds/ceph-{$id}/keyring + +#. Enable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file + + .. code-block:: ini + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + +For details on bootstrapping a monitor manually, see `Manual Deployment`_. + + + +Disabling Cephx +--------------- + +The following procedure describes how to disable Cephx. If your cluster +environment is relatively safe, you can offset the computation expense of +running authentication. **We do not recommend it.** However, it may be easier +during setup and/or troubleshooting to temporarily disable authentication. + +#. Disable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file + + .. code-block:: ini + + auth_cluster_required = none + auth_service_required = none + auth_client_required = none + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + + +Configuration Settings +====================== + +Enablement +---------- + + +``auth_cluster_required`` + +:Description: If enabled, the Ceph Storage Cluster daemons (i.e., ``ceph-mon``, + ``ceph-osd``, ``ceph-mds`` and ``ceph-mgr``) must authenticate with + each other. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth_service_required`` + +:Description: If enabled, the Ceph Storage Cluster daemons require Ceph Clients + to authenticate with the Ceph Storage Cluster in order to access + Ceph services. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth_client_required`` + +:Description: If enabled, the Ceph Client requires the Ceph Storage Cluster to + authenticate with the Ceph Client. Valid settings are ``cephx`` + or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +.. index:: keys; keyring + +Keys +---- + +When you run Ceph with authentication enabled, ``ceph`` administrative commands +and Ceph Clients require authentication keys to access the Ceph Storage Cluster. + +The most common way to provide these keys to the ``ceph`` administrative +commands and clients is to include a Ceph keyring under the ``/etc/ceph`` +directory. For Octopus and later releases using ``cephadm``, the filename +is usually ``ceph.client.admin.keyring`` (or ``$cluster.client.admin.keyring``). +If you include the keyring under the ``/etc/ceph`` directory, you don't need to +specify a ``keyring`` entry in your Ceph configuration file. + +We recommend copying the Ceph Storage Cluster's keyring file to nodes where you +will run administrative commands, because it contains the ``client.admin`` key. + +To perform this step manually, execute the following: + +.. prompt:: bash $ + + sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring + +.. tip:: Ensure the ``ceph.keyring`` file has appropriate permissions set + (e.g., ``chmod 644``) on your client machine. + +You may specify the key itself in the Ceph configuration file using the ``key`` +setting (not recommended), or a path to a keyfile using the ``keyfile`` setting. + + +``keyring`` + +:Description: The path to the keyring file. +:Type: String +:Required: No +:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin`` + + +``keyfile`` + +:Description: The path to a key file (i.e,. a file containing only the key). +:Type: String +:Required: No +:Default: None + + +``key`` + +:Description: The key (i.e., the text string of the key itself). Not recommended. +:Type: String +:Required: No +:Default: None + + +Daemon Keyrings +--------------- + +Administrative users or deployment tools (e.g., ``cephadm``) may generate +daemon keyrings in the same way as generating user keyrings. By default, Ceph +stores daemons keyrings inside their data directory. The default keyring +locations, and the capabilities necessary for the daemon to function, are shown +below. + +``ceph-mon`` + +:Location: ``$mon_data/keyring`` +:Capabilities: ``mon 'allow *'`` + +``ceph-osd`` + +:Location: ``$osd_data/keyring`` +:Capabilities: ``mgr 'allow profile osd' mon 'allow profile osd' osd 'allow *'`` + +``ceph-mds`` + +:Location: ``$mds_data/keyring`` +:Capabilities: ``mds 'allow' mgr 'allow profile mds' mon 'allow profile mds' osd 'allow rwx'`` + +``ceph-mgr`` + +:Location: ``$mgr_data/keyring`` +:Capabilities: ``mon 'allow profile mgr' mds 'allow *' osd 'allow *'`` + +``radosgw`` + +:Location: ``$rgw_data/keyring`` +:Capabilities: ``mon 'allow rwx' osd 'allow rwx'`` + + +.. note:: The monitor keyring (i.e., ``mon.``) contains a key but no + capabilities, and is not part of the cluster ``auth`` database. + +The daemon data directory locations default to directories of the form:: + + /var/lib/ceph/$type/$cluster-$id + +For example, ``osd.12`` would be:: + + /var/lib/ceph/osd/ceph-12 + +You can override these locations, but it is not recommended. + + +.. index:: signatures + +Signatures +---------- + +Ceph performs a signature check that provides some limited protection +against messages being tampered with in flight (e.g., by a "man in the +middle" attack). + +Like other parts of Ceph authentication, Ceph provides fine-grained control so +you can enable/disable signatures for service messages between clients and +Ceph, and so you can enable/disable signatures for messages between Ceph daemons. + +Note that even with signatures enabled data is not encrypted in +flight. + +``cephx_require_signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between the Ceph Client and the Ceph Storage Cluster, and + between daemons comprising the Ceph Storage Cluster. + + Ceph Argonaut and Linux kernel versions prior to 3.19 do + not support signatures; if such clients are in use this + option can be turned off to allow them to connect. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_cluster_require_signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph daemons comprising the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_service_require_signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph Clients and the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_sign_messages`` + +:Description: If the Ceph version supports message signing, Ceph will sign + all messages so they are more difficult to spoof. + +:Type: Boolean +:Default: ``true`` + + +Time to Live +------------ + +``auth_service_ticket_ttl`` + +:Description: When the Ceph Storage Cluster sends a Ceph Client a ticket for + authentication, the Ceph Storage Cluster assigns the ticket a + time to live. + +:Type: Double +:Default: ``60*60`` + + +.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping +.. _Operating a Cluster: ../../operations/operating +.. _Manual Deployment: ../../../install/manual-deployment +.. _Ceph configuration: ../ceph-conf +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _User Management: ../../operations/user-management diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst new file mode 100644 index 000000000..3bfc8e295 --- /dev/null +++ b/doc/rados/configuration/bluestore-config-ref.rst @@ -0,0 +1,482 @@ +========================== +BlueStore Config Reference +========================== + +Devices +======= + +BlueStore manages either one, two, or (in certain cases) three storage +devices. + +In the simplest case, BlueStore consumes a single (primary) storage device. +The storage device is normally used as a whole, occupying the full device that +is managed directly by BlueStore. This *primary device* is normally identified +by a ``block`` symlink in the data directory. + +The data directory is a ``tmpfs`` mount which gets populated (at boot time, or +when ``ceph-volume`` activates it) with all the common OSD files that hold +information about the OSD, like: its identifier, which cluster it belongs to, +and its private keyring. + +It is also possible to deploy BlueStore across one or two additional devices: + +* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be + used for BlueStore's internal journal or write-ahead log. It is only useful + to use a WAL device if the device is faster than the primary device (e.g., + when it is on an SSD and the primary device is an HDD). +* A *DB device* (identified as ``block.db`` in the data directory) can be used + for storing BlueStore's internal metadata. BlueStore (or rather, the + embedded RocksDB) will put as much metadata as it can on the DB device to + improve performance. If the DB device fills up, metadata will spill back + onto the primary device (where it would have been otherwise). Again, it is + only helpful to provision a DB device if it is faster than the primary + device. + +If there is only a small amount of fast storage available (e.g., less +than a gigabyte), we recommend using it as a WAL device. If there is +more, provisioning a DB device makes more sense. The BlueStore +journal will always be placed on the fastest device available, so +using a DB device will provide the same benefit that the WAL device +would while *also* allowing additional metadata to be stored there (if +it will fit). This means that if a DB device is specified but an explicit +WAL device is not, the WAL will be implicitly colocated with the DB on the faster +device. + +A single-device (colocated) BlueStore OSD can be provisioned with: + +.. prompt:: bash $ + + ceph-volume lvm prepare --bluestore --data <device> + +To specify a WAL device and/or DB device: + +.. prompt:: bash $ + + ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device> + +.. note:: ``--data`` can be a Logical Volume using *vg/lv* notation. Other + devices can be existing logical volumes or GPT partitions. + +Provisioning strategies +----------------------- +Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore +which had just one), there are two common arrangements that should help clarify +the deployment strategy: + +.. _bluestore-single-type-device-config: + +**block (data) only** +^^^^^^^^^^^^^^^^^^^^^ +If all devices are the same type, for example all rotational drives, and +there are no fast devices to use for metadata, it makes sense to specify the +block device only and to not separate ``block.db`` or ``block.wal``. The +:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data /dev/sda + +If logical volumes have already been created for each device, (a single LV +using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named +``ceph-vg/block-lv`` would look like: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data ceph-vg/block-lv + +.. _bluestore-mixed-device-config: + +**block and block.db** +^^^^^^^^^^^^^^^^^^^^^^ +If you have a mix of fast and slow devices (SSD / NVMe and rotational), +it is recommended to place ``block.db`` on the faster device while ``block`` +(data) lives on the slower (spinning drive). + +You must create these volume groups and logical volumes manually as +the ``ceph-volume`` tool is currently not able to do so automatically. + +For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``) +and one (fast) solid state drive (``sdx``). First create the volume groups: + +.. prompt:: bash $ + + vgcreate ceph-block-0 /dev/sda + vgcreate ceph-block-1 /dev/sdb + vgcreate ceph-block-2 /dev/sdc + vgcreate ceph-block-3 /dev/sdd + +Now create the logical volumes for ``block``: + +.. prompt:: bash $ + + lvcreate -l 100%FREE -n block-0 ceph-block-0 + lvcreate -l 100%FREE -n block-1 ceph-block-1 + lvcreate -l 100%FREE -n block-2 ceph-block-2 + lvcreate -l 100%FREE -n block-3 ceph-block-3 + +We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB +SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB: + +.. prompt:: bash $ + + vgcreate ceph-db-0 /dev/sdx + lvcreate -L 50GB -n db-0 ceph-db-0 + lvcreate -L 50GB -n db-1 ceph-db-0 + lvcreate -L 50GB -n db-2 ceph-db-0 + lvcreate -L 50GB -n db-3 ceph-db-0 + +Finally, create the 4 OSDs with ``ceph-volume``: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0 + ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1 + ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2 + ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3 + +These operations should end up creating four OSDs, with ``block`` on the slower +rotational drives with a 50 GB logical volume (DB) for each on the solid state +drive. + +Sizing +====== +When using a :ref:`mixed spinning and solid drive setup +<bluestore-mixed-device-config>` it is important to make a large enough +``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have +*as large as possible* logical volumes. + +The general recommendation is to have ``block.db`` size in between 1% to 4% +of ``block`` size. For RGW workloads, it is recommended that the ``block.db`` +size isn't smaller than 4% of ``block``, because RGW heavily uses it to store +metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't +be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough. + +In older releases, internal level sizes mean that the DB can fully utilize only +specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2, +etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and +so forth. Most deployments will not substantially benefit from sizing to +accommodate L3 and higher, though DB compaction can be facilitated by doubling +these figures to 6GB, 60GB, and 600GB. + +Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6 +enable better utilization of arbitrary DB device sizes, and the Pacific +release brings experimental dynamic level support. Users of older releases may +thus wish to plan ahead by provisioning larger DB devices today so that their +benefits may be realized with future upgrades. + +When *not* using a mix of fast and slow devices, it isn't required to create +separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will +automatically colocate these within the space of ``block``. + + +Automatic Cache Sizing +====================== + +BlueStore can be configured to automatically resize its caches when TCMalloc +is configured as the memory allocator and the ``bluestore_cache_autotune`` +setting is enabled. This option is currently enabled by default. BlueStore +will attempt to keep OSD heap memory usage under a designated target size via +the ``osd_memory_target`` configuration option. This is a best effort +algorithm and caches will not shrink smaller than the amount specified by +``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy +of priorities. If priority information is not available, the +``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are +used as fallbacks. + +Manual Cache Sizing +=================== + +The amount of memory consumed by each OSD for BlueStore caches is +determined by the ``bluestore_cache_size`` configuration option. If +that config option is not set (i.e., remains at 0), there is a +different default value that is used depending on whether an HDD or +SSD is used for the primary device (set by the +``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config +options). + +BlueStore and the rest of the Ceph OSD daemon do the best they can +to work within this memory budget. Note that on top of the configured +cache size, there is also memory consumed by the OSD itself, and +some additional utilization due to memory fragmentation and other +allocator overhead. + +The configured cache memory budget can be used in a few different ways: + +* Key/Value metadata (i.e., RocksDB's internal cache) +* BlueStore metadata +* BlueStore data (i.e., recently read or written object data) + +Cache memory usage is governed by the following options: +``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. +The fraction of the cache devoted to data +is governed by the effective bluestore cache size (depending on +``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary +device) as well as the meta and kv ratios. +The data fraction can be calculated by +``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)`` + +Checksums +========= + +BlueStore checksums all metadata and data written to disk. Metadata +checksumming is handled by RocksDB and uses `crc32c`. Data +checksumming is done by BlueStore and can make use of `crc32c`, +`xxhash32`, or `xxhash64`. The default is `crc32c` and should be +suitable for most purposes. + +Full data checksumming does increase the amount of metadata that +BlueStore must store and manage. When possible, e.g., when clients +hint that data is written and read sequentially, BlueStore will +checksum larger blocks, but in many cases it must store a checksum +value (usually 4 bytes) for every 4 kilobyte block of data. + +It is possible to use a smaller checksum value by truncating the +checksum to two or one byte, reducing the metadata overhead. The +trade-off is that the probability that a random error will not be +detected is higher with a smaller checksum, going from about one in +four billion with a 32-bit (4 byte) checksum to one in 65,536 for a +16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum. +The smaller checksum values can be used by selecting `crc32c_16` or +`crc32c_8` as the checksum algorithm. + +The *checksum algorithm* can be set either via a per-pool +``csum_type`` property or the global config option. For example: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> csum_type <algorithm> + +Inline Compression +================== + +BlueStore supports inline compression using `snappy`, `zlib`, or +`lz4`. Please note that the `lz4` compression plugin is not +distributed in the official release. + +Whether data in BlueStore is compressed is determined by a combination +of the *compression mode* and any hints associated with a write +operation. The modes are: + +* **none**: Never compress data. +* **passive**: Do not compress data unless the write operation has a + *compressible* hint set. +* **aggressive**: Compress data unless the write operation has an + *incompressible* hint set. +* **force**: Try to compress data no matter what. + +For more information about the *compressible* and *incompressible* IO +hints, see :c:func:`rados_set_alloc_hint`. + +Note that regardless of the mode, if the size of the data chunk is not +reduced sufficiently it will not be used and the original +(uncompressed) data will be stored. For example, if the ``bluestore +compression required ratio`` is set to ``.7`` then the compressed data +must be 70% of the size of the original (or smaller). + +The *compression mode*, *compression algorithm*, *compression required +ratio*, *min blob size*, and *max blob size* can be set either via a +per-pool property or a global config option. Pool properties can be +set with: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> compression_algorithm <algorithm> + ceph osd pool set <pool-name> compression_mode <mode> + ceph osd pool set <pool-name> compression_required_ratio <ratio> + ceph osd pool set <pool-name> compression_min_blob_size <size> + ceph osd pool set <pool-name> compression_max_blob_size <size> + +.. _bluestore-rocksdb-sharding: + +RocksDB Sharding +================ + +Internally BlueStore uses multiple types of key-value data, +stored in RocksDB. Each data type in BlueStore is assigned a +unique prefix. Until Pacific all key-value data was stored in +single RocksDB column family: 'default'. Since Pacific, +BlueStore can divide this data into multiple RocksDB column +families. When keys have similar access frequency, modification +frequency and lifetime, BlueStore benefits from better caching +and more precise compaction. This improves performance, and also +requires less disk space during compaction, since each column +family is smaller and can compact independent of others. + +OSDs deployed in Pacific or later use RocksDB sharding by default. +If Ceph is upgraded to Pacific from a previous version, sharding is off. + +To enable sharding and apply the Pacific defaults, stop an OSD and run + + .. prompt:: bash # + + ceph-bluestore-tool \ + --path <data path> \ + --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \ + reshard + + +Throttling +========== + +SPDK Usage +================== + +If you want to use the SPDK driver for NVMe devices, you must prepare your system. +Refer to `SPDK document`__ for more details. + +.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples + +SPDK offers a script to configure the device automatically. Users can run the +script as root: + +.. prompt:: bash $ + + sudo src/spdk/scripts/setup.sh + +You will need to specify the subject NVMe device's device selector with +the "spdk:" prefix for ``bluestore_block_path``. + +For example, you can find the device selector of an Intel PCIe SSD with: + +.. prompt:: bash $ + + lspci -mm -n -D -d 8086:0953 + +The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``. + +and then set:: + + bluestore_block_path = "spdk:trtype:PCIe traddr:0000:01:00.0" + +Where ``0000:01:00.0`` is the device selector found in the output of ``lspci`` +command above. + +You may also specify a remote NVMeoF target over the TCP transport as in the +following example:: + + bluestore_block_path = "spdk:trtype:TCP traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1" + +To run multiple SPDK instances per node, you must specify the +amount of dpdk memory in MB that each instance will use, to make sure each +instance uses its own DPDK memory. + +In most cases, a single device can be used for data, DB, and WAL. We describe +this strategy as *colocating* these components. Be sure to enter the below +settings to ensure that all IOs are issued through SPDK.:: + + bluestore_block_db_path = "" + bluestore_block_db_size = 0 + bluestore_block_wal_path = "" + bluestore_block_wal_size = 0 + +Otherwise, the current implementation will populate the SPDK map files with +kernel file system symbols and will use the kernel driver to issue DB/WAL IO. + +Minimum Allocation Size +======================== + +There is a configured minimum amount of storage that BlueStore will allocate on +an OSD. In practice, this is the least amount of capacity that a RADOS object +can consume. The value of `bluestore_min_alloc_size` is derived from the +value of `bluestore_min_alloc_size_hdd` or `bluestore_min_alloc_size_ssd` +depending on the OSD's ``rotational`` attribute. This means that when an OSD +is created on an HDD, BlueStore will be initialized with the current value +of `bluestore_min_alloc_size_hdd`, and SSD OSDs (including NVMe devices) +with the value of `bluestore_min_alloc_size_ssd`. + +Through the Mimic release, the default values were 64KB and 16KB for rotational +(HDD) and non-rotational (SSD) media respectively. Octopus changed the default +for SSD (non-rotational) media to 4KB, and Pacific changed the default for HDD +(rotational) media to 4KB as well. + +These changes were driven by space amplification experienced by Ceph RADOS +GateWay (RGW) deployments that host large numbers of small files +(S3/Swift objects). + +For example, when an RGW client stores a 1KB S3 object, it is written to a +single RADOS object. With the default `min_alloc_size` value, 4KB of +underlying drive space is allocated. This means that roughly +(4KB - 1KB) == 3KB is allocated but never used, which corresponds to 300% +overhead or 25% efficiency. Similarly, a 5KB user object will be stored +as one 4KB and one 1KB RADOS object, again stranding 4KB of device capcity, +though in this case the overhead is a much smaller percentage. Think of this +in terms of the remainder from a modulus operation. The overhead *percentage* +thus decreases rapidly as user object size increases. + +An easily missed additional subtlety is that this +takes place for *each* replica. So when using the default three copies of +data (3R), a 1KB S3 object actually consumes roughly 9KB of storage device +capacity. If erasure coding (EC) is used instead of replication, the +amplification may be even higher: for a ``k=4,m=2`` pool, our 1KB S3 object +will allocate (6 * 4KB) = 24KB of device capacity. + +When an RGW bucket pool contains many relatively large user objects, the effect +of this phenomenon is often negligible, but should be considered for deployments +that expect a signficiant fraction of relatively small objects. + +The 4KB default value aligns well with conventional HDD and SSD devices. Some +new coarse-IU (Indirection Unit) QLC SSDs however perform and wear best +when `bluestore_min_alloc_size_ssd` +is set at OSD creation to match the device's IU:. 8KB, 16KB, or even 64KB. +These novel storage drives allow one to achieve read performance competitive +with conventional TLC SSDs and write performance faster than HDDs, with +high density and lower cost than TLC SSDs. + +Note that when creating OSDs on these devices, one must carefully apply the +non-default value only to appropriate devices, and not to conventional SSD and +HDD devices. This may be done through careful ordering of OSD creation, custom +OSD device classes, and especially by the use of central configuration _masks_. + +Quincy and later releases add +the `bluestore_use_optimal_io_size_for_min_alloc_size` +option that enables automatic discovery of the appropriate value as each OSD is +created. Note that the use of ``bcache``, ``OpenCAS``, ``dmcrypt``, +``ATA over Ethernet``, `iSCSI`, or other device layering / abstraction +technologies may confound the determination of appropriate values. OSDs +deployed on top of VMware storage have been reported to also +sometimes report a ``rotational`` attribute that does not match the underlying +hardware. + +We suggest inspecting such OSDs at startup via logs and admin sockets to ensure that +behavior is appropriate. Note that this also may not work as desired with +older kernels. You can check for this by examining the presence and value +of ``/sys/block/<drive>/queue/optimal_io_size``. + +You may also inspect a given OSD: + +.. prompt:: bash # + + ceph osd metadata osd.1701 | grep rotational + +This space amplification may manifest as an unusually high ratio of raw to +stored data reported by ``ceph df``. ``ceph osd df`` may also report +anomalously high ``%USE`` / ``VAR`` values when +compared to other, ostensibly identical OSDs. A pool using OSDs with +mismatched ``min_alloc_size`` values may experience unexpected balancer +behavior as well. + +Note that this BlueStore attribute takes effect *only* at OSD creation; if +changed later, a given OSD's behavior will not change unless / until it is +destroyed and redeployed with the appropriate option value(s). Upgrading +to a later Ceph release will *not* change the value used by OSDs deployed +under older releases or with other settings. + +DSA (Data Streaming Accelerator Usage) +====================================== + +If you want to use the DML library to drive DSA device for offloading +read/write operations on Persist memory in Bluestore. You need to install +`DML`_ and `idxd-config`_ library in your machine with SPR (Sapphire Rapids) CPU. + +.. _DML: https://github.com/intel/DML +.. _idxd-config: https://github.com/intel/idxd-config + +After installing the DML software, you need to configure the shared +work queues (WQs) with the following WQ configuration example via accel-config tool: + +.. prompt:: bash $ + + accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="MyApp1" --priority=10 --block-on-fault=1 dsa0/wq0.1 + accel-config config-engine dsa0/engine0.1 --group-id=1 + accel-config enable-device dsa0 + accel-config enable-wq dsa0/wq0.1 diff --git a/doc/rados/configuration/ceph-conf.rst b/doc/rados/configuration/ceph-conf.rst new file mode 100644 index 000000000..ad93598de --- /dev/null +++ b/doc/rados/configuration/ceph-conf.rst @@ -0,0 +1,689 @@ +.. _configuring-ceph: + +================== + Configuring Ceph +================== + +When Ceph services start, the initialization process activates a series +of daemons that run in the background. A :term:`Ceph Storage Cluster` runs +at a minimum three types of daemons: + +- :term:`Ceph Monitor` (``ceph-mon``) +- :term:`Ceph Manager` (``ceph-mgr``) +- :term:`Ceph OSD Daemon` (``ceph-osd``) + +Ceph Storage Clusters that support the :term:`Ceph File System` also run at +least one :term:`Ceph Metadata Server` (``ceph-mds``). Clusters that +support :term:`Ceph Object Storage` run Ceph RADOS Gateway daemons +(``radosgw``) as well. + +Each daemon has a number of configuration options, each of which has a +default value. You may adjust the behavior of the system by changing these +configuration options. Be careful to understand the consequences before +overriding default values, as it is possible to significantly degrade the +performance and stability of your cluster. Also note that default values +sometimes change between releases, so it is best to review the version of +this documentation that aligns with your Ceph release. + +Option names +============ + +All Ceph configuration options have a unique name consisting of words +formed with lower-case characters and connected with underscore +(``_``) characters. + +When option names are specified on the command line, either underscore +(``_``) or dash (``-``) characters can be used interchangeable (e.g., +``--mon-host`` is equivalent to ``--mon_host``). + +When option names appear in configuration files, spaces can also be +used in place of underscore or dash. We suggest, though, that for +clarity and convenience you consistently use underscores, as we do +throughout this documentation. + +Config sources +============== + +Each Ceph daemon, process, and library will pull its configuration +from several sources, listed below. Sources later in the list will +override those earlier in the list when both are present. + +- the compiled-in default value +- the monitor cluster's centralized configuration database +- a configuration file stored on the local host +- environment variables +- command line arguments +- runtime overrides set by an administrator + +One of the first things a Ceph process does on startup is parse the +configuration options provided via the command line, environment, and +local configuration file. The process will then contact the monitor +cluster to retrieve configuration stored centrally for the entire +cluster. Once a complete view of the configuration is available, the +daemon or process startup will proceed. + +.. _bootstrap-options: + +Bootstrap options +----------------- + +Because some configuration options affect the process's ability to +contact the monitors, authenticate, and retrieve the cluster-stored +configuration, they may need to be stored locally on the node and set +in a local configuration file. These options include: + + - ``mon_host``, the list of monitors for the cluster + - ``mon_host_override``, the list of monitors for the cluster to + **initially** contact when beginning a new instance of communication with the + Ceph cluster. This overrides the known monitor list derived from MonMap + updates sent to older Ceph instances (like librados cluster handles). It is + expected this option is primarily useful for debugging. + - ``mon_dns_srv_name`` (default: `ceph-mon`), the name of the DNS + SRV record to check to identify the cluster monitors via DNS + - ``mon_data``, ``osd_data``, ``mds_data``, ``mgr_data``, and + similar options that define which local directory the daemon + stores its data in. + - ``keyring``, ``keyfile``, and/or ``key``, which can be used to + specify the authentication credential to use to authenticate with + the monitor. Note that in most cases the default keyring location + is in the data directory specified above. + +In the vast majority of cases the default values of these are +appropriate, with the exception of the ``mon_host`` option that +identifies the addresses of the cluster's monitors. When DNS is used +to identify monitors a local ceph configuration file can be avoided +entirely. + +Skipping monitor config +----------------------- + +Any process may be passed the option ``--no-mon-config`` to skip the +step that retrieves configuration from the cluster monitors. This is +useful in cases where configuration is managed entirely via +configuration files or where the monitor cluster is currently down but +some maintenance activity needs to be done. + + +.. _ceph-conf-file: + + +Configuration sections +====================== + +Any given process or daemon has a single value for each configuration +option. However, values for an option may vary across different +daemon types even daemons of the same type. Ceph options that are +stored in the monitor configuration database or in local configuration +files are grouped into sections to indicate which daemons or clients +they apply to. + +These sections include: + +``global`` + +:Description: Settings under ``global`` affect all daemons and clients + in a Ceph Storage Cluster. + +:Example: ``log_file = /var/log/ceph/$cluster-$type.$id.log`` + +``mon`` + +:Description: Settings under ``mon`` affect all ``ceph-mon`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mon_cluster_log_to_syslog = true`` + + +``mgr`` + +:Description: Settings in the ``mgr`` section affect all ``ceph-mgr`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mgr_stats_period = 10`` + +``osd`` + +:Description: Settings under ``osd`` affect all ``ceph-osd`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``osd_op_queue = wpq`` + +``mds`` + +:Description: Settings in the ``mds`` section affect all ``ceph-mds`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mds_cache_memory_limit = 10G`` + +``client`` + +:Description: Settings under ``client`` affect all Ceph Clients + (e.g., mounted Ceph File Systems, mounted Ceph Block Devices, + etc.) as well as Rados Gateway (RGW) daemons. + +:Example: ``objecter_inflight_ops = 512`` + + +Sections may also specify an individual daemon or client name. For example, +``mon.foo``, ``osd.123``, and ``client.smith`` are all valid section names. + + +Any given daemon will draw its settings from the global section, the +daemon or client type section, and the section sharing its name. +Settings in the most-specific section take precedence, so for example +if the same option is specified in both ``global``, ``mon``, and +``mon.foo`` on the same source (i.e., in the same configurationfile), +the ``mon.foo`` value will be used. + +If multiple values of the same configuration option are specified in the same +section, the last value wins. + +Note that values from the local configuration file always take +precedence over values from the monitor configuration database, +regardless of which section they appear in. + + +.. _ceph-metavariables: + +Metavariables +============= + +Metavariables simplify Ceph Storage Cluster configuration +dramatically. When a metavariable is set in a configuration value, +Ceph expands the metavariable into a concrete value at the time the +configuration value is used. Ceph metavariables are similar to variable expansion in the Bash shell. + +Ceph supports the following metavariables: + +``$cluster`` + +:Description: Expands to the Ceph Storage Cluster name. Useful when running + multiple Ceph Storage Clusters on the same hardware. + +:Example: ``/etc/ceph/$cluster.keyring`` +:Default: ``ceph`` + + +``$type`` + +:Description: Expands to a daemon or process type (e.g., ``mds``, ``osd``, or ``mon``) + +:Example: ``/var/lib/ceph/$type`` + + +``$id`` + +:Description: Expands to the daemon or client identifier. For + ``osd.0``, this would be ``0``; for ``mds.a``, it would + be ``a``. + +:Example: ``/var/lib/ceph/$type/$cluster-$id`` + + +``$host`` + +:Description: Expands to the host name where the process is running. + + +``$name`` + +:Description: Expands to ``$type.$id``. +:Example: ``/var/run/ceph/$cluster-$name.asok`` + +``$pid`` + +:Description: Expands to daemon pid. +:Example: ``/var/run/ceph/$cluster-$name-$pid.asok`` + + + +The Configuration File +====================== + +On startup, Ceph processes search for a configuration file in the +following locations: + +#. ``$CEPH_CONF`` (*i.e.,* the path following the ``$CEPH_CONF`` + environment variable) +#. ``-c path/path`` (*i.e.,* the ``-c`` command line argument) +#. ``/etc/ceph/$cluster.conf`` +#. ``~/.ceph/$cluster.conf`` +#. ``./$cluster.conf`` (*i.e.,* in the current working directory) +#. On FreeBSD systems only, ``/usr/local/etc/ceph/$cluster.conf`` + +where ``$cluster`` is the cluster's name (default ``ceph``). + +The Ceph configuration file uses an *ini* style syntax. You can add comment +text after a pound sign (#) or a semi-colon (;). For example: + +.. code-block:: ini + + # <--A number (#) sign precedes a comment. + ; A comment may be anything. + # Comments always follow a semi-colon (;) or a pound (#) on each line. + # The end of the line terminates a comment. + # We recommend that you provide comments in your configuration file(s). + + +.. _ceph-conf-settings: + +Config file section names +------------------------- + +The configuration file is divided into sections. Each section must begin with a +valid configuration section name (see `Configuration sections`_, above) +surrounded by square brackets. For example, + +.. code-block:: ini + + [global] + debug_ms = 0 + + [osd] + debug_ms = 1 + + [osd.1] + debug_ms = 10 + + [osd.2] + debug_ms = 10 + + +Config file option values +------------------------- + +The value of a configuration option is a string. If it is too long to +fit in a single line, you can put a backslash (``\``) at the end of line +as the line continuation marker, so the value of the option will be +the string after ``=`` in current line combined with the string in the next +line:: + + [global] + foo = long long ago\ + long ago + +In the example above, the value of "``foo``" would be "``long long ago long ago``". + +Normally, the option value ends with a new line, or a comment, like + +.. code-block:: ini + + [global] + obscure_one = difficult to explain # I will try harder in next release + simpler_one = nothing to explain + +In the example above, the value of "``obscure one``" would be "``difficult to explain``"; +and the value of "``simpler one`` would be "``nothing to explain``". + +If an option value contains spaces, and we want to make it explicit, we +could quote the value using single or double quotes, like + +.. code-block:: ini + + [global] + line = "to be, or not to be" + +Certain characters are not allowed to be present in the option values directly. +They are ``=``, ``#``, ``;`` and ``[``. If we have to, we need to escape them, +like + +.. code-block:: ini + + [global] + secret = "i love \# and \[" + +Every configuration option is typed with one of the types below: + +``int`` + +:Description: 64-bit signed integer, Some SI prefixes are supported, like "K", "M", "G", + "T", "P", "E", meaning, respectively, 10\ :sup:`3`, 10\ :sup:`6`, + 10\ :sup:`9`, etc. And "B" is the only supported unit. So, "1K", "1M", "128B" and "-1" are all valid + option values. Some times, a negative value implies "unlimited" when it comes to + an option for threshold or limit. +:Example: ``42``, ``-1`` + +``uint`` + +:Description: It is almost identical to ``integer``. But a negative value will be rejected. +:Example: ``256``, ``0`` + +``str`` + +:Description: Free style strings encoded in UTF-8, but some characters are not allowed. Please + reference the above notes for the details. +:Example: ``"hello world"``, ``"i love \#"``, ``yet-another-name`` + +``boolean`` + +:Description: one of the two values ``true`` or ``false``. But an integer is also accepted, + where "0" implies ``false``, and any non-zero values imply ``true``. +:Example: ``true``, ``false``, ``1``, ``0`` + +``addr`` + +:Description: a single address optionally prefixed with ``v1``, ``v2`` or ``any`` for the messenger + protocol. If the prefix is not specified, ``v2`` protocol is used. Please see + :ref:`address_formats` for more details. +:Example: ``v1:1.2.3.4:567``, ``v2:1.2.3.4:567``, ``1.2.3.4:567``, ``2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567``, ``[::1]:6789`` + +``addrvec`` + +:Description: a set of addresses separated by ",". The addresses can be optionally quoted with ``[`` and ``]``. +:Example: ``[v1:1.2.3.4:567,v2:1.2.3.4:568]``, ``v1:1.2.3.4:567,v1:1.2.3.14:567`` ``[2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567], [2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::568]`` + +``uuid`` + +:Description: the string format of a uuid defined by `RFC4122 <https://www.ietf.org/rfc/rfc4122.txt>`_. + And some variants are also supported, for more details, see + `Boost document <https://www.boost.org/doc/libs/1_74_0/libs/uuid/doc/uuid.html#String%20Generator>`_. +:Example: ``f81d4fae-7dec-11d0-a765-00a0c91e6bf6`` + +``size`` + +:Description: denotes a 64-bit unsigned integer. Both SI prefixes and IEC prefixes are + supported. And "B" is the only supported unit. A negative value will be + rejected. +:Example: ``1Ki``, ``1K``, ``1KiB`` and ``1B``. + +``secs`` + +:Description: denotes a duration of time. By default the unit is second if not specified. + Following units of time are supported: + + * second: "s", "sec", "second", "seconds" + * minute: "m", "min", "minute", "minutes" + * hour: "hs", "hr", "hour", "hours" + * day: "d", "day", "days" + * week: "w", "wk", "week", "weeks" + * month: "mo", "month", "months" + * year: "y", "yr", "year", "years" +:Example: ``1 m``, ``1m`` and ``1 week`` + +.. _ceph-conf-database: + +Monitor configuration database +============================== + +The monitor cluster manages a database of configuration options that +can be consumed by the entire cluster, enabling streamlined central +configuration management for the entire system. The vast majority of +configuration options can and should be stored here for ease of +administration and transparency. + +A handful of settings may still need to be stored in local +configuration files because they affect the ability to connect to the +monitors, authenticate, and fetch configuration information. In most +cases this is limited to the ``mon_host`` option, although this can +also be avoided through the use of DNS SRV records. + +Sections and masks +------------------ + +Configuration options stored by the monitor can live in a global +section, daemon type section, or specific daemon section, just like +options in a configuration file can. + +In addition, options may also have a *mask* associated with them to +further restrict which daemons or clients the option applies to. +Masks take two forms: + +#. ``type:location`` where *type* is a CRUSH property like `rack` or + `host`, and *location* is a value for that property. For example, + ``host:foo`` would limit the option only to daemons or clients + running on a particular host. +#. ``class:device-class`` where *device-class* is the name of a CRUSH + device class (e.g., ``hdd`` or ``ssd``). For example, + ``class:ssd`` would limit the option only to OSDs backed by SSDs. + (This mask has no effect for non-OSD daemons or clients.) + +When setting a configuration option, the `who` may be a section name, +a mask, or a combination of both separated by a slash (``/``) +character. For example, ``osd/rack:foo`` would mean all OSD daemons +in the ``foo`` rack. + +When viewing configuration options, the section name and mask are +generally separated out into separate fields or columns to ease readability. + + +Commands +-------- + +The following CLI commands are used to configure the cluster: + +* ``ceph config dump`` will dump the entire configuration database for + the cluster. + +* ``ceph config get <who>`` will dump the configuration for a specific + daemon or client (e.g., ``mds.a``), as stored in the monitors' + configuration database. + +* ``ceph config set <who> <option> <value>`` will set a configuration + option in the monitors' configuration database. + +* ``ceph config show <who>`` will show the reported running + configuration for a running daemon. These settings may differ from + those stored by the monitors if there are also local configuration + files in use or options have been overridden on the command line or + at run time. The source of the option values is reported as part + of the output. + +* ``ceph config assimilate-conf -i <input file> -o <output file>`` + will ingest a configuration file from *input file* and move any + valid options into the monitors' configuration database. Any + settings that are unrecognized, invalid, or cannot be controlled by + the monitor will be returned in an abbreviated config file stored in + *output file*. This command is useful for transitioning from legacy + configuration files to centralized monitor-based configuration. + + +Help +==== + +You can get help for a particular option with: + +.. prompt:: bash $ + + ceph config help <option> + +Note that this will use the configuration schema that is compiled into the running monitors. If you have a mixed-version cluster (e.g., during an upgrade), you might also want to query the option schema from a specific running daemon: + +.. prompt:: bash $ + + ceph daemon <name> config help [option] + +For example: + +.. prompt:: bash $ + + ceph config help log_file + +:: + + log_file - path to log file + (std::string, basic) + Default (non-daemon): + Default (daemon): /var/log/ceph/$cluster-$name.log + Can update at runtime: false + See also: [log_to_stderr,err_to_stderr,log_to_syslog,err_to_syslog] + +or: + +.. prompt:: bash $ + + ceph config help log_file -f json-pretty + +:: + + { + "name": "log_file", + "type": "std::string", + "level": "basic", + "desc": "path to log file", + "long_desc": "", + "default": "", + "daemon_default": "/var/log/ceph/$cluster-$name.log", + "tags": [], + "services": [], + "see_also": [ + "log_to_stderr", + "err_to_stderr", + "log_to_syslog", + "err_to_syslog" + ], + "enum_values": [], + "min": "", + "max": "", + "can_update_at_runtime": false + } + +The ``level`` property can be any of `basic`, `advanced`, or `dev`. +The `dev` options are intended for use by developers, generally for +testing purposes, and are not recommended for use by operators. + + +Runtime Changes +=============== + +In most cases, Ceph allows you to make changes to the configuration of +a daemon at runtime. This capability is quite useful for +increasing/decreasing logging output, enabling/disabling debug +settings, and even for runtime optimization. + +Generally speaking, configuration options can be updated in the usual +way via the ``ceph config set`` command. For example, do enable the debug log level on a specific OSD: + +.. prompt:: bash $ + + ceph config set osd.123 debug_ms 20 + +Note that if the same option is also customized in a local +configuration file, the monitor setting will be ignored (it has a +lower priority than the local config file). + +Override values +--------------- + +You can also temporarily set an option using the `tell` or `daemon` +interfaces on the Ceph CLI. These *override* values are ephemeral in +that they only affect the running process and are discarded/lost if +the daemon or process restarts. + +Override values can be set in two ways: + +#. From any host, we can send a message to a daemon over the network with: + + .. prompt:: bash $ + + ceph tell <name> config set <option> <value> + + For example: + + .. prompt:: bash $ + + ceph tell osd.123 config set debug_osd 20 + + The `tell` command can also accept a wildcard for the daemon + identifier. For example, to adjust the debug level on all OSD + daemons: + + .. prompt:: bash $ + + ceph tell osd.* config set debug_osd 20 + +#. From the host the process is running on, we can connect directly to + the process via a socket in ``/var/run/ceph`` with: + + .. prompt:: bash $ + + ceph daemon <name> config set <option> <value> + + For example: + + .. prompt:: bash $ + + ceph daemon osd.4 config set debug_osd 20 + +Note that in the ``ceph config show`` command output these temporary +values will be shown with a source of ``override``. + + +Viewing runtime settings +======================== + +You can see the current options set for a running daemon with the ``ceph config show`` command. For example: + +.. prompt:: bash $ + + ceph config show osd.0 + +will show you the (non-default) options for that daemon. You can also look at a specific option with: + +.. prompt:: bash $ + + ceph config show osd.0 debug_osd + +or view all options (even those with default values) with: + +.. prompt:: bash $ + + ceph config show-with-defaults osd.0 + +You can also observe settings for a running daemon by connecting to it from the local host via the admin socket. For example: + +.. prompt:: bash $ + + ceph daemon osd.0 config show + +will dump all current settings: + +.. prompt:: bash $ + + ceph daemon osd.0 config diff + +will show only non-default settings (as well as where the value came from: a config file, the monitor, an override, etc.), and: + +.. prompt:: bash $ + + ceph daemon osd.0 config get debug_osd + +will report the value of a single option. + + + +Changes since Nautilus +====================== + +With the Octopus release We changed the way the configuration file is parsed. +These changes are as follows: + +- Repeated configuration options are allowed, and no warnings will be printed. + The value of the last one is used, which means that the setting last in the file + is the one that takes effect. Before this change, we would print warning messages + when lines with duplicated options were encountered, like:: + + warning line 42: 'foo' in section 'bar' redefined + +- Invalid UTF-8 options were ignored with warning messages. But since Octopus, + they are treated as fatal errors. + +- Backslash ``\`` is used as the line continuation marker to combine the next + line with current one. Before Octopus, it was required to follow a backslash with + a non-empty line. But in Octopus, an empty line following a backslash is now allowed. + +- In the configuration file, each line specifies an individual configuration + option. The option's name and its value are separated with ``=``, and the + value may be quoted using single or double quotes. If an invalid + configuration is specified, we will treat it as an invalid configuration + file :: + + bad option ==== bad value + +- Before Octopus, if no section name was specified in the configuration file, + all options would be set as though they were within the ``global`` section. This is + now discouraged. Since Octopus, only a single option is allowed for + configuration files without a section name. diff --git a/doc/rados/configuration/common.rst b/doc/rados/configuration/common.rst new file mode 100644 index 000000000..709c8bce2 --- /dev/null +++ b/doc/rados/configuration/common.rst @@ -0,0 +1,218 @@ + +.. _ceph-conf-common-settings: + +Common Settings +=============== + +The `Hardware Recommendations`_ section provides some hardware guidelines for +configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph +Node` to run multiple daemons. For example, a single node with multiple drives +may run one ``ceph-osd`` for each drive. Ideally, you will have a node for a +particular type of process. For example, some nodes may run ``ceph-osd`` +daemons, other nodes may run ``ceph-mds`` daemons, and still other nodes may +run ``ceph-mon`` daemons. + +Each node has a name identified by the ``host`` setting. Monitors also specify +a network address and port (i.e., domain name or IP address) identified by the +``addr`` setting. A basic configuration file will typically specify only +minimal settings for each instance of monitor daemons. For example: + +.. code-block:: ini + + [global] + mon_initial_members = ceph1 + mon_host = 10.0.0.1 + + +.. important:: The ``host`` setting is the short name of the node (i.e., not + an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on + the command line to retrieve the name of the node. Do not use ``host`` + settings for anything other than initial monitors unless you are deploying + Ceph manually. You **MUST NOT** specify ``host`` under individual daemons + when using deployment tools like ``chef`` or ``cephadm``, as those tools + will enter the appropriate values for you in the cluster map. + + +.. _ceph-network-config: + +Networks +======== + +See the `Network Configuration Reference`_ for a detailed discussion about +configuring a network for use with Ceph. + + +Monitors +======== + +Production Ceph clusters typically provision a minimum of three :term:`Ceph Monitor` +daemons to ensure availability should a monitor instance crash. A minimum of +three ensures that the Paxos algorithm can determine which version +of the :term:`Ceph Cluster Map` is the most recent from a majority of Ceph +Monitors in the quorum. + +.. note:: You may deploy Ceph with a single monitor, but if the instance fails, + the lack of other monitors may interrupt data service availability. + +Ceph Monitors normally listen on port ``3300`` for the new v2 protocol, and ``6789`` for the old v1 protocol. + +By default, Ceph expects to store monitor data under the +following path:: + + /var/lib/ceph/mon/$cluster-$id + +You or a deployment tool (e.g., ``cephadm``) must create the corresponding +directory. With metavariables fully expressed and a cluster named "ceph", the +foregoing directory would evaluate to:: + + /var/lib/ceph/mon/ceph-a + +For additional details, see the `Monitor Config Reference`_. + +.. _Monitor Config Reference: ../mon-config-ref + + +.. _ceph-osd-config: + + +Authentication +============== + +.. versionadded:: Bobtail 0.56 + +For Bobtail (v 0.56) and beyond, you should expressly enable or disable +authentication in the ``[global]`` section of your Ceph configuration file. + +.. code-block:: ini + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + +Additionally, you should enable message signing. See `Cephx Config Reference`_ for details. + +.. _Cephx Config Reference: ../auth-config-ref + + +.. _ceph-monitor-config: + + +OSDs +==== + +Ceph production clusters typically deploy :term:`Ceph OSD Daemons` where one node +has one OSD daemon running a Filestore on one storage device. The BlueStore back +end is now default, but when using Filestore you specify a journal size. For example: + +.. code-block:: ini + + [osd] + osd_journal_size = 10000 + + [osd.0] + host = {hostname} #manual deployments only. + + +By default, Ceph expects to store a Ceph OSD Daemon's data at the +following path:: + + /var/lib/ceph/osd/$cluster-$id + +You or a deployment tool (e.g., ``cephadm``) must create the corresponding +directory. With metavariables fully expressed and a cluster named "ceph", this +example would evaluate to:: + + /var/lib/ceph/osd/ceph-0 + +You may override this path using the ``osd_data`` setting. We recommend not +changing the default location. Create the default directory on your OSD host. + +.. prompt:: bash $ + + ssh {osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + +The ``osd_data`` path ideally leads to a mount point with a device that is +separate from the device that contains the operating system and +daemons. If an OSD is to use a device other than the OS device, prepare it for +use with Ceph, and mount it to the directory you just created + +.. prompt:: bash $ + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{disk} + sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} + +We recommend using the ``xfs`` file system when running +:command:`mkfs`. (``btrfs`` and ``ext4`` are not recommended and are no +longer tested.) + +See the `OSD Config Reference`_ for additional configuration details. + + +Heartbeats +========== + +During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons +and report their findings to the Ceph Monitor. You do not have to provide any +settings. However, if you have network latency issues, you may wish to modify +the settings. + +See `Configuring Monitor/OSD Interaction`_ for additional details. + + +.. _ceph-logging-and-debugging: + +Logs / Debugging +================ + +Sometimes you may encounter issues with Ceph that require +modifying logging output and using Ceph's debugging. See `Debugging and +Logging`_ for details on log rotation. + +.. _Debugging and Logging: ../../troubleshooting/log-and-debug + + +Example ceph.conf +================= + +.. literalinclude:: demo-ceph.conf + :language: ini + +.. _ceph-runtime-config: + + + +Running Multiple Clusters (DEPRECATED) +====================================== + +Each Ceph cluster has an internal name that is used as part of configuration +and log file names as well as directory and mountpoint names. This name +defaults to "ceph". Previous releases of Ceph allowed one to specify a custom +name instead, for example "ceph2". This was intended to faciliate running +multiple logical clusters on the same physical hardware, but in practice this +was rarely exploited and should no longer be attempted. Prior documentation +could also be misinterpreted as requiring unique cluster names in order to +use ``rbd-mirror``. + +Custom cluster names are now considered deprecated and the ability to deploy +them has already been removed from some tools, though existing custom name +deployments continue to operate. The ability to run and manage clusters with +custom names may be progressively removed by future Ceph releases, so it is +strongly recommended to deploy all new clusters with the default name "ceph". + +Some Ceph CLI commands accept an optional ``--cluster`` (cluster name) option. This +option is present purely for backward compatibility and need not be accomodated +by new tools and deployments. + +If you do need to allow multiple clusters to exist on the same host, please use +:ref:`cephadm`, which uses containers to fully isolate each cluster. + + + + + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Network Configuration Reference: ../network-config-ref +.. _OSD Config Reference: ../osd-config-ref +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction diff --git a/doc/rados/configuration/demo-ceph.conf b/doc/rados/configuration/demo-ceph.conf new file mode 100644 index 000000000..58bb7061f --- /dev/null +++ b/doc/rados/configuration/demo-ceph.conf @@ -0,0 +1,31 @@ +[global] +fsid = {cluster-id} +mon_initial_ members = {hostname}[, {hostname}] +mon_host = {ip-address}[, {ip-address}] + +#All clusters have a front-side public network. +#If you have two network interfaces, you can configure a private / cluster +#network for RADOS object replication, heartbeats, backfill, +#recovery, etc. +public_network = {network}[, {network}] +#cluster_network = {network}[, {network}] + +#Clusters require authentication by default. +auth_cluster_required = cephx +auth_service_required = cephx +auth_client_required = cephx + +#Choose reasonable numbers for journals, number of replicas +#and placement groups. +osd_journal_size = {n} +osd_pool_default_size = {n} # Write an object n times. +osd_pool_default_min size = {n} # Allow writing n copy in a degraded state. +osd_pool_default_pg num = {n} +osd_pool_default_pgp num = {n} + +#Choose a reasonable crush leaf type. +#0 for a 1-node cluster. +#1 for a multi node cluster in a single rack +#2 for a multi node, multi chassis cluster with multiple hosts in a chassis +#3 for a multi node cluster with hosts across racks, etc. +osd_crush_chooseleaf_type = {n}
\ No newline at end of file diff --git a/doc/rados/configuration/filestore-config-ref.rst b/doc/rados/configuration/filestore-config-ref.rst new file mode 100644 index 000000000..435a800a8 --- /dev/null +++ b/doc/rados/configuration/filestore-config-ref.rst @@ -0,0 +1,367 @@ +============================ + Filestore Config Reference +============================ + +The Filestore back end is no longer the default when creating new OSDs, +though Filestore OSDs are still supported. + +``filestore debug omap check`` + +:Description: Debugging check on synchronization. Expensive. For debugging only. +:Type: Boolean +:Required: No +:Default: ``false`` + + +.. index:: filestore; extended attributes + +Extended Attributes +=================== + +Extended Attributes (XATTRs) are important for Filestore OSDs. +Some file systems have limits on the number of bytes that can be stored in XATTRs. +Additionally, in some cases, the file system may not be as fast as an alternative +method of storing XATTRs. The following settings may help improve performance +by using a method of storing XATTRs that is extrinsic to the underlying file system. + +Ceph XATTRs are stored as ``inline xattr``, using the XATTRs provided +by the underlying file system, if it does not impose a size limit. If +there is a size limit (4KB total on ext4, for instance), some Ceph +XATTRs will be stored in a key/value database when either the +``filestore_max_inline_xattr_size`` or ``filestore_max_inline_xattrs`` +threshold is reached. + + +``filestore_max_inline_xattr_size`` + +:Description: The maximum size of an XATTR stored in the file system (i.e., XFS, + Btrfs, EXT4, etc.) per object. Should not be larger than the + file system can handle. Default value of 0 means to use the value + specific to the underlying file system. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore_max_inline_xattr_size_xfs`` + +:Description: The maximum size of an XATTR stored in the XFS file system. + Only used if ``filestore_max_inline_xattr_size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``65536`` + + +``filestore_max_inline_xattr_size_btrfs`` + +:Description: The maximum size of an XATTR stored in the Btrfs file system. + Only used if ``filestore_max_inline_xattr_size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``2048`` + + +``filestore_max_inline_xattr_size_other`` + +:Description: The maximum size of an XATTR stored in other file systems. + Only used if ``filestore_max_inline_xattr_size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``512`` + + +``filestore_max_inline_xattrs`` + +:Description: The maximum number of XATTRs stored in the file system per object. + Default value of 0 means to use the value specific to the + underlying file system. +:Type: 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore_max_inline_xattrs_xfs`` + +:Description: The maximum number of XATTRs stored in the XFS file system per object. + Only used if ``filestore_max_inline_xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore_max_inline_xattrs_btrfs`` + +:Description: The maximum number of XATTRs stored in the Btrfs file system per object. + Only used if ``filestore_max_inline_xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore_max_inline_xattrs_other`` + +:Description: The maximum number of XATTRs stored in other file systems per object. + Only used if ``filestore_max_inline_xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``2`` + +.. index:: filestore; synchronization + +Synchronization Intervals +========================= + +Filestore needs to periodically quiesce writes and synchronize the +file system, which creates a consistent commit point. It can then free journal +entries up to the commit point. Synchronizing more frequently tends to reduce +the time required to perform synchronization, and reduces the amount of data +that needs to remain in the journal. Less frequent synchronization allows the +backing file system to coalesce small writes and metadata updates more +optimally, potentially resulting in more efficient synchronization at the +expense of potentially increasing tail latency. + +``filestore_max_sync_interval`` + +:Description: The maximum interval in seconds for synchronizing Filestore. +:Type: Double +:Required: No +:Default: ``5`` + + +``filestore_min_sync_interval`` + +:Description: The minimum interval in seconds for synchronizing Filestore. +:Type: Double +:Required: No +:Default: ``.01`` + + +.. index:: filestore; flusher + +Flusher +======= + +The Filestore flusher forces data from large writes to be written out using +``sync_file_range`` before the sync in order to (hopefully) reduce the cost of +the eventual sync. In practice, disabling 'filestore_flusher' seems to improve +performance in some cases. + + +``filestore_flusher`` + +:Description: Enables the filestore flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore_flusher_max_fds`` + +:Description: Sets the maximum number of file descriptors for the flusher. +:Type: Integer +:Required: No +:Default: ``512`` + +.. deprecated:: v.65 + +``filestore_sync_flush`` + +:Description: Enables the synchronization flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore_fsync_flushes_journal_data`` + +:Description: Flush journal data during file system synchronization. +:Type: Boolean +:Required: No +:Default: ``false`` + + +.. index:: filestore; queue + +Queue +===== + +The following settings provide limits on the size of the Filestore queue. + +``filestore_queue_max_ops`` + +:Description: Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations. +:Type: Integer +:Required: No. Minimal impact on performance. +:Default: ``50`` + + +``filestore_queue_max_bytes`` + +:Description: The maximum number of bytes for an operation. +:Type: Integer +:Required: No +:Default: ``100 << 20`` + + + + +.. index:: filestore; timeouts + +Timeouts +======== + + +``filestore_op_threads`` + +:Description: The number of file system operation threads that execute in parallel. +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore_op_thread_timeout`` + +:Description: The timeout for a file system operation thread (in seconds). +:Type: Integer +:Required: No +:Default: ``60`` + + +``filestore_op_thread_suicide_timeout`` + +:Description: The timeout for a commit operation before cancelling the commit (in seconds). +:Type: Integer +:Required: No +:Default: ``180`` + + +.. index:: filestore; btrfs + +B-Tree Filesystem +================= + + +``filestore_btrfs_snap`` + +:Description: Enable snapshots for a ``btrfs`` filestore. +:Type: Boolean +:Required: No. Only used for ``btrfs``. +:Default: ``true`` + + +``filestore_btrfs_clone_range`` + +:Description: Enable cloning ranges for a ``btrfs`` filestore. +:Type: Boolean +:Required: No. Only used for ``btrfs``. +:Default: ``true`` + + +.. index:: filestore; journal + +Journal +======= + + +``filestore_journal_parallel`` + +:Description: Enables parallel journaling, default for Btrfs. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_journal_writeahead`` + +:Description: Enables writeahead journaling, default for XFS. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_journal_trailing`` + +:Description: Deprecated, never use. +:Type: Boolean +:Required: No +:Default: ``false`` + + +Misc +==== + + +``filestore_merge_threshold`` + +:Description: Min number of files in a subdir before merging into parent + NOTE: A negative value means to disable subdir merging +:Type: Integer +:Required: No +:Default: ``-10`` + + +``filestore_split_multiple`` + +:Description: ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16`` + is the maximum number of files in a subdirectory before + splitting into child directories. + +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore_split_rand_factor`` + +:Description: A random factor added to the split threshold to avoid + too many (expensive) Filestore splits occurring at once. See + ``filestore_split_multiple`` for details. + This can only be changed offline for an existing OSD, + via the ``ceph-objectstore-tool apply-layout-settings`` command. + +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``20`` + + +``filestore_update_to`` + +:Description: Limits Filestore auto upgrade to specified version. +:Type: Integer +:Required: No +:Default: ``1000`` + + +``filestore_blackhole`` + +:Description: Drop any new transactions on the floor. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_dump_file`` + +:Description: File onto which store transaction dumps. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_kill_at`` + +:Description: inject a failure at the n'th opportunity +:Type: String +:Required: No +:Default: ``false`` + + +``filestore_fail_eio`` + +:Description: Fail/Crash on eio. +:Type: Boolean +:Required: No +:Default: ``true`` + diff --git a/doc/rados/configuration/general-config-ref.rst b/doc/rados/configuration/general-config-ref.rst new file mode 100644 index 000000000..fc837c395 --- /dev/null +++ b/doc/rados/configuration/general-config-ref.rst @@ -0,0 +1,66 @@ +========================== + General Config Reference +========================== + + +``fsid`` + +:Description: The file system ID. One per cluster. +:Type: UUID +:Required: No. +:Default: N/A. Usually generated by deployment tools. + + +``admin_socket`` + +:Description: The socket for executing administrative commands on a daemon, + irrespective of whether Ceph Monitors have established a quorum. + +:Type: String +:Required: No +:Default: ``/var/run/ceph/$cluster-$name.asok`` + + +``pid_file`` + +:Description: The file in which the mon, osd or mds will write its + PID. For instance, ``/var/run/$cluster/$type.$id.pid`` + will create /var/run/ceph/mon.a.pid for the ``mon`` with + id ``a`` running in the ``ceph`` cluster. The ``pid + file`` is removed when the daemon stops gracefully. If + the process is not daemonized (i.e. runs with the ``-f`` + or ``-d`` option), the ``pid file`` is not created. +:Type: String +:Required: No +:Default: No + + +``chdir`` + +:Description: The directory Ceph daemons change to once they are + up and running. Default ``/`` directory recommended. + +:Type: String +:Required: No +:Default: ``/`` + + +``max_open_files`` + +:Description: If set, when the :term:`Ceph Storage Cluster` starts, Ceph sets + the max open FDs at the OS level (i.e., the max # of file + descriptors). A suitably large value prevents Ceph Daemons from running out + of file descriptors. + +:Type: 64-bit Integer +:Required: No +:Default: ``0`` + + +``fatal_signal_handlers`` + +:Description: If set, we will install signal handlers for SEGV, ABRT, BUS, ILL, + FPE, XCPU, XFSZ, SYS signals to generate a useful log message + +:Type: Boolean +:Default: ``true`` diff --git a/doc/rados/configuration/index.rst b/doc/rados/configuration/index.rst new file mode 100644 index 000000000..414fcc6fa --- /dev/null +++ b/doc/rados/configuration/index.rst @@ -0,0 +1,54 @@ +=============== + Configuration +=============== + +Each Ceph process, daemon, or utility draws its configuration from several +sources on startup. Such sources can include (1) a local configuration, (2) the +monitors, (3) the command line, and (4) environment variables. + +Configuration options can be set globally so that they apply (1) to all +daemons, (2) to all daemons or services of a particular type, or (3) to only a +specific daemon, process, or client. + +.. raw:: html + + <table cellpadding="10"><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>Configuring the Object Store</h3> + +For general object store configuration, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Storage devices <storage-devices> + ceph-conf + + +.. raw:: html + + </td><td><h3>Reference</h3> + +To optimize the performance of your cluster, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Common Settings <common> + Network Settings <network-config-ref> + Messenger v2 protocol <msgr2> + Auth Settings <auth-config-ref> + Monitor Settings <mon-config-ref> + mon-lookup-dns + Heartbeat Settings <mon-osd-interaction> + OSD Settings <osd-config-ref> + DmClock Settings <mclock-config-ref> + BlueStore Settings <bluestore-config-ref> + FileStore Settings <filestore-config-ref> + Journal Settings <journal-ref> + Pool, PG & CRUSH Settings <pool-pg-config-ref.rst> + Messaging Settings <ms-ref> + General Settings <general-config-ref> + + +.. raw:: html + + </td></tr></tbody></table> diff --git a/doc/rados/configuration/journal-ref.rst b/doc/rados/configuration/journal-ref.rst new file mode 100644 index 000000000..71c74c606 --- /dev/null +++ b/doc/rados/configuration/journal-ref.rst @@ -0,0 +1,119 @@ +========================== + Journal Config Reference +========================== + +.. index:: journal; journal configuration + +Filestore OSDs use a journal for two reasons: speed and consistency. Note +that since Luminous, the BlueStore OSD back end has been preferred and default. +This information is provided for pre-existing OSDs and for rare situations where +Filestore is preferred for new deployments. + +- **Speed:** The journal enables the Ceph OSD Daemon to commit small writes + quickly. Ceph writes small, random i/o to the journal sequentially, which + tends to speed up bursty workloads by allowing the backing file system more + time to coalesce writes. The Ceph OSD Daemon's journal, however, can lead + to spiky performance with short spurts of high-speed writes followed by + periods without any write progress as the file system catches up to the + journal. + +- **Consistency:** Ceph OSD Daemons require a file system interface that + guarantees atomic compound operations. Ceph OSD Daemons write a description + of the operation to the journal and apply the operation to the file system. + This enables atomic updates to an object (for example, placement group + metadata). Every few seconds--between ``filestore max sync interval`` and + ``filestore min sync interval``--the Ceph OSD Daemon stops writes and + synchronizes the journal with the file system, allowing Ceph OSD Daemons to + trim operations from the journal and reuse the space. On failure, Ceph + OSD Daemons replay the journal starting after the last synchronization + operation. + +Ceph OSD Daemons recognize the following journal settings: + + +``journal_dio`` + +:Description: Enables direct i/o to the journal. Requires ``journal block + align`` set to ``true``. + +:Type: Boolean +:Required: Yes when using ``aio``. +:Default: ``true`` + + + +``journal_aio`` + +.. versionchanged:: 0.61 Cuttlefish + +:Description: Enables using ``libaio`` for asynchronous writes to the journal. + Requires ``journal dio`` set to ``true``. + +:Type: Boolean +:Required: No. +:Default: Version 0.61 and later, ``true``. Version 0.60 and earlier, ``false``. + + +``journal_block_align`` + +:Description: Block aligns write operations. Required for ``dio`` and ``aio``. +:Type: Boolean +:Required: Yes when using ``dio`` and ``aio``. +:Default: ``true`` + + +``journal_max_write_bytes`` + +:Description: The maximum number of bytes the journal will write at + any one time. + +:Type: Integer +:Required: No +:Default: ``10 << 20`` + + +``journal_max_write_entries`` + +:Description: The maximum number of entries the journal will write at + any one time. + +:Type: Integer +:Required: No +:Default: ``100`` + + +``journal_queue_max_ops`` + +:Description: The maximum number of operations allowed in the queue at + any one time. + +:Type: Integer +:Required: No +:Default: ``500`` + + +``journal_queue_max_bytes`` + +:Description: The maximum number of bytes allowed in the queue at + any one time. + +:Type: Integer +:Required: No +:Default: ``10 << 20`` + + +``journal_align_min_size`` + +:Description: Align data payloads greater than the specified minimum. +:Type: Integer +:Required: No +:Default: ``64 << 10`` + + +``journal_zero_on_create`` + +:Description: Causes the file store to overwrite the entire journal with + ``0``'s during ``mkfs``. +:Type: Boolean +:Required: No +:Default: ``false`` diff --git a/doc/rados/configuration/mclock-config-ref.rst b/doc/rados/configuration/mclock-config-ref.rst new file mode 100644 index 000000000..579056895 --- /dev/null +++ b/doc/rados/configuration/mclock-config-ref.rst @@ -0,0 +1,395 @@ +======================== + mClock Config Reference +======================== + +.. index:: mclock; configuration + +Mclock profiles mask the low level details from users, making it +easier for them to configure mclock. + +The following input parameters are required for a mclock profile to configure +the QoS related parameters: + +* total capacity (IOPS) of each OSD (determined automatically) + +* an mclock profile type to enable + +Using the settings in the specified profile, the OSD determines and applies the +lower-level mclock and Ceph parameters. The parameters applied by the mclock +profile make it possible to tune the QoS between client I/O, recovery/backfill +operations, and other background operations (for example, scrub, snap trim, and +PG deletion). These background activities are considered best-effort internal +clients of Ceph. + + +.. index:: mclock; profile definition + +mClock Profiles - Definition and Purpose +======================================== + +A mclock profile is *“a configuration setting that when applied on a running +Ceph cluster enables the throttling of the operations(IOPS) belonging to +different client classes (background recovery, scrub, snaptrim, client op, +osd subop)”*. + +The mclock profile uses the capacity limits and the mclock profile type selected +by the user to determine the low-level mclock resource control parameters. + +Depending on the profile type, lower-level mclock resource-control parameters +and some Ceph-configuration parameters are transparently applied. + +The low-level mclock resource control parameters are the *reservation*, +*limit*, and *weight* that provide control of the resource shares, as +described in the :ref:`dmclock-qos` section. + + +.. index:: mclock; profile types + +mClock Profile Types +==================== + +mclock profiles can be broadly classified into two types, + +- **Built-in**: Users can choose between the following built-in profile types: + + - **high_client_ops** (*default*): + This profile allocates more reservation and limit to external-client ops + as compared to background recoveries and other internal clients within + Ceph. This profile is enabled by default. + - **high_recovery_ops**: + This profile allocates more reservation to background recoveries as + compared to external clients and other internal clients within Ceph. For + example, an admin may enable this profile temporarily to speed-up background + recoveries during non-peak hours. + - **balanced**: + This profile allocates equal reservation to client ops and background + recovery ops. + +- **Custom**: This profile gives users complete control over all the mclock + configuration parameters. Using this profile is not recommended without + a deep understanding of mclock and related Ceph-configuration options. + +.. note:: Across the built-in profiles, internal clients of mclock (for example + "scrub", "snap trim", and "pg deletion") are given slightly lower + reservations, but higher weight and no limit. This ensures that + these operations are able to complete quickly if there are no other + competing services. + + +.. index:: mclock; built-in profiles + +mClock Built-in Profiles +======================== + +When a built-in profile is enabled, the mClock scheduler calculates the low +level mclock parameters [*reservation*, *weight*, *limit*] based on the profile +enabled for each client type. The mclock parameters are calculated based on +the max OSD capacity provided beforehand. As a result, the following mclock +config parameters cannot be modified when using any of the built-in profiles: + +- ``osd_mclock_scheduler_client_res`` +- ``osd_mclock_scheduler_client_wgt`` +- ``osd_mclock_scheduler_client_lim`` +- ``osd_mclock_scheduler_background_recovery_res`` +- ``osd_mclock_scheduler_background_recovery_wgt`` +- ``osd_mclock_scheduler_background_recovery_lim`` +- ``osd_mclock_scheduler_background_best_effort_res`` +- ``osd_mclock_scheduler_background_best_effort_wgt`` +- ``osd_mclock_scheduler_background_best_effort_lim`` + +The following Ceph options will not be modifiable by the user: + +- ``osd_max_backfills`` +- ``osd_recovery_max_active`` + +This is because the above options are internally modified by the mclock +scheduler in order to maximize the impact of the set profile. + +By default, the *high_client_ops* profile is enabled to ensure that a larger +chunk of the bandwidth allocation goes to client ops. Background recovery ops +are given lower allocation (and therefore take a longer time to complete). But +there might be instances that necessitate giving higher allocations to either +client ops or recovery ops. In order to deal with such a situation, you can +enable one of the alternate built-in profiles by following the steps mentioned +in the next section. + +If any mClock profile (including "custom") is active, the following Ceph config +sleep options will be disabled, + +- ``osd_recovery_sleep`` +- ``osd_recovery_sleep_hdd`` +- ``osd_recovery_sleep_ssd`` +- ``osd_recovery_sleep_hybrid`` +- ``osd_scrub_sleep`` +- ``osd_delete_sleep`` +- ``osd_delete_sleep_hdd`` +- ``osd_delete_sleep_ssd`` +- ``osd_delete_sleep_hybrid`` +- ``osd_snap_trim_sleep`` +- ``osd_snap_trim_sleep_hdd`` +- ``osd_snap_trim_sleep_ssd`` +- ``osd_snap_trim_sleep_hybrid`` + +The above sleep options are disabled to ensure that mclock scheduler is able to +determine when to pick the next op from its operation queue and transfer it to +the operation sequencer. This results in the desired QoS being provided across +all its clients. + + +.. index:: mclock; enable built-in profile + +Steps to Enable mClock Profile +============================== + +As already mentioned, the default mclock profile is set to *high_client_ops*. +The other values for the built-in profiles include *balanced* and +*high_recovery_ops*. + +If there is a requirement to change the default profile, then the option +``osd_mclock_profile`` may be set during runtime by using the following +command: + + .. prompt:: bash # + + ceph config set osd.N osd_mclock_profile <value> + +For example, to change the profile to allow faster recoveries on "osd.0", the +following command can be used to switch to the *high_recovery_ops* profile: + + .. prompt:: bash # + + ceph config set osd.0 osd_mclock_profile high_recovery_ops + +.. note:: The *custom* profile is not recommended unless you are an advanced + user. + +And that's it! You are ready to run workloads on the cluster and check if the +QoS requirements are being met. + + +OSD Capacity Determination (Automated) +====================================== + +The OSD capacity in terms of total IOPS is determined automatically during OSD +initialization. This is achieved by running the OSD bench tool and overriding +the default value of ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option +depending on the device type. No other action/input is expected from the user +to set the OSD capacity. You may verify the capacity of an OSD after the +cluster is brought up by using the following command: + + .. prompt:: bash # + + ceph config show osd.N osd_mclock_max_capacity_iops_[hdd, ssd] + +For example, the following command shows the max capacity for "osd.0" on a Ceph +node whose underlying device type is SSD: + + .. prompt:: bash # + + ceph config show osd.0 osd_mclock_max_capacity_iops_ssd + + +Steps to Manually Benchmark an OSD (Optional) +============================================= + +.. note:: These steps are only necessary if you want to override the OSD + capacity already determined automatically during OSD initialization. + Otherwise, you may skip this section entirely. + +.. tip:: If you have already determined the benchmark data and wish to manually + override the max osd capacity for an OSD, you may skip to section + `Specifying Max OSD Capacity`_. + + +Any existing benchmarking tool can be used for this purpose. In this case, the +steps use the *Ceph OSD Bench* command described in the next section. Regardless +of the tool/command used, the steps outlined further below remain the same. + +As already described in the :ref:`dmclock-qos` section, the number of +shards and the bluestore's throttle parameters have an impact on the mclock op +queues. Therefore, it is critical to set these values carefully in order to +maximize the impact of the mclock scheduler. + +:Number of Operational Shards: + We recommend using the default number of shards as defined by the + configuration options ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and + ``osd_op_num_shards_ssd``. In general, a lower number of shards will increase + the impact of the mclock queues. + +:Bluestore Throttle Parameters: + We recommend using the default values as defined by + ``bluestore_throttle_bytes`` and ``bluestore_throttle_deferred_bytes``. But + these parameters may also be determined during the benchmarking phase as + described below. + + +OSD Bench Command Syntax +```````````````````````` + +The :ref:`osd-subsystem` section describes the OSD bench command. The syntax +used for benchmarking is shown below : + +.. prompt:: bash # + + ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS] + +where, + +* ``TOTAL_BYTES``: Total number of bytes to write +* ``BYTES_PER_WRITE``: Block size per write +* ``OBJ_SIZE``: Bytes per object +* ``NUM_OBJS``: Number of objects to write + +Benchmarking Test Steps Using OSD Bench +``````````````````````````````````````` + +The steps below use the default shards and detail the steps used to determine +the correct bluestore throttle values (optional). + +#. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that + you wish to benchmark. +#. Run a simple 4KiB random write workload on an OSD using the following + commands: + + .. note:: Note that before running the test, caches must be cleared to get an + accurate measurement. + + For example, if you are running the benchmark test on osd.0, run the following + commands: + + .. prompt:: bash # + + ceph tell osd.0 cache drop + + .. prompt:: bash # + + ceph tell osd.0 bench 12288000 4096 4194304 100 + +#. Note the overall throughput(IOPS) obtained from the output of the osd bench + command. This value is the baseline throughput(IOPS) when the default + bluestore throttle options are in effect. +#. If the intent is to determine the bluestore throttle values for your + environment, then set the two options, ``bluestore_throttle_bytes`` + and ``bluestore_throttle_deferred_bytes`` to 32 KiB(32768 Bytes) each + to begin with. Otherwise, you may skip to the next section. +#. Run the 4KiB random write test as before using OSD bench. +#. Note the overall throughput from the output and compare the value + against the baseline throughput recorded in step 3. +#. If the throughput doesn't match with the baseline, increment the bluestore + throttle options by 2x and repeat steps 5 through 7 until the obtained + throughput is very close to the baseline value. + +For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB +for both bluestore throttle and deferred bytes was determined to maximize the +impact of mclock. For HDDs, the corresponding value was 40 MiB, where the +overall throughput was roughly equal to the baseline throughput. Note that in +general for HDDs, the bluestore throttle values are expected to be higher when +compared to SSDs. + + +Specifying Max OSD Capacity +```````````````````````````` + +The steps in this section may be performed only if you want to override the +max osd capacity automatically set during OSD initialization. The option +``osd_mclock_max_capacity_iops_[hdd, ssd]`` for an OSD can be set by running the +following command: + + .. prompt:: bash # + + ceph config set osd.N osd_mclock_max_capacity_iops_[hdd,ssd] <value> + +For example, the following command sets the max capacity for a specific OSD +(say "osd.0") whose underlying device type is HDD to 350 IOPS: + + .. prompt:: bash # + + ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350 + +Alternatively, you may specify the max capacity for OSDs within the Ceph +configuration file under the respective [osd.N] section. See +:ref:`ceph-conf-settings` for more details. + + +.. index:: mclock; config settings + +mClock Config Options +===================== + +``osd_mclock_profile`` + +:Description: This sets the type of mclock profile to use for providing QoS + based on operations belonging to different classes (background + recovery, scrub, snaptrim, client op, osd subop). Once a built-in + profile is enabled, the lower level mclock resource control + parameters [*reservation, weight, limit*] and some Ceph + configuration parameters are set transparently. Note that the + above does not apply for the *custom* profile. + +:Type: String +:Valid Choices: high_client_ops, high_recovery_ops, balanced, custom +:Default: ``high_client_ops`` + +``osd_mclock_max_capacity_iops_hdd`` + +:Description: Max IOPS capacity (at 4KiB block size) to consider per OSD (for + rotational media) + +:Type: Float +:Default: ``315.0`` + +``osd_mclock_max_capacity_iops_ssd`` + +:Description: Max IOPS capacity (at 4KiB block size) to consider per OSD (for + solid state media) + +:Type: Float +:Default: ``21500.0`` + +``osd_mclock_cost_per_io_usec`` + +:Description: Cost per IO in microseconds to consider per OSD (overrides _ssd + and _hdd if non-zero) + +:Type: Float +:Default: ``0.0`` + +``osd_mclock_cost_per_io_usec_hdd`` + +:Description: Cost per IO in microseconds to consider per OSD (for rotational + media) + +:Type: Float +:Default: ``25000.0`` + +``osd_mclock_cost_per_io_usec_ssd`` + +:Description: Cost per IO in microseconds to consider per OSD (for solid state + media) + +:Type: Float +:Default: ``50.0`` + +``osd_mclock_cost_per_byte_usec`` + +:Description: Cost per byte in microseconds to consider per OSD (overrides _ssd + and _hdd if non-zero) + +:Type: Float +:Default: ``0.0`` + +``osd_mclock_cost_per_byte_usec_hdd`` + +:Description: Cost per byte in microseconds to consider per OSD (for rotational + media) + +:Type: Float +:Default: ``5.2`` + +``osd_mclock_cost_per_byte_usec_ssd`` + +:Description: Cost per byte in microseconds to consider per OSD (for solid state + media) + +:Type: Float +:Default: ``0.011`` diff --git a/doc/rados/configuration/mon-config-ref.rst b/doc/rados/configuration/mon-config-ref.rst new file mode 100644 index 000000000..3b12af43d --- /dev/null +++ b/doc/rados/configuration/mon-config-ref.rst @@ -0,0 +1,1243 @@ +.. _monitor-config-reference: + +========================== + Monitor Config Reference +========================== + +Understanding how to configure a :term:`Ceph Monitor` is an important part of +building a reliable :term:`Ceph Storage Cluster`. **All Ceph Storage Clusters +have at least one monitor**. The monitor complement usually remains fairly +consistent, but you can add, remove or replace a monitor in a cluster. See +`Adding/Removing a Monitor`_ for details. + + +.. index:: Ceph Monitor; Paxos + +Background +========== + +Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`, which means a +:term:`Ceph Client` can determine the location of all Ceph Monitors, Ceph OSD +Daemons, and Ceph Metadata Servers just by connecting to one Ceph Monitor and +retrieving a current cluster map. Before Ceph Clients can read from or write to +Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor +first. With a current copy of the cluster map and the CRUSH algorithm, a Ceph +Client can compute the location for any object. The ability to compute object +locations allows a Ceph Client to talk directly to Ceph OSD Daemons, which is a +very important aspect of Ceph's high scalability and performance. See +`Scalability and High Availability`_ for additional details. + +The primary role of the Ceph Monitor is to maintain a master copy of the cluster +map. Ceph Monitors also provide authentication and logging services. Ceph +Monitors write all changes in the monitor services to a single Paxos instance, +and Paxos writes the changes to a key/value store for strong consistency. Ceph +Monitors can query the most recent version of the cluster map during sync +operations. Ceph Monitors leverage the key/value store's snapshots and iterators +(using leveldb) to perform store-wide synchronization. + +.. ditaa:: + /-------------\ /-------------\ + | Monitor | Write Changes | Paxos | + | cCCC +-------------->+ cCCC | + | | | | + +-------------+ \------+------/ + | Auth | | + +-------------+ | Write Changes + | Log | | + +-------------+ v + | Monitor Map | /------+------\ + +-------------+ | Key / Value | + | OSD Map | | Store | + +-------------+ | cCCC | + | PG Map | \------+------/ + +-------------+ ^ + | MDS Map | | Read Changes + +-------------+ | + | cCCC |*---------------------+ + \-------------/ + + +.. deprecated:: version 0.58 + +In Ceph versions 0.58 and earlier, Ceph Monitors use a Paxos instance for +each service and store the map as a file. + +.. index:: Ceph Monitor; cluster map + +Cluster Maps +------------ + +The cluster map is a composite of maps, including the monitor map, the OSD map, +the placement group map and the metadata server map. The cluster map tracks a +number of important things: which processes are ``in`` the Ceph Storage Cluster; +which processes that are ``in`` the Ceph Storage Cluster are ``up`` and running +or ``down``; whether, the placement groups are ``active`` or ``inactive``, and +``clean`` or in some other state; and, other details that reflect the current +state of the cluster such as the total amount of storage space, and the amount +of storage used. + +When there is a significant change in the state of the cluster--e.g., a Ceph OSD +Daemon goes down, a placement group falls into a degraded state, etc.--the +cluster map gets updated to reflect the current state of the cluster. +Additionally, the Ceph Monitor also maintains a history of the prior states of +the cluster. The monitor map, OSD map, placement group map and metadata server +map each maintain a history of their map versions. We call each version an +"epoch." + +When operating your Ceph Storage Cluster, keeping track of these states is an +important part of your system administration duties. See `Monitoring a Cluster`_ +and `Monitoring OSDs and PGs`_ for additional details. + +.. index:: high availability; quorum + +Monitor Quorum +-------------- + +Our Configuring ceph section provides a trivial `Ceph configuration file`_ that +provides for one monitor in the test cluster. A cluster will run fine with a +single monitor; however, **a single monitor is a single-point-of-failure**. To +ensure high availability in a production Ceph Storage Cluster, you should run +Ceph with multiple monitors so that the failure of a single monitor **WILL NOT** +bring down your entire cluster. + +When a Ceph Storage Cluster runs multiple Ceph Monitors for high availability, +Ceph Monitors use `Paxos`_ to establish consensus about the master cluster map. +A consensus requires a majority of monitors running to establish a quorum for +consensus about the cluster map (e.g., 1; 2 out of 3; 3 out of 5; 4 out of 6; +etc.). + +``mon force quorum join`` + +:Description: Force monitor to join quorum even if it has been previously removed from the map +:Type: Boolean +:Default: ``False`` + +.. index:: Ceph Monitor; consistency + +Consistency +----------- + +When you add monitor settings to your Ceph configuration file, you need to be +aware of some of the architectural aspects of Ceph Monitors. **Ceph imposes +strict consistency requirements** for a Ceph monitor when discovering another +Ceph Monitor within the cluster. Whereas, Ceph Clients and other Ceph daemons +use the Ceph configuration file to discover monitors, monitors discover each +other using the monitor map (monmap), not the Ceph configuration file. + +A Ceph Monitor always refers to the local copy of the monmap when discovering +other Ceph Monitors in the Ceph Storage Cluster. Using the monmap instead of the +Ceph configuration file avoids errors that could break the cluster (e.g., typos +in ``ceph.conf`` when specifying a monitor address or port). Since monitors use +monmaps for discovery and they share monmaps with clients and other Ceph +daemons, **the monmap provides monitors with a strict guarantee that their +consensus is valid.** + +Strict consistency also applies to updates to the monmap. As with any other +updates on the Ceph Monitor, changes to the monmap always run through a +distributed consensus algorithm called `Paxos`_. The Ceph Monitors must agree on +each update to the monmap, such as adding or removing a Ceph Monitor, to ensure +that each monitor in the quorum has the same version of the monmap. Updates to +the monmap are incremental so that Ceph Monitors have the latest agreed upon +version, and a set of previous versions. Maintaining a history enables a Ceph +Monitor that has an older version of the monmap to catch up with the current +state of the Ceph Storage Cluster. + +If Ceph Monitors were to discover each other through the Ceph configuration file +instead of through the monmap, additional risks would be introduced because +Ceph configuration files are not updated and distributed automatically. Ceph +Monitors might inadvertently use an older Ceph configuration file, fail to +recognize a Ceph Monitor, fall out of a quorum, or develop a situation where +`Paxos`_ is not able to determine the current state of the system accurately. + + +.. index:: Ceph Monitor; bootstrapping monitors + +Bootstrapping Monitors +---------------------- + +In most configuration and deployment cases, tools that deploy Ceph help +bootstrap the Ceph Monitors by generating a monitor map for you (e.g., +``cephadm``, etc). A Ceph Monitor requires a few explicit +settings: + +- **Filesystem ID**: The ``fsid`` is the unique identifier for your + object store. Since you can run multiple clusters on the same + hardware, you must specify the unique ID of the object store when + bootstrapping a monitor. Deployment tools usually do this for you + (e.g., ``cephadm`` can call a tool like ``uuidgen``), but you + may specify the ``fsid`` manually too. + +- **Monitor ID**: A monitor ID is a unique ID assigned to each monitor within + the cluster. It is an alphanumeric value, and by convention the identifier + usually follows an alphabetical increment (e.g., ``a``, ``b``, etc.). This + can be set in a Ceph configuration file (e.g., ``[mon.a]``, ``[mon.b]``, etc.), + by a deployment tool, or using the ``ceph`` commandline. + +- **Keys**: The monitor must have secret keys. A deployment tool such as + ``cephadm`` usually does this for you, but you may + perform this step manually too. See `Monitor Keyrings`_ for details. + +For additional details on bootstrapping, see `Bootstrapping a Monitor`_. + +.. index:: Ceph Monitor; configuring monitors + +Configuring Monitors +==================== + +To apply configuration settings to the entire cluster, enter the configuration +settings under ``[global]``. To apply configuration settings to all monitors in +your cluster, enter the configuration settings under ``[mon]``. To apply +configuration settings to specific monitors, specify the monitor instance +(e.g., ``[mon.a]``). By convention, monitor instance names use alpha notation. + +.. code-block:: ini + + [global] + + [mon] + + [mon.a] + + [mon.b] + + [mon.c] + + +Minimum Configuration +--------------------- + +The bare minimum monitor settings for a Ceph monitor via the Ceph configuration +file include a hostname and a network address for each monitor. You can configure +these under ``[mon]`` or under the entry for a specific monitor. + +.. code-block:: ini + + [global] + mon host = 10.0.0.2,10.0.0.3,10.0.0.4 + +.. code-block:: ini + + [mon.a] + host = hostname1 + mon addr = 10.0.0.10:6789 + +See the `Network Configuration Reference`_ for details. + +.. note:: This minimum configuration for monitors assumes that a deployment + tool generates the ``fsid`` and the ``mon.`` key for you. + +Once you deploy a Ceph cluster, you **SHOULD NOT** change the IP addresses of +monitors. However, if you decide to change the monitor's IP address, you +must follow a specific procedure. See `Changing a Monitor's IP Address`_ for +details. + +Monitors can also be found by clients by using DNS SRV records. See `Monitor lookup through DNS`_ for details. + +Cluster ID +---------- + +Each Ceph Storage Cluster has a unique identifier (``fsid``). If specified, it +usually appears under the ``[global]`` section of the configuration file. +Deployment tools usually generate the ``fsid`` and store it in the monitor map, +so the value may not appear in a configuration file. The ``fsid`` makes it +possible to run daemons for multiple clusters on the same hardware. + +``fsid`` + +:Description: The cluster ID. One per cluster. +:Type: UUID +:Required: Yes. +:Default: N/A. May be generated by a deployment tool if not specified. + +.. note:: Do not set this value if you use a deployment tool that does + it for you. + + +.. index:: Ceph Monitor; initial members + +Initial Members +--------------- + +We recommend running a production Ceph Storage Cluster with at least three Ceph +Monitors to ensure high availability. When you run multiple monitors, you may +specify the initial monitors that must be members of the cluster in order to +establish a quorum. This may reduce the time it takes for your cluster to come +online. + +.. code-block:: ini + + [mon] + mon_initial_members = a,b,c + + +``mon_initial_members`` + +:Description: The IDs of initial monitors in a cluster during startup. If + specified, Ceph requires an odd number of monitors to form an + initial quorum (e.g., 3). + +:Type: String +:Default: None + +.. note:: A *majority* of monitors in your cluster must be able to reach + each other in order to establish a quorum. You can decrease the initial + number of monitors to establish a quorum with this setting. + +.. index:: Ceph Monitor; data path + +Data +---- + +Ceph provides a default path where Ceph Monitors store data. For optimal +performance in a production Ceph Storage Cluster, we recommend running Ceph +Monitors on separate hosts and drives from Ceph OSD Daemons. As leveldb uses +``mmap()`` for writing the data, Ceph Monitors flush their data from memory to disk +very often, which can interfere with Ceph OSD Daemon workloads if the data +store is co-located with the OSD Daemons. + +In Ceph versions 0.58 and earlier, Ceph Monitors store their data in plain files. This +approach allows users to inspect monitor data with common tools like ``ls`` +and ``cat``. However, this approach didn't provide strong consistency. + +In Ceph versions 0.59 and later, Ceph Monitors store their data as key/value +pairs. Ceph Monitors require `ACID`_ transactions. Using a data store prevents +recovering Ceph Monitors from running corrupted versions through Paxos, and it +enables multiple modification operations in one single atomic batch, among other +advantages. + +Generally, we do not recommend changing the default data location. If you modify +the default location, we recommend that you make it uniform across Ceph Monitors +by setting it in the ``[mon]`` section of the configuration file. + + +``mon_data`` + +:Description: The monitor's data location. +:Type: String +:Default: ``/var/lib/ceph/mon/$cluster-$id`` + + +``mon_data_size_warn`` + +:Description: Raise ``HEALTH_WARN`` status when a monitor's data + store grows to be larger than this size, 15GB by default. + +:Type: Integer +:Default: ``15*1024*1024*1024`` + + +``mon_data_avail_warn`` + +:Description: Raise ``HEALTH_WARN`` status when the filesystem that houses a + monitor's data store reports that its available capacity is + less than or equal to this percentage . + +:Type: Integer +:Default: ``30`` + + +``mon_data_avail_crit`` + +:Description: Raise ``HEALTH_ERR`` status when the filesystem that houses a + monitor's data store reports that its available capacity is + less than or equal to this percentage. + +:Type: Integer +:Default: ``5`` + +``mon_warn_on_cache_pools_without_hit_sets`` + +:Description: Raise ``HEALTH_WARN`` when a cache pool does not + have the ``hit_set_type`` value configured. + See :ref:`hit_set_type <hit_set_type>` for more + details. + +:Type: Boolean +:Default: ``True`` + +``mon_warn_on_crush_straw_calc_version_zero`` + +:Description: Raise ``HEALTH_WARN`` when the CRUSH + ``straw_calc_version`` is zero. See + :ref:`CRUSH map tunables <crush-map-tunables>` for + details. + +:Type: Boolean +:Default: ``True`` + + +``mon_warn_on_legacy_crush_tunables`` + +:Description: Raise ``HEALTH_WARN`` when + CRUSH tunables are too old (older than ``mon_min_crush_required_version``) + +:Type: Boolean +:Default: ``True`` + + +``mon_crush_min_required_version`` + +:Description: The minimum tunable profile required by the cluster. + See + :ref:`CRUSH map tunables <crush-map-tunables>` for + details. + +:Type: String +:Default: ``hammer`` + + +``mon_warn_on_osd_down_out_interval_zero`` + +:Description: Raise ``HEALTH_WARN`` when + ``mon_osd_down_out_interval`` is zero. Having this option set to + zero on the leader acts much like the ``noout`` flag. It's hard + to figure out what's going wrong with clusters without the + ``noout`` flag set but acting like that just the same, so we + report a warning in this case. + +:Type: Boolean +:Default: ``True`` + + +``mon_warn_on_slow_ping_ratio`` + +:Description: Raise ``HEALTH_WARN`` when any heartbeat + between OSDs exceeds ``mon_warn_on_slow_ping_ratio`` + of ``osd_heartbeat_grace``. The default is 5%. +:Type: Float +:Default: ``0.05`` + + +``mon_warn_on_slow_ping_time`` + +:Description: Override ``mon_warn_on_slow_ping_ratio`` with a specific value. + Raise ``HEALTH_WARN`` if any heartbeat + between OSDs exceeds ``mon_warn_on_slow_ping_time`` + milliseconds. The default is 0 (disabled). +:Type: Integer +:Default: ``0`` + + +``mon_warn_on_pool_no_redundancy`` + +:Description: Raise ``HEALTH_WARN`` if any pool is + configured with no replicas. +:Type: Boolean +:Default: ``True`` + + +``mon_cache_target_full_warn_ratio`` + +:Description: Position between pool's ``cache_target_full`` and + ``target_max_object`` where we start warning + +:Type: Float +:Default: ``0.66`` + + +``mon_health_to_clog`` + +:Description: Enable sending a health summary to the cluster log periodically. +:Type: Boolean +:Default: ``True`` + + +``mon_health_to_clog_tick_interval`` + +:Description: How often (in seconds) the monitor sends a health summary to the cluster + log (a non-positive number disables). If current health summary + is empty or identical to the last time, monitor will not send it + to cluster log. + +:Type: Float +:Default: ``60.0`` + + +``mon_health_to_clog_interval`` + +:Description: How often (in seconds) the monitor sends a health summary to the cluster + log (a non-positive number disables). Monitors will always + send a summary to the cluster log whether or not it differs from + the previous summary. + +:Type: Integer +:Default: ``3600`` + + + +.. index:: Ceph Storage Cluster; capacity planning, Ceph Monitor; capacity planning + +.. _storage-capacity: + +Storage Capacity +---------------- + +When a Ceph Storage Cluster gets close to its maximum capacity +(see``mon_osd_full ratio``), Ceph prevents you from writing to or reading from OSDs +as a safety measure to prevent data loss. Therefore, letting a +production Ceph Storage Cluster approach its full ratio is not a good practice, +because it sacrifices high availability. The default full ratio is ``.95``, or +95% of capacity. This a very aggressive setting for a test cluster with a small +number of OSDs. + +.. tip:: When monitoring your cluster, be alert to warnings related to the + ``nearfull`` ratio. This means that a failure of some OSDs could result + in a temporary service disruption if one or more OSDs fails. Consider adding + more OSDs to increase storage capacity. + +A common scenario for test clusters involves a system administrator removing an +OSD from the Ceph Storage Cluster, watching the cluster rebalance, then removing +another OSD, and another, until at least one OSD eventually reaches the full +ratio and the cluster locks up. We recommend a bit of capacity +planning even with a test cluster. Planning enables you to gauge how much spare +capacity you will need in order to maintain high availability. Ideally, you want +to plan for a series of Ceph OSD Daemon failures where the cluster can recover +to an ``active+clean`` state without replacing those OSDs +immediately. Cluster operation continues in the ``active+degraded`` state, but this +is not ideal for normal operation and should be addressed promptly. + +The following diagram depicts a simplistic Ceph Storage Cluster containing 33 +Ceph Nodes with one OSD per host, each OSD reading from +and writing to a 3TB drive. So this exemplary Ceph Storage Cluster has a maximum +actual capacity of 99TB. With a ``mon osd full ratio`` of ``0.95``, if the Ceph +Storage Cluster falls to 5TB of remaining capacity, the cluster will not allow +Ceph Clients to read and write data. So the Ceph Storage Cluster's operating +capacity is 95TB, not 99TB. + +.. ditaa:: + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | Rack 1 | | Rack 2 | | Rack 3 | | Rack 4 | | Rack 5 | | Rack 6 | + | cCCC | | cF00 | | cCCC | | cCCC | | cCCC | | cCCC | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 1 | | OSD 7 | | OSD 13 | | OSD 19 | | OSD 25 | | OSD 31 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 2 | | OSD 8 | | OSD 14 | | OSD 20 | | OSD 26 | | OSD 32 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 3 | | OSD 9 | | OSD 15 | | OSD 21 | | OSD 27 | | OSD 33 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 4 | | OSD 10 | | OSD 16 | | OSD 22 | | OSD 28 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 5 | | OSD 11 | | OSD 17 | | OSD 23 | | OSD 29 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 6 | | OSD 12 | | OSD 18 | | OSD 24 | | OSD 30 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + +It is normal in such a cluster for one or two OSDs to fail. A less frequent but +reasonable scenario involves a rack's router or power supply failing, which +brings down multiple OSDs simultaneously (e.g., OSDs 7-12). In such a scenario, +you should still strive for a cluster that can remain operational and achieve an +``active + clean`` state--even if that means adding a few hosts with additional +OSDs in short order. If your capacity utilization is too high, you may not lose +data, but you could still sacrifice data availability while resolving an outage +within a failure domain if capacity utilization of the cluster exceeds the full +ratio. For this reason, we recommend at least some rough capacity planning. + +Identify two numbers for your cluster: + +#. The number of OSDs. +#. The total capacity of the cluster + +If you divide the total capacity of your cluster by the number of OSDs in your +cluster, you will find the mean average capacity of an OSD within your cluster. +Consider multiplying that number by the number of OSDs you expect will fail +simultaneously during normal operations (a relatively small number). Finally +multiply the capacity of the cluster by the full ratio to arrive at a maximum +operating capacity; then, subtract the number of amount of data from the OSDs +you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing +process with a higher number of OSD failures (e.g., a rack of OSDs) to arrive at +a reasonable number for a near full ratio. + +The following settings only apply on cluster creation and are then stored in +the OSDMap. To clarify, in normal operation the values that are used by OSDs +are those found in the OSDMap, not those in the configuration file or central +config store. + +.. code-block:: ini + + [global] + mon_osd_full_ratio = .80 + mon_osd_backfillfull_ratio = .75 + mon_osd_nearfull_ratio = .70 + + +``mon_osd_full_ratio`` + +:Description: The threshold percentage of device space utilized before an OSD is + considered ``full``. + +:Type: Float +:Default: ``0.95`` + + +``mon_osd_backfillfull_ratio`` + +:Description: The threshold percentage of device space utilized before an OSD is + considered too ``full`` to backfill. + +:Type: Float +:Default: ``0.90`` + + +``mon_osd_nearfull_ratio`` + +:Description: The threshold percentage of device space used before an OSD is + considered ``nearfull``. + +:Type: Float +:Default: ``0.85`` + + +.. tip:: If some OSDs are nearfull, but others have plenty of capacity, you + may have an inaccurate CRUSH weight set for the nearfull OSDs. + +.. tip:: These settings only apply during cluster creation. Afterwards they need + to be changed in the OSDMap using ``ceph osd set-nearfull-ratio`` and + ``ceph osd set-full-ratio`` + +.. index:: heartbeat + +Heartbeat +--------- + +Ceph monitors know about the cluster by requiring reports from each OSD, and by +receiving reports from OSDs about the status of their neighboring OSDs. Ceph +provides reasonable default settings for monitor/OSD interaction; however, you +may modify them as needed. See `Monitor/OSD Interaction`_ for details. + + +.. index:: Ceph Monitor; leader, Ceph Monitor; provider, Ceph Monitor; requester, Ceph Monitor; synchronization + +Monitor Store Synchronization +----------------------------- + +When you run a production cluster with multiple monitors (recommended), each +monitor checks to see if a neighboring monitor has a more recent version of the +cluster map (e.g., a map in a neighboring monitor with one or more epoch numbers +higher than the most current epoch in the map of the instant monitor). +Periodically, one monitor in the cluster may fall behind the other monitors to +the point where it must leave the quorum, synchronize to retrieve the most +current information about the cluster, and then rejoin the quorum. For the +purposes of synchronization, monitors may assume one of three roles: + +#. **Leader**: The `Leader` is the first monitor to achieve the most recent + Paxos version of the cluster map. + +#. **Provider**: The `Provider` is a monitor that has the most recent version + of the cluster map, but wasn't the first to achieve the most recent version. + +#. **Requester:** A `Requester` is a monitor that has fallen behind the leader + and must synchronize in order to retrieve the most recent information about + the cluster before it can rejoin the quorum. + +These roles enable a leader to delegate synchronization duties to a provider, +which prevents synchronization requests from overloading the leader--improving +performance. In the following diagram, the requester has learned that it has +fallen behind the other monitors. The requester asks the leader to synchronize, +and the leader tells the requester to synchronize with a provider. + + +.. ditaa:: + +-----------+ +---------+ +----------+ + | Requester | | Leader | | Provider | + +-----------+ +---------+ +----------+ + | | | + | | | + | Ask to Synchronize | | + |------------------->| | + | | | + |<-------------------| | + | Tell Requester to | | + | Sync with Provider | | + | | | + | Synchronize | + |--------------------+-------------------->| + | | | + |<-------------------+---------------------| + | Send Chunk to Requester | + | (repeat as necessary) | + | Requester Acks Chuck to Provider | + |--------------------+-------------------->| + | | + | Sync Complete | + | Notification | + |------------------->| + | | + |<-------------------| + | Ack | + | | + + +Synchronization always occurs when a new monitor joins the cluster. During +runtime operations, monitors may receive updates to the cluster map at different +times. This means the leader and provider roles may migrate from one monitor to +another. If this happens while synchronizing (e.g., a provider falls behind the +leader), the provider can terminate synchronization with a requester. + +Once synchronization is complete, Ceph performs trimming across the cluster. +Trimming requires that the placement groups are ``active+clean``. + + +``mon_sync_timeout`` + +:Description: Number of seconds the monitor will wait for the next update + message from its sync provider before it gives up and bootstrap + again. + +:Type: Double +:Default: ``60.0`` + + +``mon_sync_max_payload_size`` + +:Description: The maximum size for a sync payload (in bytes). +:Type: 32-bit Integer +:Default: ``1048576`` + + +``paxos_max_join_drift`` + +:Description: The maximum Paxos iterations before we must first sync the + monitor data stores. When a monitor finds that its peer is too + far ahead of it, it will first sync with data stores before moving + on. + +:Type: Integer +:Default: ``10`` + + +``paxos_stash_full_interval`` + +:Description: How often (in commits) to stash a full copy of the PaxosService state. + Current this setting only affects ``mds``, ``mon``, ``auth`` and ``mgr`` + PaxosServices. + +:Type: Integer +:Default: ``25`` + + +``paxos_propose_interval`` + +:Description: Gather updates for this time interval before proposing + a map update. + +:Type: Double +:Default: ``1.0`` + + +``paxos_min`` + +:Description: The minimum number of Paxos states to keep around +:Type: Integer +:Default: ``500`` + + +``paxos_min_wait`` + +:Description: The minimum amount of time to gather updates after a period of + inactivity. + +:Type: Double +:Default: ``0.05`` + + +``paxos_trim_min`` + +:Description: Number of extra proposals tolerated before trimming +:Type: Integer +:Default: ``250`` + + +``paxos_trim_max`` + +:Description: The maximum number of extra proposals to trim at a time +:Type: Integer +:Default: ``500`` + + +``paxos_service_trim_min`` + +:Description: The minimum amount of versions to trigger a trim (0 disables it) +:Type: Integer +:Default: ``250`` + + +``paxos_service_trim_max`` + +:Description: The maximum amount of versions to trim during a single proposal (0 disables it) +:Type: Integer +:Default: ``500`` + + +``paxos service trim max multiplier`` + +:Description: The factor by which paxos service trim max will be multiplied + to get a new upper bound when trim sizes are high (0 disables it) +:Type: Integer +:Default: ``20`` + + +``mon mds force trim to`` + +:Description: Force monitor to trim mdsmaps to this point (0 disables it. + dangerous, use with care) + +:Type: Integer +:Default: ``0`` + + +``mon_osd_force_trim_to`` + +:Description: Force monitor to trim osdmaps to this point, even if there is + PGs not clean at the specified epoch (0 disables it. dangerous, + use with care) + +:Type: Integer +:Default: ``0`` + + +``mon_osd_cache_size`` + +:Description: The size of osdmaps cache, not to rely on underlying store's cache +:Type: Integer +:Default: ``500`` + + +``mon_election_timeout`` + +:Description: On election proposer, maximum waiting time for all ACKs in seconds. +:Type: Float +:Default: ``5.00`` + + +``mon_lease`` + +:Description: The length (in seconds) of the lease on the monitor's versions. +:Type: Float +:Default: ``5.00`` + + +``mon_lease_renew_interval_factor`` + +:Description: ``mon_lease`` \* ``mon_lease_renew_interval_factor`` will be the + interval for the Leader to renew the other monitor's leases. The + factor should be less than ``1.0``. + +:Type: Float +:Default: ``0.60`` + + +``mon_lease_ack_timeout_factor`` + +:Description: The Leader will wait ``mon_lease`` \* ``mon_lease_ack_timeout_factor`` + for the Providers to acknowledge the lease extension. + +:Type: Float +:Default: ``2.00`` + + +``mon_accept_timeout_factor`` + +:Description: The Leader will wait ``mon_lease`` \* ``mon_accept_timeout_factor`` + for the Requester(s) to accept a Paxos update. It is also used + during the Paxos recovery phase for similar purposes. + +:Type: Float +:Default: ``2.00`` + + +``mon_min_osdmap_epochs`` + +:Description: Minimum number of OSD map epochs to keep at all times. +:Type: 32-bit Integer +:Default: ``500`` + + +``mon_max_log_epochs`` + +:Description: Maximum number of Log epochs the monitor should keep. +:Type: 32-bit Integer +:Default: ``500`` + + + +.. index:: Ceph Monitor; clock + +Clock +----- + +Ceph daemons pass critical messages to each other, which must be processed +before daemons reach a timeout threshold. If the clocks in Ceph monitors +are not synchronized, it can lead to a number of anomalies. For example: + +- Daemons ignoring received messages (e.g., timestamps outdated) +- Timeouts triggered too soon/late when a message wasn't received in time. + +See `Monitor Store Synchronization`_ for details. + + +.. tip:: You must configure NTP or PTP daemons on your Ceph monitor hosts to + ensure that the monitor cluster operates with synchronized clocks. + It can be advantageous to have monitor hosts sync with each other + as well as with multiple quality upstream time sources. + +Clock drift may still be noticeable with NTP even though the discrepancy is not +yet harmful. Ceph's clock drift / clock skew warnings may get triggered even +though NTP maintains a reasonable level of synchronization. Increasing your +clock drift may be tolerable under such circumstances; however, a number of +factors such as workload, network latency, configuring overrides to default +timeouts and the `Monitor Store Synchronization`_ settings may influence +the level of acceptable clock drift without compromising Paxos guarantees. + +Ceph provides the following tunable options to allow you to find +acceptable values. + + +``mon_tick_interval`` + +:Description: A monitor's tick interval in seconds. +:Type: 32-bit Integer +:Default: ``5`` + + +``mon_clock_drift_allowed`` + +:Description: The clock drift in seconds allowed between monitors. +:Type: Float +:Default: ``0.05`` + + +``mon_clock_drift_warn_backoff`` + +:Description: Exponential backoff for clock drift warnings +:Type: Float +:Default: ``5.00`` + + +``mon_timecheck_interval`` + +:Description: The time check interval (clock drift check) in seconds + for the Leader. + +:Type: Float +:Default: ``300.00`` + + +``mon_timecheck_skew_interval`` + +:Description: The time check interval (clock drift check) in seconds when in + presence of a skew in seconds for the Leader. + +:Type: Float +:Default: ``30.00`` + + +Client +------ + +``mon_client_hunt_interval`` + +:Description: The client will try a new monitor every ``N`` seconds until it + establishes a connection. + +:Type: Double +:Default: ``3.00`` + + +``mon_client_ping_interval`` + +:Description: The client will ping the monitor every ``N`` seconds. +:Type: Double +:Default: ``10.00`` + + +``mon_client_max_log_entries_per_message`` + +:Description: The maximum number of log entries a monitor will generate + per client message. + +:Type: Integer +:Default: ``1000`` + + +``mon_client_bytes`` + +:Description: The amount of client message data allowed in memory (in bytes). +:Type: 64-bit Integer Unsigned +:Default: ``100ul << 20`` + +.. _pool-settings: + +Pool settings +============= + +Since version v0.94 there is support for pool flags which allow or disallow changes to be made to pools. +Monitors can also disallow removal of pools if appropriately configured. The inconvenience of this guardrail +is far outweighed by the number of accidental pool (and thus data) deletions it prevents. + +``mon_allow_pool_delete`` + +:Description: Should monitors allow pools to be removed, regardless of what the pool flags say? + +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_ec_fast_read`` + +:Description: Whether to turn on fast read on the pool or not. It will be used as + the default setting of newly created erasure coded pools if ``fast_read`` + is not specified at create time. + +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_flag_hashpspool`` + +:Description: Set the hashpspool flag on new pools +:Type: Boolean +:Default: ``true`` + + +``osd_pool_default_flag_nodelete`` + +:Description: Set the ``nodelete`` flag on new pools, which prevents pool removal. +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_flag_nopgchange`` + +:Description: Set the ``nopgchange`` flag on new pools. Does not allow the number of PGs to be changed. +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_flag_nosizechange`` + +:Description: Set the ``nosizechange`` flag on new pools. Does not allow the ``size`` to be changed. +:Type: Boolean +:Default: ``false`` + +For more information about the pool flags see `Pool values`_. + +Miscellaneous +============= + +``mon_max_osd`` + +:Description: The maximum number of OSDs allowed in the cluster. +:Type: 32-bit Integer +:Default: ``10000`` + + +``mon_globalid_prealloc`` + +:Description: The number of global IDs to pre-allocate for clients and daemons in the cluster. +:Type: 32-bit Integer +:Default: ``10000`` + + +``mon_subscribe_interval`` + +:Description: The refresh interval (in seconds) for subscriptions. The + subscription mechanism enables obtaining cluster maps + and log information. + +:Type: Double +:Default: ``86400.00`` + + +``mon_stat_smooth_intervals`` + +:Description: Ceph will smooth statistics over the last ``N`` PG maps. +:Type: Integer +:Default: ``6`` + + +``mon_probe_timeout`` + +:Description: Number of seconds the monitor will wait to find peers before bootstrapping. +:Type: Double +:Default: ``2.00`` + + +``mon_daemon_bytes`` + +:Description: The message memory cap for metadata server and OSD messages (in bytes). +:Type: 64-bit Integer Unsigned +:Default: ``400ul << 20`` + + +``mon_max_log_entries_per_event`` + +:Description: The maximum number of log entries per event. +:Type: Integer +:Default: ``4096`` + + +``mon_osd_prime_pg_temp`` + +:Description: Enables or disables priming the PGMap with the previous OSDs when an ``out`` + OSD comes back into the cluster. With the ``true`` setting, clients + will continue to use the previous OSDs until the newly ``in`` OSDs for + a PG have peered. + +:Type: Boolean +:Default: ``true`` + + +``mon_osd_prime pg temp max time`` + +:Description: How much time in seconds the monitor should spend trying to prime the + PGMap when an out OSD comes back into the cluster. + +:Type: Float +:Default: ``0.50`` + + +``mon_osd_prime_pg_temp_max_time_estimate`` + +:Description: Maximum estimate of time spent on each PG before we prime all PGs + in parallel. + +:Type: Float +:Default: ``0.25`` + + +``mon_mds_skip_sanity`` + +:Description: Skip safety assertions on FSMap (in case of bugs where we want to + continue anyway). Monitor terminates if the FSMap sanity check + fails, but we can disable it by enabling this option. + +:Type: Boolean +:Default: ``False`` + + +``mon_max_mdsmap_epochs`` + +:Description: The maximum number of mdsmap epochs to trim during a single proposal. +:Type: Integer +:Default: ``500`` + + +``mon_config_key_max_entry_size`` + +:Description: The maximum size of config-key entry (in bytes) +:Type: Integer +:Default: ``65536`` + + +``mon_scrub_interval`` + +:Description: How often the monitor scrubs its store by comparing + the stored checksums with the computed ones for all stored + keys. (0 disables it. dangerous, use with care) + +:Type: Seconds +:Default: ``1 day`` + + +``mon_scrub_max_keys`` + +:Description: The maximum number of keys to scrub each time. +:Type: Integer +:Default: ``100`` + + +``mon_compact_on_start`` + +:Description: Compact the database used as Ceph Monitor store on + ``ceph-mon`` start. A manual compaction helps to shrink the + monitor database and improve the performance of it if the regular + compaction fails to work. + +:Type: Boolean +:Default: ``False`` + + +``mon_compact_on_bootstrap`` + +:Description: Compact the database used as Ceph Monitor store + on bootstrap. Monitors probe each other to establish + a quorum after bootstrap. If a monitor times out before joining the + quorum, it will start over and bootstrap again. + +:Type: Boolean +:Default: ``False`` + + +``mon_compact_on_trim`` + +:Description: Compact a certain prefix (including paxos) when we trim its old states. +:Type: Boolean +:Default: ``True`` + + +``mon_cpu_threads`` + +:Description: Number of threads for performing CPU intensive work on monitor. +:Type: Integer +:Default: ``4`` + + +``mon_osd_mapping_pgs_per_chunk`` + +:Description: We calculate the mapping from placement group to OSDs in chunks. + This option specifies the number of placement groups per chunk. + +:Type: Integer +:Default: ``4096`` + + +``mon_session_timeout`` + +:Description: Monitor will terminate inactive sessions stay idle over this + time limit. + +:Type: Integer +:Default: ``300`` + + +``mon_osd_cache_size_min`` + +:Description: The minimum amount of bytes to be kept mapped in memory for osd + monitor caches. + +:Type: 64-bit Integer +:Default: ``134217728`` + + +``mon_memory_target`` + +:Description: The amount of bytes pertaining to OSD monitor caches and KV cache + to be kept mapped in memory with cache auto-tuning enabled. + +:Type: 64-bit Integer +:Default: ``2147483648`` + + +``mon_memory_autotune`` + +:Description: Autotune the cache memory used for OSD monitors and KV + database. + +:Type: Boolean +:Default: ``True`` + + +.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) +.. _Monitor Keyrings: ../../../dev/mon-bootstrap#secret-keys +.. _Ceph configuration file: ../ceph-conf/#monitors +.. _Network Configuration Reference: ../network-config-ref +.. _Monitor lookup through DNS: ../mon-lookup-dns +.. _ACID: https://en.wikipedia.org/wiki/ACID +.. _Adding/Removing a Monitor: ../../operations/add-or-rm-mons +.. _Monitoring a Cluster: ../../operations/monitoring +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg +.. _Bootstrapping a Monitor: ../../../dev/mon-bootstrap +.. _Changing a Monitor's IP Address: ../../operations/add-or-rm-mons#changing-a-monitor-s-ip-address +.. _Monitor/OSD Interaction: ../mon-osd-interaction +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability +.. _Pool values: ../../operations/pools/#set-pool-values diff --git a/doc/rados/configuration/mon-lookup-dns.rst b/doc/rados/configuration/mon-lookup-dns.rst new file mode 100644 index 000000000..c9bece004 --- /dev/null +++ b/doc/rados/configuration/mon-lookup-dns.rst @@ -0,0 +1,56 @@ +=============================== +Looking up Monitors through DNS +=============================== + +Since version 11.0.0 RADOS supports looking up Monitors through DNS. + +This way daemons and clients do not require a *mon host* configuration directive in their ceph.conf configuration file. + +Using DNS SRV TCP records clients are able to look up the monitors. + +This allows for less configuration on clients and monitors. Using a DNS update clients and daemons can be made aware of changes in the monitor topology. + +By default clients and daemons will look for the TCP service called *ceph-mon* which is configured by the *mon_dns_srv_name* configuration directive. + + +``mon dns srv name`` + +:Description: the service name used querying the DNS for the monitor hosts/addresses +:Type: String +:Default: ``ceph-mon`` + +Example +------- +When the DNS search domain is set to *example.com* a DNS zone file might contain the following elements. + +First, create records for the Monitors, either IPv4 (A) or IPv6 (AAAA). + +:: + + mon1.example.com. AAAA 2001:db8::100 + mon2.example.com. AAAA 2001:db8::200 + mon3.example.com. AAAA 2001:db8::300 + +:: + + mon1.example.com. A 192.168.0.1 + mon2.example.com. A 192.168.0.2 + mon3.example.com. A 192.168.0.3 + + +With those records now existing we can create the SRV TCP records with the name *ceph-mon* pointing to the three Monitors. + +:: + + _ceph-mon._tcp.example.com. 60 IN SRV 10 20 6789 mon1.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 10 30 6789 mon2.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 20 50 6789 mon3.example.com. + +Now all Monitors are running on port *6789*, with priorities 10, 10, 20 and weights 20, 30, 50 respectively. + +Monitor clients choose monitor by referencing the SRV records. If a cluster has multiple Monitor SRV records +with the same priority value, clients and daemons will load balance the connections to Monitors in proportion +to the values of the SRV weight fields. + +For the above example, this will result in approximate 40% of the clients and daemons connecting to mon1, +60% of them connecting to mon2. However, if neither of them is reachable, then mon3 will be reconsidered as a fallback. diff --git a/doc/rados/configuration/mon-osd-interaction.rst b/doc/rados/configuration/mon-osd-interaction.rst new file mode 100644 index 000000000..727070491 --- /dev/null +++ b/doc/rados/configuration/mon-osd-interaction.rst @@ -0,0 +1,396 @@ +===================================== + Configuring Monitor/OSD Interaction +===================================== + +.. index:: heartbeat + +After you have completed your initial Ceph configuration, you may deploy and run +Ceph. When you execute a command such as ``ceph health`` or ``ceph -s``, the +:term:`Ceph Monitor` reports on the current state of the :term:`Ceph Storage +Cluster`. The Ceph Monitor knows about the Ceph Storage Cluster by requiring +reports from each :term:`Ceph OSD Daemon`, and by receiving reports from Ceph +OSD Daemons about the status of their neighboring Ceph OSD Daemons. If the Ceph +Monitor doesn't receive reports, or if it receives reports of changes in the +Ceph Storage Cluster, the Ceph Monitor updates the status of the :term:`Ceph +Cluster Map`. + +Ceph provides reasonable default settings for Ceph Monitor/Ceph OSD Daemon +interaction. However, you may override the defaults. The following sections +describe how Ceph Monitors and Ceph OSD Daemons interact for the purposes of +monitoring the Ceph Storage Cluster. + +.. index:: heartbeat interval + +OSDs Check Heartbeats +===================== + +Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons at random +intervals less than every 6 seconds. If a neighboring Ceph OSD Daemon doesn't +show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may +consider the neighboring Ceph OSD Daemon ``down`` and report it back to a Ceph +Monitor, which will update the Ceph Cluster Map. You may change this grace +period by adding an ``osd heartbeat grace`` setting under the ``[mon]`` +and ``[osd]`` or ``[global]`` section of your Ceph configuration file, +or by setting the value at runtime. + + +.. ditaa:: + +---------+ +---------+ + | OSD 1 | | OSD 2 | + +---------+ +---------+ + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |<-------------------| + | Heart Beating | + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |----+ Grace | + | | Period | + |<---+ Exceeded | + | | + |----+ Mark | + | | OSD 2 | + |<---+ Down | + + +.. index:: OSD down report + +OSDs Report Down OSDs +===================== + +By default, two Ceph OSD Daemons from different hosts must report to the Ceph +Monitors that another Ceph OSD Daemon is ``down`` before the Ceph Monitors +acknowledge that the reported Ceph OSD Daemon is ``down``. But there is chance +that all the OSDs reporting the failure are hosted in a rack with a bad switch +which has trouble connecting to another OSD. To avoid this sort of false alarm, +we consider the peers reporting a failure a proxy for a potential "subcluster" +over the overall cluster that is similarly laggy. This is clearly not true in +all cases, but will sometimes help us localize the grace correction to a subset +of the system that is unhappy. ``mon osd reporter subtree level`` is used to +group the peers into the "subcluster" by their common ancestor type in CRUSH +map. By default, only two reports from different subtree are required to report +another Ceph OSD Daemon ``down``. You can change the number of reporters from +unique subtrees and the common ancestor type required to report a Ceph OSD +Daemon ``down`` to a Ceph Monitor by adding an ``mon osd min down reporters`` +and ``mon osd reporter subtree level`` settings under the ``[mon]`` section of +your Ceph configuration file, or by setting the value at runtime. + + +.. ditaa:: + + +---------+ +---------+ +---------+ + | OSD 1 | | OSD 2 | | Monitor | + +---------+ +---------+ +---------+ + | | | + | OSD 3 Is Down | | + |---------------+--------------->| + | | | + | | | + | | OSD 3 Is Down | + | |--------------->| + | | | + | | | + | | |---------+ Mark + | | | | OSD 3 + | | |<--------+ Down + + +.. index:: peering failure + +OSDs Report Peering Failure +=========================== + +If a Ceph OSD Daemon cannot peer with any of the Ceph OSD Daemons defined in its +Ceph configuration file (or the cluster map), it will ping a Ceph Monitor for +the most recent copy of the cluster map every 30 seconds. You can change the +Ceph Monitor heartbeat interval by adding an ``osd mon heartbeat interval`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. + +.. ditaa:: + + +---------+ +---------+ +-------+ +---------+ + | OSD 1 | | OSD 2 | | OSD 3 | | Monitor | + +---------+ +---------+ +-------+ +---------+ + | | | | + | Request To | | | + | Peer | | | + |-------------->| | | + |<--------------| | | + | Peering | | + | | | + | Request To | | + | Peer | | + |----------------------------->| | + | | + |----+ OSD Monitor | + | | Heartbeat | + |<---+ Interval Exceeded | + | | + | Failed to Peer with OSD 3 | + |-------------------------------------------->| + |<--------------------------------------------| + | Receive New Cluster Map | + + +.. index:: OSD status + +OSDs Report Their Status +======================== + +If an Ceph OSD Daemon doesn't report to a Ceph Monitor, the Ceph Monitor will +consider the Ceph OSD Daemon ``down`` after the ``mon osd report timeout`` +elapses. A Ceph OSD Daemon sends a report to a Ceph Monitor when a reportable +event such as a failure, a change in placement group stats, a change in +``up_thru`` or when it boots within 5 seconds. You can change the Ceph OSD +Daemon minimum report interval by adding an ``osd mon report interval`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. A Ceph OSD Daemon sends a report to a Ceph +Monitor every 120 seconds irrespective of whether any notable changes occur. +You can change the Ceph Monitor report interval by adding an ``osd mon report +interval max`` setting under the ``[osd]`` section of your Ceph configuration +file, or by setting the value at runtime. + + +.. ditaa:: + + +---------+ +---------+ + | OSD 1 | | Monitor | + +---------+ +---------+ + | | + |----+ Report Min | + | | Interval | + |<---+ Exceeded | + | | + |----+ Reportable | + | | Event | + |<---+ Occurs | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Report Max | + | | Interval | + |<---+ Exceeded | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Monitor | + | | Fails | + |<---+ | + +----+ Monitor OSD + | | Report Timeout + |<---+ Exceeded + | + +----+ Mark + | | OSD 1 + |<---+ Down + + + + +Configuration Settings +====================== + +When modifying heartbeat settings, you should include them in the ``[global]`` +section of your configuration file. + +.. index:: monitor heartbeat + +Monitor Settings +---------------- + +``mon osd min up ratio`` + +:Description: The minimum ratio of ``up`` Ceph OSD Daemons before Ceph will + mark Ceph OSD Daemons ``down``. + +:Type: Double +:Default: ``.3`` + + +``mon osd min in ratio`` + +:Description: The minimum ratio of ``in`` Ceph OSD Daemons before Ceph will + mark Ceph OSD Daemons ``out``. + +:Type: Double +:Default: ``.75`` + + +``mon osd laggy halflife`` + +:Description: The number of seconds laggy estimates will decay. +:Type: Integer +:Default: ``60*60`` + + +``mon osd laggy weight`` + +:Description: The weight for new samples in laggy estimation decay. +:Type: Double +:Default: ``0.3`` + + + +``mon osd laggy max interval`` + +:Description: Maximum value of ``laggy_interval`` in laggy estimations (in seconds). + Monitor uses an adaptive approach to evaluate the ``laggy_interval`` of + a certain OSD. This value will be used to calculate the grace time for + that OSD. +:Type: Integer +:Default: 300 + +``mon osd adjust heartbeat grace`` + +:Description: If set to ``true``, Ceph will scale based on laggy estimations. +:Type: Boolean +:Default: ``true`` + + +``mon osd adjust down out interval`` + +:Description: If set to ``true``, Ceph will scaled based on laggy estimations. +:Type: Boolean +:Default: ``true`` + + +``mon osd auto mark in`` + +:Description: Ceph will mark any booting Ceph OSD Daemons as ``in`` + the Ceph Storage Cluster. + +:Type: Boolean +:Default: ``false`` + + +``mon osd auto mark auto out in`` + +:Description: Ceph will mark booting Ceph OSD Daemons auto marked ``out`` + of the Ceph Storage Cluster as ``in`` the cluster. + +:Type: Boolean +:Default: ``true`` + + +``mon osd auto mark new in`` + +:Description: Ceph will mark booting new Ceph OSD Daemons as ``in`` the + Ceph Storage Cluster. + +:Type: Boolean +:Default: ``true`` + + +``mon osd down out interval`` + +:Description: The number of seconds Ceph waits before marking a Ceph OSD Daemon + ``down`` and ``out`` if it doesn't respond. + +:Type: 32-bit Integer +:Default: ``600`` + + +``mon osd down out subtree limit`` + +:Description: The smallest :term:`CRUSH` unit type that Ceph will **not** + automatically mark out. For instance, if set to ``host`` and if + all OSDs of a host are down, Ceph will not automatically mark out + these OSDs. + +:Type: String +:Default: ``rack`` + + +``mon osd report timeout`` + +:Description: The grace period in seconds before declaring + unresponsive Ceph OSD Daemons ``down``. + +:Type: 32-bit Integer +:Default: ``900`` + +``mon osd min down reporters`` + +:Description: The minimum number of Ceph OSD Daemons required to report a + ``down`` Ceph OSD Daemon. + +:Type: 32-bit Integer +:Default: ``2`` + + +``mon_osd_reporter_subtree_level`` + +:Description: In which level of parent bucket the reporters are counted. The OSDs + send failure reports to monitors if they find a peer that is not responsive. + Monitors mark the reported ``OSD`` out and then ``down`` after a grace period. +:Type: String +:Default: ``host`` + + +.. index:: OSD hearbeat + +OSD Settings +------------ + +``osd_heartbeat_interval`` + +:Description: How often an Ceph OSD Daemon pings its peers (in seconds). +:Type: 32-bit Integer +:Default: ``6`` + + +``osd_heartbeat_grace`` + +:Description: The elapsed time when a Ceph OSD Daemon hasn't shown a heartbeat + that the Ceph Storage Cluster considers it ``down``. + This setting must be set in both the [mon] and [osd] or [global] + sections so that it is read by both monitor and OSD daemons. +:Type: 32-bit Integer +:Default: ``20`` + + +``osd_mon_heartbeat_interval`` + +:Description: How often the Ceph OSD Daemon pings a Ceph Monitor if it has no + Ceph OSD Daemon peers. + +:Type: 32-bit Integer +:Default: ``30`` + + +``osd_mon_heartbeat_stat_stale`` + +:Description: Stop reporting on heartbeat ping times which haven't been updated for + this many seconds. Set to zero to disable this action. + +:Type: 32-bit Integer +:Default: ``3600`` + + +``osd_mon_report_interval`` + +:Description: The number of seconds a Ceph OSD Daemon may wait + from startup or another reportable event before reporting + to a Ceph Monitor. + +:Type: 32-bit Integer +:Default: ``5`` diff --git a/doc/rados/configuration/ms-ref.rst b/doc/rados/configuration/ms-ref.rst new file mode 100644 index 000000000..113bd0913 --- /dev/null +++ b/doc/rados/configuration/ms-ref.rst @@ -0,0 +1,133 @@ +=========== + Messaging +=========== + +General Settings +================ + +``ms_tcp_nodelay`` + +:Description: Disables Nagle's algorithm on messenger TCP sessions. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``ms_initial_backoff`` + +:Description: The initial time to wait before reconnecting on a fault. +:Type: Double +:Required: No +:Default: ``.2`` + + +``ms_max_backoff`` + +:Description: The maximum time to wait before reconnecting on a fault. +:Type: Double +:Required: No +:Default: ``15.0`` + + +``ms_nocrc`` + +:Description: Disables CRC on network messages. May increase performance if CPU limited. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms_die_on_bad_msg`` + +:Description: Debug option; do not configure. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms_dispatch_throttle_bytes`` + +:Description: Throttles total size of messages waiting to be dispatched. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``100 << 20`` + + +``ms_bind_ipv6`` + +:Description: Enable to bind daemons to IPv6 addresses instead of IPv4. Not required if you specify a daemon or cluster IP. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms_rwthread_stack_bytes`` + +:Description: Debug option for stack size; do not configure. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``1024 << 10`` + + +``ms_tcp_read_timeout`` + +:Description: Controls how long (in seconds) the messenger will wait before closing an idle connection. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``900`` + + +``ms_inject_socket_failures`` + +:Description: Debug option; do not configure. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``0`` + +Async messenger options +======================= + + +``ms_async_transport_type`` + +:Description: Transport type used by Async Messenger. Can be ``posix``, ``dpdk`` + or ``rdma``. Posix uses standard TCP/IP networking and is default. + Other transports may be experimental and support may be limited. +:Type: String +:Required: No +:Default: ``posix`` + + +``ms_async_op_threads`` + +:Description: Initial number of worker threads used by each Async Messenger instance. + Should be at least equal to highest number of replicas, but you can + decrease it if you are low on CPU core count and/or you host a lot of + OSDs on single server. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``3`` + + +``ms_async_max_op_threads`` + +:Description: Maximum number of worker threads used by each Async Messenger instance. + Set to lower values when your machine has limited CPU count, and increase + when your CPUs are underutilized (i. e. one or more of CPUs are + constantly on 100% load during I/O operations). +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``5`` + + +``ms_async_send_inline`` + +:Description: Send messages directly from the thread that generated them instead of + queuing and sending from Async Messenger thread. This option is known + to decrease performance on systems with a lot of CPU cores, so it's + disabled by default. +:Type: Boolean +:Required: No +:Default: ``false`` + + diff --git a/doc/rados/configuration/msgr2.rst b/doc/rados/configuration/msgr2.rst new file mode 100644 index 000000000..3415b1f5f --- /dev/null +++ b/doc/rados/configuration/msgr2.rst @@ -0,0 +1,233 @@ +.. _msgr2: + +Messenger v2 +============ + +What is it +---------- + +The messenger v2 protocol, or msgr2, is the second major revision on +Ceph's on-wire protocol. It brings with it several key features: + +* A *secure* mode that encrypts all data passing over the network +* Improved encapsulation of authentication payloads, enabling future + integration of new authentication modes like Kerberos +* Improved earlier feature advertisement and negotiation, enabling + future protocol revisions + +Ceph daemons can now bind to multiple ports, allowing both legacy Ceph +clients and new v2-capable clients to connect to the same cluster. + +By default, monitors now bind to the new IANA-assigned port ``3300`` +(ce4h or 0xce4) for the new v2 protocol, while also binding to the +old default port ``6789`` for the legacy v1 protocol. + +.. _address_formats: + +Address formats +--------------- + +Prior to Nautilus, all network addresses were rendered like +``1.2.3.4:567/89012`` where there was an IP address, a port, and a +nonce to uniquely identify a client or daemon on the network. +Starting with Nautilus, we now have three different address types: + +* **v2**: ``v2:1.2.3.4:578/89012`` identifies a daemon binding to a + port speaking the new v2 protocol +* **v1**: ``v1:1.2.3.4:578/89012`` identifies a daemon binding to a + port speaking the legacy v1 protocol. Any address that was + previously shown with any prefix is now shown as a ``v1:`` address. +* **TYPE_ANY** ``any:1.2.3.4:578/89012`` identifies a client that can + speak either version of the protocol. Prior to nautilus, clients would appear as + ``1.2.3.4:0/123456``, where the port of 0 indicates they are clients + and do not accept incoming connections. Starting with Nautilus, + these clients are now internally represented by a **TYPE_ANY** + address, and still shown with no prefix, because they may + connect to daemons using the v2 or v1 protocol, depending on what + protocol(s) the daemons are using. + +Because daemons now bind to multiple ports, they are now described by +a vector of addresses instead of a single address. For example, +dumping the monitor map on a Nautilus cluster now includes lines +like:: + + epoch 1 + fsid 50fcf227-be32-4bcb-8b41-34ca8370bd16 + last_changed 2019-02-25 11:10:46.700821 + created 2019-02-25 11:10:46.700821 + min_mon_release 14 (nautilus) + 0: [v2:10.0.0.10:3300/0,v1:10.0.0.10:6789/0] mon.foo + 1: [v2:10.0.0.11:3300/0,v1:10.0.0.11:6789/0] mon.bar + 2: [v2:10.0.0.12:3300/0,v1:10.0.0.12:6789/0] mon.baz + +The bracketed list or vector of addresses means that the same daemon can be +reached on multiple ports (and protocols). Any client or other daemon +connecting to that daemon will use the v2 protocol (listed first) if +possible; otherwise it will back to the legacy v1 protocol. Legacy +clients will only see the v1 addresses and will continue to connect as +they did before, with the v1 protocol. + +Starting in Nautilus, the ``mon_host`` configuration option and ``-m +<mon-host>`` command line options support the same bracketed address +vector syntax. + + +Bind configuration options +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Two new configuration options control whether the v1 and/or v2 +protocol is used: + + * ``ms_bind_msgr1`` [default: true] controls whether a daemon binds + to a port speaking the v1 protocol + * ``ms_bind_msgr2`` [default: true] controls whether a daemon binds + to a port speaking the v2 protocol + +Similarly, two options control whether IPv4 and IPv6 addresses are used: + + * ``ms_bind_ipv4`` [default: true] controls whether a daemon binds + to an IPv4 address + * ``ms_bind_ipv6`` [default: false] controls whether a daemon binds + to an IPv6 address + +.. note:: The ability to bind to multiple ports has paved the way for + dual-stack IPv4 and IPv6 support. That said, dual-stack support is + not yet tested as of Nautilus v14.2.0 and likely needs some + additional code changes to work correctly. + +Connection modes +---------------- + +The v2 protocol supports two connection modes: + +* *crc* mode provides: + + - a strong initial authentication when the connection is established + (with cephx, mutual authentication of both parties with protection + from a man-in-the-middle or eavesdropper), and + - a crc32c integrity check to protect against bit flips due to flaky + hardware or cosmic rays + + *crc* mode does *not* provide: + + - secrecy (an eavesdropper on the network can see all + post-authentication traffic as it goes by) or + - protection from a malicious man-in-the-middle (who can deliberate + modify traffic as it goes by, as long as they are careful to + adjust the crc32c values to match) + +* *secure* mode provides: + + - a strong initial authentication when the connection is established + (with cephx, mutual authentication of both parties with protection + from a man-in-the-middle or eavesdropper), and + - full encryption of all post-authentication traffic, including a + cryptographic integrity check. + + In Nautilus, secure mode uses the `AES-GCM + <https://en.wikipedia.org/wiki/Galois/Counter_Mode>`_ stream cipher, + which is generally very fast on modern processors (e.g., faster than + a SHA-256 cryptographic hash). + +Connection mode configuration options +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For most connections, there are options that control which modes are used: + +* ``ms_cluster_mode`` is the connection mode (or permitted modes) used + for intra-cluster communication between Ceph daemons. If multiple + modes are listed, the modes listed first are preferred. +* ``ms_service_mode`` is a list of permitted modes for clients to use + when connecting to the cluster. +* ``ms_client_mode`` is a list of connection modes, in order of + preference, for clients to use (or allow) when talking to a Ceph + cluster. + +There are a parallel set of options that apply specifically to +monitors, allowing administrators to set different (usually more +secure) requirements on communication with the monitors. + +* ``ms_mon_cluster_mode`` is the connection mode (or permitted modes) + to use between monitors. +* ``ms_mon_service_mode`` is a list of permitted modes for clients or + other Ceph daemons to use when connecting to monitors. +* ``ms_mon_client_mode`` is a list of connection modes, in order of + preference, for clients or non-monitor daemons to use when + connecting to monitors. + + +Transitioning from v1-only to v2-plus-v1 +---------------------------------------- + +By default, ``ms_bind_msgr2`` is true starting with Nautilus 14.2.z. +However, until the monitors start using v2, only limited services will +start advertising v2 addresses. + +For most users, the monitors are binding to the default legacy port ``6789`` +for the v1 protocol. When this is the case, enabling v2 is as simple as: + +.. prompt:: bash $ + + ceph mon enable-msgr2 + +If the monitors are bound to non-standard ports, you will need to +specify an additional port for v2 explicitly. For example, if your +monitor ``mon.a`` binds to ``1.2.3.4:1111``, and you want to add v2 on +port ``1112``: + +.. prompt:: bash $ + + ceph mon set-addrs a [v2:1.2.3.4:1112,v1:1.2.3.4:1111] + +Once the monitors bind to v2, each daemon will start advertising a v2 +address when it is next restarted. + + +.. _msgr2_ceph_conf: + +Updating ceph.conf and mon_host +------------------------------- + +Prior to Nautilus, a CLI user or daemon will normally discover the +monitors via the ``mon_host`` option in ``/etc/ceph/ceph.conf``. The +syntax for this option has expanded starting with Nautilus to allow +support the new bracketed list format. For example, an old line +like:: + + mon_host = 10.0.0.1:6789,10.0.0.2:6789,10.0.0.3:6789 + +Can be changed to:: + + mon_host = [v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0],[v2:10.0.0.2:3300/0,v1:10.0.0.2:6789/0],[v2:10.0.0.3:3300/0,v1:10.0.0.3:6789/0] + +However, when default ports are used (``3300`` and ``6789``), they can +be omitted:: + + mon_host = 10.0.0.1,10.0.0.2,10.0.0.3 + +Once v2 has been enabled on the monitors, ``ceph.conf`` may need to be +updated to either specify no ports (this is usually simplest), or +explicitly specify both the v2 and v1 addresses. Note, however, that +the new bracketed syntax is only understood by Nautilus and later, so +do not make that change on hosts that have not yet had their ceph +packages upgraded. + +When you are updating ``ceph.conf``, note the new ``ceph config +generate-minimal-conf`` command (which generates a barebones config +file with just enough information to reach the monitors) and the +``ceph config assimilate-conf`` (which moves config file options into +the monitors' configuration database) may be helpful. For example,:: + + # ceph config assimilate-conf < /etc/ceph/ceph.conf + # ceph config generate-minimal-config > /etc/ceph/ceph.conf.new + # cat /etc/ceph/ceph.conf.new + # minimal ceph.conf for 0e5a806b-0ce5-4bc6-b949-aa6f68f5c2a3 + [global] + fsid = 0e5a806b-0ce5-4bc6-b949-aa6f68f5c2a3 + mon_host = [v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0] + # mv /etc/ceph/ceph.conf.new /etc/ceph/ceph.conf + +Protocol +-------- + +For a detailed description of the v2 wire protocol, see :ref:`msgr2-protocol`. diff --git a/doc/rados/configuration/network-config-ref.rst b/doc/rados/configuration/network-config-ref.rst new file mode 100644 index 000000000..97229a401 --- /dev/null +++ b/doc/rados/configuration/network-config-ref.rst @@ -0,0 +1,454 @@ +================================= + Network Configuration Reference +================================= + +Network configuration is critical for building a high performance :term:`Ceph +Storage Cluster`. The Ceph Storage Cluster does not perform request routing or +dispatching on behalf of the :term:`Ceph Client`. Instead, Ceph Clients make +requests directly to Ceph OSD Daemons. Ceph OSD Daemons perform data replication +on behalf of Ceph Clients, which means replication and other factors impose +additional loads on Ceph Storage Cluster networks. + +Our Quick Start configurations provide a trivial Ceph configuration file that +sets monitor IP addresses and daemon host names only. Unless you specify a +cluster network, Ceph assumes a single "public" network. Ceph functions just +fine with a public network only, but you may see significant performance +improvement with a second "cluster" network in a large cluster. + +It is possible to run a Ceph Storage Cluster with two networks: a public +(client, front-side) network and a cluster (private, replication, back-side) +network. However, this approach +complicates network configuration (both hardware and software) and does not usually +have a significant impact on overall performance. For this reason, we recommend +that for resilience and capacity dual-NIC systems either active/active bond +these interfaces or implemebnt a layer 3 multipath strategy with eg. FRR. + +If, despite the complexity, one still wishes to use two networks, each +:term:`Ceph Node` will need to have more than one network interface or VLAN. See `Hardware +Recommendations - Networks`_ for additional details. + +.. ditaa:: + +-------------+ + | Ceph Client | + +----*--*-----+ + | ^ + Request | : Response + v | + /----------------------------------*--*-------------------------------------\ + | Public Network | + \---*--*------------*--*-------------*--*------------*--*------------*--*---/ + ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ + | | | | | | | | | | + | : | : | : | : | : + v v v v v v v v v v + +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ + | Ceph MON | | Ceph MDS | | Ceph OSD | | Ceph OSD | | Ceph OSD | + +----------+ +----------+ +---*--*---+ +---*--*---+ +---*--*---+ + ^ ^ ^ ^ ^ ^ + The cluster network relieves | | | | | | + OSD replication and heartbeat | : | : | : + traffic from the public network. v v v v v v + /------------------------------------*--*------------*--*------------*--*---\ + | cCCC Cluster Network | + \---------------------------------------------------------------------------/ + + +IP Tables +========= + +By default, daemons `bind`_ to ports within the ``6800:7300`` range. You may +configure this range at your discretion. Before configuring your IP tables, +check the default ``iptables`` configuration. + +.. prompt:: bash $ + + sudo iptables -L + +Some Linux distributions include rules that reject all inbound requests +except SSH from all network interfaces. For example:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +You will need to delete these rules on both your public and cluster networks +initially, and replace them with appropriate rules when you are ready to +harden the ports on your Ceph Nodes. + + +Monitor IP Tables +----------------- + +Ceph Monitors listen on ports ``3300`` and ``6789`` by +default. Additionally, Ceph Monitors always operate on the public +network. When you add the rule using the example below, make sure you +replace ``{iface}`` with the public network interface (e.g., ``eth0``, +``eth1``, etc.), ``{ip-address}`` with the IP address of the public +network and ``{netmask}`` with the netmask for the public network. : + +.. prompt:: bash $ + + sudo iptables -A INPUT -i {iface} -p tcp -s {ip-address}/{netmask} --dport 6789 -j ACCEPT + + +MDS and Manager IP Tables +------------------------- + +A :term:`Ceph Metadata Server` or :term:`Ceph Manager` listens on the first +available port on the public network beginning at port 6800. Note that this +behavior is not deterministic, so if you are running more than one OSD or MDS +on the same host, or if you restart the daemons within a short window of time, +the daemons will bind to higher ports. You should open the entire 6800-7300 +range by default. When you add the rule using the example below, make sure +you replace ``{iface}`` with the public network interface (e.g., ``eth0``, +``eth1``, etc.), ``{ip-address}`` with the IP address of the public network +and ``{netmask}`` with the netmask of the public network. + +For example: + +.. prompt:: bash $ + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + + +OSD IP Tables +------------- + +By default, Ceph OSD Daemons `bind`_ to the first available ports on a Ceph Node +beginning at port 6800. Note that this behavior is not deterministic, so if you +are running more than one OSD or MDS on the same host, or if you restart the +daemons within a short window of time, the daemons will bind to higher ports. +Each Ceph OSD Daemon on a Ceph Node may use up to four ports: + +#. One for talking to clients and monitors. +#. One for sending data to other OSDs. +#. Two for heartbeating on each interface. + +.. ditaa:: + /---------------\ + | OSD | + | +---+----------------+-----------+ + | | Clients & Monitors | Heartbeat | + | +---+----------------+-----------+ + | | + | +---+----------------+-----------+ + | | Data Replication | Heartbeat | + | +---+----------------+-----------+ + | cCCC | + \---------------/ + +When a daemon fails and restarts without letting go of the port, the restarted +daemon will bind to a new port. You should open the entire 6800-7300 port range +to handle this possibility. + +If you set up separate public and cluster networks, you must add rules for both +the public network and the cluster network, because clients will connect using +the public network and other Ceph OSD Daemons will connect using the cluster +network. When you add the rule using the example below, make sure you replace +``{iface}`` with the network interface (e.g., ``eth0``, ``eth1``, etc.), +``{ip-address}`` with the IP address and ``{netmask}`` with the netmask of the +public or cluster network. For example: + +.. prompt:: bash $ + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + +.. tip:: If you run Ceph Metadata Servers on the same Ceph Node as the + Ceph OSD Daemons, you can consolidate the public network configuration step. + + +Ceph Networks +============= + +To configure Ceph networks, you must add a network configuration to the +``[global]`` section of the configuration file. Our 5-minute Quick Start +provides a trivial Ceph configuration file that assumes one public network +with client and server on the same network and subnet. Ceph functions just fine +with a public network only. However, Ceph allows you to establish much more +specific criteria, including multiple IP network and subnet masks for your +public network. You can also establish a separate cluster network to handle OSD +heartbeat, object replication and recovery traffic. Don't confuse the IP +addresses you set in your configuration with the public-facing IP addresses +network clients may use to access your service. Typical internal IP networks are +often ``192.168.0.0`` or ``10.0.0.0``. + +.. tip:: If you specify more than one IP address and subnet mask for + either the public or the cluster network, the subnets within the network + must be capable of routing to each other. Additionally, make sure you + include each IP address/subnet in your IP tables and open ports for them + as necessary. + +.. note:: Ceph uses `CIDR`_ notation for subnets (e.g., ``10.0.0.0/24``). + +When you have configured your networks, you may restart your cluster or restart +each daemon. Ceph daemons bind dynamically, so you do not have to restart the +entire cluster at once if you change your network configuration. + + +Public Network +-------------- + +To configure a public network, add the following option to the ``[global]`` +section of your Ceph configuration file. + +.. code-block:: ini + + [global] + # ... elided configuration + public_network = {public-network/netmask} + +.. _cluster-network: + +Cluster Network +--------------- + +If you declare a cluster network, OSDs will route heartbeat, object replication +and recovery traffic over the cluster network. This may improve performance +compared to using a single network. To configure a cluster network, add the +following option to the ``[global]`` section of your Ceph configuration file. + +.. code-block:: ini + + [global] + # ... elided configuration + cluster_network = {cluster-network/netmask} + +We prefer that the cluster network is **NOT** reachable from the public network +or the Internet for added security. + +IPv4/IPv6 Dual Stack Mode +------------------------- + +If you want to run in an IPv4/IPv6 dual stack mode and want to define your public and/or +cluster networks, then you need to specify both your IPv4 and IPv6 networks for each: + +.. code-block:: ini + + [global] + # ... elided configuration + public_network = {IPv4 public-network/netmask}, {IPv6 public-network/netmask} + +This is so that Ceph can find a valid IP address for both address families. + +If you want just an IPv4 or an IPv6 stack environment, then make sure you set the `ms bind` +options correctly. + +.. note:: + Binding to IPv4 is enabled by default, so if you just add the option to bind to IPv6 + you'll actually put yourself into dual stack mode. If you want just IPv6, then disable IPv4 and + enable IPv6. See `Bind`_ below. + +Ceph Daemons +============ + +Monitor daemons are each configured to bind to a specific IP address. These +addresses are normally configured by your deployment tool. Other components +in the Ceph cluster discover the monitors via the ``mon host`` configuration +option, normally specified in the ``[global]`` section of the ``ceph.conf`` file. + +.. code-block:: ini + + [global] + mon_host = 10.0.0.2, 10.0.0.3, 10.0.0.4 + +The ``mon_host`` value can be a list of IP addresses or a name that is +looked up via DNS. In the case of a DNS name with multiple A or AAAA +records, all records are probed in order to discover a monitor. Once +one monitor is reached, all other current monitors are discovered, so +the ``mon host`` configuration option only needs to be sufficiently up +to date such that a client can reach one monitor that is currently online. + +The MGR, OSD, and MDS daemons will bind to any available address and +do not require any special configuration. However, it is possible to +specify a specific IP address for them to bind to with the ``public +addr`` (and/or, in the case of OSD daemons, the ``cluster addr``) +configuration option. For example, + +.. code-block:: ini + + [osd.0] + public addr = {host-public-ip-address} + cluster addr = {host-cluster-ip-address} + +.. topic:: One NIC OSD in a Two Network Cluster + + Generally, we do not recommend deploying an OSD host with a single network interface in a + cluster with two networks. However, you may accomplish this by forcing the + OSD host to operate on the public network by adding a ``public_addr`` entry + to the ``[osd.n]`` section of the Ceph configuration file, where ``n`` + refers to the ID of the OSD with one network interface. Additionally, the public + network and cluster network must be able to route traffic to each other, + which we don't recommend for security reasons. + + +Network Config Settings +======================= + +Network configuration settings are not required. Ceph assumes a public network +with all hosts operating on it unless you specifically configure a cluster +network. + + +Public Network +-------------- + +The public network configuration allows you specifically define IP addresses +and subnets for the public network. You may specifically assign static IP +addresses or override ``public_network`` settings using the ``public_addr`` +setting for a specific daemon. + +``public_network`` + +:Description: The IP address and netmask of the public (front-side) network + (e.g., ``192.168.0.0/24``). Set in ``[global]``. You may specify + comma-separated subnets. + +:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` +:Required: No +:Default: N/A + + +``public_addr`` + +:Description: The IP address for the public (front-side) network. + Set for each daemon. + +:Type: IP Address +:Required: No +:Default: N/A + + + +Cluster Network +--------------- + +The cluster network configuration allows you to declare a cluster network, and +specifically define IP addresses and subnets for the cluster network. You may +specifically assign static IP addresses or override ``cluster_network`` +settings using the ``cluster_addr`` setting for specific OSD daemons. + + +``cluster_network`` + +:Description: The IP address and netmask of the cluster (back-side) network + (e.g., ``10.0.0.0/24``). Set in ``[global]``. You may specify + comma-separated subnets. + +:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` +:Required: No +:Default: N/A + + +``cluster_addr`` + +:Description: The IP address for the cluster (back-side) network. + Set for each daemon. + +:Type: Address +:Required: No +:Default: N/A + + +Bind +---- + +Bind settings set the default port ranges Ceph OSD and MDS daemons use. The +default range is ``6800:7300``. Ensure that your `IP Tables`_ configuration +allows you to use the configured port range. + +You may also enable Ceph daemons to bind to IPv6 addresses instead of IPv4 +addresses. + + +``ms_bind_port_min`` + +:Description: The minimum port number to which an OSD or MDS daemon will bind. +:Type: 32-bit Integer +:Default: ``6800`` +:Required: No + + +``ms_bind_port_max`` + +:Description: The maximum port number to which an OSD or MDS daemon will bind. +:Type: 32-bit Integer +:Default: ``7300`` +:Required: No. + +``ms_bind_ipv4`` + +:Description: Enables Ceph daemons to bind to IPv4 addresses. +:Type: Boolean +:Default: ``true`` +:Required: No + +``ms_bind_ipv6`` + +:Description: Enables Ceph daemons to bind to IPv6 addresses. +:Type: Boolean +:Default: ``false`` +:Required: No + +``public_bind_addr`` + +:Description: In some dynamic deployments the Ceph MON daemon might bind + to an IP address locally that is different from the ``public_addr`` + advertised to other peers in the network. The environment must ensure + that routing rules are set correctly. If ``public_bind_addr`` is set + the Ceph Monitor daemon will bind to it locally and use ``public_addr`` + in the monmaps to advertise its address to peers. This behavior is limited + to the Monitor daemon. + +:Type: IP Address +:Required: No +:Default: N/A + + + +TCP +--- + +Ceph disables TCP buffering by default. + + +``ms_tcp_nodelay`` + +:Description: Ceph enables ``ms_tcp_nodelay`` so that each request is sent + immediately (no buffering). Disabling `Nagle's algorithm`_ + increases network traffic, which can introduce latency. If you + experience large numbers of small packets, you may try + disabling ``ms_tcp_nodelay``. + +:Type: Boolean +:Required: No +:Default: ``true`` + + +``ms_tcp_rcvbuf`` + +:Description: The size of the socket buffer on the receiving end of a network + connection. Disable by default. + +:Type: 32-bit Integer +:Required: No +:Default: ``0`` + + +``ms_tcp_read_timeout`` + +:Description: If a client or daemon makes a request to another Ceph daemon and + does not drop an unused connection, the ``ms tcp read timeout`` + defines the connection as idle after the specified number + of seconds. + +:Type: Unsigned 64-bit Integer +:Required: No +:Default: ``900`` 15 minutes. + + + +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability +.. _Hardware Recommendations - Networks: ../../../start/hardware-recommendations#networks +.. _hardware recommendations: ../../../start/hardware-recommendations +.. _Monitor / OSD Interaction: ../mon-osd-interaction +.. _Message Signatures: ../auth-config-ref#signatures +.. _CIDR: https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing +.. _Nagle's Algorithm: https://en.wikipedia.org/wiki/Nagle's_algorithm diff --git a/doc/rados/configuration/osd-config-ref.rst b/doc/rados/configuration/osd-config-ref.rst new file mode 100644 index 000000000..7f69b5e80 --- /dev/null +++ b/doc/rados/configuration/osd-config-ref.rst @@ -0,0 +1,1127 @@ +====================== + OSD Config Reference +====================== + +.. index:: OSD; configuration + +You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent +releases, the central config store), but Ceph OSD +Daemons can use the default values and a very minimal configuration. A minimal +Ceph OSD Daemon configuration sets ``osd journal size`` (for Filestore), ``host``, and +uses default values for nearly everything else. + +Ceph OSD Daemons are numerically identified in incremental fashion, beginning +with ``0`` using the following convention. :: + + osd.0 + osd.1 + osd.2 + +In a configuration file, you may specify settings for all Ceph OSD Daemons in +the cluster by adding configuration settings to the ``[osd]`` section of your +configuration file. To add settings directly to a specific Ceph OSD Daemon +(e.g., ``host``), enter it in an OSD-specific section of your configuration +file. For example: + +.. code-block:: ini + + [osd] + osd_journal_size = 5120 + + [osd.0] + host = osd-host-a + + [osd.1] + host = osd-host-b + + +.. index:: OSD; config settings + +General Settings +================ + +The following settings provide a Ceph OSD Daemon's ID, and determine paths to +data and journals. Ceph deployment scripts typically generate the UUID +automatically. + +.. warning:: **DO NOT** change the default paths for data or journals, as it + makes it more problematic to troubleshoot Ceph later. + +When using Filestore, the journal size should be at least twice the product of the expected drive +speed multiplied by ``filestore_max_sync_interval``. However, the most common +practice is to partition the journal drive (often an SSD), and mount it such +that Ceph uses the entire partition for the journal. + + +``osd_uuid`` + +:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon. +:Type: UUID +:Default: The UUID. +:Note: The ``osd_uuid`` applies to a single Ceph OSD Daemon. The ``fsid`` + applies to the entire cluster. + + +``osd_data`` + +:Description: The path to the OSDs data. You must create the directory when + deploying Ceph. You should mount a drive for OSD data at this + mount point. We do not recommend changing the default. + +:Type: String +:Default: ``/var/lib/ceph/osd/$cluster-$id`` + + +``osd_max_write_size`` + +:Description: The maximum size of a write in megabytes. +:Type: 32-bit Integer +:Default: ``90`` + + +``osd_max_object_size`` + +:Description: The maximum size of a RADOS object in bytes. +:Type: 32-bit Unsigned Integer +:Default: 128MB + + +``osd_client_message_size_cap`` + +:Description: The largest client data message allowed in memory. +:Type: 64-bit Unsigned Integer +:Default: 500MB default. ``500*1024L*1024L`` + + +``osd_class_dir`` + +:Description: The class path for RADOS class plug-ins. +:Type: String +:Default: ``$libdir/rados-classes`` + + +.. index:: OSD; file system + +File System Settings +==================== +Ceph builds and mounts file systems which are used for Ceph OSDs. + +``osd_mkfs_options {fs-type}`` + +:Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``-f -i 2048`` +:Default for other file systems: {empty string} + +For example:: + ``osd_mkfs_options_xfs = -f -d agcount=24`` + +``osd_mount_options {fs-type}`` + +:Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``rw,noatime,inode64`` +:Default for other file systems: ``rw, noatime`` + +For example:: + ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8`` + + +.. index:: OSD; journal settings + +Journal Settings +================ + +This section applies only to the older Filestore OSD back end. Since Luminous +BlueStore has been default and preferred. + +By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at +the following path, which is usually a symlink to a device or partition:: + + /var/lib/ceph/osd/$cluster-$id/journal + +When using a single device type (for example, spinning drives), the journals +should be *colocated*: the logical volume (or partition) should be in the same +device as the ``data`` logical volume. + +When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning +drives) it makes sense to place the journal on the faster device, while +``data`` occupies the slower device fully. + +The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be +larger, in which case it will need to be set in the ``ceph.conf`` file. +A value of 10 gigabytes is common in practice:: + + osd_journal_size = 10240 + + +``osd_journal`` + +:Description: The path to the OSD's journal. This may be a path to a file or a + block device (such as a partition of an SSD). If it is a file, + you must create the directory to contain it. We recommend using a + separate fast device when the ``osd_data`` drive is an HDD. + +:Type: String +:Default: ``/var/lib/ceph/osd/$cluster-$id/journal`` + + +``osd_journal_size`` + +:Description: The size of the journal in megabytes. + +:Type: 32-bit Integer +:Default: ``5120`` + + +See `Journal Config Reference`_ for additional details. + + +Monitor OSD Interaction +======================= + +Ceph OSD Daemons check each other's heartbeats and report to monitors +periodically. Ceph can use default values in many cases. However, if your +network has latency issues, you may need to adopt longer intervals. See +`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. + + +Data Placement +============== + +See `Pool & PG Config Reference`_ for details. + + +.. index:: OSD; scrubbing + +Scrubbing +========= + +In addition to making multiple copies of objects, Ceph ensures data integrity by +scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the +object storage layer. For each placement group, Ceph generates a catalog of all +objects and compares each primary object and its replicas to ensure that no +objects are missing or mismatched. Light scrubbing (daily) checks the object +size and attributes. Deep scrubbing (weekly) reads the data and uses checksums +to ensure data integrity. + +Scrubbing is important for maintaining data integrity, but it can reduce +performance. You can adjust the following settings to increase or decrease +scrubbing operations. + + +``osd_max_scrubs`` + +:Description: The maximum number of simultaneous scrub operations for + a Ceph OSD Daemon. + +:Type: 32-bit Int +:Default: ``1`` + +``osd_scrub_begin_hour`` + +:Description: This restricts scrubbing to this hour of the day or later. + Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0`` + to allow scrubbing the entire day. Along with ``osd_scrub_end_hour``, they define a time + window, in which the scrubs can happen. + But a scrub will be performed + no matter whether the time window allows or not, as long as the placement + group's scrub interval exceeds ``osd_scrub_max_interval``. +:Type: Integer in the range of 0 to 23 +:Default: ``0`` + + +``osd_scrub_end_hour`` + +:Description: This restricts scrubbing to the hour earlier than this. + Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0`` to allow scrubbing + for the entire day. Along with ``osd_scrub_begin_hour``, they define a time + window, in which the scrubs can happen. But a scrub will be performed + no matter whether the time window allows or not, as long as the placement + group's scrub interval exceeds ``osd_scrub_max_interval``. +:Type: Integer in the range of 0 to 23 +:Default: ``0`` + + +``osd_scrub_begin_week_day`` + +:Description: This restricts scrubbing to this day of the week or later. + 0 = Sunday, 1 = Monday, etc. Use ``osd_scrub_begin_week_day = 0`` + and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week. + Along with ``osd_scrub_end_week_day``, they define a time window in which + scrubs can happen. But a scrub will be performed + no matter whether the time window allows or not, when the PG's + scrub interval exceeds ``osd_scrub_max_interval``. +:Type: Integer in the range of 0 to 6 +:Default: ``0`` + + +``osd_scrub_end_week_day`` + +:Description: This restricts scrubbing to days of the week earlier than this. + 0 = Sunday, 1 = Monday, etc. Use ``osd_scrub_begin_week_day = 0`` + and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week. + Along with ``osd_scrub_begin_week_day``, they define a time + window, in which the scrubs can happen. But a scrub will be performed + no matter whether the time window allows or not, as long as the placement + group's scrub interval exceeds ``osd_scrub_max_interval``. +:Type: Integer in the range of 0 to 6 +:Default: ``0`` + + +``osd scrub during recovery`` + +:Description: Allow scrub during recovery. Setting this to ``false`` will disable + scheduling new scrub (and deep--scrub) while there is active recovery. + Already running scrubs will be continued. This might be useful to reduce + load on busy clusters. +:Type: Boolean +:Default: ``false`` + + +``osd_scrub_thread_timeout`` + +:Description: The maximum time in seconds before timing out a scrub thread. +:Type: 32-bit Integer +:Default: ``60`` + + +``osd_scrub_finalize_thread_timeout`` + +:Description: The maximum time in seconds before timing out a scrub finalize + thread. + +:Type: 32-bit Integer +:Default: ``10*60`` + + +``osd_scrub_load_threshold`` + +:Description: The normalized maximum load. Ceph will not scrub when the system load + (as defined by ``getloadavg() / number of online CPUs``) is higher than this number. + Default is ``0.5``. + +:Type: Float +:Default: ``0.5`` + + +``osd_scrub_min_interval`` + +:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon + when the Ceph Storage Cluster load is low. + +:Type: Float +:Default: Once per day. ``24*60*60`` + +.. _osd_scrub_max_interval: + +``osd_scrub_max_interval`` + +:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon + irrespective of cluster load. + +:Type: Float +:Default: Once per week. ``7*24*60*60`` + + +``osd_scrub_chunk_min`` + +:Description: The minimal number of object store chunks to scrub during single operation. + Ceph blocks writes to single chunk during scrub. + +:Type: 32-bit Integer +:Default: 5 + + +``osd_scrub_chunk_max`` + +:Description: The maximum number of object store chunks to scrub during single operation. + +:Type: 32-bit Integer +:Default: 25 + + +``osd_scrub_sleep`` + +:Description: Time to sleep before scrubbing the next group of chunks. Increasing this value will slow + down the overall rate of scrubbing so that client operations will be less impacted. + +:Type: Float +:Default: 0 + + +``osd_deep_scrub_interval`` + +:Description: The interval for "deep" scrubbing (fully reading all data). The + ``osd_scrub_load_threshold`` does not affect this setting. + +:Type: Float +:Default: Once per week. ``7*24*60*60`` + + +``osd_scrub_interval_randomize_ratio`` + +:Description: Add a random delay to ``osd_scrub_min_interval`` when scheduling + the next scrub job for a PG. The delay is a random + value less than ``osd_scrub_min_interval`` \* + ``osd_scrub_interval_randomized_ratio``. The default setting + spreads scrubs throughout the allowed time + window of ``[1, 1.5]`` \* ``osd_scrub_min_interval``. +:Type: Float +:Default: ``0.5`` + +``osd_deep_scrub_stride`` + +:Description: Read size when doing a deep scrub. +:Type: 32-bit Integer +:Default: 512 KB. ``524288`` + + +``osd_scrub_auto_repair`` + +:Description: Setting this to ``true`` will enable automatic PG repair when errors + are found by scrubs or deep-scrubs. However, if more than + ``osd_scrub_auto_repair_num_errors`` errors are found a repair is NOT performed. +:Type: Boolean +:Default: ``false`` + + +``osd_scrub_auto_repair_num_errors`` + +:Description: Auto repair will not occur if more than this many errors are found. +:Type: 32-bit Integer +:Default: ``5`` + + +.. index:: OSD; operations settings + +Operations +========== + + ``osd_op_queue`` + +:Description: This sets the type of queue to be used for prioritizing ops + within each OSD. Both queues feature a strict sub-queue which is + dequeued before the normal queue. The normal queue is different + between implementations. The WeightedPriorityQueue (``wpq``) + dequeues operations in relation to their priorities to prevent + starvation of any queue. WPQ should help in cases where a few OSDs + are more overloaded than others. The new mClockQueue + (``mclock_scheduler``) prioritizes operations based on which class + they belong to (recovery, scrub, snaptrim, client op, osd subop). + See `QoS Based on mClock`_. Requires a restart. + +:Type: String +:Valid Choices: wpq, mclock_scheduler +:Default: ``wpq`` + + +``osd_op_queue_cut_off`` + +:Description: This selects which priority ops will be sent to the strict + queue verses the normal queue. The ``low`` setting sends all + replication ops and higher to the strict queue, while the ``high`` + option sends only replication acknowledgment ops and higher to + the strict queue. Setting this to ``high`` should help when a few + OSDs in the cluster are very busy especially when combined with + ``wpq`` in the ``osd_op_queue`` setting. OSDs that are very busy + handling replication traffic could starve primary client traffic + on these OSDs without these settings. Requires a restart. + +:Type: String +:Valid Choices: low, high +:Default: ``high`` + + +``osd_client_op_priority`` + +:Description: The priority set for client operations. This value is relative + to that of ``osd_recovery_op_priority`` below. The default + strongly favors client ops over recovery. + +:Type: 32-bit Integer +:Default: ``63`` +:Valid Range: 1-63 + + +``osd_recovery_op_priority`` + +:Description: The priority of recovery operations vs client operations, if not specified by the + pool's ``recovery_op_priority``. The default value prioritizes client + ops (see above) over recovery ops. You may adjust the tradeoff of client + impact against the time to restore cluster health by lowering this value + for increased prioritization of client ops, or by increasing it to favor + recovery. + +:Type: 32-bit Integer +:Default: ``3`` +:Valid Range: 1-63 + + +``osd_scrub_priority`` + +:Description: The default work queue priority for scheduled scrubs when the + pool doesn't specify a value of ``scrub_priority``. This can be + boosted to the value of ``osd_client_op_priority`` when scrubs are + blocking client operations. + +:Type: 32-bit Integer +:Default: ``5`` +:Valid Range: 1-63 + + +``osd_requested_scrub_priority`` + +:Description: The priority set for user requested scrub on the work queue. If + this value were to be smaller than ``osd_client_op_priority`` it + can be boosted to the value of ``osd_client_op_priority`` when + scrub is blocking client operations. + +:Type: 32-bit Integer +:Default: ``120`` + + +``osd_snap_trim_priority`` + +:Description: The priority set for the snap trim work queue. + +:Type: 32-bit Integer +:Default: ``5`` +:Valid Range: 1-63 + +``osd_snap_trim_sleep`` + +:Description: Time in seconds to sleep before next snap trim op. + Increasing this value will slow down snap trimming. + This option overrides backend specific variants. + +:Type: Float +:Default: ``0`` + + +``osd_snap_trim_sleep_hdd`` + +:Description: Time in seconds to sleep before next snap trim op + for HDDs. + +:Type: Float +:Default: ``5`` + + +``osd_snap_trim_sleep_ssd`` + +:Description: Time in seconds to sleep before next snap trim op + for SSD OSDs (including NVMe). + +:Type: Float +:Default: ``0`` + + +``osd_snap_trim_sleep_hybrid`` + +:Description: Time in seconds to sleep before next snap trim op + when OSD data is on an HDD and the OSD journal or WAL+DB is on an SSD. + +:Type: Float +:Default: ``2`` + +``osd_op_thread_timeout`` + +:Description: The Ceph OSD Daemon operation thread timeout in seconds. +:Type: 32-bit Integer +:Default: ``15`` + + +``osd_op_complaint_time`` + +:Description: An operation becomes complaint worthy after the specified number + of seconds have elapsed. + +:Type: Float +:Default: ``30`` + + +``osd_op_history_size`` + +:Description: The maximum number of completed operations to track. +:Type: 32-bit Unsigned Integer +:Default: ``20`` + + +``osd_op_history_duration`` + +:Description: The oldest completed operation to track. +:Type: 32-bit Unsigned Integer +:Default: ``600`` + + +``osd_op_log_threshold`` + +:Description: How many operations logs to display at once. +:Type: 32-bit Integer +:Default: ``5`` + + +.. _dmclock-qos: + +QoS Based on mClock +------------------- + +Ceph's use of mClock is now more refined and can be used by following the +steps as described in `mClock Config Reference`_. + +Core Concepts +````````````` + +Ceph's QoS support is implemented using a queueing scheduler +based on `the dmClock algorithm`_. This algorithm allocates the I/O +resources of the Ceph cluster in proportion to weights, and enforces +the constraints of minimum reservation and maximum limitation, so that +the services can compete for the resources fairly. Currently the +*mclock_scheduler* operation queue divides Ceph services involving I/O +resources into following buckets: + +- client op: the iops issued by client +- osd subop: the iops issued by primary OSD +- snap trim: the snap trimming related requests +- pg recovery: the recovery related requests +- pg scrub: the scrub related requests + +And the resources are partitioned using following three sets of tags. In other +words, the share of each type of service is controlled by three tags: + +#. reservation: the minimum IOPS allocated for the service. +#. limitation: the maximum IOPS allocated for the service. +#. weight: the proportional share of capacity if extra capacity or system + oversubscribed. + +In Ceph, operations are graded with "cost". And the resources allocated +for serving various services are consumed by these "costs". So, for +example, the more reservation a services has, the more resource it is +guaranteed to possess, as long as it requires. Assuming there are 2 +services: recovery and client ops: + +- recovery: (r:1, l:5, w:1) +- client ops: (r:2, l:0, w:9) + +The settings above ensure that the recovery won't get more than 5 +requests per second serviced, even if it requires so (see CURRENT +IMPLEMENTATION NOTE below), and no other services are competing with +it. But if the clients start to issue large amount of I/O requests, +neither will they exhaust all the I/O resources. 1 request per second +is always allocated for recovery jobs as long as there are any such +requests. So the recovery jobs won't be starved even in a cluster with +high load. And in the meantime, the client ops can enjoy a larger +portion of the I/O resource, because its weight is "9", while its +competitor "1". In the case of client ops, it is not clamped by the +limit setting, so it can make use of all the resources if there is no +recovery ongoing. + +CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit +values. Therefore, if a service crosses the enforced limit, the op remains +in the operation queue until the limit is restored. + +Subtleties of mClock +```````````````````` + +The reservation and limit values have a unit of requests per +second. The weight, however, does not technically have a unit and the +weights are relative to one another. So if one class of requests has a +weight of 1 and another a weight of 9, then the latter class of +requests should get 9 executed at a 9 to 1 ratio as the first class. +However that will only happen once the reservations are met and those +values include the operations executed under the reservation phase. + +Even though the weights do not have units, one must be careful in +choosing their values due how the algorithm assigns weight tags to +requests. If the weight is *W*, then for a given class of requests, +the next one that comes in will have a weight tag of *1/W* plus the +previous weight tag or the current time, whichever is larger. That +means if *W* is sufficiently large and therefore *1/W* is sufficiently +small, the calculated tag may never be assigned as it will get a value +of the current time. The ultimate lesson is that values for weight +should not be too large. They should be under the number of requests +one expects to be serviced each second. + +Caveats +``````` + +There are some factors that can reduce the impact of the mClock op +queues within Ceph. First, requests to an OSD are sharded by their +placement group identifier. Each shard has its own mClock queue and +these queues neither interact nor share information among them. The +number of shards can be controlled with the configuration options +``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and +``osd_op_num_shards_ssd``. A lower number of shards will increase the +impact of the mClock queues, but may have other deleterious effects. + +Second, requests are transferred from the operation queue to the +operation sequencer, in which they go through the phases of +execution. The operation queue is where mClock resides and mClock +determines the next op to transfer to the operation sequencer. The +number of operations allowed in the operation sequencer is a complex +issue. In general we want to keep enough operations in the sequencer +so it's always getting work done on some operations while it's waiting +for disk and network access to complete on other operations. On the +other hand, once an operation is transferred to the operation +sequencer, mClock no longer has control over it. Therefore to maximize +the impact of mClock, we want to keep as few operations in the +operation sequencer as possible. So we have an inherent tension. + +The configuration options that influence the number of operations in +the operation sequencer are ``bluestore_throttle_bytes``, +``bluestore_throttle_deferred_bytes``, +``bluestore_throttle_cost_per_io``, +``bluestore_throttle_cost_per_io_hdd``, and +``bluestore_throttle_cost_per_io_ssd``. + +A third factor that affects the impact of the mClock algorithm is that +we're using a distributed system, where requests are made to multiple +OSDs and each OSD has (can have) multiple shards. Yet we're currently +using the mClock algorithm, which is not distributed (note: dmClock is +the distributed version of mClock). + +Various organizations and individuals are currently experimenting with +mClock as it exists in this code base along with their modifications +to the code base. We hope you'll share you're experiences with your +mClock and dmClock experiments on the ``ceph-devel`` mailing list. + + +``osd_push_per_object_cost`` + +:Description: the overhead for serving a push op + +:Type: Unsigned Integer +:Default: 1000 + + +``osd_recovery_max_chunk`` + +:Description: the maximum total size of data chunks a recovery op can carry. + +:Type: Unsigned Integer +:Default: 8 MiB + + +``osd_mclock_scheduler_client_res`` + +:Description: IO proportion reserved for each client (default). + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_client_wgt`` + +:Description: IO share for each client (default) over reservation. + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_client_lim`` + +:Description: IO limit for each client (default) over reservation. + +:Type: Unsigned Integer +:Default: 999999 + + +``osd_mclock_scheduler_background_recovery_res`` + +:Description: IO proportion reserved for background recovery (default). + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_background_recovery_wgt`` + +:Description: IO share for each background recovery over reservation. + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_background_recovery_lim`` + +:Description: IO limit for background recovery over reservation. + +:Type: Unsigned Integer +:Default: 999999 + + +``osd_mclock_scheduler_background_best_effort_res`` + +:Description: IO proportion reserved for background best_effort (default). + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_background_best_effort_wgt`` + +:Description: IO share for each background best_effort over reservation. + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_background_best_effort_lim`` + +:Description: IO limit for background best_effort over reservation. + +:Type: Unsigned Integer +:Default: 999999 + +.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf + + +.. index:: OSD; backfilling + +Backfilling +=========== + +When you add or remove Ceph OSD Daemons to a cluster, CRUSH will +rebalance the cluster by moving placement groups to or from Ceph OSDs +to restore balanced utilization. The process of migrating placement groups and +the objects they contain can reduce the cluster's operational performance +considerably. To maintain operational performance, Ceph performs this migration +with 'backfilling', which allows Ceph to set backfill operations to a lower +priority than requests to read or write data. + + +``osd_max_backfills`` + +:Description: The maximum number of backfills allowed to or from a single OSD. + Note that this is applied separately for read and write operations. +:Type: 64-bit Unsigned Integer +:Default: ``1`` + + +``osd_backfill_scan_min`` + +:Description: The minimum number of objects per backfill scan. + +:Type: 32-bit Integer +:Default: ``64`` + + +``osd_backfill_scan_max`` + +:Description: The maximum number of objects per backfill scan. + +:Type: 32-bit Integer +:Default: ``512`` + + +``osd_backfill_retry_interval`` + +:Description: The number of seconds to wait before retrying backfill requests. +:Type: Double +:Default: ``10.0`` + +.. index:: OSD; osdmap + +OSD Map +======= + +OSD maps reflect the OSD daemons operating in the cluster. Over time, the +number of map epochs increases. Ceph provides some settings to ensure that +Ceph performs well as the OSD map grows larger. + + +``osd_map_dedup`` + +:Description: Enable removing duplicates in the OSD map. +:Type: Boolean +:Default: ``true`` + + +``osd_map_cache_size`` + +:Description: The number of OSD maps to keep cached. +:Type: 32-bit Integer +:Default: ``50`` + + +``osd_map_message_max`` + +:Description: The maximum map entries allowed per MOSDMap message. +:Type: 32-bit Integer +:Default: ``40`` + + + +.. index:: OSD; recovery + +Recovery +======== + +When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD +begins peering with other Ceph OSD Daemons before writes can occur. See +`Monitoring OSDs and PGs`_ for details. + +If a Ceph OSD Daemon crashes and comes back online, usually it will be out of +sync with other Ceph OSD Daemons containing more recent versions of objects in +the placement groups. When this happens, the Ceph OSD Daemon goes into recovery +mode and seeks to get the latest copy of the data and bring its map back up to +date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects +and placement groups may be significantly out of date. Also, if a failure domain +went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at +the same time. This can make the recovery process time consuming and resource +intensive. + +To maintain operational performance, Ceph performs recovery with limitations on +the number recovery requests, threads and object chunk sizes which allows Ceph +perform well in a degraded state. + + +``osd_recovery_delay_start`` + +:Description: After peering completes, Ceph will delay for the specified number + of seconds before starting to recover RADOS objects. + +:Type: Float +:Default: ``0`` + + +``osd_recovery_max_active`` + +:Description: The number of active recovery requests per OSD at one time. More + requests will accelerate recovery, but the requests places an + increased load on the cluster. + + This value is only used if it is non-zero. Normally it + is ``0``, which means that the ``hdd`` or ``ssd`` values + (below) are used, depending on the type of the primary + device backing the OSD. + +:Type: 32-bit Integer +:Default: ``0`` + +``osd_recovery_max_active_hdd`` + +:Description: The number of active recovery requests per OSD at one time, if the + primary device is rotational. + +:Type: 32-bit Integer +:Default: ``3`` + +``osd_recovery_max_active_ssd`` + +:Description: The number of active recovery requests per OSD at one time, if the + primary device is non-rotational (i.e., an SSD). + +:Type: 32-bit Integer +:Default: ``10`` + + +``osd_recovery_max_chunk`` + +:Description: The maximum size of a recovered chunk of data to push. +:Type: 64-bit Unsigned Integer +:Default: ``8 << 20`` + + +``osd_recovery_max_single_start`` + +:Description: The maximum number of recovery operations per OSD that will be + newly started when an OSD is recovering. +:Type: 64-bit Unsigned Integer +:Default: ``1`` + + +``osd_recovery_thread_timeout`` + +:Description: The maximum time in seconds before timing out a recovery thread. +:Type: 32-bit Integer +:Default: ``30`` + + +``osd_recover_clone_overlap`` + +:Description: Preserves clone overlap during recovery. Should always be set + to ``true``. + +:Type: Boolean +:Default: ``true`` + + +``osd_recovery_sleep`` + +:Description: Time in seconds to sleep before the next recovery or backfill op. + Increasing this value will slow down recovery operation while + client operations will be less impacted. + +:Type: Float +:Default: ``0`` + + +``osd_recovery_sleep_hdd`` + +:Description: Time in seconds to sleep before next recovery or backfill op + for HDDs. + +:Type: Float +:Default: ``0.1`` + + +``osd_recovery_sleep_ssd`` + +:Description: Time in seconds to sleep before the next recovery or backfill op + for SSDs. + +:Type: Float +:Default: ``0`` + + +``osd_recovery_sleep_hybrid`` + +:Description: Time in seconds to sleep before the next recovery or backfill op + when OSD data is on HDD and OSD journal / WAL+DB is on SSD. + +:Type: Float +:Default: ``0.025`` + + +``osd_recovery_priority`` + +:Description: The default priority set for recovery work queue. Not + related to a pool's ``recovery_priority``. + +:Type: 32-bit Integer +:Default: ``5`` + + +Tiering +======= + +``osd_agent_max_ops`` + +:Description: The maximum number of simultaneous flushing ops per tiering agent + in the high speed mode. +:Type: 32-bit Integer +:Default: ``4`` + + +``osd_agent_max_low_ops`` + +:Description: The maximum number of simultaneous flushing ops per tiering agent + in the low speed mode. +:Type: 32-bit Integer +:Default: ``2`` + +See `cache target dirty high ratio`_ for when the tiering agent flushes dirty +objects within the high speed mode. + +Miscellaneous +============= + + +``osd_snap_trim_thread_timeout`` + +:Description: The maximum time in seconds before timing out a snap trim thread. +:Type: 32-bit Integer +:Default: ``1*60*60`` + + +``osd_backlog_thread_timeout`` + +:Description: The maximum time in seconds before timing out a backlog thread. +:Type: 32-bit Integer +:Default: ``1*60*60`` + + +``osd_default_notify_timeout`` + +:Description: The OSD default notification timeout (in seconds). +:Type: 32-bit Unsigned Integer +:Default: ``30`` + + +``osd_check_for_log_corruption`` + +:Description: Check log files for corruption. Can be computationally expensive. +:Type: Boolean +:Default: ``false`` + + +``osd_remove_thread_timeout`` + +:Description: The maximum time in seconds before timing out a remove OSD thread. +:Type: 32-bit Integer +:Default: ``60*60`` + + +``osd_command_thread_timeout`` + +:Description: The maximum time in seconds before timing out a command thread. +:Type: 32-bit Integer +:Default: ``10*60`` + + +``osd_delete_sleep`` + +:Description: Time in seconds to sleep before the next removal transaction. This + throttles the PG deletion process. + +:Type: Float +:Default: ``0`` + + +``osd_delete_sleep_hdd`` + +:Description: Time in seconds to sleep before the next removal transaction + for HDDs. + +:Type: Float +:Default: ``5`` + + +``osd_delete_sleep_ssd`` + +:Description: Time in seconds to sleep before the next removal transaction + for SSDs. + +:Type: Float +:Default: ``0`` + + +``osd_delete_sleep_hybrid`` + +:Description: Time in seconds to sleep before the next removal transaction + when OSD data is on HDD and OSD journal or WAL+DB is on SSD. + +:Type: Float +:Default: ``1`` + + +``osd_command_max_records`` + +:Description: Limits the number of lost objects to return. +:Type: 32-bit Integer +:Default: ``256`` + + +``osd_fast_fail_on_connection_refused`` + +:Description: If this option is enabled, crashed OSDs are marked down + immediately by connected peers and MONs (assuming that the + crashed OSD host survives). Disable it to restore old + behavior, at the expense of possible long I/O stalls when + OSDs crash in the middle of I/O operations. +:Type: Boolean +:Default: ``true`` + + + +.. _pool: ../../operations/pools +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Pool & PG Config Reference: ../pool-pg-config-ref +.. _Journal Config Reference: ../journal-ref +.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio +.. _mClock Config Reference: ../mclock-config-ref diff --git a/doc/rados/configuration/pool-pg-config-ref.rst b/doc/rados/configuration/pool-pg-config-ref.rst new file mode 100644 index 000000000..b4b5df478 --- /dev/null +++ b/doc/rados/configuration/pool-pg-config-ref.rst @@ -0,0 +1,282 @@ +====================================== + Pool, PG and CRUSH Config Reference +====================================== + +.. index:: pools; configuration + +When you create pools and set the number of placement groups (PGs) for each, Ceph +uses default values when you don't specifically override the defaults. **We +recommend** overriding some of the defaults. Specifically, we recommend setting +a pool's replica size and overriding the default number of placement groups. You +can specifically set these values when running `pool`_ commands. You can also +override the defaults by adding new ones in the ``[global]`` section of your +Ceph configuration file. + + +.. literalinclude:: pool-pg.conf + :language: ini + + +``mon_max_pool_pg_num`` + +:Description: The maximum number of placement groups per pool. +:Type: Integer +:Default: ``65536`` + + +``mon_pg_create_interval`` + +:Description: Number of seconds between PG creation in the same + Ceph OSD Daemon. + +:Type: Float +:Default: ``30.0`` + + +``mon_pg_stuck_threshold`` + +:Description: Number of seconds after which PGs can be considered as + being stuck. + +:Type: 32-bit Integer +:Default: ``300`` + +``mon_pg_min_inactive`` + +:Description: Raise ``HEALTH_ERR`` if the count of PGs that have been + inactive longer than the ``mon_pg_stuck_threshold`` exceeds this + setting. A non-positive number means disabled, never go into ERR. +:Type: Integer +:Default: ``1`` + + +``mon_pg_warn_min_per_osd`` + +:Description: Raise ``HEALTH_WARN`` if the average number + of PGs per ``in`` OSD is under this number. A non-positive number + disables this. +:Type: Integer +:Default: ``30`` + + +``mon_pg_warn_min_objects`` + +:Description: Do not warn if the total number of RADOS objects in cluster is below + this number +:Type: Integer +:Default: ``1000`` + + +``mon_pg_warn_min_pool_objects`` + +:Description: Do not warn on pools whose RADOS object count is below this number +:Type: Integer +:Default: ``1000`` + + +``mon_pg_check_down_all_threshold`` + +:Description: Percentage threshold of ``down`` OSDs above which we check all PGs + for stale ones. +:Type: Float +:Default: ``0.5`` + + +``mon_pg_warn_max_object_skew`` + +:Description: Raise ``HEALTH_WARN`` if the average RADOS object count per PG + of any pool is greater than ``mon_pg_warn_max_object_skew`` times + the average RADOS object count per PG of all pools. Zero or a non-positive + number disables this. Note that this option applies to ``ceph-mgr`` daemons. +:Type: Float +:Default: ``10`` + + +``mon_delta_reset_interval`` + +:Description: Seconds of inactivity before we reset the PG delta to 0. We keep + track of the delta of the used space of each pool, so, for + example, it would be easier for us to understand the progress of + recovery or the performance of cache tier. But if there's no + activity reported for a certain pool, we just reset the history of + deltas of that pool. +:Type: Integer +:Default: ``10`` + + +``mon_osd_max_op_age`` + +:Description: Maximum op age before we get concerned (make it a power of 2). + ``HEALTH_WARN`` will be raised if a request has been blocked longer + than this limit. +:Type: Float +:Default: ``32.0`` + + +``osd_pg_bits`` + +:Description: Placement group bits per Ceph OSD Daemon. +:Type: 32-bit Integer +:Default: ``6`` + + +``osd_pgp_bits`` + +:Description: The number of bits per Ceph OSD Daemon for PGPs. +:Type: 32-bit Integer +:Default: ``6`` + + +``osd_crush_chooseleaf_type`` + +:Description: The bucket type to use for ``chooseleaf`` in a CRUSH rule. Uses + ordinal rank rather than name. + +:Type: 32-bit Integer +:Default: ``1``. Typically a host containing one or more Ceph OSD Daemons. + + +``osd_crush_initial_weight`` + +:Description: The initial CRUSH weight for newly added OSDs. + +:Type: Double +:Default: ``the size of a newly added OSD in TB``. By default, the initial CRUSH + weight for a newly added OSD is set to its device size in TB. + See `Weighting Bucket Items`_ for details. + + +``osd_pool_default_crush_rule`` + +:Description: The default CRUSH rule to use when creating a replicated pool. +:Type: 8-bit Integer +:Default: ``-1``, which means "pick the rule with the lowest numerical ID and + use that". This is to make pool creation work in the absence of rule 0. + + +``osd_pool_erasure_code_stripe_unit`` + +:Description: Sets the default size, in bytes, of a chunk of an object + stripe for erasure coded pools. Every object of size S + will be stored as N stripes, with each data chunk + receiving ``stripe unit`` bytes. Each stripe of ``N * + stripe unit`` bytes will be encoded/decoded + individually. This option can is overridden by the + ``stripe_unit`` setting in an erasure code profile. + +:Type: Unsigned 32-bit Integer +:Default: ``4096`` + + +``osd_pool_default_size`` + +:Description: Sets the number of replicas for objects in the pool. The default + value is the same as + ``ceph osd pool set {pool-name} size {size}``. + +:Type: 32-bit Integer +:Default: ``3`` + + +``osd_pool_default_min_size`` + +:Description: Sets the minimum number of written replicas for objects in the + pool in order to acknowledge a write operation to the client. If + minimum is not met, Ceph will not acknowledge the write to the + client, **which may result in data loss**. This setting ensures + a minimum number of replicas when operating in ``degraded`` mode. + +:Type: 32-bit Integer +:Default: ``0``, which means no particular minimum. If ``0``, + minimum is ``size - (size / 2)``. + + +``osd_pool_default_pg_num`` + +:Description: The default number of placement groups for a pool. The default + value is the same as ``pg_num`` with ``mkpool``. + +:Type: 32-bit Integer +:Default: ``32`` + + +``osd_pool_default_pgp_num`` + +:Description: The default number of placement groups for placement for a pool. + The default value is the same as ``pgp_num`` with ``mkpool``. + PG and PGP should be equal (for now). + +:Type: 32-bit Integer +:Default: ``8`` + + +``osd_pool_default_flags`` + +:Description: The default flags for new pools. +:Type: 32-bit Integer +:Default: ``0`` + + +``osd_max_pgls`` + +:Description: The maximum number of placement groups to list. A client + requesting a large number can tie up the Ceph OSD Daemon. + +:Type: Unsigned 64-bit Integer +:Default: ``1024`` +:Note: Default should be fine. + + +``osd_min_pg_log_entries`` + +:Description: The minimum number of placement group logs to maintain + when trimming log files. + +:Type: 32-bit Int Unsigned +:Default: ``250`` + + +``osd_max_pg_log_entries`` + +:Description: The maximum number of placement group logs to maintain + when trimming log files. + +:Type: 32-bit Int Unsigned +:Default: ``10000`` + + +``osd_default_data_pool_replay_window`` + +:Description: The time (in seconds) for an OSD to wait for a client to replay + a request. + +:Type: 32-bit Integer +:Default: ``45`` + +``osd_max_pg_per_osd_hard_ratio`` + +:Description: The ratio of number of PGs per OSD allowed by the cluster before the + OSD refuses to create new PGs. An OSD stops creating new PGs if the number + of PGs it serves exceeds + ``osd_max_pg_per_osd_hard_ratio`` \* ``mon_max_pg_per_osd``. + +:Type: Float +:Default: ``2`` + +``osd_recovery_priority`` + +:Description: Priority of recovery in the work queue. + +:Type: Integer +:Default: ``5`` + +``osd_recovery_op_priority`` + +:Description: Default priority used for recovery operations if pool doesn't override. + +:Type: Integer +:Default: ``3`` + +.. _pool: ../../operations/pools +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Weighting Bucket Items: ../../operations/crush-map#weightingbucketitems diff --git a/doc/rados/configuration/pool-pg.conf b/doc/rados/configuration/pool-pg.conf new file mode 100644 index 000000000..34c3af9f0 --- /dev/null +++ b/doc/rados/configuration/pool-pg.conf @@ -0,0 +1,21 @@ +[global] + + # By default, Ceph makes 3 replicas of RADOS objects. If you want to maintain four + # copies of an object the default value--a primary copy and three replica + # copies--reset the default values as shown in 'osd_pool_default_size'. + # If you want to allow Ceph to write a lesser number of copies in a degraded + # state, set 'osd_pool_default_min_size' to a number less than the + # 'osd_pool_default_size' value. + + osd_pool_default_size = 3 # Write an object 3 times. + osd_pool_default_min_size = 2 # Allow writing two copies in a degraded state. + + # Ensure you have a realistic number of placement groups. We recommend + # approximately 100 per OSD. E.g., total number of OSDs multiplied by 100 + # divided by the number of replicas (i.e., osd pool default size). So for + # 10 OSDs and osd pool default size = 4, we'd recommend approximately + # (100 * 10) / 4 = 250. + # always use the nearest power of 2 + + osd_pool_default_pg_num = 256 + osd_pool_default_pgp_num = 256 diff --git a/doc/rados/configuration/storage-devices.rst b/doc/rados/configuration/storage-devices.rst new file mode 100644 index 000000000..8536d2cfa --- /dev/null +++ b/doc/rados/configuration/storage-devices.rst @@ -0,0 +1,96 @@ +================= + Storage Devices +================= + +There are two Ceph daemons that store data on devices: + +.. _rados_configuration_storage-devices_ceph_osd: + +* **Ceph OSDs** (Object Storage Daemons) store most of the data + in Ceph. Usually each OSD is backed by a single storage device. + This can be a traditional hard disk (HDD) or a solid state disk + (SSD). OSDs can also be backed by a combination of devices: for + example, a HDD for most data and an SSD (or partition of an + SSD) for some metadata. The number of OSDs in a cluster is + usually a function of the amount of data to be stored, the size + of each storage device, and the level and type of redundancy + specified (replication or erasure coding). +* **Ceph Monitor** daemons manage critical cluster state. This + includes cluster membership and authentication information. + Small clusters require only a few gigabytes of storage to hold + the monitor database. In large clusters, however, the monitor + database can reach sizes of tens of gigabytes to hundreds of + gigabytes. +* **Ceph Manager** daemons run alongside monitor daemons, providing + additional monitoring and providing interfaces to external + monitoring and management systems. + + +OSD Back Ends +============= + +There are two ways that OSDs manage the data they store. As of the Luminous +12.2.z release, the default (and recommended) back end is *BlueStore*. Prior +to the Luminous release, the default (and only) back end was *Filestore*. + +.. _rados_config_storage_devices_bluestore: + +BlueStore +--------- + +<<<<<<< HEAD +BlueStore is a special-purpose storage backend designed specifically +for managing data on disk for Ceph OSD workloads. It is motivated by +experience supporting and managing OSDs using FileStore over the +last ten years. Key BlueStore features include: +======= +BlueStore is a special-purpose storage back end designed specifically for +managing data on disk for Ceph OSD workloads. BlueStore's design is based on +a decade of experience of supporting and managing Filestore OSDs. +>>>>>>> 28abc6a9a59 (doc/rados: s/backend/back end/) + +* Direct management of storage devices. BlueStore consumes raw block + devices or partitions. This avoids any intervening layers of + abstraction (such as local file systems like XFS) that may limit + performance or add complexity. +* Metadata management with RocksDB. We embed RocksDB's key/value database + in order to manage internal metadata, such as the mapping from object + names to block locations on disk. +* Full data and metadata checksumming. By default all data and + metadata written to BlueStore is protected by one or more + checksums. No data or metadata will be read from disk or returned + to the user without being verified. +* Inline compression. Data written may be optionally compressed + before being written to disk. +* Multi-device metadata tiering. BlueStore allows its internal + journal (write-ahead log) to be written to a separate, high-speed + device (like an SSD, NVMe, or NVDIMM) to increased performance. If + a significant amount of faster storage is available, internal + metadata can also be stored on the faster device. +* Efficient copy-on-write. RBD and CephFS snapshots rely on a + copy-on-write *clone* mechanism that is implemented efficiently in + BlueStore. This results in efficient IO both for regular snapshots + and for erasure coded pools (which rely on cloning to implement + efficient two-phase commits). + +For more information, see :doc:`bluestore-config-ref` and :doc:`/rados/operations/bluestore-migration`. + +FileStore +--------- + +FileStore is the legacy approach to storing objects in Ceph. It +relies on a standard file system (normally XFS) in combination with a +key/value database (traditionally LevelDB, now RocksDB) for some +metadata. + +FileStore is well-tested and widely used in production but suffers +from many performance deficiencies due to its overall design and +reliance on a traditional file system for storing object data. + +Although FileStore is generally capable of functioning on most +POSIX-compatible file systems (including btrfs and ext4), we only +recommend that XFS be used. Both btrfs and ext4 have known bugs and +deficiencies and their use may lead to data loss. By default all Ceph +provisioning tools will use XFS. + +For more information, see :doc:`filestore-config-ref`. diff --git a/doc/rados/index.rst b/doc/rados/index.rst new file mode 100644 index 000000000..5f3f112e8 --- /dev/null +++ b/doc/rados/index.rst @@ -0,0 +1,78 @@ +.. _rados-index: + +====================== + Ceph Storage Cluster +====================== + +The :term:`Ceph Storage Cluster` is the foundation for all Ceph deployments. +Based upon :abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, Ceph +Storage Clusters consist of two types of daemons: a :term:`Ceph OSD Daemon` +(OSD) stores data as objects on a storage node; and a :term:`Ceph Monitor` (MON) +maintains a master copy of the cluster map. A Ceph Storage Cluster may contain +thousands of storage nodes. A minimal system will have at least one +Ceph Monitor and two Ceph OSD Daemons for data replication. + +The Ceph File System, Ceph Object Storage and Ceph Block Devices read data from +and write data to the Ceph Storage Cluster. + +.. raw:: html + + <style type="text/css">div.body h3{margin:5px 0px 0px 0px;}</style> + <table cellpadding="10"><colgroup><col width="33%"><col width="33%"><col width="33%"></colgroup><tbody valign="top"><tr><td><h3>Config and Deploy</h3> + +Ceph Storage Clusters have a few required settings, but most configuration +settings have default values. A typical deployment uses a deployment tool +to define a cluster and bootstrap a monitor. See `Deployment`_ for details +on ``cephadm.`` + +.. toctree:: + :maxdepth: 2 + + Configuration <configuration/index> + Deployment <../cephadm/index> + +.. raw:: html + + </td><td><h3>Operations</h3> + +Once you have deployed a Ceph Storage Cluster, you may begin operating +your cluster. + +.. toctree:: + :maxdepth: 2 + + + Operations <operations/index> + +.. toctree:: + :maxdepth: 1 + + Man Pages <man/index> + + +.. toctree:: + :hidden: + + troubleshooting/index + +.. raw:: html + + </td><td><h3>APIs</h3> + +Most Ceph deployments use `Ceph Block Devices`_, `Ceph Object Storage`_ and/or the +`Ceph File System`_. You may also develop applications that talk directly to +the Ceph Storage Cluster. + +.. toctree:: + :maxdepth: 2 + + APIs <api/index> + +.. raw:: html + + </td></tr></tbody></table> + +.. _Ceph Block Devices: ../rbd/ +.. _Ceph File System: ../cephfs/ +.. _Ceph Object Storage: ../radosgw/ +.. _Deployment: ../cephadm/ diff --git a/doc/rados/man/index.rst b/doc/rados/man/index.rst new file mode 100644 index 000000000..8311bfd5a --- /dev/null +++ b/doc/rados/man/index.rst @@ -0,0 +1,31 @@ +======================= + Object Store Manpages +======================= + +.. toctree:: + :maxdepth: 1 + + ../../man/8/ceph-volume.rst + ../../man/8/ceph-volume-systemd.rst + ../../man/8/ceph.rst + ../../man/8/ceph-authtool.rst + ../../man/8/ceph-clsinfo.rst + ../../man/8/ceph-conf.rst + ../../man/8/ceph-debugpack.rst + ../../man/8/ceph-dencoder.rst + ../../man/8/ceph-mon.rst + ../../man/8/ceph-osd.rst + ../../man/8/ceph-kvstore-tool.rst + ../../man/8/ceph-run.rst + ../../man/8/ceph-syn.rst + ../../man/8/crushtool.rst + ../../man/8/librados-config.rst + ../../man/8/monmaptool.rst + ../../man/8/osdmaptool.rst + ../../man/8/rados.rst + + +.. toctree:: + :hidden: + + ../../man/8/ceph-post-file.rst diff --git a/doc/rados/operations/add-or-rm-mons.rst b/doc/rados/operations/add-or-rm-mons.rst new file mode 100644 index 000000000..359fa7676 --- /dev/null +++ b/doc/rados/operations/add-or-rm-mons.rst @@ -0,0 +1,446 @@ +.. _adding-and-removing-monitors: + +========================== + Adding/Removing Monitors +========================== + +When you have a cluster up and running, you may add or remove monitors +from the cluster at runtime. To bootstrap a monitor, see `Manual Deployment`_ +or `Monitor Bootstrap`_. + +.. _adding-monitors: + +Adding Monitors +=============== + +Ceph monitors are lightweight processes that are the single source of truth +for the cluster map. You can run a cluster with 1 monitor but we recommend at least 3 +for a production cluster. Ceph monitors use a variation of the +`Paxos`_ algorithm to establish consensus about maps and other critical +information across the cluster. Due to the nature of Paxos, Ceph requires +a majority of monitors to be active to establish a quorum (thus establishing +consensus). + +It is advisable to run an odd number of monitors. An +odd number of monitors is more resilient than an +even number. For instance, with a two monitor deployment, no +failures can be tolerated and still maintain a quorum; with three monitors, +one failure can be tolerated; in a four monitor deployment, one failure can +be tolerated; with five monitors, two failures can be tolerated. This avoids +the dreaded *split brain* phenomenon, and is why an odd number is best. +In short, Ceph needs a majority of +monitors to be active (and able to communicate with each other), but that +majority can be achieved using a single monitor, or 2 out of 2 monitors, +2 out of 3, 3 out of 4, etc. + +For small or non-critical deployments of multi-node Ceph clusters, it is +advisable to deploy three monitors, and to increase the number of monitors +to five for larger clusters or to survive a double failure. There is rarely +justification for seven or more. + +Since monitors are lightweight, it is possible to run them on the same +host as OSDs; however, we recommend running them on separate hosts, +because `fsync` issues with the kernel may impair performance. +Dedicated monitor nodes also minimize disruption since monitor and OSD +daemons are not inactive at the same time when a node crashes or is +taken down for maintenance. + +Dedicated +monitor nodes also make for cleaner maintenance by avoiding both OSDs and +a mon going down if a node is rebooted, taken down, or crashes. + +.. note:: A *majority* of monitors in your cluster must be able to + reach each other in order to establish a quorum. + +Deploy your Hardware +-------------------- + +If you are adding a new host when adding a new monitor, see `Hardware +Recommendations`_ for details on minimum recommendations for monitor hardware. +To add a monitor host to your cluster, first make sure you have an up-to-date +version of Linux installed (typically Ubuntu 16.04 or RHEL 7). + +Add your monitor host to a rack in your cluster, connect it to the network +and ensure that it has network connectivity. + +.. _Hardware Recommendations: ../../../start/hardware-recommendations + +Install the Required Software +----------------------------- + +For manually deployed clusters, you must install Ceph packages +manually. See `Installing Packages`_ for details. +You should configure SSH to a user with password-less authentication +and root permissions. + +.. _Installing Packages: ../../../install/install-storage-cluster + + +.. _Adding a Monitor (Manual): + +Adding a Monitor (Manual) +------------------------- + +This procedure creates a ``ceph-mon`` data directory, retrieves the monitor map +and monitor keyring, and adds a ``ceph-mon`` daemon to your cluster. If +this results in only two monitor daemons, you may add more monitors by +repeating this procedure until you have a sufficient number of ``ceph-mon`` +daemons to achieve a quorum. + +At this point you should define your monitor's id. Traditionally, monitors +have been named with single letters (``a``, ``b``, ``c``, ...), but you are +free to define the id as you see fit. For the purpose of this document, +please take into account that ``{mon-id}`` should be the id you chose, +without the ``mon.`` prefix (i.e., ``{mon-id}`` should be the ``a`` +on ``mon.a``). + +#. Create the default directory on the machine that will host your + new monitor: + + .. prompt:: bash $ + + ssh {new-mon-host} + sudo mkdir /var/lib/ceph/mon/ceph-{mon-id} + +#. Create a temporary directory ``{tmp}`` to keep the files needed during + this process. This directory should be different from the monitor's default + directory created in the previous step, and can be removed after all the + steps are executed: + + .. prompt:: bash $ + + mkdir {tmp} + +#. Retrieve the keyring for your monitors, where ``{tmp}`` is the path to + the retrieved keyring, and ``{key-filename}`` is the name of the file + containing the retrieved monitor key: + + .. prompt:: bash $ + + ceph auth get mon. -o {tmp}/{key-filename} + +#. Retrieve the monitor map, where ``{tmp}`` is the path to + the retrieved monitor map, and ``{map-filename}`` is the name of the file + containing the retrieved monitor map: + + .. prompt:: bash $ + + ceph mon getmap -o {tmp}/{map-filename} + +#. Prepare the monitor's data directory created in the first step. You must + specify the path to the monitor map so that you can retrieve the + information about a quorum of monitors and their ``fsid``. You must also + specify a path to the monitor keyring: + + .. prompt:: bash $ + + sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename} + + +#. Start the new monitor and it will automatically join the cluster. + The daemon needs to know which address to bind to, via either the + ``--public-addr {ip}`` or ``--public-network {network}`` argument. + For example: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --public-addr {ip:port} + +.. _removing-monitors: + +Removing Monitors +================= + +When you remove monitors from a cluster, consider that Ceph monitors use +Paxos to establish consensus about the master cluster map. You must have +a sufficient number of monitors to establish a quorum for consensus about +the cluster map. + +.. _Removing a Monitor (Manual): + +Removing a Monitor (Manual) +--------------------------- + +This procedure removes a ``ceph-mon`` daemon from your cluster. If this +procedure results in only two monitor daemons, you may add or remove another +monitor until you have a number of ``ceph-mon`` daemons that can achieve a +quorum. + +#. Stop the monitor: + + .. prompt:: bash $ + + service ceph -a stop mon.{mon-id} + +#. Remove the monitor from the cluster: + + .. prompt:: bash $ + + ceph mon remove {mon-id} + +#. Remove the monitor entry from ``ceph.conf``. + +.. _rados-mon-remove-from-unhealthy: + +Removing Monitors from an Unhealthy Cluster +------------------------------------------- + +This procedure removes a ``ceph-mon`` daemon from an unhealthy +cluster, for example a cluster where the monitors cannot form a +quorum. + + +#. Stop all ``ceph-mon`` daemons on all monitor hosts: + + .. prompt:: bash $ + + ssh {mon-host} + systemctl stop ceph-mon.target + + Repeat for all monitor hosts. + +#. Identify a surviving monitor and log in to that host: + + .. prompt:: bash $ + + ssh {mon-host} + +#. Extract a copy of the monmap file: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --extract-monmap {map-path} + + In most cases, this command will be: + + .. prompt:: bash $ + + ceph-mon -i `hostname` --extract-monmap /tmp/monmap + +#. Remove the non-surviving or problematic monitors. For example, if + you have three monitors, ``mon.a``, ``mon.b``, and ``mon.c``, where + only ``mon.a`` will survive, follow the example below: + + .. prompt:: bash $ + + monmaptool {map-path} --rm {mon-id} + + For example, + + .. prompt:: bash $ + + monmaptool /tmp/monmap --rm b + monmaptool /tmp/monmap --rm c + +#. Inject the surviving map with the removed monitors into the + surviving monitor(s). For example, to inject a map into monitor + ``mon.a``, follow the example below: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --inject-monmap {map-path} + + For example: + + .. prompt:: bash $ + + ceph-mon -i a --inject-monmap /tmp/monmap + +#. Start only the surviving monitors. + +#. Verify the monitors form a quorum (``ceph -s``). + +#. You may wish to archive the removed monitors' data directory in + ``/var/lib/ceph/mon`` in a safe location, or delete it if you are + confident the remaining monitors are healthy and are sufficiently + redundant. + +.. _Changing a Monitor's IP address: + +Changing a Monitor's IP Address +=============================== + +.. important:: Existing monitors are not supposed to change their IP addresses. + +Monitors are critical components of a Ceph cluster, and they need to maintain a +quorum for the whole system to work properly. To establish a quorum, the +monitors need to discover each other. Ceph has strict requirements for +discovering monitors. + +Ceph clients and other Ceph daemons use ``ceph.conf`` to discover monitors. +However, monitors discover each other using the monitor map, not ``ceph.conf``. +For example, if you refer to `Adding a Monitor (Manual)`_ you will see that you +need to obtain the current monmap for the cluster when creating a new monitor, +as it is one of the required arguments of ``ceph-mon -i {mon-id} --mkfs``. The +following sections explain the consistency requirements for Ceph monitors, and a +few safe ways to change a monitor's IP address. + + +Consistency Requirements +------------------------ + +A monitor always refers to the local copy of the monmap when discovering other +monitors in the cluster. Using the monmap instead of ``ceph.conf`` avoids +errors that could break the cluster (e.g., typos in ``ceph.conf`` when +specifying a monitor address or port). Since monitors use monmaps for discovery +and they share monmaps with clients and other Ceph daemons, the monmap provides +monitors with a strict guarantee that their consensus is valid. + +Strict consistency also applies to updates to the monmap. As with any other +updates on the monitor, changes to the monmap always run through a distributed +consensus algorithm called `Paxos`_. The monitors must agree on each update to +the monmap, such as adding or removing a monitor, to ensure that each monitor in +the quorum has the same version of the monmap. Updates to the monmap are +incremental so that monitors have the latest agreed upon version, and a set of +previous versions, allowing a monitor that has an older version of the monmap to +catch up with the current state of the cluster. + +If monitors discovered each other through the Ceph configuration file instead of +through the monmap, it would introduce additional risks because the Ceph +configuration files are not updated and distributed automatically. Monitors +might inadvertently use an older ``ceph.conf`` file, fail to recognize a +monitor, fall out of a quorum, or develop a situation where `Paxos`_ is not able +to determine the current state of the system accurately. Consequently, making +changes to an existing monitor's IP address must be done with great care. + + +Changing a Monitor's IP address (The Right Way) +----------------------------------------------- + +Changing a monitor's IP address in ``ceph.conf`` only is not sufficient to +ensure that other monitors in the cluster will receive the update. To change a +monitor's IP address, you must add a new monitor with the IP address you want +to use (as described in `Adding a Monitor (Manual)`_), ensure that the new +monitor successfully joins the quorum; then, remove the monitor that uses the +old IP address. Then, update the ``ceph.conf`` file to ensure that clients and +other daemons know the IP address of the new monitor. + +For example, lets assume there are three monitors in place, such as :: + + [mon.a] + host = host01 + addr = 10.0.0.1:6789 + [mon.b] + host = host02 + addr = 10.0.0.2:6789 + [mon.c] + host = host03 + addr = 10.0.0.3:6789 + +To change ``mon.c`` to ``host04`` with the IP address ``10.0.0.4``, follow the +steps in `Adding a Monitor (Manual)`_ by adding a new monitor ``mon.d``. Ensure +that ``mon.d`` is running before removing ``mon.c``, or it will break the +quorum. Remove ``mon.c`` as described on `Removing a Monitor (Manual)`_. Moving +all three monitors would thus require repeating this process as many times as +needed. + + +Changing a Monitor's IP address (The Messy Way) +----------------------------------------------- + +There may come a time when the monitors must be moved to a different network, a +different part of the datacenter or a different datacenter altogether. While it +is possible to do it, the process becomes a bit more hazardous. + +In such a case, the solution is to generate a new monmap with updated IP +addresses for all the monitors in the cluster, and inject the new map on each +individual monitor. This is not the most user-friendly approach, but we do not +expect this to be something that needs to be done every other week. As it is +clearly stated on the top of this section, monitors are not supposed to change +IP addresses. + +Using the previous monitor configuration as an example, assume you want to move +all the monitors from the ``10.0.0.x`` range to ``10.1.0.x``, and these +networks are unable to communicate. Use the following procedure: + +#. Retrieve the monitor map, where ``{tmp}`` is the path to + the retrieved monitor map, and ``{filename}`` is the name of the file + containing the retrieved monitor map: + + .. prompt:: bash $ + + ceph mon getmap -o {tmp}/{filename} + +#. The following example demonstrates the contents of the monmap: + + .. prompt:: bash $ + + monmaptool --print {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.0.0.1:6789/0 mon.a + 1: 10.0.0.2:6789/0 mon.b + 2: 10.0.0.3:6789/0 mon.c + +#. Remove the existing monitors: + + .. prompt:: bash $ + + monmaptool --rm a --rm b --rm c {tmp}/{filename} + + + :: + + monmaptool: monmap file {tmp}/{filename} + monmaptool: removing a + monmaptool: removing b + monmaptool: removing c + monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors) + +#. Add the new monitor locations: + + .. prompt:: bash $ + + monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename} + + + :: + + monmaptool: monmap file {tmp}/{filename} + monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors) + +#. Check new contents: + + .. prompt:: bash $ + + monmaptool --print {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.1.0.1:6789/0 mon.a + 1: 10.1.0.2:6789/0 mon.b + 2: 10.1.0.3:6789/0 mon.c + +At this point, we assume the monitors (and stores) are installed at the new +location. The next step is to propagate the modified monmap to the new +monitors, and inject the modified monmap into each new monitor. + +#. First, make sure to stop all your monitors. Injection must be done while + the daemon is not running. + +#. Inject the monmap: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename} + +#. Restart the monitors. + +After this step, migration to the new location is complete and +the monitors should operate successfully. + + +.. _Manual Deployment: ../../../install/manual-deployment +.. _Monitor Bootstrap: ../../../dev/mon-bootstrap +.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) diff --git a/doc/rados/operations/add-or-rm-osds.rst b/doc/rados/operations/add-or-rm-osds.rst new file mode 100644 index 000000000..315552859 --- /dev/null +++ b/doc/rados/operations/add-or-rm-osds.rst @@ -0,0 +1,386 @@ +====================== + Adding/Removing OSDs +====================== + +When you have a cluster up and running, you may add OSDs or remove OSDs +from the cluster at runtime. + +Adding OSDs +=========== + +When you want to expand a cluster, you may add an OSD at runtime. With Ceph, an +OSD is generally one Ceph ``ceph-osd`` daemon for one storage drive within a +host machine. If your host has multiple storage drives, you may map one +``ceph-osd`` daemon for each drive. + +Generally, it's a good idea to check the capacity of your cluster to see if you +are reaching the upper end of its capacity. As your cluster reaches its ``near +full`` ratio, you should add one or more OSDs to expand your cluster's capacity. + +.. warning:: Do not let your cluster reach its ``full ratio`` before + adding an OSD. OSD failures that occur after the cluster reaches + its ``near full`` ratio may cause the cluster to exceed its + ``full ratio``. + +Deploy your Hardware +-------------------- + +If you are adding a new host when adding a new OSD, see `Hardware +Recommendations`_ for details on minimum recommendations for OSD hardware. To +add an OSD host to your cluster, first make sure you have an up-to-date version +of Linux installed, and you have made some initial preparations for your +storage drives. See `Filesystem Recommendations`_ for details. + +Add your OSD host to a rack in your cluster, connect it to the network +and ensure that it has network connectivity. See the `Network Configuration +Reference`_ for details. + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Filesystem Recommendations: ../../configuration/filesystem-recommendations +.. _Network Configuration Reference: ../../configuration/network-config-ref + +Install the Required Software +----------------------------- + +For manually deployed clusters, you must install Ceph packages +manually. See `Installing Ceph (Manual)`_ for details. +You should configure SSH to a user with password-less authentication +and root permissions. + +.. _Installing Ceph (Manual): ../../../install + + +Adding an OSD (Manual) +---------------------- + +This procedure sets up a ``ceph-osd`` daemon, configures it to use one drive, +and configures the cluster to distribute data to the OSD. If your host has +multiple drives, you may add an OSD for each drive by repeating this procedure. + +To add an OSD, create a data directory for it, mount a drive to that directory, +add the OSD to the cluster, and then add it to the CRUSH map. + +When you add the OSD to the CRUSH map, consider the weight you give to the new +OSD. Hard drive capacity grows 40% per year, so newer OSD hosts may have larger +hard drives than older hosts in the cluster (i.e., they may have greater +weight). + +.. tip:: Ceph prefers uniform hardware across pools. If you are adding drives + of dissimilar size, you can adjust their weights. However, for best + performance, consider a CRUSH hierarchy with drives of the same type/size. + +#. Create the OSD. If no UUID is given, it will be set automatically when the + OSD starts up. The following command will output the OSD number, which you + will need for subsequent steps: + + .. prompt:: bash $ + + ceph osd create [{uuid} [{id}]] + + If the optional parameter {id} is given it will be used as the OSD id. + Note, in this case the command may fail if the number is already in use. + + .. warning:: In general, explicitly specifying {id} is not recommended. + IDs are allocated as an array, and skipping entries consumes some extra + memory. This can become significant if there are large gaps and/or + clusters are large. If {id} is not specified, the smallest available is + used. + +#. Create the default directory on your new OSD: + + .. prompt:: bash $ + + ssh {new-osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + +#. If the OSD is for a drive other than the OS drive, prepare it + for use with Ceph, and mount it to the directory you just created: + + .. prompt:: bash $ + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{drive} + sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} + +#. Initialize the OSD data directory: + + .. prompt:: bash $ + + ssh {new-osd-host} + ceph-osd -i {osd-num} --mkfs --mkkey + + The directory must be empty before you can run ``ceph-osd``. + +#. Register the OSD authentication key. The value of ``ceph`` for + ``ceph-{osd-num}`` in the path is the ``$cluster-$id``. If your + cluster name differs from ``ceph``, use your cluster name instead: + + .. prompt:: bash $ + + ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring + +#. Add the OSD to the CRUSH map so that the OSD can begin receiving data. The + ``ceph osd crush add`` command allows you to add OSDs to the CRUSH hierarchy + wherever you wish. If you specify at least one bucket, the command + will place the OSD into the most specific bucket you specify, *and* it will + move that bucket underneath any other buckets you specify. **Important:** If + you specify only the root bucket, the command will attach the OSD directly + to the root, but CRUSH rules expect OSDs to be inside of hosts. + + Execute the following: + + .. prompt:: bash $ + + ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...] + + You may also decompile the CRUSH map, add the OSD to the device list, add the + host as a bucket (if it's not already in the CRUSH map), add the device as an + item in the host, assign it a weight, recompile it and set it. See + `Add/Move an OSD`_ for details. + + +.. _rados-replacing-an-osd: + +Replacing an OSD +---------------- + +.. note:: If the instructions in this section do not work for you, try the + instructions in the cephadm documentation: :ref:`cephadm-replacing-an-osd`. + +When disks fail, or if an administrator wants to reprovision OSDs with a new +backend, for instance, for switching from FileStore to BlueStore, OSDs need to +be replaced. Unlike `Removing the OSD`_, replaced OSD's id and CRUSH map entry +need to be keep intact after the OSD is destroyed for replacement. + +#. Make sure it is safe to destroy the OSD: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy osd.{id} ; do sleep 10 ; done + +#. Destroy the OSD first: + + .. prompt:: bash $ + + ceph osd destroy {id} --yes-i-really-mean-it + +#. Zap a disk for the new OSD, if the disk was used before for other purposes. + It's not necessary for a new disk: + + .. prompt:: bash $ + + ceph-volume lvm zap /dev/sdX + +#. Prepare the disk for replacement by using the previously destroyed OSD id: + + .. prompt:: bash $ + + ceph-volume lvm prepare --osd-id {id} --data /dev/sdX + +#. And activate the OSD: + + .. prompt:: bash $ + + ceph-volume lvm activate {id} {fsid} + +Alternatively, instead of preparing and activating, the device can be recreated +in one call, like: + + .. prompt:: bash $ + + ceph-volume lvm create --osd-id {id} --data /dev/sdX + + +Starting the OSD +---------------- + +After you add an OSD to Ceph, the OSD is in your configuration. However, +it is not yet running. The OSD is ``down`` and ``in``. You must start +your new OSD before it can begin receiving data. You may use +``service ceph`` from your admin host or start the OSD from its host +machine: + + .. prompt:: bash $ + + sudo systemctl start ceph-osd@{osd-num} + + +Once you start your OSD, it is ``up`` and ``in``. + + +Observe the Data Migration +-------------------------- + +Once you have added your new OSD to the CRUSH map, Ceph will begin rebalancing +the server by migrating placement groups to your new OSD. You can observe this +process with the `ceph`_ tool. : + + .. prompt:: bash $ + + ceph -w + +You should see the placement group states change from ``active+clean`` to +``active, some degraded objects``, and finally ``active+clean`` when migration +completes. (Control-c to exit.) + +.. _Add/Move an OSD: ../crush-map#addosd +.. _ceph: ../monitoring + + + +Removing OSDs (Manual) +====================== + +When you want to reduce the size of a cluster or replace hardware, you may +remove an OSD at runtime. With Ceph, an OSD is generally one Ceph ``ceph-osd`` +daemon for one storage drive within a host machine. If your host has multiple +storage drives, you may need to remove one ``ceph-osd`` daemon for each drive. +Generally, it's a good idea to check the capacity of your cluster to see if you +are reaching the upper end of its capacity. Ensure that when you remove an OSD +that your cluster is not at its ``near full`` ratio. + +.. warning:: Do not let your cluster reach its ``full ratio`` when + removing an OSD. Removing OSDs could cause the cluster to reach + or exceed its ``full ratio``. + + +Take the OSD out of the Cluster +----------------------------------- + +Before you remove an OSD, it is usually ``up`` and ``in``. You need to take it +out of the cluster so that Ceph can begin rebalancing and copying its data to +other OSDs. : + + .. prompt:: bash $ + + ceph osd out {osd-num} + + +Observe the Data Migration +-------------------------- + +Once you have taken your OSD ``out`` of the cluster, Ceph will begin +rebalancing the cluster by migrating placement groups out of the OSD you +removed. You can observe this process with the `ceph`_ tool. : + + .. prompt:: bash $ + + ceph -w + +You should see the placement group states change from ``active+clean`` to +``active, some degraded objects``, and finally ``active+clean`` when migration +completes. (Control-c to exit.) + +.. note:: Sometimes, typically in a "small" cluster with few hosts (for + instance with a small testing cluster), the fact to take ``out`` the + OSD can spawn a CRUSH corner case where some PGs remain stuck in the + ``active+remapped`` state. If you are in this case, you should mark + the OSD ``in`` with: + + .. prompt:: bash $ + + ceph osd in {osd-num} + + to come back to the initial state and then, instead of marking ``out`` + the OSD, set its weight to 0 with: + + .. prompt:: bash $ + + ceph osd crush reweight osd.{osd-num} 0 + + After that, you can observe the data migration which should come to its + end. The difference between marking ``out`` the OSD and reweighting it + to 0 is that in the first case the weight of the bucket which contains + the OSD is not changed whereas in the second case the weight of the bucket + is updated (and decreased of the OSD weight). The reweight command could + be sometimes favoured in the case of a "small" cluster. + + + +Stopping the OSD +---------------- + +After you take an OSD out of the cluster, it may still be running. +That is, the OSD may be ``up`` and ``out``. You must stop +your OSD before you remove it from the configuration: + + .. prompt:: bash $ + + ssh {osd-host} + sudo systemctl stop ceph-osd@{osd-num} + +Once you stop your OSD, it is ``down``. + + +Removing the OSD +---------------- + +This procedure removes an OSD from a cluster map, removes its authentication +key, removes the OSD from the OSD map, and removes the OSD from the +``ceph.conf`` file. If your host has multiple drives, you may need to remove an +OSD for each drive by repeating this procedure. + +#. Let the cluster forget the OSD first. This step removes the OSD from the CRUSH + map, removes its authentication key. And it is removed from the OSD map as + well. Please note the :ref:`purge subcommand <ceph-admin-osd>` is introduced in Luminous, for older + versions, please see below: + + .. prompt:: bash $ + + ceph osd purge {id} --yes-i-really-mean-it + +#. Navigate to the host where you keep the master copy of the cluster's + ``ceph.conf`` file: + + .. prompt:: bash $ + + ssh {admin-host} + cd /etc/ceph + vim ceph.conf + +#. Remove the OSD entry from your ``ceph.conf`` file (if it exists):: + + [osd.1] + host = {hostname} + +#. From the host where you keep the master copy of the cluster's ``ceph.conf`` + file, copy the updated ``ceph.conf`` file to the ``/etc/ceph`` directory of + other hosts in your cluster. + +If your Ceph cluster is older than Luminous, instead of using ``ceph osd +purge``, you need to perform this step manually: + + +#. Remove the OSD from the CRUSH map so that it no longer receives data. You may + also decompile the CRUSH map, remove the OSD from the device list, remove the + device as an item in the host bucket or remove the host bucket (if it's in the + CRUSH map and you intend to remove the host), recompile the map and set it. + See `Remove an OSD`_ for details: + + .. prompt:: bash $ + + ceph osd crush remove {name} + +#. Remove the OSD authentication key: + + .. prompt:: bash $ + + ceph auth del osd.{osd-num} + + The value of ``ceph`` for ``ceph-{osd-num}`` in the path is the + ``$cluster-$id``. If your cluster name differs from ``ceph``, use your + cluster name instead. + +#. Remove the OSD: + + .. prompt:: bash $ + + ceph osd rm {osd-num} + + for example: + + .. prompt:: bash $ + + ceph osd rm 1 + +.. _Remove an OSD: ../crush-map#removeosd diff --git a/doc/rados/operations/balancer.rst b/doc/rados/operations/balancer.rst new file mode 100644 index 000000000..b02a8914d --- /dev/null +++ b/doc/rados/operations/balancer.rst @@ -0,0 +1,206 @@ +.. _balancer: + +Balancer +======== + +The *balancer* can optimize the placement of PGs across OSDs in +order to achieve a balanced distribution, either automatically or in a +supervised fashion. + +Status +------ + +The current status of the balancer can be checked at any time with: + + .. prompt:: bash $ + + ceph balancer status + + +Automatic balancing +------------------- + +The automatic balancing feature is enabled by default in ``upmap`` +mode. Please refer to :ref:`upmap` for more details. The balancer can be +turned off with: + + .. prompt:: bash $ + + ceph balancer off + +The balancer mode can be changed to ``crush-compat`` mode, which is +backward compatible with older clients, and will make small changes to +the data distribution over time to ensure that OSDs are equally utilized. + + +Throttling +---------- + +No adjustments will be made to the PG distribution if the cluster is +degraded (e.g., because an OSD has failed and the system has not yet +healed itself). + +When the cluster is healthy, the balancer will throttle its changes +such that the percentage of PGs that are misplaced (i.e., that need to +be moved) is below a threshold of (by default) 5%. The +``target_max_misplaced_ratio`` threshold can be adjusted with: + + .. prompt:: bash $ + + ceph config set mgr target_max_misplaced_ratio .07 # 7% + +Set the number of seconds to sleep in between runs of the automatic balancer: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/sleep_interval 60 + +Set the time of day to begin automatic balancing in HHMM format: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/begin_time 0000 + +Set the time of day to finish automatic balancing in HHMM format: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/end_time 2359 + +Restrict automatic balancing to this day of the week or later. +Uses the same conventions as crontab, 0 is Sunday, 1 is Monday, and so on: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/begin_weekday 0 + +Restrict automatic balancing to this day of the week or earlier. +Uses the same conventions as crontab, 0 is Sunday, 1 is Monday, and so on: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/end_weekday 6 + +Pool IDs to which the automatic balancing will be limited. +The default for this is an empty string, meaning all pools will be balanced. +The numeric pool IDs can be gotten with the :command:`ceph osd pool ls detail` command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/pool_ids 1,2,3 + + +Modes +----- + +There are currently two supported balancer modes: + +#. **crush-compat**. The CRUSH compat mode uses the compat weight-set + feature (introduced in Luminous) to manage an alternative set of + weights for devices in the CRUSH hierarchy. The normal weights + should remain set to the size of the device to reflect the target + amount of data that we want to store on the device. The balancer + then optimizes the weight-set values, adjusting them up or down in + small increments, in order to achieve a distribution that matches + the target distribution as closely as possible. (Because PG + placement is a pseudorandom process, there is a natural amount of + variation in the placement; by optimizing the weights we + counter-act that natural variation.) + + Notably, this mode is *fully backwards compatible* with older + clients: when an OSDMap and CRUSH map is shared with older clients, + we present the optimized weights as the "real" weights. + + The primary restriction of this mode is that the balancer cannot + handle multiple CRUSH hierarchies with different placement rules if + the subtrees of the hierarchy share any OSDs. (This is normally + not the case, and is generally not a recommended configuration + because it is hard to manage the space utilization on the shared + OSDs.) + +#. **upmap**. Starting with Luminous, the OSDMap can store explicit + mappings for individual OSDs as exceptions to the normal CRUSH + placement calculation. These `upmap` entries provide fine-grained + control over the PG mapping. This CRUSH mode will optimize the + placement of individual PGs in order to achieve a balanced + distribution. In most cases, this distribution is "perfect," which + an equal number of PGs on each OSD (+/-1 PG, since they might not + divide evenly). + + Note that using upmap requires that all clients be Luminous or newer. + +The default mode is ``upmap``. The mode can be adjusted with: + + .. prompt:: bash $ + + ceph balancer mode crush-compat + +Supervised optimization +----------------------- + +The balancer operation is broken into a few distinct phases: + +#. building a *plan* +#. evaluating the quality of the data distribution, either for the current PG distribution, or the PG distribution that would result after executing a *plan* +#. executing the *plan* + +To evaluate and score the current distribution: + + .. prompt:: bash $ + + ceph balancer eval + +You can also evaluate the distribution for a single pool with: + + .. prompt:: bash $ + + ceph balancer eval <pool-name> + +Greater detail for the evaluation can be seen with: + + .. prompt:: bash $ + + ceph balancer eval-verbose ... + +The balancer can generate a plan, using the currently configured mode, with: + + .. prompt:: bash $ + + ceph balancer optimize <plan-name> + +The name is provided by the user and can be any useful identifying string. The contents of a plan can be seen with: + + .. prompt:: bash $ + + ceph balancer show <plan-name> + +All plans can be shown with: + + .. prompt:: bash $ + + ceph balancer ls + +Old plans can be discarded with: + + .. prompt:: bash $ + + ceph balancer rm <plan-name> + +Currently recorded plans are shown as part of the status command: + + .. prompt:: bash $ + + ceph balancer status + +The quality of the distribution that would result after executing a plan can be calculated with: + + .. prompt:: bash $ + + ceph balancer eval <plan-name> + +Assuming the plan is expected to improve the distribution (i.e., it has a lower score than the current cluster state), the user can execute that plan with: + + .. prompt:: bash $ + + ceph balancer execute <plan-name> + diff --git a/doc/rados/operations/bluestore-migration.rst b/doc/rados/operations/bluestore-migration.rst new file mode 100644 index 000000000..1ac5f2b13 --- /dev/null +++ b/doc/rados/operations/bluestore-migration.rst @@ -0,0 +1,338 @@ +===================== + BlueStore Migration +===================== + +Each OSD can run either BlueStore or FileStore, and a single Ceph +cluster can contain a mix of both. Users who have previously deployed +FileStore are likely to want to transition to BlueStore in order to +take advantage of the improved performance and robustness. There are +several strategies for making such a transition. + +An individual OSD cannot be converted in place in isolation, however: +BlueStore and FileStore are simply too different for that to be +practical. "Conversion" will rely either on the cluster's normal +replication and healing support or tools and strategies that copy OSD +content from an old (FileStore) device to a new (BlueStore) one. + + +Deploy new OSDs with BlueStore +============================== + +Any new OSDs (e.g., when the cluster is expanded) can be deployed +using BlueStore. This is the default behavior so no specific change +is needed. + +Similarly, any OSDs that are reprovisioned after replacing a failed drive +can use BlueStore. + +Convert existing OSDs +===================== + +Mark out and replace +-------------------- + +The simplest approach is to mark out each device in turn, wait for the +data to replicate across the cluster, reprovision the OSD, and mark +it back in again. It is simple and easy to automate. However, it requires +more data migration than should be necessary, so it is not optimal. + +#. Identify a FileStore OSD to replace:: + + ID=<osd-id-number> + DEVICE=<disk-device> + + You can tell whether a given OSD is FileStore or BlueStore with: + + .. prompt:: bash $ + + ceph osd metadata $ID | grep osd_objectstore + + You can get a current count of filestore vs bluestore with: + + .. prompt:: bash $ + + ceph osd count-metadata osd_objectstore + +#. Mark the filestore OSD out: + + .. prompt:: bash $ + + ceph osd out $ID + +#. Wait for the data to migrate off the OSD in question: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done + +#. Stop the OSD: + + .. prompt:: bash $ + + systemctl kill ceph-osd@$ID + +#. Make note of which device this OSD is using: + + .. prompt:: bash $ + + mount | grep /var/lib/ceph/osd/ceph-$ID + +#. Unmount the OSD: + + .. prompt:: bash $ + + umount /var/lib/ceph/osd/ceph-$ID + +#. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy + the contents of the device; be certain the data on the device is + not needed (i.e., that the cluster is healthy) before proceeding: + + .. prompt:: bash $ + + ceph-volume lvm zap $DEVICE + +#. Tell the cluster the OSD has been destroyed (and a new OSD can be + reprovisioned with the same ID): + + .. prompt:: bash $ + + ceph osd destroy $ID --yes-i-really-mean-it + +#. Reprovision a BlueStore OSD in its place with the same OSD ID. + This requires you do identify which device to wipe based on what you saw + mounted above. BE CAREFUL! : + + .. prompt:: bash $ + + ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID + +#. Repeat. + +You can allow the refilling of the replacement OSD to happen +concurrently with the draining of the next OSD, or follow the same +procedure for multiple OSDs in parallel, as long as you ensure the +cluster is fully clean (all data has all replicas) before destroying +any OSDs. Failure to do so will reduce the redundancy of your data +and increase the risk of (or potentially even cause) data loss. + +Advantages: + +* Simple. +* Can be done on a device-by-device basis. +* No spare devices or hosts are required. + +Disadvantages: + +* Data is copied over the network twice: once to some other OSD in the + cluster (to maintain the desired number of replicas), and then again + back to the reprovisioned BlueStore OSD. + + +Whole host replacement +---------------------- + +If you have a spare host in the cluster, or have sufficient free space +to evacuate an entire host in order to use it as a spare, then the +conversion can be done on a host-by-host basis with each stored copy of +the data migrating only once. + +First, you need have empty host that has no data. There are two ways to do this: either by starting with a new, empty host that isn't yet part of the cluster, or by offloading data from an existing host that in the cluster. + +Use a new, empty host +^^^^^^^^^^^^^^^^^^^^^ + +Ideally the host should have roughly the +same capacity as other hosts you will be converting (although it +doesn't strictly matter). :: + + NEWHOST=<empty-host-name> + +Add the host to the CRUSH hierarchy, but do not attach it to the root: + +.. prompt:: bash $ + + ceph osd crush add-bucket $NEWHOST host + +Make sure the ceph packages are installed. + +Use an existing host +^^^^^^^^^^^^^^^^^^^^ + +If you would like to use an existing host +that is already part of the cluster, and there is sufficient free +space on that host so that all of its data can be migrated off, +then you can instead do:: + + OLDHOST=<existing-cluster-host-to-offload> + +.. prompt:: bash $ + + ceph osd crush unlink $OLDHOST default + +where "default" is the immediate ancestor in the CRUSH map. (For +smaller clusters with unmodified configurations this will normally +be "default", but it might also be a rack name.) You should now +see the host at the top of the OSD tree output with no parent: + +.. prompt:: bash $ + + bin/ceph osd tree + +:: + + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host oldhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host foo + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +If everything looks good, jump directly to the "Wait for data +migration to complete" step below and proceed from there to clean up +the old OSDs. + +Migration process +^^^^^^^^^^^^^^^^^ + +If you're using a new host, start at step #1. For an existing host, +jump to step #5 below. + +#. Provision new BlueStore OSDs for all devices: + + .. prompt:: bash $ + + ceph-volume lvm create --bluestore --data /dev/$DEVICE + +#. Verify OSDs join the cluster with: + + .. prompt:: bash $ + + ceph osd tree + + You should see the new host ``$NEWHOST`` with all of the OSDs beneath + it, but the host should *not* be nested beneath any other node in + hierarchy (like ``root default``). For example, if ``newhost`` is + the empty host, you might see something like:: + + $ bin/ceph osd tree + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host newhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host oldhost1 + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +#. Identify the first target host to convert : + + .. prompt:: bash $ + + OLDHOST=<existing-cluster-host-to-convert> + +#. Swap the new host into the old host's position in the cluster: + + .. prompt:: bash $ + + ceph osd crush swap-bucket $NEWHOST $OLDHOST + + At this point all data on ``$OLDHOST`` will start migrating to OSDs + on ``$NEWHOST``. If there is a difference in the total capacity of + the old and new hosts you may also see some data migrate to or from + other nodes in the cluster, but as long as the hosts are similarly + sized this will be a relatively small amount of data. + +#. Wait for data migration to complete: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done + +#. Stop all old OSDs on the now-empty ``$OLDHOST``: + + .. prompt:: bash $ + + ssh $OLDHOST + systemctl kill ceph-osd.target + umount /var/lib/ceph/osd/ceph-* + +#. Destroy and purge the old OSDs: + + .. prompt:: bash $ + + for osd in `ceph osd ls-tree $OLDHOST`; do + ceph osd purge $osd --yes-i-really-mean-it + done + +#. Wipe the old OSD devices. This requires you do identify which + devices are to be wiped manually (BE CAREFUL!). For each device: + + .. prompt:: bash $ + + ceph-volume lvm zap $DEVICE + +#. Use the now-empty host as the new host, and repeat:: + + NEWHOST=$OLDHOST + +Advantages: + +* Data is copied over the network only once. +* Converts an entire host's OSDs at once. +* Can parallelize to converting multiple hosts at a time. +* No spare devices are required on each host. + +Disadvantages: + +* A spare host is required. +* An entire host's worth of OSDs will be migrating data at a time. This + is like likely to impact overall cluster performance. +* All migrated data still makes one full hop over the network. + + +Per-OSD device copy +------------------- + +A single logical OSD can be converted by using the ``copy`` function +of ``ceph-objectstore-tool``. This requires that the host have a free +device (or devices) to provision a new, empty BlueStore OSD. For +example, if each host in your cluster has 12 OSDs, then you'd need a +13th available device so that each OSD can be converted in turn before the +old device is reclaimed to convert the next OSD. + +Caveats: + +* This strategy requires that a blank BlueStore OSD be prepared + without allocating a new OSD ID, something that the ``ceph-volume`` + tool doesn't support. More importantly, the setup of *dmcrypt* is + closely tied to the OSD identity, which means that this approach + does not work with encrypted OSDs. + +* The device must be manually partitioned. + +* Tooling not implemented! + +* Not documented! + +Advantages: + +* Little or no data migrates over the network during the conversion. + +Disadvantages: + +* Tooling not fully implemented. +* Process not documented. +* Each host must have a spare or empty device. +* The OSD is offline during the conversion, which means new writes will + be written to only a subset of the OSDs. This increases the risk of data + loss due to a subsequent failure. (However, if there is a failure before + conversion is complete, the original FileStore OSD can be started to provide + access to its original data.) diff --git a/doc/rados/operations/cache-tiering.rst b/doc/rados/operations/cache-tiering.rst new file mode 100644 index 000000000..8056ace47 --- /dev/null +++ b/doc/rados/operations/cache-tiering.rst @@ -0,0 +1,552 @@ +=============== + Cache Tiering +=============== + +A cache tier provides Ceph Clients with better I/O performance for a subset of +the data stored in a backing storage tier. Cache tiering involves creating a +pool of relatively fast/expensive storage devices (e.g., solid state drives) +configured to act as a cache tier, and a backing pool of either erasure-coded +or relatively slower/cheaper devices configured to act as an economical storage +tier. The Ceph objecter handles where to place the objects and the tiering +agent determines when to flush objects from the cache to the backing storage +tier. So the cache tier and the backing storage tier are completely transparent +to Ceph clients. + + +.. ditaa:: + +-------------+ + | Ceph Client | + +------+------+ + ^ + Tiering is | + Transparent | Faster I/O + to Ceph | +---------------+ + Client Ops | | | + | +----->+ Cache Tier | + | | | | + | | +-----+---+-----+ + | | | ^ + v v | | Active Data in Cache Tier + +------+----+--+ | | + | Objecter | | | + +-----------+--+ | | + ^ | | Inactive Data in Storage Tier + | v | + | +-----+---+-----+ + | | | + +----->| Storage Tier | + | | + +---------------+ + Slower I/O + + +The cache tiering agent handles the migration of data between the cache tier +and the backing storage tier automatically. However, admins have the ability to +configure how this migration takes place by setting the ``cache-mode``. There are +two main scenarios: + +- **writeback** mode: If the base tier and the cache tier are configured in + ``writeback`` mode, Ceph clients receive an ACK from the base tier every time + they write data to it. Then the cache tiering agent determines whether + ``osd_tier_default_cache_min_write_recency_for_promote`` has been set. If it + has been set and the data has been written more than a specified number of + times per interval, the data is promoted to the cache tier. + + When Ceph clients need access to data stored in the base tier, the cache + tiering agent reads the data from the base tier and returns it to the client. + While data is being read from the base tier, the cache tiering agent consults + the value of ``osd_tier_default_cache_min_read_recency_for_promote`` and + decides whether to promote that data from the base tier to the cache tier. + When data has been promoted from the base tier to the cache tier, the Ceph + client is able to perform I/O operations on it using the cache tier. This is + well-suited for mutable data (for example, photo/video editing, transactional + data). + +- **readproxy** mode: This mode will use any objects that already + exist in the cache tier, but if an object is not present in the + cache the request will be proxied to the base tier. This is useful + for transitioning from ``writeback`` mode to a disabled cache as it + allows the workload to function properly while the cache is drained, + without adding any new objects to the cache. + +Other cache modes are: + +- **readonly** promotes objects to the cache on read operations only; write + operations are forwarded to the base tier. This mode is intended for + read-only workloads that do not require consistency to be enforced by the + storage system. (**Warning**: when objects are updated in the base tier, + Ceph makes **no** attempt to sync these updates to the corresponding objects + in the cache. Since this mode is considered experimental, a + ``--yes-i-really-mean-it`` option must be passed in order to enable it.) + +- **none** is used to completely disable caching. + + +A word of caution +================= + +Cache tiering will *degrade* performance for most workloads. Users should use +extreme caution before using this feature. + +* *Workload dependent*: Whether a cache will improve performance is + highly dependent on the workload. Because there is a cost + associated with moving objects into or out of the cache, it can only + be effective when there is a *large skew* in the access pattern in + the data set, such that most of the requests touch a small number of + objects. The cache pool should be large enough to capture the + working set for your workload to avoid thrashing. + +* *Difficult to benchmark*: Most benchmarks that users run to measure + performance will show terrible performance with cache tiering, in + part because very few of them skew requests toward a small set of + objects, it can take a long time for the cache to "warm up," and + because the warm-up cost can be high. + +* *Usually slower*: For workloads that are not cache tiering-friendly, + performance is often slower than a normal RADOS pool without cache + tiering enabled. + +* *librados object enumeration*: The librados-level object enumeration + API is not meant to be coherent in the presence of the case. If + your application is using librados directly and relies on object + enumeration, cache tiering will probably not work as expected. + (This is not a problem for RGW, RBD, or CephFS.) + +* *Complexity*: Enabling cache tiering means that a lot of additional + machinery and complexity within the RADOS cluster is being used. + This increases the probability that you will encounter a bug in the system + that other users have not yet encountered and will put your deployment at a + higher level of risk. + +Known Good Workloads +-------------------- + +* *RGW time-skewed*: If the RGW workload is such that almost all read + operations are directed at recently written objects, a simple cache + tiering configuration that destages recently written objects from + the cache to the base tier after a configurable period can work + well. + +Known Bad Workloads +------------------- + +The following configurations are *known to work poorly* with cache +tiering. + +* *RBD with replicated cache and erasure-coded base*: This is a common + request, but usually does not perform well. Even reasonably skewed + workloads still send some small writes to cold objects, and because + small writes are not yet supported by the erasure-coded pool, entire + (usually 4 MB) objects must be migrated into the cache in order to + satisfy a small (often 4 KB) write. Only a handful of users have + successfully deployed this configuration, and it only works for them + because their data is extremely cold (backups) and they are not in + any way sensitive to performance. + +* *RBD with replicated cache and base*: RBD with a replicated base + tier does better than when the base is erasure coded, but it is + still highly dependent on the amount of skew in the workload, and + very difficult to validate. The user will need to have a good + understanding of their workload and will need to tune the cache + tiering parameters carefully. + + +Setting Up Pools +================ + +To set up cache tiering, you must have two pools. One will act as the +backing storage and the other will act as the cache. + + +Setting Up a Backing Storage Pool +--------------------------------- + +Setting up a backing storage pool typically involves one of two scenarios: + +- **Standard Storage**: In this scenario, the pool stores multiple copies + of an object in the Ceph Storage Cluster. + +- **Erasure Coding:** In this scenario, the pool uses erasure coding to + store data much more efficiently with a small performance tradeoff. + +In the standard storage scenario, you can setup a CRUSH rule to establish +the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD +Daemons perform optimally when all storage drives in the rule are of the +same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_ +for details on creating a rule. Once you have created a rule, create +a backing storage pool. + +In the erasure coding scenario, the pool creation arguments will generate the +appropriate rule automatically. See `Create a Pool`_ for details. + +In subsequent examples, we will refer to the backing storage pool +as ``cold-storage``. + + +Setting Up a Cache Pool +----------------------- + +Setting up a cache pool follows the same procedure as the standard storage +scenario, but with this difference: the drives for the cache tier are typically +high performance drives that reside in their own servers and have their own +CRUSH rule. When setting up such a rule, it should take account of the hosts +that have the high performance drives while omitting the hosts that don't. See +:ref:`CRUSH Device Class<crush-map-device-class>` for details. + + +In subsequent examples, we will refer to the cache pool as ``hot-storage`` and +the backing pool as ``cold-storage``. + +For cache tier configuration and default values, see +`Pools - Set Pool Values`_. + + +Creating a Cache Tier +===================== + +Setting up a cache tier involves associating a backing storage pool with +a cache pool: + +.. prompt:: bash $ + + ceph osd tier add {storagepool} {cachepool} + +For example: + +.. prompt:: bash $ + + ceph osd tier add cold-storage hot-storage + +To set the cache mode, execute the following: + +.. prompt:: bash $ + + ceph osd tier cache-mode {cachepool} {cache-mode} + +For example: + +.. prompt:: bash $ + + ceph osd tier cache-mode hot-storage writeback + +The cache tiers overlay the backing storage tier, so they require one +additional step: you must direct all client traffic from the storage pool to +the cache pool. To direct client traffic directly to the cache pool, execute +the following: + +.. prompt:: bash $ + + ceph osd tier set-overlay {storagepool} {cachepool} + +For example: + +.. prompt:: bash $ + + ceph osd tier set-overlay cold-storage hot-storage + + +Configuring a Cache Tier +======================== + +Cache tiers have several configuration options. You may set +cache tier configuration options with the following usage: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} {key} {value} + +See `Pools - Set Pool Values`_ for details. + + +Target Size and Type +-------------------- + +Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} hit_set_type bloom + +For example: + +.. prompt:: bash $ + + ceph osd pool set hot-storage hit_set_type bloom + +The ``hit_set_count`` and ``hit_set_period`` define how many such HitSets to +store, and how much time each HitSet should cover: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} hit_set_count 12 + ceph osd pool set {cachepool} hit_set_period 14400 + ceph osd pool set {cachepool} target_max_bytes 1000000000000 + +.. note:: A larger ``hit_set_count`` results in more RAM consumed by + the ``ceph-osd`` process. + +Binning accesses over time allows Ceph to determine whether a Ceph client +accessed an object at least once, or more than once over a time period +("age" vs "temperature"). + +The ``min_read_recency_for_promote`` defines how many HitSets to check for the +existence of an object when handling a read operation. The checking result is +used to decide whether to promote the object asynchronously. Its value should be +between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted. +If it's set to 1, the current HitSet is checked. And if this object is in the +current HitSet, it's promoted. Otherwise not. For the other values, the exact +number of archive HitSets are checked. The object is promoted if the object is +found in any of the most recent ``min_read_recency_for_promote`` HitSets. + +A similar parameter can be set for the write operation, which is +``min_write_recency_for_promote``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} min_read_recency_for_promote 2 + ceph osd pool set {cachepool} min_write_recency_for_promote 2 + +.. note:: The longer the period and the higher the + ``min_read_recency_for_promote`` and + ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd`` + daemon consumes. In particular, when the agent is active to flush + or evict cache objects, all ``hit_set_count`` HitSets are loaded + into RAM. + + +Cache Sizing +------------ + +The cache tiering agent performs two main functions: + +- **Flushing:** The agent identifies modified (or dirty) objects and forwards + them to the storage pool for long-term storage. + +- **Evicting:** The agent identifies objects that haven't been modified + (or clean) and evicts the least recently used among them from the cache. + + +Absolute Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects based upon the total number +of bytes or the total number of objects. To specify a maximum number of bytes, +execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} target_max_bytes {#bytes} + +For example, to flush or evict at 1 TB, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage target_max_bytes 1099511627776 + +To specify the maximum number of objects, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} target_max_objects {#objects} + +For example, to flush or evict at 1M objects, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage target_max_objects 1000000 + +.. note:: Ceph is not able to determine the size of a cache pool automatically, so + the configuration on the absolute size is required here, otherwise the + flush/evict will not work. If you specify both limits, the cache tiering + agent will begin flushing or evicting when either threshold is triggered. + +.. note:: All client requests will be blocked only when ``target_max_bytes`` or + ``target_max_objects`` reached + +Relative Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects relative to the size of the +cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in +`Absolute sizing`_). When the cache pool consists of a certain percentage of +modified (or dirty) objects, the cache tiering agent will flush them to the +storage pool. To set the ``cache_target_dirty_ratio``, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0} + +For example, setting the value to ``0.4`` will begin flushing modified +(dirty) objects when they reach 40% of the cache pool's capacity: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_dirty_ratio 0.4 + +When the dirty objects reaches a certain percentage of its capacity, flush dirty +objects with a higher speed. To set the ``cache_target_dirty_high_ratio``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0} + +For example, setting the value to ``0.6`` will begin aggressively flush dirty +objects when they reach 60% of the cache pool's capacity. obviously, we'd +better set the value between dirty_ratio and full_ratio: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6 + +When the cache pool reaches a certain percentage of its capacity, the cache +tiering agent will evict objects to maintain free capacity. To set the +``cache_target_full_ratio``, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0} + +For example, setting the value to ``0.8`` will begin flushing unmodified +(clean) objects when they reach 80% of the cache pool's capacity: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_full_ratio 0.8 + + +Cache Age +--------- + +You can specify the minimum age of an object before the cache tiering agent +flushes a recently modified (or dirty) object to the backing storage pool: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_min_flush_age {#seconds} + +For example, to flush modified (or dirty) objects after 10 minutes, execute the +following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_min_flush_age 600 + +You can specify the minimum age of an object before it will be evicted from the +cache tier: + +.. prompt:: bash $ + + ceph osd pool {cache-tier} cache_min_evict_age {#seconds} + +For example, to evict objects after 30 minutes, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_min_evict_age 1800 + + +Removing a Cache Tier +===================== + +Removing a cache tier differs depending on whether it is a writeback +cache or a read-only cache. + + +Removing a Read-Only Cache +-------------------------- + +Since a read-only cache does not have modified data, you can disable +and remove it without losing any recent changes to objects in the cache. + +#. Change the cache-mode to ``none`` to disable it.: + + .. prompt:: bash + + ceph osd tier cache-mode {cachepool} none + + For example: + + .. prompt:: bash $ + + ceph osd tier cache-mode hot-storage none + +#. Remove the cache pool from the backing pool.: + + .. prompt:: bash $ + + ceph osd tier remove {storagepool} {cachepool} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove cold-storage hot-storage + + +Removing a Writeback Cache +-------------------------- + +Since a writeback cache may have modified data, you must take steps to ensure +that you do not lose any recent changes to objects in the cache before you +disable and remove it. + + +#. Change the cache mode to ``proxy`` so that new and modified objects will + flush to the backing storage pool.: + + .. prompt:: bash $ + + ceph osd tier cache-mode {cachepool} proxy + + For example: + + .. prompt:: bash $ + + ceph osd tier cache-mode hot-storage proxy + + +#. Ensure that the cache pool has been flushed. This may take a few minutes: + + .. prompt:: bash $ + + rados -p {cachepool} ls + + If the cache pool still has objects, you can flush them manually. + For example: + + .. prompt:: bash $ + + rados -p {cachepool} cache-flush-evict-all + + +#. Remove the overlay so that clients will not direct traffic to the cache.: + + .. prompt:: bash $ + + ceph osd tier remove-overlay {storagetier} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove-overlay cold-storage + + +#. Finally, remove the cache tier pool from the backing storage pool.: + + .. prompt:: bash $ + + ceph osd tier remove {storagepool} {cachepool} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove cold-storage hot-storage + + +.. _Create a Pool: ../pools#create-a-pool +.. _Pools - Set Pool Values: ../pools#set-pool-values +.. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter +.. _CRUSH Maps: ../crush-map +.. _Absolute Sizing: #absolute-sizing diff --git a/doc/rados/operations/change-mon-elections.rst b/doc/rados/operations/change-mon-elections.rst new file mode 100644 index 000000000..eba730bdc --- /dev/null +++ b/doc/rados/operations/change-mon-elections.rst @@ -0,0 +1,88 @@ +.. _changing_monitor_elections: + +===================================== +Configure Monitor Election Strategies +===================================== + +By default, the monitors will use the ``classic`` mode. We +recommend that you stay in this mode unless you have a very specific reason. + +If you want to switch modes BEFORE constructing the cluster, change +the ``mon election default strategy`` option. This option is an integer value: + +* 1 for "classic" +* 2 for "disallow" +* 3 for "connectivity" + +Once your cluster is running, you can change strategies by running :: + + $ ceph mon set election_strategy {classic|disallow|connectivity} + +Choosing a mode +=============== +The modes other than classic provide different features. We recommend +you stay in classic mode if you don't need the extra features as it is +the simplest mode. + +The disallow Mode +================= +This mode lets you mark monitors as disallowd, in which case they will +participate in the quorum and serve clients, but cannot be elected leader. You +may wish to use this if you have some monitors which are known to be far away +from clients. +You can disallow a leader by running: + +.. prompt:: bash $ + + ceph mon add disallowed_leader {name} + +You can remove a monitor from the disallowed list, and allow it to become +a leader again, by running: + +.. prompt:: bash $ + + ceph mon rm disallowed_leader {name} + +The list of disallowed_leaders is included when you run: + +.. prompt:: bash $ + + ceph mon dump + +The connectivity Mode +===================== +This mode evaluates connection scores provided by each monitor for its +peers and elects the monitor with the highest score. This mode is designed +to handle network partitioning or *net-splits*, which may happen if your cluster +is stretched across multiple data centers or otherwise has a non-uniform +or unbalanced network topology. + +This mode also supports disallowing monitors from being the leader +using the same commands as above in disallow. + +Examining connectivity scores +============================= +The monitors maintain connection scores even if they aren't in +the connectivity election mode. You can examine the scores a monitor +has by running: + +.. prompt:: bash $ + + ceph daemon mon.{name} connection scores dump + +Scores for individual connections range from 0-1 inclusive, and also +include whether the connection is considered alive or dead (determined by +whether it returned its latest ping within the timeout). + +While this would be an unexpected occurrence, if for some reason you experience +problems and troubleshooting makes you think your scores have become invalid, +you can forget history and reset them by running: + +.. prompt:: bash $ + + ceph daemon mon.{name} connection scores reset + +While resetting scores has low risk (monitors will still quickly determine +if a connection is alive or dead, and trend back to the previous scores if they +were accurate!), it should also not be needed and is not recommended unless +requested by your support team or a developer. diff --git a/doc/rados/operations/control.rst b/doc/rados/operations/control.rst new file mode 100644 index 000000000..d7a512618 --- /dev/null +++ b/doc/rados/operations/control.rst @@ -0,0 +1,601 @@ +.. index:: control, commands + +================== + Control Commands +================== + + +Monitor Commands +================ + +Monitor commands are issued using the ``ceph`` utility: + +.. prompt:: bash $ + + ceph [-m monhost] {command} + +The command is usually (though not always) of the form: + +.. prompt:: bash $ + + ceph {subsystem} {command} + + +System Commands +=============== + +Execute the following to display the current cluster status. : + +.. prompt:: bash $ + + ceph -s + ceph status + +Execute the following to display a running summary of cluster status +and major events. : + +.. prompt:: bash $ + + ceph -w + +Execute the following to show the monitor quorum, including which monitors are +participating and which one is the leader. : + +.. prompt:: bash $ + + ceph mon stat + ceph quorum_status + +Execute the following to query the status of a single monitor, including whether +or not it is in the quorum. : + +.. prompt:: bash $ + + ceph tell mon.[id] mon_status + +where the value of ``[id]`` can be determined, e.g., from ``ceph -s``. + + +Authentication Subsystem +======================== + +To add a keyring for an OSD, execute the following: + +.. prompt:: bash $ + + ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring} + +To list the cluster's keys and their capabilities, execute the following: + +.. prompt:: bash $ + + ceph auth ls + + +Placement Group Subsystem +========================= + +To display the statistics for all placement groups (PGs), execute the following: + +.. prompt:: bash $ + + ceph pg dump [--format {format}] + +The valid formats are ``plain`` (default), ``json`` ``json-pretty``, ``xml``, and ``xml-pretty``. +When implementing monitoring and other tools, it is best to use ``json`` format. +JSON parsing is more deterministic than the human-oriented ``plain``, and the layout is much +less variable from release to release. The ``jq`` utility can be invaluable when extracting +data from JSON output. + +To display the statistics for all placement groups stuck in a specified state, +execute the following: + +.. prompt:: bash $ + + ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}] + + +``--format`` may be ``plain`` (default), ``json``, ``json-pretty``, ``xml``, or ``xml-pretty``. + +``--threshold`` defines how many seconds "stuck" is (default: 300) + +**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD +with the most up-to-date data to come back. + +**Unclean** Placement groups contain objects that are not replicated the desired number +of times. They should be recovering. + +**Stale** Placement groups are in an unknown state - the OSDs that host them have not +reported to the monitor cluster in a while (configured by +``mon_osd_report_timeout``). + +Delete "lost" objects or revert them to their prior state, either a previous version +or delete them if they were just created. : + +.. prompt:: bash $ + + ceph pg {pgid} mark_unfound_lost revert|delete + + +.. _osd-subsystem: + +OSD Subsystem +============= + +Query OSD subsystem status. : + +.. prompt:: bash $ + + ceph osd stat + +Write a copy of the most recent OSD map to a file. See +:ref:`osdmaptool <osdmaptool>`. : + +.. prompt:: bash $ + + ceph osd getmap -o file + +Write a copy of the crush map from the most recent OSD map to +file. : + +.. prompt:: bash $ + + ceph osd getcrushmap -o file + +The foregoing is functionally equivalent to : + +.. prompt:: bash $ + + ceph osd getmap -o /tmp/osdmap + osdmaptool /tmp/osdmap --export-crush file + +Dump the OSD map. Valid formats for ``-f`` are ``plain``, ``json``, ``json-pretty``, +``xml``, and ``xml-pretty``. If no ``--format`` option is given, the OSD map is +dumped as plain text. As above, JSON format is best for tools, scripting, and other automation. : + +.. prompt:: bash $ + + ceph osd dump [--format {format}] + +Dump the OSD map as a tree with one line per OSD containing weight +and state. : + +.. prompt:: bash $ + + ceph osd tree [--format {format}] + +Find out where a specific object is or would be stored in the system: + +.. prompt:: bash $ + + ceph osd map <pool-name> <object-name> + +Add or move a new item (OSD) with the given id/name/weight at the specified +location. : + +.. prompt:: bash $ + + ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]] + +Remove an existing item (OSD) from the CRUSH map. : + +.. prompt:: bash $ + + ceph osd crush remove {name} + +Remove an existing bucket from the CRUSH map. : + +.. prompt:: bash $ + + ceph osd crush remove {bucket-name} + +Move an existing bucket from one position in the hierarchy to another. : + +.. prompt:: bash $ + + ceph osd crush move {id} {loc1} [{loc2} ...] + +Set the weight of the item given by ``{name}`` to ``{weight}``. : + +.. prompt:: bash $ + + ceph osd crush reweight {name} {weight} + +Mark an OSD as ``lost``. This may result in permanent data loss. Use with caution. : + +.. prompt:: bash $ + + ceph osd lost {id} [--yes-i-really-mean-it] + +Create a new OSD. If no UUID is given, it will be set automatically when the OSD +starts up. : + +.. prompt:: bash $ + + ceph osd create [{uuid}] + +Remove the given OSD(s). : + +.. prompt:: bash $ + + ceph osd rm [{id}...] + +Query the current ``max_osd`` parameter in the OSD map. : + +.. prompt:: bash $ + + ceph osd getmaxosd + +Import the given crush map. : + +.. prompt:: bash $ + + ceph osd setcrushmap -i file + +Set the ``max_osd`` parameter in the OSD map. This defaults to 10000 now so +most admins will never need to adjust this. : + +.. prompt:: bash $ + + ceph osd setmaxosd + +Mark OSD ``{osd-num}`` down. : + +.. prompt:: bash $ + + ceph osd down {osd-num} + +Mark OSD ``{osd-num}`` out of the distribution (i.e. allocated no data). : + +.. prompt:: bash $ + + ceph osd out {osd-num} + +Mark ``{osd-num}`` in the distribution (i.e. allocated data). : + +.. prompt:: bash $ + + ceph osd in {osd-num} + +Set or clear the pause flags in the OSD map. If set, no IO requests +will be sent to any OSD. Clearing the flags via unpause results in +resending pending requests. : + +.. prompt:: bash $ + + ceph osd pause + ceph osd unpause + +Set the override weight (reweight) of ``{osd-num}`` to ``{weight}``. Two OSDs with the +same weight will receive roughly the same number of I/O requests and +store approximately the same amount of data. ``ceph osd reweight`` +sets an override weight on the OSD. This value is in the range 0 to 1, +and forces CRUSH to re-place (1-weight) of the data that would +otherwise live on this drive. It does not change weights assigned +to the buckets above the OSD in the crush map, and is a corrective +measure in case the normal CRUSH distribution is not working out quite +right. For instance, if one of your OSDs is at 90% and the others are +at 50%, you could reduce this weight to compensate. : + +.. prompt:: bash $ + + ceph osd reweight {osd-num} {weight} + +Balance OSD fullness by reducing the override weight of OSDs which are +overly utilized. Note that these override aka ``reweight`` values +default to 1.00000 and are relative only to each other; they not absolute. +It is crucial to distinguish them from CRUSH weights, which reflect the +absolute capacity of a bucket in TiB. By default this command adjusts +override weight on OSDs which have + or - 20% of the average utilization, +but if you include a ``threshold`` that percentage will be used instead. : + +.. prompt:: bash $ + + ceph osd reweight-by-utilization [threshold [max_change [max_osds]]] [--no-increasing] + +To limit the step by which any OSD's reweight will be changed, specify +``max_change`` which defaults to 0.05. To limit the number of OSDs that will +be adjusted, specify ``max_osds`` as well; the default is 4. Increasing these +parameters can speed leveling of OSD utilization, at the potential cost of +greater impact on client operations due to more data moving at once. + +To determine which and how many PGs and OSDs will be affected by a given invocation +you can test before executing. : + +.. prompt:: bash $ + + ceph osd test-reweight-by-utilization [threshold [max_change max_osds]] [--no-increasing] + +Adding ``--no-increasing`` to either command prevents increasing any +override weights that are currently < 1.00000. This can be useful when +you are balancing in a hurry to remedy ``full`` or ``nearful`` OSDs or +when some OSDs are being evacuated or slowly brought into service. + +Deployments utilizing Nautilus (or later revisions of Luminous and Mimic) +that have no pre-Luminous cients may instead wish to instead enable the +`balancer`` module for ``ceph-mgr``. + +Add/remove an IP address or CIDR range to/from the blocklist. +When adding to the blocklist, +you can specify how long it should be blocklisted in seconds; otherwise, +it will default to 1 hour. A blocklisted address is prevented from +connecting to any OSD. If you blocklist an IP or range containing an OSD, be aware +that OSD will also be prevented from performing operations on its peers where it +acts as a client. (This includes tiering and copy-from functionality.) + +If you want to blocklist a range (in CIDR format), you may do so by +including the ``range`` keyword. + +These commands are mostly only useful for failure testing, as +blocklists are normally maintained automatically and shouldn't need +manual intervention. : + +.. prompt:: bash $ + + ceph osd blocklist ["range"] add ADDRESS[:source_port][/netmask_bits] [TIME] + ceph osd blocklist ["range"] rm ADDRESS[:source_port][/netmask_bits] + +Creates/deletes a snapshot of a pool. : + +.. prompt:: bash $ + + ceph osd pool mksnap {pool-name} {snap-name} + ceph osd pool rmsnap {pool-name} {snap-name} + +Creates/deletes/renames a storage pool. : + +.. prompt:: bash $ + + ceph osd pool create {pool-name} [pg_num [pgp_num]] + ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] + ceph osd pool rename {old-name} {new-name} + +Changes a pool setting. : + +.. prompt:: bash $ + + ceph osd pool set {pool-name} {field} {value} + +Valid fields are: + + * ``size``: Sets the number of copies of data in the pool. + * ``pg_num``: The placement group number. + * ``pgp_num``: Effective number when calculating pg placement. + * ``crush_rule``: rule number for mapping placement. + +Get the value of a pool setting. : + +.. prompt:: bash $ + + ceph osd pool get {pool-name} {field} + +Valid fields are: + + * ``pg_num``: The placement group number. + * ``pgp_num``: Effective number of placement groups when calculating placement. + + +Sends a scrub command to OSD ``{osd-num}``. To send the command to all OSDs, use ``*``. : + +.. prompt:: bash $ + + ceph osd scrub {osd-num} + +Sends a repair command to OSD.N. To send the command to all OSDs, use ``*``. : + +.. prompt:: bash $ + + ceph osd repair N + +Runs a simple throughput benchmark against OSD.N, writing ``TOTAL_DATA_BYTES`` +in write requests of ``BYTES_PER_WRITE`` each. By default, the test +writes 1 GB in total in 4-MB increments. +The benchmark is non-destructive and will not overwrite existing live +OSD data, but might temporarily affect the performance of clients +concurrently accessing the OSD. : + +.. prompt:: bash $ + + ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE] + +To clear an OSD's caches between benchmark runs, use the 'cache drop' command : + +.. prompt:: bash $ + + ceph tell osd.N cache drop + +To get the cache statistics of an OSD, use the 'cache status' command : + +.. prompt:: bash $ + + ceph tell osd.N cache status + +MDS Subsystem +============= + +Change configuration parameters on a running mds. : + +.. prompt:: bash $ + + ceph tell mds.{mds-id} config set {setting} {value} + +Example: + +.. prompt:: bash $ + + ceph tell mds.0 config set debug_ms 1 + +Enables debug messages. : + +.. prompt:: bash $ + + ceph mds stat + +Displays the status of all metadata servers. : + +.. prompt:: bash $ + + ceph mds fail 0 + +Marks the active MDS as failed, triggering failover to a standby if present. + +.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap + + +Mon Subsystem +============= + +Show monitor stats: + +.. prompt:: bash $ + + ceph mon stat + +:: + + e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c + + +The ``quorum`` list at the end lists monitor nodes that are part of the current quorum. + +This is also available more directly: + +.. prompt:: bash $ + + ceph quorum_status -f json-pretty + +.. code-block:: javascript + + { + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "quorum_names": [ + "a", + "b", + "c" + ], + "quorum_leader_name": "a", + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + + +The above will block until a quorum is reached. + +For a status of just a single monitor: + +.. prompt:: bash $ + + ceph tell mon.[name] mon_status + +where the value of ``[name]`` can be taken from ``ceph quorum_status``. Sample +output:: + + { + "name": "b", + "rank": 1, + "state": "peon", + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "features": { + "required_con": "9025616074522624", + "required_mon": [ + "kraken" + ], + "quorum_con": "1152921504336314367", + "quorum_mon": [ + "kraken" + ] + }, + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + +A dump of the monitor state: + + .. prompt:: bash $ + + ceph mon dump + + :: + + dumped monmap epoch 2 + epoch 2 + fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc + last_changed 2016-12-26 14:42:09.288066 + created 2016-12-26 14:42:03.573585 + 0: 127.0.0.1:40000/0 mon.a + 1: 127.0.0.1:40001/0 mon.b + 2: 127.0.0.1:40002/0 mon.c + diff --git a/doc/rados/operations/crush-map-edits.rst b/doc/rados/operations/crush-map-edits.rst new file mode 100644 index 000000000..18553e47d --- /dev/null +++ b/doc/rados/operations/crush-map-edits.rst @@ -0,0 +1,747 @@ +Manually editing a CRUSH Map +============================ + +.. note:: Manually editing the CRUSH map is an advanced + administrator operation. All CRUSH changes that are + necessary for the overwhelming majority of installations are + possible via the standard ceph CLI and do not require manual + CRUSH map edits. If you have identified a use case where + manual edits *are* necessary with recent Ceph releases, consider + contacting the Ceph developers so that future versions of Ceph + can obviate your corner case. + +To edit an existing CRUSH map: + +#. `Get the CRUSH map`_. +#. `Decompile`_ the CRUSH map. +#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_. +#. `Recompile`_ the CRUSH map. +#. `Set the CRUSH map`_. + +For details on setting the CRUSH map rule for a specific pool, see `Set +Pool Values`_. + +.. _Get the CRUSH map: #getcrushmap +.. _Decompile: #decompilecrushmap +.. _Devices: #crushmapdevices +.. _Buckets: #crushmapbuckets +.. _Rules: #crushmaprules +.. _Recompile: #compilecrushmap +.. _Set the CRUSH map: #setcrushmap +.. _Set Pool Values: ../pools#setpoolvalues + +.. _getcrushmap: + +Get a CRUSH Map +--------------- + +To get the CRUSH map for your cluster, execute the following: + +.. prompt:: bash $ + + ceph osd getcrushmap -o {compiled-crushmap-filename} + +Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since +the CRUSH map is in a compiled form, you must decompile it first before you can +edit it. + +.. _decompilecrushmap: + +Decompile a CRUSH Map +--------------------- + +To decompile a CRUSH map, execute the following: + +.. prompt:: bash $ + + crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} + +.. _compilecrushmap: + +Recompile a CRUSH Map +--------------------- + +To compile a CRUSH map, execute the following: + +.. prompt:: bash $ + + crushtool -c {decompiled-crushmap-filename} -o {compiled-crushmap-filename} + +.. _setcrushmap: + +Set the CRUSH Map +----------------- + +To set the CRUSH map for your cluster, execute the following: + +.. prompt:: bash $ + + ceph osd setcrushmap -i {compiled-crushmap-filename} + +Ceph will load (-i) a compiled CRUSH map from the filename you specified. + +Sections +-------- + +There are six main sections to a CRUSH Map. + +#. **tunables:** The preamble at the top of the map describes any *tunables* + that differ from the historical / legacy CRUSH behavior. These + correct for old bugs, optimizations, or other changes that have + been made over the years to improve CRUSH's behavior. + +#. **devices:** Devices are individual OSDs that store data. + +#. **types**: Bucket ``types`` define the types of buckets used in + your CRUSH hierarchy. Buckets consist of a hierarchical aggregation + of storage locations (e.g., rows, racks, chassis, hosts, etc.) and + their assigned weights. + +#. **buckets:** Once you define bucket types, you must define each node + in the hierarchy, its type, and which devices or other nodes it + contains. + +#. **rules:** Rules define policy about how data is distributed across + devices in the hierarchy. + +#. **choose_args:** Choose_args are alternative weights associated with + the hierarchy that have been adjusted to optimize data placement. A single + choose_args map can be used for the entire cluster, or one can be + created for each individual pool. + + +.. _crushmapdevices: + +CRUSH Map Devices +----------------- + +Devices are individual OSDs that store data. Usually one is defined here for each +OSD daemon in your +cluster. Devices are identified by an ``id`` (a non-negative integer) and +a ``name``, normally ``osd.N`` where ``N`` is the device id. + +.. _crush-map-device-class: + +Devices may also have a *device class* associated with them (e.g., +``hdd`` or ``ssd``), allowing them to be conveniently targeted by a +crush rule. + +.. prompt:: bash # + + devices + +:: + + device {num} {osd.name} [class {class}] + +For example: + +.. prompt:: bash # + + devices + +:: + + device 0 osd.0 class ssd + device 1 osd.1 class hdd + device 2 osd.2 + device 3 osd.3 + +In most cases, each device maps to a single ``ceph-osd`` daemon. This +is normally a single storage device, a pair of devices (for example, +one for data and one for a journal or metadata), or in some cases a +small RAID device. + +CRUSH Map Bucket Types +---------------------- + +The second list in the CRUSH map defines 'bucket' types. Buckets facilitate +a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent +physical locations in a hierarchy. Nodes aggregate other nodes or leaves. +Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage +media. + +.. tip:: The term "bucket" used in the context of CRUSH means a node in + the hierarchy, i.e. a location or a piece of physical hardware. It + is a different concept from the term "bucket" when used in the + context of RADOS Gateway APIs. + +To add a bucket type to the CRUSH map, create a new line under your list of +bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. +By convention, there is one leaf bucket and it is ``type 0``; however, you may +give it any name you like (e.g., osd, disk, drive, storage):: + + # types + type {num} {bucket-name} + +For example:: + + # types + type 0 osd + type 1 host + type 2 chassis + type 3 rack + type 4 row + type 5 pdu + type 6 pod + type 7 room + type 8 datacenter + type 9 zone + type 10 region + type 11 root + + + +.. _crushmapbuckets: + +CRUSH Map Bucket Hierarchy +-------------------------- + +The CRUSH algorithm distributes data objects among storage devices according +to a per-device weight value, approximating a uniform probability distribution. +CRUSH distributes objects and their replicas according to the hierarchical +cluster map you define. Your CRUSH map represents the available storage +devices and the logical elements that contain them. + +To map placement groups to OSDs across failure domains, a CRUSH map defines a +hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH +map). The purpose of creating a bucket hierarchy is to segregate the +leaf nodes by their failure domains, such as hosts, chassis, racks, power +distribution units, pods, rows, rooms, and data centers. With the exception of +the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and +you may define it according to your own needs. + +We recommend adapting your CRUSH map to your firm's hardware naming conventions +and using instance names that reflect the physical hardware. Your naming +practice can make it easier to administer the cluster and troubleshoot +problems when an OSD and/or other hardware malfunctions and the administrator +need access to physical hardware. + +In the following example, the bucket hierarchy has a leaf bucket named ``osd``, +and two node buckets named ``host`` and ``rack`` respectively. + +.. ditaa:: + +-----------+ + | {o}rack | + | Bucket | + +-----+-----+ + | + +---------------+---------------+ + | | + +-----+-----+ +-----+-----+ + | {o}host | | {o}host | + | Bucket | | Bucket | + +-----+-----+ +-----+-----+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd | | osd | | osd | | osd | + | Bucket | | Bucket | | Bucket | | Bucket | + +-----------+ +-----------+ +-----------+ +-----------+ + +.. note:: The higher numbered ``rack`` bucket type aggregates the lower + numbered ``host`` bucket type. + +Since leaf nodes reflect storage devices declared under the ``#devices`` list +at the beginning of the CRUSH map, you do not need to declare them as bucket +instances. The second lowest bucket type in your hierarchy usually aggregates +the devices (i.e., it's usually the computer containing the storage media, and +uses whatever term you prefer to describe it, such as "node", "computer", +"server," "host", "machine", etc.). In high density environments, it is +increasingly common to see multiple hosts/nodes per chassis. You should account +for chassis failure too--e.g., the need to pull a chassis if a node fails may +result in bringing down numerous hosts/nodes and their OSDs. + +When declaring a bucket instance, you must specify its type, give it a unique +name (string), assign it a unique ID expressed as a negative integer (optional), +specify a weight relative to the total capacity/capability of its item(s), +specify the bucket algorithm (usually ``straw2``), and the hash (usually ``0``, +reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items. +The items may consist of node buckets or leaves. Items may have a weight that +reflects the relative weight of the item. + +You may declare a node bucket with the following syntax:: + + [bucket-type] [bucket-name] { + id [a unique negative numeric ID] + weight [the relative capacity/capability of the item(s)] + alg [the bucket type: uniform | list | tree | straw | straw2 ] + hash [the hash type: 0 by default] + item [item-name] weight [weight] + } + +For example, using the diagram above, we would define two host buckets +and one rack bucket. The OSDs are declared as items within the host buckets:: + + host node1 { + id -1 + alg straw2 + hash 0 + item osd.0 weight 1.00 + item osd.1 weight 1.00 + } + + host node2 { + id -2 + alg straw2 + hash 0 + item osd.2 weight 1.00 + item osd.3 weight 1.00 + } + + rack rack1 { + id -3 + alg straw2 + hash 0 + item node1 weight 2.00 + item node2 weight 2.00 + } + +.. note:: In the foregoing example, note that the rack bucket does not contain + any OSDs. Rather it contains lower level host buckets, and includes the + sum total of their weight in the item entry. + +.. topic:: Bucket Types + + Ceph supports five bucket types, each representing a tradeoff between + performance and reorganization efficiency. If you are unsure of which bucket + type to use, we recommend using a ``straw2`` bucket. For a detailed + discussion of bucket types, refer to + `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, + and more specifically to **Section 3.4**. The bucket types are: + + #. **uniform**: Uniform buckets aggregate devices with **exactly** the same + weight. For example, when firms commission or decommission hardware, they + typically do so with many machines that have exactly the same physical + configuration (e.g., bulk purchases). When storage devices have exactly + the same weight, you may use the ``uniform`` bucket type, which allows + CRUSH to map replicas into uniform buckets in constant time. With + non-uniform weights, you should use another bucket algorithm. + + #. **list**: List buckets aggregate their content as linked lists. Based on + the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm, + a list is a natural and intuitive choice for an **expanding cluster**: + either an object is relocated to the newest device with some appropriate + probability, or it remains on the older devices as before. The result is + optimal data migration when items are added to the bucket. Items removed + from the middle or tail of the list, however, can result in a significant + amount of unnecessary movement, making list buckets most suitable for + circumstances in which they **never (or very rarely) shrink**. + + #. **tree**: Tree buckets use a binary search tree. They are more efficient + than list buckets when a bucket contains a larger set of items. Based on + the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm, + tree buckets reduce the placement time to O(log :sub:`n`), making them + suitable for managing much larger sets of devices or nested buckets. + + #. **straw**: List and Tree buckets use a divide and conquer strategy + in a way that either gives certain items precedence (e.g., those + at the beginning of a list) or obviates the need to consider entire + subtrees of items at all. That improves the performance of the replica + placement process, but can also introduce suboptimal reorganization + behavior when the contents of a bucket change due an addition, removal, + or re-weighting of an item. The straw bucket type allows all items to + fairly “compete” against each other for replica placement through a + process analogous to a draw of straws. + + #. **straw2**: Straw2 buckets improve Straw to correctly avoid any data + movement between items when neighbor weights change. + + For example the weight of item A including adding it anew or removing + it completely, there will be data movement only to or from item A. + +.. topic:: Hash + + Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``. + Enter ``0`` as your hash setting to select ``rjenkins1``. + + +.. _weightingbucketitems: + +.. topic:: Weighting Bucket Items + + Ceph expresses bucket weights as doubles, which allows for fine + weighting. A weight is the relative difference between device capacities. We + recommend using ``1.00`` as the relative weight for a 1TB storage device. + In such a scenario, a weight of ``0.5`` would represent approximately 500GB, + and a weight of ``3.00`` would represent approximately 3TB. Higher level + buckets have a weight that is the sum total of the leaf items aggregated by + the bucket. + + A bucket item weight is one dimensional, but you may also calculate your + item weights to reflect the performance of the storage drive. For example, + if you have many 1TB drives where some have relatively low data transfer + rate and the others have a relatively high data transfer rate, you may + weight them differently, even though they have the same capacity (e.g., + a weight of 0.80 for the first set of drives with lower total throughput, + and 1.20 for the second set of drives with higher total throughput). + + +.. _crushmaprules: + +CRUSH Map Rules +--------------- + +CRUSH maps support the notion of 'CRUSH rules', which are the rules that +determine data placement for a pool. The default CRUSH map has a rule for each +pool. For large clusters, you will likely create many pools where each pool may +have its own non-default CRUSH rule. + +.. note:: In most cases, you will not need to modify the default rule. When + you create a new pool, by default the rule will be set to ``0``. + + +CRUSH rules define placement and replication strategies or distribution policies +that allow you to specify exactly how CRUSH places object replicas. For +example, you might create a rule selecting a pair of targets for 2-way +mirroring, another rule for selecting three targets in two different data +centers for 3-way mirroring, and yet another rule for erasure coding over six +storage devices. For a detailed discussion of CRUSH rules, refer to +`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, +and more specifically to **Section 3.2**. + +A rule takes the following form:: + + rule <rulename> { + + id [a unique whole numeric ID] + type [ replicated | erasure ] + min_size <min-size> + max_size <max-size> + step take <bucket-name> [class <device-class>] + step [choose|chooseleaf] [firstn|indep] <N> type <bucket-type> + step emit + } + + +``id`` + +:Description: A unique whole number for identifying the rule. + +:Purpose: A component of the rule mask. +:Type: Integer +:Required: Yes +:Default: 0 + + +``type`` + +:Description: Describes a rule for either a storage drive (replicated) + or a RAID. + +:Purpose: A component of the rule mask. +:Type: String +:Required: Yes +:Default: ``replicated`` +:Valid Values: Currently only ``replicated`` and ``erasure`` + +``min_size`` + +:Description: If a pool makes fewer replicas than this number, CRUSH will + **NOT** select this rule. + +:Type: Integer +:Purpose: A component of the rule mask. +:Required: Yes +:Default: ``1`` + +``max_size`` + +:Description: If a pool makes more replicas than this number, CRUSH will + **NOT** select this rule. + +:Type: Integer +:Purpose: A component of the rule mask. +:Required: Yes +:Default: 10 + + +``step take <bucket-name> [class <device-class>]`` + +:Description: Takes a bucket name, and begins iterating down the tree. + If the ``device-class`` is specified, it must match + a class previously used when defining a device. All + devices that do not belong to the class are excluded. +:Purpose: A component of the rule. +:Required: Yes +:Example: ``step take data`` + + +``step choose firstn {num} type {bucket-type}`` + +:Description: Selects the number of buckets of the given type from within the + current bucket. The number is usually the number of replicas in + the pool (i.e., pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). + - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. + - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. + +:Purpose: A component of the rule. +:Prerequisite: Follows ``step take`` or ``step choose``. +:Example: ``step choose firstn 1 type row`` + + +``step chooseleaf firstn {num} type {bucket-type}`` + +:Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf + node (that is, an OSD) from the subtree of each bucket in the set of buckets. + The number of buckets in the set is usually the number of replicas in + the pool (i.e., pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). + - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. + - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. + +:Purpose: A component of the rule. Usage removes the need to select a device using two steps. +:Prerequisite: Follows ``step take`` or ``step choose``. +:Example: ``step chooseleaf firstn 0 type row`` + + +``step emit`` + +:Description: Outputs the current value and empties the stack. Typically used + at the end of a rule, but may also be used to pick from different + trees in the same rule. + +:Purpose: A component of the rule. +:Prerequisite: Follows ``step choose``. +:Example: ``step emit`` + +.. important:: A given CRUSH rule may be assigned to multiple pools, but it + is not possible for a single pool to have multiple CRUSH rules. + +``firstn`` versus ``indep`` + +:Description: Controls the replacement strategy CRUSH uses when items (OSDs) + are marked down in the CRUSH map. If this rule is to be used with + replicated pools it should be ``firstn`` and if it's for + erasure-coded pools it should be ``indep``. + + The reason has to do with how they behave when a + previously-selected device fails. Let's say you have a PG stored + on OSDs 1, 2, 3, 4, 5. Then 3 goes down. + + With the "firstn" mode, CRUSH simply adjusts its calculation to + select 1 and 2, then selects 3 but discovers it's down, so it + retries and selects 4 and 5, and then goes on to select a new + OSD 6. So the final CRUSH mapping change is + 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6. + + But if you're storing an EC pool, that means you just changed the + data mapped to OSDs 4, 5, and 6! So the "indep" mode attempts to + not do that. You can instead expect it, when it selects the failed + OSD 3, to try again and pick out 6, for a final transformation of: + 1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5 + +.. _crush-reclassify: + +Migrating from a legacy SSD rule to device classes +-------------------------------------------------- + +It used to be necessary to manually edit your CRUSH map and maintain a +parallel hierarchy for each specialized device type (e.g., SSD) in order to +write rules that apply to those devices. Since the Luminous release, +the *device class* feature has enabled this transparently. + +However, migrating from an existing, manually customized per-device map to +the new device class rules in the trivial way will cause all data in the +system to be reshuffled. + +The ``crushtool`` has a few commands that can transform a legacy rule +and hierarchy so that you can start using the new class-based rules. +There are three types of transformations possible: + +#. ``--reclassify-root <root-name> <device-class>`` + + This will take everything in the hierarchy beneath root-name and + adjust any rules that reference that root via a ``take + <root-name>`` to instead ``take <root-name> class <device-class>``. + It renumbers the buckets in such a way that the old IDs are instead + used for the specified class's "shadow tree" so that no data + movement takes place. + + For example, imagine you have an existing rule like:: + + rule replicated_ruleset { + id 0 + type replicated + min_size 1 + max_size 10 + step take default + step chooseleaf firstn 0 type rack + step emit + } + + If you reclassify the root `default` as class `hdd`, the rule will + become:: + + rule replicated_ruleset { + id 0 + type replicated + min_size 1 + max_size 10 + step take default class hdd + step chooseleaf firstn 0 type rack + step emit + } + +#. ``--set-subtree-class <bucket-name> <device-class>`` + + This will mark every device in the subtree rooted at *bucket-name* + with the specified device class. + + This is normally used in conjunction with the ``--reclassify-root`` + option to ensure that all devices in that root are labeled with the + correct class. In some situations, however, some of those devices + (correctly) have a different class and we do not want to relabel + them. In such cases, one can exclude the ``--set-subtree-class`` + option. This means that the remapping process will not be perfect, + since the previous rule distributed across devices of multiple + classes but the adjusted rules will only map to devices of the + specified *device-class*, but that often is an accepted level of + data movement when the number of outlier devices is small. + +#. ``--reclassify-bucket <match-pattern> <device-class> <default-parent>`` + + This will allow you to merge a parallel type-specific hierarchy with the normal hierarchy. For example, many users have maps like:: + + host node1 { + id -2 # do not change unnecessarily + # weight 109.152 + alg straw2 + hash 0 # rjenkins1 + item osd.0 weight 9.096 + item osd.1 weight 9.096 + item osd.2 weight 9.096 + item osd.3 weight 9.096 + item osd.4 weight 9.096 + item osd.5 weight 9.096 + ... + } + + host node1-ssd { + id -10 # do not change unnecessarily + # weight 2.000 + alg straw2 + hash 0 # rjenkins1 + item osd.80 weight 2.000 + ... + } + + root default { + id -1 # do not change unnecessarily + alg straw2 + hash 0 # rjenkins1 + item node1 weight 110.967 + ... + } + + root ssd { + id -18 # do not change unnecessarily + # weight 16.000 + alg straw2 + hash 0 # rjenkins1 + item node1-ssd weight 2.000 + ... + } + + This function will reclassify each bucket that matches a + pattern. The pattern can look like ``%suffix`` or ``prefix%``. + For example, in the above example, we would use the pattern + ``%-ssd``. For each matched bucket, the remaining portion of the + name (that matches the ``%`` wildcard) specifies the *base bucket*. + All devices in the matched bucket are labeled with the specified + device class and then moved to the base bucket. If the base bucket + does not exist (e.g., ``node12-ssd`` exists but ``node12`` does + not), then it is created and linked underneath the specified + *default parent* bucket. In each case, we are careful to preserve + the old bucket IDs for the new shadow buckets to prevent data + movement. Any rules with ``take`` steps referencing the old + buckets are adjusted. + +#. ``--reclassify-bucket <bucket-name> <device-class> <base-bucket>`` + + The same command can also be used without a wildcard to map a + single bucket. For example, in the previous example, we want the + ``ssd`` bucket to be mapped to the ``default`` bucket. + +The final command to convert the map comprising the above fragments would be something like: + +.. prompt:: bash $ + + ceph osd getcrushmap -o original + crushtool -i original --reclassify \ + --set-subtree-class default hdd \ + --reclassify-root default hdd \ + --reclassify-bucket %-ssd ssd default \ + --reclassify-bucket ssd ssd default \ + -o adjusted + +In order to ensure that the conversion is correct, there is a ``--compare`` command that will test a large sample of inputs against the CRUSH map and check that the same result is output. These inputs are controlled by the same options that apply to the ``--test`` command. For the above example,: + +.. prompt:: bash $ + + crushtool -i original --compare adjusted + +:: + + rule 0 had 0/10240 mismatched mappings (0) + rule 1 had 0/10240 mismatched mappings (0) + maps appear equivalent + +If there were differences, the ratio of remapped inputs would be reported in +the parentheses. + +When you are satisfied with the adjusted map, apply it to the cluster with a command of the form: + +.. prompt:: bash $ + + ceph osd setcrushmap -i adjusted + +Tuning CRUSH, the hard way +-------------------------- + +If you can ensure that all clients are running recent code, you can +adjust the tunables by extracting the CRUSH map, modifying the values, +and reinjecting it into the cluster. + +* Extract the latest CRUSH map: + + .. prompt:: bash $ + + ceph osd getcrushmap -o /tmp/crush + +* Adjust tunables. These values appear to offer the best behavior + for both large and small clusters we tested with. You will need to + additionally specify the ``--enable-unsafe-tunables`` argument to + ``crushtool`` for this to work. Please use this option with + extreme care.: + + .. prompt:: bash $ + + crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new + +* Reinject modified map: + + .. prompt:: bash $ + + ceph osd setcrushmap -i /tmp/crush.new + +Legacy values +------------- + +For reference, the legacy values for the CRUSH tunables can be set +with: + +.. prompt:: bash $ + + crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy + +Again, the special ``--enable-unsafe-tunables`` option is required. +Further, as noted above, be careful running old versions of the +``ceph-osd`` daemon after reverting to legacy values as the feature +bit is not perfectly enforced. + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf diff --git a/doc/rados/operations/crush-map.rst b/doc/rados/operations/crush-map.rst new file mode 100644 index 000000000..f22ebb24e --- /dev/null +++ b/doc/rados/operations/crush-map.rst @@ -0,0 +1,1126 @@ +============ + CRUSH Maps +============ + +The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm +determines how to store and retrieve data by computing storage locations. +CRUSH empowers Ceph clients to communicate with OSDs directly rather than +through a centralized server or broker. With an algorithmically determined +method of storing and retrieving data, Ceph avoids a single point of failure, a +performance bottleneck, and a physical limit to its scalability. + +CRUSH uses a map of your cluster (the CRUSH map) to pseudo-randomly +map data to OSDs, distributing it across the cluster according to configured +replication policy and failure domain. For a detailed discussion of CRUSH, see +`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ + +CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a hierarchy +of 'buckets' for aggregating devices and buckets, and +rules that govern how CRUSH replicates data within the cluster's pools. By +reflecting the underlying physical organization of the installation, CRUSH can +model (and thereby address) the potential for correlated device failures. +Typical factors include chassis, racks, physical proximity, a shared power +source, and shared networking. By encoding this information into the cluster +map, CRUSH placement +policies distribute object replicas across failure domains while +maintaining the desired distribution. For example, to address the +possibility of concurrent failures, it may be desirable to ensure that data +replicas are on devices using different shelves, racks, power supplies, +controllers, and/or physical locations. + +When you deploy OSDs they are automatically added to the CRUSH map under a +``host`` bucket named for the node on which they run. This, +combined with the configured CRUSH failure domain, ensures that replicas or +erasure code shards are distributed across hosts and that a single host or other +failure will not affect availability. For larger clusters, administrators must +carefully consider their choice of failure domain. Separating replicas across racks, +for example, is typical for mid- to large-sized clusters. + + +CRUSH Location +============== + +The location of an OSD within the CRUSH map's hierarchy is +referred to as a ``CRUSH location``. This location specifier takes the +form of a list of key and value pairs. For +example, if an OSD is in a particular row, rack, chassis and host, and +is part of the 'default' CRUSH root (which is the case for most +clusters), its CRUSH location could be described as:: + + root=default row=a rack=a2 chassis=a2a host=a2a1 + +Note: + +#. Note that the order of the keys does not matter. +#. The key name (left of ``=``) must be a valid CRUSH ``type``. By default + these include ``root``, ``datacenter``, ``room``, ``row``, ``pod``, ``pdu``, + ``rack``, ``chassis`` and ``host``. + These defined types suffice for almost all clusters, but can be customized + by modifying the CRUSH map. +#. Not all keys need to be specified. For example, by default, Ceph + automatically sets an ``OSD``'s location to be + ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``). + +The CRUSH location for an OSD can be defined by adding the ``crush location`` +option in ``ceph.conf``. Each time the OSD starts, +it verifies it is in the correct location in the CRUSH map and, if it is not, +it moves itself. To disable this automatic CRUSH map management, add the +following to your configuration file in the ``[osd]`` section:: + + osd crush update on start = false + +Note that in most cases you will not need to manually configure this. + + +Custom location hooks +--------------------- + +A customized location hook can be used to generate a more complete +CRUSH location on startup. The CRUSH location is based on, in order +of preference: + +#. A ``crush location`` option in ``ceph.conf`` +#. A default of ``root=default host=HOSTNAME`` where the hostname is + derived from the ``hostname -s`` command + +A script can be written to provide additional +location fields (for example, ``rack`` or ``datacenter``) and the +hook enabled via the config option:: + + crush location hook = /path/to/customized-ceph-crush-location + +This hook is passed several arguments (below) and should output a single line +to ``stdout`` with the CRUSH location description.:: + + --cluster CLUSTER --id ID --type TYPE + +where the cluster name is typically ``ceph``, the ``id`` is the daemon +identifier (e.g., the OSD number or daemon identifier), and the daemon +type is ``osd``, ``mds``, etc. + +For example, a simple hook that additionally specifies a rack location +based on a value in the file ``/etc/rack`` might be:: + + #!/bin/sh + echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default" + + +CRUSH structure +=============== + +The CRUSH map consists of a hierarchy that describes +the physical topology of the cluster and a set of rules defining +data placement policy. The hierarchy has +devices (OSDs) at the leaves, and internal nodes +corresponding to other physical features or groupings: hosts, racks, +rows, datacenters, and so on. The rules describe how replicas are +placed in terms of that hierarchy (e.g., 'three replicas in different +racks'). + +Devices +------- + +Devices are individual OSDs that store data, usually one for each storage drive. +Devices are identified by an ``id`` +(a non-negative integer) and a ``name``, normally ``osd.N`` where ``N`` is the device id. + +Since the Luminous release, devices may also have a *device class* assigned (e.g., +``hdd`` or ``ssd`` or ``nvme``), allowing them to be conveniently targeted by +CRUSH rules. This is especially useful when mixing device types within hosts. + +.. _crush_map_default_types: + +Types and Buckets +----------------- + +A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, +racks, rows, etc. The CRUSH map defines a series of *types* that are +used to describe these nodes. Default types include: + +- ``osd`` (or ``device``) +- ``host`` +- ``chassis`` +- ``rack`` +- ``row`` +- ``pdu`` +- ``pod`` +- ``room`` +- ``datacenter`` +- ``zone`` +- ``region`` +- ``root`` + +Most clusters use only a handful of these types, and others +can be defined as needed. + +The hierarchy is built with devices (normally type ``osd``) at the +leaves, interior nodes with non-device types, and a root node of type +``root``. For example, + +.. ditaa:: + + +-----------------+ + |{o}root default | + +--------+--------+ + | + +---------------+---------------+ + | | + +------+------+ +------+------+ + |{o}host foo | |{o}host bar | + +------+------+ +------+------+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd.0 | | osd.1 | | osd.2 | | osd.3 | + +-----------+ +-----------+ +-----------+ +-----------+ + +Each node (device or bucket) in the hierarchy has a *weight* +that indicates the relative proportion of the total +data that device or hierarchy subtree should store. Weights are set +at the leaves, indicating the size of the device, and automatically +sum up the tree, such that the weight of the ``root`` node +will be the total of all devices contained beneath it. Normally +weights are in units of terabytes (TB). + +You can get a simple view the of CRUSH hierarchy for your cluster, +including weights, with: + +.. prompt:: bash $ + + ceph osd tree + +Rules +----- + +CRUSH Rules define policy about how data is distributed across the devices +in the hierarchy. They define placement and replication strategies or +distribution policies that allow you to specify exactly how CRUSH +places data replicas. For example, you might create a rule selecting +a pair of targets for two-way mirroring, another rule for selecting +three targets in two different data centers for three-way mirroring, and +yet another rule for erasure coding (EC) across six storage devices. For a +detailed discussion of CRUSH rules, refer to `CRUSH - Controlled, +Scalable, Decentralized Placement of Replicated Data`_, and more +specifically to **Section 3.2**. + +CRUSH rules can be created via the CLI by +specifying the *pool type* they will be used for (replicated or +erasure coded), the *failure domain*, and optionally a *device class*. +In rare cases rules must be written by hand by manually editing the +CRUSH map. + +You can see what rules are defined for your cluster with: + +.. prompt:: bash $ + + ceph osd crush rule ls + +You can view the contents of the rules with: + +.. prompt:: bash $ + + ceph osd crush rule dump + +Device classes +-------------- + +Each device can optionally have a *class* assigned. By +default, OSDs automatically set their class at startup to +`hdd`, `ssd`, or `nvme` based on the type of device they are backed +by. + +The device class for one or more OSDs can be explicitly set with: + +.. prompt:: bash $ + + ceph osd crush set-device-class <class> <osd-name> [...] + +Once a device class is set, it cannot be changed to another class +until the old class is unset with: + +.. prompt:: bash $ + + ceph osd crush rm-device-class <osd-name> [...] + +This allows administrators to set device classes without the class +being changed on OSD restart or by some other script. + +A placement rule that targets a specific device class can be created with: + +.. prompt:: bash $ + + ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> + +A pool can then be changed to use the new rule with: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> crush_rule <rule-name> + +Device classes are implemented by creating a "shadow" CRUSH hierarchy +for each device class in use that contains only devices of that class. +CRUSH rules can then distribute data over the shadow hierarchy. +This approach is fully backward compatible with +old Ceph clients. You can view the CRUSH hierarchy with shadow items +with: + +.. prompt:: bash $ + + ceph osd crush tree --show-shadow + +For older clusters created before Luminous that relied on manually +crafted CRUSH maps to maintain per-device-type hierarchies, there is a +*reclassify* tool available to help transition to device classes +without triggering data movement (see :ref:`crush-reclassify`). + + +Weights sets +------------ + +A *weight set* is an alternative set of weights to use when +calculating data placement. The normal weights associated with each +device in the CRUSH map are set based on the device size and indicate +how much data we *should* be storing where. However, because CRUSH is +a "probabilistic" pseudorandom placement process, there is always some +variation from this ideal distribution, in the same way that rolling a +die sixty times will not result in rolling exactly 10 ones and 10 +sixes. Weight sets allow the cluster to perform numerical optimization +based on the specifics of your cluster (hierarchy, pools, etc.) to achieve +a balanced distribution. + +There are two types of weight sets supported: + + #. A **compat** weight set is a single alternative set of weights for + each device and node in the cluster. This is not well-suited for + correcting for all anomalies (for example, placement groups for + different pools may be different sizes and have different load + levels, but will be mostly treated the same by the balancer). + However, compat weight sets have the huge advantage that they are + *backward compatible* with previous versions of Ceph, which means + that even though weight sets were first introduced in Luminous + v12.2.z, older clients (e.g., firefly) can still connect to the + cluster when a compat weight set is being used to balance data. + #. A **per-pool** weight set is more flexible in that it allows + placement to be optimized for each data pool. Additionally, + weights can be adjusted for each position of placement, allowing + the optimizer to correct for a subtle skew of data toward devices + with small weights relative to their peers (and effect that is + usually only apparently in very large clusters but which can cause + balancing problems). + +When weight sets are in use, the weights associated with each node in +the hierarchy is visible as a separate column (labeled either +``(compat)`` or the pool name) from the command: + +.. prompt:: bash $ + + ceph osd tree + +When both *compat* and *per-pool* weight sets are in use, data +placement for a particular pool will use its own per-pool weight set +if present. If not, it will use the compat weight set if present. If +neither are present, it will use the normal CRUSH weights. + +Although weight sets can be set up and manipulated by hand, it is +recommended that the ``ceph-mgr`` *balancer* module be enabled to do so +automatically when running Luminous or later releases. + + +Modifying the CRUSH map +======================= + +.. _addosd: + +Add/Move an OSD +--------------- + +.. note: OSDs are normally automatically added to the CRUSH map when + the OSD is created. This command is rarely needed. + +To add or move an OSD in the CRUSH map of a running cluster: + +.. prompt:: bash $ + + ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] + +Where: + +``name`` + +:Description: The full name of the OSD. +:Type: String +:Required: Yes +:Example: ``osd.0`` + + +``weight`` + +:Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB). +:Type: Double +:Required: Yes +:Example: ``2.0`` + + +``root`` + +:Description: The root node of the tree in which the OSD resides (normally ``default``) +:Type: Key/value pair. +:Required: Yes +:Example: ``root=default`` + + +``bucket-type`` + +:Description: You may specify the OSD's location in the CRUSH hierarchy. +:Type: Key/value pairs. +:Required: No +:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + + +The following example adds ``osd.0`` to the hierarchy, or moves the +OSD from a previous location: + +.. prompt:: bash $ + + ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 + + +Adjust OSD weight +----------------- + +.. note: Normally OSDs automatically add themselves to the CRUSH map + with the correct weight when they are created. This command + is rarely needed. + +To adjust an OSD's CRUSH weight in the CRUSH map of a running cluster, execute +the following: + +.. prompt:: bash $ + + ceph osd crush reweight {name} {weight} + +Where: + +``name`` + +:Description: The full name of the OSD. +:Type: String +:Required: Yes +:Example: ``osd.0`` + + +``weight`` + +:Description: The CRUSH weight for the OSD. +:Type: Double +:Required: Yes +:Example: ``2.0`` + + +.. _removeosd: + +Remove an OSD +------------- + +.. note: OSDs are normally removed from the CRUSH as part of the + ``ceph osd purge`` command. This command is rarely needed. + +To remove an OSD from the CRUSH map of a running cluster, execute the +following: + +.. prompt:: bash $ + + ceph osd crush remove {name} + +Where: + +``name`` + +:Description: The full name of the OSD. +:Type: String +:Required: Yes +:Example: ``osd.0`` + + +Add a Bucket +------------ + +.. note: Buckets are implicitly created when an OSD is added + that specifies a ``{bucket-type}={bucket-name}`` as part of its + location, if a bucket with that name does not already exist. This + command is typically used when manually adjusting the structure of the + hierarchy after OSDs have been created. One use is to move a + series of hosts underneath a new rack-level bucket; another is to + add new ``host`` buckets (OSD nodes) to a dummy ``root`` so that they don't + receive data until you're ready, at which time you would move them to the + ``default`` or other root as described below. + +To add a bucket in the CRUSH map of a running cluster, execute the +``ceph osd crush add-bucket`` command: + +.. prompt:: bash $ + + ceph osd crush add-bucket {bucket-name} {bucket-type} + +Where: + +``bucket-name`` + +:Description: The full name of the bucket. +:Type: String +:Required: Yes +:Example: ``rack12`` + + +``bucket-type`` + +:Description: The type of the bucket. The type must already exist in the hierarchy. +:Type: String +:Required: Yes +:Example: ``rack`` + + +The following example adds the ``rack12`` bucket to the hierarchy: + +.. prompt:: bash $ + + ceph osd crush add-bucket rack12 rack + +Move a Bucket +------------- + +To move a bucket to a different location or position in the CRUSH map +hierarchy, execute the following: + +.. prompt:: bash $ + + ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] + +Where: + +``bucket-name`` + +:Description: The name of the bucket to move/reposition. +:Type: String +:Required: Yes +:Example: ``foo-bar-1`` + +``bucket-type`` + +:Description: You may specify the bucket's location in the CRUSH hierarchy. +:Type: Key/value pairs. +:Required: No +:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + +Remove a Bucket +--------------- + +To remove a bucket from the CRUSH hierarchy, execute the following: + +.. prompt:: bash $ + + ceph osd crush remove {bucket-name} + +.. note:: A bucket must be empty before removing it from the CRUSH hierarchy. + +Where: + +``bucket-name`` + +:Description: The name of the bucket that you'd like to remove. +:Type: String +:Required: Yes +:Example: ``rack12`` + +The following example removes the ``rack12`` bucket from the hierarchy: + +.. prompt:: bash $ + + ceph osd crush remove rack12 + +Creating a compat weight set +---------------------------- + +.. note: This step is normally done automatically by the ``balancer`` + module when enabled. + +To create a *compat* weight set: + +.. prompt:: bash $ + + ceph osd crush weight-set create-compat + +Weights for the compat weight set can be adjusted with: + +.. prompt:: bash $ + + ceph osd crush weight-set reweight-compat {name} {weight} + +The compat weight set can be destroyed with: + +.. prompt:: bash $ + + ceph osd crush weight-set rm-compat + +Creating per-pool weight sets +----------------------------- + +To create a weight set for a specific pool: + +.. prompt:: bash $ + + ceph osd crush weight-set create {pool-name} {mode} + +.. note:: Per-pool weight sets require that all servers and daemons + run Luminous v12.2.z or later. + +Where: + +``pool-name`` + +:Description: The name of a RADOS pool +:Type: String +:Required: Yes +:Example: ``rbd`` + +``mode`` + +:Description: Either ``flat`` or ``positional``. A *flat* weight set + has a single weight for each device or bucket. A + *positional* weight set has a potentially different + weight for each position in the resulting placement + mapping. For example, if a pool has a replica count of + 3, then a positional weight set will have three weights + for each device and bucket. +:Type: String +:Required: Yes +:Example: ``flat`` + +To adjust the weight of an item in a weight set: + +.. prompt:: bash $ + + ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} + +To list existing weight sets: + +.. prompt:: bash $ + + ceph osd crush weight-set ls + +To remove a weight set: + +.. prompt:: bash $ + + ceph osd crush weight-set rm {pool-name} + +Creating a rule for a replicated pool +------------------------------------- + +For a replicated pool, the primary decision when creating the CRUSH +rule is what the failure domain is going to be. For example, if a +failure domain of ``host`` is selected, then CRUSH will ensure that +each replica of the data is stored on a unique host. If ``rack`` +is selected, then each replica will be stored in a different rack. +What failure domain you choose primarily depends on the size and +topology of your cluster. + +In most cases the entire cluster hierarchy is nested beneath a root node +named ``default``. If you have customized your hierarchy, you may +want to create a rule nested at some other node in the hierarchy. It +doesn't matter what type is associated with that node (it doesn't have +to be a ``root`` node). + +It is also possible to create a rule that restricts data placement to +a specific *class* of device. By default, Ceph OSDs automatically +classify themselves as either ``hdd`` or ``ssd``, depending on the +underlying type of device being used. These classes can also be +customized. + +To create a replicated rule: + +.. prompt:: bash $ + + ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] + +Where: + +``name`` + +:Description: The name of the rule +:Type: String +:Required: Yes +:Example: ``rbd-rule`` + +``root`` + +:Description: The name of the node under which data should be placed. +:Type: String +:Required: Yes +:Example: ``default`` + +``failure-domain-type`` + +:Description: The type of CRUSH nodes across which we should separate replicas. +:Type: String +:Required: Yes +:Example: ``rack`` + +``class`` + +:Description: The device class on which data should be placed. +:Type: String +:Required: No +:Example: ``ssd`` + +Creating a rule for an erasure coded pool +----------------------------------------- + +For an erasure-coded (EC) pool, the same basic decisions need to be made: +what is the failure domain, which node in the +hierarchy will data be placed under (usually ``default``), and will +placement be restricted to a specific device class. Erasure code +pools are created a bit differently, however, because they need to be +constructed carefully based on the erasure code being used. For this reason, +you must include this information in the *erasure code profile*. A CRUSH +rule will then be created from that either explicitly or automatically when +the profile is used to create a pool. + +The erasure code profiles can be listed with: + +.. prompt:: bash $ + + ceph osd erasure-code-profile ls + +An existing profile can be viewed with: + +.. prompt:: bash $ + + ceph osd erasure-code-profile get {profile-name} + +Normally profiles should never be modified; instead, a new profile +should be created and used when creating a new pool or creating a new +rule for an existing pool. + +An erasure code profile consists of a set of key=value pairs. Most of +these control the behavior of the erasure code that is encoding data +in the pool. Those that begin with ``crush-``, however, affect the +CRUSH rule that is created. + +The erasure code profile properties of interest are: + + * **crush-root**: the name of the CRUSH node under which to place data [default: ``default``]. + * **crush-failure-domain**: the CRUSH bucket type across which to distribute erasure-coded shards [default: ``host``]. + * **crush-device-class**: the device class on which to place data [default: none, meaning all devices are used]. + * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule. + +Once a profile is defined, you can create a CRUSH rule with: + +.. prompt:: bash $ + + ceph osd crush rule create-erasure {name} {profile-name} + +.. note: When creating a new pool, it is not actually necessary to + explicitly create the rule. If the erasure code profile alone is + specified and the rule argument is left off then Ceph will create + the CRUSH rule automatically. + +Deleting rules +-------------- + +Rules that are not in use by pools can be deleted with: + +.. prompt:: bash $ + + ceph osd crush rule rm {rule-name} + + +.. _crush-map-tunables: + +Tunables +======== + +Over time, we have made (and continue to make) improvements to the +CRUSH algorithm used to calculate the placement of data. In order to +support the change in behavior, we have introduced a series of tunable +options that control whether the legacy or improved variation of the +algorithm is used. + +In order to use newer tunables, both clients and servers must support +the new version of CRUSH. For this reason, we have created +``profiles`` that are named after the Ceph version in which they were +introduced. For example, the ``firefly`` tunables are first supported +by the Firefly release, and will not work with older (e.g., Dumpling) +clients. Once a given set of tunables are changed from the legacy +default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older +clients who do not support the new CRUSH features from connecting to +the cluster. + +argonaut (legacy) +----------------- + +The legacy CRUSH behavior used by Argonaut and older releases works +fine for most clusters, provided there are not many OSDs that have +been marked out. + +bobtail (CRUSH_TUNABLES2) +------------------------- + +The ``bobtail`` tunable profile fixes a few key misbehaviors: + + * For hierarchies with a small number of devices in the leaf buckets, + some PGs map to fewer than the desired number of replicas. This + commonly happens for hierarchies with "host" nodes with a small + number (1-3) of OSDs nested beneath each one. + + * For large clusters, some small percentages of PGs map to fewer than + the desired number of OSDs. This is more prevalent when there are + mutiple hierarchy layers in use (e.g., ``row``, ``rack``, ``host``, ``osd``). + + * When some OSDs are marked out, the data tends to get redistributed + to nearby OSDs instead of across the entire hierarchy. + +The new tunables are: + + * ``choose_local_tries``: Number of local retries. Legacy value is + 2, optimal value is 0. + + * ``choose_local_fallback_tries``: Legacy value is 5, optimal value + is 0. + + * ``choose_total_tries``: Total number of attempts to choose an item. + Legacy value was 19, subsequent testing indicates that a value of + 50 is more appropriate for typical clusters. For extremely large + clusters, a larger value might be necessary. + + * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt + will retry, or only try once and allow the original placement to + retry. Legacy default is 0, optimal value is 1. + +Migration impact: + + * Moving from ``argonaut`` to ``bobtail`` tunables triggers a moderate amount + of data movement. Use caution on a cluster that is already + populated with data. + +firefly (CRUSH_TUNABLES3) +------------------------- + +The ``firefly`` tunable profile fixes a problem +with ``chooseleaf`` CRUSH rule behavior that tends to result in PG +mappings with too few results when too many OSDs have been marked out. + +The new tunable is: + + * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will + start with a non-zero value of ``r``, based on how many attempts the + parent has already made. Legacy default is ``0``, but with this value + CRUSH is sometimes unable to find a mapping. The optimal value (in + terms of computational cost and correctness) is ``1``. + +Migration impact: + + * For existing clusters that house lots of data, changing + from ``0`` to ``1`` will cause a lot of data to move; a value of ``4`` or ``5`` + will allow CRUSH to still find a valid mapping but will cause less data + to move. + +straw_calc_version tunable (introduced with Firefly too) +-------------------------------------------------------- + +There were some problems with the internal weights calculated and +stored in the CRUSH map for ``straw`` algorithm buckets. Specifically, when +there were items with a CRUSH weight of ``0``, or both a mix of different and +unique weights, CRUSH would distribute data incorrectly (i.e., +not in proportion to the weights). + +The new tunable is: + + * ``straw_calc_version``: A value of ``0`` preserves the old, broken + internal weight calculation; a value of ``1`` fixes the behavior. + +Migration impact: + + * Moving to straw_calc_version ``1`` and then adjusting a straw bucket + (by adding, removing, or reweighting an item, or by using the + reweight-all command) can trigger a small to moderate amount of + data movement *if* the cluster has hit one of the problematic + conditions. + +This tunable option is special because it has absolutely no impact +concerning the required kernel version in the client side. + +hammer (CRUSH_V4) +----------------- + +The ``hammer`` tunable profile does not affect the +mapping of existing CRUSH maps simply by changing the profile. However: + + * There is a new bucket algorithm (``straw2``) supported. The new + ``straw2`` bucket algorithm fixes several limitations in the original + ``straw``. Specifically, the old ``straw`` buckets would + change some mappings that should have changed when a weight was + adjusted, while ``straw2`` achieves the original goal of only + changing mappings to or from the bucket item whose weight has + changed. + + * ``straw2`` is the default for any newly created buckets. + +Migration impact: + + * Changing a bucket type from ``straw`` to ``straw2`` will result in + a reasonably small amount of data movement, depending on how much + the bucket item weights vary from each other. When the weights are + all the same no data will move, and when item weights vary + significantly there will be more movement. + +jewel (CRUSH_TUNABLES5) +----------------------- + +The ``jewel`` tunable profile improves the +overall behavior of CRUSH such that significantly fewer mappings +change when an OSD is marked out of the cluster. This results in +significantly less data movement. + +The new tunable is: + + * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will + use a better value for an inner loop that greatly reduces the number + of mapping changes when an OSD is marked out. The legacy value is ``0``, + while the new value of ``1`` uses the new approach. + +Migration impact: + + * Changing this value on an existing cluster will result in a very + large amount of data movement as almost every PG mapping is likely + to change. + + + + +Which client versions support CRUSH_TUNABLES +-------------------------------------------- + + * argonaut series, v0.48.1 or later + * v0.49 or later + * Linux kernel version v3.6 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_TUNABLES2 +--------------------------------------------- + + * v0.55 or later, including bobtail series (v0.56.x) + * Linux kernel version v3.9 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_TUNABLES3 +--------------------------------------------- + + * v0.78 (firefly) or later + * Linux kernel version v3.15 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_V4 +-------------------------------------- + + * v0.94 (hammer) or later + * Linux kernel version v4.1 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_TUNABLES5 +--------------------------------------------- + + * v10.0.2 (jewel) or later + * Linux kernel version v4.5 or later (for the file system and RBD kernel clients) + +Warning when tunables are non-optimal +------------------------------------- + +Starting with version v0.74, Ceph will issue a health warning if the +current CRUSH tunables don't include all the optimal values from the +``default`` profile (see below for the meaning of the ``default`` profile). +To make this warning go away, you have two options: + +1. Adjust the tunables on the existing cluster. Note that this will + result in some data movement (possibly as much as 10%). This is the + preferred route, but should be taken with care on a production cluster + where the data movement may affect performance. You can enable optimal + tunables with: + + .. prompt:: bash $ + + ceph osd crush tunables optimal + + If things go poorly (e.g., too much load) and not very much + progress has been made, or there is a client compatibility problem + (old kernel CephFS or RBD clients, or pre-Bobtail ``librados`` + clients), you can switch back with: + + .. prompt:: bash $ + + ceph osd crush tunables legacy + +2. You can make the warning go away without making any changes to CRUSH by + adding the following option to your ceph.conf ``[mon]`` section:: + + mon warn on legacy crush tunables = false + + For the change to take effect, you will need to restart the monitors, or + apply the option to running monitors with: + + .. prompt:: bash $ + + ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false + + +A few important points +---------------------- + + * Adjusting these values will result in the shift of some PGs between + storage nodes. If the Ceph cluster is already storing a lot of + data, be prepared for some fraction of the data to move. + * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the + feature bits of new connections as soon as they get + the updated map. However, already-connected clients are + effectively grandfathered in, and will misbehave if they do not + support the new feature. + * If the CRUSH tunables are set to non-legacy values and then later + changed back to the default values, ``ceph-osd`` daemons will not be + required to support the feature. However, the OSD peering process + requires examining and understanding old maps. Therefore, you + should not run old versions of the ``ceph-osd`` daemon + if the cluster has previously used non-legacy CRUSH values, even if + the latest version of the map has been switched back to using the + legacy defaults. + +Tuning CRUSH +------------ + +The simplest way to adjust CRUSH tunables is by applying them in matched +sets known as *profiles*. As of the Octopus release these are: + + * ``legacy``: the legacy behavior from argonaut and earlier. + * ``argonaut``: the legacy values supported by the original argonaut release + * ``bobtail``: the values supported by the bobtail release + * ``firefly``: the values supported by the firefly release + * ``hammer``: the values supported by the hammer release + * ``jewel``: the values supported by the jewel release + * ``optimal``: the best (i.e. optimal) values of the current version of Ceph + * ``default``: the default values of a new cluster installed from + scratch. These values, which depend on the current version of Ceph, + are hardcoded and are generally a mix of optimal and legacy values. + These values generally match the ``optimal`` profile of the previous + LTS release, or the most recent release for which we generally expect + most users to have up-to-date clients for. + +You can apply a profile to a running cluster with the command: + +.. prompt:: bash $ + + ceph osd crush tunables {PROFILE} + +Note that this may result in data movement, potentially quite a bit. Study +release notes and documentation carefully before changing the profile on a +running cluster, and consider throttling recovery/backfill parameters to +limit the impact of a bolus of backfill. + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf + + +Primary Affinity +================ + +When a Ceph Client reads or writes data, it first contacts the primary OSD in +each affected PG's acting set. By default, the first OSD in the acting set is +the primary. For example, in the acting set ``[2, 3, 4]``, ``osd.2`` is +listed first and thus is the primary (aka lead) OSD. Sometimes we know that an +OSD is less well suited to act as the lead than are other OSDs (e.g., it has +a slow drive or a slow controller). To prevent performance bottlenecks +(especially on read operations) while maximizing utilization of your hardware, +you can influence the selection of primary OSDs by adjusting primary affinity +values, or by crafting a CRUSH rule that selects preferred OSDs first. + +Tuning primary OSD selection is mainly useful for replicated pools, because +by default read operations are served from the primary OSD for each PG. +For erasure coded (EC) pools, a way to speed up read operations is to enable +**fast read** as described in :ref:`pool-settings`. + +A common scenario for primary affinity is when a cluster contains +a mix of drive sizes, for example older racks with 1.9 TB SATA SSDS and newer racks with +3.84TB SATA SSDs. On average the latter will be assigned double the number of +PGs and thus will serve double the number of write and read operations, thus +they'll be busier than the former. A rough assignment of primary affinity +inversely proportional to OSD size won't be 100% optimal, but it can readily +achieve a 15% improvement in overall read throughput by utilizing SATA +interface bandwidth and CPU cycles more evenly. + +By default, all ceph OSDs have primary affinity of ``1``, which indicates that +any OSD may act as a primary with equal probability. + +You can reduce a Ceph OSD's primary affinity so that CRUSH is less likely to +choose the OSD as primary in a PG's acting set.: + +.. prompt:: bash $ + + ceph osd primary-affinity <osd-id> <weight> + +You may set an OSD's primary affinity to a real number in the range ``[0-1]``, +where ``0`` indicates that the OSD may **NOT** be used as a primary and ``1`` +indicates that an OSD may be used as a primary. When the weight is between +these extremes, it is less likely that CRUSH will select that OSD as a primary. +The process for selecting the lead OSD is more nuanced than a simple +probability based on relative affinity values, but measurable results can be +achieved even with first-order approximations of desirable values. + +Custom CRUSH Rules +------------------ + +There are occasional clusters that balance cost and performance by mixing SSDs +and HDDs in the same replicated pool. By setting the primary affinity of HDD +OSDs to ``0`` one can direct operations to the SSD in each acting set. An +alternative is to define a CRUSH rule that always selects an SSD OSD as the +first OSD, then selects HDDs for the remaining OSDs. Thus, each PG's acting +set will contain exactly one SSD OSD as the primary with the balance on HDDs. + +For example, the CRUSH rule below:: + + rule mixed_replicated_rule { + id 11 + type replicated + min_size 1 + max_size 10 + step take default class ssd + step chooseleaf firstn 1 type host + step emit + step take default class hdd + step chooseleaf firstn 0 type host + step emit + } + +chooses an SSD as the first OSD. Note that for an ``N``-times replicated pool +this rule selects ``N+1`` OSDs to guarantee that ``N`` copies are on different +hosts, because the first SSD OSD might be co-located with any of the ``N`` HDD +OSDs. + +This extra storage requirement can be avoided by placing SSDs and HDDs in +different hosts with the tradeoff that hosts with SSDs will receive all client +requests. You may thus consider faster CPU(s) for SSD hosts and more modest +ones for HDD nodes, since the latter will normally only service recovery +operations. Here the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` strictly +must not contain the same servers:: + + rule mixed_replicated_rule_two { + id 1 + type replicated + min_size 1 + max_size 10 + step take ssd_hosts class ssd + step chooseleaf firstn 1 type host + step emit + step take hdd_hosts class hdd + step chooseleaf firstn -1 type host + step emit + } + + +Note also that on failure of an SSD, requests to a PG will be served temporarily +from a (slower) HDD OSD until the PG's data has been replicated onto the replacement +primary SSD OSD. + diff --git a/doc/rados/operations/data-placement.rst b/doc/rados/operations/data-placement.rst new file mode 100644 index 000000000..bd9bd7ec7 --- /dev/null +++ b/doc/rados/operations/data-placement.rst @@ -0,0 +1,43 @@ +========================= + Data Placement Overview +========================= + +Ceph stores, replicates and rebalances data objects across a RADOS cluster +dynamically. With many different users storing objects in different pools for +different purposes on countless OSDs, Ceph operations require some data +placement planning. The main data placement planning concepts in Ceph include: + +- **Pools:** Ceph stores data within pools, which are logical groups for storing + objects. Pools manage the number of placement groups, the number of replicas, + and the CRUSH rule for the pool. To store data in a pool, you must have + an authenticated user with permissions for the pool. Ceph can snapshot pools. + See `Pools`_ for additional details. + +- **Placement Groups:** Ceph maps objects to placement groups (PGs). + Placement groups (PGs) are shards or fragments of a logical object pool + that place objects as a group into OSDs. Placement groups reduce the amount + of per-object metadata when Ceph stores the data in OSDs. A larger number of + placement groups (e.g., 100 per OSD) leads to better balancing. See + `Placement Groups`_ for additional details. + +- **CRUSH Maps:** CRUSH is a big part of what allows Ceph to scale without + performance bottlenecks, without limitations to scalability, and without a + single point of failure. CRUSH maps provide the physical topology of the + cluster to the CRUSH algorithm to determine where the data for an object + and its replicas should be stored, and how to do so across failure domains + for added data safety among other things. See `CRUSH Maps`_ for additional + details. + +- **Balancer:** The balancer is a feature that will automatically optimize the + distribution of PGs across devices to achieve a balanced data distribution, + maximizing the amount of data that can be stored in the cluster and evenly + distributing the workload across OSDs. + +When you initially set up a test cluster, you can use the default values. Once +you begin planning for a large Ceph cluster, refer to pools, placement groups +and CRUSH for data placement operations. + +.. _Pools: ../pools +.. _Placement Groups: ../placement-groups +.. _CRUSH Maps: ../crush-map +.. _Balancer: ../balancer diff --git a/doc/rados/operations/devices.rst b/doc/rados/operations/devices.rst new file mode 100644 index 000000000..1b6eaebde --- /dev/null +++ b/doc/rados/operations/devices.rst @@ -0,0 +1,208 @@ +.. _devices: + +Device Management +================= + +Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by +which daemons, and collects health metrics about those devices in order to +provide tools to predict and/or automatically respond to hardware failure. + +Device tracking +--------------- + +You can query which storage devices are in use with: + +.. prompt:: bash $ + + ceph device ls + +You can also list devices by daemon or by host: + +.. prompt:: bash $ + + ceph device ls-by-daemon <daemon> + ceph device ls-by-host <host> + +For any individual device, you can query information about its +location and how it is being consumed with: + +.. prompt:: bash $ + + ceph device info <devid> + +Identifying physical devices +---------------------------- + +You can blink the drive LEDs on hardware enclosures to make the replacement of +failed disks easy and less error-prone. Use the following command:: + + device light on|off <devid> [ident|fault] [--force] + +The ``<devid>`` parameter is the device identification. You can obtain this +information using the following command: + +.. prompt:: bash $ + + ceph device ls + +The ``[ident|fault]`` parameter is used to set the kind of light to blink. +By default, the `identification` light is used. + +.. note:: + This command needs the Cephadm or the Rook `orchestrator <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ module enabled. + The orchestrator module enabled is shown by executing the following command: + + .. prompt:: bash $ + + ceph orch status + +The command behind the scene to blink the drive LEDs is `lsmcli`. If you need +to customize this command you can configure this via a Jinja2 template:: + + ceph config-key set mgr/cephadm/blink_device_light_cmd "<template>" + ceph config-key set mgr/cephadm/<host>/blink_device_light_cmd "lsmcli local-disk-{{ ident_fault }}-led-{{'on' if on else 'off'}} --path '{{ path or dev }}'" + +The Jinja2 template is rendered using the following arguments: + +* ``on`` + A boolean value. +* ``ident_fault`` + A string containing `ident` or `fault`. +* ``dev`` + A string containing the device ID, e.g. `SanDisk_X400_M.2_2280_512GB_162924424784`. +* ``path`` + A string containing the device path, e.g. `/dev/sda`. + +.. _enabling-monitoring: + +Enabling monitoring +------------------- + +Ceph can also monitor health metrics associated with your device. For +example, SATA hard disks implement a standard called SMART that +provides a wide range of internal metrics about the device's usage and +health, like the number of hours powered on, number of power cycles, +or unrecoverable read errors. Other device types like SAS and NVMe +implement a similar set of metrics (via slightly different standards). +All of these can be collected by Ceph via the ``smartctl`` tool. + +You can enable or disable health monitoring with: + +.. prompt:: bash $ + + ceph device monitoring on + +or: + +.. prompt:: bash $ + + ceph device monitoring off + + +Scraping +-------- + +If monitoring is enabled, metrics will automatically be scraped at regular intervals. That interval can be configured with: + +.. prompt:: bash $ + + ceph config set mgr mgr/devicehealth/scrape_frequency <seconds> + +The default is to scrape once every 24 hours. + +You can manually trigger a scrape of all devices with: + +.. prompt:: bash $ + + ceph device scrape-health-metrics + +A single device can be scraped with: + +.. prompt:: bash $ + + ceph device scrape-health-metrics <device-id> + +Or a single daemon's devices can be scraped with: + +.. prompt:: bash $ + + ceph device scrape-daemon-health-metrics <who> + +The stored health metrics for a device can be retrieved (optionally +for a specific timestamp) with: + +.. prompt:: bash $ + + ceph device get-health-metrics <devid> [sample-timestamp] + +Failure prediction +------------------ + +Ceph can predict life expectancy and device failures based on the +health metrics it collects. There are three modes: + +* *none*: disable device failure prediction. +* *local*: use a pre-trained prediction model from the ceph-mgr daemon + +The prediction mode can be configured with: + +.. prompt:: bash $ + + ceph config set global device_failure_prediction_mode <mode> + +Prediction normally runs in the background on a periodic basis, so it +may take some time before life expectancy values are populated. You +can see the life expectancy of all devices in output from: + +.. prompt:: bash $ + + ceph device ls + +You can also query the metadata for a specific device with: + +.. prompt:: bash $ + + ceph device info <devid> + +You can explicitly force prediction of a device's life expectancy with: + +.. prompt:: bash $ + + ceph device predict-life-expectancy <devid> + +If you are not using Ceph's internal device failure prediction but +have some external source of information about device failures, you +can inform Ceph of a device's life expectancy with: + +.. prompt:: bash $ + + ceph device set-life-expectancy <devid> <from> [<to>] + +Life expectancies are expressed as a time interval so that +uncertainty can be expressed in the form of a wide interval. The +interval end can also be left unspecified. + +Health alerts +------------- + +The ``mgr/devicehealth/warn_threshold`` controls how soon an expected +device failure must be before we generate a health warning. + +The stored life expectancy of all devices can be checked, and any +appropriate health alerts generated, with: + +.. prompt:: bash $ + + ceph device check-health + +Automatic Mitigation +-------------------- + +If the ``mgr/devicehealth/self_heal`` option is enabled (it is by +default), then for devices that are expected to fail soon the module +will automatically migrate data away from them by marking the devices +"out". + +The ``mgr/devicehealth/mark_out_threshold`` controls how soon an +expected device failure must be before we automatically mark an osd +"out". diff --git a/doc/rados/operations/erasure-code-clay.rst b/doc/rados/operations/erasure-code-clay.rst new file mode 100644 index 000000000..1cffa32f5 --- /dev/null +++ b/doc/rados/operations/erasure-code-clay.rst @@ -0,0 +1,240 @@ +================ +CLAY code plugin +================ + +CLAY (short for coupled-layer) codes are erasure codes designed to bring about significant savings +in terms of network bandwidth and disk IO when a failed node/OSD/rack is being repaired. Let: + + d = number of OSDs contacted during repair + +If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires +reading from the *d=8* others to repair. And recovery of say a 1GiB needs +a download of 8 X 1GiB = 8GiB of information. + +However, in the case of the *clay* plugin *d* is configurable within the limits: + + k+1 <= d <= k+m-1 + +By default, the clay code plugin picks *d=k+m-1* as it provides the greatest savings in terms +of network bandwidth and disk IO. In the case of the *clay* plugin configured with +*k=8*, *m=4* and *d=11* when a single OSD fails, d=11 osds are contacted and +250MiB is downloaded from each of them, resulting in a total download of 11 X 250MiB = 2.75GiB +amount of information. More general parameters are provided below. The benefits are substantial +when the repair is carried out for a rack that stores information on the order of +Terabytes. + + +-------------+---------------------------------------------------------+ + | plugin | total amount of disk IO | + +=============+=========================================================+ + |jerasure,isa | :math:`k S` | + +-------------+---------------------------------------------------------+ + | clay | :math:`\frac{d S}{d - k + 1} = \frac{(k + m - 1) S}{m}` | + +-------------+---------------------------------------------------------+ + +where *S* is the amount of data stored on a single OSD undergoing repair. In the table above, we have +used the largest possible value of *d* as this will result in the smallest amount of data download needed +to achieve recovery from an OSD failure. + +Erasure-code profile examples +============================= + +An example configuration that can be used to observe reduced bandwidth usage: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set CLAYprofile \ + plugin=clay \ + k=4 m=2 d=5 \ + crush-failure-domain=host + ceph osd pool create claypool erasure CLAYprofile + + +Creating a clay profile +======================= + +To create a new clay code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=clay \ + k={data-chunks} \ + m={coding-chunks} \ + [d={helper-chunks}] \ + [scalar_mds={plugin-name}] \ + [technique={technique-name}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split into **data-chunks** parts, + each of which is stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``d={helper-chunks}`` + +:Description: Number of OSDs requested to send data during recovery of + a single chunk. *d* needs to be chosen such that + k+1 <= d <= k+m-1. The larger the *d*, the better the savings. + +:Type: Integer +:Required: No. +:Default: k+m-1 + +``scalar_mds={jerasure|isa|shec}`` + +:Description: **scalar_mds** specifies the plugin that is used as a + building block in the layered construction. It can be + one of *jerasure*, *isa*, *shec* + +:Type: String +:Required: No. +:Default: jerasure + +``technique={technique}`` + +:Description: **technique** specifies the technique that will be picked + within the 'scalar_mds' plugin specified. Supported techniques + are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig', + 'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van', + 'cauchy' for isa and 'single', 'multiple' for shec. + +:Type: String +:Required: No. +:Default: reed_sol_van (for jerasure, isa), single (for shec) + + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + + +Notion of sub-chunks +==================== + +The Clay code is able to save in terms of disk IO, network bandwidth as it +is a vector code and it is able to view and manipulate data within a chunk +at a finer granularity termed as a sub-chunk. The number of sub-chunks within +a chunk for a Clay code is given by: + + sub-chunk count = :math:`q^{\frac{k+m}{q}}`, where :math:`q = d - k + 1` + + +During repair of an OSD, the helper information requested +from an available OSD is only a fraction of a chunk. In fact, the number +of sub-chunks within a chunk that are accessed during repair is given by: + + repair sub-chunk count = :math:`\frac{sub---chunk \: count}{q}` + +Examples +-------- + +#. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is + 8 and the repair sub-chunk count is 4. Therefore, only half of a chunk is read + during repair. +#. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count + is 16. A quarter of a chunk is read from an available OSD for repair of a failed + chunk. + + + +How to choose a configuration given a workload +============================================== + +Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks +are not necessarily stored consecutively within a chunk. For best disk IO +performance, it is helpful to read contiguous data. For this reason, it is suggested that +you choose stripe-size such that the sub-chunk size is sufficiently large. + +For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that: + + sub-chunk size = :math:`\frac{stripe-size}{k sub-chunk count}` = 4KB, 8KB, 12KB ... + +#. For large size workloads for which the stripe size is large, it is easy to choose k, m, d. + For example consider a stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will + result in a sub-chunk count of 1024 and a sub-chunk size of 4KB. +#. For small size workloads, *k=4*, *m=2* is a good configuration that provides both network + and disk IO benefits. + +Comparisons with LRC +==================== + +Locally Recoverable Codes (LRC) are also designed in order to save in terms of network +bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the +number of OSDs contacted during repair (d) to be minimal, but this comes at the cost of storage overhead. +The *clay* code has a storage overhead m/k. In the case of an *lrc*, it stores (k+m)/d parities in +addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc* +can recover from the failure of any ``m`` OSDs. + + +-----------------+----------------------------------+----------------------------------+ + | Parameters | disk IO, storage overhead (LRC) | disk IO, storage overhead (CLAY) | + +=================+================+=================+==================================+ + | (k=10, m=4) | 7 * S, 0.6 (d=7) | 3.25 * S, 0.4 (d=13) | + +-----------------+----------------------------------+----------------------------------+ + | (k=16, m=4) | 4 * S, 0.5625 (d=4) | 4.75 * S, 0.25 (d=19) | + +-----------------+----------------------------------+----------------------------------+ + + +where ``S`` is the amount of data stored of single OSD being recovered. diff --git a/doc/rados/operations/erasure-code-isa.rst b/doc/rados/operations/erasure-code-isa.rst new file mode 100644 index 000000000..9a43f89a2 --- /dev/null +++ b/doc/rados/operations/erasure-code-isa.rst @@ -0,0 +1,107 @@ +======================= +ISA erasure code plugin +======================= + +The *isa* plugin encapsulates the `ISA +<https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version/>`_ +library. + +Create an isa profile +===================== + +To create a new *isa* erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=isa \ + technique={reed_sol_van|cauchy} \ + [k={data-chunks}] \ + [m={coding-chunks}] \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: No. +:Default: 7 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 3 + +``technique={reed_sol_van|cauchy}`` + +:Description: The ISA plugin comes in two `Reed Solomon + <https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction>`_ + forms. If *reed_sol_van* is set, it is `Vandermonde + <https://en.wikipedia.org/wiki/Vandermonde_matrix>`_, if + *cauchy* is set, it is `Cauchy + <https://en.wikipedia.org/wiki/Cauchy_matrix>`_. + +:Type: String +:Required: No. +:Default: reed_sol_van + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + diff --git a/doc/rados/operations/erasure-code-jerasure.rst b/doc/rados/operations/erasure-code-jerasure.rst new file mode 100644 index 000000000..553afa09d --- /dev/null +++ b/doc/rados/operations/erasure-code-jerasure.rst @@ -0,0 +1,121 @@ +============================ +Jerasure erasure code plugin +============================ + +The *jerasure* plugin is the most generic and flexible plugin, it is +also the default for Ceph erasure coded pools. + +The *jerasure* plugin encapsulates the `Jerasure +<http://jerasure.org>`_ library. It is +recommended to read the *jerasure* documentation to get a better +understanding of the parameters. + +Create a jerasure profile +========================= + +To create a new *jerasure* erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=jerasure \ + k={data-chunks} \ + m={coding-chunks} \ + technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion} \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion}`` + +:Description: The more flexible technique is *reed_sol_van* : it is + enough to set *k* and *m*. The *cauchy_good* technique + can be faster but you need to chose the *packetsize* + carefully. All of *reed_sol_r6_op*, *liberation*, + *blaum_roth*, *liber8tion* are *RAID6* equivalents in + the sense that they can only be configured with *m=2*. + +:Type: String +:Required: No. +:Default: reed_sol_van + +``packetsize={bytes}`` + +:Description: The encoding will be done on packets of *bytes* size at + a time. Choosing the right packet size is difficult. The + *jerasure* documentation contains extensive information + on this topic. + +:Type: Integer +:Required: No. +:Default: 2048 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + diff --git a/doc/rados/operations/erasure-code-lrc.rst b/doc/rados/operations/erasure-code-lrc.rst new file mode 100644 index 000000000..5329603b9 --- /dev/null +++ b/doc/rados/operations/erasure-code-lrc.rst @@ -0,0 +1,388 @@ +====================================== +Locally repairable erasure code plugin +====================================== + +With the *jerasure* plugin, when an erasure coded object is stored on +multiple OSDs, recovering from the loss of one OSD requires reading +from *k* others. For instance if *jerasure* is configured with +*k=8* and *m=4*, recovering from the loss of one OSD requires reading +from eight others. + +The *lrc* erasure code plugin creates local parity chunks to enable +recovery using fewer surviving OSDs. For instance if *lrc* is configured with +*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for +every four OSDs. When a single OSD is lost, it can be recovered with +only four OSDs instead of eight. + +Erasure code profile examples +============================= + +Reduce recovery bandwidth between hosts +--------------------------------------- + +Although it is probably not an interesting use case when all hosts are +connected to the same switch, reduced bandwidth usage can actually be +observed.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + k=4 m=2 l=3 \ + crush-failure-domain=host + ceph osd pool create lrcpool erasure LRCprofile + + +Reduce recovery bandwidth between racks +--------------------------------------- + +In Firefly the bandwidth reduction will only be observed if the primary +OSD is in the same rack as the lost chunk.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + k=4 m=2 l=3 \ + crush-locality=rack \ + crush-failure-domain=host + ceph osd pool create lrcpool erasure LRCprofile + + +Create an lrc profile +===================== + +To create a new lrc erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=lrc \ + k={data-chunks} \ + m={coding-chunks} \ + l={locality} \ + [crush-root={root}] \ + [crush-locality={bucket-type}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``l={locality}`` + +:Description: Group the coding and data chunks into sets of size + **locality**. For instance, for **k=4** and **m=2**, + when **locality=3** two groups of three are created. + Each set can be recovered without reading chunks + from another set. + +:Type: Integer +:Required: Yes. +:Example: 3 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-locality={bucket-type}`` + +:Description: The type of the CRUSH bucket in which each set of chunks + defined by **l** will be stored. For instance, if it is + set to **rack**, each group of **l** chunks will be + placed in a different rack. It is used to create a + CRUSH rule step such as **step choose rack**. If it is not + set, no such grouping is done. + +:Type: String +:Required: No. + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + +Low level plugin configuration +============================== + +The sum of **k** and **m** must be a multiple of the **l** parameter. +The low level configuration parameters however do not enforce this +restriction and it may be advantageous to use them for specific +purposes. It is for instance possible to define two groups, one with 4 +chunks and another with 3 chunks. It is also possible to recursively +define locality sets, for instance datacenters and racks into +datacenters. The **k/m/l** are implemented by generating a low level +configuration. + +The *lrc* erasure code plugin recursively applies erasure code +techniques so that recovering from the loss of some chunks only +requires a subset of the available chunks, most of the time. + +For instance, when three coding steps are described as:: + + chunk nr 01234567 + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +where *c* are coding chunks calculated from the data chunks *D*, the +loss of chunk *7* can be recovered with the last four chunks. And the +loss of chunk *2* chunk can be recovered with the first four +chunks. + +Erasure code profile examples using low level configuration +=========================================================== + +Minimal testing +--------------- + +It is strictly equivalent to using a *K=2* *M=1* erasure code profile. The *DD* +implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used +by default.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=DD_ \ + layers='[ [ "DDc", "" ] ]' + ceph osd pool create lrcpool erasure LRCprofile + +Reduce recovery bandwidth between hosts +--------------------------------------- + +Although it is probably not an interesting use case when all hosts are +connected to the same switch, reduced bandwidth usage can actually be +observed. It is equivalent to **k=4**, **m=2** and **l=3** although +the layout of the chunks is different. **WARNING: PROMPTS ARE SELECTABLE** + +:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "" ], + [ "cDDD____", "" ], + [ "____cDDD", "" ], + ]' + $ ceph osd pool create lrcpool erasure LRCprofile + + +Reduce recovery bandwidth between racks +--------------------------------------- + +In Firefly the reduced bandwidth will only be observed if the primary OSD is in +the same rack as the lost chunk. **WARNING: PROMPTS ARE SELECTABLE** + +:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "" ], + [ "cDDD____", "" ], + [ "____cDDD", "" ], + ]' \ + crush-steps='[ + [ "choose", "rack", 2 ], + [ "chooseleaf", "host", 4 ], + ]' + + $ ceph osd pool create lrcpool erasure LRCprofile + +Testing with different Erasure Code backends +-------------------------------------------- + +LRC now uses jerasure as the default EC backend. It is possible to +specify the EC backend/algorithm on a per layer basis using the low +level configuration. The second argument in layers='[ [ "DDc", "" ] ]' +is actually an erasure code profile to be used for this level. The +example below specifies the ISA backend with the cauchy technique to +be used in the lrcpool.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=DD_ \ + layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]' + ceph osd pool create lrcpool erasure LRCprofile + +You could also use a different erasure code profile for each +layer. **WARNING: PROMPTS ARE SELECTABLE** + +:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "plugin=isa technique=cauchy" ], + [ "cDDD____", "plugin=isa" ], + [ "____cDDD", "plugin=jerasure" ], + ]' + $ ceph osd pool create lrcpool erasure LRCprofile + + + +Erasure coding and decoding algorithm +===================================== + +The steps found in the layers description:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +are applied in order. For instance, if a 4K object is encoded, it will +first go through *step 1* and be divided in four 1K chunks (the four +uppercase D). They are stored in the chunks 2, 3, 6 and 7, in +order. From these, two coding chunks are calculated (the two lowercase +c). The coding chunks are stored in the chunks 1 and 5, respectively. + +The *step 2* re-uses the content created by *step 1* in a similar +fashion and stores a single coding chunk *c* at position 0. The last four +chunks, marked with an underscore (*_*) for readability, are ignored. + +The *step 3* stores a single coding chunk *c* at position 4. The three +chunks created by *step 1* are used to compute this coding chunk, +i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*. + +If chunk *2* is lost:: + + chunk nr 01234567 + + step 1 _c D_cDD + step 2 cD D____ + step 3 __ _cDDD + +decoding will attempt to recover it by walking the steps in reverse +order: *step 3* then *step 2* and finally *step 1*. + +The *step 3* knows nothing about chunk *2* (i.e. it is an underscore) +and is skipped. + +The coding chunk from *step 2*, stored in chunk *0*, allows it to +recover the content of chunk *2*. There are no more chunks to recover +and the process stops, without considering *step 1*. + +Recovering chunk *2* requires reading chunks *0, 1, 3* and writing +back chunk *2*. + +If chunk *2, 3, 6* are lost:: + + chunk nr 01234567 + + step 1 _c _c D + step 2 cD __ _ + step 3 __ cD D + +The *step 3* can recover the content of chunk *6*:: + + chunk nr 01234567 + + step 1 _c _cDD + step 2 cD ____ + step 3 __ cDDD + +The *step 2* fails to recover and is skipped because there are two +chunks missing (*2, 3*) and it can only recover from one missing +chunk. + +The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to +recover the content of chunk *2, 3*:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +Controlling CRUSH placement +=========================== + +The default CRUSH rule provides OSDs that are on different hosts. For instance:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +needs exactly *8* OSDs, one for each chunk. If the hosts are in two +adjacent racks, the first four chunks can be placed in the first rack +and the last four in the second rack. So that recovering from the loss +of a single OSD does not require using bandwidth between the two +racks. + +For instance:: + + crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]' + +will create a rule that will select two crush buckets of type +*rack* and for each of them choose four OSDs, each of them located in +different buckets of type *host*. + +The CRUSH rule can also be manually crafted for finer control. diff --git a/doc/rados/operations/erasure-code-profile.rst b/doc/rados/operations/erasure-code-profile.rst new file mode 100644 index 000000000..45b071f8a --- /dev/null +++ b/doc/rados/operations/erasure-code-profile.rst @@ -0,0 +1,126 @@ +.. _erasure-code-profiles: + +===================== +Erasure code profiles +===================== + +Erasure code is defined by a **profile** and is used when creating an +erasure coded pool and the associated CRUSH rule. + +The **default** erasure code profile (which is created when the Ceph +cluster is initialized) will split the data into 2 equal-sized chunks, +and have 2 parity chunks of the same size. It will take as much space +in the cluster as a 2-replica pool but can sustain the data loss of 2 +chunks out of 4. It is described as a profile with **k=2** and **m=2**, +meaning the information is spread over four OSD (k+m == 4) and two of +them can be lost. + +To improve redundancy without increasing raw storage requirements, a +new profile can be created. For instance, a profile with **k=10** and +**m=4** can sustain the loss of four (**m=4**) OSDs by distributing an +object on fourteen (k+m=14) OSDs. The object is first divided in +**10** chunks (if the object is 10MB, each chunk is 1MB) and **4** +coding chunks are computed, for recovery (each coding chunk has the +same size as the data chunk, i.e. 1MB). The raw space overhead is only +40% and the object will not be lost even if four OSDs break at the +same time. + +.. _list of available plugins: + +.. toctree:: + :maxdepth: 1 + + erasure-code-jerasure + erasure-code-isa + erasure-code-lrc + erasure-code-shec + erasure-code-clay + +osd erasure-code-profile set +============================ + +To create a new erasure code profile:: + + ceph osd erasure-code-profile set {name} \ + [{directory=directory}] \ + [{plugin=plugin}] \ + [{stripe_unit=stripe_unit}] \ + [{key=value} ...] \ + [--force] + +Where: + +``{directory=directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``{plugin=plugin}`` + +:Description: Use the erasure code **plugin** to compute coding chunks + and recover missing chunks. See the `list of available + plugins`_ for more information. + +:Type: String +:Required: No. +:Default: jerasure + +``{stripe_unit=stripe_unit}`` + +:Description: The amount of data in a data chunk, per stripe. For + example, a profile with 2 data chunks and stripe_unit=4K + would put the range 0-4K in chunk 0, 4K-8K in chunk 1, + then 8K-12K in chunk 0 again. This should be a multiple + of 4K for best performance. The default value is taken + from the monitor config option + ``osd_pool_erasure_code_stripe_unit`` when a pool is + created. The stripe_width of a pool using this profile + will be the number of data chunks multiplied by this + stripe_unit. + +:Type: String +:Required: No. + +``{key=value}`` + +:Description: The semantic of the remaining key/value pairs is defined + by the erasure code plugin. + +:Type: String +:Required: No. + +``--force`` + +:Description: Override an existing profile by the same name, and allow + setting a non-4K-aligned stripe_unit. + +:Type: String +:Required: No. + +osd erasure-code-profile rm +============================ + +To remove an erasure code profile:: + + ceph osd erasure-code-profile rm {name} + +If the profile is referenced by a pool, the deletion will fail. + +osd erasure-code-profile get +============================ + +To display an erasure code profile:: + + ceph osd erasure-code-profile get {name} + +osd erasure-code-profile ls +=========================== + +To list the names of all erasure code profiles:: + + ceph osd erasure-code-profile ls + diff --git a/doc/rados/operations/erasure-code-shec.rst b/doc/rados/operations/erasure-code-shec.rst new file mode 100644 index 000000000..4e8f59b0b --- /dev/null +++ b/doc/rados/operations/erasure-code-shec.rst @@ -0,0 +1,145 @@ +======================== +SHEC erasure code plugin +======================== + +The *shec* plugin encapsulates the `multiple SHEC +<http://tracker.ceph.com/projects/ceph/wiki/Shingled_Erasure_Code_(SHEC)>`_ +library. It allows ceph to recover data more efficiently than Reed Solomon codes. + +Create an SHEC profile +====================== + +To create a new *shec* erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=shec \ + [k={data-chunks}] \ + [m={coding-chunks}] \ + [c={durability-estimator}] \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data-chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: No. +:Default: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding-chunks** for each object and store them on + different OSDs. The number of **coding-chunks** does not necessarily + equal the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 3 + +``c={durability-estimator}`` + +:Description: The number of parity chunks each of which includes each data chunk in its + calculation range. The number is used as a **durability estimator**. + For instance, if c=2, 2 OSDs can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 2 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + +Brief description of SHEC's layouts +=================================== + +Space Efficiency +---------------- + +Space efficiency is a ratio of data chunks to all ones in a object and +represented as k/(k+m). +In order to improve space efficiency, you should increase k or decrease m: + + space efficiency of SHEC(4,3,2) = :math:`\frac{4}{4+3}` = 0.57 + SHEC(5,3,2) or SHEC(4,2,2) improves SHEC(4,3,2)'s space efficiency + +Durability +---------- + +The third parameter of SHEC (=c) is a durability estimator, which approximates +the number of OSDs that can be down without losing data. + +``durability estimator of SHEC(4,3,2) = 2`` + +Recovery Efficiency +------------------- + +Describing calculation of recovery efficiency is beyond the scope of this document, +but at least increasing m without increasing c achieves improvement of recovery efficiency. +(However, we must pay attention to the sacrifice of space efficiency in this case.) + +``SHEC(4,2,2) -> SHEC(4,3,2) : achieves improvement of recovery efficiency`` + +Erasure code profile examples +============================= + + +.. prompt:: bash $ + + ceph osd erasure-code-profile set SHECprofile \ + plugin=shec \ + k=8 m=4 c=3 \ + crush-failure-domain=host + ceph osd pool create shecpool erasure SHECprofile diff --git a/doc/rados/operations/erasure-code.rst b/doc/rados/operations/erasure-code.rst new file mode 100644 index 000000000..1dea23c35 --- /dev/null +++ b/doc/rados/operations/erasure-code.rst @@ -0,0 +1,262 @@ +.. _ecpool: + +============= + Erasure code +============= + +By default, Ceph `pools <../pools>`_ are created with the type "replicated". In +replicated-type pools, every object is copied to multiple disks (this +multiple copying is the "replication"). + +In contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_ +pools use a method of data protection that is different from replication. In +erasure coding, data is broken into fragments of two kinds: data blocks and +parity blocks. If a drive fails or becomes corrupted, the parity blocks are +used to rebuild the data. At scale, erasure coding saves space relative to +replication. + +In this documentation, data blocks are referred to as "data chunks" +and parity blocks are referred to as "encoding chunks". + +Erasure codes are also called "forward error correction codes". The +first forward error correction code was developed in 1950 by Richard +Hamming at Bell Laboratories. + + +Creating a sample erasure coded pool +------------------------------------ + +The simplest erasure coded pool is equivalent to `RAID5 +<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and +requires at least three hosts: + +.. prompt:: bash $ + + ceph osd pool create ecpool erasure + +:: + + pool 'ecpool' created + +.. prompt:: bash $ + + echo ABCDEFGHI | rados --pool ecpool put NYAN - + rados --pool ecpool get NYAN - + +:: + + ABCDEFGHI + +Erasure code profiles +--------------------- + +The default erasure code profile can sustain the loss of two OSDs. This erasure +code profile is equivalent to a replicated pool of size three, but requires +2TB to store 1TB of data instead of 3TB to store 1TB of data. The default +profile can be displayed with this command: + +.. prompt:: bash $ + + ceph osd erasure-code-profile get default + +:: + + k=2 + m=2 + plugin=jerasure + crush-failure-domain=host + technique=reed_sol_van + +.. note:: + The default erasure-coded pool, the profile of which is displayed here, is + not the same as the simplest erasure-coded pool. + + The default erasure-coded pool has two data chunks (k) and two coding chunks + (m). The profile of the default erasure-coded pool is "k=2 m=2". + + The simplest erasure-coded pool has two data chunks (k) and one coding chunk + (m). The profile of the simplest erasure-coded pool is "k=2 m=1". + +Choosing the right profile is important because the profile cannot be modified +after the pool is created. If you find that you need an erasure-coded pool with +a profile different than the one you have created, you must create a new pool +with a different (and presumably more carefully-considered) profile. When the +new pool is created, all objects from the wrongly-configured pool must be moved +to the newly-created pool. There is no way to alter the profile of a pool after its creation. + +The most important parameters of the profile are *K*, *M* and +*crush-failure-domain* because they define the storage overhead and +the data durability. For example, if the desired architecture must +sustain the loss of two racks with a storage overhead of 67% overhead, +the following profile can be defined: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set myprofile \ + k=3 \ + m=2 \ + crush-failure-domain=rack + ceph osd pool create ecpool erasure myprofile + echo ABCDEFGHI | rados --pool ecpool put NYAN - + rados --pool ecpool get NYAN - + +:: + + ABCDEFGHI + +The *NYAN* object will be divided in three (*K=3*) and two additional +*chunks* will be created (*M=2*). The value of *M* defines how many +OSD can be lost simultaneously without losing any data. The +*crush-failure-domain=rack* will create a CRUSH rule that ensures +no two *chunks* are stored in the same rack. + +.. ditaa:: + +-------------------+ + name | NYAN | + +-------------------+ + content | ABCDEFGHI | + +--------+----------+ + | + | + v + +------+------+ + +---------------+ encode(3,2) +-----------+ + | +--+--+---+---+ | + | | | | | + | +-------+ | +-----+ | + | | | | | + +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ + name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | + +------+ +------+ +------+ +------+ +------+ + shard | 1 | | 2 | | 3 | | 4 | | 5 | + +------+ +------+ +------+ +------+ +------+ + content | ABC | | DEF | | GHI | | YXY | | QGC | + +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ + | | | | | + | | v | | + | | +--+---+ | | + | | | OSD1 | | | + | | +------+ | | + | | | | + | | +------+ | | + | +------>| OSD2 | | | + | +------+ | | + | | | + | +------+ | | + | | OSD3 |<----+ | + | +------+ | + | | + | +------+ | + | | OSD4 |<--------------+ + | +------+ + | + | +------+ + +----------------->| OSD5 | + +------+ + + +More information can be found in the `erasure code profiles +<../erasure-code-profile>`_ documentation. + + +Erasure Coding with Overwrites +------------------------------ + +By default, erasure coded pools only work with uses like RGW that +perform full object writes and appends. + +Since Luminous, partial writes for an erasure coded pool may be +enabled with a per-pool setting. This lets RBD and CephFS store their +data in an erasure coded pool: + +.. prompt:: bash $ + + ceph osd pool set ec_pool allow_ec_overwrites true + +This can only be enabled on a pool residing on bluestore OSDs, since +bluestore's checksumming is used to detect bitrot or other corruption +during deep-scrub. In addition to being unsafe, using filestore with +ec overwrites yields low performance compared to bluestore. + +Erasure coded pools do not support omap, so to use them with RBD and +CephFS you must instruct them to store their data in an ec pool, and +their metadata in a replicated pool. For RBD, this means using the +erasure coded pool as the ``--data-pool`` during image creation: + +.. prompt:: bash $ + + rbd create --size 1G --data-pool ec_pool replicated_pool/image_name + +For CephFS, an erasure coded pool can be set as the default data pool during +file system creation or via `file layouts <../../../cephfs/file-layouts>`_. + + +Erasure coded pool and cache tiering +------------------------------------ + +Erasure coded pools require more resources than replicated pools and +lack some functionalities such as omap. To overcome these +limitations, one can set up a `cache tier <../cache-tiering>`_ +before the erasure coded pool. + +For instance, if the pool *hot-storage* is made of fast storage: + +.. prompt:: bash $ + + ceph osd tier add ecpool hot-storage + ceph osd tier cache-mode hot-storage writeback + ceph osd tier set-overlay ecpool hot-storage + +will place the *hot-storage* pool as tier of *ecpool* in *writeback* +mode so that every write and read to the *ecpool* are actually using +the *hot-storage* and benefit from its flexibility and speed. + +More information can be found in the `cache tiering +<../cache-tiering>`_ documentation. + +Erasure coded pool recovery +--------------------------- +If an erasure coded pool loses some shards, it must recover them from the others. +This generally involves reading from the remaining shards, reconstructing the data, and +writing it to the new peer. +In Octopus, erasure coded pools can recover as long as there are at least *K* shards +available. (With fewer than *K* shards, you have actually lost data!) + +Prior to Octopus, erasure coded pools required at least *min_size* shards to be +available, even if *min_size* is greater than *K*. (We generally recommend min_size +be *K+2* or more to prevent loss of writes and data.) +This conservative decision was made out of an abundance of caution when designing the new pool +mode but also meant pools with lost OSDs but no data loss were unable to recover and go active +without manual intervention to change the *min_size*. + +Glossary +-------- + +*chunk* + when the encoding function is called, it returns chunks of the same + size. Data chunks which can be concatenated to reconstruct the original + object and coding chunks which can be used to rebuild a lost chunk. + +*K* + the number of data *chunks*, i.e. the number of *chunks* in which the + original object is divided. For instance if *K* = 2 a 10KB object + will be divided into *K* objects of 5KB each. + +*M* + the number of coding *chunks*, i.e. the number of additional *chunks* + computed by the encoding functions. If there are 2 coding *chunks*, + it means 2 OSDs can be out without losing data. + + +Table of content +---------------- + +.. toctree:: + :maxdepth: 1 + + erasure-code-profile + erasure-code-jerasure + erasure-code-isa + erasure-code-lrc + erasure-code-shec + erasure-code-clay diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst new file mode 100644 index 000000000..a8fa8243f --- /dev/null +++ b/doc/rados/operations/health-checks.rst @@ -0,0 +1,1549 @@ +.. _health-checks: + +============= +Health checks +============= + +Overview +======== + +There is a finite set of possible health messages that a Ceph cluster can +raise -- these are defined as *health checks* which have unique identifiers. + +The identifier is a terse pseudo-human-readable (i.e. like a variable name) +string. It is intended to enable tools (such as UIs) to make sense of +health checks, and present them in a way that reflects their meaning. + +This page lists the health checks that are raised by the monitor and manager +daemons. In addition to these, you may also see health checks that originate +from MDS daemons (see :ref:`cephfs-health-messages`), and health checks +that are defined by ceph-mgr python modules. + +Definitions +=========== + +Monitor +------- + +DAEMON_OLD_VERSION +__________________ + +Warn if old version(s) of Ceph are running on any daemons. +It will generate a health error if multiple versions are detected. +This condition must exist for over mon_warn_older_version_delay (set to 1 week by default) in order for the +health condition to be triggered. This allows most upgrades to proceed +without falsely seeing the warning. If upgrade is paused for an extended +time period, health mute can be used like this +"ceph health mute DAEMON_OLD_VERSION --sticky". In this case after +upgrade has finished use "ceph health unmute DAEMON_OLD_VERSION". + +MON_DOWN +________ + +One or more monitor daemons is currently down. The cluster requires a +majority (more than 1/2) of the monitors in order to function. When +one or more monitors are down, clients may have a harder time forming +their initial connection to the cluster as they may need to try more +addresses before they reach an operating monitor. + +The down monitor daemon should generally be restarted as soon as +possible to reduce the risk of a subsequen monitor failure leading to +a service outage. + +MON_CLOCK_SKEW +______________ + +The clocks on the hosts running the ceph-mon monitor daemons are not +sufficiently well synchronized. This health alert is raised if the +cluster detects a clock skew greater than ``mon_clock_drift_allowed``. + +This is best resolved by synchronizing the clocks using a tool like +``ntpd`` or ``chrony``. + +If it is impractical to keep the clocks closely synchronized, the +``mon_clock_drift_allowed`` threshold can also be increased, but this +value must stay significantly below the ``mon_lease`` interval in +order for monitor cluster to function properly. + +MON_MSGR2_NOT_ENABLED +_____________________ + +The ``ms_bind_msgr2`` option is enabled but one or more monitors is +not configured to bind to a v2 port in the cluster's monmap. This +means that features specific to the msgr2 protocol (e.g., encryption) +are not available on some or all connections. + +In most cases this can be corrected by issuing the command: + +.. prompt:: bash $ + + ceph mon enable-msgr2 + +That command will change any monitor configured for the old default +port 6789 to continue to listen for v1 connections on 6789 and also +listen for v2 connections on the new default 3300 port. + +If a monitor is configured to listen for v1 connections on a non-standard port (not 6789), then the monmap will need to be modified manually. + + +MON_DISK_LOW +____________ + +One or more monitors is low on disk space. This alert triggers if the +available space on the file system storing the monitor database +(normally ``/var/lib/ceph/mon``), as a percentage, drops below +``mon_data_avail_warn`` (default: 30%). + +This may indicate that some other process or user on the system is +filling up the same file system used by the monitor. It may also +indicate that the monitors database is large (see ``MON_DISK_BIG`` +below). + +If space cannot be freed, the monitor's data directory may need to be +moved to another storage device or file system (while the monitor +daemon is not running, of course). + + +MON_DISK_CRIT +_____________ + +One or more monitors is critically low on disk space. This alert +triggers if the available space on the file system storing the monitor +database (normally ``/var/lib/ceph/mon``), as a percentage, drops +below ``mon_data_avail_crit`` (default: 5%). See ``MON_DISK_LOW``, above. + +MON_DISK_BIG +____________ + +The database size for one or more monitors is very large. This alert +triggers if the size of the monitor's database is larger than +``mon_data_size_warn`` (default: 15 GiB). + +A large database is unusual, but may not necessarily indicate a +problem. Monitor databases may grow in size when there are placement +groups that have not reached an ``active+clean`` state in a long time. + +This may also indicate that the monitor's database is not properly +compacting, which has been observed with some older versions of +leveldb and rocksdb. Forcing a compaction with ``ceph daemon mon.<id> +compact`` may shrink the on-disk size. + +This warning may also indicate that the monitor has a bug that is +preventing it from pruning the cluster metadata it stores. If the +problem persists, please report a bug. + +The warning threshold may be adjusted with: + +.. prompt:: bash $ + + ceph config set global mon_data_size_warn <size> + +AUTH_INSECURE_GLOBAL_ID_RECLAIM +_______________________________ + +One or more clients or daemons are connected to the cluster that are +not securely reclaiming their global_id (a unique number identifying +each entity in the cluster) when reconnecting to a monitor. The +client is being permitted to connect anyway because the +``auth_allow_insecure_global_id_reclaim`` option is set to true (which may +be necessary until all ceph clients have been upgraded), and the +``auth_expose_insecure_global_id_reclaim`` option set to ``true`` (which +allows monitors to detect clients with insecure reclaim early by forcing them to +reconnect right after they first authenticate). + +You can identify which client(s) are using unpatched ceph client code with: + +.. prompt:: bash $ + + ceph health detail + +Clients global_id reclaim rehavior can also seen in the +``global_id_status`` field in the dump of clients connected to an +individual monitor (``reclaim_insecure`` means the client is +unpatched and is contributing to this health alert): + +.. prompt:: bash $ + + ceph tell mon.\* sessions + +We strongly recommend that all clients in the system are upgraded to a +newer version of Ceph that correctly reclaims global_id values. Once +all clients have been updated, you can stop allowing insecure reconnections +with: + +.. prompt:: bash $ + + ceph config set mon auth_allow_insecure_global_id_reclaim false + +If it is impractical to upgrade all clients immediately, you can silence +this warning temporarily with: + +.. prompt:: bash $ + + ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM 1w # 1 week + +Although we do NOT recommend doing so, you can also disable this warning +indefinitely with: + +.. prompt:: bash $ + + ceph config set mon mon_warn_on_insecure_global_id_reclaim false + +AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED +_______________________________________ + +Ceph is currently configured to allow clients to reconnect to monitors using +an insecure process to reclaim their previous global_id because the setting +``auth_allow_insecure_global_id_reclaim`` is set to ``true``. It may be necessary to +leave this setting enabled while existing Ceph clients are upgraded to newer +versions of Ceph that correctly and securely reclaim their global_id. + +If the ``AUTH_INSECURE_GLOBAL_ID_RECLAIM`` health alert has not also been raised and +the ``auth_expose_insecure_global_id_reclaim`` setting has not been disabled (it is +on by default), then there are currently no clients connected that need to be +upgraded, and it is safe to disallow insecure global_id reclaim with: + +.. prompt:: bash $ + + ceph config set mon auth_allow_insecure_global_id_reclaim false + +If there are still clients that need to be upgraded, then this alert can be +silenced temporarily with: + +.. prompt:: bash $ + + ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED 1w # 1 week + +Although we do NOT recommend doing so, you can also disable this warning indefinitely +with: + +.. prompt:: bash $ + + ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed false + + +Manager +------- + +MGR_DOWN +________ + +All manager daemons are currently down. The cluster should normally +have at least one running manager (``ceph-mgr``) daemon. If no +manager daemon is running, the cluster's ability to monitor itself will +be compromised, and parts of the management API will become +unavailable (for example, the dashboard will not work, and most CLI +commands that report metrics or runtime state will block). However, +the cluster will still be able to perform all IO operations and +recover from failures. + +The down manager daemon should generally be restarted as soon as +possible to ensure that the cluster can be monitored (e.g., so that +the ``ceph -s`` information is up to date, and/or metrics can be +scraped by Prometheus). + + +MGR_MODULE_DEPENDENCY +_____________________ + +An enabled manager module is failing its dependency check. This health check +should come with an explanatory message from the module about the problem. + +For example, a module might report that a required package is not installed: +install the required package and restart your manager daemons. + +This health check is only applied to enabled modules. If a module is +not enabled, you can see whether it is reporting dependency issues in +the output of `ceph module ls`. + + +MGR_MODULE_ERROR +________________ + +A manager module has experienced an unexpected error. Typically, +this means an unhandled exception was raised from the module's `serve` +function. The human readable description of the error may be obscurely +worded if the exception did not provide a useful description of itself. + +This health check may indicate a bug: please open a Ceph bug report if you +think you have encountered a bug. + +If you believe the error is transient, you may restart your manager +daemon(s), or use `ceph mgr fail` on the active daemon to prompt +a failover to another daemon. + + +OSDs +---- + +OSD_DOWN +________ + +One or more OSDs are marked down. The ceph-osd daemon may have been +stopped, or peer OSDs may be unable to reach the OSD over the network. +Common causes include a stopped or crashed daemon, a down host, or a +network outage. + +Verify the host is healthy, the daemon is started, and network is +functioning. If the daemon has crashed, the daemon log file +(``/var/log/ceph/ceph-osd.*``) may contain debugging information. + +OSD_<crush type>_DOWN +_____________________ + +(e.g. OSD_HOST_DOWN, OSD_ROOT_DOWN) + +All the OSDs within a particular CRUSH subtree are marked down, for example +all OSDs on a host. + +OSD_ORPHAN +__________ + +An OSD is referenced in the CRUSH map hierarchy but does not exist. + +The OSD can be removed from the CRUSH hierarchy with: + +.. prompt:: bash $ + + ceph osd crush rm osd.<id> + +OSD_OUT_OF_ORDER_FULL +_____________________ + +The utilization thresholds for `nearfull`, `backfillfull`, `full`, +and/or `failsafe_full` are not ascending. In particular, we expect +`nearfull < backfillfull`, `backfillfull < full`, and `full < +failsafe_full`. + +The thresholds can be adjusted with: + +.. prompt:: bash $ + + ceph osd set-nearfull-ratio <ratio> + ceph osd set-backfillfull-ratio <ratio> + ceph osd set-full-ratio <ratio> + + +OSD_FULL +________ + +One or more OSDs has exceeded the `full` threshold and is preventing +the cluster from servicing writes. + +Utilization by pool can be checked with: + +.. prompt:: bash $ + + ceph df + +The currently defined `full` ratio can be seen with: + +.. prompt:: bash $ + + ceph osd dump | grep full_ratio + +A short-term workaround to restore write availability is to raise the full +threshold by a small amount: + +.. prompt:: bash $ + + ceph osd set-full-ratio <ratio> + +New storage should be added to the cluster by deploying more OSDs or +existing data should be deleted in order to free up space. + +OSD_BACKFILLFULL +________________ + +One or more OSDs has exceeded the `backfillfull` threshold, which will +prevent data from being allowed to rebalance to this device. This is +an early warning that rebalancing may not be able to complete and that +the cluster is approaching full. + +Utilization by pool can be checked with: + +.. prompt:: bash $ + + ceph df + +OSD_NEARFULL +____________ + +One or more OSDs has exceeded the `nearfull` threshold. This is an early +warning that the cluster is approaching full. + +Utilization by pool can be checked with: + +.. prompt:: bash $ + + ceph df + +OSDMAP_FLAGS +____________ + +One or more cluster flags of interest has been set. These flags include: + +* *full* - the cluster is flagged as full and cannot serve writes +* *pauserd*, *pausewr* - paused reads or writes +* *noup* - OSDs are not allowed to start +* *nodown* - OSD failure reports are being ignored, such that the + monitors will not mark OSDs `down` +* *noin* - OSDs that were previously marked `out` will not be marked + back `in` when they start +* *noout* - down OSDs will not automatically be marked out after the + configured interval +* *nobackfill*, *norecover*, *norebalance* - recovery or data + rebalancing is suspended +* *noscrub*, *nodeep_scrub* - scrubbing is disabled +* *notieragent* - cache tiering activity is suspended + +With the exception of *full*, these flags can be set or cleared with: + +.. prompt:: bash $ + + ceph osd set <flag> + ceph osd unset <flag> + +OSD_FLAGS +_________ + +One or more OSDs or CRUSH {nodes,device classes} has a flag of interest set. +These flags include: + +* *noup*: these OSDs are not allowed to start +* *nodown*: failure reports for these OSDs will be ignored +* *noin*: if these OSDs were previously marked `out` automatically + after a failure, they will not be marked in when they start +* *noout*: if these OSDs are down they will not automatically be marked + `out` after the configured interval + +These flags can be set and cleared in batch with: + +.. prompt:: bash $ + + ceph osd set-group <flags> <who> + ceph osd unset-group <flags> <who> + +For example: + +.. prompt:: bash $ + + ceph osd set-group noup,noout osd.0 osd.1 + ceph osd unset-group noup,noout osd.0 osd.1 + ceph osd set-group noup,noout host-foo + ceph osd unset-group noup,noout host-foo + ceph osd set-group noup,noout class-hdd + ceph osd unset-group noup,noout class-hdd + +OLD_CRUSH_TUNABLES +__________________ + +The CRUSH map is using very old settings and should be updated. The +oldest tunables that can be used (i.e., the oldest client version that +can connect to the cluster) without triggering this health warning is +determined by the ``mon_crush_min_required_version`` config option. +See :ref:`crush-map-tunables` for more information. + +OLD_CRUSH_STRAW_CALC_VERSION +____________________________ + +The CRUSH map is using an older, non-optimal method for calculating +intermediate weight values for ``straw`` buckets. + +The CRUSH map should be updated to use the newer method +(``straw_calc_version=1``). See +:ref:`crush-map-tunables` for more information. + +CACHE_POOL_NO_HIT_SET +_____________________ + +One or more cache pools is not configured with a *hit set* to track +utilization, which will prevent the tiering agent from identifying +cold objects to flush and evict from the cache. + +Hit sets can be configured on the cache pool with: + +.. prompt:: bash $ + + ceph osd pool set <poolname> hit_set_type <type> + ceph osd pool set <poolname> hit_set_period <period-in-seconds> + ceph osd pool set <poolname> hit_set_count <number-of-hitsets> + ceph osd pool set <poolname> hit_set_fpp <target-false-positive-rate> + +OSD_NO_SORTBITWISE +__________________ + +No pre-luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not +been set. + +The ``sortbitwise`` flag must be set before luminous v12.y.z or newer +OSDs can start. You can safely set the flag with: + +.. prompt:: bash $ + + ceph osd set sortbitwise + +OSD_FILESTORE +__________________ + +Filestore has been deprecated, considering that Bluestore has been the default +objectstore for quite some time. Warn if OSDs are running Filestore. + +The 'mclock_scheduler' is not supported for filestore OSDs. Therefore, the +default 'osd_op_queue' is set to 'wpq' for filestore OSDs and is enforced +even if the user attempts to change it. + +Filestore OSDs can be listed with: + +.. prompt:: bash $ + + ceph report | jq -c '."osd_metadata" | .[] | select(.osd_objectstore | contains("filestore")) | {id, osd_objectstore}' + +If it is not feasible to migrate Filestore OSDs to Bluestore immediately, you +can silence this warning temporarily with: + +.. prompt:: bash $ + + ceph health mute OSD_FILESTORE + +POOL_FULL +_________ + +One or more pools has reached its quota and is no longer allowing writes. + +Pool quotas and utilization can be seen with: + +.. prompt:: bash $ + + ceph df detail + +You can either raise the pool quota with: + +.. prompt:: bash $ + + ceph osd pool set-quota <poolname> max_objects <num-objects> + ceph osd pool set-quota <poolname> max_bytes <num-bytes> + +or delete some existing data to reduce utilization. + +BLUEFS_SPILLOVER +________________ + +One or more OSDs that use the BlueStore backend have been allocated +`db` partitions (storage space for metadata, normally on a faster +device) but that space has filled, such that metadata has "spilled +over" onto the normal slow device. This isn't necessarily an error +condition or even unexpected, but if the administrator's expectation +was that all metadata would fit on the faster device, it indicates +that not enough space was provided. + +This warning can be disabled on all OSDs with: + +.. prompt:: bash $ + + ceph config set osd bluestore_warn_on_bluefs_spillover false + +Alternatively, it can be disabled on a specific OSD with: + +.. prompt:: bash $ + + ceph config set osd.123 bluestore_warn_on_bluefs_spillover false + +To provide more metadata space, the OSD in question could be destroyed and +reprovisioned. This will involve data migration and recovery. + +It may also be possible to expand the LVM logical volume backing the +`db` storage. If the underlying LV has been expanded, the OSD daemon +needs to be stopped and BlueFS informed of the device size change with: + +.. prompt:: bash $ + + ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-$ID + +BLUEFS_AVAILABLE_SPACE +______________________ + +To check how much space is free for BlueFS do: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore bluefs available + +This will output up to 3 values: `BDEV_DB free`, `BDEV_SLOW free` and +`available_from_bluestore`. `BDEV_DB` and `BDEV_SLOW` report amount of space that +has been acquired by BlueFS and is considered free. Value `available_from_bluestore` +denotes ability of BlueStore to relinquish more space to BlueFS. +It is normal that this value is different from amount of BlueStore free space, as +BlueFS allocation unit is typically larger than BlueStore allocation unit. +This means that only part of BlueStore free space will be acceptable for BlueFS. + +BLUEFS_LOW_SPACE +_________________ + +If BlueFS is running low on available free space and there is little +`available_from_bluestore` one can consider reducing BlueFS allocation unit size. +To simulate available space when allocation unit is different do: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore bluefs available <alloc-unit-size> + +BLUESTORE_FRAGMENTATION +_______________________ + +As BlueStore works free space on underlying storage will get fragmented. +This is normal and unavoidable but excessive fragmentation will cause slowdown. +To inspect BlueStore fragmentation one can do: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore allocator score block + +Score is given in [0-1] range. +[0.0 .. 0.4] tiny fragmentation +[0.4 .. 0.7] small, acceptable fragmentation +[0.7 .. 0.9] considerable, but safe fragmentation +[0.9 .. 1.0] severe fragmentation, may impact BlueFS ability to get space from BlueStore + +If detailed report of free fragments is required do: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore allocator dump block + +In case when handling OSD process that is not running fragmentation can be +inspected with `ceph-bluestore-tool`. +Get fragmentation score: + +.. prompt:: bash $ + + ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-score + +And dump detailed free chunks: + +.. prompt:: bash $ + + ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-dump + +BLUESTORE_LEGACY_STATFS +_______________________ + +In the Nautilus release, BlueStore tracks its internal usage +statistics on a per-pool granular basis, and one or more OSDs have +BlueStore volumes that were created prior to Nautilus. If *all* OSDs +are older than Nautilus, this just means that the per-pool metrics are +not available. However, if there is a mix of pre-Nautilus and +post-Nautilus OSDs, the cluster usage statistics reported by ``ceph +df`` will not be accurate. + +The old OSDs can be updated to use the new usage tracking scheme by stopping each OSD, running a repair operation, and the restarting it. For example, if ``osd.123`` needed to be updated,: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +This warning can be disabled with: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_legacy_statfs false + +BLUESTORE_NO_PER_POOL_OMAP +__________________________ + +Starting with the Octopus release, BlueStore tracks omap space utilization +by pool, and one or more OSDs have volumes that were created prior to +Octopus. If all OSDs are not running BlueStore with the new tracking +enabled, the cluster will report and approximate value for per-pool omap usage +based on the most recent deep-scrub. + +The old OSDs can be updated to track by pool by stopping each OSD, +running a repair operation, and the restarting it. For example, if +``osd.123`` needed to be updated,: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +This warning can be disabled with: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_no_per_pool_omap false + +BLUESTORE_NO_PER_PG_OMAP +__________________________ + +Starting with the Pacific release, BlueStore tracks omap space utilization +by PG, and one or more OSDs have volumes that were created prior to +Pacific. Per-PG omap enables faster PG removal when PGs migrate. + +The older OSDs can be updated to track by PG by stopping each OSD, +running a repair operation, and the restarting it. For example, if +``osd.123`` needed to be updated,: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +This warning can be disabled with: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_no_per_pg_omap false + + +BLUESTORE_DISK_SIZE_MISMATCH +____________________________ + +One or more OSDs using BlueStore has an internal inconsistency between the size +of the physical device and the metadata tracking its size. This can lead to +the OSD crashing in the future. + +The OSDs in question should be destroyed and reprovisioned. Care should be +taken to do this one OSD at a time, and in a way that doesn't put any data at +risk. For example, if osd ``$N`` has the error: + +.. prompt:: bash $ + + ceph osd out osd.$N + while ! ceph osd safe-to-destroy osd.$N ; do sleep 1m ; done + ceph osd destroy osd.$N + ceph-volume lvm zap /path/to/device + ceph-volume lvm create --osd-id $N --data /path/to/device + +BLUESTORE_NO_COMPRESSION +________________________ + +One or more OSDs is unable to load a BlueStore compression plugin. +This can be caused by a broken installation, in which the ``ceph-osd`` +binary does not match the compression plugins, or a recent upgrade +that did not include a restart of the ``ceph-osd`` daemon. + +Verify that the package(s) on the host running the OSD(s) in question +are correctly installed and that the OSD daemon(s) have been +restarted. If the problem persists, check the OSD log for any clues +as to the source of the problem. + +BLUESTORE_SPURIOUS_READ_ERRORS +______________________________ + +One or more OSDs using BlueStore detects spurious read errors at main device. +BlueStore has recovered from these errors by retrying disk reads. +Though this might show some issues with underlying hardware, I/O subsystem, +etc. +Which theoretically might cause permanent data corruption. +Some observations on the root cause can be found at +https://tracker.ceph.com/issues/22464 + +This alert doesn't require immediate response but corresponding host might need +additional attention, e.g. upgrading to the latest OS/kernel versions and +H/W resource utilization monitoring. + +This warning can be disabled on all OSDs with: + +.. prompt:: bash $ + + ceph config set osd bluestore_warn_on_spurious_read_errors false + +Alternatively, it can be disabled on a specific OSD with: + +.. prompt:: bash $ + + ceph config set osd.123 bluestore_warn_on_spurious_read_errors false + + +Device health +------------- + +DEVICE_HEALTH +_____________ + +One or more devices is expected to fail soon, where the warning +threshold is controlled by the ``mgr/devicehealth/warn_threshold`` +config option. + +This warning only applies to OSDs that are currently marked "in", so +the expected response to this failure is to mark the device "out" so +that data is migrated off of the device, and then to remove the +hardware from the system. Note that the marking out is normally done +automatically if ``mgr/devicehealth/self_heal`` is enabled based on +the ``mgr/devicehealth/mark_out_threshold``. + +Device health can be checked with: + +.. prompt:: bash $ + + ceph device info <device-id> + +Device life expectancy is set by a prediction model run by +the mgr or an by external tool via the command: + +.. prompt:: bash $ + + ceph device set-life-expectancy <device-id> <from> <to> + +You can change the stored life expectancy manually, but that usually +doesn't accomplish anything as whatever tool originally set it will +probably set it again, and changing the stored value does not affect +the actual health of the hardware device. + +DEVICE_HEALTH_IN_USE +____________________ + +One or more devices is expected to fail soon and has been marked "out" +of the cluster based on ``mgr/devicehealth/mark_out_threshold``, but it +is still participating in one more PGs. This may be because it was +only recently marked "out" and data is still migrating, or because data +cannot be migrated off for some reason (e.g., the cluster is nearly +full, or the CRUSH hierarchy is such that there isn't another suitable +OSD to migrate the data too). + +This message can be silenced by disabling the self heal behavior +(setting ``mgr/devicehealth/self_heal`` to false), by adjusting the +``mgr/devicehealth/mark_out_threshold``, or by addressing what is +preventing data from being migrated off of the ailing device. + +DEVICE_HEALTH_TOOMANY +_____________________ + +Too many devices is expected to fail soon and the +``mgr/devicehealth/self_heal`` behavior is enabled, such that marking +out all of the ailing devices would exceed the clusters +``mon_osd_min_in_ratio`` ratio that prevents too many OSDs from being +automatically marked "out". + +This generally indicates that too many devices in your cluster are +expected to fail soon and you should take action to add newer +(healthier) devices before too many devices fail and data is lost. + +The health message can also be silenced by adjusting parameters like +``mon_osd_min_in_ratio`` or ``mgr/devicehealth/mark_out_threshold``, +but be warned that this will increase the likelihood of unrecoverable +data loss in the cluster. + + +Data health (pools & placement groups) +-------------------------------------- + +PG_AVAILABILITY +_______________ + +Data availability is reduced, meaning that the cluster is unable to +service potential read or write requests for some data in the cluster. +Specifically, one or more PGs is in a state that does not allow IO +requests to be serviced. Problematic PG states include *peering*, +*stale*, *incomplete*, and the lack of *active* (if those conditions do not clear +quickly). + +Detailed information about which PGs are affected is available from: + +.. prompt:: bash $ + + ceph health detail + +In most cases the root cause is that one or more OSDs is currently +down; see the discussion for ``OSD_DOWN`` above. + +The state of specific problematic PGs can be queried with: + +.. prompt:: bash $ + + ceph tell <pgid> query + +PG_DEGRADED +___________ + +Data redundancy is reduced for some data, meaning the cluster does not +have the desired number of replicas for all data (for replicated +pools) or erasure code fragments (for erasure coded pools). +Specifically, one or more PGs: + +* has the *degraded* or *undersized* flag set, meaning there are not + enough instances of that placement group in the cluster; +* has not had the *clean* flag set for some time. + +Detailed information about which PGs are affected is available from: + +.. prompt:: bash $ + + ceph health detail + +In most cases the root cause is that one or more OSDs is currently +down; see the dicussion for ``OSD_DOWN`` above. + +The state of specific problematic PGs can be queried with: + +.. prompt:: bash $ + + ceph tell <pgid> query + + +PG_RECOVERY_FULL +________________ + +Data redundancy may be reduced or at risk for some data due to a lack +of free space in the cluster. Specifically, one or more PGs has the +*recovery_toofull* flag set, meaning that the +cluster is unable to migrate or recover data because one or more OSDs +is above the *full* threshold. + +See the discussion for *OSD_FULL* above for steps to resolve this condition. + +PG_BACKFILL_FULL +________________ + +Data redundancy may be reduced or at risk for some data due to a lack +of free space in the cluster. Specifically, one or more PGs has the +*backfill_toofull* flag set, meaning that the +cluster is unable to migrate or recover data because one or more OSDs +is above the *backfillfull* threshold. + +See the discussion for *OSD_BACKFILLFULL* above for +steps to resolve this condition. + +PG_DAMAGED +__________ + +Data scrubbing has discovered some problems with data consistency in +the cluster. Specifically, one or more PGs has the *inconsistent* or +*snaptrim_error* flag is set, indicating an earlier scrub operation +found a problem, or that the *repair* flag is set, meaning a repair +for such an inconsistency is currently in progress. + +See :doc:`pg-repair` for more information. + +OSD_SCRUB_ERRORS +________________ + +Recent OSD scrubs have uncovered inconsistencies. This error is generally +paired with *PG_DAMAGED* (see above). + +See :doc:`pg-repair` for more information. + +OSD_TOO_MANY_REPAIRS +____________________ + +When a read error occurs and another replica is available it is used to repair +the error immediately, so that the client can get the object data. Scrub +handles errors for data at rest. In order to identify possible failing disks +that aren't seeing scrub errors, a count of read repairs is maintained. If +it exceeds a config value threshold *mon_osd_warn_num_repaired* default 10, +this health warning is generated. + +LARGE_OMAP_OBJECTS +__________________ + +One or more pools contain large omap objects as determined by +``osd_deep_scrub_large_omap_object_key_threshold`` (threshold for number of keys +to determine a large omap object) or +``osd_deep_scrub_large_omap_object_value_sum_threshold`` (the threshold for +summed size (bytes) of all key values to determine a large omap object) or both. +More information on the object name, key count, and size in bytes can be found +by searching the cluster log for 'Large omap object found'. Large omap objects +can be caused by RGW bucket index objects that do not have automatic resharding +enabled. Please see :ref:`RGW Dynamic Bucket Index Resharding +<rgw_dynamic_bucket_index_resharding>` for more information on resharding. + +The thresholds can be adjusted with: + +.. prompt:: bash $ + + ceph config set osd osd_deep_scrub_large_omap_object_key_threshold <keys> + ceph config set osd osd_deep_scrub_large_omap_object_value_sum_threshold <bytes> + +CACHE_POOL_NEAR_FULL +____________________ + +A cache tier pool is nearly full. Full in this context is determined +by the ``target_max_bytes`` and ``target_max_objects`` properties on +the cache pool. Once the pool reaches the target threshold, write +requests to the pool may block while data is flushed and evicted +from the cache, a state that normally leads to very high latencies and +poor performance. + +The cache pool target size can be adjusted with: + +.. prompt:: bash $ + + ceph osd pool set <cache-pool-name> target_max_bytes <bytes> + ceph osd pool set <cache-pool-name> target_max_objects <objects> + +Normal cache flush and evict activity may also be throttled due to reduced +availability or performance of the base tier, or overall cluster load. + +TOO_FEW_PGS +___________ + +The number of PGs in use in the cluster is below the configurable +threshold of ``mon_pg_warn_min_per_osd`` PGs per OSD. This can lead +to suboptimal distribution and balance of data across the OSDs in +the cluster, and similarly reduce overall performance. + +This may be an expected condition if data pools have not yet been +created. + +The PG count for existing pools can be increased or new pools can be created. +Please refer to :ref:`choosing-number-of-placement-groups` for more +information. + +POOL_PG_NUM_NOT_POWER_OF_TWO +____________________________ + +One or more pools has a ``pg_num`` value that is not a power of two. +Although this is not strictly incorrect, it does lead to a less +balanced distribution of data because some PGs have roughly twice as +much data as others. + +This is easily corrected by setting the ``pg_num`` value for the +affected pool(s) to a nearby power of two: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_num <value> + +This health warning can be disabled with: + +.. prompt:: bash $ + + ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false + +POOL_TOO_FEW_PGS +________________ + +One or more pools should probably have more PGs, based on the amount +of data that is currently stored in the pool. This can lead to +suboptimal distribution and balance of data across the OSDs in the +cluster, and similarly reduce overall performance. This warning is +generated if the ``pg_autoscale_mode`` property on the pool is set to +``warn``. + +To disable the warning, you can disable auto-scaling of PGs for the +pool entirely with: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode off + +To allow the cluster to automatically adjust the number of PGs,: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode on + +You can also manually set the number of PGs for the pool to the +recommended amount with: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_num <new-pg-num> + +Please refer to :ref:`choosing-number-of-placement-groups` and +:ref:`pg-autoscaler` for more information. + +TOO_MANY_PGS +____________ + +The number of PGs in use in the cluster is above the configurable +threshold of ``mon_max_pg_per_osd`` PGs per OSD. If this threshold is +exceed the cluster will not allow new pools to be created, pool `pg_num` to +be increased, or pool replication to be increased (any of which would lead to +more PGs in the cluster). A large number of PGs can lead +to higher memory utilization for OSD daemons, slower peering after +cluster state changes (like OSD restarts, additions, or removals), and +higher load on the Manager and Monitor daemons. + +The simplest way to mitigate the problem is to increase the number of +OSDs in the cluster by adding more hardware. Note that the OSD count +used for the purposes of this health check is the number of "in" OSDs, +so marking "out" OSDs "in" (if there are any) can also help: + +.. prompt:: bash $ + + ceph osd in <osd id(s)> + +Please refer to :ref:`choosing-number-of-placement-groups` for more +information. + +POOL_TOO_MANY_PGS +_________________ + +One or more pools should probably have more PGs, based on the amount +of data that is currently stored in the pool. This can lead to higher +memory utilization for OSD daemons, slower peering after cluster state +changes (like OSD restarts, additions, or removals), and higher load +on the Manager and Monitor daemons. This warning is generated if the +``pg_autoscale_mode`` property on the pool is set to ``warn``. + +To disable the warning, you can disable auto-scaling of PGs for the +pool entirely with: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode off + +To allow the cluster to automatically adjust the number of PGs,: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode on + +You can also manually set the number of PGs for the pool to the +recommended amount with: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_num <new-pg-num> + +Please refer to :ref:`choosing-number-of-placement-groups` and +:ref:`pg-autoscaler` for more information. + +POOL_TARGET_SIZE_BYTES_OVERCOMMITTED +____________________________________ + +One or more pools have a ``target_size_bytes`` property set to +estimate the expected size of the pool, +but the value(s) exceed the total available storage (either by +themselves or in combination with other pools' actual usage). + +This is usually an indication that the ``target_size_bytes`` value for +the pool is too large and should be reduced or set to zero with: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> target_size_bytes 0 + +For more information, see :ref:`specifying_pool_target_size`. + +POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO +____________________________________ + +One or more pools have both ``target_size_bytes`` and +``target_size_ratio`` set to estimate the expected size of the pool. +Only one of these properties should be non-zero. If both are set, +``target_size_ratio`` takes precedence and ``target_size_bytes`` is +ignored. + +To reset ``target_size_bytes`` to zero: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> target_size_bytes 0 + +For more information, see :ref:`specifying_pool_target_size`. + +TOO_FEW_OSDS +____________ + +The number of OSDs in the cluster is below the configurable +threshold of ``osd_pool_default_size``. + +SMALLER_PGP_NUM +_______________ + +One or more pools has a ``pgp_num`` value less than ``pg_num``. This +is normally an indication that the PG count was increased without +also increasing the placement behavior. + +This is sometimes done deliberately to separate out the `split` step +when the PG count is adjusted from the data migration that is needed +when ``pgp_num`` is changed. + +This is normally resolved by setting ``pgp_num`` to match ``pg_num``, +triggering the data migration, with: + +.. prompt:: bash $ + + ceph osd pool set <pool> pgp_num <pg-num-value> + +MANY_OBJECTS_PER_PG +___________________ + +One or more pools has an average number of objects per PG that is +significantly higher than the overall cluster average. The specific +threshold is controlled by the ``mon_pg_warn_max_object_skew`` +configuration value. + +This is usually an indication that the pool(s) containing most of the +data in the cluster have too few PGs, and/or that other pools that do +not contain as much data have too many PGs. See the discussion of +*TOO_MANY_PGS* above. + +The threshold can be raised to silence the health warning by adjusting +the ``mon_pg_warn_max_object_skew`` config option on the managers. + +The health warning will be silenced for a particular pool if +``pg_autoscale_mode`` is set to ``on``. + +POOL_APP_NOT_ENABLED +____________________ + +A pool exists that contains one or more objects but has not been +tagged for use by a particular application. + +Resolve this warning by labeling the pool for use by an application. For +example, if the pool is used by RBD,: + +.. prompt:: bash $ + + rbd pool init <poolname> + +If the pool is being used by a custom application 'foo', you can also label +via the low-level command: + +.. prompt:: bash $ + + ceph osd pool application enable foo + +For more information, see :ref:`associate-pool-to-application`. + +POOL_FULL +_________ + +One or more pools has reached (or is very close to reaching) its +quota. The threshold to trigger this error condition is controlled by +the ``mon_pool_quota_crit_threshold`` configuration option. + +Pool quotas can be adjusted up or down (or removed) with: + +.. prompt:: bash $ + + ceph osd pool set-quota <pool> max_bytes <bytes> + ceph osd pool set-quota <pool> max_objects <objects> + +Setting the quota value to 0 will disable the quota. + +POOL_NEAR_FULL +______________ + +One or more pools is approaching a configured fullness threshold. + +One threshold that can trigger this warning condition is the +``mon_pool_quota_warn_threshold`` configuration option. + +Pool quotas can be adjusted up or down (or removed) with: + +.. prompt:: bash $ + + ceph osd pool set-quota <pool> max_bytes <bytes> + ceph osd pool set-quota <pool> max_objects <objects> + +Setting the quota value to 0 will disable the quota. + +Other thresholds that can trigger the above two warning conditions are +``mon_osd_nearfull_ratio`` and ``mon_osd_full_ratio``. Visit the +:ref:`storage-capacity` and :ref:`no-free-drive-space` documents for details +and resolution. + +OBJECT_MISPLACED +________________ + +One or more objects in the cluster is not stored on the node the +cluster would like it to be stored on. This is an indication that +data migration due to some recent cluster change has not yet completed. + +Misplaced data is not a dangerous condition in and of itself; data +consistency is never at risk, and old copies of objects are never +removed until the desired number of new copies (in the desired +locations) are present. + +OBJECT_UNFOUND +______________ + +One or more objects in the cluster cannot be found. Specifically, the +OSDs know that a new or updated copy of an object should exist, but a +copy of that version of the object has not been found on OSDs that are +currently online. + +Read or write requests to unfound objects will block. + +Ideally, a down OSD can be brought back online that has the more +recent copy of the unfound object. Candidate OSDs can be identified from the +peering state for the PG(s) responsible for the unfound object: + +.. prompt:: bash $ + + ceph tell <pgid> query + +If the latest copy of the object is not available, the cluster can be +told to roll back to a previous version of the object. See +:ref:`failures-osd-unfound` for more information. + +SLOW_OPS +________ + +One or more OSD or monitor requests is taking a long time to process. This can +be an indication of extreme load, a slow storage device, or a software +bug. + +The request queue for the daemon in question can be queried with the +following command, executed from the daemon's host: + +.. prompt:: bash $ + + ceph daemon osd.<id> ops + +A summary of the slowest recent requests can be seen with: + +.. prompt:: bash $ + + ceph daemon osd.<id> dump_historic_ops + +The location of an OSD can be found with: + +.. prompt:: bash $ + + ceph osd find osd.<id> + +PG_NOT_SCRUBBED +_______________ + +One or more PGs has not been scrubbed recently. PGs are normally scrubbed +within every configured interval specified by +:ref:`osd_scrub_max_interval <osd_scrub_max_interval>` globally. This +interval can be overriden on per-pool basis with +:ref:`scrub_max_interval <scrub_max_interval>`. The warning triggers when +``mon_warn_pg_not_scrubbed_ratio`` percentage of interval has elapsed without a +scrub since it was due. + +PGs will not scrub if they are not flagged as *clean*, which may +happen if they are misplaced or degraded (see *PG_AVAILABILITY* and +*PG_DEGRADED* above). + +You can manually initiate a scrub of a clean PG with:: + + ceph pg scrub <pgid> + +PG_NOT_DEEP_SCRUBBED +____________________ + +One or more PGs has not been deep scrubbed recently. PGs are normally +scrubbed every ``osd_deep_scrub_interval`` seconds, and this warning +triggers when ``mon_warn_pg_not_deep_scrubbed_ratio`` percentage of interval has elapsed +without a scrub since it was due. + +PGs will not (deep) scrub if they are not flagged as *clean*, which may +happen if they are misplaced or degraded (see *PG_AVAILABILITY* and +*PG_DEGRADED* above). + +You can manually initiate a scrub of a clean PG with: + +.. prompt:: bash $ + + ceph pg deep-scrub <pgid> + + +PG_SLOW_SNAP_TRIMMING +_____________________ + +The snapshot trim queue for one or more PGs has exceeded the +configured warning threshold. This indicates that either an extremely +large number of snapshots were recently deleted, or that the OSDs are +unable to trim snapshots quickly enough to keep up with the rate of +new snapshot deletions. + +The warning threshold is controlled by the +``mon_osd_snap_trim_queue_warn_on`` option (default: 32768). + +This warning may trigger if OSDs are under excessive load and unable +to keep up with their background work, or if the OSDs' internal +metadata database is heavily fragmented and unable to perform. It may +also indicate some other performance issue with the OSDs. + +The exact size of the snapshot trim queue is reported by the +``snaptrimq_len`` field of ``ceph pg ls -f json-detail``. + +Miscellaneous +------------- + +RECENT_CRASH +____________ + +One or more Ceph daemons has crashed recently, and the crash has not +yet been archived (acknowledged) by the administrator. This may +indicate a software bug, a hardware problem (e.g., a failing disk), or +some other problem. + +New crashes can be listed with: + +.. prompt:: bash $ + + ceph crash ls-new + +Information about a specific crash can be examined with: + +.. prompt:: bash $ + + ceph crash info <crash-id> + +This warning can be silenced by "archiving" the crash (perhaps after +being examined by an administrator) so that it does not generate this +warning: + +.. prompt:: bash $ + + ceph crash archive <crash-id> + +Similarly, all new crashes can be archived with: + +.. prompt:: bash $ + + ceph crash archive-all + +Archived crashes will still be visible via ``ceph crash ls`` but not +``ceph crash ls-new``. + +The time period for what "recent" means is controlled by the option +``mgr/crash/warn_recent_interval`` (default: two weeks). + +These warnings can be disabled entirely with: + +.. prompt:: bash $ + + ceph config set mgr/crash/warn_recent_interval 0 + +RECENT_MGR_MODULE_CRASH +_______________________ + +One or more ceph-mgr modules has crashed recently, and the crash as +not yet been archived (acknowledged) by the administrator. This +generally indicates a software bug in one of the software modules run +inside the ceph-mgr daemon. Although the module that experienced the +problem maybe be disabled as a result, the function of other modules +is normally unaffected. + +As with the *RECENT_CRASH* health alert, the crash can be inspected with: + +.. prompt:: bash $ + + ceph crash info <crash-id> + +This warning can be silenced by "archiving" the crash (perhaps after +being examined by an administrator) so that it does not generate this +warning: + +.. prompt:: bash $ + + ceph crash archive <crash-id> + +Similarly, all new crashes can be archived with: + +.. prompt:: bash $ + + ceph crash archive-all + +Archived crashes will still be visible via ``ceph crash ls`` but not +``ceph crash ls-new``. + +The time period for what "recent" means is controlled by the option +``mgr/crash/warn_recent_interval`` (default: two weeks). + +These warnings can be disabled entirely with: + +.. prompt:: bash $ + + ceph config set mgr/crash/warn_recent_interval 0 + +TELEMETRY_CHANGED +_________________ + +Telemetry has been enabled, but the contents of the telemetry report +have changed since that time, so telemetry reports will not be sent. + +The Ceph developers periodically revise the telemetry feature to +include new and useful information, or to remove information found to +be useless or sensitive. If any new information is included in the +report, Ceph will require the administrator to re-enable telemetry to +ensure they have an opportunity to (re)review what information will be +shared. + +To review the contents of the telemetry report: + +.. prompt:: bash $ + + ceph telemetry show + +Note that the telemetry report consists of several optional channels +that may be independently enabled or disabled. For more information, see +:ref:`telemetry`. + +To re-enable telemetry (and make this warning go away): + +.. prompt:: bash $ + + ceph telemetry on + +To disable telemetry (and make this warning go away): + +.. prompt:: bash $ + + ceph telemetry off + +AUTH_BAD_CAPS +_____________ + +One or more auth users has capabilities that cannot be parsed by the +monitor. This generally indicates that the user will not be +authorized to perform any action with one or more daemon types. + +This error is mostly likely to occur after an upgrade if the +capabilities were set with an older version of Ceph that did not +properly validate their syntax, or if the syntax of the capabilities +has changed. + +The user in question can be removed with: + +.. prompt:: bash $ + + ceph auth rm <entity-name> + +(This will resolve the health alert, but obviously clients will not be +able to authenticate as that user.) + +Alternatively, the capabilities for the user can be updated with: + +.. prompt:: bash $ + + ceph auth <entity-name> <daemon-type> <caps> [<daemon-type> <caps> ...] + +For more information about auth capabilities, see :ref:`user-management`. + +OSD_NO_DOWN_OUT_INTERVAL +________________________ + +The ``mon_osd_down_out_interval`` option is set to zero, which means +that the system will not automatically perform any repair or healing +operations after an OSD fails. Instead, an administrator (or some +other external entity) will need to manually mark down OSDs as 'out' +(i.e., via ``ceph osd out <osd-id>``) in order to trigger recovery. + +This option is normally set to five or ten minutes--enough time for a +host to power-cycle or reboot. + +This warning can silenced by setting the +``mon_warn_on_osd_down_out_interval_zero`` to false: + +.. prompt:: bash $ + + ceph config global mon mon_warn_on_osd_down_out_interval_zero false + +DASHBOARD_DEBUG +_______________ + +The Dashboard debug mode is enabled. This means, if there is an error +while processing a REST API request, the HTTP error response contains +a Python traceback. This behaviour should be disabled in production +environments because such a traceback might contain and expose sensible +information. + +The debug mode can be disabled with: + +.. prompt:: bash $ + + ceph dashboard debug disable diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst new file mode 100644 index 000000000..2136918c7 --- /dev/null +++ b/doc/rados/operations/index.rst @@ -0,0 +1,98 @@ +.. _rados-operations: + +==================== + Cluster Operations +==================== + +.. raw:: html + + <table><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>High-level Operations</h3> + +High-level cluster operations consist primarily of starting, stopping, and +restarting a cluster with the ``ceph`` service; checking the cluster's health; +and, monitoring an operating cluster. + +.. toctree:: + :maxdepth: 1 + + operating + health-checks + monitoring + monitoring-osd-pg + user-management + pg-repair + +.. raw:: html + + </td><td><h3>Data Placement</h3> + +Once you have your cluster up and running, you may begin working with data +placement. Ceph supports petabyte-scale data storage clusters, with storage +pools and placement groups that distribute data across the cluster using Ceph's +CRUSH algorithm. + +.. toctree:: + :maxdepth: 1 + + data-placement + pools + erasure-code + cache-tiering + placement-groups + balancer + upmap + crush-map + crush-map-edits + stretch-mode + change-mon-elections + + + +.. raw:: html + + </td></tr><tr><td><h3>Low-level Operations</h3> + +Low-level cluster operations consist of starting, stopping, and restarting a +particular daemon within a cluster; changing the settings of a particular +daemon or subsystem; and, adding a daemon to the cluster or removing a daemon +from the cluster. The most common use cases for low-level operations include +growing or shrinking the Ceph cluster and replacing legacy or failed hardware +with new hardware. + +.. toctree:: + :maxdepth: 1 + + add-or-rm-osds + add-or-rm-mons + devices + bluestore-migration + Command Reference <control> + + + +.. raw:: html + + </td><td><h3>Troubleshooting</h3> + +Ceph is still on the leading edge, so you may encounter situations that require +you to evaluate your Ceph configuration and modify your logging and debugging +settings to identify and remedy issues you are encountering with your cluster. + +.. toctree:: + :maxdepth: 1 + + ../troubleshooting/community + ../troubleshooting/troubleshooting-mon + ../troubleshooting/troubleshooting-osd + ../troubleshooting/troubleshooting-pg + ../troubleshooting/log-and-debug + ../troubleshooting/cpu-profiling + ../troubleshooting/memory-profiling + + + + +.. raw:: html + + </td></tr></tbody></table> + diff --git a/doc/rados/operations/monitoring-osd-pg.rst b/doc/rados/operations/monitoring-osd-pg.rst new file mode 100644 index 000000000..3b997bfb4 --- /dev/null +++ b/doc/rados/operations/monitoring-osd-pg.rst @@ -0,0 +1,553 @@ +========================= + Monitoring OSDs and PGs +========================= + +High availability and high reliability require a fault-tolerant approach to +managing hardware and software issues. Ceph has no single point-of-failure, and +can service requests for data in a "degraded" mode. Ceph's `data placement`_ +introduces a layer of indirection to ensure that data doesn't bind directly to +particular OSD addresses. This means that tracking down system faults requires +finding the `placement group`_ and the underlying OSDs at root of the problem. + +.. tip:: A fault in one part of the cluster may prevent you from accessing a + particular object, but that doesn't mean that you cannot access other objects. + When you run into a fault, don't panic. Just follow the steps for monitoring + your OSDs and placement groups. Then, begin troubleshooting. + +Ceph is generally self-repairing. However, when problems persist, monitoring +OSDs and placement groups will help you identify the problem. + + +Monitoring OSDs +=============== + +An OSD's status is either in the cluster (``in``) or out of the cluster +(``out``); and, it is either up and running (``up``), or it is down and not +running (``down``). If an OSD is ``up``, it may be either ``in`` the cluster +(you can read and write data) or it is ``out`` of the cluster. If it was +``in`` the cluster and recently moved ``out`` of the cluster, Ceph will migrate +placement groups to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will +not assign placement groups to the OSD. If an OSD is ``down``, it should also be +``out``. + +.. note:: If an OSD is ``down`` and ``in``, there is a problem and the cluster + will not be in a healthy state. + +.. ditaa:: + + +----------------+ +----------------+ + | | | | + | OSD #n In | | OSD #n Up | + | | | | + +----------------+ +----------------+ + ^ ^ + | | + | | + v v + +----------------+ +----------------+ + | | | | + | OSD #n Out | | OSD #n Down | + | | | | + +----------------+ +----------------+ + +If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, +you may notice that the cluster does not always echo back ``HEALTH OK``. Don't +panic. With respect to OSDs, you should expect that the cluster will **NOT** +echo ``HEALTH OK`` in a few expected circumstances: + +#. You haven't started the cluster yet (it won't respond). +#. You have just started or restarted the cluster and it's not ready yet, + because the placement groups are getting created and the OSDs are in + the process of peering. +#. You just added or removed an OSD. +#. You just have modified your cluster map. + +An important aspect of monitoring OSDs is to ensure that when the cluster +is up and running that all OSDs that are ``in`` the cluster are ``up`` and +running, too. To see if all OSDs are running, execute: + +.. prompt:: bash $ + + ceph osd stat + +The result should tell you the total number of OSDs (x), +how many are ``up`` (y), how many are ``in`` (z) and the map epoch (eNNNN). :: + + x osds: y up, z in; epoch: eNNNN + +If the number of OSDs that are ``in`` the cluster is more than the number of +OSDs that are ``up``, execute the following command to identify the ``ceph-osd`` +daemons that are not running: + +.. prompt:: bash $ + + ceph osd tree + +:: + + #ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -1 2.00000 pool openstack + -3 2.00000 rack dell-2950-rack-A + -2 2.00000 host dell-2950-A1 + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 down 1.00000 1.00000 + +.. tip:: The ability to search through a well-designed CRUSH hierarchy may help + you troubleshoot your cluster by identifying the physical locations faster. + +If an OSD is ``down``, start it: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd@1 + +See `OSD Not Running`_ for problems associated with OSDs that stopped, or won't +restart. + + +PG Sets +======= + +When CRUSH assigns placement groups to OSDs, it looks at the number of replicas +for the pool and assigns the placement group to OSDs such that each replica of +the placement group gets assigned to a different OSD. For example, if the pool +requires three replicas of a placement group, CRUSH may assign them to +``osd.1``, ``osd.2`` and ``osd.3`` respectively. CRUSH actually seeks a +pseudo-random placement that will take into account failure domains you set in +your `CRUSH map`_, so you will rarely see placement groups assigned to nearest +neighbor OSDs in a large cluster. + +Ceph processes a client request using the **Acting Set**, which is the set of +OSDs that will actually handle the requests since they have a full and working +version of a placement group shard. The set of OSDs that should contain a shard +of a particular placement group as the **Up Set**, i.e. where data is +moved/copied to (or planned to be). + +In some cases, an OSD in the Acting Set is ``down`` or otherwise not able to +service requests for objects in the placement group. When these situations +arise, don't panic. Common examples include: + +- You added or removed an OSD. Then, CRUSH reassigned the placement group to + other OSDs--thereby changing the composition of the Acting Set and spawning + the migration of data with a "backfill" process. +- An OSD was ``down``, was restarted, and is now ``recovering``. +- An OSD in the Acting Set is ``down`` or unable to service requests, + and another OSD has temporarily assumed its duties. + +In most cases, the Up Set and the Acting Set are identical. When they are not, +it may indicate that Ceph is migrating the PG (it's remapped), an OSD is +recovering, or that there is a problem (i.e., Ceph usually echoes a "HEALTH +WARN" state with a "stuck stale" message in such scenarios). + +To retrieve a list of placement groups, execute: + +.. prompt:: bash $ + + ceph pg dump + +To view which OSDs are within the Acting Set or the Up Set for a given placement +group, execute: + +.. prompt:: bash $ + + ceph pg map {pg-num} + +The result should tell you the osdmap epoch (eNNN), the placement group number +({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the acting set +(acting[]):: + + osdmap eNNN pg {raw-pg-num} ({pg-num}) -> up [0,1,2] acting [0,1,2] + +.. note:: If the Up Set and Acting Set do not match, this may be an indicator + that the cluster rebalancing itself or of a potential problem with + the cluster. + + +Peering +======= + +Before you can write data to a placement group, it must be in an ``active`` +state, and it **should** be in a ``clean`` state. For Ceph to determine the +current state of a placement group, the primary OSD of the placement group +(i.e., the first OSD in the acting set), peers with the secondary and tertiary +OSDs to establish agreement on the current state of the placement group +(assuming a pool with 3 replicas of the PG). + + +.. ditaa:: + + +---------+ +---------+ +-------+ + | OSD 1 | | OSD 2 | | OSD 3 | + +---------+ +---------+ +-------+ + | | | + | Request To | | + | Peer | | + |-------------->| | + |<--------------| | + | Peering | + | | + | Request To | + | Peer | + |----------------------------->| + |<-----------------------------| + | Peering | + +The OSDs also report their status to the monitor. See `Configuring Monitor/OSD +Interaction`_ for details. To troubleshoot peering issues, see `Peering +Failure`_. + + +Monitoring Placement Group States +================================= + +If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, +you may notice that the cluster does not always echo back ``HEALTH OK``. After +you check to see if the OSDs are running, you should also check placement group +states. You should expect that the cluster will **NOT** echo ``HEALTH OK`` in a +number of placement group peering-related circumstances: + +#. You have just created a pool and placement groups haven't peered yet. +#. The placement groups are recovering. +#. You have just added an OSD to or removed an OSD from the cluster. +#. You have just modified your CRUSH map and your placement groups are migrating. +#. There is inconsistent data in different replicas of a placement group. +#. Ceph is scrubbing a placement group's replicas. +#. Ceph doesn't have enough storage capacity to complete backfilling operations. + +If one of the foregoing circumstances causes Ceph to echo ``HEALTH WARN``, don't +panic. In many cases, the cluster will recover on its own. In some cases, you +may need to take action. An important aspect of monitoring placement groups is +to ensure that when the cluster is up and running that all placement groups are +``active``, and preferably in the ``clean`` state. To see the status of all +placement groups, execute: + +.. prompt:: bash $ + + ceph pg stat + +The result should tell you the total number of placement groups (x), how many +placement groups are in a particular state such as ``active+clean`` (y) and the +amount of data stored (z). :: + + x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail + +.. note:: It is common for Ceph to report multiple states for placement groups. + +In addition to the placement group states, Ceph will also echo back the amount of +storage capacity used (aa), the amount of storage capacity remaining (bb), and the total +storage capacity for the placement group. These numbers can be important in a +few cases: + +- You are reaching your ``near full ratio`` or ``full ratio``. +- Your data is not getting distributed across the cluster due to an + error in your CRUSH configuration. + + +.. topic:: Placement Group IDs + + Placement group IDs consist of the pool number (not pool name) followed + by a period (.) and the placement group ID--a hexadecimal number. You + can view pool numbers and their names from the output of ``ceph osd + lspools``. For example, the first pool created corresponds to + pool number ``1``. A fully qualified placement group ID has the + following form:: + + {pool-num}.{pg-id} + + And it typically looks like this:: + + 1.1f + + +To retrieve a list of placement groups, execute the following: + +.. prompt:: bash $ + + ceph pg dump + +You can also format the output in JSON format and save it to a file: + +.. prompt:: bash $ + + ceph pg dump -o {filename} --format=json + +To query a particular placement group, execute the following: + +.. prompt:: bash $ + + ceph pg {poolnum}.{pg-id} query + +Ceph will output the query in JSON format. + +The following subsections describe the common pg states in detail. + +Creating +-------- + +When you create a pool, it will create the number of placement groups you +specified. Ceph will echo ``creating`` when it is creating one or more +placement groups. Once they are created, the OSDs that are part of a placement +group's Acting Set will peer. Once peering is complete, the placement group +status should be ``active+clean``, which means a Ceph client can begin writing +to the placement group. + +.. ditaa:: + + /-----------\ /-----------\ /-----------\ + | Creating |------>| Peering |------>| Active | + \-----------/ \-----------/ \-----------/ + +Peering +------- + +When Ceph is Peering a placement group, Ceph is bringing the OSDs that +store the replicas of the placement group into **agreement about the state** +of the objects and metadata in the placement group. When Ceph completes peering, +this means that the OSDs that store the placement group agree about the current +state of the placement group. However, completion of the peering process does +**NOT** mean that each replica has the latest contents. + +.. topic:: Authoritative History + + Ceph will **NOT** acknowledge a write operation to a client, until + all OSDs of the acting set persist the write operation. This practice + ensures that at least one member of the acting set will have a record + of every acknowledged write operation since the last successful + peering operation. + + With an accurate record of each acknowledged write operation, Ceph can + construct and disseminate a new authoritative history of the placement + group--a complete, and fully ordered set of operations that, if performed, + would bring an OSD’s copy of a placement group up to date. + + +Active +------ + +Once Ceph completes the peering process, a placement group may become +``active``. The ``active`` state means that the data in the placement group is +generally available in the primary placement group and the replicas for read +and write operations. + + +Clean +----- + +When a placement group is in the ``clean`` state, the primary OSD and the +replica OSDs have successfully peered and there are no stray replicas for the +placement group. Ceph replicated all objects in the placement group the correct +number of times. + + +Degraded +-------- + +When a client writes an object to the primary OSD, the primary OSD is +responsible for writing the replicas to the replica OSDs. After the primary OSD +writes the object to storage, the placement group will remain in a ``degraded`` +state until the primary OSD has received an acknowledgement from the replica +OSDs that Ceph created the replica objects successfully. + +The reason a placement group can be ``active+degraded`` is that an OSD may be +``active`` even though it doesn't hold all of the objects yet. If an OSD goes +``down``, Ceph marks each placement group assigned to the OSD as ``degraded``. +The OSDs must peer again when the OSD comes back online. However, a client can +still write a new object to a ``degraded`` placement group if it is ``active``. + +If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the +``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD +to another OSD. The time between being marked ``down`` and being marked ``out`` +is controlled by ``mon osd down out interval``, which is set to ``600`` seconds +by default. + +A placement group can also be ``degraded``, because Ceph cannot find one or more +objects that Ceph thinks should be in the placement group. While you cannot +read or write to unfound objects, you can still access all of the other objects +in the ``degraded`` placement group. + + +Recovering +---------- + +Ceph was designed for fault-tolerance at a scale where hardware and software +problems are ongoing. When an OSD goes ``down``, its contents may fall behind +the current state of other replicas in the placement groups. When the OSD is +back ``up``, the contents of the placement groups must be updated to reflect the +current state. During that time period, the OSD may reflect a ``recovering`` +state. + +Recovery is not always trivial, because a hardware failure might cause a +cascading failure of multiple OSDs. For example, a network switch for a rack or +cabinet may fail, which can cause the OSDs of a number of host machines to fall +behind the current state of the cluster. Each one of the OSDs must recover once +the fault is resolved. + +Ceph provides a number of settings to balance the resource contention between +new service requests and the need to recover data objects and restore the +placement groups to the current state. The ``osd recovery delay start`` setting +allows an OSD to restart, re-peer and even process some replay requests before +starting the recovery process. The ``osd +recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail, +restart and re-peer at staggered rates. The ``osd recovery max active`` setting +limits the number of recovery requests an OSD will entertain simultaneously to +prevent the OSD from failing to serve . The ``osd recovery max chunk`` setting +limits the size of the recovered data chunks to prevent network congestion. + + +Back Filling +------------ + +When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs +in the cluster to the newly added OSD. Forcing the new OSD to accept the +reassigned placement groups immediately can put excessive load on the new OSD. +Back filling the OSD with the placement groups allows this process to begin in +the background. Once backfilling is complete, the new OSD will begin serving +requests when it is ready. + +During the backfill operations, you may see one of several states: +``backfill_wait`` indicates that a backfill operation is pending, but is not +underway yet; ``backfilling`` indicates that a backfill operation is underway; +and, ``backfill_toofull`` indicates that a backfill operation was requested, +but couldn't be completed due to insufficient storage capacity. When a +placement group cannot be backfilled, it may be considered ``incomplete``. + +The ``backfill_toofull`` state may be transient. It is possible that as PGs +are moved around, space may become available. The ``backfill_toofull`` is +similar to ``backfill_wait`` in that as soon as conditions change +backfill can proceed. + +Ceph provides a number of settings to manage the load spike associated with +reassigning placement groups to an OSD (especially a new OSD). By default, +``osd_max_backfills`` sets the maximum number of concurrent backfills to and from +an OSD to 1. The ``backfill full ratio`` enables an OSD to refuse a +backfill request if the OSD is approaching its full ratio (90%, by default) and +change with ``ceph osd set-backfillfull-ratio`` command. +If an OSD refuses a backfill request, the ``osd backfill retry interval`` +enables an OSD to retry the request (after 30 seconds, by default). OSDs can +also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan +intervals (64 and 512, by default). + + +Remapped +-------- + +When the Acting Set that services a placement group changes, the data migrates +from the old acting set to the new acting set. It may take some time for a new +primary OSD to service requests. So it may ask the old primary to continue to +service requests until the placement group migration is complete. Once data +migration completes, the mapping uses the primary OSD of the new acting set. + + +Stale +----- + +While Ceph uses heartbeats to ensure that hosts and daemons are running, the +``ceph-osd`` daemons may also get into a ``stuck`` state where they are not +reporting statistics in a timely manner (e.g., a temporary network fault). By +default, OSD daemons report their placement group, up through, boot and failure +statistics every half second (i.e., ``0.5``), which is more frequent than the +heartbeat thresholds. If the **Primary OSD** of a placement group's acting set +fails to report to the monitor or if other OSDs have reported the primary OSD +``down``, the monitors will mark the placement group ``stale``. + +When you start your cluster, it is common to see the ``stale`` state until +the peering process completes. After your cluster has been running for awhile, +seeing placement groups in the ``stale`` state indicates that the primary OSD +for those placement groups is ``down`` or not reporting placement group statistics +to the monitor. + + +Identifying Troubled PGs +======================== + +As previously noted, a placement group is not necessarily problematic just +because its state is not ``active+clean``. Generally, Ceph's ability to self +repair may not be working when placement groups get stuck. The stuck states +include: + +- **Unclean**: Placement groups contain objects that are not replicated the + desired number of times. They should be recovering. +- **Inactive**: Placement groups cannot process reads or writes because they + are waiting for an OSD with the most up-to-date data to come back ``up``. +- **Stale**: Placement groups are in an unknown state, because the OSDs that + host them have not reported to the monitor cluster in a while (configured + by ``mon osd report timeout``). + +To identify stuck placement groups, execute the following: + +.. prompt:: bash $ + + ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded] + +See `Placement Group Subsystem`_ for additional details. To troubleshoot +stuck placement groups, see `Troubleshooting PG Errors`_. + + +Finding an Object Location +========================== + +To store object data in the Ceph Object Store, a Ceph client must: + +#. Set an object name +#. Specify a `pool`_ + +The Ceph client retrieves the latest cluster map and the CRUSH algorithm +calculates how to map the object to a `placement group`_, and then calculates +how to assign the placement group to an OSD dynamically. To find the object +location, all you need is the object name and the pool name. For example: + +.. prompt:: bash $ + + ceph osd map {poolname} {object-name} [namespace] + +.. topic:: Exercise: Locate an Object + + As an exercise, let's create an object. Specify an object name, a path + to a test file containing some object data and a pool name using the + ``rados put`` command on the command line. For example: + + .. prompt:: bash $ + + rados put {object-name} {file-path} --pool=data + rados put test-object-1 testfile.txt --pool=data + + To verify that the Ceph Object Store stored the object, execute the + following: + + .. prompt:: bash $ + + rados -p data ls + + Now, identify the object location: + + .. prompt:: bash $ + + ceph osd map {pool-name} {object-name} + ceph osd map data test-object-1 + + Ceph should output the object's location. For example:: + + osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0) + + To remove the test object, simply delete it using the ``rados rm`` + command. For example: + + .. prompt:: bash $ + + rados rm test-object-1 --pool=data + + +As the cluster evolves, the object location may change dynamically. One benefit +of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform +the migration manually. See the `Architecture`_ section for details. + +.. _data placement: ../data-placement +.. _pool: ../pools +.. _placement group: ../placement-groups +.. _Architecture: ../../../architecture +.. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running +.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors +.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering +.. _CRUSH map: ../crush-map +.. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/ +.. _Placement Group Subsystem: ../control#placement-group-subsystem diff --git a/doc/rados/operations/monitoring.rst b/doc/rados/operations/monitoring.rst new file mode 100644 index 000000000..4df711d8b --- /dev/null +++ b/doc/rados/operations/monitoring.rst @@ -0,0 +1,647 @@ +====================== + Monitoring a Cluster +====================== + +Once you have a running cluster, you may use the ``ceph`` tool to monitor your +cluster. Monitoring a cluster typically involves checking OSD status, monitor +status, placement group status and metadata server status. + +Using the command line +====================== + +Interactive mode +---------------- + +To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line +with no arguments. For example: + +.. prompt:: bash $ + + ceph + +.. prompt:: ceph> + :prompts: ceph> + + health + status + quorum_status + mon stat + +Non-default paths +----------------- + +If you specified non-default locations for your configuration or keyring, +you may specify their locations: + +.. prompt:: bash $ + + ceph -c /path/to/conf -k /path/to/keyring health + +Checking a Cluster's Status +=========================== + +After you start your cluster, and before you start reading and/or +writing data, check your cluster's status first. + +To check a cluster's status, execute the following: + +.. prompt:: bash $ + + ceph status + +Or: + +.. prompt:: bash $ + + ceph -s + +In interactive mode, type ``status`` and press **Enter**: + +.. prompt:: ceph> + :prompts: ceph> + + ceph> status + +Ceph will print the cluster status. For example, a tiny Ceph demonstration +cluster with one of each service may print the following: + +:: + + cluster: + id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 + health: HEALTH_OK + + services: + mon: 3 daemons, quorum a,b,c + mgr: x(active) + mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby + osd: 3 osds: 3 up, 3 in + + data: + pools: 2 pools, 16 pgs + objects: 21 objects, 2.19K + usage: 546 GB used, 384 GB / 931 GB avail + pgs: 16 active+clean + + +.. topic:: How Ceph Calculates Data Usage + + The ``usage`` value reflects the *actual* amount of raw storage used. The + ``xxx GB / xxx GB`` value means the amount available (the lesser number) + of the overall storage capacity of the cluster. The notional number reflects + the size of the stored data before it is replicated, cloned or snapshotted. + Therefore, the amount of data actually stored typically exceeds the notional + amount stored, because Ceph creates replicas of the data and may also use + storage capacity for cloning and snapshotting. + + +Watching a Cluster +================== + +In addition to local logging by each daemon, Ceph clusters maintain +a *cluster log* that records high level events about the whole system. +This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by +default), but can also be monitored via the command line. + +To follow the cluster log, use the following command: + +.. prompt:: bash $ + + ceph -w + +Ceph will print the status of the system, followed by each log message as it +is emitted. For example: + +:: + + cluster: + id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 + health: HEALTH_OK + + services: + mon: 3 daemons, quorum a,b,c + mgr: x(active) + mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby + osd: 3 osds: 3 up, 3 in + + data: + pools: 2 pools, 16 pgs + objects: 21 objects, 2.19K + usage: 546 GB used, 384 GB / 931 GB avail + pgs: 16 active+clean + + + 2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot + 2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x + 2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available + + +In addition to using ``ceph -w`` to print log lines as they are emitted, +use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster +log. + +Monitoring Health Checks +======================== + +Ceph continuously runs various *health checks* against its own status. When +a health check fails, this is reflected in the output of ``ceph status`` (or +``ceph health``). In addition, messages are sent to the cluster log to +indicate when a check fails, and when the cluster recovers. + +For example, when an OSD goes down, the ``health`` section of the status +output may be updated as follows: + +:: + + health: HEALTH_WARN + 1 osds down + Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded + +At this time, cluster log messages are also emitted to record the failure of the +health checks: + +:: + + 2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) + 2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED) + +When the OSD comes back online, the cluster log records the cluster's return +to a health state: + +:: + + 2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED) + 2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized) + 2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy + +Network Performance Checks +-------------------------- + +Ceph OSDs send heartbeat ping messages amongst themselves to monitor daemon availability. We +also use the response times to monitor network performance. +While it is possible that a busy OSD could delay a ping response, we can assume +that if a network switch fails multiple delays will be detected between distinct pairs of OSDs. + +By default we will warn about ping times which exceed 1 second (1000 milliseconds). + +:: + + HEALTH_WARN Slow OSD heartbeats on back (longest 1118.001ms) + +The health detail will add the combination of OSDs are seeing the delays and by how much. There is a limit of 10 +detail line items. + +:: + + [WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1118.001ms) + Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.1 [dc1,rack1] 1118.001 msec possibly improving + Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.2 [dc1,rack2] 1030.123 msec + Slow OSD heartbeats on back from osd.2 [dc1,rack2] to osd.1 [dc1,rack1] 1015.321 msec + Slow OSD heartbeats on back from osd.1 [dc1,rack1] to osd.0 [dc1,rack1] 1010.456 msec + +To see even more detail and a complete dump of network performance information the ``dump_osd_network`` command can be used. Typically, this would be +sent to a mgr, but it can be limited to a particular OSD's interactions by issuing it to any OSD. The current threshold which defaults to 1 second +(1000 milliseconds) can be overridden as an argument in milliseconds. + +The following command will show all gathered network performance data by specifying a threshold of 0 and sending to the mgr. + +.. prompt:: bash $ + + ceph daemon /var/run/ceph/ceph-mgr.x.asok dump_osd_network 0 + +:: + + { + "threshold": 0, + "entries": [ + { + "last update": "Wed Sep 4 17:04:49 2019", + "stale": false, + "from osd": 2, + "to osd": 0, + "interface": "front", + "average": { + "1min": 1.023, + "5min": 0.860, + "15min": 0.883 + }, + "min": { + "1min": 0.818, + "5min": 0.607, + "15min": 0.607 + }, + "max": { + "1min": 1.164, + "5min": 1.173, + "15min": 1.544 + }, + "last": 0.924 + }, + { + "last update": "Wed Sep 4 17:04:49 2019", + "stale": false, + "from osd": 2, + "to osd": 0, + "interface": "back", + "average": { + "1min": 0.968, + "5min": 0.897, + "15min": 0.830 + }, + "min": { + "1min": 0.860, + "5min": 0.563, + "15min": 0.502 + }, + "max": { + "1min": 1.171, + "5min": 1.216, + "15min": 1.456 + }, + "last": 0.845 + }, + { + "last update": "Wed Sep 4 17:04:48 2019", + "stale": false, + "from osd": 0, + "to osd": 1, + "interface": "front", + "average": { + "1min": 0.965, + "5min": 0.811, + "15min": 0.850 + }, + "min": { + "1min": 0.650, + "5min": 0.488, + "15min": 0.466 + }, + "max": { + "1min": 1.252, + "5min": 1.252, + "15min": 1.362 + }, + "last": 0.791 + }, + ... + + + +Muting health checks +-------------------- + +Health checks can be muted so that they do not affect the overall +reported status of the cluster. Alerts are specified using the health +check code (see :ref:`health-checks`): + +.. prompt:: bash $ + + ceph health mute <code> + +For example, if there is a health warning, muting it will make the +cluster report an overall status of ``HEALTH_OK``. For example, to +mute an ``OSD_DOWN`` alert,: + +.. prompt:: bash $ + + ceph health mute OSD_DOWN + +Mutes are reported as part of the short and long form of the ``ceph health`` command. +For example, in the above scenario, the cluster would report: + +.. prompt:: bash $ + + ceph health + +:: + + HEALTH_OK (muted: OSD_DOWN) + +.. prompt:: bash $ + + ceph health detail + +:: + + HEALTH_OK (muted: OSD_DOWN) + (MUTED) OSD_DOWN 1 osds down + osd.1 is down + +A mute can be explicitly removed with: + +.. prompt:: bash $ + + ceph health unmute <code> + +For example: + +.. prompt:: bash $ + + ceph health unmute OSD_DOWN + +A health check mute may optionally have a TTL (time to live) +associated with it, such that the mute will automatically expire +after the specified period of time has elapsed. The TTL is specified as an optional +duration argument, e.g.: + +.. prompt:: bash $ + + ceph health mute OSD_DOWN 4h # mute for 4 hours + ceph health mute MON_DOWN 15m # mute for 15 minutes + +Normally, if a muted health alert is resolved (e.g., in the example +above, the OSD comes back up), the mute goes away. If the alert comes +back later, it will be reported in the usual way. + +It is possible to make a mute "sticky" such that the mute will remain even if the +alert clears. For example: + +.. prompt:: bash $ + + ceph health mute OSD_DOWN 1h --sticky # ignore any/all down OSDs for next hour + +Most health mutes also disappear if the extent of an alert gets worse. For example, +if there is one OSD down, and the alert is muted, the mute will disappear if one +or more additional OSDs go down. This is true for any health alert that involves +a count indicating how much or how many of something is triggering the warning or +error. + + +Detecting configuration issues +============================== + +In addition to the health checks that Ceph continuously runs on its +own status, there are some configuration issues that may only be detected +by an external tool. + +Use the `ceph-medic`_ tool to run these additional checks on your Ceph +cluster's configuration. + +Checking a Cluster's Usage Stats +================================ + +To check a cluster's data usage and data distribution among pools, you can +use the ``df`` option. It is similar to Linux ``df``. Execute +the following: + +.. prompt:: bash $ + + ceph df + +The output of ``ceph df`` looks like this:: + + CLASS SIZE AVAIL USED RAW USED %RAW USED + ssd 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00 + TOTAL 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00 + + --- POOLS --- + POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR + device_health_metrics 1 1 242 KiB 15 KiB 227 KiB 4 251 KiB 24 KiB 227 KiB 0 297 GiB N/A N/A 4 0 B 0 B + cephfs.a.meta 2 32 6.8 KiB 6.8 KiB 0 B 22 96 KiB 96 KiB 0 B 0 297 GiB N/A N/A 22 0 B 0 B + cephfs.a.data 3 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 99 GiB N/A N/A 0 0 B 0 B + test 4 32 22 MiB 22 MiB 50 KiB 248 19 MiB 19 MiB 50 KiB 0 297 GiB N/A N/A 248 0 B 0 B + + + + + +- **CLASS:** for example, "ssd" or "hdd" +- **SIZE:** The amount of storage capacity managed by the cluster. +- **AVAIL:** The amount of free space available in the cluster. +- **USED:** The amount of raw storage consumed by user data (excluding + BlueStore's database) +- **RAW USED:** The amount of raw storage consumed by user data, internal + overhead, or reserved capacity. +- **%RAW USED:** The percentage of raw storage used. Use this number in + conjunction with the ``full ratio`` and ``near full ratio`` to ensure that + you are not reaching your cluster's capacity. See `Storage Capacity`_ for + additional details. + + +**POOLS:** + +The **POOLS** section of the output provides a list of pools and the notional +usage of each pool. The output from this section **DOES NOT** reflect replicas, +clones or snapshots. For example, if you store an object with 1MB of data, the +notional usage will be 1MB, but the actual usage may be 2MB or more depending +on the number of replicas, clones and snapshots. + +- **ID:** The number of the node within the pool. +- **STORED:** actual amount of data user/Ceph has stored in a pool. This is + similar to the USED column in earlier versions of Ceph but the calculations + (for BlueStore!) are more precise (gaps are properly handled). + + - **(DATA):** usage for RBD (RADOS Block Device), CephFS file data, and RGW + (RADOS Gateway) object data. + - **(OMAP):** key-value pairs. Used primarily by CephFS and RGW (RADOS + Gateway) for metadata storage. + +- **OBJECTS:** The notional number of objects stored per pool. "Notional" is + defined above in the paragraph immediately under "POOLS". +- **USED:** The space allocated for a pool over all OSDs. This includes + replication, allocation granularity, and erasure-coding overhead. Compression + savings and object content gaps are also taken into account. BlueStore's + database is not included in this amount. + + - **(DATA):** object usage for RBD (RADOS Block Device), CephFS file data, and RGW + (RADOS Gateway) object data. + - **(OMAP):** object key-value pairs. Used primarily by CephFS and RGW (RADOS + Gateway) for metadata storage. + +- **%USED:** The notional percentage of storage used per pool. +- **MAX AVAIL:** An estimate of the notional amount of data that can be written + to this pool. +- **QUOTA OBJECTS:** The number of quota objects. +- **QUOTA BYTES:** The number of bytes in the quota objects. +- **DIRTY:** The number of objects in the cache pool that have been written to + the cache pool but have not been flushed yet to the base pool. This field is + only available when cache tiering is in use. +- **USED COMPR:** amount of space allocated for compressed data (i.e. this + includes comrpessed data plus all the allocation, replication and erasure + coding overhead). +- **UNDER COMPR:** amount of data passed through compression (summed over all + replicas) and beneficial enough to be stored in a compressed form. + + +.. note:: The numbers in the POOLS section are notional. They are not + inclusive of the number of replicas, snapshots or clones. As a result, the + sum of the USED and %USED amounts will not add up to the USED and %USED + amounts in the RAW section of the output. + +.. note:: The MAX AVAIL value is a complicated function of the replication + or erasure code used, the CRUSH rule that maps storage to devices, the + utilization of those devices, and the configured ``mon_osd_full_ratio``. + + +Checking OSD Status +=================== + +You can check OSDs to ensure they are ``up`` and ``in`` by executing the +following command: + +.. prompt:: bash # + + ceph osd stat + +Or: + +.. prompt:: bash # + + ceph osd dump + +You can also check view OSDs according to their position in the CRUSH map by +using the folloiwng command: + +.. prompt:: bash # + + ceph osd tree + +Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up +and their weight: + +.. code-block:: bash + + #ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -1 3.00000 pool default + -3 3.00000 rack mainrack + -2 3.00000 host osd-host + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + +For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. + +Checking Monitor Status +======================= + +If your cluster has multiple monitors (likely), you should check the monitor +quorum status after you start the cluster and before reading and/or writing data. A +quorum must be present when multiple monitors are running. You should also check +monitor status periodically to ensure that they are running. + +To see display the monitor map, execute the following: + +.. prompt:: bash $ + + ceph mon stat + +Or: + +.. prompt:: bash $ + + ceph mon dump + +To check the quorum status for the monitor cluster, execute the following: + +.. prompt:: bash $ + + ceph quorum_status + +Ceph will return the quorum status. For example, a Ceph cluster consisting of +three monitors may return the following: + +.. code-block:: javascript + + { "election_epoch": 10, + "quorum": [ + 0, + 1, + 2], + "quorum_names": [ + "a", + "b", + "c"], + "quorum_leader_name": "a", + "monmap": { "epoch": 1, + "fsid": "444b489c-4f16-4b75-83f0-cb8097468898", + "modified": "2011-12-12 13:28:27.505520", + "created": "2011-12-12 13:28:27.505520", + "features": {"persistent": [ + "kraken", + "luminous", + "mimic"], + "optional": [] + }, + "mons": [ + { "rank": 0, + "name": "a", + "addr": "127.0.0.1:6789/0", + "public_addr": "127.0.0.1:6789/0"}, + { "rank": 1, + "name": "b", + "addr": "127.0.0.1:6790/0", + "public_addr": "127.0.0.1:6790/0"}, + { "rank": 2, + "name": "c", + "addr": "127.0.0.1:6791/0", + "public_addr": "127.0.0.1:6791/0"} + ] + } + } + +Checking MDS Status +=================== + +Metadata servers provide metadata services for CephFS. Metadata servers have +two sets of states: ``up | down`` and ``active | inactive``. To ensure your +metadata servers are ``up`` and ``active``, execute the following: + +.. prompt:: bash $ + + ceph mds stat + +To display details of the metadata cluster, execute the following: + +.. prompt:: bash $ + + ceph fs dump + + +Checking Placement Group States +=============================== + +Placement groups map objects to OSDs. When you monitor your +placement groups, you will want them to be ``active`` and ``clean``. +For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. + +.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg + +.. _rados-monitoring-using-admin-socket: + +Using the Admin Socket +====================== + +The Ceph admin socket allows you to query a daemon via a socket interface. +By default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon +via the admin socket, login to the host running the daemon and use the +following command: + +.. prompt:: bash $ + + ceph daemon {daemon-name} + ceph daemon {path-to-socket-file} + +For example, the following are equivalent: + +.. prompt:: bash $ + + ceph daemon osd.0 foo + ceph daemon /var/run/ceph/ceph-osd.0.asok foo + +To view the available admin socket commands, execute the following command: + +.. prompt:: bash $ + + ceph daemon {daemon-name} help + +The admin socket command enables you to show and set your configuration at +runtime. See `Viewing a Configuration at Runtime`_ for details. + +Additionally, you can set configuration values at runtime directly (i.e., the +admin socket bypasses the monitor, unlike ``ceph tell {daemon-type}.{id} +config set``, which relies on the monitor but doesn't require you to login +directly to the host in question ). + +.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#viewing-a-configuration-at-runtime +.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity +.. _ceph-medic: http://docs.ceph.com/ceph-medic/master/ diff --git a/doc/rados/operations/operating.rst b/doc/rados/operations/operating.rst new file mode 100644 index 000000000..134774ccb --- /dev/null +++ b/doc/rados/operations/operating.rst @@ -0,0 +1,255 @@ +===================== + Operating a Cluster +===================== + +.. index:: systemd; operating a cluster + + +Running Ceph with systemd +========================== + +For all distributions that support systemd (CentOS 7, Fedora, Debian +Jessie 8 and later, SUSE), ceph daemons are now managed using native +systemd files instead of the legacy sysvinit scripts. For example: + +.. prompt:: bash $ + + sudo systemctl start ceph.target # start all daemons + sudo systemctl status ceph-osd@12 # check status of osd.12 + +To list the Ceph systemd units on a node, execute: + +.. prompt:: bash $ + + sudo systemctl status ceph\*.service ceph\*.target + +Starting all Daemons +-------------------- + +To start all daemons on a Ceph Node (irrespective of type), execute the +following: + +.. prompt:: bash $ + + sudo systemctl start ceph.target + + +Stopping all Daemons +-------------------- + +To stop all daemons on a Ceph Node (irrespective of type), execute the +following: + +.. prompt:: bash $ + + sudo systemctl stop ceph\*.service ceph\*.target + + +Starting all Daemons by Type +---------------------------- + +To start all daemons of a particular type on a Ceph Node, execute one of the +following: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd.target + sudo systemctl start ceph-mon.target + sudo systemctl start ceph-mds.target + + +Stopping all Daemons by Type +---------------------------- + +To stop all daemons of a particular type on a Ceph Node, execute one of the +following: + +.. prompt:: bash $ + + sudo systemctl stop ceph-mon\*.service ceph-mon.target + sudo systemctl stop ceph-osd\*.service ceph-osd.target + sudo systemctl stop ceph-mds\*.service ceph-mds.target + + +Starting a Daemon +----------------- + +To start a specific daemon instance on a Ceph Node, execute one of the +following: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd@{id} + sudo systemctl start ceph-mon@{hostname} + sudo systemctl start ceph-mds@{hostname} + +For example: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd@1 + sudo systemctl start ceph-mon@ceph-server + sudo systemctl start ceph-mds@ceph-server + + +Stopping a Daemon +----------------- + +To stop a specific daemon instance on a Ceph Node, execute one of the +following: + +.. prompt:: bash $ + + sudo systemctl stop ceph-osd@{id} + sudo systemctl stop ceph-mon@{hostname} + sudo systemctl stop ceph-mds@{hostname} + +For example: + +.. prompt:: bash $ + + sudo systemctl stop ceph-osd@1 + sudo systemctl stop ceph-mon@ceph-server + sudo systemctl stop ceph-mds@ceph-server + + +.. index:: Upstart; operating a cluster + +Running Ceph with Upstart +========================== + +Starting all Daemons +-------------------- + +To start all daemons on a Ceph Node (irrespective of type), execute the +following:: + + sudo start ceph-all + + +Stopping all Daemons +-------------------- + +To stop all daemons on a Ceph Node (irrespective of type), execute the +following:: + + sudo stop ceph-all + + +Starting all Daemons by Type +---------------------------- + +To start all daemons of a particular type on a Ceph Node, execute one of the +following:: + + sudo start ceph-osd-all + sudo start ceph-mon-all + sudo start ceph-mds-all + + +Stopping all Daemons by Type +---------------------------- + +To stop all daemons of a particular type on a Ceph Node, execute one of the +following:: + + sudo stop ceph-osd-all + sudo stop ceph-mon-all + sudo stop ceph-mds-all + + +Starting a Daemon +----------------- + +To start a specific daemon instance on a Ceph Node, execute one of the +following:: + + sudo start ceph-osd id={id} + sudo start ceph-mon id={hostname} + sudo start ceph-mds id={hostname} + +For example:: + + sudo start ceph-osd id=1 + sudo start ceph-mon id=ceph-server + sudo start ceph-mds id=ceph-server + + +Stopping a Daemon +----------------- + +To stop a specific daemon instance on a Ceph Node, execute one of the +following:: + + sudo stop ceph-osd id={id} + sudo stop ceph-mon id={hostname} + sudo stop ceph-mds id={hostname} + +For example:: + + sudo stop ceph-osd id=1 + sudo start ceph-mon id=ceph-server + sudo start ceph-mds id=ceph-server + + +.. index:: sysvinit; operating a cluster + +Running Ceph with sysvinit +========================== + +Each time you to **start**, **restart**, and **stop** Ceph daemons (or your +entire cluster) you must specify at least one option and one command. You may +also specify a daemon type or a daemon instance. :: + + {commandline} [options] [commands] [daemons] + + +The ``ceph`` options include: + ++-----------------+----------+-------------------------------------------------+ +| Option | Shortcut | Description | ++=================+==========+=================================================+ +| ``--verbose`` | ``-v`` | Use verbose logging. | ++-----------------+----------+-------------------------------------------------+ +| ``--valgrind`` | ``N/A`` | (Dev and QA only) Use `Valgrind`_ debugging. | ++-----------------+----------+-------------------------------------------------+ +| ``--allhosts`` | ``-a`` | Execute on all nodes in ``ceph.conf.`` | +| | | Otherwise, it only executes on ``localhost``. | ++-----------------+----------+-------------------------------------------------+ +| ``--restart`` | ``N/A`` | Automatically restart daemon if it core dumps. | ++-----------------+----------+-------------------------------------------------+ +| ``--norestart`` | ``N/A`` | Don't restart a daemon if it core dumps. | ++-----------------+----------+-------------------------------------------------+ +| ``--conf`` | ``-c`` | Use an alternate configuration file. | ++-----------------+----------+-------------------------------------------------+ + +The ``ceph`` commands include: + ++------------------+------------------------------------------------------------+ +| Command | Description | ++==================+============================================================+ +| ``start`` | Start the daemon(s). | ++------------------+------------------------------------------------------------+ +| ``stop`` | Stop the daemon(s). | ++------------------+------------------------------------------------------------+ +| ``forcestop`` | Force the daemon(s) to stop. Same as ``kill -9`` | ++------------------+------------------------------------------------------------+ +| ``killall`` | Kill all daemons of a particular type. | ++------------------+------------------------------------------------------------+ +| ``cleanlogs`` | Cleans out the log directory. | ++------------------+------------------------------------------------------------+ +| ``cleanalllogs`` | Cleans out **everything** in the log directory. | ++------------------+------------------------------------------------------------+ + +For subsystem operations, the ``ceph`` service can target specific daemon types +by adding a particular daemon type for the ``[daemons]`` option. Daemon types +include: + +- ``mon`` +- ``osd`` +- ``mds`` + + + +.. _Valgrind: http://www.valgrind.org/ +.. _initctl: http://manpages.ubuntu.com/manpages/raring/en/man8/initctl.8.html diff --git a/doc/rados/operations/pg-concepts.rst b/doc/rados/operations/pg-concepts.rst new file mode 100644 index 000000000..636d6bf9a --- /dev/null +++ b/doc/rados/operations/pg-concepts.rst @@ -0,0 +1,102 @@ +========================== + Placement Group Concepts +========================== + +When you execute commands like ``ceph -w``, ``ceph osd dump``, and other +commands related to placement groups, Ceph may return values using some +of the following terms: + +*Peering* + The process of bringing all of the OSDs that store + a Placement Group (PG) into agreement about the state + of all of the objects (and their metadata) in that PG. + Note that agreeing on the state does not mean that + they all have the latest contents. + +*Acting Set* + The ordered list of OSDs who are (or were as of some epoch) + responsible for a particular placement group. + +*Up Set* + The ordered list of OSDs responsible for a particular placement + group for a particular epoch according to CRUSH. Normally this + is the same as the *Acting Set*, except when the *Acting Set* has + been explicitly overridden via ``pg_temp`` in the OSD Map. + +*Current Interval* or *Past Interval* + A sequence of OSD map epochs during which the *Acting Set* and *Up + Set* for particular placement group do not change. + +*Primary* + The member (and by convention first) of the *Acting Set*, + that is responsible for coordination peering, and is + the only OSD that will accept client-initiated + writes to objects in a placement group. + +*Replica* + A non-primary OSD in the *Acting Set* for a placement group + (and who has been recognized as such and *activated* by the primary). + +*Stray* + An OSD that is not a member of the current *Acting Set*, but + has not yet been told that it can delete its copies of a + particular placement group. + +*Recovery* + Ensuring that copies of all of the objects in a placement group + are on all of the OSDs in the *Acting Set*. Once *Peering* has + been performed, the *Primary* can start accepting write operations, + and *Recovery* can proceed in the background. + +*PG Info* + Basic metadata about the placement group's creation epoch, the version + for the most recent write to the placement group, *last epoch started*, + *last epoch clean*, and the beginning of the *current interval*. Any + inter-OSD communication about placement groups includes the *PG Info*, + such that any OSD that knows a placement group exists (or once existed) + also has a lower bound on *last epoch clean* or *last epoch started*. + +*PG Log* + A list of recent updates made to objects in a placement group. + Note that these logs can be truncated after all OSDs + in the *Acting Set* have acknowledged up to a certain + point. + +*Missing Set* + Each OSD notes update log entries and if they imply updates to + the contents of an object, adds that object to a list of needed + updates. This list is called the *Missing Set* for that ``<OSD,PG>``. + +*Authoritative History* + A complete, and fully ordered set of operations that, if + performed, would bring an OSD's copy of a placement group + up to date. + +*Epoch* + A (monotonically increasing) OSD map version number + +*Last Epoch Start* + The last epoch at which all nodes in the *Acting Set* + for a particular placement group agreed on an + *Authoritative History*. At this point, *Peering* is + deemed to have been successful. + +*up_thru* + Before a *Primary* can successfully complete the *Peering* process, + it must inform a monitor that is alive through the current + OSD map *Epoch* by having the monitor set its *up_thru* in the osd + map. This helps *Peering* ignore previous *Acting Sets* for which + *Peering* never completed after certain sequences of failures, such as + the second interval below: + + - *acting set* = [A,B] + - *acting set* = [A] + - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection) + - *acting set* = [B] (B restarts, A does not) + +*Last Epoch Clean* + The last *Epoch* at which all nodes in the *Acting set* + for a particular placement group were completely + up to date (both placement group logs and object contents). + At this point, *recovery* is deemed to have been + completed. diff --git a/doc/rados/operations/pg-repair.rst b/doc/rados/operations/pg-repair.rst new file mode 100644 index 000000000..f495530cc --- /dev/null +++ b/doc/rados/operations/pg-repair.rst @@ -0,0 +1,81 @@ +============================ +Repairing PG inconsistencies +============================ +Sometimes a placement group might become "inconsistent". To return the +placement group to an active+clean state, you must first determine which +of the placement groups has become inconsistent and then run the "pg +repair" command on it. This page contains commands for diagnosing placement +groups and the command for repairing placement groups that have become +inconsistent. + +.. highlight:: console + +Commands for Diagnosing Placement-group Problems +================================================ +The commands in this section provide various ways of diagnosing broken placement groups. + +The following command provides a high-level (low detail) overview of the health of the ceph cluster: + +.. prompt:: bash # + + ceph health detail + +The following command provides more detail on the status of the placement groups: + +.. prompt:: bash # + + ceph pg dump --format=json-pretty + +The following command lists inconsistent placement groups: + +.. prompt:: bash # + + rados list-inconsistent-pg {pool} + +The following command lists inconsistent rados objects: + +.. prompt:: bash # + + rados list-inconsistent-obj {pgid} + +The following command lists inconsistent snapsets in the given placement group: + +.. prompt:: bash # + + rados list-inconsistent-snapset {pgid} + + +Commands for Repairing Placement Groups +======================================= +The form of the command to repair a broken placement group is: + +.. prompt:: bash # + + ceph pg repair {pgid} + +Where ``{pgid}`` is the id of the affected placement group. + +For example: + +.. prompt:: bash # + + ceph pg repair 1.4 + +More Information on Placement Group Repair +========================================== +Ceph stores and updates the checksums of objects stored in the cluster. When a scrub is performed on a placement group, the OSD attempts to choose an authoritative copy from among its replicas. Among all of the possible cases, only one case is consistent. After a deep scrub, Ceph calculates the checksum of an object read from the disk and compares it to the checksum previously recorded. If the current checksum and the previously recorded checksums do not match, that is an inconsistency. In the case of replicated pools, any mismatch between the checksum of any replica of an object and the checksum of the authoritative copy means that there is an inconsistency. + +The "pg repair" command attempts to fix inconsistencies of various kinds. If "pg repair" finds an inconsistent placement group, it attempts to overwrite the digest of the inconsistent copy with the digest of the authoritative copy. If "pg repair" finds an inconsistent replicated pool, it marks the inconsistent copy as missing. Recovery, in the case of replicated pools, is beyond the scope of "pg repair". + +For erasure coded and bluestore pools, Ceph will automatically repair if osd_scrub_auto_repair (configuration default "false") is set to true and at most osd_scrub_auto_repair_num_errors (configuration default 5) errors are found. + +"pg repair" will not solve every problem. Ceph does not automatically repair placement groups when inconsistencies are found in them. + +The checksum of an object or an omap is not always available. Checksums are calculated incrementally. If a replicated object is updated non-sequentially, the write operation involved in the update changes the object and invalidates its checksum. The whole object is not read while recalculating the checksum. "ceph pg repair" is able to repair things even when checksums are not available to it, as in the case of filestore. When replicated filestore pools are in question, users might prefer manual repair to "ceph pg repair". + +The material in this paragraph is relevant for filestore, and bluestore has its own internal checksums. The matched-record checksum and the calculated checksum cannot prove that the authoritative copy is in fact authoritative. In the case that there is no checksum available, "pg repair" favors the data on the primary. this might or might not be the uncorrupted replica. This is why human intervention is necessary when an inconsistency is discovered. Human intervention sometimes means using the "ceph-objectstore-tool". + +External Links +============== +https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page contains a walkthrough of the repair of a placement group, and is recommended reading if you want to repair a placement +group but have never done so. diff --git a/doc/rados/operations/pg-states.rst b/doc/rados/operations/pg-states.rst new file mode 100644 index 000000000..495229d92 --- /dev/null +++ b/doc/rados/operations/pg-states.rst @@ -0,0 +1,118 @@ +======================== + Placement Group States +======================== + +When checking a cluster's status (e.g., running ``ceph -w`` or ``ceph -s``), +Ceph will report on the status of the placement groups. A placement group has +one or more states. The optimum state for placement groups in the placement group +map is ``active + clean``. + +*creating* + Ceph is still creating the placement group. + +*activating* + The placement group is peered but not yet active. + +*active* + Ceph will process requests to the placement group. + +*clean* + Ceph replicated all objects in the placement group the correct number of times. + +*down* + A replica with necessary data is down, so the placement group is offline. + +*laggy* + A replica is not acknowledging new leases from the primary in a timely fashion; IO is temporarily paused. + +*wait* + The set of OSDs for this PG has just changed and IO is temporarily paused until the previous interval's leases expire. + +*scrubbing* + Ceph is checking the placement group metadata for inconsistencies. + +*deep* + Ceph is checking the placement group data against stored checksums. + +*degraded* + Ceph has not replicated some objects in the placement group the correct number of times yet. + +*inconsistent* + Ceph detects inconsistencies in the one or more replicas of an object in the placement group + (e.g. objects are the wrong size, objects are missing from one replica *after* recovery finished, etc.). + +*peering* + The placement group is undergoing the peering process + +*repair* + Ceph is checking the placement group and repairing any inconsistencies it finds (if possible). + +*recovering* + Ceph is migrating/synchronizing objects and their replicas. + +*forced_recovery* + High recovery priority of that PG is enforced by user. + +*recovery_wait* + The placement group is waiting in line to start recover. + +*recovery_toofull* + A recovery operation is waiting because the destination OSD is over its + full ratio. + +*recovery_unfound* + Recovery stopped due to unfound objects. + +*backfilling* + Ceph is scanning and synchronizing the entire contents of a placement group + instead of inferring what contents need to be synchronized from the logs of + recent operations. Backfill is a special case of recovery. + +*forced_backfill* + High backfill priority of that PG is enforced by user. + +*backfill_wait* + The placement group is waiting in line to start backfill. + +*backfill_toofull* + A backfill operation is waiting because the destination OSD is over + the backfillfull ratio. + +*backfill_unfound* + Backfill stopped due to unfound objects. + +*incomplete* + Ceph detects that a placement group is missing information about + writes that may have occurred, or does not have any healthy + copies. If you see this state, try to start any failed OSDs that may + contain the needed information. In the case of an erasure coded pool + temporarily reducing min_size may allow recovery. + +*stale* + The placement group is in an unknown state - the monitors have not received + an update for it since the placement group mapping changed. + +*remapped* + The placement group is temporarily mapped to a different set of OSDs from what + CRUSH specified. + +*undersized* + The placement group has fewer copies than the configured pool replication level. + +*peered* + The placement group has peered, but cannot serve client IO due to not having + enough copies to reach the pool's configured min_size parameter. Recovery + may occur in this state, so the pg may heal up to min_size eventually. + +*snaptrim* + Trimming snaps. + +*snaptrim_wait* + Queued to trim snaps. + +*snaptrim_error* + Error stopped trimming snaps. + +*unknown* + The ceph-mgr hasn't yet received any information about the PG's state from an + OSD since mgr started up. diff --git a/doc/rados/operations/placement-groups.rst b/doc/rados/operations/placement-groups.rst new file mode 100644 index 000000000..d51f8d76e --- /dev/null +++ b/doc/rados/operations/placement-groups.rst @@ -0,0 +1,798 @@ +================== + Placement Groups +================== + +.. _pg-autoscaler: + +Autoscaling placement groups +============================ + +Placement groups (PGs) are an internal implementation detail of how +Ceph distributes data. You can allow the cluster to either make +recommendations or automatically tune PGs based on how the cluster is +used by enabling *pg-autoscaling*. + +Each pool in the system has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``. + +* ``off``: Disable autoscaling for this pool. It is up to the administrator to choose an appropriate PG number for each pool. Please refer to :ref:`choosing-number-of-placement-groups` for more information. +* ``on``: Enable automated adjustments of the PG count for the given pool. +* ``warn``: Raise health alerts when the PG count should be adjusted + +To set the autoscaling mode for an existing pool: + +.. prompt:: bash # + + ceph osd pool set <pool-name> pg_autoscale_mode <mode> + +For example to enable autoscaling on pool ``foo``: + +.. prompt:: bash # + + ceph osd pool set foo pg_autoscale_mode on + +You can also configure the default ``pg_autoscale_mode`` that is +set on any pools that are subsequently created: + +.. prompt:: bash # + + ceph config set global osd_pool_default_pg_autoscale_mode <mode> + +You can disable or enable the autoscaler for all pools with +the ``noautoscale`` flag. By default this flag is set to be ``off``, +but you can turn it ``on`` by using the command: + +.. prompt:: bash $ + + ceph osd pool set noautoscale + +You can turn it ``off`` using the command: + +.. prompt:: bash # + + ceph osd pool unset noautoscale + +To ``get`` the value of the flag use the command: + +.. prompt:: bash # + + ceph osd pool get noautoscale + +Viewing PG scaling recommendations +---------------------------------- + +You can view each pool, its relative utilization, and any suggested changes to +the PG count with this command: + +.. prompt:: bash # + + ceph osd pool autoscale-status + +Output will be something like:: + + POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK + a 12900M 3.0 82431M 0.4695 8 128 warn True + c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn True + b 0 953.6M 3.0 82431M 0.0347 8 warn False + +**SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if +present, is the amount of data the administrator has specified that +they expect to eventually be stored in this pool. The system uses +the larger of the two values for its calculation. + +**RATE** is the multiplier for the pool that determines how much raw +storage capacity is consumed. For example, a 3 replica pool will +have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a +ratio of 1.5. + +**RAW CAPACITY** is the total amount of raw storage capacity on the +OSDs that are responsible for storing this pool's (and perhaps other +pools') data. **RATIO** is the ratio of that total capacity that +this pool is consuming (i.e., ratio = size * rate / raw capacity). + +**TARGET RATIO**, if present, is the ratio of storage that the +administrator has specified that they expect this pool to consume +relative to other pools with target ratios set. +If both target size bytes and ratio are specified, the +ratio takes precedence. + +**EFFECTIVE RATIO** is the target ratio after adjusting in two ways: + +1. subtracting any capacity expected to be used by pools with target size set +2. normalizing the target ratios among pools with target ratio set so + they collectively target the rest of the space. For example, 4 + pools with target_ratio 1.0 would have an effective ratio of 0.25. + +The system uses the larger of the actual ratio and the effective ratio +for its calculation. + +**BIAS** is used as a multiplier to manually adjust a pool's PG based +on prior information about how much PGs a specific pool is expected +to have. + +**PG_NUM** is the current number of PGs for the pool (or the current +number of PGs that the pool is working towards, if a ``pg_num`` +change is in progress). **NEW PG_NUM**, if present, is what the +system believes the pool's ``pg_num`` should be changed to. It is +always a power of 2, and will only be present if the "ideal" value +varies from the current value by more than a factor of 3 by default. +This factor can be be adjusted with: + +.. prompt:: bash # + + ceph osd pool set threshold 2.0 + +**AUTOSCALE**, is the pool ``pg_autoscale_mode`` +and will be either ``on``, ``off``, or ``warn``. + +The final column, **BULK** determines if the pool is ``bulk`` +and will be either ``True`` or ``False``. A ``bulk`` pool +means that the pool is expected to be large and should start out +with large amount of PGs for performance purposes. On the other hand, +pools without the ``bulk`` flag are expected to be smaller e.g., +.mgr or meta pools. + + +Automated scaling +----------------- + +Allowing the cluster to automatically scale PGs based on usage is the +simplest approach. Ceph will look at the total available storage and +target number of PGs for the whole system, look at how much data is +stored in each pool, and try to apportion the PGs accordingly. The +system is relatively conservative with its approach, only making +changes to a pool when the current number of PGs (``pg_num``) is more +than 3 times off from what it thinks it should be. + +The target number of PGs per OSD is based on the +``mon_target_pg_per_osd`` configurable (default: 100), which can be +adjusted with: + +.. prompt:: bash # + + ceph config set global mon_target_pg_per_osd 100 + +The autoscaler analyzes pools and adjusts on a per-subtree basis. +Because each pool may map to a different CRUSH rule, and each rule may +distribute data across different devices, Ceph will consider +utilization of each subtree of the hierarchy independently. For +example, a pool that maps to OSDs of class `ssd` and a pool that maps +to OSDs of class `hdd` will each have optimal PG counts that depend on +the number of those respective device types. + +In the case where a pool uses OSDs under two or more CRUSH roots, e.g., (shadow +trees with both `ssd` and `hdd` devices), the autoscaler will +issue a warning to the user in the manager log stating the name of the pool +and the set of roots that overlap each other. The autoscaler will not +scale any pools with overlapping roots because this can cause problems +with the scaling process. We recommend making each pool belong to only +one root (one OSD class) to get rid of the warning and ensure a successful +scaling process. + +The autoscaler uses the `bulk` flag to determine which pool +should start out with a full complement of PGs and only +scales down when the usage ratio across the pool is not even. +However, if the pool doesn't have the `bulk` flag, the pool will +start out with minimal PGs and only when there is more usage in the pool. + +To create pool with `bulk` flag: + +.. prompt:: bash # + + ceph osd pool create <pool-name> --bulk + +To set/unset `bulk` flag of existing pool: + +.. prompt:: bash # + + ceph osd pool set <pool-name> bulk <true/false/1/0> + +To get `bulk` flag of existing pool: + +.. prompt:: bash # + + ceph osd pool get <pool-name> bulk + +.. _specifying_pool_target_size: + +Specifying expected pool size +----------------------------- + +When a cluster or pool is first created, it will consume a small +fraction of the total cluster capacity and will appear to the system +as if it should only need a small number of placement groups. +However, in most cases cluster administrators have a good idea which +pools are expected to consume most of the system capacity over time. +By providing this information to Ceph, a more appropriate number of +PGs can be used from the beginning, preventing subsequent changes in +``pg_num`` and the overhead associated with moving data around when +those adjustments are made. + +The *target size* of a pool can be specified in two ways: either in +terms of the absolute size of the pool (i.e., bytes), or as a weight +relative to other pools with a ``target_size_ratio`` set. + +For example: + +.. prompt:: bash # + + ceph osd pool set mypool target_size_bytes 100T + +will tell the system that `mypool` is expected to consume 100 TiB of +space. Alternatively: + +.. prompt:: bash # + + ceph osd pool set mypool target_size_ratio 1.0 + +will tell the system that `mypool` is expected to consume 1.0 relative +to the other pools with ``target_size_ratio`` set. If `mypool` is the +only pool in the cluster, this means an expected use of 100% of the +total capacity. If there is a second pool with ``target_size_ratio`` +1.0, both pools would expect to use 50% of the cluster capacity. + +You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command. + +Note that if impossible target size values are specified (for example, +a capacity larger than the total cluster) then a health warning +(``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised. + +If both ``target_size_ratio`` and ``target_size_bytes`` are specified +for a pool, only the ratio will be considered, and a health warning +(``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) will be issued. + +Specifying bounds on a pool's PGs +--------------------------------- + +It is also possible to specify a minimum number of PGs for a pool. +This is useful for establishing a lower bound on the amount of +parallelism client will see when doing IO, even when a pool is mostly +empty. Setting the lower bound prevents Ceph from reducing (or +recommending you reduce) the PG number below the configured number. + +You can set the minimum or maximum number of PGs for a pool with: + +.. prompt:: bash # + + ceph osd pool set <pool-name> pg_num_min <num> + ceph osd pool set <pool-name> pg_num_max <num> + +You can also specify the minimum or maximum PG count at pool creation +time with the optional ``--pg-num-min <num>`` or ``--pg-num-max +<num>`` arguments to the ``ceph osd pool create`` command. + +.. _preselection: + +A preselection of pg_num +======================== + +When creating a new pool with: + +.. prompt:: bash # + + ceph osd pool create {pool-name} [pg_num] + +it is optional to choose the value of ``pg_num``. If you do not +specify ``pg_num``, the cluster can (by default) automatically tune it +for you based on how much data is stored in the pool (see above, :ref:`pg-autoscaler`). + +Alternatively, ``pg_num`` can be explicitly provided. However, +whether you specify a ``pg_num`` value or not does not affect whether +the value is automatically tuned by the cluster after the fact. To +enable or disable auto-tuning: + +.. prompt:: bash # + + ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn) + +The "rule of thumb" for PGs per OSD has traditionally be 100. With +the additional of the balancer (which is also enabled by default), a +value of more like 50 PGs per OSD is probably reasonable. The +challenge (which the autoscaler normally does for you), is to: + +- have the PGs per pool proportional to the data in the pool, and +- end up with 50-100 PGs per OSDs, after the replication or + erasuring-coding fan-out of each PG across OSDs is taken into + consideration + +How are Placement Groups used ? +=============================== + +A placement group (PG) aggregates objects within a pool because +tracking object placement and object metadata on a per-object basis is +computationally expensive--i.e., a system with millions of objects +cannot realistically track placement on a per-object basis. + +.. ditaa:: + /-----\ /-----\ /-----\ /-----\ /-----\ + | obj | | obj | | obj | | obj | | obj | + \-----/ \-----/ \-----/ \-----/ \-----/ + | | | | | + +--------+--------+ +---+----+ + | | + v v + +-----------------------+ +-----------------------+ + | Placement Group #1 | | Placement Group #2 | + | | | | + +-----------------------+ +-----------------------+ + | | + +------------------------------+ + | + v + +-----------------------+ + | Pool | + | | + +-----------------------+ + +The Ceph client will calculate which placement group an object should +be in. It does this by hashing the object ID and applying an operation +based on the number of PGs in the defined pool and the ID of the pool. +See `Mapping PGs to OSDs`_ for details. + +The object's contents within a placement group are stored in a set of +OSDs. For instance, in a replicated pool of size two, each placement +group will store objects on two OSDs, as shown below. + +.. ditaa:: + +-----------------------+ +-----------------------+ + | Placement Group #1 | | Placement Group #2 | + | | | | + +-----------------------+ +-----------------------+ + | | | | + v v v v + /----------\ /----------\ /----------\ /----------\ + | | | | | | | | + | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 | + | | | | | | | | + \----------/ \----------/ \----------/ \----------/ + + +Should OSD #2 fail, another will be assigned to Placement Group #1 and +will be filled with copies of all objects in OSD #1. If the pool size +is changed from two to three, an additional OSD will be assigned to +the placement group and will receive copies of all objects in the +placement group. + +Placement groups do not own the OSD; they share it with other +placement groups from the same pool or even other pools. If OSD #2 +fails, the Placement Group #2 will also have to restore copies of +objects, using OSD #3. + +When the number of placement groups increases, the new placement +groups will be assigned OSDs. The result of the CRUSH function will +also change and some objects from the former placement groups will be +copied over to the new Placement Groups and removed from the old ones. + +Placement Groups Tradeoffs +========================== + +Data durability and even distribution among all OSDs call for more +placement groups but their number should be reduced to the minimum to +save CPU and memory. + +.. _data durability: + +Data durability +--------------- + +After an OSD fails, the risk of data loss increases until the data it +contained is fully recovered. Let's imagine a scenario that causes +permanent data loss in a single placement group: + +- The OSD fails and all copies of the object it contains are lost. + For all objects within the placement group the number of replica + suddenly drops from three to two. + +- Ceph starts recovery for this placement group by choosing a new OSD + to re-create the third copy of all objects. + +- Another OSD, within the same placement group, fails before the new + OSD is fully populated with the third copy. Some objects will then + only have one surviving copies. + +- Ceph picks yet another OSD and keeps copying objects to restore the + desired number of copies. + +- A third OSD, within the same placement group, fails before recovery + is complete. If this OSD contained the only remaining copy of an + object, it is permanently lost. + +In a cluster containing 10 OSDs with 512 placement groups in a three +replica pool, CRUSH will give each placement groups three OSDs. In the +end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement +Groups. When the first OSD fails, the above scenario will therefore +start recovery for all 150 placement groups at the same time. + +The 150 placement groups being recovered are likely to be +homogeneously spread over the 9 remaining OSDs. Each remaining OSD is +therefore likely to send copies of objects to all others and also +receive some new objects to be stored because they became part of a +new placement group. + +The amount of time it takes for this recovery to complete entirely +depends on the architecture of the Ceph cluster. Let say each OSD is +hosted by a 1TB SSD on a single machine and all of them are connected +to a 10Gb/s switch and the recovery for a single OSD completes within +M minutes. If there are two OSDs per machine using spinners with no +SSD journal and a 1Gb/s switch, it will at least be an order of +magnitude slower. + +In a cluster of this size, the number of placement groups has almost +no influence on data durability. It could be 128 or 8192 and the +recovery would not be slower or faster. + +However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs +is likely to speed up recovery and therefore improve data durability +significantly. Each OSD now participates in only ~75 placement groups +instead of ~150 when there were only 10 OSDs and it will still require +all 19 remaining OSDs to perform the same amount of object copies in +order to recover. But where 10 OSDs had to copy approximately 100GB +each, they now have to copy 50GB each instead. If the network was the +bottleneck, recovery will happen twice as fast. In other words, +recovery goes faster when the number of OSDs increases. + +If this cluster grows to 40 OSDs, each of them will only host ~35 +placement groups. If an OSD dies, recovery will keep going faster +unless it is blocked by another bottleneck. However, if this cluster +grows to 200 OSDs, each of them will only host ~7 placement groups. If +an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs +in these placement groups: recovery will take longer than when there +were 40 OSDs, meaning the number of placement groups should be +increased. + +No matter how short the recovery time is, there is a chance for a +second OSD to fail while it is in progress. In the 10 OSDs cluster +described above, if any of them fail, then ~17 placement groups +(i.e. ~150 / 9 placement groups being recovered) will only have one +surviving copy. And if any of the 8 remaining OSD fail, the last +objects of two placement groups are likely to be lost (i.e. ~17 / 8 +placement groups with only one remaining copy being recovered). + +When the size of the cluster grows to 20 OSDs, the number of Placement +Groups damaged by the loss of three OSDs drops. The second OSD lost +will degrade ~4 (i.e. ~75 / 19 placement groups being recovered) +instead of ~17 and the third OSD lost will only lose data if it is one +of the four OSDs containing the surviving copy. In other words, if the +probability of losing one OSD is 0.0001% during the recovery time +frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 * +0.0001% in the cluster with 20 OSDs. + +In a nutshell, more OSDs mean faster recovery and a lower risk of +cascading failures leading to the permanent loss of a Placement +Group. Having 512 or 4096 Placement Groups is roughly equivalent in a +cluster with less than 50 OSDs as far as data durability is concerned. + +Note: It may take a long time for a new OSD added to the cluster to be +populated with placement groups that were assigned to it. However +there is no degradation of any object and it has no impact on the +durability of the data contained in the Cluster. + +.. _object distribution: + +Object distribution within a pool +--------------------------------- + +Ideally objects are evenly distributed in each placement group. Since +CRUSH computes the placement group for each object, but does not +actually know how much data is stored in each OSD within this +placement group, the ratio between the number of placement groups and +the number of OSDs may influence the distribution of the data +significantly. + +For instance, if there was a single placement group for ten OSDs in a +three replica pool, only three OSD would be used because CRUSH would +have no other choice. When more placement groups are available, +objects are more likely to be evenly spread among them. CRUSH also +makes every effort to evenly spread OSDs among all existing Placement +Groups. + +As long as there are one or two orders of magnitude more Placement +Groups than OSDs, the distribution should be even. For instance, 256 +placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs +etc. + +Uneven data distribution can be caused by factors other than the ratio +between OSDs and placement groups. Since CRUSH does not take into +account the size of the objects, a few very large objects may create +an imbalance. Let say one million 4K objects totaling 4GB are evenly +spread among 1024 placement groups on 10 OSDs. They will use 4GB / 10 += 400MB on each OSD. If one 400MB object is added to the pool, the +three OSDs supporting the placement group in which the object has been +placed will be filled with 400MB + 400MB = 800MB while the seven +others will remain occupied with only 400MB. + +.. _resource usage: + +Memory, CPU and network usage +----------------------------- + +For each placement group, OSDs and MONs need memory, network and CPU +at all times and even more during recovery. Sharing this overhead by +clustering objects within a placement group is one of the main reasons +they exist. + +Minimizing the number of placement groups saves significant amounts of +resources. + +.. _choosing-number-of-placement-groups: + +Choosing the number of Placement Groups +======================================= + +.. note: It is rarely necessary to do this math by hand. Instead, use the ``ceph osd pool autoscale-status`` command in combination with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. See :ref:`pg-autoscaler` for more information. + +If you have more than 50 OSDs, we recommend approximately 50-100 +placement groups per OSD to balance out resource usage, data +durability and distribution. If you have less than 50 OSDs, choosing +among the `preselection`_ above is best. For a single pool of objects, +you can use the following formula to get a baseline + + Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}` + +Where **pool size** is either the number of replicas for replicated +pools or the K+M sum for erasure coded pools (as returned by **ceph +osd erasure-code-profile get**). + +You should then check if the result makes sense with the way you +designed your Ceph cluster to maximize `data durability`_, +`object distribution`_ and minimize `resource usage`_. + +The result should always be **rounded up to the nearest power of two**. + +Only a power of two will evenly balance the number of objects among +placement groups. Other values will result in an uneven distribution of +data across your OSDs. Their use should be limited to incrementally +stepping from one power of two to another. + +As an example, for a cluster with 200 OSDs and a pool size of 3 +replicas, you would estimate your number of PGs as follows + + :math:`\frac{200 \times 100}{3} = 6667`. Nearest power of 2: 8192 + +When using multiple data pools for storing objects, you need to ensure +that you balance the number of placement groups per pool with the +number of placement groups per OSD so that you arrive at a reasonable +total number of placement groups that provides reasonably low variance +per OSD without taxing system resources or making the peering process +too slow. + +For instance a cluster of 10 pools each with 512 placement groups on +ten OSDs is a total of 5,120 placement groups spread over ten OSDs, +that is 512 placement groups per OSD. That does not use too many +resources. However, if 1,000 pools were created with 512 placement +groups each, the OSDs will handle ~50,000 placement groups each and it +would require significantly more resources and time for peering. + +You may find the `PGCalc`_ tool helpful. + + +.. _setting the number of placement groups: + +Set the Number of Placement Groups +================================== + +To set the number of placement groups in a pool, you must specify the +number of placement groups at the time you create the pool. +See `Create a Pool`_ for details. Even after a pool is created you can also change the number of placement groups with: + +.. prompt:: bash # + + ceph osd pool set {pool-name} pg_num {pg_num} + +After you increase the number of placement groups, you must also +increase the number of placement groups for placement (``pgp_num``) +before your cluster will rebalance. The ``pgp_num`` will be the number of +placement groups that will be considered for placement by the CRUSH +algorithm. Increasing ``pg_num`` splits the placement groups but data +will not be migrated to the newer placement groups until placement +groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num`` +should be equal to the ``pg_num``. To increase the number of +placement groups for placement, execute the following: + +.. prompt:: bash # + + ceph osd pool set {pool-name} pgp_num {pgp_num} + +When decreasing the number of PGs, ``pgp_num`` is adjusted +automatically for you. + +Get the Number of Placement Groups +================================== + +To get the number of placement groups in a pool, execute the following: + +.. prompt:: bash # + + ceph osd pool get {pool-name} pg_num + + +Get a Cluster's PG Statistics +============================= + +To get the statistics for the placement groups in your cluster, execute the following: + +.. prompt:: bash # + + ceph pg dump [--format {format}] + +Valid formats are ``plain`` (default) and ``json``. + + +Get Statistics for Stuck PGs +============================ + +To get the statistics for all placement groups stuck in a specified state, +execute the following: + +.. prompt:: bash # + + ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>] + +**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD +with the most up-to-date data to come up and in. + +**Unclean** Placement groups contain objects that are not replicated the desired number +of times. They should be recovering. + +**Stale** Placement groups are in an unknown state - the OSDs that host them have not +reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``). + +Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number +of seconds the placement group is stuck before including it in the returned statistics +(default 300 seconds). + + +Get a PG Map +============ + +To get the placement group map for a particular placement group, execute the following: + +.. prompt:: bash # + + ceph pg map {pg-id} + +For example: + +.. prompt:: bash # + + ceph pg map 1.6c + +Ceph will return the placement group map, the placement group, and the OSD status: + +.. prompt:: bash # + + osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0] + + +Get a PGs Statistics +==================== + +To retrieve statistics for a particular placement group, execute the following: + +.. prompt:: bash # + + ceph pg {pg-id} query + + +Scrub a Placement Group +======================= + +To scrub a placement group, execute the following: + +.. prompt:: bash # + + ceph pg scrub {pg-id} + +Ceph checks the primary and any replica nodes, generates a catalog of all objects +in the placement group and compares them to ensure that no objects are missing +or mismatched, and their contents are consistent. Assuming the replicas all +match, a final semantic sweep ensures that all of the snapshot-related object +metadata is consistent. Errors are reported via logs. + +To scrub all placement groups from a specific pool, execute the following: + +.. prompt:: bash # + + ceph osd pool scrub {pool-name} + +Prioritize backfill/recovery of a Placement Group(s) +==================================================== + +You may run into a situation where a bunch of placement groups will require +recovery and/or backfill, and some particular groups hold data more important +than others (for example, those PGs may hold data for images used by running +machines and other PGs may be used by inactive machines/less relevant data). +In that case, you may want to prioritize recovery of those groups so +performance and/or availability of data stored on those groups is restored +earlier. To do this (mark particular placement group(s) as prioritized during +backfill or recovery), execute the following: + +.. prompt:: bash # + + ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] + ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +This will cause Ceph to perform recovery or backfill on specified placement +groups first, before other placement groups. This does not interrupt currently +ongoing backfills or recovery, but causes specified PGs to be processed +as soon as possible. If you change your mind or prioritize wrong groups, +use: + +.. prompt:: bash # + + ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] + ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +This will remove "force" flag from those PGs and they will be processed +in default order. Again, this doesn't affect currently processed placement +group, only those that are still queued. + +The "force" flag is cleared automatically after recovery or backfill of group +is done. + +Similarly, you may use the following commands to force Ceph to perform recovery +or backfill on all placement groups from a specified pool first: + +.. prompt:: bash # + + ceph osd pool force-recovery {pool-name} + ceph osd pool force-backfill {pool-name} + +or: + +.. prompt:: bash # + + ceph osd pool cancel-force-recovery {pool-name} + ceph osd pool cancel-force-backfill {pool-name} + +to restore to the default recovery or backfill priority if you change your mind. + +Note that these commands could possibly break the ordering of Ceph's internal +priority computations, so use them with caution! +Especially, if you have multiple pools that are currently sharing the same +underlying OSDs, and some particular pools hold data more important than others, +we recommend you use the following command to re-arrange all pools's +recovery/backfill priority in a better order: + +.. prompt:: bash # + + ceph osd pool set {pool-name} recovery_priority {value} + +For example, if you have 10 pools you could make the most important one priority 10, +next 9, etc. Or you could leave most pools alone and have say 3 important pools +all priority 1 or priorities 3, 2, 1 respectively. + +Revert Lost +=========== + +If the cluster has lost one or more objects, and you have decided to +abandon the search for the lost data, you must mark the unfound objects +as ``lost``. + +If all possible locations have been queried and objects are still +lost, you may have to give up on the lost objects. This is +possible given unusual combinations of failures that allow the cluster +to learn about writes that were performed before the writes themselves +are recovered. + +Currently the only supported option is "revert", which will either roll back to +a previous version of the object or (if it was a new object) forget about it +entirely. To mark the "unfound" objects as "lost", execute the following: + +.. prompt:: bash # + + ceph pg {pg-id} mark_unfound_lost revert|delete + +.. important:: Use this feature with caution, because it may confuse + applications that expect the object(s) to exist. + + +.. toctree:: + :hidden: + + pg-states + pg-concepts + + +.. _Create a Pool: ../pools#createpool +.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds +.. _pgcalc: https://old.ceph.com/pgcalc/ diff --git a/doc/rados/operations/pools.rst b/doc/rados/operations/pools.rst new file mode 100644 index 000000000..b44c48460 --- /dev/null +++ b/doc/rados/operations/pools.rst @@ -0,0 +1,900 @@ +.. _rados_pools: + +======= + Pools +======= +Pools are logical partitions that are used to store objects. + +Pools provide: + +- **Resilience**: It is possible to set the number of OSDs that are allowed to + fail without any data being lost. If your cluster uses replicated pools, the + number of OSDs that can fail without data loss is the number of replicas. + For example: a typical configuration stores an object and two additional + copies (that is: ``size = 3``), but you can configure the number of replicas + on a per-pool basis. For `erasure coded pools <../erasure-code>`_, resilience + is defined as the number of coding chunks (for example, ``m = 2`` in the + **erasure code profile**). + +- **Placement Groups**: You can set the number of placement groups for the + pool. A typical configuration targets approximately 100 placement groups per + OSD, providing optimal balancing without consuming many computing resources. + When setting up multiple pools, be careful to set a reasonable number of + placement groups for each pool and for the cluster as a whole. Note that each + PG belongs to a specific pool: when multiple pools use the same OSDs, make + sure that the **sum** of PG replicas per OSD is in the desired PG per OSD + target range. Use the `pgcalc`_ tool to calculate the number of placement + groups to set for your pool. + +- **CRUSH Rules**: When data is stored in a pool, the placement of the object + and its replicas (or chunks, in the case of erasure-coded pools) in your + cluster is governed by CRUSH rules. Custom CRUSH rules can be created for a + pool if the default rule does not fit your use case. + +- **Snapshots**: The command ``ceph osd pool mksnap`` creates a snapshot of a + pool. + +Pool Names +========== + +Pool names beginning with ``.`` are reserved for use by Ceph's internal +operations. Please do not create or manipulate pools with these names. + +List Pools +========== + +To list your cluster's pools, execute: + +.. prompt:: bash $ + + ceph osd lspools + +.. _createpool: + +Create a Pool +============= + +Before creating pools, refer to the `Pool, PG and CRUSH Config Reference`_. +Ideally, you should override the default value for the number of placement +groups in your Ceph configuration file, as the default is NOT ideal. +For details on placement group numbers refer to `setting the number of placement groups`_ + +.. note:: Starting with Luminous, all pools need to be associated to the + application using the pool. See `Associate Pool to Application`_ below for + more information. + +For example: + +.. prompt:: bash $ + + osd pool default pg num = 100 + osd pool default pgp num = 100 + +To create a pool, execute: + +.. prompt:: bash $ + + ceph osd pool create {pool-name} [{pg-num} [{pgp-num}]] [replicated] \ + [crush-rule-name] [expected-num-objects] + ceph osd pool create {pool-name} [{pg-num} [{pgp-num}]] erasure \ + [erasure-code-profile] [crush-rule-name] [expected_num_objects] [--autoscale-mode=<on,off,warn>] + +Where: + +``{pool-name}`` + +:Description: The name of the pool. It must be unique. +:Type: String +:Required: Yes. + +``{pg-num}`` + +:Description: The total number of placement groups for the pool. See `Placement + Groups`_ for details on calculating a suitable number. The + default value ``8`` is NOT suitable for most systems. + +:Type: Integer +:Required: Yes. +:Default: 8 + +``{pgp-num}`` + +:Description: The total number of placement groups for placement purposes. This + **should be equal to the total number of placement groups**, except + for placement group splitting scenarios. + +:Type: Integer +:Required: Yes. Picks up default or Ceph configuration value if not specified. +:Default: 8 + +``{replicated|erasure}`` + +:Description: The pool type which may either be **replicated** to + recover from lost OSDs by keeping multiple copies of the + objects or **erasure** to get a kind of + `generalized RAID5 <../erasure-code>`_ capability. + The **replicated** pools require more + raw storage but implement all Ceph operations. The + **erasure** pools require less raw storage but only + implement a subset of the available operations. + +:Type: String +:Required: No. +:Default: replicated + +``[crush-rule-name]`` + +:Description: The name of a CRUSH rule to use for this pool. The specified + rule must exist. + +:Type: String +:Required: No. +:Default: For **replicated** pools it is the rule specified by the ``osd + pool default crush rule`` config variable. This rule must exist. + For **erasure** pools it is ``erasure-code`` if the ``default`` + `erasure code profile`_ is used or ``{pool-name}`` otherwise. This + rule will be created implicitly if it doesn't exist already. + + +``[erasure-code-profile=profile]`` + +.. _erasure code profile: ../erasure-code-profile + +:Description: For **erasure** pools only. Use the `erasure code profile`_. It + must be an existing profile as defined by + **osd erasure-code-profile set**. + +:Type: String +:Required: No. + +``--autoscale-mode=<on,off,warn>`` + +:Description: Autoscale mode + +:Type: String +:Required: No. +:Default: The default behavior is controlled by the ``osd pool default pg autoscale mode`` option. + +If you set the autoscale mode to ``on`` or ``warn``, you can let the system autotune or recommend changes to the number of placement groups in your pool based on actual usage. If you leave it off, then you should refer to `Placement Groups`_ for more information. + +.. _Placement Groups: ../placement-groups + +``[expected-num-objects]`` + +:Description: The expected number of objects for this pool. By setting this value ( + together with a negative **filestore merge threshold**), the PG folder + splitting would happen at the pool creation time, to avoid the latency + impact to do a runtime folder splitting. + +:Type: Integer +:Required: No. +:Default: 0, no splitting at the pool creation time. + +.. _associate-pool-to-application: + +Associate Pool to Application +============================= + +Pools need to be associated with an application before use. Pools that will be +used with CephFS or pools that are automatically created by RGW are +automatically associated. Pools that are intended for use with RBD should be +initialized using the ``rbd`` tool (see `Block Device Commands`_ for more +information). + +For other cases, you can manually associate a free-form application name to +a pool.: + +.. prompt:: bash $ + + ceph osd pool application enable {pool-name} {application-name} + +.. note:: CephFS uses the application name ``cephfs``, RBD uses the + application name ``rbd``, and RGW uses the application name ``rgw``. + +Set Pool Quotas +=============== + +You can set pool quotas for the maximum number of bytes and/or the maximum +number of objects per pool: + +.. prompt:: bash $ + + ceph osd pool set-quota {pool-name} [max_objects {obj-count}] [max_bytes {bytes}] + +For example: + +.. prompt:: bash $ + + ceph osd pool set-quota data max_objects 10000 + +To remove a quota, set its value to ``0``. + + +Delete a Pool +============= + +To delete a pool, execute: + +.. prompt:: bash $ + + ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] + + +To remove a pool the mon_allow_pool_delete flag must be set to true in the Monitor's +configuration. Otherwise they will refuse to remove a pool. + +See `Monitor Configuration`_ for more information. + +.. _Monitor Configuration: ../../configuration/mon-config-ref + +If you created your own rules for a pool you created, you should consider +removing them when you no longer need your pool: + +.. prompt:: bash $ + + ceph osd pool get {pool-name} crush_rule + +If the rule was "123", for example, you can check the other pools like so: + +.. prompt:: bash $ + + ceph osd dump | grep "^pool" | grep "crush_rule 123" + +If no other pools use that custom rule, then it's safe to delete that +rule from the cluster. + +If you created users with permissions strictly for a pool that no longer +exists, you should consider deleting those users too: + + +.. prompt:: bash $ + + ceph auth ls | grep -C 5 {pool-name} + ceph auth del {user} + + +Rename a Pool +============= + +To rename a pool, execute: + +.. prompt:: bash $ + + ceph osd pool rename {current-pool-name} {new-pool-name} + +If you rename a pool and you have per-pool capabilities for an authenticated +user, you must update the user's capabilities (i.e., caps) with the new pool +name. + +Show Pool Statistics +==================== + +To show a pool's utilization statistics, execute: + +.. prompt:: bash $ + + rados df + +Additionally, to obtain I/O information for a specific pool or all, execute: + +.. prompt:: bash $ + + ceph osd pool stats [{pool-name}] + + +Make a Snapshot of a Pool +========================= + +To make a snapshot of a pool, execute: + +.. prompt:: bash $ + + ceph osd pool mksnap {pool-name} {snap-name} + +Remove a Snapshot of a Pool +=========================== + +To remove a snapshot of a pool, execute: + +.. prompt:: bash $ + + ceph osd pool rmsnap {pool-name} {snap-name} + +.. _setpoolvalues: + + +Set Pool Values +=============== + +To set a value to a pool, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {pool-name} {key} {value} + +You may set values for the following keys: + +.. _compression_algorithm: + +``compression_algorithm`` + +:Description: Sets inline compression algorithm to use for underlying BlueStore. This setting overrides the `global setting <https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#inline-compression>`__ of ``bluestore compression algorithm``. + +:Type: String +:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd`` + +``compression_mode`` + +:Description: Sets the policy for the inline compression algorithm for underlying BlueStore. This setting overrides the `global setting <http://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#inline-compression>`__ of ``bluestore compression mode``. + +:Type: String +:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` + +``compression_min_blob_size`` + +:Description: Chunks smaller than this are never compressed. This setting overrides the `global setting <http://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#inline-compression>`__ of ``bluestore compression min blob *``. + +:Type: Unsigned Integer + +``compression_max_blob_size`` + +:Description: Chunks larger than this are broken into smaller blobs sizing + ``compression_max_blob_size`` before being compressed. + +:Type: Unsigned Integer + +.. _size: + +``size`` + +:Description: Sets the number of replicas for objects in the pool. + See `Set the Number of Object Replicas`_ for further details. + Replicated pools only. + +:Type: Integer + +.. _min_size: + +``min_size`` + +:Description: Sets the minimum number of replicas required for I/O. + See `Set the Number of Object Replicas`_ for further details. + In the case of Erasure Coded pools this should be set to a value + greater than 'k' since if we allow IO at the value 'k' there is no + redundancy and data will be lost in the event of a permanent OSD + failure. For more information see `Erasure Code + <../erasure-code>`_ + +:Type: Integer +:Version: ``0.54`` and above + +.. _pg_num: + +``pg_num`` + +:Description: The effective number of placement groups to use when calculating + data placement. +:Type: Integer +:Valid Range: Superior to ``pg_num`` current value. + +.. _pgp_num: + +``pgp_num`` + +:Description: The effective number of placement groups for placement to use + when calculating data placement. + +:Type: Integer +:Valid Range: Equal to or less than ``pg_num``. + +.. _crush_rule: + +``crush_rule`` + +:Description: The rule to use for mapping object placement in the cluster. +:Type: String + +.. _allow_ec_overwrites: + +``allow_ec_overwrites`` + +:Description: Whether writes to an erasure coded pool can update part + of an object, so cephfs and rbd can use it. See + `Erasure Coding with Overwrites`_ for more details. +:Type: Boolean +:Version: ``12.2.0`` and above + +.. _hashpspool: + +``hashpspool`` + +:Description: Set/Unset HASHPSPOOL flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag + +.. _nodelete: + +``nodelete`` + +:Description: Set/Unset NODELETE flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag +:Version: Version ``FIXME`` + +.. _nopgchange: + +``nopgchange`` + +:Description: Set/Unset NOPGCHANGE flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag +:Version: Version ``FIXME`` + +.. _nosizechange: + +``nosizechange`` + +:Description: Set/Unset NOSIZECHANGE flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag +:Version: Version ``FIXME`` + +.. _bulk: + +.. describe:: bulk + + Set/Unset bulk flag on a given pool. + + :Type: Boolean + :Valid Range: true/1 sets flag, false/0 unsets flag + +.. _write_fadvise_dontneed: + +``write_fadvise_dontneed`` + +:Description: Set/Unset WRITE_FADVISE_DONTNEED flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag + +.. _noscrub: + +``noscrub`` + +:Description: Set/Unset NOSCRUB flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag + +.. _nodeep-scrub: + +``nodeep-scrub`` + +:Description: Set/Unset NODEEP_SCRUB flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag + +.. _hit_set_type: + +``hit_set_type`` + +:Description: Enables hit set tracking for cache pools. + See `Bloom Filter`_ for additional information. + +:Type: String +:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object`` +:Default: ``bloom``. Other values are for testing. + +.. _hit_set_count: + +``hit_set_count`` + +:Description: The number of hit sets to store for cache pools. The higher + the number, the more RAM consumed by the ``ceph-osd`` daemon. + +:Type: Integer +:Valid Range: ``1``. Agent doesn't handle > 1 yet. + +.. _hit_set_period: + +``hit_set_period`` + +:Description: The duration of a hit set period in seconds for cache pools. + The higher the number, the more RAM consumed by the + ``ceph-osd`` daemon. + +:Type: Integer +:Example: ``3600`` 1hr + +.. _hit_set_fpp: + +``hit_set_fpp`` + +:Description: The false positive probability for the ``bloom`` hit set type. + See `Bloom Filter`_ for additional information. + +:Type: Double +:Valid Range: 0.0 - 1.0 +:Default: ``0.05`` + +.. _cache_target_dirty_ratio: + +``cache_target_dirty_ratio`` + +:Description: The percentage of the cache pool containing modified (dirty) + objects before the cache tiering agent will flush them to the + backing storage pool. + +:Type: Double +:Default: ``.4`` + +.. _cache_target_dirty_high_ratio: + +``cache_target_dirty_high_ratio`` + +:Description: The percentage of the cache pool containing modified (dirty) + objects before the cache tiering agent will flush them to the + backing storage pool with a higher speed. + +:Type: Double +:Default: ``.6`` + +.. _cache_target_full_ratio: + +``cache_target_full_ratio`` + +:Description: The percentage of the cache pool containing unmodified (clean) + objects before the cache tiering agent will evict them from the + cache pool. + +:Type: Double +:Default: ``.8`` + +.. _target_max_bytes: + +``target_max_bytes`` + +:Description: Ceph will begin flushing or evicting objects when the + ``max_bytes`` threshold is triggered. + +:Type: Integer +:Example: ``1000000000000`` #1-TB + +.. _target_max_objects: + +``target_max_objects`` + +:Description: Ceph will begin flushing or evicting objects when the + ``max_objects`` threshold is triggered. + +:Type: Integer +:Example: ``1000000`` #1M objects + + +``hit_set_grade_decay_rate`` + +:Description: Temperature decay rate between two successive hit_sets +:Type: Integer +:Valid Range: 0 - 100 +:Default: ``20`` + + +``hit_set_search_last_n`` + +:Description: Count at most N appearance in hit_sets for temperature calculation +:Type: Integer +:Valid Range: 0 - hit_set_count +:Default: ``1`` + + +.. _cache_min_flush_age: + +``cache_min_flush_age`` + +:Description: The time (in seconds) before the cache tiering agent will flush + an object from the cache pool to the storage pool. + +:Type: Integer +:Example: ``600`` 10min + +.. _cache_min_evict_age: + +``cache_min_evict_age`` + +:Description: The time (in seconds) before the cache tiering agent will evict + an object from the cache pool. + +:Type: Integer +:Example: ``1800`` 30min + +.. _fast_read: + +``fast_read`` + +:Description: On Erasure Coding pool, if this flag is turned on, the read request + would issue sub reads to all shards, and waits until it receives enough + shards to decode to serve the client. In the case of jerasure and isa + erasure plugins, once the first K replies return, client's request is + served immediately using the data decoded from these replies. This + helps to tradeoff some resources for better performance. Currently this + flag is only supported for Erasure Coding pool. + +:Type: Boolean +:Defaults: ``0`` + +.. _scrub_min_interval: + +``scrub_min_interval`` + +:Description: The minimum interval in seconds for pool scrubbing when + load is low. If it is 0, the value osd_scrub_min_interval + from config is used. + +:Type: Double +:Default: ``0`` + +.. _scrub_max_interval: + +``scrub_max_interval`` + +:Description: The maximum interval in seconds for pool scrubbing + irrespective of cluster load. If it is 0, the value + osd_scrub_max_interval from config is used. + +:Type: Double +:Default: ``0`` + +.. _deep_scrub_interval: + +``deep_scrub_interval`` + +:Description: The interval in seconds for pool “deep” scrubbing. If it + is 0, the value osd_deep_scrub_interval from config is used. + +:Type: Double +:Default: ``0`` + + +.. _recovery_priority: + +``recovery_priority`` + +:Description: When a value is set it will increase or decrease the computed + reservation priority. This value must be in the range -10 to + 10. Use a negative priority for less important pools so they + have lower priority than any new pools. + +:Type: Integer +:Default: ``0`` + + +.. _recovery_op_priority: + +``recovery_op_priority`` + +:Description: Specify the recovery operation priority for this pool instead of ``osd_recovery_op_priority``. + +:Type: Integer +:Default: ``0`` + + +Get Pool Values +=============== + +To get a value from a pool, execute the following: + +.. prompt:: bash $ + + ceph osd pool get {pool-name} {key} + +You may get values for the following keys: + +``size`` + +:Description: see size_ + +:Type: Integer + +``min_size`` + +:Description: see min_size_ + +:Type: Integer +:Version: ``0.54`` and above + +``pg_num`` + +:Description: see pg_num_ + +:Type: Integer + + +``pgp_num`` + +:Description: see pgp_num_ + +:Type: Integer +:Valid Range: Equal to or less than ``pg_num``. + + +``crush_rule`` + +:Description: see crush_rule_ + + +``hit_set_type`` + +:Description: see hit_set_type_ + +:Type: String +:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object`` + +``hit_set_count`` + +:Description: see hit_set_count_ + +:Type: Integer + + +``hit_set_period`` + +:Description: see hit_set_period_ + +:Type: Integer + + +``hit_set_fpp`` + +:Description: see hit_set_fpp_ + +:Type: Double + + +``cache_target_dirty_ratio`` + +:Description: see cache_target_dirty_ratio_ + +:Type: Double + + +``cache_target_dirty_high_ratio`` + +:Description: see cache_target_dirty_high_ratio_ + +:Type: Double + + +``cache_target_full_ratio`` + +:Description: see cache_target_full_ratio_ + +:Type: Double + + +``target_max_bytes`` + +:Description: see target_max_bytes_ + +:Type: Integer + + +``target_max_objects`` + +:Description: see target_max_objects_ + +:Type: Integer + + +``cache_min_flush_age`` + +:Description: see cache_min_flush_age_ + +:Type: Integer + + +``cache_min_evict_age`` + +:Description: see cache_min_evict_age_ + +:Type: Integer + + +``fast_read`` + +:Description: see fast_read_ + +:Type: Boolean + + +``scrub_min_interval`` + +:Description: see scrub_min_interval_ + +:Type: Double + + +``scrub_max_interval`` + +:Description: see scrub_max_interval_ + +:Type: Double + + +``deep_scrub_interval`` + +:Description: see deep_scrub_interval_ + +:Type: Double + + +``allow_ec_overwrites`` + +:Description: see allow_ec_overwrites_ + +:Type: Boolean + + +``recovery_priority`` + +:Description: see recovery_priority_ + +:Type: Integer + + +``recovery_op_priority`` + +:Description: see recovery_op_priority_ + +:Type: Integer + + +Set the Number of Object Replicas +================================= + +To set the number of object replicas on a replicated pool, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {poolname} size {num-replicas} + +.. important:: The ``{num-replicas}`` includes the object itself. + If you want the object and two copies of the object for a total of + three instances of the object, specify ``3``. + +For example: + +.. prompt:: bash $ + + ceph osd pool set data size 3 + +You may execute this command for each pool. **Note:** An object might accept +I/Os in degraded mode with fewer than ``pool size`` replicas. To set a minimum +number of required replicas for I/O, you should use the ``min_size`` setting. +For example: + +.. prompt:: bash $ + + ceph osd pool set data min_size 2 + +This ensures that no object in the data pool will receive I/O with fewer than +``min_size`` replicas. + + +Get the Number of Object Replicas +================================= + +To get the number of object replicas, execute the following: + +.. prompt:: bash $ + + ceph osd dump | grep 'replicated size' + +Ceph will list the pools, with the ``replicated size`` attribute highlighted. +By default, ceph creates two replicas of an object (a total of three copies, or +a size of 3). + + +.. _pgcalc: https://old.ceph.com/pgcalc/ +.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref +.. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter +.. _setting the number of placement groups: ../placement-groups#set-the-number-of-placement-groups +.. _Erasure Coding with Overwrites: ../erasure-code#erasure-coding-with-overwrites +.. _Block Device Commands: ../../../rbd/rados-rbd-cmds/#create-a-block-device-pool diff --git a/doc/rados/operations/stretch-mode.rst b/doc/rados/operations/stretch-mode.rst new file mode 100644 index 000000000..6b4c9ba8a --- /dev/null +++ b/doc/rados/operations/stretch-mode.rst @@ -0,0 +1,215 @@ +.. _stretch_mode: + +================ +Stretch Clusters +================ + + +Stretch Clusters +================ +Ceph generally expects all parts of its network and overall cluster to be +equally reliable, with failures randomly distributed across the CRUSH map. +So you may lose a switch that knocks out a number of OSDs, but we expect +the remaining OSDs and monitors to route around that. + +This is usually a good choice, but may not work well in some +stretched cluster configurations where a significant part of your cluster +is stuck behind a single network component. For instance, a single +cluster which is located in multiple data centers, and you want to +sustain the loss of a full DC. + +There are two standard configurations we've seen deployed, with either +two or three data centers (or, in clouds, availability zones). With two +zones, we expect each site to hold a copy of the data, and for a third +site to have a tiebreaker monitor (this can be a VM or high-latency compared +to the main sites) to pick a winner if the network connection fails and both +DCs remain alive. For three sites, we expect a copy of the data and an equal +number of monitors in each site. + +Note that the standard Ceph configuration will survive MANY failures of the +network or data centers and it will never compromise data consistency. If you +bring back enough Ceph servers following a failure, it will recover. If you +lose a data center, but can still form a quorum of monitors and have all the data +available (with enough copies to satisfy pools' ``min_size``, or CRUSH rules +that will re-replicate to meet it), Ceph will maintain availability. + +What can't it handle? + +Stretch Cluster Issues +====================== +No matter what happens, Ceph will not compromise on data integrity +and consistency. If there's a failure in your network or a loss of nodes and +you can restore service, Ceph will return to normal functionality on its own. + +But there are scenarios where you lose data availibility despite having +enough servers available to satisfy Ceph's consistency and sizing constraints, or +where you may be surprised to not satisfy Ceph's constraints. +The first important category of these failures resolve around inconsistent +networks -- if there's a netsplit, Ceph may be unable to mark OSDs down and kick +them out of the acting PG sets despite the primary being unable to replicate data. +If this happens, IO will not be permitted, because Ceph can't satisfy its durability +guarantees. + +The second important category of failures is when you think you have data replicated +across data centers, but the constraints aren't sufficient to guarantee this. +For instance, you might have data centers A and B, and your CRUSH rule targets 3 copies +and places a copy in each data center with a ``min_size`` of 2. The PG may go active with +2 copies in site A and no copies in site B, which means that if you then lose site A you +have lost data and Ceph can't operate on it. This situation is surprisingly difficult +to avoid with standard CRUSH rules. + +Stretch Mode +============ +The new stretch mode is designed to handle the 2-site case. Three sites are +just as susceptible to netsplit issues, but are much more tolerant of +component availability outages than 2-site clusters are. + +To enter stretch mode, you must set the location of each monitor, matching +your CRUSH map. For instance, to place ``mon.a`` in your first data center: + +.. prompt:: bash $ + + ceph mon set_location a datacenter=site1 + +Next, generate a CRUSH rule which will place 2 copies in each data center. This +will require editing the CRUSH map directly: + +.. prompt:: bash $ + + ceph osd getcrushmap > crush.map.bin + crushtool -d crush.map.bin -o crush.map.txt + +Now edit the ``crush.map.txt`` file to add a new rule. Here +there is only one other rule, so this is ID 1, but you may need +to use a different rule ID. We also have two datacenter buckets +named ``site1`` and ``site2``:: + + rule stretch_rule { + id 1 + type replicated + min_size 1 + max_size 10 + step take site1 + step chooseleaf firstn 2 type host + step emit + step take site2 + step chooseleaf firstn 2 type host + step emit + } + +Finally, inject the CRUSH map to make the rule available to the cluster: + +.. prompt:: bash $ + + crushtool -c crush.map.txt -o crush2.map.bin + ceph osd setcrushmap -i crush2.map.bin + +If you aren't already running your monitors in connectivity mode, do so with +the instructions in `Changing Monitor Elections`_. + +.. _Changing Monitor elections: ../change-mon-elections + +And lastly, tell the cluster to enter stretch mode. Here, ``mon.e`` is the +tiebreaker and we are splitting across data centers. ``mon.e`` should be also +set a datacenter, that will differ from ``site1`` and ``site2``. For this +purpose you can create another datacenter bucket named ```site3`` in your +CRUSH and place ``mon.e`` there: + +.. prompt:: bash $ + + ceph mon set_location e datacenter=site3 + ceph mon enable_stretch_mode e stretch_rule datacenter + +When stretch mode is enabled, the OSDs wlll only take PGs active when +they peer across data centers (or whatever other CRUSH bucket type +you specified), assuming both are alive. Pools will increase in size +from the default 3 to 4, expecting 2 copies in each site. OSDs will only +be allowed to connect to monitors in the same data center. New monitors +will not be allowed to join the cluster if they do not specify a location. + +If all the OSDs and monitors from a data center become inaccessible +at once, the surviving data center will enter a degraded stretch mode. This +will issue a warning, reduce the min_size to 1, and allow +the cluster to go active with data in the single remaining site. Note that +we do not change the pool size, so you will also get warnings that the +pools are too small -- but a special stretch mode flag will prevent the OSDs +from creating extra copies in the remaining data center (so it will only keep +2 copies, as before). + +When the missing data center comes back, the cluster will enter +recovery stretch mode. This changes the warning and allows peering, but +still only requires OSDs from the data center which was up the whole time. +When all PGs are in a known state, and are neither degraded nor incomplete, +the cluster transitions back to regular stretch mode, ends the warning, +restores min_size to its starting value (2) and requires both sites to peer, +and stops requiring the always-alive site when peering (so that you can fail +over to the other site, if necessary). + + +Stretch Mode Limitations +======================== +As implied by the setup, stretch mode only handles 2 sites with OSDs. + +While it is not enforced, you should run 2 monitors in each site plus +a tiebreaker, for a total of 5. This is because OSDs can only connect +to monitors in their own site when in stretch mode. + +You cannot use erasure coded pools with stretch mode. If you try, it will +refuse, and it will not allow you to create EC pools once in stretch mode. + +You must create your own CRUSH rule which provides 2 copies in each site, and +you must use 4 total copies with 2 in each site. If you have existing pools +with non-default size/min_size, Ceph will object when you attempt to +enable stretch mode. + +Because it runs with ``min_size 1`` when degraded, you should only use stretch +mode with all-flash OSDs. This minimizes the time needed to recover once +connectivity is restored, and thus minimizes the potential for data loss. + +Hopefully, future development will extend this feature to support EC pools and +running with more than 2 full sites. + +Other commands +============== +If your tiebreaker monitor fails for some reason, you can replace it. Turn on +a new monitor and run: + +.. prompt:: bash $ + + ceph mon set_new_tiebreaker mon.<new_mon_name> + +This command will protest if the new monitor is in the same location as existing +non-tiebreaker monitors. This command WILL NOT remove the previous tiebreaker +monitor; you should do so yourself. + +Also in 16.2.7, if you are writing your own tooling for deploying Ceph, you can use a new +``--set-crush-location`` option when booting monitors, instead of running +``ceph mon set_location``. This option accepts only a single "bucket=loc" pair, eg +``ceph-mon --set-crush-location 'datacenter=a'``, which must match the +bucket type you specified when running ``enable_stretch_mode``. + + +When in stretch degraded mode, the cluster will go into "recovery" mode automatically +when the disconnected data center comes back. If that doesn't work, or you want to +enable recovery mode early, you can invoke: + +.. prompt:: bash $ + + ceph osd force_recovery_stretch_mode --yes-i-really-mean-it + +But this command should not be necessary; it is included to deal with +unanticipated situations. + +When in recovery mode, the cluster should go back into normal stretch mode +when the PGs are healthy. If this doesn't happen, or you want to force the +cross-data-center peering early and are willing to risk data downtime (or have +verified separately that all the PGs can peer, even if they aren't fully +recovered), you can invoke: + +.. prompt:: bash $ + + ceph osd force_healthy_stretch_mode --yes-i-really-mean-it + +This command should not be necessary; it is included to deal with +unanticipated situations. But you might wish to invoke it to remove +the ``HEALTH_WARN`` state which recovery mode generates. diff --git a/doc/rados/operations/upmap.rst b/doc/rados/operations/upmap.rst new file mode 100644 index 000000000..343adf2c4 --- /dev/null +++ b/doc/rados/operations/upmap.rst @@ -0,0 +1,105 @@ +.. _upmap: + +Using the pg-upmap +================== + +Starting in Luminous v12.2.z there is a new *pg-upmap* exception table +in the OSDMap that allows the cluster to explicitly map specific PGs to +specific OSDs. This allows the cluster to fine-tune the data +distribution to, in most cases, perfectly distributed PGs across OSDs. + +The key caveat to this new mechanism is that it requires that all +clients understand the new *pg-upmap* structure in the OSDMap. + +Enabling +-------- + +New clusters will have this module on by default. The cluster must only +have luminous (and newer) clients. You can the turn the balancer off with: + +.. prompt:: bash $ + + ceph balancer off + +To allow use of the feature on existing clusters, you must tell the +cluster that it only needs to support luminous (and newer) clients with: + +.. prompt:: bash $ + + ceph osd set-require-min-compat-client luminous + +This command will fail if any pre-luminous clients or daemons are +connected to the monitors. You can see what client versions are in +use with: + +.. prompt:: bash $ + + ceph features + +Balancer module +----------------- + +The `balancer` module for ceph-mgr will automatically balance +the number of PGs per OSD. See :ref:`balancer` + + +Offline optimization +-------------------- + +Upmap entries are updated with an offline optimizer built into ``osdmaptool``. + +#. Grab the latest copy of your osdmap: + + .. prompt:: bash $ + + ceph osd getmap -o om + +#. Run the optimizer: + + .. prompt:: bash $ + + osdmaptool om --upmap out.txt [--upmap-pool <pool>] \ + [--upmap-max <max-optimizations>] \ + [--upmap-deviation <max-deviation>] \ + [--upmap-active] + + It is highly recommended that optimization be done for each pool + individually, or for sets of similarly-utilized pools. You can + specify the ``--upmap-pool`` option multiple times. "Similar pools" + means pools that are mapped to the same devices and store the same + kind of data (e.g., RBD image pools, yes; RGW index pool and RGW + data pool, no). + + The ``max-optimizations`` value is the maximum number of upmap entries to + identify in the run. The default is `10` like the ceph-mgr balancer module, + but you should use a larger number if you are doing offline optimization. + If it cannot find any additional changes to make it will stop early + (i.e., when the pool distribution is perfect). + + The ``max-deviation`` value defaults to `5`. If an OSD PG count + varies from the computed target number by less than or equal + to this amount it will be considered perfect. + + The ``--upmap-active`` option simulates the behavior of the active + balancer in upmap mode. It keeps cycling until the OSDs are balanced + and reports how many rounds and how long each round is taking. The + elapsed time for rounds indicates the CPU load ceph-mgr will be + consuming when it tries to compute the next optimization plan. + +#. Apply the changes: + + .. prompt:: bash $ + + source out.txt + + The proposed changes are written to the output file ``out.txt`` in + the example above. These are normal ceph CLI commands that can be + run to apply the changes to the cluster. + + +The above steps can be repeated as many times as necessary to achieve +a perfect distribution of PGs for each set of pools. + +You can see some (gory) details about what the tool is doing by +passing ``--debug-osd 10`` and even more with ``--debug-crush 10`` +to ``osdmaptool``. diff --git a/doc/rados/operations/user-management.rst b/doc/rados/operations/user-management.rst new file mode 100644 index 000000000..78d77236d --- /dev/null +++ b/doc/rados/operations/user-management.rst @@ -0,0 +1,823 @@ +.. _user-management: + +================= + User Management +================= + +This document describes :term:`Ceph Client` users, and their authentication and +authorization with the :term:`Ceph Storage Cluster`. Users are either +individuals or system actors such as applications, which use Ceph clients to +interact with the Ceph Storage Cluster daemons. + +.. ditaa:: + +-----+ + | {o} | + | | + +--+--+ /---------\ /---------\ + | | Ceph | | Ceph | + ---+---*----->| |<------------->| | + | uses | Clients | | Servers | + | \---------/ \---------/ + /--+--\ + | | + | | + actor + + +When Ceph runs with authentication and authorization enabled (enabled by +default), you must specify a user name and a keyring containing the secret key +of the specified user (usually via the command line). If you do not specify a +user name, Ceph will use ``client.admin`` as the default user name. If you do +not specify a keyring, Ceph will look for a keyring via the ``keyring`` setting +in the Ceph configuration. For example, if you execute the ``ceph health`` +command without specifying a user or keyring: + +.. prompt:: bash $ + + ceph health + +Ceph interprets the command like this: + +.. prompt:: bash $ + + ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health + +Alternatively, you may use the ``CEPH_ARGS`` environment variable to avoid +re-entry of the user name and secret. + +For details on configuring the Ceph Storage Cluster to use authentication, +see `Cephx Config Reference`_. For details on the architecture of Cephx, see +`Architecture - High Availability Authentication`_. + +Background +========== + +Irrespective of the type of Ceph client (e.g., Block Device, Object Storage, +Filesystem, native API, etc.), Ceph stores all data as objects within `pools`_. +Ceph users must have access to pools in order to read and write data. +Additionally, Ceph users must have execute permissions to use Ceph's +administrative commands. The following concepts will help you understand Ceph +user management. + +User +---- + +A user is either an individual or a system actor such as an application. +Creating users allows you to control who (or what) can access your Ceph Storage +Cluster, its pools, and the data within pools. + +Ceph has the notion of a ``type`` of user. For the purposes of user management, +the type will always be ``client``. Ceph identifies users in period (.) +delimited form consisting of the user type and the user ID: for example, +``TYPE.ID``, ``client.admin``, or ``client.user1``. The reason for user typing +is that Ceph Monitors, OSDs, and Metadata Servers also use the Cephx protocol, +but they are not clients. Distinguishing the user type helps to distinguish +between client users and other users--streamlining access control, user +monitoring and traceability. + +Sometimes Ceph's user type may seem confusing, because the Ceph command line +allows you to specify a user with or without the type, depending upon your +command line usage. If you specify ``--user`` or ``--id``, you can omit the +type. So ``client.user1`` can be entered simply as ``user1``. If you specify +``--name`` or ``-n``, you must specify the type and name, such as +``client.user1``. We recommend using the type and name as a best practice +wherever possible. + +.. note:: A Ceph Storage Cluster user is not the same as a Ceph Object Storage + user or a Ceph File System user. The Ceph Object Gateway uses a Ceph Storage + Cluster user to communicate between the gateway daemon and the storage + cluster, but the gateway has its own user management functionality for end + users. The Ceph File System uses POSIX semantics. The user space associated + with the Ceph File System is not the same as a Ceph Storage Cluster user. + + + +Authorization (Capabilities) +---------------------------- + +Ceph uses the term "capabilities" (caps) to describe authorizing an +authenticated user to exercise the functionality of the monitors, OSDs and +metadata servers. Capabilities can also restrict access to data within a pool, +a namespace within a pool, or a set of pools based on their application tags. +A Ceph administrative user sets a user's capabilities when creating or updating +a user. + +Capability syntax follows the form:: + + {daemon-type} '{cap-spec}[, {cap-spec} ...]' + +- **Monitor Caps:** Monitor capabilities include ``r``, ``w``, ``x`` access + settings or ``profile {name}``. For example:: + + mon 'allow {access-spec} [network {network/prefix}]' + + mon 'profile {name}' + + The ``{access-spec}`` syntax is as follows: :: + + * | all | [r][w][x] + + The optional ``{network/prefix}`` is a standard network name and + prefix length in CIDR notation (e.g., ``10.3.0.0/16``). If present, + the use of this capability is restricted to clients connecting from + this network. + +- **OSD Caps:** OSD capabilities include ``r``, ``w``, ``x``, ``class-read``, + ``class-write`` access settings or ``profile {name}``. Additionally, OSD + capabilities also allow for pool and namespace settings. :: + + osd 'allow {access-spec} [{match-spec}] [network {network/prefix}]' + + osd 'profile {name} [pool={pool-name} [namespace={namespace-name}]] [network {network/prefix}]' + + The ``{access-spec}`` syntax is either of the following: :: + + * | all | [r][w][x] [class-read] [class-write] + + class {class name} [{method name}] + + The optional ``{match-spec}`` syntax is either of the following: :: + + pool={pool-name} [namespace={namespace-name}] [object_prefix {prefix}] + + [namespace={namespace-name}] tag {application} {key}={value} + + The optional ``{network/prefix}`` is a standard network name and + prefix length in CIDR notation (e.g., ``10.3.0.0/16``). If present, + the use of this capability is restricted to clients connecting from + this network. + +- **Manager Caps:** Manager (``ceph-mgr``) capabilities include + ``r``, ``w``, ``x`` access settings or ``profile {name}``. For example: :: + + mgr 'allow {access-spec} [network {network/prefix}]' + + mgr 'profile {name} [{key1} {match-type} {value1} ...] [network {network/prefix}]' + + Manager capabilities can also be specified for specific commands, + all commands exported by a built-in manager service, or all commands + exported by a specific add-on module. For example: :: + + mgr 'allow command "{command-prefix}" [with {key1} {match-type} {value1} ...] [network {network/prefix}]' + + mgr 'allow service {service-name} {access-spec} [network {network/prefix}]' + + mgr 'allow module {module-name} [with {key1} {match-type} {value1} ...] {access-spec} [network {network/prefix}]' + + The ``{access-spec}`` syntax is as follows: :: + + * | all | [r][w][x] + + The ``{service-name}`` is one of the following: :: + + mgr | osd | pg | py + + The ``{match-type}`` is one of the following: :: + + = | prefix | regex + +- **Metadata Server Caps:** For administrators, use ``allow *``. For all + other users, such as CephFS clients, consult :doc:`/cephfs/client-auth` + + +.. note:: The Ceph Object Gateway daemon (``radosgw``) is a client of the + Ceph Storage Cluster, so it is not represented as a Ceph Storage + Cluster daemon type. + +The following entries describe each access capability. + +``allow`` + +:Description: Precedes access settings for a daemon. Implies ``rw`` + for MDS only. + + +``r`` + +:Description: Gives the user read access. Required with monitors to retrieve + the CRUSH map. + + +``w`` + +:Description: Gives the user write access to objects. + + +``x`` + +:Description: Gives the user the capability to call class methods + (i.e., both read and write) and to conduct ``auth`` + operations on monitors. + + +``class-read`` + +:Descriptions: Gives the user the capability to call class read methods. + Subset of ``x``. + + +``class-write`` + +:Description: Gives the user the capability to call class write methods. + Subset of ``x``. + + +``*``, ``all`` + +:Description: Gives the user read, write and execute permissions for a + particular daemon/pool, and the ability to execute + admin commands. + +The following entries describe valid capability profiles: + +``profile osd`` (Monitor only) + +:Description: Gives a user permissions to connect as an OSD to other OSDs or + monitors. Conferred on OSDs to enable OSDs to handle replication + heartbeat traffic and status reporting. + + +``profile mds`` (Monitor only) + +:Description: Gives a user permissions to connect as a MDS to other MDSs or + monitors. + + +``profile bootstrap-osd`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an OSD. Conferred on + deployment tools such as ``ceph-volume``, ``cephadm``, etc. + so that they have permissions to add keys, etc. when + bootstrapping an OSD. + + +``profile bootstrap-mds`` (Monitor only) + +:Description: Gives a user permissions to bootstrap a metadata server. + Conferred on deployment tools such as ``cephadm``, etc. + so they have permissions to add keys, etc. when bootstrapping + a metadata server. + +``profile bootstrap-rbd`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an RBD user. + Conferred on deployment tools such as ``cephadm``, etc. + so they have permissions to add keys, etc. when bootstrapping + an RBD user. + +``profile bootstrap-rbd-mirror`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an ``rbd-mirror`` daemon + user. Conferred on deployment tools such as ``cephadm``, etc. + so they have permissions to add keys, etc. when bootstrapping + an ``rbd-mirror`` daemon. + +``profile rbd`` (Manager, Monitor, and OSD) + +:Description: Gives a user permissions to manipulate RBD images. When used + as a Monitor cap, it provides the minimal privileges required + by an RBD client application; this includes the ability + to blocklist other client users. When used as an OSD cap, it + provides read-write access to the specified pool to an + RBD client application. The Manager cap supports optional + ``pool`` and ``namespace`` keyword arguments. + +``profile rbd-mirror`` (Monitor only) + +:Description: Gives a user permissions to manipulate RBD images and retrieve + RBD mirroring config-key secrets. It provides the minimal + privileges required for the ``rbd-mirror`` daemon. + +``profile rbd-read-only`` (Manager and OSD) + +:Description: Gives a user read-only permissions to RBD images. The Manager + cap supports optional ``pool`` and ``namespace`` keyword + arguments. + +``profile simple-rados-client`` (Monitor only) + +:Description: Gives a user read-only permissions for monitor, OSD, and PG data. + Intended for use by direct librados client applications. + +``profile simple-rados-client-with-blocklist`` (Monitor only) + +:Description: Gives a user read-only permissions for monitor, OSD, and PG data. + Intended for use by direct librados client applications. Also + includes permission to add blocklist entries to build HA + applications. + +``profile fs-client`` (Monitor only) + +:Description: Gives a user read-only permissions for monitor, OSD, PG, and MDS + data. Intended for CephFS clients. + +``profile role-definer`` (Monitor and Auth) + +:Description: Gives a user **all** permissions for the auth subsystem, read-only + access to monitors, and nothing else. Useful for automation + tools. Do not assign this unless you really, **really** know what + you're doing as the security ramifications are substantial and + pervasive. + +``profile crash`` (Monitor only) + +:Description: Gives a user read-only access to monitors, used in conjunction + with the manager ``crash`` module when collecting daemon crash + dumps for later analysis. + +Pool +---- + +A pool is a logical partition where users store data. +In Ceph deployments, it is common to create a pool as a logical partition for +similar types of data. For example, when deploying Ceph as a backend for +OpenStack, a typical deployment would have pools for volumes, images, backups +and virtual machines, and users such as ``client.glance``, ``client.cinder``, +etc. + +Application Tags +---------------- + +Access may be restricted to specific pools as defined by their application +metadata. The ``*`` wildcard may be used for the ``key`` argument, the +``value`` argument, or both. ``all`` is a synony for ``*``. + +Namespace +--------- + +Objects within a pool can be associated to a namespace--a logical group of +objects within the pool. A user's access to a pool can be associated with a +namespace such that reads and writes by the user take place only within the +namespace. Objects written to a namespace within the pool can only be accessed +by users who have access to the namespace. + +.. note:: Namespaces are primarily useful for applications written on top of + ``librados`` where the logical grouping can alleviate the need to create + different pools. Ceph Object Gateway (from ``luminous``) uses namespaces for various + metadata objects. + +The rationale for namespaces is that pools can be a computationally expensive +method of segregating data sets for the purposes of authorizing separate sets +of users. For example, a pool should have ~100 placement groups per OSD. So an +exemplary cluster with 1000 OSDs would have 100,000 placement groups for one +pool. Each pool would create another 100,000 placement groups in the exemplary +cluster. By contrast, writing an object to a namespace simply associates the +namespace to the object name with out the computational overhead of a separate +pool. Rather than creating a separate pool for a user or set of users, you may +use a namespace. **Note:** Only available using ``librados`` at this time. + +Access may be restricted to specific RADOS namespaces using the ``namespace`` +capability. Limited globbing of namespaces is supported; if the last character +of the specified namespace is ``*``, then access is granted to any namespace +starting with the provided argument. + +Managing Users +============== + +User management functionality provides Ceph Storage Cluster administrators with +the ability to create, update and delete users directly in the Ceph Storage +Cluster. + +When you create or delete users in the Ceph Storage Cluster, you may need to +distribute keys to clients so that they can be added to keyrings. See `Keyring +Management`_ for details. + +List Users +---------- + +To list the users in your cluster, execute the following: + +.. prompt:: bash $ + + ceph auth ls + +Ceph will list out all users in your cluster. For example, in a two-node +exemplary cluster, ``ceph auth ls`` will output something that looks like +this:: + + installed auth entries: + + osd.0 + key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w== + caps: [mon] allow profile osd + caps: [osd] allow * + osd.1 + key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA== + caps: [mon] allow profile osd + caps: [osd] allow * + client.admin + key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw== + caps: [mds] allow + caps: [mon] allow * + caps: [osd] allow * + client.bootstrap-mds + key: AQBICbtTOK9uGBAAdbe5zcIGHZL3T/u2g6EBww== + caps: [mon] allow profile bootstrap-mds + client.bootstrap-osd + key: AQBHCbtT4GxqORAADE5u7RkpCN/oo4e5W0uBtw== + caps: [mon] allow profile bootstrap-osd + + +Note that the ``TYPE.ID`` notation for users applies such that ``osd.0`` is a +user of type ``osd`` and its ID is ``0``, ``client.admin`` is a user of type +``client`` and its ID is ``admin`` (i.e., the default ``client.admin`` user). +Note also that each entry has a ``key: <value>`` entry, and one or more +``caps:`` entries. + +You may use the ``-o {filename}`` option with ``ceph auth ls`` to +save the output to a file. + + +Get a User +---------- + +To retrieve a specific user, key and capabilities, execute the +following: + +.. prompt:: bash $ + + ceph auth get {TYPE.ID} + +For example: + +.. prompt:: bash $ + + ceph auth get client.admin + +You may also use the ``-o {filename}`` option with ``ceph auth get`` to +save the output to a file. Developers may also execute the following: + +.. prompt:: bash $ + + ceph auth export {TYPE.ID} + +The ``auth export`` command is identical to ``auth get``. + +Add a User +---------- + +Adding a user creates a username (i.e., ``TYPE.ID``), a secret key and +any capabilities included in the command you use to create the user. + +A user's key enables the user to authenticate with the Ceph Storage Cluster. +The user's capabilities authorize the user to read, write, or execute on Ceph +monitors (``mon``), Ceph OSDs (``osd``) or Ceph Metadata Servers (``mds``). + +There are a few ways to add a user: + +- ``ceph auth add``: This command is the canonical way to add a user. It + will create the user, generate a key and add any specified capabilities. + +- ``ceph auth get-or-create``: This command is often the most convenient way + to create a user, because it returns a keyfile format with the user name + (in brackets) and the key. If the user already exists, this command + simply returns the user name and key in the keyfile format. You may use the + ``-o {filename}`` option to save the output to a file. + +- ``ceph auth get-or-create-key``: This command is a convenient way to create + a user and return the user's key (only). This is useful for clients that + need the key only (e.g., libvirt). If the user already exists, this command + simply returns the key. You may use the ``-o {filename}`` option to save the + output to a file. + +When creating client users, you may create a user with no capabilities. A user +with no capabilities is useless beyond mere authentication, because the client +cannot retrieve the cluster map from the monitor. However, you can create a +user with no capabilities if you wish to defer adding capabilities later using +the ``ceph auth caps`` command. + +A typical user has at least read capabilities on the Ceph monitor and +read and write capability on Ceph OSDs. Additionally, a user's OSD permissions +are often restricted to accessing a particular pool: + +.. prompt:: bash $ + + ceph auth add client.john mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth get-or-create client.paul mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth get-or-create client.george mon 'allow r' osd 'allow rw pool=liverpool' -o george.keyring + ceph auth get-or-create-key client.ringo mon 'allow r' osd 'allow rw pool=liverpool' -o ringo.key + + +.. important:: If you provide a user with capabilities to OSDs, but you DO NOT + restrict access to particular pools, the user will have access to ALL + pools in the cluster! + + +.. _modify-user-capabilities: + +Modify User Capabilities +------------------------ + +The ``ceph auth caps`` command allows you to specify a user and change the +user's capabilities. Setting new capabilities will overwrite current capabilities. +To view current capabilities run ``ceph auth get USERTYPE.USERID``. To add +capabilities, you should also specify the existing capabilities when using the form: + +.. prompt:: bash $ + + ceph auth caps USERTYPE.USERID {daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]' [{daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]'] + +For example: + +.. prompt:: bash $ + + ceph auth get client.john + ceph auth caps client.john mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth caps client.paul mon 'allow rw' osd 'allow rwx pool=liverpool' + ceph auth caps client.brian-manager mon 'allow *' osd 'allow *' + +See `Authorization (Capabilities)`_ for additional details on capabilities. + +Delete a User +------------- + +To delete a user, use ``ceph auth del``: + +.. prompt:: bash $ + + ceph auth del {TYPE}.{ID} + +Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``, +and ``{ID}`` is the user name or ID of the daemon. + + +Print a User's Key +------------------ + +To print a user's authentication key to standard output, execute the following: + +.. prompt:: bash $ + + ceph auth print-key {TYPE}.{ID} + +Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``, +and ``{ID}`` is the user name or ID of the daemon. + +Printing a user's key is useful when you need to populate client +software with a user's key (e.g., libvirt): + +.. prompt:: bash $ + + mount -t ceph serverhost:/ mountpoint -o name=client.user,secret=`ceph auth print-key client.user` + +Import a User(s) +---------------- + +To import one or more users, use ``ceph auth import`` and +specify a keyring: + +.. prompt:: bash $ + + ceph auth import -i /path/to/keyring + +For example: + +.. prompt:: bash $ + + sudo ceph auth import -i /etc/ceph/ceph.keyring + + +.. note:: The Ceph storage cluster will add new users, their keys and their + capabilities and will update existing users, their keys and their + capabilities. + +Keyring Management +================== + +When you access Ceph via a Ceph client, the Ceph client will look for a local +keyring. Ceph presets the ``keyring`` setting with the following four keyring +names by default so you don't have to set them in your Ceph configuration file +unless you want to override the defaults (not recommended): + +- ``/etc/ceph/$cluster.$name.keyring`` +- ``/etc/ceph/$cluster.keyring`` +- ``/etc/ceph/keyring`` +- ``/etc/ceph/keyring.bin`` + +The ``$cluster`` metavariable is your Ceph cluster name as defined by the +name of the Ceph configuration file (i.e., ``ceph.conf`` means the cluster name +is ``ceph``; thus, ``ceph.keyring``). The ``$name`` metavariable is the user +type and user ID (e.g., ``client.admin``; thus, ``ceph.client.admin.keyring``). + +.. note:: When executing commands that read or write to ``/etc/ceph``, you may + need to use ``sudo`` to execute the command as ``root``. + +After you create a user (e.g., ``client.ringo``), you must get the key and add +it to a keyring on a Ceph client so that the user can access the Ceph Storage +Cluster. + +The `User Management`_ section details how to list, get, add, modify and delete +users directly in the Ceph Storage Cluster. However, Ceph also provides the +``ceph-authtool`` utility to allow you to manage keyrings from a Ceph client. + +Create a Keyring +---------------- + +When you use the procedures in the `Managing Users`_ section to create users, +you need to provide user keys to the Ceph client(s) so that the Ceph client +can retrieve the key for the specified user and authenticate with the Ceph +Storage Cluster. Ceph Clients access keyrings to lookup a user name and +retrieve the user's key. + +The ``ceph-authtool`` utility allows you to create a keyring. To create an +empty keyring, use ``--create-keyring`` or ``-C``. For example: + +.. prompt:: bash $ + + ceph-authtool --create-keyring /path/to/keyring + +When creating a keyring with multiple users, we recommend using the cluster name +(e.g., ``$cluster.keyring``) for the keyring filename and saving it in the +``/etc/ceph`` directory so that the ``keyring`` configuration default setting +will pick up the filename without requiring you to specify it in the local copy +of your Ceph configuration file. For example, create ``ceph.keyring`` by +executing the following: + +.. prompt:: bash $ + + sudo ceph-authtool -C /etc/ceph/ceph.keyring + +When creating a keyring with a single user, we recommend using the cluster name, +the user type and the user name and saving it in the ``/etc/ceph`` directory. +For example, ``ceph.client.admin.keyring`` for the ``client.admin`` user. + +To create a keyring in ``/etc/ceph``, you must do so as ``root``. This means +the file will have ``rw`` permissions for the ``root`` user only, which is +appropriate when the keyring contains administrator keys. However, if you +intend to use the keyring for a particular user or group of users, ensure +that you execute ``chown`` or ``chmod`` to establish appropriate keyring +ownership and access. + +Add a User to a Keyring +----------------------- + +When you `Add a User`_ to the Ceph Storage Cluster, you can use the `Get a +User`_ procedure to retrieve a user, key and capabilities and save the user to a +keyring. + +When you only want to use one user per keyring, the `Get a User`_ procedure with +the ``-o`` option will save the output in the keyring file format. For example, +to create a keyring for the ``client.admin`` user, execute the following: + +.. prompt:: bash $ + + sudo ceph auth get client.admin -o /etc/ceph/ceph.client.admin.keyring + +Notice that we use the recommended file format for an individual user. + +When you want to import users to a keyring, you can use ``ceph-authtool`` +to specify the destination keyring and the source keyring. +For example: + +.. prompt:: bash $ + + sudo ceph-authtool /etc/ceph/ceph.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring + +Create a User +------------- + +Ceph provides the `Add a User`_ function to create a user directly in the Ceph +Storage Cluster. However, you can also create a user, keys and capabilities +directly on a Ceph client keyring. Then, you can import the user to the Ceph +Storage Cluster. For example: + +.. prompt:: bash $ + + sudo ceph-authtool -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' /etc/ceph/ceph.keyring + +See `Authorization (Capabilities)`_ for additional details on capabilities. + +You can also create a keyring and add a new user to the keyring simultaneously. +For example: + +.. prompt:: bash $ + + sudo ceph-authtool -C /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' --gen-key + +In the foregoing scenarios, the new user ``client.ringo`` is only in the +keyring. To add the new user to the Ceph Storage Cluster, you must still add +the new user to the Ceph Storage Cluster: + +.. prompt:: bash $ + + sudo ceph auth add client.ringo -i /etc/ceph/ceph.keyring + +Modify a User +------------- + +To modify the capabilities of a user record in a keyring, specify the keyring, +and the user followed by the capabilities. For example: + +.. prompt:: bash $ + + sudo ceph-authtool /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' + +To update the user to the Ceph Storage Cluster, you must update the user +in the keyring to the user entry in the Ceph Storage Cluster: + +.. prompt:: bash $ + + sudo ceph auth import -i /etc/ceph/ceph.keyring + +See `Import a User(s)`_ for details on updating a Ceph Storage Cluster user +from a keyring. + +You may also `Modify User Capabilities`_ directly in the cluster, store the +results to a keyring file; then, import the keyring into your main +``ceph.keyring`` file. + +Command Line Usage +================== + +Ceph supports the following usage for user name and secret: + +``--id`` | ``--user`` + +:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or + ``client.admin``, ``client.user1``). The ``id``, ``name`` and + ``-n`` options enable you to specify the ID portion of the user + name (e.g., ``admin``, ``user1``, ``foo``, etc.). You can specify + the user with the ``--id`` and omit the type. For example, + to specify user ``client.foo`` enter the following: + + .. prompt:: bash $ + + ceph --id foo --keyring /path/to/keyring health + ceph --user foo --keyring /path/to/keyring health + + +``--name`` | ``-n`` + +:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or + ``client.admin``, ``client.user1``). The ``--name`` and ``-n`` + options enables you to specify the fully qualified user name. + You must specify the user type (typically ``client``) with the + user ID. For example: + + .. prompt:: bash $ + + ceph --name client.foo --keyring /path/to/keyring health + ceph -n client.foo --keyring /path/to/keyring health + + +``--keyring`` + +:Description: The path to the keyring containing one or more user name and + secret. The ``--secret`` option provides the same functionality, + but it does not work with Ceph RADOS Gateway, which uses + ``--secret`` for another purpose. You may retrieve a keyring with + ``ceph auth get-or-create`` and store it locally. This is a + preferred approach, because you can switch user names without + switching the keyring path. For example: + + .. prompt:: bash $ + + sudo rbd map --id foo --keyring /path/to/keyring mypool/myimage + + +.. _pools: ../pools + +Limitations +=========== + +The ``cephx`` protocol authenticates Ceph clients and servers to each other. It +is not intended to handle authentication of human users or application programs +run on their behalf. If that effect is required to handle your access control +needs, you must have another mechanism, which is likely to be specific to the +front end used to access the Ceph object store. This other mechanism has the +role of ensuring that only acceptable users and programs are able to run on the +machine that Ceph will permit to access its object store. + +The keys used to authenticate Ceph clients and servers are typically stored in +a plain text file with appropriate permissions in a trusted host. + +.. important:: Storing keys in plaintext files has security shortcomings, but + they are difficult to avoid, given the basic authentication methods Ceph + uses in the background. Those setting up Ceph systems should be aware of + these shortcomings. + +In particular, arbitrary user machines, especially portable machines, should not +be configured to interact directly with Ceph, since that mode of use would +require the storage of a plaintext authentication key on an insecure machine. +Anyone who stole that machine or obtained surreptitious access to it could +obtain the key that will allow them to authenticate their own machines to Ceph. + +Rather than permitting potentially insecure machines to access a Ceph object +store directly, users should be required to sign in to a trusted machine in +your environment using a method that provides sufficient security for your +purposes. That trusted machine will store the plaintext Ceph keys for the +human users. A future version of Ceph may address these particular +authentication issues more fully. + +At the moment, none of the Ceph authentication protocols provide secrecy for +messages in transit. Thus, an eavesdropper on the wire can hear and understand +all data sent between clients and servers in Ceph, even if it cannot create or +alter them. Further, Ceph does not include options to encrypt user data in the +object store. Users can hand-encrypt and store their own data in the Ceph +object store, of course, but Ceph provides no features to perform object +encryption itself. Those storing sensitive data in Ceph should consider +encrypting their data before providing it to the Ceph system. + + +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _Cephx Config Reference: ../../configuration/auth-config-ref diff --git a/doc/rados/troubleshooting/community.rst b/doc/rados/troubleshooting/community.rst new file mode 100644 index 000000000..f816584ae --- /dev/null +++ b/doc/rados/troubleshooting/community.rst @@ -0,0 +1,28 @@ +==================== + The Ceph Community +==================== + +The Ceph community is an excellent source of information and help. For +operational issues with Ceph releases we recommend you `subscribe to the +ceph-users email list`_. When you no longer want to receive emails, you can +`unsubscribe from the ceph-users email list`_. + +You may also `subscribe to the ceph-devel email list`_. You should do so if +your issue is: + +- Likely related to a bug +- Related to a development release package +- Related to a development testing package +- Related to your own builds + +If you no longer want to receive emails from the ``ceph-devel`` email list, you +may `unsubscribe from the ceph-devel email list`_. + +.. tip:: The Ceph community is growing rapidly, and community members can help + you if you provide them with detailed information about your problem. You + can attach the output of the ``ceph report`` command to help people understand your issues. + +.. _subscribe to the ceph-devel email list: mailto:dev-join@ceph.io +.. _unsubscribe from the ceph-devel email list: mailto:dev-leave@ceph.io +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@ceph.io +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@ceph.io diff --git a/doc/rados/troubleshooting/cpu-profiling.rst b/doc/rados/troubleshooting/cpu-profiling.rst new file mode 100644 index 000000000..159f7998d --- /dev/null +++ b/doc/rados/troubleshooting/cpu-profiling.rst @@ -0,0 +1,67 @@ +=============== + CPU Profiling +=============== + +If you built Ceph from source and compiled Ceph for use with `oprofile`_ +you can profile Ceph's CPU usage. See `Installing Oprofile`_ for details. + + +Initializing oprofile +===================== + +The first time you use ``oprofile`` you need to initialize it. Locate the +``vmlinux`` image corresponding to the kernel you are now running. :: + + ls /boot + sudo opcontrol --init + sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6 + + +Starting oprofile +================= + +To start ``oprofile`` execute the following command:: + + opcontrol --start + +Once you start ``oprofile``, you may run some tests with Ceph. + + +Stopping oprofile +================= + +To stop ``oprofile`` execute the following command:: + + opcontrol --stop + + +Retrieving oprofile Results +=========================== + +To retrieve the top ``cmon`` results, execute the following command:: + + opreport -gal ./cmon | less + + +To retrieve the top ``cmon`` results with call graphs attached, execute the +following command:: + + opreport -cal ./cmon | less + +.. important:: After reviewing results, you should reset ``oprofile`` before + running it again. Resetting ``oprofile`` removes data from the session + directory. + + +Resetting oprofile +================== + +To reset ``oprofile``, execute the following command:: + + sudo opcontrol --reset + +.. important:: You should reset ``oprofile`` after analyzing data so that + you do not commingle results from different tests. + +.. _oprofile: http://oprofile.sourceforge.net/about/ +.. _Installing Oprofile: ../../../dev/cpu-profiler diff --git a/doc/rados/troubleshooting/index.rst b/doc/rados/troubleshooting/index.rst new file mode 100644 index 000000000..80d14f3ce --- /dev/null +++ b/doc/rados/troubleshooting/index.rst @@ -0,0 +1,19 @@ +================= + Troubleshooting +================= + +Ceph is still on the leading edge, so you may encounter situations that require +you to examine your configuration, modify your logging output, troubleshoot +monitors and OSDs, profile memory and CPU usage, and reach out to the +Ceph community for help. + +.. toctree:: + :maxdepth: 1 + + community + log-and-debug + troubleshooting-mon + troubleshooting-osd + troubleshooting-pg + memory-profiling + cpu-profiling diff --git a/doc/rados/troubleshooting/log-and-debug.rst b/doc/rados/troubleshooting/log-and-debug.rst new file mode 100644 index 000000000..71170149b --- /dev/null +++ b/doc/rados/troubleshooting/log-and-debug.rst @@ -0,0 +1,599 @@ +======================= + Logging and Debugging +======================= + +Typically, when you add debugging to your Ceph configuration, you do so at +runtime. You can also add Ceph debug logging to your Ceph configuration file if +you are encountering issues when starting your cluster. You may view Ceph log +files under ``/var/log/ceph`` (the default location). + +.. tip:: When debug output slows down your system, the latency can hide + race conditions. + +Logging is resource intensive. If you are encountering a problem in a specific +area of your cluster, enable logging for that area of the cluster. For example, +if your OSDs are running fine, but your metadata servers are not, you should +start by enabling debug logging for the specific metadata server instance(s) +giving you trouble. Enable logging for each subsystem as needed. + +.. important:: Verbose logging can generate over 1GB of data per hour. If your + OS disk reaches its capacity, the node will stop working. + +If you enable or increase the rate of Ceph logging, ensure that you have +sufficient disk space on your OS disk. See `Accelerating Log Rotation`_ for +details on rotating log files. When your system is running well, remove +unnecessary debugging settings to ensure your cluster runs optimally. Logging +debug output messages is relatively slow, and a waste of resources when +operating your cluster. + +See `Subsystem, Log and Debug Settings`_ for details on available settings. + +Runtime +======= + +If you would like to see the configuration settings at runtime, you must log +in to a host with a running daemon and execute the following:: + + ceph daemon {daemon-name} config show | less + +For example,:: + + ceph daemon osd.0 config show | less + +To activate Ceph's debugging output (*i.e.*, ``dout()``) at runtime, use the +``ceph tell`` command to inject arguments into the runtime configuration:: + + ceph tell {daemon-type}.{daemon id or *} config set {name} {value} + +Replace ``{daemon-type}`` with one of ``osd``, ``mon`` or ``mds``. You may apply +the runtime setting to all daemons of a particular type with ``*``, or specify +a specific daemon's ID. For example, to increase +debug logging for a ``ceph-osd`` daemon named ``osd.0``, execute the following:: + + ceph tell osd.0 config set debug_osd 0/5 + +The ``ceph tell`` command goes through the monitors. If you cannot bind to the +monitor, you can still make the change by logging into the host of the daemon +whose configuration you'd like to change using ``ceph daemon``. +For example:: + + sudo ceph daemon osd.0 config set debug_osd 0/5 + +See `Subsystem, Log and Debug Settings`_ for details on available settings. + + +Boot Time +========= + +To activate Ceph's debugging output (*i.e.*, ``dout()``) at boot time, you must +add settings to your Ceph configuration file. Subsystems common to each daemon +may be set under ``[global]`` in your configuration file. Subsystems for +particular daemons are set under the daemon section in your configuration file +(*e.g.*, ``[mon]``, ``[osd]``, ``[mds]``). For example:: + + [global] + debug ms = 1/5 + + [mon] + debug mon = 20 + debug paxos = 1/5 + debug auth = 2 + + [osd] + debug osd = 1/5 + debug filestore = 1/5 + debug journal = 1 + debug monc = 5/20 + + [mds] + debug mds = 1 + debug mds balancer = 1 + + +See `Subsystem, Log and Debug Settings`_ for details. + + +Accelerating Log Rotation +========================= + +If your OS disk is relatively full, you can accelerate log rotation by modifying +the Ceph log rotation file at ``/etc/logrotate.d/ceph``. Add a size setting +after the rotation frequency to accelerate log rotation (via cronjob) if your +logs exceed the size setting. For example, the default setting looks like +this:: + + rotate 7 + weekly + compress + sharedscripts + +Modify it by adding a ``size`` setting. :: + + rotate 7 + weekly + size 500M + compress + sharedscripts + +Then, start the crontab editor for your user space. :: + + crontab -e + +Finally, add an entry to check the ``etc/logrotate.d/ceph`` file. :: + + 30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1 + +The preceding example checks the ``etc/logrotate.d/ceph`` file every 30 minutes. + + +Valgrind +======== + +Debugging may also require you to track down memory and threading issues. +You can run a single daemon, a type of daemon, or the whole cluster with +Valgrind. You should only use Valgrind when developing or debugging Ceph. +Valgrind is computationally expensive, and will slow down your system otherwise. +Valgrind messages are logged to ``stderr``. + + +Subsystem, Log and Debug Settings +================================= + +In most cases, you will enable debug logging output via subsystems. + +Ceph Subsystems +--------------- + +Each subsystem has a logging level for its output logs, and for its logs +in-memory. You may set different values for each of these subsystems by setting +a log file level and a memory level for debug logging. Ceph's logging levels +operate on a scale of ``1`` to ``20``, where ``1`` is terse and ``20`` is +verbose [#]_ . In general, the logs in-memory are not sent to the output log unless: + +- a fatal signal is raised or +- an ``assert`` in source code is triggered or +- upon requested. Please consult `document on admin socket <http://docs.ceph.com/en/latest/man/8/ceph/#daemon>`_ for more details. + +A debug logging setting can take a single value for the log level and the +memory level, which sets them both as the same value. For example, if you +specify ``debug ms = 5``, Ceph will treat it as a log level and a memory level +of ``5``. You may also specify them separately. The first setting is the log +level, and the second setting is the memory level. You must separate them with +a forward slash (/). For example, if you want to set the ``ms`` subsystem's +debug logging level to ``1`` and its memory level to ``5``, you would specify it +as ``debug ms = 1/5``. For example: + + + +.. code-block:: ini + + debug {subsystem} = {log-level}/{memory-level} + #for example + debug mds balancer = 1/20 + + +The following table provides a list of Ceph subsystems and their default log and +memory levels. Once you complete your logging efforts, restore the subsystems +to their default level or to a level suitable for normal operations. + + ++--------------------+-----------+--------------+ +| Subsystem | Log Level | Memory Level | ++====================+===========+==============+ +| ``default`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``lockdep`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``context`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``crush`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``mds`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds balancer`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds locker`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds log`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds log expire`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds migrator`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``buffer`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``timer`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``filer`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``striper`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``objecter`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``rados`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``rbd`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``rbd mirror`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``rbd replay`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``journaler`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``objectcacher`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``client`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``osd`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``optracker`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``objclass`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``filestore`` | 1 | 3 | ++--------------------+-----------+--------------+ +| ``journal`` | 1 | 3 | ++--------------------+-----------+--------------+ +| ``ms`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``mon`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``monc`` | 0 | 10 | ++--------------------+-----------+--------------+ +| ``paxos`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``tp`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``auth`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``crypto`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``finisher`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``reserver`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``heartbeatmap`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``perfcounter`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``rgw`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``rgw sync`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``civetweb`` | 1 | 10 | ++--------------------+-----------+--------------+ +| ``javaclient`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``asok`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``throttle`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``refs`` | 0 | 0 | ++--------------------+-----------+--------------+ +| ``compressor`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``bluestore`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``bluefs`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``bdev`` | 1 | 3 | ++--------------------+-----------+--------------+ +| ``kstore`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``rocksdb`` | 4 | 5 | ++--------------------+-----------+--------------+ +| ``leveldb`` | 4 | 5 | ++--------------------+-----------+--------------+ +| ``memdb`` | 4 | 5 | ++--------------------+-----------+--------------+ +| ``fuse`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mgr`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mgrc`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``dpdk`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``eventtrace`` | 1 | 5 | ++--------------------+-----------+--------------+ + + +Logging Settings +---------------- + +Logging and debugging settings are not required in a Ceph configuration file, +but you may override default settings as needed. Ceph supports the following +settings: + + +``log file`` + +:Description: The location of the logging file for your cluster. +:Type: String +:Required: No +:Default: ``/var/log/ceph/$cluster-$name.log`` + + +``log max new`` + +:Description: The maximum number of new log files. +:Type: Integer +:Required: No +:Default: ``1000`` + + +``log max recent`` + +:Description: The maximum number of recent events to include in a log file. +:Type: Integer +:Required: No +:Default: ``10000`` + + +``log to file`` + +:Description: Determines if logging messages should appear in a file. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``log to stderr`` + +:Description: Determines if logging messages should appear in ``stderr``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``err to stderr`` + +:Description: Determines if error messages should appear in ``stderr``. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``log to syslog`` + +:Description: Determines if logging messages should appear in ``syslog``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``err to syslog`` + +:Description: Determines if error messages should appear in ``syslog``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``log flush on exit`` + +:Description: Determines if Ceph should flush the log files after exit. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``clog to monitors`` + +:Description: Determines if ``clog`` messages should be sent to monitors. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``clog to syslog`` + +:Description: Determines if ``clog`` messages should be sent to syslog. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mon cluster log to syslog`` + +:Description: Determines if the cluster log should be output to the syslog. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mon cluster log file`` + +:Description: The locations of the cluster's log files. There are two channels in + Ceph: ``cluster`` and ``audit``. This option represents a mapping + from channels to log files, where the log entries of that + channel are sent to. The ``default`` entry is a fallback + mapping for channels not explicitly specified. So, the following + default setting will send cluster log to ``$cluster.log``, and + send audit log to ``$cluster.audit.log``, where ``$cluster`` will + be replaced with the actual cluster name. +:Type: String +:Required: No +:Default: ``default=/var/log/ceph/$cluster.$channel.log,cluster=/var/log/ceph/$cluster.log`` + + + +OSD +--- + + +``osd debug drop ping probability`` + +:Description: ? +:Type: Double +:Required: No +:Default: 0 + + +``osd debug drop ping duration`` + +:Description: +:Type: Integer +:Required: No +:Default: 0 + +``osd debug drop pg create probability`` + +:Description: +:Type: Integer +:Required: No +:Default: 0 + +``osd debug drop pg create duration`` + +:Description: ? +:Type: Double +:Required: No +:Default: 1 + + +``osd min pg log entries`` + +:Description: The minimum number of log entries for placement groups. +:Type: 32-bit Unsigned Integer +:Required: No +:Default: 250 + + +``osd op log threshold`` + +:Description: How many op log messages to show up in one pass. +:Type: Integer +:Required: No +:Default: 5 + + + +Filestore +--------- + +``filestore debug omap check`` + +:Description: Debugging check on synchronization. This is an expensive operation. +:Type: Boolean +:Required: No +:Default: ``false`` + + +MDS +--- + + +``mds debug scatterstat`` + +:Description: Ceph will assert that various recursive stat invariants are true + (for developers only). + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug frag`` + +:Description: Ceph will verify directory fragmentation invariants when + convenient (developers only). + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug auth pins`` + +:Description: The debug auth pin invariants (for developers only). +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug subtrees`` + +:Description: The debug subtree invariants (for developers only). +:Type: Boolean +:Required: No +:Default: ``false`` + + + +RADOS Gateway +------------- + + +``rgw log nonexistent bucket`` + +:Description: Should we log a non-existent buckets? +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw log object name`` + +:Description: Should an object's name be logged. // man date to see codes (a subset are supported) +:Type: String +:Required: No +:Default: ``%Y-%m-%d-%H-%i-%n`` + + +``rgw log object name utc`` + +:Description: Object log name contains UTC? +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw enable ops log`` + +:Description: Enables logging of every RGW operation. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``rgw enable usage log`` + +:Description: Enable logging of RGW's bandwidth usage. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw usage log flush threshold`` + +:Description: Threshold to flush pending log data. +:Type: Integer +:Required: No +:Default: ``1024`` + + +``rgw usage log tick interval`` + +:Description: Flush pending log data every ``s`` seconds. +:Type: Integer +:Required: No +:Default: 30 + + +``rgw intent log object name`` + +:Description: +:Type: String +:Required: No +:Default: ``%Y-%m-%d-%i-%n`` + + +``rgw intent log object name utc`` + +:Description: Include a UTC timestamp in the intent log object name. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. [#] there are levels >20 in some rare cases and that they are extremely verbose. diff --git a/doc/rados/troubleshooting/memory-profiling.rst b/doc/rados/troubleshooting/memory-profiling.rst new file mode 100644 index 000000000..e2396e2fd --- /dev/null +++ b/doc/rados/troubleshooting/memory-profiling.rst @@ -0,0 +1,142 @@ +================== + Memory Profiling +================== + +Ceph MON, OSD and MDS can generate heap profiles using +``tcmalloc``. To generate heap profiles, ensure you have +``google-perftools`` installed:: + + sudo apt-get install google-perftools + +The profiler dumps output to your ``log file`` directory (i.e., +``/var/log/ceph``). See `Logging and Debugging`_ for details. +To view the profiler logs with Google's performance tools, execute the +following:: + + google-pprof --text {path-to-daemon} {log-path/filename} + +For example:: + + $ ceph tell osd.0 heap start_profiler + $ ceph tell osd.0 heap dump + osd.0 tcmalloc heap stats:------------------------------------------------ + MALLOC: 2632288 ( 2.5 MiB) Bytes in use by application + MALLOC: + 499712 ( 0.5 MiB) Bytes in page heap freelist + MALLOC: + 543800 ( 0.5 MiB) Bytes in central cache freelist + MALLOC: + 327680 ( 0.3 MiB) Bytes in transfer cache freelist + MALLOC: + 1239400 ( 1.2 MiB) Bytes in thread cache freelists + MALLOC: + 1142936 ( 1.1 MiB) Bytes in malloc metadata + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Actual memory used (physical + swap) + MALLOC: + 0 ( 0.0 MiB) Bytes released to OS (aka unmapped) + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Virtual address space used + MALLOC: + MALLOC: 231 Spans in use + MALLOC: 56 Thread heaps in use + MALLOC: 8192 Tcmalloc page size + ------------------------------------------------ + Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). + Bytes released to the OS take up virtual address space but no physical memory. + $ google-pprof --text \ + /usr/bin/ceph-osd \ + /var/log/ceph/ceph-osd.0.profile.0001.heap + Total: 3.7 MB + 1.9 51.1% 51.1% 1.9 51.1% ceph::log::Log::create_entry + 1.8 47.3% 98.4% 1.8 47.3% std::string::_Rep::_S_create + 0.0 0.4% 98.9% 0.0 0.6% SimpleMessenger::add_accept_pipe + 0.0 0.4% 99.2% 0.0 0.6% decode_message + ... + +Another heap dump on the same daemon will add another file. It is +convenient to compare to a previous heap dump to show what has grown +in the interval. For instance:: + + $ google-pprof --text --base out/osd.0.profile.0001.heap \ + ceph-osd out/osd.0.profile.0003.heap + Total: 0.2 MB + 0.1 50.3% 50.3% 0.1 50.3% ceph::log::Log::create_entry + 0.1 46.6% 96.8% 0.1 46.6% std::string::_Rep::_S_create + 0.0 0.9% 97.7% 0.0 26.1% ReplicatedPG::do_op + 0.0 0.8% 98.5% 0.0 0.8% __gnu_cxx::new_allocator::allocate + +Refer to `Google Heap Profiler`_ for additional details. + +Once you have the heap profiler installed, start your cluster and +begin using the heap profiler. You may enable or disable the heap +profiler at runtime, or ensure that it runs continuously. For the +following commandline usage, replace ``{daemon-type}`` with ``mon``, +``osd`` or ``mds``, and replace ``{daemon-id}`` with the OSD number or +the MON or MDS id. + + +Starting the Profiler +--------------------- + +To start the heap profiler, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap start_profiler + +For example:: + + ceph tell osd.1 heap start_profiler + +Alternatively the profile can be started when the daemon starts +running if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in +the environment. + +Printing Stats +-------------- + +To print out statistics, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap stats + +For example:: + + ceph tell osd.0 heap stats + +.. note:: Printing stats does not require the profiler to be running and does + not dump the heap allocation information to a file. + + +Dumping Heap Information +------------------------ + +To dump heap information, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap dump + +For example:: + + ceph tell mds.a heap dump + +.. note:: Dumping heap information only works when the profiler is running. + + +Releasing Memory +---------------- + +To release memory that ``tcmalloc`` has allocated but which is not being used by +the Ceph daemon itself, execute the following:: + + ceph tell {daemon-type}{daemon-id} heap release + +For example:: + + ceph tell osd.2 heap release + + +Stopping the Profiler +--------------------- + +To stop the heap profiler, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap stop_profiler + +For example:: + + ceph tell osd.0 heap stop_profiler + +.. _Logging and Debugging: ../log-and-debug +.. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html diff --git a/doc/rados/troubleshooting/troubleshooting-mon.rst b/doc/rados/troubleshooting/troubleshooting-mon.rst new file mode 100644 index 000000000..dc575f761 --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-mon.rst @@ -0,0 +1,613 @@ +.. _rados-troubleshooting-mon: + +================================= + Troubleshooting Monitors +================================= + +.. index:: monitor, high availability + +When a cluster encounters monitor-related troubles there's a tendency to +panic, and sometimes with good reason. Losing one or more monitors doesn't +necessarily mean that your cluster is down, so long as a majority are up, +running, and form a quorum. +Regardless of how bad the situation is, the first thing you should do is to +calm down, take a breath, and step through the below troubleshooting steps. + + +Initial Troubleshooting +======================== + + +**Are the monitors running?** + + First of all, we need to make sure the monitor (*mon*) daemon processes + (``ceph-mon``) are running. You would be amazed by how often Ceph admins + forget to start the mons, or to restart them after an upgrade. There's no + shame, but try to not lose a couple of hours looking for a deeper problem. + When running Kraken or later releases also ensure that the manager + daemons (``ceph-mgr``) are running, usually alongside each ``ceph-mon``. + + +**Are you able to reach to the mon nodes?** + + Doesn't happen often, but sometimes there are ``iptables`` rules that + block accesse to mon nodes or TCP ports. These may be leftovers from + prior stress-testing or rule development. Try SSHing into + the server and, if that succeeds, try connecting to the monitor's ports + (``tcp/3300`` and ``tcp/6789``) using a ``telnet``, ``nc``, or similar tools. + +**Does ceph -s run and obtain a reply from the cluster?** + + If the answer is yes then your cluster is up and running. One thing you + can take for granted is that the monitors will only answer to a ``status`` + request if there is a formed quorum. Also check that at least one ``mgr`` + daemon is reported as running, ideally all of them. + + If ``ceph -s`` hangs without obtaining a reply from the cluster + or showing ``fault`` messages, then it is likely that your monitors + are either down completely or just a fraction are up -- a fraction + insufficient to form a majority quorum. This check will connect to an + arbitrary mon; in rare cases it may be illuminating to bind to specific + mons in sequence by adding e.g. ``-m mymon1`` to the command. + +**What if ceph -s doesn't come back?** + + If you haven't gone through all the steps so far, please go back and do. + + You can contact each monitor individually asking them for their status, + regardless of a quorum being formed. This can be achieved using + ``ceph tell mon.ID mon_status``, ID being the monitor's identifier. You should + perform this for each monitor in the cluster. In section `Understanding + mon_status`_ we will explain how to interpret the output of this command. + + You may instead SSH into each mon node and query the daemon's admin socket. + + +Using the monitor's admin socket +================================= + +The admin socket allows you to interact with a given daemon directly using a +Unix socket file. This file can be found in your monitor's ``run`` directory. +By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok`` +but this may be elsewhere if you have overridden the default directory. If you +don't find it there, check your ``ceph.conf`` for an alternative path or +run:: + + ceph-conf --name mon.ID --show-config-value admin_socket + +Bear in mind that the admin socket will be available only while the monitor +daemon is running. When the monitor is properly shut down, the admin socket +will be removed. If however the monitor is not running and the admin socket +persists, it is likely that the monitor was improperly shut down. +Regardless, if the monitor is not running, you will not be able to use the +admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``. + +Accessing the admin socket is as simple as running ``ceph tell`` on the daemon +you are interested in. For example:: + + ceph tell mon.<id> mon_status + +Under the hood, this passes the command ``help`` to the running MON daemon +``<id>`` via its "admin socket", which is a file ending in ``.asok`` +somewhere under ``/var/run/ceph``. Once you know the full path to the file, +you can even do this yourself:: + + ceph --admin-daemon <full_path_to_asok_file> <command> + +Using ``help`` as the command to the ``ceph`` tool will show you the +supported commands available through the admin socket. Please take a look +at ``config get``, ``config show``, ``mon stat`` and ``quorum_status``, +as those can be enlightening when troubleshooting a monitor. + + +Understanding mon_status +========================= + +``mon_status`` can always be obtained via the admin socket. This command will +output a multitude of information about the monitor, including the same output +you would get with ``quorum_status``. + +Take the following example output of ``ceph tell mon.c mon_status``:: + + + { "name": "c", + "rank": 2, + "state": "peon", + "election_epoch": 38, + "quorum": [ + 1, + 2], + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { "epoch": 3, + "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8", + "modified": "2013-10-30 04:12:01.945629", + "created": "2013-10-29 14:14:41.914786", + "mons": [ + { "rank": 0, + "name": "a", + "addr": "127.0.0.1:6789\/0"}, + { "rank": 1, + "name": "b", + "addr": "127.0.0.1:6790\/0"}, + { "rank": 2, + "name": "c", + "addr": "127.0.0.1:6795\/0"}]}} + +A couple of things are obvious: we have three monitors in the monmap (*a*, *b* +and *c*), the quorum is formed by only two monitors, and *c* is in the quorum +as a *peon*. + +Which monitor is out of the quorum? + + The answer would be **a**. + +Why? + + Take a look at the ``quorum`` set. We have two monitors in this set: *1* + and *2*. These are not monitor names. These are monitor ranks, as established + in the current monmap. We are missing the monitor with rank 0, and according + to the monmap that would be ``mon.a``. + +By the way, how are ranks established? + + Ranks are (re)calculated whenever you add or remove monitors and follow a + simple rule: the **greater** the ``IP:PORT`` combination, the **lower** the + rank is. In this case, considering that ``127.0.0.1:6789`` is lower than all + the remaining ``IP:PORT`` combinations, ``mon.a`` has rank 0. + +Most Common Monitor Issues +=========================== + +Have Quorum but at least one Monitor is down +--------------------------------------------- + +When this happens, depending on the version of Ceph you are running, +you should be seeing something similar to:: + + $ ceph health detail + [snip] + mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum) + +How to troubleshoot this? + + First, make sure ``mon.a`` is running. + + Second, make sure you are able to connect to ``mon.a``'s node from the + other mon nodes. Check the TCP ports as well. Check ``iptables`` and + ``nf_conntrack`` on all nodes and ensure that you are not + dropping/rejecting connections. + + If this initial troubleshooting doesn't solve your problems, then it's + time to go deeper. + + First, check the problematic monitor's ``mon_status`` via the admin + socket as explained in `Using the monitor's admin socket`_ and + `Understanding mon_status`_. + + If the monitor is out of the quorum, its state should be one of + ``probing``, ``electing`` or ``synchronizing``. If it happens to be either + ``leader`` or ``peon``, then the monitor believes to be in quorum, while + the remaining cluster is sure it is not; or maybe it got into the quorum + while we were troubleshooting the monitor, so check you ``ceph -s`` again + just to make sure. Proceed if the monitor is not yet in the quorum. + +What if the state is ``probing``? + + This means the monitor is still looking for the other monitors. Every time + you start a monitor, the monitor will stay in this state for some time + while trying to connect the rest of the monitors specified in the ``monmap``. + The time a monitor will spend in this state can vary. For instance, when on + a single-monitor cluster (never do this in production), + the monitor will pass through the probing state almost instantaneously. + In a multi-monitor cluster, the monitors will stay in this state until they + find enough monitors to form a quorum -- this means that if you have 2 out + of 3 monitors down, the one remaining monitor will stay in this state + indefinitely until you bring one of the other monitors up. + + If you have a quorum the starting daemon should be able to find the + other monitors quickly, as long as they can be reached. If your + monitor is stuck probing and you have gone through with all the communication + troubleshooting, then there is a fair chance that the monitor is trying + to reach the other monitors on a wrong address. ``mon_status`` outputs the + ``monmap`` known to the monitor: check if the other monitor's locations + match reality. If they don't, jump to + `Recovering a Monitor's Broken monmap`_; if they do, then it may be related + to severe clock skews amongst the monitor nodes and you should refer to + `Clock Skews`_ first, but if that doesn't solve your problem then it is + the time to prepare some logs and reach out to the community (please refer + to `Preparing your logs`_ on how to best prepare your logs). + + +What if state is ``electing``? + + This means the monitor is in the middle of an election. With recent Ceph + releases these typically complete quickly, but at times the monitors can + get stuck in what is known as an *election storm*. This can indicate + clock skew among the monitor nodes; jump to + `Clock Skews`_ for more information. If all your clocks are properly + synchronized, you should search the mailing lists and tracker. + This is not a state that is likely to persist and aside from + (*really*) old bugs there is not an obvious reason besides clock skews on + why this would happen. Worst case, if there are enough surviving mons, + down the problematic one while you investigate. + +What if state is ``synchronizing``? + + This means the monitor is catching up with the rest of the cluster in + order to join the quorum. Time to synchronize is a function of the size + of your monitor store and thus of cluster size and state, so if you have a + large or degraded cluster this may take a while. + + If you notice that the monitor jumps from ``synchronizing`` to + ``electing`` and then back to ``synchronizing``, then you do have a + problem: the cluster state may be advancing (i.e., generating new maps) + too fast for the synchronization process to keep up. This was a more common + thing in early days (Cuttlefish), but since then the synchronization process + has been refactored and enhanced to avoid this dynamic. If you experience + this in later versions please let us know via a bug tracker. And bring some logs + (see `Preparing your logs`_). + +What if state is ``leader`` or ``peon``? + + This should not happen: famous last words. If it does, however, it likely + has a lot to do with clock skew -- see `Clock Skews`_. If you are not + suffering from clock skew, then please prepare your logs (see + `Preparing your logs`_) and reach out to the community. + + +Recovering a Monitor's Broken ``monmap`` +---------------------------------------- + +This is how a ``monmap`` usually looks, depending on the number of +monitors:: + + + epoch 3 + fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8 + last_changed 2013-10-30 04:12:01.945629 + created 2013-10-29 14:14:41.914786 + 0: 127.0.0.1:6789/0 mon.a + 1: 127.0.0.1:6790/0 mon.b + 2: 127.0.0.1:6795/0 mon.c + +This may not be what you have however. For instance, in some versions of +early Cuttlefish there was a bug that could cause your ``monmap`` +to be nullified. Completely filled with zeros. This means that not even +``monmaptool`` would be able to make sense of cold, hard, inscrutable zeros. +It's also possible to end up with a monitor with a severely outdated monmap, +notably if the node has been down for months while you fight with your vendor's +TAC. The subject ``ceph-mon`` daemon might be unable to find the surviving +monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``, +then remove ``mon.a``, then add a new monitor ``mon.e`` and remove +``mon.b``; you will end up with a totally different monmap from the one +``mon.c`` knows). + +In this situation you have two possible solutions: + +Scrap the monitor and redeploy + + You should only take this route if you are positive that you won't + lose the information kept by that monitor; that you have other monitors + and that they are running just fine so that your new monitor is able + to synchronize from the remaining monitors. Keep in mind that destroying + a monitor, if there are no other copies of its contents, may lead to + loss of data. + +Inject a monmap into the monitor + + Usually the safest path. You should grab the monmap from the remaining + monitors and inject it into the monitor with the corrupted/lost monmap. + + These are the basic steps: + + 1. Is there a formed quorum? If so, grab the monmap from the quorum:: + + $ ceph mon getmap -o /tmp/monmap + + 2. No quorum? Grab the monmap directly from another monitor (this + assumes the monitor you are grabbing the monmap from has id ID-FOO + and has been stopped):: + + $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap + + 3. Stop the monitor you are going to inject the monmap into. + + 4. Inject the monmap:: + + $ ceph-mon -i ID --inject-monmap /tmp/monmap + + 5. Start the monitor + + Please keep in mind that the ability to inject monmaps is a powerful + feature that can cause havoc with your monitors if misused as it will + overwrite the latest, existing monmap kept by the monitor. + + +Clock Skews +------------ + +Monitor operation can be severely affected by clock skew among the quorum's +mons, as the PAXOS consensus algorithm requires tight time alignment. +Skew can result in weird behavior with no obvious +cause. To avoid such issues, you must run a clock synchronization tool +on your monitor nodes: ``Chrony`` or the legacy ``ntpd``. Be sure to +configure the mon nodes with the `iburst` option and multiple peers: + +* Each other +* Internal ``NTP`` servers +* Multiple external, public pool servers + +For good measure, *all* nodes in your cluster should also sync against +internal and external servers, and perhaps even your mons. ``NTP`` servers +should run on bare metal; VM virtualized clocks are not suitable for steady +timekeeping. Visit `https://www.ntp.org <https://www.ntp.org>`_ for more info. Your +organization may already have quality internal ``NTP`` servers you can use. +Sources for ``NTP`` server appliances include: + +* Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_ +* EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_ +* Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_ + + +What's the maximum tolerated clock skew? + + By default the monitors will allow clocks to drift up to 0.05 seconds (50 ms). + + +Can I increase the maximum tolerated clock skew? + + The maximum tolerated clock skew is configurable via the + ``mon-clock-drift-allowed`` option, and + although you *CAN* you almost certainly *SHOULDN'T*. The clock skew mechanism + is in place because clock-skewed monitors are liely to misbehave. We, as + developers and QA aficionados, are comfortable with the current default + value, as it will alert the user before the monitors get out hand. Changing + this value may cause unforeseen effects on the + stability of the monitors and overall cluster health. + +How do I know there's a clock skew? + + The monitors will warn you via the cluster status ``HEALTH_WARN``. ``ceph health + detail`` or ``ceph status`` should show something like:: + + mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s) + + That means that ``mon.c`` has been flagged as suffering from a clock skew. + + On releases beginning with Luminous you can issue the + ``ceph time-sync-status`` command to check status. Note that the lead mon + is typically the one with the numerically lowest IP address. It will always + show ``0``: the reported offsets of other mons are relative to + the lead mon, not to any external reference source. + + +What should I do if there's a clock skew? + + Synchronize your clocks. Running an NTP client may help. If you are already + using one and you hit this sort of issues, check if you are using some NTP + server remote to your network and consider hosting your own NTP server on + your network. This last option tends to reduce the amount of issues with + monitor clock skews. + + +Client Can't Connect or Mount +------------------------------ + +Check your IP tables. Some OS install utilities add a ``REJECT`` rule to +``iptables``. The rule rejects all clients trying to connect to the host except +for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in +place, clients connecting from a separate node will fail to mount with a timeout +error. You need to address ``iptables`` rules that reject clients trying to +connect to Ceph daemons. For example, you would need to address rules that look +like this appropriately:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +You may also need to add rules to IP tables on your Ceph hosts to ensure +that clients can access the ports associated with your Ceph monitors (i.e., port +6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For +example:: + + iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT + +Monitor Store Failures +====================== + +Symptoms of store corruption +---------------------------- + +Ceph monitor stores the :term:`Cluster Map` in a key/value store such as LevelDB. If +a monitor fails due to the key/value store corruption, following error messages +might be found in the monitor log:: + + Corruption: error in middle of record + +or:: + + Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb + +Recovery using healthy monitor(s) +--------------------------------- + +If there are any survivors, we can always :ref:`replace <adding-and-removing-monitors>` the corrupted one with a +new one. After booting up, the new joiner will sync up with a healthy +peer, and once it is fully sync'ed, it will be able to serve the clients. + +.. _mon-store-recovery-using-osds: + +Recovery using OSDs +------------------- + +But what if all monitors fail at the same time? Since users are encouraged to +deploy at least three (and preferably five) monitors in a Ceph cluster, the chance of simultaneous +failure is rare. But unplanned power-downs in a data center with improperly +configured disk/fs settings could fail the underlying file system, and hence +kill all the monitors. In this case, we can recover the monitor store with the +information stored in OSDs.:: + + ms=/root/mon-store + mkdir $ms + + # collect the cluster map from stopped OSDs + for host in $hosts; do + rsync -avz $ms/. user@$host:$ms.remote + rm -rf $ms + ssh user@$host <<EOF + for osd in /var/lib/ceph/osd/ceph-*; do + ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote + done + EOF + rsync -avz user@$host:$ms.remote/. $ms + done + + # rebuild the monitor store from the collected map, if the cluster does not + # use cephx authentication, we can skip the following steps to update the + # keyring with the caps, and there is no need to pass the "--keyring" option. + # i.e. just use "ceph-monstore-tool $ms rebuild" instead + ceph-authtool /path/to/admin.keyring -n mon. \ + --cap mon 'allow *' + ceph-authtool /path/to/admin.keyring -n client.admin \ + --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' + # add one or more ceph-mgr's key to the keyring. in this case, an encoded key + # for mgr.x is added, you can find the encoded key in + # /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is + # deployed + ceph-authtool /path/to/admin.keyring --add-key 'AQDN8kBe9PLWARAAZwxXMr+n85SBYbSlLcZnMA==' -n mgr.x \ + --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *' + # if your monitors' ids are not single characters like 'a', 'b', 'c', please + # specify them in the command line by passing them as arguments of the "--mon-ids" + # option. if you are not sure, please check your ceph.conf to see if there is any + # sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are + # using DNS SRV for looking up monitors. + ceph-monstore-tool $ms rebuild -- --keyring /path/to/admin.keyring --mon-ids alpha beta gamma + + # make a backup of the corrupted store.db just in case! repeat for + # all monitors. + mv /var/lib/ceph/mon/mon.foo/store.db /var/lib/ceph/mon/mon.foo/store.db.corrupted + + # move rebuild store.db into place. repeat for all monitors. + mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db + chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db + +The steps above + +#. collect the map from all OSD hosts, +#. then rebuild the store, +#. fill the entities in keyring file with appropriate caps +#. replace the corrupted store on ``mon.foo`` with the recovered copy. + +Known limitations +~~~~~~~~~~~~~~~~~ + +Following information are not recoverable using the steps above: + +- **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command + are recovered from the OSD's copy. And the ``client.admin`` keyring is imported + using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing + in the recovered monitor store. You might need to re-add them manually. + +- **creating pools**: If any RADOS pools were in the process of being creating, that state is lost. The recovery tool assumes that all pools have been created. If there are PGs that are stuck in the 'unknown' after the recovery for a partially created pool, you can force creation of the *empty* PG with the ``ceph osd force-create-pg`` command. Note that this will create an *empty* PG, so only do this if you know the pool is empty. + +- **MDS Maps**: the MDS maps are lost. + + + +Everything Failed! Now What? +============================= + +Reaching out for help +---------------------- + +You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net) +and on ``ceph-devel@vger.kernel.org`` and ``ceph-users@lists.ceph.com``. Make +sure you have grabbed your logs and have them ready if someone asks: the faster +the interaction and lower the latency in response, the better chances everyone's +time is optimized. + + +Preparing your logs +--------------------- + +Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We +may want them. However, your logs may not have the necessary information. If +you don't find your monitor logs at their default location, you can check +where they should be by running:: + + ceph-conf --name mon.FOO --show-config-value log_file + +The amount of information in the logs are subject to the debug levels being +enforced by your configuration files. If you have not enforced a specific +debug level then Ceph is using the default levels and your logs may not +contain important information to track down you issue. +A first step in getting relevant information into your logs will be to raise +debug levels. In this case we will be interested in the information from the +monitor. +Similarly to what happens on other components, different parts of the monitor +will output their debug information on different subsystems. + +You will have to raise the debug levels of those subsystems more closely +related to your issue. This may not be an easy task for someone unfamiliar +with troubleshooting Ceph. For most situations, setting the following options +on your monitors will be enough to pinpoint a potential source of the issue:: + + debug mon = 10 + debug ms = 1 + +If we find that these debug levels are not enough, there's a chance we may +ask you to raise them or even define other debug subsystems to obtain infos +from -- but at least we started off with some useful information, instead +of a massively empty log without much to go on with. + +Do I need to restart a monitor to adjust debug levels? +------------------------------------------------------ + +No. You may do it in one of two ways: + +You have quorum + + Either inject the debug option into the monitor you want to debug:: + + ceph tell mon.FOO config set debug_mon 10/10 + + or into all monitors at once:: + + ceph tell mon.* config set debug_mon 10/10 + +No quorum + + Use the monitor's admin socket and directly adjust the configuration + options:: + + ceph daemon mon.FOO config set debug_mon 10/10 + + +Going back to default values is as easy as rerunning the above commands +using the debug level ``1/10`` instead. You can check your current +values using the admin socket and the following commands:: + + ceph daemon mon.FOO config show + +or:: + + ceph daemon mon.FOO config get 'OPTION_NAME' + + +Reproduced the problem with appropriate debug levels. Now what? +---------------------------------------------------------------- + +Ideally you would send us only the relevant portions of your logs. +We realise that figuring out the corresponding portion may not be the +easiest of tasks. Therefore, we won't hold it to you if you provide the +full log, but common sense should be employed. If your log has hundreds +of thousands of lines, it may get tricky to go through the whole thing, +specially if we are not aware at which point, whatever your issue is, +happened. For instance, when reproducing, keep in mind to write down +current time and date and to extract the relevant portions of your logs +based on that. + +Finally, you should reach out to us on the mailing lists, on IRC or file +a new issue on the `tracker`_. + +.. _tracker: http://tracker.ceph.com/projects/ceph/issues/new diff --git a/doc/rados/troubleshooting/troubleshooting-osd.rst b/doc/rados/troubleshooting/troubleshooting-osd.rst new file mode 100644 index 000000000..cc852d73d --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-osd.rst @@ -0,0 +1,620 @@ +====================== + Troubleshooting OSDs +====================== + +Before troubleshooting your OSDs, first check your monitors and network. If +you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph shows +``HEALTH_OK``, it means that the monitors have a quorum. +If you don't have a monitor quorum or if there are errors with the monitor +status, `address the monitor issues first <../troubleshooting-mon>`_. +Check your networks to ensure they +are running properly, because networks may have a significant impact on OSD +operation and performance. Look for dropped packets on the host side +and CRC errors on the switch side. + +Obtaining Data About OSDs +========================= + +A good first step in troubleshooting your OSDs is to obtain topology information in +addition to the information you collected while `monitoring your OSDs`_ +(e.g., ``ceph osd tree``). + + +Ceph Logs +--------- + +If you haven't changed the default path, you can find Ceph log files at +``/var/log/ceph``:: + + ls /var/log/ceph + +If you don't see enough log detail you can change your logging level. See +`Logging and Debugging`_ for details to ensure that Ceph performs adequately +under high logging volume. + + +Admin Socket +------------ + +Use the admin socket tool to retrieve runtime information. For details, list +the sockets for your Ceph daemons:: + + ls /var/run/ceph + +Then, execute the following, replacing ``{daemon-name}`` with an actual +daemon (e.g., ``osd.0``):: + + ceph daemon osd.0 help + +Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/run/ceph``):: + + ceph daemon {socket-file} help + +The admin socket, among other things, allows you to: + +- List your configuration at runtime +- Dump historic operations +- Dump the operation priority queue state +- Dump operations in flight +- Dump perfcounters + +Display Freespace +----------------- + +Filesystem issues may arise. To display your file system's free space, execute +``df``. :: + + df -h + +Execute ``df --help`` for additional usage. + +I/O Statistics +-------------- + +Use `iostat`_ to identify I/O-related issues. :: + + iostat -x + +Diagnostic Messages +------------------- + +To retrieve diagnostic messages from the kernel, use ``dmesg`` with ``less``, ``more``, ``grep`` +or ``tail``. For example:: + + dmesg | grep scsi + +Stopping w/out Rebalancing +========================== + +Periodically, you may need to perform maintenance on a subset of your cluster, +or resolve a problem that affects a failure domain (e.g., a rack). If you do not +want CRUSH to automatically rebalance the cluster as you stop OSDs for +maintenance, set the cluster to ``noout`` first:: + + ceph osd set noout + +On Luminous or newer releases it is safer to set the flag only on affected OSDs. +You can do this individually :: + + ceph osd add-noout osd.0 + ceph osd rm-noout osd.0 + +Or an entire CRUSH bucket at a time. Say you're going to take down +``prod-ceph-data1701`` to add RAM :: + + ceph osd set-group noout prod-ceph-data1701 + +Once the flag is set you can stop the OSDs and any other colocated Ceph +services within the failure domain that requires maintenance work. :: + + systemctl stop ceph\*.service ceph\*.target + +.. note:: Placement groups within the OSDs you stop will become ``degraded`` + while you are addressing issues with within the failure domain. + +Once you have completed your maintenance, restart the OSDs and any other +daemons. If you rebooted the host as part of the maintenance, these should +come back on their own without intervention. :: + + sudo systemctl start ceph.target + +Finally, you must unset the cluster-wide``noout`` flag:: + + ceph osd unset noout + ceph osd unset-group noout prod-ceph-data1701 + +Note that most Linux distributions that Ceph supports today employ ``systemd`` +for service management. For other or older operating systems you may need +to issue equivalent ``service`` or ``start``/``stop`` commands. + +.. _osd-not-running: + +OSD Not Running +=============== + +Under normal circumstances, simply restarting the ``ceph-osd`` daemon will +allow it to rejoin the cluster and recover. + +An OSD Won't Start +------------------ + +If you start your cluster and an OSD won't start, check the following: + +- **Configuration File:** If you were not able to get OSDs running from + a new installation, check your configuration file to ensure it conforms + (e.g., ``host`` not ``hostname``, etc.). + +- **Check Paths:** Check the paths in your configuration, and the actual + paths themselves for data and metadata (journals, WAL, DB). If you separate the OSD data from + the metadata and there are errors in your configuration file or in the + actual mounts, you may have trouble starting OSDs. If you want to store the + metadata on a separate block device, you should partition or LVM your + drive and assign one partition per OSD. + +- **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be + hitting the default maximum number of threads (e.g., usually 32k), especially + during recovery. You can increase the number of threads using ``sysctl`` to + see if increasing the maximum number of threads to the maximum possible + number of threads allowed (i.e., 4194303) will help. For example:: + + sysctl -w kernel.pid_max=4194303 + + If increasing the maximum thread count resolves the issue, you can make it + permanent by including a ``kernel.pid_max`` setting in a file under ``/etc/sysctl.d`` or + within the master ``/etc/sysctl.conf`` file. For example:: + + kernel.pid_max = 4194303 + +- **Check ``nf_conntrack``:** This connection tracking and limiting system + is the bane of many production Ceph clusters, and can be insidious in that + everything is fine at first. As cluster topology and client workload + grow, mysterious and intermittent connection failures and performance + glitches manifest, becoming worse over time and at certain times of day. + Check ``syslog`` history for table fillage events. You can mitigate this + bother by raising ``nf_conntrack_max`` to a much higher value via ``sysctl``. + Be sure to raise ``nf_conntrack_buckets`` accordingly to + ``nf_conntrack_max / 4``, which may require action outside of ``sysctl`` e.g. + ``"echo 131072 > /sys/module/nf_conntrack/parameters/hashsize`` + More interdictive but fussier is to blacklist the associated kernel modules + to disable processing altogether. This is fragile in that the modules + vary among kernel versions, as does the order in which they must be listed. + Even when blacklisted there are situations in which ``iptables`` or ``docker`` + may activate connection tracking anyway, so a "set and forget" strategy for + the tunables is advised. On modern systems this will not consume appreciable + resources. + +- **Kernel Version:** Identify the kernel version and distribution you + are using. Ceph uses some third party tools by default, which may be + buggy or may conflict with certain distributions and/or kernel + versions (e.g., Google ``gperftools`` and ``TCMalloc``). Check the + `OS recommendations`_ and the release notes for each Ceph version + to ensure you have addressed any issues related to your kernel. + +- **Segment Fault:** If there is a segment fault, increase log levels + and start the problematic daemon(s) again. If segment faults recur, + search the Ceph bug tracker `https://tracker.ceph/com/projects/ceph <https://tracker.ceph.com/projects/ceph/>`_ + and the ``dev`` and ``ceph-users`` mailing list archives `https://ceph.io/resources <https://ceph.io/resources>`_. + If this is truly a new and unique + failure, post to the ``dev`` email list and provide the specific Ceph + release being run, ``ceph.conf`` (with secrets XXX'd out), + your monitor status output and excerpts from your log file(s). + +An OSD Failed +------------- + +When a ``ceph-osd`` process dies, surviving ``ceph-osd`` daemons will report +to the mons that it appears down, which will in turn surface the new status +via the ``ceph health`` command:: + + ceph health + HEALTH_WARN 1/3 in osds are down + +Specifically, you will get a warning whenever there are OSDs marked ``in`` +and ``down``. You can identify which are ``down`` with:: + + ceph health detail + HEALTH_WARN 1/3 in osds are down + osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 + +or :: + + ceph osd tree down + +If there is a drive +failure or other fault preventing ``ceph-osd`` from functioning or +restarting, an error message should be present in its log file under +``/var/log/ceph``. + +If the daemon stopped because of a heartbeat failure or ``suicide timeout``, +the underlying drive or filesystem may be unresponsive. Check ``dmesg`` +and `syslog` output for drive or other kernel errors. You may need to +specify something like ``dmesg -T`` to get timestamps, otherwise it's +easy to mistake old errors for new. + +If the problem is a software error (failed assertion or other +unexpected error), search the archives and tracker as above, and +report it to the `ceph-devel`_ email list if there's no clear fix or +existing bug. + +.. _no-free-drive-space: + +No Free Drive Space +------------------- + +Ceph prevents you from writing to a full OSD so that you don't lose data. +In an operational cluster, you should receive a warning when your cluster's OSDs +and pools approach the full ratio. The ``mon osd full ratio`` defaults to +``0.95``, or 95% of capacity before it stops clients from writing data. +The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of +capacity above which backfills will not start. The +OSD nearfull ratio defaults to ``0.85``, or 85% of capacity +when it generates a health warning. + +Note that individual OSDs within a cluster will vary in how much data Ceph +allocates to them. This utilization can be displayed for each OSD with :: + + ceph osd df + +Overall cluster / pool fullness can be checked with :: + + ceph df + +Pay close attention to the **most full** OSDs, not the percentage of raw space +used as reported by ``ceph df``. It only takes one outlier OSD filling up to +fail writes to its pool. The space available to each pool as reported by +``ceph df`` considers the ratio settings relative to the *most full* OSD that +is part of a given pool. The distribution can be flattened by progressively +moving data from overfull or to underfull OSDs using the ``reweight-by-utilization`` +command. With Ceph releases beginning with later revisions of Luminous one can also +exploit the ``ceph-mgr`` ``balancer`` module to perform this task automatically +and rather effectively. + +The ratios can be adjusted: + +:: + + ceph osd set-nearfull-ratio <float[0.0-1.0]> + ceph osd set-full-ratio <float[0.0-1.0]> + ceph osd set-backfillfull-ratio <float[0.0-1.0]> + +Full cluster issues can arise when an OSD fails either as a test or organically +within small and/or very full or unbalanced cluster. When an OSD or node +holds an outsize percentage of the cluster's data, the ``nearfull`` and ``full`` +ratios may be exceeded as a result of component failures or even natural growth. +If you are testing how Ceph reacts to OSD failures on a small +cluster, you should leave ample free disk space and consider temporarily +lowering the OSD ``full ratio``, OSD ``backfillfull ratio`` and +OSD ``nearfull ratio`` + +Full ``ceph-osds`` will be reported by ``ceph health``:: + + ceph health + HEALTH_WARN 1 nearfull osd(s) + +Or:: + + ceph health detail + HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) + osd.3 is full at 97% + osd.4 is backfill full at 91% + osd.2 is near full at 87% + +The best way to deal with a full cluster is to add capacity via new OSDs, enabling +the cluster to redistribute data to newly available storage. + +If you cannot start a legacy Filestore OSD because it is full, you may reclaim +some space deleting a few placement group directories in the full OSD. + +.. important:: If you choose to delete a placement group directory on a full OSD, + **DO NOT** delete the same placement group directory on another full OSD, or + **YOU WILL LOSE DATA**. You **MUST** maintain at least one copy of your data on + at least one OSD. This is a rare and extreme intervention, and is not to be + undertaken lightly. + +See `Monitor Config Reference`_ for additional details. + +OSDs are Slow/Unresponsive +========================== + +A common issue involves slow or unresponsive OSDs. Ensure that you +have eliminated other troubleshooting possibilities before delving into OSD +performance issues. For example, ensure that your network(s) is working properly +and your OSDs are running. Check to see if OSDs are throttling recovery traffic. + +.. tip:: Newer versions of Ceph provide better recovery handling by preventing + recovering OSDs from using up system resources so that ``up`` and ``in`` + OSDs are not available or are otherwise slow. + +Networking Issues +----------------- + +Ceph is a distributed storage system, so it relies upon networks for OSD peering +and replication, recovery from faults, and periodic heartbeats. Networking +issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for +details. + +Ensure that Ceph processes and Ceph-dependent processes are connected and/or +listening. :: + + netstat -a | grep ceph + netstat -l | grep ceph + sudo netstat -p | grep ceph + +Check network statistics. :: + + netstat -s + +Drive Configuration +------------------- + +A SAS or SATA storage drive should only house one OSD; NVMe drives readily +handle two or more. Read and write throughput can bottleneck if other processes +share the drive, including journals / metadata, operating systems, Ceph monitors, +`syslog` logs, other OSDs, and non-Ceph processes. + +Ceph acknowledges writes *after* journaling, so fast SSDs are an +attractive option to accelerate the response time--particularly when +using the ``XFS`` or ``ext4`` file systems for legacy Filestore OSDs. +By contrast, the ``Btrfs`` +file system can write and journal simultaneously. (Note, however, that +we recommend against using ``Btrfs`` for production deployments.) + +.. note:: Partitioning a drive does not change its total throughput or + sequential read/write limits. Running a journal in a separate partition + may help, but you should prefer a separate physical drive. + +Bad Sectors / Fragmented Disk +----------------------------- + +Check your drives for bad blocks, fragmentation, and other errors that can cause +performance to drop substantially. Invaluable tools include ``dmesg``, ``syslog`` +logs, and ``smartctl`` (from the ``smartmontools`` package). + +Co-resident Monitors/OSDs +------------------------- + +Monitors are relatively lightweight processes, but they issue lots of +``fsync()`` calls, +which can interfere with other workloads, particularly if monitors run on the +same drive as an OSD. Additionally, if you run monitors on the same host as +OSDs, you may incur performance issues related to: + +- Running an older kernel (pre-3.0) +- Running a kernel with no ``syncfs(2)`` syscall. + +In these cases, multiple OSDs running on the same host can drag each other down +by doing lots of commits. That often leads to the bursty writes. + +Co-resident Processes +--------------------- + +Spinning up co-resident processes (convergence) such as a cloud-based solution, virtual +machines and other applications that write data to Ceph while operating on the +same hardware as OSDs can introduce significant OSD latency. Generally, we +recommend optimizing hosts for use with Ceph and using other hosts for other +processes. The practice of separating Ceph operations from other applications +may help improve performance and may streamline troubleshooting and maintenance. + +Logging Levels +-------------- + +If you turned logging levels up to track an issue and then forgot to turn +logging levels back down, the OSD may be putting a lot of logs onto the disk. If +you intend to keep logging levels high, you may consider mounting a drive to the +default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``). + +Recovery Throttling +------------------- + +Depending upon your configuration, Ceph may reduce recovery rates to maintain +performance or it may increase recovery rates to the point that recovery +impacts OSD performance. Check to see if the OSD is recovering. + +Kernel Version +-------------- + +Check the kernel version you are running. Older kernels may not receive +new backports that Ceph depends upon for better performance. + +Kernel Issues with SyncFS +------------------------- + +Try running one OSD per host to see if performance improves. Old kernels +might not have a recent enough version of ``glibc`` to support ``syncfs(2)``. + +Filesystem Issues +----------------- + +Currently, we recommend deploying clusters with the BlueStore back end. +When running a pre-Luminous release or if you have a specific reason to deploy +OSDs with the previous Filestore backend, we recommend ``XFS``. + +We recommend against using ``Btrfs`` or ``ext4``. The ``Btrfs`` filesystem has +many attractive features, but bugs may lead to +performance issues and spurious ENOSPC errors. We do not recommend +``ext4`` for Filestore OSDs because ``xattr`` limitations break support for long +object names, which are needed for RGW. + +For more information, see `Filesystem Recommendations`_. + +.. _Filesystem Recommendations: ../configuration/filesystem-recommendations + +Insufficient RAM +---------------- + +We recommend a *minimum* of 4GB of RAM per OSD daemon and suggest rounding up +from 6-8GB. You may notice that during normal operations, ``ceph-osd`` +processes only use a fraction of that amount. +Unused RAM makes it tempting to use the excess RAM for co-resident +applications or to skimp on each node's memory capacity. However, +when OSDs experience recovery their memory utilization spikes. If +there is insufficient RAM available, OSD performance will slow considerably +and the daemons may even crash or be killed by the Linux ``OOM Killer``. + +Blocked Requests or Slow Requests +--------------------------------- + +If a ``ceph-osd`` daemon is slow to respond to a request, messages will be logged +noting ops that are taking too long. The warning threshold +defaults to 30 seconds and is configurable via the ``osd op complaint time`` +setting. When this happens, the cluster log will receive messages. + +Legacy versions of Ceph complain about ``old requests``:: + + osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops + +New versions of Ceph complain about ``slow requests``:: + + {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs + {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] + +Possible causes include: + +- A failing drive (check ``dmesg`` output) +- A bug in the kernel file system (check ``dmesg`` output) +- An overloaded cluster (check system load, iostat, etc.) +- A bug in the ``ceph-osd`` daemon. + +Possible solutions: + +- Remove VMs from Ceph hosts +- Upgrade kernel +- Upgrade Ceph +- Restart OSDs +- Replace failed or failing components + +Debugging Slow Requests +----------------------- + +If you run ``ceph daemon osd.<id> dump_historic_ops`` or ``ceph daemon osd.<id> dump_ops_in_flight``, +you will see a set of operations and a list of events each operation went +through. These are briefly described below. + +Events from the Messenger layer: + +- ``header_read``: When the messenger first started reading the message off the wire. +- ``throttled``: When the messenger tried to acquire memory throttle space to read + the message into memory. +- ``all_read``: When the messenger finished reading the message off the wire. +- ``dispatched``: When the messenger gave the message to the OSD. +- ``initiated``: This is identical to ``header_read``. The existence of both is a + historical oddity. + +Events from the OSD as it processes ops: + +- ``queued_for_pg``: The op has been put into the queue for processing by its PG. +- ``reached_pg``: The PG has started doing the op. +- ``waiting for \*``: The op is waiting for some other work to complete before it + can proceed (e.g. a new OSDMap; for its object target to scrub; for the PG to + finish peering; all as specified in the message). +- ``started``: The op has been accepted as something the OSD should do and + is now being performed. +- ``waiting for subops from``: The op has been sent to replica OSDs. + +Events from ```Filestore```: + +- ``commit_queued_for_journal_write``: The op has been given to the FileStore. +- ``write_thread_in_journal_buffer``: The op is in the journal's buffer and waiting + to be persisted (as the next disk write). +- ``journaled_completion_queued``: The op was journaled to disk and its callback + queued for invocation. + +Events from the OSD after data has been given to underlying storage: + +- ``op_commit``: The op has been committed (i.e. written to journal) by the + primary OSD. +- ``op_applied``: The op has been `write()'en <https://www.freebsd.org/cgi/man.cgi?write(2)>`_ to the backing FS (i.e. applied in memory but not flushed out to disk) on the primary. +- ``sub_op_applied``: ``op_applied``, but for a replica's "subop". +- ``sub_op_committed``: ``op_commit``, but for a replica's subop (only for EC pools). +- ``sub_op_commit_rec/sub_op_apply_rec from <X>``: The primary marks this when it + hears about the above, but for a particular replica (i.e. ``<X>``). +- ``commit_sent``: We sent a reply back to the client (or primary OSD, for sub ops). + +Many of these events are seemingly redundant, but cross important boundaries in +the internal code (such as passing data across locks into new threads). + +Flapping OSDs +============= + +When OSDs peer and check heartbeats, they use the cluster (back-end) +network when it's available. See `Monitor/OSD Interaction`_ for details. + +We have tradtionally recommended separate *public* (front-end) and *private* +(cluster / back-end / replication) networks: + +#. Segregation of heartbeat and replication / recovery traffic (private) + from client and OSD <-> mon traffic (public). This helps keep one + from DoS-ing the other, which could in turn result in a cascading failure. + +#. Additional throughput for both public and private traffic. + +When common networking technloogies were 100Mb/s and 1Gb/s, this separation +was often critical. With today's 10Gb/s, 40Gb/s, and 25/50/100Gb/s +networks, the above capacity concerns are often diminished or even obviated. +For example, if your OSD nodes have two network ports, dedicating one to +the public and the other to the private network means no path redundancy. +This degrades your ability to weather network maintenance and failures without +significant cluster or client impact. Consider instead using both links +for just a public network: with bonding (LACP) or equal-cost routing (e.g. FRR) +you reap the benefits of increased throughput headroom, fault tolerance, and +reduced OSD flapping. + +When a private network (or even a single host link) fails or degrades while the +public network operates normally, OSDs may not handle this situation well. What +happens is that OSDs use the public network to report each other ``down`` to +the monitors, while marking themselves ``up``. The monitors then send out, +again on the public network, an updated cluster map with affected OSDs marked +`down`. These OSDs reply to the monitors "I'm not dead yet!", and the cycle +repeats. We call this scenario 'flapping`, and it can be difficult to isolate +and remediate. With no private network, this irksome dynamic is avoided: +OSDs are generally either ``up`` or ``down`` without flapping. + +If something does cause OSDs to 'flap' (repeatedly getting marked ``down`` and +then ``up`` again), you can force the monitors to halt the flapping by +temporarily freezing their states:: + + ceph osd set noup # prevent OSDs from getting marked up + ceph osd set nodown # prevent OSDs from getting marked down + +These flags are recorded in the osdmap:: + + ceph osd dump | grep flags + flags no-up,no-down + +You can clear the flags with:: + + ceph osd unset noup + ceph osd unset nodown + +Two other flags are supported, ``noin`` and ``noout``, which prevent +booting OSDs from being marked ``in`` (allocated data) or protect OSDs +from eventually being marked ``out`` (regardless of what the current value for +``mon osd down out interval`` is). + +.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the + sense that once the flags are cleared, the action they were blocking + should occur shortly after. The ``noin`` flag, on the other hand, + prevents OSDs from being marked ``in`` on boot, and any daemons that + started while the flag was set will remain that way. + +.. note:: The causes and effects of flapping can be somewhat mitigated through + careful adjustments to the ``mon_osd_down_out_subtree_limit``, + ``mon_osd_reporter_subtree_level``, and ``mon_osd_min_down_reporters``. + Derivation of optimal settings depends on cluster size, topology, and the + Ceph release in use. Their interactions are subtle and beyond the scope of + this document. + + +.. _iostat: https://en.wikipedia.org/wiki/Iostat +.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging +.. _Logging and Debugging: ../log-and-debug +.. _Debugging and Logging: ../debug +.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction +.. _Monitor Config Reference: ../../configuration/mon-config-ref +.. _monitoring your OSDs: ../../operations/monitoring-osd-pg +.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel +.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com +.. _OS recommendations: ../../../start/os-recommendations +.. _ceph-devel: ceph-devel@vger.kernel.org diff --git a/doc/rados/troubleshooting/troubleshooting-pg.rst b/doc/rados/troubleshooting/troubleshooting-pg.rst new file mode 100644 index 000000000..f5e5054ba --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-pg.rst @@ -0,0 +1,693 @@ +===================== + Troubleshooting PGs +===================== + +Placement Groups Never Get Clean +================================ + +When you create a cluster and your cluster remains in ``active``, +``active+remapped`` or ``active+degraded`` status and never achieves an +``active+clean`` status, you likely have a problem with your configuration. + +You may need to review settings in the `Pool, PG and CRUSH Config Reference`_ +and make appropriate adjustments. + +As a general rule, you should run your cluster with more than one OSD and a +pool size greater than 1 object replica. + +.. _one-node-cluster: + +One Node Cluster +---------------- + +Ceph no longer provides documentation for operating on a single node, because +you would never deploy a system designed for distributed computing on a single +node. Additionally, mounting client kernel modules on a single node containing a +Ceph daemon may cause a deadlock due to issues with the Linux kernel itself +(unless you use VMs for the clients). You can experiment with Ceph in a 1-node +configuration, in spite of the limitations as described herein. + +If you are trying to create a cluster on a single node, you must change the +default of the ``osd crush chooseleaf type`` setting from ``1`` (meaning +``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration +file before you create your monitors and OSDs. This tells Ceph that an OSD +can peer with another OSD on the same host. If you are trying to set up a +1-node cluster and ``osd crush chooseleaf type`` is greater than ``0``, +Ceph will try to peer the PGs of one OSD with the PGs of another OSD on +another node, chassis, rack, row, or even datacenter depending on the setting. + +.. tip:: DO NOT mount kernel clients directly on the same node as your + Ceph Storage Cluster, because kernel conflicts can arise. However, you + can mount kernel clients within virtual machines (VMs) on a single node. + +If you are creating OSDs using a single disk, you must create directories +for the data manually first. + + +Fewer OSDs than Replicas +------------------------ + +If you have brought up two OSDs to an ``up`` and ``in`` state, but you still +don't see ``active + clean`` placement groups, you may have an +``osd pool default size`` set to greater than ``2``. + +There are a few ways to address this situation. If you want to operate your +cluster in an ``active + degraded`` state with two replicas, you can set the +``osd pool default min size`` to ``2`` so that you can write objects in +an ``active + degraded`` state. You may also set the ``osd pool default size`` +setting to ``2`` so that you only have two stored replicas (the original and +one replica), in which case the cluster should achieve an ``active + clean`` +state. + +.. note:: You can make the changes at runtime. If you make the changes in + your Ceph configuration file, you may need to restart your cluster. + + +Pool Size = 1 +------------- + +If you have the ``osd pool default size`` set to ``1``, you will only have +one copy of the object. OSDs rely on other OSDs to tell them which objects +they should have. If a first OSD has a copy of an object and there is no +second copy, then no second OSD can tell the first OSD that it should have +that copy. For each placement group mapped to the first OSD (see +``ceph pg dump``), you can force the first OSD to notice the placement groups +it needs by running:: + + ceph osd force-create-pg <pgid> + + +CRUSH Map Errors +---------------- + +Another candidate for placement groups remaining unclean involves errors +in your CRUSH map. + + +Stuck Placement Groups +====================== + +It is normal for placement groups to enter states like "degraded" or "peering" +following a failure. Normally these states indicate the normal progression +through the failure recovery process. However, if a placement group stays in one +of these states for a long time this may be an indication of a larger problem. +For this reason, the monitor will warn when placement groups get "stuck" in a +non-optimal state. Specifically, we check for: + +* ``inactive`` - The placement group has not been ``active`` for too long + (i.e., it hasn't been able to service read/write requests). + +* ``unclean`` - The placement group has not been ``clean`` for too long + (i.e., it hasn't been able to completely recover from a previous failure). + +* ``stale`` - The placement group status has not been updated by a ``ceph-osd``, + indicating that all nodes storing this placement group may be ``down``. + +You can explicitly list stuck placement groups with one of:: + + ceph pg dump_stuck stale + ceph pg dump_stuck inactive + ceph pg dump_stuck unclean + +For stuck ``stale`` placement groups, it is normally a matter of getting the +right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement +groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For +stuck ``unclean`` placement groups, there is usually something preventing +recovery from completing, like unfound objects (see +:ref:`failures-osd-unfound`); + + + +.. _failures-osd-peering: + +Placement Group Down - Peering Failure +====================================== + +In certain cases, the ``ceph-osd`` `Peering` process can run into +problems, preventing a PG from becoming active and usable. For +example, ``ceph health`` might report:: + + ceph health detail + HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down + ... + pg 0.5 is down+peering + pg 1.4 is down+peering + ... + osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 + +We can query the cluster to determine exactly why the PG is marked ``down`` with:: + + ceph pg 0.5 query + +.. code-block:: javascript + + { "state": "down+peering", + ... + "recovery_state": [ + { "name": "Started\/Primary\/Peering\/GetInfo", + "enter_time": "2012-03-06 14:40:16.169679", + "requested_info_from": []}, + { "name": "Started\/Primary\/Peering", + "enter_time": "2012-03-06 14:40:16.169659", + "probing_osds": [ + 0, + 1], + "blocked": "peering is blocked due to down osds", + "down_osds_we_would_probe": [ + 1], + "peering_blocked_by": [ + { "osd": 1, + "current_lost_at": 0, + "comment": "starting or marking this osd lost may let us proceed"}]}, + { "name": "Started", + "enter_time": "2012-03-06 14:40:16.169513"} + ] + } + +The ``recovery_state`` section tells us that peering is blocked due to +down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd`` +and things will recover. + +Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk +failure), we can tell the cluster that it is ``lost`` and to cope as +best it can. + +.. important:: This is dangerous in that the cluster cannot + guarantee that the other copies of the data are consistent + and up to date. + +To instruct Ceph to continue anyway:: + + ceph osd lost 1 + +Recovery will proceed. + + +.. _failures-osd-unfound: + +Unfound Objects +=============== + +Under certain combinations of failures Ceph may complain about +``unfound`` objects:: + + ceph health detail + HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) + pg 2.4 is active+degraded, 78 unfound + +This means that the storage cluster knows that some objects (or newer +copies of existing objects) exist, but it hasn't found copies of them. +One example of how this might come about for a PG whose data is on ceph-osds +1 and 2: + +* 1 goes down +* 2 handles some writes, alone +* 1 comes up +* 1 and 2 repeer, and the objects missing on 1 are queued for recovery. +* Before the new objects are copied, 2 goes down. + +Now 1 knows that these object exist, but there is no live ``ceph-osd`` who +has a copy. In this case, IO to those objects will block, and the +cluster will hope that the failed node comes back soon; this is +assumed to be preferable to returning an IO error to the user. + +First, you can identify which objects are unfound with:: + + ceph pg 2.4 list_unfound [starting offset, in json] + +.. code-block:: javascript + + { + "num_missing": 1, + "num_unfound": 1, + "objects": [ + { + "oid": { + "oid": "object", + "key": "", + "snapid": -2, + "hash": 2249616407, + "max": 0, + "pool": 2, + "namespace": "" + }, + "need": "43'251", + "have": "0'0", + "flags": "none", + "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1", + "locations": [ + "0(3)", + "4(2)" + ] + } + ], + "state": "NotRecovering", + "available_might_have_unfound": true, + "might_have_unfound": [ + { + "osd": "2(4)", + "status": "osd is down" + } + ], + "more": false + } + +If there are too many objects to list in a single result, the ``more`` +field will be true and you can query for more. (Eventually the +command line tool will hide this from you, but not yet.) + +Second, you can identify which OSDs have been probed or might contain +data. + +At the end of the listing (before ``more`` is false), ``might_have_unfound`` is provided +when ``available_might_have_unfound`` is true. This is equivalent to the output +of ``ceph pg #.# query``. This eliminates the need to use ``query`` directly. +The ``might_have_unfound`` information given behaves the same way as described below for ``query``. +The only difference is that OSDs that have ``already probed`` status are ignored. + +Use of ``query``:: + + ceph pg 2.4 query + +.. code-block:: javascript + + "recovery_state": [ + { "name": "Started\/Primary\/Active", + "enter_time": "2012-03-06 15:15:46.713212", + "might_have_unfound": [ + { "osd": 1, + "status": "osd is down"}]}, + +In this case, for example, the cluster knows that ``osd.1`` might have +data, but it is ``down``. The full range of possible states include: + +* already probed +* querying +* OSD is down +* not queried (yet) + +Sometimes it simply takes some time for the cluster to query possible +locations. + +It is possible that there are other locations where the object can +exist that are not listed. For example, if a ceph-osd is stopped and +taken out of the cluster, the cluster fully recovers, and due to some +future set of failures ends up with an unfound object, it won't +consider the long-departed ceph-osd as a potential location to +consider. (This scenario, however, is unlikely.) + +If all possible locations have been queried and objects are still +lost, you may have to give up on the lost objects. This, again, is +possible given unusual combinations of failures that allow the cluster +to learn about writes that were performed before the writes themselves +are recovered. To mark the "unfound" objects as "lost":: + + ceph pg 2.5 mark_unfound_lost revert|delete + +This the final argument specifies how the cluster should deal with +lost objects. + +The "delete" option will forget about them entirely. + +The "revert" option (not available for erasure coded pools) will +either roll back to a previous version of the object or (if it was a +new object) forget about it entirely. Use this with caution, as it +may confuse applications that expected the object to exist. + + +Homeless Placement Groups +========================= + +It is possible for all OSDs that had copies of a given placement groups to fail. +If that's the case, that subset of the object store is unavailable, and the +monitor will receive no status updates for those placement groups. To detect +this situation, the monitor marks any placement group whose primary OSD has +failed as ``stale``. For example:: + + ceph health + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + +You can identify which placement groups are ``stale``, and what the last OSDs to +store them were, with:: + + ceph health detail + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + ... + pg 2.5 is stuck stale+active+remapped, last acting [2,0] + ... + osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 + osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 + osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 + +If we want to get placement group 2.5 back online, for example, this tells us that +it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd`` +daemons will allow the cluster to recover that placement group (and, presumably, +many others). + + +Only a Few OSDs Receive Data +============================ + +If you have many nodes in your cluster and only a few of them receive data, +`check`_ the number of placement groups in your pool. Since placement groups get +mapped to OSDs, a small number of placement groups will not distribute across +your cluster. Try creating a pool with a placement group count that is a +multiple of the number of OSDs. See `Placement Groups`_ for details. The default +placement group count for pools is not useful, but you can change it `here`_. + + +Can't Write Data +================ + +If your cluster is up, but some OSDs are down and you cannot write data, +check to ensure that you have the minimum number of OSDs running for the +placement group. If you don't have the minimum number of OSDs running, +Ceph will not allow you to write data because there is no guarantee +that Ceph can replicate your data. See ``osd pool default min size`` +in the `Pool, PG and CRUSH Config Reference`_ for details. + + +PGs Inconsistent +================ + +If you receive an ``active + clean + inconsistent`` state, this may happen +due to an error during scrubbing. As always, we can identify the inconsistent +placement group(s) with:: + + $ ceph health detail + HEALTH_ERR 1 pgs inconsistent; 2 scrub errors + pg 0.6 is active+clean+inconsistent, acting [0,1,2] + 2 scrub errors + +Or if you prefer inspecting the output in a programmatic way:: + + $ rados list-inconsistent-pg rbd + ["0.6"] + +There is only one consistent state, but in the worst case, we could have +different inconsistencies in multiple perspectives found in more than one +objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have:: + + $ rados list-inconsistent-obj 0.6 --format=json-pretty + +.. code-block:: javascript + + { + "epoch": 14, + "inconsistents": [ + { + "object": { + "name": "foo", + "nspace": "", + "locator": "", + "snap": "head", + "version": 1 + }, + "errors": [ + "data_digest_mismatch", + "size_mismatch" + ], + "union_shard_errors": [ + "data_digest_mismatch_info", + "size_mismatch_info" + ], + "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", + "shards": [ + { + "osd": 0, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 1, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 2, + "errors": [ + "data_digest_mismatch_info", + "size_mismatch_info" + ], + "size": 0, + "omap_digest": "0xffffffff", + "data_digest": "0xffffffff" + } + ] + } + ] + } + +In this case, we can learn from the output: + +* The only inconsistent object is named ``foo``, and it is its head that has + inconsistencies. +* The inconsistencies fall into two categories: + + * ``errors``: these errors indicate inconsistencies between shards without a + determination of which shard(s) are bad. Check for the ``errors`` in the + `shards` array, if available, to pinpoint the problem. + + * ``data_digest_mismatch``: the digest of the replica read from OSD.2 is + different from the ones of OSD.0 and OSD.1 + * ``size_mismatch``: the size of the replica read from OSD.2 is 0, while + the size reported by OSD.0 and OSD.1 is 968. + * ``union_shard_errors``: the union of all shard specific ``errors`` in + ``shards`` array. The ``errors`` are set for the given shard that has the + problem. They include errors like ``read_error``. The ``errors`` ending in + ``oi`` indicate a comparison with ``selected_object_info``. Look at the + ``shards`` array to determine which shard has which error(s). + + * ``data_digest_mismatch_info``: the digest stored in the object-info is not + ``0xffffffff``, which is calculated from the shard read from OSD.2 + * ``size_mismatch_info``: the size stored in the object-info is different + from the one read from OSD.2. The latter is 0. + +You can repair the inconsistent placement group by executing:: + + ceph pg repair {placement-group-ID} + +Which overwrites the `bad` copies with the `authoritative` ones. In most cases, +Ceph is able to choose authoritative copies from all available replicas using +some predefined criteria. But this does not always work. For example, the stored +data digest could be missing, and the calculated digest will be ignored when +choosing the authoritative copies. So, please use the above command with caution. + +If ``read_error`` is listed in the ``errors`` attribute of a shard, the +inconsistency is likely due to disk errors. You might want to check your disk +used by that OSD. + +If you receive ``active + clean + inconsistent`` states periodically due to +clock skew, you may consider configuring your `NTP`_ daemons on your +monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph +`Clock Settings`_ for additional details. + + +Erasure Coded PGs are not active+clean +====================================== + +When CRUSH fails to find enough OSDs to map to a PG, it will show as a +``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance:: + + [2,1,6,0,5,8,2147483647,7,4] + +Not enough OSDs +--------------- + +If the Ceph cluster only has 8 OSDs and the erasure coded pool needs +9, that is what it will show. You can either create another erasure +coded pool that requires less OSDs:: + + ceph osd erasure-code-profile set myprofile k=5 m=3 + ceph osd pool create erasurepool erasure myprofile + +or add a new OSDs and the PG will automatically use them. + +CRUSH constraints cannot be satisfied +------------------------------------- + +If the cluster has enough OSDs, it is possible that the CRUSH rule +imposes constraints that cannot be satisfied. If there are 10 OSDs on +two hosts and the CRUSH rule requires that no two OSDs from the +same host are used in the same PG, the mapping may fail because only +two OSDs will be found. You can check the constraint by displaying ("dumping") +the rule:: + + $ ceph osd crush rule ls + [ + "replicated_rule", + "erasurepool"] + $ ceph osd crush rule dump erasurepool + { "rule_id": 1, + "rule_name": "erasurepool", + "ruleset": 1, + "type": 3, + "min_size": 3, + "max_size": 20, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + + +You can resolve the problem by creating a new pool in which PGs are allowed +to have OSDs residing on the same host with:: + + ceph osd erasure-code-profile set myprofile crush-failure-domain=osd + ceph osd pool create erasurepool erasure myprofile + +CRUSH gives up too soon +----------------------- + +If the Ceph cluster has just enough OSDs to map the PG (for instance a +cluster with a total of 9 OSDs and an erasure coded pool that requires +9 OSDs per PG), it is possible that CRUSH gives up before finding a +mapping. It can be resolved by: + +* lowering the erasure coded pool requirements to use less OSDs per PG + (that requires the creation of another pool as erasure code profiles + cannot be dynamically modified). + +* adding more OSDs to the cluster (that does not require the erasure + coded pool to be modified, it will become clean automatically) + +* use a handmade CRUSH rule that tries more times to find a good + mapping. This can be done by setting ``set_choose_tries`` to a value + greater than the default. + +You should first verify the problem with ``crushtool`` after +extracting the crushmap from the cluster so your experiments do not +modify the Ceph cluster and only work on a local files:: + + $ ceph osd crush rule dump erasurepool + { "rule_name": "erasurepool", + "ruleset": 1, + "type": 3, + "min_size": 3, + "max_size": 20, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + $ ceph osd getcrushmap > crush.map + got crush map from osdmap epoch 13 + $ crushtool -i crush.map --test --show-bad-mappings \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0] + bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8] + bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647] + +Where ``--num-rep`` is the number of OSDs the erasure code CRUSH +rule needs, ``--rule`` is the value of the ``ruleset`` field +displayed by ``ceph osd crush rule dump``. The test will try mapping +one million values (i.e. the range defined by ``[--min-x,--max-x]``) +and must display at least one bad mapping. If it outputs nothing it +means all mappings are successful and you can stop right there: the +problem is elsewhere. + +The CRUSH rule can be edited by decompiling the crush map:: + + $ crushtool --decompile crush.map > crush.txt + +and adding the following line to the rule:: + + step set_choose_tries 100 + +The relevant part of the ``crush.txt`` file should look something +like:: + + rule erasurepool { + ruleset 1 + type erasure + min_size 3 + max_size 20 + step set_chooseleaf_tries 5 + step set_choose_tries 100 + step take default + step chooseleaf indep 0 type host + step emit + } + +It can then be compiled and tested again:: + + $ crushtool --compile crush.txt -o better-crush.map + +When all mappings succeed, an histogram of the number of tries that +were necessary to find all of them can be displayed with the +``--show-choose-tries`` option of ``crushtool``:: + + $ crushtool -i better-crush.map --test --show-bad-mappings \ + --show-choose-tries \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + ... + 11: 42 + 12: 44 + 13: 54 + 14: 45 + 15: 35 + 16: 34 + 17: 30 + 18: 25 + 19: 19 + 20: 22 + 21: 20 + 22: 17 + 23: 13 + 24: 16 + 25: 13 + 26: 11 + 27: 11 + 28: 13 + 29: 11 + 30: 10 + 31: 6 + 32: 5 + 33: 10 + 34: 3 + 35: 7 + 36: 5 + 37: 2 + 38: 5 + 39: 5 + 40: 2 + 41: 5 + 42: 4 + 43: 1 + 44: 2 + 45: 2 + 46: 3 + 47: 1 + 48: 0 + ... + 102: 0 + 103: 1 + 104: 0 + ... + +It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped). + +.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups +.. _here: ../../configuration/pool-pg-config-ref +.. _Placement Groups: ../../operations/placement-groups +.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref +.. _NTP: https://en.wikipedia.org/wiki/Network_Time_Protocol +.. _The Network Time Protocol: http://www.ntp.org/ +.. _Clock Settings: ../../configuration/mon-config-ref/#clock + + |