From e6918187568dbd01842d8d1d2c808ce16a894239 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 21 Apr 2024 13:54:28 +0200 Subject: Adding upstream version 18.2.2. Signed-off-by: Daniel Baumann --- doc/rados/api/index.rst | 25 + doc/rados/api/libcephsqlite.rst | 454 ++++++ doc/rados/api/librados-intro.rst | 1051 +++++++++++++ doc/rados/api/librados.rst | 187 +++ doc/rados/api/libradospp.rst | 9 + doc/rados/api/objclass-sdk.rst | 39 + doc/rados/api/python.rst | 428 ++++++ doc/rados/command/list-inconsistent-obj.json | 237 +++ doc/rados/command/list-inconsistent-snap.json | 86 ++ doc/rados/configuration/auth-config-ref.rst | 379 +++++ doc/rados/configuration/bluestore-config-ref.rst | 552 +++++++ doc/rados/configuration/ceph-conf.rst | 715 +++++++++ doc/rados/configuration/common.rst | 207 +++ doc/rados/configuration/demo-ceph.conf | 31 + doc/rados/configuration/filestore-config-ref.rst | 377 +++++ doc/rados/configuration/general-config-ref.rst | 19 + doc/rados/configuration/index.rst | 53 + doc/rados/configuration/journal-ref.rst | 39 + doc/rados/configuration/mclock-config-ref.rst | 699 +++++++++ doc/rados/configuration/mon-config-ref.rst | 642 ++++++++ doc/rados/configuration/mon-lookup-dns.rst | 58 + doc/rados/configuration/mon-osd-interaction.rst | 245 ++++ doc/rados/configuration/msgr2.rst | 257 ++++ doc/rados/configuration/network-config-ref.rst | 355 +++++ doc/rados/configuration/osd-config-ref.rst | 445 ++++++ doc/rados/configuration/pool-pg-config-ref.rst | 46 + doc/rados/configuration/pool-pg.conf | 21 + doc/rados/configuration/storage-devices.rst | 93 ++ doc/rados/index.rst | 81 ++ doc/rados/man/index.rst | 32 + doc/rados/operations/add-or-rm-mons.rst | 458 ++++++ doc/rados/operations/add-or-rm-osds.rst | 419 ++++++ doc/rados/operations/balancer.rst | 221 +++ doc/rados/operations/bluestore-migration.rst | 357 +++++ doc/rados/operations/cache-tiering.rst | 557 +++++++ doc/rados/operations/change-mon-elections.rst | 100 ++ doc/rados/operations/control.rst | 665 +++++++++ doc/rados/operations/crush-map-edits.rst | 746 ++++++++++ doc/rados/operations/crush-map.rst | 1147 +++++++++++++++ doc/rados/operations/data-placement.rst | 47 + doc/rados/operations/devices.rst | 227 +++ doc/rados/operations/erasure-code-clay.rst | 240 +++ doc/rados/operations/erasure-code-isa.rst | 107 ++ doc/rados/operations/erasure-code-jerasure.rst | 123 ++ doc/rados/operations/erasure-code-lrc.rst | 388 +++++ doc/rados/operations/erasure-code-profile.rst | 128 ++ doc/rados/operations/erasure-code-shec.rst | 145 ++ doc/rados/operations/erasure-code.rst | 272 ++++ doc/rados/operations/health-checks.rst | 1619 +++++++++++++++++++++ doc/rados/operations/index.rst | 99 ++ doc/rados/operations/monitoring-osd-pg.rst | 556 +++++++ doc/rados/operations/monitoring.rst | 644 ++++++++ doc/rados/operations/operating.rst | 174 +++ doc/rados/operations/pg-concepts.rst | 104 ++ doc/rados/operations/pg-repair.rst | 118 ++ doc/rados/operations/pg-states.rst | 118 ++ doc/rados/operations/placement-groups.rst | 897 ++++++++++++ doc/rados/operations/pools.rst | 751 ++++++++++ doc/rados/operations/read-balancer.rst | 64 + doc/rados/operations/stretch-mode.rst | 262 ++++ doc/rados/operations/upmap.rst | 113 ++ doc/rados/operations/user-management.rst | 840 +++++++++++ doc/rados/troubleshooting/community.rst | 37 + doc/rados/troubleshooting/cpu-profiling.rst | 80 + doc/rados/troubleshooting/index.rst | 19 + doc/rados/troubleshooting/log-and-debug.rst | 430 ++++++ doc/rados/troubleshooting/memory-profiling.rst | 203 +++ doc/rados/troubleshooting/troubleshooting-mon.rst | 713 +++++++++ doc/rados/troubleshooting/troubleshooting-osd.rst | 787 ++++++++++ doc/rados/troubleshooting/troubleshooting-pg.rst | 782 ++++++++++ 70 files changed, 23619 insertions(+) create mode 100644 doc/rados/api/index.rst create mode 100644 doc/rados/api/libcephsqlite.rst create mode 100644 doc/rados/api/librados-intro.rst create mode 100644 doc/rados/api/librados.rst create mode 100644 doc/rados/api/libradospp.rst create mode 100644 doc/rados/api/objclass-sdk.rst create mode 100644 doc/rados/api/python.rst create mode 100644 doc/rados/command/list-inconsistent-obj.json create mode 100644 doc/rados/command/list-inconsistent-snap.json create mode 100644 doc/rados/configuration/auth-config-ref.rst create mode 100644 doc/rados/configuration/bluestore-config-ref.rst create mode 100644 doc/rados/configuration/ceph-conf.rst create mode 100644 doc/rados/configuration/common.rst create mode 100644 doc/rados/configuration/demo-ceph.conf create mode 100644 doc/rados/configuration/filestore-config-ref.rst create mode 100644 doc/rados/configuration/general-config-ref.rst create mode 100644 doc/rados/configuration/index.rst create mode 100644 doc/rados/configuration/journal-ref.rst create mode 100644 doc/rados/configuration/mclock-config-ref.rst create mode 100644 doc/rados/configuration/mon-config-ref.rst create mode 100644 doc/rados/configuration/mon-lookup-dns.rst create mode 100644 doc/rados/configuration/mon-osd-interaction.rst create mode 100644 doc/rados/configuration/msgr2.rst create mode 100644 doc/rados/configuration/network-config-ref.rst create mode 100644 doc/rados/configuration/osd-config-ref.rst create mode 100644 doc/rados/configuration/pool-pg-config-ref.rst create mode 100644 doc/rados/configuration/pool-pg.conf create mode 100644 doc/rados/configuration/storage-devices.rst create mode 100644 doc/rados/index.rst create mode 100644 doc/rados/man/index.rst create mode 100644 doc/rados/operations/add-or-rm-mons.rst create mode 100644 doc/rados/operations/add-or-rm-osds.rst create mode 100644 doc/rados/operations/balancer.rst create mode 100644 doc/rados/operations/bluestore-migration.rst create mode 100644 doc/rados/operations/cache-tiering.rst create mode 100644 doc/rados/operations/change-mon-elections.rst create mode 100644 doc/rados/operations/control.rst create mode 100644 doc/rados/operations/crush-map-edits.rst create mode 100644 doc/rados/operations/crush-map.rst create mode 100644 doc/rados/operations/data-placement.rst create mode 100644 doc/rados/operations/devices.rst create mode 100644 doc/rados/operations/erasure-code-clay.rst create mode 100644 doc/rados/operations/erasure-code-isa.rst create mode 100644 doc/rados/operations/erasure-code-jerasure.rst create mode 100644 doc/rados/operations/erasure-code-lrc.rst create mode 100644 doc/rados/operations/erasure-code-profile.rst create mode 100644 doc/rados/operations/erasure-code-shec.rst create mode 100644 doc/rados/operations/erasure-code.rst create mode 100644 doc/rados/operations/health-checks.rst create mode 100644 doc/rados/operations/index.rst create mode 100644 doc/rados/operations/monitoring-osd-pg.rst create mode 100644 doc/rados/operations/monitoring.rst create mode 100644 doc/rados/operations/operating.rst create mode 100644 doc/rados/operations/pg-concepts.rst create mode 100644 doc/rados/operations/pg-repair.rst create mode 100644 doc/rados/operations/pg-states.rst create mode 100644 doc/rados/operations/placement-groups.rst create mode 100644 doc/rados/operations/pools.rst create mode 100644 doc/rados/operations/read-balancer.rst create mode 100644 doc/rados/operations/stretch-mode.rst create mode 100644 doc/rados/operations/upmap.rst create mode 100644 doc/rados/operations/user-management.rst create mode 100644 doc/rados/troubleshooting/community.rst create mode 100644 doc/rados/troubleshooting/cpu-profiling.rst create mode 100644 doc/rados/troubleshooting/index.rst create mode 100644 doc/rados/troubleshooting/log-and-debug.rst create mode 100644 doc/rados/troubleshooting/memory-profiling.rst create mode 100644 doc/rados/troubleshooting/troubleshooting-mon.rst create mode 100644 doc/rados/troubleshooting/troubleshooting-osd.rst create mode 100644 doc/rados/troubleshooting/troubleshooting-pg.rst (limited to 'doc/rados') diff --git a/doc/rados/api/index.rst b/doc/rados/api/index.rst new file mode 100644 index 000000000..5422ce871 --- /dev/null +++ b/doc/rados/api/index.rst @@ -0,0 +1,25 @@ +.. _rados api: + +=========================== + Ceph Storage Cluster APIs +=========================== + +The :term:`Ceph Storage Cluster` has a messaging layer protocol that enables +clients to interact with a :term:`Ceph Monitor` and a :term:`Ceph OSD Daemon`. +``librados`` provides this functionality to :term:`Ceph Client`\s in the form of +a library. All Ceph Clients either use ``librados`` or the same functionality +encapsulated in ``librados`` to interact with the object store. For example, +``librbd`` and ``libcephfs`` leverage this functionality. You may use +``librados`` to interact with Ceph directly (e.g., an application that talks to +Ceph, your own interface to Ceph, etc.). + + +.. toctree:: + :maxdepth: 2 + + Introduction to librados + librados (C) + librados (C++) + librados (Python) + libcephsqlite (SQLite) + object class diff --git a/doc/rados/api/libcephsqlite.rst b/doc/rados/api/libcephsqlite.rst new file mode 100644 index 000000000..beee4a466 --- /dev/null +++ b/doc/rados/api/libcephsqlite.rst @@ -0,0 +1,454 @@ +.. _libcephsqlite: + +================ + Ceph SQLite VFS +================ + +This `SQLite VFS`_ may be used for storing and accessing a `SQLite`_ database +backed by RADOS. This allows you to fully decentralize your database using +Ceph's object store for improved availability, accessibility, and use of +storage. + +Note what this is not: a distributed SQL engine. SQLite on RADOS can be thought +of like RBD as compared to CephFS: RBD puts a disk image on RADOS for the +purposes of exclusive access by a machine and generally does not allow parallel +access by other machines; on the other hand, CephFS allows fully distributed +access to a file system from many client mounts. SQLite on RADOS is meant to be +accessed by a single SQLite client database connection at a given time. The +database may be manipulated safely by multiple clients only in a serial fashion +controlled by RADOS locks managed by the Ceph SQLite VFS. + + +Usage +^^^^^ + +Normal unmodified applications (including the sqlite command-line toolset +binary) may load the *ceph* VFS using the `SQLite Extension Loading API`_. + +.. code:: sql + + .LOAD libcephsqlite.so + +or during the invocation of ``sqlite3`` + +.. code:: sh + + sqlite3 -cmd '.load libcephsqlite.so' + +A database file is formatted as a SQLite URI:: + + file:///<"*"poolid|poolname>:[namespace]/?vfs=ceph + +The RADOS ``namespace`` is optional. Note the triple ``///`` in the path. The URI +authority must be empty or localhost in SQLite. Only the path part of the URI +is parsed. For this reason, the URI will not parse properly if you only use two +``//``. + +A complete example of (optionally) creating a database and opening: + +.. code:: sh + + sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph' + +Note you cannot specify the database file as the normal positional argument to +``sqlite3``. This is because the ``.load libcephsqlite.so`` command is applied +after opening the database, but opening the database depends on the extension +being loaded first. + +An example passing the pool integer id and no RADOS namespace: + +.. code:: sh + + sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///*2:/baz.db?vfs=ceph' + +Like other Ceph tools, the *ceph* VFS looks at some environment variables that +help with configuring which Ceph cluster to communicate with and which +credential to use. Here would be a typical configuration: + +.. code:: sh + + export CEPH_CONF=/path/to/ceph.conf + export CEPH_KEYRING=/path/to/ceph.keyring + export CEPH_ARGS='--id myclientid' + ./runmyapp + # or + sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph' + +The default operation would look at the standard Ceph configuration file path +using the ``client.admin`` user. + + +User +^^^^ + +The *ceph* VFS requires a user credential with read access to the monitors, the +ability to blocklist dead clients of the database, and access to the OSDs +hosting the database. This can be done with authorizations as simply as: + +.. code:: sh + + ceph auth get-or-create client.X mon 'allow r, allow command "osd blocklist" with blocklistop=add' osd 'allow rwx' + +.. note:: The terminology change from ``blacklist`` to ``blocklist``; older clusters may require using the old terms. + +You may also simplify using the ``simple-rados-client-with-blocklist`` profile: + +.. code:: sh + + ceph auth get-or-create client.X mon 'profile simple-rados-client-with-blocklist' osd 'allow rwx' + +To learn why blocklisting is necessary, see :ref:`libcephsqlite-corrupt`. + + +Page Size +^^^^^^^^^ + +SQLite allows configuring the page size prior to creating a new database. It is +advisable to increase this config to 65536 (64K) when using RADOS backed +databases to reduce the number of OSD reads/writes and thereby improve +throughput and latency. + +.. code:: sql + + PRAGMA page_size = 65536 + +You may also try other values according to your application needs but note that +64K is the max imposed by SQLite. + + +Cache +^^^^^ + +The ceph VFS does not do any caching of reads or buffering of writes. Instead, +and more appropriately, the SQLite page cache is used. You may find it is too small +for most workloads and should therefore increase it significantly: + + +.. code:: sql + + PRAGMA cache_size = 4096 + +Which will cache 4096 pages or 256MB (with 64K ``page_cache``). + + +Journal Persistence +^^^^^^^^^^^^^^^^^^^ + +By default, SQLite deletes the journal for every transaction. This can be +expensive as the *ceph* VFS must delete every object backing the journal for each +transaction. For this reason, it is much faster and simpler to ask SQLite to +**persist** the journal. In this mode, SQLite will invalidate the journal via a +write to its header. This is done as: + +.. code:: sql + + PRAGMA journal_mode = PERSIST + +The cost of this may be increased unused space according to the high-water size +of the rollback journal (based on transaction type and size). + + +Exclusive Lock Mode +^^^^^^^^^^^^^^^^^^^ + +SQLite operates in a ``NORMAL`` locking mode where each transaction requires +locking the backing database file. This can add unnecessary overhead to +transactions when you know there's only ever one user of the database at a +given time. You can have SQLite lock the database once for the duration of the +connection using: + +.. code:: sql + + PRAGMA locking_mode = EXCLUSIVE + +This can more than **halve** the time taken to perform a transaction. Keep in +mind this prevents other clients from accessing the database. + +In this locking mode, each write transaction to the database requires 3 +synchronization events: once to write to the journal, another to write to the +database file, and a final write to invalidate the journal header (in +``PERSIST`` journaling mode). + + +WAL Journal +^^^^^^^^^^^ + +The `WAL Journal Mode`_ is only available when SQLite is operating in exclusive +lock mode. This is because it requires shared memory communication with other +readers and writers when in the ``NORMAL`` locking mode. + +As with local disk databases, WAL mode may significantly reduce small +transaction latency. Testing has shown it can provide more than 50% speedup +over persisted rollback journals in exclusive locking mode. You can expect +around 150-250 transactions per second depending on size. + + +Performance Notes +^^^^^^^^^^^^^^^^^ + +The filing backend for the database on RADOS is asynchronous as much as +possible. Still, performance can be anywhere from 3x-10x slower than a local +database on SSD. Latency can be a major factor. It is advisable to be familiar +with SQL transactions and other strategies for efficient database updates. +Depending on the performance of the underlying pool, you can expect small +transactions to take up to 30 milliseconds to complete. If you use the +``EXCLUSIVE`` locking mode, it can be reduced further to 15 milliseconds per +transaction. A WAL journal in ``EXCLUSIVE`` locking mode can further reduce +this as low as ~2-5 milliseconds (or the time to complete a RADOS write; you +won't get better than that!). + +There is no limit to the size of a SQLite database on RADOS imposed by the Ceph +VFS. There are standard `SQLite Limits`_ to be aware of, notably the maximum +database size of 281 TB. Large databases may or may not be performant on Ceph. +Experimentation for your own use-case is advised. + +Be aware that read-heavy queries could take significant amounts of time as +reads are necessarily synchronous (due to the VFS API). No readahead is yet +performed by the VFS. + + +Recommended Use-Cases +^^^^^^^^^^^^^^^^^^^^^ + +The original purpose of this module was to support saving relational or large +data in RADOS which needs to span multiple objects. Many current applications +with trivial state try to use RADOS omap storage on a single object but this +cannot scale without striping data across multiple objects. Unfortunately, it +is non-trivial to design a store spanning multiple objects which is consistent +and also simple to use. SQLite can be used to bridge that gap. + + +Parallel Access +^^^^^^^^^^^^^^^ + +The VFS does not yet support concurrent readers. All database access is protected +by a single exclusive lock. + + +Export or Extract Database out of RADOS +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The database is striped on RADOS and can be extracted using the RADOS cli toolset. + +.. code:: sh + + rados --pool=foo --striper get bar.db local-bar.db + rados --pool=foo --striper get bar.db-journal local-bar.db-journal + sqlite3 local-bar.db ... + +Keep in mind the rollback journal is also striped and will need to be extracted +as well if the database was in the middle of a transaction. If you're using +WAL, that journal will need to be extracted as well. + +Keep in mind that extracting the database using the striper uses the same RADOS +locks as those used by the *ceph* VFS. However, the journal file locks are not +used by the *ceph* VFS (SQLite only locks the main database file) so there is a +potential race with other SQLite clients when extracting both files. That could +result in fetching a corrupt journal. + +Instead of manually extracting the files, it would be more advisable to use the +`SQLite Backup`_ mechanism instead. + + +Temporary Tables +^^^^^^^^^^^^^^^^ + +Temporary tables backed by the ceph VFS are not supported. The main reason for +this is that the VFS lacks context about where it should put the database, i.e. +which RADOS pool. The persistent database associated with the temporary +database is not communicated via the SQLite VFS API. + +Instead, it's suggested to attach a secondary local or `In-Memory Database`_ +and put the temporary tables there. Alternatively, you may set a connection +pragma: + +.. code:: sql + + PRAGMA temp_store=memory + + +.. _libcephsqlite-breaking-locks: + +Breaking Locks +^^^^^^^^^^^^^^ + +Access to the database file is protected by an exclusive lock on the first +object stripe of the database. If the application fails without unlocking the +database (e.g. a segmentation fault), the lock is not automatically unlocked, +even if the client connection is blocklisted afterward. Eventually, the lock +will timeout subject to the configurations:: + + cephsqlite_lock_renewal_timeout = 30000 + +The timeout is in milliseconds. Once the timeout is reached, the OSD will +expire the lock and allow clients to relock. When this occurs, the database +will be recovered by SQLite and the in-progress transaction rolled back. The +new client recovering the database will also blocklist the old client to +prevent potential database corruption from rogue writes. + +The holder of the exclusive lock on the database will periodically renew the +lock so it does not lose the lock. This is necessary for large transactions or +database connections operating in ``EXCLUSIVE`` locking mode. The lock renewal +interval is adjustable via:: + + cephsqlite_lock_renewal_interval = 2000 + +This configuration is also in units of milliseconds. + +It is possible to break the lock early if you know the client is gone for good +(e.g. blocklisted). This allows restoring database access to clients +immediately. For example: + +.. code:: sh + + $ rados --pool=foo --namespace bar lock info baz.db.0000000000000000 striper.lock + {"name":"striper.lock","type":"exclusive","tag":"","lockers":[{"name":"client.4463","cookie":"555c7208-db39-48e8-a4d7-3ba92433a41a","description":"SimpleRADOSStriper","expiration":"0.000000","addr":"127.0.0.1:0/1831418345"}]} + + $ rados --pool=foo --namespace bar lock break baz.db.0000000000000000 striper.lock client.4463 --lock-cookie 555c7208-db39-48e8-a4d7-3ba92433a41a + +.. _libcephsqlite-corrupt: + +How to Corrupt Your Database +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +There is the usual reading on `How to Corrupt Your SQLite Database`_ that you +should review before using this tool. To add to that, the most likely way you +may corrupt your database is by a rogue process transiently losing network +connectivity and then resuming its work. The exclusive RADOS lock it held will +be lost but it cannot know that immediately. Any work it might do after +regaining network connectivity could corrupt the database. + +The *ceph* VFS library defaults do not allow for this scenario to occur. The Ceph +VFS will blocklist the last owner of the exclusive lock on the database if it +detects incomplete cleanup. + +By blocklisting the old client, it's no longer possible for the old client to +resume its work on the database when it returns (subject to blocklist +expiration, 3600 seconds by default). To turn off blocklisting the prior client, change:: + + cephsqlite_blocklist_dead_locker = false + +Do NOT do this unless you know database corruption cannot result due to other +guarantees. If this config is true (the default), the *ceph* VFS will cowardly +fail if it cannot blocklist the prior instance (due to lack of authorization, +for example). + +One example where out-of-band mechanisms exist to blocklist the last dead +holder of the exclusive lock on the database is in the ``ceph-mgr``. The +monitors are made aware of the RADOS connection used for the *ceph* VFS and will +blocklist the instance during ``ceph-mgr`` failover. This prevents a zombie +``ceph-mgr`` from continuing work and potentially corrupting the database. For +this reason, it is not necessary for the *ceph* VFS to do the blocklist command +in the new instance of the ``ceph-mgr`` (but it still does so, harmlessly). + +To blocklist the *ceph* VFS manually, you may see the instance address of the +*ceph* VFS using the ``ceph_status`` SQL function: + +.. code:: sql + + SELECT ceph_status(); + +.. code:: + + {"id":788461300,"addr":"172.21.10.4:0/1472139388"} + +You may easily manipulate that information using the `JSON1 extension`_: + +.. code:: sql + + SELECT json_extract(ceph_status(), '$.addr'); + +.. code:: + + 172.21.10.4:0/3563721180 + +This is the address you would pass to the ceph blocklist command: + +.. code:: sh + + ceph osd blocklist add 172.21.10.4:0/3082314560 + + +Performance Statistics +^^^^^^^^^^^^^^^^^^^^^^ + +The *ceph* VFS provides a SQLite function, ``ceph_perf``, for querying the +performance statistics of the VFS. The data is from "performance counters" as +in other Ceph services normally queried via an admin socket. + +.. code:: sql + + SELECT ceph_perf(); + +.. code:: + + {"libcephsqlite_vfs":{"op_open":{"avgcount":2,"sum":0.150001291,"avgtime":0.075000645},"op_delete":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"op_access":{"avgcount":1,"sum":0.003000026,"avgtime":0.003000026},"op_fullpathname":{"avgcount":1,"sum":0.064000551,"avgtime":0.064000551},"op_currenttime":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_close":{"avgcount":1,"sum":0.000000000,"avgtime":0.000000000},"opf_read":{"avgcount":3,"sum":0.036000310,"avgtime":0.012000103},"opf_write":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_truncate":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_sync":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_filesize":{"avgcount":2,"sum":0.000000000,"avgtime":0.000000000},"opf_lock":{"avgcount":1,"sum":0.158001360,"avgtime":0.158001360},"opf_unlock":{"avgcount":1,"sum":0.101000871,"avgtime":0.101000871},"opf_checkreservedlock":{"avgcount":1,"sum":0.002000017,"avgtime":0.002000017},"opf_filecontrol":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000},"opf_sectorsize":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_devicecharacteristics":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000}},"libcephsqlite_striper":{"update_metadata":0,"update_allocated":0,"update_size":0,"update_version":0,"shrink":0,"shrink_bytes":0,"lock":1,"unlock":1}} + +You may easily manipulate that information using the `JSON1 extension`_: + +.. code:: sql + + SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync.avgcount'); + +.. code:: + + 776 + +That tells you the number of times SQLite has called the xSync method of the +`SQLite IO Methods`_ of the VFS (for **all** open database connections in the +process). You could analyze the performance stats before and after a number of +queries to see the number of file system syncs required (this would just be +proportional to the number of transactions). Alternatively, you may be more +interested in the average latency to complete a write: + +.. code:: sql + + SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_write'); + +.. code:: + + {"avgcount":7873,"sum":0.675005797,"avgtime":0.000085736} + +Which would tell you there have been 7873 writes with an average +time-to-complete of 85 microseconds. That clearly shows the calls are executed +asynchronously. Returning to sync: + +.. code:: sql + + SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync'); + +.. code:: + + {"avgcount":776,"sum":4.802041199,"avgtime":0.006188197} + +6 milliseconds were spent on average executing a sync call. This gathers all of +the asynchronous writes as well as an asynchronous update to the size of the +striped file. + + +Debugging +^^^^^^^^^ + +Debugging libcephsqlite can be turned on via:: + + debug_cephsqlite + +If running the ``sqlite3`` command-line tool, use: + +.. code:: sh + + env CEPH_ARGS='--log_to_file true --log-file sqlite3.log --debug_cephsqlite 20 --debug_ms 1' sqlite3 ... + +This will save all the usual Ceph debugging to a file ``sqlite3.log`` for inspection. + + +.. _SQLite: https://sqlite.org/index.html +.. _SQLite VFS: https://www.sqlite.org/vfs.html +.. _SQLite Backup: https://www.sqlite.org/backup.html +.. _SQLite Limits: https://www.sqlite.org/limits.html +.. _SQLite Extension Loading API: https://sqlite.org/c3ref/load_extension.html +.. _In-Memory Database: https://www.sqlite.org/inmemorydb.html +.. _WAL Journal Mode: https://sqlite.org/wal.html +.. _How to Corrupt Your SQLite Database: https://www.sqlite.org/howtocorrupt.html +.. _JSON1 Extension: https://www.sqlite.org/json1.html +.. _SQLite IO Methods: https://www.sqlite.org/c3ref/io_methods.html diff --git a/doc/rados/api/librados-intro.rst b/doc/rados/api/librados-intro.rst new file mode 100644 index 000000000..5174188b4 --- /dev/null +++ b/doc/rados/api/librados-intro.rst @@ -0,0 +1,1051 @@ +========================== + Introduction to librados +========================== + +The :term:`Ceph Storage Cluster` provides the basic storage service that allows +:term:`Ceph` to uniquely deliver **object, block, and file storage** in one +unified system. However, you are not limited to using the RESTful, block, or +POSIX interfaces. Based upon :abbr:`RADOS (Reliable Autonomic Distributed Object +Store)`, the ``librados`` API enables you to create your own interface to the +Ceph Storage Cluster. + +The ``librados`` API enables you to interact with the two types of daemons in +the Ceph Storage Cluster: + +- The :term:`Ceph Monitor`, which maintains a master copy of the cluster map. +- The :term:`Ceph OSD Daemon` (OSD), which stores data as objects on a storage node. + +.. ditaa:: + +---------------------------------+ + | Ceph Storage Cluster Protocol | + | (librados) | + +---------------------------------+ + +---------------+ +---------------+ + | OSDs | | Monitors | + +---------------+ +---------------+ + +This guide provides a high-level introduction to using ``librados``. +Refer to :doc:`../../architecture` for additional details of the Ceph +Storage Cluster. To use the API, you need a running Ceph Storage Cluster. +See `Installation (Quick)`_ for details. + + +Step 1: Getting librados +======================== + +Your client application must bind with ``librados`` to connect to the Ceph +Storage Cluster. You must install ``librados`` and any required packages to +write applications that use ``librados``. The ``librados`` API is written in +C++, with additional bindings for C, Python, Java and PHP. + + +Getting librados for C/C++ +-------------------------- + +To install ``librados`` development support files for C/C++ on Debian/Ubuntu +distributions, execute the following: + +.. prompt:: bash $ + + sudo apt-get install librados-dev + +To install ``librados`` development support files for C/C++ on RHEL/CentOS +distributions, execute the following: + +.. prompt:: bash $ + + sudo yum install librados2-devel + +Once you install ``librados`` for developers, you can find the required +headers for C/C++ under ``/usr/include/rados``: + +.. prompt:: bash $ + + ls /usr/include/rados + + +Getting librados for Python +--------------------------- + +The ``rados`` module provides ``librados`` support to Python +applications. You may install ``python3-rados`` for Debian, Ubuntu, SLE or +openSUSE or the ``python-rados`` package for CentOS/RHEL. + +To install ``librados`` development support files for Python on Debian/Ubuntu +distributions, execute the following: + +.. prompt:: bash $ + + sudo apt-get install python3-rados + +To install ``librados`` development support files for Python on RHEL/CentOS +distributions, execute the following: + +.. prompt:: bash $ + + sudo yum install python-rados + +To install ``librados`` development support files for Python on SLE/openSUSE +distributions, execute the following: + +.. prompt:: bash $ + + sudo zypper install python3-rados + +You can find the module under ``/usr/share/pyshared`` on Debian systems, +or under ``/usr/lib/python*/site-packages`` on CentOS/RHEL systems. + + +Getting librados for Java +------------------------- + +To install ``librados`` for Java, you need to execute the following procedure: + +#. Install ``jna.jar``. For Debian/Ubuntu, execute: + + .. prompt:: bash $ + + sudo apt-get install libjna-java + + For CentOS/RHEL, execute: + + .. prompt:: bash $ + + sudo yum install jna + + The JAR files are located in ``/usr/share/java``. + +#. Clone the ``rados-java`` repository: + + .. prompt:: bash $ + + git clone --recursive https://github.com/ceph/rados-java.git + +#. Build the ``rados-java`` repository: + + .. prompt:: bash $ + + cd rados-java + ant + + The JAR file is located under ``rados-java/target``. + +#. Copy the JAR for RADOS to a common location (e.g., ``/usr/share/java``) and + ensure that it and the JNA JAR are in your JVM's classpath. For example: + + .. prompt:: bash $ + + sudo cp target/rados-0.1.3.jar /usr/share/java/rados-0.1.3.jar + sudo ln -s /usr/share/java/jna-3.2.7.jar /usr/lib/jvm/default-java/jre/lib/ext/jna-3.2.7.jar + sudo ln -s /usr/share/java/rados-0.1.3.jar /usr/lib/jvm/default-java/jre/lib/ext/rados-0.1.3.jar + +To build the documentation, execute the following: + +.. prompt:: bash $ + + ant docs + + +Getting librados for PHP +------------------------- + +To install the ``librados`` extension for PHP, you need to execute the following procedure: + +#. Install php-dev. For Debian/Ubuntu, execute: + + .. prompt:: bash $ + + sudo apt-get install php5-dev build-essential + + For CentOS/RHEL, execute: + + .. prompt:: bash $ + + sudo yum install php-devel + +#. Clone the ``phprados`` repository: + + .. prompt:: bash $ + + git clone https://github.com/ceph/phprados.git + +#. Build ``phprados``: + + .. prompt:: bash $ + + cd phprados + phpize + ./configure + make + sudo make install + +#. Enable ``phprados`` by adding the following line to ``php.ini``:: + + extension=rados.so + + +Step 2: Configuring a Cluster Handle +==================================== + +A :term:`Ceph Client`, via ``librados``, interacts directly with OSDs to store +and retrieve data. To interact with OSDs, the client app must invoke +``librados`` and connect to a Ceph Monitor. Once connected, ``librados`` +retrieves the :term:`Cluster Map` from the Ceph Monitor. When the client app +wants to read or write data, it creates an I/O context and binds to a +:term:`Pool`. The pool has an associated :term:`CRUSH rule` that defines how it +will place data in the storage cluster. Via the I/O context, the client +provides the object name to ``librados``, which takes the object name +and the cluster map (i.e., the topology of the cluster) and `computes`_ the +placement group and `OSD`_ for locating the data. Then the client application +can read or write data. The client app doesn't need to learn about the topology +of the cluster directly. + +.. ditaa:: + +--------+ Retrieves +---------------+ + | Client |------------>| Cluster Map | + +--------+ +---------------+ + | + v Writes + /-----\ + | obj | + \-----/ + | To + v + +--------+ +---------------+ + | Pool |---------->| CRUSH Rule | + +--------+ Selects +---------------+ + + +The Ceph Storage Cluster handle encapsulates the client configuration, including: + +- The `user ID`_ for ``rados_create()`` or user name for ``rados_create2()`` + (preferred). +- The :term:`cephx` authentication key +- The monitor ID and IP address +- Logging levels +- Debugging levels + +Thus, the first steps in using the cluster from your app are to 1) create +a cluster handle that your app will use to connect to the storage cluster, +and then 2) use that handle to connect. To connect to the cluster, the +app must supply a monitor address, a username and an authentication key +(cephx is enabled by default). + +.. tip:: Talking to different Ceph Storage Clusters – or to the same cluster + with different users – requires different cluster handles. + +RADOS provides a number of ways for you to set the required values. For +the monitor and encryption key settings, an easy way to handle them is to ensure +that your Ceph configuration file contains a ``keyring`` path to a keyring file +and at least one monitor address (e.g., ``mon_host``). For example:: + + [global] + mon_host = 192.168.1.1 + keyring = /etc/ceph/ceph.client.admin.keyring + +Once you create the handle, you can read a Ceph configuration file to configure +the handle. You can also pass arguments to your app and parse them with the +function for parsing command line arguments (e.g., ``rados_conf_parse_argv()``), +or parse Ceph environment variables (e.g., ``rados_conf_parse_env()``). Some +wrappers may not implement convenience methods, so you may need to implement +these capabilities. The following diagram provides a high-level flow for the +initial connection. + + +.. ditaa:: + +---------+ +---------+ + | Client | | Monitor | + +---------+ +---------+ + | | + |-----+ create | + | | cluster | + |<----+ handle | + | | + |-----+ read | + | | config | + |<----+ file | + | | + | connect | + |-------------->| + | | + |<--------------| + | connected | + | | + + +Once connected, your app can invoke functions that affect the whole cluster +with only the cluster handle. For example, once you have a cluster +handle, you can: + +- Get cluster statistics +- Use Pool Operation (exists, create, list, delete) +- Get and set the configuration + + +One of the powerful features of Ceph is the ability to bind to different pools. +Each pool may have a different number of placement groups, object replicas and +replication strategies. For example, a pool could be set up as a "hot" pool that +uses SSDs for frequently used objects or a "cold" pool that uses erasure coding. + +The main difference in the various ``librados`` bindings is between C and +the object-oriented bindings for C++, Java and Python. The object-oriented +bindings use objects to represent cluster handles, IO Contexts, iterators, +exceptions, etc. + + +C Example +--------- + +For C, creating a simple cluster handle using the ``admin`` user, configuring +it and connecting to the cluster might look something like this: + +.. code-block:: c + + #include + #include + #include + #include + + int main (int argc, const char **argv) + { + + /* Declare the cluster handle and required arguments. */ + rados_t cluster; + char cluster_name[] = "ceph"; + char user_name[] = "client.admin"; + uint64_t flags = 0; + + /* Initialize the cluster handle with the "ceph" cluster name and the "client.admin" user */ + int err; + err = rados_create2(&cluster, cluster_name, user_name, flags); + + if (err < 0) { + fprintf(stderr, "%s: Couldn't create the cluster handle! %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nCreated a cluster handle.\n"); + } + + + /* Read a Ceph configuration file to configure the cluster handle. */ + err = rados_conf_read_file(cluster, "/etc/ceph/ceph.conf"); + if (err < 0) { + fprintf(stderr, "%s: cannot read config file: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nRead the config file.\n"); + } + + /* Read command line arguments */ + err = rados_conf_parse_argv(cluster, argc, argv); + if (err < 0) { + fprintf(stderr, "%s: cannot parse command line arguments: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nRead the command line arguments.\n"); + } + + /* Connect to the cluster */ + err = rados_connect(cluster); + if (err < 0) { + fprintf(stderr, "%s: cannot connect to cluster: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nConnected to the cluster.\n"); + } + + } + +Compile your client and link to ``librados`` using ``-lrados``. For example: + +.. prompt:: bash $ + + gcc ceph-client.c -lrados -o ceph-client + + +C++ Example +----------- + +The Ceph project provides a C++ example in the ``ceph/examples/librados`` +directory. For C++, a simple cluster handle using the ``admin`` user requires +you to initialize a ``librados::Rados`` cluster handle object: + +.. code-block:: c++ + + #include + #include + #include + + int main(int argc, const char **argv) + { + + int ret = 0; + + /* Declare the cluster handle and required variables. */ + librados::Rados cluster; + char cluster_name[] = "ceph"; + char user_name[] = "client.admin"; + uint64_t flags = 0; + + /* Initialize the cluster handle with the "ceph" cluster name and "client.admin" user */ + { + ret = cluster.init2(user_name, cluster_name, flags); + if (ret < 0) { + std::cerr << "Couldn't initialize the cluster handle! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Created a cluster handle." << std::endl; + } + } + + /* Read a Ceph configuration file to configure the cluster handle. */ + { + ret = cluster.conf_read_file("/etc/ceph/ceph.conf"); + if (ret < 0) { + std::cerr << "Couldn't read the Ceph configuration file! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Read the Ceph configuration file." << std::endl; + } + } + + /* Read command line arguments */ + { + ret = cluster.conf_parse_argv(argc, argv); + if (ret < 0) { + std::cerr << "Couldn't parse command line options! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Parsed command line options." << std::endl; + } + } + + /* Connect to the cluster */ + { + ret = cluster.connect(); + if (ret < 0) { + std::cerr << "Couldn't connect to cluster! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Connected to the cluster." << std::endl; + } + } + + return 0; + } + + +Compile the source; then, link ``librados`` using ``-lrados``. +For example: + +.. prompt:: bash $ + + g++ -g -c ceph-client.cc -o ceph-client.o + g++ -g ceph-client.o -lrados -o ceph-client + + + +Python Example +-------------- + +Python uses the ``admin`` id and the ``ceph`` cluster name by default, and +will read the standard ``ceph.conf`` file if the conffile parameter is +set to the empty string. The Python binding converts C++ errors +into exceptions. + + +.. code-block:: python + + import rados + + try: + cluster = rados.Rados(conffile='') + except TypeError as e: + print('Argument validation error: {}'.format(e)) + raise e + + print("Created cluster handle.") + + try: + cluster.connect() + except Exception as e: + print("connection error: {}".format(e)) + raise e + finally: + print("Connected to the cluster.") + + +Execute the example to verify that it connects to your cluster: + +.. prompt:: bash $ + + python ceph-client.py + + +Java Example +------------ + +Java requires you to specify the user ID (``admin``) or user name +(``client.admin``), and uses the ``ceph`` cluster name by default . The Java +binding converts C++-based errors into exceptions. + +.. code-block:: java + + import com.ceph.rados.Rados; + import com.ceph.rados.RadosException; + + import java.io.File; + + public class CephClient { + public static void main (String args[]){ + + try { + Rados cluster = new Rados("admin"); + System.out.println("Created cluster handle."); + + File f = new File("/etc/ceph/ceph.conf"); + cluster.confReadFile(f); + System.out.println("Read the configuration file."); + + cluster.connect(); + System.out.println("Connected to the cluster."); + + } catch (RadosException e) { + System.out.println(e.getMessage() + ": " + e.getReturnValue()); + } + } + } + + +Compile the source; then, run it. If you have copied the JAR to +``/usr/share/java`` and sym linked from your ``ext`` directory, you won't need +to specify the classpath. For example: + +.. prompt:: bash $ + + javac CephClient.java + java CephClient + + +PHP Example +------------ + +With the RADOS extension enabled in PHP you can start creating a new cluster handle very easily: + +.. code-block:: php + + | + | | | + | write ack | | + |<--------------+---------------| + | | | + | write xattr | | + |---------------+-------------->| + | | | + | xattr ack | | + |<--------------+---------------| + | | | + | read data | | + |---------------+-------------->| + | | | + | read ack | | + |<--------------+---------------| + | | | + | remove data | | + |---------------+-------------->| + | | | + | remove ack | | + |<--------------+---------------| + + + +RADOS enables you to interact both synchronously and asynchronously. Once your +app has an I/O Context, read/write operations only require you to know the +object/xattr name. The CRUSH algorithm encapsulated in ``librados`` uses the +cluster map to identify the appropriate OSD. OSD daemons handle the replication, +as described in `Smart Daemons Enable Hyperscale`_. The ``librados`` library also +maps objects to placement groups, as described in `Calculating PG IDs`_. + +The following examples use the default ``data`` pool. However, you may also +use the API to list pools, ensure they exist, or create and delete pools. For +the write operations, the examples illustrate how to use synchronous mode. For +the read operations, the examples illustrate how to use asynchronous mode. + +.. important:: Use caution when deleting pools with this API. If you delete + a pool, the pool and ALL DATA in the pool will be lost. + + +C Example +--------- + + +.. code-block:: c + + #include + #include + #include + #include + + int main (int argc, const char **argv) + { + /* + * Continued from previous C example, where cluster handle and + * connection are established. First declare an I/O Context. + */ + + rados_ioctx_t io; + char *poolname = "data"; + + err = rados_ioctx_create(cluster, poolname, &io); + if (err < 0) { + fprintf(stderr, "%s: cannot open rados pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_shutdown(cluster); + exit(EXIT_FAILURE); + } else { + printf("\nCreated I/O context.\n"); + } + + /* Write data to the cluster synchronously. */ + err = rados_write(io, "hw", "Hello World!", 12, 0); + if (err < 0) { + fprintf(stderr, "%s: Cannot write object \"hw\" to pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nWrote \"Hello World\" to object \"hw\".\n"); + } + + char xattr[] = "en_US"; + err = rados_setxattr(io, "hw", "lang", xattr, 5); + if (err < 0) { + fprintf(stderr, "%s: Cannot write xattr to pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nWrote \"en_US\" to xattr \"lang\" for object \"hw\".\n"); + } + + /* + * Read data from the cluster asynchronously. + * First, set up asynchronous I/O completion. + */ + rados_completion_t comp; + err = rados_aio_create_completion(NULL, NULL, NULL, &comp); + if (err < 0) { + fprintf(stderr, "%s: Could not create aio completion: %s\n", argv[0], strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nCreated AIO completion.\n"); + } + + /* Next, read data using rados_aio_read. */ + char read_res[100]; + err = rados_aio_read(io, "hw", comp, read_res, 12, 0); + if (err < 0) { + fprintf(stderr, "%s: Cannot read object. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRead object \"hw\". The contents are:\n %s \n", read_res); + } + + /* Wait for the operation to complete */ + rados_aio_wait_for_complete(comp); + + /* Release the asynchronous I/O complete handle to avoid memory leaks. */ + rados_aio_release(comp); + + + char xattr_res[100]; + err = rados_getxattr(io, "hw", "lang", xattr_res, 5); + if (err < 0) { + fprintf(stderr, "%s: Cannot read xattr. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRead xattr \"lang\" for object \"hw\". The contents are:\n %s \n", xattr_res); + } + + err = rados_rmxattr(io, "hw", "lang"); + if (err < 0) { + fprintf(stderr, "%s: Cannot remove xattr. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRemoved xattr \"lang\" for object \"hw\".\n"); + } + + err = rados_remove(io, "hw"); + if (err < 0) { + fprintf(stderr, "%s: Cannot remove object. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRemoved object \"hw\".\n"); + } + + } + + + +C++ Example +----------- + + +.. code-block:: c++ + + #include + #include + #include + + int main(int argc, const char **argv) + { + + /* Continued from previous C++ example, where cluster handle and + * connection are established. First declare an I/O Context. + */ + + librados::IoCtx io_ctx; + const char *pool_name = "data"; + + { + ret = cluster.ioctx_create(pool_name, io_ctx); + if (ret < 0) { + std::cerr << "Couldn't set up ioctx! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Created an ioctx for the pool." << std::endl; + } + } + + + /* Write an object synchronously. */ + { + librados::bufferlist bl; + bl.append("Hello World!"); + ret = io_ctx.write_full("hw", bl); + if (ret < 0) { + std::cerr << "Couldn't write object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Wrote new object 'hw' " << std::endl; + } + } + + + /* + * Add an xattr to the object. + */ + { + librados::bufferlist lang_bl; + lang_bl.append("en_US"); + ret = io_ctx.setxattr("hw", "lang", lang_bl); + if (ret < 0) { + std::cerr << "failed to set xattr version entry! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Set the xattr 'lang' on our object!" << std::endl; + } + } + + + /* + * Read the object back asynchronously. + */ + { + librados::bufferlist read_buf; + int read_len = 4194304; + + //Create I/O Completion. + librados::AioCompletion *read_completion = librados::Rados::aio_create_completion(); + + //Send read request. + ret = io_ctx.aio_read("hw", read_completion, &read_buf, read_len, 0); + if (ret < 0) { + std::cerr << "Couldn't start read object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } + + // Wait for the request to complete, and check that it succeeded. + read_completion->wait_for_complete(); + ret = read_completion->get_return_value(); + if (ret < 0) { + std::cerr << "Couldn't read object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Read object hw asynchronously with contents.\n" + << read_buf.c_str() << std::endl; + } + } + + + /* + * Read the xattr. + */ + { + librados::bufferlist lang_res; + ret = io_ctx.getxattr("hw", "lang", lang_res); + if (ret < 0) { + std::cerr << "failed to get xattr version entry! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Got the xattr 'lang' from object hw!" + << lang_res.c_str() << std::endl; + } + } + + + /* + * Remove the xattr. + */ + { + ret = io_ctx.rmxattr("hw", "lang"); + if (ret < 0) { + std::cerr << "Failed to remove xattr! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Removed the xattr 'lang' from our object!" << std::endl; + } + } + + /* + * Remove the object. + */ + { + ret = io_ctx.remove("hw"); + if (ret < 0) { + std::cerr << "Couldn't remove object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Removed object 'hw'." << std::endl; + } + } + } + + + +Python Example +-------------- + +.. code-block:: python + + print("\n\nI/O Context and Object Operations") + print("=================================") + + print("\nCreating a context for the 'data' pool") + if not cluster.pool_exists('data'): + raise RuntimeError('No data pool exists') + ioctx = cluster.open_ioctx('data') + + print("\nWriting object 'hw' with contents 'Hello World!' to pool 'data'.") + ioctx.write("hw", b"Hello World!") + print("Writing XATTR 'lang' with value 'en_US' to object 'hw'") + ioctx.set_xattr("hw", "lang", b"en_US") + + + print("\nWriting object 'bm' with contents 'Bonjour tout le monde!' to pool + 'data'.") + ioctx.write("bm", b"Bonjour tout le monde!") + print("Writing XATTR 'lang' with value 'fr_FR' to object 'bm'") + ioctx.set_xattr("bm", "lang", b"fr_FR") + + print("\nContents of object 'hw'\n------------------------") + print(ioctx.read("hw")) + + print("\n\nGetting XATTR 'lang' from object 'hw'") + print(ioctx.get_xattr("hw", "lang")) + + print("\nContents of object 'bm'\n------------------------") + print(ioctx.read("bm")) + + print("\n\nGetting XATTR 'lang' from object 'bm'") + print(ioctx.get_xattr("bm", "lang")) + + + print("\nRemoving object 'hw'") + ioctx.remove_object("hw") + + print("Removing object 'bm'") + ioctx.remove_object("bm") + + +Java-Example +------------ + +.. code-block:: java + + import com.ceph.rados.Rados; + import com.ceph.rados.RadosException; + + import java.io.File; + import com.ceph.rados.IoCTX; + + public class CephClient { + public static void main (String args[]){ + + try { + Rados cluster = new Rados("admin"); + System.out.println("Created cluster handle."); + + File f = new File("/etc/ceph/ceph.conf"); + cluster.confReadFile(f); + System.out.println("Read the configuration file."); + + cluster.connect(); + System.out.println("Connected to the cluster."); + + IoCTX io = cluster.ioCtxCreate("data"); + + String oidone = "hw"; + String contentone = "Hello World!"; + io.write(oidone, contentone); + + String oidtwo = "bm"; + String contenttwo = "Bonjour tout le monde!"; + io.write(oidtwo, contenttwo); + + String[] objects = io.listObjects(); + for (String object: objects) + System.out.println(object); + + io.remove(oidone); + io.remove(oidtwo); + + cluster.ioCtxDestroy(io); + + } catch (RadosException e) { + System.out.println(e.getMessage() + ": " + e.getReturnValue()); + } + } + } + + +PHP Example +----------- + +.. code-block:: php + + ack_end, NULL); + } + + void commit_callback(rados_completion_t comp, void *arg) { + req_duration *dur = (req_duration *) arg; + gettimeofday(&dur->commit_end, NULL); + } + + int output_append_latency(rados_ioctx_t io, const char *data, size_t len, size_t num_writes) { + req_duration times[num_writes]; + rados_completion_t comps[num_writes]; + for (size_t i = 0; i < num_writes; ++i) { + gettimeofday(×[i].start, NULL); + int err = rados_aio_create_completion((void*) ×[i], ack_callback, commit_callback, &comps[i]); + if (err < 0) { + fprintf(stderr, "Error creating rados completion: %s\n", strerror(-err)); + return err; + } + char obj_name[100]; + snprintf(obj_name, sizeof(obj_name), "foo%ld", (unsigned long)i); + err = rados_aio_append(io, obj_name, comps[i], data, len); + if (err < 0) { + fprintf(stderr, "Error from rados_aio_append: %s", strerror(-err)); + return err; + } + } + // wait until all requests finish *and* the callbacks complete + rados_aio_flush(io); + // the latencies can now be analyzed + printf("Request # | Ack latency (s) | Commit latency (s)\n"); + for (size_t i = 0; i < num_writes; ++i) { + // don't forget to free the completions + rados_aio_release(comps[i]); + struct timeval ack_lat, commit_lat; + timersub(×[i].ack_end, ×[i].start, &ack_lat); + timersub(×[i].commit_end, ×[i].start, &commit_lat); + printf("%9ld | %8ld.%06ld | %10ld.%06ld\n", (unsigned long) i, ack_lat.tv_sec, ack_lat.tv_usec, commit_lat.tv_sec, commit_lat.tv_usec); + } + return 0; + } + +Note that all the :c:type:`rados_completion_t` must be freed with :c:func:`rados_aio_release` to avoid leaking memory. + + +API calls +========= + + .. autodoxygenfile:: rados_types.h + .. autodoxygenfile:: librados.h diff --git a/doc/rados/api/libradospp.rst b/doc/rados/api/libradospp.rst new file mode 100644 index 000000000..08483c8d4 --- /dev/null +++ b/doc/rados/api/libradospp.rst @@ -0,0 +1,9 @@ +================== + LibradosPP (C++) +================== + +.. note:: The librados C++ API is not guaranteed to be API+ABI stable + between major releases. All applications using the librados C++ API must + be recompiled and relinked against a specific Ceph release. + +.. todo:: write me! diff --git a/doc/rados/api/objclass-sdk.rst b/doc/rados/api/objclass-sdk.rst new file mode 100644 index 000000000..90b8eb018 --- /dev/null +++ b/doc/rados/api/objclass-sdk.rst @@ -0,0 +1,39 @@ +.. _`rados-objclass-api-sdk`: + +=========================== +SDK for Ceph Object Classes +=========================== + +`Ceph` can be extended by creating shared object classes called `Ceph Object +Classes`. The existing framework to build these object classes has dependencies +on the internal functionality of `Ceph`, which restricts users to build object +classes within the tree. The aim of this project is to create an independent +object class interface, which can be used to build object classes outside the +`Ceph` tree. This allows us to have two types of object classes, 1) those that +have in-tree dependencies and reside in the tree and 2) those that can make use +of the `Ceph Object Class SDK framework` and can be built outside of the `Ceph` +tree because they do not depend on any internal implementation of `Ceph`. This +project decouples object class development from Ceph and encourages creation +and distribution of object classes as packages. + +In order to demonstrate the use of this framework, we have provided an example +called ``cls_sdk``, which is a very simple object class that makes use of the +SDK framework. This object class resides in the ``src/cls`` directory. + +Installing objclass.h +--------------------- + +The object class interface that enables out-of-tree development of object +classes resides in ``src/include/rados/`` and gets installed with `Ceph` +installation. After running ``make install``, you should be able to see it +in ``/include/rados``. :: + + ls /usr/local/include/rados + +Using the SDK example +--------------------- + +The ``cls_sdk`` object class resides in ``src/cls/sdk/``. This gets built and +loaded into Ceph, with the Ceph build process. You can run the +``ceph_test_cls_sdk`` unittest, which resides in ``src/test/cls_sdk/``, +to test this class. diff --git a/doc/rados/api/python.rst b/doc/rados/api/python.rst new file mode 100644 index 000000000..346653a3d --- /dev/null +++ b/doc/rados/api/python.rst @@ -0,0 +1,428 @@ +=================== + Librados (Python) +=================== + +The ``rados`` module is a thin Python wrapper for ``librados``. + +Installation +============ + +To install Python libraries for Ceph, see `Getting librados for Python`_. + + +Getting Started +=============== + +You can create your own Ceph client using Python. The following tutorial will +show you how to import the Ceph Python module, connect to a Ceph cluster, and +perform object operations as a ``client.admin`` user. + +.. note:: To use the Ceph Python bindings, you must have access to a + running Ceph cluster. To set one up quickly, see `Getting Started`_. + +First, create a Python source file for your Ceph client. + +.. prompt:: bash + + vim client.py + + +Import the Module +----------------- + +To use the ``rados`` module, import it into your source file. + +.. code-block:: python + :linenos: + + import rados + + +Configure a Cluster Handle +-------------------------- + +Before connecting to the Ceph Storage Cluster, create a cluster handle. By +default, the cluster handle assumes a cluster named ``ceph`` (i.e., the default +for deployment tools, and our Getting Started guides too), and a +``client.admin`` user name. You may change these defaults to suit your needs. + +To connect to the Ceph Storage Cluster, your application needs to know where to +find the Ceph Monitor. Provide this information to your application by +specifying the path to your Ceph configuration file, which contains the location +of the initial Ceph monitors. + +.. code-block:: python + :linenos: + + import rados, sys + + #Create Handle Examples. + cluster = rados.Rados(conffile='ceph.conf') + cluster = rados.Rados(conffile=sys.argv[1]) + cluster = rados.Rados(conffile = 'ceph.conf', conf = dict (keyring = '/path/to/keyring')) + +Ensure that the ``conffile`` argument provides the path and file name of your +Ceph configuration file. You may use the ``sys`` module to avoid hard-coding the +Ceph configuration path and file name. + +Your Python client also requires a client keyring. For this example, we use the +``client.admin`` key by default. If you would like to specify the keyring when +creating the cluster handle, you may use the ``conf`` argument. Alternatively, +you may specify the keyring path in your Ceph configuration file. For example, +you may add something like the following line to your Ceph configuration file:: + + keyring = /path/to/ceph.client.admin.keyring + +For additional details on modifying your configuration via Python, see `Configuration`_. + + +Connect to the Cluster +---------------------- + +Once you have a cluster handle configured, you may connect to the cluster. +With a connection to the cluster, you may execute methods that return +information about the cluster. + +.. code-block:: python + :linenos: + :emphasize-lines: 7 + + import rados, sys + + cluster = rados.Rados(conffile='ceph.conf') + print("\nlibrados version: {}".format(str(cluster.version()))) + print("Will attempt to connect to: {}".format(str(cluster.conf_get('mon host')))) + + cluster.connect() + print("\nCluster ID: {}".format(cluster.get_fsid())) + + print("\n\nCluster Statistics") + print("==================") + cluster_stats = cluster.get_cluster_stats() + + for key, value in cluster_stats.items(): + print(key, value) + + +By default, Ceph authentication is ``on``. Your application will need to know +the location of the keyring. The ``python-ceph`` module doesn't have the default +location, so you need to specify the keyring path. The easiest way to specify +the keyring is to add it to the Ceph configuration file. The following Ceph +configuration file example uses the ``client.admin`` keyring. + +.. code-block:: ini + :linenos: + + [global] + # ... elided configuration + keyring = /path/to/keyring/ceph.client.admin.keyring + + +Manage Pools +------------ + +When connected to the cluster, the ``Rados`` API allows you to manage pools. You +can list pools, check for the existence of a pool, create a pool and delete a +pool. + +.. code-block:: python + :linenos: + :emphasize-lines: 6, 13, 18, 25 + + print("\n\nPool Operations") + print("===============") + + print("\nAvailable Pools") + print("----------------") + pools = cluster.list_pools() + + for pool in pools: + print(pool) + + print("\nCreate 'test' Pool") + print("------------------") + cluster.create_pool('test') + + print("\nPool named 'test' exists: {}".format(str(cluster.pool_exists('test')))) + print("\nVerify 'test' Pool Exists") + print("-------------------------") + pools = cluster.list_pools() + + for pool in pools: + print(pool) + + print("\nDelete 'test' Pool") + print("------------------") + cluster.delete_pool('test') + print("\nPool named 'test' exists: {}".format(str(cluster.pool_exists('test')))) + + +Input/Output Context +-------------------- + +Reading from and writing to the Ceph Storage Cluster requires an input/output +context (ioctx). You can create an ioctx with the ``open_ioctx()`` or +``open_ioctx2()`` method of the ``Rados`` class. The ``ioctx_name`` parameter +is the name of the pool and ``pool_id`` is the ID of the pool you wish to use. + +.. code-block:: python + :linenos: + + ioctx = cluster.open_ioctx('data') + + +or + +.. code-block:: python + :linenos: + + ioctx = cluster.open_ioctx2(pool_id) + + +Once you have an I/O context, you can read/write objects, extended attributes, +and perform a number of other operations. After you complete operations, ensure +that you close the connection. For example: + +.. code-block:: python + :linenos: + + print("\nClosing the connection.") + ioctx.close() + + +Writing, Reading and Removing Objects +------------------------------------- + +Once you create an I/O context, you can write objects to the cluster. If you +write to an object that doesn't exist, Ceph creates it. If you write to an +object that exists, Ceph overwrites it (except when you specify a range, and +then it only overwrites the range). You may read objects (and object ranges) +from the cluster. You may also remove objects from the cluster. For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 2, 5, 8 + + print("\nWriting object 'hw' with contents 'Hello World!' to pool 'data'.") + ioctx.write_full("hw", "Hello World!") + + print("\n\nContents of object 'hw'\n------------------------\n") + print(ioctx.read("hw")) + + print("\nRemoving object 'hw'") + ioctx.remove_object("hw") + + +Writing and Reading XATTRS +-------------------------- + +Once you create an object, you can write extended attributes (XATTRs) to +the object and read XATTRs from the object. For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 2, 5 + + print("\n\nWriting XATTR 'lang' with value 'en_US' to object 'hw'") + ioctx.set_xattr("hw", "lang", "en_US") + + print("\n\nGetting XATTR 'lang' from object 'hw'\n") + print(ioctx.get_xattr("hw", "lang")) + + +Listing Objects +--------------- + +If you want to examine the list of objects in a pool, you may +retrieve the list of objects and iterate over them with the object iterator. +For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 1, 6, 7, 13 + + object_iterator = ioctx.list_objects() + + while True : + + try : + rados_object = object_iterator.__next__() + print("Object contents = {}".format(rados_object.read())) + + except StopIteration : + break + + # Or alternatively + [print("Object contents = {}".format(obj.read())) for obj in ioctx.list_objects()] + +The ``Object`` class provides a file-like interface to an object, allowing +you to read and write content and extended attributes. Object operations using +the I/O context provide additional functionality and asynchronous capabilities. + + +Cluster Handle API +================== + +The ``Rados`` class provides an interface into the Ceph Storage Daemon. + + +Configuration +------------- + +The ``Rados`` class provides methods for getting and setting configuration +values, reading the Ceph configuration file, and parsing arguments. You +do not need to be connected to the Ceph Storage Cluster to invoke the following +methods. See `Storage Cluster Configuration`_ for details on settings. + +.. currentmodule:: rados +.. automethod:: Rados.conf_get(option) +.. automethod:: Rados.conf_set(option, val) +.. automethod:: Rados.conf_read_file(path=None) +.. automethod:: Rados.conf_parse_argv(args) +.. automethod:: Rados.version() + + +Connection Management +--------------------- + +Once you configure your cluster handle, you may connect to the cluster, check +the cluster ``fsid``, retrieve cluster statistics, and disconnect (shutdown) +from the cluster. You may also assert that the cluster handle is in a particular +state (e.g., "configuring", "connecting", etc.). + +.. automethod:: Rados.connect(timeout=0) +.. automethod:: Rados.shutdown() +.. automethod:: Rados.get_fsid() +.. automethod:: Rados.get_cluster_stats() + +.. documented manually because it raises warnings because of *args usage in the +.. signature + +.. py:class:: Rados + + .. py:method:: require_state(*args) + + Checks if the Rados object is in a special state + + :param args: Any number of states to check as separate arguments + :raises: :class:`RadosStateError` + + +Pool Operations +--------------- + +To use pool operation methods, you must connect to the Ceph Storage Cluster +first. You may list the available pools, create a pool, check to see if a pool +exists, and delete a pool. + +.. automethod:: Rados.list_pools() +.. automethod:: Rados.create_pool(pool_name, crush_rule=None) +.. automethod:: Rados.pool_exists() +.. automethod:: Rados.delete_pool(pool_name) + + +CLI Commands +------------ + +The Ceph CLI command is internally using the following librados Python binding methods. + +In order to send a command, choose the correct method and choose the correct target. + +.. automethod:: Rados.mon_command +.. automethod:: Rados.osd_command +.. automethod:: Rados.mgr_command +.. automethod:: Rados.pg_command + + +Input/Output Context API +======================== + +To write data to and read data from the Ceph Object Store, you must create +an Input/Output context (ioctx). The `Rados` class provides `open_ioctx()` +and `open_ioctx2()` methods. The remaining ``ioctx`` operations involve +invoking methods of the `Ioctx` and other classes. + +.. automethod:: Rados.open_ioctx(ioctx_name) +.. automethod:: Ioctx.require_ioctx_open() +.. automethod:: Ioctx.get_stats() +.. automethod:: Ioctx.get_last_version() +.. automethod:: Ioctx.close() + + +.. Pool Snapshots +.. -------------- + +.. The Ceph Storage Cluster allows you to make a snapshot of a pool's state. +.. Whereas, basic pool operations only require a connection to the cluster, +.. snapshots require an I/O context. + +.. Ioctx.create_snap(self, snap_name) +.. Ioctx.list_snaps(self) +.. SnapIterator.next(self) +.. Snap.get_timestamp(self) +.. Ioctx.lookup_snap(self, snap_name) +.. Ioctx.remove_snap(self, snap_name) + +.. not published. This doesn't seem ready yet. + +Object Operations +----------------- + +The Ceph Storage Cluster stores data as objects. You can read and write objects +synchronously or asynchronously. You can read and write from offsets. An object +has a name (or key) and data. + + +.. automethod:: Ioctx.aio_write(object_name, to_write, offset=0, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.aio_write_full(object_name, to_write, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.aio_append(object_name, to_append, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.write(key, data, offset=0) +.. automethod:: Ioctx.write_full(key, data) +.. automethod:: Ioctx.aio_flush() +.. automethod:: Ioctx.set_locator_key(loc_key) +.. automethod:: Ioctx.aio_read(object_name, length, offset, oncomplete) +.. automethod:: Ioctx.read(key, length=8192, offset=0) +.. automethod:: Ioctx.stat(key) +.. automethod:: Ioctx.trunc(key, size) +.. automethod:: Ioctx.remove_object(key) + + +Object Extended Attributes +-------------------------- + +You may set extended attributes (XATTRs) on an object. You can retrieve a list +of objects or XATTRs and iterate over them. + +.. automethod:: Ioctx.set_xattr(key, xattr_name, xattr_value) +.. automethod:: Ioctx.get_xattrs(oid) +.. automethod:: XattrIterator.__next__() +.. automethod:: Ioctx.get_xattr(key, xattr_name) +.. automethod:: Ioctx.rm_xattr(key, xattr_name) + + + +Object Interface +================ + +From an I/O context, you can retrieve a list of objects from a pool and iterate +over them. The object interface provide makes each object look like a file, and +you may perform synchronous operations on the objects. For asynchronous +operations, you should use the I/O context methods. + +.. automethod:: Ioctx.list_objects() +.. automethod:: ObjectIterator.__next__() +.. automethod:: Object.read(length = 1024*1024) +.. automethod:: Object.write(string_to_write) +.. automethod:: Object.get_xattrs() +.. automethod:: Object.get_xattr(xattr_name) +.. automethod:: Object.set_xattr(xattr_name, xattr_value) +.. automethod:: Object.rm_xattr(xattr_name) +.. automethod:: Object.stat() +.. automethod:: Object.remove() + + + + +.. _Getting Started: ../../../start +.. _Storage Cluster Configuration: ../../configuration +.. _Getting librados for Python: ../librados-intro#getting-librados-for-python diff --git a/doc/rados/command/list-inconsistent-obj.json b/doc/rados/command/list-inconsistent-obj.json new file mode 100644 index 000000000..2bdc5f74c --- /dev/null +++ b/doc/rados/command/list-inconsistent-obj.json @@ -0,0 +1,237 @@ +{ + "$schema": "http://json-schema.org/draft-04/schema#", + "type": "object", + "properties": { + "epoch": { + "description": "Scrub epoch", + "type": "integer" + }, + "inconsistents": { + "type": "array", + "items": { + "type": "object", + "properties": { + "object": { + "description": "Identify a Ceph object", + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "nspace": { + "type": "string" + }, + "locator": { + "type": "string" + }, + "version": { + "type": "integer", + "minimum": 0 + }, + "snap": { + "oneOf": [ + { + "type": "string", + "enum": [ "head", "snapdir" ] + }, + { + "type": "integer", + "minimum": 0 + } + ] + } + }, + "required": [ + "name", + "nspace", + "locator", + "version", + "snap" + ] + }, + "selected_object_info": { + "type": "object", + "description": "Selected object information", + "additionalProperties": true + }, + "union_shard_errors": { + "description": "Union of all shard errors", + "type": "array", + "items": { + "enum": [ + "missing", + "stat_error", + "read_error", + "data_digest_mismatch_info", + "omap_digest_mismatch_info", + "size_mismatch_info", + "ec_hash_error", + "ec_size_error", + "info_missing", + "info_corrupted", + "obj_size_info_mismatch", + "snapset_missing", + "snapset_corrupted", + "hinfo_missing", + "hinfo_corrupted" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "errors": { + "description": "Errors related to the analysis of this object", + "type": "array", + "items": { + "enum": [ + "object_info_inconsistency", + "data_digest_mismatch", + "omap_digest_mismatch", + "size_mismatch", + "attr_value_mismatch", + "attr_name_mismatch", + "snapset_inconsistency", + "hinfo_inconsistency", + "size_too_large" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "shards": { + "description": "All found or expected shards", + "type": "array", + "items": { + "description": "Information about a particular shard of object", + "type": "object", + "properties": { + "object_info": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "object", + "description": "Object information", + "additionalProperties": true + } + ] + }, + "snapset": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "object", + "description": "Snap set information", + "additionalProperties": true + } + ] + }, + "hashinfo": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "object", + "description": "Erasure code hash information", + "additionalProperties": true + } + ] + }, + "shard": { + "type": "integer" + }, + "osd": { + "type": "integer" + }, + "primary": { + "type": "boolean" + }, + "size": { + "type": "integer" + }, + "omap_digest": { + "description": "Hex representation (e.g. 0x1abd1234)", + "type": "string" + }, + "data_digest": { + "description": "Hex representation (e.g. 0x1abd1234)", + "type": "string" + }, + "errors": { + "description": "Errors with this shard", + "type": "array", + "items": { + "enum": [ + "missing", + "stat_error", + "read_error", + "data_digest_mismatch_info", + "omap_digest_mismatch_info", + "size_mismatch_info", + "ec_hash_error", + "ec_size_error", + "info_missing", + "info_corrupted", + "obj_size_info_mismatch", + "snapset_missing", + "snapset_corrupted", + "hinfo_missing", + "hinfo_corrupted" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "attrs": { + "description": "If any shard's attr error is set then all attrs are here", + "type": "array", + "items": { + "description": "Information about a particular shard of object", + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "value": { + "type": "string" + }, + "Base64": { + "type": "boolean" + } + }, + "required": [ + "name", + "value", + "Base64" + ], + "additionalProperties": false + } + } + }, + "additionalProperties": false, + "required": [ + "osd", + "primary", + "errors" + ] + } + } + }, + "required": [ + "object", + "union_shard_errors", + "errors", + "shards" + ] + } + } + }, + "required": [ + "epoch", + "inconsistents" + ] +} diff --git a/doc/rados/command/list-inconsistent-snap.json b/doc/rados/command/list-inconsistent-snap.json new file mode 100644 index 000000000..55f1d53e9 --- /dev/null +++ b/doc/rados/command/list-inconsistent-snap.json @@ -0,0 +1,86 @@ +{ + "$schema": "http://json-schema.org/draft-04/schema#", + "type": "object", + "properties": { + "epoch": { + "description": "Scrub epoch", + "type": "integer" + }, + "inconsistents": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "nspace": { + "type": "string" + }, + "locator": { + "type": "string" + }, + "snap": { + "oneOf": [ + { + "type": "string", + "enum": [ + "head", + "snapdir" + ] + }, + { + "type": "integer", + "minimum": 0 + } + ] + }, + "errors": { + "description": "Errors for this object's snap", + "type": "array", + "items": { + "enum": [ + "snapset_missing", + "snapset_corrupted", + "info_missing", + "info_corrupted", + "snapset_error", + "headless", + "size_mismatch", + "extra_clones", + "clone_missing" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "missing": { + "description": "List of missing clones if clone_missing error set", + "type": "array", + "items": { + "type": "integer" + } + }, + "extra_clones": { + "description": "List of extra clones if extra_clones error set", + "type": "array", + "items": { + "type": "integer" + } + } + }, + "required": [ + "name", + "nspace", + "locator", + "snap", + "errors" + ] + } + } + }, + "required": [ + "epoch", + "inconsistents" + ] +} diff --git a/doc/rados/configuration/auth-config-ref.rst b/doc/rados/configuration/auth-config-ref.rst new file mode 100644 index 000000000..fc14f4ee6 --- /dev/null +++ b/doc/rados/configuration/auth-config-ref.rst @@ -0,0 +1,379 @@ +.. _rados-cephx-config-ref: + +======================== + CephX Config Reference +======================== + +The CephX protocol is enabled by default. The cryptographic authentication that +CephX provides has some computational costs, though they should generally be +quite low. If the network environment connecting your client and server hosts +is very safe and you cannot afford authentication, you can disable it. +**Disabling authentication is not generally recommended**. + +.. note:: If you disable authentication, you will be at risk of a + man-in-the-middle attack that alters your client/server messages, which + could have disastrous security effects. + +For information about creating users, see `User Management`_. For details on +the architecture of CephX, see `Architecture - High Availability +Authentication`_. + + +Deployment Scenarios +==================== + +How you initially configure CephX depends on your scenario. There are two +common strategies for deploying a Ceph cluster. If you are a first-time Ceph +user, you should probably take the easiest approach: using ``cephadm`` to +deploy a cluster. But if your cluster uses other deployment tools (for example, +Ansible, Chef, Juju, or Puppet), you will need either to use the manual +deployment procedures or to configure your deployment tool so that it will +bootstrap your monitor(s). + +Manual Deployment +----------------- + +When you deploy a cluster manually, it is necessary to bootstrap the monitors +manually and to create the ``client.admin`` user and keyring. To bootstrap +monitors, follow the steps in `Monitor Bootstrapping`_. Follow these steps when +using third-party deployment tools (for example, Chef, Puppet, and Juju). + + +Enabling/Disabling CephX +======================== + +Enabling CephX is possible only if the keys for your monitors, OSDs, and +metadata servers have already been deployed. If you are simply toggling CephX +on or off, it is not necessary to repeat the bootstrapping procedures. + +Enabling CephX +-------------- + +When CephX is enabled, Ceph will look for the keyring in the default search +path: this path includes ``/etc/ceph/$cluster.$name.keyring``. It is possible +to override this search-path location by adding a ``keyring`` option in the +``[global]`` section of your `Ceph configuration`_ file, but this is not +recommended. + +To enable CephX on a cluster for which authentication has been disabled, carry +out the following procedure. If you (or your deployment utility) have already +generated the keys, you may skip the steps related to generating keys. + +#. Create a ``client.admin`` key, and save a copy of the key for your client + host: + + .. prompt:: bash $ + + ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring + + **Warning:** This step will clobber any existing + ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a + deployment tool has already generated a keyring file for you. Be careful! + +#. Create a monitor keyring and generate a monitor secret key: + + .. prompt:: bash $ + + ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' + +#. For each monitor, copy the monitor keyring into a ``ceph.mon.keyring`` file + in the monitor's ``mon data`` directory. For example, to copy the monitor + keyring to ``mon.a`` in a cluster called ``ceph``, run the following + command: + + .. prompt:: bash $ + + cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring + +#. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter: + + .. prompt:: bash $ + + ceph auth get-or-create mgr.{$id} mon 'allow profile mgr' mds 'allow *' osd 'allow *' -o /var/lib/ceph/mgr/ceph-{$id}/keyring + +#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number: + + .. prompt:: bash $ + + ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring + +#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter: + + .. prompt:: bash $ + + ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' mgr 'allow profile mds' -o /var/lib/ceph/mds/ceph-{$id}/keyring + +#. Enable CephX authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file: + + .. code-block:: ini + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + +#. Start or restart the Ceph cluster. For details, see `Operating a Cluster`_. + +For details on bootstrapping a monitor manually, see `Manual Deployment`_. + + + +Disabling CephX +--------------- + +The following procedure describes how to disable CephX. If your cluster +environment is safe, you might want to disable CephX in order to offset the +computational expense of running authentication. **We do not recommend doing +so.** However, setup and troubleshooting might be easier if authentication is +temporarily disabled and subsequently re-enabled. + +#. Disable CephX authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file: + + .. code-block:: ini + + auth_cluster_required = none + auth_service_required = none + auth_client_required = none + +#. Start or restart the Ceph cluster. For details, see `Operating a Cluster`_. + + +Configuration Settings +====================== + +Enablement +---------- + + +``auth_cluster_required`` + +:Description: If this configuration setting is enabled, the Ceph Storage + Cluster daemons (that is, ``ceph-mon``, ``ceph-osd``, + ``ceph-mds``, and ``ceph-mgr``) are required to authenticate with + each other. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth_service_required`` + +:Description: If this configuration setting is enabled, then Ceph clients can + access Ceph services only if those clients authenticate with the + Ceph Storage Cluster. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth_client_required`` + +:Description: If this configuration setting is enabled, then communication + between the Ceph client and Ceph Storage Cluster can be + established only if the Ceph Storage Cluster authenticates + against the Ceph client. Valid settings are ``cephx`` or + ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +.. index:: keys; keyring + +Keys +---- + +When Ceph is run with authentication enabled, ``ceph`` administrative commands +and Ceph clients can access the Ceph Storage Cluster only if they use +authentication keys. + +The most common way to make these keys available to ``ceph`` administrative +commands and Ceph clients is to include a Ceph keyring under the ``/etc/ceph`` +directory. For Octopus and later releases that use ``cephadm``, the filename is +usually ``ceph.client.admin.keyring``. If the keyring is included in the +``/etc/ceph`` directory, then it is unnecessary to specify a ``keyring`` entry +in the Ceph configuration file. + +Because the Ceph Storage Cluster's keyring file contains the ``client.admin`` +key, we recommend copying the keyring file to nodes from which you run +administrative commands. + +To perform this step manually, run the following command: + +.. prompt:: bash $ + + sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring + +.. tip:: Make sure that the ``ceph.keyring`` file has appropriate permissions + (for example, ``chmod 644``) set on your client machine. + +You can specify the key itself by using the ``key`` setting in the Ceph +configuration file (this approach is not recommended), or instead specify a +path to a keyfile by using the ``keyfile`` setting in the Ceph configuration +file. + +``keyring`` + +:Description: The path to the keyring file. +:Type: String +:Required: No +:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin`` + + +``keyfile`` + +:Description: The path to a keyfile (that is, a file containing only the key). +:Type: String +:Required: No +:Default: None + + +``key`` + +:Description: The key (that is, the text string of the key itself). We do not + recommend that you use this setting unless you know what you're + doing. +:Type: String +:Required: No +:Default: None + + +Daemon Keyrings +--------------- + +Administrative users or deployment tools (for example, ``cephadm``) generate +daemon keyrings in the same way that they generate user keyrings. By default, +Ceph stores the keyring of a daemon inside that daemon's data directory. The +default keyring locations and the capabilities that are necessary for the +daemon to function are shown below. + +``ceph-mon`` + +:Location: ``$mon_data/keyring`` +:Capabilities: ``mon 'allow *'`` + +``ceph-osd`` + +:Location: ``$osd_data/keyring`` +:Capabilities: ``mgr 'allow profile osd' mon 'allow profile osd' osd 'allow *'`` + +``ceph-mds`` + +:Location: ``$mds_data/keyring`` +:Capabilities: ``mds 'allow' mgr 'allow profile mds' mon 'allow profile mds' osd 'allow rwx'`` + +``ceph-mgr`` + +:Location: ``$mgr_data/keyring`` +:Capabilities: ``mon 'allow profile mgr' mds 'allow *' osd 'allow *'`` + +``radosgw`` + +:Location: ``$rgw_data/keyring`` +:Capabilities: ``mon 'allow rwx' osd 'allow rwx'`` + + +.. note:: The monitor keyring (that is, ``mon.``) contains a key but no + capabilities, and this keyring is not part of the cluster ``auth`` database. + +The daemon's data-directory locations default to directories of the form:: + + /var/lib/ceph/$type/$cluster-$id + +For example, ``osd.12`` would have the following data directory:: + + /var/lib/ceph/osd/ceph-12 + +It is possible to override these locations, but it is not recommended. + + +.. index:: signatures + +Signatures +---------- + +Ceph performs a signature check that provides some limited protection against +messages being tampered with in flight (for example, by a "man in the middle" +attack). + +As with other parts of Ceph authentication, signatures admit of fine-grained +control. You can enable or disable signatures for service messages between +clients and Ceph, and for messages between Ceph daemons. + +Note that even when signatures are enabled data is not encrypted in flight. + +``cephx_require_signatures`` + +:Description: If this configuration setting is set to ``true``, Ceph requires + signatures on all message traffic between the Ceph client and the + Ceph Storage Cluster, and between daemons within the Ceph Storage + Cluster. + +.. note:: + **ANTIQUATED NOTE:** + + Neither Ceph Argonaut nor Linux kernel versions prior to 3.19 + support signatures; if one of these clients is in use, ``cephx_require_signatures`` + can be disabled in order to allow the client to connect. + + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_cluster_require_signatures`` + +:Description: If this configuration setting is set to ``true``, Ceph requires + signatures on all message traffic between Ceph daemons within the + Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_service_require_signatures`` + +:Description: If this configuration setting is set to ``true``, Ceph requires + signatures on all message traffic between Ceph clients and the + Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_sign_messages`` + +:Description: If this configuration setting is set to ``true``, and if the Ceph + version supports message signing, then Ceph will sign all + messages so that they are more difficult to spoof. + +:Type: Boolean +:Default: ``true`` + + +Time to Live +------------ + +``auth_service_ticket_ttl`` + +:Description: When the Ceph Storage Cluster sends a ticket for authentication + to a Ceph client, the Ceph Storage Cluster assigns that ticket a + Time To Live (TTL). + +:Type: Double +:Default: ``60*60`` + + +.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping +.. _Operating a Cluster: ../../operations/operating +.. _Manual Deployment: ../../../install/manual-deployment +.. _Ceph configuration: ../ceph-conf +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _User Management: ../../operations/user-management diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst new file mode 100644 index 000000000..3707be1aa --- /dev/null +++ b/doc/rados/configuration/bluestore-config-ref.rst @@ -0,0 +1,552 @@ +================================== + BlueStore Configuration Reference +================================== + +Devices +======= + +BlueStore manages either one, two, or in certain cases three storage devices. +These *devices* are "devices" in the Linux/Unix sense. This means that they are +assets listed under ``/dev`` or ``/devices``. Each of these devices may be an +entire storage drive, or a partition of a storage drive, or a logical volume. +BlueStore does not create or mount a conventional file system on devices that +it uses; BlueStore reads and writes to the devices directly in a "raw" fashion. + +In the simplest case, BlueStore consumes all of a single storage device. This +device is known as the *primary device*. The primary device is identified by +the ``block`` symlink in the data directory. + +The data directory is a ``tmpfs`` mount. When this data directory is booted or +activated by ``ceph-volume``, it is populated with metadata files and links +that hold information about the OSD: for example, the OSD's identifier, the +name of the cluster that the OSD belongs to, and the OSD's private keyring. + +In more complicated cases, BlueStore is deployed across one or two additional +devices: + +* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data + directory) can be used to separate out BlueStore's internal journal or + write-ahead log. Using a WAL device is advantageous only if the WAL device + is faster than the primary device (for example, if the WAL device is an SSD + and the primary device is an HDD). +* A *DB device* (identified as ``block.db`` in the data directory) can be used + to store BlueStore's internal metadata. BlueStore (or more precisely, the + embedded RocksDB) will put as much metadata as it can on the DB device in + order to improve performance. If the DB device becomes full, metadata will + spill back onto the primary device (where it would have been located in the + absence of the DB device). Again, it is advantageous to provision a DB device + only if it is faster than the primary device. + +If there is only a small amount of fast storage available (for example, less +than a gigabyte), we recommend using the available space as a WAL device. But +if more fast storage is available, it makes more sense to provision a DB +device. Because the BlueStore journal is always placed on the fastest device +available, using a DB device provides the same benefit that using a WAL device +would, while *also* allowing additional metadata to be stored off the primary +device (provided that it fits). DB devices make this possible because whenever +a DB device is specified but an explicit WAL device is not, the WAL will be +implicitly colocated with the DB on the faster device. + +To provision a single-device (colocated) BlueStore OSD, run the following +command: + +.. prompt:: bash $ + + ceph-volume lvm prepare --bluestore --data + +To specify a WAL device or DB device, run the following command: + +.. prompt:: bash $ + + ceph-volume lvm prepare --bluestore --data --block.wal --block.db + +.. note:: The option ``--data`` can take as its argument any of the the + following devices: logical volumes specified using *vg/lv* notation, + existing logical volumes, and GPT partitions. + + + +Provisioning strategies +----------------------- + +BlueStore differs from Filestore in that there are several ways to deploy a +BlueStore OSD. However, the overall deployment strategy for BlueStore can be +clarified by examining just these two common arrangements: + +.. _bluestore-single-type-device-config: + +**block (data) only** +^^^^^^^^^^^^^^^^^^^^^ +If all devices are of the same type (for example, they are all HDDs), and if +there are no fast devices available for the storage of metadata, then it makes +sense to specify the block device only and to leave ``block.db`` and +``block.wal`` unseparated. The :ref:`ceph-volume-lvm` command for a single +``/dev/sda`` device is as follows: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data /dev/sda + +If the devices to be used for a BlueStore OSD are pre-created logical volumes, +then the :ref:`ceph-volume-lvm` call for an logical volume named +``ceph-vg/block-lv`` is as follows: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data ceph-vg/block-lv + +.. _bluestore-mixed-device-config: + +**block and block.db** +^^^^^^^^^^^^^^^^^^^^^^ + +If you have a mix of fast and slow devices (for example, SSD or HDD), then we +recommend placing ``block.db`` on the faster device while ``block`` (that is, +the data) is stored on the slower device (that is, the rotational drive). + +You must create these volume groups and these logical volumes manually. as The +``ceph-volume`` tool is currently unable to do so [create them?] automatically. + +The following procedure illustrates the manual creation of volume groups and +logical volumes. For this example, we shall assume four rotational drives +(``sda``, ``sdb``, ``sdc``, and ``sdd``) and one (fast) SSD (``sdx``). First, +to create the volume groups, run the following commands: + +.. prompt:: bash $ + + vgcreate ceph-block-0 /dev/sda + vgcreate ceph-block-1 /dev/sdb + vgcreate ceph-block-2 /dev/sdc + vgcreate ceph-block-3 /dev/sdd + +Next, to create the logical volumes for ``block``, run the following commands: + +.. prompt:: bash $ + + lvcreate -l 100%FREE -n block-0 ceph-block-0 + lvcreate -l 100%FREE -n block-1 ceph-block-1 + lvcreate -l 100%FREE -n block-2 ceph-block-2 + lvcreate -l 100%FREE -n block-3 ceph-block-3 + +Because there are four HDDs, there will be four OSDs. Supposing that there is a +200GB SSD in ``/dev/sdx``, we can create four 50GB logical volumes by running +the following commands: + +.. prompt:: bash $ + + vgcreate ceph-db-0 /dev/sdx + lvcreate -L 50GB -n db-0 ceph-db-0 + lvcreate -L 50GB -n db-1 ceph-db-0 + lvcreate -L 50GB -n db-2 ceph-db-0 + lvcreate -L 50GB -n db-3 ceph-db-0 + +Finally, to create the four OSDs, run the following commands: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0 + ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1 + ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2 + ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3 + +After this procedure is finished, there should be four OSDs, ``block`` should +be on the four HDDs, and each HDD should have a 50GB logical volume +(specifically, a DB device) on the shared SSD. + +Sizing +====== +When using a :ref:`mixed spinning-and-solid-drive setup +`, it is important to make a large enough +``block.db`` logical volume for BlueStore. The logical volumes associated with +``block.db`` should have logical volumes that are *as large as possible*. + +It is generally recommended that the size of ``block.db`` be somewhere between +1% and 4% of the size of ``block``. For RGW workloads, it is recommended that +the ``block.db`` be at least 4% of the ``block`` size, because RGW makes heavy +use of ``block.db`` to store metadata (in particular, omap keys). For example, +if the ``block`` size is 1TB, then ``block.db`` should have a size of at least +40GB. For RBD workloads, however, ``block.db`` usually needs no more than 1% to +2% of the ``block`` size. + +In older releases, internal level sizes are such that the DB can fully utilize +only those specific partition / logical volume sizes that correspond to sums of +L0, L0+L1, L1+L2, and so on--that is, given default settings, sizes of roughly +3GB, 30GB, 300GB, and so on. Most deployments do not substantially benefit from +sizing that accommodates L3 and higher, though DB compaction can be facilitated +by doubling these figures to 6GB, 60GB, and 600GB. + +Improvements in Nautilus 14.2.12, Octopus 15.2.6, and subsequent releases allow +for better utilization of arbitrarily-sized DB devices. Moreover, the Pacific +release brings experimental dynamic-level support. Because of these advances, +users of older releases might want to plan ahead by provisioning larger DB +devices today so that the benefits of scale can be realized when upgrades are +made in the future. + +When *not* using a mix of fast and slow devices, there is no requirement to +create separate logical volumes for ``block.db`` or ``block.wal``. BlueStore +will automatically colocate these devices within the space of ``block``. + +Automatic Cache Sizing +====================== + +BlueStore can be configured to automatically resize its caches, provided that +certain conditions are met: TCMalloc must be configured as the memory allocator +and the ``bluestore_cache_autotune`` configuration option must be enabled (note +that it is currently enabled by default). When automatic cache sizing is in +effect, BlueStore attempts to keep OSD heap-memory usage under a certain target +size (as determined by ``osd_memory_target``). This approach makes use of a +best-effort algorithm and caches do not shrink smaller than the size defined by +the value of ``osd_memory_cache_min``. Cache ratios are selected in accordance +with a hierarchy of priorities. But if priority information is not available, +the values specified in the ``bluestore_cache_meta_ratio`` and +``bluestore_cache_kv_ratio`` options are used as fallback cache ratios. + +.. confval:: bluestore_cache_autotune +.. confval:: osd_memory_target +.. confval:: bluestore_cache_autotune_interval +.. confval:: osd_memory_base +.. confval:: osd_memory_expected_fragmentation +.. confval:: osd_memory_cache_min +.. confval:: osd_memory_cache_resize_interval + + +Manual Cache Sizing +=================== + +The amount of memory consumed by each OSD to be used for its BlueStore cache is +determined by the ``bluestore_cache_size`` configuration option. If that option +has not been specified (that is, if it remains at 0), then Ceph uses a +different configuration option to determine the default memory budget: +``bluestore_cache_size_hdd`` if the primary device is an HDD, or +``bluestore_cache_size_ssd`` if the primary device is an SSD. + +BlueStore and the rest of the Ceph OSD daemon make every effort to work within +this memory budget. Note that in addition to the configured cache size, there +is also memory consumed by the OSD itself. There is additional utilization due +to memory fragmentation and other allocator overhead. + +The configured cache-memory budget can be used to store the following types of +things: + +* Key/Value metadata (that is, RocksDB's internal cache) +* BlueStore metadata +* BlueStore data (that is, recently read or recently written object data) + +Cache memory usage is governed by the configuration options +``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. The fraction +of the cache that is reserved for data is governed by both the effective +BlueStore cache size (which depends on the relevant +``bluestore_cache_size[_ssd|_hdd]`` option and the device class of the primary +device) and the "meta" and "kv" ratios. This data fraction can be calculated +with the following formula: `` * (1 - +bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``. + +.. confval:: bluestore_cache_size +.. confval:: bluestore_cache_size_hdd +.. confval:: bluestore_cache_size_ssd +.. confval:: bluestore_cache_meta_ratio +.. confval:: bluestore_cache_kv_ratio + +Checksums +========= + +BlueStore checksums all metadata and all data written to disk. Metadata +checksumming is handled by RocksDB and uses the `crc32c` algorithm. By +contrast, data checksumming is handled by BlueStore and can use either +`crc32c`, `xxhash32`, or `xxhash64`. Nonetheless, `crc32c` is the default +checksum algorithm and it is suitable for most purposes. + +Full data checksumming increases the amount of metadata that BlueStore must +store and manage. Whenever possible (for example, when clients hint that data +is written and read sequentially), BlueStore will checksum larger blocks. In +many cases, however, it must store a checksum value (usually 4 bytes) for every +4 KB block of data. + +It is possible to obtain a smaller checksum value by truncating the checksum to +one or two bytes and reducing the metadata overhead. A drawback of this +approach is that it increases the probability of a random error going +undetected: about one in four billion given a 32-bit (4 byte) checksum, 1 in +65,536 given a 16-bit (2 byte) checksum, and 1 in 256 given an 8-bit (1 byte) +checksum. To use the smaller checksum values, select `crc32c_16` or `crc32c_8` +as the checksum algorithm. + +The *checksum algorithm* can be specified either via a per-pool ``csum_type`` +configuration option or via the global configuration option. For example: + +.. prompt:: bash $ + + ceph osd pool set csum_type + +.. confval:: bluestore_csum_type + +Inline Compression +================== + +BlueStore supports inline compression using `snappy`, `zlib`, `lz4`, or `zstd`. + +Whether data in BlueStore is compressed is determined by two factors: (1) the +*compression mode* and (2) any client hints associated with a write operation. +The compression modes are as follows: + +* **none**: Never compress data. +* **passive**: Do not compress data unless the write operation has a + *compressible* hint set. +* **aggressive**: Do compress data unless the write operation has an + *incompressible* hint set. +* **force**: Try to compress data no matter what. + +For more information about the *compressible* and *incompressible* I/O hints, +see :c:func:`rados_set_alloc_hint`. + +Note that data in Bluestore will be compressed only if the data chunk will be +sufficiently reduced in size (as determined by the ``bluestore compression +required ratio`` setting). No matter which compression modes have been used, if +the data chunk is too big, then it will be discarded and the original +(uncompressed) data will be stored instead. For example, if ``bluestore +compression required ratio`` is set to ``.7``, then data compression will take +place only if the size of the compressed data is no more than 70% of the size +of the original data. + +The *compression mode*, *compression algorithm*, *compression required ratio*, +*min blob size*, and *max blob size* settings can be specified either via a +per-pool property or via a global config option. To specify pool properties, +run the following commands: + +.. prompt:: bash $ + + ceph osd pool set compression_algorithm + ceph osd pool set compression_mode + ceph osd pool set compression_required_ratio + ceph osd pool set compression_min_blob_size + ceph osd pool set compression_max_blob_size + +.. confval:: bluestore_compression_algorithm +.. confval:: bluestore_compression_mode +.. confval:: bluestore_compression_required_ratio +.. confval:: bluestore_compression_min_blob_size +.. confval:: bluestore_compression_min_blob_size_hdd +.. confval:: bluestore_compression_min_blob_size_ssd +.. confval:: bluestore_compression_max_blob_size +.. confval:: bluestore_compression_max_blob_size_hdd +.. confval:: bluestore_compression_max_blob_size_ssd + +.. _bluestore-rocksdb-sharding: + +RocksDB Sharding +================ + +BlueStore maintains several types of internal key-value data, all of which are +stored in RocksDB. Each data type in BlueStore is assigned a unique prefix. +Prior to the Pacific release, all key-value data was stored in a single RocksDB +column family: 'default'. In Pacific and later releases, however, BlueStore can +divide key-value data into several RocksDB column families. BlueStore achieves +better caching and more precise compaction when keys are similar: specifically, +when keys have similar access frequency, similar modification frequency, and a +similar lifetime. Under such conditions, performance is improved and less disk +space is required during compaction (because each column family is smaller and +is able to compact independently of the others). + +OSDs deployed in Pacific or later releases use RocksDB sharding by default. +However, if Ceph has been upgraded to Pacific or a later version from a +previous version, sharding is disabled on any OSDs that were created before +Pacific. + +To enable sharding and apply the Pacific defaults to a specific OSD, stop the +OSD and run the following command: + + .. prompt:: bash # + + ceph-bluestore-tool \ + --path \ + --sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l p" \ + reshard + +.. confval:: bluestore_rocksdb_cf +.. confval:: bluestore_rocksdb_cfs + +Throttling +========== + +.. confval:: bluestore_throttle_bytes +.. confval:: bluestore_throttle_deferred_bytes +.. confval:: bluestore_throttle_cost_per_io +.. confval:: bluestore_throttle_cost_per_io_hdd +.. confval:: bluestore_throttle_cost_per_io_ssd + +SPDK Usage +========== + +To use the SPDK driver for NVMe devices, you must first prepare your system. +See `SPDK document`__. + +.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples + +SPDK offers a script that will configure the device automatically. Run this +script with root permissions: + +.. prompt:: bash $ + + sudo src/spdk/scripts/setup.sh + +You will need to specify the subject NVMe device's device selector with the +"spdk:" prefix for ``bluestore_block_path``. + +In the following example, you first find the device selector of an Intel NVMe +SSD by running the following command: + +.. prompt:: bash $ + + lspci -mm -n -d -d 8086:0953 + +The form of the device selector is either ``DDDD:BB:DD.FF`` or +``DDDD.BB.DD.FF``. + +Next, supposing that ``0000:01:00.0`` is the device selector found in the +output of the ``lspci`` command, you can specify the device selector by running +the following command:: + + bluestore_block_path = "spdk:trtype:pcie traddr:0000:01:00.0" + +You may also specify a remote NVMeoF target over the TCP transport, as in the +following example:: + + bluestore_block_path = "spdk:trtype:tcp traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1" + +To run multiple SPDK instances per node, you must make sure each instance uses +its own DPDK memory by specifying for each instance the amount of DPDK memory +(in MB) that the instance will use. + +In most cases, a single device can be used for data, DB, and WAL. We describe +this strategy as *colocating* these components. Be sure to enter the below +settings to ensure that all I/Os are issued through SPDK:: + + bluestore_block_db_path = "" + bluestore_block_db_size = 0 + bluestore_block_wal_path = "" + bluestore_block_wal_size = 0 + +If these settings are not entered, then the current implementation will +populate the SPDK map files with kernel file system symbols and will use the +kernel driver to issue DB/WAL I/Os. + +Minimum Allocation Size +======================= + +There is a configured minimum amount of storage that BlueStore allocates on an +underlying storage device. In practice, this is the least amount of capacity +that even a tiny RADOS object can consume on each OSD's primary device. The +configuration option in question--:confval:`bluestore_min_alloc_size`--derives +its value from the value of either :confval:`bluestore_min_alloc_size_hdd` or +:confval:`bluestore_min_alloc_size_ssd`, depending on the OSD's ``rotational`` +attribute. Thus if an OSD is created on an HDD, BlueStore is initialized with +the current value of :confval:`bluestore_min_alloc_size_hdd`; but with SSD OSDs +(including NVMe devices), Bluestore is initialized with the current value of +:confval:`bluestore_min_alloc_size_ssd`. + +In Mimic and earlier releases, the default values were 64KB for rotational +media (HDD) and 16KB for non-rotational media (SSD). The Octopus release +changed the the default value for non-rotational media (SSD) to 4KB, and the +Pacific release changed the default value for rotational media (HDD) to 4KB. + +These changes were driven by space amplification that was experienced by Ceph +RADOS GateWay (RGW) deployments that hosted large numbers of small files +(S3/Swift objects). + +For example, when an RGW client stores a 1 KB S3 object, that object is written +to a single RADOS object. In accordance with the default +:confval:`min_alloc_size` value, 4 KB of underlying drive space is allocated. +This means that roughly 3 KB (that is, 4 KB minus 1 KB) is allocated but never +used: this corresponds to 300% overhead or 25% efficiency. Similarly, a 5 KB +user object will be stored as two RADOS objects, a 4 KB RADOS object and a 1 KB +RADOS object, with the result that 4KB of device capacity is stranded. In this +case, however, the overhead percentage is much smaller. Think of this in terms +of the remainder from a modulus operation. The overhead *percentage* thus +decreases rapidly as object size increases. + +There is an additional subtlety that is easily missed: the amplification +phenomenon just described takes place for *each* replica. For example, when +using the default of three copies of data (3R), a 1 KB S3 object actually +strands roughly 9 KB of storage device capacity. If erasure coding (EC) is used +instead of replication, the amplification might be even higher: for a ``k=4, +m=2`` pool, our 1 KB S3 object allocates 24 KB (that is, 4 KB multiplied by 6) +of device capacity. + +When an RGW bucket pool contains many relatively large user objects, the effect +of this phenomenon is often negligible. However, with deployments that can +expect a significant fraction of relatively small user objects, the effect +should be taken into consideration. + +The 4KB default value aligns well with conventional HDD and SSD devices. +However, certain novel coarse-IU (Indirection Unit) QLC SSDs perform and wear +best when :confval:`bluestore_min_alloc_size_ssd` is specified at OSD creation +to match the device's IU: this might be 8KB, 16KB, or even 64KB. These novel +storage drives can achieve read performance that is competitive with that of +conventional TLC SSDs and write performance that is faster than that of HDDs, +with higher density and lower cost than TLC SSDs. + +Note that when creating OSDs on these novel devices, one must be careful to +apply the non-default value only to appropriate devices, and not to +conventional HDD and SSD devices. Error can be avoided through careful ordering +of OSD creation, with custom OSD device classes, and especially by the use of +central configuration *masks*. + +In Quincy and later releases, you can use the +:confval:`bluestore_use_optimal_io_size_for_min_alloc_size` option to allow +automatic discovery of the correct value as each OSD is created. Note that the +use of ``bcache``, ``OpenCAS``, ``dmcrypt``, ``ATA over Ethernet``, `iSCSI`, or +other device-layering and abstraction technologies might confound the +determination of correct values. Moreover, OSDs deployed on top of VMware +storage have sometimes been found to report a ``rotational`` attribute that +does not match the underlying hardware. + +We suggest inspecting such OSDs at startup via logs and admin sockets in order +to ensure that their behavior is correct. Be aware that this kind of inspection +might not work as expected with older kernels. To check for this issue, +examine the presence and value of ``/sys/block//queue/optimal_io_size``. + +.. note:: When running Reef or a later Ceph release, the ``min_alloc_size`` + baked into each OSD is conveniently reported by ``ceph osd metadata``. + +To inspect a specific OSD, run the following command: + +.. prompt:: bash # + + ceph osd metadata osd.1701 | egrep rotational\|alloc + +This space amplification might manifest as an unusually high ratio of raw to +stored data as reported by ``ceph df``. There might also be ``%USE`` / ``VAR`` +values reported by ``ceph osd df`` that are unusually high in comparison to +other, ostensibly identical, OSDs. Finally, there might be unexpected balancer +behavior in pools that use OSDs that have mismatched ``min_alloc_size`` values. + +This BlueStore attribute takes effect *only* at OSD creation; if the attribute +is changed later, a specific OSD's behavior will not change unless and until +the OSD is destroyed and redeployed with the appropriate option value(s). +Upgrading to a later Ceph release will *not* change the value used by OSDs that +were deployed under older releases or with other settings. + +.. confval:: bluestore_min_alloc_size +.. confval:: bluestore_min_alloc_size_hdd +.. confval:: bluestore_min_alloc_size_ssd +.. confval:: bluestore_use_optimal_io_size_for_min_alloc_size + +DSA (Data Streaming Accelerator) Usage +====================================== + +If you want to use the DML library to drive the DSA device for offloading +read/write operations on persistent memory (PMEM) in BlueStore, you need to +install `DML`_ and the `idxd-config`_ library. This will work only on machines +that have a SPR (Sapphire Rapids) CPU. + +.. _dml: https://github.com/intel/dml +.. _idxd-config: https://github.com/intel/idxd-config + +After installing the DML software, configure the shared work queues (WQs) with +reference to the following WQ configuration example: + +.. prompt:: bash $ + + accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="myapp1" --priority=10 --block-on-fault=1 dsa0/wq0.1 + accel-config config-engine dsa0/engine0.1 --group-id=1 + accel-config enable-device dsa0 + accel-config enable-wq dsa0/wq0.1 diff --git a/doc/rados/configuration/ceph-conf.rst b/doc/rados/configuration/ceph-conf.rst new file mode 100644 index 000000000..d8d5c9d03 --- /dev/null +++ b/doc/rados/configuration/ceph-conf.rst @@ -0,0 +1,715 @@ +.. _configuring-ceph: + +================== + Configuring Ceph +================== + +When Ceph services start, the initialization process activates a set of +daemons that run in the background. A :term:`Ceph Storage Cluster` runs at +least three types of daemons: + +- :term:`Ceph Monitor` (``ceph-mon``) +- :term:`Ceph Manager` (``ceph-mgr``) +- :term:`Ceph OSD Daemon` (``ceph-osd``) + +Any Ceph Storage Cluster that supports the :term:`Ceph File System` also runs +at least one :term:`Ceph Metadata Server` (``ceph-mds``). Any Cluster that +supports :term:`Ceph Object Storage` runs Ceph RADOS Gateway daemons +(``radosgw``). + +Each daemon has a number of configuration options, and each of those options +has a default value. Adjust the behavior of the system by changing these +configuration options. Make sure to understand the consequences before +overriding the default values, as it is possible to significantly degrade the +performance and stability of your cluster. Remember that default values +sometimes change between releases. For this reason, it is best to review the +version of this documentation that applies to your Ceph release. + +Option names +============ + +Each of the Ceph configuration options has a unique name that consists of words +formed with lowercase characters and connected with underscore characters +(``_``). + +When option names are specified on the command line, underscore (``_``) and +dash (``-``) characters can be used interchangeably (for example, +``--mon-host`` is equivalent to ``--mon_host``). + +When option names appear in configuration files, spaces can also be used in +place of underscores or dashes. However, for the sake of clarity and +convenience, we suggest that you consistently use underscores, as we do +throughout this documentation. + +Config sources +============== + +Each Ceph daemon, process, and library pulls its configuration from one or more +of the several sources listed below. Sources that occur later in the list +override those that occur earlier in the list (when both are present). + +- the compiled-in default value +- the monitor cluster's centralized configuration database +- a configuration file stored on the local host +- environment variables +- command-line arguments +- runtime overrides that are set by an administrator + +One of the first things a Ceph process does on startup is parse the +configuration options provided via the command line, via the environment, and +via the local configuration file. Next, the process contacts the monitor +cluster to retrieve centrally-stored configuration for the entire cluster. +After a complete view of the configuration is available, the startup of the +daemon or process will commence. + +.. _bootstrap-options: + +Bootstrap options +----------------- + +Bootstrap options are configuration options that affect the process's ability +to contact the monitors, to authenticate, and to retrieve the cluster-stored +configuration. For this reason, these options might need to be stored locally +on the node, and set by means of a local configuration file. These options +include the following: + +.. confval:: mon_host +.. confval:: mon_host_override + +- :confval:`mon_dns_srv_name` +- :confval:`mon_data`, :confval:`osd_data`, :confval:`mds_data`, + :confval:`mgr_data`, and similar options that define which local directory + the daemon stores its data in. +- :confval:`keyring`, :confval:`keyfile`, and/or :confval:`key`, which can be + used to specify the authentication credential to use to authenticate with the + monitor. Note that in most cases the default keyring location is in the data + directory specified above. + +In most cases, there is no reason to modify the default values of these +options. However, there is one exception to this: the :confval:`mon_host` +option that identifies the addresses of the cluster's monitors. But when +:ref:`DNS is used to identify monitors`, a local Ceph +configuration file can be avoided entirely. + + +Skipping monitor config +----------------------- + +The option ``--no-mon-config`` can be passed in any command in order to skip +the step that retrieves configuration information from the cluster's monitors. +Skipping this retrieval step can be useful in cases where configuration is +managed entirely via configuration files, or when maintenance activity needs to +be done but the monitor cluster is down. + +.. _ceph-conf-file: + +Configuration sections +====================== + +Each of the configuration options associated with a single process or daemon +has a single value. However, the values for a configuration option can vary +across daemon types, and can vary even across different daemons of the same +type. Ceph options that are stored in the monitor configuration database or in +local configuration files are grouped into sections |---| so-called "configuration +sections" |---| to indicate which daemons or clients they apply to. + + +These sections include the following: + +.. confsec:: global + + Settings under ``global`` affect all daemons and clients + in a Ceph Storage Cluster. + + :example: ``log_file = /var/log/ceph/$cluster-$type.$id.log`` + +.. confsec:: mon + + Settings under ``mon`` affect all ``ceph-mon`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + + :example: ``mon_cluster_log_to_syslog = true`` + +.. confsec:: mgr + + Settings in the ``mgr`` section affect all ``ceph-mgr`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + + :example: ``mgr_stats_period = 10`` + +.. confsec:: osd + + Settings under ``osd`` affect all ``ceph-osd`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + + :example: ``osd_op_queue = wpq`` + +.. confsec:: mds + + Settings in the ``mds`` section affect all ``ceph-mds`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + + :example: ``mds_cache_memory_limit = 10G`` + +.. confsec:: client + + Settings under ``client`` affect all Ceph clients + (for example, mounted Ceph File Systems, mounted Ceph Block Devices) + as well as RADOS Gateway (RGW) daemons. + + :example: ``objecter_inflight_ops = 512`` + + +Configuration sections can also specify an individual daemon or client name. For example, +``mon.foo``, ``osd.123``, and ``client.smith`` are all valid section names. + + +Any given daemon will draw its settings from the global section, the daemon- or +client-type section, and the section sharing its name. Settings in the +most-specific section take precedence so precedence: for example, if the same +option is specified in both :confsec:`global`, :confsec:`mon`, and ``mon.foo`` +on the same source (i.e. that is, in the same configuration file), the +``mon.foo`` setting will be used. + +If multiple values of the same configuration option are specified in the same +section, the last value specified takes precedence. + +Note that values from the local configuration file always take precedence over +values from the monitor configuration database, regardless of the section in +which they appear. + +.. _ceph-metavariables: + +Metavariables +============= + +Metavariables dramatically simplify Ceph storage cluster configuration. When a +metavariable is set in a configuration value, Ceph expands the metavariable at +the time the configuration value is used. In this way, Ceph metavariables +behave similarly to the way that variable expansion works in the Bash shell. + +Ceph supports the following metavariables: + +.. describe:: $cluster + + Expands to the Ceph Storage Cluster name. Useful when running + multiple Ceph Storage Clusters on the same hardware. + + :example: ``/etc/ceph/$cluster.keyring`` + :default: ``ceph`` + +.. describe:: $type + + Expands to a daemon or process type (for example, ``mds``, ``osd``, or ``mon``) + + :example: ``/var/lib/ceph/$type`` + +.. describe:: $id + + Expands to the daemon or client identifier. For + ``osd.0``, this would be ``0``; for ``mds.a``, it would + be ``a``. + + :example: ``/var/lib/ceph/$type/$cluster-$id`` + +.. describe:: $host + + Expands to the host name where the process is running. + +.. describe:: $name + + Expands to ``$type.$id``. + + :example: ``/var/run/ceph/$cluster-$name.asok`` + +.. describe:: $pid + + Expands to daemon pid. + + :example: ``/var/run/ceph/$cluster-$name-$pid.asok`` + + +Ceph configuration file +======================= + +On startup, Ceph processes search for a configuration file in the +following locations: + +#. ``$CEPH_CONF`` (that is, the path following the ``$CEPH_CONF`` + environment variable) +#. ``-c path/path`` (that is, the ``-c`` command line argument) +#. ``/etc/ceph/$cluster.conf`` +#. ``~/.ceph/$cluster.conf`` +#. ``./$cluster.conf`` (that is, in the current working directory) +#. On FreeBSD systems only, ``/usr/local/etc/ceph/$cluster.conf`` + +Here ``$cluster`` is the cluster's name (default: ``ceph``). + +The Ceph configuration file uses an ``ini`` style syntax. You can add "comment +text" after a pound sign (#) or a semi-colon semicolon (;). For example: + +.. code-block:: ini + + # <--A number (#) sign number sign (#) precedes a comment. + ; A comment may be anything. + # Comments always follow a semi-colon semicolon (;) or a pound sign (#) on each line. + # The end of the line terminates a comment. + # We recommend that you provide comments in your configuration file(s). + + +.. _ceph-conf-settings: + +Config file section names +------------------------- + +The configuration file is divided into sections. Each section must begin with a +valid configuration section name (see `Configuration sections`_, above) that is +surrounded by square brackets. For example: + +.. code-block:: ini + + [global] + debug_ms = 0 + + [osd] + debug_ms = 1 + + [osd.1] + debug_ms = 10 + + [osd.2] + debug_ms = 10 + +Config file option values +------------------------- + +The value of a configuration option is a string. If the string is too long to +fit on a single line, you can put a backslash (``\``) at the end of the line +and the backslash will act as a line continuation marker. In such a case, the +value of the option will be the string after ``=`` in the current line, +combined with the string in the next line. Here is an example:: + + [global] + foo = long long ago\ + long ago + +In this example, the value of the "``foo``" option is "``long long ago long +ago``". + +An option value typically ends with either a newline or a comment. For +example: + +.. code-block:: ini + + [global] + obscure_one = difficult to explain # I will try harder in next release + simpler_one = nothing to explain + +In this example, the value of the "``obscure one``" option is "``difficult to +explain``" and the value of the "``simpler one`` options is "``nothing to +explain``". + +When an option value contains spaces, it can be enclosed within single quotes +or double quotes in order to make its scope clear and in order to make sure +that the first space in the value is not interpreted as the end of the value. +For example: + +.. code-block:: ini + + [global] + line = "to be, or not to be" + +In option values, there are four characters that are treated as escape +characters: ``=``, ``#``, ``;`` and ``[``. They are permitted to occur in an +option value only if they are immediately preceded by the backslash character +(``\``). For example: + +.. code-block:: ini + + [global] + secret = "i love \# and \[" + +Each configuration option falls under one of the following types: + +.. describe:: int + + 64-bit signed integer. Some SI suffixes are supported, such as "K", "M", + "G", "T", "P", and "E" (meaning, respectively, 10\ :sup:`3`, 10\ :sup:`6`, + 10\ :sup:`9`, etc.). "B" is the only supported unit string. Thus "1K", "1M", + "128B" and "-1" are all valid option values. When a negative value is + assigned to a threshold option, this can indicate that the option is + "unlimited" -- that is, that there is no threshold or limit in effect. + + :example: ``42``, ``-1`` + +.. describe:: uint + + This differs from ``integer`` only in that negative values are not + permitted. + + :example: ``256``, ``0`` + +.. describe:: str + + A string encoded in UTF-8. Certain characters are not permitted. Reference + the above notes for the details. + + :example: ``"hello world"``, ``"i love \#"``, ``yet-another-name`` + +.. describe:: boolean + + Typically either of the two values ``true`` or ``false``. However, any + integer is permitted: "0" implies ``false``, and any non-zero value implies + ``true``. + + :example: ``true``, ``false``, ``1``, ``0`` + +.. describe:: addr + + A single address, optionally prefixed with ``v1``, ``v2`` or ``any`` for the + messenger protocol. If no prefix is specified, the ``v2`` protocol is used. + For more details, see :ref:`address_formats`. + + :example: ``v1:1.2.3.4:567``, ``v2:1.2.3.4:567``, ``1.2.3.4:567``, ``2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567``, ``[::1]:6789`` + +.. describe:: addrvec + + A set of addresses separated by ",". The addresses can be optionally quoted + with ``[`` and ``]``. + + :example: ``[v1:1.2.3.4:567,v2:1.2.3.4:568]``, ``v1:1.2.3.4:567,v1:1.2.3.14:567`` ``[2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567], [2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::568]`` + +.. describe:: uuid + + The string format of a uuid defined by `RFC4122 + `_. Certain variants are also + supported: for more details, see `Boost document + `_. + + :example: ``f81d4fae-7dec-11d0-a765-00a0c91e6bf6`` + +.. describe:: size + + 64-bit unsigned integer. Both SI prefixes and IEC prefixes are supported. + "B" is the only supported unit string. Negative values are not permitted. + + :example: ``1Ki``, ``1K``, ``1KiB`` and ``1B``. + +.. describe:: secs + + Denotes a duration of time. The default unit of time is the second. + The following units of time are supported: + + * second: ``s``, ``sec``, ``second``, ``seconds`` + * minute: ``m``, ``min``, ``minute``, ``minutes`` + * hour: ``hs``, ``hr``, ``hour``, ``hours`` + * day: ``d``, ``day``, ``days`` + * week: ``w``, ``wk``, ``week``, ``weeks`` + * month: ``mo``, ``month``, ``months`` + * year: ``y``, ``yr``, ``year``, ``years`` + + :example: ``1 m``, ``1m`` and ``1 week`` + +.. _ceph-conf-database: + +Monitor configuration database +============================== + +The monitor cluster manages a database of configuration options that can be +consumed by the entire cluster. This allows for streamlined central +configuration management of the entire system. For ease of administration and +transparency, the vast majority of configuration options can and should be +stored in this database. + +Some settings might need to be stored in local configuration files because they +affect the ability of the process to connect to the monitors, to authenticate, +and to fetch configuration information. In most cases this applies only to the +``mon_host`` option. This issue can be avoided by using :ref:`DNS SRV +records`. + +Sections and masks +------------------ + +Configuration options stored by the monitor can be stored in a global section, +in a daemon-type section, or in a specific daemon section. In this, they are +no different from the options in a configuration file. + +In addition, options may have a *mask* associated with them to further restrict +which daemons or clients the option applies to. Masks take two forms: + +#. ``type:location`` where ``type`` is a CRUSH property like ``rack`` or + ``host``, and ``location`` is a value for that property. For example, + ``host:foo`` would limit the option only to daemons or clients + running on a particular host. +#. ``class:device-class`` where ``device-class`` is the name of a CRUSH + device class (for example, ``hdd`` or ``ssd``). For example, + ``class:ssd`` would limit the option only to OSDs backed by SSDs. + (This mask has no effect on non-OSD daemons or clients.) + +In commands that specify a configuration option, the argument of the option (in +the following examples, this is the "who" string) may be a section name, a +mask, or a combination of both separated by a slash character (``/``). For +example, ``osd/rack:foo`` would refer to all OSD daemons in the ``foo`` rack. + +When configuration options are shown, the section name and mask are presented +in separate fields or columns to make them more readable. + +Commands +-------- + +The following CLI commands are used to configure the cluster: + +* ``ceph config dump`` dumps the entire monitor configuration + database for the cluster. + +* ``ceph config get `` dumps the configuration options stored in + the monitor configuration database for a specific daemon or client + (for example, ``mds.a``). + +* ``ceph config get