From 19fcec84d8d7d21e796c7624e521b60d28ee21ed Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 7 Apr 2024 20:45:59 +0200 Subject: Adding upstream version 16.2.11+ds. Signed-off-by: Daniel Baumann --- doc/rados/api/index.rst | 23 + doc/rados/api/libcephsqlite.rst | 438 ++++++ doc/rados/api/librados-intro.rst | 1052 ++++++++++++++ doc/rados/api/librados.rst | 187 +++ doc/rados/api/libradospp.rst | 9 + doc/rados/api/objclass-sdk.rst | 37 + doc/rados/api/python.rst | 425 ++++++ doc/rados/command/list-inconsistent-obj.json | 237 ++++ doc/rados/command/list-inconsistent-snap.json | 86 ++ doc/rados/configuration/auth-config-ref.rst | 362 +++++ doc/rados/configuration/bluestore-config-ref.rst | 482 +++++++ doc/rados/configuration/ceph-conf.rst | 689 +++++++++ doc/rados/configuration/common.rst | 218 +++ doc/rados/configuration/demo-ceph.conf | 31 + doc/rados/configuration/filestore-config-ref.rst | 367 +++++ doc/rados/configuration/general-config-ref.rst | 66 + doc/rados/configuration/index.rst | 54 + doc/rados/configuration/journal-ref.rst | 119 ++ doc/rados/configuration/mclock-config-ref.rst | 395 ++++++ doc/rados/configuration/mon-config-ref.rst | 1243 +++++++++++++++++ doc/rados/configuration/mon-lookup-dns.rst | 56 + doc/rados/configuration/mon-osd-interaction.rst | 396 ++++++ doc/rados/configuration/ms-ref.rst | 133 ++ doc/rados/configuration/msgr2.rst | 233 ++++ doc/rados/configuration/network-config-ref.rst | 454 ++++++ doc/rados/configuration/osd-config-ref.rst | 1127 +++++++++++++++ doc/rados/configuration/pool-pg-config-ref.rst | 282 ++++ doc/rados/configuration/pool-pg.conf | 21 + doc/rados/configuration/storage-devices.rst | 96 ++ doc/rados/index.rst | 78 ++ doc/rados/man/index.rst | 31 + doc/rados/operations/add-or-rm-mons.rst | 446 ++++++ doc/rados/operations/add-or-rm-osds.rst | 386 +++++ doc/rados/operations/balancer.rst | 206 +++ doc/rados/operations/bluestore-migration.rst | 338 +++++ doc/rados/operations/cache-tiering.rst | 552 ++++++++ doc/rados/operations/change-mon-elections.rst | 88 ++ doc/rados/operations/control.rst | 601 ++++++++ doc/rados/operations/crush-map-edits.rst | 747 ++++++++++ doc/rados/operations/crush-map.rst | 1126 +++++++++++++++ doc/rados/operations/data-placement.rst | 43 + doc/rados/operations/devices.rst | 208 +++ doc/rados/operations/erasure-code-clay.rst | 240 ++++ doc/rados/operations/erasure-code-isa.rst | 107 ++ doc/rados/operations/erasure-code-jerasure.rst | 121 ++ doc/rados/operations/erasure-code-lrc.rst | 388 ++++++ doc/rados/operations/erasure-code-profile.rst | 126 ++ doc/rados/operations/erasure-code-shec.rst | 145 ++ doc/rados/operations/erasure-code.rst | 262 ++++ doc/rados/operations/health-checks.rst | 1549 +++++++++++++++++++++ doc/rados/operations/index.rst | 98 ++ doc/rados/operations/monitoring-osd-pg.rst | 553 ++++++++ doc/rados/operations/monitoring.rst | 647 +++++++++ doc/rados/operations/operating.rst | 255 ++++ doc/rados/operations/pg-concepts.rst | 102 ++ doc/rados/operations/pg-repair.rst | 81 ++ doc/rados/operations/pg-states.rst | 118 ++ doc/rados/operations/placement-groups.rst | 798 +++++++++++ doc/rados/operations/pools.rst | 900 ++++++++++++ doc/rados/operations/stretch-mode.rst | 215 +++ doc/rados/operations/upmap.rst | 105 ++ doc/rados/operations/user-management.rst | 823 +++++++++++ doc/rados/troubleshooting/community.rst | 28 + doc/rados/troubleshooting/cpu-profiling.rst | 67 + doc/rados/troubleshooting/index.rst | 19 + doc/rados/troubleshooting/log-and-debug.rst | 599 ++++++++ doc/rados/troubleshooting/memory-profiling.rst | 142 ++ doc/rados/troubleshooting/troubleshooting-mon.rst | 613 ++++++++ doc/rados/troubleshooting/troubleshooting-osd.rst | 620 +++++++++ doc/rados/troubleshooting/troubleshooting-pg.rst | 693 +++++++++ 70 files changed, 24582 insertions(+) create mode 100644 doc/rados/api/index.rst create mode 100644 doc/rados/api/libcephsqlite.rst create mode 100644 doc/rados/api/librados-intro.rst create mode 100644 doc/rados/api/librados.rst create mode 100644 doc/rados/api/libradospp.rst create mode 100644 doc/rados/api/objclass-sdk.rst create mode 100644 doc/rados/api/python.rst create mode 100644 doc/rados/command/list-inconsistent-obj.json create mode 100644 doc/rados/command/list-inconsistent-snap.json create mode 100644 doc/rados/configuration/auth-config-ref.rst create mode 100644 doc/rados/configuration/bluestore-config-ref.rst create mode 100644 doc/rados/configuration/ceph-conf.rst create mode 100644 doc/rados/configuration/common.rst create mode 100644 doc/rados/configuration/demo-ceph.conf create mode 100644 doc/rados/configuration/filestore-config-ref.rst create mode 100644 doc/rados/configuration/general-config-ref.rst create mode 100644 doc/rados/configuration/index.rst create mode 100644 doc/rados/configuration/journal-ref.rst create mode 100644 doc/rados/configuration/mclock-config-ref.rst create mode 100644 doc/rados/configuration/mon-config-ref.rst create mode 100644 doc/rados/configuration/mon-lookup-dns.rst create mode 100644 doc/rados/configuration/mon-osd-interaction.rst create mode 100644 doc/rados/configuration/ms-ref.rst create mode 100644 doc/rados/configuration/msgr2.rst create mode 100644 doc/rados/configuration/network-config-ref.rst create mode 100644 doc/rados/configuration/osd-config-ref.rst create mode 100644 doc/rados/configuration/pool-pg-config-ref.rst create mode 100644 doc/rados/configuration/pool-pg.conf create mode 100644 doc/rados/configuration/storage-devices.rst create mode 100644 doc/rados/index.rst create mode 100644 doc/rados/man/index.rst create mode 100644 doc/rados/operations/add-or-rm-mons.rst create mode 100644 doc/rados/operations/add-or-rm-osds.rst create mode 100644 doc/rados/operations/balancer.rst create mode 100644 doc/rados/operations/bluestore-migration.rst create mode 100644 doc/rados/operations/cache-tiering.rst create mode 100644 doc/rados/operations/change-mon-elections.rst create mode 100644 doc/rados/operations/control.rst create mode 100644 doc/rados/operations/crush-map-edits.rst create mode 100644 doc/rados/operations/crush-map.rst create mode 100644 doc/rados/operations/data-placement.rst create mode 100644 doc/rados/operations/devices.rst create mode 100644 doc/rados/operations/erasure-code-clay.rst create mode 100644 doc/rados/operations/erasure-code-isa.rst create mode 100644 doc/rados/operations/erasure-code-jerasure.rst create mode 100644 doc/rados/operations/erasure-code-lrc.rst create mode 100644 doc/rados/operations/erasure-code-profile.rst create mode 100644 doc/rados/operations/erasure-code-shec.rst create mode 100644 doc/rados/operations/erasure-code.rst create mode 100644 doc/rados/operations/health-checks.rst create mode 100644 doc/rados/operations/index.rst create mode 100644 doc/rados/operations/monitoring-osd-pg.rst create mode 100644 doc/rados/operations/monitoring.rst create mode 100644 doc/rados/operations/operating.rst create mode 100644 doc/rados/operations/pg-concepts.rst create mode 100644 doc/rados/operations/pg-repair.rst create mode 100644 doc/rados/operations/pg-states.rst create mode 100644 doc/rados/operations/placement-groups.rst create mode 100644 doc/rados/operations/pools.rst create mode 100644 doc/rados/operations/stretch-mode.rst create mode 100644 doc/rados/operations/upmap.rst create mode 100644 doc/rados/operations/user-management.rst create mode 100644 doc/rados/troubleshooting/community.rst create mode 100644 doc/rados/troubleshooting/cpu-profiling.rst create mode 100644 doc/rados/troubleshooting/index.rst create mode 100644 doc/rados/troubleshooting/log-and-debug.rst create mode 100644 doc/rados/troubleshooting/memory-profiling.rst create mode 100644 doc/rados/troubleshooting/troubleshooting-mon.rst create mode 100644 doc/rados/troubleshooting/troubleshooting-osd.rst create mode 100644 doc/rados/troubleshooting/troubleshooting-pg.rst (limited to 'doc/rados') diff --git a/doc/rados/api/index.rst b/doc/rados/api/index.rst new file mode 100644 index 000000000..63bc7222d --- /dev/null +++ b/doc/rados/api/index.rst @@ -0,0 +1,23 @@ +=========================== + Ceph Storage Cluster APIs +=========================== + +The :term:`Ceph Storage Cluster` has a messaging layer protocol that enables +clients to interact with a :term:`Ceph Monitor` and a :term:`Ceph OSD Daemon`. +``librados`` provides this functionality to :term:`Ceph Client`\s in the form of +a library. All Ceph Clients either use ``librados`` or the same functionality +encapsulated in ``librados`` to interact with the object store. For example, +``librbd`` and ``libcephfs`` leverage this functionality. You may use +``librados`` to interact with Ceph directly (e.g., an application that talks to +Ceph, your own interface to Ceph, etc.). + + +.. toctree:: + :maxdepth: 2 + + Introduction to librados + librados (C) + librados (C++) + librados (Python) + libcephsqlite (SQLite) + object class diff --git a/doc/rados/api/libcephsqlite.rst b/doc/rados/api/libcephsqlite.rst new file mode 100644 index 000000000..76ab306bb --- /dev/null +++ b/doc/rados/api/libcephsqlite.rst @@ -0,0 +1,438 @@ +.. _libcephsqlite: + +================ + Ceph SQLite VFS +================ + +This `SQLite VFS`_ may be used for storing and accessing a `SQLite`_ database +backed by RADOS. This allows you to fully decentralize your database using +Ceph's object store for improved availability, accessibility, and use of +storage. + +Note what this is not: a distributed SQL engine. SQLite on RADOS can be thought +of like RBD as compared to CephFS: RBD puts a disk image on RADOS for the +purposes of exclusive access by a machine and generally does not allow parallel +access by other machines; on the other hand, CephFS allows fully distributed +access to a file system from many client mounts. SQLite on RADOS is meant to be +accessed by a single SQLite client database connection at a given time. The +database may be manipulated safely by multiple clients only in a serial fashion +controlled by RADOS locks managed by the Ceph SQLite VFS. + + +Usage +^^^^^ + +Normal unmodified applications (including the sqlite command-line toolset +binary) may load the *ceph* VFS using the `SQLite Extension Loading API`_. + +.. code:: sql + + .LOAD libcephsqlite.so + +or during the invocation of ``sqlite3`` + +.. code:: sh + + sqlite3 -cmd '.load libcephsqlite.so' + +A database file is formatted as a SQLite URI:: + + file:///<"*"poolid|poolname>:[namespace]/?vfs=ceph + +The RADOS ``namespace`` is optional. Note the triple ``///`` in the path. The URI +authority must be empty or localhost in SQLite. Only the path part of the URI +is parsed. For this reason, the URI will not parse properly if you only use two +``//``. + +A complete example of (optionally) creating a database and opening: + +.. code:: sh + + sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph' + +Note you cannot specify the database file as the normal positional argument to +``sqlite3``. This is because the ``.load libcephsqlite.so`` command is applied +after opening the database, but opening the database depends on the extension +being loaded first. + +An example passing the pool integer id and no RADOS namespace: + +.. code:: sh + + sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///*2:/baz.db?vfs=ceph' + +Like other Ceph tools, the *ceph* VFS looks at some environment variables that +help with configuring which Ceph cluster to communicate with and which +credential to use. Here would be a typical configuration: + +.. code:: sh + + export CEPH_CONF=/path/to/ceph.conf + export CEPH_KEYRING=/path/to/ceph.keyring + export CEPH_ARGS='--id myclientid' + ./runmyapp + # or + sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph' + +The default operation would look at the standard Ceph configuration file path +using the ``client.admin`` user. + + +User +^^^^ + +The *ceph* VFS requires a user credential with read access to the monitors, the +ability to blocklist dead clients of the database, and access to the OSDs +hosting the database. This can be done with authorizations as simply as: + +.. code:: sh + + ceph auth get-or-create client.X mon 'allow r, allow command "osd blocklist" with blocklistop=add' osd 'allow rwx' + +.. note:: The terminology change from ``blacklist`` to ``blocklist``; older clusters may require using the old terms. + +You may also simplify using the ``simple-rados-client-with-blocklist`` profile: + +.. code:: sh + + ceph auth get-or-create client.X mon 'profile simple-rados-client-with-blocklist' osd 'allow rwx' + +To learn why blocklisting is necessary, see :ref:`libcephsqlite-corrupt`. + + +Page Size +^^^^^^^^^ + +SQLite allows configuring the page size prior to creating a new database. It is +advisable to increase this config to 65536 (64K) when using RADOS backed +databases to reduce the number of OSD reads/writes and thereby improve +throughput and latency. + +.. code:: sql + + PRAGMA page_size = 65536 + +You may also try other values according to your application needs but note that +64K is the max imposed by SQLite. + + +Cache +^^^^^ + +The ceph VFS does not do any caching of reads or buffering of writes. Instead, +and more appropriately, the SQLite page cache is used. You may find it is too small +for most workloads and should therefore increase it significantly: + + +.. code:: sql + + PRAGMA cache_size = 4096 + +Which will cache 4096 pages or 256MB (with 64K ``page_cache``). + + +Journal Persistence +^^^^^^^^^^^^^^^^^^^ + +By default, SQLite deletes the journal for every transaction. This can be +expensive as the *ceph* VFS must delete every object backing the journal for each +transaction. For this reason, it is much faster and simpler to ask SQLite to +**persist** the journal. In this mode, SQLite will invalidate the journal via a +write to its header. This is done as: + +.. code:: sql + + PRAGMA journal_mode = PERSIST + +The cost of this may be increased unused space according to the high-water size +of the rollback journal (based on transaction type and size). + + +Exclusive Lock Mode +^^^^^^^^^^^^^^^^^^^ + +SQLite operates in a ``NORMAL`` locking mode where each transaction requires +locking the backing database file. This can add unnecessary overhead to +transactions when you know there's only ever one user of the database at a +given time. You can have SQLite lock the database once for the duration of the +connection using: + +.. code:: sql + + PRAGMA locking_mode = EXCLUSIVE + +This can more than **halve** the time taken to perform a transaction. Keep in +mind this prevents other clients from accessing the database. + +In this locking mode, each write transaction to the database requires 3 +synchronization events: once to write to the journal, another to write to the +database file, and a final write to invalidate the journal header (in +``PERSIST`` journaling mode). + + +WAL Journal +^^^^^^^^^^^ + +The `WAL Journal Mode`_ is only available when SQLite is operating in exclusive +lock mode. This is because it requires shared memory communication with other +readers and writers when in the ``NORMAL`` locking mode. + +As with local disk databases, WAL mode may significantly reduce small +transaction latency. Testing has shown it can provide more than 50% speedup +over persisted rollback journals in exclusive locking mode. You can expect +around 150-250 transactions per second depending on size. + + +Performance Notes +^^^^^^^^^^^^^^^^^ + +The filing backend for the database on RADOS is asynchronous as much as +possible. Still, performance can be anywhere from 3x-10x slower than a local +database on SSD. Latency can be a major factor. It is advisable to be familiar +with SQL transactions and other strategies for efficient database updates. +Depending on the performance of the underlying pool, you can expect small +transactions to take up to 30 milliseconds to complete. If you use the +``EXCLUSIVE`` locking mode, it can be reduced further to 15 milliseconds per +transaction. A WAL journal in ``EXCLUSIVE`` locking mode can further reduce +this as low as ~2-5 milliseconds (or the time to complete a RADOS write; you +won't get better than that!). + +There is no limit to the size of a SQLite database on RADOS imposed by the Ceph +VFS. There are standard `SQLite Limits`_ to be aware of, notably the maximum +database size of 281 TB. Large databases may or may not be performant on Ceph. +Experimentation for your own use-case is advised. + +Be aware that read-heavy queries could take significant amounts of time as +reads are necessarily synchronous (due to the VFS API). No readahead is yet +performed by the VFS. + + +Recommended Use-Cases +^^^^^^^^^^^^^^^^^^^^^ + +The original purpose of this module was to support saving relational or large +data in RADOS which needs to span multiple objects. Many current applications +with trivial state try to use RADOS omap storage on a single object but this +cannot scale without striping data across multiple objects. Unfortunately, it +is non-trivial to design a store spanning multiple objects which is consistent +and also simple to use. SQLite can be used to bridge that gap. + + +Parallel Access +^^^^^^^^^^^^^^^ + +The VFS does not yet support concurrent readers. All database access is protected +by a single exclusive lock. + + +Export or Extract Database out of RADOS +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The database is striped on RADOS and can be extracted using the RADOS cli toolset. + +.. code:: sh + + rados --pool=foo --striper get bar.db local-bar.db + rados --pool=foo --striper get bar.db-journal local-bar.db-journal + sqlite3 local-bar.db ... + +Keep in mind the rollback journal is also striped and will need to be extracted +as well if the database was in the middle of a transaction. If you're using +WAL, that journal will need to be extracted as well. + +Keep in mind that extracting the database using the striper uses the same RADOS +locks as those used by the *ceph* VFS. However, the journal file locks are not +used by the *ceph* VFS (SQLite only locks the main database file) so there is a +potential race with other SQLite clients when extracting both files. That could +result in fetching a corrupt journal. + +Instead of manually extracting the files, it would be more advisable to use the +`SQLite Backup`_ mechanism instead. + + +Temporary Tables +^^^^^^^^^^^^^^^^ + +Temporary tables backed by the ceph VFS are not supported. The main reason for +this is that the VFS lacks context about where it should put the database, i.e. +which RADOS pool. The persistent database associated with the temporary +database is not communicated via the SQLite VFS API. + +Instead, it's suggested to attach a secondary local or `In-Memory Database`_ +and put the temporary tables there. Alternatively, you may set a connection +pragma: + +.. code:: sql + + PRAGMA temp_store=memory + + +.. _libcephsqlite-breaking-locks: + +Breaking Locks +^^^^^^^^^^^^^^ + +Access to the database file is protected by an exclusive lock on the first +object stripe of the database. If the application fails without unlocking the +database (e.g. a segmentation fault), the lock is not automatically unlocked, +even if the client connection is blocklisted afterward. Eventually, the lock +will timeout subject to the configurations:: + + cephsqlite_lock_renewal_timeout = 30000 + +The timeout is in milliseconds. Once the timeout is reached, the OSD will +expire the lock and allow clients to relock. When this occurs, the database +will be recovered by SQLite and the in-progress transaction rolled back. The +new client recovering the database will also blocklist the old client to +prevent potential database corruption from rogue writes. + +The holder of the exclusive lock on the database will periodically renew the +lock so it does not lose the lock. This is necessary for large transactions or +database connections operating in ``EXCLUSIVE`` locking mode. The lock renewal +interval is adjustable via:: + + cephsqlite_lock_renewal_interval = 2000 + +This configuration is also in units of milliseconds. + +It is possible to break the lock early if you know the client is gone for good +(e.g. blocklisted). This allows restoring database access to clients +immediately. For example: + +.. code:: sh + + $ rados --pool=foo --namespace bar lock info baz.db.0000000000000000 striper.lock + {"name":"striper.lock","type":"exclusive","tag":"","lockers":[{"name":"client.4463","cookie":"555c7208-db39-48e8-a4d7-3ba92433a41a","description":"SimpleRADOSStriper","expiration":"0.000000","addr":"127.0.0.1:0/1831418345"}]} + + $ rados --pool=foo --namespace bar lock break baz.db.0000000000000000 striper.lock client.4463 --lock-cookie 555c7208-db39-48e8-a4d7-3ba92433a41a + +.. _libcephsqlite-corrupt: + +How to Corrupt Your Database +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +There is the usual reading on `How to Corrupt Your SQLite Database`_ that you +should review before using this tool. To add to that, the most likely way you +may corrupt your database is by a rogue process transiently losing network +connectivity and then resuming its work. The exclusive RADOS lock it held will +be lost but it cannot know that immediately. Any work it might do after +regaining network connectivity could corrupt the database. + +The *ceph* VFS library defaults do not allow for this scenario to occur. The Ceph +VFS will blocklist the last owner of the exclusive lock on the database if it +detects incomplete cleanup. + +By blocklisting the old client, it's no longer possible for the old client to +resume its work on the database when it returns (subject to blocklist +expiration, 3600 seconds by default). To turn off blocklisting the prior client, change:: + + cephsqlite_blocklist_dead_locker = false + +Do NOT do this unless you know database corruption cannot result due to other +guarantees. If this config is true (the default), the *ceph* VFS will cowardly +fail if it cannot blocklist the prior instance (due to lack of authorization, +for example). + +One example where out-of-band mechanisms exist to blocklist the last dead +holder of the exclusive lock on the database is in the ``ceph-mgr``. The +monitors are made aware of the RADOS connection used for the *ceph* VFS and will +blocklist the instance during ``ceph-mgr`` failover. This prevents a zombie +``ceph-mgr`` from continuing work and potentially corrupting the database. For +this reason, it is not necessary for the *ceph* VFS to do the blocklist command +in the new instance of the ``ceph-mgr`` (but it still does so, harmlessly). + +To blocklist the *ceph* VFS manually, you may see the instance address of the +*ceph* VFS using the ``ceph_status`` SQL function: + +.. code:: sql + + SELECT ceph_status(); + +.. code:: + + {"id":788461300,"addr":"172.21.10.4:0/1472139388"} + +You may easily manipulate that information using the `JSON1 extension`_: + +.. code:: sql + + SELECT json_extract(ceph_status(), '$.addr'); + +.. code:: + + 172.21.10.4:0/3563721180 + +This is the address you would pass to the ceph blocklist command: + +.. code:: sh + + ceph osd blocklist add 172.21.10.4:0/3082314560 + + +Performance Statistics +^^^^^^^^^^^^^^^^^^^^^^ + +The *ceph* VFS provides a SQLite function, ``ceph_perf``, for querying the +performance statistics of the VFS. The data is from "performance counters" as +in other Ceph services normally queried via an admin socket. + +.. code:: sql + + SELECT ceph_perf(); + +.. code:: + + {"libcephsqlite_vfs":{"op_open":{"avgcount":2,"sum":0.150001291,"avgtime":0.075000645},"op_delete":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"op_access":{"avgcount":1,"sum":0.003000026,"avgtime":0.003000026},"op_fullpathname":{"avgcount":1,"sum":0.064000551,"avgtime":0.064000551},"op_currenttime":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_close":{"avgcount":1,"sum":0.000000000,"avgtime":0.000000000},"opf_read":{"avgcount":3,"sum":0.036000310,"avgtime":0.012000103},"opf_write":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_truncate":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_sync":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_filesize":{"avgcount":2,"sum":0.000000000,"avgtime":0.000000000},"opf_lock":{"avgcount":1,"sum":0.158001360,"avgtime":0.158001360},"opf_unlock":{"avgcount":1,"sum":0.101000871,"avgtime":0.101000871},"opf_checkreservedlock":{"avgcount":1,"sum":0.002000017,"avgtime":0.002000017},"opf_filecontrol":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000},"opf_sectorsize":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_devicecharacteristics":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000}},"libcephsqlite_striper":{"update_metadata":0,"update_allocated":0,"update_size":0,"update_version":0,"shrink":0,"shrink_bytes":0,"lock":1,"unlock":1}} + +You may easily manipulate that information using the `JSON1 extension`_: + +.. code:: sql + + SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync.avgcount'); + +.. code:: + + 776 + +That tells you the number of times SQLite has called the xSync method of the +`SQLite IO Methods`_ of the VFS (for **all** open database connections in the +process). You could analyze the performance stats before and after a number of +queries to see the number of file system syncs required (this would just be +proportional to the number of transactions). Alternatively, you may be more +interested in the average latency to complete a write: + +.. code:: sql + + SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_write'); + +.. code:: + + {"avgcount":7873,"sum":0.675005797,"avgtime":0.000085736} + +Which would tell you there have been 7873 writes with an average +time-to-complete of 85 microseconds. That clearly shows the calls are executed +asynchronously. Returning to sync: + +.. code:: sql + + SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync'); + +.. code:: + + {"avgcount":776,"sum":4.802041199,"avgtime":0.006188197} + +6 milliseconds were spent on average executing a sync call. This gathers all of +the asynchronous writes as well as an asynchronous update to the size of the +striped file. + + +.. _SQLite: https://sqlite.org/index.html +.. _SQLite VFS: https://www.sqlite.org/vfs.html +.. _SQLite Backup: https://www.sqlite.org/backup.html +.. _SQLite Limits: https://www.sqlite.org/limits.html +.. _SQLite Extension Loading API: https://sqlite.org/c3ref/load_extension.html +.. _In-Memory Database: https://www.sqlite.org/inmemorydb.html +.. _WAL Journal Mode: https://sqlite.org/wal.html +.. _How to Corrupt Your SQLite Database: https://www.sqlite.org/howtocorrupt.html +.. _JSON1 Extension: https://www.sqlite.org/json1.html +.. _SQLite IO Methods: https://www.sqlite.org/c3ref/io_methods.html diff --git a/doc/rados/api/librados-intro.rst b/doc/rados/api/librados-intro.rst new file mode 100644 index 000000000..9bffa3114 --- /dev/null +++ b/doc/rados/api/librados-intro.rst @@ -0,0 +1,1052 @@ +========================== + Introduction to librados +========================== + +The :term:`Ceph Storage Cluster` provides the basic storage service that allows +:term:`Ceph` to uniquely deliver **object, block, and file storage** in one +unified system. However, you are not limited to using the RESTful, block, or +POSIX interfaces. Based upon :abbr:`RADOS (Reliable Autonomic Distributed Object +Store)`, the ``librados`` API enables you to create your own interface to the +Ceph Storage Cluster. + +The ``librados`` API enables you to interact with the two types of daemons in +the Ceph Storage Cluster: + +- The :term:`Ceph Monitor`, which maintains a master copy of the cluster map. +- The :term:`Ceph OSD Daemon` (OSD), which stores data as objects on a storage node. + +.. ditaa:: + +---------------------------------+ + | Ceph Storage Cluster Protocol | + | (librados) | + +---------------------------------+ + +---------------+ +---------------+ + | OSDs | | Monitors | + +---------------+ +---------------+ + +This guide provides a high-level introduction to using ``librados``. +Refer to :doc:`../../architecture` for additional details of the Ceph +Storage Cluster. To use the API, you need a running Ceph Storage Cluster. +See `Installation (Quick)`_ for details. + + +Step 1: Getting librados +======================== + +Your client application must bind with ``librados`` to connect to the Ceph +Storage Cluster. You must install ``librados`` and any required packages to +write applications that use ``librados``. The ``librados`` API is written in +C++, with additional bindings for C, Python, Java and PHP. + + +Getting librados for C/C++ +-------------------------- + +To install ``librados`` development support files for C/C++ on Debian/Ubuntu +distributions, execute the following: + +.. prompt:: bash $ + + sudo apt-get install librados-dev + +To install ``librados`` development support files for C/C++ on RHEL/CentOS +distributions, execute the following: + +.. prompt:: bash $ + + sudo yum install librados2-devel + +Once you install ``librados`` for developers, you can find the required +headers for C/C++ under ``/usr/include/rados``: + +.. prompt:: bash $ + + ls /usr/include/rados + + +Getting librados for Python +--------------------------- + +The ``rados`` module provides ``librados`` support to Python +applications. The ``librados-dev`` package for Debian/Ubuntu +and the ``librados2-devel`` package for RHEL/CentOS will install the +``python-rados`` package for you. You may install ``python-rados`` +directly too. + +To install ``librados`` development support files for Python on Debian/Ubuntu +distributions, execute the following: + +.. prompt:: bash $ + + sudo apt-get install python3-rados + +To install ``librados`` development support files for Python on RHEL/CentOS +distributions, execute the following: + +.. prompt:: bash $ + + sudo yum install python-rados + +To install ``librados`` development support files for Python on SLE/openSUSE +distributions, execute the following: + +.. prompt:: bash $ + + sudo zypper install python3-rados + +You can find the module under ``/usr/share/pyshared`` on Debian systems, +or under ``/usr/lib/python*/site-packages`` on CentOS/RHEL systems. + + +Getting librados for Java +------------------------- + +To install ``librados`` for Java, you need to execute the following procedure: + +#. Install ``jna.jar``. For Debian/Ubuntu, execute: + + .. prompt:: bash $ + + sudo apt-get install libjna-java + + For CentOS/RHEL, execute: + + .. prompt:: bash $ + + sudo yum install jna + + The JAR files are located in ``/usr/share/java``. + +#. Clone the ``rados-java`` repository: + + .. prompt:: bash $ + + git clone --recursive https://github.com/ceph/rados-java.git + +#. Build the ``rados-java`` repository: + + .. prompt:: bash $ + + cd rados-java + ant + + The JAR file is located under ``rados-java/target``. + +#. Copy the JAR for RADOS to a common location (e.g., ``/usr/share/java``) and + ensure that it and the JNA JAR are in your JVM's classpath. For example: + + .. prompt:: bash $ + + sudo cp target/rados-0.1.3.jar /usr/share/java/rados-0.1.3.jar + sudo ln -s /usr/share/java/jna-3.2.7.jar /usr/lib/jvm/default-java/jre/lib/ext/jna-3.2.7.jar + sudo ln -s /usr/share/java/rados-0.1.3.jar /usr/lib/jvm/default-java/jre/lib/ext/rados-0.1.3.jar + +To build the documentation, execute the following: + +.. prompt:: bash $ + + ant docs + + +Getting librados for PHP +------------------------- + +To install the ``librados`` extension for PHP, you need to execute the following procedure: + +#. Install php-dev. For Debian/Ubuntu, execute: + + .. prompt:: bash $ + + sudo apt-get install php5-dev build-essential + + For CentOS/RHEL, execute: + + .. prompt:: bash $ + + sudo yum install php-devel + +#. Clone the ``phprados`` repository: + + .. prompt:: bash $ + + git clone https://github.com/ceph/phprados.git + +#. Build ``phprados``: + + .. prompt:: bash $ + + cd phprados + phpize + ./configure + make + sudo make install + +#. Enable ``phprados`` by adding the following line to ``php.ini``:: + + extension=rados.so + + +Step 2: Configuring a Cluster Handle +==================================== + +A :term:`Ceph Client`, via ``librados``, interacts directly with OSDs to store +and retrieve data. To interact with OSDs, the client app must invoke +``librados`` and connect to a Ceph Monitor. Once connected, ``librados`` +retrieves the :term:`Cluster Map` from the Ceph Monitor. When the client app +wants to read or write data, it creates an I/O context and binds to a +:term:`Pool`. The pool has an associated :term:`CRUSH rule` that defines how it +will place data in the storage cluster. Via the I/O context, the client +provides the object name to ``librados``, which takes the object name +and the cluster map (i.e., the topology of the cluster) and `computes`_ the +placement group and `OSD`_ for locating the data. Then the client application +can read or write data. The client app doesn't need to learn about the topology +of the cluster directly. + +.. ditaa:: + +--------+ Retrieves +---------------+ + | Client |------------>| Cluster Map | + +--------+ +---------------+ + | + v Writes + /-----\ + | obj | + \-----/ + | To + v + +--------+ +---------------+ + | Pool |---------->| CRUSH Rule | + +--------+ Selects +---------------+ + + +The Ceph Storage Cluster handle encapsulates the client configuration, including: + +- The `user ID`_ for ``rados_create()`` or user name for ``rados_create2()`` + (preferred). +- The :term:`cephx` authentication key +- The monitor ID and IP address +- Logging levels +- Debugging levels + +Thus, the first steps in using the cluster from your app are to 1) create +a cluster handle that your app will use to connect to the storage cluster, +and then 2) use that handle to connect. To connect to the cluster, the +app must supply a monitor address, a username and an authentication key +(cephx is enabled by default). + +.. tip:: Talking to different Ceph Storage Clusters – or to the same cluster + with different users – requires different cluster handles. + +RADOS provides a number of ways for you to set the required values. For +the monitor and encryption key settings, an easy way to handle them is to ensure +that your Ceph configuration file contains a ``keyring`` path to a keyring file +and at least one monitor address (e.g., ``mon host``). For example:: + + [global] + mon host = 192.168.1.1 + keyring = /etc/ceph/ceph.client.admin.keyring + +Once you create the handle, you can read a Ceph configuration file to configure +the handle. You can also pass arguments to your app and parse them with the +function for parsing command line arguments (e.g., ``rados_conf_parse_argv()``), +or parse Ceph environment variables (e.g., ``rados_conf_parse_env()``). Some +wrappers may not implement convenience methods, so you may need to implement +these capabilities. The following diagram provides a high-level flow for the +initial connection. + + +.. ditaa:: + +---------+ +---------+ + | Client | | Monitor | + +---------+ +---------+ + | | + |-----+ create | + | | cluster | + |<----+ handle | + | | + |-----+ read | + | | config | + |<----+ file | + | | + | connect | + |-------------->| + | | + |<--------------| + | connected | + | | + + +Once connected, your app can invoke functions that affect the whole cluster +with only the cluster handle. For example, once you have a cluster +handle, you can: + +- Get cluster statistics +- Use Pool Operation (exists, create, list, delete) +- Get and set the configuration + + +One of the powerful features of Ceph is the ability to bind to different pools. +Each pool may have a different number of placement groups, object replicas and +replication strategies. For example, a pool could be set up as a "hot" pool that +uses SSDs for frequently used objects or a "cold" pool that uses erasure coding. + +The main difference in the various ``librados`` bindings is between C and +the object-oriented bindings for C++, Java and Python. The object-oriented +bindings use objects to represent cluster handles, IO Contexts, iterators, +exceptions, etc. + + +C Example +--------- + +For C, creating a simple cluster handle using the ``admin`` user, configuring +it and connecting to the cluster might look something like this: + +.. code-block:: c + + #include + #include + #include + #include + + int main (int argc, const char **argv) + { + + /* Declare the cluster handle and required arguments. */ + rados_t cluster; + char cluster_name[] = "ceph"; + char user_name[] = "client.admin"; + uint64_t flags = 0; + + /* Initialize the cluster handle with the "ceph" cluster name and the "client.admin" user */ + int err; + err = rados_create2(&cluster, cluster_name, user_name, flags); + + if (err < 0) { + fprintf(stderr, "%s: Couldn't create the cluster handle! %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nCreated a cluster handle.\n"); + } + + + /* Read a Ceph configuration file to configure the cluster handle. */ + err = rados_conf_read_file(cluster, "/etc/ceph/ceph.conf"); + if (err < 0) { + fprintf(stderr, "%s: cannot read config file: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nRead the config file.\n"); + } + + /* Read command line arguments */ + err = rados_conf_parse_argv(cluster, argc, argv); + if (err < 0) { + fprintf(stderr, "%s: cannot parse command line arguments: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nRead the command line arguments.\n"); + } + + /* Connect to the cluster */ + err = rados_connect(cluster); + if (err < 0) { + fprintf(stderr, "%s: cannot connect to cluster: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nConnected to the cluster.\n"); + } + + } + +Compile your client and link to ``librados`` using ``-lrados``. For example: + +.. prompt:: bash $ + + gcc ceph-client.c -lrados -o ceph-client + + +C++ Example +----------- + +The Ceph project provides a C++ example in the ``ceph/examples/librados`` +directory. For C++, a simple cluster handle using the ``admin`` user requires +you to initialize a ``librados::Rados`` cluster handle object: + +.. code-block:: c++ + + #include + #include + #include + + int main(int argc, const char **argv) + { + + int ret = 0; + + /* Declare the cluster handle and required variables. */ + librados::Rados cluster; + char cluster_name[] = "ceph"; + char user_name[] = "client.admin"; + uint64_t flags = 0; + + /* Initialize the cluster handle with the "ceph" cluster name and "client.admin" user */ + { + ret = cluster.init2(user_name, cluster_name, flags); + if (ret < 0) { + std::cerr << "Couldn't initialize the cluster handle! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Created a cluster handle." << std::endl; + } + } + + /* Read a Ceph configuration file to configure the cluster handle. */ + { + ret = cluster.conf_read_file("/etc/ceph/ceph.conf"); + if (ret < 0) { + std::cerr << "Couldn't read the Ceph configuration file! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Read the Ceph configuration file." << std::endl; + } + } + + /* Read command line arguments */ + { + ret = cluster.conf_parse_argv(argc, argv); + if (ret < 0) { + std::cerr << "Couldn't parse command line options! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Parsed command line options." << std::endl; + } + } + + /* Connect to the cluster */ + { + ret = cluster.connect(); + if (ret < 0) { + std::cerr << "Couldn't connect to cluster! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Connected to the cluster." << std::endl; + } + } + + return 0; + } + + +Compile the source; then, link ``librados`` using ``-lrados``. +For example: + +.. prompt:: bash $ + + g++ -g -c ceph-client.cc -o ceph-client.o + g++ -g ceph-client.o -lrados -o ceph-client + + + +Python Example +-------------- + +Python uses the ``admin`` id and the ``ceph`` cluster name by default, and +will read the standard ``ceph.conf`` file if the conffile parameter is +set to the empty string. The Python binding converts C++ errors +into exceptions. + + +.. code-block:: python + + import rados + + try: + cluster = rados.Rados(conffile='') + except TypeError as e: + print 'Argument validation error: ', e + raise e + + print "Created cluster handle." + + try: + cluster.connect() + except Exception as e: + print "connection error: ", e + raise e + finally: + print "Connected to the cluster." + + +Execute the example to verify that it connects to your cluster: + +.. prompt:: bash $ + + python ceph-client.py + + +Java Example +------------ + +Java requires you to specify the user ID (``admin``) or user name +(``client.admin``), and uses the ``ceph`` cluster name by default . The Java +binding converts C++-based errors into exceptions. + +.. code-block:: java + + import com.ceph.rados.Rados; + import com.ceph.rados.RadosException; + + import java.io.File; + + public class CephClient { + public static void main (String args[]){ + + try { + Rados cluster = new Rados("admin"); + System.out.println("Created cluster handle."); + + File f = new File("/etc/ceph/ceph.conf"); + cluster.confReadFile(f); + System.out.println("Read the configuration file."); + + cluster.connect(); + System.out.println("Connected to the cluster."); + + } catch (RadosException e) { + System.out.println(e.getMessage() + ": " + e.getReturnValue()); + } + } + } + + +Compile the source; then, run it. If you have copied the JAR to +``/usr/share/java`` and sym linked from your ``ext`` directory, you won't need +to specify the classpath. For example: + +.. prompt:: bash $ + + javac CephClient.java + java CephClient + + +PHP Example +------------ + +With the RADOS extension enabled in PHP you can start creating a new cluster handle very easily: + +.. code-block:: php + + | + | | | + | write ack | | + |<--------------+---------------| + | | | + | write xattr | | + |---------------+-------------->| + | | | + | xattr ack | | + |<--------------+---------------| + | | | + | read data | | + |---------------+-------------->| + | | | + | read ack | | + |<--------------+---------------| + | | | + | remove data | | + |---------------+-------------->| + | | | + | remove ack | | + |<--------------+---------------| + + + +RADOS enables you to interact both synchronously and asynchronously. Once your +app has an I/O Context, read/write operations only require you to know the +object/xattr name. The CRUSH algorithm encapsulated in ``librados`` uses the +cluster map to identify the appropriate OSD. OSD daemons handle the replication, +as described in `Smart Daemons Enable Hyperscale`_. The ``librados`` library also +maps objects to placement groups, as described in `Calculating PG IDs`_. + +The following examples use the default ``data`` pool. However, you may also +use the API to list pools, ensure they exist, or create and delete pools. For +the write operations, the examples illustrate how to use synchronous mode. For +the read operations, the examples illustrate how to use asynchronous mode. + +.. important:: Use caution when deleting pools with this API. If you delete + a pool, the pool and ALL DATA in the pool will be lost. + + +C Example +--------- + + +.. code-block:: c + + #include + #include + #include + #include + + int main (int argc, const char **argv) + { + /* + * Continued from previous C example, where cluster handle and + * connection are established. First declare an I/O Context. + */ + + rados_ioctx_t io; + char *poolname = "data"; + + err = rados_ioctx_create(cluster, poolname, &io); + if (err < 0) { + fprintf(stderr, "%s: cannot open rados pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_shutdown(cluster); + exit(EXIT_FAILURE); + } else { + printf("\nCreated I/O context.\n"); + } + + /* Write data to the cluster synchronously. */ + err = rados_write(io, "hw", "Hello World!", 12, 0); + if (err < 0) { + fprintf(stderr, "%s: Cannot write object \"hw\" to pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nWrote \"Hello World\" to object \"hw\".\n"); + } + + char xattr[] = "en_US"; + err = rados_setxattr(io, "hw", "lang", xattr, 5); + if (err < 0) { + fprintf(stderr, "%s: Cannot write xattr to pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nWrote \"en_US\" to xattr \"lang\" for object \"hw\".\n"); + } + + /* + * Read data from the cluster asynchronously. + * First, set up asynchronous I/O completion. + */ + rados_completion_t comp; + err = rados_aio_create_completion(NULL, NULL, NULL, &comp); + if (err < 0) { + fprintf(stderr, "%s: Could not create aio completion: %s\n", argv[0], strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nCreated AIO completion.\n"); + } + + /* Next, read data using rados_aio_read. */ + char read_res[100]; + err = rados_aio_read(io, "hw", comp, read_res, 12, 0); + if (err < 0) { + fprintf(stderr, "%s: Cannot read object. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRead object \"hw\". The contents are:\n %s \n", read_res); + } + + /* Wait for the operation to complete */ + rados_aio_wait_for_complete(comp); + + /* Release the asynchronous I/O complete handle to avoid memory leaks. */ + rados_aio_release(comp); + + + char xattr_res[100]; + err = rados_getxattr(io, "hw", "lang", xattr_res, 5); + if (err < 0) { + fprintf(stderr, "%s: Cannot read xattr. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRead xattr \"lang\" for object \"hw\". The contents are:\n %s \n", xattr_res); + } + + err = rados_rmxattr(io, "hw", "lang"); + if (err < 0) { + fprintf(stderr, "%s: Cannot remove xattr. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRemoved xattr \"lang\" for object \"hw\".\n"); + } + + err = rados_remove(io, "hw"); + if (err < 0) { + fprintf(stderr, "%s: Cannot remove object. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRemoved object \"hw\".\n"); + } + + } + + + +C++ Example +----------- + + +.. code-block:: c++ + + #include + #include + #include + + int main(int argc, const char **argv) + { + + /* Continued from previous C++ example, where cluster handle and + * connection are established. First declare an I/O Context. + */ + + librados::IoCtx io_ctx; + const char *pool_name = "data"; + + { + ret = cluster.ioctx_create(pool_name, io_ctx); + if (ret < 0) { + std::cerr << "Couldn't set up ioctx! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Created an ioctx for the pool." << std::endl; + } + } + + + /* Write an object synchronously. */ + { + librados::bufferlist bl; + bl.append("Hello World!"); + ret = io_ctx.write_full("hw", bl); + if (ret < 0) { + std::cerr << "Couldn't write object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Wrote new object 'hw' " << std::endl; + } + } + + + /* + * Add an xattr to the object. + */ + { + librados::bufferlist lang_bl; + lang_bl.append("en_US"); + ret = io_ctx.setxattr("hw", "lang", lang_bl); + if (ret < 0) { + std::cerr << "failed to set xattr version entry! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Set the xattr 'lang' on our object!" << std::endl; + } + } + + + /* + * Read the object back asynchronously. + */ + { + librados::bufferlist read_buf; + int read_len = 4194304; + + //Create I/O Completion. + librados::AioCompletion *read_completion = librados::Rados::aio_create_completion(); + + //Send read request. + ret = io_ctx.aio_read("hw", read_completion, &read_buf, read_len, 0); + if (ret < 0) { + std::cerr << "Couldn't start read object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } + + // Wait for the request to complete, and check that it succeeded. + read_completion->wait_for_complete(); + ret = read_completion->get_return_value(); + if (ret < 0) { + std::cerr << "Couldn't read object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Read object hw asynchronously with contents.\n" + << read_buf.c_str() << std::endl; + } + } + + + /* + * Read the xattr. + */ + { + librados::bufferlist lang_res; + ret = io_ctx.getxattr("hw", "lang", lang_res); + if (ret < 0) { + std::cerr << "failed to get xattr version entry! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Got the xattr 'lang' from object hw!" + << lang_res.c_str() << std::endl; + } + } + + + /* + * Remove the xattr. + */ + { + ret = io_ctx.rmxattr("hw", "lang"); + if (ret < 0) { + std::cerr << "Failed to remove xattr! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Removed the xattr 'lang' from our object!" << std::endl; + } + } + + /* + * Remove the object. + */ + { + ret = io_ctx.remove("hw"); + if (ret < 0) { + std::cerr << "Couldn't remove object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Removed object 'hw'." << std::endl; + } + } + } + + + +Python Example +-------------- + +.. code-block:: python + + print "\n\nI/O Context and Object Operations" + print "=================================" + + print "\nCreating a context for the 'data' pool" + if not cluster.pool_exists('data'): + raise RuntimeError('No data pool exists') + ioctx = cluster.open_ioctx('data') + + print "\nWriting object 'hw' with contents 'Hello World!' to pool 'data'." + ioctx.write("hw", "Hello World!") + print "Writing XATTR 'lang' with value 'en_US' to object 'hw'" + ioctx.set_xattr("hw", "lang", "en_US") + + + print "\nWriting object 'bm' with contents 'Bonjour tout le monde!' to pool 'data'." + ioctx.write("bm", "Bonjour tout le monde!") + print "Writing XATTR 'lang' with value 'fr_FR' to object 'bm'" + ioctx.set_xattr("bm", "lang", "fr_FR") + + print "\nContents of object 'hw'\n------------------------" + print ioctx.read("hw") + + print "\n\nGetting XATTR 'lang' from object 'hw'" + print ioctx.get_xattr("hw", "lang") + + print "\nContents of object 'bm'\n------------------------" + print ioctx.read("bm") + + print "Getting XATTR 'lang' from object 'bm'" + print ioctx.get_xattr("bm", "lang") + + + print "\nRemoving object 'hw'" + ioctx.remove_object("hw") + + print "Removing object 'bm'" + ioctx.remove_object("bm") + + +Java-Example +------------ + +.. code-block:: java + + import com.ceph.rados.Rados; + import com.ceph.rados.RadosException; + + import java.io.File; + import com.ceph.rados.IoCTX; + + public class CephClient { + public static void main (String args[]){ + + try { + Rados cluster = new Rados("admin"); + System.out.println("Created cluster handle."); + + File f = new File("/etc/ceph/ceph.conf"); + cluster.confReadFile(f); + System.out.println("Read the configuration file."); + + cluster.connect(); + System.out.println("Connected to the cluster."); + + IoCTX io = cluster.ioCtxCreate("data"); + + String oidone = "hw"; + String contentone = "Hello World!"; + io.write(oidone, contentone); + + String oidtwo = "bm"; + String contenttwo = "Bonjour tout le monde!"; + io.write(oidtwo, contenttwo); + + String[] objects = io.listObjects(); + for (String object: objects) + System.out.println(object); + + io.remove(oidone); + io.remove(oidtwo); + + cluster.ioCtxDestroy(io); + + } catch (RadosException e) { + System.out.println(e.getMessage() + ": " + e.getReturnValue()); + } + } + } + + +PHP Example +----------- + +.. code-block:: php + + ack_end, NULL); + } + + void commit_callback(rados_completion_t comp, void *arg) { + req_duration *dur = (req_duration *) arg; + gettimeofday(&dur->commit_end, NULL); + } + + int output_append_latency(rados_ioctx_t io, const char *data, size_t len, size_t num_writes) { + req_duration times[num_writes]; + rados_completion_t comps[num_writes]; + for (size_t i = 0; i < num_writes; ++i) { + gettimeofday(×[i].start, NULL); + int err = rados_aio_create_completion((void*) ×[i], ack_callback, commit_callback, &comps[i]); + if (err < 0) { + fprintf(stderr, "Error creating rados completion: %s\n", strerror(-err)); + return err; + } + char obj_name[100]; + snprintf(obj_name, sizeof(obj_name), "foo%ld", (unsigned long)i); + err = rados_aio_append(io, obj_name, comps[i], data, len); + if (err < 0) { + fprintf(stderr, "Error from rados_aio_append: %s", strerror(-err)); + return err; + } + } + // wait until all requests finish *and* the callbacks complete + rados_aio_flush(io); + // the latencies can now be analyzed + printf("Request # | Ack latency (s) | Commit latency (s)\n"); + for (size_t i = 0; i < num_writes; ++i) { + // don't forget to free the completions + rados_aio_release(comps[i]); + struct timeval ack_lat, commit_lat; + timersub(×[i].ack_end, ×[i].start, &ack_lat); + timersub(×[i].commit_end, ×[i].start, &commit_lat); + printf("%9ld | %8ld.%06ld | %10ld.%06ld\n", (unsigned long) i, ack_lat.tv_sec, ack_lat.tv_usec, commit_lat.tv_sec, commit_lat.tv_usec); + } + return 0; + } + +Note that all the :c:type:`rados_completion_t` must be freed with :c:func:`rados_aio_release` to avoid leaking memory. + + +API calls +========= + + .. autodoxygenfile:: rados_types.h + .. autodoxygenfile:: librados.h diff --git a/doc/rados/api/libradospp.rst b/doc/rados/api/libradospp.rst new file mode 100644 index 000000000..08483c8d4 --- /dev/null +++ b/doc/rados/api/libradospp.rst @@ -0,0 +1,9 @@ +================== + LibradosPP (C++) +================== + +.. note:: The librados C++ API is not guaranteed to be API+ABI stable + between major releases. All applications using the librados C++ API must + be recompiled and relinked against a specific Ceph release. + +.. todo:: write me! diff --git a/doc/rados/api/objclass-sdk.rst b/doc/rados/api/objclass-sdk.rst new file mode 100644 index 000000000..6b1162fd4 --- /dev/null +++ b/doc/rados/api/objclass-sdk.rst @@ -0,0 +1,37 @@ +=========================== +SDK for Ceph Object Classes +=========================== + +`Ceph` can be extended by creating shared object classes called `Ceph Object +Classes`. The existing framework to build these object classes has dependencies +on the internal functionality of `Ceph`, which restricts users to build object +classes within the tree. The aim of this project is to create an independent +object class interface, which can be used to build object classes outside the +`Ceph` tree. This allows us to have two types of object classes, 1) those that +have in-tree dependencies and reside in the tree and 2) those that can make use +of the `Ceph Object Class SDK framework` and can be built outside of the `Ceph` +tree because they do not depend on any internal implementation of `Ceph`. This +project decouples object class development from Ceph and encourages creation +and distribution of object classes as packages. + +In order to demonstrate the use of this framework, we have provided an example +called ``cls_sdk``, which is a very simple object class that makes use of the +SDK framework. This object class resides in the ``src/cls`` directory. + +Installing objclass.h +--------------------- + +The object class interface that enables out-of-tree development of object +classes resides in ``src/include/rados/`` and gets installed with `Ceph` +installation. After running ``make install``, you should be able to see it +in ``/include/rados``. :: + + ls /usr/local/include/rados + +Using the SDK example +--------------------- + +The ``cls_sdk`` object class resides in ``src/cls/sdk/``. This gets built and +loaded into Ceph, with the Ceph build process. You can run the +``ceph_test_cls_sdk`` unittest, which resides in ``src/test/cls_sdk/``, +to test this class. diff --git a/doc/rados/api/python.rst b/doc/rados/api/python.rst new file mode 100644 index 000000000..0c9cb9e98 --- /dev/null +++ b/doc/rados/api/python.rst @@ -0,0 +1,425 @@ +=================== + Librados (Python) +=================== + +The ``rados`` module is a thin Python wrapper for ``librados``. + +Installation +============ + +To install Python libraries for Ceph, see `Getting librados for Python`_. + + +Getting Started +=============== + +You can create your own Ceph client using Python. The following tutorial will +show you how to import the Ceph Python module, connect to a Ceph cluster, and +perform object operations as a ``client.admin`` user. + +.. note:: To use the Ceph Python bindings, you must have access to a + running Ceph cluster. To set one up quickly, see `Getting Started`_. + +First, create a Python source file for your Ceph client. :: + :linenos: + + sudo vim client.py + + +Import the Module +----------------- + +To use the ``rados`` module, import it into your source file. + +.. code-block:: python + :linenos: + + import rados + + +Configure a Cluster Handle +-------------------------- + +Before connecting to the Ceph Storage Cluster, create a cluster handle. By +default, the cluster handle assumes a cluster named ``ceph`` (i.e., the default +for deployment tools, and our Getting Started guides too), and a +``client.admin`` user name. You may change these defaults to suit your needs. + +To connect to the Ceph Storage Cluster, your application needs to know where to +find the Ceph Monitor. Provide this information to your application by +specifying the path to your Ceph configuration file, which contains the location +of the initial Ceph monitors. + +.. code-block:: python + :linenos: + + import rados, sys + + #Create Handle Examples. + cluster = rados.Rados(conffile='ceph.conf') + cluster = rados.Rados(conffile=sys.argv[1]) + cluster = rados.Rados(conffile = 'ceph.conf', conf = dict (keyring = '/path/to/keyring')) + +Ensure that the ``conffile`` argument provides the path and file name of your +Ceph configuration file. You may use the ``sys`` module to avoid hard-coding the +Ceph configuration path and file name. + +Your Python client also requires a client keyring. For this example, we use the +``client.admin`` key by default. If you would like to specify the keyring when +creating the cluster handle, you may use the ``conf`` argument. Alternatively, +you may specify the keyring path in your Ceph configuration file. For example, +you may add something like the following line to your Ceph configuration file:: + + keyring = /path/to/ceph.client.admin.keyring + +For additional details on modifying your configuration via Python, see `Configuration`_. + + +Connect to the Cluster +---------------------- + +Once you have a cluster handle configured, you may connect to the cluster. +With a connection to the cluster, you may execute methods that return +information about the cluster. + +.. code-block:: python + :linenos: + :emphasize-lines: 7 + + import rados, sys + + cluster = rados.Rados(conffile='ceph.conf') + print "\nlibrados version: " + str(cluster.version()) + print "Will attempt to connect to: " + str(cluster.conf_get('mon host')) + + cluster.connect() + print "\nCluster ID: " + cluster.get_fsid() + + print "\n\nCluster Statistics" + print "==================" + cluster_stats = cluster.get_cluster_stats() + + for key, value in cluster_stats.iteritems(): + print key, value + + +By default, Ceph authentication is ``on``. Your application will need to know +the location of the keyring. The ``python-ceph`` module doesn't have the default +location, so you need to specify the keyring path. The easiest way to specify +the keyring is to add it to the Ceph configuration file. The following Ceph +configuration file example uses the ``client.admin`` keyring. + +.. code-block:: ini + :linenos: + + [global] + # ... elided configuration + keyring=/path/to/keyring/ceph.client.admin.keyring + + +Manage Pools +------------ + +When connected to the cluster, the ``Rados`` API allows you to manage pools. You +can list pools, check for the existence of a pool, create a pool and delete a +pool. + +.. code-block:: python + :linenos: + :emphasize-lines: 6, 13, 18, 25 + + print "\n\nPool Operations" + print "===============" + + print "\nAvailable Pools" + print "----------------" + pools = cluster.list_pools() + + for pool in pools: + print pool + + print "\nCreate 'test' Pool" + print "------------------" + cluster.create_pool('test') + + print "\nPool named 'test' exists: " + str(cluster.pool_exists('test')) + print "\nVerify 'test' Pool Exists" + print "-------------------------" + pools = cluster.list_pools() + + for pool in pools: + print pool + + print "\nDelete 'test' Pool" + print "------------------" + cluster.delete_pool('test') + print "\nPool named 'test' exists: " + str(cluster.pool_exists('test')) + + + +Input/Output Context +-------------------- + +Reading from and writing to the Ceph Storage Cluster requires an input/output +context (ioctx). You can create an ioctx with the ``open_ioctx()`` or +``open_ioctx2()`` method of the ``Rados`` class. The ``ioctx_name`` parameter +is the name of the pool and ``pool_id`` is the ID of the pool you wish to use. + +.. code-block:: python + :linenos: + + ioctx = cluster.open_ioctx('data') + + +or + +.. code-block:: python + :linenos: + + ioctx = cluster.open_ioctx2(pool_id) + + +Once you have an I/O context, you can read/write objects, extended attributes, +and perform a number of other operations. After you complete operations, ensure +that you close the connection. For example: + +.. code-block:: python + :linenos: + + print "\nClosing the connection." + ioctx.close() + + +Writing, Reading and Removing Objects +------------------------------------- + +Once you create an I/O context, you can write objects to the cluster. If you +write to an object that doesn't exist, Ceph creates it. If you write to an +object that exists, Ceph overwrites it (except when you specify a range, and +then it only overwrites the range). You may read objects (and object ranges) +from the cluster. You may also remove objects from the cluster. For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 2, 5, 8 + + print "\nWriting object 'hw' with contents 'Hello World!' to pool 'data'." + ioctx.write_full("hw", "Hello World!") + + print "\n\nContents of object 'hw'\n------------------------\n" + print ioctx.read("hw") + + print "\nRemoving object 'hw'" + ioctx.remove_object("hw") + + +Writing and Reading XATTRS +-------------------------- + +Once you create an object, you can write extended attributes (XATTRs) to +the object and read XATTRs from the object. For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 2, 5 + + print "\n\nWriting XATTR 'lang' with value 'en_US' to object 'hw'" + ioctx.set_xattr("hw", "lang", "en_US") + + print "\n\nGetting XATTR 'lang' from object 'hw'\n" + print ioctx.get_xattr("hw", "lang") + + +Listing Objects +--------------- + +If you want to examine the list of objects in a pool, you may +retrieve the list of objects and iterate over them with the object iterator. +For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 1, 6, 7 + + object_iterator = ioctx.list_objects() + + while True : + + try : + rados_object = object_iterator.next() + print "Object contents = " + rados_object.read() + + except StopIteration : + break + +The ``Object`` class provides a file-like interface to an object, allowing +you to read and write content and extended attributes. Object operations using +the I/O context provide additional functionality and asynchronous capabilities. + + +Cluster Handle API +================== + +The ``Rados`` class provides an interface into the Ceph Storage Daemon. + + +Configuration +------------- + +The ``Rados`` class provides methods for getting and setting configuration +values, reading the Ceph configuration file, and parsing arguments. You +do not need to be connected to the Ceph Storage Cluster to invoke the following +methods. See `Storage Cluster Configuration`_ for details on settings. + +.. currentmodule:: rados +.. automethod:: Rados.conf_get(option) +.. automethod:: Rados.conf_set(option, val) +.. automethod:: Rados.conf_read_file(path=None) +.. automethod:: Rados.conf_parse_argv(args) +.. automethod:: Rados.version() + + +Connection Management +--------------------- + +Once you configure your cluster handle, you may connect to the cluster, check +the cluster ``fsid``, retrieve cluster statistics, and disconnect (shutdown) +from the cluster. You may also assert that the cluster handle is in a particular +state (e.g., "configuring", "connecting", etc.). + +.. automethod:: Rados.connect(timeout=0) +.. automethod:: Rados.shutdown() +.. automethod:: Rados.get_fsid() +.. automethod:: Rados.get_cluster_stats() + +.. documented manually because it raises warnings because of *args usage in the +.. signature + +.. py:class:: Rados + + .. py:method:: require_state(*args) + + Checks if the Rados object is in a special state + + :param args: Any number of states to check as separate arguments + :raises: :class:`RadosStateError` + + +Pool Operations +--------------- + +To use pool operation methods, you must connect to the Ceph Storage Cluster +first. You may list the available pools, create a pool, check to see if a pool +exists, and delete a pool. + +.. automethod:: Rados.list_pools() +.. automethod:: Rados.create_pool(pool_name, crush_rule=None) +.. automethod:: Rados.pool_exists() +.. automethod:: Rados.delete_pool(pool_name) + + +CLI Commands +------------ + +The Ceph CLI command is internally using the following librados Python binding methods. + +In order to send a command, choose the correct method and choose the correct target. + +.. automethod:: Rados.mon_command +.. automethod:: Rados.osd_command +.. automethod:: Rados.mgr_command +.. automethod:: Rados.pg_command + + +Input/Output Context API +======================== + +To write data to and read data from the Ceph Object Store, you must create +an Input/Output context (ioctx). The `Rados` class provides `open_ioctx()` +and `open_ioctx2()` methods. The remaining ``ioctx`` operations involve +invoking methods of the `Ioctx` and other classes. + +.. automethod:: Rados.open_ioctx(ioctx_name) +.. automethod:: Ioctx.require_ioctx_open() +.. automethod:: Ioctx.get_stats() +.. automethod:: Ioctx.get_last_version() +.. automethod:: Ioctx.close() + + +.. Pool Snapshots +.. -------------- + +.. The Ceph Storage Cluster allows you to make a snapshot of a pool's state. +.. Whereas, basic pool operations only require a connection to the cluster, +.. snapshots require an I/O context. + +.. Ioctx.create_snap(self, snap_name) +.. Ioctx.list_snaps(self) +.. SnapIterator.next(self) +.. Snap.get_timestamp(self) +.. Ioctx.lookup_snap(self, snap_name) +.. Ioctx.remove_snap(self, snap_name) + +.. not published. This doesn't seem ready yet. + +Object Operations +----------------- + +The Ceph Storage Cluster stores data as objects. You can read and write objects +synchronously or asynchronously. You can read and write from offsets. An object +has a name (or key) and data. + + +.. automethod:: Ioctx.aio_write(object_name, to_write, offset=0, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.aio_write_full(object_name, to_write, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.aio_append(object_name, to_append, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.write(key, data, offset=0) +.. automethod:: Ioctx.write_full(key, data) +.. automethod:: Ioctx.aio_flush() +.. automethod:: Ioctx.set_locator_key(loc_key) +.. automethod:: Ioctx.aio_read(object_name, length, offset, oncomplete) +.. automethod:: Ioctx.read(key, length=8192, offset=0) +.. automethod:: Ioctx.stat(key) +.. automethod:: Ioctx.trunc(key, size) +.. automethod:: Ioctx.remove_object(key) + + +Object Extended Attributes +-------------------------- + +You may set extended attributes (XATTRs) on an object. You can retrieve a list +of objects or XATTRs and iterate over them. + +.. automethod:: Ioctx.set_xattr(key, xattr_name, xattr_value) +.. automethod:: Ioctx.get_xattrs(oid) +.. automethod:: XattrIterator.__next__() +.. automethod:: Ioctx.get_xattr(key, xattr_name) +.. automethod:: Ioctx.rm_xattr(key, xattr_name) + + + +Object Interface +================ + +From an I/O context, you can retrieve a list of objects from a pool and iterate +over them. The object interface provide makes each object look like a file, and +you may perform synchronous operations on the objects. For asynchronous +operations, you should use the I/O context methods. + +.. automethod:: Ioctx.list_objects() +.. automethod:: ObjectIterator.__next__() +.. automethod:: Object.read(length = 1024*1024) +.. automethod:: Object.write(string_to_write) +.. automethod:: Object.get_xattrs() +.. automethod:: Object.get_xattr(xattr_name) +.. automethod:: Object.set_xattr(xattr_name, xattr_value) +.. automethod:: Object.rm_xattr(xattr_name) +.. automethod:: Object.stat() +.. automethod:: Object.remove() + + + + +.. _Getting Started: ../../../start +.. _Storage Cluster Configuration: ../../configuration +.. _Getting librados for Python: ../librados-intro#getting-librados-for-python diff --git a/doc/rados/command/list-inconsistent-obj.json b/doc/rados/command/list-inconsistent-obj.json new file mode 100644 index 000000000..2bdc5f74c --- /dev/null +++ b/doc/rados/command/list-inconsistent-obj.json @@ -0,0 +1,237 @@ +{ + "$schema": "http://json-schema.org/draft-04/schema#", + "type": "object", + "properties": { + "epoch": { + "description": "Scrub epoch", + "type": "integer" + }, + "inconsistents": { + "type": "array", + "items": { + "type": "object", + "properties": { + "object": { + "description": "Identify a Ceph object", + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "nspace": { + "type": "string" + }, + "locator": { + "type": "string" + }, + "version": { + "type": "integer", + "minimum": 0 + }, + "snap": { + "oneOf": [ + { + "type": "string", + "enum": [ "head", "snapdir" ] + }, + { + "type": "integer", + "minimum": 0 + } + ] + } + }, + "required": [ + "name", + "nspace", + "locator", + "version", + "snap" + ] + }, + "selected_object_info": { + "type": "object", + "description": "Selected object information", + "additionalProperties": true + }, + "union_shard_errors": { + "description": "Union of all shard errors", + "type": "array", + "items": { + "enum": [ + "missing", + "stat_error", + "read_error", + "data_digest_mismatch_info", + "omap_digest_mismatch_info", + "size_mismatch_info", + "ec_hash_error", + "ec_size_error", + "info_missing", + "info_corrupted", + "obj_size_info_mismatch", + "snapset_missing", + "snapset_corrupted", + "hinfo_missing", + "hinfo_corrupted" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "errors": { + "description": "Errors related to the analysis of this object", + "type": "array", + "items": { + "enum": [ + "object_info_inconsistency", + "data_digest_mismatch", + "omap_digest_mismatch", + "size_mismatch", + "attr_value_mismatch", + "attr_name_mismatch", + "snapset_inconsistency", + "hinfo_inconsistency", + "size_too_large" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "shards": { + "description": "All found or expected shards", + "type": "array", + "items": { + "description": "Information about a particular shard of object", + "type": "object", + "properties": { + "object_info": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "object", + "description": "Object information", + "additionalProperties": true + } + ] + }, + "snapset": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "object", + "description": "Snap set information", + "additionalProperties": true + } + ] + }, + "hashinfo": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "object", + "description": "Erasure code hash information", + "additionalProperties": true + } + ] + }, + "shard": { + "type": "integer" + }, + "osd": { + "type": "integer" + }, + "primary": { + "type": "boolean" + }, + "size": { + "type": "integer" + }, + "omap_digest": { + "description": "Hex representation (e.g. 0x1abd1234)", + "type": "string" + }, + "data_digest": { + "description": "Hex representation (e.g. 0x1abd1234)", + "type": "string" + }, + "errors": { + "description": "Errors with this shard", + "type": "array", + "items": { + "enum": [ + "missing", + "stat_error", + "read_error", + "data_digest_mismatch_info", + "omap_digest_mismatch_info", + "size_mismatch_info", + "ec_hash_error", + "ec_size_error", + "info_missing", + "info_corrupted", + "obj_size_info_mismatch", + "snapset_missing", + "snapset_corrupted", + "hinfo_missing", + "hinfo_corrupted" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "attrs": { + "description": "If any shard's attr error is set then all attrs are here", + "type": "array", + "items": { + "description": "Information about a particular shard of object", + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "value": { + "type": "string" + }, + "Base64": { + "type": "boolean" + } + }, + "required": [ + "name", + "value", + "Base64" + ], + "additionalProperties": false + } + } + }, + "additionalProperties": false, + "required": [ + "osd", + "primary", + "errors" + ] + } + } + }, + "required": [ + "object", + "union_shard_errors", + "errors", + "shards" + ] + } + } + }, + "required": [ + "epoch", + "inconsistents" + ] +} diff --git a/doc/rados/command/list-inconsistent-snap.json b/doc/rados/command/list-inconsistent-snap.json new file mode 100644 index 000000000..55f1d53e9 --- /dev/null +++ b/doc/rados/command/list-inconsistent-snap.json @@ -0,0 +1,86 @@ +{ + "$schema": "http://json-schema.org/draft-04/schema#", + "type": "object", + "properties": { + "epoch": { + "description": "Scrub epoch", + "type": "integer" + }, + "inconsistents": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "nspace": { + "type": "string" + }, + "locator": { + "type": "string" + }, + "snap": { + "oneOf": [ + { + "type": "string", + "enum": [ + "head", + "snapdir" + ] + }, + { + "type": "integer", + "minimum": 0 + } + ] + }, + "errors": { + "description": "Errors for this object's snap", + "type": "array", + "items": { + "enum": [ + "snapset_missing", + "snapset_corrupted", + "info_missing", + "info_corrupted", + "snapset_error", + "headless", + "size_mismatch", + "extra_clones", + "clone_missing" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "missing": { + "description": "List of missing clones if clone_missing error set", + "type": "array", + "items": { + "type": "integer" + } + }, + "extra_clones": { + "description": "List of extra clones if extra_clones error set", + "type": "array", + "items": { + "type": "integer" + } + } + }, + "required": [ + "name", + "nspace", + "locator", + "snap", + "errors" + ] + } + } + }, + "required": [ + "epoch", + "inconsistents" + ] +} diff --git a/doc/rados/configuration/auth-config-ref.rst b/doc/rados/configuration/auth-config-ref.rst new file mode 100644 index 000000000..5cc13ff6a --- /dev/null +++ b/doc/rados/configuration/auth-config-ref.rst @@ -0,0 +1,362 @@ +======================== + Cephx Config Reference +======================== + +The ``cephx`` protocol is enabled by default. Cryptographic authentication has +some computational costs, though they should generally be quite low. If the +network environment connecting your client and server hosts is very safe and +you cannot afford authentication, you can turn it off. **This is not generally +recommended**. + +.. note:: If you disable authentication, you are at risk of a man-in-the-middle + attack altering your client/server messages, which could lead to disastrous + security effects. + +For creating users, see `User Management`_. For details on the architecture +of Cephx, see `Architecture - High Availability Authentication`_. + + +Deployment Scenarios +==================== + +There are two main scenarios for deploying a Ceph cluster, which impact +how you initially configure Cephx. Most first time Ceph users use +``cephadm`` to create a cluster (easiest). For clusters using +other deployment tools (e.g., Chef, Juju, Puppet, etc.), you will need +to use the manual procedures or configure your deployment tool to +bootstrap your monitor(s). + +Manual Deployment +----------------- + +When you deploy a cluster manually, you have to bootstrap the monitor manually +and create the ``client.admin`` user and keyring. To bootstrap monitors, follow +the steps in `Monitor Bootstrapping`_. The steps for monitor bootstrapping are +the logical steps you must perform when using third party deployment tools like +Chef, Puppet, Juju, etc. + + +Enabling/Disabling Cephx +======================== + +Enabling Cephx requires that you have deployed keys for your monitors, +OSDs and metadata servers. If you are simply toggling Cephx on / off, +you do not have to repeat the bootstrapping procedures. + + +Enabling Cephx +-------------- + +When ``cephx`` is enabled, Ceph will look for the keyring in the default search +path, which includes ``/etc/ceph/$cluster.$name.keyring``. You can override +this location by adding a ``keyring`` option in the ``[global]`` section of +your `Ceph configuration`_ file, but this is not recommended. + +Execute the following procedures to enable ``cephx`` on a cluster with +authentication disabled. If you (or your deployment utility) have already +generated the keys, you may skip the steps related to generating keys. + +#. Create a ``client.admin`` key, and save a copy of the key for your client + host + + .. prompt:: bash $ + + ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring + + **Warning:** This will clobber any existing + ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a + deployment tool has already done it for you. Be careful! + +#. Create a keyring for your monitor cluster and generate a monitor + secret key. + + .. prompt:: bash $ + + ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' + +#. Copy the monitor keyring into a ``ceph.mon.keyring`` file in every monitor's + ``mon data`` directory. For example, to copy it to ``mon.a`` in cluster ``ceph``, + use the following + + .. prompt:: bash $ + + cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring + +#. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter + + .. prompt:: bash $ + + ceph auth get-or-create mgr.{$id} mon 'allow profile mgr' mds 'allow *' osd 'allow *' -o /var/lib/ceph/mgr/ceph-{$id}/keyring + +#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number + + .. prompt:: bash $ + + ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring + +#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter + + .. prompt:: bash $ + + ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' mgr 'allow profile mds' -o /var/lib/ceph/mds/ceph-{$id}/keyring + +#. Enable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file + + .. code-block:: ini + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + +For details on bootstrapping a monitor manually, see `Manual Deployment`_. + + + +Disabling Cephx +--------------- + +The following procedure describes how to disable Cephx. If your cluster +environment is relatively safe, you can offset the computation expense of +running authentication. **We do not recommend it.** However, it may be easier +during setup and/or troubleshooting to temporarily disable authentication. + +#. Disable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file + + .. code-block:: ini + + auth_cluster_required = none + auth_service_required = none + auth_client_required = none + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + + +Configuration Settings +====================== + +Enablement +---------- + + +``auth_cluster_required`` + +:Description: If enabled, the Ceph Storage Cluster daemons (i.e., ``ceph-mon``, + ``ceph-osd``, ``ceph-mds`` and ``ceph-mgr``) must authenticate with + each other. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth_service_required`` + +:Description: If enabled, the Ceph Storage Cluster daemons require Ceph Clients + to authenticate with the Ceph Storage Cluster in order to access + Ceph services. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth_client_required`` + +:Description: If enabled, the Ceph Client requires the Ceph Storage Cluster to + authenticate with the Ceph Client. Valid settings are ``cephx`` + or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +.. index:: keys; keyring + +Keys +---- + +When you run Ceph with authentication enabled, ``ceph`` administrative commands +and Ceph Clients require authentication keys to access the Ceph Storage Cluster. + +The most common way to provide these keys to the ``ceph`` administrative +commands and clients is to include a Ceph keyring under the ``/etc/ceph`` +directory. For Octopus and later releases using ``cephadm``, the filename +is usually ``ceph.client.admin.keyring`` (or ``$cluster.client.admin.keyring``). +If you include the keyring under the ``/etc/ceph`` directory, you don't need to +specify a ``keyring`` entry in your Ceph configuration file. + +We recommend copying the Ceph Storage Cluster's keyring file to nodes where you +will run administrative commands, because it contains the ``client.admin`` key. + +To perform this step manually, execute the following: + +.. prompt:: bash $ + + sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring + +.. tip:: Ensure the ``ceph.keyring`` file has appropriate permissions set + (e.g., ``chmod 644``) on your client machine. + +You may specify the key itself in the Ceph configuration file using the ``key`` +setting (not recommended), or a path to a keyfile using the ``keyfile`` setting. + + +``keyring`` + +:Description: The path to the keyring file. +:Type: String +:Required: No +:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin`` + + +``keyfile`` + +:Description: The path to a key file (i.e,. a file containing only the key). +:Type: String +:Required: No +:Default: None + + +``key`` + +:Description: The key (i.e., the text string of the key itself). Not recommended. +:Type: String +:Required: No +:Default: None + + +Daemon Keyrings +--------------- + +Administrative users or deployment tools (e.g., ``cephadm``) may generate +daemon keyrings in the same way as generating user keyrings. By default, Ceph +stores daemons keyrings inside their data directory. The default keyring +locations, and the capabilities necessary for the daemon to function, are shown +below. + +``ceph-mon`` + +:Location: ``$mon_data/keyring`` +:Capabilities: ``mon 'allow *'`` + +``ceph-osd`` + +:Location: ``$osd_data/keyring`` +:Capabilities: ``mgr 'allow profile osd' mon 'allow profile osd' osd 'allow *'`` + +``ceph-mds`` + +:Location: ``$mds_data/keyring`` +:Capabilities: ``mds 'allow' mgr 'allow profile mds' mon 'allow profile mds' osd 'allow rwx'`` + +``ceph-mgr`` + +:Location: ``$mgr_data/keyring`` +:Capabilities: ``mon 'allow profile mgr' mds 'allow *' osd 'allow *'`` + +``radosgw`` + +:Location: ``$rgw_data/keyring`` +:Capabilities: ``mon 'allow rwx' osd 'allow rwx'`` + + +.. note:: The monitor keyring (i.e., ``mon.``) contains a key but no + capabilities, and is not part of the cluster ``auth`` database. + +The daemon data directory locations default to directories of the form:: + + /var/lib/ceph/$type/$cluster-$id + +For example, ``osd.12`` would be:: + + /var/lib/ceph/osd/ceph-12 + +You can override these locations, but it is not recommended. + + +.. index:: signatures + +Signatures +---------- + +Ceph performs a signature check that provides some limited protection +against messages being tampered with in flight (e.g., by a "man in the +middle" attack). + +Like other parts of Ceph authentication, Ceph provides fine-grained control so +you can enable/disable signatures for service messages between clients and +Ceph, and so you can enable/disable signatures for messages between Ceph daemons. + +Note that even with signatures enabled data is not encrypted in +flight. + +``cephx_require_signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between the Ceph Client and the Ceph Storage Cluster, and + between daemons comprising the Ceph Storage Cluster. + + Ceph Argonaut and Linux kernel versions prior to 3.19 do + not support signatures; if such clients are in use this + option can be turned off to allow them to connect. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_cluster_require_signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph daemons comprising the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_service_require_signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph Clients and the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_sign_messages`` + +:Description: If the Ceph version supports message signing, Ceph will sign + all messages so they are more difficult to spoof. + +:Type: Boolean +:Default: ``true`` + + +Time to Live +------------ + +``auth_service_ticket_ttl`` + +:Description: When the Ceph Storage Cluster sends a Ceph Client a ticket for + authentication, the Ceph Storage Cluster assigns the ticket a + time to live. + +:Type: Double +:Default: ``60*60`` + + +.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping +.. _Operating a Cluster: ../../operations/operating +.. _Manual Deployment: ../../../install/manual-deployment +.. _Ceph configuration: ../ceph-conf +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _User Management: ../../operations/user-management diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst new file mode 100644 index 000000000..3bfc8e295 --- /dev/null +++ b/doc/rados/configuration/bluestore-config-ref.rst @@ -0,0 +1,482 @@ +========================== +BlueStore Config Reference +========================== + +Devices +======= + +BlueStore manages either one, two, or (in certain cases) three storage +devices. + +In the simplest case, BlueStore consumes a single (primary) storage device. +The storage device is normally used as a whole, occupying the full device that +is managed directly by BlueStore. This *primary device* is normally identified +by a ``block`` symlink in the data directory. + +The data directory is a ``tmpfs`` mount which gets populated (at boot time, or +when ``ceph-volume`` activates it) with all the common OSD files that hold +information about the OSD, like: its identifier, which cluster it belongs to, +and its private keyring. + +It is also possible to deploy BlueStore across one or two additional devices: + +* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be + used for BlueStore's internal journal or write-ahead log. It is only useful + to use a WAL device if the device is faster than the primary device (e.g., + when it is on an SSD and the primary device is an HDD). +* A *DB device* (identified as ``block.db`` in the data directory) can be used + for storing BlueStore's internal metadata. BlueStore (or rather, the + embedded RocksDB) will put as much metadata as it can on the DB device to + improve performance. If the DB device fills up, metadata will spill back + onto the primary device (where it would have been otherwise). Again, it is + only helpful to provision a DB device if it is faster than the primary + device. + +If there is only a small amount of fast storage available (e.g., less +than a gigabyte), we recommend using it as a WAL device. If there is +more, provisioning a DB device makes more sense. The BlueStore +journal will always be placed on the fastest device available, so +using a DB device will provide the same benefit that the WAL device +would while *also* allowing additional metadata to be stored there (if +it will fit). This means that if a DB device is specified but an explicit +WAL device is not, the WAL will be implicitly colocated with the DB on the faster +device. + +A single-device (colocated) BlueStore OSD can be provisioned with: + +.. prompt:: bash $ + + ceph-volume lvm prepare --bluestore --data + +To specify a WAL device and/or DB device: + +.. prompt:: bash $ + + ceph-volume lvm prepare --bluestore --data --block.wal --block.db + +.. note:: ``--data`` can be a Logical Volume using *vg/lv* notation. Other + devices can be existing logical volumes or GPT partitions. + +Provisioning strategies +----------------------- +Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore +which had just one), there are two common arrangements that should help clarify +the deployment strategy: + +.. _bluestore-single-type-device-config: + +**block (data) only** +^^^^^^^^^^^^^^^^^^^^^ +If all devices are the same type, for example all rotational drives, and +there are no fast devices to use for metadata, it makes sense to specify the +block device only and to not separate ``block.db`` or ``block.wal``. The +:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data /dev/sda + +If logical volumes have already been created for each device, (a single LV +using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named +``ceph-vg/block-lv`` would look like: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data ceph-vg/block-lv + +.. _bluestore-mixed-device-config: + +**block and block.db** +^^^^^^^^^^^^^^^^^^^^^^ +If you have a mix of fast and slow devices (SSD / NVMe and rotational), +it is recommended to place ``block.db`` on the faster device while ``block`` +(data) lives on the slower (spinning drive). + +You must create these volume groups and logical volumes manually as +the ``ceph-volume`` tool is currently not able to do so automatically. + +For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``) +and one (fast) solid state drive (``sdx``). First create the volume groups: + +.. prompt:: bash $ + + vgcreate ceph-block-0 /dev/sda + vgcreate ceph-block-1 /dev/sdb + vgcreate ceph-block-2 /dev/sdc + vgcreate ceph-block-3 /dev/sdd + +Now create the logical volumes for ``block``: + +.. prompt:: bash $ + + lvcreate -l 100%FREE -n block-0 ceph-block-0 + lvcreate -l 100%FREE -n block-1 ceph-block-1 + lvcreate -l 100%FREE -n block-2 ceph-block-2 + lvcreate -l 100%FREE -n block-3 ceph-block-3 + +We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB +SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB: + +.. prompt:: bash $ + + vgcreate ceph-db-0 /dev/sdx + lvcreate -L 50GB -n db-0 ceph-db-0 + lvcreate -L 50GB -n db-1 ceph-db-0 + lvcreate -L 50GB -n db-2 ceph-db-0 + lvcreate -L 50GB -n db-3 ceph-db-0 + +Finally, create the 4 OSDs with ``ceph-volume``: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0 + ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1 + ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2 + ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3 + +These operations should end up creating four OSDs, with ``block`` on the slower +rotational drives with a 50 GB logical volume (DB) for each on the solid state +drive. + +Sizing +====== +When using a :ref:`mixed spinning and solid drive setup +` it is important to make a large enough +``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have +*as large as possible* logical volumes. + +The general recommendation is to have ``block.db`` size in between 1% to 4% +of ``block`` size. For RGW workloads, it is recommended that the ``block.db`` +size isn't smaller than 4% of ``block``, because RGW heavily uses it to store +metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't +be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough. + +In older releases, internal level sizes mean that the DB can fully utilize only +specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2, +etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and +so forth. Most deployments will not substantially benefit from sizing to +accommodate L3 and higher, though DB compaction can be facilitated by doubling +these figures to 6GB, 60GB, and 600GB. + +Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6 +enable better utilization of arbitrary DB device sizes, and the Pacific +release brings experimental dynamic level support. Users of older releases may +thus wish to plan ahead by provisioning larger DB devices today so that their +benefits may be realized with future upgrades. + +When *not* using a mix of fast and slow devices, it isn't required to create +separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will +automatically colocate these within the space of ``block``. + + +Automatic Cache Sizing +====================== + +BlueStore can be configured to automatically resize its caches when TCMalloc +is configured as the memory allocator and the ``bluestore_cache_autotune`` +setting is enabled. This option is currently enabled by default. BlueStore +will attempt to keep OSD heap memory usage under a designated target size via +the ``osd_memory_target`` configuration option. This is a best effort +algorithm and caches will not shrink smaller than the amount specified by +``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy +of priorities. If priority information is not available, the +``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are +used as fallbacks. + +Manual Cache Sizing +=================== + +The amount of memory consumed by each OSD for BlueStore caches is +determined by the ``bluestore_cache_size`` configuration option. If +that config option is not set (i.e., remains at 0), there is a +different default value that is used depending on whether an HDD or +SSD is used for the primary device (set by the +``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config +options). + +BlueStore and the rest of the Ceph OSD daemon do the best they can +to work within this memory budget. Note that on top of the configured +cache size, there is also memory consumed by the OSD itself, and +some additional utilization due to memory fragmentation and other +allocator overhead. + +The configured cache memory budget can be used in a few different ways: + +* Key/Value metadata (i.e., RocksDB's internal cache) +* BlueStore metadata +* BlueStore data (i.e., recently read or written object data) + +Cache memory usage is governed by the following options: +``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. +The fraction of the cache devoted to data +is governed by the effective bluestore cache size (depending on +``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary +device) as well as the meta and kv ratios. +The data fraction can be calculated by +`` * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)`` + +Checksums +========= + +BlueStore checksums all metadata and data written to disk. Metadata +checksumming is handled by RocksDB and uses `crc32c`. Data +checksumming is done by BlueStore and can make use of `crc32c`, +`xxhash32`, or `xxhash64`. The default is `crc32c` and should be +suitable for most purposes. + +Full data checksumming does increase the amount of metadata that +BlueStore must store and manage. When possible, e.g., when clients +hint that data is written and read sequentially, BlueStore will +checksum larger blocks, but in many cases it must store a checksum +value (usually 4 bytes) for every 4 kilobyte block of data. + +It is possible to use a smaller checksum value by truncating the +checksum to two or one byte, reducing the metadata overhead. The +trade-off is that the probability that a random error will not be +detected is higher with a smaller checksum, going from about one in +four billion with a 32-bit (4 byte) checksum to one in 65,536 for a +16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum. +The smaller checksum values can be used by selecting `crc32c_16` or +`crc32c_8` as the checksum algorithm. + +The *checksum algorithm* can be set either via a per-pool +``csum_type`` property or the global config option. For example: + +.. prompt:: bash $ + + ceph osd pool set csum_type + +Inline Compression +================== + +BlueStore supports inline compression using `snappy`, `zlib`, or +`lz4`. Please note that the `lz4` compression plugin is not +distributed in the official release. + +Whether data in BlueStore is compressed is determined by a combination +of the *compression mode* and any hints associated with a write +operation. The modes are: + +* **none**: Never compress data. +* **passive**: Do not compress data unless the write operation has a + *compressible* hint set. +* **aggressive**: Compress data unless the write operation has an + *incompressible* hint set. +* **force**: Try to compress data no matter what. + +For more information about the *compressible* and *incompressible* IO +hints, see :c:func:`rados_set_alloc_hint`. + +Note that regardless of the mode, if the size of the data chunk is not +reduced sufficiently it will not be used and the original +(uncompressed) data will be stored. For example, if the ``bluestore +compression required ratio`` is set to ``.7`` then the compressed data +must be 70% of the size of the original (or smaller). + +The *compression mode*, *compression algorithm*, *compression required +ratio*, *min blob size*, and *max blob size* can be set either via a +per-pool property or a global config option. Pool properties can be +set with: + +.. prompt:: bash $ + + ceph osd pool set compression_algorithm + ceph osd pool set compression_mode + ceph osd pool set compression_required_ratio + ceph osd pool set compression_min_blob_size + ceph osd pool set compression_max_blob_size + +.. _bluestore-rocksdb-sharding: + +RocksDB Sharding +================ + +Internally BlueStore uses multiple types of key-value data, +stored in RocksDB. Each data type in BlueStore is assigned a +unique prefix. Until Pacific all key-value data was stored in +single RocksDB column family: 'default'. Since Pacific, +BlueStore can divide this data into multiple RocksDB column +families. When keys have similar access frequency, modification +frequency and lifetime, BlueStore benefits from better caching +and more precise compaction. This improves performance, and also +requires less disk space during compaction, since each column +family is smaller and can compact independent of others. + +OSDs deployed in Pacific or later use RocksDB sharding by default. +If Ceph is upgraded to Pacific from a previous version, sharding is off. + +To enable sharding and apply the Pacific defaults, stop an OSD and run + + .. prompt:: bash # + + ceph-bluestore-tool \ + --path \ + --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \ + reshard + + +Throttling +========== + +SPDK Usage +================== + +If you want to use the SPDK driver for NVMe devices, you must prepare your system. +Refer to `SPDK document`__ for more details. + +.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples + +SPDK offers a script to configure the device automatically. Users can run the +script as root: + +.. prompt:: bash $ + + sudo src/spdk/scripts/setup.sh + +You will need to specify the subject NVMe device's device selector with +the "spdk:" prefix for ``bluestore_block_path``. + +For example, you can find the device selector of an Intel PCIe SSD with: + +.. prompt:: bash $ + + lspci -mm -n -D -d 8086:0953 + +The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``. + +and then set:: + + bluestore_block_path = "spdk:trtype:PCIe traddr:0000:01:00.0" + +Where ``0000:01:00.0`` is the device selector found in the output of ``lspci`` +command above. + +You may also specify a remote NVMeoF target over the TCP transport as in the +following example:: + + bluestore_block_path = "spdk:trtype:TCP traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1" + +To run multiple SPDK instances per node, you must specify the +amount of dpdk memory in MB that each instance will use, to make sure each +instance uses its own DPDK memory. + +In most cases, a single device can be used for data, DB, and WAL. We describe +this strategy as *colocating* these components. Be sure to enter the below +settings to ensure that all IOs are issued through SPDK.:: + + bluestore_block_db_path = "" + bluestore_block_db_size = 0 + bluestore_block_wal_path = "" + bluestore_block_wal_size = 0 + +Otherwise, the current implementation will populate the SPDK map files with +kernel file system symbols and will use the kernel driver to issue DB/WAL IO. + +Minimum Allocation Size +======================== + +There is a configured minimum amount of storage that BlueStore will allocate on +an OSD. In practice, this is the least amount of capacity that a RADOS object +can consume. The value of `bluestore_min_alloc_size` is derived from the +value of `bluestore_min_alloc_size_hdd` or `bluestore_min_alloc_size_ssd` +depending on the OSD's ``rotational`` attribute. This means that when an OSD +is created on an HDD, BlueStore will be initialized with the current value +of `bluestore_min_alloc_size_hdd`, and SSD OSDs (including NVMe devices) +with the value of `bluestore_min_alloc_size_ssd`. + +Through the Mimic release, the default values were 64KB and 16KB for rotational +(HDD) and non-rotational (SSD) media respectively. Octopus changed the default +for SSD (non-rotational) media to 4KB, and Pacific changed the default for HDD +(rotational) media to 4KB as well. + +These changes were driven by space amplification experienced by Ceph RADOS +GateWay (RGW) deployments that host large numbers of small files +(S3/Swift objects). + +For example, when an RGW client stores a 1KB S3 object, it is written to a +single RADOS object. With the default `min_alloc_size` value, 4KB of +underlying drive space is allocated. This means that roughly +(4KB - 1KB) == 3KB is allocated but never used, which corresponds to 300% +overhead or 25% efficiency. Similarly, a 5KB user object will be stored +as one 4KB and one 1KB RADOS object, again stranding 4KB of device capcity, +though in this case the overhead is a much smaller percentage. Think of this +in terms of the remainder from a modulus operation. The overhead *percentage* +thus decreases rapidly as user object size increases. + +An easily missed additional subtlety is that this +takes place for *each* replica. So when using the default three copies of +data (3R), a 1KB S3 object actually consumes roughly 9KB of storage device +capacity. If erasure coding (EC) is used instead of replication, the +amplification may be even higher: for a ``k=4,m=2`` pool, our 1KB S3 object +will allocate (6 * 4KB) = 24KB of device capacity. + +When an RGW bucket pool contains many relatively large user objects, the effect +of this phenomenon is often negligible, but should be considered for deployments +that expect a signficiant fraction of relatively small objects. + +The 4KB default value aligns well with conventional HDD and SSD devices. Some +new coarse-IU (Indirection Unit) QLC SSDs however perform and wear best +when `bluestore_min_alloc_size_ssd` +is set at OSD creation to match the device's IU:. 8KB, 16KB, or even 64KB. +These novel storage drives allow one to achieve read performance competitive +with conventional TLC SSDs and write performance faster than HDDs, with +high density and lower cost than TLC SSDs. + +Note that when creating OSDs on these devices, one must carefully apply the +non-default value only to appropriate devices, and not to conventional SSD and +HDD devices. This may be done through careful ordering of OSD creation, custom +OSD device classes, and especially by the use of central configuration _masks_. + +Quincy and later releases add +the `bluestore_use_optimal_io_size_for_min_alloc_size` +option that enables automatic discovery of the appropriate value as each OSD is +created. Note that the use of ``bcache``, ``OpenCAS``, ``dmcrypt``, +``ATA over Ethernet``, `iSCSI`, or other device layering / abstraction +technologies may confound the determination of appropriate values. OSDs +deployed on top of VMware storage have been reported to also +sometimes report a ``rotational`` attribute that does not match the underlying +hardware. + +We suggest inspecting such OSDs at startup via logs and admin sockets to ensure that +behavior is appropriate. Note that this also may not work as desired with +older kernels. You can check for this by examining the presence and value +of ``/sys/block//queue/optimal_io_size``. + +You may also inspect a given OSD: + +.. prompt:: bash # + + ceph osd metadata osd.1701 | grep rotational + +This space amplification may manifest as an unusually high ratio of raw to +stored data reported by ``ceph df``. ``ceph osd df`` may also report +anomalously high ``%USE`` / ``VAR`` values when +compared to other, ostensibly identical OSDs. A pool using OSDs with +mismatched ``min_alloc_size`` values may experience unexpected balancer +behavior as well. + +Note that this BlueStore attribute takes effect *only* at OSD creation; if +changed later, a given OSD's behavior will not change unless / until it is +destroyed and redeployed with the appropriate option value(s). Upgrading +to a later Ceph release will *not* change the value used by OSDs deployed +under older releases or with other settings. + +DSA (Data Streaming Accelerator Usage) +====================================== + +If you want to use the DML library to drive DSA device for offloading +read/write operations on Persist memory in Bluestore. You need to install +`DML`_ and `idxd-config`_ library in your machine with SPR (Sapphire Rapids) CPU. + +.. _DML: https://github.com/intel/DML +.. _idxd-config: https://github.com/intel/idxd-config + +After installing the DML software, you need to configure the shared +work queues (WQs) with the following WQ configuration example via accel-config tool: + +.. prompt:: bash $ + + accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="MyApp1" --priority=10 --block-on-fault=1 dsa0/wq0.1 + accel-config config-engine dsa0/engine0.1 --group-id=1 + accel-config enable-device dsa0 + accel-config enable-wq dsa0/wq0.1 diff --git a/doc/rados/configuration/ceph-conf.rst b/doc/rados/configuration/ceph-conf.rst new file mode 100644 index 000000000..ad93598de --- /dev/null +++ b/doc/rados/configuration/ceph-conf.rst @@ -0,0 +1,689 @@ +.. _configuring-ceph: + +================== + Configuring Ceph +================== + +When Ceph services start, the initialization process activates a series +of daemons that run in the background. A :term:`Ceph Storage Cluster` runs +at a minimum three types of daemons: + +- :term:`Ceph Monitor` (``ceph-mon``) +- :term:`Ceph Manager` (``ceph-mgr``) +- :term:`Ceph OSD Daemon` (``ceph-osd``) + +Ceph Storage Clusters that support the :term:`Ceph File System` also run at +least one :term:`Ceph Metadata Server` (``ceph-mds``). Clusters that +support :term:`Ceph Object Storage` run Ceph RADOS Gateway daemons +(``radosgw``) as well. + +Each daemon has a number of configuration options, each of which has a +default value. You may adjust the behavior of the system by changing these +configuration options. Be careful to understand the consequences before +overriding default values, as it is possible to significantly degrade the +performance and stability of your cluster. Also note that default values +sometimes change between releases, so it is best to review the version of +this documentation that aligns with your Ceph release. + +Option names +============ + +All Ceph configuration options have a unique name consisting of words +formed with lower-case characters and connected with underscore +(``_``) characters. + +When option names are specified on the command line, either underscore +(``_``) or dash (``-``) characters can be used interchangeable (e.g., +``--mon-host`` is equivalent to ``--mon_host``). + +When option names appear in configuration files, spaces can also be +used in place of underscore or dash. We suggest, though, that for +clarity and convenience you consistently use underscores, as we do +throughout this documentation. + +Config sources +============== + +Each Ceph daemon, process, and library will pull its configuration +from several sources, listed below. Sources later in the list will +override those earlier in the list when both are present. + +- the compiled-in default value +- the monitor cluster's centralized configuration database +- a configuration file stored on the local host +- environment variables +- command line arguments +- runtime overrides set by an administrator + +One of the first things a Ceph process does on startup is parse the +configuration options provided via the command line, environment, and +local configuration file. The process will then contact the monitor +cluster to retrieve configuration stored centrally for the entire +cluster. Once a complete view of the configuration is available, the +daemon or process startup will proceed. + +.. _bootstrap-options: + +Bootstrap options +----------------- + +Because some configuration options affect the process's ability to +contact the monitors, authenticate, and retrieve the cluster-stored +configuration, they may need to be stored locally on the node and set +in a local configuration file. These options include: + + - ``mon_host``, the list of monitors for the cluster + - ``mon_host_override``, the list of monitors for the cluster to + **initially** contact when beginning a new instance of communication with the + Ceph cluster. This overrides the known monitor list derived from MonMap + updates sent to older Ceph instances (like librados cluster handles). It is + expected this option is primarily useful for debugging. + - ``mon_dns_srv_name`` (default: `ceph-mon`), the name of the DNS + SRV record to check to identify the cluster monitors via DNS + - ``mon_data``, ``osd_data``, ``mds_data``, ``mgr_data``, and + similar options that define which local directory the daemon + stores its data in. + - ``keyring``, ``keyfile``, and/or ``key``, which can be used to + specify the authentication credential to use to authenticate with + the monitor. Note that in most cases the default keyring location + is in the data directory specified above. + +In the vast majority of cases the default values of these are +appropriate, with the exception of the ``mon_host`` option that +identifies the addresses of the cluster's monitors. When DNS is used +to identify monitors a local ceph configuration file can be avoided +entirely. + +Skipping monitor config +----------------------- + +Any process may be passed the option ``--no-mon-config`` to skip the +step that retrieves configuration from the cluster monitors. This is +useful in cases where configuration is managed entirely via +configuration files or where the monitor cluster is currently down but +some maintenance activity needs to be done. + + +.. _ceph-conf-file: + + +Configuration sections +====================== + +Any given process or daemon has a single value for each configuration +option. However, values for an option may vary across different +daemon types even daemons of the same type. Ceph options that are +stored in the monitor configuration database or in local configuration +files are grouped into sections to indicate which daemons or clients +they apply to. + +These sections include: + +``global`` + +:Description: Settings under ``global`` affect all daemons and clients + in a Ceph Storage Cluster. + +:Example: ``log_file = /var/log/ceph/$cluster-$type.$id.log`` + +``mon`` + +:Description: Settings under ``mon`` affect all ``ceph-mon`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mon_cluster_log_to_syslog = true`` + + +``mgr`` + +:Description: Settings in the ``mgr`` section affect all ``ceph-mgr`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mgr_stats_period = 10`` + +``osd`` + +:Description: Settings under ``osd`` affect all ``ceph-osd`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``osd_op_queue = wpq`` + +``mds`` + +:Description: Settings in the ``mds`` section affect all ``ceph-mds`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mds_cache_memory_limit = 10G`` + +``client`` + +:Description: Settings under ``client`` affect all Ceph Clients + (e.g., mounted Ceph File Systems, mounted Ceph Block Devices, + etc.) as well as Rados Gateway (RGW) daemons. + +:Example: ``objecter_inflight_ops = 512`` + + +Sections may also specify an individual daemon or client name. For example, +``mon.foo``, ``osd.123``, and ``client.smith`` are all valid section names. + + +Any given daemon will draw its settings from the global section, the +daemon or client type section, and the section sharing its name. +Settings in the most-specific section take precedence, so for example +if the same option is specified in both ``global``, ``mon``, and +``mon.foo`` on the same source (i.e., in the same configurationfile), +the ``mon.foo`` value will be used. + +If multiple values of the same configuration option are specified in the same +section, the last value wins. + +Note that values from the local configuration file always take +precedence over values from the monitor configuration database, +regardless of which section they appear in. + + +.. _ceph-metavariables: + +Metavariables +============= + +Metavariables simplify Ceph Storage Cluster configuration +dramatically. When a metavariable is set in a configuration value, +Ceph expands the metavariable into a concrete value at the time the +configuration value is used. Ceph metavariables are similar to variable expansion in the Bash shell. + +Ceph supports the following metavariables: + +``$cluster`` + +:Description: Expands to the Ceph Storage Cluster name. Useful when running + multiple Ceph Storage Clusters on the same hardware. + +:Example: ``/etc/ceph/$cluster.keyring`` +:Default: ``ceph`` + + +``$type`` + +:Description: Expands to a daemon or process type (e.g., ``mds``, ``osd``, or ``mon``) + +:Example: ``/var/lib/ceph/$type`` + + +``$id`` + +:Description: Expands to the daemon or client identifier. For + ``osd.0``, this would be ``0``; for ``mds.a``, it would + be ``a``. + +:Example: ``/var/lib/ceph/$type/$cluster-$id`` + + +``$host`` + +:Description: Expands to the host name where the process is running. + + +``$name`` + +:Description: Expands to ``$type.$id``. +:Example: ``/var/run/ceph/$cluster-$name.asok`` + +``$pid`` + +:Description: Expands to daemon pid. +:Example: ``/var/run/ceph/$cluster-$name-$pid.asok`` + + + +The Configuration File +====================== + +On startup, Ceph processes search for a configuration file in the +following locations: + +#. ``$CEPH_CONF`` (*i.e.,* the path following the ``$CEPH_CONF`` + environment variable) +#. ``-c path/path`` (*i.e.,* the ``-c`` command line argument) +#. ``/etc/ceph/$cluster.conf`` +#. ``~/.ceph/$cluster.conf`` +#. ``./$cluster.conf`` (*i.e.,* in the current working directory) +#. On FreeBSD systems only, ``/usr/local/etc/ceph/$cluster.conf`` + +where ``$cluster`` is the cluster's name (default ``ceph``). + +The Ceph configuration file uses an *ini* style syntax. You can add comment +text after a pound sign (#) or a semi-colon (;). For example: + +.. code-block:: ini + + # <--A number (#) sign precedes a comment. + ; A comment may be anything. + # Comments always follow a semi-colon (;) or a pound (#) on each line. + # The end of the line terminates a comment. + # We recommend that you provide comments in your configuration file(s). + + +.. _ceph-conf-settings: + +Config file section names +------------------------- + +The configuration file is divided into sections. Each section must begin with a +valid configuration section name (see `Configuration sections`_, above) +surrounded by square brackets. For example, + +.. code-block:: ini + + [global] + debug_ms = 0 + + [osd] + debug_ms = 1 + + [osd.1] + debug_ms = 10 + + [osd.2] + debug_ms = 10 + + +Config file option values +------------------------- + +The value of a configuration option is a string. If it is too long to +fit in a single line, you can put a backslash (``\``) at the end of line +as the line continuation marker, so the value of the option will be +the string after ``=`` in current line combined with the string in the next +line:: + + [global] + foo = long long ago\ + long ago + +In the example above, the value of "``foo``" would be "``long long ago long ago``". + +Normally, the option value ends with a new line, or a comment, like + +.. code-block:: ini + + [global] + obscure_one = difficult to explain # I will try harder in next release + simpler_one = nothing to explain + +In the example above, the value of "``obscure one``" would be "``difficult to explain``"; +and the value of "``simpler one`` would be "``nothing to explain``". + +If an option value contains spaces, and we want to make it explicit, we +could quote the value using single or double quotes, like + +.. code-block:: ini + + [global] + line = "to be, or not to be" + +Certain characters are not allowed to be present in the option values directly. +They are ``=``, ``#``, ``;`` and ``[``. If we have to, we need to escape them, +like + +.. code-block:: ini + + [global] + secret = "i love \# and \[" + +Every configuration option is typed with one of the types below: + +``int`` + +:Description: 64-bit signed integer, Some SI prefixes are supported, like "K", "M", "G", + "T", "P", "E", meaning, respectively, 10\ :sup:`3`, 10\ :sup:`6`, + 10\ :sup:`9`, etc. And "B" is the only supported unit. So, "1K", "1M", "128B" and "-1" are all valid + option values. Some times, a negative value implies "unlimited" when it comes to + an option for threshold or limit. +:Example: ``42``, ``-1`` + +``uint`` + +:Description: It is almost identical to ``integer``. But a negative value will be rejected. +:Example: ``256``, ``0`` + +``str`` + +:Description: Free style strings encoded in UTF-8, but some characters are not allowed. Please + reference the above notes for the details. +:Example: ``"hello world"``, ``"i love \#"``, ``yet-another-name`` + +``boolean`` + +:Description: one of the two values ``true`` or ``false``. But an integer is also accepted, + where "0" implies ``false``, and any non-zero values imply ``true``. +:Example: ``true``, ``false``, ``1``, ``0`` + +``addr`` + +:Description: a single address optionally prefixed with ``v1``, ``v2`` or ``any`` for the messenger + protocol. If the prefix is not specified, ``v2`` protocol is used. Please see + :ref:`address_formats` for more details. +:Example: ``v1:1.2.3.4:567``, ``v2:1.2.3.4:567``, ``1.2.3.4:567``, ``2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567``, ``[::1]:6789`` + +``addrvec`` + +:Description: a set of addresses separated by ",". The addresses can be optionally quoted with ``[`` and ``]``. +:Example: ``[v1:1.2.3.4:567,v2:1.2.3.4:568]``, ``v1:1.2.3.4:567,v1:1.2.3.14:567`` ``[2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567], [2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::568]`` + +``uuid`` + +:Description: the string format of a uuid defined by `RFC4122 `_. + And some variants are also supported, for more details, see + `Boost document `_. +:Example: ``f81d4fae-7dec-11d0-a765-00a0c91e6bf6`` + +``size`` + +:Description: denotes a 64-bit unsigned integer. Both SI prefixes and IEC prefixes are + supported. And "B" is the only supported unit. A negative value will be + rejected. +:Example: ``1Ki``, ``1K``, ``1KiB`` and ``1B``. + +``secs`` + +:Description: denotes a duration of time. By default the unit is second if not specified. + Following units of time are supported: + + * second: "s", "sec", "second", "seconds" + * minute: "m", "min", "minute", "minutes" + * hour: "hs", "hr", "hour", "hours" + * day: "d", "day", "days" + * week: "w", "wk", "week", "weeks" + * month: "mo", "month", "months" + * year: "y", "yr", "year", "years" +:Example: ``1 m``, ``1m`` and ``1 week`` + +.. _ceph-conf-database: + +Monitor configuration database +============================== + +The monitor cluster manages a database of configuration options that +can be consumed by the entire cluster, enabling streamlined central +configuration management for the entire system. The vast majority of +configuration options can and should be stored here for ease of +administration and transparency. + +A handful of settings may still need to be stored in local +configuration files because they affect the ability to connect to the +monitors, authenticate, and fetch configuration information. In most +cases this is limited to the ``mon_host`` option, although this can +also be avoided through the use of DNS SRV records. + +Sections and masks +------------------ + +Configuration options stored by the monitor can live in a global +section, daemon type section, or specific daemon section, just like +options in a configuration file can. + +In addition, options may also have a *mask* associated with them to +further restrict which daemons or clients the option applies to. +Masks take two forms: + +#. ``type:location`` where *type* is a CRUSH property like `rack` or + `host`, and *location* is a value for that property. For example, + ``host:foo`` would limit the option only to daemons or clients + running on a particular host. +#. ``class:device-class`` where *device-class* is the name of a CRUSH + device class (e.g., ``hdd`` or ``ssd``). For example, + ``class:ssd`` would limit the option only to OSDs backed by SSDs. + (This mask has no effect for non-OSD daemons or clients.) + +When setting a configuration option, the `who` may be a section name, +a mask, or a combination of both separated by a slash (``/``) +character. For example, ``osd/rack:foo`` would mean all OSD daemons +in the ``foo`` rack. + +When viewing configuration options, the section name and mask are +generally separated out into separate fields or columns to ease readability. + + +Commands +-------- + +The following CLI commands are used to configure the cluster: + +* ``ceph config dump`` will dump the entire configuration database for + the cluster. + +* ``ceph config get `` will dump the configuration for a specific + daemon or client (e.g., ``mds.a``), as stored in the monitors' + configuration database. + +* ``ceph config set