diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-07 18:45:59 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-07 18:45:59 +0000 |
commit | 19fcec84d8d7d21e796c7624e521b60d28ee21ed (patch) | |
tree | 42d26aa27d1e3f7c0b8bd3fd14e7d7082f5008dc /doc/rados | |
parent | Initial commit. (diff) | |
download | ceph-19fcec84d8d7d21e796c7624e521b60d28ee21ed.tar.xz ceph-19fcec84d8d7d21e796c7624e521b60d28ee21ed.zip |
Adding upstream version 16.2.11+ds.upstream/16.2.11+dsupstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
136 files changed, 43514 insertions, 0 deletions
diff --git a/doc/rados/api/index.rst b/doc/rados/api/index.rst new file mode 100644 index 000000000..63bc7222d --- /dev/null +++ b/doc/rados/api/index.rst @@ -0,0 +1,23 @@ +=========================== + Ceph Storage Cluster APIs +=========================== + +The :term:`Ceph Storage Cluster` has a messaging layer protocol that enables +clients to interact with a :term:`Ceph Monitor` and a :term:`Ceph OSD Daemon`. +``librados`` provides this functionality to :term:`Ceph Client`\s in the form of +a library. All Ceph Clients either use ``librados`` or the same functionality +encapsulated in ``librados`` to interact with the object store. For example, +``librbd`` and ``libcephfs`` leverage this functionality. You may use +``librados`` to interact with Ceph directly (e.g., an application that talks to +Ceph, your own interface to Ceph, etc.). + + +.. toctree:: + :maxdepth: 2 + + Introduction to librados <librados-intro> + librados (C) <librados> + librados (C++) <libradospp> + librados (Python) <python> + libcephsqlite (SQLite) <libcephsqlite> + object class <objclass-sdk> diff --git a/doc/rados/api/libcephsqlite.rst b/doc/rados/api/libcephsqlite.rst new file mode 100644 index 000000000..76ab306bb --- /dev/null +++ b/doc/rados/api/libcephsqlite.rst @@ -0,0 +1,438 @@ +.. _libcephsqlite: + +================ + Ceph SQLite VFS +================ + +This `SQLite VFS`_ may be used for storing and accessing a `SQLite`_ database +backed by RADOS. This allows you to fully decentralize your database using +Ceph's object store for improved availability, accessibility, and use of +storage. + +Note what this is not: a distributed SQL engine. SQLite on RADOS can be thought +of like RBD as compared to CephFS: RBD puts a disk image on RADOS for the +purposes of exclusive access by a machine and generally does not allow parallel +access by other machines; on the other hand, CephFS allows fully distributed +access to a file system from many client mounts. SQLite on RADOS is meant to be +accessed by a single SQLite client database connection at a given time. The +database may be manipulated safely by multiple clients only in a serial fashion +controlled by RADOS locks managed by the Ceph SQLite VFS. + + +Usage +^^^^^ + +Normal unmodified applications (including the sqlite command-line toolset +binary) may load the *ceph* VFS using the `SQLite Extension Loading API`_. + +.. code:: sql + + .LOAD libcephsqlite.so + +or during the invocation of ``sqlite3`` + +.. code:: sh + + sqlite3 -cmd '.load libcephsqlite.so' + +A database file is formatted as a SQLite URI:: + + file:///<"*"poolid|poolname>:[namespace]/<dbname>?vfs=ceph + +The RADOS ``namespace`` is optional. Note the triple ``///`` in the path. The URI +authority must be empty or localhost in SQLite. Only the path part of the URI +is parsed. For this reason, the URI will not parse properly if you only use two +``//``. + +A complete example of (optionally) creating a database and opening: + +.. code:: sh + + sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph' + +Note you cannot specify the database file as the normal positional argument to +``sqlite3``. This is because the ``.load libcephsqlite.so`` command is applied +after opening the database, but opening the database depends on the extension +being loaded first. + +An example passing the pool integer id and no RADOS namespace: + +.. code:: sh + + sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///*2:/baz.db?vfs=ceph' + +Like other Ceph tools, the *ceph* VFS looks at some environment variables that +help with configuring which Ceph cluster to communicate with and which +credential to use. Here would be a typical configuration: + +.. code:: sh + + export CEPH_CONF=/path/to/ceph.conf + export CEPH_KEYRING=/path/to/ceph.keyring + export CEPH_ARGS='--id myclientid' + ./runmyapp + # or + sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph' + +The default operation would look at the standard Ceph configuration file path +using the ``client.admin`` user. + + +User +^^^^ + +The *ceph* VFS requires a user credential with read access to the monitors, the +ability to blocklist dead clients of the database, and access to the OSDs +hosting the database. This can be done with authorizations as simply as: + +.. code:: sh + + ceph auth get-or-create client.X mon 'allow r, allow command "osd blocklist" with blocklistop=add' osd 'allow rwx' + +.. note:: The terminology change from ``blacklist`` to ``blocklist``; older clusters may require using the old terms. + +You may also simplify using the ``simple-rados-client-with-blocklist`` profile: + +.. code:: sh + + ceph auth get-or-create client.X mon 'profile simple-rados-client-with-blocklist' osd 'allow rwx' + +To learn why blocklisting is necessary, see :ref:`libcephsqlite-corrupt`. + + +Page Size +^^^^^^^^^ + +SQLite allows configuring the page size prior to creating a new database. It is +advisable to increase this config to 65536 (64K) when using RADOS backed +databases to reduce the number of OSD reads/writes and thereby improve +throughput and latency. + +.. code:: sql + + PRAGMA page_size = 65536 + +You may also try other values according to your application needs but note that +64K is the max imposed by SQLite. + + +Cache +^^^^^ + +The ceph VFS does not do any caching of reads or buffering of writes. Instead, +and more appropriately, the SQLite page cache is used. You may find it is too small +for most workloads and should therefore increase it significantly: + + +.. code:: sql + + PRAGMA cache_size = 4096 + +Which will cache 4096 pages or 256MB (with 64K ``page_cache``). + + +Journal Persistence +^^^^^^^^^^^^^^^^^^^ + +By default, SQLite deletes the journal for every transaction. This can be +expensive as the *ceph* VFS must delete every object backing the journal for each +transaction. For this reason, it is much faster and simpler to ask SQLite to +**persist** the journal. In this mode, SQLite will invalidate the journal via a +write to its header. This is done as: + +.. code:: sql + + PRAGMA journal_mode = PERSIST + +The cost of this may be increased unused space according to the high-water size +of the rollback journal (based on transaction type and size). + + +Exclusive Lock Mode +^^^^^^^^^^^^^^^^^^^ + +SQLite operates in a ``NORMAL`` locking mode where each transaction requires +locking the backing database file. This can add unnecessary overhead to +transactions when you know there's only ever one user of the database at a +given time. You can have SQLite lock the database once for the duration of the +connection using: + +.. code:: sql + + PRAGMA locking_mode = EXCLUSIVE + +This can more than **halve** the time taken to perform a transaction. Keep in +mind this prevents other clients from accessing the database. + +In this locking mode, each write transaction to the database requires 3 +synchronization events: once to write to the journal, another to write to the +database file, and a final write to invalidate the journal header (in +``PERSIST`` journaling mode). + + +WAL Journal +^^^^^^^^^^^ + +The `WAL Journal Mode`_ is only available when SQLite is operating in exclusive +lock mode. This is because it requires shared memory communication with other +readers and writers when in the ``NORMAL`` locking mode. + +As with local disk databases, WAL mode may significantly reduce small +transaction latency. Testing has shown it can provide more than 50% speedup +over persisted rollback journals in exclusive locking mode. You can expect +around 150-250 transactions per second depending on size. + + +Performance Notes +^^^^^^^^^^^^^^^^^ + +The filing backend for the database on RADOS is asynchronous as much as +possible. Still, performance can be anywhere from 3x-10x slower than a local +database on SSD. Latency can be a major factor. It is advisable to be familiar +with SQL transactions and other strategies for efficient database updates. +Depending on the performance of the underlying pool, you can expect small +transactions to take up to 30 milliseconds to complete. If you use the +``EXCLUSIVE`` locking mode, it can be reduced further to 15 milliseconds per +transaction. A WAL journal in ``EXCLUSIVE`` locking mode can further reduce +this as low as ~2-5 milliseconds (or the time to complete a RADOS write; you +won't get better than that!). + +There is no limit to the size of a SQLite database on RADOS imposed by the Ceph +VFS. There are standard `SQLite Limits`_ to be aware of, notably the maximum +database size of 281 TB. Large databases may or may not be performant on Ceph. +Experimentation for your own use-case is advised. + +Be aware that read-heavy queries could take significant amounts of time as +reads are necessarily synchronous (due to the VFS API). No readahead is yet +performed by the VFS. + + +Recommended Use-Cases +^^^^^^^^^^^^^^^^^^^^^ + +The original purpose of this module was to support saving relational or large +data in RADOS which needs to span multiple objects. Many current applications +with trivial state try to use RADOS omap storage on a single object but this +cannot scale without striping data across multiple objects. Unfortunately, it +is non-trivial to design a store spanning multiple objects which is consistent +and also simple to use. SQLite can be used to bridge that gap. + + +Parallel Access +^^^^^^^^^^^^^^^ + +The VFS does not yet support concurrent readers. All database access is protected +by a single exclusive lock. + + +Export or Extract Database out of RADOS +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The database is striped on RADOS and can be extracted using the RADOS cli toolset. + +.. code:: sh + + rados --pool=foo --striper get bar.db local-bar.db + rados --pool=foo --striper get bar.db-journal local-bar.db-journal + sqlite3 local-bar.db ... + +Keep in mind the rollback journal is also striped and will need to be extracted +as well if the database was in the middle of a transaction. If you're using +WAL, that journal will need to be extracted as well. + +Keep in mind that extracting the database using the striper uses the same RADOS +locks as those used by the *ceph* VFS. However, the journal file locks are not +used by the *ceph* VFS (SQLite only locks the main database file) so there is a +potential race with other SQLite clients when extracting both files. That could +result in fetching a corrupt journal. + +Instead of manually extracting the files, it would be more advisable to use the +`SQLite Backup`_ mechanism instead. + + +Temporary Tables +^^^^^^^^^^^^^^^^ + +Temporary tables backed by the ceph VFS are not supported. The main reason for +this is that the VFS lacks context about where it should put the database, i.e. +which RADOS pool. The persistent database associated with the temporary +database is not communicated via the SQLite VFS API. + +Instead, it's suggested to attach a secondary local or `In-Memory Database`_ +and put the temporary tables there. Alternatively, you may set a connection +pragma: + +.. code:: sql + + PRAGMA temp_store=memory + + +.. _libcephsqlite-breaking-locks: + +Breaking Locks +^^^^^^^^^^^^^^ + +Access to the database file is protected by an exclusive lock on the first +object stripe of the database. If the application fails without unlocking the +database (e.g. a segmentation fault), the lock is not automatically unlocked, +even if the client connection is blocklisted afterward. Eventually, the lock +will timeout subject to the configurations:: + + cephsqlite_lock_renewal_timeout = 30000 + +The timeout is in milliseconds. Once the timeout is reached, the OSD will +expire the lock and allow clients to relock. When this occurs, the database +will be recovered by SQLite and the in-progress transaction rolled back. The +new client recovering the database will also blocklist the old client to +prevent potential database corruption from rogue writes. + +The holder of the exclusive lock on the database will periodically renew the +lock so it does not lose the lock. This is necessary for large transactions or +database connections operating in ``EXCLUSIVE`` locking mode. The lock renewal +interval is adjustable via:: + + cephsqlite_lock_renewal_interval = 2000 + +This configuration is also in units of milliseconds. + +It is possible to break the lock early if you know the client is gone for good +(e.g. blocklisted). This allows restoring database access to clients +immediately. For example: + +.. code:: sh + + $ rados --pool=foo --namespace bar lock info baz.db.0000000000000000 striper.lock + {"name":"striper.lock","type":"exclusive","tag":"","lockers":[{"name":"client.4463","cookie":"555c7208-db39-48e8-a4d7-3ba92433a41a","description":"SimpleRADOSStriper","expiration":"0.000000","addr":"127.0.0.1:0/1831418345"}]} + + $ rados --pool=foo --namespace bar lock break baz.db.0000000000000000 striper.lock client.4463 --lock-cookie 555c7208-db39-48e8-a4d7-3ba92433a41a + +.. _libcephsqlite-corrupt: + +How to Corrupt Your Database +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +There is the usual reading on `How to Corrupt Your SQLite Database`_ that you +should review before using this tool. To add to that, the most likely way you +may corrupt your database is by a rogue process transiently losing network +connectivity and then resuming its work. The exclusive RADOS lock it held will +be lost but it cannot know that immediately. Any work it might do after +regaining network connectivity could corrupt the database. + +The *ceph* VFS library defaults do not allow for this scenario to occur. The Ceph +VFS will blocklist the last owner of the exclusive lock on the database if it +detects incomplete cleanup. + +By blocklisting the old client, it's no longer possible for the old client to +resume its work on the database when it returns (subject to blocklist +expiration, 3600 seconds by default). To turn off blocklisting the prior client, change:: + + cephsqlite_blocklist_dead_locker = false + +Do NOT do this unless you know database corruption cannot result due to other +guarantees. If this config is true (the default), the *ceph* VFS will cowardly +fail if it cannot blocklist the prior instance (due to lack of authorization, +for example). + +One example where out-of-band mechanisms exist to blocklist the last dead +holder of the exclusive lock on the database is in the ``ceph-mgr``. The +monitors are made aware of the RADOS connection used for the *ceph* VFS and will +blocklist the instance during ``ceph-mgr`` failover. This prevents a zombie +``ceph-mgr`` from continuing work and potentially corrupting the database. For +this reason, it is not necessary for the *ceph* VFS to do the blocklist command +in the new instance of the ``ceph-mgr`` (but it still does so, harmlessly). + +To blocklist the *ceph* VFS manually, you may see the instance address of the +*ceph* VFS using the ``ceph_status`` SQL function: + +.. code:: sql + + SELECT ceph_status(); + +.. code:: + + {"id":788461300,"addr":"172.21.10.4:0/1472139388"} + +You may easily manipulate that information using the `JSON1 extension`_: + +.. code:: sql + + SELECT json_extract(ceph_status(), '$.addr'); + +.. code:: + + 172.21.10.4:0/3563721180 + +This is the address you would pass to the ceph blocklist command: + +.. code:: sh + + ceph osd blocklist add 172.21.10.4:0/3082314560 + + +Performance Statistics +^^^^^^^^^^^^^^^^^^^^^^ + +The *ceph* VFS provides a SQLite function, ``ceph_perf``, for querying the +performance statistics of the VFS. The data is from "performance counters" as +in other Ceph services normally queried via an admin socket. + +.. code:: sql + + SELECT ceph_perf(); + +.. code:: + + {"libcephsqlite_vfs":{"op_open":{"avgcount":2,"sum":0.150001291,"avgtime":0.075000645},"op_delete":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"op_access":{"avgcount":1,"sum":0.003000026,"avgtime":0.003000026},"op_fullpathname":{"avgcount":1,"sum":0.064000551,"avgtime":0.064000551},"op_currenttime":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_close":{"avgcount":1,"sum":0.000000000,"avgtime":0.000000000},"opf_read":{"avgcount":3,"sum":0.036000310,"avgtime":0.012000103},"opf_write":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_truncate":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_sync":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_filesize":{"avgcount":2,"sum":0.000000000,"avgtime":0.000000000},"opf_lock":{"avgcount":1,"sum":0.158001360,"avgtime":0.158001360},"opf_unlock":{"avgcount":1,"sum":0.101000871,"avgtime":0.101000871},"opf_checkreservedlock":{"avgcount":1,"sum":0.002000017,"avgtime":0.002000017},"opf_filecontrol":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000},"opf_sectorsize":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_devicecharacteristics":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000}},"libcephsqlite_striper":{"update_metadata":0,"update_allocated":0,"update_size":0,"update_version":0,"shrink":0,"shrink_bytes":0,"lock":1,"unlock":1}} + +You may easily manipulate that information using the `JSON1 extension`_: + +.. code:: sql + + SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync.avgcount'); + +.. code:: + + 776 + +That tells you the number of times SQLite has called the xSync method of the +`SQLite IO Methods`_ of the VFS (for **all** open database connections in the +process). You could analyze the performance stats before and after a number of +queries to see the number of file system syncs required (this would just be +proportional to the number of transactions). Alternatively, you may be more +interested in the average latency to complete a write: + +.. code:: sql + + SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_write'); + +.. code:: + + {"avgcount":7873,"sum":0.675005797,"avgtime":0.000085736} + +Which would tell you there have been 7873 writes with an average +time-to-complete of 85 microseconds. That clearly shows the calls are executed +asynchronously. Returning to sync: + +.. code:: sql + + SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync'); + +.. code:: + + {"avgcount":776,"sum":4.802041199,"avgtime":0.006188197} + +6 milliseconds were spent on average executing a sync call. This gathers all of +the asynchronous writes as well as an asynchronous update to the size of the +striped file. + + +.. _SQLite: https://sqlite.org/index.html +.. _SQLite VFS: https://www.sqlite.org/vfs.html +.. _SQLite Backup: https://www.sqlite.org/backup.html +.. _SQLite Limits: https://www.sqlite.org/limits.html +.. _SQLite Extension Loading API: https://sqlite.org/c3ref/load_extension.html +.. _In-Memory Database: https://www.sqlite.org/inmemorydb.html +.. _WAL Journal Mode: https://sqlite.org/wal.html +.. _How to Corrupt Your SQLite Database: https://www.sqlite.org/howtocorrupt.html +.. _JSON1 Extension: https://www.sqlite.org/json1.html +.. _SQLite IO Methods: https://www.sqlite.org/c3ref/io_methods.html diff --git a/doc/rados/api/librados-intro.rst b/doc/rados/api/librados-intro.rst new file mode 100644 index 000000000..9bffa3114 --- /dev/null +++ b/doc/rados/api/librados-intro.rst @@ -0,0 +1,1052 @@ +========================== + Introduction to librados +========================== + +The :term:`Ceph Storage Cluster` provides the basic storage service that allows +:term:`Ceph` to uniquely deliver **object, block, and file storage** in one +unified system. However, you are not limited to using the RESTful, block, or +POSIX interfaces. Based upon :abbr:`RADOS (Reliable Autonomic Distributed Object +Store)`, the ``librados`` API enables you to create your own interface to the +Ceph Storage Cluster. + +The ``librados`` API enables you to interact with the two types of daemons in +the Ceph Storage Cluster: + +- The :term:`Ceph Monitor`, which maintains a master copy of the cluster map. +- The :term:`Ceph OSD Daemon` (OSD), which stores data as objects on a storage node. + +.. ditaa:: + +---------------------------------+ + | Ceph Storage Cluster Protocol | + | (librados) | + +---------------------------------+ + +---------------+ +---------------+ + | OSDs | | Monitors | + +---------------+ +---------------+ + +This guide provides a high-level introduction to using ``librados``. +Refer to :doc:`../../architecture` for additional details of the Ceph +Storage Cluster. To use the API, you need a running Ceph Storage Cluster. +See `Installation (Quick)`_ for details. + + +Step 1: Getting librados +======================== + +Your client application must bind with ``librados`` to connect to the Ceph +Storage Cluster. You must install ``librados`` and any required packages to +write applications that use ``librados``. The ``librados`` API is written in +C++, with additional bindings for C, Python, Java and PHP. + + +Getting librados for C/C++ +-------------------------- + +To install ``librados`` development support files for C/C++ on Debian/Ubuntu +distributions, execute the following: + +.. prompt:: bash $ + + sudo apt-get install librados-dev + +To install ``librados`` development support files for C/C++ on RHEL/CentOS +distributions, execute the following: + +.. prompt:: bash $ + + sudo yum install librados2-devel + +Once you install ``librados`` for developers, you can find the required +headers for C/C++ under ``/usr/include/rados``: + +.. prompt:: bash $ + + ls /usr/include/rados + + +Getting librados for Python +--------------------------- + +The ``rados`` module provides ``librados`` support to Python +applications. The ``librados-dev`` package for Debian/Ubuntu +and the ``librados2-devel`` package for RHEL/CentOS will install the +``python-rados`` package for you. You may install ``python-rados`` +directly too. + +To install ``librados`` development support files for Python on Debian/Ubuntu +distributions, execute the following: + +.. prompt:: bash $ + + sudo apt-get install python3-rados + +To install ``librados`` development support files for Python on RHEL/CentOS +distributions, execute the following: + +.. prompt:: bash $ + + sudo yum install python-rados + +To install ``librados`` development support files for Python on SLE/openSUSE +distributions, execute the following: + +.. prompt:: bash $ + + sudo zypper install python3-rados + +You can find the module under ``/usr/share/pyshared`` on Debian systems, +or under ``/usr/lib/python*/site-packages`` on CentOS/RHEL systems. + + +Getting librados for Java +------------------------- + +To install ``librados`` for Java, you need to execute the following procedure: + +#. Install ``jna.jar``. For Debian/Ubuntu, execute: + + .. prompt:: bash $ + + sudo apt-get install libjna-java + + For CentOS/RHEL, execute: + + .. prompt:: bash $ + + sudo yum install jna + + The JAR files are located in ``/usr/share/java``. + +#. Clone the ``rados-java`` repository: + + .. prompt:: bash $ + + git clone --recursive https://github.com/ceph/rados-java.git + +#. Build the ``rados-java`` repository: + + .. prompt:: bash $ + + cd rados-java + ant + + The JAR file is located under ``rados-java/target``. + +#. Copy the JAR for RADOS to a common location (e.g., ``/usr/share/java``) and + ensure that it and the JNA JAR are in your JVM's classpath. For example: + + .. prompt:: bash $ + + sudo cp target/rados-0.1.3.jar /usr/share/java/rados-0.1.3.jar + sudo ln -s /usr/share/java/jna-3.2.7.jar /usr/lib/jvm/default-java/jre/lib/ext/jna-3.2.7.jar + sudo ln -s /usr/share/java/rados-0.1.3.jar /usr/lib/jvm/default-java/jre/lib/ext/rados-0.1.3.jar + +To build the documentation, execute the following: + +.. prompt:: bash $ + + ant docs + + +Getting librados for PHP +------------------------- + +To install the ``librados`` extension for PHP, you need to execute the following procedure: + +#. Install php-dev. For Debian/Ubuntu, execute: + + .. prompt:: bash $ + + sudo apt-get install php5-dev build-essential + + For CentOS/RHEL, execute: + + .. prompt:: bash $ + + sudo yum install php-devel + +#. Clone the ``phprados`` repository: + + .. prompt:: bash $ + + git clone https://github.com/ceph/phprados.git + +#. Build ``phprados``: + + .. prompt:: bash $ + + cd phprados + phpize + ./configure + make + sudo make install + +#. Enable ``phprados`` by adding the following line to ``php.ini``:: + + extension=rados.so + + +Step 2: Configuring a Cluster Handle +==================================== + +A :term:`Ceph Client`, via ``librados``, interacts directly with OSDs to store +and retrieve data. To interact with OSDs, the client app must invoke +``librados`` and connect to a Ceph Monitor. Once connected, ``librados`` +retrieves the :term:`Cluster Map` from the Ceph Monitor. When the client app +wants to read or write data, it creates an I/O context and binds to a +:term:`Pool`. The pool has an associated :term:`CRUSH rule` that defines how it +will place data in the storage cluster. Via the I/O context, the client +provides the object name to ``librados``, which takes the object name +and the cluster map (i.e., the topology of the cluster) and `computes`_ the +placement group and `OSD`_ for locating the data. Then the client application +can read or write data. The client app doesn't need to learn about the topology +of the cluster directly. + +.. ditaa:: + +--------+ Retrieves +---------------+ + | Client |------------>| Cluster Map | + +--------+ +---------------+ + | + v Writes + /-----\ + | obj | + \-----/ + | To + v + +--------+ +---------------+ + | Pool |---------->| CRUSH Rule | + +--------+ Selects +---------------+ + + +The Ceph Storage Cluster handle encapsulates the client configuration, including: + +- The `user ID`_ for ``rados_create()`` or user name for ``rados_create2()`` + (preferred). +- The :term:`cephx` authentication key +- The monitor ID and IP address +- Logging levels +- Debugging levels + +Thus, the first steps in using the cluster from your app are to 1) create +a cluster handle that your app will use to connect to the storage cluster, +and then 2) use that handle to connect. To connect to the cluster, the +app must supply a monitor address, a username and an authentication key +(cephx is enabled by default). + +.. tip:: Talking to different Ceph Storage Clusters – or to the same cluster + with different users – requires different cluster handles. + +RADOS provides a number of ways for you to set the required values. For +the monitor and encryption key settings, an easy way to handle them is to ensure +that your Ceph configuration file contains a ``keyring`` path to a keyring file +and at least one monitor address (e.g., ``mon host``). For example:: + + [global] + mon host = 192.168.1.1 + keyring = /etc/ceph/ceph.client.admin.keyring + +Once you create the handle, you can read a Ceph configuration file to configure +the handle. You can also pass arguments to your app and parse them with the +function for parsing command line arguments (e.g., ``rados_conf_parse_argv()``), +or parse Ceph environment variables (e.g., ``rados_conf_parse_env()``). Some +wrappers may not implement convenience methods, so you may need to implement +these capabilities. The following diagram provides a high-level flow for the +initial connection. + + +.. ditaa:: + +---------+ +---------+ + | Client | | Monitor | + +---------+ +---------+ + | | + |-----+ create | + | | cluster | + |<----+ handle | + | | + |-----+ read | + | | config | + |<----+ file | + | | + | connect | + |-------------->| + | | + |<--------------| + | connected | + | | + + +Once connected, your app can invoke functions that affect the whole cluster +with only the cluster handle. For example, once you have a cluster +handle, you can: + +- Get cluster statistics +- Use Pool Operation (exists, create, list, delete) +- Get and set the configuration + + +One of the powerful features of Ceph is the ability to bind to different pools. +Each pool may have a different number of placement groups, object replicas and +replication strategies. For example, a pool could be set up as a "hot" pool that +uses SSDs for frequently used objects or a "cold" pool that uses erasure coding. + +The main difference in the various ``librados`` bindings is between C and +the object-oriented bindings for C++, Java and Python. The object-oriented +bindings use objects to represent cluster handles, IO Contexts, iterators, +exceptions, etc. + + +C Example +--------- + +For C, creating a simple cluster handle using the ``admin`` user, configuring +it and connecting to the cluster might look something like this: + +.. code-block:: c + + #include <stdio.h> + #include <stdlib.h> + #include <string.h> + #include <rados/librados.h> + + int main (int argc, const char **argv) + { + + /* Declare the cluster handle and required arguments. */ + rados_t cluster; + char cluster_name[] = "ceph"; + char user_name[] = "client.admin"; + uint64_t flags = 0; + + /* Initialize the cluster handle with the "ceph" cluster name and the "client.admin" user */ + int err; + err = rados_create2(&cluster, cluster_name, user_name, flags); + + if (err < 0) { + fprintf(stderr, "%s: Couldn't create the cluster handle! %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nCreated a cluster handle.\n"); + } + + + /* Read a Ceph configuration file to configure the cluster handle. */ + err = rados_conf_read_file(cluster, "/etc/ceph/ceph.conf"); + if (err < 0) { + fprintf(stderr, "%s: cannot read config file: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nRead the config file.\n"); + } + + /* Read command line arguments */ + err = rados_conf_parse_argv(cluster, argc, argv); + if (err < 0) { + fprintf(stderr, "%s: cannot parse command line arguments: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nRead the command line arguments.\n"); + } + + /* Connect to the cluster */ + err = rados_connect(cluster); + if (err < 0) { + fprintf(stderr, "%s: cannot connect to cluster: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nConnected to the cluster.\n"); + } + + } + +Compile your client and link to ``librados`` using ``-lrados``. For example: + +.. prompt:: bash $ + + gcc ceph-client.c -lrados -o ceph-client + + +C++ Example +----------- + +The Ceph project provides a C++ example in the ``ceph/examples/librados`` +directory. For C++, a simple cluster handle using the ``admin`` user requires +you to initialize a ``librados::Rados`` cluster handle object: + +.. code-block:: c++ + + #include <iostream> + #include <string> + #include <rados/librados.hpp> + + int main(int argc, const char **argv) + { + + int ret = 0; + + /* Declare the cluster handle and required variables. */ + librados::Rados cluster; + char cluster_name[] = "ceph"; + char user_name[] = "client.admin"; + uint64_t flags = 0; + + /* Initialize the cluster handle with the "ceph" cluster name and "client.admin" user */ + { + ret = cluster.init2(user_name, cluster_name, flags); + if (ret < 0) { + std::cerr << "Couldn't initialize the cluster handle! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Created a cluster handle." << std::endl; + } + } + + /* Read a Ceph configuration file to configure the cluster handle. */ + { + ret = cluster.conf_read_file("/etc/ceph/ceph.conf"); + if (ret < 0) { + std::cerr << "Couldn't read the Ceph configuration file! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Read the Ceph configuration file." << std::endl; + } + } + + /* Read command line arguments */ + { + ret = cluster.conf_parse_argv(argc, argv); + if (ret < 0) { + std::cerr << "Couldn't parse command line options! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Parsed command line options." << std::endl; + } + } + + /* Connect to the cluster */ + { + ret = cluster.connect(); + if (ret < 0) { + std::cerr << "Couldn't connect to cluster! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Connected to the cluster." << std::endl; + } + } + + return 0; + } + + +Compile the source; then, link ``librados`` using ``-lrados``. +For example: + +.. prompt:: bash $ + + g++ -g -c ceph-client.cc -o ceph-client.o + g++ -g ceph-client.o -lrados -o ceph-client + + + +Python Example +-------------- + +Python uses the ``admin`` id and the ``ceph`` cluster name by default, and +will read the standard ``ceph.conf`` file if the conffile parameter is +set to the empty string. The Python binding converts C++ errors +into exceptions. + + +.. code-block:: python + + import rados + + try: + cluster = rados.Rados(conffile='') + except TypeError as e: + print 'Argument validation error: ', e + raise e + + print "Created cluster handle." + + try: + cluster.connect() + except Exception as e: + print "connection error: ", e + raise e + finally: + print "Connected to the cluster." + + +Execute the example to verify that it connects to your cluster: + +.. prompt:: bash $ + + python ceph-client.py + + +Java Example +------------ + +Java requires you to specify the user ID (``admin``) or user name +(``client.admin``), and uses the ``ceph`` cluster name by default . The Java +binding converts C++-based errors into exceptions. + +.. code-block:: java + + import com.ceph.rados.Rados; + import com.ceph.rados.RadosException; + + import java.io.File; + + public class CephClient { + public static void main (String args[]){ + + try { + Rados cluster = new Rados("admin"); + System.out.println("Created cluster handle."); + + File f = new File("/etc/ceph/ceph.conf"); + cluster.confReadFile(f); + System.out.println("Read the configuration file."); + + cluster.connect(); + System.out.println("Connected to the cluster."); + + } catch (RadosException e) { + System.out.println(e.getMessage() + ": " + e.getReturnValue()); + } + } + } + + +Compile the source; then, run it. If you have copied the JAR to +``/usr/share/java`` and sym linked from your ``ext`` directory, you won't need +to specify the classpath. For example: + +.. prompt:: bash $ + + javac CephClient.java + java CephClient + + +PHP Example +------------ + +With the RADOS extension enabled in PHP you can start creating a new cluster handle very easily: + +.. code-block:: php + + <?php + + $r = rados_create(); + rados_conf_read_file($r, '/etc/ceph/ceph.conf'); + if (!rados_connect($r)) { + echo "Failed to connect to Ceph cluster"; + } else { + echo "Successfully connected to Ceph cluster"; + } + + +Save this as rados.php and run the code: + +.. prompt:: bash $ + + php rados.php + + +Step 3: Creating an I/O Context +=============================== + +Once your app has a cluster handle and a connection to a Ceph Storage Cluster, +you may create an I/O Context and begin reading and writing data. An I/O Context +binds the connection to a specific pool. The user must have appropriate +`CAPS`_ permissions to access the specified pool. For example, a user with read +access but not write access will only be able to read data. I/O Context +functionality includes: + +- Write/read data and extended attributes +- List and iterate over objects and extended attributes +- Snapshot pools, list snapshots, etc. + + +.. ditaa:: + +---------+ +---------+ +---------+ + | Client | | Monitor | | OSD | + +---------+ +---------+ +---------+ + | | | + |-----+ create | | + | | I/O | | + |<----+ context | | + | | | + | write data | | + |---------------+-------------->| + | | | + | write ack | | + |<--------------+---------------| + | | | + | write xattr | | + |---------------+-------------->| + | | | + | xattr ack | | + |<--------------+---------------| + | | | + | read data | | + |---------------+-------------->| + | | | + | read ack | | + |<--------------+---------------| + | | | + | remove data | | + |---------------+-------------->| + | | | + | remove ack | | + |<--------------+---------------| + + + +RADOS enables you to interact both synchronously and asynchronously. Once your +app has an I/O Context, read/write operations only require you to know the +object/xattr name. The CRUSH algorithm encapsulated in ``librados`` uses the +cluster map to identify the appropriate OSD. OSD daemons handle the replication, +as described in `Smart Daemons Enable Hyperscale`_. The ``librados`` library also +maps objects to placement groups, as described in `Calculating PG IDs`_. + +The following examples use the default ``data`` pool. However, you may also +use the API to list pools, ensure they exist, or create and delete pools. For +the write operations, the examples illustrate how to use synchronous mode. For +the read operations, the examples illustrate how to use asynchronous mode. + +.. important:: Use caution when deleting pools with this API. If you delete + a pool, the pool and ALL DATA in the pool will be lost. + + +C Example +--------- + + +.. code-block:: c + + #include <stdio.h> + #include <stdlib.h> + #include <string.h> + #include <rados/librados.h> + + int main (int argc, const char **argv) + { + /* + * Continued from previous C example, where cluster handle and + * connection are established. First declare an I/O Context. + */ + + rados_ioctx_t io; + char *poolname = "data"; + + err = rados_ioctx_create(cluster, poolname, &io); + if (err < 0) { + fprintf(stderr, "%s: cannot open rados pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_shutdown(cluster); + exit(EXIT_FAILURE); + } else { + printf("\nCreated I/O context.\n"); + } + + /* Write data to the cluster synchronously. */ + err = rados_write(io, "hw", "Hello World!", 12, 0); + if (err < 0) { + fprintf(stderr, "%s: Cannot write object \"hw\" to pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nWrote \"Hello World\" to object \"hw\".\n"); + } + + char xattr[] = "en_US"; + err = rados_setxattr(io, "hw", "lang", xattr, 5); + if (err < 0) { + fprintf(stderr, "%s: Cannot write xattr to pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nWrote \"en_US\" to xattr \"lang\" for object \"hw\".\n"); + } + + /* + * Read data from the cluster asynchronously. + * First, set up asynchronous I/O completion. + */ + rados_completion_t comp; + err = rados_aio_create_completion(NULL, NULL, NULL, &comp); + if (err < 0) { + fprintf(stderr, "%s: Could not create aio completion: %s\n", argv[0], strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nCreated AIO completion.\n"); + } + + /* Next, read data using rados_aio_read. */ + char read_res[100]; + err = rados_aio_read(io, "hw", comp, read_res, 12, 0); + if (err < 0) { + fprintf(stderr, "%s: Cannot read object. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRead object \"hw\". The contents are:\n %s \n", read_res); + } + + /* Wait for the operation to complete */ + rados_aio_wait_for_complete(comp); + + /* Release the asynchronous I/O complete handle to avoid memory leaks. */ + rados_aio_release(comp); + + + char xattr_res[100]; + err = rados_getxattr(io, "hw", "lang", xattr_res, 5); + if (err < 0) { + fprintf(stderr, "%s: Cannot read xattr. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRead xattr \"lang\" for object \"hw\". The contents are:\n %s \n", xattr_res); + } + + err = rados_rmxattr(io, "hw", "lang"); + if (err < 0) { + fprintf(stderr, "%s: Cannot remove xattr. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRemoved xattr \"lang\" for object \"hw\".\n"); + } + + err = rados_remove(io, "hw"); + if (err < 0) { + fprintf(stderr, "%s: Cannot remove object. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRemoved object \"hw\".\n"); + } + + } + + + +C++ Example +----------- + + +.. code-block:: c++ + + #include <iostream> + #include <string> + #include <rados/librados.hpp> + + int main(int argc, const char **argv) + { + + /* Continued from previous C++ example, where cluster handle and + * connection are established. First declare an I/O Context. + */ + + librados::IoCtx io_ctx; + const char *pool_name = "data"; + + { + ret = cluster.ioctx_create(pool_name, io_ctx); + if (ret < 0) { + std::cerr << "Couldn't set up ioctx! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Created an ioctx for the pool." << std::endl; + } + } + + + /* Write an object synchronously. */ + { + librados::bufferlist bl; + bl.append("Hello World!"); + ret = io_ctx.write_full("hw", bl); + if (ret < 0) { + std::cerr << "Couldn't write object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Wrote new object 'hw' " << std::endl; + } + } + + + /* + * Add an xattr to the object. + */ + { + librados::bufferlist lang_bl; + lang_bl.append("en_US"); + ret = io_ctx.setxattr("hw", "lang", lang_bl); + if (ret < 0) { + std::cerr << "failed to set xattr version entry! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Set the xattr 'lang' on our object!" << std::endl; + } + } + + + /* + * Read the object back asynchronously. + */ + { + librados::bufferlist read_buf; + int read_len = 4194304; + + //Create I/O Completion. + librados::AioCompletion *read_completion = librados::Rados::aio_create_completion(); + + //Send read request. + ret = io_ctx.aio_read("hw", read_completion, &read_buf, read_len, 0); + if (ret < 0) { + std::cerr << "Couldn't start read object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } + + // Wait for the request to complete, and check that it succeeded. + read_completion->wait_for_complete(); + ret = read_completion->get_return_value(); + if (ret < 0) { + std::cerr << "Couldn't read object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Read object hw asynchronously with contents.\n" + << read_buf.c_str() << std::endl; + } + } + + + /* + * Read the xattr. + */ + { + librados::bufferlist lang_res; + ret = io_ctx.getxattr("hw", "lang", lang_res); + if (ret < 0) { + std::cerr << "failed to get xattr version entry! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Got the xattr 'lang' from object hw!" + << lang_res.c_str() << std::endl; + } + } + + + /* + * Remove the xattr. + */ + { + ret = io_ctx.rmxattr("hw", "lang"); + if (ret < 0) { + std::cerr << "Failed to remove xattr! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Removed the xattr 'lang' from our object!" << std::endl; + } + } + + /* + * Remove the object. + */ + { + ret = io_ctx.remove("hw"); + if (ret < 0) { + std::cerr << "Couldn't remove object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Removed object 'hw'." << std::endl; + } + } + } + + + +Python Example +-------------- + +.. code-block:: python + + print "\n\nI/O Context and Object Operations" + print "=================================" + + print "\nCreating a context for the 'data' pool" + if not cluster.pool_exists('data'): + raise RuntimeError('No data pool exists') + ioctx = cluster.open_ioctx('data') + + print "\nWriting object 'hw' with contents 'Hello World!' to pool 'data'." + ioctx.write("hw", "Hello World!") + print "Writing XATTR 'lang' with value 'en_US' to object 'hw'" + ioctx.set_xattr("hw", "lang", "en_US") + + + print "\nWriting object 'bm' with contents 'Bonjour tout le monde!' to pool 'data'." + ioctx.write("bm", "Bonjour tout le monde!") + print "Writing XATTR 'lang' with value 'fr_FR' to object 'bm'" + ioctx.set_xattr("bm", "lang", "fr_FR") + + print "\nContents of object 'hw'\n------------------------" + print ioctx.read("hw") + + print "\n\nGetting XATTR 'lang' from object 'hw'" + print ioctx.get_xattr("hw", "lang") + + print "\nContents of object 'bm'\n------------------------" + print ioctx.read("bm") + + print "Getting XATTR 'lang' from object 'bm'" + print ioctx.get_xattr("bm", "lang") + + + print "\nRemoving object 'hw'" + ioctx.remove_object("hw") + + print "Removing object 'bm'" + ioctx.remove_object("bm") + + +Java-Example +------------ + +.. code-block:: java + + import com.ceph.rados.Rados; + import com.ceph.rados.RadosException; + + import java.io.File; + import com.ceph.rados.IoCTX; + + public class CephClient { + public static void main (String args[]){ + + try { + Rados cluster = new Rados("admin"); + System.out.println("Created cluster handle."); + + File f = new File("/etc/ceph/ceph.conf"); + cluster.confReadFile(f); + System.out.println("Read the configuration file."); + + cluster.connect(); + System.out.println("Connected to the cluster."); + + IoCTX io = cluster.ioCtxCreate("data"); + + String oidone = "hw"; + String contentone = "Hello World!"; + io.write(oidone, contentone); + + String oidtwo = "bm"; + String contenttwo = "Bonjour tout le monde!"; + io.write(oidtwo, contenttwo); + + String[] objects = io.listObjects(); + for (String object: objects) + System.out.println(object); + + io.remove(oidone); + io.remove(oidtwo); + + cluster.ioCtxDestroy(io); + + } catch (RadosException e) { + System.out.println(e.getMessage() + ": " + e.getReturnValue()); + } + } + } + + +PHP Example +----------- + +.. code-block:: php + + <?php + + $io = rados_ioctx_create($r, "mypool"); + rados_write_full($io, "oidOne", "mycontents"); + rados_remove("oidOne"); + rados_ioctx_destroy($io); + + +Step 4: Closing Sessions +======================== + +Once your app finishes with the I/O Context and cluster handle, the app should +close the connection and shutdown the handle. For asynchronous I/O, the app +should also ensure that pending asynchronous operations have completed. + + +C Example +--------- + +.. code-block:: c + + rados_ioctx_destroy(io); + rados_shutdown(cluster); + + +C++ Example +----------- + +.. code-block:: c++ + + io_ctx.close(); + cluster.shutdown(); + + +Java Example +-------------- + +.. code-block:: java + + cluster.ioCtxDestroy(io); + cluster.shutDown(); + + +Python Example +-------------- + +.. code-block:: python + + print "\nClosing the connection." + ioctx.close() + + print "Shutting down the handle." + cluster.shutdown() + +PHP Example +----------- + +.. code-block:: php + + rados_shutdown($r); + + + +.. _user ID: ../../operations/user-management#command-line-usage +.. _CAPS: ../../operations/user-management#authorization-capabilities +.. _Installation (Quick): ../../../start +.. _Smart Daemons Enable Hyperscale: ../../../architecture#smart-daemons-enable-hyperscale +.. _Calculating PG IDs: ../../../architecture#calculating-pg-ids +.. _computes: ../../../architecture#calculating-pg-ids +.. _OSD: ../../../architecture#mapping-pgs-to-osds diff --git a/doc/rados/api/librados.rst b/doc/rados/api/librados.rst new file mode 100644 index 000000000..3e202bd4b --- /dev/null +++ b/doc/rados/api/librados.rst @@ -0,0 +1,187 @@ +============== + Librados (C) +============== + +.. highlight:: c + +`librados` provides low-level access to the RADOS service. For an +overview of RADOS, see :doc:`../../architecture`. + + +Example: connecting and writing an object +========================================= + +To use `Librados`, you instantiate a :c:type:`rados_t` variable (a cluster handle) and +call :c:func:`rados_create()` with a pointer to it:: + + int err; + rados_t cluster; + + err = rados_create(&cluster, NULL); + if (err < 0) { + fprintf(stderr, "%s: cannot create a cluster handle: %s\n", argv[0], strerror(-err)); + exit(1); + } + +Then you configure your :c:type:`rados_t` to connect to your cluster, +either by setting individual values (:c:func:`rados_conf_set()`), +using a configuration file (:c:func:`rados_conf_read_file()`), using +command line options (:c:func:`rados_conf_parse_argv`), or an +environment variable (:c:func:`rados_conf_parse_env()`):: + + err = rados_conf_read_file(cluster, "/path/to/myceph.conf"); + if (err < 0) { + fprintf(stderr, "%s: cannot read config file: %s\n", argv[0], strerror(-err)); + exit(1); + } + +Once the cluster handle is configured, you can connect to the cluster with :c:func:`rados_connect()`:: + + err = rados_connect(cluster); + if (err < 0) { + fprintf(stderr, "%s: cannot connect to cluster: %s\n", argv[0], strerror(-err)); + exit(1); + } + +Then you open an "IO context", a :c:type:`rados_ioctx_t`, with :c:func:`rados_ioctx_create()`:: + + rados_ioctx_t io; + char *poolname = "mypool"; + + err = rados_ioctx_create(cluster, poolname, &io); + if (err < 0) { + fprintf(stderr, "%s: cannot open rados pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_shutdown(cluster); + exit(1); + } + +Note that the pool you try to access must exist. + +Then you can use the RADOS data manipulation functions, for example +write into an object called ``greeting`` with +:c:func:`rados_write_full()`:: + + err = rados_write_full(io, "greeting", "hello", 5); + if (err < 0) { + fprintf(stderr, "%s: cannot write pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } + +In the end, you will want to close your IO context and connection to RADOS with :c:func:`rados_ioctx_destroy()` and :c:func:`rados_shutdown()`:: + + rados_ioctx_destroy(io); + rados_shutdown(cluster); + + +Asynchronous IO +=============== + +When doing lots of IO, you often don't need to wait for one operation +to complete before starting the next one. `Librados` provides +asynchronous versions of several operations: + +* :c:func:`rados_aio_write` +* :c:func:`rados_aio_append` +* :c:func:`rados_aio_write_full` +* :c:func:`rados_aio_read` + +For each operation, you must first create a +:c:type:`rados_completion_t` that represents what to do when the +operation is safe or complete by calling +:c:func:`rados_aio_create_completion`. If you don't need anything +special to happen, you can pass NULL:: + + rados_completion_t comp; + err = rados_aio_create_completion(NULL, NULL, NULL, &comp); + if (err < 0) { + fprintf(stderr, "%s: could not create aio completion: %s\n", argv[0], strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } + +Now you can call any of the aio operations, and wait for it to +be in memory or on disk on all replicas:: + + err = rados_aio_write(io, "foo", comp, "bar", 3, 0); + if (err < 0) { + fprintf(stderr, "%s: could not schedule aio write: %s\n", argv[0], strerror(-err)); + rados_aio_release(comp); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } + rados_aio_wait_for_complete(comp); // in memory + rados_aio_wait_for_safe(comp); // on disk + +Finally, we need to free the memory used by the completion with :c:func:`rados_aio_release`:: + + rados_aio_release(comp); + +You can use the callbacks to tell your application when writes are +durable, or when read buffers are full. For example, if you wanted to +measure the latency of each operation when appending to several +objects, you could schedule several writes and store the ack and +commit time in the corresponding callback, then wait for all of them +to complete using :c:func:`rados_aio_flush` before analyzing the +latencies:: + + typedef struct { + struct timeval start; + struct timeval ack_end; + struct timeval commit_end; + } req_duration; + + void ack_callback(rados_completion_t comp, void *arg) { + req_duration *dur = (req_duration *) arg; + gettimeofday(&dur->ack_end, NULL); + } + + void commit_callback(rados_completion_t comp, void *arg) { + req_duration *dur = (req_duration *) arg; + gettimeofday(&dur->commit_end, NULL); + } + + int output_append_latency(rados_ioctx_t io, const char *data, size_t len, size_t num_writes) { + req_duration times[num_writes]; + rados_completion_t comps[num_writes]; + for (size_t i = 0; i < num_writes; ++i) { + gettimeofday(×[i].start, NULL); + int err = rados_aio_create_completion((void*) ×[i], ack_callback, commit_callback, &comps[i]); + if (err < 0) { + fprintf(stderr, "Error creating rados completion: %s\n", strerror(-err)); + return err; + } + char obj_name[100]; + snprintf(obj_name, sizeof(obj_name), "foo%ld", (unsigned long)i); + err = rados_aio_append(io, obj_name, comps[i], data, len); + if (err < 0) { + fprintf(stderr, "Error from rados_aio_append: %s", strerror(-err)); + return err; + } + } + // wait until all requests finish *and* the callbacks complete + rados_aio_flush(io); + // the latencies can now be analyzed + printf("Request # | Ack latency (s) | Commit latency (s)\n"); + for (size_t i = 0; i < num_writes; ++i) { + // don't forget to free the completions + rados_aio_release(comps[i]); + struct timeval ack_lat, commit_lat; + timersub(×[i].ack_end, ×[i].start, &ack_lat); + timersub(×[i].commit_end, ×[i].start, &commit_lat); + printf("%9ld | %8ld.%06ld | %10ld.%06ld\n", (unsigned long) i, ack_lat.tv_sec, ack_lat.tv_usec, commit_lat.tv_sec, commit_lat.tv_usec); + } + return 0; + } + +Note that all the :c:type:`rados_completion_t` must be freed with :c:func:`rados_aio_release` to avoid leaking memory. + + +API calls +========= + + .. autodoxygenfile:: rados_types.h + .. autodoxygenfile:: librados.h diff --git a/doc/rados/api/libradospp.rst b/doc/rados/api/libradospp.rst new file mode 100644 index 000000000..08483c8d4 --- /dev/null +++ b/doc/rados/api/libradospp.rst @@ -0,0 +1,9 @@ +================== + LibradosPP (C++) +================== + +.. note:: The librados C++ API is not guaranteed to be API+ABI stable + between major releases. All applications using the librados C++ API must + be recompiled and relinked against a specific Ceph release. + +.. todo:: write me! diff --git a/doc/rados/api/objclass-sdk.rst b/doc/rados/api/objclass-sdk.rst new file mode 100644 index 000000000..6b1162fd4 --- /dev/null +++ b/doc/rados/api/objclass-sdk.rst @@ -0,0 +1,37 @@ +=========================== +SDK for Ceph Object Classes +=========================== + +`Ceph` can be extended by creating shared object classes called `Ceph Object +Classes`. The existing framework to build these object classes has dependencies +on the internal functionality of `Ceph`, which restricts users to build object +classes within the tree. The aim of this project is to create an independent +object class interface, which can be used to build object classes outside the +`Ceph` tree. This allows us to have two types of object classes, 1) those that +have in-tree dependencies and reside in the tree and 2) those that can make use +of the `Ceph Object Class SDK framework` and can be built outside of the `Ceph` +tree because they do not depend on any internal implementation of `Ceph`. This +project decouples object class development from Ceph and encourages creation +and distribution of object classes as packages. + +In order to demonstrate the use of this framework, we have provided an example +called ``cls_sdk``, which is a very simple object class that makes use of the +SDK framework. This object class resides in the ``src/cls`` directory. + +Installing objclass.h +--------------------- + +The object class interface that enables out-of-tree development of object +classes resides in ``src/include/rados/`` and gets installed with `Ceph` +installation. After running ``make install``, you should be able to see it +in ``<prefix>/include/rados``. :: + + ls /usr/local/include/rados + +Using the SDK example +--------------------- + +The ``cls_sdk`` object class resides in ``src/cls/sdk/``. This gets built and +loaded into Ceph, with the Ceph build process. You can run the +``ceph_test_cls_sdk`` unittest, which resides in ``src/test/cls_sdk/``, +to test this class. diff --git a/doc/rados/api/python.rst b/doc/rados/api/python.rst new file mode 100644 index 000000000..0c9cb9e98 --- /dev/null +++ b/doc/rados/api/python.rst @@ -0,0 +1,425 @@ +=================== + Librados (Python) +=================== + +The ``rados`` module is a thin Python wrapper for ``librados``. + +Installation +============ + +To install Python libraries for Ceph, see `Getting librados for Python`_. + + +Getting Started +=============== + +You can create your own Ceph client using Python. The following tutorial will +show you how to import the Ceph Python module, connect to a Ceph cluster, and +perform object operations as a ``client.admin`` user. + +.. note:: To use the Ceph Python bindings, you must have access to a + running Ceph cluster. To set one up quickly, see `Getting Started`_. + +First, create a Python source file for your Ceph client. :: + :linenos: + + sudo vim client.py + + +Import the Module +----------------- + +To use the ``rados`` module, import it into your source file. + +.. code-block:: python + :linenos: + + import rados + + +Configure a Cluster Handle +-------------------------- + +Before connecting to the Ceph Storage Cluster, create a cluster handle. By +default, the cluster handle assumes a cluster named ``ceph`` (i.e., the default +for deployment tools, and our Getting Started guides too), and a +``client.admin`` user name. You may change these defaults to suit your needs. + +To connect to the Ceph Storage Cluster, your application needs to know where to +find the Ceph Monitor. Provide this information to your application by +specifying the path to your Ceph configuration file, which contains the location +of the initial Ceph monitors. + +.. code-block:: python + :linenos: + + import rados, sys + + #Create Handle Examples. + cluster = rados.Rados(conffile='ceph.conf') + cluster = rados.Rados(conffile=sys.argv[1]) + cluster = rados.Rados(conffile = 'ceph.conf', conf = dict (keyring = '/path/to/keyring')) + +Ensure that the ``conffile`` argument provides the path and file name of your +Ceph configuration file. You may use the ``sys`` module to avoid hard-coding the +Ceph configuration path and file name. + +Your Python client also requires a client keyring. For this example, we use the +``client.admin`` key by default. If you would like to specify the keyring when +creating the cluster handle, you may use the ``conf`` argument. Alternatively, +you may specify the keyring path in your Ceph configuration file. For example, +you may add something like the following line to your Ceph configuration file:: + + keyring = /path/to/ceph.client.admin.keyring + +For additional details on modifying your configuration via Python, see `Configuration`_. + + +Connect to the Cluster +---------------------- + +Once you have a cluster handle configured, you may connect to the cluster. +With a connection to the cluster, you may execute methods that return +information about the cluster. + +.. code-block:: python + :linenos: + :emphasize-lines: 7 + + import rados, sys + + cluster = rados.Rados(conffile='ceph.conf') + print "\nlibrados version: " + str(cluster.version()) + print "Will attempt to connect to: " + str(cluster.conf_get('mon host')) + + cluster.connect() + print "\nCluster ID: " + cluster.get_fsid() + + print "\n\nCluster Statistics" + print "==================" + cluster_stats = cluster.get_cluster_stats() + + for key, value in cluster_stats.iteritems(): + print key, value + + +By default, Ceph authentication is ``on``. Your application will need to know +the location of the keyring. The ``python-ceph`` module doesn't have the default +location, so you need to specify the keyring path. The easiest way to specify +the keyring is to add it to the Ceph configuration file. The following Ceph +configuration file example uses the ``client.admin`` keyring. + +.. code-block:: ini + :linenos: + + [global] + # ... elided configuration + keyring=/path/to/keyring/ceph.client.admin.keyring + + +Manage Pools +------------ + +When connected to the cluster, the ``Rados`` API allows you to manage pools. You +can list pools, check for the existence of a pool, create a pool and delete a +pool. + +.. code-block:: python + :linenos: + :emphasize-lines: 6, 13, 18, 25 + + print "\n\nPool Operations" + print "===============" + + print "\nAvailable Pools" + print "----------------" + pools = cluster.list_pools() + + for pool in pools: + print pool + + print "\nCreate 'test' Pool" + print "------------------" + cluster.create_pool('test') + + print "\nPool named 'test' exists: " + str(cluster.pool_exists('test')) + print "\nVerify 'test' Pool Exists" + print "-------------------------" + pools = cluster.list_pools() + + for pool in pools: + print pool + + print "\nDelete 'test' Pool" + print "------------------" + cluster.delete_pool('test') + print "\nPool named 'test' exists: " + str(cluster.pool_exists('test')) + + + +Input/Output Context +-------------------- + +Reading from and writing to the Ceph Storage Cluster requires an input/output +context (ioctx). You can create an ioctx with the ``open_ioctx()`` or +``open_ioctx2()`` method of the ``Rados`` class. The ``ioctx_name`` parameter +is the name of the pool and ``pool_id`` is the ID of the pool you wish to use. + +.. code-block:: python + :linenos: + + ioctx = cluster.open_ioctx('data') + + +or + +.. code-block:: python + :linenos: + + ioctx = cluster.open_ioctx2(pool_id) + + +Once you have an I/O context, you can read/write objects, extended attributes, +and perform a number of other operations. After you complete operations, ensure +that you close the connection. For example: + +.. code-block:: python + :linenos: + + print "\nClosing the connection." + ioctx.close() + + +Writing, Reading and Removing Objects +------------------------------------- + +Once you create an I/O context, you can write objects to the cluster. If you +write to an object that doesn't exist, Ceph creates it. If you write to an +object that exists, Ceph overwrites it (except when you specify a range, and +then it only overwrites the range). You may read objects (and object ranges) +from the cluster. You may also remove objects from the cluster. For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 2, 5, 8 + + print "\nWriting object 'hw' with contents 'Hello World!' to pool 'data'." + ioctx.write_full("hw", "Hello World!") + + print "\n\nContents of object 'hw'\n------------------------\n" + print ioctx.read("hw") + + print "\nRemoving object 'hw'" + ioctx.remove_object("hw") + + +Writing and Reading XATTRS +-------------------------- + +Once you create an object, you can write extended attributes (XATTRs) to +the object and read XATTRs from the object. For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 2, 5 + + print "\n\nWriting XATTR 'lang' with value 'en_US' to object 'hw'" + ioctx.set_xattr("hw", "lang", "en_US") + + print "\n\nGetting XATTR 'lang' from object 'hw'\n" + print ioctx.get_xattr("hw", "lang") + + +Listing Objects +--------------- + +If you want to examine the list of objects in a pool, you may +retrieve the list of objects and iterate over them with the object iterator. +For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 1, 6, 7 + + object_iterator = ioctx.list_objects() + + while True : + + try : + rados_object = object_iterator.next() + print "Object contents = " + rados_object.read() + + except StopIteration : + break + +The ``Object`` class provides a file-like interface to an object, allowing +you to read and write content and extended attributes. Object operations using +the I/O context provide additional functionality and asynchronous capabilities. + + +Cluster Handle API +================== + +The ``Rados`` class provides an interface into the Ceph Storage Daemon. + + +Configuration +------------- + +The ``Rados`` class provides methods for getting and setting configuration +values, reading the Ceph configuration file, and parsing arguments. You +do not need to be connected to the Ceph Storage Cluster to invoke the following +methods. See `Storage Cluster Configuration`_ for details on settings. + +.. currentmodule:: rados +.. automethod:: Rados.conf_get(option) +.. automethod:: Rados.conf_set(option, val) +.. automethod:: Rados.conf_read_file(path=None) +.. automethod:: Rados.conf_parse_argv(args) +.. automethod:: Rados.version() + + +Connection Management +--------------------- + +Once you configure your cluster handle, you may connect to the cluster, check +the cluster ``fsid``, retrieve cluster statistics, and disconnect (shutdown) +from the cluster. You may also assert that the cluster handle is in a particular +state (e.g., "configuring", "connecting", etc.). + +.. automethod:: Rados.connect(timeout=0) +.. automethod:: Rados.shutdown() +.. automethod:: Rados.get_fsid() +.. automethod:: Rados.get_cluster_stats() + +.. documented manually because it raises warnings because of *args usage in the +.. signature + +.. py:class:: Rados + + .. py:method:: require_state(*args) + + Checks if the Rados object is in a special state + + :param args: Any number of states to check as separate arguments + :raises: :class:`RadosStateError` + + +Pool Operations +--------------- + +To use pool operation methods, you must connect to the Ceph Storage Cluster +first. You may list the available pools, create a pool, check to see if a pool +exists, and delete a pool. + +.. automethod:: Rados.list_pools() +.. automethod:: Rados.create_pool(pool_name, crush_rule=None) +.. automethod:: Rados.pool_exists() +.. automethod:: Rados.delete_pool(pool_name) + + +CLI Commands +------------ + +The Ceph CLI command is internally using the following librados Python binding methods. + +In order to send a command, choose the correct method and choose the correct target. + +.. automethod:: Rados.mon_command +.. automethod:: Rados.osd_command +.. automethod:: Rados.mgr_command +.. automethod:: Rados.pg_command + + +Input/Output Context API +======================== + +To write data to and read data from the Ceph Object Store, you must create +an Input/Output context (ioctx). The `Rados` class provides `open_ioctx()` +and `open_ioctx2()` methods. The remaining ``ioctx`` operations involve +invoking methods of the `Ioctx` and other classes. + +.. automethod:: Rados.open_ioctx(ioctx_name) +.. automethod:: Ioctx.require_ioctx_open() +.. automethod:: Ioctx.get_stats() +.. automethod:: Ioctx.get_last_version() +.. automethod:: Ioctx.close() + + +.. Pool Snapshots +.. -------------- + +.. The Ceph Storage Cluster allows you to make a snapshot of a pool's state. +.. Whereas, basic pool operations only require a connection to the cluster, +.. snapshots require an I/O context. + +.. Ioctx.create_snap(self, snap_name) +.. Ioctx.list_snaps(self) +.. SnapIterator.next(self) +.. Snap.get_timestamp(self) +.. Ioctx.lookup_snap(self, snap_name) +.. Ioctx.remove_snap(self, snap_name) + +.. not published. This doesn't seem ready yet. + +Object Operations +----------------- + +The Ceph Storage Cluster stores data as objects. You can read and write objects +synchronously or asynchronously. You can read and write from offsets. An object +has a name (or key) and data. + + +.. automethod:: Ioctx.aio_write(object_name, to_write, offset=0, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.aio_write_full(object_name, to_write, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.aio_append(object_name, to_append, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.write(key, data, offset=0) +.. automethod:: Ioctx.write_full(key, data) +.. automethod:: Ioctx.aio_flush() +.. automethod:: Ioctx.set_locator_key(loc_key) +.. automethod:: Ioctx.aio_read(object_name, length, offset, oncomplete) +.. automethod:: Ioctx.read(key, length=8192, offset=0) +.. automethod:: Ioctx.stat(key) +.. automethod:: Ioctx.trunc(key, size) +.. automethod:: Ioctx.remove_object(key) + + +Object Extended Attributes +-------------------------- + +You may set extended attributes (XATTRs) on an object. You can retrieve a list +of objects or XATTRs and iterate over them. + +.. automethod:: Ioctx.set_xattr(key, xattr_name, xattr_value) +.. automethod:: Ioctx.get_xattrs(oid) +.. automethod:: XattrIterator.__next__() +.. automethod:: Ioctx.get_xattr(key, xattr_name) +.. automethod:: Ioctx.rm_xattr(key, xattr_name) + + + +Object Interface +================ + +From an I/O context, you can retrieve a list of objects from a pool and iterate +over them. The object interface provide makes each object look like a file, and +you may perform synchronous operations on the objects. For asynchronous +operations, you should use the I/O context methods. + +.. automethod:: Ioctx.list_objects() +.. automethod:: ObjectIterator.__next__() +.. automethod:: Object.read(length = 1024*1024) +.. automethod:: Object.write(string_to_write) +.. automethod:: Object.get_xattrs() +.. automethod:: Object.get_xattr(xattr_name) +.. automethod:: Object.set_xattr(xattr_name, xattr_value) +.. automethod:: Object.rm_xattr(xattr_name) +.. automethod:: Object.stat() +.. automethod:: Object.remove() + + + + +.. _Getting Started: ../../../start +.. _Storage Cluster Configuration: ../../configuration +.. _Getting librados for Python: ../librados-intro#getting-librados-for-python diff --git a/doc/rados/command/list-inconsistent-obj.json b/doc/rados/command/list-inconsistent-obj.json new file mode 100644 index 000000000..2bdc5f74c --- /dev/null +++ b/doc/rados/command/list-inconsistent-obj.json @@ -0,0 +1,237 @@ +{ + "$schema": "http://json-schema.org/draft-04/schema#", + "type": "object", + "properties": { + "epoch": { + "description": "Scrub epoch", + "type": "integer" + }, + "inconsistents": { + "type": "array", + "items": { + "type": "object", + "properties": { + "object": { + "description": "Identify a Ceph object", + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "nspace": { + "type": "string" + }, + "locator": { + "type": "string" + }, + "version": { + "type": "integer", + "minimum": 0 + }, + "snap": { + "oneOf": [ + { + "type": "string", + "enum": [ "head", "snapdir" ] + }, + { + "type": "integer", + "minimum": 0 + } + ] + } + }, + "required": [ + "name", + "nspace", + "locator", + "version", + "snap" + ] + }, + "selected_object_info": { + "type": "object", + "description": "Selected object information", + "additionalProperties": true + }, + "union_shard_errors": { + "description": "Union of all shard errors", + "type": "array", + "items": { + "enum": [ + "missing", + "stat_error", + "read_error", + "data_digest_mismatch_info", + "omap_digest_mismatch_info", + "size_mismatch_info", + "ec_hash_error", + "ec_size_error", + "info_missing", + "info_corrupted", + "obj_size_info_mismatch", + "snapset_missing", + "snapset_corrupted", + "hinfo_missing", + "hinfo_corrupted" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "errors": { + "description": "Errors related to the analysis of this object", + "type": "array", + "items": { + "enum": [ + "object_info_inconsistency", + "data_digest_mismatch", + "omap_digest_mismatch", + "size_mismatch", + "attr_value_mismatch", + "attr_name_mismatch", + "snapset_inconsistency", + "hinfo_inconsistency", + "size_too_large" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "shards": { + "description": "All found or expected shards", + "type": "array", + "items": { + "description": "Information about a particular shard of object", + "type": "object", + "properties": { + "object_info": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "object", + "description": "Object information", + "additionalProperties": true + } + ] + }, + "snapset": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "object", + "description": "Snap set information", + "additionalProperties": true + } + ] + }, + "hashinfo": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "object", + "description": "Erasure code hash information", + "additionalProperties": true + } + ] + }, + "shard": { + "type": "integer" + }, + "osd": { + "type": "integer" + }, + "primary": { + "type": "boolean" + }, + "size": { + "type": "integer" + }, + "omap_digest": { + "description": "Hex representation (e.g. 0x1abd1234)", + "type": "string" + }, + "data_digest": { + "description": "Hex representation (e.g. 0x1abd1234)", + "type": "string" + }, + "errors": { + "description": "Errors with this shard", + "type": "array", + "items": { + "enum": [ + "missing", + "stat_error", + "read_error", + "data_digest_mismatch_info", + "omap_digest_mismatch_info", + "size_mismatch_info", + "ec_hash_error", + "ec_size_error", + "info_missing", + "info_corrupted", + "obj_size_info_mismatch", + "snapset_missing", + "snapset_corrupted", + "hinfo_missing", + "hinfo_corrupted" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "attrs": { + "description": "If any shard's attr error is set then all attrs are here", + "type": "array", + "items": { + "description": "Information about a particular shard of object", + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "value": { + "type": "string" + }, + "Base64": { + "type": "boolean" + } + }, + "required": [ + "name", + "value", + "Base64" + ], + "additionalProperties": false + } + } + }, + "additionalProperties": false, + "required": [ + "osd", + "primary", + "errors" + ] + } + } + }, + "required": [ + "object", + "union_shard_errors", + "errors", + "shards" + ] + } + } + }, + "required": [ + "epoch", + "inconsistents" + ] +} diff --git a/doc/rados/command/list-inconsistent-snap.json b/doc/rados/command/list-inconsistent-snap.json new file mode 100644 index 000000000..55f1d53e9 --- /dev/null +++ b/doc/rados/command/list-inconsistent-snap.json @@ -0,0 +1,86 @@ +{ + "$schema": "http://json-schema.org/draft-04/schema#", + "type": "object", + "properties": { + "epoch": { + "description": "Scrub epoch", + "type": "integer" + }, + "inconsistents": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "nspace": { + "type": "string" + }, + "locator": { + "type": "string" + }, + "snap": { + "oneOf": [ + { + "type": "string", + "enum": [ + "head", + "snapdir" + ] + }, + { + "type": "integer", + "minimum": 0 + } + ] + }, + "errors": { + "description": "Errors for this object's snap", + "type": "array", + "items": { + "enum": [ + "snapset_missing", + "snapset_corrupted", + "info_missing", + "info_corrupted", + "snapset_error", + "headless", + "size_mismatch", + "extra_clones", + "clone_missing" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "missing": { + "description": "List of missing clones if clone_missing error set", + "type": "array", + "items": { + "type": "integer" + } + }, + "extra_clones": { + "description": "List of extra clones if extra_clones error set", + "type": "array", + "items": { + "type": "integer" + } + } + }, + "required": [ + "name", + "nspace", + "locator", + "snap", + "errors" + ] + } + } + }, + "required": [ + "epoch", + "inconsistents" + ] +} diff --git a/doc/rados/configuration/auth-config-ref.rst b/doc/rados/configuration/auth-config-ref.rst new file mode 100644 index 000000000..5cc13ff6a --- /dev/null +++ b/doc/rados/configuration/auth-config-ref.rst @@ -0,0 +1,362 @@ +======================== + Cephx Config Reference +======================== + +The ``cephx`` protocol is enabled by default. Cryptographic authentication has +some computational costs, though they should generally be quite low. If the +network environment connecting your client and server hosts is very safe and +you cannot afford authentication, you can turn it off. **This is not generally +recommended**. + +.. note:: If you disable authentication, you are at risk of a man-in-the-middle + attack altering your client/server messages, which could lead to disastrous + security effects. + +For creating users, see `User Management`_. For details on the architecture +of Cephx, see `Architecture - High Availability Authentication`_. + + +Deployment Scenarios +==================== + +There are two main scenarios for deploying a Ceph cluster, which impact +how you initially configure Cephx. Most first time Ceph users use +``cephadm`` to create a cluster (easiest). For clusters using +other deployment tools (e.g., Chef, Juju, Puppet, etc.), you will need +to use the manual procedures or configure your deployment tool to +bootstrap your monitor(s). + +Manual Deployment +----------------- + +When you deploy a cluster manually, you have to bootstrap the monitor manually +and create the ``client.admin`` user and keyring. To bootstrap monitors, follow +the steps in `Monitor Bootstrapping`_. The steps for monitor bootstrapping are +the logical steps you must perform when using third party deployment tools like +Chef, Puppet, Juju, etc. + + +Enabling/Disabling Cephx +======================== + +Enabling Cephx requires that you have deployed keys for your monitors, +OSDs and metadata servers. If you are simply toggling Cephx on / off, +you do not have to repeat the bootstrapping procedures. + + +Enabling Cephx +-------------- + +When ``cephx`` is enabled, Ceph will look for the keyring in the default search +path, which includes ``/etc/ceph/$cluster.$name.keyring``. You can override +this location by adding a ``keyring`` option in the ``[global]`` section of +your `Ceph configuration`_ file, but this is not recommended. + +Execute the following procedures to enable ``cephx`` on a cluster with +authentication disabled. If you (or your deployment utility) have already +generated the keys, you may skip the steps related to generating keys. + +#. Create a ``client.admin`` key, and save a copy of the key for your client + host + + .. prompt:: bash $ + + ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring + + **Warning:** This will clobber any existing + ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a + deployment tool has already done it for you. Be careful! + +#. Create a keyring for your monitor cluster and generate a monitor + secret key. + + .. prompt:: bash $ + + ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' + +#. Copy the monitor keyring into a ``ceph.mon.keyring`` file in every monitor's + ``mon data`` directory. For example, to copy it to ``mon.a`` in cluster ``ceph``, + use the following + + .. prompt:: bash $ + + cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring + +#. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter + + .. prompt:: bash $ + + ceph auth get-or-create mgr.{$id} mon 'allow profile mgr' mds 'allow *' osd 'allow *' -o /var/lib/ceph/mgr/ceph-{$id}/keyring + +#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number + + .. prompt:: bash $ + + ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring + +#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter + + .. prompt:: bash $ + + ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' mgr 'allow profile mds' -o /var/lib/ceph/mds/ceph-{$id}/keyring + +#. Enable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file + + .. code-block:: ini + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + +For details on bootstrapping a monitor manually, see `Manual Deployment`_. + + + +Disabling Cephx +--------------- + +The following procedure describes how to disable Cephx. If your cluster +environment is relatively safe, you can offset the computation expense of +running authentication. **We do not recommend it.** However, it may be easier +during setup and/or troubleshooting to temporarily disable authentication. + +#. Disable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file + + .. code-block:: ini + + auth_cluster_required = none + auth_service_required = none + auth_client_required = none + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + + +Configuration Settings +====================== + +Enablement +---------- + + +``auth_cluster_required`` + +:Description: If enabled, the Ceph Storage Cluster daemons (i.e., ``ceph-mon``, + ``ceph-osd``, ``ceph-mds`` and ``ceph-mgr``) must authenticate with + each other. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth_service_required`` + +:Description: If enabled, the Ceph Storage Cluster daemons require Ceph Clients + to authenticate with the Ceph Storage Cluster in order to access + Ceph services. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth_client_required`` + +:Description: If enabled, the Ceph Client requires the Ceph Storage Cluster to + authenticate with the Ceph Client. Valid settings are ``cephx`` + or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +.. index:: keys; keyring + +Keys +---- + +When you run Ceph with authentication enabled, ``ceph`` administrative commands +and Ceph Clients require authentication keys to access the Ceph Storage Cluster. + +The most common way to provide these keys to the ``ceph`` administrative +commands and clients is to include a Ceph keyring under the ``/etc/ceph`` +directory. For Octopus and later releases using ``cephadm``, the filename +is usually ``ceph.client.admin.keyring`` (or ``$cluster.client.admin.keyring``). +If you include the keyring under the ``/etc/ceph`` directory, you don't need to +specify a ``keyring`` entry in your Ceph configuration file. + +We recommend copying the Ceph Storage Cluster's keyring file to nodes where you +will run administrative commands, because it contains the ``client.admin`` key. + +To perform this step manually, execute the following: + +.. prompt:: bash $ + + sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring + +.. tip:: Ensure the ``ceph.keyring`` file has appropriate permissions set + (e.g., ``chmod 644``) on your client machine. + +You may specify the key itself in the Ceph configuration file using the ``key`` +setting (not recommended), or a path to a keyfile using the ``keyfile`` setting. + + +``keyring`` + +:Description: The path to the keyring file. +:Type: String +:Required: No +:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin`` + + +``keyfile`` + +:Description: The path to a key file (i.e,. a file containing only the key). +:Type: String +:Required: No +:Default: None + + +``key`` + +:Description: The key (i.e., the text string of the key itself). Not recommended. +:Type: String +:Required: No +:Default: None + + +Daemon Keyrings +--------------- + +Administrative users or deployment tools (e.g., ``cephadm``) may generate +daemon keyrings in the same way as generating user keyrings. By default, Ceph +stores daemons keyrings inside their data directory. The default keyring +locations, and the capabilities necessary for the daemon to function, are shown +below. + +``ceph-mon`` + +:Location: ``$mon_data/keyring`` +:Capabilities: ``mon 'allow *'`` + +``ceph-osd`` + +:Location: ``$osd_data/keyring`` +:Capabilities: ``mgr 'allow profile osd' mon 'allow profile osd' osd 'allow *'`` + +``ceph-mds`` + +:Location: ``$mds_data/keyring`` +:Capabilities: ``mds 'allow' mgr 'allow profile mds' mon 'allow profile mds' osd 'allow rwx'`` + +``ceph-mgr`` + +:Location: ``$mgr_data/keyring`` +:Capabilities: ``mon 'allow profile mgr' mds 'allow *' osd 'allow *'`` + +``radosgw`` + +:Location: ``$rgw_data/keyring`` +:Capabilities: ``mon 'allow rwx' osd 'allow rwx'`` + + +.. note:: The monitor keyring (i.e., ``mon.``) contains a key but no + capabilities, and is not part of the cluster ``auth`` database. + +The daemon data directory locations default to directories of the form:: + + /var/lib/ceph/$type/$cluster-$id + +For example, ``osd.12`` would be:: + + /var/lib/ceph/osd/ceph-12 + +You can override these locations, but it is not recommended. + + +.. index:: signatures + +Signatures +---------- + +Ceph performs a signature check that provides some limited protection +against messages being tampered with in flight (e.g., by a "man in the +middle" attack). + +Like other parts of Ceph authentication, Ceph provides fine-grained control so +you can enable/disable signatures for service messages between clients and +Ceph, and so you can enable/disable signatures for messages between Ceph daemons. + +Note that even with signatures enabled data is not encrypted in +flight. + +``cephx_require_signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between the Ceph Client and the Ceph Storage Cluster, and + between daemons comprising the Ceph Storage Cluster. + + Ceph Argonaut and Linux kernel versions prior to 3.19 do + not support signatures; if such clients are in use this + option can be turned off to allow them to connect. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_cluster_require_signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph daemons comprising the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_service_require_signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph Clients and the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx_sign_messages`` + +:Description: If the Ceph version supports message signing, Ceph will sign + all messages so they are more difficult to spoof. + +:Type: Boolean +:Default: ``true`` + + +Time to Live +------------ + +``auth_service_ticket_ttl`` + +:Description: When the Ceph Storage Cluster sends a Ceph Client a ticket for + authentication, the Ceph Storage Cluster assigns the ticket a + time to live. + +:Type: Double +:Default: ``60*60`` + + +.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping +.. _Operating a Cluster: ../../operations/operating +.. _Manual Deployment: ../../../install/manual-deployment +.. _Ceph configuration: ../ceph-conf +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _User Management: ../../operations/user-management diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst new file mode 100644 index 000000000..3bfc8e295 --- /dev/null +++ b/doc/rados/configuration/bluestore-config-ref.rst @@ -0,0 +1,482 @@ +========================== +BlueStore Config Reference +========================== + +Devices +======= + +BlueStore manages either one, two, or (in certain cases) three storage +devices. + +In the simplest case, BlueStore consumes a single (primary) storage device. +The storage device is normally used as a whole, occupying the full device that +is managed directly by BlueStore. This *primary device* is normally identified +by a ``block`` symlink in the data directory. + +The data directory is a ``tmpfs`` mount which gets populated (at boot time, or +when ``ceph-volume`` activates it) with all the common OSD files that hold +information about the OSD, like: its identifier, which cluster it belongs to, +and its private keyring. + +It is also possible to deploy BlueStore across one or two additional devices: + +* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be + used for BlueStore's internal journal or write-ahead log. It is only useful + to use a WAL device if the device is faster than the primary device (e.g., + when it is on an SSD and the primary device is an HDD). +* A *DB device* (identified as ``block.db`` in the data directory) can be used + for storing BlueStore's internal metadata. BlueStore (or rather, the + embedded RocksDB) will put as much metadata as it can on the DB device to + improve performance. If the DB device fills up, metadata will spill back + onto the primary device (where it would have been otherwise). Again, it is + only helpful to provision a DB device if it is faster than the primary + device. + +If there is only a small amount of fast storage available (e.g., less +than a gigabyte), we recommend using it as a WAL device. If there is +more, provisioning a DB device makes more sense. The BlueStore +journal will always be placed on the fastest device available, so +using a DB device will provide the same benefit that the WAL device +would while *also* allowing additional metadata to be stored there (if +it will fit). This means that if a DB device is specified but an explicit +WAL device is not, the WAL will be implicitly colocated with the DB on the faster +device. + +A single-device (colocated) BlueStore OSD can be provisioned with: + +.. prompt:: bash $ + + ceph-volume lvm prepare --bluestore --data <device> + +To specify a WAL device and/or DB device: + +.. prompt:: bash $ + + ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device> + +.. note:: ``--data`` can be a Logical Volume using *vg/lv* notation. Other + devices can be existing logical volumes or GPT partitions. + +Provisioning strategies +----------------------- +Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore +which had just one), there are two common arrangements that should help clarify +the deployment strategy: + +.. _bluestore-single-type-device-config: + +**block (data) only** +^^^^^^^^^^^^^^^^^^^^^ +If all devices are the same type, for example all rotational drives, and +there are no fast devices to use for metadata, it makes sense to specify the +block device only and to not separate ``block.db`` or ``block.wal``. The +:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data /dev/sda + +If logical volumes have already been created for each device, (a single LV +using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named +``ceph-vg/block-lv`` would look like: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data ceph-vg/block-lv + +.. _bluestore-mixed-device-config: + +**block and block.db** +^^^^^^^^^^^^^^^^^^^^^^ +If you have a mix of fast and slow devices (SSD / NVMe and rotational), +it is recommended to place ``block.db`` on the faster device while ``block`` +(data) lives on the slower (spinning drive). + +You must create these volume groups and logical volumes manually as +the ``ceph-volume`` tool is currently not able to do so automatically. + +For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``) +and one (fast) solid state drive (``sdx``). First create the volume groups: + +.. prompt:: bash $ + + vgcreate ceph-block-0 /dev/sda + vgcreate ceph-block-1 /dev/sdb + vgcreate ceph-block-2 /dev/sdc + vgcreate ceph-block-3 /dev/sdd + +Now create the logical volumes for ``block``: + +.. prompt:: bash $ + + lvcreate -l 100%FREE -n block-0 ceph-block-0 + lvcreate -l 100%FREE -n block-1 ceph-block-1 + lvcreate -l 100%FREE -n block-2 ceph-block-2 + lvcreate -l 100%FREE -n block-3 ceph-block-3 + +We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB +SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB: + +.. prompt:: bash $ + + vgcreate ceph-db-0 /dev/sdx + lvcreate -L 50GB -n db-0 ceph-db-0 + lvcreate -L 50GB -n db-1 ceph-db-0 + lvcreate -L 50GB -n db-2 ceph-db-0 + lvcreate -L 50GB -n db-3 ceph-db-0 + +Finally, create the 4 OSDs with ``ceph-volume``: + +.. prompt:: bash $ + + ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0 + ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1 + ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2 + ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3 + +These operations should end up creating four OSDs, with ``block`` on the slower +rotational drives with a 50 GB logical volume (DB) for each on the solid state +drive. + +Sizing +====== +When using a :ref:`mixed spinning and solid drive setup +<bluestore-mixed-device-config>` it is important to make a large enough +``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have +*as large as possible* logical volumes. + +The general recommendation is to have ``block.db`` size in between 1% to 4% +of ``block`` size. For RGW workloads, it is recommended that the ``block.db`` +size isn't smaller than 4% of ``block``, because RGW heavily uses it to store +metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't +be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough. + +In older releases, internal level sizes mean that the DB can fully utilize only +specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2, +etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and +so forth. Most deployments will not substantially benefit from sizing to +accommodate L3 and higher, though DB compaction can be facilitated by doubling +these figures to 6GB, 60GB, and 600GB. + +Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6 +enable better utilization of arbitrary DB device sizes, and the Pacific +release brings experimental dynamic level support. Users of older releases may +thus wish to plan ahead by provisioning larger DB devices today so that their +benefits may be realized with future upgrades. + +When *not* using a mix of fast and slow devices, it isn't required to create +separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will +automatically colocate these within the space of ``block``. + + +Automatic Cache Sizing +====================== + +BlueStore can be configured to automatically resize its caches when TCMalloc +is configured as the memory allocator and the ``bluestore_cache_autotune`` +setting is enabled. This option is currently enabled by default. BlueStore +will attempt to keep OSD heap memory usage under a designated target size via +the ``osd_memory_target`` configuration option. This is a best effort +algorithm and caches will not shrink smaller than the amount specified by +``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy +of priorities. If priority information is not available, the +``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are +used as fallbacks. + +Manual Cache Sizing +=================== + +The amount of memory consumed by each OSD for BlueStore caches is +determined by the ``bluestore_cache_size`` configuration option. If +that config option is not set (i.e., remains at 0), there is a +different default value that is used depending on whether an HDD or +SSD is used for the primary device (set by the +``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config +options). + +BlueStore and the rest of the Ceph OSD daemon do the best they can +to work within this memory budget. Note that on top of the configured +cache size, there is also memory consumed by the OSD itself, and +some additional utilization due to memory fragmentation and other +allocator overhead. + +The configured cache memory budget can be used in a few different ways: + +* Key/Value metadata (i.e., RocksDB's internal cache) +* BlueStore metadata +* BlueStore data (i.e., recently read or written object data) + +Cache memory usage is governed by the following options: +``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. +The fraction of the cache devoted to data +is governed by the effective bluestore cache size (depending on +``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary +device) as well as the meta and kv ratios. +The data fraction can be calculated by +``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)`` + +Checksums +========= + +BlueStore checksums all metadata and data written to disk. Metadata +checksumming is handled by RocksDB and uses `crc32c`. Data +checksumming is done by BlueStore and can make use of `crc32c`, +`xxhash32`, or `xxhash64`. The default is `crc32c` and should be +suitable for most purposes. + +Full data checksumming does increase the amount of metadata that +BlueStore must store and manage. When possible, e.g., when clients +hint that data is written and read sequentially, BlueStore will +checksum larger blocks, but in many cases it must store a checksum +value (usually 4 bytes) for every 4 kilobyte block of data. + +It is possible to use a smaller checksum value by truncating the +checksum to two or one byte, reducing the metadata overhead. The +trade-off is that the probability that a random error will not be +detected is higher with a smaller checksum, going from about one in +four billion with a 32-bit (4 byte) checksum to one in 65,536 for a +16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum. +The smaller checksum values can be used by selecting `crc32c_16` or +`crc32c_8` as the checksum algorithm. + +The *checksum algorithm* can be set either via a per-pool +``csum_type`` property or the global config option. For example: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> csum_type <algorithm> + +Inline Compression +================== + +BlueStore supports inline compression using `snappy`, `zlib`, or +`lz4`. Please note that the `lz4` compression plugin is not +distributed in the official release. + +Whether data in BlueStore is compressed is determined by a combination +of the *compression mode* and any hints associated with a write +operation. The modes are: + +* **none**: Never compress data. +* **passive**: Do not compress data unless the write operation has a + *compressible* hint set. +* **aggressive**: Compress data unless the write operation has an + *incompressible* hint set. +* **force**: Try to compress data no matter what. + +For more information about the *compressible* and *incompressible* IO +hints, see :c:func:`rados_set_alloc_hint`. + +Note that regardless of the mode, if the size of the data chunk is not +reduced sufficiently it will not be used and the original +(uncompressed) data will be stored. For example, if the ``bluestore +compression required ratio`` is set to ``.7`` then the compressed data +must be 70% of the size of the original (or smaller). + +The *compression mode*, *compression algorithm*, *compression required +ratio*, *min blob size*, and *max blob size* can be set either via a +per-pool property or a global config option. Pool properties can be +set with: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> compression_algorithm <algorithm> + ceph osd pool set <pool-name> compression_mode <mode> + ceph osd pool set <pool-name> compression_required_ratio <ratio> + ceph osd pool set <pool-name> compression_min_blob_size <size> + ceph osd pool set <pool-name> compression_max_blob_size <size> + +.. _bluestore-rocksdb-sharding: + +RocksDB Sharding +================ + +Internally BlueStore uses multiple types of key-value data, +stored in RocksDB. Each data type in BlueStore is assigned a +unique prefix. Until Pacific all key-value data was stored in +single RocksDB column family: 'default'. Since Pacific, +BlueStore can divide this data into multiple RocksDB column +families. When keys have similar access frequency, modification +frequency and lifetime, BlueStore benefits from better caching +and more precise compaction. This improves performance, and also +requires less disk space during compaction, since each column +family is smaller and can compact independent of others. + +OSDs deployed in Pacific or later use RocksDB sharding by default. +If Ceph is upgraded to Pacific from a previous version, sharding is off. + +To enable sharding and apply the Pacific defaults, stop an OSD and run + + .. prompt:: bash # + + ceph-bluestore-tool \ + --path <data path> \ + --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \ + reshard + + +Throttling +========== + +SPDK Usage +================== + +If you want to use the SPDK driver for NVMe devices, you must prepare your system. +Refer to `SPDK document`__ for more details. + +.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples + +SPDK offers a script to configure the device automatically. Users can run the +script as root: + +.. prompt:: bash $ + + sudo src/spdk/scripts/setup.sh + +You will need to specify the subject NVMe device's device selector with +the "spdk:" prefix for ``bluestore_block_path``. + +For example, you can find the device selector of an Intel PCIe SSD with: + +.. prompt:: bash $ + + lspci -mm -n -D -d 8086:0953 + +The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``. + +and then set:: + + bluestore_block_path = "spdk:trtype:PCIe traddr:0000:01:00.0" + +Where ``0000:01:00.0`` is the device selector found in the output of ``lspci`` +command above. + +You may also specify a remote NVMeoF target over the TCP transport as in the +following example:: + + bluestore_block_path = "spdk:trtype:TCP traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1" + +To run multiple SPDK instances per node, you must specify the +amount of dpdk memory in MB that each instance will use, to make sure each +instance uses its own DPDK memory. + +In most cases, a single device can be used for data, DB, and WAL. We describe +this strategy as *colocating* these components. Be sure to enter the below +settings to ensure that all IOs are issued through SPDK.:: + + bluestore_block_db_path = "" + bluestore_block_db_size = 0 + bluestore_block_wal_path = "" + bluestore_block_wal_size = 0 + +Otherwise, the current implementation will populate the SPDK map files with +kernel file system symbols and will use the kernel driver to issue DB/WAL IO. + +Minimum Allocation Size +======================== + +There is a configured minimum amount of storage that BlueStore will allocate on +an OSD. In practice, this is the least amount of capacity that a RADOS object +can consume. The value of `bluestore_min_alloc_size` is derived from the +value of `bluestore_min_alloc_size_hdd` or `bluestore_min_alloc_size_ssd` +depending on the OSD's ``rotational`` attribute. This means that when an OSD +is created on an HDD, BlueStore will be initialized with the current value +of `bluestore_min_alloc_size_hdd`, and SSD OSDs (including NVMe devices) +with the value of `bluestore_min_alloc_size_ssd`. + +Through the Mimic release, the default values were 64KB and 16KB for rotational +(HDD) and non-rotational (SSD) media respectively. Octopus changed the default +for SSD (non-rotational) media to 4KB, and Pacific changed the default for HDD +(rotational) media to 4KB as well. + +These changes were driven by space amplification experienced by Ceph RADOS +GateWay (RGW) deployments that host large numbers of small files +(S3/Swift objects). + +For example, when an RGW client stores a 1KB S3 object, it is written to a +single RADOS object. With the default `min_alloc_size` value, 4KB of +underlying drive space is allocated. This means that roughly +(4KB - 1KB) == 3KB is allocated but never used, which corresponds to 300% +overhead or 25% efficiency. Similarly, a 5KB user object will be stored +as one 4KB and one 1KB RADOS object, again stranding 4KB of device capcity, +though in this case the overhead is a much smaller percentage. Think of this +in terms of the remainder from a modulus operation. The overhead *percentage* +thus decreases rapidly as user object size increases. + +An easily missed additional subtlety is that this +takes place for *each* replica. So when using the default three copies of +data (3R), a 1KB S3 object actually consumes roughly 9KB of storage device +capacity. If erasure coding (EC) is used instead of replication, the +amplification may be even higher: for a ``k=4,m=2`` pool, our 1KB S3 object +will allocate (6 * 4KB) = 24KB of device capacity. + +When an RGW bucket pool contains many relatively large user objects, the effect +of this phenomenon is often negligible, but should be considered for deployments +that expect a signficiant fraction of relatively small objects. + +The 4KB default value aligns well with conventional HDD and SSD devices. Some +new coarse-IU (Indirection Unit) QLC SSDs however perform and wear best +when `bluestore_min_alloc_size_ssd` +is set at OSD creation to match the device's IU:. 8KB, 16KB, or even 64KB. +These novel storage drives allow one to achieve read performance competitive +with conventional TLC SSDs and write performance faster than HDDs, with +high density and lower cost than TLC SSDs. + +Note that when creating OSDs on these devices, one must carefully apply the +non-default value only to appropriate devices, and not to conventional SSD and +HDD devices. This may be done through careful ordering of OSD creation, custom +OSD device classes, and especially by the use of central configuration _masks_. + +Quincy and later releases add +the `bluestore_use_optimal_io_size_for_min_alloc_size` +option that enables automatic discovery of the appropriate value as each OSD is +created. Note that the use of ``bcache``, ``OpenCAS``, ``dmcrypt``, +``ATA over Ethernet``, `iSCSI`, or other device layering / abstraction +technologies may confound the determination of appropriate values. OSDs +deployed on top of VMware storage have been reported to also +sometimes report a ``rotational`` attribute that does not match the underlying +hardware. + +We suggest inspecting such OSDs at startup via logs and admin sockets to ensure that +behavior is appropriate. Note that this also may not work as desired with +older kernels. You can check for this by examining the presence and value +of ``/sys/block/<drive>/queue/optimal_io_size``. + +You may also inspect a given OSD: + +.. prompt:: bash # + + ceph osd metadata osd.1701 | grep rotational + +This space amplification may manifest as an unusually high ratio of raw to +stored data reported by ``ceph df``. ``ceph osd df`` may also report +anomalously high ``%USE`` / ``VAR`` values when +compared to other, ostensibly identical OSDs. A pool using OSDs with +mismatched ``min_alloc_size`` values may experience unexpected balancer +behavior as well. + +Note that this BlueStore attribute takes effect *only* at OSD creation; if +changed later, a given OSD's behavior will not change unless / until it is +destroyed and redeployed with the appropriate option value(s). Upgrading +to a later Ceph release will *not* change the value used by OSDs deployed +under older releases or with other settings. + +DSA (Data Streaming Accelerator Usage) +====================================== + +If you want to use the DML library to drive DSA device for offloading +read/write operations on Persist memory in Bluestore. You need to install +`DML`_ and `idxd-config`_ library in your machine with SPR (Sapphire Rapids) CPU. + +.. _DML: https://github.com/intel/DML +.. _idxd-config: https://github.com/intel/idxd-config + +After installing the DML software, you need to configure the shared +work queues (WQs) with the following WQ configuration example via accel-config tool: + +.. prompt:: bash $ + + accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="MyApp1" --priority=10 --block-on-fault=1 dsa0/wq0.1 + accel-config config-engine dsa0/engine0.1 --group-id=1 + accel-config enable-device dsa0 + accel-config enable-wq dsa0/wq0.1 diff --git a/doc/rados/configuration/ceph-conf.rst b/doc/rados/configuration/ceph-conf.rst new file mode 100644 index 000000000..ad93598de --- /dev/null +++ b/doc/rados/configuration/ceph-conf.rst @@ -0,0 +1,689 @@ +.. _configuring-ceph: + +================== + Configuring Ceph +================== + +When Ceph services start, the initialization process activates a series +of daemons that run in the background. A :term:`Ceph Storage Cluster` runs +at a minimum three types of daemons: + +- :term:`Ceph Monitor` (``ceph-mon``) +- :term:`Ceph Manager` (``ceph-mgr``) +- :term:`Ceph OSD Daemon` (``ceph-osd``) + +Ceph Storage Clusters that support the :term:`Ceph File System` also run at +least one :term:`Ceph Metadata Server` (``ceph-mds``). Clusters that +support :term:`Ceph Object Storage` run Ceph RADOS Gateway daemons +(``radosgw``) as well. + +Each daemon has a number of configuration options, each of which has a +default value. You may adjust the behavior of the system by changing these +configuration options. Be careful to understand the consequences before +overriding default values, as it is possible to significantly degrade the +performance and stability of your cluster. Also note that default values +sometimes change between releases, so it is best to review the version of +this documentation that aligns with your Ceph release. + +Option names +============ + +All Ceph configuration options have a unique name consisting of words +formed with lower-case characters and connected with underscore +(``_``) characters. + +When option names are specified on the command line, either underscore +(``_``) or dash (``-``) characters can be used interchangeable (e.g., +``--mon-host`` is equivalent to ``--mon_host``). + +When option names appear in configuration files, spaces can also be +used in place of underscore or dash. We suggest, though, that for +clarity and convenience you consistently use underscores, as we do +throughout this documentation. + +Config sources +============== + +Each Ceph daemon, process, and library will pull its configuration +from several sources, listed below. Sources later in the list will +override those earlier in the list when both are present. + +- the compiled-in default value +- the monitor cluster's centralized configuration database +- a configuration file stored on the local host +- environment variables +- command line arguments +- runtime overrides set by an administrator + +One of the first things a Ceph process does on startup is parse the +configuration options provided via the command line, environment, and +local configuration file. The process will then contact the monitor +cluster to retrieve configuration stored centrally for the entire +cluster. Once a complete view of the configuration is available, the +daemon or process startup will proceed. + +.. _bootstrap-options: + +Bootstrap options +----------------- + +Because some configuration options affect the process's ability to +contact the monitors, authenticate, and retrieve the cluster-stored +configuration, they may need to be stored locally on the node and set +in a local configuration file. These options include: + + - ``mon_host``, the list of monitors for the cluster + - ``mon_host_override``, the list of monitors for the cluster to + **initially** contact when beginning a new instance of communication with the + Ceph cluster. This overrides the known monitor list derived from MonMap + updates sent to older Ceph instances (like librados cluster handles). It is + expected this option is primarily useful for debugging. + - ``mon_dns_srv_name`` (default: `ceph-mon`), the name of the DNS + SRV record to check to identify the cluster monitors via DNS + - ``mon_data``, ``osd_data``, ``mds_data``, ``mgr_data``, and + similar options that define which local directory the daemon + stores its data in. + - ``keyring``, ``keyfile``, and/or ``key``, which can be used to + specify the authentication credential to use to authenticate with + the monitor. Note that in most cases the default keyring location + is in the data directory specified above. + +In the vast majority of cases the default values of these are +appropriate, with the exception of the ``mon_host`` option that +identifies the addresses of the cluster's monitors. When DNS is used +to identify monitors a local ceph configuration file can be avoided +entirely. + +Skipping monitor config +----------------------- + +Any process may be passed the option ``--no-mon-config`` to skip the +step that retrieves configuration from the cluster monitors. This is +useful in cases where configuration is managed entirely via +configuration files or where the monitor cluster is currently down but +some maintenance activity needs to be done. + + +.. _ceph-conf-file: + + +Configuration sections +====================== + +Any given process or daemon has a single value for each configuration +option. However, values for an option may vary across different +daemon types even daemons of the same type. Ceph options that are +stored in the monitor configuration database or in local configuration +files are grouped into sections to indicate which daemons or clients +they apply to. + +These sections include: + +``global`` + +:Description: Settings under ``global`` affect all daemons and clients + in a Ceph Storage Cluster. + +:Example: ``log_file = /var/log/ceph/$cluster-$type.$id.log`` + +``mon`` + +:Description: Settings under ``mon`` affect all ``ceph-mon`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mon_cluster_log_to_syslog = true`` + + +``mgr`` + +:Description: Settings in the ``mgr`` section affect all ``ceph-mgr`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mgr_stats_period = 10`` + +``osd`` + +:Description: Settings under ``osd`` affect all ``ceph-osd`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``osd_op_queue = wpq`` + +``mds`` + +:Description: Settings in the ``mds`` section affect all ``ceph-mds`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``global``. + +:Example: ``mds_cache_memory_limit = 10G`` + +``client`` + +:Description: Settings under ``client`` affect all Ceph Clients + (e.g., mounted Ceph File Systems, mounted Ceph Block Devices, + etc.) as well as Rados Gateway (RGW) daemons. + +:Example: ``objecter_inflight_ops = 512`` + + +Sections may also specify an individual daemon or client name. For example, +``mon.foo``, ``osd.123``, and ``client.smith`` are all valid section names. + + +Any given daemon will draw its settings from the global section, the +daemon or client type section, and the section sharing its name. +Settings in the most-specific section take precedence, so for example +if the same option is specified in both ``global``, ``mon``, and +``mon.foo`` on the same source (i.e., in the same configurationfile), +the ``mon.foo`` value will be used. + +If multiple values of the same configuration option are specified in the same +section, the last value wins. + +Note that values from the local configuration file always take +precedence over values from the monitor configuration database, +regardless of which section they appear in. + + +.. _ceph-metavariables: + +Metavariables +============= + +Metavariables simplify Ceph Storage Cluster configuration +dramatically. When a metavariable is set in a configuration value, +Ceph expands the metavariable into a concrete value at the time the +configuration value is used. Ceph metavariables are similar to variable expansion in the Bash shell. + +Ceph supports the following metavariables: + +``$cluster`` + +:Description: Expands to the Ceph Storage Cluster name. Useful when running + multiple Ceph Storage Clusters on the same hardware. + +:Example: ``/etc/ceph/$cluster.keyring`` +:Default: ``ceph`` + + +``$type`` + +:Description: Expands to a daemon or process type (e.g., ``mds``, ``osd``, or ``mon``) + +:Example: ``/var/lib/ceph/$type`` + + +``$id`` + +:Description: Expands to the daemon or client identifier. For + ``osd.0``, this would be ``0``; for ``mds.a``, it would + be ``a``. + +:Example: ``/var/lib/ceph/$type/$cluster-$id`` + + +``$host`` + +:Description: Expands to the host name where the process is running. + + +``$name`` + +:Description: Expands to ``$type.$id``. +:Example: ``/var/run/ceph/$cluster-$name.asok`` + +``$pid`` + +:Description: Expands to daemon pid. +:Example: ``/var/run/ceph/$cluster-$name-$pid.asok`` + + + +The Configuration File +====================== + +On startup, Ceph processes search for a configuration file in the +following locations: + +#. ``$CEPH_CONF`` (*i.e.,* the path following the ``$CEPH_CONF`` + environment variable) +#. ``-c path/path`` (*i.e.,* the ``-c`` command line argument) +#. ``/etc/ceph/$cluster.conf`` +#. ``~/.ceph/$cluster.conf`` +#. ``./$cluster.conf`` (*i.e.,* in the current working directory) +#. On FreeBSD systems only, ``/usr/local/etc/ceph/$cluster.conf`` + +where ``$cluster`` is the cluster's name (default ``ceph``). + +The Ceph configuration file uses an *ini* style syntax. You can add comment +text after a pound sign (#) or a semi-colon (;). For example: + +.. code-block:: ini + + # <--A number (#) sign precedes a comment. + ; A comment may be anything. + # Comments always follow a semi-colon (;) or a pound (#) on each line. + # The end of the line terminates a comment. + # We recommend that you provide comments in your configuration file(s). + + +.. _ceph-conf-settings: + +Config file section names +------------------------- + +The configuration file is divided into sections. Each section must begin with a +valid configuration section name (see `Configuration sections`_, above) +surrounded by square brackets. For example, + +.. code-block:: ini + + [global] + debug_ms = 0 + + [osd] + debug_ms = 1 + + [osd.1] + debug_ms = 10 + + [osd.2] + debug_ms = 10 + + +Config file option values +------------------------- + +The value of a configuration option is a string. If it is too long to +fit in a single line, you can put a backslash (``\``) at the end of line +as the line continuation marker, so the value of the option will be +the string after ``=`` in current line combined with the string in the next +line:: + + [global] + foo = long long ago\ + long ago + +In the example above, the value of "``foo``" would be "``long long ago long ago``". + +Normally, the option value ends with a new line, or a comment, like + +.. code-block:: ini + + [global] + obscure_one = difficult to explain # I will try harder in next release + simpler_one = nothing to explain + +In the example above, the value of "``obscure one``" would be "``difficult to explain``"; +and the value of "``simpler one`` would be "``nothing to explain``". + +If an option value contains spaces, and we want to make it explicit, we +could quote the value using single or double quotes, like + +.. code-block:: ini + + [global] + line = "to be, or not to be" + +Certain characters are not allowed to be present in the option values directly. +They are ``=``, ``#``, ``;`` and ``[``. If we have to, we need to escape them, +like + +.. code-block:: ini + + [global] + secret = "i love \# and \[" + +Every configuration option is typed with one of the types below: + +``int`` + +:Description: 64-bit signed integer, Some SI prefixes are supported, like "K", "M", "G", + "T", "P", "E", meaning, respectively, 10\ :sup:`3`, 10\ :sup:`6`, + 10\ :sup:`9`, etc. And "B" is the only supported unit. So, "1K", "1M", "128B" and "-1" are all valid + option values. Some times, a negative value implies "unlimited" when it comes to + an option for threshold or limit. +:Example: ``42``, ``-1`` + +``uint`` + +:Description: It is almost identical to ``integer``. But a negative value will be rejected. +:Example: ``256``, ``0`` + +``str`` + +:Description: Free style strings encoded in UTF-8, but some characters are not allowed. Please + reference the above notes for the details. +:Example: ``"hello world"``, ``"i love \#"``, ``yet-another-name`` + +``boolean`` + +:Description: one of the two values ``true`` or ``false``. But an integer is also accepted, + where "0" implies ``false``, and any non-zero values imply ``true``. +:Example: ``true``, ``false``, ``1``, ``0`` + +``addr`` + +:Description: a single address optionally prefixed with ``v1``, ``v2`` or ``any`` for the messenger + protocol. If the prefix is not specified, ``v2`` protocol is used. Please see + :ref:`address_formats` for more details. +:Example: ``v1:1.2.3.4:567``, ``v2:1.2.3.4:567``, ``1.2.3.4:567``, ``2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567``, ``[::1]:6789`` + +``addrvec`` + +:Description: a set of addresses separated by ",". The addresses can be optionally quoted with ``[`` and ``]``. +:Example: ``[v1:1.2.3.4:567,v2:1.2.3.4:568]``, ``v1:1.2.3.4:567,v1:1.2.3.14:567`` ``[2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567], [2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::568]`` + +``uuid`` + +:Description: the string format of a uuid defined by `RFC4122 <https://www.ietf.org/rfc/rfc4122.txt>`_. + And some variants are also supported, for more details, see + `Boost document <https://www.boost.org/doc/libs/1_74_0/libs/uuid/doc/uuid.html#String%20Generator>`_. +:Example: ``f81d4fae-7dec-11d0-a765-00a0c91e6bf6`` + +``size`` + +:Description: denotes a 64-bit unsigned integer. Both SI prefixes and IEC prefixes are + supported. And "B" is the only supported unit. A negative value will be + rejected. +:Example: ``1Ki``, ``1K``, ``1KiB`` and ``1B``. + +``secs`` + +:Description: denotes a duration of time. By default the unit is second if not specified. + Following units of time are supported: + + * second: "s", "sec", "second", "seconds" + * minute: "m", "min", "minute", "minutes" + * hour: "hs", "hr", "hour", "hours" + * day: "d", "day", "days" + * week: "w", "wk", "week", "weeks" + * month: "mo", "month", "months" + * year: "y", "yr", "year", "years" +:Example: ``1 m``, ``1m`` and ``1 week`` + +.. _ceph-conf-database: + +Monitor configuration database +============================== + +The monitor cluster manages a database of configuration options that +can be consumed by the entire cluster, enabling streamlined central +configuration management for the entire system. The vast majority of +configuration options can and should be stored here for ease of +administration and transparency. + +A handful of settings may still need to be stored in local +configuration files because they affect the ability to connect to the +monitors, authenticate, and fetch configuration information. In most +cases this is limited to the ``mon_host`` option, although this can +also be avoided through the use of DNS SRV records. + +Sections and masks +------------------ + +Configuration options stored by the monitor can live in a global +section, daemon type section, or specific daemon section, just like +options in a configuration file can. + +In addition, options may also have a *mask* associated with them to +further restrict which daemons or clients the option applies to. +Masks take two forms: + +#. ``type:location`` where *type* is a CRUSH property like `rack` or + `host`, and *location* is a value for that property. For example, + ``host:foo`` would limit the option only to daemons or clients + running on a particular host. +#. ``class:device-class`` where *device-class* is the name of a CRUSH + device class (e.g., ``hdd`` or ``ssd``). For example, + ``class:ssd`` would limit the option only to OSDs backed by SSDs. + (This mask has no effect for non-OSD daemons or clients.) + +When setting a configuration option, the `who` may be a section name, +a mask, or a combination of both separated by a slash (``/``) +character. For example, ``osd/rack:foo`` would mean all OSD daemons +in the ``foo`` rack. + +When viewing configuration options, the section name and mask are +generally separated out into separate fields or columns to ease readability. + + +Commands +-------- + +The following CLI commands are used to configure the cluster: + +* ``ceph config dump`` will dump the entire configuration database for + the cluster. + +* ``ceph config get <who>`` will dump the configuration for a specific + daemon or client (e.g., ``mds.a``), as stored in the monitors' + configuration database. + +* ``ceph config set <who> <option> <value>`` will set a configuration + option in the monitors' configuration database. + +* ``ceph config show <who>`` will show the reported running + configuration for a running daemon. These settings may differ from + those stored by the monitors if there are also local configuration + files in use or options have been overridden on the command line or + at run time. The source of the option values is reported as part + of the output. + +* ``ceph config assimilate-conf -i <input file> -o <output file>`` + will ingest a configuration file from *input file* and move any + valid options into the monitors' configuration database. Any + settings that are unrecognized, invalid, or cannot be controlled by + the monitor will be returned in an abbreviated config file stored in + *output file*. This command is useful for transitioning from legacy + configuration files to centralized monitor-based configuration. + + +Help +==== + +You can get help for a particular option with: + +.. prompt:: bash $ + + ceph config help <option> + +Note that this will use the configuration schema that is compiled into the running monitors. If you have a mixed-version cluster (e.g., during an upgrade), you might also want to query the option schema from a specific running daemon: + +.. prompt:: bash $ + + ceph daemon <name> config help [option] + +For example: + +.. prompt:: bash $ + + ceph config help log_file + +:: + + log_file - path to log file + (std::string, basic) + Default (non-daemon): + Default (daemon): /var/log/ceph/$cluster-$name.log + Can update at runtime: false + See also: [log_to_stderr,err_to_stderr,log_to_syslog,err_to_syslog] + +or: + +.. prompt:: bash $ + + ceph config help log_file -f json-pretty + +:: + + { + "name": "log_file", + "type": "std::string", + "level": "basic", + "desc": "path to log file", + "long_desc": "", + "default": "", + "daemon_default": "/var/log/ceph/$cluster-$name.log", + "tags": [], + "services": [], + "see_also": [ + "log_to_stderr", + "err_to_stderr", + "log_to_syslog", + "err_to_syslog" + ], + "enum_values": [], + "min": "", + "max": "", + "can_update_at_runtime": false + } + +The ``level`` property can be any of `basic`, `advanced`, or `dev`. +The `dev` options are intended for use by developers, generally for +testing purposes, and are not recommended for use by operators. + + +Runtime Changes +=============== + +In most cases, Ceph allows you to make changes to the configuration of +a daemon at runtime. This capability is quite useful for +increasing/decreasing logging output, enabling/disabling debug +settings, and even for runtime optimization. + +Generally speaking, configuration options can be updated in the usual +way via the ``ceph config set`` command. For example, do enable the debug log level on a specific OSD: + +.. prompt:: bash $ + + ceph config set osd.123 debug_ms 20 + +Note that if the same option is also customized in a local +configuration file, the monitor setting will be ignored (it has a +lower priority than the local config file). + +Override values +--------------- + +You can also temporarily set an option using the `tell` or `daemon` +interfaces on the Ceph CLI. These *override* values are ephemeral in +that they only affect the running process and are discarded/lost if +the daemon or process restarts. + +Override values can be set in two ways: + +#. From any host, we can send a message to a daemon over the network with: + + .. prompt:: bash $ + + ceph tell <name> config set <option> <value> + + For example: + + .. prompt:: bash $ + + ceph tell osd.123 config set debug_osd 20 + + The `tell` command can also accept a wildcard for the daemon + identifier. For example, to adjust the debug level on all OSD + daemons: + + .. prompt:: bash $ + + ceph tell osd.* config set debug_osd 20 + +#. From the host the process is running on, we can connect directly to + the process via a socket in ``/var/run/ceph`` with: + + .. prompt:: bash $ + + ceph daemon <name> config set <option> <value> + + For example: + + .. prompt:: bash $ + + ceph daemon osd.4 config set debug_osd 20 + +Note that in the ``ceph config show`` command output these temporary +values will be shown with a source of ``override``. + + +Viewing runtime settings +======================== + +You can see the current options set for a running daemon with the ``ceph config show`` command. For example: + +.. prompt:: bash $ + + ceph config show osd.0 + +will show you the (non-default) options for that daemon. You can also look at a specific option with: + +.. prompt:: bash $ + + ceph config show osd.0 debug_osd + +or view all options (even those with default values) with: + +.. prompt:: bash $ + + ceph config show-with-defaults osd.0 + +You can also observe settings for a running daemon by connecting to it from the local host via the admin socket. For example: + +.. prompt:: bash $ + + ceph daemon osd.0 config show + +will dump all current settings: + +.. prompt:: bash $ + + ceph daemon osd.0 config diff + +will show only non-default settings (as well as where the value came from: a config file, the monitor, an override, etc.), and: + +.. prompt:: bash $ + + ceph daemon osd.0 config get debug_osd + +will report the value of a single option. + + + +Changes since Nautilus +====================== + +With the Octopus release We changed the way the configuration file is parsed. +These changes are as follows: + +- Repeated configuration options are allowed, and no warnings will be printed. + The value of the last one is used, which means that the setting last in the file + is the one that takes effect. Before this change, we would print warning messages + when lines with duplicated options were encountered, like:: + + warning line 42: 'foo' in section 'bar' redefined + +- Invalid UTF-8 options were ignored with warning messages. But since Octopus, + they are treated as fatal errors. + +- Backslash ``\`` is used as the line continuation marker to combine the next + line with current one. Before Octopus, it was required to follow a backslash with + a non-empty line. But in Octopus, an empty line following a backslash is now allowed. + +- In the configuration file, each line specifies an individual configuration + option. The option's name and its value are separated with ``=``, and the + value may be quoted using single or double quotes. If an invalid + configuration is specified, we will treat it as an invalid configuration + file :: + + bad option ==== bad value + +- Before Octopus, if no section name was specified in the configuration file, + all options would be set as though they were within the ``global`` section. This is + now discouraged. Since Octopus, only a single option is allowed for + configuration files without a section name. diff --git a/doc/rados/configuration/common.rst b/doc/rados/configuration/common.rst new file mode 100644 index 000000000..709c8bce2 --- /dev/null +++ b/doc/rados/configuration/common.rst @@ -0,0 +1,218 @@ + +.. _ceph-conf-common-settings: + +Common Settings +=============== + +The `Hardware Recommendations`_ section provides some hardware guidelines for +configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph +Node` to run multiple daemons. For example, a single node with multiple drives +may run one ``ceph-osd`` for each drive. Ideally, you will have a node for a +particular type of process. For example, some nodes may run ``ceph-osd`` +daemons, other nodes may run ``ceph-mds`` daemons, and still other nodes may +run ``ceph-mon`` daemons. + +Each node has a name identified by the ``host`` setting. Monitors also specify +a network address and port (i.e., domain name or IP address) identified by the +``addr`` setting. A basic configuration file will typically specify only +minimal settings for each instance of monitor daemons. For example: + +.. code-block:: ini + + [global] + mon_initial_members = ceph1 + mon_host = 10.0.0.1 + + +.. important:: The ``host`` setting is the short name of the node (i.e., not + an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on + the command line to retrieve the name of the node. Do not use ``host`` + settings for anything other than initial monitors unless you are deploying + Ceph manually. You **MUST NOT** specify ``host`` under individual daemons + when using deployment tools like ``chef`` or ``cephadm``, as those tools + will enter the appropriate values for you in the cluster map. + + +.. _ceph-network-config: + +Networks +======== + +See the `Network Configuration Reference`_ for a detailed discussion about +configuring a network for use with Ceph. + + +Monitors +======== + +Production Ceph clusters typically provision a minimum of three :term:`Ceph Monitor` +daemons to ensure availability should a monitor instance crash. A minimum of +three ensures that the Paxos algorithm can determine which version +of the :term:`Ceph Cluster Map` is the most recent from a majority of Ceph +Monitors in the quorum. + +.. note:: You may deploy Ceph with a single monitor, but if the instance fails, + the lack of other monitors may interrupt data service availability. + +Ceph Monitors normally listen on port ``3300`` for the new v2 protocol, and ``6789`` for the old v1 protocol. + +By default, Ceph expects to store monitor data under the +following path:: + + /var/lib/ceph/mon/$cluster-$id + +You or a deployment tool (e.g., ``cephadm``) must create the corresponding +directory. With metavariables fully expressed and a cluster named "ceph", the +foregoing directory would evaluate to:: + + /var/lib/ceph/mon/ceph-a + +For additional details, see the `Monitor Config Reference`_. + +.. _Monitor Config Reference: ../mon-config-ref + + +.. _ceph-osd-config: + + +Authentication +============== + +.. versionadded:: Bobtail 0.56 + +For Bobtail (v 0.56) and beyond, you should expressly enable or disable +authentication in the ``[global]`` section of your Ceph configuration file. + +.. code-block:: ini + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + +Additionally, you should enable message signing. See `Cephx Config Reference`_ for details. + +.. _Cephx Config Reference: ../auth-config-ref + + +.. _ceph-monitor-config: + + +OSDs +==== + +Ceph production clusters typically deploy :term:`Ceph OSD Daemons` where one node +has one OSD daemon running a Filestore on one storage device. The BlueStore back +end is now default, but when using Filestore you specify a journal size. For example: + +.. code-block:: ini + + [osd] + osd_journal_size = 10000 + + [osd.0] + host = {hostname} #manual deployments only. + + +By default, Ceph expects to store a Ceph OSD Daemon's data at the +following path:: + + /var/lib/ceph/osd/$cluster-$id + +You or a deployment tool (e.g., ``cephadm``) must create the corresponding +directory. With metavariables fully expressed and a cluster named "ceph", this +example would evaluate to:: + + /var/lib/ceph/osd/ceph-0 + +You may override this path using the ``osd_data`` setting. We recommend not +changing the default location. Create the default directory on your OSD host. + +.. prompt:: bash $ + + ssh {osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + +The ``osd_data`` path ideally leads to a mount point with a device that is +separate from the device that contains the operating system and +daemons. If an OSD is to use a device other than the OS device, prepare it for +use with Ceph, and mount it to the directory you just created + +.. prompt:: bash $ + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{disk} + sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} + +We recommend using the ``xfs`` file system when running +:command:`mkfs`. (``btrfs`` and ``ext4`` are not recommended and are no +longer tested.) + +See the `OSD Config Reference`_ for additional configuration details. + + +Heartbeats +========== + +During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons +and report their findings to the Ceph Monitor. You do not have to provide any +settings. However, if you have network latency issues, you may wish to modify +the settings. + +See `Configuring Monitor/OSD Interaction`_ for additional details. + + +.. _ceph-logging-and-debugging: + +Logs / Debugging +================ + +Sometimes you may encounter issues with Ceph that require +modifying logging output and using Ceph's debugging. See `Debugging and +Logging`_ for details on log rotation. + +.. _Debugging and Logging: ../../troubleshooting/log-and-debug + + +Example ceph.conf +================= + +.. literalinclude:: demo-ceph.conf + :language: ini + +.. _ceph-runtime-config: + + + +Running Multiple Clusters (DEPRECATED) +====================================== + +Each Ceph cluster has an internal name that is used as part of configuration +and log file names as well as directory and mountpoint names. This name +defaults to "ceph". Previous releases of Ceph allowed one to specify a custom +name instead, for example "ceph2". This was intended to faciliate running +multiple logical clusters on the same physical hardware, but in practice this +was rarely exploited and should no longer be attempted. Prior documentation +could also be misinterpreted as requiring unique cluster names in order to +use ``rbd-mirror``. + +Custom cluster names are now considered deprecated and the ability to deploy +them has already been removed from some tools, though existing custom name +deployments continue to operate. The ability to run and manage clusters with +custom names may be progressively removed by future Ceph releases, so it is +strongly recommended to deploy all new clusters with the default name "ceph". + +Some Ceph CLI commands accept an optional ``--cluster`` (cluster name) option. This +option is present purely for backward compatibility and need not be accomodated +by new tools and deployments. + +If you do need to allow multiple clusters to exist on the same host, please use +:ref:`cephadm`, which uses containers to fully isolate each cluster. + + + + + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Network Configuration Reference: ../network-config-ref +.. _OSD Config Reference: ../osd-config-ref +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction diff --git a/doc/rados/configuration/demo-ceph.conf b/doc/rados/configuration/demo-ceph.conf new file mode 100644 index 000000000..58bb7061f --- /dev/null +++ b/doc/rados/configuration/demo-ceph.conf @@ -0,0 +1,31 @@ +[global] +fsid = {cluster-id} +mon_initial_ members = {hostname}[, {hostname}] +mon_host = {ip-address}[, {ip-address}] + +#All clusters have a front-side public network. +#If you have two network interfaces, you can configure a private / cluster +#network for RADOS object replication, heartbeats, backfill, +#recovery, etc. +public_network = {network}[, {network}] +#cluster_network = {network}[, {network}] + +#Clusters require authentication by default. +auth_cluster_required = cephx +auth_service_required = cephx +auth_client_required = cephx + +#Choose reasonable numbers for journals, number of replicas +#and placement groups. +osd_journal_size = {n} +osd_pool_default_size = {n} # Write an object n times. +osd_pool_default_min size = {n} # Allow writing n copy in a degraded state. +osd_pool_default_pg num = {n} +osd_pool_default_pgp num = {n} + +#Choose a reasonable crush leaf type. +#0 for a 1-node cluster. +#1 for a multi node cluster in a single rack +#2 for a multi node, multi chassis cluster with multiple hosts in a chassis +#3 for a multi node cluster with hosts across racks, etc. +osd_crush_chooseleaf_type = {n}
\ No newline at end of file diff --git a/doc/rados/configuration/filestore-config-ref.rst b/doc/rados/configuration/filestore-config-ref.rst new file mode 100644 index 000000000..435a800a8 --- /dev/null +++ b/doc/rados/configuration/filestore-config-ref.rst @@ -0,0 +1,367 @@ +============================ + Filestore Config Reference +============================ + +The Filestore back end is no longer the default when creating new OSDs, +though Filestore OSDs are still supported. + +``filestore debug omap check`` + +:Description: Debugging check on synchronization. Expensive. For debugging only. +:Type: Boolean +:Required: No +:Default: ``false`` + + +.. index:: filestore; extended attributes + +Extended Attributes +=================== + +Extended Attributes (XATTRs) are important for Filestore OSDs. +Some file systems have limits on the number of bytes that can be stored in XATTRs. +Additionally, in some cases, the file system may not be as fast as an alternative +method of storing XATTRs. The following settings may help improve performance +by using a method of storing XATTRs that is extrinsic to the underlying file system. + +Ceph XATTRs are stored as ``inline xattr``, using the XATTRs provided +by the underlying file system, if it does not impose a size limit. If +there is a size limit (4KB total on ext4, for instance), some Ceph +XATTRs will be stored in a key/value database when either the +``filestore_max_inline_xattr_size`` or ``filestore_max_inline_xattrs`` +threshold is reached. + + +``filestore_max_inline_xattr_size`` + +:Description: The maximum size of an XATTR stored in the file system (i.e., XFS, + Btrfs, EXT4, etc.) per object. Should not be larger than the + file system can handle. Default value of 0 means to use the value + specific to the underlying file system. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore_max_inline_xattr_size_xfs`` + +:Description: The maximum size of an XATTR stored in the XFS file system. + Only used if ``filestore_max_inline_xattr_size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``65536`` + + +``filestore_max_inline_xattr_size_btrfs`` + +:Description: The maximum size of an XATTR stored in the Btrfs file system. + Only used if ``filestore_max_inline_xattr_size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``2048`` + + +``filestore_max_inline_xattr_size_other`` + +:Description: The maximum size of an XATTR stored in other file systems. + Only used if ``filestore_max_inline_xattr_size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``512`` + + +``filestore_max_inline_xattrs`` + +:Description: The maximum number of XATTRs stored in the file system per object. + Default value of 0 means to use the value specific to the + underlying file system. +:Type: 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore_max_inline_xattrs_xfs`` + +:Description: The maximum number of XATTRs stored in the XFS file system per object. + Only used if ``filestore_max_inline_xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore_max_inline_xattrs_btrfs`` + +:Description: The maximum number of XATTRs stored in the Btrfs file system per object. + Only used if ``filestore_max_inline_xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore_max_inline_xattrs_other`` + +:Description: The maximum number of XATTRs stored in other file systems per object. + Only used if ``filestore_max_inline_xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``2`` + +.. index:: filestore; synchronization + +Synchronization Intervals +========================= + +Filestore needs to periodically quiesce writes and synchronize the +file system, which creates a consistent commit point. It can then free journal +entries up to the commit point. Synchronizing more frequently tends to reduce +the time required to perform synchronization, and reduces the amount of data +that needs to remain in the journal. Less frequent synchronization allows the +backing file system to coalesce small writes and metadata updates more +optimally, potentially resulting in more efficient synchronization at the +expense of potentially increasing tail latency. + +``filestore_max_sync_interval`` + +:Description: The maximum interval in seconds for synchronizing Filestore. +:Type: Double +:Required: No +:Default: ``5`` + + +``filestore_min_sync_interval`` + +:Description: The minimum interval in seconds for synchronizing Filestore. +:Type: Double +:Required: No +:Default: ``.01`` + + +.. index:: filestore; flusher + +Flusher +======= + +The Filestore flusher forces data from large writes to be written out using +``sync_file_range`` before the sync in order to (hopefully) reduce the cost of +the eventual sync. In practice, disabling 'filestore_flusher' seems to improve +performance in some cases. + + +``filestore_flusher`` + +:Description: Enables the filestore flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore_flusher_max_fds`` + +:Description: Sets the maximum number of file descriptors for the flusher. +:Type: Integer +:Required: No +:Default: ``512`` + +.. deprecated:: v.65 + +``filestore_sync_flush`` + +:Description: Enables the synchronization flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore_fsync_flushes_journal_data`` + +:Description: Flush journal data during file system synchronization. +:Type: Boolean +:Required: No +:Default: ``false`` + + +.. index:: filestore; queue + +Queue +===== + +The following settings provide limits on the size of the Filestore queue. + +``filestore_queue_max_ops`` + +:Description: Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations. +:Type: Integer +:Required: No. Minimal impact on performance. +:Default: ``50`` + + +``filestore_queue_max_bytes`` + +:Description: The maximum number of bytes for an operation. +:Type: Integer +:Required: No +:Default: ``100 << 20`` + + + + +.. index:: filestore; timeouts + +Timeouts +======== + + +``filestore_op_threads`` + +:Description: The number of file system operation threads that execute in parallel. +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore_op_thread_timeout`` + +:Description: The timeout for a file system operation thread (in seconds). +:Type: Integer +:Required: No +:Default: ``60`` + + +``filestore_op_thread_suicide_timeout`` + +:Description: The timeout for a commit operation before cancelling the commit (in seconds). +:Type: Integer +:Required: No +:Default: ``180`` + + +.. index:: filestore; btrfs + +B-Tree Filesystem +================= + + +``filestore_btrfs_snap`` + +:Description: Enable snapshots for a ``btrfs`` filestore. +:Type: Boolean +:Required: No. Only used for ``btrfs``. +:Default: ``true`` + + +``filestore_btrfs_clone_range`` + +:Description: Enable cloning ranges for a ``btrfs`` filestore. +:Type: Boolean +:Required: No. Only used for ``btrfs``. +:Default: ``true`` + + +.. index:: filestore; journal + +Journal +======= + + +``filestore_journal_parallel`` + +:Description: Enables parallel journaling, default for Btrfs. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_journal_writeahead`` + +:Description: Enables writeahead journaling, default for XFS. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_journal_trailing`` + +:Description: Deprecated, never use. +:Type: Boolean +:Required: No +:Default: ``false`` + + +Misc +==== + + +``filestore_merge_threshold`` + +:Description: Min number of files in a subdir before merging into parent + NOTE: A negative value means to disable subdir merging +:Type: Integer +:Required: No +:Default: ``-10`` + + +``filestore_split_multiple`` + +:Description: ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16`` + is the maximum number of files in a subdirectory before + splitting into child directories. + +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore_split_rand_factor`` + +:Description: A random factor added to the split threshold to avoid + too many (expensive) Filestore splits occurring at once. See + ``filestore_split_multiple`` for details. + This can only be changed offline for an existing OSD, + via the ``ceph-objectstore-tool apply-layout-settings`` command. + +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``20`` + + +``filestore_update_to`` + +:Description: Limits Filestore auto upgrade to specified version. +:Type: Integer +:Required: No +:Default: ``1000`` + + +``filestore_blackhole`` + +:Description: Drop any new transactions on the floor. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_dump_file`` + +:Description: File onto which store transaction dumps. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore_kill_at`` + +:Description: inject a failure at the n'th opportunity +:Type: String +:Required: No +:Default: ``false`` + + +``filestore_fail_eio`` + +:Description: Fail/Crash on eio. +:Type: Boolean +:Required: No +:Default: ``true`` + diff --git a/doc/rados/configuration/general-config-ref.rst b/doc/rados/configuration/general-config-ref.rst new file mode 100644 index 000000000..fc837c395 --- /dev/null +++ b/doc/rados/configuration/general-config-ref.rst @@ -0,0 +1,66 @@ +========================== + General Config Reference +========================== + + +``fsid`` + +:Description: The file system ID. One per cluster. +:Type: UUID +:Required: No. +:Default: N/A. Usually generated by deployment tools. + + +``admin_socket`` + +:Description: The socket for executing administrative commands on a daemon, + irrespective of whether Ceph Monitors have established a quorum. + +:Type: String +:Required: No +:Default: ``/var/run/ceph/$cluster-$name.asok`` + + +``pid_file`` + +:Description: The file in which the mon, osd or mds will write its + PID. For instance, ``/var/run/$cluster/$type.$id.pid`` + will create /var/run/ceph/mon.a.pid for the ``mon`` with + id ``a`` running in the ``ceph`` cluster. The ``pid + file`` is removed when the daemon stops gracefully. If + the process is not daemonized (i.e. runs with the ``-f`` + or ``-d`` option), the ``pid file`` is not created. +:Type: String +:Required: No +:Default: No + + +``chdir`` + +:Description: The directory Ceph daemons change to once they are + up and running. Default ``/`` directory recommended. + +:Type: String +:Required: No +:Default: ``/`` + + +``max_open_files`` + +:Description: If set, when the :term:`Ceph Storage Cluster` starts, Ceph sets + the max open FDs at the OS level (i.e., the max # of file + descriptors). A suitably large value prevents Ceph Daemons from running out + of file descriptors. + +:Type: 64-bit Integer +:Required: No +:Default: ``0`` + + +``fatal_signal_handlers`` + +:Description: If set, we will install signal handlers for SEGV, ABRT, BUS, ILL, + FPE, XCPU, XFSZ, SYS signals to generate a useful log message + +:Type: Boolean +:Default: ``true`` diff --git a/doc/rados/configuration/index.rst b/doc/rados/configuration/index.rst new file mode 100644 index 000000000..414fcc6fa --- /dev/null +++ b/doc/rados/configuration/index.rst @@ -0,0 +1,54 @@ +=============== + Configuration +=============== + +Each Ceph process, daemon, or utility draws its configuration from several +sources on startup. Such sources can include (1) a local configuration, (2) the +monitors, (3) the command line, and (4) environment variables. + +Configuration options can be set globally so that they apply (1) to all +daemons, (2) to all daemons or services of a particular type, or (3) to only a +specific daemon, process, or client. + +.. raw:: html + + <table cellpadding="10"><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>Configuring the Object Store</h3> + +For general object store configuration, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Storage devices <storage-devices> + ceph-conf + + +.. raw:: html + + </td><td><h3>Reference</h3> + +To optimize the performance of your cluster, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Common Settings <common> + Network Settings <network-config-ref> + Messenger v2 protocol <msgr2> + Auth Settings <auth-config-ref> + Monitor Settings <mon-config-ref> + mon-lookup-dns + Heartbeat Settings <mon-osd-interaction> + OSD Settings <osd-config-ref> + DmClock Settings <mclock-config-ref> + BlueStore Settings <bluestore-config-ref> + FileStore Settings <filestore-config-ref> + Journal Settings <journal-ref> + Pool, PG & CRUSH Settings <pool-pg-config-ref.rst> + Messaging Settings <ms-ref> + General Settings <general-config-ref> + + +.. raw:: html + + </td></tr></tbody></table> diff --git a/doc/rados/configuration/journal-ref.rst b/doc/rados/configuration/journal-ref.rst new file mode 100644 index 000000000..71c74c606 --- /dev/null +++ b/doc/rados/configuration/journal-ref.rst @@ -0,0 +1,119 @@ +========================== + Journal Config Reference +========================== + +.. index:: journal; journal configuration + +Filestore OSDs use a journal for two reasons: speed and consistency. Note +that since Luminous, the BlueStore OSD back end has been preferred and default. +This information is provided for pre-existing OSDs and for rare situations where +Filestore is preferred for new deployments. + +- **Speed:** The journal enables the Ceph OSD Daemon to commit small writes + quickly. Ceph writes small, random i/o to the journal sequentially, which + tends to speed up bursty workloads by allowing the backing file system more + time to coalesce writes. The Ceph OSD Daemon's journal, however, can lead + to spiky performance with short spurts of high-speed writes followed by + periods without any write progress as the file system catches up to the + journal. + +- **Consistency:** Ceph OSD Daemons require a file system interface that + guarantees atomic compound operations. Ceph OSD Daemons write a description + of the operation to the journal and apply the operation to the file system. + This enables atomic updates to an object (for example, placement group + metadata). Every few seconds--between ``filestore max sync interval`` and + ``filestore min sync interval``--the Ceph OSD Daemon stops writes and + synchronizes the journal with the file system, allowing Ceph OSD Daemons to + trim operations from the journal and reuse the space. On failure, Ceph + OSD Daemons replay the journal starting after the last synchronization + operation. + +Ceph OSD Daemons recognize the following journal settings: + + +``journal_dio`` + +:Description: Enables direct i/o to the journal. Requires ``journal block + align`` set to ``true``. + +:Type: Boolean +:Required: Yes when using ``aio``. +:Default: ``true`` + + + +``journal_aio`` + +.. versionchanged:: 0.61 Cuttlefish + +:Description: Enables using ``libaio`` for asynchronous writes to the journal. + Requires ``journal dio`` set to ``true``. + +:Type: Boolean +:Required: No. +:Default: Version 0.61 and later, ``true``. Version 0.60 and earlier, ``false``. + + +``journal_block_align`` + +:Description: Block aligns write operations. Required for ``dio`` and ``aio``. +:Type: Boolean +:Required: Yes when using ``dio`` and ``aio``. +:Default: ``true`` + + +``journal_max_write_bytes`` + +:Description: The maximum number of bytes the journal will write at + any one time. + +:Type: Integer +:Required: No +:Default: ``10 << 20`` + + +``journal_max_write_entries`` + +:Description: The maximum number of entries the journal will write at + any one time. + +:Type: Integer +:Required: No +:Default: ``100`` + + +``journal_queue_max_ops`` + +:Description: The maximum number of operations allowed in the queue at + any one time. + +:Type: Integer +:Required: No +:Default: ``500`` + + +``journal_queue_max_bytes`` + +:Description: The maximum number of bytes allowed in the queue at + any one time. + +:Type: Integer +:Required: No +:Default: ``10 << 20`` + + +``journal_align_min_size`` + +:Description: Align data payloads greater than the specified minimum. +:Type: Integer +:Required: No +:Default: ``64 << 10`` + + +``journal_zero_on_create`` + +:Description: Causes the file store to overwrite the entire journal with + ``0``'s during ``mkfs``. +:Type: Boolean +:Required: No +:Default: ``false`` diff --git a/doc/rados/configuration/mclock-config-ref.rst b/doc/rados/configuration/mclock-config-ref.rst new file mode 100644 index 000000000..579056895 --- /dev/null +++ b/doc/rados/configuration/mclock-config-ref.rst @@ -0,0 +1,395 @@ +======================== + mClock Config Reference +======================== + +.. index:: mclock; configuration + +Mclock profiles mask the low level details from users, making it +easier for them to configure mclock. + +The following input parameters are required for a mclock profile to configure +the QoS related parameters: + +* total capacity (IOPS) of each OSD (determined automatically) + +* an mclock profile type to enable + +Using the settings in the specified profile, the OSD determines and applies the +lower-level mclock and Ceph parameters. The parameters applied by the mclock +profile make it possible to tune the QoS between client I/O, recovery/backfill +operations, and other background operations (for example, scrub, snap trim, and +PG deletion). These background activities are considered best-effort internal +clients of Ceph. + + +.. index:: mclock; profile definition + +mClock Profiles - Definition and Purpose +======================================== + +A mclock profile is *“a configuration setting that when applied on a running +Ceph cluster enables the throttling of the operations(IOPS) belonging to +different client classes (background recovery, scrub, snaptrim, client op, +osd subop)”*. + +The mclock profile uses the capacity limits and the mclock profile type selected +by the user to determine the low-level mclock resource control parameters. + +Depending on the profile type, lower-level mclock resource-control parameters +and some Ceph-configuration parameters are transparently applied. + +The low-level mclock resource control parameters are the *reservation*, +*limit*, and *weight* that provide control of the resource shares, as +described in the :ref:`dmclock-qos` section. + + +.. index:: mclock; profile types + +mClock Profile Types +==================== + +mclock profiles can be broadly classified into two types, + +- **Built-in**: Users can choose between the following built-in profile types: + + - **high_client_ops** (*default*): + This profile allocates more reservation and limit to external-client ops + as compared to background recoveries and other internal clients within + Ceph. This profile is enabled by default. + - **high_recovery_ops**: + This profile allocates more reservation to background recoveries as + compared to external clients and other internal clients within Ceph. For + example, an admin may enable this profile temporarily to speed-up background + recoveries during non-peak hours. + - **balanced**: + This profile allocates equal reservation to client ops and background + recovery ops. + +- **Custom**: This profile gives users complete control over all the mclock + configuration parameters. Using this profile is not recommended without + a deep understanding of mclock and related Ceph-configuration options. + +.. note:: Across the built-in profiles, internal clients of mclock (for example + "scrub", "snap trim", and "pg deletion") are given slightly lower + reservations, but higher weight and no limit. This ensures that + these operations are able to complete quickly if there are no other + competing services. + + +.. index:: mclock; built-in profiles + +mClock Built-in Profiles +======================== + +When a built-in profile is enabled, the mClock scheduler calculates the low +level mclock parameters [*reservation*, *weight*, *limit*] based on the profile +enabled for each client type. The mclock parameters are calculated based on +the max OSD capacity provided beforehand. As a result, the following mclock +config parameters cannot be modified when using any of the built-in profiles: + +- ``osd_mclock_scheduler_client_res`` +- ``osd_mclock_scheduler_client_wgt`` +- ``osd_mclock_scheduler_client_lim`` +- ``osd_mclock_scheduler_background_recovery_res`` +- ``osd_mclock_scheduler_background_recovery_wgt`` +- ``osd_mclock_scheduler_background_recovery_lim`` +- ``osd_mclock_scheduler_background_best_effort_res`` +- ``osd_mclock_scheduler_background_best_effort_wgt`` +- ``osd_mclock_scheduler_background_best_effort_lim`` + +The following Ceph options will not be modifiable by the user: + +- ``osd_max_backfills`` +- ``osd_recovery_max_active`` + +This is because the above options are internally modified by the mclock +scheduler in order to maximize the impact of the set profile. + +By default, the *high_client_ops* profile is enabled to ensure that a larger +chunk of the bandwidth allocation goes to client ops. Background recovery ops +are given lower allocation (and therefore take a longer time to complete). But +there might be instances that necessitate giving higher allocations to either +client ops or recovery ops. In order to deal with such a situation, you can +enable one of the alternate built-in profiles by following the steps mentioned +in the next section. + +If any mClock profile (including "custom") is active, the following Ceph config +sleep options will be disabled, + +- ``osd_recovery_sleep`` +- ``osd_recovery_sleep_hdd`` +- ``osd_recovery_sleep_ssd`` +- ``osd_recovery_sleep_hybrid`` +- ``osd_scrub_sleep`` +- ``osd_delete_sleep`` +- ``osd_delete_sleep_hdd`` +- ``osd_delete_sleep_ssd`` +- ``osd_delete_sleep_hybrid`` +- ``osd_snap_trim_sleep`` +- ``osd_snap_trim_sleep_hdd`` +- ``osd_snap_trim_sleep_ssd`` +- ``osd_snap_trim_sleep_hybrid`` + +The above sleep options are disabled to ensure that mclock scheduler is able to +determine when to pick the next op from its operation queue and transfer it to +the operation sequencer. This results in the desired QoS being provided across +all its clients. + + +.. index:: mclock; enable built-in profile + +Steps to Enable mClock Profile +============================== + +As already mentioned, the default mclock profile is set to *high_client_ops*. +The other values for the built-in profiles include *balanced* and +*high_recovery_ops*. + +If there is a requirement to change the default profile, then the option +``osd_mclock_profile`` may be set during runtime by using the following +command: + + .. prompt:: bash # + + ceph config set osd.N osd_mclock_profile <value> + +For example, to change the profile to allow faster recoveries on "osd.0", the +following command can be used to switch to the *high_recovery_ops* profile: + + .. prompt:: bash # + + ceph config set osd.0 osd_mclock_profile high_recovery_ops + +.. note:: The *custom* profile is not recommended unless you are an advanced + user. + +And that's it! You are ready to run workloads on the cluster and check if the +QoS requirements are being met. + + +OSD Capacity Determination (Automated) +====================================== + +The OSD capacity in terms of total IOPS is determined automatically during OSD +initialization. This is achieved by running the OSD bench tool and overriding +the default value of ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option +depending on the device type. No other action/input is expected from the user +to set the OSD capacity. You may verify the capacity of an OSD after the +cluster is brought up by using the following command: + + .. prompt:: bash # + + ceph config show osd.N osd_mclock_max_capacity_iops_[hdd, ssd] + +For example, the following command shows the max capacity for "osd.0" on a Ceph +node whose underlying device type is SSD: + + .. prompt:: bash # + + ceph config show osd.0 osd_mclock_max_capacity_iops_ssd + + +Steps to Manually Benchmark an OSD (Optional) +============================================= + +.. note:: These steps are only necessary if you want to override the OSD + capacity already determined automatically during OSD initialization. + Otherwise, you may skip this section entirely. + +.. tip:: If you have already determined the benchmark data and wish to manually + override the max osd capacity for an OSD, you may skip to section + `Specifying Max OSD Capacity`_. + + +Any existing benchmarking tool can be used for this purpose. In this case, the +steps use the *Ceph OSD Bench* command described in the next section. Regardless +of the tool/command used, the steps outlined further below remain the same. + +As already described in the :ref:`dmclock-qos` section, the number of +shards and the bluestore's throttle parameters have an impact on the mclock op +queues. Therefore, it is critical to set these values carefully in order to +maximize the impact of the mclock scheduler. + +:Number of Operational Shards: + We recommend using the default number of shards as defined by the + configuration options ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and + ``osd_op_num_shards_ssd``. In general, a lower number of shards will increase + the impact of the mclock queues. + +:Bluestore Throttle Parameters: + We recommend using the default values as defined by + ``bluestore_throttle_bytes`` and ``bluestore_throttle_deferred_bytes``. But + these parameters may also be determined during the benchmarking phase as + described below. + + +OSD Bench Command Syntax +```````````````````````` + +The :ref:`osd-subsystem` section describes the OSD bench command. The syntax +used for benchmarking is shown below : + +.. prompt:: bash # + + ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS] + +where, + +* ``TOTAL_BYTES``: Total number of bytes to write +* ``BYTES_PER_WRITE``: Block size per write +* ``OBJ_SIZE``: Bytes per object +* ``NUM_OBJS``: Number of objects to write + +Benchmarking Test Steps Using OSD Bench +``````````````````````````````````````` + +The steps below use the default shards and detail the steps used to determine +the correct bluestore throttle values (optional). + +#. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that + you wish to benchmark. +#. Run a simple 4KiB random write workload on an OSD using the following + commands: + + .. note:: Note that before running the test, caches must be cleared to get an + accurate measurement. + + For example, if you are running the benchmark test on osd.0, run the following + commands: + + .. prompt:: bash # + + ceph tell osd.0 cache drop + + .. prompt:: bash # + + ceph tell osd.0 bench 12288000 4096 4194304 100 + +#. Note the overall throughput(IOPS) obtained from the output of the osd bench + command. This value is the baseline throughput(IOPS) when the default + bluestore throttle options are in effect. +#. If the intent is to determine the bluestore throttle values for your + environment, then set the two options, ``bluestore_throttle_bytes`` + and ``bluestore_throttle_deferred_bytes`` to 32 KiB(32768 Bytes) each + to begin with. Otherwise, you may skip to the next section. +#. Run the 4KiB random write test as before using OSD bench. +#. Note the overall throughput from the output and compare the value + against the baseline throughput recorded in step 3. +#. If the throughput doesn't match with the baseline, increment the bluestore + throttle options by 2x and repeat steps 5 through 7 until the obtained + throughput is very close to the baseline value. + +For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB +for both bluestore throttle and deferred bytes was determined to maximize the +impact of mclock. For HDDs, the corresponding value was 40 MiB, where the +overall throughput was roughly equal to the baseline throughput. Note that in +general for HDDs, the bluestore throttle values are expected to be higher when +compared to SSDs. + + +Specifying Max OSD Capacity +```````````````````````````` + +The steps in this section may be performed only if you want to override the +max osd capacity automatically set during OSD initialization. The option +``osd_mclock_max_capacity_iops_[hdd, ssd]`` for an OSD can be set by running the +following command: + + .. prompt:: bash # + + ceph config set osd.N osd_mclock_max_capacity_iops_[hdd,ssd] <value> + +For example, the following command sets the max capacity for a specific OSD +(say "osd.0") whose underlying device type is HDD to 350 IOPS: + + .. prompt:: bash # + + ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350 + +Alternatively, you may specify the max capacity for OSDs within the Ceph +configuration file under the respective [osd.N] section. See +:ref:`ceph-conf-settings` for more details. + + +.. index:: mclock; config settings + +mClock Config Options +===================== + +``osd_mclock_profile`` + +:Description: This sets the type of mclock profile to use for providing QoS + based on operations belonging to different classes (background + recovery, scrub, snaptrim, client op, osd subop). Once a built-in + profile is enabled, the lower level mclock resource control + parameters [*reservation, weight, limit*] and some Ceph + configuration parameters are set transparently. Note that the + above does not apply for the *custom* profile. + +:Type: String +:Valid Choices: high_client_ops, high_recovery_ops, balanced, custom +:Default: ``high_client_ops`` + +``osd_mclock_max_capacity_iops_hdd`` + +:Description: Max IOPS capacity (at 4KiB block size) to consider per OSD (for + rotational media) + +:Type: Float +:Default: ``315.0`` + +``osd_mclock_max_capacity_iops_ssd`` + +:Description: Max IOPS capacity (at 4KiB block size) to consider per OSD (for + solid state media) + +:Type: Float +:Default: ``21500.0`` + +``osd_mclock_cost_per_io_usec`` + +:Description: Cost per IO in microseconds to consider per OSD (overrides _ssd + and _hdd if non-zero) + +:Type: Float +:Default: ``0.0`` + +``osd_mclock_cost_per_io_usec_hdd`` + +:Description: Cost per IO in microseconds to consider per OSD (for rotational + media) + +:Type: Float +:Default: ``25000.0`` + +``osd_mclock_cost_per_io_usec_ssd`` + +:Description: Cost per IO in microseconds to consider per OSD (for solid state + media) + +:Type: Float +:Default: ``50.0`` + +``osd_mclock_cost_per_byte_usec`` + +:Description: Cost per byte in microseconds to consider per OSD (overrides _ssd + and _hdd if non-zero) + +:Type: Float +:Default: ``0.0`` + +``osd_mclock_cost_per_byte_usec_hdd`` + +:Description: Cost per byte in microseconds to consider per OSD (for rotational + media) + +:Type: Float +:Default: ``5.2`` + +``osd_mclock_cost_per_byte_usec_ssd`` + +:Description: Cost per byte in microseconds to consider per OSD (for solid state + media) + +:Type: Float +:Default: ``0.011`` diff --git a/doc/rados/configuration/mon-config-ref.rst b/doc/rados/configuration/mon-config-ref.rst new file mode 100644 index 000000000..3b12af43d --- /dev/null +++ b/doc/rados/configuration/mon-config-ref.rst @@ -0,0 +1,1243 @@ +.. _monitor-config-reference: + +========================== + Monitor Config Reference +========================== + +Understanding how to configure a :term:`Ceph Monitor` is an important part of +building a reliable :term:`Ceph Storage Cluster`. **All Ceph Storage Clusters +have at least one monitor**. The monitor complement usually remains fairly +consistent, but you can add, remove or replace a monitor in a cluster. See +`Adding/Removing a Monitor`_ for details. + + +.. index:: Ceph Monitor; Paxos + +Background +========== + +Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`, which means a +:term:`Ceph Client` can determine the location of all Ceph Monitors, Ceph OSD +Daemons, and Ceph Metadata Servers just by connecting to one Ceph Monitor and +retrieving a current cluster map. Before Ceph Clients can read from or write to +Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor +first. With a current copy of the cluster map and the CRUSH algorithm, a Ceph +Client can compute the location for any object. The ability to compute object +locations allows a Ceph Client to talk directly to Ceph OSD Daemons, which is a +very important aspect of Ceph's high scalability and performance. See +`Scalability and High Availability`_ for additional details. + +The primary role of the Ceph Monitor is to maintain a master copy of the cluster +map. Ceph Monitors also provide authentication and logging services. Ceph +Monitors write all changes in the monitor services to a single Paxos instance, +and Paxos writes the changes to a key/value store for strong consistency. Ceph +Monitors can query the most recent version of the cluster map during sync +operations. Ceph Monitors leverage the key/value store's snapshots and iterators +(using leveldb) to perform store-wide synchronization. + +.. ditaa:: + /-------------\ /-------------\ + | Monitor | Write Changes | Paxos | + | cCCC +-------------->+ cCCC | + | | | | + +-------------+ \------+------/ + | Auth | | + +-------------+ | Write Changes + | Log | | + +-------------+ v + | Monitor Map | /------+------\ + +-------------+ | Key / Value | + | OSD Map | | Store | + +-------------+ | cCCC | + | PG Map | \------+------/ + +-------------+ ^ + | MDS Map | | Read Changes + +-------------+ | + | cCCC |*---------------------+ + \-------------/ + + +.. deprecated:: version 0.58 + +In Ceph versions 0.58 and earlier, Ceph Monitors use a Paxos instance for +each service and store the map as a file. + +.. index:: Ceph Monitor; cluster map + +Cluster Maps +------------ + +The cluster map is a composite of maps, including the monitor map, the OSD map, +the placement group map and the metadata server map. The cluster map tracks a +number of important things: which processes are ``in`` the Ceph Storage Cluster; +which processes that are ``in`` the Ceph Storage Cluster are ``up`` and running +or ``down``; whether, the placement groups are ``active`` or ``inactive``, and +``clean`` or in some other state; and, other details that reflect the current +state of the cluster such as the total amount of storage space, and the amount +of storage used. + +When there is a significant change in the state of the cluster--e.g., a Ceph OSD +Daemon goes down, a placement group falls into a degraded state, etc.--the +cluster map gets updated to reflect the current state of the cluster. +Additionally, the Ceph Monitor also maintains a history of the prior states of +the cluster. The monitor map, OSD map, placement group map and metadata server +map each maintain a history of their map versions. We call each version an +"epoch." + +When operating your Ceph Storage Cluster, keeping track of these states is an +important part of your system administration duties. See `Monitoring a Cluster`_ +and `Monitoring OSDs and PGs`_ for additional details. + +.. index:: high availability; quorum + +Monitor Quorum +-------------- + +Our Configuring ceph section provides a trivial `Ceph configuration file`_ that +provides for one monitor in the test cluster. A cluster will run fine with a +single monitor; however, **a single monitor is a single-point-of-failure**. To +ensure high availability in a production Ceph Storage Cluster, you should run +Ceph with multiple monitors so that the failure of a single monitor **WILL NOT** +bring down your entire cluster. + +When a Ceph Storage Cluster runs multiple Ceph Monitors for high availability, +Ceph Monitors use `Paxos`_ to establish consensus about the master cluster map. +A consensus requires a majority of monitors running to establish a quorum for +consensus about the cluster map (e.g., 1; 2 out of 3; 3 out of 5; 4 out of 6; +etc.). + +``mon force quorum join`` + +:Description: Force monitor to join quorum even if it has been previously removed from the map +:Type: Boolean +:Default: ``False`` + +.. index:: Ceph Monitor; consistency + +Consistency +----------- + +When you add monitor settings to your Ceph configuration file, you need to be +aware of some of the architectural aspects of Ceph Monitors. **Ceph imposes +strict consistency requirements** for a Ceph monitor when discovering another +Ceph Monitor within the cluster. Whereas, Ceph Clients and other Ceph daemons +use the Ceph configuration file to discover monitors, monitors discover each +other using the monitor map (monmap), not the Ceph configuration file. + +A Ceph Monitor always refers to the local copy of the monmap when discovering +other Ceph Monitors in the Ceph Storage Cluster. Using the monmap instead of the +Ceph configuration file avoids errors that could break the cluster (e.g., typos +in ``ceph.conf`` when specifying a monitor address or port). Since monitors use +monmaps for discovery and they share monmaps with clients and other Ceph +daemons, **the monmap provides monitors with a strict guarantee that their +consensus is valid.** + +Strict consistency also applies to updates to the monmap. As with any other +updates on the Ceph Monitor, changes to the monmap always run through a +distributed consensus algorithm called `Paxos`_. The Ceph Monitors must agree on +each update to the monmap, such as adding or removing a Ceph Monitor, to ensure +that each monitor in the quorum has the same version of the monmap. Updates to +the monmap are incremental so that Ceph Monitors have the latest agreed upon +version, and a set of previous versions. Maintaining a history enables a Ceph +Monitor that has an older version of the monmap to catch up with the current +state of the Ceph Storage Cluster. + +If Ceph Monitors were to discover each other through the Ceph configuration file +instead of through the monmap, additional risks would be introduced because +Ceph configuration files are not updated and distributed automatically. Ceph +Monitors might inadvertently use an older Ceph configuration file, fail to +recognize a Ceph Monitor, fall out of a quorum, or develop a situation where +`Paxos`_ is not able to determine the current state of the system accurately. + + +.. index:: Ceph Monitor; bootstrapping monitors + +Bootstrapping Monitors +---------------------- + +In most configuration and deployment cases, tools that deploy Ceph help +bootstrap the Ceph Monitors by generating a monitor map for you (e.g., +``cephadm``, etc). A Ceph Monitor requires a few explicit +settings: + +- **Filesystem ID**: The ``fsid`` is the unique identifier for your + object store. Since you can run multiple clusters on the same + hardware, you must specify the unique ID of the object store when + bootstrapping a monitor. Deployment tools usually do this for you + (e.g., ``cephadm`` can call a tool like ``uuidgen``), but you + may specify the ``fsid`` manually too. + +- **Monitor ID**: A monitor ID is a unique ID assigned to each monitor within + the cluster. It is an alphanumeric value, and by convention the identifier + usually follows an alphabetical increment (e.g., ``a``, ``b``, etc.). This + can be set in a Ceph configuration file (e.g., ``[mon.a]``, ``[mon.b]``, etc.), + by a deployment tool, or using the ``ceph`` commandline. + +- **Keys**: The monitor must have secret keys. A deployment tool such as + ``cephadm`` usually does this for you, but you may + perform this step manually too. See `Monitor Keyrings`_ for details. + +For additional details on bootstrapping, see `Bootstrapping a Monitor`_. + +.. index:: Ceph Monitor; configuring monitors + +Configuring Monitors +==================== + +To apply configuration settings to the entire cluster, enter the configuration +settings under ``[global]``. To apply configuration settings to all monitors in +your cluster, enter the configuration settings under ``[mon]``. To apply +configuration settings to specific monitors, specify the monitor instance +(e.g., ``[mon.a]``). By convention, monitor instance names use alpha notation. + +.. code-block:: ini + + [global] + + [mon] + + [mon.a] + + [mon.b] + + [mon.c] + + +Minimum Configuration +--------------------- + +The bare minimum monitor settings for a Ceph monitor via the Ceph configuration +file include a hostname and a network address for each monitor. You can configure +these under ``[mon]`` or under the entry for a specific monitor. + +.. code-block:: ini + + [global] + mon host = 10.0.0.2,10.0.0.3,10.0.0.4 + +.. code-block:: ini + + [mon.a] + host = hostname1 + mon addr = 10.0.0.10:6789 + +See the `Network Configuration Reference`_ for details. + +.. note:: This minimum configuration for monitors assumes that a deployment + tool generates the ``fsid`` and the ``mon.`` key for you. + +Once you deploy a Ceph cluster, you **SHOULD NOT** change the IP addresses of +monitors. However, if you decide to change the monitor's IP address, you +must follow a specific procedure. See `Changing a Monitor's IP Address`_ for +details. + +Monitors can also be found by clients by using DNS SRV records. See `Monitor lookup through DNS`_ for details. + +Cluster ID +---------- + +Each Ceph Storage Cluster has a unique identifier (``fsid``). If specified, it +usually appears under the ``[global]`` section of the configuration file. +Deployment tools usually generate the ``fsid`` and store it in the monitor map, +so the value may not appear in a configuration file. The ``fsid`` makes it +possible to run daemons for multiple clusters on the same hardware. + +``fsid`` + +:Description: The cluster ID. One per cluster. +:Type: UUID +:Required: Yes. +:Default: N/A. May be generated by a deployment tool if not specified. + +.. note:: Do not set this value if you use a deployment tool that does + it for you. + + +.. index:: Ceph Monitor; initial members + +Initial Members +--------------- + +We recommend running a production Ceph Storage Cluster with at least three Ceph +Monitors to ensure high availability. When you run multiple monitors, you may +specify the initial monitors that must be members of the cluster in order to +establish a quorum. This may reduce the time it takes for your cluster to come +online. + +.. code-block:: ini + + [mon] + mon_initial_members = a,b,c + + +``mon_initial_members`` + +:Description: The IDs of initial monitors in a cluster during startup. If + specified, Ceph requires an odd number of monitors to form an + initial quorum (e.g., 3). + +:Type: String +:Default: None + +.. note:: A *majority* of monitors in your cluster must be able to reach + each other in order to establish a quorum. You can decrease the initial + number of monitors to establish a quorum with this setting. + +.. index:: Ceph Monitor; data path + +Data +---- + +Ceph provides a default path where Ceph Monitors store data. For optimal +performance in a production Ceph Storage Cluster, we recommend running Ceph +Monitors on separate hosts and drives from Ceph OSD Daemons. As leveldb uses +``mmap()`` for writing the data, Ceph Monitors flush their data from memory to disk +very often, which can interfere with Ceph OSD Daemon workloads if the data +store is co-located with the OSD Daemons. + +In Ceph versions 0.58 and earlier, Ceph Monitors store their data in plain files. This +approach allows users to inspect monitor data with common tools like ``ls`` +and ``cat``. However, this approach didn't provide strong consistency. + +In Ceph versions 0.59 and later, Ceph Monitors store their data as key/value +pairs. Ceph Monitors require `ACID`_ transactions. Using a data store prevents +recovering Ceph Monitors from running corrupted versions through Paxos, and it +enables multiple modification operations in one single atomic batch, among other +advantages. + +Generally, we do not recommend changing the default data location. If you modify +the default location, we recommend that you make it uniform across Ceph Monitors +by setting it in the ``[mon]`` section of the configuration file. + + +``mon_data`` + +:Description: The monitor's data location. +:Type: String +:Default: ``/var/lib/ceph/mon/$cluster-$id`` + + +``mon_data_size_warn`` + +:Description: Raise ``HEALTH_WARN`` status when a monitor's data + store grows to be larger than this size, 15GB by default. + +:Type: Integer +:Default: ``15*1024*1024*1024`` + + +``mon_data_avail_warn`` + +:Description: Raise ``HEALTH_WARN`` status when the filesystem that houses a + monitor's data store reports that its available capacity is + less than or equal to this percentage . + +:Type: Integer +:Default: ``30`` + + +``mon_data_avail_crit`` + +:Description: Raise ``HEALTH_ERR`` status when the filesystem that houses a + monitor's data store reports that its available capacity is + less than or equal to this percentage. + +:Type: Integer +:Default: ``5`` + +``mon_warn_on_cache_pools_without_hit_sets`` + +:Description: Raise ``HEALTH_WARN`` when a cache pool does not + have the ``hit_set_type`` value configured. + See :ref:`hit_set_type <hit_set_type>` for more + details. + +:Type: Boolean +:Default: ``True`` + +``mon_warn_on_crush_straw_calc_version_zero`` + +:Description: Raise ``HEALTH_WARN`` when the CRUSH + ``straw_calc_version`` is zero. See + :ref:`CRUSH map tunables <crush-map-tunables>` for + details. + +:Type: Boolean +:Default: ``True`` + + +``mon_warn_on_legacy_crush_tunables`` + +:Description: Raise ``HEALTH_WARN`` when + CRUSH tunables are too old (older than ``mon_min_crush_required_version``) + +:Type: Boolean +:Default: ``True`` + + +``mon_crush_min_required_version`` + +:Description: The minimum tunable profile required by the cluster. + See + :ref:`CRUSH map tunables <crush-map-tunables>` for + details. + +:Type: String +:Default: ``hammer`` + + +``mon_warn_on_osd_down_out_interval_zero`` + +:Description: Raise ``HEALTH_WARN`` when + ``mon_osd_down_out_interval`` is zero. Having this option set to + zero on the leader acts much like the ``noout`` flag. It's hard + to figure out what's going wrong with clusters without the + ``noout`` flag set but acting like that just the same, so we + report a warning in this case. + +:Type: Boolean +:Default: ``True`` + + +``mon_warn_on_slow_ping_ratio`` + +:Description: Raise ``HEALTH_WARN`` when any heartbeat + between OSDs exceeds ``mon_warn_on_slow_ping_ratio`` + of ``osd_heartbeat_grace``. The default is 5%. +:Type: Float +:Default: ``0.05`` + + +``mon_warn_on_slow_ping_time`` + +:Description: Override ``mon_warn_on_slow_ping_ratio`` with a specific value. + Raise ``HEALTH_WARN`` if any heartbeat + between OSDs exceeds ``mon_warn_on_slow_ping_time`` + milliseconds. The default is 0 (disabled). +:Type: Integer +:Default: ``0`` + + +``mon_warn_on_pool_no_redundancy`` + +:Description: Raise ``HEALTH_WARN`` if any pool is + configured with no replicas. +:Type: Boolean +:Default: ``True`` + + +``mon_cache_target_full_warn_ratio`` + +:Description: Position between pool's ``cache_target_full`` and + ``target_max_object`` where we start warning + +:Type: Float +:Default: ``0.66`` + + +``mon_health_to_clog`` + +:Description: Enable sending a health summary to the cluster log periodically. +:Type: Boolean +:Default: ``True`` + + +``mon_health_to_clog_tick_interval`` + +:Description: How often (in seconds) the monitor sends a health summary to the cluster + log (a non-positive number disables). If current health summary + is empty or identical to the last time, monitor will not send it + to cluster log. + +:Type: Float +:Default: ``60.0`` + + +``mon_health_to_clog_interval`` + +:Description: How often (in seconds) the monitor sends a health summary to the cluster + log (a non-positive number disables). Monitors will always + send a summary to the cluster log whether or not it differs from + the previous summary. + +:Type: Integer +:Default: ``3600`` + + + +.. index:: Ceph Storage Cluster; capacity planning, Ceph Monitor; capacity planning + +.. _storage-capacity: + +Storage Capacity +---------------- + +When a Ceph Storage Cluster gets close to its maximum capacity +(see``mon_osd_full ratio``), Ceph prevents you from writing to or reading from OSDs +as a safety measure to prevent data loss. Therefore, letting a +production Ceph Storage Cluster approach its full ratio is not a good practice, +because it sacrifices high availability. The default full ratio is ``.95``, or +95% of capacity. This a very aggressive setting for a test cluster with a small +number of OSDs. + +.. tip:: When monitoring your cluster, be alert to warnings related to the + ``nearfull`` ratio. This means that a failure of some OSDs could result + in a temporary service disruption if one or more OSDs fails. Consider adding + more OSDs to increase storage capacity. + +A common scenario for test clusters involves a system administrator removing an +OSD from the Ceph Storage Cluster, watching the cluster rebalance, then removing +another OSD, and another, until at least one OSD eventually reaches the full +ratio and the cluster locks up. We recommend a bit of capacity +planning even with a test cluster. Planning enables you to gauge how much spare +capacity you will need in order to maintain high availability. Ideally, you want +to plan for a series of Ceph OSD Daemon failures where the cluster can recover +to an ``active+clean`` state without replacing those OSDs +immediately. Cluster operation continues in the ``active+degraded`` state, but this +is not ideal for normal operation and should be addressed promptly. + +The following diagram depicts a simplistic Ceph Storage Cluster containing 33 +Ceph Nodes with one OSD per host, each OSD reading from +and writing to a 3TB drive. So this exemplary Ceph Storage Cluster has a maximum +actual capacity of 99TB. With a ``mon osd full ratio`` of ``0.95``, if the Ceph +Storage Cluster falls to 5TB of remaining capacity, the cluster will not allow +Ceph Clients to read and write data. So the Ceph Storage Cluster's operating +capacity is 95TB, not 99TB. + +.. ditaa:: + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | Rack 1 | | Rack 2 | | Rack 3 | | Rack 4 | | Rack 5 | | Rack 6 | + | cCCC | | cF00 | | cCCC | | cCCC | | cCCC | | cCCC | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 1 | | OSD 7 | | OSD 13 | | OSD 19 | | OSD 25 | | OSD 31 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 2 | | OSD 8 | | OSD 14 | | OSD 20 | | OSD 26 | | OSD 32 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 3 | | OSD 9 | | OSD 15 | | OSD 21 | | OSD 27 | | OSD 33 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 4 | | OSD 10 | | OSD 16 | | OSD 22 | | OSD 28 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 5 | | OSD 11 | | OSD 17 | | OSD 23 | | OSD 29 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 6 | | OSD 12 | | OSD 18 | | OSD 24 | | OSD 30 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + +It is normal in such a cluster for one or two OSDs to fail. A less frequent but +reasonable scenario involves a rack's router or power supply failing, which +brings down multiple OSDs simultaneously (e.g., OSDs 7-12). In such a scenario, +you should still strive for a cluster that can remain operational and achieve an +``active + clean`` state--even if that means adding a few hosts with additional +OSDs in short order. If your capacity utilization is too high, you may not lose +data, but you could still sacrifice data availability while resolving an outage +within a failure domain if capacity utilization of the cluster exceeds the full +ratio. For this reason, we recommend at least some rough capacity planning. + +Identify two numbers for your cluster: + +#. The number of OSDs. +#. The total capacity of the cluster + +If you divide the total capacity of your cluster by the number of OSDs in your +cluster, you will find the mean average capacity of an OSD within your cluster. +Consider multiplying that number by the number of OSDs you expect will fail +simultaneously during normal operations (a relatively small number). Finally +multiply the capacity of the cluster by the full ratio to arrive at a maximum +operating capacity; then, subtract the number of amount of data from the OSDs +you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing +process with a higher number of OSD failures (e.g., a rack of OSDs) to arrive at +a reasonable number for a near full ratio. + +The following settings only apply on cluster creation and are then stored in +the OSDMap. To clarify, in normal operation the values that are used by OSDs +are those found in the OSDMap, not those in the configuration file or central +config store. + +.. code-block:: ini + + [global] + mon_osd_full_ratio = .80 + mon_osd_backfillfull_ratio = .75 + mon_osd_nearfull_ratio = .70 + + +``mon_osd_full_ratio`` + +:Description: The threshold percentage of device space utilized before an OSD is + considered ``full``. + +:Type: Float +:Default: ``0.95`` + + +``mon_osd_backfillfull_ratio`` + +:Description: The threshold percentage of device space utilized before an OSD is + considered too ``full`` to backfill. + +:Type: Float +:Default: ``0.90`` + + +``mon_osd_nearfull_ratio`` + +:Description: The threshold percentage of device space used before an OSD is + considered ``nearfull``. + +:Type: Float +:Default: ``0.85`` + + +.. tip:: If some OSDs are nearfull, but others have plenty of capacity, you + may have an inaccurate CRUSH weight set for the nearfull OSDs. + +.. tip:: These settings only apply during cluster creation. Afterwards they need + to be changed in the OSDMap using ``ceph osd set-nearfull-ratio`` and + ``ceph osd set-full-ratio`` + +.. index:: heartbeat + +Heartbeat +--------- + +Ceph monitors know about the cluster by requiring reports from each OSD, and by +receiving reports from OSDs about the status of their neighboring OSDs. Ceph +provides reasonable default settings for monitor/OSD interaction; however, you +may modify them as needed. See `Monitor/OSD Interaction`_ for details. + + +.. index:: Ceph Monitor; leader, Ceph Monitor; provider, Ceph Monitor; requester, Ceph Monitor; synchronization + +Monitor Store Synchronization +----------------------------- + +When you run a production cluster with multiple monitors (recommended), each +monitor checks to see if a neighboring monitor has a more recent version of the +cluster map (e.g., a map in a neighboring monitor with one or more epoch numbers +higher than the most current epoch in the map of the instant monitor). +Periodically, one monitor in the cluster may fall behind the other monitors to +the point where it must leave the quorum, synchronize to retrieve the most +current information about the cluster, and then rejoin the quorum. For the +purposes of synchronization, monitors may assume one of three roles: + +#. **Leader**: The `Leader` is the first monitor to achieve the most recent + Paxos version of the cluster map. + +#. **Provider**: The `Provider` is a monitor that has the most recent version + of the cluster map, but wasn't the first to achieve the most recent version. + +#. **Requester:** A `Requester` is a monitor that has fallen behind the leader + and must synchronize in order to retrieve the most recent information about + the cluster before it can rejoin the quorum. + +These roles enable a leader to delegate synchronization duties to a provider, +which prevents synchronization requests from overloading the leader--improving +performance. In the following diagram, the requester has learned that it has +fallen behind the other monitors. The requester asks the leader to synchronize, +and the leader tells the requester to synchronize with a provider. + + +.. ditaa:: + +-----------+ +---------+ +----------+ + | Requester | | Leader | | Provider | + +-----------+ +---------+ +----------+ + | | | + | | | + | Ask to Synchronize | | + |------------------->| | + | | | + |<-------------------| | + | Tell Requester to | | + | Sync with Provider | | + | | | + | Synchronize | + |--------------------+-------------------->| + | | | + |<-------------------+---------------------| + | Send Chunk to Requester | + | (repeat as necessary) | + | Requester Acks Chuck to Provider | + |--------------------+-------------------->| + | | + | Sync Complete | + | Notification | + |------------------->| + | | + |<-------------------| + | Ack | + | | + + +Synchronization always occurs when a new monitor joins the cluster. During +runtime operations, monitors may receive updates to the cluster map at different +times. This means the leader and provider roles may migrate from one monitor to +another. If this happens while synchronizing (e.g., a provider falls behind the +leader), the provider can terminate synchronization with a requester. + +Once synchronization is complete, Ceph performs trimming across the cluster. +Trimming requires that the placement groups are ``active+clean``. + + +``mon_sync_timeout`` + +:Description: Number of seconds the monitor will wait for the next update + message from its sync provider before it gives up and bootstrap + again. + +:Type: Double +:Default: ``60.0`` + + +``mon_sync_max_payload_size`` + +:Description: The maximum size for a sync payload (in bytes). +:Type: 32-bit Integer +:Default: ``1048576`` + + +``paxos_max_join_drift`` + +:Description: The maximum Paxos iterations before we must first sync the + monitor data stores. When a monitor finds that its peer is too + far ahead of it, it will first sync with data stores before moving + on. + +:Type: Integer +:Default: ``10`` + + +``paxos_stash_full_interval`` + +:Description: How often (in commits) to stash a full copy of the PaxosService state. + Current this setting only affects ``mds``, ``mon``, ``auth`` and ``mgr`` + PaxosServices. + +:Type: Integer +:Default: ``25`` + + +``paxos_propose_interval`` + +:Description: Gather updates for this time interval before proposing + a map update. + +:Type: Double +:Default: ``1.0`` + + +``paxos_min`` + +:Description: The minimum number of Paxos states to keep around +:Type: Integer +:Default: ``500`` + + +``paxos_min_wait`` + +:Description: The minimum amount of time to gather updates after a period of + inactivity. + +:Type: Double +:Default: ``0.05`` + + +``paxos_trim_min`` + +:Description: Number of extra proposals tolerated before trimming +:Type: Integer +:Default: ``250`` + + +``paxos_trim_max`` + +:Description: The maximum number of extra proposals to trim at a time +:Type: Integer +:Default: ``500`` + + +``paxos_service_trim_min`` + +:Description: The minimum amount of versions to trigger a trim (0 disables it) +:Type: Integer +:Default: ``250`` + + +``paxos_service_trim_max`` + +:Description: The maximum amount of versions to trim during a single proposal (0 disables it) +:Type: Integer +:Default: ``500`` + + +``paxos service trim max multiplier`` + +:Description: The factor by which paxos service trim max will be multiplied + to get a new upper bound when trim sizes are high (0 disables it) +:Type: Integer +:Default: ``20`` + + +``mon mds force trim to`` + +:Description: Force monitor to trim mdsmaps to this point (0 disables it. + dangerous, use with care) + +:Type: Integer +:Default: ``0`` + + +``mon_osd_force_trim_to`` + +:Description: Force monitor to trim osdmaps to this point, even if there is + PGs not clean at the specified epoch (0 disables it. dangerous, + use with care) + +:Type: Integer +:Default: ``0`` + + +``mon_osd_cache_size`` + +:Description: The size of osdmaps cache, not to rely on underlying store's cache +:Type: Integer +:Default: ``500`` + + +``mon_election_timeout`` + +:Description: On election proposer, maximum waiting time for all ACKs in seconds. +:Type: Float +:Default: ``5.00`` + + +``mon_lease`` + +:Description: The length (in seconds) of the lease on the monitor's versions. +:Type: Float +:Default: ``5.00`` + + +``mon_lease_renew_interval_factor`` + +:Description: ``mon_lease`` \* ``mon_lease_renew_interval_factor`` will be the + interval for the Leader to renew the other monitor's leases. The + factor should be less than ``1.0``. + +:Type: Float +:Default: ``0.60`` + + +``mon_lease_ack_timeout_factor`` + +:Description: The Leader will wait ``mon_lease`` \* ``mon_lease_ack_timeout_factor`` + for the Providers to acknowledge the lease extension. + +:Type: Float +:Default: ``2.00`` + + +``mon_accept_timeout_factor`` + +:Description: The Leader will wait ``mon_lease`` \* ``mon_accept_timeout_factor`` + for the Requester(s) to accept a Paxos update. It is also used + during the Paxos recovery phase for similar purposes. + +:Type: Float +:Default: ``2.00`` + + +``mon_min_osdmap_epochs`` + +:Description: Minimum number of OSD map epochs to keep at all times. +:Type: 32-bit Integer +:Default: ``500`` + + +``mon_max_log_epochs`` + +:Description: Maximum number of Log epochs the monitor should keep. +:Type: 32-bit Integer +:Default: ``500`` + + + +.. index:: Ceph Monitor; clock + +Clock +----- + +Ceph daemons pass critical messages to each other, which must be processed +before daemons reach a timeout threshold. If the clocks in Ceph monitors +are not synchronized, it can lead to a number of anomalies. For example: + +- Daemons ignoring received messages (e.g., timestamps outdated) +- Timeouts triggered too soon/late when a message wasn't received in time. + +See `Monitor Store Synchronization`_ for details. + + +.. tip:: You must configure NTP or PTP daemons on your Ceph monitor hosts to + ensure that the monitor cluster operates with synchronized clocks. + It can be advantageous to have monitor hosts sync with each other + as well as with multiple quality upstream time sources. + +Clock drift may still be noticeable with NTP even though the discrepancy is not +yet harmful. Ceph's clock drift / clock skew warnings may get triggered even +though NTP maintains a reasonable level of synchronization. Increasing your +clock drift may be tolerable under such circumstances; however, a number of +factors such as workload, network latency, configuring overrides to default +timeouts and the `Monitor Store Synchronization`_ settings may influence +the level of acceptable clock drift without compromising Paxos guarantees. + +Ceph provides the following tunable options to allow you to find +acceptable values. + + +``mon_tick_interval`` + +:Description: A monitor's tick interval in seconds. +:Type: 32-bit Integer +:Default: ``5`` + + +``mon_clock_drift_allowed`` + +:Description: The clock drift in seconds allowed between monitors. +:Type: Float +:Default: ``0.05`` + + +``mon_clock_drift_warn_backoff`` + +:Description: Exponential backoff for clock drift warnings +:Type: Float +:Default: ``5.00`` + + +``mon_timecheck_interval`` + +:Description: The time check interval (clock drift check) in seconds + for the Leader. + +:Type: Float +:Default: ``300.00`` + + +``mon_timecheck_skew_interval`` + +:Description: The time check interval (clock drift check) in seconds when in + presence of a skew in seconds for the Leader. + +:Type: Float +:Default: ``30.00`` + + +Client +------ + +``mon_client_hunt_interval`` + +:Description: The client will try a new monitor every ``N`` seconds until it + establishes a connection. + +:Type: Double +:Default: ``3.00`` + + +``mon_client_ping_interval`` + +:Description: The client will ping the monitor every ``N`` seconds. +:Type: Double +:Default: ``10.00`` + + +``mon_client_max_log_entries_per_message`` + +:Description: The maximum number of log entries a monitor will generate + per client message. + +:Type: Integer +:Default: ``1000`` + + +``mon_client_bytes`` + +:Description: The amount of client message data allowed in memory (in bytes). +:Type: 64-bit Integer Unsigned +:Default: ``100ul << 20`` + +.. _pool-settings: + +Pool settings +============= + +Since version v0.94 there is support for pool flags which allow or disallow changes to be made to pools. +Monitors can also disallow removal of pools if appropriately configured. The inconvenience of this guardrail +is far outweighed by the number of accidental pool (and thus data) deletions it prevents. + +``mon_allow_pool_delete`` + +:Description: Should monitors allow pools to be removed, regardless of what the pool flags say? + +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_ec_fast_read`` + +:Description: Whether to turn on fast read on the pool or not. It will be used as + the default setting of newly created erasure coded pools if ``fast_read`` + is not specified at create time. + +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_flag_hashpspool`` + +:Description: Set the hashpspool flag on new pools +:Type: Boolean +:Default: ``true`` + + +``osd_pool_default_flag_nodelete`` + +:Description: Set the ``nodelete`` flag on new pools, which prevents pool removal. +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_flag_nopgchange`` + +:Description: Set the ``nopgchange`` flag on new pools. Does not allow the number of PGs to be changed. +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_flag_nosizechange`` + +:Description: Set the ``nosizechange`` flag on new pools. Does not allow the ``size`` to be changed. +:Type: Boolean +:Default: ``false`` + +For more information about the pool flags see `Pool values`_. + +Miscellaneous +============= + +``mon_max_osd`` + +:Description: The maximum number of OSDs allowed in the cluster. +:Type: 32-bit Integer +:Default: ``10000`` + + +``mon_globalid_prealloc`` + +:Description: The number of global IDs to pre-allocate for clients and daemons in the cluster. +:Type: 32-bit Integer +:Default: ``10000`` + + +``mon_subscribe_interval`` + +:Description: The refresh interval (in seconds) for subscriptions. The + subscription mechanism enables obtaining cluster maps + and log information. + +:Type: Double +:Default: ``86400.00`` + + +``mon_stat_smooth_intervals`` + +:Description: Ceph will smooth statistics over the last ``N`` PG maps. +:Type: Integer +:Default: ``6`` + + +``mon_probe_timeout`` + +:Description: Number of seconds the monitor will wait to find peers before bootstrapping. +:Type: Double +:Default: ``2.00`` + + +``mon_daemon_bytes`` + +:Description: The message memory cap for metadata server and OSD messages (in bytes). +:Type: 64-bit Integer Unsigned +:Default: ``400ul << 20`` + + +``mon_max_log_entries_per_event`` + +:Description: The maximum number of log entries per event. +:Type: Integer +:Default: ``4096`` + + +``mon_osd_prime_pg_temp`` + +:Description: Enables or disables priming the PGMap with the previous OSDs when an ``out`` + OSD comes back into the cluster. With the ``true`` setting, clients + will continue to use the previous OSDs until the newly ``in`` OSDs for + a PG have peered. + +:Type: Boolean +:Default: ``true`` + + +``mon_osd_prime pg temp max time`` + +:Description: How much time in seconds the monitor should spend trying to prime the + PGMap when an out OSD comes back into the cluster. + +:Type: Float +:Default: ``0.50`` + + +``mon_osd_prime_pg_temp_max_time_estimate`` + +:Description: Maximum estimate of time spent on each PG before we prime all PGs + in parallel. + +:Type: Float +:Default: ``0.25`` + + +``mon_mds_skip_sanity`` + +:Description: Skip safety assertions on FSMap (in case of bugs where we want to + continue anyway). Monitor terminates if the FSMap sanity check + fails, but we can disable it by enabling this option. + +:Type: Boolean +:Default: ``False`` + + +``mon_max_mdsmap_epochs`` + +:Description: The maximum number of mdsmap epochs to trim during a single proposal. +:Type: Integer +:Default: ``500`` + + +``mon_config_key_max_entry_size`` + +:Description: The maximum size of config-key entry (in bytes) +:Type: Integer +:Default: ``65536`` + + +``mon_scrub_interval`` + +:Description: How often the monitor scrubs its store by comparing + the stored checksums with the computed ones for all stored + keys. (0 disables it. dangerous, use with care) + +:Type: Seconds +:Default: ``1 day`` + + +``mon_scrub_max_keys`` + +:Description: The maximum number of keys to scrub each time. +:Type: Integer +:Default: ``100`` + + +``mon_compact_on_start`` + +:Description: Compact the database used as Ceph Monitor store on + ``ceph-mon`` start. A manual compaction helps to shrink the + monitor database and improve the performance of it if the regular + compaction fails to work. + +:Type: Boolean +:Default: ``False`` + + +``mon_compact_on_bootstrap`` + +:Description: Compact the database used as Ceph Monitor store + on bootstrap. Monitors probe each other to establish + a quorum after bootstrap. If a monitor times out before joining the + quorum, it will start over and bootstrap again. + +:Type: Boolean +:Default: ``False`` + + +``mon_compact_on_trim`` + +:Description: Compact a certain prefix (including paxos) when we trim its old states. +:Type: Boolean +:Default: ``True`` + + +``mon_cpu_threads`` + +:Description: Number of threads for performing CPU intensive work on monitor. +:Type: Integer +:Default: ``4`` + + +``mon_osd_mapping_pgs_per_chunk`` + +:Description: We calculate the mapping from placement group to OSDs in chunks. + This option specifies the number of placement groups per chunk. + +:Type: Integer +:Default: ``4096`` + + +``mon_session_timeout`` + +:Description: Monitor will terminate inactive sessions stay idle over this + time limit. + +:Type: Integer +:Default: ``300`` + + +``mon_osd_cache_size_min`` + +:Description: The minimum amount of bytes to be kept mapped in memory for osd + monitor caches. + +:Type: 64-bit Integer +:Default: ``134217728`` + + +``mon_memory_target`` + +:Description: The amount of bytes pertaining to OSD monitor caches and KV cache + to be kept mapped in memory with cache auto-tuning enabled. + +:Type: 64-bit Integer +:Default: ``2147483648`` + + +``mon_memory_autotune`` + +:Description: Autotune the cache memory used for OSD monitors and KV + database. + +:Type: Boolean +:Default: ``True`` + + +.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) +.. _Monitor Keyrings: ../../../dev/mon-bootstrap#secret-keys +.. _Ceph configuration file: ../ceph-conf/#monitors +.. _Network Configuration Reference: ../network-config-ref +.. _Monitor lookup through DNS: ../mon-lookup-dns +.. _ACID: https://en.wikipedia.org/wiki/ACID +.. _Adding/Removing a Monitor: ../../operations/add-or-rm-mons +.. _Monitoring a Cluster: ../../operations/monitoring +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg +.. _Bootstrapping a Monitor: ../../../dev/mon-bootstrap +.. _Changing a Monitor's IP Address: ../../operations/add-or-rm-mons#changing-a-monitor-s-ip-address +.. _Monitor/OSD Interaction: ../mon-osd-interaction +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability +.. _Pool values: ../../operations/pools/#set-pool-values diff --git a/doc/rados/configuration/mon-lookup-dns.rst b/doc/rados/configuration/mon-lookup-dns.rst new file mode 100644 index 000000000..c9bece004 --- /dev/null +++ b/doc/rados/configuration/mon-lookup-dns.rst @@ -0,0 +1,56 @@ +=============================== +Looking up Monitors through DNS +=============================== + +Since version 11.0.0 RADOS supports looking up Monitors through DNS. + +This way daemons and clients do not require a *mon host* configuration directive in their ceph.conf configuration file. + +Using DNS SRV TCP records clients are able to look up the monitors. + +This allows for less configuration on clients and monitors. Using a DNS update clients and daemons can be made aware of changes in the monitor topology. + +By default clients and daemons will look for the TCP service called *ceph-mon* which is configured by the *mon_dns_srv_name* configuration directive. + + +``mon dns srv name`` + +:Description: the service name used querying the DNS for the monitor hosts/addresses +:Type: String +:Default: ``ceph-mon`` + +Example +------- +When the DNS search domain is set to *example.com* a DNS zone file might contain the following elements. + +First, create records for the Monitors, either IPv4 (A) or IPv6 (AAAA). + +:: + + mon1.example.com. AAAA 2001:db8::100 + mon2.example.com. AAAA 2001:db8::200 + mon3.example.com. AAAA 2001:db8::300 + +:: + + mon1.example.com. A 192.168.0.1 + mon2.example.com. A 192.168.0.2 + mon3.example.com. A 192.168.0.3 + + +With those records now existing we can create the SRV TCP records with the name *ceph-mon* pointing to the three Monitors. + +:: + + _ceph-mon._tcp.example.com. 60 IN SRV 10 20 6789 mon1.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 10 30 6789 mon2.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 20 50 6789 mon3.example.com. + +Now all Monitors are running on port *6789*, with priorities 10, 10, 20 and weights 20, 30, 50 respectively. + +Monitor clients choose monitor by referencing the SRV records. If a cluster has multiple Monitor SRV records +with the same priority value, clients and daemons will load balance the connections to Monitors in proportion +to the values of the SRV weight fields. + +For the above example, this will result in approximate 40% of the clients and daemons connecting to mon1, +60% of them connecting to mon2. However, if neither of them is reachable, then mon3 will be reconsidered as a fallback. diff --git a/doc/rados/configuration/mon-osd-interaction.rst b/doc/rados/configuration/mon-osd-interaction.rst new file mode 100644 index 000000000..727070491 --- /dev/null +++ b/doc/rados/configuration/mon-osd-interaction.rst @@ -0,0 +1,396 @@ +===================================== + Configuring Monitor/OSD Interaction +===================================== + +.. index:: heartbeat + +After you have completed your initial Ceph configuration, you may deploy and run +Ceph. When you execute a command such as ``ceph health`` or ``ceph -s``, the +:term:`Ceph Monitor` reports on the current state of the :term:`Ceph Storage +Cluster`. The Ceph Monitor knows about the Ceph Storage Cluster by requiring +reports from each :term:`Ceph OSD Daemon`, and by receiving reports from Ceph +OSD Daemons about the status of their neighboring Ceph OSD Daemons. If the Ceph +Monitor doesn't receive reports, or if it receives reports of changes in the +Ceph Storage Cluster, the Ceph Monitor updates the status of the :term:`Ceph +Cluster Map`. + +Ceph provides reasonable default settings for Ceph Monitor/Ceph OSD Daemon +interaction. However, you may override the defaults. The following sections +describe how Ceph Monitors and Ceph OSD Daemons interact for the purposes of +monitoring the Ceph Storage Cluster. + +.. index:: heartbeat interval + +OSDs Check Heartbeats +===================== + +Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons at random +intervals less than every 6 seconds. If a neighboring Ceph OSD Daemon doesn't +show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may +consider the neighboring Ceph OSD Daemon ``down`` and report it back to a Ceph +Monitor, which will update the Ceph Cluster Map. You may change this grace +period by adding an ``osd heartbeat grace`` setting under the ``[mon]`` +and ``[osd]`` or ``[global]`` section of your Ceph configuration file, +or by setting the value at runtime. + + +.. ditaa:: + +---------+ +---------+ + | OSD 1 | | OSD 2 | + +---------+ +---------+ + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |<-------------------| + | Heart Beating | + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |----+ Grace | + | | Period | + |<---+ Exceeded | + | | + |----+ Mark | + | | OSD 2 | + |<---+ Down | + + +.. index:: OSD down report + +OSDs Report Down OSDs +===================== + +By default, two Ceph OSD Daemons from different hosts must report to the Ceph +Monitors that another Ceph OSD Daemon is ``down`` before the Ceph Monitors +acknowledge that the reported Ceph OSD Daemon is ``down``. But there is chance +that all the OSDs reporting the failure are hosted in a rack with a bad switch +which has trouble connecting to another OSD. To avoid this sort of false alarm, +we consider the peers reporting a failure a proxy for a potential "subcluster" +over the overall cluster that is similarly laggy. This is clearly not true in +all cases, but will sometimes help us localize the grace correction to a subset +of the system that is unhappy. ``mon osd reporter subtree level`` is used to +group the peers into the "subcluster" by their common ancestor type in CRUSH +map. By default, only two reports from different subtree are required to report +another Ceph OSD Daemon ``down``. You can change the number of reporters from +unique subtrees and the common ancestor type required to report a Ceph OSD +Daemon ``down`` to a Ceph Monitor by adding an ``mon osd min down reporters`` +and ``mon osd reporter subtree level`` settings under the ``[mon]`` section of +your Ceph configuration file, or by setting the value at runtime. + + +.. ditaa:: + + +---------+ +---------+ +---------+ + | OSD 1 | | OSD 2 | | Monitor | + +---------+ +---------+ +---------+ + | | | + | OSD 3 Is Down | | + |---------------+--------------->| + | | | + | | | + | | OSD 3 Is Down | + | |--------------->| + | | | + | | | + | | |---------+ Mark + | | | | OSD 3 + | | |<--------+ Down + + +.. index:: peering failure + +OSDs Report Peering Failure +=========================== + +If a Ceph OSD Daemon cannot peer with any of the Ceph OSD Daemons defined in its +Ceph configuration file (or the cluster map), it will ping a Ceph Monitor for +the most recent copy of the cluster map every 30 seconds. You can change the +Ceph Monitor heartbeat interval by adding an ``osd mon heartbeat interval`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. + +.. ditaa:: + + +---------+ +---------+ +-------+ +---------+ + | OSD 1 | | OSD 2 | | OSD 3 | | Monitor | + +---------+ +---------+ +-------+ +---------+ + | | | | + | Request To | | | + | Peer | | | + |-------------->| | | + |<--------------| | | + | Peering | | + | | | + | Request To | | + | Peer | | + |----------------------------->| | + | | + |----+ OSD Monitor | + | | Heartbeat | + |<---+ Interval Exceeded | + | | + | Failed to Peer with OSD 3 | + |-------------------------------------------->| + |<--------------------------------------------| + | Receive New Cluster Map | + + +.. index:: OSD status + +OSDs Report Their Status +======================== + +If an Ceph OSD Daemon doesn't report to a Ceph Monitor, the Ceph Monitor will +consider the Ceph OSD Daemon ``down`` after the ``mon osd report timeout`` +elapses. A Ceph OSD Daemon sends a report to a Ceph Monitor when a reportable +event such as a failure, a change in placement group stats, a change in +``up_thru`` or when it boots within 5 seconds. You can change the Ceph OSD +Daemon minimum report interval by adding an ``osd mon report interval`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. A Ceph OSD Daemon sends a report to a Ceph +Monitor every 120 seconds irrespective of whether any notable changes occur. +You can change the Ceph Monitor report interval by adding an ``osd mon report +interval max`` setting under the ``[osd]`` section of your Ceph configuration +file, or by setting the value at runtime. + + +.. ditaa:: + + +---------+ +---------+ + | OSD 1 | | Monitor | + +---------+ +---------+ + | | + |----+ Report Min | + | | Interval | + |<---+ Exceeded | + | | + |----+ Reportable | + | | Event | + |<---+ Occurs | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Report Max | + | | Interval | + |<---+ Exceeded | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Monitor | + | | Fails | + |<---+ | + +----+ Monitor OSD + | | Report Timeout + |<---+ Exceeded + | + +----+ Mark + | | OSD 1 + |<---+ Down + + + + +Configuration Settings +====================== + +When modifying heartbeat settings, you should include them in the ``[global]`` +section of your configuration file. + +.. index:: monitor heartbeat + +Monitor Settings +---------------- + +``mon osd min up ratio`` + +:Description: The minimum ratio of ``up`` Ceph OSD Daemons before Ceph will + mark Ceph OSD Daemons ``down``. + +:Type: Double +:Default: ``.3`` + + +``mon osd min in ratio`` + +:Description: The minimum ratio of ``in`` Ceph OSD Daemons before Ceph will + mark Ceph OSD Daemons ``out``. + +:Type: Double +:Default: ``.75`` + + +``mon osd laggy halflife`` + +:Description: The number of seconds laggy estimates will decay. +:Type: Integer +:Default: ``60*60`` + + +``mon osd laggy weight`` + +:Description: The weight for new samples in laggy estimation decay. +:Type: Double +:Default: ``0.3`` + + + +``mon osd laggy max interval`` + +:Description: Maximum value of ``laggy_interval`` in laggy estimations (in seconds). + Monitor uses an adaptive approach to evaluate the ``laggy_interval`` of + a certain OSD. This value will be used to calculate the grace time for + that OSD. +:Type: Integer +:Default: 300 + +``mon osd adjust heartbeat grace`` + +:Description: If set to ``true``, Ceph will scale based on laggy estimations. +:Type: Boolean +:Default: ``true`` + + +``mon osd adjust down out interval`` + +:Description: If set to ``true``, Ceph will scaled based on laggy estimations. +:Type: Boolean +:Default: ``true`` + + +``mon osd auto mark in`` + +:Description: Ceph will mark any booting Ceph OSD Daemons as ``in`` + the Ceph Storage Cluster. + +:Type: Boolean +:Default: ``false`` + + +``mon osd auto mark auto out in`` + +:Description: Ceph will mark booting Ceph OSD Daemons auto marked ``out`` + of the Ceph Storage Cluster as ``in`` the cluster. + +:Type: Boolean +:Default: ``true`` + + +``mon osd auto mark new in`` + +:Description: Ceph will mark booting new Ceph OSD Daemons as ``in`` the + Ceph Storage Cluster. + +:Type: Boolean +:Default: ``true`` + + +``mon osd down out interval`` + +:Description: The number of seconds Ceph waits before marking a Ceph OSD Daemon + ``down`` and ``out`` if it doesn't respond. + +:Type: 32-bit Integer +:Default: ``600`` + + +``mon osd down out subtree limit`` + +:Description: The smallest :term:`CRUSH` unit type that Ceph will **not** + automatically mark out. For instance, if set to ``host`` and if + all OSDs of a host are down, Ceph will not automatically mark out + these OSDs. + +:Type: String +:Default: ``rack`` + + +``mon osd report timeout`` + +:Description: The grace period in seconds before declaring + unresponsive Ceph OSD Daemons ``down``. + +:Type: 32-bit Integer +:Default: ``900`` + +``mon osd min down reporters`` + +:Description: The minimum number of Ceph OSD Daemons required to report a + ``down`` Ceph OSD Daemon. + +:Type: 32-bit Integer +:Default: ``2`` + + +``mon_osd_reporter_subtree_level`` + +:Description: In which level of parent bucket the reporters are counted. The OSDs + send failure reports to monitors if they find a peer that is not responsive. + Monitors mark the reported ``OSD`` out and then ``down`` after a grace period. +:Type: String +:Default: ``host`` + + +.. index:: OSD hearbeat + +OSD Settings +------------ + +``osd_heartbeat_interval`` + +:Description: How often an Ceph OSD Daemon pings its peers (in seconds). +:Type: 32-bit Integer +:Default: ``6`` + + +``osd_heartbeat_grace`` + +:Description: The elapsed time when a Ceph OSD Daemon hasn't shown a heartbeat + that the Ceph Storage Cluster considers it ``down``. + This setting must be set in both the [mon] and [osd] or [global] + sections so that it is read by both monitor and OSD daemons. +:Type: 32-bit Integer +:Default: ``20`` + + +``osd_mon_heartbeat_interval`` + +:Description: How often the Ceph OSD Daemon pings a Ceph Monitor if it has no + Ceph OSD Daemon peers. + +:Type: 32-bit Integer +:Default: ``30`` + + +``osd_mon_heartbeat_stat_stale`` + +:Description: Stop reporting on heartbeat ping times which haven't been updated for + this many seconds. Set to zero to disable this action. + +:Type: 32-bit Integer +:Default: ``3600`` + + +``osd_mon_report_interval`` + +:Description: The number of seconds a Ceph OSD Daemon may wait + from startup or another reportable event before reporting + to a Ceph Monitor. + +:Type: 32-bit Integer +:Default: ``5`` diff --git a/doc/rados/configuration/ms-ref.rst b/doc/rados/configuration/ms-ref.rst new file mode 100644 index 000000000..113bd0913 --- /dev/null +++ b/doc/rados/configuration/ms-ref.rst @@ -0,0 +1,133 @@ +=========== + Messaging +=========== + +General Settings +================ + +``ms_tcp_nodelay`` + +:Description: Disables Nagle's algorithm on messenger TCP sessions. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``ms_initial_backoff`` + +:Description: The initial time to wait before reconnecting on a fault. +:Type: Double +:Required: No +:Default: ``.2`` + + +``ms_max_backoff`` + +:Description: The maximum time to wait before reconnecting on a fault. +:Type: Double +:Required: No +:Default: ``15.0`` + + +``ms_nocrc`` + +:Description: Disables CRC on network messages. May increase performance if CPU limited. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms_die_on_bad_msg`` + +:Description: Debug option; do not configure. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms_dispatch_throttle_bytes`` + +:Description: Throttles total size of messages waiting to be dispatched. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``100 << 20`` + + +``ms_bind_ipv6`` + +:Description: Enable to bind daemons to IPv6 addresses instead of IPv4. Not required if you specify a daemon or cluster IP. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms_rwthread_stack_bytes`` + +:Description: Debug option for stack size; do not configure. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``1024 << 10`` + + +``ms_tcp_read_timeout`` + +:Description: Controls how long (in seconds) the messenger will wait before closing an idle connection. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``900`` + + +``ms_inject_socket_failures`` + +:Description: Debug option; do not configure. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``0`` + +Async messenger options +======================= + + +``ms_async_transport_type`` + +:Description: Transport type used by Async Messenger. Can be ``posix``, ``dpdk`` + or ``rdma``. Posix uses standard TCP/IP networking and is default. + Other transports may be experimental and support may be limited. +:Type: String +:Required: No +:Default: ``posix`` + + +``ms_async_op_threads`` + +:Description: Initial number of worker threads used by each Async Messenger instance. + Should be at least equal to highest number of replicas, but you can + decrease it if you are low on CPU core count and/or you host a lot of + OSDs on single server. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``3`` + + +``ms_async_max_op_threads`` + +:Description: Maximum number of worker threads used by each Async Messenger instance. + Set to lower values when your machine has limited CPU count, and increase + when your CPUs are underutilized (i. e. one or more of CPUs are + constantly on 100% load during I/O operations). +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``5`` + + +``ms_async_send_inline`` + +:Description: Send messages directly from the thread that generated them instead of + queuing and sending from Async Messenger thread. This option is known + to decrease performance on systems with a lot of CPU cores, so it's + disabled by default. +:Type: Boolean +:Required: No +:Default: ``false`` + + diff --git a/doc/rados/configuration/msgr2.rst b/doc/rados/configuration/msgr2.rst new file mode 100644 index 000000000..3415b1f5f --- /dev/null +++ b/doc/rados/configuration/msgr2.rst @@ -0,0 +1,233 @@ +.. _msgr2: + +Messenger v2 +============ + +What is it +---------- + +The messenger v2 protocol, or msgr2, is the second major revision on +Ceph's on-wire protocol. It brings with it several key features: + +* A *secure* mode that encrypts all data passing over the network +* Improved encapsulation of authentication payloads, enabling future + integration of new authentication modes like Kerberos +* Improved earlier feature advertisement and negotiation, enabling + future protocol revisions + +Ceph daemons can now bind to multiple ports, allowing both legacy Ceph +clients and new v2-capable clients to connect to the same cluster. + +By default, monitors now bind to the new IANA-assigned port ``3300`` +(ce4h or 0xce4) for the new v2 protocol, while also binding to the +old default port ``6789`` for the legacy v1 protocol. + +.. _address_formats: + +Address formats +--------------- + +Prior to Nautilus, all network addresses were rendered like +``1.2.3.4:567/89012`` where there was an IP address, a port, and a +nonce to uniquely identify a client or daemon on the network. +Starting with Nautilus, we now have three different address types: + +* **v2**: ``v2:1.2.3.4:578/89012`` identifies a daemon binding to a + port speaking the new v2 protocol +* **v1**: ``v1:1.2.3.4:578/89012`` identifies a daemon binding to a + port speaking the legacy v1 protocol. Any address that was + previously shown with any prefix is now shown as a ``v1:`` address. +* **TYPE_ANY** ``any:1.2.3.4:578/89012`` identifies a client that can + speak either version of the protocol. Prior to nautilus, clients would appear as + ``1.2.3.4:0/123456``, where the port of 0 indicates they are clients + and do not accept incoming connections. Starting with Nautilus, + these clients are now internally represented by a **TYPE_ANY** + address, and still shown with no prefix, because they may + connect to daemons using the v2 or v1 protocol, depending on what + protocol(s) the daemons are using. + +Because daemons now bind to multiple ports, they are now described by +a vector of addresses instead of a single address. For example, +dumping the monitor map on a Nautilus cluster now includes lines +like:: + + epoch 1 + fsid 50fcf227-be32-4bcb-8b41-34ca8370bd16 + last_changed 2019-02-25 11:10:46.700821 + created 2019-02-25 11:10:46.700821 + min_mon_release 14 (nautilus) + 0: [v2:10.0.0.10:3300/0,v1:10.0.0.10:6789/0] mon.foo + 1: [v2:10.0.0.11:3300/0,v1:10.0.0.11:6789/0] mon.bar + 2: [v2:10.0.0.12:3300/0,v1:10.0.0.12:6789/0] mon.baz + +The bracketed list or vector of addresses means that the same daemon can be +reached on multiple ports (and protocols). Any client or other daemon +connecting to that daemon will use the v2 protocol (listed first) if +possible; otherwise it will back to the legacy v1 protocol. Legacy +clients will only see the v1 addresses and will continue to connect as +they did before, with the v1 protocol. + +Starting in Nautilus, the ``mon_host`` configuration option and ``-m +<mon-host>`` command line options support the same bracketed address +vector syntax. + + +Bind configuration options +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Two new configuration options control whether the v1 and/or v2 +protocol is used: + + * ``ms_bind_msgr1`` [default: true] controls whether a daemon binds + to a port speaking the v1 protocol + * ``ms_bind_msgr2`` [default: true] controls whether a daemon binds + to a port speaking the v2 protocol + +Similarly, two options control whether IPv4 and IPv6 addresses are used: + + * ``ms_bind_ipv4`` [default: true] controls whether a daemon binds + to an IPv4 address + * ``ms_bind_ipv6`` [default: false] controls whether a daemon binds + to an IPv6 address + +.. note:: The ability to bind to multiple ports has paved the way for + dual-stack IPv4 and IPv6 support. That said, dual-stack support is + not yet tested as of Nautilus v14.2.0 and likely needs some + additional code changes to work correctly. + +Connection modes +---------------- + +The v2 protocol supports two connection modes: + +* *crc* mode provides: + + - a strong initial authentication when the connection is established + (with cephx, mutual authentication of both parties with protection + from a man-in-the-middle or eavesdropper), and + - a crc32c integrity check to protect against bit flips due to flaky + hardware or cosmic rays + + *crc* mode does *not* provide: + + - secrecy (an eavesdropper on the network can see all + post-authentication traffic as it goes by) or + - protection from a malicious man-in-the-middle (who can deliberate + modify traffic as it goes by, as long as they are careful to + adjust the crc32c values to match) + +* *secure* mode provides: + + - a strong initial authentication when the connection is established + (with cephx, mutual authentication of both parties with protection + from a man-in-the-middle or eavesdropper), and + - full encryption of all post-authentication traffic, including a + cryptographic integrity check. + + In Nautilus, secure mode uses the `AES-GCM + <https://en.wikipedia.org/wiki/Galois/Counter_Mode>`_ stream cipher, + which is generally very fast on modern processors (e.g., faster than + a SHA-256 cryptographic hash). + +Connection mode configuration options +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For most connections, there are options that control which modes are used: + +* ``ms_cluster_mode`` is the connection mode (or permitted modes) used + for intra-cluster communication between Ceph daemons. If multiple + modes are listed, the modes listed first are preferred. +* ``ms_service_mode`` is a list of permitted modes for clients to use + when connecting to the cluster. +* ``ms_client_mode`` is a list of connection modes, in order of + preference, for clients to use (or allow) when talking to a Ceph + cluster. + +There are a parallel set of options that apply specifically to +monitors, allowing administrators to set different (usually more +secure) requirements on communication with the monitors. + +* ``ms_mon_cluster_mode`` is the connection mode (or permitted modes) + to use between monitors. +* ``ms_mon_service_mode`` is a list of permitted modes for clients or + other Ceph daemons to use when connecting to monitors. +* ``ms_mon_client_mode`` is a list of connection modes, in order of + preference, for clients or non-monitor daemons to use when + connecting to monitors. + + +Transitioning from v1-only to v2-plus-v1 +---------------------------------------- + +By default, ``ms_bind_msgr2`` is true starting with Nautilus 14.2.z. +However, until the monitors start using v2, only limited services will +start advertising v2 addresses. + +For most users, the monitors are binding to the default legacy port ``6789`` +for the v1 protocol. When this is the case, enabling v2 is as simple as: + +.. prompt:: bash $ + + ceph mon enable-msgr2 + +If the monitors are bound to non-standard ports, you will need to +specify an additional port for v2 explicitly. For example, if your +monitor ``mon.a`` binds to ``1.2.3.4:1111``, and you want to add v2 on +port ``1112``: + +.. prompt:: bash $ + + ceph mon set-addrs a [v2:1.2.3.4:1112,v1:1.2.3.4:1111] + +Once the monitors bind to v2, each daemon will start advertising a v2 +address when it is next restarted. + + +.. _msgr2_ceph_conf: + +Updating ceph.conf and mon_host +------------------------------- + +Prior to Nautilus, a CLI user or daemon will normally discover the +monitors via the ``mon_host`` option in ``/etc/ceph/ceph.conf``. The +syntax for this option has expanded starting with Nautilus to allow +support the new bracketed list format. For example, an old line +like:: + + mon_host = 10.0.0.1:6789,10.0.0.2:6789,10.0.0.3:6789 + +Can be changed to:: + + mon_host = [v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0],[v2:10.0.0.2:3300/0,v1:10.0.0.2:6789/0],[v2:10.0.0.3:3300/0,v1:10.0.0.3:6789/0] + +However, when default ports are used (``3300`` and ``6789``), they can +be omitted:: + + mon_host = 10.0.0.1,10.0.0.2,10.0.0.3 + +Once v2 has been enabled on the monitors, ``ceph.conf`` may need to be +updated to either specify no ports (this is usually simplest), or +explicitly specify both the v2 and v1 addresses. Note, however, that +the new bracketed syntax is only understood by Nautilus and later, so +do not make that change on hosts that have not yet had their ceph +packages upgraded. + +When you are updating ``ceph.conf``, note the new ``ceph config +generate-minimal-conf`` command (which generates a barebones config +file with just enough information to reach the monitors) and the +``ceph config assimilate-conf`` (which moves config file options into +the monitors' configuration database) may be helpful. For example,:: + + # ceph config assimilate-conf < /etc/ceph/ceph.conf + # ceph config generate-minimal-config > /etc/ceph/ceph.conf.new + # cat /etc/ceph/ceph.conf.new + # minimal ceph.conf for 0e5a806b-0ce5-4bc6-b949-aa6f68f5c2a3 + [global] + fsid = 0e5a806b-0ce5-4bc6-b949-aa6f68f5c2a3 + mon_host = [v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0] + # mv /etc/ceph/ceph.conf.new /etc/ceph/ceph.conf + +Protocol +-------- + +For a detailed description of the v2 wire protocol, see :ref:`msgr2-protocol`. diff --git a/doc/rados/configuration/network-config-ref.rst b/doc/rados/configuration/network-config-ref.rst new file mode 100644 index 000000000..97229a401 --- /dev/null +++ b/doc/rados/configuration/network-config-ref.rst @@ -0,0 +1,454 @@ +================================= + Network Configuration Reference +================================= + +Network configuration is critical for building a high performance :term:`Ceph +Storage Cluster`. The Ceph Storage Cluster does not perform request routing or +dispatching on behalf of the :term:`Ceph Client`. Instead, Ceph Clients make +requests directly to Ceph OSD Daemons. Ceph OSD Daemons perform data replication +on behalf of Ceph Clients, which means replication and other factors impose +additional loads on Ceph Storage Cluster networks. + +Our Quick Start configurations provide a trivial Ceph configuration file that +sets monitor IP addresses and daemon host names only. Unless you specify a +cluster network, Ceph assumes a single "public" network. Ceph functions just +fine with a public network only, but you may see significant performance +improvement with a second "cluster" network in a large cluster. + +It is possible to run a Ceph Storage Cluster with two networks: a public +(client, front-side) network and a cluster (private, replication, back-side) +network. However, this approach +complicates network configuration (both hardware and software) and does not usually +have a significant impact on overall performance. For this reason, we recommend +that for resilience and capacity dual-NIC systems either active/active bond +these interfaces or implemebnt a layer 3 multipath strategy with eg. FRR. + +If, despite the complexity, one still wishes to use two networks, each +:term:`Ceph Node` will need to have more than one network interface or VLAN. See `Hardware +Recommendations - Networks`_ for additional details. + +.. ditaa:: + +-------------+ + | Ceph Client | + +----*--*-----+ + | ^ + Request | : Response + v | + /----------------------------------*--*-------------------------------------\ + | Public Network | + \---*--*------------*--*-------------*--*------------*--*------------*--*---/ + ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ + | | | | | | | | | | + | : | : | : | : | : + v v v v v v v v v v + +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ + | Ceph MON | | Ceph MDS | | Ceph OSD | | Ceph OSD | | Ceph OSD | + +----------+ +----------+ +---*--*---+ +---*--*---+ +---*--*---+ + ^ ^ ^ ^ ^ ^ + The cluster network relieves | | | | | | + OSD replication and heartbeat | : | : | : + traffic from the public network. v v v v v v + /------------------------------------*--*------------*--*------------*--*---\ + | cCCC Cluster Network | + \---------------------------------------------------------------------------/ + + +IP Tables +========= + +By default, daemons `bind`_ to ports within the ``6800:7300`` range. You may +configure this range at your discretion. Before configuring your IP tables, +check the default ``iptables`` configuration. + +.. prompt:: bash $ + + sudo iptables -L + +Some Linux distributions include rules that reject all inbound requests +except SSH from all network interfaces. For example:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +You will need to delete these rules on both your public and cluster networks +initially, and replace them with appropriate rules when you are ready to +harden the ports on your Ceph Nodes. + + +Monitor IP Tables +----------------- + +Ceph Monitors listen on ports ``3300`` and ``6789`` by +default. Additionally, Ceph Monitors always operate on the public +network. When you add the rule using the example below, make sure you +replace ``{iface}`` with the public network interface (e.g., ``eth0``, +``eth1``, etc.), ``{ip-address}`` with the IP address of the public +network and ``{netmask}`` with the netmask for the public network. : + +.. prompt:: bash $ + + sudo iptables -A INPUT -i {iface} -p tcp -s {ip-address}/{netmask} --dport 6789 -j ACCEPT + + +MDS and Manager IP Tables +------------------------- + +A :term:`Ceph Metadata Server` or :term:`Ceph Manager` listens on the first +available port on the public network beginning at port 6800. Note that this +behavior is not deterministic, so if you are running more than one OSD or MDS +on the same host, or if you restart the daemons within a short window of time, +the daemons will bind to higher ports. You should open the entire 6800-7300 +range by default. When you add the rule using the example below, make sure +you replace ``{iface}`` with the public network interface (e.g., ``eth0``, +``eth1``, etc.), ``{ip-address}`` with the IP address of the public network +and ``{netmask}`` with the netmask of the public network. + +For example: + +.. prompt:: bash $ + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + + +OSD IP Tables +------------- + +By default, Ceph OSD Daemons `bind`_ to the first available ports on a Ceph Node +beginning at port 6800. Note that this behavior is not deterministic, so if you +are running more than one OSD or MDS on the same host, or if you restart the +daemons within a short window of time, the daemons will bind to higher ports. +Each Ceph OSD Daemon on a Ceph Node may use up to four ports: + +#. One for talking to clients and monitors. +#. One for sending data to other OSDs. +#. Two for heartbeating on each interface. + +.. ditaa:: + /---------------\ + | OSD | + | +---+----------------+-----------+ + | | Clients & Monitors | Heartbeat | + | +---+----------------+-----------+ + | | + | +---+----------------+-----------+ + | | Data Replication | Heartbeat | + | +---+----------------+-----------+ + | cCCC | + \---------------/ + +When a daemon fails and restarts without letting go of the port, the restarted +daemon will bind to a new port. You should open the entire 6800-7300 port range +to handle this possibility. + +If you set up separate public and cluster networks, you must add rules for both +the public network and the cluster network, because clients will connect using +the public network and other Ceph OSD Daemons will connect using the cluster +network. When you add the rule using the example below, make sure you replace +``{iface}`` with the network interface (e.g., ``eth0``, ``eth1``, etc.), +``{ip-address}`` with the IP address and ``{netmask}`` with the netmask of the +public or cluster network. For example: + +.. prompt:: bash $ + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + +.. tip:: If you run Ceph Metadata Servers on the same Ceph Node as the + Ceph OSD Daemons, you can consolidate the public network configuration step. + + +Ceph Networks +============= + +To configure Ceph networks, you must add a network configuration to the +``[global]`` section of the configuration file. Our 5-minute Quick Start +provides a trivial Ceph configuration file that assumes one public network +with client and server on the same network and subnet. Ceph functions just fine +with a public network only. However, Ceph allows you to establish much more +specific criteria, including multiple IP network and subnet masks for your +public network. You can also establish a separate cluster network to handle OSD +heartbeat, object replication and recovery traffic. Don't confuse the IP +addresses you set in your configuration with the public-facing IP addresses +network clients may use to access your service. Typical internal IP networks are +often ``192.168.0.0`` or ``10.0.0.0``. + +.. tip:: If you specify more than one IP address and subnet mask for + either the public or the cluster network, the subnets within the network + must be capable of routing to each other. Additionally, make sure you + include each IP address/subnet in your IP tables and open ports for them + as necessary. + +.. note:: Ceph uses `CIDR`_ notation for subnets (e.g., ``10.0.0.0/24``). + +When you have configured your networks, you may restart your cluster or restart +each daemon. Ceph daemons bind dynamically, so you do not have to restart the +entire cluster at once if you change your network configuration. + + +Public Network +-------------- + +To configure a public network, add the following option to the ``[global]`` +section of your Ceph configuration file. + +.. code-block:: ini + + [global] + # ... elided configuration + public_network = {public-network/netmask} + +.. _cluster-network: + +Cluster Network +--------------- + +If you declare a cluster network, OSDs will route heartbeat, object replication +and recovery traffic over the cluster network. This may improve performance +compared to using a single network. To configure a cluster network, add the +following option to the ``[global]`` section of your Ceph configuration file. + +.. code-block:: ini + + [global] + # ... elided configuration + cluster_network = {cluster-network/netmask} + +We prefer that the cluster network is **NOT** reachable from the public network +or the Internet for added security. + +IPv4/IPv6 Dual Stack Mode +------------------------- + +If you want to run in an IPv4/IPv6 dual stack mode and want to define your public and/or +cluster networks, then you need to specify both your IPv4 and IPv6 networks for each: + +.. code-block:: ini + + [global] + # ... elided configuration + public_network = {IPv4 public-network/netmask}, {IPv6 public-network/netmask} + +This is so that Ceph can find a valid IP address for both address families. + +If you want just an IPv4 or an IPv6 stack environment, then make sure you set the `ms bind` +options correctly. + +.. note:: + Binding to IPv4 is enabled by default, so if you just add the option to bind to IPv6 + you'll actually put yourself into dual stack mode. If you want just IPv6, then disable IPv4 and + enable IPv6. See `Bind`_ below. + +Ceph Daemons +============ + +Monitor daemons are each configured to bind to a specific IP address. These +addresses are normally configured by your deployment tool. Other components +in the Ceph cluster discover the monitors via the ``mon host`` configuration +option, normally specified in the ``[global]`` section of the ``ceph.conf`` file. + +.. code-block:: ini + + [global] + mon_host = 10.0.0.2, 10.0.0.3, 10.0.0.4 + +The ``mon_host`` value can be a list of IP addresses or a name that is +looked up via DNS. In the case of a DNS name with multiple A or AAAA +records, all records are probed in order to discover a monitor. Once +one monitor is reached, all other current monitors are discovered, so +the ``mon host`` configuration option only needs to be sufficiently up +to date such that a client can reach one monitor that is currently online. + +The MGR, OSD, and MDS daemons will bind to any available address and +do not require any special configuration. However, it is possible to +specify a specific IP address for them to bind to with the ``public +addr`` (and/or, in the case of OSD daemons, the ``cluster addr``) +configuration option. For example, + +.. code-block:: ini + + [osd.0] + public addr = {host-public-ip-address} + cluster addr = {host-cluster-ip-address} + +.. topic:: One NIC OSD in a Two Network Cluster + + Generally, we do not recommend deploying an OSD host with a single network interface in a + cluster with two networks. However, you may accomplish this by forcing the + OSD host to operate on the public network by adding a ``public_addr`` entry + to the ``[osd.n]`` section of the Ceph configuration file, where ``n`` + refers to the ID of the OSD with one network interface. Additionally, the public + network and cluster network must be able to route traffic to each other, + which we don't recommend for security reasons. + + +Network Config Settings +======================= + +Network configuration settings are not required. Ceph assumes a public network +with all hosts operating on it unless you specifically configure a cluster +network. + + +Public Network +-------------- + +The public network configuration allows you specifically define IP addresses +and subnets for the public network. You may specifically assign static IP +addresses or override ``public_network`` settings using the ``public_addr`` +setting for a specific daemon. + +``public_network`` + +:Description: The IP address and netmask of the public (front-side) network + (e.g., ``192.168.0.0/24``). Set in ``[global]``. You may specify + comma-separated subnets. + +:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` +:Required: No +:Default: N/A + + +``public_addr`` + +:Description: The IP address for the public (front-side) network. + Set for each daemon. + +:Type: IP Address +:Required: No +:Default: N/A + + + +Cluster Network +--------------- + +The cluster network configuration allows you to declare a cluster network, and +specifically define IP addresses and subnets for the cluster network. You may +specifically assign static IP addresses or override ``cluster_network`` +settings using the ``cluster_addr`` setting for specific OSD daemons. + + +``cluster_network`` + +:Description: The IP address and netmask of the cluster (back-side) network + (e.g., ``10.0.0.0/24``). Set in ``[global]``. You may specify + comma-separated subnets. + +:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` +:Required: No +:Default: N/A + + +``cluster_addr`` + +:Description: The IP address for the cluster (back-side) network. + Set for each daemon. + +:Type: Address +:Required: No +:Default: N/A + + +Bind +---- + +Bind settings set the default port ranges Ceph OSD and MDS daemons use. The +default range is ``6800:7300``. Ensure that your `IP Tables`_ configuration +allows you to use the configured port range. + +You may also enable Ceph daemons to bind to IPv6 addresses instead of IPv4 +addresses. + + +``ms_bind_port_min`` + +:Description: The minimum port number to which an OSD or MDS daemon will bind. +:Type: 32-bit Integer +:Default: ``6800`` +:Required: No + + +``ms_bind_port_max`` + +:Description: The maximum port number to which an OSD or MDS daemon will bind. +:Type: 32-bit Integer +:Default: ``7300`` +:Required: No. + +``ms_bind_ipv4`` + +:Description: Enables Ceph daemons to bind to IPv4 addresses. +:Type: Boolean +:Default: ``true`` +:Required: No + +``ms_bind_ipv6`` + +:Description: Enables Ceph daemons to bind to IPv6 addresses. +:Type: Boolean +:Default: ``false`` +:Required: No + +``public_bind_addr`` + +:Description: In some dynamic deployments the Ceph MON daemon might bind + to an IP address locally that is different from the ``public_addr`` + advertised to other peers in the network. The environment must ensure + that routing rules are set correctly. If ``public_bind_addr`` is set + the Ceph Monitor daemon will bind to it locally and use ``public_addr`` + in the monmaps to advertise its address to peers. This behavior is limited + to the Monitor daemon. + +:Type: IP Address +:Required: No +:Default: N/A + + + +TCP +--- + +Ceph disables TCP buffering by default. + + +``ms_tcp_nodelay`` + +:Description: Ceph enables ``ms_tcp_nodelay`` so that each request is sent + immediately (no buffering). Disabling `Nagle's algorithm`_ + increases network traffic, which can introduce latency. If you + experience large numbers of small packets, you may try + disabling ``ms_tcp_nodelay``. + +:Type: Boolean +:Required: No +:Default: ``true`` + + +``ms_tcp_rcvbuf`` + +:Description: The size of the socket buffer on the receiving end of a network + connection. Disable by default. + +:Type: 32-bit Integer +:Required: No +:Default: ``0`` + + +``ms_tcp_read_timeout`` + +:Description: If a client or daemon makes a request to another Ceph daemon and + does not drop an unused connection, the ``ms tcp read timeout`` + defines the connection as idle after the specified number + of seconds. + +:Type: Unsigned 64-bit Integer +:Required: No +:Default: ``900`` 15 minutes. + + + +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability +.. _Hardware Recommendations - Networks: ../../../start/hardware-recommendations#networks +.. _hardware recommendations: ../../../start/hardware-recommendations +.. _Monitor / OSD Interaction: ../mon-osd-interaction +.. _Message Signatures: ../auth-config-ref#signatures +.. _CIDR: https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing +.. _Nagle's Algorithm: https://en.wikipedia.org/wiki/Nagle's_algorithm diff --git a/doc/rados/configuration/osd-config-ref.rst b/doc/rados/configuration/osd-config-ref.rst new file mode 100644 index 000000000..7f69b5e80 --- /dev/null +++ b/doc/rados/configuration/osd-config-ref.rst @@ -0,0 +1,1127 @@ +====================== + OSD Config Reference +====================== + +.. index:: OSD; configuration + +You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent +releases, the central config store), but Ceph OSD +Daemons can use the default values and a very minimal configuration. A minimal +Ceph OSD Daemon configuration sets ``osd journal size`` (for Filestore), ``host``, and +uses default values for nearly everything else. + +Ceph OSD Daemons are numerically identified in incremental fashion, beginning +with ``0`` using the following convention. :: + + osd.0 + osd.1 + osd.2 + +In a configuration file, you may specify settings for all Ceph OSD Daemons in +the cluster by adding configuration settings to the ``[osd]`` section of your +configuration file. To add settings directly to a specific Ceph OSD Daemon +(e.g., ``host``), enter it in an OSD-specific section of your configuration +file. For example: + +.. code-block:: ini + + [osd] + osd_journal_size = 5120 + + [osd.0] + host = osd-host-a + + [osd.1] + host = osd-host-b + + +.. index:: OSD; config settings + +General Settings +================ + +The following settings provide a Ceph OSD Daemon's ID, and determine paths to +data and journals. Ceph deployment scripts typically generate the UUID +automatically. + +.. warning:: **DO NOT** change the default paths for data or journals, as it + makes it more problematic to troubleshoot Ceph later. + +When using Filestore, the journal size should be at least twice the product of the expected drive +speed multiplied by ``filestore_max_sync_interval``. However, the most common +practice is to partition the journal drive (often an SSD), and mount it such +that Ceph uses the entire partition for the journal. + + +``osd_uuid`` + +:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon. +:Type: UUID +:Default: The UUID. +:Note: The ``osd_uuid`` applies to a single Ceph OSD Daemon. The ``fsid`` + applies to the entire cluster. + + +``osd_data`` + +:Description: The path to the OSDs data. You must create the directory when + deploying Ceph. You should mount a drive for OSD data at this + mount point. We do not recommend changing the default. + +:Type: String +:Default: ``/var/lib/ceph/osd/$cluster-$id`` + + +``osd_max_write_size`` + +:Description: The maximum size of a write in megabytes. +:Type: 32-bit Integer +:Default: ``90`` + + +``osd_max_object_size`` + +:Description: The maximum size of a RADOS object in bytes. +:Type: 32-bit Unsigned Integer +:Default: 128MB + + +``osd_client_message_size_cap`` + +:Description: The largest client data message allowed in memory. +:Type: 64-bit Unsigned Integer +:Default: 500MB default. ``500*1024L*1024L`` + + +``osd_class_dir`` + +:Description: The class path for RADOS class plug-ins. +:Type: String +:Default: ``$libdir/rados-classes`` + + +.. index:: OSD; file system + +File System Settings +==================== +Ceph builds and mounts file systems which are used for Ceph OSDs. + +``osd_mkfs_options {fs-type}`` + +:Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``-f -i 2048`` +:Default for other file systems: {empty string} + +For example:: + ``osd_mkfs_options_xfs = -f -d agcount=24`` + +``osd_mount_options {fs-type}`` + +:Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``rw,noatime,inode64`` +:Default for other file systems: ``rw, noatime`` + +For example:: + ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8`` + + +.. index:: OSD; journal settings + +Journal Settings +================ + +This section applies only to the older Filestore OSD back end. Since Luminous +BlueStore has been default and preferred. + +By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at +the following path, which is usually a symlink to a device or partition:: + + /var/lib/ceph/osd/$cluster-$id/journal + +When using a single device type (for example, spinning drives), the journals +should be *colocated*: the logical volume (or partition) should be in the same +device as the ``data`` logical volume. + +When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning +drives) it makes sense to place the journal on the faster device, while +``data`` occupies the slower device fully. + +The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be +larger, in which case it will need to be set in the ``ceph.conf`` file. +A value of 10 gigabytes is common in practice:: + + osd_journal_size = 10240 + + +``osd_journal`` + +:Description: The path to the OSD's journal. This may be a path to a file or a + block device (such as a partition of an SSD). If it is a file, + you must create the directory to contain it. We recommend using a + separate fast device when the ``osd_data`` drive is an HDD. + +:Type: String +:Default: ``/var/lib/ceph/osd/$cluster-$id/journal`` + + +``osd_journal_size`` + +:Description: The size of the journal in megabytes. + +:Type: 32-bit Integer +:Default: ``5120`` + + +See `Journal Config Reference`_ for additional details. + + +Monitor OSD Interaction +======================= + +Ceph OSD Daemons check each other's heartbeats and report to monitors +periodically. Ceph can use default values in many cases. However, if your +network has latency issues, you may need to adopt longer intervals. See +`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. + + +Data Placement +============== + +See `Pool & PG Config Reference`_ for details. + + +.. index:: OSD; scrubbing + +Scrubbing +========= + +In addition to making multiple copies of objects, Ceph ensures data integrity by +scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the +object storage layer. For each placement group, Ceph generates a catalog of all +objects and compares each primary object and its replicas to ensure that no +objects are missing or mismatched. Light scrubbing (daily) checks the object +size and attributes. Deep scrubbing (weekly) reads the data and uses checksums +to ensure data integrity. + +Scrubbing is important for maintaining data integrity, but it can reduce +performance. You can adjust the following settings to increase or decrease +scrubbing operations. + + +``osd_max_scrubs`` + +:Description: The maximum number of simultaneous scrub operations for + a Ceph OSD Daemon. + +:Type: 32-bit Int +:Default: ``1`` + +``osd_scrub_begin_hour`` + +:Description: This restricts scrubbing to this hour of the day or later. + Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0`` + to allow scrubbing the entire day. Along with ``osd_scrub_end_hour``, they define a time + window, in which the scrubs can happen. + But a scrub will be performed + no matter whether the time window allows or not, as long as the placement + group's scrub interval exceeds ``osd_scrub_max_interval``. +:Type: Integer in the range of 0 to 23 +:Default: ``0`` + + +``osd_scrub_end_hour`` + +:Description: This restricts scrubbing to the hour earlier than this. + Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0`` to allow scrubbing + for the entire day. Along with ``osd_scrub_begin_hour``, they define a time + window, in which the scrubs can happen. But a scrub will be performed + no matter whether the time window allows or not, as long as the placement + group's scrub interval exceeds ``osd_scrub_max_interval``. +:Type: Integer in the range of 0 to 23 +:Default: ``0`` + + +``osd_scrub_begin_week_day`` + +:Description: This restricts scrubbing to this day of the week or later. + 0 = Sunday, 1 = Monday, etc. Use ``osd_scrub_begin_week_day = 0`` + and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week. + Along with ``osd_scrub_end_week_day``, they define a time window in which + scrubs can happen. But a scrub will be performed + no matter whether the time window allows or not, when the PG's + scrub interval exceeds ``osd_scrub_max_interval``. +:Type: Integer in the range of 0 to 6 +:Default: ``0`` + + +``osd_scrub_end_week_day`` + +:Description: This restricts scrubbing to days of the week earlier than this. + 0 = Sunday, 1 = Monday, etc. Use ``osd_scrub_begin_week_day = 0`` + and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week. + Along with ``osd_scrub_begin_week_day``, they define a time + window, in which the scrubs can happen. But a scrub will be performed + no matter whether the time window allows or not, as long as the placement + group's scrub interval exceeds ``osd_scrub_max_interval``. +:Type: Integer in the range of 0 to 6 +:Default: ``0`` + + +``osd scrub during recovery`` + +:Description: Allow scrub during recovery. Setting this to ``false`` will disable + scheduling new scrub (and deep--scrub) while there is active recovery. + Already running scrubs will be continued. This might be useful to reduce + load on busy clusters. +:Type: Boolean +:Default: ``false`` + + +``osd_scrub_thread_timeout`` + +:Description: The maximum time in seconds before timing out a scrub thread. +:Type: 32-bit Integer +:Default: ``60`` + + +``osd_scrub_finalize_thread_timeout`` + +:Description: The maximum time in seconds before timing out a scrub finalize + thread. + +:Type: 32-bit Integer +:Default: ``10*60`` + + +``osd_scrub_load_threshold`` + +:Description: The normalized maximum load. Ceph will not scrub when the system load + (as defined by ``getloadavg() / number of online CPUs``) is higher than this number. + Default is ``0.5``. + +:Type: Float +:Default: ``0.5`` + + +``osd_scrub_min_interval`` + +:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon + when the Ceph Storage Cluster load is low. + +:Type: Float +:Default: Once per day. ``24*60*60`` + +.. _osd_scrub_max_interval: + +``osd_scrub_max_interval`` + +:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon + irrespective of cluster load. + +:Type: Float +:Default: Once per week. ``7*24*60*60`` + + +``osd_scrub_chunk_min`` + +:Description: The minimal number of object store chunks to scrub during single operation. + Ceph blocks writes to single chunk during scrub. + +:Type: 32-bit Integer +:Default: 5 + + +``osd_scrub_chunk_max`` + +:Description: The maximum number of object store chunks to scrub during single operation. + +:Type: 32-bit Integer +:Default: 25 + + +``osd_scrub_sleep`` + +:Description: Time to sleep before scrubbing the next group of chunks. Increasing this value will slow + down the overall rate of scrubbing so that client operations will be less impacted. + +:Type: Float +:Default: 0 + + +``osd_deep_scrub_interval`` + +:Description: The interval for "deep" scrubbing (fully reading all data). The + ``osd_scrub_load_threshold`` does not affect this setting. + +:Type: Float +:Default: Once per week. ``7*24*60*60`` + + +``osd_scrub_interval_randomize_ratio`` + +:Description: Add a random delay to ``osd_scrub_min_interval`` when scheduling + the next scrub job for a PG. The delay is a random + value less than ``osd_scrub_min_interval`` \* + ``osd_scrub_interval_randomized_ratio``. The default setting + spreads scrubs throughout the allowed time + window of ``[1, 1.5]`` \* ``osd_scrub_min_interval``. +:Type: Float +:Default: ``0.5`` + +``osd_deep_scrub_stride`` + +:Description: Read size when doing a deep scrub. +:Type: 32-bit Integer +:Default: 512 KB. ``524288`` + + +``osd_scrub_auto_repair`` + +:Description: Setting this to ``true`` will enable automatic PG repair when errors + are found by scrubs or deep-scrubs. However, if more than + ``osd_scrub_auto_repair_num_errors`` errors are found a repair is NOT performed. +:Type: Boolean +:Default: ``false`` + + +``osd_scrub_auto_repair_num_errors`` + +:Description: Auto repair will not occur if more than this many errors are found. +:Type: 32-bit Integer +:Default: ``5`` + + +.. index:: OSD; operations settings + +Operations +========== + + ``osd_op_queue`` + +:Description: This sets the type of queue to be used for prioritizing ops + within each OSD. Both queues feature a strict sub-queue which is + dequeued before the normal queue. The normal queue is different + between implementations. The WeightedPriorityQueue (``wpq``) + dequeues operations in relation to their priorities to prevent + starvation of any queue. WPQ should help in cases where a few OSDs + are more overloaded than others. The new mClockQueue + (``mclock_scheduler``) prioritizes operations based on which class + they belong to (recovery, scrub, snaptrim, client op, osd subop). + See `QoS Based on mClock`_. Requires a restart. + +:Type: String +:Valid Choices: wpq, mclock_scheduler +:Default: ``wpq`` + + +``osd_op_queue_cut_off`` + +:Description: This selects which priority ops will be sent to the strict + queue verses the normal queue. The ``low`` setting sends all + replication ops and higher to the strict queue, while the ``high`` + option sends only replication acknowledgment ops and higher to + the strict queue. Setting this to ``high`` should help when a few + OSDs in the cluster are very busy especially when combined with + ``wpq`` in the ``osd_op_queue`` setting. OSDs that are very busy + handling replication traffic could starve primary client traffic + on these OSDs without these settings. Requires a restart. + +:Type: String +:Valid Choices: low, high +:Default: ``high`` + + +``osd_client_op_priority`` + +:Description: The priority set for client operations. This value is relative + to that of ``osd_recovery_op_priority`` below. The default + strongly favors client ops over recovery. + +:Type: 32-bit Integer +:Default: ``63`` +:Valid Range: 1-63 + + +``osd_recovery_op_priority`` + +:Description: The priority of recovery operations vs client operations, if not specified by the + pool's ``recovery_op_priority``. The default value prioritizes client + ops (see above) over recovery ops. You may adjust the tradeoff of client + impact against the time to restore cluster health by lowering this value + for increased prioritization of client ops, or by increasing it to favor + recovery. + +:Type: 32-bit Integer +:Default: ``3`` +:Valid Range: 1-63 + + +``osd_scrub_priority`` + +:Description: The default work queue priority for scheduled scrubs when the + pool doesn't specify a value of ``scrub_priority``. This can be + boosted to the value of ``osd_client_op_priority`` when scrubs are + blocking client operations. + +:Type: 32-bit Integer +:Default: ``5`` +:Valid Range: 1-63 + + +``osd_requested_scrub_priority`` + +:Description: The priority set for user requested scrub on the work queue. If + this value were to be smaller than ``osd_client_op_priority`` it + can be boosted to the value of ``osd_client_op_priority`` when + scrub is blocking client operations. + +:Type: 32-bit Integer +:Default: ``120`` + + +``osd_snap_trim_priority`` + +:Description: The priority set for the snap trim work queue. + +:Type: 32-bit Integer +:Default: ``5`` +:Valid Range: 1-63 + +``osd_snap_trim_sleep`` + +:Description: Time in seconds to sleep before next snap trim op. + Increasing this value will slow down snap trimming. + This option overrides backend specific variants. + +:Type: Float +:Default: ``0`` + + +``osd_snap_trim_sleep_hdd`` + +:Description: Time in seconds to sleep before next snap trim op + for HDDs. + +:Type: Float +:Default: ``5`` + + +``osd_snap_trim_sleep_ssd`` + +:Description: Time in seconds to sleep before next snap trim op + for SSD OSDs (including NVMe). + +:Type: Float +:Default: ``0`` + + +``osd_snap_trim_sleep_hybrid`` + +:Description: Time in seconds to sleep before next snap trim op + when OSD data is on an HDD and the OSD journal or WAL+DB is on an SSD. + +:Type: Float +:Default: ``2`` + +``osd_op_thread_timeout`` + +:Description: The Ceph OSD Daemon operation thread timeout in seconds. +:Type: 32-bit Integer +:Default: ``15`` + + +``osd_op_complaint_time`` + +:Description: An operation becomes complaint worthy after the specified number + of seconds have elapsed. + +:Type: Float +:Default: ``30`` + + +``osd_op_history_size`` + +:Description: The maximum number of completed operations to track. +:Type: 32-bit Unsigned Integer +:Default: ``20`` + + +``osd_op_history_duration`` + +:Description: The oldest completed operation to track. +:Type: 32-bit Unsigned Integer +:Default: ``600`` + + +``osd_op_log_threshold`` + +:Description: How many operations logs to display at once. +:Type: 32-bit Integer +:Default: ``5`` + + +.. _dmclock-qos: + +QoS Based on mClock +------------------- + +Ceph's use of mClock is now more refined and can be used by following the +steps as described in `mClock Config Reference`_. + +Core Concepts +````````````` + +Ceph's QoS support is implemented using a queueing scheduler +based on `the dmClock algorithm`_. This algorithm allocates the I/O +resources of the Ceph cluster in proportion to weights, and enforces +the constraints of minimum reservation and maximum limitation, so that +the services can compete for the resources fairly. Currently the +*mclock_scheduler* operation queue divides Ceph services involving I/O +resources into following buckets: + +- client op: the iops issued by client +- osd subop: the iops issued by primary OSD +- snap trim: the snap trimming related requests +- pg recovery: the recovery related requests +- pg scrub: the scrub related requests + +And the resources are partitioned using following three sets of tags. In other +words, the share of each type of service is controlled by three tags: + +#. reservation: the minimum IOPS allocated for the service. +#. limitation: the maximum IOPS allocated for the service. +#. weight: the proportional share of capacity if extra capacity or system + oversubscribed. + +In Ceph, operations are graded with "cost". And the resources allocated +for serving various services are consumed by these "costs". So, for +example, the more reservation a services has, the more resource it is +guaranteed to possess, as long as it requires. Assuming there are 2 +services: recovery and client ops: + +- recovery: (r:1, l:5, w:1) +- client ops: (r:2, l:0, w:9) + +The settings above ensure that the recovery won't get more than 5 +requests per second serviced, even if it requires so (see CURRENT +IMPLEMENTATION NOTE below), and no other services are competing with +it. But if the clients start to issue large amount of I/O requests, +neither will they exhaust all the I/O resources. 1 request per second +is always allocated for recovery jobs as long as there are any such +requests. So the recovery jobs won't be starved even in a cluster with +high load. And in the meantime, the client ops can enjoy a larger +portion of the I/O resource, because its weight is "9", while its +competitor "1". In the case of client ops, it is not clamped by the +limit setting, so it can make use of all the resources if there is no +recovery ongoing. + +CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit +values. Therefore, if a service crosses the enforced limit, the op remains +in the operation queue until the limit is restored. + +Subtleties of mClock +```````````````````` + +The reservation and limit values have a unit of requests per +second. The weight, however, does not technically have a unit and the +weights are relative to one another. So if one class of requests has a +weight of 1 and another a weight of 9, then the latter class of +requests should get 9 executed at a 9 to 1 ratio as the first class. +However that will only happen once the reservations are met and those +values include the operations executed under the reservation phase. + +Even though the weights do not have units, one must be careful in +choosing their values due how the algorithm assigns weight tags to +requests. If the weight is *W*, then for a given class of requests, +the next one that comes in will have a weight tag of *1/W* plus the +previous weight tag or the current time, whichever is larger. That +means if *W* is sufficiently large and therefore *1/W* is sufficiently +small, the calculated tag may never be assigned as it will get a value +of the current time. The ultimate lesson is that values for weight +should not be too large. They should be under the number of requests +one expects to be serviced each second. + +Caveats +``````` + +There are some factors that can reduce the impact of the mClock op +queues within Ceph. First, requests to an OSD are sharded by their +placement group identifier. Each shard has its own mClock queue and +these queues neither interact nor share information among them. The +number of shards can be controlled with the configuration options +``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and +``osd_op_num_shards_ssd``. A lower number of shards will increase the +impact of the mClock queues, but may have other deleterious effects. + +Second, requests are transferred from the operation queue to the +operation sequencer, in which they go through the phases of +execution. The operation queue is where mClock resides and mClock +determines the next op to transfer to the operation sequencer. The +number of operations allowed in the operation sequencer is a complex +issue. In general we want to keep enough operations in the sequencer +so it's always getting work done on some operations while it's waiting +for disk and network access to complete on other operations. On the +other hand, once an operation is transferred to the operation +sequencer, mClock no longer has control over it. Therefore to maximize +the impact of mClock, we want to keep as few operations in the +operation sequencer as possible. So we have an inherent tension. + +The configuration options that influence the number of operations in +the operation sequencer are ``bluestore_throttle_bytes``, +``bluestore_throttle_deferred_bytes``, +``bluestore_throttle_cost_per_io``, +``bluestore_throttle_cost_per_io_hdd``, and +``bluestore_throttle_cost_per_io_ssd``. + +A third factor that affects the impact of the mClock algorithm is that +we're using a distributed system, where requests are made to multiple +OSDs and each OSD has (can have) multiple shards. Yet we're currently +using the mClock algorithm, which is not distributed (note: dmClock is +the distributed version of mClock). + +Various organizations and individuals are currently experimenting with +mClock as it exists in this code base along with their modifications +to the code base. We hope you'll share you're experiences with your +mClock and dmClock experiments on the ``ceph-devel`` mailing list. + + +``osd_push_per_object_cost`` + +:Description: the overhead for serving a push op + +:Type: Unsigned Integer +:Default: 1000 + + +``osd_recovery_max_chunk`` + +:Description: the maximum total size of data chunks a recovery op can carry. + +:Type: Unsigned Integer +:Default: 8 MiB + + +``osd_mclock_scheduler_client_res`` + +:Description: IO proportion reserved for each client (default). + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_client_wgt`` + +:Description: IO share for each client (default) over reservation. + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_client_lim`` + +:Description: IO limit for each client (default) over reservation. + +:Type: Unsigned Integer +:Default: 999999 + + +``osd_mclock_scheduler_background_recovery_res`` + +:Description: IO proportion reserved for background recovery (default). + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_background_recovery_wgt`` + +:Description: IO share for each background recovery over reservation. + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_background_recovery_lim`` + +:Description: IO limit for background recovery over reservation. + +:Type: Unsigned Integer +:Default: 999999 + + +``osd_mclock_scheduler_background_best_effort_res`` + +:Description: IO proportion reserved for background best_effort (default). + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_background_best_effort_wgt`` + +:Description: IO share for each background best_effort over reservation. + +:Type: Unsigned Integer +:Default: 1 + + +``osd_mclock_scheduler_background_best_effort_lim`` + +:Description: IO limit for background best_effort over reservation. + +:Type: Unsigned Integer +:Default: 999999 + +.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf + + +.. index:: OSD; backfilling + +Backfilling +=========== + +When you add or remove Ceph OSD Daemons to a cluster, CRUSH will +rebalance the cluster by moving placement groups to or from Ceph OSDs +to restore balanced utilization. The process of migrating placement groups and +the objects they contain can reduce the cluster's operational performance +considerably. To maintain operational performance, Ceph performs this migration +with 'backfilling', which allows Ceph to set backfill operations to a lower +priority than requests to read or write data. + + +``osd_max_backfills`` + +:Description: The maximum number of backfills allowed to or from a single OSD. + Note that this is applied separately for read and write operations. +:Type: 64-bit Unsigned Integer +:Default: ``1`` + + +``osd_backfill_scan_min`` + +:Description: The minimum number of objects per backfill scan. + +:Type: 32-bit Integer +:Default: ``64`` + + +``osd_backfill_scan_max`` + +:Description: The maximum number of objects per backfill scan. + +:Type: 32-bit Integer +:Default: ``512`` + + +``osd_backfill_retry_interval`` + +:Description: The number of seconds to wait before retrying backfill requests. +:Type: Double +:Default: ``10.0`` + +.. index:: OSD; osdmap + +OSD Map +======= + +OSD maps reflect the OSD daemons operating in the cluster. Over time, the +number of map epochs increases. Ceph provides some settings to ensure that +Ceph performs well as the OSD map grows larger. + + +``osd_map_dedup`` + +:Description: Enable removing duplicates in the OSD map. +:Type: Boolean +:Default: ``true`` + + +``osd_map_cache_size`` + +:Description: The number of OSD maps to keep cached. +:Type: 32-bit Integer +:Default: ``50`` + + +``osd_map_message_max`` + +:Description: The maximum map entries allowed per MOSDMap message. +:Type: 32-bit Integer +:Default: ``40`` + + + +.. index:: OSD; recovery + +Recovery +======== + +When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD +begins peering with other Ceph OSD Daemons before writes can occur. See +`Monitoring OSDs and PGs`_ for details. + +If a Ceph OSD Daemon crashes and comes back online, usually it will be out of +sync with other Ceph OSD Daemons containing more recent versions of objects in +the placement groups. When this happens, the Ceph OSD Daemon goes into recovery +mode and seeks to get the latest copy of the data and bring its map back up to +date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects +and placement groups may be significantly out of date. Also, if a failure domain +went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at +the same time. This can make the recovery process time consuming and resource +intensive. + +To maintain operational performance, Ceph performs recovery with limitations on +the number recovery requests, threads and object chunk sizes which allows Ceph +perform well in a degraded state. + + +``osd_recovery_delay_start`` + +:Description: After peering completes, Ceph will delay for the specified number + of seconds before starting to recover RADOS objects. + +:Type: Float +:Default: ``0`` + + +``osd_recovery_max_active`` + +:Description: The number of active recovery requests per OSD at one time. More + requests will accelerate recovery, but the requests places an + increased load on the cluster. + + This value is only used if it is non-zero. Normally it + is ``0``, which means that the ``hdd`` or ``ssd`` values + (below) are used, depending on the type of the primary + device backing the OSD. + +:Type: 32-bit Integer +:Default: ``0`` + +``osd_recovery_max_active_hdd`` + +:Description: The number of active recovery requests per OSD at one time, if the + primary device is rotational. + +:Type: 32-bit Integer +:Default: ``3`` + +``osd_recovery_max_active_ssd`` + +:Description: The number of active recovery requests per OSD at one time, if the + primary device is non-rotational (i.e., an SSD). + +:Type: 32-bit Integer +:Default: ``10`` + + +``osd_recovery_max_chunk`` + +:Description: The maximum size of a recovered chunk of data to push. +:Type: 64-bit Unsigned Integer +:Default: ``8 << 20`` + + +``osd_recovery_max_single_start`` + +:Description: The maximum number of recovery operations per OSD that will be + newly started when an OSD is recovering. +:Type: 64-bit Unsigned Integer +:Default: ``1`` + + +``osd_recovery_thread_timeout`` + +:Description: The maximum time in seconds before timing out a recovery thread. +:Type: 32-bit Integer +:Default: ``30`` + + +``osd_recover_clone_overlap`` + +:Description: Preserves clone overlap during recovery. Should always be set + to ``true``. + +:Type: Boolean +:Default: ``true`` + + +``osd_recovery_sleep`` + +:Description: Time in seconds to sleep before the next recovery or backfill op. + Increasing this value will slow down recovery operation while + client operations will be less impacted. + +:Type: Float +:Default: ``0`` + + +``osd_recovery_sleep_hdd`` + +:Description: Time in seconds to sleep before next recovery or backfill op + for HDDs. + +:Type: Float +:Default: ``0.1`` + + +``osd_recovery_sleep_ssd`` + +:Description: Time in seconds to sleep before the next recovery or backfill op + for SSDs. + +:Type: Float +:Default: ``0`` + + +``osd_recovery_sleep_hybrid`` + +:Description: Time in seconds to sleep before the next recovery or backfill op + when OSD data is on HDD and OSD journal / WAL+DB is on SSD. + +:Type: Float +:Default: ``0.025`` + + +``osd_recovery_priority`` + +:Description: The default priority set for recovery work queue. Not + related to a pool's ``recovery_priority``. + +:Type: 32-bit Integer +:Default: ``5`` + + +Tiering +======= + +``osd_agent_max_ops`` + +:Description: The maximum number of simultaneous flushing ops per tiering agent + in the high speed mode. +:Type: 32-bit Integer +:Default: ``4`` + + +``osd_agent_max_low_ops`` + +:Description: The maximum number of simultaneous flushing ops per tiering agent + in the low speed mode. +:Type: 32-bit Integer +:Default: ``2`` + +See `cache target dirty high ratio`_ for when the tiering agent flushes dirty +objects within the high speed mode. + +Miscellaneous +============= + + +``osd_snap_trim_thread_timeout`` + +:Description: The maximum time in seconds before timing out a snap trim thread. +:Type: 32-bit Integer +:Default: ``1*60*60`` + + +``osd_backlog_thread_timeout`` + +:Description: The maximum time in seconds before timing out a backlog thread. +:Type: 32-bit Integer +:Default: ``1*60*60`` + + +``osd_default_notify_timeout`` + +:Description: The OSD default notification timeout (in seconds). +:Type: 32-bit Unsigned Integer +:Default: ``30`` + + +``osd_check_for_log_corruption`` + +:Description: Check log files for corruption. Can be computationally expensive. +:Type: Boolean +:Default: ``false`` + + +``osd_remove_thread_timeout`` + +:Description: The maximum time in seconds before timing out a remove OSD thread. +:Type: 32-bit Integer +:Default: ``60*60`` + + +``osd_command_thread_timeout`` + +:Description: The maximum time in seconds before timing out a command thread. +:Type: 32-bit Integer +:Default: ``10*60`` + + +``osd_delete_sleep`` + +:Description: Time in seconds to sleep before the next removal transaction. This + throttles the PG deletion process. + +:Type: Float +:Default: ``0`` + + +``osd_delete_sleep_hdd`` + +:Description: Time in seconds to sleep before the next removal transaction + for HDDs. + +:Type: Float +:Default: ``5`` + + +``osd_delete_sleep_ssd`` + +:Description: Time in seconds to sleep before the next removal transaction + for SSDs. + +:Type: Float +:Default: ``0`` + + +``osd_delete_sleep_hybrid`` + +:Description: Time in seconds to sleep before the next removal transaction + when OSD data is on HDD and OSD journal or WAL+DB is on SSD. + +:Type: Float +:Default: ``1`` + + +``osd_command_max_records`` + +:Description: Limits the number of lost objects to return. +:Type: 32-bit Integer +:Default: ``256`` + + +``osd_fast_fail_on_connection_refused`` + +:Description: If this option is enabled, crashed OSDs are marked down + immediately by connected peers and MONs (assuming that the + crashed OSD host survives). Disable it to restore old + behavior, at the expense of possible long I/O stalls when + OSDs crash in the middle of I/O operations. +:Type: Boolean +:Default: ``true`` + + + +.. _pool: ../../operations/pools +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Pool & PG Config Reference: ../pool-pg-config-ref +.. _Journal Config Reference: ../journal-ref +.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio +.. _mClock Config Reference: ../mclock-config-ref diff --git a/doc/rados/configuration/pool-pg-config-ref.rst b/doc/rados/configuration/pool-pg-config-ref.rst new file mode 100644 index 000000000..b4b5df478 --- /dev/null +++ b/doc/rados/configuration/pool-pg-config-ref.rst @@ -0,0 +1,282 @@ +====================================== + Pool, PG and CRUSH Config Reference +====================================== + +.. index:: pools; configuration + +When you create pools and set the number of placement groups (PGs) for each, Ceph +uses default values when you don't specifically override the defaults. **We +recommend** overriding some of the defaults. Specifically, we recommend setting +a pool's replica size and overriding the default number of placement groups. You +can specifically set these values when running `pool`_ commands. You can also +override the defaults by adding new ones in the ``[global]`` section of your +Ceph configuration file. + + +.. literalinclude:: pool-pg.conf + :language: ini + + +``mon_max_pool_pg_num`` + +:Description: The maximum number of placement groups per pool. +:Type: Integer +:Default: ``65536`` + + +``mon_pg_create_interval`` + +:Description: Number of seconds between PG creation in the same + Ceph OSD Daemon. + +:Type: Float +:Default: ``30.0`` + + +``mon_pg_stuck_threshold`` + +:Description: Number of seconds after which PGs can be considered as + being stuck. + +:Type: 32-bit Integer +:Default: ``300`` + +``mon_pg_min_inactive`` + +:Description: Raise ``HEALTH_ERR`` if the count of PGs that have been + inactive longer than the ``mon_pg_stuck_threshold`` exceeds this + setting. A non-positive number means disabled, never go into ERR. +:Type: Integer +:Default: ``1`` + + +``mon_pg_warn_min_per_osd`` + +:Description: Raise ``HEALTH_WARN`` if the average number + of PGs per ``in`` OSD is under this number. A non-positive number + disables this. +:Type: Integer +:Default: ``30`` + + +``mon_pg_warn_min_objects`` + +:Description: Do not warn if the total number of RADOS objects in cluster is below + this number +:Type: Integer +:Default: ``1000`` + + +``mon_pg_warn_min_pool_objects`` + +:Description: Do not warn on pools whose RADOS object count is below this number +:Type: Integer +:Default: ``1000`` + + +``mon_pg_check_down_all_threshold`` + +:Description: Percentage threshold of ``down`` OSDs above which we check all PGs + for stale ones. +:Type: Float +:Default: ``0.5`` + + +``mon_pg_warn_max_object_skew`` + +:Description: Raise ``HEALTH_WARN`` if the average RADOS object count per PG + of any pool is greater than ``mon_pg_warn_max_object_skew`` times + the average RADOS object count per PG of all pools. Zero or a non-positive + number disables this. Note that this option applies to ``ceph-mgr`` daemons. +:Type: Float +:Default: ``10`` + + +``mon_delta_reset_interval`` + +:Description: Seconds of inactivity before we reset the PG delta to 0. We keep + track of the delta of the used space of each pool, so, for + example, it would be easier for us to understand the progress of + recovery or the performance of cache tier. But if there's no + activity reported for a certain pool, we just reset the history of + deltas of that pool. +:Type: Integer +:Default: ``10`` + + +``mon_osd_max_op_age`` + +:Description: Maximum op age before we get concerned (make it a power of 2). + ``HEALTH_WARN`` will be raised if a request has been blocked longer + than this limit. +:Type: Float +:Default: ``32.0`` + + +``osd_pg_bits`` + +:Description: Placement group bits per Ceph OSD Daemon. +:Type: 32-bit Integer +:Default: ``6`` + + +``osd_pgp_bits`` + +:Description: The number of bits per Ceph OSD Daemon for PGPs. +:Type: 32-bit Integer +:Default: ``6`` + + +``osd_crush_chooseleaf_type`` + +:Description: The bucket type to use for ``chooseleaf`` in a CRUSH rule. Uses + ordinal rank rather than name. + +:Type: 32-bit Integer +:Default: ``1``. Typically a host containing one or more Ceph OSD Daemons. + + +``osd_crush_initial_weight`` + +:Description: The initial CRUSH weight for newly added OSDs. + +:Type: Double +:Default: ``the size of a newly added OSD in TB``. By default, the initial CRUSH + weight for a newly added OSD is set to its device size in TB. + See `Weighting Bucket Items`_ for details. + + +``osd_pool_default_crush_rule`` + +:Description: The default CRUSH rule to use when creating a replicated pool. +:Type: 8-bit Integer +:Default: ``-1``, which means "pick the rule with the lowest numerical ID and + use that". This is to make pool creation work in the absence of rule 0. + + +``osd_pool_erasure_code_stripe_unit`` + +:Description: Sets the default size, in bytes, of a chunk of an object + stripe for erasure coded pools. Every object of size S + will be stored as N stripes, with each data chunk + receiving ``stripe unit`` bytes. Each stripe of ``N * + stripe unit`` bytes will be encoded/decoded + individually. This option can is overridden by the + ``stripe_unit`` setting in an erasure code profile. + +:Type: Unsigned 32-bit Integer +:Default: ``4096`` + + +``osd_pool_default_size`` + +:Description: Sets the number of replicas for objects in the pool. The default + value is the same as + ``ceph osd pool set {pool-name} size {size}``. + +:Type: 32-bit Integer +:Default: ``3`` + + +``osd_pool_default_min_size`` + +:Description: Sets the minimum number of written replicas for objects in the + pool in order to acknowledge a write operation to the client. If + minimum is not met, Ceph will not acknowledge the write to the + client, **which may result in data loss**. This setting ensures + a minimum number of replicas when operating in ``degraded`` mode. + +:Type: 32-bit Integer +:Default: ``0``, which means no particular minimum. If ``0``, + minimum is ``size - (size / 2)``. + + +``osd_pool_default_pg_num`` + +:Description: The default number of placement groups for a pool. The default + value is the same as ``pg_num`` with ``mkpool``. + +:Type: 32-bit Integer +:Default: ``32`` + + +``osd_pool_default_pgp_num`` + +:Description: The default number of placement groups for placement for a pool. + The default value is the same as ``pgp_num`` with ``mkpool``. + PG and PGP should be equal (for now). + +:Type: 32-bit Integer +:Default: ``8`` + + +``osd_pool_default_flags`` + +:Description: The default flags for new pools. +:Type: 32-bit Integer +:Default: ``0`` + + +``osd_max_pgls`` + +:Description: The maximum number of placement groups to list. A client + requesting a large number can tie up the Ceph OSD Daemon. + +:Type: Unsigned 64-bit Integer +:Default: ``1024`` +:Note: Default should be fine. + + +``osd_min_pg_log_entries`` + +:Description: The minimum number of placement group logs to maintain + when trimming log files. + +:Type: 32-bit Int Unsigned +:Default: ``250`` + + +``osd_max_pg_log_entries`` + +:Description: The maximum number of placement group logs to maintain + when trimming log files. + +:Type: 32-bit Int Unsigned +:Default: ``10000`` + + +``osd_default_data_pool_replay_window`` + +:Description: The time (in seconds) for an OSD to wait for a client to replay + a request. + +:Type: 32-bit Integer +:Default: ``45`` + +``osd_max_pg_per_osd_hard_ratio`` + +:Description: The ratio of number of PGs per OSD allowed by the cluster before the + OSD refuses to create new PGs. An OSD stops creating new PGs if the number + of PGs it serves exceeds + ``osd_max_pg_per_osd_hard_ratio`` \* ``mon_max_pg_per_osd``. + +:Type: Float +:Default: ``2`` + +``osd_recovery_priority`` + +:Description: Priority of recovery in the work queue. + +:Type: Integer +:Default: ``5`` + +``osd_recovery_op_priority`` + +:Description: Default priority used for recovery operations if pool doesn't override. + +:Type: Integer +:Default: ``3`` + +.. _pool: ../../operations/pools +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Weighting Bucket Items: ../../operations/crush-map#weightingbucketitems diff --git a/doc/rados/configuration/pool-pg.conf b/doc/rados/configuration/pool-pg.conf new file mode 100644 index 000000000..34c3af9f0 --- /dev/null +++ b/doc/rados/configuration/pool-pg.conf @@ -0,0 +1,21 @@ +[global] + + # By default, Ceph makes 3 replicas of RADOS objects. If you want to maintain four + # copies of an object the default value--a primary copy and three replica + # copies--reset the default values as shown in 'osd_pool_default_size'. + # If you want to allow Ceph to write a lesser number of copies in a degraded + # state, set 'osd_pool_default_min_size' to a number less than the + # 'osd_pool_default_size' value. + + osd_pool_default_size = 3 # Write an object 3 times. + osd_pool_default_min_size = 2 # Allow writing two copies in a degraded state. + + # Ensure you have a realistic number of placement groups. We recommend + # approximately 100 per OSD. E.g., total number of OSDs multiplied by 100 + # divided by the number of replicas (i.e., osd pool default size). So for + # 10 OSDs and osd pool default size = 4, we'd recommend approximately + # (100 * 10) / 4 = 250. + # always use the nearest power of 2 + + osd_pool_default_pg_num = 256 + osd_pool_default_pgp_num = 256 diff --git a/doc/rados/configuration/storage-devices.rst b/doc/rados/configuration/storage-devices.rst new file mode 100644 index 000000000..8536d2cfa --- /dev/null +++ b/doc/rados/configuration/storage-devices.rst @@ -0,0 +1,96 @@ +================= + Storage Devices +================= + +There are two Ceph daemons that store data on devices: + +.. _rados_configuration_storage-devices_ceph_osd: + +* **Ceph OSDs** (Object Storage Daemons) store most of the data + in Ceph. Usually each OSD is backed by a single storage device. + This can be a traditional hard disk (HDD) or a solid state disk + (SSD). OSDs can also be backed by a combination of devices: for + example, a HDD for most data and an SSD (or partition of an + SSD) for some metadata. The number of OSDs in a cluster is + usually a function of the amount of data to be stored, the size + of each storage device, and the level and type of redundancy + specified (replication or erasure coding). +* **Ceph Monitor** daemons manage critical cluster state. This + includes cluster membership and authentication information. + Small clusters require only a few gigabytes of storage to hold + the monitor database. In large clusters, however, the monitor + database can reach sizes of tens of gigabytes to hundreds of + gigabytes. +* **Ceph Manager** daemons run alongside monitor daemons, providing + additional monitoring and providing interfaces to external + monitoring and management systems. + + +OSD Back Ends +============= + +There are two ways that OSDs manage the data they store. As of the Luminous +12.2.z release, the default (and recommended) back end is *BlueStore*. Prior +to the Luminous release, the default (and only) back end was *Filestore*. + +.. _rados_config_storage_devices_bluestore: + +BlueStore +--------- + +<<<<<<< HEAD +BlueStore is a special-purpose storage backend designed specifically +for managing data on disk for Ceph OSD workloads. It is motivated by +experience supporting and managing OSDs using FileStore over the +last ten years. Key BlueStore features include: +======= +BlueStore is a special-purpose storage back end designed specifically for +managing data on disk for Ceph OSD workloads. BlueStore's design is based on +a decade of experience of supporting and managing Filestore OSDs. +>>>>>>> 28abc6a9a59 (doc/rados: s/backend/back end/) + +* Direct management of storage devices. BlueStore consumes raw block + devices or partitions. This avoids any intervening layers of + abstraction (such as local file systems like XFS) that may limit + performance or add complexity. +* Metadata management with RocksDB. We embed RocksDB's key/value database + in order to manage internal metadata, such as the mapping from object + names to block locations on disk. +* Full data and metadata checksumming. By default all data and + metadata written to BlueStore is protected by one or more + checksums. No data or metadata will be read from disk or returned + to the user without being verified. +* Inline compression. Data written may be optionally compressed + before being written to disk. +* Multi-device metadata tiering. BlueStore allows its internal + journal (write-ahead log) to be written to a separate, high-speed + device (like an SSD, NVMe, or NVDIMM) to increased performance. If + a significant amount of faster storage is available, internal + metadata can also be stored on the faster device. +* Efficient copy-on-write. RBD and CephFS snapshots rely on a + copy-on-write *clone* mechanism that is implemented efficiently in + BlueStore. This results in efficient IO both for regular snapshots + and for erasure coded pools (which rely on cloning to implement + efficient two-phase commits). + +For more information, see :doc:`bluestore-config-ref` and :doc:`/rados/operations/bluestore-migration`. + +FileStore +--------- + +FileStore is the legacy approach to storing objects in Ceph. It +relies on a standard file system (normally XFS) in combination with a +key/value database (traditionally LevelDB, now RocksDB) for some +metadata. + +FileStore is well-tested and widely used in production but suffers +from many performance deficiencies due to its overall design and +reliance on a traditional file system for storing object data. + +Although FileStore is generally capable of functioning on most +POSIX-compatible file systems (including btrfs and ext4), we only +recommend that XFS be used. Both btrfs and ext4 have known bugs and +deficiencies and their use may lead to data loss. By default all Ceph +provisioning tools will use XFS. + +For more information, see :doc:`filestore-config-ref`. diff --git a/doc/rados/index.rst b/doc/rados/index.rst new file mode 100644 index 000000000..5f3f112e8 --- /dev/null +++ b/doc/rados/index.rst @@ -0,0 +1,78 @@ +.. _rados-index: + +====================== + Ceph Storage Cluster +====================== + +The :term:`Ceph Storage Cluster` is the foundation for all Ceph deployments. +Based upon :abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, Ceph +Storage Clusters consist of two types of daemons: a :term:`Ceph OSD Daemon` +(OSD) stores data as objects on a storage node; and a :term:`Ceph Monitor` (MON) +maintains a master copy of the cluster map. A Ceph Storage Cluster may contain +thousands of storage nodes. A minimal system will have at least one +Ceph Monitor and two Ceph OSD Daemons for data replication. + +The Ceph File System, Ceph Object Storage and Ceph Block Devices read data from +and write data to the Ceph Storage Cluster. + +.. raw:: html + + <style type="text/css">div.body h3{margin:5px 0px 0px 0px;}</style> + <table cellpadding="10"><colgroup><col width="33%"><col width="33%"><col width="33%"></colgroup><tbody valign="top"><tr><td><h3>Config and Deploy</h3> + +Ceph Storage Clusters have a few required settings, but most configuration +settings have default values. A typical deployment uses a deployment tool +to define a cluster and bootstrap a monitor. See `Deployment`_ for details +on ``cephadm.`` + +.. toctree:: + :maxdepth: 2 + + Configuration <configuration/index> + Deployment <../cephadm/index> + +.. raw:: html + + </td><td><h3>Operations</h3> + +Once you have deployed a Ceph Storage Cluster, you may begin operating +your cluster. + +.. toctree:: + :maxdepth: 2 + + + Operations <operations/index> + +.. toctree:: + :maxdepth: 1 + + Man Pages <man/index> + + +.. toctree:: + :hidden: + + troubleshooting/index + +.. raw:: html + + </td><td><h3>APIs</h3> + +Most Ceph deployments use `Ceph Block Devices`_, `Ceph Object Storage`_ and/or the +`Ceph File System`_. You may also develop applications that talk directly to +the Ceph Storage Cluster. + +.. toctree:: + :maxdepth: 2 + + APIs <api/index> + +.. raw:: html + + </td></tr></tbody></table> + +.. _Ceph Block Devices: ../rbd/ +.. _Ceph File System: ../cephfs/ +.. _Ceph Object Storage: ../radosgw/ +.. _Deployment: ../cephadm/ diff --git a/doc/rados/man/index.rst b/doc/rados/man/index.rst new file mode 100644 index 000000000..8311bfd5a --- /dev/null +++ b/doc/rados/man/index.rst @@ -0,0 +1,31 @@ +======================= + Object Store Manpages +======================= + +.. toctree:: + :maxdepth: 1 + + ../../man/8/ceph-volume.rst + ../../man/8/ceph-volume-systemd.rst + ../../man/8/ceph.rst + ../../man/8/ceph-authtool.rst + ../../man/8/ceph-clsinfo.rst + ../../man/8/ceph-conf.rst + ../../man/8/ceph-debugpack.rst + ../../man/8/ceph-dencoder.rst + ../../man/8/ceph-mon.rst + ../../man/8/ceph-osd.rst + ../../man/8/ceph-kvstore-tool.rst + ../../man/8/ceph-run.rst + ../../man/8/ceph-syn.rst + ../../man/8/crushtool.rst + ../../man/8/librados-config.rst + ../../man/8/monmaptool.rst + ../../man/8/osdmaptool.rst + ../../man/8/rados.rst + + +.. toctree:: + :hidden: + + ../../man/8/ceph-post-file.rst diff --git a/doc/rados/operations/add-or-rm-mons.rst b/doc/rados/operations/add-or-rm-mons.rst new file mode 100644 index 000000000..359fa7676 --- /dev/null +++ b/doc/rados/operations/add-or-rm-mons.rst @@ -0,0 +1,446 @@ +.. _adding-and-removing-monitors: + +========================== + Adding/Removing Monitors +========================== + +When you have a cluster up and running, you may add or remove monitors +from the cluster at runtime. To bootstrap a monitor, see `Manual Deployment`_ +or `Monitor Bootstrap`_. + +.. _adding-monitors: + +Adding Monitors +=============== + +Ceph monitors are lightweight processes that are the single source of truth +for the cluster map. You can run a cluster with 1 monitor but we recommend at least 3 +for a production cluster. Ceph monitors use a variation of the +`Paxos`_ algorithm to establish consensus about maps and other critical +information across the cluster. Due to the nature of Paxos, Ceph requires +a majority of monitors to be active to establish a quorum (thus establishing +consensus). + +It is advisable to run an odd number of monitors. An +odd number of monitors is more resilient than an +even number. For instance, with a two monitor deployment, no +failures can be tolerated and still maintain a quorum; with three monitors, +one failure can be tolerated; in a four monitor deployment, one failure can +be tolerated; with five monitors, two failures can be tolerated. This avoids +the dreaded *split brain* phenomenon, and is why an odd number is best. +In short, Ceph needs a majority of +monitors to be active (and able to communicate with each other), but that +majority can be achieved using a single monitor, or 2 out of 2 monitors, +2 out of 3, 3 out of 4, etc. + +For small or non-critical deployments of multi-node Ceph clusters, it is +advisable to deploy three monitors, and to increase the number of monitors +to five for larger clusters or to survive a double failure. There is rarely +justification for seven or more. + +Since monitors are lightweight, it is possible to run them on the same +host as OSDs; however, we recommend running them on separate hosts, +because `fsync` issues with the kernel may impair performance. +Dedicated monitor nodes also minimize disruption since monitor and OSD +daemons are not inactive at the same time when a node crashes or is +taken down for maintenance. + +Dedicated +monitor nodes also make for cleaner maintenance by avoiding both OSDs and +a mon going down if a node is rebooted, taken down, or crashes. + +.. note:: A *majority* of monitors in your cluster must be able to + reach each other in order to establish a quorum. + +Deploy your Hardware +-------------------- + +If you are adding a new host when adding a new monitor, see `Hardware +Recommendations`_ for details on minimum recommendations for monitor hardware. +To add a monitor host to your cluster, first make sure you have an up-to-date +version of Linux installed (typically Ubuntu 16.04 or RHEL 7). + +Add your monitor host to a rack in your cluster, connect it to the network +and ensure that it has network connectivity. + +.. _Hardware Recommendations: ../../../start/hardware-recommendations + +Install the Required Software +----------------------------- + +For manually deployed clusters, you must install Ceph packages +manually. See `Installing Packages`_ for details. +You should configure SSH to a user with password-less authentication +and root permissions. + +.. _Installing Packages: ../../../install/install-storage-cluster + + +.. _Adding a Monitor (Manual): + +Adding a Monitor (Manual) +------------------------- + +This procedure creates a ``ceph-mon`` data directory, retrieves the monitor map +and monitor keyring, and adds a ``ceph-mon`` daemon to your cluster. If +this results in only two monitor daemons, you may add more monitors by +repeating this procedure until you have a sufficient number of ``ceph-mon`` +daemons to achieve a quorum. + +At this point you should define your monitor's id. Traditionally, monitors +have been named with single letters (``a``, ``b``, ``c``, ...), but you are +free to define the id as you see fit. For the purpose of this document, +please take into account that ``{mon-id}`` should be the id you chose, +without the ``mon.`` prefix (i.e., ``{mon-id}`` should be the ``a`` +on ``mon.a``). + +#. Create the default directory on the machine that will host your + new monitor: + + .. prompt:: bash $ + + ssh {new-mon-host} + sudo mkdir /var/lib/ceph/mon/ceph-{mon-id} + +#. Create a temporary directory ``{tmp}`` to keep the files needed during + this process. This directory should be different from the monitor's default + directory created in the previous step, and can be removed after all the + steps are executed: + + .. prompt:: bash $ + + mkdir {tmp} + +#. Retrieve the keyring for your monitors, where ``{tmp}`` is the path to + the retrieved keyring, and ``{key-filename}`` is the name of the file + containing the retrieved monitor key: + + .. prompt:: bash $ + + ceph auth get mon. -o {tmp}/{key-filename} + +#. Retrieve the monitor map, where ``{tmp}`` is the path to + the retrieved monitor map, and ``{map-filename}`` is the name of the file + containing the retrieved monitor map: + + .. prompt:: bash $ + + ceph mon getmap -o {tmp}/{map-filename} + +#. Prepare the monitor's data directory created in the first step. You must + specify the path to the monitor map so that you can retrieve the + information about a quorum of monitors and their ``fsid``. You must also + specify a path to the monitor keyring: + + .. prompt:: bash $ + + sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename} + + +#. Start the new monitor and it will automatically join the cluster. + The daemon needs to know which address to bind to, via either the + ``--public-addr {ip}`` or ``--public-network {network}`` argument. + For example: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --public-addr {ip:port} + +.. _removing-monitors: + +Removing Monitors +================= + +When you remove monitors from a cluster, consider that Ceph monitors use +Paxos to establish consensus about the master cluster map. You must have +a sufficient number of monitors to establish a quorum for consensus about +the cluster map. + +.. _Removing a Monitor (Manual): + +Removing a Monitor (Manual) +--------------------------- + +This procedure removes a ``ceph-mon`` daemon from your cluster. If this +procedure results in only two monitor daemons, you may add or remove another +monitor until you have a number of ``ceph-mon`` daemons that can achieve a +quorum. + +#. Stop the monitor: + + .. prompt:: bash $ + + service ceph -a stop mon.{mon-id} + +#. Remove the monitor from the cluster: + + .. prompt:: bash $ + + ceph mon remove {mon-id} + +#. Remove the monitor entry from ``ceph.conf``. + +.. _rados-mon-remove-from-unhealthy: + +Removing Monitors from an Unhealthy Cluster +------------------------------------------- + +This procedure removes a ``ceph-mon`` daemon from an unhealthy +cluster, for example a cluster where the monitors cannot form a +quorum. + + +#. Stop all ``ceph-mon`` daemons on all monitor hosts: + + .. prompt:: bash $ + + ssh {mon-host} + systemctl stop ceph-mon.target + + Repeat for all monitor hosts. + +#. Identify a surviving monitor and log in to that host: + + .. prompt:: bash $ + + ssh {mon-host} + +#. Extract a copy of the monmap file: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --extract-monmap {map-path} + + In most cases, this command will be: + + .. prompt:: bash $ + + ceph-mon -i `hostname` --extract-monmap /tmp/monmap + +#. Remove the non-surviving or problematic monitors. For example, if + you have three monitors, ``mon.a``, ``mon.b``, and ``mon.c``, where + only ``mon.a`` will survive, follow the example below: + + .. prompt:: bash $ + + monmaptool {map-path} --rm {mon-id} + + For example, + + .. prompt:: bash $ + + monmaptool /tmp/monmap --rm b + monmaptool /tmp/monmap --rm c + +#. Inject the surviving map with the removed monitors into the + surviving monitor(s). For example, to inject a map into monitor + ``mon.a``, follow the example below: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --inject-monmap {map-path} + + For example: + + .. prompt:: bash $ + + ceph-mon -i a --inject-monmap /tmp/monmap + +#. Start only the surviving monitors. + +#. Verify the monitors form a quorum (``ceph -s``). + +#. You may wish to archive the removed monitors' data directory in + ``/var/lib/ceph/mon`` in a safe location, or delete it if you are + confident the remaining monitors are healthy and are sufficiently + redundant. + +.. _Changing a Monitor's IP address: + +Changing a Monitor's IP Address +=============================== + +.. important:: Existing monitors are not supposed to change their IP addresses. + +Monitors are critical components of a Ceph cluster, and they need to maintain a +quorum for the whole system to work properly. To establish a quorum, the +monitors need to discover each other. Ceph has strict requirements for +discovering monitors. + +Ceph clients and other Ceph daemons use ``ceph.conf`` to discover monitors. +However, monitors discover each other using the monitor map, not ``ceph.conf``. +For example, if you refer to `Adding a Monitor (Manual)`_ you will see that you +need to obtain the current monmap for the cluster when creating a new monitor, +as it is one of the required arguments of ``ceph-mon -i {mon-id} --mkfs``. The +following sections explain the consistency requirements for Ceph monitors, and a +few safe ways to change a monitor's IP address. + + +Consistency Requirements +------------------------ + +A monitor always refers to the local copy of the monmap when discovering other +monitors in the cluster. Using the monmap instead of ``ceph.conf`` avoids +errors that could break the cluster (e.g., typos in ``ceph.conf`` when +specifying a monitor address or port). Since monitors use monmaps for discovery +and they share monmaps with clients and other Ceph daemons, the monmap provides +monitors with a strict guarantee that their consensus is valid. + +Strict consistency also applies to updates to the monmap. As with any other +updates on the monitor, changes to the monmap always run through a distributed +consensus algorithm called `Paxos`_. The monitors must agree on each update to +the monmap, such as adding or removing a monitor, to ensure that each monitor in +the quorum has the same version of the monmap. Updates to the monmap are +incremental so that monitors have the latest agreed upon version, and a set of +previous versions, allowing a monitor that has an older version of the monmap to +catch up with the current state of the cluster. + +If monitors discovered each other through the Ceph configuration file instead of +through the monmap, it would introduce additional risks because the Ceph +configuration files are not updated and distributed automatically. Monitors +might inadvertently use an older ``ceph.conf`` file, fail to recognize a +monitor, fall out of a quorum, or develop a situation where `Paxos`_ is not able +to determine the current state of the system accurately. Consequently, making +changes to an existing monitor's IP address must be done with great care. + + +Changing a Monitor's IP address (The Right Way) +----------------------------------------------- + +Changing a monitor's IP address in ``ceph.conf`` only is not sufficient to +ensure that other monitors in the cluster will receive the update. To change a +monitor's IP address, you must add a new monitor with the IP address you want +to use (as described in `Adding a Monitor (Manual)`_), ensure that the new +monitor successfully joins the quorum; then, remove the monitor that uses the +old IP address. Then, update the ``ceph.conf`` file to ensure that clients and +other daemons know the IP address of the new monitor. + +For example, lets assume there are three monitors in place, such as :: + + [mon.a] + host = host01 + addr = 10.0.0.1:6789 + [mon.b] + host = host02 + addr = 10.0.0.2:6789 + [mon.c] + host = host03 + addr = 10.0.0.3:6789 + +To change ``mon.c`` to ``host04`` with the IP address ``10.0.0.4``, follow the +steps in `Adding a Monitor (Manual)`_ by adding a new monitor ``mon.d``. Ensure +that ``mon.d`` is running before removing ``mon.c``, or it will break the +quorum. Remove ``mon.c`` as described on `Removing a Monitor (Manual)`_. Moving +all three monitors would thus require repeating this process as many times as +needed. + + +Changing a Monitor's IP address (The Messy Way) +----------------------------------------------- + +There may come a time when the monitors must be moved to a different network, a +different part of the datacenter or a different datacenter altogether. While it +is possible to do it, the process becomes a bit more hazardous. + +In such a case, the solution is to generate a new monmap with updated IP +addresses for all the monitors in the cluster, and inject the new map on each +individual monitor. This is not the most user-friendly approach, but we do not +expect this to be something that needs to be done every other week. As it is +clearly stated on the top of this section, monitors are not supposed to change +IP addresses. + +Using the previous monitor configuration as an example, assume you want to move +all the monitors from the ``10.0.0.x`` range to ``10.1.0.x``, and these +networks are unable to communicate. Use the following procedure: + +#. Retrieve the monitor map, where ``{tmp}`` is the path to + the retrieved monitor map, and ``{filename}`` is the name of the file + containing the retrieved monitor map: + + .. prompt:: bash $ + + ceph mon getmap -o {tmp}/{filename} + +#. The following example demonstrates the contents of the monmap: + + .. prompt:: bash $ + + monmaptool --print {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.0.0.1:6789/0 mon.a + 1: 10.0.0.2:6789/0 mon.b + 2: 10.0.0.3:6789/0 mon.c + +#. Remove the existing monitors: + + .. prompt:: bash $ + + monmaptool --rm a --rm b --rm c {tmp}/{filename} + + + :: + + monmaptool: monmap file {tmp}/{filename} + monmaptool: removing a + monmaptool: removing b + monmaptool: removing c + monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors) + +#. Add the new monitor locations: + + .. prompt:: bash $ + + monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename} + + + :: + + monmaptool: monmap file {tmp}/{filename} + monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors) + +#. Check new contents: + + .. prompt:: bash $ + + monmaptool --print {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.1.0.1:6789/0 mon.a + 1: 10.1.0.2:6789/0 mon.b + 2: 10.1.0.3:6789/0 mon.c + +At this point, we assume the monitors (and stores) are installed at the new +location. The next step is to propagate the modified monmap to the new +monitors, and inject the modified monmap into each new monitor. + +#. First, make sure to stop all your monitors. Injection must be done while + the daemon is not running. + +#. Inject the monmap: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename} + +#. Restart the monitors. + +After this step, migration to the new location is complete and +the monitors should operate successfully. + + +.. _Manual Deployment: ../../../install/manual-deployment +.. _Monitor Bootstrap: ../../../dev/mon-bootstrap +.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) diff --git a/doc/rados/operations/add-or-rm-osds.rst b/doc/rados/operations/add-or-rm-osds.rst new file mode 100644 index 000000000..315552859 --- /dev/null +++ b/doc/rados/operations/add-or-rm-osds.rst @@ -0,0 +1,386 @@ +====================== + Adding/Removing OSDs +====================== + +When you have a cluster up and running, you may add OSDs or remove OSDs +from the cluster at runtime. + +Adding OSDs +=========== + +When you want to expand a cluster, you may add an OSD at runtime. With Ceph, an +OSD is generally one Ceph ``ceph-osd`` daemon for one storage drive within a +host machine. If your host has multiple storage drives, you may map one +``ceph-osd`` daemon for each drive. + +Generally, it's a good idea to check the capacity of your cluster to see if you +are reaching the upper end of its capacity. As your cluster reaches its ``near +full`` ratio, you should add one or more OSDs to expand your cluster's capacity. + +.. warning:: Do not let your cluster reach its ``full ratio`` before + adding an OSD. OSD failures that occur after the cluster reaches + its ``near full`` ratio may cause the cluster to exceed its + ``full ratio``. + +Deploy your Hardware +-------------------- + +If you are adding a new host when adding a new OSD, see `Hardware +Recommendations`_ for details on minimum recommendations for OSD hardware. To +add an OSD host to your cluster, first make sure you have an up-to-date version +of Linux installed, and you have made some initial preparations for your +storage drives. See `Filesystem Recommendations`_ for details. + +Add your OSD host to a rack in your cluster, connect it to the network +and ensure that it has network connectivity. See the `Network Configuration +Reference`_ for details. + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Filesystem Recommendations: ../../configuration/filesystem-recommendations +.. _Network Configuration Reference: ../../configuration/network-config-ref + +Install the Required Software +----------------------------- + +For manually deployed clusters, you must install Ceph packages +manually. See `Installing Ceph (Manual)`_ for details. +You should configure SSH to a user with password-less authentication +and root permissions. + +.. _Installing Ceph (Manual): ../../../install + + +Adding an OSD (Manual) +---------------------- + +This procedure sets up a ``ceph-osd`` daemon, configures it to use one drive, +and configures the cluster to distribute data to the OSD. If your host has +multiple drives, you may add an OSD for each drive by repeating this procedure. + +To add an OSD, create a data directory for it, mount a drive to that directory, +add the OSD to the cluster, and then add it to the CRUSH map. + +When you add the OSD to the CRUSH map, consider the weight you give to the new +OSD. Hard drive capacity grows 40% per year, so newer OSD hosts may have larger +hard drives than older hosts in the cluster (i.e., they may have greater +weight). + +.. tip:: Ceph prefers uniform hardware across pools. If you are adding drives + of dissimilar size, you can adjust their weights. However, for best + performance, consider a CRUSH hierarchy with drives of the same type/size. + +#. Create the OSD. If no UUID is given, it will be set automatically when the + OSD starts up. The following command will output the OSD number, which you + will need for subsequent steps: + + .. prompt:: bash $ + + ceph osd create [{uuid} [{id}]] + + If the optional parameter {id} is given it will be used as the OSD id. + Note, in this case the command may fail if the number is already in use. + + .. warning:: In general, explicitly specifying {id} is not recommended. + IDs are allocated as an array, and skipping entries consumes some extra + memory. This can become significant if there are large gaps and/or + clusters are large. If {id} is not specified, the smallest available is + used. + +#. Create the default directory on your new OSD: + + .. prompt:: bash $ + + ssh {new-osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + +#. If the OSD is for a drive other than the OS drive, prepare it + for use with Ceph, and mount it to the directory you just created: + + .. prompt:: bash $ + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{drive} + sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} + +#. Initialize the OSD data directory: + + .. prompt:: bash $ + + ssh {new-osd-host} + ceph-osd -i {osd-num} --mkfs --mkkey + + The directory must be empty before you can run ``ceph-osd``. + +#. Register the OSD authentication key. The value of ``ceph`` for + ``ceph-{osd-num}`` in the path is the ``$cluster-$id``. If your + cluster name differs from ``ceph``, use your cluster name instead: + + .. prompt:: bash $ + + ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring + +#. Add the OSD to the CRUSH map so that the OSD can begin receiving data. The + ``ceph osd crush add`` command allows you to add OSDs to the CRUSH hierarchy + wherever you wish. If you specify at least one bucket, the command + will place the OSD into the most specific bucket you specify, *and* it will + move that bucket underneath any other buckets you specify. **Important:** If + you specify only the root bucket, the command will attach the OSD directly + to the root, but CRUSH rules expect OSDs to be inside of hosts. + + Execute the following: + + .. prompt:: bash $ + + ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...] + + You may also decompile the CRUSH map, add the OSD to the device list, add the + host as a bucket (if it's not already in the CRUSH map), add the device as an + item in the host, assign it a weight, recompile it and set it. See + `Add/Move an OSD`_ for details. + + +.. _rados-replacing-an-osd: + +Replacing an OSD +---------------- + +.. note:: If the instructions in this section do not work for you, try the + instructions in the cephadm documentation: :ref:`cephadm-replacing-an-osd`. + +When disks fail, or if an administrator wants to reprovision OSDs with a new +backend, for instance, for switching from FileStore to BlueStore, OSDs need to +be replaced. Unlike `Removing the OSD`_, replaced OSD's id and CRUSH map entry +need to be keep intact after the OSD is destroyed for replacement. + +#. Make sure it is safe to destroy the OSD: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy osd.{id} ; do sleep 10 ; done + +#. Destroy the OSD first: + + .. prompt:: bash $ + + ceph osd destroy {id} --yes-i-really-mean-it + +#. Zap a disk for the new OSD, if the disk was used before for other purposes. + It's not necessary for a new disk: + + .. prompt:: bash $ + + ceph-volume lvm zap /dev/sdX + +#. Prepare the disk for replacement by using the previously destroyed OSD id: + + .. prompt:: bash $ + + ceph-volume lvm prepare --osd-id {id} --data /dev/sdX + +#. And activate the OSD: + + .. prompt:: bash $ + + ceph-volume lvm activate {id} {fsid} + +Alternatively, instead of preparing and activating, the device can be recreated +in one call, like: + + .. prompt:: bash $ + + ceph-volume lvm create --osd-id {id} --data /dev/sdX + + +Starting the OSD +---------------- + +After you add an OSD to Ceph, the OSD is in your configuration. However, +it is not yet running. The OSD is ``down`` and ``in``. You must start +your new OSD before it can begin receiving data. You may use +``service ceph`` from your admin host or start the OSD from its host +machine: + + .. prompt:: bash $ + + sudo systemctl start ceph-osd@{osd-num} + + +Once you start your OSD, it is ``up`` and ``in``. + + +Observe the Data Migration +-------------------------- + +Once you have added your new OSD to the CRUSH map, Ceph will begin rebalancing +the server by migrating placement groups to your new OSD. You can observe this +process with the `ceph`_ tool. : + + .. prompt:: bash $ + + ceph -w + +You should see the placement group states change from ``active+clean`` to +``active, some degraded objects``, and finally ``active+clean`` when migration +completes. (Control-c to exit.) + +.. _Add/Move an OSD: ../crush-map#addosd +.. _ceph: ../monitoring + + + +Removing OSDs (Manual) +====================== + +When you want to reduce the size of a cluster or replace hardware, you may +remove an OSD at runtime. With Ceph, an OSD is generally one Ceph ``ceph-osd`` +daemon for one storage drive within a host machine. If your host has multiple +storage drives, you may need to remove one ``ceph-osd`` daemon for each drive. +Generally, it's a good idea to check the capacity of your cluster to see if you +are reaching the upper end of its capacity. Ensure that when you remove an OSD +that your cluster is not at its ``near full`` ratio. + +.. warning:: Do not let your cluster reach its ``full ratio`` when + removing an OSD. Removing OSDs could cause the cluster to reach + or exceed its ``full ratio``. + + +Take the OSD out of the Cluster +----------------------------------- + +Before you remove an OSD, it is usually ``up`` and ``in``. You need to take it +out of the cluster so that Ceph can begin rebalancing and copying its data to +other OSDs. : + + .. prompt:: bash $ + + ceph osd out {osd-num} + + +Observe the Data Migration +-------------------------- + +Once you have taken your OSD ``out`` of the cluster, Ceph will begin +rebalancing the cluster by migrating placement groups out of the OSD you +removed. You can observe this process with the `ceph`_ tool. : + + .. prompt:: bash $ + + ceph -w + +You should see the placement group states change from ``active+clean`` to +``active, some degraded objects``, and finally ``active+clean`` when migration +completes. (Control-c to exit.) + +.. note:: Sometimes, typically in a "small" cluster with few hosts (for + instance with a small testing cluster), the fact to take ``out`` the + OSD can spawn a CRUSH corner case where some PGs remain stuck in the + ``active+remapped`` state. If you are in this case, you should mark + the OSD ``in`` with: + + .. prompt:: bash $ + + ceph osd in {osd-num} + + to come back to the initial state and then, instead of marking ``out`` + the OSD, set its weight to 0 with: + + .. prompt:: bash $ + + ceph osd crush reweight osd.{osd-num} 0 + + After that, you can observe the data migration which should come to its + end. The difference between marking ``out`` the OSD and reweighting it + to 0 is that in the first case the weight of the bucket which contains + the OSD is not changed whereas in the second case the weight of the bucket + is updated (and decreased of the OSD weight). The reweight command could + be sometimes favoured in the case of a "small" cluster. + + + +Stopping the OSD +---------------- + +After you take an OSD out of the cluster, it may still be running. +That is, the OSD may be ``up`` and ``out``. You must stop +your OSD before you remove it from the configuration: + + .. prompt:: bash $ + + ssh {osd-host} + sudo systemctl stop ceph-osd@{osd-num} + +Once you stop your OSD, it is ``down``. + + +Removing the OSD +---------------- + +This procedure removes an OSD from a cluster map, removes its authentication +key, removes the OSD from the OSD map, and removes the OSD from the +``ceph.conf`` file. If your host has multiple drives, you may need to remove an +OSD for each drive by repeating this procedure. + +#. Let the cluster forget the OSD first. This step removes the OSD from the CRUSH + map, removes its authentication key. And it is removed from the OSD map as + well. Please note the :ref:`purge subcommand <ceph-admin-osd>` is introduced in Luminous, for older + versions, please see below: + + .. prompt:: bash $ + + ceph osd purge {id} --yes-i-really-mean-it + +#. Navigate to the host where you keep the master copy of the cluster's + ``ceph.conf`` file: + + .. prompt:: bash $ + + ssh {admin-host} + cd /etc/ceph + vim ceph.conf + +#. Remove the OSD entry from your ``ceph.conf`` file (if it exists):: + + [osd.1] + host = {hostname} + +#. From the host where you keep the master copy of the cluster's ``ceph.conf`` + file, copy the updated ``ceph.conf`` file to the ``/etc/ceph`` directory of + other hosts in your cluster. + +If your Ceph cluster is older than Luminous, instead of using ``ceph osd +purge``, you need to perform this step manually: + + +#. Remove the OSD from the CRUSH map so that it no longer receives data. You may + also decompile the CRUSH map, remove the OSD from the device list, remove the + device as an item in the host bucket or remove the host bucket (if it's in the + CRUSH map and you intend to remove the host), recompile the map and set it. + See `Remove an OSD`_ for details: + + .. prompt:: bash $ + + ceph osd crush remove {name} + +#. Remove the OSD authentication key: + + .. prompt:: bash $ + + ceph auth del osd.{osd-num} + + The value of ``ceph`` for ``ceph-{osd-num}`` in the path is the + ``$cluster-$id``. If your cluster name differs from ``ceph``, use your + cluster name instead. + +#. Remove the OSD: + + .. prompt:: bash $ + + ceph osd rm {osd-num} + + for example: + + .. prompt:: bash $ + + ceph osd rm 1 + +.. _Remove an OSD: ../crush-map#removeosd diff --git a/doc/rados/operations/balancer.rst b/doc/rados/operations/balancer.rst new file mode 100644 index 000000000..b02a8914d --- /dev/null +++ b/doc/rados/operations/balancer.rst @@ -0,0 +1,206 @@ +.. _balancer: + +Balancer +======== + +The *balancer* can optimize the placement of PGs across OSDs in +order to achieve a balanced distribution, either automatically or in a +supervised fashion. + +Status +------ + +The current status of the balancer can be checked at any time with: + + .. prompt:: bash $ + + ceph balancer status + + +Automatic balancing +------------------- + +The automatic balancing feature is enabled by default in ``upmap`` +mode. Please refer to :ref:`upmap` for more details. The balancer can be +turned off with: + + .. prompt:: bash $ + + ceph balancer off + +The balancer mode can be changed to ``crush-compat`` mode, which is +backward compatible with older clients, and will make small changes to +the data distribution over time to ensure that OSDs are equally utilized. + + +Throttling +---------- + +No adjustments will be made to the PG distribution if the cluster is +degraded (e.g., because an OSD has failed and the system has not yet +healed itself). + +When the cluster is healthy, the balancer will throttle its changes +such that the percentage of PGs that are misplaced (i.e., that need to +be moved) is below a threshold of (by default) 5%. The +``target_max_misplaced_ratio`` threshold can be adjusted with: + + .. prompt:: bash $ + + ceph config set mgr target_max_misplaced_ratio .07 # 7% + +Set the number of seconds to sleep in between runs of the automatic balancer: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/sleep_interval 60 + +Set the time of day to begin automatic balancing in HHMM format: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/begin_time 0000 + +Set the time of day to finish automatic balancing in HHMM format: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/end_time 2359 + +Restrict automatic balancing to this day of the week or later. +Uses the same conventions as crontab, 0 is Sunday, 1 is Monday, and so on: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/begin_weekday 0 + +Restrict automatic balancing to this day of the week or earlier. +Uses the same conventions as crontab, 0 is Sunday, 1 is Monday, and so on: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/end_weekday 6 + +Pool IDs to which the automatic balancing will be limited. +The default for this is an empty string, meaning all pools will be balanced. +The numeric pool IDs can be gotten with the :command:`ceph osd pool ls detail` command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/pool_ids 1,2,3 + + +Modes +----- + +There are currently two supported balancer modes: + +#. **crush-compat**. The CRUSH compat mode uses the compat weight-set + feature (introduced in Luminous) to manage an alternative set of + weights for devices in the CRUSH hierarchy. The normal weights + should remain set to the size of the device to reflect the target + amount of data that we want to store on the device. The balancer + then optimizes the weight-set values, adjusting them up or down in + small increments, in order to achieve a distribution that matches + the target distribution as closely as possible. (Because PG + placement is a pseudorandom process, there is a natural amount of + variation in the placement; by optimizing the weights we + counter-act that natural variation.) + + Notably, this mode is *fully backwards compatible* with older + clients: when an OSDMap and CRUSH map is shared with older clients, + we present the optimized weights as the "real" weights. + + The primary restriction of this mode is that the balancer cannot + handle multiple CRUSH hierarchies with different placement rules if + the subtrees of the hierarchy share any OSDs. (This is normally + not the case, and is generally not a recommended configuration + because it is hard to manage the space utilization on the shared + OSDs.) + +#. **upmap**. Starting with Luminous, the OSDMap can store explicit + mappings for individual OSDs as exceptions to the normal CRUSH + placement calculation. These `upmap` entries provide fine-grained + control over the PG mapping. This CRUSH mode will optimize the + placement of individual PGs in order to achieve a balanced + distribution. In most cases, this distribution is "perfect," which + an equal number of PGs on each OSD (+/-1 PG, since they might not + divide evenly). + + Note that using upmap requires that all clients be Luminous or newer. + +The default mode is ``upmap``. The mode can be adjusted with: + + .. prompt:: bash $ + + ceph balancer mode crush-compat + +Supervised optimization +----------------------- + +The balancer operation is broken into a few distinct phases: + +#. building a *plan* +#. evaluating the quality of the data distribution, either for the current PG distribution, or the PG distribution that would result after executing a *plan* +#. executing the *plan* + +To evaluate and score the current distribution: + + .. prompt:: bash $ + + ceph balancer eval + +You can also evaluate the distribution for a single pool with: + + .. prompt:: bash $ + + ceph balancer eval <pool-name> + +Greater detail for the evaluation can be seen with: + + .. prompt:: bash $ + + ceph balancer eval-verbose ... + +The balancer can generate a plan, using the currently configured mode, with: + + .. prompt:: bash $ + + ceph balancer optimize <plan-name> + +The name is provided by the user and can be any useful identifying string. The contents of a plan can be seen with: + + .. prompt:: bash $ + + ceph balancer show <plan-name> + +All plans can be shown with: + + .. prompt:: bash $ + + ceph balancer ls + +Old plans can be discarded with: + + .. prompt:: bash $ + + ceph balancer rm <plan-name> + +Currently recorded plans are shown as part of the status command: + + .. prompt:: bash $ + + ceph balancer status + +The quality of the distribution that would result after executing a plan can be calculated with: + + .. prompt:: bash $ + + ceph balancer eval <plan-name> + +Assuming the plan is expected to improve the distribution (i.e., it has a lower score than the current cluster state), the user can execute that plan with: + + .. prompt:: bash $ + + ceph balancer execute <plan-name> + diff --git a/doc/rados/operations/bluestore-migration.rst b/doc/rados/operations/bluestore-migration.rst new file mode 100644 index 000000000..1ac5f2b13 --- /dev/null +++ b/doc/rados/operations/bluestore-migration.rst @@ -0,0 +1,338 @@ +===================== + BlueStore Migration +===================== + +Each OSD can run either BlueStore or FileStore, and a single Ceph +cluster can contain a mix of both. Users who have previously deployed +FileStore are likely to want to transition to BlueStore in order to +take advantage of the improved performance and robustness. There are +several strategies for making such a transition. + +An individual OSD cannot be converted in place in isolation, however: +BlueStore and FileStore are simply too different for that to be +practical. "Conversion" will rely either on the cluster's normal +replication and healing support or tools and strategies that copy OSD +content from an old (FileStore) device to a new (BlueStore) one. + + +Deploy new OSDs with BlueStore +============================== + +Any new OSDs (e.g., when the cluster is expanded) can be deployed +using BlueStore. This is the default behavior so no specific change +is needed. + +Similarly, any OSDs that are reprovisioned after replacing a failed drive +can use BlueStore. + +Convert existing OSDs +===================== + +Mark out and replace +-------------------- + +The simplest approach is to mark out each device in turn, wait for the +data to replicate across the cluster, reprovision the OSD, and mark +it back in again. It is simple and easy to automate. However, it requires +more data migration than should be necessary, so it is not optimal. + +#. Identify a FileStore OSD to replace:: + + ID=<osd-id-number> + DEVICE=<disk-device> + + You can tell whether a given OSD is FileStore or BlueStore with: + + .. prompt:: bash $ + + ceph osd metadata $ID | grep osd_objectstore + + You can get a current count of filestore vs bluestore with: + + .. prompt:: bash $ + + ceph osd count-metadata osd_objectstore + +#. Mark the filestore OSD out: + + .. prompt:: bash $ + + ceph osd out $ID + +#. Wait for the data to migrate off the OSD in question: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done + +#. Stop the OSD: + + .. prompt:: bash $ + + systemctl kill ceph-osd@$ID + +#. Make note of which device this OSD is using: + + .. prompt:: bash $ + + mount | grep /var/lib/ceph/osd/ceph-$ID + +#. Unmount the OSD: + + .. prompt:: bash $ + + umount /var/lib/ceph/osd/ceph-$ID + +#. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy + the contents of the device; be certain the data on the device is + not needed (i.e., that the cluster is healthy) before proceeding: + + .. prompt:: bash $ + + ceph-volume lvm zap $DEVICE + +#. Tell the cluster the OSD has been destroyed (and a new OSD can be + reprovisioned with the same ID): + + .. prompt:: bash $ + + ceph osd destroy $ID --yes-i-really-mean-it + +#. Reprovision a BlueStore OSD in its place with the same OSD ID. + This requires you do identify which device to wipe based on what you saw + mounted above. BE CAREFUL! : + + .. prompt:: bash $ + + ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID + +#. Repeat. + +You can allow the refilling of the replacement OSD to happen +concurrently with the draining of the next OSD, or follow the same +procedure for multiple OSDs in parallel, as long as you ensure the +cluster is fully clean (all data has all replicas) before destroying +any OSDs. Failure to do so will reduce the redundancy of your data +and increase the risk of (or potentially even cause) data loss. + +Advantages: + +* Simple. +* Can be done on a device-by-device basis. +* No spare devices or hosts are required. + +Disadvantages: + +* Data is copied over the network twice: once to some other OSD in the + cluster (to maintain the desired number of replicas), and then again + back to the reprovisioned BlueStore OSD. + + +Whole host replacement +---------------------- + +If you have a spare host in the cluster, or have sufficient free space +to evacuate an entire host in order to use it as a spare, then the +conversion can be done on a host-by-host basis with each stored copy of +the data migrating only once. + +First, you need have empty host that has no data. There are two ways to do this: either by starting with a new, empty host that isn't yet part of the cluster, or by offloading data from an existing host that in the cluster. + +Use a new, empty host +^^^^^^^^^^^^^^^^^^^^^ + +Ideally the host should have roughly the +same capacity as other hosts you will be converting (although it +doesn't strictly matter). :: + + NEWHOST=<empty-host-name> + +Add the host to the CRUSH hierarchy, but do not attach it to the root: + +.. prompt:: bash $ + + ceph osd crush add-bucket $NEWHOST host + +Make sure the ceph packages are installed. + +Use an existing host +^^^^^^^^^^^^^^^^^^^^ + +If you would like to use an existing host +that is already part of the cluster, and there is sufficient free +space on that host so that all of its data can be migrated off, +then you can instead do:: + + OLDHOST=<existing-cluster-host-to-offload> + +.. prompt:: bash $ + + ceph osd crush unlink $OLDHOST default + +where "default" is the immediate ancestor in the CRUSH map. (For +smaller clusters with unmodified configurations this will normally +be "default", but it might also be a rack name.) You should now +see the host at the top of the OSD tree output with no parent: + +.. prompt:: bash $ + + bin/ceph osd tree + +:: + + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host oldhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host foo + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +If everything looks good, jump directly to the "Wait for data +migration to complete" step below and proceed from there to clean up +the old OSDs. + +Migration process +^^^^^^^^^^^^^^^^^ + +If you're using a new host, start at step #1. For an existing host, +jump to step #5 below. + +#. Provision new BlueStore OSDs for all devices: + + .. prompt:: bash $ + + ceph-volume lvm create --bluestore --data /dev/$DEVICE + +#. Verify OSDs join the cluster with: + + .. prompt:: bash $ + + ceph osd tree + + You should see the new host ``$NEWHOST`` with all of the OSDs beneath + it, but the host should *not* be nested beneath any other node in + hierarchy (like ``root default``). For example, if ``newhost`` is + the empty host, you might see something like:: + + $ bin/ceph osd tree + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host newhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host oldhost1 + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +#. Identify the first target host to convert : + + .. prompt:: bash $ + + OLDHOST=<existing-cluster-host-to-convert> + +#. Swap the new host into the old host's position in the cluster: + + .. prompt:: bash $ + + ceph osd crush swap-bucket $NEWHOST $OLDHOST + + At this point all data on ``$OLDHOST`` will start migrating to OSDs + on ``$NEWHOST``. If there is a difference in the total capacity of + the old and new hosts you may also see some data migrate to or from + other nodes in the cluster, but as long as the hosts are similarly + sized this will be a relatively small amount of data. + +#. Wait for data migration to complete: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done + +#. Stop all old OSDs on the now-empty ``$OLDHOST``: + + .. prompt:: bash $ + + ssh $OLDHOST + systemctl kill ceph-osd.target + umount /var/lib/ceph/osd/ceph-* + +#. Destroy and purge the old OSDs: + + .. prompt:: bash $ + + for osd in `ceph osd ls-tree $OLDHOST`; do + ceph osd purge $osd --yes-i-really-mean-it + done + +#. Wipe the old OSD devices. This requires you do identify which + devices are to be wiped manually (BE CAREFUL!). For each device: + + .. prompt:: bash $ + + ceph-volume lvm zap $DEVICE + +#. Use the now-empty host as the new host, and repeat:: + + NEWHOST=$OLDHOST + +Advantages: + +* Data is copied over the network only once. +* Converts an entire host's OSDs at once. +* Can parallelize to converting multiple hosts at a time. +* No spare devices are required on each host. + +Disadvantages: + +* A spare host is required. +* An entire host's worth of OSDs will be migrating data at a time. This + is like likely to impact overall cluster performance. +* All migrated data still makes one full hop over the network. + + +Per-OSD device copy +------------------- + +A single logical OSD can be converted by using the ``copy`` function +of ``ceph-objectstore-tool``. This requires that the host have a free +device (or devices) to provision a new, empty BlueStore OSD. For +example, if each host in your cluster has 12 OSDs, then you'd need a +13th available device so that each OSD can be converted in turn before the +old device is reclaimed to convert the next OSD. + +Caveats: + +* This strategy requires that a blank BlueStore OSD be prepared + without allocating a new OSD ID, something that the ``ceph-volume`` + tool doesn't support. More importantly, the setup of *dmcrypt* is + closely tied to the OSD identity, which means that this approach + does not work with encrypted OSDs. + +* The device must be manually partitioned. + +* Tooling not implemented! + +* Not documented! + +Advantages: + +* Little or no data migrates over the network during the conversion. + +Disadvantages: + +* Tooling not fully implemented. +* Process not documented. +* Each host must have a spare or empty device. +* The OSD is offline during the conversion, which means new writes will + be written to only a subset of the OSDs. This increases the risk of data + loss due to a subsequent failure. (However, if there is a failure before + conversion is complete, the original FileStore OSD can be started to provide + access to its original data.) diff --git a/doc/rados/operations/cache-tiering.rst b/doc/rados/operations/cache-tiering.rst new file mode 100644 index 000000000..8056ace47 --- /dev/null +++ b/doc/rados/operations/cache-tiering.rst @@ -0,0 +1,552 @@ +=============== + Cache Tiering +=============== + +A cache tier provides Ceph Clients with better I/O performance for a subset of +the data stored in a backing storage tier. Cache tiering involves creating a +pool of relatively fast/expensive storage devices (e.g., solid state drives) +configured to act as a cache tier, and a backing pool of either erasure-coded +or relatively slower/cheaper devices configured to act as an economical storage +tier. The Ceph objecter handles where to place the objects and the tiering +agent determines when to flush objects from the cache to the backing storage +tier. So the cache tier and the backing storage tier are completely transparent +to Ceph clients. + + +.. ditaa:: + +-------------+ + | Ceph Client | + +------+------+ + ^ + Tiering is | + Transparent | Faster I/O + to Ceph | +---------------+ + Client Ops | | | + | +----->+ Cache Tier | + | | | | + | | +-----+---+-----+ + | | | ^ + v v | | Active Data in Cache Tier + +------+----+--+ | | + | Objecter | | | + +-----------+--+ | | + ^ | | Inactive Data in Storage Tier + | v | + | +-----+---+-----+ + | | | + +----->| Storage Tier | + | | + +---------------+ + Slower I/O + + +The cache tiering agent handles the migration of data between the cache tier +and the backing storage tier automatically. However, admins have the ability to +configure how this migration takes place by setting the ``cache-mode``. There are +two main scenarios: + +- **writeback** mode: If the base tier and the cache tier are configured in + ``writeback`` mode, Ceph clients receive an ACK from the base tier every time + they write data to it. Then the cache tiering agent determines whether + ``osd_tier_default_cache_min_write_recency_for_promote`` has been set. If it + has been set and the data has been written more than a specified number of + times per interval, the data is promoted to the cache tier. + + When Ceph clients need access to data stored in the base tier, the cache + tiering agent reads the data from the base tier and returns it to the client. + While data is being read from the base tier, the cache tiering agent consults + the value of ``osd_tier_default_cache_min_read_recency_for_promote`` and + decides whether to promote that data from the base tier to the cache tier. + When data has been promoted from the base tier to the cache tier, the Ceph + client is able to perform I/O operations on it using the cache tier. This is + well-suited for mutable data (for example, photo/video editing, transactional + data). + +- **readproxy** mode: This mode will use any objects that already + exist in the cache tier, but if an object is not present in the + cache the request will be proxied to the base tier. This is useful + for transitioning from ``writeback`` mode to a disabled cache as it + allows the workload to function properly while the cache is drained, + without adding any new objects to the cache. + +Other cache modes are: + +- **readonly** promotes objects to the cache on read operations only; write + operations are forwarded to the base tier. This mode is intended for + read-only workloads that do not require consistency to be enforced by the + storage system. (**Warning**: when objects are updated in the base tier, + Ceph makes **no** attempt to sync these updates to the corresponding objects + in the cache. Since this mode is considered experimental, a + ``--yes-i-really-mean-it`` option must be passed in order to enable it.) + +- **none** is used to completely disable caching. + + +A word of caution +================= + +Cache tiering will *degrade* performance for most workloads. Users should use +extreme caution before using this feature. + +* *Workload dependent*: Whether a cache will improve performance is + highly dependent on the workload. Because there is a cost + associated with moving objects into or out of the cache, it can only + be effective when there is a *large skew* in the access pattern in + the data set, such that most of the requests touch a small number of + objects. The cache pool should be large enough to capture the + working set for your workload to avoid thrashing. + +* *Difficult to benchmark*: Most benchmarks that users run to measure + performance will show terrible performance with cache tiering, in + part because very few of them skew requests toward a small set of + objects, it can take a long time for the cache to "warm up," and + because the warm-up cost can be high. + +* *Usually slower*: For workloads that are not cache tiering-friendly, + performance is often slower than a normal RADOS pool without cache + tiering enabled. + +* *librados object enumeration*: The librados-level object enumeration + API is not meant to be coherent in the presence of the case. If + your application is using librados directly and relies on object + enumeration, cache tiering will probably not work as expected. + (This is not a problem for RGW, RBD, or CephFS.) + +* *Complexity*: Enabling cache tiering means that a lot of additional + machinery and complexity within the RADOS cluster is being used. + This increases the probability that you will encounter a bug in the system + that other users have not yet encountered and will put your deployment at a + higher level of risk. + +Known Good Workloads +-------------------- + +* *RGW time-skewed*: If the RGW workload is such that almost all read + operations are directed at recently written objects, a simple cache + tiering configuration that destages recently written objects from + the cache to the base tier after a configurable period can work + well. + +Known Bad Workloads +------------------- + +The following configurations are *known to work poorly* with cache +tiering. + +* *RBD with replicated cache and erasure-coded base*: This is a common + request, but usually does not perform well. Even reasonably skewed + workloads still send some small writes to cold objects, and because + small writes are not yet supported by the erasure-coded pool, entire + (usually 4 MB) objects must be migrated into the cache in order to + satisfy a small (often 4 KB) write. Only a handful of users have + successfully deployed this configuration, and it only works for them + because their data is extremely cold (backups) and they are not in + any way sensitive to performance. + +* *RBD with replicated cache and base*: RBD with a replicated base + tier does better than when the base is erasure coded, but it is + still highly dependent on the amount of skew in the workload, and + very difficult to validate. The user will need to have a good + understanding of their workload and will need to tune the cache + tiering parameters carefully. + + +Setting Up Pools +================ + +To set up cache tiering, you must have two pools. One will act as the +backing storage and the other will act as the cache. + + +Setting Up a Backing Storage Pool +--------------------------------- + +Setting up a backing storage pool typically involves one of two scenarios: + +- **Standard Storage**: In this scenario, the pool stores multiple copies + of an object in the Ceph Storage Cluster. + +- **Erasure Coding:** In this scenario, the pool uses erasure coding to + store data much more efficiently with a small performance tradeoff. + +In the standard storage scenario, you can setup a CRUSH rule to establish +the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD +Daemons perform optimally when all storage drives in the rule are of the +same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_ +for details on creating a rule. Once you have created a rule, create +a backing storage pool. + +In the erasure coding scenario, the pool creation arguments will generate the +appropriate rule automatically. See `Create a Pool`_ for details. + +In subsequent examples, we will refer to the backing storage pool +as ``cold-storage``. + + +Setting Up a Cache Pool +----------------------- + +Setting up a cache pool follows the same procedure as the standard storage +scenario, but with this difference: the drives for the cache tier are typically +high performance drives that reside in their own servers and have their own +CRUSH rule. When setting up such a rule, it should take account of the hosts +that have the high performance drives while omitting the hosts that don't. See +:ref:`CRUSH Device Class<crush-map-device-class>` for details. + + +In subsequent examples, we will refer to the cache pool as ``hot-storage`` and +the backing pool as ``cold-storage``. + +For cache tier configuration and default values, see +`Pools - Set Pool Values`_. + + +Creating a Cache Tier +===================== + +Setting up a cache tier involves associating a backing storage pool with +a cache pool: + +.. prompt:: bash $ + + ceph osd tier add {storagepool} {cachepool} + +For example: + +.. prompt:: bash $ + + ceph osd tier add cold-storage hot-storage + +To set the cache mode, execute the following: + +.. prompt:: bash $ + + ceph osd tier cache-mode {cachepool} {cache-mode} + +For example: + +.. prompt:: bash $ + + ceph osd tier cache-mode hot-storage writeback + +The cache tiers overlay the backing storage tier, so they require one +additional step: you must direct all client traffic from the storage pool to +the cache pool. To direct client traffic directly to the cache pool, execute +the following: + +.. prompt:: bash $ + + ceph osd tier set-overlay {storagepool} {cachepool} + +For example: + +.. prompt:: bash $ + + ceph osd tier set-overlay cold-storage hot-storage + + +Configuring a Cache Tier +======================== + +Cache tiers have several configuration options. You may set +cache tier configuration options with the following usage: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} {key} {value} + +See `Pools - Set Pool Values`_ for details. + + +Target Size and Type +-------------------- + +Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} hit_set_type bloom + +For example: + +.. prompt:: bash $ + + ceph osd pool set hot-storage hit_set_type bloom + +The ``hit_set_count`` and ``hit_set_period`` define how many such HitSets to +store, and how much time each HitSet should cover: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} hit_set_count 12 + ceph osd pool set {cachepool} hit_set_period 14400 + ceph osd pool set {cachepool} target_max_bytes 1000000000000 + +.. note:: A larger ``hit_set_count`` results in more RAM consumed by + the ``ceph-osd`` process. + +Binning accesses over time allows Ceph to determine whether a Ceph client +accessed an object at least once, or more than once over a time period +("age" vs "temperature"). + +The ``min_read_recency_for_promote`` defines how many HitSets to check for the +existence of an object when handling a read operation. The checking result is +used to decide whether to promote the object asynchronously. Its value should be +between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted. +If it's set to 1, the current HitSet is checked. And if this object is in the +current HitSet, it's promoted. Otherwise not. For the other values, the exact +number of archive HitSets are checked. The object is promoted if the object is +found in any of the most recent ``min_read_recency_for_promote`` HitSets. + +A similar parameter can be set for the write operation, which is +``min_write_recency_for_promote``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} min_read_recency_for_promote 2 + ceph osd pool set {cachepool} min_write_recency_for_promote 2 + +.. note:: The longer the period and the higher the + ``min_read_recency_for_promote`` and + ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd`` + daemon consumes. In particular, when the agent is active to flush + or evict cache objects, all ``hit_set_count`` HitSets are loaded + into RAM. + + +Cache Sizing +------------ + +The cache tiering agent performs two main functions: + +- **Flushing:** The agent identifies modified (or dirty) objects and forwards + them to the storage pool for long-term storage. + +- **Evicting:** The agent identifies objects that haven't been modified + (or clean) and evicts the least recently used among them from the cache. + + +Absolute Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects based upon the total number +of bytes or the total number of objects. To specify a maximum number of bytes, +execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} target_max_bytes {#bytes} + +For example, to flush or evict at 1 TB, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage target_max_bytes 1099511627776 + +To specify the maximum number of objects, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} target_max_objects {#objects} + +For example, to flush or evict at 1M objects, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage target_max_objects 1000000 + +.. note:: Ceph is not able to determine the size of a cache pool automatically, so + the configuration on the absolute size is required here, otherwise the + flush/evict will not work. If you specify both limits, the cache tiering + agent will begin flushing or evicting when either threshold is triggered. + +.. note:: All client requests will be blocked only when ``target_max_bytes`` or + ``target_max_objects`` reached + +Relative Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects relative to the size of the +cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in +`Absolute sizing`_). When the cache pool consists of a certain percentage of +modified (or dirty) objects, the cache tiering agent will flush them to the +storage pool. To set the ``cache_target_dirty_ratio``, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0} + +For example, setting the value to ``0.4`` will begin flushing modified +(dirty) objects when they reach 40% of the cache pool's capacity: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_dirty_ratio 0.4 + +When the dirty objects reaches a certain percentage of its capacity, flush dirty +objects with a higher speed. To set the ``cache_target_dirty_high_ratio``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0} + +For example, setting the value to ``0.6`` will begin aggressively flush dirty +objects when they reach 60% of the cache pool's capacity. obviously, we'd +better set the value between dirty_ratio and full_ratio: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6 + +When the cache pool reaches a certain percentage of its capacity, the cache +tiering agent will evict objects to maintain free capacity. To set the +``cache_target_full_ratio``, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0} + +For example, setting the value to ``0.8`` will begin flushing unmodified +(clean) objects when they reach 80% of the cache pool's capacity: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_full_ratio 0.8 + + +Cache Age +--------- + +You can specify the minimum age of an object before the cache tiering agent +flushes a recently modified (or dirty) object to the backing storage pool: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_min_flush_age {#seconds} + +For example, to flush modified (or dirty) objects after 10 minutes, execute the +following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_min_flush_age 600 + +You can specify the minimum age of an object before it will be evicted from the +cache tier: + +.. prompt:: bash $ + + ceph osd pool {cache-tier} cache_min_evict_age {#seconds} + +For example, to evict objects after 30 minutes, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_min_evict_age 1800 + + +Removing a Cache Tier +===================== + +Removing a cache tier differs depending on whether it is a writeback +cache or a read-only cache. + + +Removing a Read-Only Cache +-------------------------- + +Since a read-only cache does not have modified data, you can disable +and remove it without losing any recent changes to objects in the cache. + +#. Change the cache-mode to ``none`` to disable it.: + + .. prompt:: bash + + ceph osd tier cache-mode {cachepool} none + + For example: + + .. prompt:: bash $ + + ceph osd tier cache-mode hot-storage none + +#. Remove the cache pool from the backing pool.: + + .. prompt:: bash $ + + ceph osd tier remove {storagepool} {cachepool} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove cold-storage hot-storage + + +Removing a Writeback Cache +-------------------------- + +Since a writeback cache may have modified data, you must take steps to ensure +that you do not lose any recent changes to objects in the cache before you +disable and remove it. + + +#. Change the cache mode to ``proxy`` so that new and modified objects will + flush to the backing storage pool.: + + .. prompt:: bash $ + + ceph osd tier cache-mode {cachepool} proxy + + For example: + + .. prompt:: bash $ + + ceph osd tier cache-mode hot-storage proxy + + +#. Ensure that the cache pool has been flushed. This may take a few minutes: + + .. prompt:: bash $ + + rados -p {cachepool} ls + + If the cache pool still has objects, you can flush them manually. + For example: + + .. prompt:: bash $ + + rados -p {cachepool} cache-flush-evict-all + + +#. Remove the overlay so that clients will not direct traffic to the cache.: + + .. prompt:: bash $ + + ceph osd tier remove-overlay {storagetier} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove-overlay cold-storage + + +#. Finally, remove the cache tier pool from the backing storage pool.: + + .. prompt:: bash $ + + ceph osd tier remove {storagepool} {cachepool} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove cold-storage hot-storage + + +.. _Create a Pool: ../pools#create-a-pool +.. _Pools - Set Pool Values: ../pools#set-pool-values +.. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter +.. _CRUSH Maps: ../crush-map +.. _Absolute Sizing: #absolute-sizing diff --git a/doc/rados/operations/change-mon-elections.rst b/doc/rados/operations/change-mon-elections.rst new file mode 100644 index 000000000..eba730bdc --- /dev/null +++ b/doc/rados/operations/change-mon-elections.rst @@ -0,0 +1,88 @@ +.. _changing_monitor_elections: + +===================================== +Configure Monitor Election Strategies +===================================== + +By default, the monitors will use the ``classic`` mode. We +recommend that you stay in this mode unless you have a very specific reason. + +If you want to switch modes BEFORE constructing the cluster, change +the ``mon election default strategy`` option. This option is an integer value: + +* 1 for "classic" +* 2 for "disallow" +* 3 for "connectivity" + +Once your cluster is running, you can change strategies by running :: + + $ ceph mon set election_strategy {classic|disallow|connectivity} + +Choosing a mode +=============== +The modes other than classic provide different features. We recommend +you stay in classic mode if you don't need the extra features as it is +the simplest mode. + +The disallow Mode +================= +This mode lets you mark monitors as disallowd, in which case they will +participate in the quorum and serve clients, but cannot be elected leader. You +may wish to use this if you have some monitors which are known to be far away +from clients. +You can disallow a leader by running: + +.. prompt:: bash $ + + ceph mon add disallowed_leader {name} + +You can remove a monitor from the disallowed list, and allow it to become +a leader again, by running: + +.. prompt:: bash $ + + ceph mon rm disallowed_leader {name} + +The list of disallowed_leaders is included when you run: + +.. prompt:: bash $ + + ceph mon dump + +The connectivity Mode +===================== +This mode evaluates connection scores provided by each monitor for its +peers and elects the monitor with the highest score. This mode is designed +to handle network partitioning or *net-splits*, which may happen if your cluster +is stretched across multiple data centers or otherwise has a non-uniform +or unbalanced network topology. + +This mode also supports disallowing monitors from being the leader +using the same commands as above in disallow. + +Examining connectivity scores +============================= +The monitors maintain connection scores even if they aren't in +the connectivity election mode. You can examine the scores a monitor +has by running: + +.. prompt:: bash $ + + ceph daemon mon.{name} connection scores dump + +Scores for individual connections range from 0-1 inclusive, and also +include whether the connection is considered alive or dead (determined by +whether it returned its latest ping within the timeout). + +While this would be an unexpected occurrence, if for some reason you experience +problems and troubleshooting makes you think your scores have become invalid, +you can forget history and reset them by running: + +.. prompt:: bash $ + + ceph daemon mon.{name} connection scores reset + +While resetting scores has low risk (monitors will still quickly determine +if a connection is alive or dead, and trend back to the previous scores if they +were accurate!), it should also not be needed and is not recommended unless +requested by your support team or a developer. diff --git a/doc/rados/operations/control.rst b/doc/rados/operations/control.rst new file mode 100644 index 000000000..d7a512618 --- /dev/null +++ b/doc/rados/operations/control.rst @@ -0,0 +1,601 @@ +.. index:: control, commands + +================== + Control Commands +================== + + +Monitor Commands +================ + +Monitor commands are issued using the ``ceph`` utility: + +.. prompt:: bash $ + + ceph [-m monhost] {command} + +The command is usually (though not always) of the form: + +.. prompt:: bash $ + + ceph {subsystem} {command} + + +System Commands +=============== + +Execute the following to display the current cluster status. : + +.. prompt:: bash $ + + ceph -s + ceph status + +Execute the following to display a running summary of cluster status +and major events. : + +.. prompt:: bash $ + + ceph -w + +Execute the following to show the monitor quorum, including which monitors are +participating and which one is the leader. : + +.. prompt:: bash $ + + ceph mon stat + ceph quorum_status + +Execute the following to query the status of a single monitor, including whether +or not it is in the quorum. : + +.. prompt:: bash $ + + ceph tell mon.[id] mon_status + +where the value of ``[id]`` can be determined, e.g., from ``ceph -s``. + + +Authentication Subsystem +======================== + +To add a keyring for an OSD, execute the following: + +.. prompt:: bash $ + + ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring} + +To list the cluster's keys and their capabilities, execute the following: + +.. prompt:: bash $ + + ceph auth ls + + +Placement Group Subsystem +========================= + +To display the statistics for all placement groups (PGs), execute the following: + +.. prompt:: bash $ + + ceph pg dump [--format {format}] + +The valid formats are ``plain`` (default), ``json`` ``json-pretty``, ``xml``, and ``xml-pretty``. +When implementing monitoring and other tools, it is best to use ``json`` format. +JSON parsing is more deterministic than the human-oriented ``plain``, and the layout is much +less variable from release to release. The ``jq`` utility can be invaluable when extracting +data from JSON output. + +To display the statistics for all placement groups stuck in a specified state, +execute the following: + +.. prompt:: bash $ + + ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}] + + +``--format`` may be ``plain`` (default), ``json``, ``json-pretty``, ``xml``, or ``xml-pretty``. + +``--threshold`` defines how many seconds "stuck" is (default: 300) + +**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD +with the most up-to-date data to come back. + +**Unclean** Placement groups contain objects that are not replicated the desired number +of times. They should be recovering. + +**Stale** Placement groups are in an unknown state - the OSDs that host them have not +reported to the monitor cluster in a while (configured by +``mon_osd_report_timeout``). + +Delete "lost" objects or revert them to their prior state, either a previous version +or delete them if they were just created. : + +.. prompt:: bash $ + + ceph pg {pgid} mark_unfound_lost revert|delete + + +.. _osd-subsystem: + +OSD Subsystem +============= + +Query OSD subsystem status. : + +.. prompt:: bash $ + + ceph osd stat + +Write a copy of the most recent OSD map to a file. See +:ref:`osdmaptool <osdmaptool>`. : + +.. prompt:: bash $ + + ceph osd getmap -o file + +Write a copy of the crush map from the most recent OSD map to +file. : + +.. prompt:: bash $ + + ceph osd getcrushmap -o file + +The foregoing is functionally equivalent to : + +.. prompt:: bash $ + + ceph osd getmap -o /tmp/osdmap + osdmaptool /tmp/osdmap --export-crush file + +Dump the OSD map. Valid formats for ``-f`` are ``plain``, ``json``, ``json-pretty``, +``xml``, and ``xml-pretty``. If no ``--format`` option is given, the OSD map is +dumped as plain text. As above, JSON format is best for tools, scripting, and other automation. : + +.. prompt:: bash $ + + ceph osd dump [--format {format}] + +Dump the OSD map as a tree with one line per OSD containing weight +and state. : + +.. prompt:: bash $ + + ceph osd tree [--format {format}] + +Find out where a specific object is or would be stored in the system: + +.. prompt:: bash $ + + ceph osd map <pool-name> <object-name> + +Add or move a new item (OSD) with the given id/name/weight at the specified +location. : + +.. prompt:: bash $ + + ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]] + +Remove an existing item (OSD) from the CRUSH map. : + +.. prompt:: bash $ + + ceph osd crush remove {name} + +Remove an existing bucket from the CRUSH map. : + +.. prompt:: bash $ + + ceph osd crush remove {bucket-name} + +Move an existing bucket from one position in the hierarchy to another. : + +.. prompt:: bash $ + + ceph osd crush move {id} {loc1} [{loc2} ...] + +Set the weight of the item given by ``{name}`` to ``{weight}``. : + +.. prompt:: bash $ + + ceph osd crush reweight {name} {weight} + +Mark an OSD as ``lost``. This may result in permanent data loss. Use with caution. : + +.. prompt:: bash $ + + ceph osd lost {id} [--yes-i-really-mean-it] + +Create a new OSD. If no UUID is given, it will be set automatically when the OSD +starts up. : + +.. prompt:: bash $ + + ceph osd create [{uuid}] + +Remove the given OSD(s). : + +.. prompt:: bash $ + + ceph osd rm [{id}...] + +Query the current ``max_osd`` parameter in the OSD map. : + +.. prompt:: bash $ + + ceph osd getmaxosd + +Import the given crush map. : + +.. prompt:: bash $ + + ceph osd setcrushmap -i file + +Set the ``max_osd`` parameter in the OSD map. This defaults to 10000 now so +most admins will never need to adjust this. : + +.. prompt:: bash $ + + ceph osd setmaxosd + +Mark OSD ``{osd-num}`` down. : + +.. prompt:: bash $ + + ceph osd down {osd-num} + +Mark OSD ``{osd-num}`` out of the distribution (i.e. allocated no data). : + +.. prompt:: bash $ + + ceph osd out {osd-num} + +Mark ``{osd-num}`` in the distribution (i.e. allocated data). : + +.. prompt:: bash $ + + ceph osd in {osd-num} + +Set or clear the pause flags in the OSD map. If set, no IO requests +will be sent to any OSD. Clearing the flags via unpause results in +resending pending requests. : + +.. prompt:: bash $ + + ceph osd pause + ceph osd unpause + +Set the override weight (reweight) of ``{osd-num}`` to ``{weight}``. Two OSDs with the +same weight will receive roughly the same number of I/O requests and +store approximately the same amount of data. ``ceph osd reweight`` +sets an override weight on the OSD. This value is in the range 0 to 1, +and forces CRUSH to re-place (1-weight) of the data that would +otherwise live on this drive. It does not change weights assigned +to the buckets above the OSD in the crush map, and is a corrective +measure in case the normal CRUSH distribution is not working out quite +right. For instance, if one of your OSDs is at 90% and the others are +at 50%, you could reduce this weight to compensate. : + +.. prompt:: bash $ + + ceph osd reweight {osd-num} {weight} + +Balance OSD fullness by reducing the override weight of OSDs which are +overly utilized. Note that these override aka ``reweight`` values +default to 1.00000 and are relative only to each other; they not absolute. +It is crucial to distinguish them from CRUSH weights, which reflect the +absolute capacity of a bucket in TiB. By default this command adjusts +override weight on OSDs which have + or - 20% of the average utilization, +but if you include a ``threshold`` that percentage will be used instead. : + +.. prompt:: bash $ + + ceph osd reweight-by-utilization [threshold [max_change [max_osds]]] [--no-increasing] + +To limit the step by which any OSD's reweight will be changed, specify +``max_change`` which defaults to 0.05. To limit the number of OSDs that will +be adjusted, specify ``max_osds`` as well; the default is 4. Increasing these +parameters can speed leveling of OSD utilization, at the potential cost of +greater impact on client operations due to more data moving at once. + +To determine which and how many PGs and OSDs will be affected by a given invocation +you can test before executing. : + +.. prompt:: bash $ + + ceph osd test-reweight-by-utilization [threshold [max_change max_osds]] [--no-increasing] + +Adding ``--no-increasing`` to either command prevents increasing any +override weights that are currently < 1.00000. This can be useful when +you are balancing in a hurry to remedy ``full`` or ``nearful`` OSDs or +when some OSDs are being evacuated or slowly brought into service. + +Deployments utilizing Nautilus (or later revisions of Luminous and Mimic) +that have no pre-Luminous cients may instead wish to instead enable the +`balancer`` module for ``ceph-mgr``. + +Add/remove an IP address or CIDR range to/from the blocklist. +When adding to the blocklist, +you can specify how long it should be blocklisted in seconds; otherwise, +it will default to 1 hour. A blocklisted address is prevented from +connecting to any OSD. If you blocklist an IP or range containing an OSD, be aware +that OSD will also be prevented from performing operations on its peers where it +acts as a client. (This includes tiering and copy-from functionality.) + +If you want to blocklist a range (in CIDR format), you may do so by +including the ``range`` keyword. + +These commands are mostly only useful for failure testing, as +blocklists are normally maintained automatically and shouldn't need +manual intervention. : + +.. prompt:: bash $ + + ceph osd blocklist ["range"] add ADDRESS[:source_port][/netmask_bits] [TIME] + ceph osd blocklist ["range"] rm ADDRESS[:source_port][/netmask_bits] + +Creates/deletes a snapshot of a pool. : + +.. prompt:: bash $ + + ceph osd pool mksnap {pool-name} {snap-name} + ceph osd pool rmsnap {pool-name} {snap-name} + +Creates/deletes/renames a storage pool. : + +.. prompt:: bash $ + + ceph osd pool create {pool-name} [pg_num [pgp_num]] + ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] + ceph osd pool rename {old-name} {new-name} + +Changes a pool setting. : + +.. prompt:: bash $ + + ceph osd pool set {pool-name} {field} {value} + +Valid fields are: + + * ``size``: Sets the number of copies of data in the pool. + * ``pg_num``: The placement group number. + * ``pgp_num``: Effective number when calculating pg placement. + * ``crush_rule``: rule number for mapping placement. + +Get the value of a pool setting. : + +.. prompt:: bash $ + + ceph osd pool get {pool-name} {field} + +Valid fields are: + + * ``pg_num``: The placement group number. + * ``pgp_num``: Effective number of placement groups when calculating placement. + + +Sends a scrub command to OSD ``{osd-num}``. To send the command to all OSDs, use ``*``. : + +.. prompt:: bash $ + + ceph osd scrub {osd-num} + +Sends a repair command to OSD.N. To send the command to all OSDs, use ``*``. : + +.. prompt:: bash $ + + ceph osd repair N + +Runs a simple throughput benchmark against OSD.N, writing ``TOTAL_DATA_BYTES`` +in write requests of ``BYTES_PER_WRITE`` each. By default, the test +writes 1 GB in total in 4-MB increments. +The benchmark is non-destructive and will not overwrite existing live +OSD data, but might temporarily affect the performance of clients +concurrently accessing the OSD. : + +.. prompt:: bash $ + + ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE] + +To clear an OSD's caches between benchmark runs, use the 'cache drop' command : + +.. prompt:: bash $ + + ceph tell osd.N cache drop + +To get the cache statistics of an OSD, use the 'cache status' command : + +.. prompt:: bash $ + + ceph tell osd.N cache status + +MDS Subsystem +============= + +Change configuration parameters on a running mds. : + +.. prompt:: bash $ + + ceph tell mds.{mds-id} config set {setting} {value} + +Example: + +.. prompt:: bash $ + + ceph tell mds.0 config set debug_ms 1 + +Enables debug messages. : + +.. prompt:: bash $ + + ceph mds stat + +Displays the status of all metadata servers. : + +.. prompt:: bash $ + + ceph mds fail 0 + +Marks the active MDS as failed, triggering failover to a standby if present. + +.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap + + +Mon Subsystem +============= + +Show monitor stats: + +.. prompt:: bash $ + + ceph mon stat + +:: + + e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c + + +The ``quorum`` list at the end lists monitor nodes that are part of the current quorum. + +This is also available more directly: + +.. prompt:: bash $ + + ceph quorum_status -f json-pretty + +.. code-block:: javascript + + { + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "quorum_names": [ + "a", + "b", + "c" + ], + "quorum_leader_name": "a", + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + + +The above will block until a quorum is reached. + +For a status of just a single monitor: + +.. prompt:: bash $ + + ceph tell mon.[name] mon_status + +where the value of ``[name]`` can be taken from ``ceph quorum_status``. Sample +output:: + + { + "name": "b", + "rank": 1, + "state": "peon", + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "features": { + "required_con": "9025616074522624", + "required_mon": [ + "kraken" + ], + "quorum_con": "1152921504336314367", + "quorum_mon": [ + "kraken" + ] + }, + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + +A dump of the monitor state: + + .. prompt:: bash $ + + ceph mon dump + + :: + + dumped monmap epoch 2 + epoch 2 + fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc + last_changed 2016-12-26 14:42:09.288066 + created 2016-12-26 14:42:03.573585 + 0: 127.0.0.1:40000/0 mon.a + 1: 127.0.0.1:40001/0 mon.b + 2: 127.0.0.1:40002/0 mon.c + diff --git a/doc/rados/operations/crush-map-edits.rst b/doc/rados/operations/crush-map-edits.rst new file mode 100644 index 000000000..18553e47d --- /dev/null +++ b/doc/rados/operations/crush-map-edits.rst @@ -0,0 +1,747 @@ +Manually editing a CRUSH Map +============================ + +.. note:: Manually editing the CRUSH map is an advanced + administrator operation. All CRUSH changes that are + necessary for the overwhelming majority of installations are + possible via the standard ceph CLI and do not require manual + CRUSH map edits. If you have identified a use case where + manual edits *are* necessary with recent Ceph releases, consider + contacting the Ceph developers so that future versions of Ceph + can obviate your corner case. + +To edit an existing CRUSH map: + +#. `Get the CRUSH map`_. +#. `Decompile`_ the CRUSH map. +#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_. +#. `Recompile`_ the CRUSH map. +#. `Set the CRUSH map`_. + +For details on setting the CRUSH map rule for a specific pool, see `Set +Pool Values`_. + +.. _Get the CRUSH map: #getcrushmap +.. _Decompile: #decompilecrushmap +.. _Devices: #crushmapdevices +.. _Buckets: #crushmapbuckets +.. _Rules: #crushmaprules +.. _Recompile: #compilecrushmap +.. _Set the CRUSH map: #setcrushmap +.. _Set Pool Values: ../pools#setpoolvalues + +.. _getcrushmap: + +Get a CRUSH Map +--------------- + +To get the CRUSH map for your cluster, execute the following: + +.. prompt:: bash $ + + ceph osd getcrushmap -o {compiled-crushmap-filename} + +Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since +the CRUSH map is in a compiled form, you must decompile it first before you can +edit it. + +.. _decompilecrushmap: + +Decompile a CRUSH Map +--------------------- + +To decompile a CRUSH map, execute the following: + +.. prompt:: bash $ + + crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} + +.. _compilecrushmap: + +Recompile a CRUSH Map +--------------------- + +To compile a CRUSH map, execute the following: + +.. prompt:: bash $ + + crushtool -c {decompiled-crushmap-filename} -o {compiled-crushmap-filename} + +.. _setcrushmap: + +Set the CRUSH Map +----------------- + +To set the CRUSH map for your cluster, execute the following: + +.. prompt:: bash $ + + ceph osd setcrushmap -i {compiled-crushmap-filename} + +Ceph will load (-i) a compiled CRUSH map from the filename you specified. + +Sections +-------- + +There are six main sections to a CRUSH Map. + +#. **tunables:** The preamble at the top of the map describes any *tunables* + that differ from the historical / legacy CRUSH behavior. These + correct for old bugs, optimizations, or other changes that have + been made over the years to improve CRUSH's behavior. + +#. **devices:** Devices are individual OSDs that store data. + +#. **types**: Bucket ``types`` define the types of buckets used in + your CRUSH hierarchy. Buckets consist of a hierarchical aggregation + of storage locations (e.g., rows, racks, chassis, hosts, etc.) and + their assigned weights. + +#. **buckets:** Once you define bucket types, you must define each node + in the hierarchy, its type, and which devices or other nodes it + contains. + +#. **rules:** Rules define policy about how data is distributed across + devices in the hierarchy. + +#. **choose_args:** Choose_args are alternative weights associated with + the hierarchy that have been adjusted to optimize data placement. A single + choose_args map can be used for the entire cluster, or one can be + created for each individual pool. + + +.. _crushmapdevices: + +CRUSH Map Devices +----------------- + +Devices are individual OSDs that store data. Usually one is defined here for each +OSD daemon in your +cluster. Devices are identified by an ``id`` (a non-negative integer) and +a ``name``, normally ``osd.N`` where ``N`` is the device id. + +.. _crush-map-device-class: + +Devices may also have a *device class* associated with them (e.g., +``hdd`` or ``ssd``), allowing them to be conveniently targeted by a +crush rule. + +.. prompt:: bash # + + devices + +:: + + device {num} {osd.name} [class {class}] + +For example: + +.. prompt:: bash # + + devices + +:: + + device 0 osd.0 class ssd + device 1 osd.1 class hdd + device 2 osd.2 + device 3 osd.3 + +In most cases, each device maps to a single ``ceph-osd`` daemon. This +is normally a single storage device, a pair of devices (for example, +one for data and one for a journal or metadata), or in some cases a +small RAID device. + +CRUSH Map Bucket Types +---------------------- + +The second list in the CRUSH map defines 'bucket' types. Buckets facilitate +a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent +physical locations in a hierarchy. Nodes aggregate other nodes or leaves. +Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage +media. + +.. tip:: The term "bucket" used in the context of CRUSH means a node in + the hierarchy, i.e. a location or a piece of physical hardware. It + is a different concept from the term "bucket" when used in the + context of RADOS Gateway APIs. + +To add a bucket type to the CRUSH map, create a new line under your list of +bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. +By convention, there is one leaf bucket and it is ``type 0``; however, you may +give it any name you like (e.g., osd, disk, drive, storage):: + + # types + type {num} {bucket-name} + +For example:: + + # types + type 0 osd + type 1 host + type 2 chassis + type 3 rack + type 4 row + type 5 pdu + type 6 pod + type 7 room + type 8 datacenter + type 9 zone + type 10 region + type 11 root + + + +.. _crushmapbuckets: + +CRUSH Map Bucket Hierarchy +-------------------------- + +The CRUSH algorithm distributes data objects among storage devices according +to a per-device weight value, approximating a uniform probability distribution. +CRUSH distributes objects and their replicas according to the hierarchical +cluster map you define. Your CRUSH map represents the available storage +devices and the logical elements that contain them. + +To map placement groups to OSDs across failure domains, a CRUSH map defines a +hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH +map). The purpose of creating a bucket hierarchy is to segregate the +leaf nodes by their failure domains, such as hosts, chassis, racks, power +distribution units, pods, rows, rooms, and data centers. With the exception of +the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and +you may define it according to your own needs. + +We recommend adapting your CRUSH map to your firm's hardware naming conventions +and using instance names that reflect the physical hardware. Your naming +practice can make it easier to administer the cluster and troubleshoot +problems when an OSD and/or other hardware malfunctions and the administrator +need access to physical hardware. + +In the following example, the bucket hierarchy has a leaf bucket named ``osd``, +and two node buckets named ``host`` and ``rack`` respectively. + +.. ditaa:: + +-----------+ + | {o}rack | + | Bucket | + +-----+-----+ + | + +---------------+---------------+ + | | + +-----+-----+ +-----+-----+ + | {o}host | | {o}host | + | Bucket | | Bucket | + +-----+-----+ +-----+-----+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd | | osd | | osd | | osd | + | Bucket | | Bucket | | Bucket | | Bucket | + +-----------+ +-----------+ +-----------+ +-----------+ + +.. note:: The higher numbered ``rack`` bucket type aggregates the lower + numbered ``host`` bucket type. + +Since leaf nodes reflect storage devices declared under the ``#devices`` list +at the beginning of the CRUSH map, you do not need to declare them as bucket +instances. The second lowest bucket type in your hierarchy usually aggregates +the devices (i.e., it's usually the computer containing the storage media, and +uses whatever term you prefer to describe it, such as "node", "computer", +"server," "host", "machine", etc.). In high density environments, it is +increasingly common to see multiple hosts/nodes per chassis. You should account +for chassis failure too--e.g., the need to pull a chassis if a node fails may +result in bringing down numerous hosts/nodes and their OSDs. + +When declaring a bucket instance, you must specify its type, give it a unique +name (string), assign it a unique ID expressed as a negative integer (optional), +specify a weight relative to the total capacity/capability of its item(s), +specify the bucket algorithm (usually ``straw2``), and the hash (usually ``0``, +reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items. +The items may consist of node buckets or leaves. Items may have a weight that +reflects the relative weight of the item. + +You may declare a node bucket with the following syntax:: + + [bucket-type] [bucket-name] { + id [a unique negative numeric ID] + weight [the relative capacity/capability of the item(s)] + alg [the bucket type: uniform | list | tree | straw | straw2 ] + hash [the hash type: 0 by default] + item [item-name] weight [weight] + } + +For example, using the diagram above, we would define two host buckets +and one rack bucket. The OSDs are declared as items within the host buckets:: + + host node1 { + id -1 + alg straw2 + hash 0 + item osd.0 weight 1.00 + item osd.1 weight 1.00 + } + + host node2 { + id -2 + alg straw2 + hash 0 + item osd.2 weight 1.00 + item osd.3 weight 1.00 + } + + rack rack1 { + id -3 + alg straw2 + hash 0 + item node1 weight 2.00 + item node2 weight 2.00 + } + +.. note:: In the foregoing example, note that the rack bucket does not contain + any OSDs. Rather it contains lower level host buckets, and includes the + sum total of their weight in the item entry. + +.. topic:: Bucket Types + + Ceph supports five bucket types, each representing a tradeoff between + performance and reorganization efficiency. If you are unsure of which bucket + type to use, we recommend using a ``straw2`` bucket. For a detailed + discussion of bucket types, refer to + `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, + and more specifically to **Section 3.4**. The bucket types are: + + #. **uniform**: Uniform buckets aggregate devices with **exactly** the same + weight. For example, when firms commission or decommission hardware, they + typically do so with many machines that have exactly the same physical + configuration (e.g., bulk purchases). When storage devices have exactly + the same weight, you may use the ``uniform`` bucket type, which allows + CRUSH to map replicas into uniform buckets in constant time. With + non-uniform weights, you should use another bucket algorithm. + + #. **list**: List buckets aggregate their content as linked lists. Based on + the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm, + a list is a natural and intuitive choice for an **expanding cluster**: + either an object is relocated to the newest device with some appropriate + probability, or it remains on the older devices as before. The result is + optimal data migration when items are added to the bucket. Items removed + from the middle or tail of the list, however, can result in a significant + amount of unnecessary movement, making list buckets most suitable for + circumstances in which they **never (or very rarely) shrink**. + + #. **tree**: Tree buckets use a binary search tree. They are more efficient + than list buckets when a bucket contains a larger set of items. Based on + the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm, + tree buckets reduce the placement time to O(log :sub:`n`), making them + suitable for managing much larger sets of devices or nested buckets. + + #. **straw**: List and Tree buckets use a divide and conquer strategy + in a way that either gives certain items precedence (e.g., those + at the beginning of a list) or obviates the need to consider entire + subtrees of items at all. That improves the performance of the replica + placement process, but can also introduce suboptimal reorganization + behavior when the contents of a bucket change due an addition, removal, + or re-weighting of an item. The straw bucket type allows all items to + fairly “compete” against each other for replica placement through a + process analogous to a draw of straws. + + #. **straw2**: Straw2 buckets improve Straw to correctly avoid any data + movement between items when neighbor weights change. + + For example the weight of item A including adding it anew or removing + it completely, there will be data movement only to or from item A. + +.. topic:: Hash + + Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``. + Enter ``0`` as your hash setting to select ``rjenkins1``. + + +.. _weightingbucketitems: + +.. topic:: Weighting Bucket Items + + Ceph expresses bucket weights as doubles, which allows for fine + weighting. A weight is the relative difference between device capacities. We + recommend using ``1.00`` as the relative weight for a 1TB storage device. + In such a scenario, a weight of ``0.5`` would represent approximately 500GB, + and a weight of ``3.00`` would represent approximately 3TB. Higher level + buckets have a weight that is the sum total of the leaf items aggregated by + the bucket. + + A bucket item weight is one dimensional, but you may also calculate your + item weights to reflect the performance of the storage drive. For example, + if you have many 1TB drives where some have relatively low data transfer + rate and the others have a relatively high data transfer rate, you may + weight them differently, even though they have the same capacity (e.g., + a weight of 0.80 for the first set of drives with lower total throughput, + and 1.20 for the second set of drives with higher total throughput). + + +.. _crushmaprules: + +CRUSH Map Rules +--------------- + +CRUSH maps support the notion of 'CRUSH rules', which are the rules that +determine data placement for a pool. The default CRUSH map has a rule for each +pool. For large clusters, you will likely create many pools where each pool may +have its own non-default CRUSH rule. + +.. note:: In most cases, you will not need to modify the default rule. When + you create a new pool, by default the rule will be set to ``0``. + + +CRUSH rules define placement and replication strategies or distribution policies +that allow you to specify exactly how CRUSH places object replicas. For +example, you might create a rule selecting a pair of targets for 2-way +mirroring, another rule for selecting three targets in two different data +centers for 3-way mirroring, and yet another rule for erasure coding over six +storage devices. For a detailed discussion of CRUSH rules, refer to +`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, +and more specifically to **Section 3.2**. + +A rule takes the following form:: + + rule <rulename> { + + id [a unique whole numeric ID] + type [ replicated | erasure ] + min_size <min-size> + max_size <max-size> + step take <bucket-name> [class <device-class>] + step [choose|chooseleaf] [firstn|indep] <N> type <bucket-type> + step emit + } + + +``id`` + +:Description: A unique whole number for identifying the rule. + +:Purpose: A component of the rule mask. +:Type: Integer +:Required: Yes +:Default: 0 + + +``type`` + +:Description: Describes a rule for either a storage drive (replicated) + or a RAID. + +:Purpose: A component of the rule mask. +:Type: String +:Required: Yes +:Default: ``replicated`` +:Valid Values: Currently only ``replicated`` and ``erasure`` + +``min_size`` + +:Description: If a pool makes fewer replicas than this number, CRUSH will + **NOT** select this rule. + +:Type: Integer +:Purpose: A component of the rule mask. +:Required: Yes +:Default: ``1`` + +``max_size`` + +:Description: If a pool makes more replicas than this number, CRUSH will + **NOT** select this rule. + +:Type: Integer +:Purpose: A component of the rule mask. +:Required: Yes +:Default: 10 + + +``step take <bucket-name> [class <device-class>]`` + +:Description: Takes a bucket name, and begins iterating down the tree. + If the ``device-class`` is specified, it must match + a class previously used when defining a device. All + devices that do not belong to the class are excluded. +:Purpose: A component of the rule. +:Required: Yes +:Example: ``step take data`` + + +``step choose firstn {num} type {bucket-type}`` + +:Description: Selects the number of buckets of the given type from within the + current bucket. The number is usually the number of replicas in + the pool (i.e., pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). + - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. + - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. + +:Purpose: A component of the rule. +:Prerequisite: Follows ``step take`` or ``step choose``. +:Example: ``step choose firstn 1 type row`` + + +``step chooseleaf firstn {num} type {bucket-type}`` + +:Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf + node (that is, an OSD) from the subtree of each bucket in the set of buckets. + The number of buckets in the set is usually the number of replicas in + the pool (i.e., pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). + - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. + - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. + +:Purpose: A component of the rule. Usage removes the need to select a device using two steps. +:Prerequisite: Follows ``step take`` or ``step choose``. +:Example: ``step chooseleaf firstn 0 type row`` + + +``step emit`` + +:Description: Outputs the current value and empties the stack. Typically used + at the end of a rule, but may also be used to pick from different + trees in the same rule. + +:Purpose: A component of the rule. +:Prerequisite: Follows ``step choose``. +:Example: ``step emit`` + +.. important:: A given CRUSH rule may be assigned to multiple pools, but it + is not possible for a single pool to have multiple CRUSH rules. + +``firstn`` versus ``indep`` + +:Description: Controls the replacement strategy CRUSH uses when items (OSDs) + are marked down in the CRUSH map. If this rule is to be used with + replicated pools it should be ``firstn`` and if it's for + erasure-coded pools it should be ``indep``. + + The reason has to do with how they behave when a + previously-selected device fails. Let's say you have a PG stored + on OSDs 1, 2, 3, 4, 5. Then 3 goes down. + + With the "firstn" mode, CRUSH simply adjusts its calculation to + select 1 and 2, then selects 3 but discovers it's down, so it + retries and selects 4 and 5, and then goes on to select a new + OSD 6. So the final CRUSH mapping change is + 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6. + + But if you're storing an EC pool, that means you just changed the + data mapped to OSDs 4, 5, and 6! So the "indep" mode attempts to + not do that. You can instead expect it, when it selects the failed + OSD 3, to try again and pick out 6, for a final transformation of: + 1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5 + +.. _crush-reclassify: + +Migrating from a legacy SSD rule to device classes +-------------------------------------------------- + +It used to be necessary to manually edit your CRUSH map and maintain a +parallel hierarchy for each specialized device type (e.g., SSD) in order to +write rules that apply to those devices. Since the Luminous release, +the *device class* feature has enabled this transparently. + +However, migrating from an existing, manually customized per-device map to +the new device class rules in the trivial way will cause all data in the +system to be reshuffled. + +The ``crushtool`` has a few commands that can transform a legacy rule +and hierarchy so that you can start using the new class-based rules. +There are three types of transformations possible: + +#. ``--reclassify-root <root-name> <device-class>`` + + This will take everything in the hierarchy beneath root-name and + adjust any rules that reference that root via a ``take + <root-name>`` to instead ``take <root-name> class <device-class>``. + It renumbers the buckets in such a way that the old IDs are instead + used for the specified class's "shadow tree" so that no data + movement takes place. + + For example, imagine you have an existing rule like:: + + rule replicated_ruleset { + id 0 + type replicated + min_size 1 + max_size 10 + step take default + step chooseleaf firstn 0 type rack + step emit + } + + If you reclassify the root `default` as class `hdd`, the rule will + become:: + + rule replicated_ruleset { + id 0 + type replicated + min_size 1 + max_size 10 + step take default class hdd + step chooseleaf firstn 0 type rack + step emit + } + +#. ``--set-subtree-class <bucket-name> <device-class>`` + + This will mark every device in the subtree rooted at *bucket-name* + with the specified device class. + + This is normally used in conjunction with the ``--reclassify-root`` + option to ensure that all devices in that root are labeled with the + correct class. In some situations, however, some of those devices + (correctly) have a different class and we do not want to relabel + them. In such cases, one can exclude the ``--set-subtree-class`` + option. This means that the remapping process will not be perfect, + since the previous rule distributed across devices of multiple + classes but the adjusted rules will only map to devices of the + specified *device-class*, but that often is an accepted level of + data movement when the number of outlier devices is small. + +#. ``--reclassify-bucket <match-pattern> <device-class> <default-parent>`` + + This will allow you to merge a parallel type-specific hierarchy with the normal hierarchy. For example, many users have maps like:: + + host node1 { + id -2 # do not change unnecessarily + # weight 109.152 + alg straw2 + hash 0 # rjenkins1 + item osd.0 weight 9.096 + item osd.1 weight 9.096 + item osd.2 weight 9.096 + item osd.3 weight 9.096 + item osd.4 weight 9.096 + item osd.5 weight 9.096 + ... + } + + host node1-ssd { + id -10 # do not change unnecessarily + # weight 2.000 + alg straw2 + hash 0 # rjenkins1 + item osd.80 weight 2.000 + ... + } + + root default { + id -1 # do not change unnecessarily + alg straw2 + hash 0 # rjenkins1 + item node1 weight 110.967 + ... + } + + root ssd { + id -18 # do not change unnecessarily + # weight 16.000 + alg straw2 + hash 0 # rjenkins1 + item node1-ssd weight 2.000 + ... + } + + This function will reclassify each bucket that matches a + pattern. The pattern can look like ``%suffix`` or ``prefix%``. + For example, in the above example, we would use the pattern + ``%-ssd``. For each matched bucket, the remaining portion of the + name (that matches the ``%`` wildcard) specifies the *base bucket*. + All devices in the matched bucket are labeled with the specified + device class and then moved to the base bucket. If the base bucket + does not exist (e.g., ``node12-ssd`` exists but ``node12`` does + not), then it is created and linked underneath the specified + *default parent* bucket. In each case, we are careful to preserve + the old bucket IDs for the new shadow buckets to prevent data + movement. Any rules with ``take`` steps referencing the old + buckets are adjusted. + +#. ``--reclassify-bucket <bucket-name> <device-class> <base-bucket>`` + + The same command can also be used without a wildcard to map a + single bucket. For example, in the previous example, we want the + ``ssd`` bucket to be mapped to the ``default`` bucket. + +The final command to convert the map comprising the above fragments would be something like: + +.. prompt:: bash $ + + ceph osd getcrushmap -o original + crushtool -i original --reclassify \ + --set-subtree-class default hdd \ + --reclassify-root default hdd \ + --reclassify-bucket %-ssd ssd default \ + --reclassify-bucket ssd ssd default \ + -o adjusted + +In order to ensure that the conversion is correct, there is a ``--compare`` command that will test a large sample of inputs against the CRUSH map and check that the same result is output. These inputs are controlled by the same options that apply to the ``--test`` command. For the above example,: + +.. prompt:: bash $ + + crushtool -i original --compare adjusted + +:: + + rule 0 had 0/10240 mismatched mappings (0) + rule 1 had 0/10240 mismatched mappings (0) + maps appear equivalent + +If there were differences, the ratio of remapped inputs would be reported in +the parentheses. + +When you are satisfied with the adjusted map, apply it to the cluster with a command of the form: + +.. prompt:: bash $ + + ceph osd setcrushmap -i adjusted + +Tuning CRUSH, the hard way +-------------------------- + +If you can ensure that all clients are running recent code, you can +adjust the tunables by extracting the CRUSH map, modifying the values, +and reinjecting it into the cluster. + +* Extract the latest CRUSH map: + + .. prompt:: bash $ + + ceph osd getcrushmap -o /tmp/crush + +* Adjust tunables. These values appear to offer the best behavior + for both large and small clusters we tested with. You will need to + additionally specify the ``--enable-unsafe-tunables`` argument to + ``crushtool`` for this to work. Please use this option with + extreme care.: + + .. prompt:: bash $ + + crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new + +* Reinject modified map: + + .. prompt:: bash $ + + ceph osd setcrushmap -i /tmp/crush.new + +Legacy values +------------- + +For reference, the legacy values for the CRUSH tunables can be set +with: + +.. prompt:: bash $ + + crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy + +Again, the special ``--enable-unsafe-tunables`` option is required. +Further, as noted above, be careful running old versions of the +``ceph-osd`` daemon after reverting to legacy values as the feature +bit is not perfectly enforced. + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf diff --git a/doc/rados/operations/crush-map.rst b/doc/rados/operations/crush-map.rst new file mode 100644 index 000000000..f22ebb24e --- /dev/null +++ b/doc/rados/operations/crush-map.rst @@ -0,0 +1,1126 @@ +============ + CRUSH Maps +============ + +The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm +determines how to store and retrieve data by computing storage locations. +CRUSH empowers Ceph clients to communicate with OSDs directly rather than +through a centralized server or broker. With an algorithmically determined +method of storing and retrieving data, Ceph avoids a single point of failure, a +performance bottleneck, and a physical limit to its scalability. + +CRUSH uses a map of your cluster (the CRUSH map) to pseudo-randomly +map data to OSDs, distributing it across the cluster according to configured +replication policy and failure domain. For a detailed discussion of CRUSH, see +`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ + +CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a hierarchy +of 'buckets' for aggregating devices and buckets, and +rules that govern how CRUSH replicates data within the cluster's pools. By +reflecting the underlying physical organization of the installation, CRUSH can +model (and thereby address) the potential for correlated device failures. +Typical factors include chassis, racks, physical proximity, a shared power +source, and shared networking. By encoding this information into the cluster +map, CRUSH placement +policies distribute object replicas across failure domains while +maintaining the desired distribution. For example, to address the +possibility of concurrent failures, it may be desirable to ensure that data +replicas are on devices using different shelves, racks, power supplies, +controllers, and/or physical locations. + +When you deploy OSDs they are automatically added to the CRUSH map under a +``host`` bucket named for the node on which they run. This, +combined with the configured CRUSH failure domain, ensures that replicas or +erasure code shards are distributed across hosts and that a single host or other +failure will not affect availability. For larger clusters, administrators must +carefully consider their choice of failure domain. Separating replicas across racks, +for example, is typical for mid- to large-sized clusters. + + +CRUSH Location +============== + +The location of an OSD within the CRUSH map's hierarchy is +referred to as a ``CRUSH location``. This location specifier takes the +form of a list of key and value pairs. For +example, if an OSD is in a particular row, rack, chassis and host, and +is part of the 'default' CRUSH root (which is the case for most +clusters), its CRUSH location could be described as:: + + root=default row=a rack=a2 chassis=a2a host=a2a1 + +Note: + +#. Note that the order of the keys does not matter. +#. The key name (left of ``=``) must be a valid CRUSH ``type``. By default + these include ``root``, ``datacenter``, ``room``, ``row``, ``pod``, ``pdu``, + ``rack``, ``chassis`` and ``host``. + These defined types suffice for almost all clusters, but can be customized + by modifying the CRUSH map. +#. Not all keys need to be specified. For example, by default, Ceph + automatically sets an ``OSD``'s location to be + ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``). + +The CRUSH location for an OSD can be defined by adding the ``crush location`` +option in ``ceph.conf``. Each time the OSD starts, +it verifies it is in the correct location in the CRUSH map and, if it is not, +it moves itself. To disable this automatic CRUSH map management, add the +following to your configuration file in the ``[osd]`` section:: + + osd crush update on start = false + +Note that in most cases you will not need to manually configure this. + + +Custom location hooks +--------------------- + +A customized location hook can be used to generate a more complete +CRUSH location on startup. The CRUSH location is based on, in order +of preference: + +#. A ``crush location`` option in ``ceph.conf`` +#. A default of ``root=default host=HOSTNAME`` where the hostname is + derived from the ``hostname -s`` command + +A script can be written to provide additional +location fields (for example, ``rack`` or ``datacenter``) and the +hook enabled via the config option:: + + crush location hook = /path/to/customized-ceph-crush-location + +This hook is passed several arguments (below) and should output a single line +to ``stdout`` with the CRUSH location description.:: + + --cluster CLUSTER --id ID --type TYPE + +where the cluster name is typically ``ceph``, the ``id`` is the daemon +identifier (e.g., the OSD number or daemon identifier), and the daemon +type is ``osd``, ``mds``, etc. + +For example, a simple hook that additionally specifies a rack location +based on a value in the file ``/etc/rack`` might be:: + + #!/bin/sh + echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default" + + +CRUSH structure +=============== + +The CRUSH map consists of a hierarchy that describes +the physical topology of the cluster and a set of rules defining +data placement policy. The hierarchy has +devices (OSDs) at the leaves, and internal nodes +corresponding to other physical features or groupings: hosts, racks, +rows, datacenters, and so on. The rules describe how replicas are +placed in terms of that hierarchy (e.g., 'three replicas in different +racks'). + +Devices +------- + +Devices are individual OSDs that store data, usually one for each storage drive. +Devices are identified by an ``id`` +(a non-negative integer) and a ``name``, normally ``osd.N`` where ``N`` is the device id. + +Since the Luminous release, devices may also have a *device class* assigned (e.g., +``hdd`` or ``ssd`` or ``nvme``), allowing them to be conveniently targeted by +CRUSH rules. This is especially useful when mixing device types within hosts. + +.. _crush_map_default_types: + +Types and Buckets +----------------- + +A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, +racks, rows, etc. The CRUSH map defines a series of *types* that are +used to describe these nodes. Default types include: + +- ``osd`` (or ``device``) +- ``host`` +- ``chassis`` +- ``rack`` +- ``row`` +- ``pdu`` +- ``pod`` +- ``room`` +- ``datacenter`` +- ``zone`` +- ``region`` +- ``root`` + +Most clusters use only a handful of these types, and others +can be defined as needed. + +The hierarchy is built with devices (normally type ``osd``) at the +leaves, interior nodes with non-device types, and a root node of type +``root``. For example, + +.. ditaa:: + + +-----------------+ + |{o}root default | + +--------+--------+ + | + +---------------+---------------+ + | | + +------+------+ +------+------+ + |{o}host foo | |{o}host bar | + +------+------+ +------+------+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd.0 | | osd.1 | | osd.2 | | osd.3 | + +-----------+ +-----------+ +-----------+ +-----------+ + +Each node (device or bucket) in the hierarchy has a *weight* +that indicates the relative proportion of the total +data that device or hierarchy subtree should store. Weights are set +at the leaves, indicating the size of the device, and automatically +sum up the tree, such that the weight of the ``root`` node +will be the total of all devices contained beneath it. Normally +weights are in units of terabytes (TB). + +You can get a simple view the of CRUSH hierarchy for your cluster, +including weights, with: + +.. prompt:: bash $ + + ceph osd tree + +Rules +----- + +CRUSH Rules define policy about how data is distributed across the devices +in the hierarchy. They define placement and replication strategies or +distribution policies that allow you to specify exactly how CRUSH +places data replicas. For example, you might create a rule selecting +a pair of targets for two-way mirroring, another rule for selecting +three targets in two different data centers for three-way mirroring, and +yet another rule for erasure coding (EC) across six storage devices. For a +detailed discussion of CRUSH rules, refer to `CRUSH - Controlled, +Scalable, Decentralized Placement of Replicated Data`_, and more +specifically to **Section 3.2**. + +CRUSH rules can be created via the CLI by +specifying the *pool type* they will be used for (replicated or +erasure coded), the *failure domain*, and optionally a *device class*. +In rare cases rules must be written by hand by manually editing the +CRUSH map. + +You can see what rules are defined for your cluster with: + +.. prompt:: bash $ + + ceph osd crush rule ls + +You can view the contents of the rules with: + +.. prompt:: bash $ + + ceph osd crush rule dump + +Device classes +-------------- + +Each device can optionally have a *class* assigned. By +default, OSDs automatically set their class at startup to +`hdd`, `ssd`, or `nvme` based on the type of device they are backed +by. + +The device class for one or more OSDs can be explicitly set with: + +.. prompt:: bash $ + + ceph osd crush set-device-class <class> <osd-name> [...] + +Once a device class is set, it cannot be changed to another class +until the old class is unset with: + +.. prompt:: bash $ + + ceph osd crush rm-device-class <osd-name> [...] + +This allows administrators to set device classes without the class +being changed on OSD restart or by some other script. + +A placement rule that targets a specific device class can be created with: + +.. prompt:: bash $ + + ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> + +A pool can then be changed to use the new rule with: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> crush_rule <rule-name> + +Device classes are implemented by creating a "shadow" CRUSH hierarchy +for each device class in use that contains only devices of that class. +CRUSH rules can then distribute data over the shadow hierarchy. +This approach is fully backward compatible with +old Ceph clients. You can view the CRUSH hierarchy with shadow items +with: + +.. prompt:: bash $ + + ceph osd crush tree --show-shadow + +For older clusters created before Luminous that relied on manually +crafted CRUSH maps to maintain per-device-type hierarchies, there is a +*reclassify* tool available to help transition to device classes +without triggering data movement (see :ref:`crush-reclassify`). + + +Weights sets +------------ + +A *weight set* is an alternative set of weights to use when +calculating data placement. The normal weights associated with each +device in the CRUSH map are set based on the device size and indicate +how much data we *should* be storing where. However, because CRUSH is +a "probabilistic" pseudorandom placement process, there is always some +variation from this ideal distribution, in the same way that rolling a +die sixty times will not result in rolling exactly 10 ones and 10 +sixes. Weight sets allow the cluster to perform numerical optimization +based on the specifics of your cluster (hierarchy, pools, etc.) to achieve +a balanced distribution. + +There are two types of weight sets supported: + + #. A **compat** weight set is a single alternative set of weights for + each device and node in the cluster. This is not well-suited for + correcting for all anomalies (for example, placement groups for + different pools may be different sizes and have different load + levels, but will be mostly treated the same by the balancer). + However, compat weight sets have the huge advantage that they are + *backward compatible* with previous versions of Ceph, which means + that even though weight sets were first introduced in Luminous + v12.2.z, older clients (e.g., firefly) can still connect to the + cluster when a compat weight set is being used to balance data. + #. A **per-pool** weight set is more flexible in that it allows + placement to be optimized for each data pool. Additionally, + weights can be adjusted for each position of placement, allowing + the optimizer to correct for a subtle skew of data toward devices + with small weights relative to their peers (and effect that is + usually only apparently in very large clusters but which can cause + balancing problems). + +When weight sets are in use, the weights associated with each node in +the hierarchy is visible as a separate column (labeled either +``(compat)`` or the pool name) from the command: + +.. prompt:: bash $ + + ceph osd tree + +When both *compat* and *per-pool* weight sets are in use, data +placement for a particular pool will use its own per-pool weight set +if present. If not, it will use the compat weight set if present. If +neither are present, it will use the normal CRUSH weights. + +Although weight sets can be set up and manipulated by hand, it is +recommended that the ``ceph-mgr`` *balancer* module be enabled to do so +automatically when running Luminous or later releases. + + +Modifying the CRUSH map +======================= + +.. _addosd: + +Add/Move an OSD +--------------- + +.. note: OSDs are normally automatically added to the CRUSH map when + the OSD is created. This command is rarely needed. + +To add or move an OSD in the CRUSH map of a running cluster: + +.. prompt:: bash $ + + ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] + +Where: + +``name`` + +:Description: The full name of the OSD. +:Type: String +:Required: Yes +:Example: ``osd.0`` + + +``weight`` + +:Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB). +:Type: Double +:Required: Yes +:Example: ``2.0`` + + +``root`` + +:Description: The root node of the tree in which the OSD resides (normally ``default``) +:Type: Key/value pair. +:Required: Yes +:Example: ``root=default`` + + +``bucket-type`` + +:Description: You may specify the OSD's location in the CRUSH hierarchy. +:Type: Key/value pairs. +:Required: No +:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + + +The following example adds ``osd.0`` to the hierarchy, or moves the +OSD from a previous location: + +.. prompt:: bash $ + + ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 + + +Adjust OSD weight +----------------- + +.. note: Normally OSDs automatically add themselves to the CRUSH map + with the correct weight when they are created. This command + is rarely needed. + +To adjust an OSD's CRUSH weight in the CRUSH map of a running cluster, execute +the following: + +.. prompt:: bash $ + + ceph osd crush reweight {name} {weight} + +Where: + +``name`` + +:Description: The full name of the OSD. +:Type: String +:Required: Yes +:Example: ``osd.0`` + + +``weight`` + +:Description: The CRUSH weight for the OSD. +:Type: Double +:Required: Yes +:Example: ``2.0`` + + +.. _removeosd: + +Remove an OSD +------------- + +.. note: OSDs are normally removed from the CRUSH as part of the + ``ceph osd purge`` command. This command is rarely needed. + +To remove an OSD from the CRUSH map of a running cluster, execute the +following: + +.. prompt:: bash $ + + ceph osd crush remove {name} + +Where: + +``name`` + +:Description: The full name of the OSD. +:Type: String +:Required: Yes +:Example: ``osd.0`` + + +Add a Bucket +------------ + +.. note: Buckets are implicitly created when an OSD is added + that specifies a ``{bucket-type}={bucket-name}`` as part of its + location, if a bucket with that name does not already exist. This + command is typically used when manually adjusting the structure of the + hierarchy after OSDs have been created. One use is to move a + series of hosts underneath a new rack-level bucket; another is to + add new ``host`` buckets (OSD nodes) to a dummy ``root`` so that they don't + receive data until you're ready, at which time you would move them to the + ``default`` or other root as described below. + +To add a bucket in the CRUSH map of a running cluster, execute the +``ceph osd crush add-bucket`` command: + +.. prompt:: bash $ + + ceph osd crush add-bucket {bucket-name} {bucket-type} + +Where: + +``bucket-name`` + +:Description: The full name of the bucket. +:Type: String +:Required: Yes +:Example: ``rack12`` + + +``bucket-type`` + +:Description: The type of the bucket. The type must already exist in the hierarchy. +:Type: String +:Required: Yes +:Example: ``rack`` + + +The following example adds the ``rack12`` bucket to the hierarchy: + +.. prompt:: bash $ + + ceph osd crush add-bucket rack12 rack + +Move a Bucket +------------- + +To move a bucket to a different location or position in the CRUSH map +hierarchy, execute the following: + +.. prompt:: bash $ + + ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] + +Where: + +``bucket-name`` + +:Description: The name of the bucket to move/reposition. +:Type: String +:Required: Yes +:Example: ``foo-bar-1`` + +``bucket-type`` + +:Description: You may specify the bucket's location in the CRUSH hierarchy. +:Type: Key/value pairs. +:Required: No +:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + +Remove a Bucket +--------------- + +To remove a bucket from the CRUSH hierarchy, execute the following: + +.. prompt:: bash $ + + ceph osd crush remove {bucket-name} + +.. note:: A bucket must be empty before removing it from the CRUSH hierarchy. + +Where: + +``bucket-name`` + +:Description: The name of the bucket that you'd like to remove. +:Type: String +:Required: Yes +:Example: ``rack12`` + +The following example removes the ``rack12`` bucket from the hierarchy: + +.. prompt:: bash $ + + ceph osd crush remove rack12 + +Creating a compat weight set +---------------------------- + +.. note: This step is normally done automatically by the ``balancer`` + module when enabled. + +To create a *compat* weight set: + +.. prompt:: bash $ + + ceph osd crush weight-set create-compat + +Weights for the compat weight set can be adjusted with: + +.. prompt:: bash $ + + ceph osd crush weight-set reweight-compat {name} {weight} + +The compat weight set can be destroyed with: + +.. prompt:: bash $ + + ceph osd crush weight-set rm-compat + +Creating per-pool weight sets +----------------------------- + +To create a weight set for a specific pool: + +.. prompt:: bash $ + + ceph osd crush weight-set create {pool-name} {mode} + +.. note:: Per-pool weight sets require that all servers and daemons + run Luminous v12.2.z or later. + +Where: + +``pool-name`` + +:Description: The name of a RADOS pool +:Type: String +:Required: Yes +:Example: ``rbd`` + +``mode`` + +:Description: Either ``flat`` or ``positional``. A *flat* weight set + has a single weight for each device or bucket. A + *positional* weight set has a potentially different + weight for each position in the resulting placement + mapping. For example, if a pool has a replica count of + 3, then a positional weight set will have three weights + for each device and bucket. +:Type: String +:Required: Yes +:Example: ``flat`` + +To adjust the weight of an item in a weight set: + +.. prompt:: bash $ + + ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} + +To list existing weight sets: + +.. prompt:: bash $ + + ceph osd crush weight-set ls + +To remove a weight set: + +.. prompt:: bash $ + + ceph osd crush weight-set rm {pool-name} + +Creating a rule for a replicated pool +------------------------------------- + +For a replicated pool, the primary decision when creating the CRUSH +rule is what the failure domain is going to be. For example, if a +failure domain of ``host`` is selected, then CRUSH will ensure that +each replica of the data is stored on a unique host. If ``rack`` +is selected, then each replica will be stored in a different rack. +What failure domain you choose primarily depends on the size and +topology of your cluster. + +In most cases the entire cluster hierarchy is nested beneath a root node +named ``default``. If you have customized your hierarchy, you may +want to create a rule nested at some other node in the hierarchy. It +doesn't matter what type is associated with that node (it doesn't have +to be a ``root`` node). + +It is also possible to create a rule that restricts data placement to +a specific *class* of device. By default, Ceph OSDs automatically +classify themselves as either ``hdd`` or ``ssd``, depending on the +underlying type of device being used. These classes can also be +customized. + +To create a replicated rule: + +.. prompt:: bash $ + + ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] + +Where: + +``name`` + +:Description: The name of the rule +:Type: String +:Required: Yes +:Example: ``rbd-rule`` + +``root`` + +:Description: The name of the node under which data should be placed. +:Type: String +:Required: Yes +:Example: ``default`` + +``failure-domain-type`` + +:Description: The type of CRUSH nodes across which we should separate replicas. +:Type: String +:Required: Yes +:Example: ``rack`` + +``class`` + +:Description: The device class on which data should be placed. +:Type: String +:Required: No +:Example: ``ssd`` + +Creating a rule for an erasure coded pool +----------------------------------------- + +For an erasure-coded (EC) pool, the same basic decisions need to be made: +what is the failure domain, which node in the +hierarchy will data be placed under (usually ``default``), and will +placement be restricted to a specific device class. Erasure code +pools are created a bit differently, however, because they need to be +constructed carefully based on the erasure code being used. For this reason, +you must include this information in the *erasure code profile*. A CRUSH +rule will then be created from that either explicitly or automatically when +the profile is used to create a pool. + +The erasure code profiles can be listed with: + +.. prompt:: bash $ + + ceph osd erasure-code-profile ls + +An existing profile can be viewed with: + +.. prompt:: bash $ + + ceph osd erasure-code-profile get {profile-name} + +Normally profiles should never be modified; instead, a new profile +should be created and used when creating a new pool or creating a new +rule for an existing pool. + +An erasure code profile consists of a set of key=value pairs. Most of +these control the behavior of the erasure code that is encoding data +in the pool. Those that begin with ``crush-``, however, affect the +CRUSH rule that is created. + +The erasure code profile properties of interest are: + + * **crush-root**: the name of the CRUSH node under which to place data [default: ``default``]. + * **crush-failure-domain**: the CRUSH bucket type across which to distribute erasure-coded shards [default: ``host``]. + * **crush-device-class**: the device class on which to place data [default: none, meaning all devices are used]. + * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule. + +Once a profile is defined, you can create a CRUSH rule with: + +.. prompt:: bash $ + + ceph osd crush rule create-erasure {name} {profile-name} + +.. note: When creating a new pool, it is not actually necessary to + explicitly create the rule. If the erasure code profile alone is + specified and the rule argument is left off then Ceph will create + the CRUSH rule automatically. + +Deleting rules +-------------- + +Rules that are not in use by pools can be deleted with: + +.. prompt:: bash $ + + ceph osd crush rule rm {rule-name} + + +.. _crush-map-tunables: + +Tunables +======== + +Over time, we have made (and continue to make) improvements to the +CRUSH algorithm used to calculate the placement of data. In order to +support the change in behavior, we have introduced a series of tunable +options that control whether the legacy or improved variation of the +algorithm is used. + +In order to use newer tunables, both clients and servers must support +the new version of CRUSH. For this reason, we have created +``profiles`` that are named after the Ceph version in which they were +introduced. For example, the ``firefly`` tunables are first supported +by the Firefly release, and will not work with older (e.g., Dumpling) +clients. Once a given set of tunables are changed from the legacy +default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older +clients who do not support the new CRUSH features from connecting to +the cluster. + +argonaut (legacy) +----------------- + +The legacy CRUSH behavior used by Argonaut and older releases works +fine for most clusters, provided there are not many OSDs that have +been marked out. + +bobtail (CRUSH_TUNABLES2) +------------------------- + +The ``bobtail`` tunable profile fixes a few key misbehaviors: + + * For hierarchies with a small number of devices in the leaf buckets, + some PGs map to fewer than the desired number of replicas. This + commonly happens for hierarchies with "host" nodes with a small + number (1-3) of OSDs nested beneath each one. + + * For large clusters, some small percentages of PGs map to fewer than + the desired number of OSDs. This is more prevalent when there are + mutiple hierarchy layers in use (e.g., ``row``, ``rack``, ``host``, ``osd``). + + * When some OSDs are marked out, the data tends to get redistributed + to nearby OSDs instead of across the entire hierarchy. + +The new tunables are: + + * ``choose_local_tries``: Number of local retries. Legacy value is + 2, optimal value is 0. + + * ``choose_local_fallback_tries``: Legacy value is 5, optimal value + is 0. + + * ``choose_total_tries``: Total number of attempts to choose an item. + Legacy value was 19, subsequent testing indicates that a value of + 50 is more appropriate for typical clusters. For extremely large + clusters, a larger value might be necessary. + + * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt + will retry, or only try once and allow the original placement to + retry. Legacy default is 0, optimal value is 1. + +Migration impact: + + * Moving from ``argonaut`` to ``bobtail`` tunables triggers a moderate amount + of data movement. Use caution on a cluster that is already + populated with data. + +firefly (CRUSH_TUNABLES3) +------------------------- + +The ``firefly`` tunable profile fixes a problem +with ``chooseleaf`` CRUSH rule behavior that tends to result in PG +mappings with too few results when too many OSDs have been marked out. + +The new tunable is: + + * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will + start with a non-zero value of ``r``, based on how many attempts the + parent has already made. Legacy default is ``0``, but with this value + CRUSH is sometimes unable to find a mapping. The optimal value (in + terms of computational cost and correctness) is ``1``. + +Migration impact: + + * For existing clusters that house lots of data, changing + from ``0`` to ``1`` will cause a lot of data to move; a value of ``4`` or ``5`` + will allow CRUSH to still find a valid mapping but will cause less data + to move. + +straw_calc_version tunable (introduced with Firefly too) +-------------------------------------------------------- + +There were some problems with the internal weights calculated and +stored in the CRUSH map for ``straw`` algorithm buckets. Specifically, when +there were items with a CRUSH weight of ``0``, or both a mix of different and +unique weights, CRUSH would distribute data incorrectly (i.e., +not in proportion to the weights). + +The new tunable is: + + * ``straw_calc_version``: A value of ``0`` preserves the old, broken + internal weight calculation; a value of ``1`` fixes the behavior. + +Migration impact: + + * Moving to straw_calc_version ``1`` and then adjusting a straw bucket + (by adding, removing, or reweighting an item, or by using the + reweight-all command) can trigger a small to moderate amount of + data movement *if* the cluster has hit one of the problematic + conditions. + +This tunable option is special because it has absolutely no impact +concerning the required kernel version in the client side. + +hammer (CRUSH_V4) +----------------- + +The ``hammer`` tunable profile does not affect the +mapping of existing CRUSH maps simply by changing the profile. However: + + * There is a new bucket algorithm (``straw2``) supported. The new + ``straw2`` bucket algorithm fixes several limitations in the original + ``straw``. Specifically, the old ``straw`` buckets would + change some mappings that should have changed when a weight was + adjusted, while ``straw2`` achieves the original goal of only + changing mappings to or from the bucket item whose weight has + changed. + + * ``straw2`` is the default for any newly created buckets. + +Migration impact: + + * Changing a bucket type from ``straw`` to ``straw2`` will result in + a reasonably small amount of data movement, depending on how much + the bucket item weights vary from each other. When the weights are + all the same no data will move, and when item weights vary + significantly there will be more movement. + +jewel (CRUSH_TUNABLES5) +----------------------- + +The ``jewel`` tunable profile improves the +overall behavior of CRUSH such that significantly fewer mappings +change when an OSD is marked out of the cluster. This results in +significantly less data movement. + +The new tunable is: + + * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will + use a better value for an inner loop that greatly reduces the number + of mapping changes when an OSD is marked out. The legacy value is ``0``, + while the new value of ``1`` uses the new approach. + +Migration impact: + + * Changing this value on an existing cluster will result in a very + large amount of data movement as almost every PG mapping is likely + to change. + + + + +Which client versions support CRUSH_TUNABLES +-------------------------------------------- + + * argonaut series, v0.48.1 or later + * v0.49 or later + * Linux kernel version v3.6 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_TUNABLES2 +--------------------------------------------- + + * v0.55 or later, including bobtail series (v0.56.x) + * Linux kernel version v3.9 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_TUNABLES3 +--------------------------------------------- + + * v0.78 (firefly) or later + * Linux kernel version v3.15 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_V4 +-------------------------------------- + + * v0.94 (hammer) or later + * Linux kernel version v4.1 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_TUNABLES5 +--------------------------------------------- + + * v10.0.2 (jewel) or later + * Linux kernel version v4.5 or later (for the file system and RBD kernel clients) + +Warning when tunables are non-optimal +------------------------------------- + +Starting with version v0.74, Ceph will issue a health warning if the +current CRUSH tunables don't include all the optimal values from the +``default`` profile (see below for the meaning of the ``default`` profile). +To make this warning go away, you have two options: + +1. Adjust the tunables on the existing cluster. Note that this will + result in some data movement (possibly as much as 10%). This is the + preferred route, but should be taken with care on a production cluster + where the data movement may affect performance. You can enable optimal + tunables with: + + .. prompt:: bash $ + + ceph osd crush tunables optimal + + If things go poorly (e.g., too much load) and not very much + progress has been made, or there is a client compatibility problem + (old kernel CephFS or RBD clients, or pre-Bobtail ``librados`` + clients), you can switch back with: + + .. prompt:: bash $ + + ceph osd crush tunables legacy + +2. You can make the warning go away without making any changes to CRUSH by + adding the following option to your ceph.conf ``[mon]`` section:: + + mon warn on legacy crush tunables = false + + For the change to take effect, you will need to restart the monitors, or + apply the option to running monitors with: + + .. prompt:: bash $ + + ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false + + +A few important points +---------------------- + + * Adjusting these values will result in the shift of some PGs between + storage nodes. If the Ceph cluster is already storing a lot of + data, be prepared for some fraction of the data to move. + * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the + feature bits of new connections as soon as they get + the updated map. However, already-connected clients are + effectively grandfathered in, and will misbehave if they do not + support the new feature. + * If the CRUSH tunables are set to non-legacy values and then later + changed back to the default values, ``ceph-osd`` daemons will not be + required to support the feature. However, the OSD peering process + requires examining and understanding old maps. Therefore, you + should not run old versions of the ``ceph-osd`` daemon + if the cluster has previously used non-legacy CRUSH values, even if + the latest version of the map has been switched back to using the + legacy defaults. + +Tuning CRUSH +------------ + +The simplest way to adjust CRUSH tunables is by applying them in matched +sets known as *profiles*. As of the Octopus release these are: + + * ``legacy``: the legacy behavior from argonaut and earlier. + * ``argonaut``: the legacy values supported by the original argonaut release + * ``bobtail``: the values supported by the bobtail release + * ``firefly``: the values supported by the firefly release + * ``hammer``: the values supported by the hammer release + * ``jewel``: the values supported by the jewel release + * ``optimal``: the best (i.e. optimal) values of the current version of Ceph + * ``default``: the default values of a new cluster installed from + scratch. These values, which depend on the current version of Ceph, + are hardcoded and are generally a mix of optimal and legacy values. + These values generally match the ``optimal`` profile of the previous + LTS release, or the most recent release for which we generally expect + most users to have up-to-date clients for. + +You can apply a profile to a running cluster with the command: + +.. prompt:: bash $ + + ceph osd crush tunables {PROFILE} + +Note that this may result in data movement, potentially quite a bit. Study +release notes and documentation carefully before changing the profile on a +running cluster, and consider throttling recovery/backfill parameters to +limit the impact of a bolus of backfill. + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf + + +Primary Affinity +================ + +When a Ceph Client reads or writes data, it first contacts the primary OSD in +each affected PG's acting set. By default, the first OSD in the acting set is +the primary. For example, in the acting set ``[2, 3, 4]``, ``osd.2`` is +listed first and thus is the primary (aka lead) OSD. Sometimes we know that an +OSD is less well suited to act as the lead than are other OSDs (e.g., it has +a slow drive or a slow controller). To prevent performance bottlenecks +(especially on read operations) while maximizing utilization of your hardware, +you can influence the selection of primary OSDs by adjusting primary affinity +values, or by crafting a CRUSH rule that selects preferred OSDs first. + +Tuning primary OSD selection is mainly useful for replicated pools, because +by default read operations are served from the primary OSD for each PG. +For erasure coded (EC) pools, a way to speed up read operations is to enable +**fast read** as described in :ref:`pool-settings`. + +A common scenario for primary affinity is when a cluster contains +a mix of drive sizes, for example older racks with 1.9 TB SATA SSDS and newer racks with +3.84TB SATA SSDs. On average the latter will be assigned double the number of +PGs and thus will serve double the number of write and read operations, thus +they'll be busier than the former. A rough assignment of primary affinity +inversely proportional to OSD size won't be 100% optimal, but it can readily +achieve a 15% improvement in overall read throughput by utilizing SATA +interface bandwidth and CPU cycles more evenly. + +By default, all ceph OSDs have primary affinity of ``1``, which indicates that +any OSD may act as a primary with equal probability. + +You can reduce a Ceph OSD's primary affinity so that CRUSH is less likely to +choose the OSD as primary in a PG's acting set.: + +.. prompt:: bash $ + + ceph osd primary-affinity <osd-id> <weight> + +You may set an OSD's primary affinity to a real number in the range ``[0-1]``, +where ``0`` indicates that the OSD may **NOT** be used as a primary and ``1`` +indicates that an OSD may be used as a primary. When the weight is between +these extremes, it is less likely that CRUSH will select that OSD as a primary. +The process for selecting the lead OSD is more nuanced than a simple +probability based on relative affinity values, but measurable results can be +achieved even with first-order approximations of desirable values. + +Custom CRUSH Rules +------------------ + +There are occasional clusters that balance cost and performance by mixing SSDs +and HDDs in the same replicated pool. By setting the primary affinity of HDD +OSDs to ``0`` one can direct operations to the SSD in each acting set. An +alternative is to define a CRUSH rule that always selects an SSD OSD as the +first OSD, then selects HDDs for the remaining OSDs. Thus, each PG's acting +set will contain exactly one SSD OSD as the primary with the balance on HDDs. + +For example, the CRUSH rule below:: + + rule mixed_replicated_rule { + id 11 + type replicated + min_size 1 + max_size 10 + step take default class ssd + step chooseleaf firstn 1 type host + step emit + step take default class hdd + step chooseleaf firstn 0 type host + step emit + } + +chooses an SSD as the first OSD. Note that for an ``N``-times replicated pool +this rule selects ``N+1`` OSDs to guarantee that ``N`` copies are on different +hosts, because the first SSD OSD might be co-located with any of the ``N`` HDD +OSDs. + +This extra storage requirement can be avoided by placing SSDs and HDDs in +different hosts with the tradeoff that hosts with SSDs will receive all client +requests. You may thus consider faster CPU(s) for SSD hosts and more modest +ones for HDD nodes, since the latter will normally only service recovery +operations. Here the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` strictly +must not contain the same servers:: + + rule mixed_replicated_rule_two { + id 1 + type replicated + min_size 1 + max_size 10 + step take ssd_hosts class ssd + step chooseleaf firstn 1 type host + step emit + step take hdd_hosts class hdd + step chooseleaf firstn -1 type host + step emit + } + + +Note also that on failure of an SSD, requests to a PG will be served temporarily +from a (slower) HDD OSD until the PG's data has been replicated onto the replacement +primary SSD OSD. + diff --git a/doc/rados/operations/data-placement.rst b/doc/rados/operations/data-placement.rst new file mode 100644 index 000000000..bd9bd7ec7 --- /dev/null +++ b/doc/rados/operations/data-placement.rst @@ -0,0 +1,43 @@ +========================= + Data Placement Overview +========================= + +Ceph stores, replicates and rebalances data objects across a RADOS cluster +dynamically. With many different users storing objects in different pools for +different purposes on countless OSDs, Ceph operations require some data +placement planning. The main data placement planning concepts in Ceph include: + +- **Pools:** Ceph stores data within pools, which are logical groups for storing + objects. Pools manage the number of placement groups, the number of replicas, + and the CRUSH rule for the pool. To store data in a pool, you must have + an authenticated user with permissions for the pool. Ceph can snapshot pools. + See `Pools`_ for additional details. + +- **Placement Groups:** Ceph maps objects to placement groups (PGs). + Placement groups (PGs) are shards or fragments of a logical object pool + that place objects as a group into OSDs. Placement groups reduce the amount + of per-object metadata when Ceph stores the data in OSDs. A larger number of + placement groups (e.g., 100 per OSD) leads to better balancing. See + `Placement Groups`_ for additional details. + +- **CRUSH Maps:** CRUSH is a big part of what allows Ceph to scale without + performance bottlenecks, without limitations to scalability, and without a + single point of failure. CRUSH maps provide the physical topology of the + cluster to the CRUSH algorithm to determine where the data for an object + and its replicas should be stored, and how to do so across failure domains + for added data safety among other things. See `CRUSH Maps`_ for additional + details. + +- **Balancer:** The balancer is a feature that will automatically optimize the + distribution of PGs across devices to achieve a balanced data distribution, + maximizing the amount of data that can be stored in the cluster and evenly + distributing the workload across OSDs. + +When you initially set up a test cluster, you can use the default values. Once +you begin planning for a large Ceph cluster, refer to pools, placement groups +and CRUSH for data placement operations. + +.. _Pools: ../pools +.. _Placement Groups: ../placement-groups +.. _CRUSH Maps: ../crush-map +.. _Balancer: ../balancer diff --git a/doc/rados/operations/devices.rst b/doc/rados/operations/devices.rst new file mode 100644 index 000000000..1b6eaebde --- /dev/null +++ b/doc/rados/operations/devices.rst @@ -0,0 +1,208 @@ +.. _devices: + +Device Management +================= + +Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by +which daemons, and collects health metrics about those devices in order to +provide tools to predict and/or automatically respond to hardware failure. + +Device tracking +--------------- + +You can query which storage devices are in use with: + +.. prompt:: bash $ + + ceph device ls + +You can also list devices by daemon or by host: + +.. prompt:: bash $ + + ceph device ls-by-daemon <daemon> + ceph device ls-by-host <host> + +For any individual device, you can query information about its +location and how it is being consumed with: + +.. prompt:: bash $ + + ceph device info <devid> + +Identifying physical devices +---------------------------- + +You can blink the drive LEDs on hardware enclosures to make the replacement of +failed disks easy and less error-prone. Use the following command:: + + device light on|off <devid> [ident|fault] [--force] + +The ``<devid>`` parameter is the device identification. You can obtain this +information using the following command: + +.. prompt:: bash $ + + ceph device ls + +The ``[ident|fault]`` parameter is used to set the kind of light to blink. +By default, the `identification` light is used. + +.. note:: + This command needs the Cephadm or the Rook `orchestrator <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ module enabled. + The orchestrator module enabled is shown by executing the following command: + + .. prompt:: bash $ + + ceph orch status + +The command behind the scene to blink the drive LEDs is `lsmcli`. If you need +to customize this command you can configure this via a Jinja2 template:: + + ceph config-key set mgr/cephadm/blink_device_light_cmd "<template>" + ceph config-key set mgr/cephadm/<host>/blink_device_light_cmd "lsmcli local-disk-{{ ident_fault }}-led-{{'on' if on else 'off'}} --path '{{ path or dev }}'" + +The Jinja2 template is rendered using the following arguments: + +* ``on`` + A boolean value. +* ``ident_fault`` + A string containing `ident` or `fault`. +* ``dev`` + A string containing the device ID, e.g. `SanDisk_X400_M.2_2280_512GB_162924424784`. +* ``path`` + A string containing the device path, e.g. `/dev/sda`. + +.. _enabling-monitoring: + +Enabling monitoring +------------------- + +Ceph can also monitor health metrics associated with your device. For +example, SATA hard disks implement a standard called SMART that +provides a wide range of internal metrics about the device's usage and +health, like the number of hours powered on, number of power cycles, +or unrecoverable read errors. Other device types like SAS and NVMe +implement a similar set of metrics (via slightly different standards). +All of these can be collected by Ceph via the ``smartctl`` tool. + +You can enable or disable health monitoring with: + +.. prompt:: bash $ + + ceph device monitoring on + +or: + +.. prompt:: bash $ + + ceph device monitoring off + + +Scraping +-------- + +If monitoring is enabled, metrics will automatically be scraped at regular intervals. That interval can be configured with: + +.. prompt:: bash $ + + ceph config set mgr mgr/devicehealth/scrape_frequency <seconds> + +The default is to scrape once every 24 hours. + +You can manually trigger a scrape of all devices with: + +.. prompt:: bash $ + + ceph device scrape-health-metrics + +A single device can be scraped with: + +.. prompt:: bash $ + + ceph device scrape-health-metrics <device-id> + +Or a single daemon's devices can be scraped with: + +.. prompt:: bash $ + + ceph device scrape-daemon-health-metrics <who> + +The stored health metrics for a device can be retrieved (optionally +for a specific timestamp) with: + +.. prompt:: bash $ + + ceph device get-health-metrics <devid> [sample-timestamp] + +Failure prediction +------------------ + +Ceph can predict life expectancy and device failures based on the +health metrics it collects. There are three modes: + +* *none*: disable device failure prediction. +* *local*: use a pre-trained prediction model from the ceph-mgr daemon + +The prediction mode can be configured with: + +.. prompt:: bash $ + + ceph config set global device_failure_prediction_mode <mode> + +Prediction normally runs in the background on a periodic basis, so it +may take some time before life expectancy values are populated. You +can see the life expectancy of all devices in output from: + +.. prompt:: bash $ + + ceph device ls + +You can also query the metadata for a specific device with: + +.. prompt:: bash $ + + ceph device info <devid> + +You can explicitly force prediction of a device's life expectancy with: + +.. prompt:: bash $ + + ceph device predict-life-expectancy <devid> + +If you are not using Ceph's internal device failure prediction but +have some external source of information about device failures, you +can inform Ceph of a device's life expectancy with: + +.. prompt:: bash $ + + ceph device set-life-expectancy <devid> <from> [<to>] + +Life expectancies are expressed as a time interval so that +uncertainty can be expressed in the form of a wide interval. The +interval end can also be left unspecified. + +Health alerts +------------- + +The ``mgr/devicehealth/warn_threshold`` controls how soon an expected +device failure must be before we generate a health warning. + +The stored life expectancy of all devices can be checked, and any +appropriate health alerts generated, with: + +.. prompt:: bash $ + + ceph device check-health + +Automatic Mitigation +-------------------- + +If the ``mgr/devicehealth/self_heal`` option is enabled (it is by +default), then for devices that are expected to fail soon the module +will automatically migrate data away from them by marking the devices +"out". + +The ``mgr/devicehealth/mark_out_threshold`` controls how soon an +expected device failure must be before we automatically mark an osd +"out". diff --git a/doc/rados/operations/erasure-code-clay.rst b/doc/rados/operations/erasure-code-clay.rst new file mode 100644 index 000000000..1cffa32f5 --- /dev/null +++ b/doc/rados/operations/erasure-code-clay.rst @@ -0,0 +1,240 @@ +================ +CLAY code plugin +================ + +CLAY (short for coupled-layer) codes are erasure codes designed to bring about significant savings +in terms of network bandwidth and disk IO when a failed node/OSD/rack is being repaired. Let: + + d = number of OSDs contacted during repair + +If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires +reading from the *d=8* others to repair. And recovery of say a 1GiB needs +a download of 8 X 1GiB = 8GiB of information. + +However, in the case of the *clay* plugin *d* is configurable within the limits: + + k+1 <= d <= k+m-1 + +By default, the clay code plugin picks *d=k+m-1* as it provides the greatest savings in terms +of network bandwidth and disk IO. In the case of the *clay* plugin configured with +*k=8*, *m=4* and *d=11* when a single OSD fails, d=11 osds are contacted and +250MiB is downloaded from each of them, resulting in a total download of 11 X 250MiB = 2.75GiB +amount of information. More general parameters are provided below. The benefits are substantial +when the repair is carried out for a rack that stores information on the order of +Terabytes. + + +-------------+---------------------------------------------------------+ + | plugin | total amount of disk IO | + +=============+=========================================================+ + |jerasure,isa | :math:`k S` | + +-------------+---------------------------------------------------------+ + | clay | :math:`\frac{d S}{d - k + 1} = \frac{(k + m - 1) S}{m}` | + +-------------+---------------------------------------------------------+ + +where *S* is the amount of data stored on a single OSD undergoing repair. In the table above, we have +used the largest possible value of *d* as this will result in the smallest amount of data download needed +to achieve recovery from an OSD failure. + +Erasure-code profile examples +============================= + +An example configuration that can be used to observe reduced bandwidth usage: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set CLAYprofile \ + plugin=clay \ + k=4 m=2 d=5 \ + crush-failure-domain=host + ceph osd pool create claypool erasure CLAYprofile + + +Creating a clay profile +======================= + +To create a new clay code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=clay \ + k={data-chunks} \ + m={coding-chunks} \ + [d={helper-chunks}] \ + [scalar_mds={plugin-name}] \ + [technique={technique-name}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split into **data-chunks** parts, + each of which is stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``d={helper-chunks}`` + +:Description: Number of OSDs requested to send data during recovery of + a single chunk. *d* needs to be chosen such that + k+1 <= d <= k+m-1. The larger the *d*, the better the savings. + +:Type: Integer +:Required: No. +:Default: k+m-1 + +``scalar_mds={jerasure|isa|shec}`` + +:Description: **scalar_mds** specifies the plugin that is used as a + building block in the layered construction. It can be + one of *jerasure*, *isa*, *shec* + +:Type: String +:Required: No. +:Default: jerasure + +``technique={technique}`` + +:Description: **technique** specifies the technique that will be picked + within the 'scalar_mds' plugin specified. Supported techniques + are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig', + 'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van', + 'cauchy' for isa and 'single', 'multiple' for shec. + +:Type: String +:Required: No. +:Default: reed_sol_van (for jerasure, isa), single (for shec) + + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + + +Notion of sub-chunks +==================== + +The Clay code is able to save in terms of disk IO, network bandwidth as it +is a vector code and it is able to view and manipulate data within a chunk +at a finer granularity termed as a sub-chunk. The number of sub-chunks within +a chunk for a Clay code is given by: + + sub-chunk count = :math:`q^{\frac{k+m}{q}}`, where :math:`q = d - k + 1` + + +During repair of an OSD, the helper information requested +from an available OSD is only a fraction of a chunk. In fact, the number +of sub-chunks within a chunk that are accessed during repair is given by: + + repair sub-chunk count = :math:`\frac{sub---chunk \: count}{q}` + +Examples +-------- + +#. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is + 8 and the repair sub-chunk count is 4. Therefore, only half of a chunk is read + during repair. +#. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count + is 16. A quarter of a chunk is read from an available OSD for repair of a failed + chunk. + + + +How to choose a configuration given a workload +============================================== + +Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks +are not necessarily stored consecutively within a chunk. For best disk IO +performance, it is helpful to read contiguous data. For this reason, it is suggested that +you choose stripe-size such that the sub-chunk size is sufficiently large. + +For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that: + + sub-chunk size = :math:`\frac{stripe-size}{k sub-chunk count}` = 4KB, 8KB, 12KB ... + +#. For large size workloads for which the stripe size is large, it is easy to choose k, m, d. + For example consider a stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will + result in a sub-chunk count of 1024 and a sub-chunk size of 4KB. +#. For small size workloads, *k=4*, *m=2* is a good configuration that provides both network + and disk IO benefits. + +Comparisons with LRC +==================== + +Locally Recoverable Codes (LRC) are also designed in order to save in terms of network +bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the +number of OSDs contacted during repair (d) to be minimal, but this comes at the cost of storage overhead. +The *clay* code has a storage overhead m/k. In the case of an *lrc*, it stores (k+m)/d parities in +addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc* +can recover from the failure of any ``m`` OSDs. + + +-----------------+----------------------------------+----------------------------------+ + | Parameters | disk IO, storage overhead (LRC) | disk IO, storage overhead (CLAY) | + +=================+================+=================+==================================+ + | (k=10, m=4) | 7 * S, 0.6 (d=7) | 3.25 * S, 0.4 (d=13) | + +-----------------+----------------------------------+----------------------------------+ + | (k=16, m=4) | 4 * S, 0.5625 (d=4) | 4.75 * S, 0.25 (d=19) | + +-----------------+----------------------------------+----------------------------------+ + + +where ``S`` is the amount of data stored of single OSD being recovered. diff --git a/doc/rados/operations/erasure-code-isa.rst b/doc/rados/operations/erasure-code-isa.rst new file mode 100644 index 000000000..9a43f89a2 --- /dev/null +++ b/doc/rados/operations/erasure-code-isa.rst @@ -0,0 +1,107 @@ +======================= +ISA erasure code plugin +======================= + +The *isa* plugin encapsulates the `ISA +<https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version/>`_ +library. + +Create an isa profile +===================== + +To create a new *isa* erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=isa \ + technique={reed_sol_van|cauchy} \ + [k={data-chunks}] \ + [m={coding-chunks}] \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: No. +:Default: 7 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 3 + +``technique={reed_sol_van|cauchy}`` + +:Description: The ISA plugin comes in two `Reed Solomon + <https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction>`_ + forms. If *reed_sol_van* is set, it is `Vandermonde + <https://en.wikipedia.org/wiki/Vandermonde_matrix>`_, if + *cauchy* is set, it is `Cauchy + <https://en.wikipedia.org/wiki/Cauchy_matrix>`_. + +:Type: String +:Required: No. +:Default: reed_sol_van + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + diff --git a/doc/rados/operations/erasure-code-jerasure.rst b/doc/rados/operations/erasure-code-jerasure.rst new file mode 100644 index 000000000..553afa09d --- /dev/null +++ b/doc/rados/operations/erasure-code-jerasure.rst @@ -0,0 +1,121 @@ +============================ +Jerasure erasure code plugin +============================ + +The *jerasure* plugin is the most generic and flexible plugin, it is +also the default for Ceph erasure coded pools. + +The *jerasure* plugin encapsulates the `Jerasure +<http://jerasure.org>`_ library. It is +recommended to read the *jerasure* documentation to get a better +understanding of the parameters. + +Create a jerasure profile +========================= + +To create a new *jerasure* erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=jerasure \ + k={data-chunks} \ + m={coding-chunks} \ + technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion} \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion}`` + +:Description: The more flexible technique is *reed_sol_van* : it is + enough to set *k* and *m*. The *cauchy_good* technique + can be faster but you need to chose the *packetsize* + carefully. All of *reed_sol_r6_op*, *liberation*, + *blaum_roth*, *liber8tion* are *RAID6* equivalents in + the sense that they can only be configured with *m=2*. + +:Type: String +:Required: No. +:Default: reed_sol_van + +``packetsize={bytes}`` + +:Description: The encoding will be done on packets of *bytes* size at + a time. Choosing the right packet size is difficult. The + *jerasure* documentation contains extensive information + on this topic. + +:Type: Integer +:Required: No. +:Default: 2048 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + diff --git a/doc/rados/operations/erasure-code-lrc.rst b/doc/rados/operations/erasure-code-lrc.rst new file mode 100644 index 000000000..5329603b9 --- /dev/null +++ b/doc/rados/operations/erasure-code-lrc.rst @@ -0,0 +1,388 @@ +====================================== +Locally repairable erasure code plugin +====================================== + +With the *jerasure* plugin, when an erasure coded object is stored on +multiple OSDs, recovering from the loss of one OSD requires reading +from *k* others. For instance if *jerasure* is configured with +*k=8* and *m=4*, recovering from the loss of one OSD requires reading +from eight others. + +The *lrc* erasure code plugin creates local parity chunks to enable +recovery using fewer surviving OSDs. For instance if *lrc* is configured with +*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for +every four OSDs. When a single OSD is lost, it can be recovered with +only four OSDs instead of eight. + +Erasure code profile examples +============================= + +Reduce recovery bandwidth between hosts +--------------------------------------- + +Although it is probably not an interesting use case when all hosts are +connected to the same switch, reduced bandwidth usage can actually be +observed.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + k=4 m=2 l=3 \ + crush-failure-domain=host + ceph osd pool create lrcpool erasure LRCprofile + + +Reduce recovery bandwidth between racks +--------------------------------------- + +In Firefly the bandwidth reduction will only be observed if the primary +OSD is in the same rack as the lost chunk.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + k=4 m=2 l=3 \ + crush-locality=rack \ + crush-failure-domain=host + ceph osd pool create lrcpool erasure LRCprofile + + +Create an lrc profile +===================== + +To create a new lrc erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=lrc \ + k={data-chunks} \ + m={coding-chunks} \ + l={locality} \ + [crush-root={root}] \ + [crush-locality={bucket-type}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``l={locality}`` + +:Description: Group the coding and data chunks into sets of size + **locality**. For instance, for **k=4** and **m=2**, + when **locality=3** two groups of three are created. + Each set can be recovered without reading chunks + from another set. + +:Type: Integer +:Required: Yes. +:Example: 3 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-locality={bucket-type}`` + +:Description: The type of the CRUSH bucket in which each set of chunks + defined by **l** will be stored. For instance, if it is + set to **rack**, each group of **l** chunks will be + placed in a different rack. It is used to create a + CRUSH rule step such as **step choose rack**. If it is not + set, no such grouping is done. + +:Type: String +:Required: No. + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + +Low level plugin configuration +============================== + +The sum of **k** and **m** must be a multiple of the **l** parameter. +The low level configuration parameters however do not enforce this +restriction and it may be advantageous to use them for specific +purposes. It is for instance possible to define two groups, one with 4 +chunks and another with 3 chunks. It is also possible to recursively +define locality sets, for instance datacenters and racks into +datacenters. The **k/m/l** are implemented by generating a low level +configuration. + +The *lrc* erasure code plugin recursively applies erasure code +techniques so that recovering from the loss of some chunks only +requires a subset of the available chunks, most of the time. + +For instance, when three coding steps are described as:: + + chunk nr 01234567 + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +where *c* are coding chunks calculated from the data chunks *D*, the +loss of chunk *7* can be recovered with the last four chunks. And the +loss of chunk *2* chunk can be recovered with the first four +chunks. + +Erasure code profile examples using low level configuration +=========================================================== + +Minimal testing +--------------- + +It is strictly equivalent to using a *K=2* *M=1* erasure code profile. The *DD* +implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used +by default.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=DD_ \ + layers='[ [ "DDc", "" ] ]' + ceph osd pool create lrcpool erasure LRCprofile + +Reduce recovery bandwidth between hosts +--------------------------------------- + +Although it is probably not an interesting use case when all hosts are +connected to the same switch, reduced bandwidth usage can actually be +observed. It is equivalent to **k=4**, **m=2** and **l=3** although +the layout of the chunks is different. **WARNING: PROMPTS ARE SELECTABLE** + +:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "" ], + [ "cDDD____", "" ], + [ "____cDDD", "" ], + ]' + $ ceph osd pool create lrcpool erasure LRCprofile + + +Reduce recovery bandwidth between racks +--------------------------------------- + +In Firefly the reduced bandwidth will only be observed if the primary OSD is in +the same rack as the lost chunk. **WARNING: PROMPTS ARE SELECTABLE** + +:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "" ], + [ "cDDD____", "" ], + [ "____cDDD", "" ], + ]' \ + crush-steps='[ + [ "choose", "rack", 2 ], + [ "chooseleaf", "host", 4 ], + ]' + + $ ceph osd pool create lrcpool erasure LRCprofile + +Testing with different Erasure Code backends +-------------------------------------------- + +LRC now uses jerasure as the default EC backend. It is possible to +specify the EC backend/algorithm on a per layer basis using the low +level configuration. The second argument in layers='[ [ "DDc", "" ] ]' +is actually an erasure code profile to be used for this level. The +example below specifies the ISA backend with the cauchy technique to +be used in the lrcpool.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=DD_ \ + layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]' + ceph osd pool create lrcpool erasure LRCprofile + +You could also use a different erasure code profile for each +layer. **WARNING: PROMPTS ARE SELECTABLE** + +:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "plugin=isa technique=cauchy" ], + [ "cDDD____", "plugin=isa" ], + [ "____cDDD", "plugin=jerasure" ], + ]' + $ ceph osd pool create lrcpool erasure LRCprofile + + + +Erasure coding and decoding algorithm +===================================== + +The steps found in the layers description:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +are applied in order. For instance, if a 4K object is encoded, it will +first go through *step 1* and be divided in four 1K chunks (the four +uppercase D). They are stored in the chunks 2, 3, 6 and 7, in +order. From these, two coding chunks are calculated (the two lowercase +c). The coding chunks are stored in the chunks 1 and 5, respectively. + +The *step 2* re-uses the content created by *step 1* in a similar +fashion and stores a single coding chunk *c* at position 0. The last four +chunks, marked with an underscore (*_*) for readability, are ignored. + +The *step 3* stores a single coding chunk *c* at position 4. The three +chunks created by *step 1* are used to compute this coding chunk, +i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*. + +If chunk *2* is lost:: + + chunk nr 01234567 + + step 1 _c D_cDD + step 2 cD D____ + step 3 __ _cDDD + +decoding will attempt to recover it by walking the steps in reverse +order: *step 3* then *step 2* and finally *step 1*. + +The *step 3* knows nothing about chunk *2* (i.e. it is an underscore) +and is skipped. + +The coding chunk from *step 2*, stored in chunk *0*, allows it to +recover the content of chunk *2*. There are no more chunks to recover +and the process stops, without considering *step 1*. + +Recovering chunk *2* requires reading chunks *0, 1, 3* and writing +back chunk *2*. + +If chunk *2, 3, 6* are lost:: + + chunk nr 01234567 + + step 1 _c _c D + step 2 cD __ _ + step 3 __ cD D + +The *step 3* can recover the content of chunk *6*:: + + chunk nr 01234567 + + step 1 _c _cDD + step 2 cD ____ + step 3 __ cDDD + +The *step 2* fails to recover and is skipped because there are two +chunks missing (*2, 3*) and it can only recover from one missing +chunk. + +The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to +recover the content of chunk *2, 3*:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +Controlling CRUSH placement +=========================== + +The default CRUSH rule provides OSDs that are on different hosts. For instance:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +needs exactly *8* OSDs, one for each chunk. If the hosts are in two +adjacent racks, the first four chunks can be placed in the first rack +and the last four in the second rack. So that recovering from the loss +of a single OSD does not require using bandwidth between the two +racks. + +For instance:: + + crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]' + +will create a rule that will select two crush buckets of type +*rack* and for each of them choose four OSDs, each of them located in +different buckets of type *host*. + +The CRUSH rule can also be manually crafted for finer control. diff --git a/doc/rados/operations/erasure-code-profile.rst b/doc/rados/operations/erasure-code-profile.rst new file mode 100644 index 000000000..45b071f8a --- /dev/null +++ b/doc/rados/operations/erasure-code-profile.rst @@ -0,0 +1,126 @@ +.. _erasure-code-profiles: + +===================== +Erasure code profiles +===================== + +Erasure code is defined by a **profile** and is used when creating an +erasure coded pool and the associated CRUSH rule. + +The **default** erasure code profile (which is created when the Ceph +cluster is initialized) will split the data into 2 equal-sized chunks, +and have 2 parity chunks of the same size. It will take as much space +in the cluster as a 2-replica pool but can sustain the data loss of 2 +chunks out of 4. It is described as a profile with **k=2** and **m=2**, +meaning the information is spread over four OSD (k+m == 4) and two of +them can be lost. + +To improve redundancy without increasing raw storage requirements, a +new profile can be created. For instance, a profile with **k=10** and +**m=4** can sustain the loss of four (**m=4**) OSDs by distributing an +object on fourteen (k+m=14) OSDs. The object is first divided in +**10** chunks (if the object is 10MB, each chunk is 1MB) and **4** +coding chunks are computed, for recovery (each coding chunk has the +same size as the data chunk, i.e. 1MB). The raw space overhead is only +40% and the object will not be lost even if four OSDs break at the +same time. + +.. _list of available plugins: + +.. toctree:: + :maxdepth: 1 + + erasure-code-jerasure + erasure-code-isa + erasure-code-lrc + erasure-code-shec + erasure-code-clay + +osd erasure-code-profile set +============================ + +To create a new erasure code profile:: + + ceph osd erasure-code-profile set {name} \ + [{directory=directory}] \ + [{plugin=plugin}] \ + [{stripe_unit=stripe_unit}] \ + [{key=value} ...] \ + [--force] + +Where: + +``{directory=directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``{plugin=plugin}`` + +:Description: Use the erasure code **plugin** to compute coding chunks + and recover missing chunks. See the `list of available + plugins`_ for more information. + +:Type: String +:Required: No. +:Default: jerasure + +``{stripe_unit=stripe_unit}`` + +:Description: The amount of data in a data chunk, per stripe. For + example, a profile with 2 data chunks and stripe_unit=4K + would put the range 0-4K in chunk 0, 4K-8K in chunk 1, + then 8K-12K in chunk 0 again. This should be a multiple + of 4K for best performance. The default value is taken + from the monitor config option + ``osd_pool_erasure_code_stripe_unit`` when a pool is + created. The stripe_width of a pool using this profile + will be the number of data chunks multiplied by this + stripe_unit. + +:Type: String +:Required: No. + +``{key=value}`` + +:Description: The semantic of the remaining key/value pairs is defined + by the erasure code plugin. + +:Type: String +:Required: No. + +``--force`` + +:Description: Override an existing profile by the same name, and allow + setting a non-4K-aligned stripe_unit. + +:Type: String +:Required: No. + +osd erasure-code-profile rm +============================ + +To remove an erasure code profile:: + + ceph osd erasure-code-profile rm {name} + +If the profile is referenced by a pool, the deletion will fail. + +osd erasure-code-profile get +============================ + +To display an erasure code profile:: + + ceph osd erasure-code-profile get {name} + +osd erasure-code-profile ls +=========================== + +To list the names of all erasure code profiles:: + + ceph osd erasure-code-profile ls + diff --git a/doc/rados/operations/erasure-code-shec.rst b/doc/rados/operations/erasure-code-shec.rst new file mode 100644 index 000000000..4e8f59b0b --- /dev/null +++ b/doc/rados/operations/erasure-code-shec.rst @@ -0,0 +1,145 @@ +======================== +SHEC erasure code plugin +======================== + +The *shec* plugin encapsulates the `multiple SHEC +<http://tracker.ceph.com/projects/ceph/wiki/Shingled_Erasure_Code_(SHEC)>`_ +library. It allows ceph to recover data more efficiently than Reed Solomon codes. + +Create an SHEC profile +====================== + +To create a new *shec* erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=shec \ + [k={data-chunks}] \ + [m={coding-chunks}] \ + [c={durability-estimator}] \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data-chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: No. +:Default: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding-chunks** for each object and store them on + different OSDs. The number of **coding-chunks** does not necessarily + equal the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 3 + +``c={durability-estimator}`` + +:Description: The number of parity chunks each of which includes each data chunk in its + calculation range. The number is used as a **durability estimator**. + For instance, if c=2, 2 OSDs can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 2 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + +Brief description of SHEC's layouts +=================================== + +Space Efficiency +---------------- + +Space efficiency is a ratio of data chunks to all ones in a object and +represented as k/(k+m). +In order to improve space efficiency, you should increase k or decrease m: + + space efficiency of SHEC(4,3,2) = :math:`\frac{4}{4+3}` = 0.57 + SHEC(5,3,2) or SHEC(4,2,2) improves SHEC(4,3,2)'s space efficiency + +Durability +---------- + +The third parameter of SHEC (=c) is a durability estimator, which approximates +the number of OSDs that can be down without losing data. + +``durability estimator of SHEC(4,3,2) = 2`` + +Recovery Efficiency +------------------- + +Describing calculation of recovery efficiency is beyond the scope of this document, +but at least increasing m without increasing c achieves improvement of recovery efficiency. +(However, we must pay attention to the sacrifice of space efficiency in this case.) + +``SHEC(4,2,2) -> SHEC(4,3,2) : achieves improvement of recovery efficiency`` + +Erasure code profile examples +============================= + + +.. prompt:: bash $ + + ceph osd erasure-code-profile set SHECprofile \ + plugin=shec \ + k=8 m=4 c=3 \ + crush-failure-domain=host + ceph osd pool create shecpool erasure SHECprofile diff --git a/doc/rados/operations/erasure-code.rst b/doc/rados/operations/erasure-code.rst new file mode 100644 index 000000000..1dea23c35 --- /dev/null +++ b/doc/rados/operations/erasure-code.rst @@ -0,0 +1,262 @@ +.. _ecpool: + +============= + Erasure code +============= + +By default, Ceph `pools <../pools>`_ are created with the type "replicated". In +replicated-type pools, every object is copied to multiple disks (this +multiple copying is the "replication"). + +In contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_ +pools use a method of data protection that is different from replication. In +erasure coding, data is broken into fragments of two kinds: data blocks and +parity blocks. If a drive fails or becomes corrupted, the parity blocks are +used to rebuild the data. At scale, erasure coding saves space relative to +replication. + +In this documentation, data blocks are referred to as "data chunks" +and parity blocks are referred to as "encoding chunks". + +Erasure codes are also called "forward error correction codes". The +first forward error correction code was developed in 1950 by Richard +Hamming at Bell Laboratories. + + +Creating a sample erasure coded pool +------------------------------------ + +The simplest erasure coded pool is equivalent to `RAID5 +<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and +requires at least three hosts: + +.. prompt:: bash $ + + ceph osd pool create ecpool erasure + +:: + + pool 'ecpool' created + +.. prompt:: bash $ + + echo ABCDEFGHI | rados --pool ecpool put NYAN - + rados --pool ecpool get NYAN - + +:: + + ABCDEFGHI + +Erasure code profiles +--------------------- + +The default erasure code profile can sustain the loss of two OSDs. This erasure +code profile is equivalent to a replicated pool of size three, but requires +2TB to store 1TB of data instead of 3TB to store 1TB of data. The default +profile can be displayed with this command: + +.. prompt:: bash $ + + ceph osd erasure-code-profile get default + +:: + + k=2 + m=2 + plugin=jerasure + crush-failure-domain=host + technique=reed_sol_van + +.. note:: + The default erasure-coded pool, the profile of which is displayed here, is + not the same as the simplest erasure-coded pool. + + The default erasure-coded pool has two data chunks (k) and two coding chunks + (m). The profile of the default erasure-coded pool is "k=2 m=2". + + The simplest erasure-coded pool has two data chunks (k) and one coding chunk + (m). The profile of the simplest erasure-coded pool is "k=2 m=1". + +Choosing the right profile is important because the profile cannot be modified +after the pool is created. If you find that you need an erasure-coded pool with +a profile different than the one you have created, you must create a new pool +with a different (and presumably more carefully-considered) profile. When the +new pool is created, all objects from the wrongly-configured pool must be moved +to the newly-created pool. There is no way to alter the profile of a pool after its creation. + +The most important parameters of the profile are *K*, *M* and +*crush-failure-domain* because they define the storage overhead and +the data durability. For example, if the desired architecture must +sustain the loss of two racks with a storage overhead of 67% overhead, +the following profile can be defined: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set myprofile \ + k=3 \ + m=2 \ + crush-failure-domain=rack + ceph osd pool create ecpool erasure myprofile + echo ABCDEFGHI | rados --pool ecpool put NYAN - + rados --pool ecpool get NYAN - + +:: + + ABCDEFGHI + +The *NYAN* object will be divided in three (*K=3*) and two additional +*chunks* will be created (*M=2*). The value of *M* defines how many +OSD can be lost simultaneously without losing any data. The +*crush-failure-domain=rack* will create a CRUSH rule that ensures +no two *chunks* are stored in the same rack. + +.. ditaa:: + +-------------------+ + name | NYAN | + +-------------------+ + content | ABCDEFGHI | + +--------+----------+ + | + | + v + +------+------+ + +---------------+ encode(3,2) +-----------+ + | +--+--+---+---+ | + | | | | | + | +-------+ | +-----+ | + | | | | | + +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ + name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | + +------+ +------+ +------+ +------+ +------+ + shard | 1 | | 2 | | 3 | | 4 | | 5 | + +------+ +------+ +------+ +------+ +------+ + content | ABC | | DEF | | GHI | | YXY | | QGC | + +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ + | | | | | + | | v | | + | | +--+---+ | | + | | | OSD1 | | | + | | +------+ | | + | | | | + | | +------+ | | + | +------>| OSD2 | | | + | +------+ | | + | | | + | +------+ | | + | | OSD3 |<----+ | + | +------+ | + | | + | +------+ | + | | OSD4 |<--------------+ + | +------+ + | + | +------+ + +----------------->| OSD5 | + +------+ + + +More information can be found in the `erasure code profiles +<../erasure-code-profile>`_ documentation. + + +Erasure Coding with Overwrites +------------------------------ + +By default, erasure coded pools only work with uses like RGW that +perform full object writes and appends. + +Since Luminous, partial writes for an erasure coded pool may be +enabled with a per-pool setting. This lets RBD and CephFS store their +data in an erasure coded pool: + +.. prompt:: bash $ + + ceph osd pool set ec_pool allow_ec_overwrites true + +This can only be enabled on a pool residing on bluestore OSDs, since +bluestore's checksumming is used to detect bitrot or other corruption +during deep-scrub. In addition to being unsafe, using filestore with +ec overwrites yields low performance compared to bluestore. + +Erasure coded pools do not support omap, so to use them with RBD and +CephFS you must instruct them to store their data in an ec pool, and +their metadata in a replicated pool. For RBD, this means using the +erasure coded pool as the ``--data-pool`` during image creation: + +.. prompt:: bash $ + + rbd create --size 1G --data-pool ec_pool replicated_pool/image_name + +For CephFS, an erasure coded pool can be set as the default data pool during +file system creation or via `file layouts <../../../cephfs/file-layouts>`_. + + +Erasure coded pool and cache tiering +------------------------------------ + +Erasure coded pools require more resources than replicated pools and +lack some functionalities such as omap. To overcome these +limitations, one can set up a `cache tier <../cache-tiering>`_ +before the erasure coded pool. + +For instance, if the pool *hot-storage* is made of fast storage: + +.. prompt:: bash $ + + ceph osd tier add ecpool hot-storage + ceph osd tier cache-mode hot-storage writeback + ceph osd tier set-overlay ecpool hot-storage + +will place the *hot-storage* pool as tier of *ecpool* in *writeback* +mode so that every write and read to the *ecpool* are actually using +the *hot-storage* and benefit from its flexibility and speed. + +More information can be found in the `cache tiering +<../cache-tiering>`_ documentation. + +Erasure coded pool recovery +--------------------------- +If an erasure coded pool loses some shards, it must recover them from the others. +This generally involves reading from the remaining shards, reconstructing the data, and +writing it to the new peer. +In Octopus, erasure coded pools can recover as long as there are at least *K* shards +available. (With fewer than *K* shards, you have actually lost data!) + +Prior to Octopus, erasure coded pools required at least *min_size* shards to be +available, even if *min_size* is greater than *K*. (We generally recommend min_size +be *K+2* or more to prevent loss of writes and data.) +This conservative decision was made out of an abundance of caution when designing the new pool +mode but also meant pools with lost OSDs but no data loss were unable to recover and go active +without manual intervention to change the *min_size*. + +Glossary +-------- + +*chunk* + when the encoding function is called, it returns chunks of the same + size. Data chunks which can be concatenated to reconstruct the original + object and coding chunks which can be used to rebuild a lost chunk. + +*K* + the number of data *chunks*, i.e. the number of *chunks* in which the + original object is divided. For instance if *K* = 2 a 10KB object + will be divided into *K* objects of 5KB each. + +*M* + the number of coding *chunks*, i.e. the number of additional *chunks* + computed by the encoding functions. If there are 2 coding *chunks*, + it means 2 OSDs can be out without losing data. + + +Table of content +---------------- + +.. toctree:: + :maxdepth: 1 + + erasure-code-profile + erasure-code-jerasure + erasure-code-isa + erasure-code-lrc + erasure-code-shec + erasure-code-clay diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst new file mode 100644 index 000000000..a8fa8243f --- /dev/null +++ b/doc/rados/operations/health-checks.rst @@ -0,0 +1,1549 @@ +.. _health-checks: + +============= +Health checks +============= + +Overview +======== + +There is a finite set of possible health messages that a Ceph cluster can +raise -- these are defined as *health checks* which have unique identifiers. + +The identifier is a terse pseudo-human-readable (i.e. like a variable name) +string. It is intended to enable tools (such as UIs) to make sense of +health checks, and present them in a way that reflects their meaning. + +This page lists the health checks that are raised by the monitor and manager +daemons. In addition to these, you may also see health checks that originate +from MDS daemons (see :ref:`cephfs-health-messages`), and health checks +that are defined by ceph-mgr python modules. + +Definitions +=========== + +Monitor +------- + +DAEMON_OLD_VERSION +__________________ + +Warn if old version(s) of Ceph are running on any daemons. +It will generate a health error if multiple versions are detected. +This condition must exist for over mon_warn_older_version_delay (set to 1 week by default) in order for the +health condition to be triggered. This allows most upgrades to proceed +without falsely seeing the warning. If upgrade is paused for an extended +time period, health mute can be used like this +"ceph health mute DAEMON_OLD_VERSION --sticky". In this case after +upgrade has finished use "ceph health unmute DAEMON_OLD_VERSION". + +MON_DOWN +________ + +One or more monitor daemons is currently down. The cluster requires a +majority (more than 1/2) of the monitors in order to function. When +one or more monitors are down, clients may have a harder time forming +their initial connection to the cluster as they may need to try more +addresses before they reach an operating monitor. + +The down monitor daemon should generally be restarted as soon as +possible to reduce the risk of a subsequen monitor failure leading to +a service outage. + +MON_CLOCK_SKEW +______________ + +The clocks on the hosts running the ceph-mon monitor daemons are not +sufficiently well synchronized. This health alert is raised if the +cluster detects a clock skew greater than ``mon_clock_drift_allowed``. + +This is best resolved by synchronizing the clocks using a tool like +``ntpd`` or ``chrony``. + +If it is impractical to keep the clocks closely synchronized, the +``mon_clock_drift_allowed`` threshold can also be increased, but this +value must stay significantly below the ``mon_lease`` interval in +order for monitor cluster to function properly. + +MON_MSGR2_NOT_ENABLED +_____________________ + +The ``ms_bind_msgr2`` option is enabled but one or more monitors is +not configured to bind to a v2 port in the cluster's monmap. This +means that features specific to the msgr2 protocol (e.g., encryption) +are not available on some or all connections. + +In most cases this can be corrected by issuing the command: + +.. prompt:: bash $ + + ceph mon enable-msgr2 + +That command will change any monitor configured for the old default +port 6789 to continue to listen for v1 connections on 6789 and also +listen for v2 connections on the new default 3300 port. + +If a monitor is configured to listen for v1 connections on a non-standard port (not 6789), then the monmap will need to be modified manually. + + +MON_DISK_LOW +____________ + +One or more monitors is low on disk space. This alert triggers if the +available space on the file system storing the monitor database +(normally ``/var/lib/ceph/mon``), as a percentage, drops below +``mon_data_avail_warn`` (default: 30%). + +This may indicate that some other process or user on the system is +filling up the same file system used by the monitor. It may also +indicate that the monitors database is large (see ``MON_DISK_BIG`` +below). + +If space cannot be freed, the monitor's data directory may need to be +moved to another storage device or file system (while the monitor +daemon is not running, of course). + + +MON_DISK_CRIT +_____________ + +One or more monitors is critically low on disk space. This alert +triggers if the available space on the file system storing the monitor +database (normally ``/var/lib/ceph/mon``), as a percentage, drops +below ``mon_data_avail_crit`` (default: 5%). See ``MON_DISK_LOW``, above. + +MON_DISK_BIG +____________ + +The database size for one or more monitors is very large. This alert +triggers if the size of the monitor's database is larger than +``mon_data_size_warn`` (default: 15 GiB). + +A large database is unusual, but may not necessarily indicate a +problem. Monitor databases may grow in size when there are placement +groups that have not reached an ``active+clean`` state in a long time. + +This may also indicate that the monitor's database is not properly +compacting, which has been observed with some older versions of +leveldb and rocksdb. Forcing a compaction with ``ceph daemon mon.<id> +compact`` may shrink the on-disk size. + +This warning may also indicate that the monitor has a bug that is +preventing it from pruning the cluster metadata it stores. If the +problem persists, please report a bug. + +The warning threshold may be adjusted with: + +.. prompt:: bash $ + + ceph config set global mon_data_size_warn <size> + +AUTH_INSECURE_GLOBAL_ID_RECLAIM +_______________________________ + +One or more clients or daemons are connected to the cluster that are +not securely reclaiming their global_id (a unique number identifying +each entity in the cluster) when reconnecting to a monitor. The +client is being permitted to connect anyway because the +``auth_allow_insecure_global_id_reclaim`` option is set to true (which may +be necessary until all ceph clients have been upgraded), and the +``auth_expose_insecure_global_id_reclaim`` option set to ``true`` (which +allows monitors to detect clients with insecure reclaim early by forcing them to +reconnect right after they first authenticate). + +You can identify which client(s) are using unpatched ceph client code with: + +.. prompt:: bash $ + + ceph health detail + +Clients global_id reclaim rehavior can also seen in the +``global_id_status`` field in the dump of clients connected to an +individual monitor (``reclaim_insecure`` means the client is +unpatched and is contributing to this health alert): + +.. prompt:: bash $ + + ceph tell mon.\* sessions + +We strongly recommend that all clients in the system are upgraded to a +newer version of Ceph that correctly reclaims global_id values. Once +all clients have been updated, you can stop allowing insecure reconnections +with: + +.. prompt:: bash $ + + ceph config set mon auth_allow_insecure_global_id_reclaim false + +If it is impractical to upgrade all clients immediately, you can silence +this warning temporarily with: + +.. prompt:: bash $ + + ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM 1w # 1 week + +Although we do NOT recommend doing so, you can also disable this warning +indefinitely with: + +.. prompt:: bash $ + + ceph config set mon mon_warn_on_insecure_global_id_reclaim false + +AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED +_______________________________________ + +Ceph is currently configured to allow clients to reconnect to monitors using +an insecure process to reclaim their previous global_id because the setting +``auth_allow_insecure_global_id_reclaim`` is set to ``true``. It may be necessary to +leave this setting enabled while existing Ceph clients are upgraded to newer +versions of Ceph that correctly and securely reclaim their global_id. + +If the ``AUTH_INSECURE_GLOBAL_ID_RECLAIM`` health alert has not also been raised and +the ``auth_expose_insecure_global_id_reclaim`` setting has not been disabled (it is +on by default), then there are currently no clients connected that need to be +upgraded, and it is safe to disallow insecure global_id reclaim with: + +.. prompt:: bash $ + + ceph config set mon auth_allow_insecure_global_id_reclaim false + +If there are still clients that need to be upgraded, then this alert can be +silenced temporarily with: + +.. prompt:: bash $ + + ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED 1w # 1 week + +Although we do NOT recommend doing so, you can also disable this warning indefinitely +with: + +.. prompt:: bash $ + + ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed false + + +Manager +------- + +MGR_DOWN +________ + +All manager daemons are currently down. The cluster should normally +have at least one running manager (``ceph-mgr``) daemon. If no +manager daemon is running, the cluster's ability to monitor itself will +be compromised, and parts of the management API will become +unavailable (for example, the dashboard will not work, and most CLI +commands that report metrics or runtime state will block). However, +the cluster will still be able to perform all IO operations and +recover from failures. + +The down manager daemon should generally be restarted as soon as +possible to ensure that the cluster can be monitored (e.g., so that +the ``ceph -s`` information is up to date, and/or metrics can be +scraped by Prometheus). + + +MGR_MODULE_DEPENDENCY +_____________________ + +An enabled manager module is failing its dependency check. This health check +should come with an explanatory message from the module about the problem. + +For example, a module might report that a required package is not installed: +install the required package and restart your manager daemons. + +This health check is only applied to enabled modules. If a module is +not enabled, you can see whether it is reporting dependency issues in +the output of `ceph module ls`. + + +MGR_MODULE_ERROR +________________ + +A manager module has experienced an unexpected error. Typically, +this means an unhandled exception was raised from the module's `serve` +function. The human readable description of the error may be obscurely +worded if the exception did not provide a useful description of itself. + +This health check may indicate a bug: please open a Ceph bug report if you +think you have encountered a bug. + +If you believe the error is transient, you may restart your manager +daemon(s), or use `ceph mgr fail` on the active daemon to prompt +a failover to another daemon. + + +OSDs +---- + +OSD_DOWN +________ + +One or more OSDs are marked down. The ceph-osd daemon may have been +stopped, or peer OSDs may be unable to reach the OSD over the network. +Common causes include a stopped or crashed daemon, a down host, or a +network outage. + +Verify the host is healthy, the daemon is started, and network is +functioning. If the daemon has crashed, the daemon log file +(``/var/log/ceph/ceph-osd.*``) may contain debugging information. + +OSD_<crush type>_DOWN +_____________________ + +(e.g. OSD_HOST_DOWN, OSD_ROOT_DOWN) + +All the OSDs within a particular CRUSH subtree are marked down, for example +all OSDs on a host. + +OSD_ORPHAN +__________ + +An OSD is referenced in the CRUSH map hierarchy but does not exist. + +The OSD can be removed from the CRUSH hierarchy with: + +.. prompt:: bash $ + + ceph osd crush rm osd.<id> + +OSD_OUT_OF_ORDER_FULL +_____________________ + +The utilization thresholds for `nearfull`, `backfillfull`, `full`, +and/or `failsafe_full` are not ascending. In particular, we expect +`nearfull < backfillfull`, `backfillfull < full`, and `full < +failsafe_full`. + +The thresholds can be adjusted with: + +.. prompt:: bash $ + + ceph osd set-nearfull-ratio <ratio> + ceph osd set-backfillfull-ratio <ratio> + ceph osd set-full-ratio <ratio> + + +OSD_FULL +________ + +One or more OSDs has exceeded the `full` threshold and is preventing +the cluster from servicing writes. + +Utilization by pool can be checked with: + +.. prompt:: bash $ + + ceph df + +The currently defined `full` ratio can be seen with: + +.. prompt:: bash $ + + ceph osd dump | grep full_ratio + +A short-term workaround to restore write availability is to raise the full +threshold by a small amount: + +.. prompt:: bash $ + + ceph osd set-full-ratio <ratio> + +New storage should be added to the cluster by deploying more OSDs or +existing data should be deleted in order to free up space. + +OSD_BACKFILLFULL +________________ + +One or more OSDs has exceeded the `backfillfull` threshold, which will +prevent data from being allowed to rebalance to this device. This is +an early warning that rebalancing may not be able to complete and that +the cluster is approaching full. + +Utilization by pool can be checked with: + +.. prompt:: bash $ + + ceph df + +OSD_NEARFULL +____________ + +One or more OSDs has exceeded the `nearfull` threshold. This is an early +warning that the cluster is approaching full. + +Utilization by pool can be checked with: + +.. prompt:: bash $ + + ceph df + +OSDMAP_FLAGS +____________ + +One or more cluster flags of interest has been set. These flags include: + +* *full* - the cluster is flagged as full and cannot serve writes +* *pauserd*, *pausewr* - paused reads or writes +* *noup* - OSDs are not allowed to start +* *nodown* - OSD failure reports are being ignored, such that the + monitors will not mark OSDs `down` +* *noin* - OSDs that were previously marked `out` will not be marked + back `in` when they start +* *noout* - down OSDs will not automatically be marked out after the + configured interval +* *nobackfill*, *norecover*, *norebalance* - recovery or data + rebalancing is suspended +* *noscrub*, *nodeep_scrub* - scrubbing is disabled +* *notieragent* - cache tiering activity is suspended + +With the exception of *full*, these flags can be set or cleared with: + +.. prompt:: bash $ + + ceph osd set <flag> + ceph osd unset <flag> + +OSD_FLAGS +_________ + +One or more OSDs or CRUSH {nodes,device classes} has a flag of interest set. +These flags include: + +* *noup*: these OSDs are not allowed to start +* *nodown*: failure reports for these OSDs will be ignored +* *noin*: if these OSDs were previously marked `out` automatically + after a failure, they will not be marked in when they start +* *noout*: if these OSDs are down they will not automatically be marked + `out` after the configured interval + +These flags can be set and cleared in batch with: + +.. prompt:: bash $ + + ceph osd set-group <flags> <who> + ceph osd unset-group <flags> <who> + +For example: + +.. prompt:: bash $ + + ceph osd set-group noup,noout osd.0 osd.1 + ceph osd unset-group noup,noout osd.0 osd.1 + ceph osd set-group noup,noout host-foo + ceph osd unset-group noup,noout host-foo + ceph osd set-group noup,noout class-hdd + ceph osd unset-group noup,noout class-hdd + +OLD_CRUSH_TUNABLES +__________________ + +The CRUSH map is using very old settings and should be updated. The +oldest tunables that can be used (i.e., the oldest client version that +can connect to the cluster) without triggering this health warning is +determined by the ``mon_crush_min_required_version`` config option. +See :ref:`crush-map-tunables` for more information. + +OLD_CRUSH_STRAW_CALC_VERSION +____________________________ + +The CRUSH map is using an older, non-optimal method for calculating +intermediate weight values for ``straw`` buckets. + +The CRUSH map should be updated to use the newer method +(``straw_calc_version=1``). See +:ref:`crush-map-tunables` for more information. + +CACHE_POOL_NO_HIT_SET +_____________________ + +One or more cache pools is not configured with a *hit set* to track +utilization, which will prevent the tiering agent from identifying +cold objects to flush and evict from the cache. + +Hit sets can be configured on the cache pool with: + +.. prompt:: bash $ + + ceph osd pool set <poolname> hit_set_type <type> + ceph osd pool set <poolname> hit_set_period <period-in-seconds> + ceph osd pool set <poolname> hit_set_count <number-of-hitsets> + ceph osd pool set <poolname> hit_set_fpp <target-false-positive-rate> + +OSD_NO_SORTBITWISE +__________________ + +No pre-luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not +been set. + +The ``sortbitwise`` flag must be set before luminous v12.y.z or newer +OSDs can start. You can safely set the flag with: + +.. prompt:: bash $ + + ceph osd set sortbitwise + +OSD_FILESTORE +__________________ + +Filestore has been deprecated, considering that Bluestore has been the default +objectstore for quite some time. Warn if OSDs are running Filestore. + +The 'mclock_scheduler' is not supported for filestore OSDs. Therefore, the +default 'osd_op_queue' is set to 'wpq' for filestore OSDs and is enforced +even if the user attempts to change it. + +Filestore OSDs can be listed with: + +.. prompt:: bash $ + + ceph report | jq -c '."osd_metadata" | .[] | select(.osd_objectstore | contains("filestore")) | {id, osd_objectstore}' + +If it is not feasible to migrate Filestore OSDs to Bluestore immediately, you +can silence this warning temporarily with: + +.. prompt:: bash $ + + ceph health mute OSD_FILESTORE + +POOL_FULL +_________ + +One or more pools has reached its quota and is no longer allowing writes. + +Pool quotas and utilization can be seen with: + +.. prompt:: bash $ + + ceph df detail + +You can either raise the pool quota with: + +.. prompt:: bash $ + + ceph osd pool set-quota <poolname> max_objects <num-objects> + ceph osd pool set-quota <poolname> max_bytes <num-bytes> + +or delete some existing data to reduce utilization. + +BLUEFS_SPILLOVER +________________ + +One or more OSDs that use the BlueStore backend have been allocated +`db` partitions (storage space for metadata, normally on a faster +device) but that space has filled, such that metadata has "spilled +over" onto the normal slow device. This isn't necessarily an error +condition or even unexpected, but if the administrator's expectation +was that all metadata would fit on the faster device, it indicates +that not enough space was provided. + +This warning can be disabled on all OSDs with: + +.. prompt:: bash $ + + ceph config set osd bluestore_warn_on_bluefs_spillover false + +Alternatively, it can be disabled on a specific OSD with: + +.. prompt:: bash $ + + ceph config set osd.123 bluestore_warn_on_bluefs_spillover false + +To provide more metadata space, the OSD in question could be destroyed and +reprovisioned. This will involve data migration and recovery. + +It may also be possible to expand the LVM logical volume backing the +`db` storage. If the underlying LV has been expanded, the OSD daemon +needs to be stopped and BlueFS informed of the device size change with: + +.. prompt:: bash $ + + ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-$ID + +BLUEFS_AVAILABLE_SPACE +______________________ + +To check how much space is free for BlueFS do: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore bluefs available + +This will output up to 3 values: `BDEV_DB free`, `BDEV_SLOW free` and +`available_from_bluestore`. `BDEV_DB` and `BDEV_SLOW` report amount of space that +has been acquired by BlueFS and is considered free. Value `available_from_bluestore` +denotes ability of BlueStore to relinquish more space to BlueFS. +It is normal that this value is different from amount of BlueStore free space, as +BlueFS allocation unit is typically larger than BlueStore allocation unit. +This means that only part of BlueStore free space will be acceptable for BlueFS. + +BLUEFS_LOW_SPACE +_________________ + +If BlueFS is running low on available free space and there is little +`available_from_bluestore` one can consider reducing BlueFS allocation unit size. +To simulate available space when allocation unit is different do: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore bluefs available <alloc-unit-size> + +BLUESTORE_FRAGMENTATION +_______________________ + +As BlueStore works free space on underlying storage will get fragmented. +This is normal and unavoidable but excessive fragmentation will cause slowdown. +To inspect BlueStore fragmentation one can do: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore allocator score block + +Score is given in [0-1] range. +[0.0 .. 0.4] tiny fragmentation +[0.4 .. 0.7] small, acceptable fragmentation +[0.7 .. 0.9] considerable, but safe fragmentation +[0.9 .. 1.0] severe fragmentation, may impact BlueFS ability to get space from BlueStore + +If detailed report of free fragments is required do: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore allocator dump block + +In case when handling OSD process that is not running fragmentation can be +inspected with `ceph-bluestore-tool`. +Get fragmentation score: + +.. prompt:: bash $ + + ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-score + +And dump detailed free chunks: + +.. prompt:: bash $ + + ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-dump + +BLUESTORE_LEGACY_STATFS +_______________________ + +In the Nautilus release, BlueStore tracks its internal usage +statistics on a per-pool granular basis, and one or more OSDs have +BlueStore volumes that were created prior to Nautilus. If *all* OSDs +are older than Nautilus, this just means that the per-pool metrics are +not available. However, if there is a mix of pre-Nautilus and +post-Nautilus OSDs, the cluster usage statistics reported by ``ceph +df`` will not be accurate. + +The old OSDs can be updated to use the new usage tracking scheme by stopping each OSD, running a repair operation, and the restarting it. For example, if ``osd.123`` needed to be updated,: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +This warning can be disabled with: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_legacy_statfs false + +BLUESTORE_NO_PER_POOL_OMAP +__________________________ + +Starting with the Octopus release, BlueStore tracks omap space utilization +by pool, and one or more OSDs have volumes that were created prior to +Octopus. If all OSDs are not running BlueStore with the new tracking +enabled, the cluster will report and approximate value for per-pool omap usage +based on the most recent deep-scrub. + +The old OSDs can be updated to track by pool by stopping each OSD, +running a repair operation, and the restarting it. For example, if +``osd.123`` needed to be updated,: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +This warning can be disabled with: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_no_per_pool_omap false + +BLUESTORE_NO_PER_PG_OMAP +__________________________ + +Starting with the Pacific release, BlueStore tracks omap space utilization +by PG, and one or more OSDs have volumes that were created prior to +Pacific. Per-PG omap enables faster PG removal when PGs migrate. + +The older OSDs can be updated to track by PG by stopping each OSD, +running a repair operation, and the restarting it. For example, if +``osd.123`` needed to be updated,: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +This warning can be disabled with: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_no_per_pg_omap false + + +BLUESTORE_DISK_SIZE_MISMATCH +____________________________ + +One or more OSDs using BlueStore has an internal inconsistency between the size +of the physical device and the metadata tracking its size. This can lead to +the OSD crashing in the future. + +The OSDs in question should be destroyed and reprovisioned. Care should be +taken to do this one OSD at a time, and in a way that doesn't put any data at +risk. For example, if osd ``$N`` has the error: + +.. prompt:: bash $ + + ceph osd out osd.$N + while ! ceph osd safe-to-destroy osd.$N ; do sleep 1m ; done + ceph osd destroy osd.$N + ceph-volume lvm zap /path/to/device + ceph-volume lvm create --osd-id $N --data /path/to/device + +BLUESTORE_NO_COMPRESSION +________________________ + +One or more OSDs is unable to load a BlueStore compression plugin. +This can be caused by a broken installation, in which the ``ceph-osd`` +binary does not match the compression plugins, or a recent upgrade +that did not include a restart of the ``ceph-osd`` daemon. + +Verify that the package(s) on the host running the OSD(s) in question +are correctly installed and that the OSD daemon(s) have been +restarted. If the problem persists, check the OSD log for any clues +as to the source of the problem. + +BLUESTORE_SPURIOUS_READ_ERRORS +______________________________ + +One or more OSDs using BlueStore detects spurious read errors at main device. +BlueStore has recovered from these errors by retrying disk reads. +Though this might show some issues with underlying hardware, I/O subsystem, +etc. +Which theoretically might cause permanent data corruption. +Some observations on the root cause can be found at +https://tracker.ceph.com/issues/22464 + +This alert doesn't require immediate response but corresponding host might need +additional attention, e.g. upgrading to the latest OS/kernel versions and +H/W resource utilization monitoring. + +This warning can be disabled on all OSDs with: + +.. prompt:: bash $ + + ceph config set osd bluestore_warn_on_spurious_read_errors false + +Alternatively, it can be disabled on a specific OSD with: + +.. prompt:: bash $ + + ceph config set osd.123 bluestore_warn_on_spurious_read_errors false + + +Device health +------------- + +DEVICE_HEALTH +_____________ + +One or more devices is expected to fail soon, where the warning +threshold is controlled by the ``mgr/devicehealth/warn_threshold`` +config option. + +This warning only applies to OSDs that are currently marked "in", so +the expected response to this failure is to mark the device "out" so +that data is migrated off of the device, and then to remove the +hardware from the system. Note that the marking out is normally done +automatically if ``mgr/devicehealth/self_heal`` is enabled based on +the ``mgr/devicehealth/mark_out_threshold``. + +Device health can be checked with: + +.. prompt:: bash $ + + ceph device info <device-id> + +Device life expectancy is set by a prediction model run by +the mgr or an by external tool via the command: + +.. prompt:: bash $ + + ceph device set-life-expectancy <device-id> <from> <to> + +You can change the stored life expectancy manually, but that usually +doesn't accomplish anything as whatever tool originally set it will +probably set it again, and changing the stored value does not affect +the actual health of the hardware device. + +DEVICE_HEALTH_IN_USE +____________________ + +One or more devices is expected to fail soon and has been marked "out" +of the cluster based on ``mgr/devicehealth/mark_out_threshold``, but it +is still participating in one more PGs. This may be because it was +only recently marked "out" and data is still migrating, or because data +cannot be migrated off for some reason (e.g., the cluster is nearly +full, or the CRUSH hierarchy is such that there isn't another suitable +OSD to migrate the data too). + +This message can be silenced by disabling the self heal behavior +(setting ``mgr/devicehealth/self_heal`` to false), by adjusting the +``mgr/devicehealth/mark_out_threshold``, or by addressing what is +preventing data from being migrated off of the ailing device. + +DEVICE_HEALTH_TOOMANY +_____________________ + +Too many devices is expected to fail soon and the +``mgr/devicehealth/self_heal`` behavior is enabled, such that marking +out all of the ailing devices would exceed the clusters +``mon_osd_min_in_ratio`` ratio that prevents too many OSDs from being +automatically marked "out". + +This generally indicates that too many devices in your cluster are +expected to fail soon and you should take action to add newer +(healthier) devices before too many devices fail and data is lost. + +The health message can also be silenced by adjusting parameters like +``mon_osd_min_in_ratio`` or ``mgr/devicehealth/mark_out_threshold``, +but be warned that this will increase the likelihood of unrecoverable +data loss in the cluster. + + +Data health (pools & placement groups) +-------------------------------------- + +PG_AVAILABILITY +_______________ + +Data availability is reduced, meaning that the cluster is unable to +service potential read or write requests for some data in the cluster. +Specifically, one or more PGs is in a state that does not allow IO +requests to be serviced. Problematic PG states include *peering*, +*stale*, *incomplete*, and the lack of *active* (if those conditions do not clear +quickly). + +Detailed information about which PGs are affected is available from: + +.. prompt:: bash $ + + ceph health detail + +In most cases the root cause is that one or more OSDs is currently +down; see the discussion for ``OSD_DOWN`` above. + +The state of specific problematic PGs can be queried with: + +.. prompt:: bash $ + + ceph tell <pgid> query + +PG_DEGRADED +___________ + +Data redundancy is reduced for some data, meaning the cluster does not +have the desired number of replicas for all data (for replicated +pools) or erasure code fragments (for erasure coded pools). +Specifically, one or more PGs: + +* has the *degraded* or *undersized* flag set, meaning there are not + enough instances of that placement group in the cluster; +* has not had the *clean* flag set for some time. + +Detailed information about which PGs are affected is available from: + +.. prompt:: bash $ + + ceph health detail + +In most cases the root cause is that one or more OSDs is currently +down; see the dicussion for ``OSD_DOWN`` above. + +The state of specific problematic PGs can be queried with: + +.. prompt:: bash $ + + ceph tell <pgid> query + + +PG_RECOVERY_FULL +________________ + +Data redundancy may be reduced or at risk for some data due to a lack +of free space in the cluster. Specifically, one or more PGs has the +*recovery_toofull* flag set, meaning that the +cluster is unable to migrate or recover data because one or more OSDs +is above the *full* threshold. + +See the discussion for *OSD_FULL* above for steps to resolve this condition. + +PG_BACKFILL_FULL +________________ + +Data redundancy may be reduced or at risk for some data due to a lack +of free space in the cluster. Specifically, one or more PGs has the +*backfill_toofull* flag set, meaning that the +cluster is unable to migrate or recover data because one or more OSDs +is above the *backfillfull* threshold. + +See the discussion for *OSD_BACKFILLFULL* above for +steps to resolve this condition. + +PG_DAMAGED +__________ + +Data scrubbing has discovered some problems with data consistency in +the cluster. Specifically, one or more PGs has the *inconsistent* or +*snaptrim_error* flag is set, indicating an earlier scrub operation +found a problem, or that the *repair* flag is set, meaning a repair +for such an inconsistency is currently in progress. + +See :doc:`pg-repair` for more information. + +OSD_SCRUB_ERRORS +________________ + +Recent OSD scrubs have uncovered inconsistencies. This error is generally +paired with *PG_DAMAGED* (see above). + +See :doc:`pg-repair` for more information. + +OSD_TOO_MANY_REPAIRS +____________________ + +When a read error occurs and another replica is available it is used to repair +the error immediately, so that the client can get the object data. Scrub +handles errors for data at rest. In order to identify possible failing disks +that aren't seeing scrub errors, a count of read repairs is maintained. If +it exceeds a config value threshold *mon_osd_warn_num_repaired* default 10, +this health warning is generated. + +LARGE_OMAP_OBJECTS +__________________ + +One or more pools contain large omap objects as determined by +``osd_deep_scrub_large_omap_object_key_threshold`` (threshold for number of keys +to determine a large omap object) or +``osd_deep_scrub_large_omap_object_value_sum_threshold`` (the threshold for +summed size (bytes) of all key values to determine a large omap object) or both. +More information on the object name, key count, and size in bytes can be found +by searching the cluster log for 'Large omap object found'. Large omap objects +can be caused by RGW bucket index objects that do not have automatic resharding +enabled. Please see :ref:`RGW Dynamic Bucket Index Resharding +<rgw_dynamic_bucket_index_resharding>` for more information on resharding. + +The thresholds can be adjusted with: + +.. prompt:: bash $ + + ceph config set osd osd_deep_scrub_large_omap_object_key_threshold <keys> + ceph config set osd osd_deep_scrub_large_omap_object_value_sum_threshold <bytes> + +CACHE_POOL_NEAR_FULL +____________________ + +A cache tier pool is nearly full. Full in this context is determined +by the ``target_max_bytes`` and ``target_max_objects`` properties on +the cache pool. Once the pool reaches the target threshold, write +requests to the pool may block while data is flushed and evicted +from the cache, a state that normally leads to very high latencies and +poor performance. + +The cache pool target size can be adjusted with: + +.. prompt:: bash $ + + ceph osd pool set <cache-pool-name> target_max_bytes <bytes> + ceph osd pool set <cache-pool-name> target_max_objects <objects> + +Normal cache flush and evict activity may also be throttled due to reduced +availability or performance of the base tier, or overall cluster load. + +TOO_FEW_PGS +___________ + +The number of PGs in use in the cluster is below the configurable +threshold of ``mon_pg_warn_min_per_osd`` PGs per OSD. This can lead +to suboptimal distribution and balance of data across the OSDs in +the cluster, and similarly reduce overall performance. + +This may be an expected condition if data pools have not yet been +created. + +The PG count for existing pools can be increased or new pools can be created. +Please refer to :ref:`choosing-number-of-placement-groups` for more +information. + +POOL_PG_NUM_NOT_POWER_OF_TWO +____________________________ + +One or more pools has a ``pg_num`` value that is not a power of two. +Although this is not strictly incorrect, it does lead to a less +balanced distribution of data because some PGs have roughly twice as +much data as others. + +This is easily corrected by setting the ``pg_num`` value for the +affected pool(s) to a nearby power of two: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_num <value> + +This health warning can be disabled with: + +.. prompt:: bash $ + + ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false + +POOL_TOO_FEW_PGS +________________ + +One or more pools should probably have more PGs, based on the amount +of data that is currently stored in the pool. This can lead to +suboptimal distribution and balance of data across the OSDs in the +cluster, and similarly reduce overall performance. This warning is +generated if the ``pg_autoscale_mode`` property on the pool is set to +``warn``. + +To disable the warning, you can disable auto-scaling of PGs for the +pool entirely with: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode off + +To allow the cluster to automatically adjust the number of PGs,: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode on + +You can also manually set the number of PGs for the pool to the +recommended amount with: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_num <new-pg-num> + +Please refer to :ref:`choosing-number-of-placement-groups` and +:ref:`pg-autoscaler` for more information. + +TOO_MANY_PGS +____________ + +The number of PGs in use in the cluster is above the configurable +threshold of ``mon_max_pg_per_osd`` PGs per OSD. If this threshold is +exceed the cluster will not allow new pools to be created, pool `pg_num` to +be increased, or pool replication to be increased (any of which would lead to +more PGs in the cluster). A large number of PGs can lead +to higher memory utilization for OSD daemons, slower peering after +cluster state changes (like OSD restarts, additions, or removals), and +higher load on the Manager and Monitor daemons. + +The simplest way to mitigate the problem is to increase the number of +OSDs in the cluster by adding more hardware. Note that the OSD count +used for the purposes of this health check is the number of "in" OSDs, +so marking "out" OSDs "in" (if there are any) can also help: + +.. prompt:: bash $ + + ceph osd in <osd id(s)> + +Please refer to :ref:`choosing-number-of-placement-groups` for more +information. + +POOL_TOO_MANY_PGS +_________________ + +One or more pools should probably have more PGs, based on the amount +of data that is currently stored in the pool. This can lead to higher +memory utilization for OSD daemons, slower peering after cluster state +changes (like OSD restarts, additions, or removals), and higher load +on the Manager and Monitor daemons. This warning is generated if the +``pg_autoscale_mode`` property on the pool is set to ``warn``. + +To disable the warning, you can disable auto-scaling of PGs for the +pool entirely with: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode off + +To allow the cluster to automatically adjust the number of PGs,: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode on + +You can also manually set the number of PGs for the pool to the +recommended amount with: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_num <new-pg-num> + +Please refer to :ref:`choosing-number-of-placement-groups` and +:ref:`pg-autoscaler` for more information. + +POOL_TARGET_SIZE_BYTES_OVERCOMMITTED +____________________________________ + +One or more pools have a ``target_size_bytes`` property set to +estimate the expected size of the pool, +but the value(s) exceed the total available storage (either by +themselves or in combination with other pools' actual usage). + +This is usually an indication that the ``target_size_bytes`` value for +the pool is too large and should be reduced or set to zero with: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> target_size_bytes 0 + +For more information, see :ref:`specifying_pool_target_size`. + +POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO +____________________________________ + +One or more pools have both ``target_size_bytes`` and +``target_size_ratio`` set to estimate the expected size of the pool. +Only one of these properties should be non-zero. If both are set, +``target_size_ratio`` takes precedence and ``target_size_bytes`` is +ignored. + +To reset ``target_size_bytes`` to zero: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> target_size_bytes 0 + +For more information, see :ref:`specifying_pool_target_size`. + +TOO_FEW_OSDS +____________ + +The number of OSDs in the cluster is below the configurable +threshold of ``osd_pool_default_size``. + +SMALLER_PGP_NUM +_______________ + +One or more pools has a ``pgp_num`` value less than ``pg_num``. This +is normally an indication that the PG count was increased without +also increasing the placement behavior. + +This is sometimes done deliberately to separate out the `split` step +when the PG count is adjusted from the data migration that is needed +when ``pgp_num`` is changed. + +This is normally resolved by setting ``pgp_num`` to match ``pg_num``, +triggering the data migration, with: + +.. prompt:: bash $ + + ceph osd pool set <pool> pgp_num <pg-num-value> + +MANY_OBJECTS_PER_PG +___________________ + +One or more pools has an average number of objects per PG that is +significantly higher than the overall cluster average. The specific +threshold is controlled by the ``mon_pg_warn_max_object_skew`` +configuration value. + +This is usually an indication that the pool(s) containing most of the +data in the cluster have too few PGs, and/or that other pools that do +not contain as much data have too many PGs. See the discussion of +*TOO_MANY_PGS* above. + +The threshold can be raised to silence the health warning by adjusting +the ``mon_pg_warn_max_object_skew`` config option on the managers. + +The health warning will be silenced for a particular pool if +``pg_autoscale_mode`` is set to ``on``. + +POOL_APP_NOT_ENABLED +____________________ + +A pool exists that contains one or more objects but has not been +tagged for use by a particular application. + +Resolve this warning by labeling the pool for use by an application. For +example, if the pool is used by RBD,: + +.. prompt:: bash $ + + rbd pool init <poolname> + +If the pool is being used by a custom application 'foo', you can also label +via the low-level command: + +.. prompt:: bash $ + + ceph osd pool application enable foo + +For more information, see :ref:`associate-pool-to-application`. + +POOL_FULL +_________ + +One or more pools has reached (or is very close to reaching) its +quota. The threshold to trigger this error condition is controlled by +the ``mon_pool_quota_crit_threshold`` configuration option. + +Pool quotas can be adjusted up or down (or removed) with: + +.. prompt:: bash $ + + ceph osd pool set-quota <pool> max_bytes <bytes> + ceph osd pool set-quota <pool> max_objects <objects> + +Setting the quota value to 0 will disable the quota. + +POOL_NEAR_FULL +______________ + +One or more pools is approaching a configured fullness threshold. + +One threshold that can trigger this warning condition is the +``mon_pool_quota_warn_threshold`` configuration option. + +Pool quotas can be adjusted up or down (or removed) with: + +.. prompt:: bash $ + + ceph osd pool set-quota <pool> max_bytes <bytes> + ceph osd pool set-quota <pool> max_objects <objects> + +Setting the quota value to 0 will disable the quota. + +Other thresholds that can trigger the above two warning conditions are +``mon_osd_nearfull_ratio`` and ``mon_osd_full_ratio``. Visit the +:ref:`storage-capacity` and :ref:`no-free-drive-space` documents for details +and resolution. + +OBJECT_MISPLACED +________________ + +One or more objects in the cluster is not stored on the node the +cluster would like it to be stored on. This is an indication that +data migration due to some recent cluster change has not yet completed. + +Misplaced data is not a dangerous condition in and of itself; data +consistency is never at risk, and old copies of objects are never +removed until the desired number of new copies (in the desired +locations) are present. + +OBJECT_UNFOUND +______________ + +One or more objects in the cluster cannot be found. Specifically, the +OSDs know that a new or updated copy of an object should exist, but a +copy of that version of the object has not been found on OSDs that are +currently online. + +Read or write requests to unfound objects will block. + +Ideally, a down OSD can be brought back online that has the more +recent copy of the unfound object. Candidate OSDs can be identified from the +peering state for the PG(s) responsible for the unfound object: + +.. prompt:: bash $ + + ceph tell <pgid> query + +If the latest copy of the object is not available, the cluster can be +told to roll back to a previous version of the object. See +:ref:`failures-osd-unfound` for more information. + +SLOW_OPS +________ + +One or more OSD or monitor requests is taking a long time to process. This can +be an indication of extreme load, a slow storage device, or a software +bug. + +The request queue for the daemon in question can be queried with the +following command, executed from the daemon's host: + +.. prompt:: bash $ + + ceph daemon osd.<id> ops + +A summary of the slowest recent requests can be seen with: + +.. prompt:: bash $ + + ceph daemon osd.<id> dump_historic_ops + +The location of an OSD can be found with: + +.. prompt:: bash $ + + ceph osd find osd.<id> + +PG_NOT_SCRUBBED +_______________ + +One or more PGs has not been scrubbed recently. PGs are normally scrubbed +within every configured interval specified by +:ref:`osd_scrub_max_interval <osd_scrub_max_interval>` globally. This +interval can be overriden on per-pool basis with +:ref:`scrub_max_interval <scrub_max_interval>`. The warning triggers when +``mon_warn_pg_not_scrubbed_ratio`` percentage of interval has elapsed without a +scrub since it was due. + +PGs will not scrub if they are not flagged as *clean*, which may +happen if they are misplaced or degraded (see *PG_AVAILABILITY* and +*PG_DEGRADED* above). + +You can manually initiate a scrub of a clean PG with:: + + ceph pg scrub <pgid> + +PG_NOT_DEEP_SCRUBBED +____________________ + +One or more PGs has not been deep scrubbed recently. PGs are normally +scrubbed every ``osd_deep_scrub_interval`` seconds, and this warning +triggers when ``mon_warn_pg_not_deep_scrubbed_ratio`` percentage of interval has elapsed +without a scrub since it was due. + +PGs will not (deep) scrub if they are not flagged as *clean*, which may +happen if they are misplaced or degraded (see *PG_AVAILABILITY* and +*PG_DEGRADED* above). + +You can manually initiate a scrub of a clean PG with: + +.. prompt:: bash $ + + ceph pg deep-scrub <pgid> + + +PG_SLOW_SNAP_TRIMMING +_____________________ + +The snapshot trim queue for one or more PGs has exceeded the +configured warning threshold. This indicates that either an extremely +large number of snapshots were recently deleted, or that the OSDs are +unable to trim snapshots quickly enough to keep up with the rate of +new snapshot deletions. + +The warning threshold is controlled by the +``mon_osd_snap_trim_queue_warn_on`` option (default: 32768). + +This warning may trigger if OSDs are under excessive load and unable +to keep up with their background work, or if the OSDs' internal +metadata database is heavily fragmented and unable to perform. It may +also indicate some other performance issue with the OSDs. + +The exact size of the snapshot trim queue is reported by the +``snaptrimq_len`` field of ``ceph pg ls -f json-detail``. + +Miscellaneous +------------- + +RECENT_CRASH +____________ + +One or more Ceph daemons has crashed recently, and the crash has not +yet been archived (acknowledged) by the administrator. This may +indicate a software bug, a hardware problem (e.g., a failing disk), or +some other problem. + +New crashes can be listed with: + +.. prompt:: bash $ + + ceph crash ls-new + +Information about a specific crash can be examined with: + +.. prompt:: bash $ + + ceph crash info <crash-id> + +This warning can be silenced by "archiving" the crash (perhaps after +being examined by an administrator) so that it does not generate this +warning: + +.. prompt:: bash $ + + ceph crash archive <crash-id> + +Similarly, all new crashes can be archived with: + +.. prompt:: bash $ + + ceph crash archive-all + +Archived crashes will still be visible via ``ceph crash ls`` but not +``ceph crash ls-new``. + +The time period for what "recent" means is controlled by the option +``mgr/crash/warn_recent_interval`` (default: two weeks). + +These warnings can be disabled entirely with: + +.. prompt:: bash $ + + ceph config set mgr/crash/warn_recent_interval 0 + +RECENT_MGR_MODULE_CRASH +_______________________ + +One or more ceph-mgr modules has crashed recently, and the crash as +not yet been archived (acknowledged) by the administrator. This +generally indicates a software bug in one of the software modules run +inside the ceph-mgr daemon. Although the module that experienced the +problem maybe be disabled as a result, the function of other modules +is normally unaffected. + +As with the *RECENT_CRASH* health alert, the crash can be inspected with: + +.. prompt:: bash $ + + ceph crash info <crash-id> + +This warning can be silenced by "archiving" the crash (perhaps after +being examined by an administrator) so that it does not generate this +warning: + +.. prompt:: bash $ + + ceph crash archive <crash-id> + +Similarly, all new crashes can be archived with: + +.. prompt:: bash $ + + ceph crash archive-all + +Archived crashes will still be visible via ``ceph crash ls`` but not +``ceph crash ls-new``. + +The time period for what "recent" means is controlled by the option +``mgr/crash/warn_recent_interval`` (default: two weeks). + +These warnings can be disabled entirely with: + +.. prompt:: bash $ + + ceph config set mgr/crash/warn_recent_interval 0 + +TELEMETRY_CHANGED +_________________ + +Telemetry has been enabled, but the contents of the telemetry report +have changed since that time, so telemetry reports will not be sent. + +The Ceph developers periodically revise the telemetry feature to +include new and useful information, or to remove information found to +be useless or sensitive. If any new information is included in the +report, Ceph will require the administrator to re-enable telemetry to +ensure they have an opportunity to (re)review what information will be +shared. + +To review the contents of the telemetry report: + +.. prompt:: bash $ + + ceph telemetry show + +Note that the telemetry report consists of several optional channels +that may be independently enabled or disabled. For more information, see +:ref:`telemetry`. + +To re-enable telemetry (and make this warning go away): + +.. prompt:: bash $ + + ceph telemetry on + +To disable telemetry (and make this warning go away): + +.. prompt:: bash $ + + ceph telemetry off + +AUTH_BAD_CAPS +_____________ + +One or more auth users has capabilities that cannot be parsed by the +monitor. This generally indicates that the user will not be +authorized to perform any action with one or more daemon types. + +This error is mostly likely to occur after an upgrade if the +capabilities were set with an older version of Ceph that did not +properly validate their syntax, or if the syntax of the capabilities +has changed. + +The user in question can be removed with: + +.. prompt:: bash $ + + ceph auth rm <entity-name> + +(This will resolve the health alert, but obviously clients will not be +able to authenticate as that user.) + +Alternatively, the capabilities for the user can be updated with: + +.. prompt:: bash $ + + ceph auth <entity-name> <daemon-type> <caps> [<daemon-type> <caps> ...] + +For more information about auth capabilities, see :ref:`user-management`. + +OSD_NO_DOWN_OUT_INTERVAL +________________________ + +The ``mon_osd_down_out_interval`` option is set to zero, which means +that the system will not automatically perform any repair or healing +operations after an OSD fails. Instead, an administrator (or some +other external entity) will need to manually mark down OSDs as 'out' +(i.e., via ``ceph osd out <osd-id>``) in order to trigger recovery. + +This option is normally set to five or ten minutes--enough time for a +host to power-cycle or reboot. + +This warning can silenced by setting the +``mon_warn_on_osd_down_out_interval_zero`` to false: + +.. prompt:: bash $ + + ceph config global mon mon_warn_on_osd_down_out_interval_zero false + +DASHBOARD_DEBUG +_______________ + +The Dashboard debug mode is enabled. This means, if there is an error +while processing a REST API request, the HTTP error response contains +a Python traceback. This behaviour should be disabled in production +environments because such a traceback might contain and expose sensible +information. + +The debug mode can be disabled with: + +.. prompt:: bash $ + + ceph dashboard debug disable diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst new file mode 100644 index 000000000..2136918c7 --- /dev/null +++ b/doc/rados/operations/index.rst @@ -0,0 +1,98 @@ +.. _rados-operations: + +==================== + Cluster Operations +==================== + +.. raw:: html + + <table><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>High-level Operations</h3> + +High-level cluster operations consist primarily of starting, stopping, and +restarting a cluster with the ``ceph`` service; checking the cluster's health; +and, monitoring an operating cluster. + +.. toctree:: + :maxdepth: 1 + + operating + health-checks + monitoring + monitoring-osd-pg + user-management + pg-repair + +.. raw:: html + + </td><td><h3>Data Placement</h3> + +Once you have your cluster up and running, you may begin working with data +placement. Ceph supports petabyte-scale data storage clusters, with storage +pools and placement groups that distribute data across the cluster using Ceph's +CRUSH algorithm. + +.. toctree:: + :maxdepth: 1 + + data-placement + pools + erasure-code + cache-tiering + placement-groups + balancer + upmap + crush-map + crush-map-edits + stretch-mode + change-mon-elections + + + +.. raw:: html + + </td></tr><tr><td><h3>Low-level Operations</h3> + +Low-level cluster operations consist of starting, stopping, and restarting a +particular daemon within a cluster; changing the settings of a particular +daemon or subsystem; and, adding a daemon to the cluster or removing a daemon +from the cluster. The most common use cases for low-level operations include +growing or shrinking the Ceph cluster and replacing legacy or failed hardware +with new hardware. + +.. toctree:: + :maxdepth: 1 + + add-or-rm-osds + add-or-rm-mons + devices + bluestore-migration + Command Reference <control> + + + +.. raw:: html + + </td><td><h3>Troubleshooting</h3> + +Ceph is still on the leading edge, so you may encounter situations that require +you to evaluate your Ceph configuration and modify your logging and debugging +settings to identify and remedy issues you are encountering with your cluster. + +.. toctree:: + :maxdepth: 1 + + ../troubleshooting/community + ../troubleshooting/troubleshooting-mon + ../troubleshooting/troubleshooting-osd + ../troubleshooting/troubleshooting-pg + ../troubleshooting/log-and-debug + ../troubleshooting/cpu-profiling + ../troubleshooting/memory-profiling + + + + +.. raw:: html + + </td></tr></tbody></table> + diff --git a/doc/rados/operations/monitoring-osd-pg.rst b/doc/rados/operations/monitoring-osd-pg.rst new file mode 100644 index 000000000..3b997bfb4 --- /dev/null +++ b/doc/rados/operations/monitoring-osd-pg.rst @@ -0,0 +1,553 @@ +========================= + Monitoring OSDs and PGs +========================= + +High availability and high reliability require a fault-tolerant approach to +managing hardware and software issues. Ceph has no single point-of-failure, and +can service requests for data in a "degraded" mode. Ceph's `data placement`_ +introduces a layer of indirection to ensure that data doesn't bind directly to +particular OSD addresses. This means that tracking down system faults requires +finding the `placement group`_ and the underlying OSDs at root of the problem. + +.. tip:: A fault in one part of the cluster may prevent you from accessing a + particular object, but that doesn't mean that you cannot access other objects. + When you run into a fault, don't panic. Just follow the steps for monitoring + your OSDs and placement groups. Then, begin troubleshooting. + +Ceph is generally self-repairing. However, when problems persist, monitoring +OSDs and placement groups will help you identify the problem. + + +Monitoring OSDs +=============== + +An OSD's status is either in the cluster (``in``) or out of the cluster +(``out``); and, it is either up and running (``up``), or it is down and not +running (``down``). If an OSD is ``up``, it may be either ``in`` the cluster +(you can read and write data) or it is ``out`` of the cluster. If it was +``in`` the cluster and recently moved ``out`` of the cluster, Ceph will migrate +placement groups to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will +not assign placement groups to the OSD. If an OSD is ``down``, it should also be +``out``. + +.. note:: If an OSD is ``down`` and ``in``, there is a problem and the cluster + will not be in a healthy state. + +.. ditaa:: + + +----------------+ +----------------+ + | | | | + | OSD #n In | | OSD #n Up | + | | | | + +----------------+ +----------------+ + ^ ^ + | | + | | + v v + +----------------+ +----------------+ + | | | | + | OSD #n Out | | OSD #n Down | + | | | | + +----------------+ +----------------+ + +If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, +you may notice that the cluster does not always echo back ``HEALTH OK``. Don't +panic. With respect to OSDs, you should expect that the cluster will **NOT** +echo ``HEALTH OK`` in a few expected circumstances: + +#. You haven't started the cluster yet (it won't respond). +#. You have just started or restarted the cluster and it's not ready yet, + because the placement groups are getting created and the OSDs are in + the process of peering. +#. You just added or removed an OSD. +#. You just have modified your cluster map. + +An important aspect of monitoring OSDs is to ensure that when the cluster +is up and running that all OSDs that are ``in`` the cluster are ``up`` and +running, too. To see if all OSDs are running, execute: + +.. prompt:: bash $ + + ceph osd stat + +The result should tell you the total number of OSDs (x), +how many are ``up`` (y), how many are ``in`` (z) and the map epoch (eNNNN). :: + + x osds: y up, z in; epoch: eNNNN + +If the number of OSDs that are ``in`` the cluster is more than the number of +OSDs that are ``up``, execute the following command to identify the ``ceph-osd`` +daemons that are not running: + +.. prompt:: bash $ + + ceph osd tree + +:: + + #ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -1 2.00000 pool openstack + -3 2.00000 rack dell-2950-rack-A + -2 2.00000 host dell-2950-A1 + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 down 1.00000 1.00000 + +.. tip:: The ability to search through a well-designed CRUSH hierarchy may help + you troubleshoot your cluster by identifying the physical locations faster. + +If an OSD is ``down``, start it: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd@1 + +See `OSD Not Running`_ for problems associated with OSDs that stopped, or won't +restart. + + +PG Sets +======= + +When CRUSH assigns placement groups to OSDs, it looks at the number of replicas +for the pool and assigns the placement group to OSDs such that each replica of +the placement group gets assigned to a different OSD. For example, if the pool +requires three replicas of a placement group, CRUSH may assign them to +``osd.1``, ``osd.2`` and ``osd.3`` respectively. CRUSH actually seeks a +pseudo-random placement that will take into account failure domains you set in +your `CRUSH map`_, so you will rarely see placement groups assigned to nearest +neighbor OSDs in a large cluster. + +Ceph processes a client request using the **Acting Set**, which is the set of +OSDs that will actually handle the requests since they have a full and working +version of a placement group shard. The set of OSDs that should contain a shard +of a particular placement group as the **Up Set**, i.e. where data is +moved/copied to (or planned to be). + +In some cases, an OSD in the Acting Set is ``down`` or otherwise not able to +service requests for objects in the placement group. When these situations +arise, don't panic. Common examples include: + +- You added or removed an OSD. Then, CRUSH reassigned the placement group to + other OSDs--thereby changing the composition of the Acting Set and spawning + the migration of data with a "backfill" process. +- An OSD was ``down``, was restarted, and is now ``recovering``. +- An OSD in the Acting Set is ``down`` or unable to service requests, + and another OSD has temporarily assumed its duties. + +In most cases, the Up Set and the Acting Set are identical. When they are not, +it may indicate that Ceph is migrating the PG (it's remapped), an OSD is +recovering, or that there is a problem (i.e., Ceph usually echoes a "HEALTH +WARN" state with a "stuck stale" message in such scenarios). + +To retrieve a list of placement groups, execute: + +.. prompt:: bash $ + + ceph pg dump + +To view which OSDs are within the Acting Set or the Up Set for a given placement +group, execute: + +.. prompt:: bash $ + + ceph pg map {pg-num} + +The result should tell you the osdmap epoch (eNNN), the placement group number +({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the acting set +(acting[]):: + + osdmap eNNN pg {raw-pg-num} ({pg-num}) -> up [0,1,2] acting [0,1,2] + +.. note:: If the Up Set and Acting Set do not match, this may be an indicator + that the cluster rebalancing itself or of a potential problem with + the cluster. + + +Peering +======= + +Before you can write data to a placement group, it must be in an ``active`` +state, and it **should** be in a ``clean`` state. For Ceph to determine the +current state of a placement group, the primary OSD of the placement group +(i.e., the first OSD in the acting set), peers with the secondary and tertiary +OSDs to establish agreement on the current state of the placement group +(assuming a pool with 3 replicas of the PG). + + +.. ditaa:: + + +---------+ +---------+ +-------+ + | OSD 1 | | OSD 2 | | OSD 3 | + +---------+ +---------+ +-------+ + | | | + | Request To | | + | Peer | | + |-------------->| | + |<--------------| | + | Peering | + | | + | Request To | + | Peer | + |----------------------------->| + |<-----------------------------| + | Peering | + +The OSDs also report their status to the monitor. See `Configuring Monitor/OSD +Interaction`_ for details. To troubleshoot peering issues, see `Peering +Failure`_. + + +Monitoring Placement Group States +================================= + +If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, +you may notice that the cluster does not always echo back ``HEALTH OK``. After +you check to see if the OSDs are running, you should also check placement group +states. You should expect that the cluster will **NOT** echo ``HEALTH OK`` in a +number of placement group peering-related circumstances: + +#. You have just created a pool and placement groups haven't peered yet. +#. The placement groups are recovering. +#. You have just added an OSD to or removed an OSD from the cluster. +#. You have just modified your CRUSH map and your placement groups are migrating. +#. There is inconsistent data in different replicas of a placement group. +#. Ceph is scrubbing a placement group's replicas. +#. Ceph doesn't have enough storage capacity to complete backfilling operations. + +If one of the foregoing circumstances causes Ceph to echo ``HEALTH WARN``, don't +panic. In many cases, the cluster will recover on its own. In some cases, you +may need to take action. An important aspect of monitoring placement groups is +to ensure that when the cluster is up and running that all placement groups are +``active``, and preferably in the ``clean`` state. To see the status of all +placement groups, execute: + +.. prompt:: bash $ + + ceph pg stat + +The result should tell you the total number of placement groups (x), how many +placement groups are in a particular state such as ``active+clean`` (y) and the +amount of data stored (z). :: + + x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail + +.. note:: It is common for Ceph to report multiple states for placement groups. + +In addition to the placement group states, Ceph will also echo back the amount of +storage capacity used (aa), the amount of storage capacity remaining (bb), and the total +storage capacity for the placement group. These numbers can be important in a +few cases: + +- You are reaching your ``near full ratio`` or ``full ratio``. +- Your data is not getting distributed across the cluster due to an + error in your CRUSH configuration. + + +.. topic:: Placement Group IDs + + Placement group IDs consist of the pool number (not pool name) followed + by a period (.) and the placement group ID--a hexadecimal number. You + can view pool numbers and their names from the output of ``ceph osd + lspools``. For example, the first pool created corresponds to + pool number ``1``. A fully qualified placement group ID has the + following form:: + + {pool-num}.{pg-id} + + And it typically looks like this:: + + 1.1f + + +To retrieve a list of placement groups, execute the following: + +.. prompt:: bash $ + + ceph pg dump + +You can also format the output in JSON format and save it to a file: + +.. prompt:: bash $ + + ceph pg dump -o {filename} --format=json + +To query a particular placement group, execute the following: + +.. prompt:: bash $ + + ceph pg {poolnum}.{pg-id} query + +Ceph will output the query in JSON format. + +The following subsections describe the common pg states in detail. + +Creating +-------- + +When you create a pool, it will create the number of placement groups you +specified. Ceph will echo ``creating`` when it is creating one or more +placement groups. Once they are created, the OSDs that are part of a placement +group's Acting Set will peer. Once peering is complete, the placement group +status should be ``active+clean``, which means a Ceph client can begin writing +to the placement group. + +.. ditaa:: + + /-----------\ /-----------\ /-----------\ + | Creating |------>| Peering |------>| Active | + \-----------/ \-----------/ \-----------/ + +Peering +------- + +When Ceph is Peering a placement group, Ceph is bringing the OSDs that +store the replicas of the placement group into **agreement about the state** +of the objects and metadata in the placement group. When Ceph completes peering, +this means that the OSDs that store the placement group agree about the current +state of the placement group. However, completion of the peering process does +**NOT** mean that each replica has the latest contents. + +.. topic:: Authoritative History + + Ceph will **NOT** acknowledge a write operation to a client, until + all OSDs of the acting set persist the write operation. This practice + ensures that at least one member of the acting set will have a record + of every acknowledged write operation since the last successful + peering operation. + + With an accurate record of each acknowledged write operation, Ceph can + construct and disseminate a new authoritative history of the placement + group--a complete, and fully ordered set of operations that, if performed, + would bring an OSD’s copy of a placement group up to date. + + +Active +------ + +Once Ceph completes the peering process, a placement group may become +``active``. The ``active`` state means that the data in the placement group is +generally available in the primary placement group and the replicas for read +and write operations. + + +Clean +----- + +When a placement group is in the ``clean`` state, the primary OSD and the +replica OSDs have successfully peered and there are no stray replicas for the +placement group. Ceph replicated all objects in the placement group the correct +number of times. + + +Degraded +-------- + +When a client writes an object to the primary OSD, the primary OSD is +responsible for writing the replicas to the replica OSDs. After the primary OSD +writes the object to storage, the placement group will remain in a ``degraded`` +state until the primary OSD has received an acknowledgement from the replica +OSDs that Ceph created the replica objects successfully. + +The reason a placement group can be ``active+degraded`` is that an OSD may be +``active`` even though it doesn't hold all of the objects yet. If an OSD goes +``down``, Ceph marks each placement group assigned to the OSD as ``degraded``. +The OSDs must peer again when the OSD comes back online. However, a client can +still write a new object to a ``degraded`` placement group if it is ``active``. + +If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the +``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD +to another OSD. The time between being marked ``down`` and being marked ``out`` +is controlled by ``mon osd down out interval``, which is set to ``600`` seconds +by default. + +A placement group can also be ``degraded``, because Ceph cannot find one or more +objects that Ceph thinks should be in the placement group. While you cannot +read or write to unfound objects, you can still access all of the other objects +in the ``degraded`` placement group. + + +Recovering +---------- + +Ceph was designed for fault-tolerance at a scale where hardware and software +problems are ongoing. When an OSD goes ``down``, its contents may fall behind +the current state of other replicas in the placement groups. When the OSD is +back ``up``, the contents of the placement groups must be updated to reflect the +current state. During that time period, the OSD may reflect a ``recovering`` +state. + +Recovery is not always trivial, because a hardware failure might cause a +cascading failure of multiple OSDs. For example, a network switch for a rack or +cabinet may fail, which can cause the OSDs of a number of host machines to fall +behind the current state of the cluster. Each one of the OSDs must recover once +the fault is resolved. + +Ceph provides a number of settings to balance the resource contention between +new service requests and the need to recover data objects and restore the +placement groups to the current state. The ``osd recovery delay start`` setting +allows an OSD to restart, re-peer and even process some replay requests before +starting the recovery process. The ``osd +recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail, +restart and re-peer at staggered rates. The ``osd recovery max active`` setting +limits the number of recovery requests an OSD will entertain simultaneously to +prevent the OSD from failing to serve . The ``osd recovery max chunk`` setting +limits the size of the recovered data chunks to prevent network congestion. + + +Back Filling +------------ + +When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs +in the cluster to the newly added OSD. Forcing the new OSD to accept the +reassigned placement groups immediately can put excessive load on the new OSD. +Back filling the OSD with the placement groups allows this process to begin in +the background. Once backfilling is complete, the new OSD will begin serving +requests when it is ready. + +During the backfill operations, you may see one of several states: +``backfill_wait`` indicates that a backfill operation is pending, but is not +underway yet; ``backfilling`` indicates that a backfill operation is underway; +and, ``backfill_toofull`` indicates that a backfill operation was requested, +but couldn't be completed due to insufficient storage capacity. When a +placement group cannot be backfilled, it may be considered ``incomplete``. + +The ``backfill_toofull`` state may be transient. It is possible that as PGs +are moved around, space may become available. The ``backfill_toofull`` is +similar to ``backfill_wait`` in that as soon as conditions change +backfill can proceed. + +Ceph provides a number of settings to manage the load spike associated with +reassigning placement groups to an OSD (especially a new OSD). By default, +``osd_max_backfills`` sets the maximum number of concurrent backfills to and from +an OSD to 1. The ``backfill full ratio`` enables an OSD to refuse a +backfill request if the OSD is approaching its full ratio (90%, by default) and +change with ``ceph osd set-backfillfull-ratio`` command. +If an OSD refuses a backfill request, the ``osd backfill retry interval`` +enables an OSD to retry the request (after 30 seconds, by default). OSDs can +also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan +intervals (64 and 512, by default). + + +Remapped +-------- + +When the Acting Set that services a placement group changes, the data migrates +from the old acting set to the new acting set. It may take some time for a new +primary OSD to service requests. So it may ask the old primary to continue to +service requests until the placement group migration is complete. Once data +migration completes, the mapping uses the primary OSD of the new acting set. + + +Stale +----- + +While Ceph uses heartbeats to ensure that hosts and daemons are running, the +``ceph-osd`` daemons may also get into a ``stuck`` state where they are not +reporting statistics in a timely manner (e.g., a temporary network fault). By +default, OSD daemons report their placement group, up through, boot and failure +statistics every half second (i.e., ``0.5``), which is more frequent than the +heartbeat thresholds. If the **Primary OSD** of a placement group's acting set +fails to report to the monitor or if other OSDs have reported the primary OSD +``down``, the monitors will mark the placement group ``stale``. + +When you start your cluster, it is common to see the ``stale`` state until +the peering process completes. After your cluster has been running for awhile, +seeing placement groups in the ``stale`` state indicates that the primary OSD +for those placement groups is ``down`` or not reporting placement group statistics +to the monitor. + + +Identifying Troubled PGs +======================== + +As previously noted, a placement group is not necessarily problematic just +because its state is not ``active+clean``. Generally, Ceph's ability to self +repair may not be working when placement groups get stuck. The stuck states +include: + +- **Unclean**: Placement groups contain objects that are not replicated the + desired number of times. They should be recovering. +- **Inactive**: Placement groups cannot process reads or writes because they + are waiting for an OSD with the most up-to-date data to come back ``up``. +- **Stale**: Placement groups are in an unknown state, because the OSDs that + host them have not reported to the monitor cluster in a while (configured + by ``mon osd report timeout``). + +To identify stuck placement groups, execute the following: + +.. prompt:: bash $ + + ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded] + +See `Placement Group Subsystem`_ for additional details. To troubleshoot +stuck placement groups, see `Troubleshooting PG Errors`_. + + +Finding an Object Location +========================== + +To store object data in the Ceph Object Store, a Ceph client must: + +#. Set an object name +#. Specify a `pool`_ + +The Ceph client retrieves the latest cluster map and the CRUSH algorithm +calculates how to map the object to a `placement group`_, and then calculates +how to assign the placement group to an OSD dynamically. To find the object +location, all you need is the object name and the pool name. For example: + +.. prompt:: bash $ + + ceph osd map {poolname} {object-name} [namespace] + +.. topic:: Exercise: Locate an Object + + As an exercise, let's create an object. Specify an object name, a path + to a test file containing some object data and a pool name using the + ``rados put`` command on the command line. For example: + + .. prompt:: bash $ + + rados put {object-name} {file-path} --pool=data + rados put test-object-1 testfile.txt --pool=data + + To verify that the Ceph Object Store stored the object, execute the + following: + + .. prompt:: bash $ + + rados -p data ls + + Now, identify the object location: + + .. prompt:: bash $ + + ceph osd map {pool-name} {object-name} + ceph osd map data test-object-1 + + Ceph should output the object's location. For example:: + + osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0) + + To remove the test object, simply delete it using the ``rados rm`` + command. For example: + + .. prompt:: bash $ + + rados rm test-object-1 --pool=data + + +As the cluster evolves, the object location may change dynamically. One benefit +of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform +the migration manually. See the `Architecture`_ section for details. + +.. _data placement: ../data-placement +.. _pool: ../pools +.. _placement group: ../placement-groups +.. _Architecture: ../../../architecture +.. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running +.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors +.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering +.. _CRUSH map: ../crush-map +.. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/ +.. _Placement Group Subsystem: ../control#placement-group-subsystem diff --git a/doc/rados/operations/monitoring.rst b/doc/rados/operations/monitoring.rst new file mode 100644 index 000000000..4df711d8b --- /dev/null +++ b/doc/rados/operations/monitoring.rst @@ -0,0 +1,647 @@ +====================== + Monitoring a Cluster +====================== + +Once you have a running cluster, you may use the ``ceph`` tool to monitor your +cluster. Monitoring a cluster typically involves checking OSD status, monitor +status, placement group status and metadata server status. + +Using the command line +====================== + +Interactive mode +---------------- + +To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line +with no arguments. For example: + +.. prompt:: bash $ + + ceph + +.. prompt:: ceph> + :prompts: ceph> + + health + status + quorum_status + mon stat + +Non-default paths +----------------- + +If you specified non-default locations for your configuration or keyring, +you may specify their locations: + +.. prompt:: bash $ + + ceph -c /path/to/conf -k /path/to/keyring health + +Checking a Cluster's Status +=========================== + +After you start your cluster, and before you start reading and/or +writing data, check your cluster's status first. + +To check a cluster's status, execute the following: + +.. prompt:: bash $ + + ceph status + +Or: + +.. prompt:: bash $ + + ceph -s + +In interactive mode, type ``status`` and press **Enter**: + +.. prompt:: ceph> + :prompts: ceph> + + ceph> status + +Ceph will print the cluster status. For example, a tiny Ceph demonstration +cluster with one of each service may print the following: + +:: + + cluster: + id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 + health: HEALTH_OK + + services: + mon: 3 daemons, quorum a,b,c + mgr: x(active) + mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby + osd: 3 osds: 3 up, 3 in + + data: + pools: 2 pools, 16 pgs + objects: 21 objects, 2.19K + usage: 546 GB used, 384 GB / 931 GB avail + pgs: 16 active+clean + + +.. topic:: How Ceph Calculates Data Usage + + The ``usage`` value reflects the *actual* amount of raw storage used. The + ``xxx GB / xxx GB`` value means the amount available (the lesser number) + of the overall storage capacity of the cluster. The notional number reflects + the size of the stored data before it is replicated, cloned or snapshotted. + Therefore, the amount of data actually stored typically exceeds the notional + amount stored, because Ceph creates replicas of the data and may also use + storage capacity for cloning and snapshotting. + + +Watching a Cluster +================== + +In addition to local logging by each daemon, Ceph clusters maintain +a *cluster log* that records high level events about the whole system. +This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by +default), but can also be monitored via the command line. + +To follow the cluster log, use the following command: + +.. prompt:: bash $ + + ceph -w + +Ceph will print the status of the system, followed by each log message as it +is emitted. For example: + +:: + + cluster: + id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 + health: HEALTH_OK + + services: + mon: 3 daemons, quorum a,b,c + mgr: x(active) + mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby + osd: 3 osds: 3 up, 3 in + + data: + pools: 2 pools, 16 pgs + objects: 21 objects, 2.19K + usage: 546 GB used, 384 GB / 931 GB avail + pgs: 16 active+clean + + + 2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot + 2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x + 2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available + + +In addition to using ``ceph -w`` to print log lines as they are emitted, +use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster +log. + +Monitoring Health Checks +======================== + +Ceph continuously runs various *health checks* against its own status. When +a health check fails, this is reflected in the output of ``ceph status`` (or +``ceph health``). In addition, messages are sent to the cluster log to +indicate when a check fails, and when the cluster recovers. + +For example, when an OSD goes down, the ``health`` section of the status +output may be updated as follows: + +:: + + health: HEALTH_WARN + 1 osds down + Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded + +At this time, cluster log messages are also emitted to record the failure of the +health checks: + +:: + + 2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) + 2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED) + +When the OSD comes back online, the cluster log records the cluster's return +to a health state: + +:: + + 2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED) + 2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized) + 2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy + +Network Performance Checks +-------------------------- + +Ceph OSDs send heartbeat ping messages amongst themselves to monitor daemon availability. We +also use the response times to monitor network performance. +While it is possible that a busy OSD could delay a ping response, we can assume +that if a network switch fails multiple delays will be detected between distinct pairs of OSDs. + +By default we will warn about ping times which exceed 1 second (1000 milliseconds). + +:: + + HEALTH_WARN Slow OSD heartbeats on back (longest 1118.001ms) + +The health detail will add the combination of OSDs are seeing the delays and by how much. There is a limit of 10 +detail line items. + +:: + + [WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1118.001ms) + Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.1 [dc1,rack1] 1118.001 msec possibly improving + Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.2 [dc1,rack2] 1030.123 msec + Slow OSD heartbeats on back from osd.2 [dc1,rack2] to osd.1 [dc1,rack1] 1015.321 msec + Slow OSD heartbeats on back from osd.1 [dc1,rack1] to osd.0 [dc1,rack1] 1010.456 msec + +To see even more detail and a complete dump of network performance information the ``dump_osd_network`` command can be used. Typically, this would be +sent to a mgr, but it can be limited to a particular OSD's interactions by issuing it to any OSD. The current threshold which defaults to 1 second +(1000 milliseconds) can be overridden as an argument in milliseconds. + +The following command will show all gathered network performance data by specifying a threshold of 0 and sending to the mgr. + +.. prompt:: bash $ + + ceph daemon /var/run/ceph/ceph-mgr.x.asok dump_osd_network 0 + +:: + + { + "threshold": 0, + "entries": [ + { + "last update": "Wed Sep 4 17:04:49 2019", + "stale": false, + "from osd": 2, + "to osd": 0, + "interface": "front", + "average": { + "1min": 1.023, + "5min": 0.860, + "15min": 0.883 + }, + "min": { + "1min": 0.818, + "5min": 0.607, + "15min": 0.607 + }, + "max": { + "1min": 1.164, + "5min": 1.173, + "15min": 1.544 + }, + "last": 0.924 + }, + { + "last update": "Wed Sep 4 17:04:49 2019", + "stale": false, + "from osd": 2, + "to osd": 0, + "interface": "back", + "average": { + "1min": 0.968, + "5min": 0.897, + "15min": 0.830 + }, + "min": { + "1min": 0.860, + "5min": 0.563, + "15min": 0.502 + }, + "max": { + "1min": 1.171, + "5min": 1.216, + "15min": 1.456 + }, + "last": 0.845 + }, + { + "last update": "Wed Sep 4 17:04:48 2019", + "stale": false, + "from osd": 0, + "to osd": 1, + "interface": "front", + "average": { + "1min": 0.965, + "5min": 0.811, + "15min": 0.850 + }, + "min": { + "1min": 0.650, + "5min": 0.488, + "15min": 0.466 + }, + "max": { + "1min": 1.252, + "5min": 1.252, + "15min": 1.362 + }, + "last": 0.791 + }, + ... + + + +Muting health checks +-------------------- + +Health checks can be muted so that they do not affect the overall +reported status of the cluster. Alerts are specified using the health +check code (see :ref:`health-checks`): + +.. prompt:: bash $ + + ceph health mute <code> + +For example, if there is a health warning, muting it will make the +cluster report an overall status of ``HEALTH_OK``. For example, to +mute an ``OSD_DOWN`` alert,: + +.. prompt:: bash $ + + ceph health mute OSD_DOWN + +Mutes are reported as part of the short and long form of the ``ceph health`` command. +For example, in the above scenario, the cluster would report: + +.. prompt:: bash $ + + ceph health + +:: + + HEALTH_OK (muted: OSD_DOWN) + +.. prompt:: bash $ + + ceph health detail + +:: + + HEALTH_OK (muted: OSD_DOWN) + (MUTED) OSD_DOWN 1 osds down + osd.1 is down + +A mute can be explicitly removed with: + +.. prompt:: bash $ + + ceph health unmute <code> + +For example: + +.. prompt:: bash $ + + ceph health unmute OSD_DOWN + +A health check mute may optionally have a TTL (time to live) +associated with it, such that the mute will automatically expire +after the specified period of time has elapsed. The TTL is specified as an optional +duration argument, e.g.: + +.. prompt:: bash $ + + ceph health mute OSD_DOWN 4h # mute for 4 hours + ceph health mute MON_DOWN 15m # mute for 15 minutes + +Normally, if a muted health alert is resolved (e.g., in the example +above, the OSD comes back up), the mute goes away. If the alert comes +back later, it will be reported in the usual way. + +It is possible to make a mute "sticky" such that the mute will remain even if the +alert clears. For example: + +.. prompt:: bash $ + + ceph health mute OSD_DOWN 1h --sticky # ignore any/all down OSDs for next hour + +Most health mutes also disappear if the extent of an alert gets worse. For example, +if there is one OSD down, and the alert is muted, the mute will disappear if one +or more additional OSDs go down. This is true for any health alert that involves +a count indicating how much or how many of something is triggering the warning or +error. + + +Detecting configuration issues +============================== + +In addition to the health checks that Ceph continuously runs on its +own status, there are some configuration issues that may only be detected +by an external tool. + +Use the `ceph-medic`_ tool to run these additional checks on your Ceph +cluster's configuration. + +Checking a Cluster's Usage Stats +================================ + +To check a cluster's data usage and data distribution among pools, you can +use the ``df`` option. It is similar to Linux ``df``. Execute +the following: + +.. prompt:: bash $ + + ceph df + +The output of ``ceph df`` looks like this:: + + CLASS SIZE AVAIL USED RAW USED %RAW USED + ssd 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00 + TOTAL 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00 + + --- POOLS --- + POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR + device_health_metrics 1 1 242 KiB 15 KiB 227 KiB 4 251 KiB 24 KiB 227 KiB 0 297 GiB N/A N/A 4 0 B 0 B + cephfs.a.meta 2 32 6.8 KiB 6.8 KiB 0 B 22 96 KiB 96 KiB 0 B 0 297 GiB N/A N/A 22 0 B 0 B + cephfs.a.data 3 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 99 GiB N/A N/A 0 0 B 0 B + test 4 32 22 MiB 22 MiB 50 KiB 248 19 MiB 19 MiB 50 KiB 0 297 GiB N/A N/A 248 0 B 0 B + + + + + +- **CLASS:** for example, "ssd" or "hdd" +- **SIZE:** The amount of storage capacity managed by the cluster. +- **AVAIL:** The amount of free space available in the cluster. +- **USED:** The amount of raw storage consumed by user data (excluding + BlueStore's database) +- **RAW USED:** The amount of raw storage consumed by user data, internal + overhead, or reserved capacity. +- **%RAW USED:** The percentage of raw storage used. Use this number in + conjunction with the ``full ratio`` and ``near full ratio`` to ensure that + you are not reaching your cluster's capacity. See `Storage Capacity`_ for + additional details. + + +**POOLS:** + +The **POOLS** section of the output provides a list of pools and the notional +usage of each pool. The output from this section **DOES NOT** reflect replicas, +clones or snapshots. For example, if you store an object with 1MB of data, the +notional usage will be 1MB, but the actual usage may be 2MB or more depending +on the number of replicas, clones and snapshots. + +- **ID:** The number of the node within the pool. +- **STORED:** actual amount of data user/Ceph has stored in a pool. This is + similar to the USED column in earlier versions of Ceph but the calculations + (for BlueStore!) are more precise (gaps are properly handled). + + - **(DATA):** usage for RBD (RADOS Block Device), CephFS file data, and RGW + (RADOS Gateway) object data. + - **(OMAP):** key-value pairs. Used primarily by CephFS and RGW (RADOS + Gateway) for metadata storage. + +- **OBJECTS:** The notional number of objects stored per pool. "Notional" is + defined above in the paragraph immediately under "POOLS". +- **USED:** The space allocated for a pool over all OSDs. This includes + replication, allocation granularity, and erasure-coding overhead. Compression + savings and object content gaps are also taken into account. BlueStore's + database is not included in this amount. + + - **(DATA):** object usage for RBD (RADOS Block Device), CephFS file data, and RGW + (RADOS Gateway) object data. + - **(OMAP):** object key-value pairs. Used primarily by CephFS and RGW (RADOS + Gateway) for metadata storage. + +- **%USED:** The notional percentage of storage used per pool. +- **MAX AVAIL:** An estimate of the notional amount of data that can be written + to this pool. +- **QUOTA OBJECTS:** The number of quota objects. +- **QUOTA BYTES:** The number of bytes in the quota objects. +- **DIRTY:** The number of objects in the cache pool that have been written to + the cache pool but have not been flushed yet to the base pool. This field is + only available when cache tiering is in use. +- **USED COMPR:** amount of space allocated for compressed data (i.e. this + includes comrpessed data plus all the allocation, replication and erasure + coding overhead). +- **UNDER COMPR:** amount of data passed through compression (summed over all + replicas) and beneficial enough to be stored in a compressed form. + + +.. note:: The numbers in the POOLS section are notional. They are not + inclusive of the number of replicas, snapshots or clones. As a result, the + sum of the USED and %USED amounts will not add up to the USED and %USED + amounts in the RAW section of the output. + +.. note:: The MAX AVAIL value is a complicated function of the replication + or erasure code used, the CRUSH rule that maps storage to devices, the + utilization of those devices, and the configured ``mon_osd_full_ratio``. + + +Checking OSD Status +=================== + +You can check OSDs to ensure they are ``up`` and ``in`` by executing the +following command: + +.. prompt:: bash # + + ceph osd stat + +Or: + +.. prompt:: bash # + + ceph osd dump + +You can also check view OSDs according to their position in the CRUSH map by +using the folloiwng command: + +.. prompt:: bash # + + ceph osd tree + +Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up +and their weight: + +.. code-block:: bash + + #ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -1 3.00000 pool default + -3 3.00000 rack mainrack + -2 3.00000 host osd-host + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + +For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. + +Checking Monitor Status +======================= + +If your cluster has multiple monitors (likely), you should check the monitor +quorum status after you start the cluster and before reading and/or writing data. A +quorum must be present when multiple monitors are running. You should also check +monitor status periodically to ensure that they are running. + +To see display the monitor map, execute the following: + +.. prompt:: bash $ + + ceph mon stat + +Or: + +.. prompt:: bash $ + + ceph mon dump + +To check the quorum status for the monitor cluster, execute the following: + +.. prompt:: bash $ + + ceph quorum_status + +Ceph will return the quorum status. For example, a Ceph cluster consisting of +three monitors may return the following: + +.. code-block:: javascript + + { "election_epoch": 10, + "quorum": [ + 0, + 1, + 2], + "quorum_names": [ + "a", + "b", + "c"], + "quorum_leader_name": "a", + "monmap": { "epoch": 1, + "fsid": "444b489c-4f16-4b75-83f0-cb8097468898", + "modified": "2011-12-12 13:28:27.505520", + "created": "2011-12-12 13:28:27.505520", + "features": {"persistent": [ + "kraken", + "luminous", + "mimic"], + "optional": [] + }, + "mons": [ + { "rank": 0, + "name": "a", + "addr": "127.0.0.1:6789/0", + "public_addr": "127.0.0.1:6789/0"}, + { "rank": 1, + "name": "b", + "addr": "127.0.0.1:6790/0", + "public_addr": "127.0.0.1:6790/0"}, + { "rank": 2, + "name": "c", + "addr": "127.0.0.1:6791/0", + "public_addr": "127.0.0.1:6791/0"} + ] + } + } + +Checking MDS Status +=================== + +Metadata servers provide metadata services for CephFS. Metadata servers have +two sets of states: ``up | down`` and ``active | inactive``. To ensure your +metadata servers are ``up`` and ``active``, execute the following: + +.. prompt:: bash $ + + ceph mds stat + +To display details of the metadata cluster, execute the following: + +.. prompt:: bash $ + + ceph fs dump + + +Checking Placement Group States +=============================== + +Placement groups map objects to OSDs. When you monitor your +placement groups, you will want them to be ``active`` and ``clean``. +For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. + +.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg + +.. _rados-monitoring-using-admin-socket: + +Using the Admin Socket +====================== + +The Ceph admin socket allows you to query a daemon via a socket interface. +By default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon +via the admin socket, login to the host running the daemon and use the +following command: + +.. prompt:: bash $ + + ceph daemon {daemon-name} + ceph daemon {path-to-socket-file} + +For example, the following are equivalent: + +.. prompt:: bash $ + + ceph daemon osd.0 foo + ceph daemon /var/run/ceph/ceph-osd.0.asok foo + +To view the available admin socket commands, execute the following command: + +.. prompt:: bash $ + + ceph daemon {daemon-name} help + +The admin socket command enables you to show and set your configuration at +runtime. See `Viewing a Configuration at Runtime`_ for details. + +Additionally, you can set configuration values at runtime directly (i.e., the +admin socket bypasses the monitor, unlike ``ceph tell {daemon-type}.{id} +config set``, which relies on the monitor but doesn't require you to login +directly to the host in question ). + +.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#viewing-a-configuration-at-runtime +.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity +.. _ceph-medic: http://docs.ceph.com/ceph-medic/master/ diff --git a/doc/rados/operations/operating.rst b/doc/rados/operations/operating.rst new file mode 100644 index 000000000..134774ccb --- /dev/null +++ b/doc/rados/operations/operating.rst @@ -0,0 +1,255 @@ +===================== + Operating a Cluster +===================== + +.. index:: systemd; operating a cluster + + +Running Ceph with systemd +========================== + +For all distributions that support systemd (CentOS 7, Fedora, Debian +Jessie 8 and later, SUSE), ceph daemons are now managed using native +systemd files instead of the legacy sysvinit scripts. For example: + +.. prompt:: bash $ + + sudo systemctl start ceph.target # start all daemons + sudo systemctl status ceph-osd@12 # check status of osd.12 + +To list the Ceph systemd units on a node, execute: + +.. prompt:: bash $ + + sudo systemctl status ceph\*.service ceph\*.target + +Starting all Daemons +-------------------- + +To start all daemons on a Ceph Node (irrespective of type), execute the +following: + +.. prompt:: bash $ + + sudo systemctl start ceph.target + + +Stopping all Daemons +-------------------- + +To stop all daemons on a Ceph Node (irrespective of type), execute the +following: + +.. prompt:: bash $ + + sudo systemctl stop ceph\*.service ceph\*.target + + +Starting all Daemons by Type +---------------------------- + +To start all daemons of a particular type on a Ceph Node, execute one of the +following: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd.target + sudo systemctl start ceph-mon.target + sudo systemctl start ceph-mds.target + + +Stopping all Daemons by Type +---------------------------- + +To stop all daemons of a particular type on a Ceph Node, execute one of the +following: + +.. prompt:: bash $ + + sudo systemctl stop ceph-mon\*.service ceph-mon.target + sudo systemctl stop ceph-osd\*.service ceph-osd.target + sudo systemctl stop ceph-mds\*.service ceph-mds.target + + +Starting a Daemon +----------------- + +To start a specific daemon instance on a Ceph Node, execute one of the +following: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd@{id} + sudo systemctl start ceph-mon@{hostname} + sudo systemctl start ceph-mds@{hostname} + +For example: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd@1 + sudo systemctl start ceph-mon@ceph-server + sudo systemctl start ceph-mds@ceph-server + + +Stopping a Daemon +----------------- + +To stop a specific daemon instance on a Ceph Node, execute one of the +following: + +.. prompt:: bash $ + + sudo systemctl stop ceph-osd@{id} + sudo systemctl stop ceph-mon@{hostname} + sudo systemctl stop ceph-mds@{hostname} + +For example: + +.. prompt:: bash $ + + sudo systemctl stop ceph-osd@1 + sudo systemctl stop ceph-mon@ceph-server + sudo systemctl stop ceph-mds@ceph-server + + +.. index:: Upstart; operating a cluster + +Running Ceph with Upstart +========================== + +Starting all Daemons +-------------------- + +To start all daemons on a Ceph Node (irrespective of type), execute the +following:: + + sudo start ceph-all + + +Stopping all Daemons +-------------------- + +To stop all daemons on a Ceph Node (irrespective of type), execute the +following:: + + sudo stop ceph-all + + +Starting all Daemons by Type +---------------------------- + +To start all daemons of a particular type on a Ceph Node, execute one of the +following:: + + sudo start ceph-osd-all + sudo start ceph-mon-all + sudo start ceph-mds-all + + +Stopping all Daemons by Type +---------------------------- + +To stop all daemons of a particular type on a Ceph Node, execute one of the +following:: + + sudo stop ceph-osd-all + sudo stop ceph-mon-all + sudo stop ceph-mds-all + + +Starting a Daemon +----------------- + +To start a specific daemon instance on a Ceph Node, execute one of the +following:: + + sudo start ceph-osd id={id} + sudo start ceph-mon id={hostname} + sudo start ceph-mds id={hostname} + +For example:: + + sudo start ceph-osd id=1 + sudo start ceph-mon id=ceph-server + sudo start ceph-mds id=ceph-server + + +Stopping a Daemon +----------------- + +To stop a specific daemon instance on a Ceph Node, execute one of the +following:: + + sudo stop ceph-osd id={id} + sudo stop ceph-mon id={hostname} + sudo stop ceph-mds id={hostname} + +For example:: + + sudo stop ceph-osd id=1 + sudo start ceph-mon id=ceph-server + sudo start ceph-mds id=ceph-server + + +.. index:: sysvinit; operating a cluster + +Running Ceph with sysvinit +========================== + +Each time you to **start**, **restart**, and **stop** Ceph daemons (or your +entire cluster) you must specify at least one option and one command. You may +also specify a daemon type or a daemon instance. :: + + {commandline} [options] [commands] [daemons] + + +The ``ceph`` options include: + ++-----------------+----------+-------------------------------------------------+ +| Option | Shortcut | Description | ++=================+==========+=================================================+ +| ``--verbose`` | ``-v`` | Use verbose logging. | ++-----------------+----------+-------------------------------------------------+ +| ``--valgrind`` | ``N/A`` | (Dev and QA only) Use `Valgrind`_ debugging. | ++-----------------+----------+-------------------------------------------------+ +| ``--allhosts`` | ``-a`` | Execute on all nodes in ``ceph.conf.`` | +| | | Otherwise, it only executes on ``localhost``. | ++-----------------+----------+-------------------------------------------------+ +| ``--restart`` | ``N/A`` | Automatically restart daemon if it core dumps. | ++-----------------+----------+-------------------------------------------------+ +| ``--norestart`` | ``N/A`` | Don't restart a daemon if it core dumps. | ++-----------------+----------+-------------------------------------------------+ +| ``--conf`` | ``-c`` | Use an alternate configuration file. | ++-----------------+----------+-------------------------------------------------+ + +The ``ceph`` commands include: + ++------------------+------------------------------------------------------------+ +| Command | Description | ++==================+============================================================+ +| ``start`` | Start the daemon(s). | ++------------------+------------------------------------------------------------+ +| ``stop`` | Stop the daemon(s). | ++------------------+------------------------------------------------------------+ +| ``forcestop`` | Force the daemon(s) to stop. Same as ``kill -9`` | ++------------------+------------------------------------------------------------+ +| ``killall`` | Kill all daemons of a particular type. | ++------------------+------------------------------------------------------------+ +| ``cleanlogs`` | Cleans out the log directory. | ++------------------+------------------------------------------------------------+ +| ``cleanalllogs`` | Cleans out **everything** in the log directory. | ++------------------+------------------------------------------------------------+ + +For subsystem operations, the ``ceph`` service can target specific daemon types +by adding a particular daemon type for the ``[daemons]`` option. Daemon types +include: + +- ``mon`` +- ``osd`` +- ``mds`` + + + +.. _Valgrind: http://www.valgrind.org/ +.. _initctl: http://manpages.ubuntu.com/manpages/raring/en/man8/initctl.8.html diff --git a/doc/rados/operations/pg-concepts.rst b/doc/rados/operations/pg-concepts.rst new file mode 100644 index 000000000..636d6bf9a --- /dev/null +++ b/doc/rados/operations/pg-concepts.rst @@ -0,0 +1,102 @@ +========================== + Placement Group Concepts +========================== + +When you execute commands like ``ceph -w``, ``ceph osd dump``, and other +commands related to placement groups, Ceph may return values using some +of the following terms: + +*Peering* + The process of bringing all of the OSDs that store + a Placement Group (PG) into agreement about the state + of all of the objects (and their metadata) in that PG. + Note that agreeing on the state does not mean that + they all have the latest contents. + +*Acting Set* + The ordered list of OSDs who are (or were as of some epoch) + responsible for a particular placement group. + +*Up Set* + The ordered list of OSDs responsible for a particular placement + group for a particular epoch according to CRUSH. Normally this + is the same as the *Acting Set*, except when the *Acting Set* has + been explicitly overridden via ``pg_temp`` in the OSD Map. + +*Current Interval* or *Past Interval* + A sequence of OSD map epochs during which the *Acting Set* and *Up + Set* for particular placement group do not change. + +*Primary* + The member (and by convention first) of the *Acting Set*, + that is responsible for coordination peering, and is + the only OSD that will accept client-initiated + writes to objects in a placement group. + +*Replica* + A non-primary OSD in the *Acting Set* for a placement group + (and who has been recognized as such and *activated* by the primary). + +*Stray* + An OSD that is not a member of the current *Acting Set*, but + has not yet been told that it can delete its copies of a + particular placement group. + +*Recovery* + Ensuring that copies of all of the objects in a placement group + are on all of the OSDs in the *Acting Set*. Once *Peering* has + been performed, the *Primary* can start accepting write operations, + and *Recovery* can proceed in the background. + +*PG Info* + Basic metadata about the placement group's creation epoch, the version + for the most recent write to the placement group, *last epoch started*, + *last epoch clean*, and the beginning of the *current interval*. Any + inter-OSD communication about placement groups includes the *PG Info*, + such that any OSD that knows a placement group exists (or once existed) + also has a lower bound on *last epoch clean* or *last epoch started*. + +*PG Log* + A list of recent updates made to objects in a placement group. + Note that these logs can be truncated after all OSDs + in the *Acting Set* have acknowledged up to a certain + point. + +*Missing Set* + Each OSD notes update log entries and if they imply updates to + the contents of an object, adds that object to a list of needed + updates. This list is called the *Missing Set* for that ``<OSD,PG>``. + +*Authoritative History* + A complete, and fully ordered set of operations that, if + performed, would bring an OSD's copy of a placement group + up to date. + +*Epoch* + A (monotonically increasing) OSD map version number + +*Last Epoch Start* + The last epoch at which all nodes in the *Acting Set* + for a particular placement group agreed on an + *Authoritative History*. At this point, *Peering* is + deemed to have been successful. + +*up_thru* + Before a *Primary* can successfully complete the *Peering* process, + it must inform a monitor that is alive through the current + OSD map *Epoch* by having the monitor set its *up_thru* in the osd + map. This helps *Peering* ignore previous *Acting Sets* for which + *Peering* never completed after certain sequences of failures, such as + the second interval below: + + - *acting set* = [A,B] + - *acting set* = [A] + - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection) + - *acting set* = [B] (B restarts, A does not) + +*Last Epoch Clean* + The last *Epoch* at which all nodes in the *Acting set* + for a particular placement group were completely + up to date (both placement group logs and object contents). + At this point, *recovery* is deemed to have been + completed. diff --git a/doc/rados/operations/pg-repair.rst b/doc/rados/operations/pg-repair.rst new file mode 100644 index 000000000..f495530cc --- /dev/null +++ b/doc/rados/operations/pg-repair.rst @@ -0,0 +1,81 @@ +============================ +Repairing PG inconsistencies +============================ +Sometimes a placement group might become "inconsistent". To return the +placement group to an active+clean state, you must first determine which +of the placement groups has become inconsistent and then run the "pg +repair" command on it. This page contains commands for diagnosing placement +groups and the command for repairing placement groups that have become +inconsistent. + +.. highlight:: console + +Commands for Diagnosing Placement-group Problems +================================================ +The commands in this section provide various ways of diagnosing broken placement groups. + +The following command provides a high-level (low detail) overview of the health of the ceph cluster: + +.. prompt:: bash # + + ceph health detail + +The following command provides more detail on the status of the placement groups: + +.. prompt:: bash # + + ceph pg dump --format=json-pretty + +The following command lists inconsistent placement groups: + +.. prompt:: bash # + + rados list-inconsistent-pg {pool} + +The following command lists inconsistent rados objects: + +.. prompt:: bash # + + rados list-inconsistent-obj {pgid} + +The following command lists inconsistent snapsets in the given placement group: + +.. prompt:: bash # + + rados list-inconsistent-snapset {pgid} + + +Commands for Repairing Placement Groups +======================================= +The form of the command to repair a broken placement group is: + +.. prompt:: bash # + + ceph pg repair {pgid} + +Where ``{pgid}`` is the id of the affected placement group. + +For example: + +.. prompt:: bash # + + ceph pg repair 1.4 + +More Information on Placement Group Repair +========================================== +Ceph stores and updates the checksums of objects stored in the cluster. When a scrub is performed on a placement group, the OSD attempts to choose an authoritative copy from among its replicas. Among all of the possible cases, only one case is consistent. After a deep scrub, Ceph calculates the checksum of an object read from the disk and compares it to the checksum previously recorded. If the current checksum and the previously recorded checksums do not match, that is an inconsistency. In the case of replicated pools, any mismatch between the checksum of any replica of an object and the checksum of the authoritative copy means that there is an inconsistency. + +The "pg repair" command attempts to fix inconsistencies of various kinds. If "pg repair" finds an inconsistent placement group, it attempts to overwrite the digest of the inconsistent copy with the digest of the authoritative copy. If "pg repair" finds an inconsistent replicated pool, it marks the inconsistent copy as missing. Recovery, in the case of replicated pools, is beyond the scope of "pg repair". + +For erasure coded and bluestore pools, Ceph will automatically repair if osd_scrub_auto_repair (configuration default "false") is set to true and at most osd_scrub_auto_repair_num_errors (configuration default 5) errors are found. + +"pg repair" will not solve every problem. Ceph does not automatically repair placement groups when inconsistencies are found in them. + +The checksum of an object or an omap is not always available. Checksums are calculated incrementally. If a replicated object is updated non-sequentially, the write operation involved in the update changes the object and invalidates its checksum. The whole object is not read while recalculating the checksum. "ceph pg repair" is able to repair things even when checksums are not available to it, as in the case of filestore. When replicated filestore pools are in question, users might prefer manual repair to "ceph pg repair". + +The material in this paragraph is relevant for filestore, and bluestore has its own internal checksums. The matched-record checksum and the calculated checksum cannot prove that the authoritative copy is in fact authoritative. In the case that there is no checksum available, "pg repair" favors the data on the primary. this might or might not be the uncorrupted replica. This is why human intervention is necessary when an inconsistency is discovered. Human intervention sometimes means using the "ceph-objectstore-tool". + +External Links +============== +https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page contains a walkthrough of the repair of a placement group, and is recommended reading if you want to repair a placement +group but have never done so. diff --git a/doc/rados/operations/pg-states.rst b/doc/rados/operations/pg-states.rst new file mode 100644 index 000000000..495229d92 --- /dev/null +++ b/doc/rados/operations/pg-states.rst @@ -0,0 +1,118 @@ +======================== + Placement Group States +======================== + +When checking a cluster's status (e.g., running ``ceph -w`` or ``ceph -s``), +Ceph will report on the status of the placement groups. A placement group has +one or more states. The optimum state for placement groups in the placement group +map is ``active + clean``. + +*creating* + Ceph is still creating the placement group. + +*activating* + The placement group is peered but not yet active. + +*active* + Ceph will process requests to the placement group. + +*clean* + Ceph replicated all objects in the placement group the correct number of times. + +*down* + A replica with necessary data is down, so the placement group is offline. + +*laggy* + A replica is not acknowledging new leases from the primary in a timely fashion; IO is temporarily paused. + +*wait* + The set of OSDs for this PG has just changed and IO is temporarily paused until the previous interval's leases expire. + +*scrubbing* + Ceph is checking the placement group metadata for inconsistencies. + +*deep* + Ceph is checking the placement group data against stored checksums. + +*degraded* + Ceph has not replicated some objects in the placement group the correct number of times yet. + +*inconsistent* + Ceph detects inconsistencies in the one or more replicas of an object in the placement group + (e.g. objects are the wrong size, objects are missing from one replica *after* recovery finished, etc.). + +*peering* + The placement group is undergoing the peering process + +*repair* + Ceph is checking the placement group and repairing any inconsistencies it finds (if possible). + +*recovering* + Ceph is migrating/synchronizing objects and their replicas. + +*forced_recovery* + High recovery priority of that PG is enforced by user. + +*recovery_wait* + The placement group is waiting in line to start recover. + +*recovery_toofull* + A recovery operation is waiting because the destination OSD is over its + full ratio. + +*recovery_unfound* + Recovery stopped due to unfound objects. + +*backfilling* + Ceph is scanning and synchronizing the entire contents of a placement group + instead of inferring what contents need to be synchronized from the logs of + recent operations. Backfill is a special case of recovery. + +*forced_backfill* + High backfill priority of that PG is enforced by user. + +*backfill_wait* + The placement group is waiting in line to start backfill. + +*backfill_toofull* + A backfill operation is waiting because the destination OSD is over + the backfillfull ratio. + +*backfill_unfound* + Backfill stopped due to unfound objects. + +*incomplete* + Ceph detects that a placement group is missing information about + writes that may have occurred, or does not have any healthy + copies. If you see this state, try to start any failed OSDs that may + contain the needed information. In the case of an erasure coded pool + temporarily reducing min_size may allow recovery. + +*stale* + The placement group is in an unknown state - the monitors have not received + an update for it since the placement group mapping changed. + +*remapped* + The placement group is temporarily mapped to a different set of OSDs from what + CRUSH specified. + +*undersized* + The placement group has fewer copies than the configured pool replication level. + +*peered* + The placement group has peered, but cannot serve client IO due to not having + enough copies to reach the pool's configured min_size parameter. Recovery + may occur in this state, so the pg may heal up to min_size eventually. + +*snaptrim* + Trimming snaps. + +*snaptrim_wait* + Queued to trim snaps. + +*snaptrim_error* + Error stopped trimming snaps. + +*unknown* + The ceph-mgr hasn't yet received any information about the PG's state from an + OSD since mgr started up. diff --git a/doc/rados/operations/placement-groups.rst b/doc/rados/operations/placement-groups.rst new file mode 100644 index 000000000..d51f8d76e --- /dev/null +++ b/doc/rados/operations/placement-groups.rst @@ -0,0 +1,798 @@ +================== + Placement Groups +================== + +.. _pg-autoscaler: + +Autoscaling placement groups +============================ + +Placement groups (PGs) are an internal implementation detail of how +Ceph distributes data. You can allow the cluster to either make +recommendations or automatically tune PGs based on how the cluster is +used by enabling *pg-autoscaling*. + +Each pool in the system has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``. + +* ``off``: Disable autoscaling for this pool. It is up to the administrator to choose an appropriate PG number for each pool. Please refer to :ref:`choosing-number-of-placement-groups` for more information. +* ``on``: Enable automated adjustments of the PG count for the given pool. +* ``warn``: Raise health alerts when the PG count should be adjusted + +To set the autoscaling mode for an existing pool: + +.. prompt:: bash # + + ceph osd pool set <pool-name> pg_autoscale_mode <mode> + +For example to enable autoscaling on pool ``foo``: + +.. prompt:: bash # + + ceph osd pool set foo pg_autoscale_mode on + +You can also configure the default ``pg_autoscale_mode`` that is +set on any pools that are subsequently created: + +.. prompt:: bash # + + ceph config set global osd_pool_default_pg_autoscale_mode <mode> + +You can disable or enable the autoscaler for all pools with +the ``noautoscale`` flag. By default this flag is set to be ``off``, +but you can turn it ``on`` by using the command: + +.. prompt:: bash $ + + ceph osd pool set noautoscale + +You can turn it ``off`` using the command: + +.. prompt:: bash # + + ceph osd pool unset noautoscale + +To ``get`` the value of the flag use the command: + +.. prompt:: bash # + + ceph osd pool get noautoscale + +Viewing PG scaling recommendations +---------------------------------- + +You can view each pool, its relative utilization, and any suggested changes to +the PG count with this command: + +.. prompt:: bash # + + ceph osd pool autoscale-status + +Output will be something like:: + + POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK + a 12900M 3.0 82431M 0.4695 8 128 warn True + c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn True + b 0 953.6M 3.0 82431M 0.0347 8 warn False + +**SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if +present, is the amount of data the administrator has specified that +they expect to eventually be stored in this pool. The system uses +the larger of the two values for its calculation. + +**RATE** is the multiplier for the pool that determines how much raw +storage capacity is consumed. For example, a 3 replica pool will +have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a +ratio of 1.5. + +**RAW CAPACITY** is the total amount of raw storage capacity on the +OSDs that are responsible for storing this pool's (and perhaps other +pools') data. **RATIO** is the ratio of that total capacity that +this pool is consuming (i.e., ratio = size * rate / raw capacity). + +**TARGET RATIO**, if present, is the ratio of storage that the +administrator has specified that they expect this pool to consume +relative to other pools with target ratios set. +If both target size bytes and ratio are specified, the +ratio takes precedence. + +**EFFECTIVE RATIO** is the target ratio after adjusting in two ways: + +1. subtracting any capacity expected to be used by pools with target size set +2. normalizing the target ratios among pools with target ratio set so + they collectively target the rest of the space. For example, 4 + pools with target_ratio 1.0 would have an effective ratio of 0.25. + +The system uses the larger of the actual ratio and the effective ratio +for its calculation. + +**BIAS** is used as a multiplier to manually adjust a pool's PG based +on prior information about how much PGs a specific pool is expected +to have. + +**PG_NUM** is the current number of PGs for the pool (or the current +number of PGs that the pool is working towards, if a ``pg_num`` +change is in progress). **NEW PG_NUM**, if present, is what the +system believes the pool's ``pg_num`` should be changed to. It is +always a power of 2, and will only be present if the "ideal" value +varies from the current value by more than a factor of 3 by default. +This factor can be be adjusted with: + +.. prompt:: bash # + + ceph osd pool set threshold 2.0 + +**AUTOSCALE**, is the pool ``pg_autoscale_mode`` +and will be either ``on``, ``off``, or ``warn``. + +The final column, **BULK** determines if the pool is ``bulk`` +and will be either ``True`` or ``False``. A ``bulk`` pool +means that the pool is expected to be large and should start out +with large amount of PGs for performance purposes. On the other hand, +pools without the ``bulk`` flag are expected to be smaller e.g., +.mgr or meta pools. + + +Automated scaling +----------------- + +Allowing the cluster to automatically scale PGs based on usage is the +simplest approach. Ceph will look at the total available storage and +target number of PGs for the whole system, look at how much data is +stored in each pool, and try to apportion the PGs accordingly. The +system is relatively conservative with its approach, only making +changes to a pool when the current number of PGs (``pg_num``) is more +than 3 times off from what it thinks it should be. + +The target number of PGs per OSD is based on the +``mon_target_pg_per_osd`` configurable (default: 100), which can be +adjusted with: + +.. prompt:: bash # + + ceph config set global mon_target_pg_per_osd 100 + +The autoscaler analyzes pools and adjusts on a per-subtree basis. +Because each pool may map to a different CRUSH rule, and each rule may +distribute data across different devices, Ceph will consider +utilization of each subtree of the hierarchy independently. For +example, a pool that maps to OSDs of class `ssd` and a pool that maps +to OSDs of class `hdd` will each have optimal PG counts that depend on +the number of those respective device types. + +In the case where a pool uses OSDs under two or more CRUSH roots, e.g., (shadow +trees with both `ssd` and `hdd` devices), the autoscaler will +issue a warning to the user in the manager log stating the name of the pool +and the set of roots that overlap each other. The autoscaler will not +scale any pools with overlapping roots because this can cause problems +with the scaling process. We recommend making each pool belong to only +one root (one OSD class) to get rid of the warning and ensure a successful +scaling process. + +The autoscaler uses the `bulk` flag to determine which pool +should start out with a full complement of PGs and only +scales down when the usage ratio across the pool is not even. +However, if the pool doesn't have the `bulk` flag, the pool will +start out with minimal PGs and only when there is more usage in the pool. + +To create pool with `bulk` flag: + +.. prompt:: bash # + + ceph osd pool create <pool-name> --bulk + +To set/unset `bulk` flag of existing pool: + +.. prompt:: bash # + + ceph osd pool set <pool-name> bulk <true/false/1/0> + +To get `bulk` flag of existing pool: + +.. prompt:: bash # + + ceph osd pool get <pool-name> bulk + +.. _specifying_pool_target_size: + +Specifying expected pool size +----------------------------- + +When a cluster or pool is first created, it will consume a small +fraction of the total cluster capacity and will appear to the system +as if it should only need a small number of placement groups. +However, in most cases cluster administrators have a good idea which +pools are expected to consume most of the system capacity over time. +By providing this information to Ceph, a more appropriate number of +PGs can be used from the beginning, preventing subsequent changes in +``pg_num`` and the overhead associated with moving data around when +those adjustments are made. + +The *target size* of a pool can be specified in two ways: either in +terms of the absolute size of the pool (i.e., bytes), or as a weight +relative to other pools with a ``target_size_ratio`` set. + +For example: + +.. prompt:: bash # + + ceph osd pool set mypool target_size_bytes 100T + +will tell the system that `mypool` is expected to consume 100 TiB of +space. Alternatively: + +.. prompt:: bash # + + ceph osd pool set mypool target_size_ratio 1.0 + +will tell the system that `mypool` is expected to consume 1.0 relative +to the other pools with ``target_size_ratio`` set. If `mypool` is the +only pool in the cluster, this means an expected use of 100% of the +total capacity. If there is a second pool with ``target_size_ratio`` +1.0, both pools would expect to use 50% of the cluster capacity. + +You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command. + +Note that if impossible target size values are specified (for example, +a capacity larger than the total cluster) then a health warning +(``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised. + +If both ``target_size_ratio`` and ``target_size_bytes`` are specified +for a pool, only the ratio will be considered, and a health warning +(``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) will be issued. + +Specifying bounds on a pool's PGs +--------------------------------- + +It is also possible to specify a minimum number of PGs for a pool. +This is useful for establishing a lower bound on the amount of +parallelism client will see when doing IO, even when a pool is mostly +empty. Setting the lower bound prevents Ceph from reducing (or +recommending you reduce) the PG number below the configured number. + +You can set the minimum or maximum number of PGs for a pool with: + +.. prompt:: bash # + + ceph osd pool set <pool-name> pg_num_min <num> + ceph osd pool set <pool-name> pg_num_max <num> + +You can also specify the minimum or maximum PG count at pool creation +time with the optional ``--pg-num-min <num>`` or ``--pg-num-max +<num>`` arguments to the ``ceph osd pool create`` command. + +.. _preselection: + +A preselection of pg_num +======================== + +When creating a new pool with: + +.. prompt:: bash # + + ceph osd pool create {pool-name} [pg_num] + +it is optional to choose the value of ``pg_num``. If you do not +specify ``pg_num``, the cluster can (by default) automatically tune it +for you based on how much data is stored in the pool (see above, :ref:`pg-autoscaler`). + +Alternatively, ``pg_num`` can be explicitly provided. However, +whether you specify a ``pg_num`` value or not does not affect whether +the value is automatically tuned by the cluster after the fact. To +enable or disable auto-tuning: + +.. prompt:: bash # + + ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn) + +The "rule of thumb" for PGs per OSD has traditionally be 100. With +the additional of the balancer (which is also enabled by default), a +value of more like 50 PGs per OSD is probably reasonable. The +challenge (which the autoscaler normally does for you), is to: + +- have the PGs per pool proportional to the data in the pool, and +- end up with 50-100 PGs per OSDs, after the replication or + erasuring-coding fan-out of each PG across OSDs is taken into + consideration + +How are Placement Groups used ? +=============================== + +A placement group (PG) aggregates objects within a pool because +tracking object placement and object metadata on a per-object basis is +computationally expensive--i.e., a system with millions of objects +cannot realistically track placement on a per-object basis. + +.. ditaa:: + /-----\ /-----\ /-----\ /-----\ /-----\ + | obj | | obj | | obj | | obj | | obj | + \-----/ \-----/ \-----/ \-----/ \-----/ + | | | | | + +--------+--------+ +---+----+ + | | + v v + +-----------------------+ +-----------------------+ + | Placement Group #1 | | Placement Group #2 | + | | | | + +-----------------------+ +-----------------------+ + | | + +------------------------------+ + | + v + +-----------------------+ + | Pool | + | | + +-----------------------+ + +The Ceph client will calculate which placement group an object should +be in. It does this by hashing the object ID and applying an operation +based on the number of PGs in the defined pool and the ID of the pool. +See `Mapping PGs to OSDs`_ for details. + +The object's contents within a placement group are stored in a set of +OSDs. For instance, in a replicated pool of size two, each placement +group will store objects on two OSDs, as shown below. + +.. ditaa:: + +-----------------------+ +-----------------------+ + | Placement Group #1 | | Placement Group #2 | + | | | | + +-----------------------+ +-----------------------+ + | | | | + v v v v + /----------\ /----------\ /----------\ /----------\ + | | | | | | | | + | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 | + | | | | | | | | + \----------/ \----------/ \----------/ \----------/ + + +Should OSD #2 fail, another will be assigned to Placement Group #1 and +will be filled with copies of all objects in OSD #1. If the pool size +is changed from two to three, an additional OSD will be assigned to +the placement group and will receive copies of all objects in the +placement group. + +Placement groups do not own the OSD; they share it with other +placement groups from the same pool or even other pools. If OSD #2 +fails, the Placement Group #2 will also have to restore copies of +objects, using OSD #3. + +When the number of placement groups increases, the new placement +groups will be assigned OSDs. The result of the CRUSH function will +also change and some objects from the former placement groups will be +copied over to the new Placement Groups and removed from the old ones. + +Placement Groups Tradeoffs +========================== + +Data durability and even distribution among all OSDs call for more +placement groups but their number should be reduced to the minimum to +save CPU and memory. + +.. _data durability: + +Data durability +--------------- + +After an OSD fails, the risk of data loss increases until the data it +contained is fully recovered. Let's imagine a scenario that causes +permanent data loss in a single placement group: + +- The OSD fails and all copies of the object it contains are lost. + For all objects within the placement group the number of replica + suddenly drops from three to two. + +- Ceph starts recovery for this placement group by choosing a new OSD + to re-create the third copy of all objects. + +- Another OSD, within the same placement group, fails before the new + OSD is fully populated with the third copy. Some objects will then + only have one surviving copies. + +- Ceph picks yet another OSD and keeps copying objects to restore the + desired number of copies. + +- A third OSD, within the same placement group, fails before recovery + is complete. If this OSD contained the only remaining copy of an + object, it is permanently lost. + +In a cluster containing 10 OSDs with 512 placement groups in a three +replica pool, CRUSH will give each placement groups three OSDs. In the +end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement +Groups. When the first OSD fails, the above scenario will therefore +start recovery for all 150 placement groups at the same time. + +The 150 placement groups being recovered are likely to be +homogeneously spread over the 9 remaining OSDs. Each remaining OSD is +therefore likely to send copies of objects to all others and also +receive some new objects to be stored because they became part of a +new placement group. + +The amount of time it takes for this recovery to complete entirely +depends on the architecture of the Ceph cluster. Let say each OSD is +hosted by a 1TB SSD on a single machine and all of them are connected +to a 10Gb/s switch and the recovery for a single OSD completes within +M minutes. If there are two OSDs per machine using spinners with no +SSD journal and a 1Gb/s switch, it will at least be an order of +magnitude slower. + +In a cluster of this size, the number of placement groups has almost +no influence on data durability. It could be 128 or 8192 and the +recovery would not be slower or faster. + +However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs +is likely to speed up recovery and therefore improve data durability +significantly. Each OSD now participates in only ~75 placement groups +instead of ~150 when there were only 10 OSDs and it will still require +all 19 remaining OSDs to perform the same amount of object copies in +order to recover. But where 10 OSDs had to copy approximately 100GB +each, they now have to copy 50GB each instead. If the network was the +bottleneck, recovery will happen twice as fast. In other words, +recovery goes faster when the number of OSDs increases. + +If this cluster grows to 40 OSDs, each of them will only host ~35 +placement groups. If an OSD dies, recovery will keep going faster +unless it is blocked by another bottleneck. However, if this cluster +grows to 200 OSDs, each of them will only host ~7 placement groups. If +an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs +in these placement groups: recovery will take longer than when there +were 40 OSDs, meaning the number of placement groups should be +increased. + +No matter how short the recovery time is, there is a chance for a +second OSD to fail while it is in progress. In the 10 OSDs cluster +described above, if any of them fail, then ~17 placement groups +(i.e. ~150 / 9 placement groups being recovered) will only have one +surviving copy. And if any of the 8 remaining OSD fail, the last +objects of two placement groups are likely to be lost (i.e. ~17 / 8 +placement groups with only one remaining copy being recovered). + +When the size of the cluster grows to 20 OSDs, the number of Placement +Groups damaged by the loss of three OSDs drops. The second OSD lost +will degrade ~4 (i.e. ~75 / 19 placement groups being recovered) +instead of ~17 and the third OSD lost will only lose data if it is one +of the four OSDs containing the surviving copy. In other words, if the +probability of losing one OSD is 0.0001% during the recovery time +frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 * +0.0001% in the cluster with 20 OSDs. + +In a nutshell, more OSDs mean faster recovery and a lower risk of +cascading failures leading to the permanent loss of a Placement +Group. Having 512 or 4096 Placement Groups is roughly equivalent in a +cluster with less than 50 OSDs as far as data durability is concerned. + +Note: It may take a long time for a new OSD added to the cluster to be +populated with placement groups that were assigned to it. However +there is no degradation of any object and it has no impact on the +durability of the data contained in the Cluster. + +.. _object distribution: + +Object distribution within a pool +--------------------------------- + +Ideally objects are evenly distributed in each placement group. Since +CRUSH computes the placement group for each object, but does not +actually know how much data is stored in each OSD within this +placement group, the ratio between the number of placement groups and +the number of OSDs may influence the distribution of the data +significantly. + +For instance, if there was a single placement group for ten OSDs in a +three replica pool, only three OSD would be used because CRUSH would +have no other choice. When more placement groups are available, +objects are more likely to be evenly spread among them. CRUSH also +makes every effort to evenly spread OSDs among all existing Placement +Groups. + +As long as there are one or two orders of magnitude more Placement +Groups than OSDs, the distribution should be even. For instance, 256 +placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs +etc. + +Uneven data distribution can be caused by factors other than the ratio +between OSDs and placement groups. Since CRUSH does not take into +account the size of the objects, a few very large objects may create +an imbalance. Let say one million 4K objects totaling 4GB are evenly +spread among 1024 placement groups on 10 OSDs. They will use 4GB / 10 += 400MB on each OSD. If one 400MB object is added to the pool, the +three OSDs supporting the placement group in which the object has been +placed will be filled with 400MB + 400MB = 800MB while the seven +others will remain occupied with only 400MB. + +.. _resource usage: + +Memory, CPU and network usage +----------------------------- + +For each placement group, OSDs and MONs need memory, network and CPU +at all times and even more during recovery. Sharing this overhead by +clustering objects within a placement group is one of the main reasons +they exist. + +Minimizing the number of placement groups saves significant amounts of +resources. + +.. _choosing-number-of-placement-groups: + +Choosing the number of Placement Groups +======================================= + +.. note: It is rarely necessary to do this math by hand. Instead, use the ``ceph osd pool autoscale-status`` command in combination with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. See :ref:`pg-autoscaler` for more information. + +If you have more than 50 OSDs, we recommend approximately 50-100 +placement groups per OSD to balance out resource usage, data +durability and distribution. If you have less than 50 OSDs, choosing +among the `preselection`_ above is best. For a single pool of objects, +you can use the following formula to get a baseline + + Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}` + +Where **pool size** is either the number of replicas for replicated +pools or the K+M sum for erasure coded pools (as returned by **ceph +osd erasure-code-profile get**). + +You should then check if the result makes sense with the way you +designed your Ceph cluster to maximize `data durability`_, +`object distribution`_ and minimize `resource usage`_. + +The result should always be **rounded up to the nearest power of two**. + +Only a power of two will evenly balance the number of objects among +placement groups. Other values will result in an uneven distribution of +data across your OSDs. Their use should be limited to incrementally +stepping from one power of two to another. + +As an example, for a cluster with 200 OSDs and a pool size of 3 +replicas, you would estimate your number of PGs as follows + + :math:`\frac{200 \times 100}{3} = 6667`. Nearest power of 2: 8192 + +When using multiple data pools for storing objects, you need to ensure +that you balance the number of placement groups per pool with the +number of placement groups per OSD so that you arrive at a reasonable +total number of placement groups that provides reasonably low variance +per OSD without taxing system resources or making the peering process +too slow. + +For instance a cluster of 10 pools each with 512 placement groups on +ten OSDs is a total of 5,120 placement groups spread over ten OSDs, +that is 512 placement groups per OSD. That does not use too many +resources. However, if 1,000 pools were created with 512 placement +groups each, the OSDs will handle ~50,000 placement groups each and it +would require significantly more resources and time for peering. + +You may find the `PGCalc`_ tool helpful. + + +.. _setting the number of placement groups: + +Set the Number of Placement Groups +================================== + +To set the number of placement groups in a pool, you must specify the +number of placement groups at the time you create the pool. +See `Create a Pool`_ for details. Even after a pool is created you can also change the number of placement groups with: + +.. prompt:: bash # + + ceph osd pool set {pool-name} pg_num {pg_num} + +After you increase the number of placement groups, you must also +increase the number of placement groups for placement (``pgp_num``) +before your cluster will rebalance. The ``pgp_num`` will be the number of +placement groups that will be considered for placement by the CRUSH +algorithm. Increasing ``pg_num`` splits the placement groups but data +will not be migrated to the newer placement groups until placement +groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num`` +should be equal to the ``pg_num``. To increase the number of +placement groups for placement, execute the following: + +.. prompt:: bash # + + ceph osd pool set {pool-name} pgp_num {pgp_num} + +When decreasing the number of PGs, ``pgp_num`` is adjusted +automatically for you. + +Get the Number of Placement Groups +================================== + +To get the number of placement groups in a pool, execute the following: + +.. prompt:: bash # + + ceph osd pool get {pool-name} pg_num + + +Get a Cluster's PG Statistics +============================= + +To get the statistics for the placement groups in your cluster, execute the following: + +.. prompt:: bash # + + ceph pg dump [--format {format}] + +Valid formats are ``plain`` (default) and ``json``. + + +Get Statistics for Stuck PGs +============================ + +To get the statistics for all placement groups stuck in a specified state, +execute the following: + +.. prompt:: bash # + + ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>] + +**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD +with the most up-to-date data to come up and in. + +**Unclean** Placement groups contain objects that are not replicated the desired number +of times. They should be recovering. + +**Stale** Placement groups are in an unknown state - the OSDs that host them have not +reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``). + +Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number +of seconds the placement group is stuck before including it in the returned statistics +(default 300 seconds). + + +Get a PG Map +============ + +To get the placement group map for a particular placement group, execute the following: + +.. prompt:: bash # + + ceph pg map {pg-id} + +For example: + +.. prompt:: bash # + + ceph pg map 1.6c + +Ceph will return the placement group map, the placement group, and the OSD status: + +.. prompt:: bash # + + osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0] + + +Get a PGs Statistics +==================== + +To retrieve statistics for a particular placement group, execute the following: + +.. prompt:: bash # + + ceph pg {pg-id} query + + +Scrub a Placement Group +======================= + +To scrub a placement group, execute the following: + +.. prompt:: bash # + + ceph pg scrub {pg-id} + +Ceph checks the primary and any replica nodes, generates a catalog of all objects +in the placement group and compares them to ensure that no objects are missing +or mismatched, and their contents are consistent. Assuming the replicas all +match, a final semantic sweep ensures that all of the snapshot-related object +metadata is consistent. Errors are reported via logs. + +To scrub all placement groups from a specific pool, execute the following: + +.. prompt:: bash # + + ceph osd pool scrub {pool-name} + +Prioritize backfill/recovery of a Placement Group(s) +==================================================== + +You may run into a situation where a bunch of placement groups will require +recovery and/or backfill, and some particular groups hold data more important +than others (for example, those PGs may hold data for images used by running +machines and other PGs may be used by inactive machines/less relevant data). +In that case, you may want to prioritize recovery of those groups so +performance and/or availability of data stored on those groups is restored +earlier. To do this (mark particular placement group(s) as prioritized during +backfill or recovery), execute the following: + +.. prompt:: bash # + + ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] + ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +This will cause Ceph to perform recovery or backfill on specified placement +groups first, before other placement groups. This does not interrupt currently +ongoing backfills or recovery, but causes specified PGs to be processed +as soon as possible. If you change your mind or prioritize wrong groups, +use: + +.. prompt:: bash # + + ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] + ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +This will remove "force" flag from those PGs and they will be processed +in default order. Again, this doesn't affect currently processed placement +group, only those that are still queued. + +The "force" flag is cleared automatically after recovery or backfill of group +is done. + +Similarly, you may use the following commands to force Ceph to perform recovery +or backfill on all placement groups from a specified pool first: + +.. prompt:: bash # + + ceph osd pool force-recovery {pool-name} + ceph osd pool force-backfill {pool-name} + +or: + +.. prompt:: bash # + + ceph osd pool cancel-force-recovery {pool-name} + ceph osd pool cancel-force-backfill {pool-name} + +to restore to the default recovery or backfill priority if you change your mind. + +Note that these commands could possibly break the ordering of Ceph's internal +priority computations, so use them with caution! +Especially, if you have multiple pools that are currently sharing the same +underlying OSDs, and some particular pools hold data more important than others, +we recommend you use the following command to re-arrange all pools's +recovery/backfill priority in a better order: + +.. prompt:: bash # + + ceph osd pool set {pool-name} recovery_priority {value} + +For example, if you have 10 pools you could make the most important one priority 10, +next 9, etc. Or you could leave most pools alone and have say 3 important pools +all priority 1 or priorities 3, 2, 1 respectively. + +Revert Lost +=========== + +If the cluster has lost one or more objects, and you have decided to +abandon the search for the lost data, you must mark the unfound objects +as ``lost``. + +If all possible locations have been queried and objects are still +lost, you may have to give up on the lost objects. This is +possible given unusual combinations of failures that allow the cluster +to learn about writes that were performed before the writes themselves +are recovered. + +Currently the only supported option is "revert", which will either roll back to +a previous version of the object or (if it was a new object) forget about it +entirely. To mark the "unfound" objects as "lost", execute the following: + +.. prompt:: bash # + + ceph pg {pg-id} mark_unfound_lost revert|delete + +.. important:: Use this feature with caution, because it may confuse + applications that expect the object(s) to exist. + + +.. toctree:: + :hidden: + + pg-states + pg-concepts + + +.. _Create a Pool: ../pools#createpool +.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds +.. _pgcalc: https://old.ceph.com/pgcalc/ diff --git a/doc/rados/operations/pools.rst b/doc/rados/operations/pools.rst new file mode 100644 index 000000000..b44c48460 --- /dev/null +++ b/doc/rados/operations/pools.rst @@ -0,0 +1,900 @@ +.. _rados_pools: + +======= + Pools +======= +Pools are logical partitions that are used to store objects. + +Pools provide: + +- **Resilience**: It is possible to set the number of OSDs that are allowed to + fail without any data being lost. If your cluster uses replicated pools, the + number of OSDs that can fail without data loss is the number of replicas. + For example: a typical configuration stores an object and two additional + copies (that is: ``size = 3``), but you can configure the number of replicas + on a per-pool basis. For `erasure coded pools <../erasure-code>`_, resilience + is defined as the number of coding chunks (for example, ``m = 2`` in the + **erasure code profile**). + +- **Placement Groups**: You can set the number of placement groups for the + pool. A typical configuration targets approximately 100 placement groups per + OSD, providing optimal balancing without consuming many computing resources. + When setting up multiple pools, be careful to set a reasonable number of + placement groups for each pool and for the cluster as a whole. Note that each + PG belongs to a specific pool: when multiple pools use the same OSDs, make + sure that the **sum** of PG replicas per OSD is in the desired PG per OSD + target range. Use the `pgcalc`_ tool to calculate the number of placement + groups to set for your pool. + +- **CRUSH Rules**: When data is stored in a pool, the placement of the object + and its replicas (or chunks, in the case of erasure-coded pools) in your + cluster is governed by CRUSH rules. Custom CRUSH rules can be created for a + pool if the default rule does not fit your use case. + +- **Snapshots**: The command ``ceph osd pool mksnap`` creates a snapshot of a + pool. + +Pool Names +========== + +Pool names beginning with ``.`` are reserved for use by Ceph's internal +operations. Please do not create or manipulate pools with these names. + +List Pools +========== + +To list your cluster's pools, execute: + +.. prompt:: bash $ + + ceph osd lspools + +.. _createpool: + +Create a Pool +============= + +Before creating pools, refer to the `Pool, PG and CRUSH Config Reference`_. +Ideally, you should override the default value for the number of placement +groups in your Ceph configuration file, as the default is NOT ideal. +For details on placement group numbers refer to `setting the number of placement groups`_ + +.. note:: Starting with Luminous, all pools need to be associated to the + application using the pool. See `Associate Pool to Application`_ below for + more information. + +For example: + +.. prompt:: bash $ + + osd pool default pg num = 100 + osd pool default pgp num = 100 + +To create a pool, execute: + +.. prompt:: bash $ + + ceph osd pool create {pool-name} [{pg-num} [{pgp-num}]] [replicated] \ + [crush-rule-name] [expected-num-objects] + ceph osd pool create {pool-name} [{pg-num} [{pgp-num}]] erasure \ + [erasure-code-profile] [crush-rule-name] [expected_num_objects] [--autoscale-mode=<on,off,warn>] + +Where: + +``{pool-name}`` + +:Description: The name of the pool. It must be unique. +:Type: String +:Required: Yes. + +``{pg-num}`` + +:Description: The total number of placement groups for the pool. See `Placement + Groups`_ for details on calculating a suitable number. The + default value ``8`` is NOT suitable for most systems. + +:Type: Integer +:Required: Yes. +:Default: 8 + +``{pgp-num}`` + +:Description: The total number of placement groups for placement purposes. This + **should be equal to the total number of placement groups**, except + for placement group splitting scenarios. + +:Type: Integer +:Required: Yes. Picks up default or Ceph configuration value if not specified. +:Default: 8 + +``{replicated|erasure}`` + +:Description: The pool type which may either be **replicated** to + recover from lost OSDs by keeping multiple copies of the + objects or **erasure** to get a kind of + `generalized RAID5 <../erasure-code>`_ capability. + The **replicated** pools require more + raw storage but implement all Ceph operations. The + **erasure** pools require less raw storage but only + implement a subset of the available operations. + +:Type: String +:Required: No. +:Default: replicated + +``[crush-rule-name]`` + +:Description: The name of a CRUSH rule to use for this pool. The specified + rule must exist. + +:Type: String +:Required: No. +:Default: For **replicated** pools it is the rule specified by the ``osd + pool default crush rule`` config variable. This rule must exist. + For **erasure** pools it is ``erasure-code`` if the ``default`` + `erasure code profile`_ is used or ``{pool-name}`` otherwise. This + rule will be created implicitly if it doesn't exist already. + + +``[erasure-code-profile=profile]`` + +.. _erasure code profile: ../erasure-code-profile + +:Description: For **erasure** pools only. Use the `erasure code profile`_. It + must be an existing profile as defined by + **osd erasure-code-profile set**. + +:Type: String +:Required: No. + +``--autoscale-mode=<on,off,warn>`` + +:Description: Autoscale mode + +:Type: String +:Required: No. +:Default: The default behavior is controlled by the ``osd pool default pg autoscale mode`` option. + +If you set the autoscale mode to ``on`` or ``warn``, you can let the system autotune or recommend changes to the number of placement groups in your pool based on actual usage. If you leave it off, then you should refer to `Placement Groups`_ for more information. + +.. _Placement Groups: ../placement-groups + +``[expected-num-objects]`` + +:Description: The expected number of objects for this pool. By setting this value ( + together with a negative **filestore merge threshold**), the PG folder + splitting would happen at the pool creation time, to avoid the latency + impact to do a runtime folder splitting. + +:Type: Integer +:Required: No. +:Default: 0, no splitting at the pool creation time. + +.. _associate-pool-to-application: + +Associate Pool to Application +============================= + +Pools need to be associated with an application before use. Pools that will be +used with CephFS or pools that are automatically created by RGW are +automatically associated. Pools that are intended for use with RBD should be +initialized using the ``rbd`` tool (see `Block Device Commands`_ for more +information). + +For other cases, you can manually associate a free-form application name to +a pool.: + +.. prompt:: bash $ + + ceph osd pool application enable {pool-name} {application-name} + +.. note:: CephFS uses the application name ``cephfs``, RBD uses the + application name ``rbd``, and RGW uses the application name ``rgw``. + +Set Pool Quotas +=============== + +You can set pool quotas for the maximum number of bytes and/or the maximum +number of objects per pool: + +.. prompt:: bash $ + + ceph osd pool set-quota {pool-name} [max_objects {obj-count}] [max_bytes {bytes}] + +For example: + +.. prompt:: bash $ + + ceph osd pool set-quota data max_objects 10000 + +To remove a quota, set its value to ``0``. + + +Delete a Pool +============= + +To delete a pool, execute: + +.. prompt:: bash $ + + ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] + + +To remove a pool the mon_allow_pool_delete flag must be set to true in the Monitor's +configuration. Otherwise they will refuse to remove a pool. + +See `Monitor Configuration`_ for more information. + +.. _Monitor Configuration: ../../configuration/mon-config-ref + +If you created your own rules for a pool you created, you should consider +removing them when you no longer need your pool: + +.. prompt:: bash $ + + ceph osd pool get {pool-name} crush_rule + +If the rule was "123", for example, you can check the other pools like so: + +.. prompt:: bash $ + + ceph osd dump | grep "^pool" | grep "crush_rule 123" + +If no other pools use that custom rule, then it's safe to delete that +rule from the cluster. + +If you created users with permissions strictly for a pool that no longer +exists, you should consider deleting those users too: + + +.. prompt:: bash $ + + ceph auth ls | grep -C 5 {pool-name} + ceph auth del {user} + + +Rename a Pool +============= + +To rename a pool, execute: + +.. prompt:: bash $ + + ceph osd pool rename {current-pool-name} {new-pool-name} + +If you rename a pool and you have per-pool capabilities for an authenticated +user, you must update the user's capabilities (i.e., caps) with the new pool +name. + +Show Pool Statistics +==================== + +To show a pool's utilization statistics, execute: + +.. prompt:: bash $ + + rados df + +Additionally, to obtain I/O information for a specific pool or all, execute: + +.. prompt:: bash $ + + ceph osd pool stats [{pool-name}] + + +Make a Snapshot of a Pool +========================= + +To make a snapshot of a pool, execute: + +.. prompt:: bash $ + + ceph osd pool mksnap {pool-name} {snap-name} + +Remove a Snapshot of a Pool +=========================== + +To remove a snapshot of a pool, execute: + +.. prompt:: bash $ + + ceph osd pool rmsnap {pool-name} {snap-name} + +.. _setpoolvalues: + + +Set Pool Values +=============== + +To set a value to a pool, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {pool-name} {key} {value} + +You may set values for the following keys: + +.. _compression_algorithm: + +``compression_algorithm`` + +:Description: Sets inline compression algorithm to use for underlying BlueStore. This setting overrides the `global setting <https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#inline-compression>`__ of ``bluestore compression algorithm``. + +:Type: String +:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd`` + +``compression_mode`` + +:Description: Sets the policy for the inline compression algorithm for underlying BlueStore. This setting overrides the `global setting <http://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#inline-compression>`__ of ``bluestore compression mode``. + +:Type: String +:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` + +``compression_min_blob_size`` + +:Description: Chunks smaller than this are never compressed. This setting overrides the `global setting <http://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#inline-compression>`__ of ``bluestore compression min blob *``. + +:Type: Unsigned Integer + +``compression_max_blob_size`` + +:Description: Chunks larger than this are broken into smaller blobs sizing + ``compression_max_blob_size`` before being compressed. + +:Type: Unsigned Integer + +.. _size: + +``size`` + +:Description: Sets the number of replicas for objects in the pool. + See `Set the Number of Object Replicas`_ for further details. + Replicated pools only. + +:Type: Integer + +.. _min_size: + +``min_size`` + +:Description: Sets the minimum number of replicas required for I/O. + See `Set the Number of Object Replicas`_ for further details. + In the case of Erasure Coded pools this should be set to a value + greater than 'k' since if we allow IO at the value 'k' there is no + redundancy and data will be lost in the event of a permanent OSD + failure. For more information see `Erasure Code + <../erasure-code>`_ + +:Type: Integer +:Version: ``0.54`` and above + +.. _pg_num: + +``pg_num`` + +:Description: The effective number of placement groups to use when calculating + data placement. +:Type: Integer +:Valid Range: Superior to ``pg_num`` current value. + +.. _pgp_num: + +``pgp_num`` + +:Description: The effective number of placement groups for placement to use + when calculating data placement. + +:Type: Integer +:Valid Range: Equal to or less than ``pg_num``. + +.. _crush_rule: + +``crush_rule`` + +:Description: The rule to use for mapping object placement in the cluster. +:Type: String + +.. _allow_ec_overwrites: + +``allow_ec_overwrites`` + +:Description: Whether writes to an erasure coded pool can update part + of an object, so cephfs and rbd can use it. See + `Erasure Coding with Overwrites`_ for more details. +:Type: Boolean +:Version: ``12.2.0`` and above + +.. _hashpspool: + +``hashpspool`` + +:Description: Set/Unset HASHPSPOOL flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag + +.. _nodelete: + +``nodelete`` + +:Description: Set/Unset NODELETE flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag +:Version: Version ``FIXME`` + +.. _nopgchange: + +``nopgchange`` + +:Description: Set/Unset NOPGCHANGE flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag +:Version: Version ``FIXME`` + +.. _nosizechange: + +``nosizechange`` + +:Description: Set/Unset NOSIZECHANGE flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag +:Version: Version ``FIXME`` + +.. _bulk: + +.. describe:: bulk + + Set/Unset bulk flag on a given pool. + + :Type: Boolean + :Valid Range: true/1 sets flag, false/0 unsets flag + +.. _write_fadvise_dontneed: + +``write_fadvise_dontneed`` + +:Description: Set/Unset WRITE_FADVISE_DONTNEED flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag + +.. _noscrub: + +``noscrub`` + +:Description: Set/Unset NOSCRUB flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag + +.. _nodeep-scrub: + +``nodeep-scrub`` + +:Description: Set/Unset NODEEP_SCRUB flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag + +.. _hit_set_type: + +``hit_set_type`` + +:Description: Enables hit set tracking for cache pools. + See `Bloom Filter`_ for additional information. + +:Type: String +:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object`` +:Default: ``bloom``. Other values are for testing. + +.. _hit_set_count: + +``hit_set_count`` + +:Description: The number of hit sets to store for cache pools. The higher + the number, the more RAM consumed by the ``ceph-osd`` daemon. + +:Type: Integer +:Valid Range: ``1``. Agent doesn't handle > 1 yet. + +.. _hit_set_period: + +``hit_set_period`` + +:Description: The duration of a hit set period in seconds for cache pools. + The higher the number, the more RAM consumed by the + ``ceph-osd`` daemon. + +:Type: Integer +:Example: ``3600`` 1hr + +.. _hit_set_fpp: + +``hit_set_fpp`` + +:Description: The false positive probability for the ``bloom`` hit set type. + See `Bloom Filter`_ for additional information. + +:Type: Double +:Valid Range: 0.0 - 1.0 +:Default: ``0.05`` + +.. _cache_target_dirty_ratio: + +``cache_target_dirty_ratio`` + +:Description: The percentage of the cache pool containing modified (dirty) + objects before the cache tiering agent will flush them to the + backing storage pool. + +:Type: Double +:Default: ``.4`` + +.. _cache_target_dirty_high_ratio: + +``cache_target_dirty_high_ratio`` + +:Description: The percentage of the cache pool containing modified (dirty) + objects before the cache tiering agent will flush them to the + backing storage pool with a higher speed. + +:Type: Double +:Default: ``.6`` + +.. _cache_target_full_ratio: + +``cache_target_full_ratio`` + +:Description: The percentage of the cache pool containing unmodified (clean) + objects before the cache tiering agent will evict them from the + cache pool. + +:Type: Double +:Default: ``.8`` + +.. _target_max_bytes: + +``target_max_bytes`` + +:Description: Ceph will begin flushing or evicting objects when the + ``max_bytes`` threshold is triggered. + +:Type: Integer +:Example: ``1000000000000`` #1-TB + +.. _target_max_objects: + +``target_max_objects`` + +:Description: Ceph will begin flushing or evicting objects when the + ``max_objects`` threshold is triggered. + +:Type: Integer +:Example: ``1000000`` #1M objects + + +``hit_set_grade_decay_rate`` + +:Description: Temperature decay rate between two successive hit_sets +:Type: Integer +:Valid Range: 0 - 100 +:Default: ``20`` + + +``hit_set_search_last_n`` + +:Description: Count at most N appearance in hit_sets for temperature calculation +:Type: Integer +:Valid Range: 0 - hit_set_count +:Default: ``1`` + + +.. _cache_min_flush_age: + +``cache_min_flush_age`` + +:Description: The time (in seconds) before the cache tiering agent will flush + an object from the cache pool to the storage pool. + +:Type: Integer +:Example: ``600`` 10min + +.. _cache_min_evict_age: + +``cache_min_evict_age`` + +:Description: The time (in seconds) before the cache tiering agent will evict + an object from the cache pool. + +:Type: Integer +:Example: ``1800`` 30min + +.. _fast_read: + +``fast_read`` + +:Description: On Erasure Coding pool, if this flag is turned on, the read request + would issue sub reads to all shards, and waits until it receives enough + shards to decode to serve the client. In the case of jerasure and isa + erasure plugins, once the first K replies return, client's request is + served immediately using the data decoded from these replies. This + helps to tradeoff some resources for better performance. Currently this + flag is only supported for Erasure Coding pool. + +:Type: Boolean +:Defaults: ``0`` + +.. _scrub_min_interval: + +``scrub_min_interval`` + +:Description: The minimum interval in seconds for pool scrubbing when + load is low. If it is 0, the value osd_scrub_min_interval + from config is used. + +:Type: Double +:Default: ``0`` + +.. _scrub_max_interval: + +``scrub_max_interval`` + +:Description: The maximum interval in seconds for pool scrubbing + irrespective of cluster load. If it is 0, the value + osd_scrub_max_interval from config is used. + +:Type: Double +:Default: ``0`` + +.. _deep_scrub_interval: + +``deep_scrub_interval`` + +:Description: The interval in seconds for pool “deep” scrubbing. If it + is 0, the value osd_deep_scrub_interval from config is used. + +:Type: Double +:Default: ``0`` + + +.. _recovery_priority: + +``recovery_priority`` + +:Description: When a value is set it will increase or decrease the computed + reservation priority. This value must be in the range -10 to + 10. Use a negative priority for less important pools so they + have lower priority than any new pools. + +:Type: Integer +:Default: ``0`` + + +.. _recovery_op_priority: + +``recovery_op_priority`` + +:Description: Specify the recovery operation priority for this pool instead of ``osd_recovery_op_priority``. + +:Type: Integer +:Default: ``0`` + + +Get Pool Values +=============== + +To get a value from a pool, execute the following: + +.. prompt:: bash $ + + ceph osd pool get {pool-name} {key} + +You may get values for the following keys: + +``size`` + +:Description: see size_ + +:Type: Integer + +``min_size`` + +:Description: see min_size_ + +:Type: Integer +:Version: ``0.54`` and above + +``pg_num`` + +:Description: see pg_num_ + +:Type: Integer + + +``pgp_num`` + +:Description: see pgp_num_ + +:Type: Integer +:Valid Range: Equal to or less than ``pg_num``. + + +``crush_rule`` + +:Description: see crush_rule_ + + +``hit_set_type`` + +:Description: see hit_set_type_ + +:Type: String +:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object`` + +``hit_set_count`` + +:Description: see hit_set_count_ + +:Type: Integer + + +``hit_set_period`` + +:Description: see hit_set_period_ + +:Type: Integer + + +``hit_set_fpp`` + +:Description: see hit_set_fpp_ + +:Type: Double + + +``cache_target_dirty_ratio`` + +:Description: see cache_target_dirty_ratio_ + +:Type: Double + + +``cache_target_dirty_high_ratio`` + +:Description: see cache_target_dirty_high_ratio_ + +:Type: Double + + +``cache_target_full_ratio`` + +:Description: see cache_target_full_ratio_ + +:Type: Double + + +``target_max_bytes`` + +:Description: see target_max_bytes_ + +:Type: Integer + + +``target_max_objects`` + +:Description: see target_max_objects_ + +:Type: Integer + + +``cache_min_flush_age`` + +:Description: see cache_min_flush_age_ + +:Type: Integer + + +``cache_min_evict_age`` + +:Description: see cache_min_evict_age_ + +:Type: Integer + + +``fast_read`` + +:Description: see fast_read_ + +:Type: Boolean + + +``scrub_min_interval`` + +:Description: see scrub_min_interval_ + +:Type: Double + + +``scrub_max_interval`` + +:Description: see scrub_max_interval_ + +:Type: Double + + +``deep_scrub_interval`` + +:Description: see deep_scrub_interval_ + +:Type: Double + + +``allow_ec_overwrites`` + +:Description: see allow_ec_overwrites_ + +:Type: Boolean + + +``recovery_priority`` + +:Description: see recovery_priority_ + +:Type: Integer + + +``recovery_op_priority`` + +:Description: see recovery_op_priority_ + +:Type: Integer + + +Set the Number of Object Replicas +================================= + +To set the number of object replicas on a replicated pool, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {poolname} size {num-replicas} + +.. important:: The ``{num-replicas}`` includes the object itself. + If you want the object and two copies of the object for a total of + three instances of the object, specify ``3``. + +For example: + +.. prompt:: bash $ + + ceph osd pool set data size 3 + +You may execute this command for each pool. **Note:** An object might accept +I/Os in degraded mode with fewer than ``pool size`` replicas. To set a minimum +number of required replicas for I/O, you should use the ``min_size`` setting. +For example: + +.. prompt:: bash $ + + ceph osd pool set data min_size 2 + +This ensures that no object in the data pool will receive I/O with fewer than +``min_size`` replicas. + + +Get the Number of Object Replicas +================================= + +To get the number of object replicas, execute the following: + +.. prompt:: bash $ + + ceph osd dump | grep 'replicated size' + +Ceph will list the pools, with the ``replicated size`` attribute highlighted. +By default, ceph creates two replicas of an object (a total of three copies, or +a size of 3). + + +.. _pgcalc: https://old.ceph.com/pgcalc/ +.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref +.. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter +.. _setting the number of placement groups: ../placement-groups#set-the-number-of-placement-groups +.. _Erasure Coding with Overwrites: ../erasure-code#erasure-coding-with-overwrites +.. _Block Device Commands: ../../../rbd/rados-rbd-cmds/#create-a-block-device-pool diff --git a/doc/rados/operations/stretch-mode.rst b/doc/rados/operations/stretch-mode.rst new file mode 100644 index 000000000..6b4c9ba8a --- /dev/null +++ b/doc/rados/operations/stretch-mode.rst @@ -0,0 +1,215 @@ +.. _stretch_mode: + +================ +Stretch Clusters +================ + + +Stretch Clusters +================ +Ceph generally expects all parts of its network and overall cluster to be +equally reliable, with failures randomly distributed across the CRUSH map. +So you may lose a switch that knocks out a number of OSDs, but we expect +the remaining OSDs and monitors to route around that. + +This is usually a good choice, but may not work well in some +stretched cluster configurations where a significant part of your cluster +is stuck behind a single network component. For instance, a single +cluster which is located in multiple data centers, and you want to +sustain the loss of a full DC. + +There are two standard configurations we've seen deployed, with either +two or three data centers (or, in clouds, availability zones). With two +zones, we expect each site to hold a copy of the data, and for a third +site to have a tiebreaker monitor (this can be a VM or high-latency compared +to the main sites) to pick a winner if the network connection fails and both +DCs remain alive. For three sites, we expect a copy of the data and an equal +number of monitors in each site. + +Note that the standard Ceph configuration will survive MANY failures of the +network or data centers and it will never compromise data consistency. If you +bring back enough Ceph servers following a failure, it will recover. If you +lose a data center, but can still form a quorum of monitors and have all the data +available (with enough copies to satisfy pools' ``min_size``, or CRUSH rules +that will re-replicate to meet it), Ceph will maintain availability. + +What can't it handle? + +Stretch Cluster Issues +====================== +No matter what happens, Ceph will not compromise on data integrity +and consistency. If there's a failure in your network or a loss of nodes and +you can restore service, Ceph will return to normal functionality on its own. + +But there are scenarios where you lose data availibility despite having +enough servers available to satisfy Ceph's consistency and sizing constraints, or +where you may be surprised to not satisfy Ceph's constraints. +The first important category of these failures resolve around inconsistent +networks -- if there's a netsplit, Ceph may be unable to mark OSDs down and kick +them out of the acting PG sets despite the primary being unable to replicate data. +If this happens, IO will not be permitted, because Ceph can't satisfy its durability +guarantees. + +The second important category of failures is when you think you have data replicated +across data centers, but the constraints aren't sufficient to guarantee this. +For instance, you might have data centers A and B, and your CRUSH rule targets 3 copies +and places a copy in each data center with a ``min_size`` of 2. The PG may go active with +2 copies in site A and no copies in site B, which means that if you then lose site A you +have lost data and Ceph can't operate on it. This situation is surprisingly difficult +to avoid with standard CRUSH rules. + +Stretch Mode +============ +The new stretch mode is designed to handle the 2-site case. Three sites are +just as susceptible to netsplit issues, but are much more tolerant of +component availability outages than 2-site clusters are. + +To enter stretch mode, you must set the location of each monitor, matching +your CRUSH map. For instance, to place ``mon.a`` in your first data center: + +.. prompt:: bash $ + + ceph mon set_location a datacenter=site1 + +Next, generate a CRUSH rule which will place 2 copies in each data center. This +will require editing the CRUSH map directly: + +.. prompt:: bash $ + + ceph osd getcrushmap > crush.map.bin + crushtool -d crush.map.bin -o crush.map.txt + +Now edit the ``crush.map.txt`` file to add a new rule. Here +there is only one other rule, so this is ID 1, but you may need +to use a different rule ID. We also have two datacenter buckets +named ``site1`` and ``site2``:: + + rule stretch_rule { + id 1 + type replicated + min_size 1 + max_size 10 + step take site1 + step chooseleaf firstn 2 type host + step emit + step take site2 + step chooseleaf firstn 2 type host + step emit + } + +Finally, inject the CRUSH map to make the rule available to the cluster: + +.. prompt:: bash $ + + crushtool -c crush.map.txt -o crush2.map.bin + ceph osd setcrushmap -i crush2.map.bin + +If you aren't already running your monitors in connectivity mode, do so with +the instructions in `Changing Monitor Elections`_. + +.. _Changing Monitor elections: ../change-mon-elections + +And lastly, tell the cluster to enter stretch mode. Here, ``mon.e`` is the +tiebreaker and we are splitting across data centers. ``mon.e`` should be also +set a datacenter, that will differ from ``site1`` and ``site2``. For this +purpose you can create another datacenter bucket named ```site3`` in your +CRUSH and place ``mon.e`` there: + +.. prompt:: bash $ + + ceph mon set_location e datacenter=site3 + ceph mon enable_stretch_mode e stretch_rule datacenter + +When stretch mode is enabled, the OSDs wlll only take PGs active when +they peer across data centers (or whatever other CRUSH bucket type +you specified), assuming both are alive. Pools will increase in size +from the default 3 to 4, expecting 2 copies in each site. OSDs will only +be allowed to connect to monitors in the same data center. New monitors +will not be allowed to join the cluster if they do not specify a location. + +If all the OSDs and monitors from a data center become inaccessible +at once, the surviving data center will enter a degraded stretch mode. This +will issue a warning, reduce the min_size to 1, and allow +the cluster to go active with data in the single remaining site. Note that +we do not change the pool size, so you will also get warnings that the +pools are too small -- but a special stretch mode flag will prevent the OSDs +from creating extra copies in the remaining data center (so it will only keep +2 copies, as before). + +When the missing data center comes back, the cluster will enter +recovery stretch mode. This changes the warning and allows peering, but +still only requires OSDs from the data center which was up the whole time. +When all PGs are in a known state, and are neither degraded nor incomplete, +the cluster transitions back to regular stretch mode, ends the warning, +restores min_size to its starting value (2) and requires both sites to peer, +and stops requiring the always-alive site when peering (so that you can fail +over to the other site, if necessary). + + +Stretch Mode Limitations +======================== +As implied by the setup, stretch mode only handles 2 sites with OSDs. + +While it is not enforced, you should run 2 monitors in each site plus +a tiebreaker, for a total of 5. This is because OSDs can only connect +to monitors in their own site when in stretch mode. + +You cannot use erasure coded pools with stretch mode. If you try, it will +refuse, and it will not allow you to create EC pools once in stretch mode. + +You must create your own CRUSH rule which provides 2 copies in each site, and +you must use 4 total copies with 2 in each site. If you have existing pools +with non-default size/min_size, Ceph will object when you attempt to +enable stretch mode. + +Because it runs with ``min_size 1`` when degraded, you should only use stretch +mode with all-flash OSDs. This minimizes the time needed to recover once +connectivity is restored, and thus minimizes the potential for data loss. + +Hopefully, future development will extend this feature to support EC pools and +running with more than 2 full sites. + +Other commands +============== +If your tiebreaker monitor fails for some reason, you can replace it. Turn on +a new monitor and run: + +.. prompt:: bash $ + + ceph mon set_new_tiebreaker mon.<new_mon_name> + +This command will protest if the new monitor is in the same location as existing +non-tiebreaker monitors. This command WILL NOT remove the previous tiebreaker +monitor; you should do so yourself. + +Also in 16.2.7, if you are writing your own tooling for deploying Ceph, you can use a new +``--set-crush-location`` option when booting monitors, instead of running +``ceph mon set_location``. This option accepts only a single "bucket=loc" pair, eg +``ceph-mon --set-crush-location 'datacenter=a'``, which must match the +bucket type you specified when running ``enable_stretch_mode``. + + +When in stretch degraded mode, the cluster will go into "recovery" mode automatically +when the disconnected data center comes back. If that doesn't work, or you want to +enable recovery mode early, you can invoke: + +.. prompt:: bash $ + + ceph osd force_recovery_stretch_mode --yes-i-really-mean-it + +But this command should not be necessary; it is included to deal with +unanticipated situations. + +When in recovery mode, the cluster should go back into normal stretch mode +when the PGs are healthy. If this doesn't happen, or you want to force the +cross-data-center peering early and are willing to risk data downtime (or have +verified separately that all the PGs can peer, even if they aren't fully +recovered), you can invoke: + +.. prompt:: bash $ + + ceph osd force_healthy_stretch_mode --yes-i-really-mean-it + +This command should not be necessary; it is included to deal with +unanticipated situations. But you might wish to invoke it to remove +the ``HEALTH_WARN`` state which recovery mode generates. diff --git a/doc/rados/operations/upmap.rst b/doc/rados/operations/upmap.rst new file mode 100644 index 000000000..343adf2c4 --- /dev/null +++ b/doc/rados/operations/upmap.rst @@ -0,0 +1,105 @@ +.. _upmap: + +Using the pg-upmap +================== + +Starting in Luminous v12.2.z there is a new *pg-upmap* exception table +in the OSDMap that allows the cluster to explicitly map specific PGs to +specific OSDs. This allows the cluster to fine-tune the data +distribution to, in most cases, perfectly distributed PGs across OSDs. + +The key caveat to this new mechanism is that it requires that all +clients understand the new *pg-upmap* structure in the OSDMap. + +Enabling +-------- + +New clusters will have this module on by default. The cluster must only +have luminous (and newer) clients. You can the turn the balancer off with: + +.. prompt:: bash $ + + ceph balancer off + +To allow use of the feature on existing clusters, you must tell the +cluster that it only needs to support luminous (and newer) clients with: + +.. prompt:: bash $ + + ceph osd set-require-min-compat-client luminous + +This command will fail if any pre-luminous clients or daemons are +connected to the monitors. You can see what client versions are in +use with: + +.. prompt:: bash $ + + ceph features + +Balancer module +----------------- + +The `balancer` module for ceph-mgr will automatically balance +the number of PGs per OSD. See :ref:`balancer` + + +Offline optimization +-------------------- + +Upmap entries are updated with an offline optimizer built into ``osdmaptool``. + +#. Grab the latest copy of your osdmap: + + .. prompt:: bash $ + + ceph osd getmap -o om + +#. Run the optimizer: + + .. prompt:: bash $ + + osdmaptool om --upmap out.txt [--upmap-pool <pool>] \ + [--upmap-max <max-optimizations>] \ + [--upmap-deviation <max-deviation>] \ + [--upmap-active] + + It is highly recommended that optimization be done for each pool + individually, or for sets of similarly-utilized pools. You can + specify the ``--upmap-pool`` option multiple times. "Similar pools" + means pools that are mapped to the same devices and store the same + kind of data (e.g., RBD image pools, yes; RGW index pool and RGW + data pool, no). + + The ``max-optimizations`` value is the maximum number of upmap entries to + identify in the run. The default is `10` like the ceph-mgr balancer module, + but you should use a larger number if you are doing offline optimization. + If it cannot find any additional changes to make it will stop early + (i.e., when the pool distribution is perfect). + + The ``max-deviation`` value defaults to `5`. If an OSD PG count + varies from the computed target number by less than or equal + to this amount it will be considered perfect. + + The ``--upmap-active`` option simulates the behavior of the active + balancer in upmap mode. It keeps cycling until the OSDs are balanced + and reports how many rounds and how long each round is taking. The + elapsed time for rounds indicates the CPU load ceph-mgr will be + consuming when it tries to compute the next optimization plan. + +#. Apply the changes: + + .. prompt:: bash $ + + source out.txt + + The proposed changes are written to the output file ``out.txt`` in + the example above. These are normal ceph CLI commands that can be + run to apply the changes to the cluster. + + +The above steps can be repeated as many times as necessary to achieve +a perfect distribution of PGs for each set of pools. + +You can see some (gory) details about what the tool is doing by +passing ``--debug-osd 10`` and even more with ``--debug-crush 10`` +to ``osdmaptool``. diff --git a/doc/rados/operations/user-management.rst b/doc/rados/operations/user-management.rst new file mode 100644 index 000000000..78d77236d --- /dev/null +++ b/doc/rados/operations/user-management.rst @@ -0,0 +1,823 @@ +.. _user-management: + +================= + User Management +================= + +This document describes :term:`Ceph Client` users, and their authentication and +authorization with the :term:`Ceph Storage Cluster`. Users are either +individuals or system actors such as applications, which use Ceph clients to +interact with the Ceph Storage Cluster daemons. + +.. ditaa:: + +-----+ + | {o} | + | | + +--+--+ /---------\ /---------\ + | | Ceph | | Ceph | + ---+---*----->| |<------------->| | + | uses | Clients | | Servers | + | \---------/ \---------/ + /--+--\ + | | + | | + actor + + +When Ceph runs with authentication and authorization enabled (enabled by +default), you must specify a user name and a keyring containing the secret key +of the specified user (usually via the command line). If you do not specify a +user name, Ceph will use ``client.admin`` as the default user name. If you do +not specify a keyring, Ceph will look for a keyring via the ``keyring`` setting +in the Ceph configuration. For example, if you execute the ``ceph health`` +command without specifying a user or keyring: + +.. prompt:: bash $ + + ceph health + +Ceph interprets the command like this: + +.. prompt:: bash $ + + ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health + +Alternatively, you may use the ``CEPH_ARGS`` environment variable to avoid +re-entry of the user name and secret. + +For details on configuring the Ceph Storage Cluster to use authentication, +see `Cephx Config Reference`_. For details on the architecture of Cephx, see +`Architecture - High Availability Authentication`_. + +Background +========== + +Irrespective of the type of Ceph client (e.g., Block Device, Object Storage, +Filesystem, native API, etc.), Ceph stores all data as objects within `pools`_. +Ceph users must have access to pools in order to read and write data. +Additionally, Ceph users must have execute permissions to use Ceph's +administrative commands. The following concepts will help you understand Ceph +user management. + +User +---- + +A user is either an individual or a system actor such as an application. +Creating users allows you to control who (or what) can access your Ceph Storage +Cluster, its pools, and the data within pools. + +Ceph has the notion of a ``type`` of user. For the purposes of user management, +the type will always be ``client``. Ceph identifies users in period (.) +delimited form consisting of the user type and the user ID: for example, +``TYPE.ID``, ``client.admin``, or ``client.user1``. The reason for user typing +is that Ceph Monitors, OSDs, and Metadata Servers also use the Cephx protocol, +but they are not clients. Distinguishing the user type helps to distinguish +between client users and other users--streamlining access control, user +monitoring and traceability. + +Sometimes Ceph's user type may seem confusing, because the Ceph command line +allows you to specify a user with or without the type, depending upon your +command line usage. If you specify ``--user`` or ``--id``, you can omit the +type. So ``client.user1`` can be entered simply as ``user1``. If you specify +``--name`` or ``-n``, you must specify the type and name, such as +``client.user1``. We recommend using the type and name as a best practice +wherever possible. + +.. note:: A Ceph Storage Cluster user is not the same as a Ceph Object Storage + user or a Ceph File System user. The Ceph Object Gateway uses a Ceph Storage + Cluster user to communicate between the gateway daemon and the storage + cluster, but the gateway has its own user management functionality for end + users. The Ceph File System uses POSIX semantics. The user space associated + with the Ceph File System is not the same as a Ceph Storage Cluster user. + + + +Authorization (Capabilities) +---------------------------- + +Ceph uses the term "capabilities" (caps) to describe authorizing an +authenticated user to exercise the functionality of the monitors, OSDs and +metadata servers. Capabilities can also restrict access to data within a pool, +a namespace within a pool, or a set of pools based on their application tags. +A Ceph administrative user sets a user's capabilities when creating or updating +a user. + +Capability syntax follows the form:: + + {daemon-type} '{cap-spec}[, {cap-spec} ...]' + +- **Monitor Caps:** Monitor capabilities include ``r``, ``w``, ``x`` access + settings or ``profile {name}``. For example:: + + mon 'allow {access-spec} [network {network/prefix}]' + + mon 'profile {name}' + + The ``{access-spec}`` syntax is as follows: :: + + * | all | [r][w][x] + + The optional ``{network/prefix}`` is a standard network name and + prefix length in CIDR notation (e.g., ``10.3.0.0/16``). If present, + the use of this capability is restricted to clients connecting from + this network. + +- **OSD Caps:** OSD capabilities include ``r``, ``w``, ``x``, ``class-read``, + ``class-write`` access settings or ``profile {name}``. Additionally, OSD + capabilities also allow for pool and namespace settings. :: + + osd 'allow {access-spec} [{match-spec}] [network {network/prefix}]' + + osd 'profile {name} [pool={pool-name} [namespace={namespace-name}]] [network {network/prefix}]' + + The ``{access-spec}`` syntax is either of the following: :: + + * | all | [r][w][x] [class-read] [class-write] + + class {class name} [{method name}] + + The optional ``{match-spec}`` syntax is either of the following: :: + + pool={pool-name} [namespace={namespace-name}] [object_prefix {prefix}] + + [namespace={namespace-name}] tag {application} {key}={value} + + The optional ``{network/prefix}`` is a standard network name and + prefix length in CIDR notation (e.g., ``10.3.0.0/16``). If present, + the use of this capability is restricted to clients connecting from + this network. + +- **Manager Caps:** Manager (``ceph-mgr``) capabilities include + ``r``, ``w``, ``x`` access settings or ``profile {name}``. For example: :: + + mgr 'allow {access-spec} [network {network/prefix}]' + + mgr 'profile {name} [{key1} {match-type} {value1} ...] [network {network/prefix}]' + + Manager capabilities can also be specified for specific commands, + all commands exported by a built-in manager service, or all commands + exported by a specific add-on module. For example: :: + + mgr 'allow command "{command-prefix}" [with {key1} {match-type} {value1} ...] [network {network/prefix}]' + + mgr 'allow service {service-name} {access-spec} [network {network/prefix}]' + + mgr 'allow module {module-name} [with {key1} {match-type} {value1} ...] {access-spec} [network {network/prefix}]' + + The ``{access-spec}`` syntax is as follows: :: + + * | all | [r][w][x] + + The ``{service-name}`` is one of the following: :: + + mgr | osd | pg | py + + The ``{match-type}`` is one of the following: :: + + = | prefix | regex + +- **Metadata Server Caps:** For administrators, use ``allow *``. For all + other users, such as CephFS clients, consult :doc:`/cephfs/client-auth` + + +.. note:: The Ceph Object Gateway daemon (``radosgw``) is a client of the + Ceph Storage Cluster, so it is not represented as a Ceph Storage + Cluster daemon type. + +The following entries describe each access capability. + +``allow`` + +:Description: Precedes access settings for a daemon. Implies ``rw`` + for MDS only. + + +``r`` + +:Description: Gives the user read access. Required with monitors to retrieve + the CRUSH map. + + +``w`` + +:Description: Gives the user write access to objects. + + +``x`` + +:Description: Gives the user the capability to call class methods + (i.e., both read and write) and to conduct ``auth`` + operations on monitors. + + +``class-read`` + +:Descriptions: Gives the user the capability to call class read methods. + Subset of ``x``. + + +``class-write`` + +:Description: Gives the user the capability to call class write methods. + Subset of ``x``. + + +``*``, ``all`` + +:Description: Gives the user read, write and execute permissions for a + particular daemon/pool, and the ability to execute + admin commands. + +The following entries describe valid capability profiles: + +``profile osd`` (Monitor only) + +:Description: Gives a user permissions to connect as an OSD to other OSDs or + monitors. Conferred on OSDs to enable OSDs to handle replication + heartbeat traffic and status reporting. + + +``profile mds`` (Monitor only) + +:Description: Gives a user permissions to connect as a MDS to other MDSs or + monitors. + + +``profile bootstrap-osd`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an OSD. Conferred on + deployment tools such as ``ceph-volume``, ``cephadm``, etc. + so that they have permissions to add keys, etc. when + bootstrapping an OSD. + + +``profile bootstrap-mds`` (Monitor only) + +:Description: Gives a user permissions to bootstrap a metadata server. + Conferred on deployment tools such as ``cephadm``, etc. + so they have permissions to add keys, etc. when bootstrapping + a metadata server. + +``profile bootstrap-rbd`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an RBD user. + Conferred on deployment tools such as ``cephadm``, etc. + so they have permissions to add keys, etc. when bootstrapping + an RBD user. + +``profile bootstrap-rbd-mirror`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an ``rbd-mirror`` daemon + user. Conferred on deployment tools such as ``cephadm``, etc. + so they have permissions to add keys, etc. when bootstrapping + an ``rbd-mirror`` daemon. + +``profile rbd`` (Manager, Monitor, and OSD) + +:Description: Gives a user permissions to manipulate RBD images. When used + as a Monitor cap, it provides the minimal privileges required + by an RBD client application; this includes the ability + to blocklist other client users. When used as an OSD cap, it + provides read-write access to the specified pool to an + RBD client application. The Manager cap supports optional + ``pool`` and ``namespace`` keyword arguments. + +``profile rbd-mirror`` (Monitor only) + +:Description: Gives a user permissions to manipulate RBD images and retrieve + RBD mirroring config-key secrets. It provides the minimal + privileges required for the ``rbd-mirror`` daemon. + +``profile rbd-read-only`` (Manager and OSD) + +:Description: Gives a user read-only permissions to RBD images. The Manager + cap supports optional ``pool`` and ``namespace`` keyword + arguments. + +``profile simple-rados-client`` (Monitor only) + +:Description: Gives a user read-only permissions for monitor, OSD, and PG data. + Intended for use by direct librados client applications. + +``profile simple-rados-client-with-blocklist`` (Monitor only) + +:Description: Gives a user read-only permissions for monitor, OSD, and PG data. + Intended for use by direct librados client applications. Also + includes permission to add blocklist entries to build HA + applications. + +``profile fs-client`` (Monitor only) + +:Description: Gives a user read-only permissions for monitor, OSD, PG, and MDS + data. Intended for CephFS clients. + +``profile role-definer`` (Monitor and Auth) + +:Description: Gives a user **all** permissions for the auth subsystem, read-only + access to monitors, and nothing else. Useful for automation + tools. Do not assign this unless you really, **really** know what + you're doing as the security ramifications are substantial and + pervasive. + +``profile crash`` (Monitor only) + +:Description: Gives a user read-only access to monitors, used in conjunction + with the manager ``crash`` module when collecting daemon crash + dumps for later analysis. + +Pool +---- + +A pool is a logical partition where users store data. +In Ceph deployments, it is common to create a pool as a logical partition for +similar types of data. For example, when deploying Ceph as a backend for +OpenStack, a typical deployment would have pools for volumes, images, backups +and virtual machines, and users such as ``client.glance``, ``client.cinder``, +etc. + +Application Tags +---------------- + +Access may be restricted to specific pools as defined by their application +metadata. The ``*`` wildcard may be used for the ``key`` argument, the +``value`` argument, or both. ``all`` is a synony for ``*``. + +Namespace +--------- + +Objects within a pool can be associated to a namespace--a logical group of +objects within the pool. A user's access to a pool can be associated with a +namespace such that reads and writes by the user take place only within the +namespace. Objects written to a namespace within the pool can only be accessed +by users who have access to the namespace. + +.. note:: Namespaces are primarily useful for applications written on top of + ``librados`` where the logical grouping can alleviate the need to create + different pools. Ceph Object Gateway (from ``luminous``) uses namespaces for various + metadata objects. + +The rationale for namespaces is that pools can be a computationally expensive +method of segregating data sets for the purposes of authorizing separate sets +of users. For example, a pool should have ~100 placement groups per OSD. So an +exemplary cluster with 1000 OSDs would have 100,000 placement groups for one +pool. Each pool would create another 100,000 placement groups in the exemplary +cluster. By contrast, writing an object to a namespace simply associates the +namespace to the object name with out the computational overhead of a separate +pool. Rather than creating a separate pool for a user or set of users, you may +use a namespace. **Note:** Only available using ``librados`` at this time. + +Access may be restricted to specific RADOS namespaces using the ``namespace`` +capability. Limited globbing of namespaces is supported; if the last character +of the specified namespace is ``*``, then access is granted to any namespace +starting with the provided argument. + +Managing Users +============== + +User management functionality provides Ceph Storage Cluster administrators with +the ability to create, update and delete users directly in the Ceph Storage +Cluster. + +When you create or delete users in the Ceph Storage Cluster, you may need to +distribute keys to clients so that they can be added to keyrings. See `Keyring +Management`_ for details. + +List Users +---------- + +To list the users in your cluster, execute the following: + +.. prompt:: bash $ + + ceph auth ls + +Ceph will list out all users in your cluster. For example, in a two-node +exemplary cluster, ``ceph auth ls`` will output something that looks like +this:: + + installed auth entries: + + osd.0 + key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w== + caps: [mon] allow profile osd + caps: [osd] allow * + osd.1 + key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA== + caps: [mon] allow profile osd + caps: [osd] allow * + client.admin + key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw== + caps: [mds] allow + caps: [mon] allow * + caps: [osd] allow * + client.bootstrap-mds + key: AQBICbtTOK9uGBAAdbe5zcIGHZL3T/u2g6EBww== + caps: [mon] allow profile bootstrap-mds + client.bootstrap-osd + key: AQBHCbtT4GxqORAADE5u7RkpCN/oo4e5W0uBtw== + caps: [mon] allow profile bootstrap-osd + + +Note that the ``TYPE.ID`` notation for users applies such that ``osd.0`` is a +user of type ``osd`` and its ID is ``0``, ``client.admin`` is a user of type +``client`` and its ID is ``admin`` (i.e., the default ``client.admin`` user). +Note also that each entry has a ``key: <value>`` entry, and one or more +``caps:`` entries. + +You may use the ``-o {filename}`` option with ``ceph auth ls`` to +save the output to a file. + + +Get a User +---------- + +To retrieve a specific user, key and capabilities, execute the +following: + +.. prompt:: bash $ + + ceph auth get {TYPE.ID} + +For example: + +.. prompt:: bash $ + + ceph auth get client.admin + +You may also use the ``-o {filename}`` option with ``ceph auth get`` to +save the output to a file. Developers may also execute the following: + +.. prompt:: bash $ + + ceph auth export {TYPE.ID} + +The ``auth export`` command is identical to ``auth get``. + +Add a User +---------- + +Adding a user creates a username (i.e., ``TYPE.ID``), a secret key and +any capabilities included in the command you use to create the user. + +A user's key enables the user to authenticate with the Ceph Storage Cluster. +The user's capabilities authorize the user to read, write, or execute on Ceph +monitors (``mon``), Ceph OSDs (``osd``) or Ceph Metadata Servers (``mds``). + +There are a few ways to add a user: + +- ``ceph auth add``: This command is the canonical way to add a user. It + will create the user, generate a key and add any specified capabilities. + +- ``ceph auth get-or-create``: This command is often the most convenient way + to create a user, because it returns a keyfile format with the user name + (in brackets) and the key. If the user already exists, this command + simply returns the user name and key in the keyfile format. You may use the + ``-o {filename}`` option to save the output to a file. + +- ``ceph auth get-or-create-key``: This command is a convenient way to create + a user and return the user's key (only). This is useful for clients that + need the key only (e.g., libvirt). If the user already exists, this command + simply returns the key. You may use the ``-o {filename}`` option to save the + output to a file. + +When creating client users, you may create a user with no capabilities. A user +with no capabilities is useless beyond mere authentication, because the client +cannot retrieve the cluster map from the monitor. However, you can create a +user with no capabilities if you wish to defer adding capabilities later using +the ``ceph auth caps`` command. + +A typical user has at least read capabilities on the Ceph monitor and +read and write capability on Ceph OSDs. Additionally, a user's OSD permissions +are often restricted to accessing a particular pool: + +.. prompt:: bash $ + + ceph auth add client.john mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth get-or-create client.paul mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth get-or-create client.george mon 'allow r' osd 'allow rw pool=liverpool' -o george.keyring + ceph auth get-or-create-key client.ringo mon 'allow r' osd 'allow rw pool=liverpool' -o ringo.key + + +.. important:: If you provide a user with capabilities to OSDs, but you DO NOT + restrict access to particular pools, the user will have access to ALL + pools in the cluster! + + +.. _modify-user-capabilities: + +Modify User Capabilities +------------------------ + +The ``ceph auth caps`` command allows you to specify a user and change the +user's capabilities. Setting new capabilities will overwrite current capabilities. +To view current capabilities run ``ceph auth get USERTYPE.USERID``. To add +capabilities, you should also specify the existing capabilities when using the form: + +.. prompt:: bash $ + + ceph auth caps USERTYPE.USERID {daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]' [{daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]'] + +For example: + +.. prompt:: bash $ + + ceph auth get client.john + ceph auth caps client.john mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth caps client.paul mon 'allow rw' osd 'allow rwx pool=liverpool' + ceph auth caps client.brian-manager mon 'allow *' osd 'allow *' + +See `Authorization (Capabilities)`_ for additional details on capabilities. + +Delete a User +------------- + +To delete a user, use ``ceph auth del``: + +.. prompt:: bash $ + + ceph auth del {TYPE}.{ID} + +Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``, +and ``{ID}`` is the user name or ID of the daemon. + + +Print a User's Key +------------------ + +To print a user's authentication key to standard output, execute the following: + +.. prompt:: bash $ + + ceph auth print-key {TYPE}.{ID} + +Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``, +and ``{ID}`` is the user name or ID of the daemon. + +Printing a user's key is useful when you need to populate client +software with a user's key (e.g., libvirt): + +.. prompt:: bash $ + + mount -t ceph serverhost:/ mountpoint -o name=client.user,secret=`ceph auth print-key client.user` + +Import a User(s) +---------------- + +To import one or more users, use ``ceph auth import`` and +specify a keyring: + +.. prompt:: bash $ + + ceph auth import -i /path/to/keyring + +For example: + +.. prompt:: bash $ + + sudo ceph auth import -i /etc/ceph/ceph.keyring + + +.. note:: The Ceph storage cluster will add new users, their keys and their + capabilities and will update existing users, their keys and their + capabilities. + +Keyring Management +================== + +When you access Ceph via a Ceph client, the Ceph client will look for a local +keyring. Ceph presets the ``keyring`` setting with the following four keyring +names by default so you don't have to set them in your Ceph configuration file +unless you want to override the defaults (not recommended): + +- ``/etc/ceph/$cluster.$name.keyring`` +- ``/etc/ceph/$cluster.keyring`` +- ``/etc/ceph/keyring`` +- ``/etc/ceph/keyring.bin`` + +The ``$cluster`` metavariable is your Ceph cluster name as defined by the +name of the Ceph configuration file (i.e., ``ceph.conf`` means the cluster name +is ``ceph``; thus, ``ceph.keyring``). The ``$name`` metavariable is the user +type and user ID (e.g., ``client.admin``; thus, ``ceph.client.admin.keyring``). + +.. note:: When executing commands that read or write to ``/etc/ceph``, you may + need to use ``sudo`` to execute the command as ``root``. + +After you create a user (e.g., ``client.ringo``), you must get the key and add +it to a keyring on a Ceph client so that the user can access the Ceph Storage +Cluster. + +The `User Management`_ section details how to list, get, add, modify and delete +users directly in the Ceph Storage Cluster. However, Ceph also provides the +``ceph-authtool`` utility to allow you to manage keyrings from a Ceph client. + +Create a Keyring +---------------- + +When you use the procedures in the `Managing Users`_ section to create users, +you need to provide user keys to the Ceph client(s) so that the Ceph client +can retrieve the key for the specified user and authenticate with the Ceph +Storage Cluster. Ceph Clients access keyrings to lookup a user name and +retrieve the user's key. + +The ``ceph-authtool`` utility allows you to create a keyring. To create an +empty keyring, use ``--create-keyring`` or ``-C``. For example: + +.. prompt:: bash $ + + ceph-authtool --create-keyring /path/to/keyring + +When creating a keyring with multiple users, we recommend using the cluster name +(e.g., ``$cluster.keyring``) for the keyring filename and saving it in the +``/etc/ceph`` directory so that the ``keyring`` configuration default setting +will pick up the filename without requiring you to specify it in the local copy +of your Ceph configuration file. For example, create ``ceph.keyring`` by +executing the following: + +.. prompt:: bash $ + + sudo ceph-authtool -C /etc/ceph/ceph.keyring + +When creating a keyring with a single user, we recommend using the cluster name, +the user type and the user name and saving it in the ``/etc/ceph`` directory. +For example, ``ceph.client.admin.keyring`` for the ``client.admin`` user. + +To create a keyring in ``/etc/ceph``, you must do so as ``root``. This means +the file will have ``rw`` permissions for the ``root`` user only, which is +appropriate when the keyring contains administrator keys. However, if you +intend to use the keyring for a particular user or group of users, ensure +that you execute ``chown`` or ``chmod`` to establish appropriate keyring +ownership and access. + +Add a User to a Keyring +----------------------- + +When you `Add a User`_ to the Ceph Storage Cluster, you can use the `Get a +User`_ procedure to retrieve a user, key and capabilities and save the user to a +keyring. + +When you only want to use one user per keyring, the `Get a User`_ procedure with +the ``-o`` option will save the output in the keyring file format. For example, +to create a keyring for the ``client.admin`` user, execute the following: + +.. prompt:: bash $ + + sudo ceph auth get client.admin -o /etc/ceph/ceph.client.admin.keyring + +Notice that we use the recommended file format for an individual user. + +When you want to import users to a keyring, you can use ``ceph-authtool`` +to specify the destination keyring and the source keyring. +For example: + +.. prompt:: bash $ + + sudo ceph-authtool /etc/ceph/ceph.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring + +Create a User +------------- + +Ceph provides the `Add a User`_ function to create a user directly in the Ceph +Storage Cluster. However, you can also create a user, keys and capabilities +directly on a Ceph client keyring. Then, you can import the user to the Ceph +Storage Cluster. For example: + +.. prompt:: bash $ + + sudo ceph-authtool -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' /etc/ceph/ceph.keyring + +See `Authorization (Capabilities)`_ for additional details on capabilities. + +You can also create a keyring and add a new user to the keyring simultaneously. +For example: + +.. prompt:: bash $ + + sudo ceph-authtool -C /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' --gen-key + +In the foregoing scenarios, the new user ``client.ringo`` is only in the +keyring. To add the new user to the Ceph Storage Cluster, you must still add +the new user to the Ceph Storage Cluster: + +.. prompt:: bash $ + + sudo ceph auth add client.ringo -i /etc/ceph/ceph.keyring + +Modify a User +------------- + +To modify the capabilities of a user record in a keyring, specify the keyring, +and the user followed by the capabilities. For example: + +.. prompt:: bash $ + + sudo ceph-authtool /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' + +To update the user to the Ceph Storage Cluster, you must update the user +in the keyring to the user entry in the Ceph Storage Cluster: + +.. prompt:: bash $ + + sudo ceph auth import -i /etc/ceph/ceph.keyring + +See `Import a User(s)`_ for details on updating a Ceph Storage Cluster user +from a keyring. + +You may also `Modify User Capabilities`_ directly in the cluster, store the +results to a keyring file; then, import the keyring into your main +``ceph.keyring`` file. + +Command Line Usage +================== + +Ceph supports the following usage for user name and secret: + +``--id`` | ``--user`` + +:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or + ``client.admin``, ``client.user1``). The ``id``, ``name`` and + ``-n`` options enable you to specify the ID portion of the user + name (e.g., ``admin``, ``user1``, ``foo``, etc.). You can specify + the user with the ``--id`` and omit the type. For example, + to specify user ``client.foo`` enter the following: + + .. prompt:: bash $ + + ceph --id foo --keyring /path/to/keyring health + ceph --user foo --keyring /path/to/keyring health + + +``--name`` | ``-n`` + +:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or + ``client.admin``, ``client.user1``). The ``--name`` and ``-n`` + options enables you to specify the fully qualified user name. + You must specify the user type (typically ``client``) with the + user ID. For example: + + .. prompt:: bash $ + + ceph --name client.foo --keyring /path/to/keyring health + ceph -n client.foo --keyring /path/to/keyring health + + +``--keyring`` + +:Description: The path to the keyring containing one or more user name and + secret. The ``--secret`` option provides the same functionality, + but it does not work with Ceph RADOS Gateway, which uses + ``--secret`` for another purpose. You may retrieve a keyring with + ``ceph auth get-or-create`` and store it locally. This is a + preferred approach, because you can switch user names without + switching the keyring path. For example: + + .. prompt:: bash $ + + sudo rbd map --id foo --keyring /path/to/keyring mypool/myimage + + +.. _pools: ../pools + +Limitations +=========== + +The ``cephx`` protocol authenticates Ceph clients and servers to each other. It +is not intended to handle authentication of human users or application programs +run on their behalf. If that effect is required to handle your access control +needs, you must have another mechanism, which is likely to be specific to the +front end used to access the Ceph object store. This other mechanism has the +role of ensuring that only acceptable users and programs are able to run on the +machine that Ceph will permit to access its object store. + +The keys used to authenticate Ceph clients and servers are typically stored in +a plain text file with appropriate permissions in a trusted host. + +.. important:: Storing keys in plaintext files has security shortcomings, but + they are difficult to avoid, given the basic authentication methods Ceph + uses in the background. Those setting up Ceph systems should be aware of + these shortcomings. + +In particular, arbitrary user machines, especially portable machines, should not +be configured to interact directly with Ceph, since that mode of use would +require the storage of a plaintext authentication key on an insecure machine. +Anyone who stole that machine or obtained surreptitious access to it could +obtain the key that will allow them to authenticate their own machines to Ceph. + +Rather than permitting potentially insecure machines to access a Ceph object +store directly, users should be required to sign in to a trusted machine in +your environment using a method that provides sufficient security for your +purposes. That trusted machine will store the plaintext Ceph keys for the +human users. A future version of Ceph may address these particular +authentication issues more fully. + +At the moment, none of the Ceph authentication protocols provide secrecy for +messages in transit. Thus, an eavesdropper on the wire can hear and understand +all data sent between clients and servers in Ceph, even if it cannot create or +alter them. Further, Ceph does not include options to encrypt user data in the +object store. Users can hand-encrypt and store their own data in the Ceph +object store, of course, but Ceph provides no features to perform object +encryption itself. Those storing sensitive data in Ceph should consider +encrypting their data before providing it to the Ceph system. + + +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _Cephx Config Reference: ../../configuration/auth-config-ref diff --git a/doc/rados/troubleshooting/community.rst b/doc/rados/troubleshooting/community.rst new file mode 100644 index 000000000..f816584ae --- /dev/null +++ b/doc/rados/troubleshooting/community.rst @@ -0,0 +1,28 @@ +==================== + The Ceph Community +==================== + +The Ceph community is an excellent source of information and help. For +operational issues with Ceph releases we recommend you `subscribe to the +ceph-users email list`_. When you no longer want to receive emails, you can +`unsubscribe from the ceph-users email list`_. + +You may also `subscribe to the ceph-devel email list`_. You should do so if +your issue is: + +- Likely related to a bug +- Related to a development release package +- Related to a development testing package +- Related to your own builds + +If you no longer want to receive emails from the ``ceph-devel`` email list, you +may `unsubscribe from the ceph-devel email list`_. + +.. tip:: The Ceph community is growing rapidly, and community members can help + you if you provide them with detailed information about your problem. You + can attach the output of the ``ceph report`` command to help people understand your issues. + +.. _subscribe to the ceph-devel email list: mailto:dev-join@ceph.io +.. _unsubscribe from the ceph-devel email list: mailto:dev-leave@ceph.io +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@ceph.io +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@ceph.io diff --git a/doc/rados/troubleshooting/cpu-profiling.rst b/doc/rados/troubleshooting/cpu-profiling.rst new file mode 100644 index 000000000..159f7998d --- /dev/null +++ b/doc/rados/troubleshooting/cpu-profiling.rst @@ -0,0 +1,67 @@ +=============== + CPU Profiling +=============== + +If you built Ceph from source and compiled Ceph for use with `oprofile`_ +you can profile Ceph's CPU usage. See `Installing Oprofile`_ for details. + + +Initializing oprofile +===================== + +The first time you use ``oprofile`` you need to initialize it. Locate the +``vmlinux`` image corresponding to the kernel you are now running. :: + + ls /boot + sudo opcontrol --init + sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6 + + +Starting oprofile +================= + +To start ``oprofile`` execute the following command:: + + opcontrol --start + +Once you start ``oprofile``, you may run some tests with Ceph. + + +Stopping oprofile +================= + +To stop ``oprofile`` execute the following command:: + + opcontrol --stop + + +Retrieving oprofile Results +=========================== + +To retrieve the top ``cmon`` results, execute the following command:: + + opreport -gal ./cmon | less + + +To retrieve the top ``cmon`` results with call graphs attached, execute the +following command:: + + opreport -cal ./cmon | less + +.. important:: After reviewing results, you should reset ``oprofile`` before + running it again. Resetting ``oprofile`` removes data from the session + directory. + + +Resetting oprofile +================== + +To reset ``oprofile``, execute the following command:: + + sudo opcontrol --reset + +.. important:: You should reset ``oprofile`` after analyzing data so that + you do not commingle results from different tests. + +.. _oprofile: http://oprofile.sourceforge.net/about/ +.. _Installing Oprofile: ../../../dev/cpu-profiler diff --git a/doc/rados/troubleshooting/index.rst b/doc/rados/troubleshooting/index.rst new file mode 100644 index 000000000..80d14f3ce --- /dev/null +++ b/doc/rados/troubleshooting/index.rst @@ -0,0 +1,19 @@ +================= + Troubleshooting +================= + +Ceph is still on the leading edge, so you may encounter situations that require +you to examine your configuration, modify your logging output, troubleshoot +monitors and OSDs, profile memory and CPU usage, and reach out to the +Ceph community for help. + +.. toctree:: + :maxdepth: 1 + + community + log-and-debug + troubleshooting-mon + troubleshooting-osd + troubleshooting-pg + memory-profiling + cpu-profiling diff --git a/doc/rados/troubleshooting/log-and-debug.rst b/doc/rados/troubleshooting/log-and-debug.rst new file mode 100644 index 000000000..71170149b --- /dev/null +++ b/doc/rados/troubleshooting/log-and-debug.rst @@ -0,0 +1,599 @@ +======================= + Logging and Debugging +======================= + +Typically, when you add debugging to your Ceph configuration, you do so at +runtime. You can also add Ceph debug logging to your Ceph configuration file if +you are encountering issues when starting your cluster. You may view Ceph log +files under ``/var/log/ceph`` (the default location). + +.. tip:: When debug output slows down your system, the latency can hide + race conditions. + +Logging is resource intensive. If you are encountering a problem in a specific +area of your cluster, enable logging for that area of the cluster. For example, +if your OSDs are running fine, but your metadata servers are not, you should +start by enabling debug logging for the specific metadata server instance(s) +giving you trouble. Enable logging for each subsystem as needed. + +.. important:: Verbose logging can generate over 1GB of data per hour. If your + OS disk reaches its capacity, the node will stop working. + +If you enable or increase the rate of Ceph logging, ensure that you have +sufficient disk space on your OS disk. See `Accelerating Log Rotation`_ for +details on rotating log files. When your system is running well, remove +unnecessary debugging settings to ensure your cluster runs optimally. Logging +debug output messages is relatively slow, and a waste of resources when +operating your cluster. + +See `Subsystem, Log and Debug Settings`_ for details on available settings. + +Runtime +======= + +If you would like to see the configuration settings at runtime, you must log +in to a host with a running daemon and execute the following:: + + ceph daemon {daemon-name} config show | less + +For example,:: + + ceph daemon osd.0 config show | less + +To activate Ceph's debugging output (*i.e.*, ``dout()``) at runtime, use the +``ceph tell`` command to inject arguments into the runtime configuration:: + + ceph tell {daemon-type}.{daemon id or *} config set {name} {value} + +Replace ``{daemon-type}`` with one of ``osd``, ``mon`` or ``mds``. You may apply +the runtime setting to all daemons of a particular type with ``*``, or specify +a specific daemon's ID. For example, to increase +debug logging for a ``ceph-osd`` daemon named ``osd.0``, execute the following:: + + ceph tell osd.0 config set debug_osd 0/5 + +The ``ceph tell`` command goes through the monitors. If you cannot bind to the +monitor, you can still make the change by logging into the host of the daemon +whose configuration you'd like to change using ``ceph daemon``. +For example:: + + sudo ceph daemon osd.0 config set debug_osd 0/5 + +See `Subsystem, Log and Debug Settings`_ for details on available settings. + + +Boot Time +========= + +To activate Ceph's debugging output (*i.e.*, ``dout()``) at boot time, you must +add settings to your Ceph configuration file. Subsystems common to each daemon +may be set under ``[global]`` in your configuration file. Subsystems for +particular daemons are set under the daemon section in your configuration file +(*e.g.*, ``[mon]``, ``[osd]``, ``[mds]``). For example:: + + [global] + debug ms = 1/5 + + [mon] + debug mon = 20 + debug paxos = 1/5 + debug auth = 2 + + [osd] + debug osd = 1/5 + debug filestore = 1/5 + debug journal = 1 + debug monc = 5/20 + + [mds] + debug mds = 1 + debug mds balancer = 1 + + +See `Subsystem, Log and Debug Settings`_ for details. + + +Accelerating Log Rotation +========================= + +If your OS disk is relatively full, you can accelerate log rotation by modifying +the Ceph log rotation file at ``/etc/logrotate.d/ceph``. Add a size setting +after the rotation frequency to accelerate log rotation (via cronjob) if your +logs exceed the size setting. For example, the default setting looks like +this:: + + rotate 7 + weekly + compress + sharedscripts + +Modify it by adding a ``size`` setting. :: + + rotate 7 + weekly + size 500M + compress + sharedscripts + +Then, start the crontab editor for your user space. :: + + crontab -e + +Finally, add an entry to check the ``etc/logrotate.d/ceph`` file. :: + + 30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1 + +The preceding example checks the ``etc/logrotate.d/ceph`` file every 30 minutes. + + +Valgrind +======== + +Debugging may also require you to track down memory and threading issues. +You can run a single daemon, a type of daemon, or the whole cluster with +Valgrind. You should only use Valgrind when developing or debugging Ceph. +Valgrind is computationally expensive, and will slow down your system otherwise. +Valgrind messages are logged to ``stderr``. + + +Subsystem, Log and Debug Settings +================================= + +In most cases, you will enable debug logging output via subsystems. + +Ceph Subsystems +--------------- + +Each subsystem has a logging level for its output logs, and for its logs +in-memory. You may set different values for each of these subsystems by setting +a log file level and a memory level for debug logging. Ceph's logging levels +operate on a scale of ``1`` to ``20``, where ``1`` is terse and ``20`` is +verbose [#]_ . In general, the logs in-memory are not sent to the output log unless: + +- a fatal signal is raised or +- an ``assert`` in source code is triggered or +- upon requested. Please consult `document on admin socket <http://docs.ceph.com/en/latest/man/8/ceph/#daemon>`_ for more details. + +A debug logging setting can take a single value for the log level and the +memory level, which sets them both as the same value. For example, if you +specify ``debug ms = 5``, Ceph will treat it as a log level and a memory level +of ``5``. You may also specify them separately. The first setting is the log +level, and the second setting is the memory level. You must separate them with +a forward slash (/). For example, if you want to set the ``ms`` subsystem's +debug logging level to ``1`` and its memory level to ``5``, you would specify it +as ``debug ms = 1/5``. For example: + + + +.. code-block:: ini + + debug {subsystem} = {log-level}/{memory-level} + #for example + debug mds balancer = 1/20 + + +The following table provides a list of Ceph subsystems and their default log and +memory levels. Once you complete your logging efforts, restore the subsystems +to their default level or to a level suitable for normal operations. + + ++--------------------+-----------+--------------+ +| Subsystem | Log Level | Memory Level | ++====================+===========+==============+ +| ``default`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``lockdep`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``context`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``crush`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``mds`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds balancer`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds locker`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds log`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds log expire`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds migrator`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``buffer`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``timer`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``filer`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``striper`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``objecter`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``rados`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``rbd`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``rbd mirror`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``rbd replay`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``journaler`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``objectcacher`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``client`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``osd`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``optracker`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``objclass`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``filestore`` | 1 | 3 | ++--------------------+-----------+--------------+ +| ``journal`` | 1 | 3 | ++--------------------+-----------+--------------+ +| ``ms`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``mon`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``monc`` | 0 | 10 | ++--------------------+-----------+--------------+ +| ``paxos`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``tp`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``auth`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``crypto`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``finisher`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``reserver`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``heartbeatmap`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``perfcounter`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``rgw`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``rgw sync`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``civetweb`` | 1 | 10 | ++--------------------+-----------+--------------+ +| ``javaclient`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``asok`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``throttle`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``refs`` | 0 | 0 | ++--------------------+-----------+--------------+ +| ``compressor`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``bluestore`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``bluefs`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``bdev`` | 1 | 3 | ++--------------------+-----------+--------------+ +| ``kstore`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``rocksdb`` | 4 | 5 | ++--------------------+-----------+--------------+ +| ``leveldb`` | 4 | 5 | ++--------------------+-----------+--------------+ +| ``memdb`` | 4 | 5 | ++--------------------+-----------+--------------+ +| ``fuse`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mgr`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mgrc`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``dpdk`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``eventtrace`` | 1 | 5 | ++--------------------+-----------+--------------+ + + +Logging Settings +---------------- + +Logging and debugging settings are not required in a Ceph configuration file, +but you may override default settings as needed. Ceph supports the following +settings: + + +``log file`` + +:Description: The location of the logging file for your cluster. +:Type: String +:Required: No +:Default: ``/var/log/ceph/$cluster-$name.log`` + + +``log max new`` + +:Description: The maximum number of new log files. +:Type: Integer +:Required: No +:Default: ``1000`` + + +``log max recent`` + +:Description: The maximum number of recent events to include in a log file. +:Type: Integer +:Required: No +:Default: ``10000`` + + +``log to file`` + +:Description: Determines if logging messages should appear in a file. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``log to stderr`` + +:Description: Determines if logging messages should appear in ``stderr``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``err to stderr`` + +:Description: Determines if error messages should appear in ``stderr``. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``log to syslog`` + +:Description: Determines if logging messages should appear in ``syslog``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``err to syslog`` + +:Description: Determines if error messages should appear in ``syslog``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``log flush on exit`` + +:Description: Determines if Ceph should flush the log files after exit. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``clog to monitors`` + +:Description: Determines if ``clog`` messages should be sent to monitors. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``clog to syslog`` + +:Description: Determines if ``clog`` messages should be sent to syslog. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mon cluster log to syslog`` + +:Description: Determines if the cluster log should be output to the syslog. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mon cluster log file`` + +:Description: The locations of the cluster's log files. There are two channels in + Ceph: ``cluster`` and ``audit``. This option represents a mapping + from channels to log files, where the log entries of that + channel are sent to. The ``default`` entry is a fallback + mapping for channels not explicitly specified. So, the following + default setting will send cluster log to ``$cluster.log``, and + send audit log to ``$cluster.audit.log``, where ``$cluster`` will + be replaced with the actual cluster name. +:Type: String +:Required: No +:Default: ``default=/var/log/ceph/$cluster.$channel.log,cluster=/var/log/ceph/$cluster.log`` + + + +OSD +--- + + +``osd debug drop ping probability`` + +:Description: ? +:Type: Double +:Required: No +:Default: 0 + + +``osd debug drop ping duration`` + +:Description: +:Type: Integer +:Required: No +:Default: 0 + +``osd debug drop pg create probability`` + +:Description: +:Type: Integer +:Required: No +:Default: 0 + +``osd debug drop pg create duration`` + +:Description: ? +:Type: Double +:Required: No +:Default: 1 + + +``osd min pg log entries`` + +:Description: The minimum number of log entries for placement groups. +:Type: 32-bit Unsigned Integer +:Required: No +:Default: 250 + + +``osd op log threshold`` + +:Description: How many op log messages to show up in one pass. +:Type: Integer +:Required: No +:Default: 5 + + + +Filestore +--------- + +``filestore debug omap check`` + +:Description: Debugging check on synchronization. This is an expensive operation. +:Type: Boolean +:Required: No +:Default: ``false`` + + +MDS +--- + + +``mds debug scatterstat`` + +:Description: Ceph will assert that various recursive stat invariants are true + (for developers only). + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug frag`` + +:Description: Ceph will verify directory fragmentation invariants when + convenient (developers only). + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug auth pins`` + +:Description: The debug auth pin invariants (for developers only). +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug subtrees`` + +:Description: The debug subtree invariants (for developers only). +:Type: Boolean +:Required: No +:Default: ``false`` + + + +RADOS Gateway +------------- + + +``rgw log nonexistent bucket`` + +:Description: Should we log a non-existent buckets? +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw log object name`` + +:Description: Should an object's name be logged. // man date to see codes (a subset are supported) +:Type: String +:Required: No +:Default: ``%Y-%m-%d-%H-%i-%n`` + + +``rgw log object name utc`` + +:Description: Object log name contains UTC? +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw enable ops log`` + +:Description: Enables logging of every RGW operation. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``rgw enable usage log`` + +:Description: Enable logging of RGW's bandwidth usage. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw usage log flush threshold`` + +:Description: Threshold to flush pending log data. +:Type: Integer +:Required: No +:Default: ``1024`` + + +``rgw usage log tick interval`` + +:Description: Flush pending log data every ``s`` seconds. +:Type: Integer +:Required: No +:Default: 30 + + +``rgw intent log object name`` + +:Description: +:Type: String +:Required: No +:Default: ``%Y-%m-%d-%i-%n`` + + +``rgw intent log object name utc`` + +:Description: Include a UTC timestamp in the intent log object name. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. [#] there are levels >20 in some rare cases and that they are extremely verbose. diff --git a/doc/rados/troubleshooting/memory-profiling.rst b/doc/rados/troubleshooting/memory-profiling.rst new file mode 100644 index 000000000..e2396e2fd --- /dev/null +++ b/doc/rados/troubleshooting/memory-profiling.rst @@ -0,0 +1,142 @@ +================== + Memory Profiling +================== + +Ceph MON, OSD and MDS can generate heap profiles using +``tcmalloc``. To generate heap profiles, ensure you have +``google-perftools`` installed:: + + sudo apt-get install google-perftools + +The profiler dumps output to your ``log file`` directory (i.e., +``/var/log/ceph``). See `Logging and Debugging`_ for details. +To view the profiler logs with Google's performance tools, execute the +following:: + + google-pprof --text {path-to-daemon} {log-path/filename} + +For example:: + + $ ceph tell osd.0 heap start_profiler + $ ceph tell osd.0 heap dump + osd.0 tcmalloc heap stats:------------------------------------------------ + MALLOC: 2632288 ( 2.5 MiB) Bytes in use by application + MALLOC: + 499712 ( 0.5 MiB) Bytes in page heap freelist + MALLOC: + 543800 ( 0.5 MiB) Bytes in central cache freelist + MALLOC: + 327680 ( 0.3 MiB) Bytes in transfer cache freelist + MALLOC: + 1239400 ( 1.2 MiB) Bytes in thread cache freelists + MALLOC: + 1142936 ( 1.1 MiB) Bytes in malloc metadata + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Actual memory used (physical + swap) + MALLOC: + 0 ( 0.0 MiB) Bytes released to OS (aka unmapped) + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Virtual address space used + MALLOC: + MALLOC: 231 Spans in use + MALLOC: 56 Thread heaps in use + MALLOC: 8192 Tcmalloc page size + ------------------------------------------------ + Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). + Bytes released to the OS take up virtual address space but no physical memory. + $ google-pprof --text \ + /usr/bin/ceph-osd \ + /var/log/ceph/ceph-osd.0.profile.0001.heap + Total: 3.7 MB + 1.9 51.1% 51.1% 1.9 51.1% ceph::log::Log::create_entry + 1.8 47.3% 98.4% 1.8 47.3% std::string::_Rep::_S_create + 0.0 0.4% 98.9% 0.0 0.6% SimpleMessenger::add_accept_pipe + 0.0 0.4% 99.2% 0.0 0.6% decode_message + ... + +Another heap dump on the same daemon will add another file. It is +convenient to compare to a previous heap dump to show what has grown +in the interval. For instance:: + + $ google-pprof --text --base out/osd.0.profile.0001.heap \ + ceph-osd out/osd.0.profile.0003.heap + Total: 0.2 MB + 0.1 50.3% 50.3% 0.1 50.3% ceph::log::Log::create_entry + 0.1 46.6% 96.8% 0.1 46.6% std::string::_Rep::_S_create + 0.0 0.9% 97.7% 0.0 26.1% ReplicatedPG::do_op + 0.0 0.8% 98.5% 0.0 0.8% __gnu_cxx::new_allocator::allocate + +Refer to `Google Heap Profiler`_ for additional details. + +Once you have the heap profiler installed, start your cluster and +begin using the heap profiler. You may enable or disable the heap +profiler at runtime, or ensure that it runs continuously. For the +following commandline usage, replace ``{daemon-type}`` with ``mon``, +``osd`` or ``mds``, and replace ``{daemon-id}`` with the OSD number or +the MON or MDS id. + + +Starting the Profiler +--------------------- + +To start the heap profiler, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap start_profiler + +For example:: + + ceph tell osd.1 heap start_profiler + +Alternatively the profile can be started when the daemon starts +running if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in +the environment. + +Printing Stats +-------------- + +To print out statistics, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap stats + +For example:: + + ceph tell osd.0 heap stats + +.. note:: Printing stats does not require the profiler to be running and does + not dump the heap allocation information to a file. + + +Dumping Heap Information +------------------------ + +To dump heap information, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap dump + +For example:: + + ceph tell mds.a heap dump + +.. note:: Dumping heap information only works when the profiler is running. + + +Releasing Memory +---------------- + +To release memory that ``tcmalloc`` has allocated but which is not being used by +the Ceph daemon itself, execute the following:: + + ceph tell {daemon-type}{daemon-id} heap release + +For example:: + + ceph tell osd.2 heap release + + +Stopping the Profiler +--------------------- + +To stop the heap profiler, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap stop_profiler + +For example:: + + ceph tell osd.0 heap stop_profiler + +.. _Logging and Debugging: ../log-and-debug +.. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html diff --git a/doc/rados/troubleshooting/troubleshooting-mon.rst b/doc/rados/troubleshooting/troubleshooting-mon.rst new file mode 100644 index 000000000..dc575f761 --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-mon.rst @@ -0,0 +1,613 @@ +.. _rados-troubleshooting-mon: + +================================= + Troubleshooting Monitors +================================= + +.. index:: monitor, high availability + +When a cluster encounters monitor-related troubles there's a tendency to +panic, and sometimes with good reason. Losing one or more monitors doesn't +necessarily mean that your cluster is down, so long as a majority are up, +running, and form a quorum. +Regardless of how bad the situation is, the first thing you should do is to +calm down, take a breath, and step through the below troubleshooting steps. + + +Initial Troubleshooting +======================== + + +**Are the monitors running?** + + First of all, we need to make sure the monitor (*mon*) daemon processes + (``ceph-mon``) are running. You would be amazed by how often Ceph admins + forget to start the mons, or to restart them after an upgrade. There's no + shame, but try to not lose a couple of hours looking for a deeper problem. + When running Kraken or later releases also ensure that the manager + daemons (``ceph-mgr``) are running, usually alongside each ``ceph-mon``. + + +**Are you able to reach to the mon nodes?** + + Doesn't happen often, but sometimes there are ``iptables`` rules that + block accesse to mon nodes or TCP ports. These may be leftovers from + prior stress-testing or rule development. Try SSHing into + the server and, if that succeeds, try connecting to the monitor's ports + (``tcp/3300`` and ``tcp/6789``) using a ``telnet``, ``nc``, or similar tools. + +**Does ceph -s run and obtain a reply from the cluster?** + + If the answer is yes then your cluster is up and running. One thing you + can take for granted is that the monitors will only answer to a ``status`` + request if there is a formed quorum. Also check that at least one ``mgr`` + daemon is reported as running, ideally all of them. + + If ``ceph -s`` hangs without obtaining a reply from the cluster + or showing ``fault`` messages, then it is likely that your monitors + are either down completely or just a fraction are up -- a fraction + insufficient to form a majority quorum. This check will connect to an + arbitrary mon; in rare cases it may be illuminating to bind to specific + mons in sequence by adding e.g. ``-m mymon1`` to the command. + +**What if ceph -s doesn't come back?** + + If you haven't gone through all the steps so far, please go back and do. + + You can contact each monitor individually asking them for their status, + regardless of a quorum being formed. This can be achieved using + ``ceph tell mon.ID mon_status``, ID being the monitor's identifier. You should + perform this for each monitor in the cluster. In section `Understanding + mon_status`_ we will explain how to interpret the output of this command. + + You may instead SSH into each mon node and query the daemon's admin socket. + + +Using the monitor's admin socket +================================= + +The admin socket allows you to interact with a given daemon directly using a +Unix socket file. This file can be found in your monitor's ``run`` directory. +By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok`` +but this may be elsewhere if you have overridden the default directory. If you +don't find it there, check your ``ceph.conf`` for an alternative path or +run:: + + ceph-conf --name mon.ID --show-config-value admin_socket + +Bear in mind that the admin socket will be available only while the monitor +daemon is running. When the monitor is properly shut down, the admin socket +will be removed. If however the monitor is not running and the admin socket +persists, it is likely that the monitor was improperly shut down. +Regardless, if the monitor is not running, you will not be able to use the +admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``. + +Accessing the admin socket is as simple as running ``ceph tell`` on the daemon +you are interested in. For example:: + + ceph tell mon.<id> mon_status + +Under the hood, this passes the command ``help`` to the running MON daemon +``<id>`` via its "admin socket", which is a file ending in ``.asok`` +somewhere under ``/var/run/ceph``. Once you know the full path to the file, +you can even do this yourself:: + + ceph --admin-daemon <full_path_to_asok_file> <command> + +Using ``help`` as the command to the ``ceph`` tool will show you the +supported commands available through the admin socket. Please take a look +at ``config get``, ``config show``, ``mon stat`` and ``quorum_status``, +as those can be enlightening when troubleshooting a monitor. + + +Understanding mon_status +========================= + +``mon_status`` can always be obtained via the admin socket. This command will +output a multitude of information about the monitor, including the same output +you would get with ``quorum_status``. + +Take the following example output of ``ceph tell mon.c mon_status``:: + + + { "name": "c", + "rank": 2, + "state": "peon", + "election_epoch": 38, + "quorum": [ + 1, + 2], + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { "epoch": 3, + "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8", + "modified": "2013-10-30 04:12:01.945629", + "created": "2013-10-29 14:14:41.914786", + "mons": [ + { "rank": 0, + "name": "a", + "addr": "127.0.0.1:6789\/0"}, + { "rank": 1, + "name": "b", + "addr": "127.0.0.1:6790\/0"}, + { "rank": 2, + "name": "c", + "addr": "127.0.0.1:6795\/0"}]}} + +A couple of things are obvious: we have three monitors in the monmap (*a*, *b* +and *c*), the quorum is formed by only two monitors, and *c* is in the quorum +as a *peon*. + +Which monitor is out of the quorum? + + The answer would be **a**. + +Why? + + Take a look at the ``quorum`` set. We have two monitors in this set: *1* + and *2*. These are not monitor names. These are monitor ranks, as established + in the current monmap. We are missing the monitor with rank 0, and according + to the monmap that would be ``mon.a``. + +By the way, how are ranks established? + + Ranks are (re)calculated whenever you add or remove monitors and follow a + simple rule: the **greater** the ``IP:PORT`` combination, the **lower** the + rank is. In this case, considering that ``127.0.0.1:6789`` is lower than all + the remaining ``IP:PORT`` combinations, ``mon.a`` has rank 0. + +Most Common Monitor Issues +=========================== + +Have Quorum but at least one Monitor is down +--------------------------------------------- + +When this happens, depending on the version of Ceph you are running, +you should be seeing something similar to:: + + $ ceph health detail + [snip] + mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum) + +How to troubleshoot this? + + First, make sure ``mon.a`` is running. + + Second, make sure you are able to connect to ``mon.a``'s node from the + other mon nodes. Check the TCP ports as well. Check ``iptables`` and + ``nf_conntrack`` on all nodes and ensure that you are not + dropping/rejecting connections. + + If this initial troubleshooting doesn't solve your problems, then it's + time to go deeper. + + First, check the problematic monitor's ``mon_status`` via the admin + socket as explained in `Using the monitor's admin socket`_ and + `Understanding mon_status`_. + + If the monitor is out of the quorum, its state should be one of + ``probing``, ``electing`` or ``synchronizing``. If it happens to be either + ``leader`` or ``peon``, then the monitor believes to be in quorum, while + the remaining cluster is sure it is not; or maybe it got into the quorum + while we were troubleshooting the monitor, so check you ``ceph -s`` again + just to make sure. Proceed if the monitor is not yet in the quorum. + +What if the state is ``probing``? + + This means the monitor is still looking for the other monitors. Every time + you start a monitor, the monitor will stay in this state for some time + while trying to connect the rest of the monitors specified in the ``monmap``. + The time a monitor will spend in this state can vary. For instance, when on + a single-monitor cluster (never do this in production), + the monitor will pass through the probing state almost instantaneously. + In a multi-monitor cluster, the monitors will stay in this state until they + find enough monitors to form a quorum -- this means that if you have 2 out + of 3 monitors down, the one remaining monitor will stay in this state + indefinitely until you bring one of the other monitors up. + + If you have a quorum the starting daemon should be able to find the + other monitors quickly, as long as they can be reached. If your + monitor is stuck probing and you have gone through with all the communication + troubleshooting, then there is a fair chance that the monitor is trying + to reach the other monitors on a wrong address. ``mon_status`` outputs the + ``monmap`` known to the monitor: check if the other monitor's locations + match reality. If they don't, jump to + `Recovering a Monitor's Broken monmap`_; if they do, then it may be related + to severe clock skews amongst the monitor nodes and you should refer to + `Clock Skews`_ first, but if that doesn't solve your problem then it is + the time to prepare some logs and reach out to the community (please refer + to `Preparing your logs`_ on how to best prepare your logs). + + +What if state is ``electing``? + + This means the monitor is in the middle of an election. With recent Ceph + releases these typically complete quickly, but at times the monitors can + get stuck in what is known as an *election storm*. This can indicate + clock skew among the monitor nodes; jump to + `Clock Skews`_ for more information. If all your clocks are properly + synchronized, you should search the mailing lists and tracker. + This is not a state that is likely to persist and aside from + (*really*) old bugs there is not an obvious reason besides clock skews on + why this would happen. Worst case, if there are enough surviving mons, + down the problematic one while you investigate. + +What if state is ``synchronizing``? + + This means the monitor is catching up with the rest of the cluster in + order to join the quorum. Time to synchronize is a function of the size + of your monitor store and thus of cluster size and state, so if you have a + large or degraded cluster this may take a while. + + If you notice that the monitor jumps from ``synchronizing`` to + ``electing`` and then back to ``synchronizing``, then you do have a + problem: the cluster state may be advancing (i.e., generating new maps) + too fast for the synchronization process to keep up. This was a more common + thing in early days (Cuttlefish), but since then the synchronization process + has been refactored and enhanced to avoid this dynamic. If you experience + this in later versions please let us know via a bug tracker. And bring some logs + (see `Preparing your logs`_). + +What if state is ``leader`` or ``peon``? + + This should not happen: famous last words. If it does, however, it likely + has a lot to do with clock skew -- see `Clock Skews`_. If you are not + suffering from clock skew, then please prepare your logs (see + `Preparing your logs`_) and reach out to the community. + + +Recovering a Monitor's Broken ``monmap`` +---------------------------------------- + +This is how a ``monmap`` usually looks, depending on the number of +monitors:: + + + epoch 3 + fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8 + last_changed 2013-10-30 04:12:01.945629 + created 2013-10-29 14:14:41.914786 + 0: 127.0.0.1:6789/0 mon.a + 1: 127.0.0.1:6790/0 mon.b + 2: 127.0.0.1:6795/0 mon.c + +This may not be what you have however. For instance, in some versions of +early Cuttlefish there was a bug that could cause your ``monmap`` +to be nullified. Completely filled with zeros. This means that not even +``monmaptool`` would be able to make sense of cold, hard, inscrutable zeros. +It's also possible to end up with a monitor with a severely outdated monmap, +notably if the node has been down for months while you fight with your vendor's +TAC. The subject ``ceph-mon`` daemon might be unable to find the surviving +monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``, +then remove ``mon.a``, then add a new monitor ``mon.e`` and remove +``mon.b``; you will end up with a totally different monmap from the one +``mon.c`` knows). + +In this situation you have two possible solutions: + +Scrap the monitor and redeploy + + You should only take this route if you are positive that you won't + lose the information kept by that monitor; that you have other monitors + and that they are running just fine so that your new monitor is able + to synchronize from the remaining monitors. Keep in mind that destroying + a monitor, if there are no other copies of its contents, may lead to + loss of data. + +Inject a monmap into the monitor + + Usually the safest path. You should grab the monmap from the remaining + monitors and inject it into the monitor with the corrupted/lost monmap. + + These are the basic steps: + + 1. Is there a formed quorum? If so, grab the monmap from the quorum:: + + $ ceph mon getmap -o /tmp/monmap + + 2. No quorum? Grab the monmap directly from another monitor (this + assumes the monitor you are grabbing the monmap from has id ID-FOO + and has been stopped):: + + $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap + + 3. Stop the monitor you are going to inject the monmap into. + + 4. Inject the monmap:: + + $ ceph-mon -i ID --inject-monmap /tmp/monmap + + 5. Start the monitor + + Please keep in mind that the ability to inject monmaps is a powerful + feature that can cause havoc with your monitors if misused as it will + overwrite the latest, existing monmap kept by the monitor. + + +Clock Skews +------------ + +Monitor operation can be severely affected by clock skew among the quorum's +mons, as the PAXOS consensus algorithm requires tight time alignment. +Skew can result in weird behavior with no obvious +cause. To avoid such issues, you must run a clock synchronization tool +on your monitor nodes: ``Chrony`` or the legacy ``ntpd``. Be sure to +configure the mon nodes with the `iburst` option and multiple peers: + +* Each other +* Internal ``NTP`` servers +* Multiple external, public pool servers + +For good measure, *all* nodes in your cluster should also sync against +internal and external servers, and perhaps even your mons. ``NTP`` servers +should run on bare metal; VM virtualized clocks are not suitable for steady +timekeeping. Visit `https://www.ntp.org <https://www.ntp.org>`_ for more info. Your +organization may already have quality internal ``NTP`` servers you can use. +Sources for ``NTP`` server appliances include: + +* Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_ +* EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_ +* Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_ + + +What's the maximum tolerated clock skew? + + By default the monitors will allow clocks to drift up to 0.05 seconds (50 ms). + + +Can I increase the maximum tolerated clock skew? + + The maximum tolerated clock skew is configurable via the + ``mon-clock-drift-allowed`` option, and + although you *CAN* you almost certainly *SHOULDN'T*. The clock skew mechanism + is in place because clock-skewed monitors are liely to misbehave. We, as + developers and QA aficionados, are comfortable with the current default + value, as it will alert the user before the monitors get out hand. Changing + this value may cause unforeseen effects on the + stability of the monitors and overall cluster health. + +How do I know there's a clock skew? + + The monitors will warn you via the cluster status ``HEALTH_WARN``. ``ceph health + detail`` or ``ceph status`` should show something like:: + + mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s) + + That means that ``mon.c`` has been flagged as suffering from a clock skew. + + On releases beginning with Luminous you can issue the + ``ceph time-sync-status`` command to check status. Note that the lead mon + is typically the one with the numerically lowest IP address. It will always + show ``0``: the reported offsets of other mons are relative to + the lead mon, not to any external reference source. + + +What should I do if there's a clock skew? + + Synchronize your clocks. Running an NTP client may help. If you are already + using one and you hit this sort of issues, check if you are using some NTP + server remote to your network and consider hosting your own NTP server on + your network. This last option tends to reduce the amount of issues with + monitor clock skews. + + +Client Can't Connect or Mount +------------------------------ + +Check your IP tables. Some OS install utilities add a ``REJECT`` rule to +``iptables``. The rule rejects all clients trying to connect to the host except +for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in +place, clients connecting from a separate node will fail to mount with a timeout +error. You need to address ``iptables`` rules that reject clients trying to +connect to Ceph daemons. For example, you would need to address rules that look +like this appropriately:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +You may also need to add rules to IP tables on your Ceph hosts to ensure +that clients can access the ports associated with your Ceph monitors (i.e., port +6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For +example:: + + iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT + +Monitor Store Failures +====================== + +Symptoms of store corruption +---------------------------- + +Ceph monitor stores the :term:`Cluster Map` in a key/value store such as LevelDB. If +a monitor fails due to the key/value store corruption, following error messages +might be found in the monitor log:: + + Corruption: error in middle of record + +or:: + + Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb + +Recovery using healthy monitor(s) +--------------------------------- + +If there are any survivors, we can always :ref:`replace <adding-and-removing-monitors>` the corrupted one with a +new one. After booting up, the new joiner will sync up with a healthy +peer, and once it is fully sync'ed, it will be able to serve the clients. + +.. _mon-store-recovery-using-osds: + +Recovery using OSDs +------------------- + +But what if all monitors fail at the same time? Since users are encouraged to +deploy at least three (and preferably five) monitors in a Ceph cluster, the chance of simultaneous +failure is rare. But unplanned power-downs in a data center with improperly +configured disk/fs settings could fail the underlying file system, and hence +kill all the monitors. In this case, we can recover the monitor store with the +information stored in OSDs.:: + + ms=/root/mon-store + mkdir $ms + + # collect the cluster map from stopped OSDs + for host in $hosts; do + rsync -avz $ms/. user@$host:$ms.remote + rm -rf $ms + ssh user@$host <<EOF + for osd in /var/lib/ceph/osd/ceph-*; do + ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote + done + EOF + rsync -avz user@$host:$ms.remote/. $ms + done + + # rebuild the monitor store from the collected map, if the cluster does not + # use cephx authentication, we can skip the following steps to update the + # keyring with the caps, and there is no need to pass the "--keyring" option. + # i.e. just use "ceph-monstore-tool $ms rebuild" instead + ceph-authtool /path/to/admin.keyring -n mon. \ + --cap mon 'allow *' + ceph-authtool /path/to/admin.keyring -n client.admin \ + --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' + # add one or more ceph-mgr's key to the keyring. in this case, an encoded key + # for mgr.x is added, you can find the encoded key in + # /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is + # deployed + ceph-authtool /path/to/admin.keyring --add-key 'AQDN8kBe9PLWARAAZwxXMr+n85SBYbSlLcZnMA==' -n mgr.x \ + --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *' + # if your monitors' ids are not single characters like 'a', 'b', 'c', please + # specify them in the command line by passing them as arguments of the "--mon-ids" + # option. if you are not sure, please check your ceph.conf to see if there is any + # sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are + # using DNS SRV for looking up monitors. + ceph-monstore-tool $ms rebuild -- --keyring /path/to/admin.keyring --mon-ids alpha beta gamma + + # make a backup of the corrupted store.db just in case! repeat for + # all monitors. + mv /var/lib/ceph/mon/mon.foo/store.db /var/lib/ceph/mon/mon.foo/store.db.corrupted + + # move rebuild store.db into place. repeat for all monitors. + mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db + chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db + +The steps above + +#. collect the map from all OSD hosts, +#. then rebuild the store, +#. fill the entities in keyring file with appropriate caps +#. replace the corrupted store on ``mon.foo`` with the recovered copy. + +Known limitations +~~~~~~~~~~~~~~~~~ + +Following information are not recoverable using the steps above: + +- **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command + are recovered from the OSD's copy. And the ``client.admin`` keyring is imported + using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing + in the recovered monitor store. You might need to re-add them manually. + +- **creating pools**: If any RADOS pools were in the process of being creating, that state is lost. The recovery tool assumes that all pools have been created. If there are PGs that are stuck in the 'unknown' after the recovery for a partially created pool, you can force creation of the *empty* PG with the ``ceph osd force-create-pg`` command. Note that this will create an *empty* PG, so only do this if you know the pool is empty. + +- **MDS Maps**: the MDS maps are lost. + + + +Everything Failed! Now What? +============================= + +Reaching out for help +---------------------- + +You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net) +and on ``ceph-devel@vger.kernel.org`` and ``ceph-users@lists.ceph.com``. Make +sure you have grabbed your logs and have them ready if someone asks: the faster +the interaction and lower the latency in response, the better chances everyone's +time is optimized. + + +Preparing your logs +--------------------- + +Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We +may want them. However, your logs may not have the necessary information. If +you don't find your monitor logs at their default location, you can check +where they should be by running:: + + ceph-conf --name mon.FOO --show-config-value log_file + +The amount of information in the logs are subject to the debug levels being +enforced by your configuration files. If you have not enforced a specific +debug level then Ceph is using the default levels and your logs may not +contain important information to track down you issue. +A first step in getting relevant information into your logs will be to raise +debug levels. In this case we will be interested in the information from the +monitor. +Similarly to what happens on other components, different parts of the monitor +will output their debug information on different subsystems. + +You will have to raise the debug levels of those subsystems more closely +related to your issue. This may not be an easy task for someone unfamiliar +with troubleshooting Ceph. For most situations, setting the following options +on your monitors will be enough to pinpoint a potential source of the issue:: + + debug mon = 10 + debug ms = 1 + +If we find that these debug levels are not enough, there's a chance we may +ask you to raise them or even define other debug subsystems to obtain infos +from -- but at least we started off with some useful information, instead +of a massively empty log without much to go on with. + +Do I need to restart a monitor to adjust debug levels? +------------------------------------------------------ + +No. You may do it in one of two ways: + +You have quorum + + Either inject the debug option into the monitor you want to debug:: + + ceph tell mon.FOO config set debug_mon 10/10 + + or into all monitors at once:: + + ceph tell mon.* config set debug_mon 10/10 + +No quorum + + Use the monitor's admin socket and directly adjust the configuration + options:: + + ceph daemon mon.FOO config set debug_mon 10/10 + + +Going back to default values is as easy as rerunning the above commands +using the debug level ``1/10`` instead. You can check your current +values using the admin socket and the following commands:: + + ceph daemon mon.FOO config show + +or:: + + ceph daemon mon.FOO config get 'OPTION_NAME' + + +Reproduced the problem with appropriate debug levels. Now what? +---------------------------------------------------------------- + +Ideally you would send us only the relevant portions of your logs. +We realise that figuring out the corresponding portion may not be the +easiest of tasks. Therefore, we won't hold it to you if you provide the +full log, but common sense should be employed. If your log has hundreds +of thousands of lines, it may get tricky to go through the whole thing, +specially if we are not aware at which point, whatever your issue is, +happened. For instance, when reproducing, keep in mind to write down +current time and date and to extract the relevant portions of your logs +based on that. + +Finally, you should reach out to us on the mailing lists, on IRC or file +a new issue on the `tracker`_. + +.. _tracker: http://tracker.ceph.com/projects/ceph/issues/new diff --git a/doc/rados/troubleshooting/troubleshooting-osd.rst b/doc/rados/troubleshooting/troubleshooting-osd.rst new file mode 100644 index 000000000..cc852d73d --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-osd.rst @@ -0,0 +1,620 @@ +====================== + Troubleshooting OSDs +====================== + +Before troubleshooting your OSDs, first check your monitors and network. If +you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph shows +``HEALTH_OK``, it means that the monitors have a quorum. +If you don't have a monitor quorum or if there are errors with the monitor +status, `address the monitor issues first <../troubleshooting-mon>`_. +Check your networks to ensure they +are running properly, because networks may have a significant impact on OSD +operation and performance. Look for dropped packets on the host side +and CRC errors on the switch side. + +Obtaining Data About OSDs +========================= + +A good first step in troubleshooting your OSDs is to obtain topology information in +addition to the information you collected while `monitoring your OSDs`_ +(e.g., ``ceph osd tree``). + + +Ceph Logs +--------- + +If you haven't changed the default path, you can find Ceph log files at +``/var/log/ceph``:: + + ls /var/log/ceph + +If you don't see enough log detail you can change your logging level. See +`Logging and Debugging`_ for details to ensure that Ceph performs adequately +under high logging volume. + + +Admin Socket +------------ + +Use the admin socket tool to retrieve runtime information. For details, list +the sockets for your Ceph daemons:: + + ls /var/run/ceph + +Then, execute the following, replacing ``{daemon-name}`` with an actual +daemon (e.g., ``osd.0``):: + + ceph daemon osd.0 help + +Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/run/ceph``):: + + ceph daemon {socket-file} help + +The admin socket, among other things, allows you to: + +- List your configuration at runtime +- Dump historic operations +- Dump the operation priority queue state +- Dump operations in flight +- Dump perfcounters + +Display Freespace +----------------- + +Filesystem issues may arise. To display your file system's free space, execute +``df``. :: + + df -h + +Execute ``df --help`` for additional usage. + +I/O Statistics +-------------- + +Use `iostat`_ to identify I/O-related issues. :: + + iostat -x + +Diagnostic Messages +------------------- + +To retrieve diagnostic messages from the kernel, use ``dmesg`` with ``less``, ``more``, ``grep`` +or ``tail``. For example:: + + dmesg | grep scsi + +Stopping w/out Rebalancing +========================== + +Periodically, you may need to perform maintenance on a subset of your cluster, +or resolve a problem that affects a failure domain (e.g., a rack). If you do not +want CRUSH to automatically rebalance the cluster as you stop OSDs for +maintenance, set the cluster to ``noout`` first:: + + ceph osd set noout + +On Luminous or newer releases it is safer to set the flag only on affected OSDs. +You can do this individually :: + + ceph osd add-noout osd.0 + ceph osd rm-noout osd.0 + +Or an entire CRUSH bucket at a time. Say you're going to take down +``prod-ceph-data1701`` to add RAM :: + + ceph osd set-group noout prod-ceph-data1701 + +Once the flag is set you can stop the OSDs and any other colocated Ceph +services within the failure domain that requires maintenance work. :: + + systemctl stop ceph\*.service ceph\*.target + +.. note:: Placement groups within the OSDs you stop will become ``degraded`` + while you are addressing issues with within the failure domain. + +Once you have completed your maintenance, restart the OSDs and any other +daemons. If you rebooted the host as part of the maintenance, these should +come back on their own without intervention. :: + + sudo systemctl start ceph.target + +Finally, you must unset the cluster-wide``noout`` flag:: + + ceph osd unset noout + ceph osd unset-group noout prod-ceph-data1701 + +Note that most Linux distributions that Ceph supports today employ ``systemd`` +for service management. For other or older operating systems you may need +to issue equivalent ``service`` or ``start``/``stop`` commands. + +.. _osd-not-running: + +OSD Not Running +=============== + +Under normal circumstances, simply restarting the ``ceph-osd`` daemon will +allow it to rejoin the cluster and recover. + +An OSD Won't Start +------------------ + +If you start your cluster and an OSD won't start, check the following: + +- **Configuration File:** If you were not able to get OSDs running from + a new installation, check your configuration file to ensure it conforms + (e.g., ``host`` not ``hostname``, etc.). + +- **Check Paths:** Check the paths in your configuration, and the actual + paths themselves for data and metadata (journals, WAL, DB). If you separate the OSD data from + the metadata and there are errors in your configuration file or in the + actual mounts, you may have trouble starting OSDs. If you want to store the + metadata on a separate block device, you should partition or LVM your + drive and assign one partition per OSD. + +- **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be + hitting the default maximum number of threads (e.g., usually 32k), especially + during recovery. You can increase the number of threads using ``sysctl`` to + see if increasing the maximum number of threads to the maximum possible + number of threads allowed (i.e., 4194303) will help. For example:: + + sysctl -w kernel.pid_max=4194303 + + If increasing the maximum thread count resolves the issue, you can make it + permanent by including a ``kernel.pid_max`` setting in a file under ``/etc/sysctl.d`` or + within the master ``/etc/sysctl.conf`` file. For example:: + + kernel.pid_max = 4194303 + +- **Check ``nf_conntrack``:** This connection tracking and limiting system + is the bane of many production Ceph clusters, and can be insidious in that + everything is fine at first. As cluster topology and client workload + grow, mysterious and intermittent connection failures and performance + glitches manifest, becoming worse over time and at certain times of day. + Check ``syslog`` history for table fillage events. You can mitigate this + bother by raising ``nf_conntrack_max`` to a much higher value via ``sysctl``. + Be sure to raise ``nf_conntrack_buckets`` accordingly to + ``nf_conntrack_max / 4``, which may require action outside of ``sysctl`` e.g. + ``"echo 131072 > /sys/module/nf_conntrack/parameters/hashsize`` + More interdictive but fussier is to blacklist the associated kernel modules + to disable processing altogether. This is fragile in that the modules + vary among kernel versions, as does the order in which they must be listed. + Even when blacklisted there are situations in which ``iptables`` or ``docker`` + may activate connection tracking anyway, so a "set and forget" strategy for + the tunables is advised. On modern systems this will not consume appreciable + resources. + +- **Kernel Version:** Identify the kernel version and distribution you + are using. Ceph uses some third party tools by default, which may be + buggy or may conflict with certain distributions and/or kernel + versions (e.g., Google ``gperftools`` and ``TCMalloc``). Check the + `OS recommendations`_ and the release notes for each Ceph version + to ensure you have addressed any issues related to your kernel. + +- **Segment Fault:** If there is a segment fault, increase log levels + and start the problematic daemon(s) again. If segment faults recur, + search the Ceph bug tracker `https://tracker.ceph/com/projects/ceph <https://tracker.ceph.com/projects/ceph/>`_ + and the ``dev`` and ``ceph-users`` mailing list archives `https://ceph.io/resources <https://ceph.io/resources>`_. + If this is truly a new and unique + failure, post to the ``dev`` email list and provide the specific Ceph + release being run, ``ceph.conf`` (with secrets XXX'd out), + your monitor status output and excerpts from your log file(s). + +An OSD Failed +------------- + +When a ``ceph-osd`` process dies, surviving ``ceph-osd`` daemons will report +to the mons that it appears down, which will in turn surface the new status +via the ``ceph health`` command:: + + ceph health + HEALTH_WARN 1/3 in osds are down + +Specifically, you will get a warning whenever there are OSDs marked ``in`` +and ``down``. You can identify which are ``down`` with:: + + ceph health detail + HEALTH_WARN 1/3 in osds are down + osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 + +or :: + + ceph osd tree down + +If there is a drive +failure or other fault preventing ``ceph-osd`` from functioning or +restarting, an error message should be present in its log file under +``/var/log/ceph``. + +If the daemon stopped because of a heartbeat failure or ``suicide timeout``, +the underlying drive or filesystem may be unresponsive. Check ``dmesg`` +and `syslog` output for drive or other kernel errors. You may need to +specify something like ``dmesg -T`` to get timestamps, otherwise it's +easy to mistake old errors for new. + +If the problem is a software error (failed assertion or other +unexpected error), search the archives and tracker as above, and +report it to the `ceph-devel`_ email list if there's no clear fix or +existing bug. + +.. _no-free-drive-space: + +No Free Drive Space +------------------- + +Ceph prevents you from writing to a full OSD so that you don't lose data. +In an operational cluster, you should receive a warning when your cluster's OSDs +and pools approach the full ratio. The ``mon osd full ratio`` defaults to +``0.95``, or 95% of capacity before it stops clients from writing data. +The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of +capacity above which backfills will not start. The +OSD nearfull ratio defaults to ``0.85``, or 85% of capacity +when it generates a health warning. + +Note that individual OSDs within a cluster will vary in how much data Ceph +allocates to them. This utilization can be displayed for each OSD with :: + + ceph osd df + +Overall cluster / pool fullness can be checked with :: + + ceph df + +Pay close attention to the **most full** OSDs, not the percentage of raw space +used as reported by ``ceph df``. It only takes one outlier OSD filling up to +fail writes to its pool. The space available to each pool as reported by +``ceph df`` considers the ratio settings relative to the *most full* OSD that +is part of a given pool. The distribution can be flattened by progressively +moving data from overfull or to underfull OSDs using the ``reweight-by-utilization`` +command. With Ceph releases beginning with later revisions of Luminous one can also +exploit the ``ceph-mgr`` ``balancer`` module to perform this task automatically +and rather effectively. + +The ratios can be adjusted: + +:: + + ceph osd set-nearfull-ratio <float[0.0-1.0]> + ceph osd set-full-ratio <float[0.0-1.0]> + ceph osd set-backfillfull-ratio <float[0.0-1.0]> + +Full cluster issues can arise when an OSD fails either as a test or organically +within small and/or very full or unbalanced cluster. When an OSD or node +holds an outsize percentage of the cluster's data, the ``nearfull`` and ``full`` +ratios may be exceeded as a result of component failures or even natural growth. +If you are testing how Ceph reacts to OSD failures on a small +cluster, you should leave ample free disk space and consider temporarily +lowering the OSD ``full ratio``, OSD ``backfillfull ratio`` and +OSD ``nearfull ratio`` + +Full ``ceph-osds`` will be reported by ``ceph health``:: + + ceph health + HEALTH_WARN 1 nearfull osd(s) + +Or:: + + ceph health detail + HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) + osd.3 is full at 97% + osd.4 is backfill full at 91% + osd.2 is near full at 87% + +The best way to deal with a full cluster is to add capacity via new OSDs, enabling +the cluster to redistribute data to newly available storage. + +If you cannot start a legacy Filestore OSD because it is full, you may reclaim +some space deleting a few placement group directories in the full OSD. + +.. important:: If you choose to delete a placement group directory on a full OSD, + **DO NOT** delete the same placement group directory on another full OSD, or + **YOU WILL LOSE DATA**. You **MUST** maintain at least one copy of your data on + at least one OSD. This is a rare and extreme intervention, and is not to be + undertaken lightly. + +See `Monitor Config Reference`_ for additional details. + +OSDs are Slow/Unresponsive +========================== + +A common issue involves slow or unresponsive OSDs. Ensure that you +have eliminated other troubleshooting possibilities before delving into OSD +performance issues. For example, ensure that your network(s) is working properly +and your OSDs are running. Check to see if OSDs are throttling recovery traffic. + +.. tip:: Newer versions of Ceph provide better recovery handling by preventing + recovering OSDs from using up system resources so that ``up`` and ``in`` + OSDs are not available or are otherwise slow. + +Networking Issues +----------------- + +Ceph is a distributed storage system, so it relies upon networks for OSD peering +and replication, recovery from faults, and periodic heartbeats. Networking +issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for +details. + +Ensure that Ceph processes and Ceph-dependent processes are connected and/or +listening. :: + + netstat -a | grep ceph + netstat -l | grep ceph + sudo netstat -p | grep ceph + +Check network statistics. :: + + netstat -s + +Drive Configuration +------------------- + +A SAS or SATA storage drive should only house one OSD; NVMe drives readily +handle two or more. Read and write throughput can bottleneck if other processes +share the drive, including journals / metadata, operating systems, Ceph monitors, +`syslog` logs, other OSDs, and non-Ceph processes. + +Ceph acknowledges writes *after* journaling, so fast SSDs are an +attractive option to accelerate the response time--particularly when +using the ``XFS`` or ``ext4`` file systems for legacy Filestore OSDs. +By contrast, the ``Btrfs`` +file system can write and journal simultaneously. (Note, however, that +we recommend against using ``Btrfs`` for production deployments.) + +.. note:: Partitioning a drive does not change its total throughput or + sequential read/write limits. Running a journal in a separate partition + may help, but you should prefer a separate physical drive. + +Bad Sectors / Fragmented Disk +----------------------------- + +Check your drives for bad blocks, fragmentation, and other errors that can cause +performance to drop substantially. Invaluable tools include ``dmesg``, ``syslog`` +logs, and ``smartctl`` (from the ``smartmontools`` package). + +Co-resident Monitors/OSDs +------------------------- + +Monitors are relatively lightweight processes, but they issue lots of +``fsync()`` calls, +which can interfere with other workloads, particularly if monitors run on the +same drive as an OSD. Additionally, if you run monitors on the same host as +OSDs, you may incur performance issues related to: + +- Running an older kernel (pre-3.0) +- Running a kernel with no ``syncfs(2)`` syscall. + +In these cases, multiple OSDs running on the same host can drag each other down +by doing lots of commits. That often leads to the bursty writes. + +Co-resident Processes +--------------------- + +Spinning up co-resident processes (convergence) such as a cloud-based solution, virtual +machines and other applications that write data to Ceph while operating on the +same hardware as OSDs can introduce significant OSD latency. Generally, we +recommend optimizing hosts for use with Ceph and using other hosts for other +processes. The practice of separating Ceph operations from other applications +may help improve performance and may streamline troubleshooting and maintenance. + +Logging Levels +-------------- + +If you turned logging levels up to track an issue and then forgot to turn +logging levels back down, the OSD may be putting a lot of logs onto the disk. If +you intend to keep logging levels high, you may consider mounting a drive to the +default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``). + +Recovery Throttling +------------------- + +Depending upon your configuration, Ceph may reduce recovery rates to maintain +performance or it may increase recovery rates to the point that recovery +impacts OSD performance. Check to see if the OSD is recovering. + +Kernel Version +-------------- + +Check the kernel version you are running. Older kernels may not receive +new backports that Ceph depends upon for better performance. + +Kernel Issues with SyncFS +------------------------- + +Try running one OSD per host to see if performance improves. Old kernels +might not have a recent enough version of ``glibc`` to support ``syncfs(2)``. + +Filesystem Issues +----------------- + +Currently, we recommend deploying clusters with the BlueStore back end. +When running a pre-Luminous release or if you have a specific reason to deploy +OSDs with the previous Filestore backend, we recommend ``XFS``. + +We recommend against using ``Btrfs`` or ``ext4``. The ``Btrfs`` filesystem has +many attractive features, but bugs may lead to +performance issues and spurious ENOSPC errors. We do not recommend +``ext4`` for Filestore OSDs because ``xattr`` limitations break support for long +object names, which are needed for RGW. + +For more information, see `Filesystem Recommendations`_. + +.. _Filesystem Recommendations: ../configuration/filesystem-recommendations + +Insufficient RAM +---------------- + +We recommend a *minimum* of 4GB of RAM per OSD daemon and suggest rounding up +from 6-8GB. You may notice that during normal operations, ``ceph-osd`` +processes only use a fraction of that amount. +Unused RAM makes it tempting to use the excess RAM for co-resident +applications or to skimp on each node's memory capacity. However, +when OSDs experience recovery their memory utilization spikes. If +there is insufficient RAM available, OSD performance will slow considerably +and the daemons may even crash or be killed by the Linux ``OOM Killer``. + +Blocked Requests or Slow Requests +--------------------------------- + +If a ``ceph-osd`` daemon is slow to respond to a request, messages will be logged +noting ops that are taking too long. The warning threshold +defaults to 30 seconds and is configurable via the ``osd op complaint time`` +setting. When this happens, the cluster log will receive messages. + +Legacy versions of Ceph complain about ``old requests``:: + + osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops + +New versions of Ceph complain about ``slow requests``:: + + {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs + {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] + +Possible causes include: + +- A failing drive (check ``dmesg`` output) +- A bug in the kernel file system (check ``dmesg`` output) +- An overloaded cluster (check system load, iostat, etc.) +- A bug in the ``ceph-osd`` daemon. + +Possible solutions: + +- Remove VMs from Ceph hosts +- Upgrade kernel +- Upgrade Ceph +- Restart OSDs +- Replace failed or failing components + +Debugging Slow Requests +----------------------- + +If you run ``ceph daemon osd.<id> dump_historic_ops`` or ``ceph daemon osd.<id> dump_ops_in_flight``, +you will see a set of operations and a list of events each operation went +through. These are briefly described below. + +Events from the Messenger layer: + +- ``header_read``: When the messenger first started reading the message off the wire. +- ``throttled``: When the messenger tried to acquire memory throttle space to read + the message into memory. +- ``all_read``: When the messenger finished reading the message off the wire. +- ``dispatched``: When the messenger gave the message to the OSD. +- ``initiated``: This is identical to ``header_read``. The existence of both is a + historical oddity. + +Events from the OSD as it processes ops: + +- ``queued_for_pg``: The op has been put into the queue for processing by its PG. +- ``reached_pg``: The PG has started doing the op. +- ``waiting for \*``: The op is waiting for some other work to complete before it + can proceed (e.g. a new OSDMap; for its object target to scrub; for the PG to + finish peering; all as specified in the message). +- ``started``: The op has been accepted as something the OSD should do and + is now being performed. +- ``waiting for subops from``: The op has been sent to replica OSDs. + +Events from ```Filestore```: + +- ``commit_queued_for_journal_write``: The op has been given to the FileStore. +- ``write_thread_in_journal_buffer``: The op is in the journal's buffer and waiting + to be persisted (as the next disk write). +- ``journaled_completion_queued``: The op was journaled to disk and its callback + queued for invocation. + +Events from the OSD after data has been given to underlying storage: + +- ``op_commit``: The op has been committed (i.e. written to journal) by the + primary OSD. +- ``op_applied``: The op has been `write()'en <https://www.freebsd.org/cgi/man.cgi?write(2)>`_ to the backing FS (i.e. applied in memory but not flushed out to disk) on the primary. +- ``sub_op_applied``: ``op_applied``, but for a replica's "subop". +- ``sub_op_committed``: ``op_commit``, but for a replica's subop (only for EC pools). +- ``sub_op_commit_rec/sub_op_apply_rec from <X>``: The primary marks this when it + hears about the above, but for a particular replica (i.e. ``<X>``). +- ``commit_sent``: We sent a reply back to the client (or primary OSD, for sub ops). + +Many of these events are seemingly redundant, but cross important boundaries in +the internal code (such as passing data across locks into new threads). + +Flapping OSDs +============= + +When OSDs peer and check heartbeats, they use the cluster (back-end) +network when it's available. See `Monitor/OSD Interaction`_ for details. + +We have tradtionally recommended separate *public* (front-end) and *private* +(cluster / back-end / replication) networks: + +#. Segregation of heartbeat and replication / recovery traffic (private) + from client and OSD <-> mon traffic (public). This helps keep one + from DoS-ing the other, which could in turn result in a cascading failure. + +#. Additional throughput for both public and private traffic. + +When common networking technloogies were 100Mb/s and 1Gb/s, this separation +was often critical. With today's 10Gb/s, 40Gb/s, and 25/50/100Gb/s +networks, the above capacity concerns are often diminished or even obviated. +For example, if your OSD nodes have two network ports, dedicating one to +the public and the other to the private network means no path redundancy. +This degrades your ability to weather network maintenance and failures without +significant cluster or client impact. Consider instead using both links +for just a public network: with bonding (LACP) or equal-cost routing (e.g. FRR) +you reap the benefits of increased throughput headroom, fault tolerance, and +reduced OSD flapping. + +When a private network (or even a single host link) fails or degrades while the +public network operates normally, OSDs may not handle this situation well. What +happens is that OSDs use the public network to report each other ``down`` to +the monitors, while marking themselves ``up``. The monitors then send out, +again on the public network, an updated cluster map with affected OSDs marked +`down`. These OSDs reply to the monitors "I'm not dead yet!", and the cycle +repeats. We call this scenario 'flapping`, and it can be difficult to isolate +and remediate. With no private network, this irksome dynamic is avoided: +OSDs are generally either ``up`` or ``down`` without flapping. + +If something does cause OSDs to 'flap' (repeatedly getting marked ``down`` and +then ``up`` again), you can force the monitors to halt the flapping by +temporarily freezing their states:: + + ceph osd set noup # prevent OSDs from getting marked up + ceph osd set nodown # prevent OSDs from getting marked down + +These flags are recorded in the osdmap:: + + ceph osd dump | grep flags + flags no-up,no-down + +You can clear the flags with:: + + ceph osd unset noup + ceph osd unset nodown + +Two other flags are supported, ``noin`` and ``noout``, which prevent +booting OSDs from being marked ``in`` (allocated data) or protect OSDs +from eventually being marked ``out`` (regardless of what the current value for +``mon osd down out interval`` is). + +.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the + sense that once the flags are cleared, the action they were blocking + should occur shortly after. The ``noin`` flag, on the other hand, + prevents OSDs from being marked ``in`` on boot, and any daemons that + started while the flag was set will remain that way. + +.. note:: The causes and effects of flapping can be somewhat mitigated through + careful adjustments to the ``mon_osd_down_out_subtree_limit``, + ``mon_osd_reporter_subtree_level``, and ``mon_osd_min_down_reporters``. + Derivation of optimal settings depends on cluster size, topology, and the + Ceph release in use. Their interactions are subtle and beyond the scope of + this document. + + +.. _iostat: https://en.wikipedia.org/wiki/Iostat +.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging +.. _Logging and Debugging: ../log-and-debug +.. _Debugging and Logging: ../debug +.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction +.. _Monitor Config Reference: ../../configuration/mon-config-ref +.. _monitoring your OSDs: ../../operations/monitoring-osd-pg +.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel +.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com +.. _OS recommendations: ../../../start/os-recommendations +.. _ceph-devel: ceph-devel@vger.kernel.org diff --git a/doc/rados/troubleshooting/troubleshooting-pg.rst b/doc/rados/troubleshooting/troubleshooting-pg.rst new file mode 100644 index 000000000..f5e5054ba --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-pg.rst @@ -0,0 +1,693 @@ +===================== + Troubleshooting PGs +===================== + +Placement Groups Never Get Clean +================================ + +When you create a cluster and your cluster remains in ``active``, +``active+remapped`` or ``active+degraded`` status and never achieves an +``active+clean`` status, you likely have a problem with your configuration. + +You may need to review settings in the `Pool, PG and CRUSH Config Reference`_ +and make appropriate adjustments. + +As a general rule, you should run your cluster with more than one OSD and a +pool size greater than 1 object replica. + +.. _one-node-cluster: + +One Node Cluster +---------------- + +Ceph no longer provides documentation for operating on a single node, because +you would never deploy a system designed for distributed computing on a single +node. Additionally, mounting client kernel modules on a single node containing a +Ceph daemon may cause a deadlock due to issues with the Linux kernel itself +(unless you use VMs for the clients). You can experiment with Ceph in a 1-node +configuration, in spite of the limitations as described herein. + +If you are trying to create a cluster on a single node, you must change the +default of the ``osd crush chooseleaf type`` setting from ``1`` (meaning +``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration +file before you create your monitors and OSDs. This tells Ceph that an OSD +can peer with another OSD on the same host. If you are trying to set up a +1-node cluster and ``osd crush chooseleaf type`` is greater than ``0``, +Ceph will try to peer the PGs of one OSD with the PGs of another OSD on +another node, chassis, rack, row, or even datacenter depending on the setting. + +.. tip:: DO NOT mount kernel clients directly on the same node as your + Ceph Storage Cluster, because kernel conflicts can arise. However, you + can mount kernel clients within virtual machines (VMs) on a single node. + +If you are creating OSDs using a single disk, you must create directories +for the data manually first. + + +Fewer OSDs than Replicas +------------------------ + +If you have brought up two OSDs to an ``up`` and ``in`` state, but you still +don't see ``active + clean`` placement groups, you may have an +``osd pool default size`` set to greater than ``2``. + +There are a few ways to address this situation. If you want to operate your +cluster in an ``active + degraded`` state with two replicas, you can set the +``osd pool default min size`` to ``2`` so that you can write objects in +an ``active + degraded`` state. You may also set the ``osd pool default size`` +setting to ``2`` so that you only have two stored replicas (the original and +one replica), in which case the cluster should achieve an ``active + clean`` +state. + +.. note:: You can make the changes at runtime. If you make the changes in + your Ceph configuration file, you may need to restart your cluster. + + +Pool Size = 1 +------------- + +If you have the ``osd pool default size`` set to ``1``, you will only have +one copy of the object. OSDs rely on other OSDs to tell them which objects +they should have. If a first OSD has a copy of an object and there is no +second copy, then no second OSD can tell the first OSD that it should have +that copy. For each placement group mapped to the first OSD (see +``ceph pg dump``), you can force the first OSD to notice the placement groups +it needs by running:: + + ceph osd force-create-pg <pgid> + + +CRUSH Map Errors +---------------- + +Another candidate for placement groups remaining unclean involves errors +in your CRUSH map. + + +Stuck Placement Groups +====================== + +It is normal for placement groups to enter states like "degraded" or "peering" +following a failure. Normally these states indicate the normal progression +through the failure recovery process. However, if a placement group stays in one +of these states for a long time this may be an indication of a larger problem. +For this reason, the monitor will warn when placement groups get "stuck" in a +non-optimal state. Specifically, we check for: + +* ``inactive`` - The placement group has not been ``active`` for too long + (i.e., it hasn't been able to service read/write requests). + +* ``unclean`` - The placement group has not been ``clean`` for too long + (i.e., it hasn't been able to completely recover from a previous failure). + +* ``stale`` - The placement group status has not been updated by a ``ceph-osd``, + indicating that all nodes storing this placement group may be ``down``. + +You can explicitly list stuck placement groups with one of:: + + ceph pg dump_stuck stale + ceph pg dump_stuck inactive + ceph pg dump_stuck unclean + +For stuck ``stale`` placement groups, it is normally a matter of getting the +right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement +groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For +stuck ``unclean`` placement groups, there is usually something preventing +recovery from completing, like unfound objects (see +:ref:`failures-osd-unfound`); + + + +.. _failures-osd-peering: + +Placement Group Down - Peering Failure +====================================== + +In certain cases, the ``ceph-osd`` `Peering` process can run into +problems, preventing a PG from becoming active and usable. For +example, ``ceph health`` might report:: + + ceph health detail + HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down + ... + pg 0.5 is down+peering + pg 1.4 is down+peering + ... + osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 + +We can query the cluster to determine exactly why the PG is marked ``down`` with:: + + ceph pg 0.5 query + +.. code-block:: javascript + + { "state": "down+peering", + ... + "recovery_state": [ + { "name": "Started\/Primary\/Peering\/GetInfo", + "enter_time": "2012-03-06 14:40:16.169679", + "requested_info_from": []}, + { "name": "Started\/Primary\/Peering", + "enter_time": "2012-03-06 14:40:16.169659", + "probing_osds": [ + 0, + 1], + "blocked": "peering is blocked due to down osds", + "down_osds_we_would_probe": [ + 1], + "peering_blocked_by": [ + { "osd": 1, + "current_lost_at": 0, + "comment": "starting or marking this osd lost may let us proceed"}]}, + { "name": "Started", + "enter_time": "2012-03-06 14:40:16.169513"} + ] + } + +The ``recovery_state`` section tells us that peering is blocked due to +down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd`` +and things will recover. + +Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk +failure), we can tell the cluster that it is ``lost`` and to cope as +best it can. + +.. important:: This is dangerous in that the cluster cannot + guarantee that the other copies of the data are consistent + and up to date. + +To instruct Ceph to continue anyway:: + + ceph osd lost 1 + +Recovery will proceed. + + +.. _failures-osd-unfound: + +Unfound Objects +=============== + +Under certain combinations of failures Ceph may complain about +``unfound`` objects:: + + ceph health detail + HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) + pg 2.4 is active+degraded, 78 unfound + +This means that the storage cluster knows that some objects (or newer +copies of existing objects) exist, but it hasn't found copies of them. +One example of how this might come about for a PG whose data is on ceph-osds +1 and 2: + +* 1 goes down +* 2 handles some writes, alone +* 1 comes up +* 1 and 2 repeer, and the objects missing on 1 are queued for recovery. +* Before the new objects are copied, 2 goes down. + +Now 1 knows that these object exist, but there is no live ``ceph-osd`` who +has a copy. In this case, IO to those objects will block, and the +cluster will hope that the failed node comes back soon; this is +assumed to be preferable to returning an IO error to the user. + +First, you can identify which objects are unfound with:: + + ceph pg 2.4 list_unfound [starting offset, in json] + +.. code-block:: javascript + + { + "num_missing": 1, + "num_unfound": 1, + "objects": [ + { + "oid": { + "oid": "object", + "key": "", + "snapid": -2, + "hash": 2249616407, + "max": 0, + "pool": 2, + "namespace": "" + }, + "need": "43'251", + "have": "0'0", + "flags": "none", + "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1", + "locations": [ + "0(3)", + "4(2)" + ] + } + ], + "state": "NotRecovering", + "available_might_have_unfound": true, + "might_have_unfound": [ + { + "osd": "2(4)", + "status": "osd is down" + } + ], + "more": false + } + +If there are too many objects to list in a single result, the ``more`` +field will be true and you can query for more. (Eventually the +command line tool will hide this from you, but not yet.) + +Second, you can identify which OSDs have been probed or might contain +data. + +At the end of the listing (before ``more`` is false), ``might_have_unfound`` is provided +when ``available_might_have_unfound`` is true. This is equivalent to the output +of ``ceph pg #.# query``. This eliminates the need to use ``query`` directly. +The ``might_have_unfound`` information given behaves the same way as described below for ``query``. +The only difference is that OSDs that have ``already probed`` status are ignored. + +Use of ``query``:: + + ceph pg 2.4 query + +.. code-block:: javascript + + "recovery_state": [ + { "name": "Started\/Primary\/Active", + "enter_time": "2012-03-06 15:15:46.713212", + "might_have_unfound": [ + { "osd": 1, + "status": "osd is down"}]}, + +In this case, for example, the cluster knows that ``osd.1`` might have +data, but it is ``down``. The full range of possible states include: + +* already probed +* querying +* OSD is down +* not queried (yet) + +Sometimes it simply takes some time for the cluster to query possible +locations. + +It is possible that there are other locations where the object can +exist that are not listed. For example, if a ceph-osd is stopped and +taken out of the cluster, the cluster fully recovers, and due to some +future set of failures ends up with an unfound object, it won't +consider the long-departed ceph-osd as a potential location to +consider. (This scenario, however, is unlikely.) + +If all possible locations have been queried and objects are still +lost, you may have to give up on the lost objects. This, again, is +possible given unusual combinations of failures that allow the cluster +to learn about writes that were performed before the writes themselves +are recovered. To mark the "unfound" objects as "lost":: + + ceph pg 2.5 mark_unfound_lost revert|delete + +This the final argument specifies how the cluster should deal with +lost objects. + +The "delete" option will forget about them entirely. + +The "revert" option (not available for erasure coded pools) will +either roll back to a previous version of the object or (if it was a +new object) forget about it entirely. Use this with caution, as it +may confuse applications that expected the object to exist. + + +Homeless Placement Groups +========================= + +It is possible for all OSDs that had copies of a given placement groups to fail. +If that's the case, that subset of the object store is unavailable, and the +monitor will receive no status updates for those placement groups. To detect +this situation, the monitor marks any placement group whose primary OSD has +failed as ``stale``. For example:: + + ceph health + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + +You can identify which placement groups are ``stale``, and what the last OSDs to +store them were, with:: + + ceph health detail + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + ... + pg 2.5 is stuck stale+active+remapped, last acting [2,0] + ... + osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 + osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 + osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 + +If we want to get placement group 2.5 back online, for example, this tells us that +it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd`` +daemons will allow the cluster to recover that placement group (and, presumably, +many others). + + +Only a Few OSDs Receive Data +============================ + +If you have many nodes in your cluster and only a few of them receive data, +`check`_ the number of placement groups in your pool. Since placement groups get +mapped to OSDs, a small number of placement groups will not distribute across +your cluster. Try creating a pool with a placement group count that is a +multiple of the number of OSDs. See `Placement Groups`_ for details. The default +placement group count for pools is not useful, but you can change it `here`_. + + +Can't Write Data +================ + +If your cluster is up, but some OSDs are down and you cannot write data, +check to ensure that you have the minimum number of OSDs running for the +placement group. If you don't have the minimum number of OSDs running, +Ceph will not allow you to write data because there is no guarantee +that Ceph can replicate your data. See ``osd pool default min size`` +in the `Pool, PG and CRUSH Config Reference`_ for details. + + +PGs Inconsistent +================ + +If you receive an ``active + clean + inconsistent`` state, this may happen +due to an error during scrubbing. As always, we can identify the inconsistent +placement group(s) with:: + + $ ceph health detail + HEALTH_ERR 1 pgs inconsistent; 2 scrub errors + pg 0.6 is active+clean+inconsistent, acting [0,1,2] + 2 scrub errors + +Or if you prefer inspecting the output in a programmatic way:: + + $ rados list-inconsistent-pg rbd + ["0.6"] + +There is only one consistent state, but in the worst case, we could have +different inconsistencies in multiple perspectives found in more than one +objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have:: + + $ rados list-inconsistent-obj 0.6 --format=json-pretty + +.. code-block:: javascript + + { + "epoch": 14, + "inconsistents": [ + { + "object": { + "name": "foo", + "nspace": "", + "locator": "", + "snap": "head", + "version": 1 + }, + "errors": [ + "data_digest_mismatch", + "size_mismatch" + ], + "union_shard_errors": [ + "data_digest_mismatch_info", + "size_mismatch_info" + ], + "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", + "shards": [ + { + "osd": 0, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 1, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 2, + "errors": [ + "data_digest_mismatch_info", + "size_mismatch_info" + ], + "size": 0, + "omap_digest": "0xffffffff", + "data_digest": "0xffffffff" + } + ] + } + ] + } + +In this case, we can learn from the output: + +* The only inconsistent object is named ``foo``, and it is its head that has + inconsistencies. +* The inconsistencies fall into two categories: + + * ``errors``: these errors indicate inconsistencies between shards without a + determination of which shard(s) are bad. Check for the ``errors`` in the + `shards` array, if available, to pinpoint the problem. + + * ``data_digest_mismatch``: the digest of the replica read from OSD.2 is + different from the ones of OSD.0 and OSD.1 + * ``size_mismatch``: the size of the replica read from OSD.2 is 0, while + the size reported by OSD.0 and OSD.1 is 968. + * ``union_shard_errors``: the union of all shard specific ``errors`` in + ``shards`` array. The ``errors`` are set for the given shard that has the + problem. They include errors like ``read_error``. The ``errors`` ending in + ``oi`` indicate a comparison with ``selected_object_info``. Look at the + ``shards`` array to determine which shard has which error(s). + + * ``data_digest_mismatch_info``: the digest stored in the object-info is not + ``0xffffffff``, which is calculated from the shard read from OSD.2 + * ``size_mismatch_info``: the size stored in the object-info is different + from the one read from OSD.2. The latter is 0. + +You can repair the inconsistent placement group by executing:: + + ceph pg repair {placement-group-ID} + +Which overwrites the `bad` copies with the `authoritative` ones. In most cases, +Ceph is able to choose authoritative copies from all available replicas using +some predefined criteria. But this does not always work. For example, the stored +data digest could be missing, and the calculated digest will be ignored when +choosing the authoritative copies. So, please use the above command with caution. + +If ``read_error`` is listed in the ``errors`` attribute of a shard, the +inconsistency is likely due to disk errors. You might want to check your disk +used by that OSD. + +If you receive ``active + clean + inconsistent`` states periodically due to +clock skew, you may consider configuring your `NTP`_ daemons on your +monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph +`Clock Settings`_ for additional details. + + +Erasure Coded PGs are not active+clean +====================================== + +When CRUSH fails to find enough OSDs to map to a PG, it will show as a +``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance:: + + [2,1,6,0,5,8,2147483647,7,4] + +Not enough OSDs +--------------- + +If the Ceph cluster only has 8 OSDs and the erasure coded pool needs +9, that is what it will show. You can either create another erasure +coded pool that requires less OSDs:: + + ceph osd erasure-code-profile set myprofile k=5 m=3 + ceph osd pool create erasurepool erasure myprofile + +or add a new OSDs and the PG will automatically use them. + +CRUSH constraints cannot be satisfied +------------------------------------- + +If the cluster has enough OSDs, it is possible that the CRUSH rule +imposes constraints that cannot be satisfied. If there are 10 OSDs on +two hosts and the CRUSH rule requires that no two OSDs from the +same host are used in the same PG, the mapping may fail because only +two OSDs will be found. You can check the constraint by displaying ("dumping") +the rule:: + + $ ceph osd crush rule ls + [ + "replicated_rule", + "erasurepool"] + $ ceph osd crush rule dump erasurepool + { "rule_id": 1, + "rule_name": "erasurepool", + "ruleset": 1, + "type": 3, + "min_size": 3, + "max_size": 20, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + + +You can resolve the problem by creating a new pool in which PGs are allowed +to have OSDs residing on the same host with:: + + ceph osd erasure-code-profile set myprofile crush-failure-domain=osd + ceph osd pool create erasurepool erasure myprofile + +CRUSH gives up too soon +----------------------- + +If the Ceph cluster has just enough OSDs to map the PG (for instance a +cluster with a total of 9 OSDs and an erasure coded pool that requires +9 OSDs per PG), it is possible that CRUSH gives up before finding a +mapping. It can be resolved by: + +* lowering the erasure coded pool requirements to use less OSDs per PG + (that requires the creation of another pool as erasure code profiles + cannot be dynamically modified). + +* adding more OSDs to the cluster (that does not require the erasure + coded pool to be modified, it will become clean automatically) + +* use a handmade CRUSH rule that tries more times to find a good + mapping. This can be done by setting ``set_choose_tries`` to a value + greater than the default. + +You should first verify the problem with ``crushtool`` after +extracting the crushmap from the cluster so your experiments do not +modify the Ceph cluster and only work on a local files:: + + $ ceph osd crush rule dump erasurepool + { "rule_name": "erasurepool", + "ruleset": 1, + "type": 3, + "min_size": 3, + "max_size": 20, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + $ ceph osd getcrushmap > crush.map + got crush map from osdmap epoch 13 + $ crushtool -i crush.map --test --show-bad-mappings \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0] + bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8] + bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647] + +Where ``--num-rep`` is the number of OSDs the erasure code CRUSH +rule needs, ``--rule`` is the value of the ``ruleset`` field +displayed by ``ceph osd crush rule dump``. The test will try mapping +one million values (i.e. the range defined by ``[--min-x,--max-x]``) +and must display at least one bad mapping. If it outputs nothing it +means all mappings are successful and you can stop right there: the +problem is elsewhere. + +The CRUSH rule can be edited by decompiling the crush map:: + + $ crushtool --decompile crush.map > crush.txt + +and adding the following line to the rule:: + + step set_choose_tries 100 + +The relevant part of the ``crush.txt`` file should look something +like:: + + rule erasurepool { + ruleset 1 + type erasure + min_size 3 + max_size 20 + step set_chooseleaf_tries 5 + step set_choose_tries 100 + step take default + step chooseleaf indep 0 type host + step emit + } + +It can then be compiled and tested again:: + + $ crushtool --compile crush.txt -o better-crush.map + +When all mappings succeed, an histogram of the number of tries that +were necessary to find all of them can be displayed with the +``--show-choose-tries`` option of ``crushtool``:: + + $ crushtool -i better-crush.map --test --show-bad-mappings \ + --show-choose-tries \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + ... + 11: 42 + 12: 44 + 13: 54 + 14: 45 + 15: 35 + 16: 34 + 17: 30 + 18: 25 + 19: 19 + 20: 22 + 21: 20 + 22: 17 + 23: 13 + 24: 16 + 25: 13 + 26: 11 + 27: 11 + 28: 13 + 29: 11 + 30: 10 + 31: 6 + 32: 5 + 33: 10 + 34: 3 + 35: 7 + 36: 5 + 37: 2 + 38: 5 + 39: 5 + 40: 2 + 41: 5 + 42: 4 + 43: 1 + 44: 2 + 45: 2 + 46: 3 + 47: 1 + 48: 0 + ... + 102: 0 + 103: 1 + 104: 0 + ... + +It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped). + +.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups +.. _here: ../../configuration/pool-pg-config-ref +.. _Placement Groups: ../../operations/placement-groups +.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref +.. _NTP: https://en.wikipedia.org/wiki/Network_Time_Protocol +.. _The Network Time Protocol: http://www.ntp.org/ +.. _Clock Settings: ../../configuration/mon-config-ref/#clock + + diff --git a/doc/radosgw/STS.rst b/doc/radosgw/STS.rst new file mode 100644 index 000000000..1678f86d5 --- /dev/null +++ b/doc/radosgw/STS.rst @@ -0,0 +1,298 @@ +=========== +STS in Ceph +=========== + +Secure Token Service is a web service in AWS that returns a set of temporary security credentials for authenticating federated users. +The link to official AWS documentation can be found here: https://docs.aws.amazon.com/STS/latest/APIReference/Welcome.html. + +Ceph Object Gateway implements a subset of STS APIs that provide temporary credentials for identity and access management. +These temporary credentials can be used to make subsequent S3 calls which will be authenticated by the STS engine in Ceph Object Gateway. +Permissions of the temporary credentials can be further restricted via an IAM policy passed as a parameter to the STS APIs. + +STS REST APIs +============= + +The following STS REST APIs have been implemented in Ceph Object Gateway: + +1. AssumeRole: Returns a set of temporary credentials that can be used for +cross-account access. The temporary credentials will have permissions that are +allowed by both - permission policies attached with the Role and policy attached +with the AssumeRole API. + +Parameters: + **RoleArn** (String/ Required): ARN of the Role to Assume. + + **RoleSessionName** (String/ Required): An Identifier for the assumed role + session. + + **Policy** (String/ Optional): An IAM Policy in JSON format. + + **DurationSeconds** (Integer/ Optional): The duration in seconds of the session. + Its default value is 3600. + + **ExternalId** (String/ Optional): A unique Id that might be used when a role is + assumed in another account. + + **SerialNumber** (String/ Optional): The Id number of the MFA device associated + with the user making the AssumeRole call. + + **TokenCode** (String/ Optional): The value provided by the MFA device, if the + trust policy of the role being assumed requires MFA. + +2. AssumeRoleWithWebIdentity: Returns a set of temporary credentials for users that +have been authenticated by a web/mobile app by an OpenID Connect /OAuth2.0 Identity Provider. +Currently Keycloak has been tested and integrated with RGW. + +Parameters: + **RoleArn** (String/ Required): ARN of the Role to Assume. + + **RoleSessionName** (String/ Required): An Identifier for the assumed role + session. + + **Policy** (String/ Optional): An IAM Policy in JSON format. + + **DurationSeconds** (Integer/ Optional): The duration in seconds of the session. + Its default value is 3600. + + **ProviderId** (String/ Optional): Fully qualified host component of the domain name + of the IDP. Valid only for OAuth2.0 tokens (not for OpenID Connect tokens). + + **WebIdentityToken** (String/ Required): The OpenID Connect/ OAuth2.0 token, which the + application gets in return after authenticating its user with an IDP. + +Before invoking AssumeRoleWithWebIdentity, an OpenID Connect Provider entity (which the web application +authenticates with), needs to be created in RGW. + +The trust between the IDP and the role is created by adding a condition to the role's trust policy, which +allows access only to applications which satisfy the given condition. +All claims of the JWT are supported in the condition of the role's trust policy. +An example of a policy that uses the 'aud' claim in the condition is of the form:: + + "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Federated\":[\"arn:aws:iam:::oidc-provider/<URL of IDP>\"]},\"Action\":[\"sts:AssumeRoleWithWebIdentity\"],\"Condition\":{\"StringEquals\":{\"<URL of IDP> :app_id\":\"<aud>\"\}\}\}\]\}" + +The app_id in the condition above must match the 'aud' claim of the incoming token. + +An example of a policy that uses the 'sub' claim in the condition is of the form:: + + "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Federated\":[\"arn:aws:iam:::oidc-provider/<URL of IDP>\"]},\"Action\":[\"sts:AssumeRoleWithWebIdentity\"],\"Condition\":{\"StringEquals\":{\"<URL of IDP> :sub\":\"<sub>\"\}\}\}\]\}" + +Similarly, an example of a policy that uses 'azp' claim in the condition is of the form:: + + "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Federated\":[\"arn:aws:iam:::oidc-provider/<URL of IDP>\"]},\"Action\":[\"sts:AssumeRoleWithWebIdentity\"],\"Condition\":{\"StringEquals\":{\"<URL of IDP> :azp\":\"<azp>\"\}\}\}\]\}" + +A shadow user is created corresponding to every federated user. The user id is derived from the 'sub' field of the incoming web token. +The user is created in a separate namespace - 'oidc' such that the user id doesn't clash with any other user ids in rgw. The format of the user id +is - <tenant>$<user-namespace>$<sub> where user-namespace is 'oidc' for users that authenticate with oidc providers. + +RGW now supports Session tags that can be passed in the web token to AssumeRoleWithWebIdentity call. More information related to Session Tags can be found here +:doc:`session-tags`. + +STS Configuration +================= + +The following configurable options have to be added for STS integration:: + + [client.radosgw.gateway] + rgw sts key = {sts key for encrypting the session token} + rgw s3 auth use sts = true + +Note: By default, STS and S3 APIs co-exist in the same namespace, and both S3 +and STS APIs can be accessed via the same endpoint in Ceph Object Gateway. + +Examples +======== + +1. The following is an example of AssumeRole API call, which shows steps to create a role, assign a policy to it +(that allows access to S3 resources), assuming a role to get temporary credentials and accessing s3 resources using +those credentials. In this example, TESTER1 assumes a role created by TESTER, to access S3 resources owned by TESTER, +according to the permission policy attached to the role. + +.. code-block:: console + + radosgw-admin caps add --uid="TESTER" --caps="roles=*" + +2. The following is an example of the AssumeRole API call, which shows steps to create a role, assign a policy to it + (that allows access to S3 resources), assuming a role to get temporary credentials and accessing S3 resources using + those credentials. In this example, TESTER1 assumes a role created by TESTER, to access S3 resources owned by TESTER, + according to the permission policy attached to the role. + +.. code-block:: python + + import boto3 + + iam_client = boto3.client('iam', + aws_access_key_id=<access_key of TESTER>, + aws_secret_access_key=<secret_key of TESTER>, + endpoint_url=<IAM URL>, + region_name='' + ) + + policy_document = "{\"Version\":\"2012-10-17\",\"Statement\":{\"Effect\":\"Allow\",\"Principal\":{\"AWS\":[\"arn:aws:iam:::user/TESTER1\"]},\"Action\":[\"sts:AssumeRole\"]}]}" + + role_response = iam_client.create_role( + AssumeRolePolicyDocument=policy_document, + Path='/', + RoleName='S3Access', + ) + + role_policy = "{\"Version\":\"2012-10-17\",\"Statement\":{\"Effect\":\"Allow\",\"Action\":\"s3:*\",\"Resource\":\"arn:aws:s3:::*\"}}" + + response = iam_client.put_role_policy( + RoleName='S3Access', + PolicyName='Policy1', + PolicyDocument=role_policy + ) + + sts_client = boto3.client('sts', + aws_access_key_id=<access_key of TESTER1>, + aws_secret_access_key=<secret_key of TESTER1>, + endpoint_url=<STS URL>, + region_name='', + ) + + response = sts_client.assume_role( + RoleArn=role_response['Role']['Arn'], + RoleSessionName='Bob', + DurationSeconds=3600 + ) + + s3client = boto3.client('s3', + aws_access_key_id = response['Credentials']['AccessKeyId'], + aws_secret_access_key = response['Credentials']['SecretAccessKey'], + aws_session_token = response['Credentials']['SessionToken'], + endpoint_url=<S3 URL>, + region_name='',) + + bucket_name = 'my-bucket' + s3bucket = s3client.create_bucket(Bucket=bucket_name) + resp = s3client.list_buckets() + +2. The following is an example of AssumeRoleWithWebIdentity API call, where an external app that has users authenticated with +an OpenID Connect/ OAuth2 IDP (Keycloak in this example), assumes a role to get back temporary credentials and access S3 resources +according to permission policy of the role. + +.. code-block:: python + + import boto3 + + iam_client = boto3.client('iam', + aws_access_key_id=<access_key of TESTER>, + aws_secret_access_key=<secret_key of TESTER>, + endpoint_url=<IAM URL>, + region_name='' + ) + + oidc_response = iam_client.create_open_id_connect_provider( + Url=<URL of the OpenID Connect Provider, + ClientIDList=[ + <Client id registered with the IDP> + ], + ThumbprintList=[ + <Thumbprint of the IDP> + ] + ) + + policy_document = "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Federated\":[\"arn:aws:iam:::oidc-provider/localhost:8080/auth/realms/demo\"]},\"Action\":[\"sts:AssumeRoleWithWebIdentity\"],\"Condition\":{\"StringEquals\":{\"localhost:8080/auth/realms/demo:app_id\":\"customer-portal\"}}}]}" + role_response = iam_client.create_role( + AssumeRolePolicyDocument=policy_document, + Path='/', + RoleName='S3Access', + ) + + role_policy = "{\"Version\":\"2012-10-17\",\"Statement\":{\"Effect\":\"Allow\",\"Action\":\"s3:*\",\"Resource\":\"arn:aws:s3:::*\"}}" + + response = iam_client.put_role_policy( + RoleName='S3Access', + PolicyName='Policy1', + PolicyDocument=role_policy + ) + + sts_client = boto3.client('sts', + aws_access_key_id=<access_key of TESTER1>, + aws_secret_access_key=<secret_key of TESTER1>, + endpoint_url=<STS URL>, + region_name='', + ) + + response = client.assume_role_with_web_identity( + RoleArn=role_response['Role']['Arn'], + RoleSessionName='Bob', + DurationSeconds=3600, + WebIdentityToken=<Web Token> + ) + + s3client = boto3.client('s3', + aws_access_key_id = response['Credentials']['AccessKeyId'], + aws_secret_access_key = response['Credentials']['SecretAccessKey'], + aws_session_token = response['Credentials']['SessionToken'], + endpoint_url=<S3 URL>, + region_name='',) + + bucket_name = 'my-bucket' + s3bucket = s3client.create_bucket(Bucket=bucket_name) + resp = s3client.list_buckets() + +How to obtain thumbprint of an OpenID Connect Provider IDP +========================================================== +1. Take the OpenID connect provider's URL and add /.well-known/openid-configuration +to it to get the URL to get the IDP's configuration document. For example, if the URL +of the IDP is http://localhost:8000/auth/realms/quickstart, then the URL to get the +document from is http://localhost:8000/auth/realms/quickstart/.well-known/openid-configuration + +2. Use the following curl command to get the configuration document from the URL described +in step 1:: + + curl -k -v \ + -X GET \ + -H "Content-Type: application/x-www-form-urlencoded" \ + "http://localhost:8000/auth/realms/quickstart/.well-known/openid-configuration" \ + | jq . + + 3. From the response of step 2, use the value of "jwks_uri" to get the certificate of the IDP, + using the following code:: + curl -k -v \ + -X GET \ + -H "Content-Type: application/x-www-form-urlencoded" \ + "http://$KC_SERVER/$KC_CONTEXT/realms/$KC_REALM/protocol/openid-connect/certs" \ + | jq . + +3. Copy the result of "x5c" in the response above, in a file certificate.crt, and add +'-----BEGIN CERTIFICATE-----' at the beginning and "-----END CERTIFICATE-----" +at the end. + +4. Use the following OpenSSL command to get the certificate thumbprint:: + + openssl x509 -in certificate.crt -fingerprint -noout + +5. The result of the above command in step 4, will be a SHA1 fingerprint, like the following:: + + SHA1 Fingerprint=F7:D7:B3:51:5D:D0:D3:19:DD:21:9A:43:A9:EA:72:7A:D6:06:52:87 + +6. Remove the colons from the result above to get the final thumbprint which can be as input +while creating the OpenID Connect Provider entity in IAM:: + + F7D7B3515DD0D319DD219A43A9EA727AD6065287 + +Roles in RGW +============ + +More information for role manipulation can be found here +:doc:`role`. + +OpenID Connect Provider in RGW +============================== + +More information for OpenID Connect Provider entity manipulation +can be found here +:doc:`oidc`. + +Keycloak integration with Radosgw +================================= + +Steps for integrating Radosgw with Keycloak can be found here +:doc:`keycloak`. + +STSLite +======= +STSLite has been built on STS, and documentation for the same can be found here +:doc:`STSLite`. diff --git a/doc/radosgw/STSLite.rst b/doc/radosgw/STSLite.rst new file mode 100644 index 000000000..63164e48d --- /dev/null +++ b/doc/radosgw/STSLite.rst @@ -0,0 +1,196 @@ +========= +STS Lite +========= + +Ceph Object Gateway provides support for a subset of Amazon Secure Token Service +(STS) APIs. STS Lite is an extension of STS and builds upon one of its APIs to +decrease the load on external IDPs like Keystone and LDAP. + +A set of temporary security credentials is returned after authenticating +a set of AWS credentials with the external IDP. These temporary credentials can be used +to make subsequent S3 calls which will be authenticated by the STS engine in Ceph, +resulting in less load on the Keystone/ LDAP server. + +Temporary and limited privileged credentials can be obtained for a local user +also using the STS Lite API. + +STS Lite REST APIs +================== + +The following STS Lite REST API is part of STS Lite in Ceph Object Gateway: + +1. GetSessionToken: Returns a set of temporary credentials for a set of AWS +credentials. After initial authentication with Keystone/ LDAP, the temporary +credentials returned can be used to make subsequent S3 calls. The temporary +credentials will have the same permission as that of the AWS credentials. + +Parameters: + **DurationSeconds** (Integer/ Optional): The duration in seconds for which the + credentials should remain valid. Its default value is 3600. Its default max + value is 43200 which is can be configured using rgw sts max session duration. + + **SerialNumber** (String/ Optional): The Id number of the MFA device associated + with the user making the GetSessionToken call. + + **TokenCode** (String/ Optional): The value provided by the MFA device, if MFA is required. + +An administrative user needs to attach a policy to allow invocation of GetSessionToken API using its permanent +credentials and to allow subsequent S3 operations invocation using only the temporary credentials returned +by GetSessionToken. + +The user attaching the policy needs to have admin caps. For example:: + + radosgw-admin caps add --uid="TESTER" --caps="user-policy=*" + +The following is the policy that needs to be attached to a user 'TESTER1':: + + user_policy = "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Deny\",\"Action\":\"s3:*\",\"Resource\":[\"*\"],\"Condition\":{\"BoolIfExists\":{\"sts:authentication\":\"false\"}}},{\"Effect\":\"Allow\",\"Action\":\"sts:GetSessionToken\",\"Resource\":\"*\",\"Condition\":{\"BoolIfExists\":{\"sts:authentication\":\"false\"}}}]}" + + +STS Lite Configuration +====================== + +The following configurable options are available for STS Lite integration:: + + [client.radosgw.gateway] + rgw sts key = {sts key for encrypting the session token} + rgw s3 auth use sts = true + +The above STS configurables can be used with the Keystone configurables if one +needs to use STS Lite in conjunction with Keystone. The complete set of +configurable options will be:: + + [client.radosgw.gateway] + rgw sts key = {sts key for encrypting/ decrypting the session token} + rgw s3 auth use sts = true + + rgw keystone url = {keystone server url:keystone server admin port} + rgw keystone admin project = {keystone admin project name} + rgw keystone admin tenant = {keystone service tenant name} + rgw keystone admin domain = {keystone admin domain name} + rgw keystone api version = {keystone api version} + rgw keystone implicit tenants = {true for private tenant for each new user} + rgw keystone admin password = {keystone service tenant user name} + rgw keystone admin user = keystone service tenant user password} + rgw keystone accepted roles = {accepted user roles} + rgw keystone token cache size = {number of tokens to cache} + rgw s3 auth use keystone = true + +The details of the integrating ldap with Ceph Object Gateway can be found here: +:doc:`keystone` + +The complete set of configurables to use STS Lite with LDAP are:: + + [client.radosgw.gateway] + rgw sts key = {sts key for encrypting/ decrypting the session token} + rgw s3 auth use sts = true + + rgw_s3_auth_use_ldap = true + rgw_ldap_uri = {LDAP server to use} + rgw_ldap_binddn = {Distinguished Name (DN) of the service account} + rgw_ldap_secret = {password for the service account} + rgw_ldap_searchdn = {base in the directory information tree for searching users} + rgw_ldap_dnattr = {attribute being used in the constructed search filter to match a username} + rgw_ldap_searchfilter = {search filter} + +The details of the integrating ldap with Ceph Object Gateway can be found here: +:doc:`ldap-auth` + +Note: By default, STS and S3 APIs co-exist in the same namespace, and both S3 +and STS APIs can be accessed via the same endpoint in Ceph Object Gateway. + +Example showing how to Use STS Lite with Keystone +================================================= + +The following are the steps needed to use STS Lite with Keystone. Boto 3.x has +been used to write an example code to show the integration of STS Lite with +Keystone. + +1. Generate EC2 credentials : + +.. code-block:: javascript + + openstack ec2 credentials create + +------------+--------------------------------------------------------+ + | Field | Value | + +------------+--------------------------------------------------------+ + | access | b924dfc87d454d15896691182fdeb0ef | + | links | {u'self': u'http://192.168.0.15/identity/v3/users/ | + | | 40a7140e424f493d8165abc652dc731c/credentials/ | + | | OS-EC2/b924dfc87d454d15896691182fdeb0ef'} | + | project_id | c703801dccaf4a0aaa39bec8c481e25a | + | secret | 6a2142613c504c42a94ba2b82147dc28 | + | trust_id | None | + | user_id | 40a7140e424f493d8165abc652dc731c | + +------------+--------------------------------------------------------+ + +2. Use the credentials created in the step 1. to get back a set of temporary + credentials using GetSessionToken API. + +.. code-block:: python + + import boto3 + + access_key = <ec2 access key> + secret_key = <ec2 secret key> + + client = boto3.client('sts', + aws_access_key_id=access_key, + aws_secret_access_key=secret_key, + endpoint_url=<STS URL>, + region_name='', + ) + + response = client.get_session_token( + DurationSeconds=43200 + ) + +3. The temporary credentials obtained in step 2. can be used for making S3 calls: + +.. code-block:: python + + s3client = boto3.client('s3', + aws_access_key_id = response['Credentials']['AccessKeyId'], + aws_secret_access_key = response['Credentials']['SecretAccessKey'], + aws_session_token = response['Credentials']['SessionToken'], + endpoint_url=<S3 URL>, + region_name='') + + bucket = s3client.create_bucket(Bucket='my-new-shiny-bucket') + response = s3client.list_buckets() + for bucket in response["Buckets"]: + print "{name}\t{created}".format( + name = bucket['Name'], + created = bucket['CreationDate'], + ) + +Similar steps can be performed for using GetSessionToken with LDAP. + +Limitations and Workarounds +=========================== + +1. Keystone currently supports only S3 requests, hence in order to successfully +authenticate an STS request, the following workaround needs to be added to boto +to the following file - botocore/auth.py + +Lines 13-16 have been added as a workaround in the code block below: + +.. code-block:: python + + class SigV4Auth(BaseSigner): + """ + Sign a request with Signature V4. + """ + REQUIRES_REGION = True + + def __init__(self, credentials, service_name, region_name): + self.credentials = credentials + # We initialize these value here so the unit tests can have + # valid values. But these will get overridden in ``add_auth`` + # later for real requests. + self._region_name = region_name + if service_name == 'sts': + self._service_name = 's3' + else: + self._service_name = service_name + diff --git a/doc/radosgw/admin.rst b/doc/radosgw/admin.rst new file mode 100644 index 000000000..5f47471ac --- /dev/null +++ b/doc/radosgw/admin.rst @@ -0,0 +1,528 @@ +============= + Admin Guide +============= + +Once you have your Ceph Object Storage service up and running, you may +administer the service with user management, access controls, quotas +and usage tracking among other features. + + +User Management +=============== + +Ceph Object Storage user management refers to users of the Ceph Object Storage +service (i.e., not the Ceph Object Gateway as a user of the Ceph Storage +Cluster). You must create a user, access key and secret to enable end users to +interact with Ceph Object Gateway services. + +There are two user types: + +- **User:** The term 'user' reflects a user of the S3 interface. + +- **Subuser:** The term 'subuser' reflects a user of the Swift interface. A subuser + is associated to a user . + +.. ditaa:: + +---------+ + | User | + +----+----+ + | + | +-----------+ + +-----+ Subuser | + +-----------+ + +You can create, modify, view, suspend and remove users and subusers. In addition +to user and subuser IDs, you may add a display name and an email address for a +user. You can specify a key and secret, or generate a key and secret +automatically. When generating or specifying keys, note that user IDs correspond +to an S3 key type and subuser IDs correspond to a swift key type. Swift keys +also have access levels of ``read``, ``write``, ``readwrite`` and ``full``. + + +Create a User +------------- + +To create a user (S3 interface), execute the following:: + + radosgw-admin user create --uid={username} --display-name="{display-name}" [--email={email}] + +For example:: + + radosgw-admin user create --uid=johndoe --display-name="John Doe" --email=john@example.com + +.. code-block:: javascript + + { "user_id": "johndoe", + "display_name": "John Doe", + "email": "john@example.com", + "suspended": 0, + "max_buckets": 1000, + "subusers": [], + "keys": [ + { "user": "johndoe", + "access_key": "11BS02LGFB6AL6H1ADMW", + "secret_key": "vzCEkuryfn060dfee4fgQPqFrncKEIkh3ZcdOANY"}], + "swift_keys": [], + "caps": [], + "op_mask": "read, write, delete", + "default_placement": "", + "placement_tags": [], + "bucket_quota": { "enabled": false, + "max_size_kb": -1, + "max_objects": -1}, + "user_quota": { "enabled": false, + "max_size_kb": -1, + "max_objects": -1}, + "temp_url_keys": []} + +Creating a user also creates an ``access_key`` and ``secret_key`` entry for use +with any S3 API-compatible client. + +.. important:: Check the key output. Sometimes ``radosgw-admin`` + generates a JSON escape (``\``) character, and some clients + do not know how to handle JSON escape characters. Remedies include + removing the JSON escape character (``\``), encapsulating the string + in quotes, regenerating the key and ensuring that it + does not have a JSON escape character or specify the key and secret + manually. + + +Create a Subuser +---------------- + +To create a subuser (Swift interface) for the user, you must specify the user ID +(``--uid={username}``), a subuser ID and the access level for the subuser. :: + + radosgw-admin subuser create --uid={uid} --subuser={uid} --access=[ read | write | readwrite | full ] + +For example:: + + radosgw-admin subuser create --uid=johndoe --subuser=johndoe:swift --access=full + + +.. note:: ``full`` is not ``readwrite``, as it also includes the access control policy. + +.. code-block:: javascript + + { "user_id": "johndoe", + "display_name": "John Doe", + "email": "john@example.com", + "suspended": 0, + "max_buckets": 1000, + "subusers": [ + { "id": "johndoe:swift", + "permissions": "full-control"}], + "keys": [ + { "user": "johndoe", + "access_key": "11BS02LGFB6AL6H1ADMW", + "secret_key": "vzCEkuryfn060dfee4fgQPqFrncKEIkh3ZcdOANY"}], + "swift_keys": [], + "caps": [], + "op_mask": "read, write, delete", + "default_placement": "", + "placement_tags": [], + "bucket_quota": { "enabled": false, + "max_size_kb": -1, + "max_objects": -1}, + "user_quota": { "enabled": false, + "max_size_kb": -1, + "max_objects": -1}, + "temp_url_keys": []} + + +Get User Info +------------- + +To get information about a user, you must specify ``user info`` and the user ID +(``--uid={username}``) . :: + + radosgw-admin user info --uid=johndoe + + + +Modify User Info +---------------- + +To modify information about a user, you must specify the user ID (``--uid={username}``) +and the attributes you want to modify. Typical modifications are to keys and secrets, +email addresses, display names and access levels. For example:: + + radosgw-admin user modify --uid=johndoe --display-name="John E. Doe" + +To modify subuser values, specify ``subuser modify``, user ID and the subuser ID. For example:: + + radosgw-admin subuser modify --uid=johndoe --subuser=johndoe:swift --access=full + + +User Enable/Suspend +------------------- + +When you create a user, the user is enabled by default. However, you may suspend +user privileges and re-enable them at a later time. To suspend a user, specify +``user suspend`` and the user ID. :: + + radosgw-admin user suspend --uid=johndoe + +To re-enable a suspended user, specify ``user enable`` and the user ID. :: + + radosgw-admin user enable --uid=johndoe + +.. note:: Disabling the user disables the subuser. + + +Remove a User +------------- + +When you remove a user, the user and subuser are removed from the system. +However, you may remove just the subuser if you wish. To remove a user (and +subuser), specify ``user rm`` and the user ID. :: + + radosgw-admin user rm --uid=johndoe + +To remove the subuser only, specify ``subuser rm`` and the subuser ID. :: + + radosgw-admin subuser rm --subuser=johndoe:swift + + +Options include: + +- **Purge Data:** The ``--purge-data`` option purges all data associated + to the UID. + +- **Purge Keys:** The ``--purge-keys`` option purges all keys associated + to the UID. + + +Remove a Subuser +---------------- + +When you remove a sub user, you are removing access to the Swift interface. +The user will remain in the system. To remove the subuser, specify +``subuser rm`` and the subuser ID. :: + + radosgw-admin subuser rm --subuser=johndoe:swift + + + +Options include: + +- **Purge Keys:** The ``--purge-keys`` option purges all keys associated + to the UID. + + +Add / Remove a Key +------------------------ + +Both users and subusers require the key to access the S3 or Swift interface. To +use S3, the user needs a key pair which is composed of an access key and a +secret key. On the other hand, to use Swift, the user typically needs a secret +key (password), and use it together with the associated user ID. You may create +a key and either specify or generate the access key and/or secret key. You may +also remove a key. Options include: + +- ``--key-type=<type>`` specifies the key type. The options are: s3, swift +- ``--access-key=<key>`` manually specifies an S3 access key. +- ``--secret-key=<key>`` manually specifies a S3 secret key or a Swift secret key. +- ``--gen-access-key`` automatically generates a random S3 access key. +- ``--gen-secret`` automatically generates a random S3 secret key or a random Swift secret key. + +An example how to add a specified S3 key pair for a user. :: + + radosgw-admin key create --uid=foo --key-type=s3 --access-key fooAccessKey --secret-key fooSecretKey + +.. code-block:: javascript + + { "user_id": "foo", + "rados_uid": 0, + "display_name": "foo", + "email": "foo@example.com", + "suspended": 0, + "keys": [ + { "user": "foo", + "access_key": "fooAccessKey", + "secret_key": "fooSecretKey"}], + } + +Note that you may create multiple S3 key pairs for a user. + +To attach a specified swift secret key for a subuser. :: + + radosgw-admin key create --subuser=foo:bar --key-type=swift --secret-key barSecret + +.. code-block:: javascript + + { "user_id": "foo", + "rados_uid": 0, + "display_name": "foo", + "email": "foo@example.com", + "suspended": 0, + "subusers": [ + { "id": "foo:bar", + "permissions": "full-control"}], + "swift_keys": [ + { "user": "foo:bar", + "secret_key": "asfghjghghmgm"}]} + +Note that a subuser can have only one swift secret key. + +Subusers can also be used with S3 APIs if the subuser is associated with a S3 key pair. :: + + radosgw-admin key create --subuser=foo:bar --key-type=s3 --access-key barAccessKey --secret-key barSecretKey + +.. code-block:: javascript + + { "user_id": "foo", + "rados_uid": 0, + "display_name": "foo", + "email": "foo@example.com", + "suspended": 0, + "subusers": [ + { "id": "foo:bar", + "permissions": "full-control"}], + "keys": [ + { "user": "foo:bar", + "access_key": "barAccessKey", + "secret_key": "barSecretKey"}], + } + + +To remove a S3 key pair, specify the access key. :: + + radosgw-admin key rm --uid=foo --key-type=s3 --access-key=fooAccessKey + +To remove the swift secret key. :: + + radosgw-admin key rm --subuser=foo:bar --key-type=swift + + +Add / Remove Admin Capabilities +------------------------------- + +The Ceph Storage Cluster provides an administrative API that enables users to +execute administrative functions via the REST API. By default, users do NOT have +access to this API. To enable a user to exercise administrative functionality, +provide the user with administrative capabilities. + +To add administrative capabilities to a user, execute the following:: + + radosgw-admin caps add --uid={uid} --caps={caps} + + +You can add read, write or all capabilities to users, buckets, metadata and +usage (utilization). For example:: + + --caps="[users|buckets|metadata|usage|zone]=[*|read|write|read, write]" + +For example:: + + radosgw-admin caps add --uid=johndoe --caps="users=*;buckets=*" + + +To remove administrative capabilities from a user, execute the following:: + + radosgw-admin caps rm --uid=johndoe --caps={caps} + + +Quota Management +================ + +The Ceph Object Gateway enables you to set quotas on users and buckets owned by +users. Quotas include the maximum number of objects in a bucket and the maximum +storage size a bucket can hold. + +- **Bucket:** The ``--bucket`` option allows you to specify a quota for + buckets the user owns. + +- **Maximum Objects:** The ``--max-objects`` setting allows you to specify + the maximum number of objects. A negative value disables this setting. + +- **Maximum Size:** The ``--max-size`` option allows you to specify a quota + size in B/K/M/G/T, where B is the default. A negative value disables this setting. + +- **Quota Scope:** The ``--quota-scope`` option sets the scope for the quota. + The options are ``bucket`` and ``user``. Bucket quotas apply to buckets a + user owns. User quotas apply to a user. + + +Set User Quota +-------------- + +Before you enable a quota, you must first set the quota parameters. +For example:: + + radosgw-admin quota set --quota-scope=user --uid=<uid> [--max-objects=<num objects>] [--max-size=<max size>] + +For example:: + + radosgw-admin quota set --quota-scope=user --uid=johndoe --max-objects=1024 --max-size=1024B + + +A negative value for num objects and / or max size means that the +specific quota attribute check is disabled. + + +Enable/Disable User Quota +------------------------- + +Once you set a user quota, you may enable it. For example:: + + radosgw-admin quota enable --quota-scope=user --uid=<uid> + +You may disable an enabled user quota. For example:: + + radosgw-admin quota disable --quota-scope=user --uid=<uid> + + +Set Bucket Quota +---------------- + +Bucket quotas apply to the buckets owned by the specified ``uid``. They are +independent of the user. :: + + radosgw-admin quota set --uid=<uid> --quota-scope=bucket [--max-objects=<num objects>] [--max-size=<max size] + +A negative value for num objects and / or max size means that the +specific quota attribute check is disabled. + + +Enable/Disable Bucket Quota +--------------------------- + +Once you set a bucket quota, you may enable it. For example:: + + radosgw-admin quota enable --quota-scope=bucket --uid=<uid> + +You may disable an enabled bucket quota. For example:: + + radosgw-admin quota disable --quota-scope=bucket --uid=<uid> + + +Get Quota Settings +------------------ + +You may access each user's quota settings via the user information +API. To read user quota setting information with the CLI interface, +execute the following:: + + radosgw-admin user info --uid=<uid> + + +Update Quota Stats +------------------ + +Quota stats get updated asynchronously. You can update quota +statistics for all users and all buckets manually to retrieve +the latest quota stats. :: + + radosgw-admin user stats --uid=<uid> --sync-stats + +.. _rgw_user_usage_stats: + +Get User Usage Stats +-------------------- + +To see how much of the quota a user has consumed, execute the following:: + + radosgw-admin user stats --uid=<uid> + +.. note:: You should execute ``radosgw-admin user stats`` with the + ``--sync-stats`` option to receive the latest data. + +Default Quotas +-------------- + +You can set default quotas in the config. These defaults are used when +creating a new user and have no effect on existing users. If the +relevant default quota is set in config, then that quota is set on the +new user, and that quota is enabled. See ``rgw bucket default quota max objects``, +``rgw bucket default quota max size``, ``rgw user default quota max objects``, and +``rgw user default quota max size`` in `Ceph Object Gateway Config Reference`_ + +Quota Cache +----------- + +Quota statistics are cached on each RGW instance. If there are multiple +instances, then the cache can keep quotas from being perfectly enforced, as +each instance will have a different view of quotas. The options that control +this are ``rgw bucket quota ttl``, ``rgw user quota bucket sync interval`` and +``rgw user quota sync interval``. The higher these values are, the more +efficient quota operations are, but the more out-of-sync multiple instances +will be. The lower these values are, the closer to perfect enforcement +multiple instances will achieve. If all three are 0, then quota caching is +effectively disabled, and multiple instances will have perfect quota +enforcement. See `Ceph Object Gateway Config Reference`_ + +Reading / Writing Global Quotas +------------------------------- + +You can read and write global quota settings in the period configuration. To +view the global quota settings:: + + radosgw-admin global quota get + +The global quota settings can be manipulated with the ``global quota`` +counterparts of the ``quota set``, ``quota enable``, and ``quota disable`` +commands. :: + + radosgw-admin global quota set --quota-scope bucket --max-objects 1024 + radosgw-admin global quota enable --quota-scope bucket + +.. note:: In a multisite configuration, where there is a realm and period + present, changes to the global quotas must be committed using ``period + update --commit``. If there is no period present, the rados gateway(s) must + be restarted for the changes to take effect. + + +Usage +===== + +The Ceph Object Gateway logs usage for each user. You can track +user usage within date ranges too. + +- Add ``rgw enable usage log = true`` in [client.rgw] section of ceph.conf and restart the radosgw service. + +Options include: + +- **Start Date:** The ``--start-date`` option allows you to filter usage + stats from a particular start date (**format:** ``yyyy-mm-dd[HH:MM:SS]``). + +- **End Date:** The ``--end-date`` option allows you to filter usage up + to a particular date (**format:** ``yyyy-mm-dd[HH:MM:SS]``). + +- **Log Entries:** The ``--show-log-entries`` option allows you to specify + whether or not to include log entries with the usage stats + (options: ``true`` | ``false``). + +.. note:: You may specify time with minutes and seconds, but it is stored + with 1 hour resolution. + + +Show Usage +---------- + +To show usage statistics, specify the ``usage show``. To show usage for a +particular user, you must specify a user ID. You may also specify a start date, +end date, and whether or not to show log entries.:: + + radosgw-admin usage show --uid=johndoe --start-date=2012-03-01 --end-date=2012-04-01 + +You may also show a summary of usage information for all users by omitting a user ID. :: + + radosgw-admin usage show --show-log-entries=false + + +Trim Usage +---------- + +With heavy use, usage logs can begin to take up storage space. You can trim +usage logs for all users and for specific users. You may also specify date +ranges for trim operations. :: + + radosgw-admin usage trim --start-date=2010-01-01 --end-date=2010-12-31 + radosgw-admin usage trim --uid=johndoe + radosgw-admin usage trim --uid=johndoe --end-date=2013-12-31 + + +.. _radosgw-admin: ../../man/8/radosgw-admin/ +.. _Pool Configuration: ../../rados/configuration/pool-pg-config-ref/ +.. _Ceph Object Gateway Config Reference: ../config-ref/ diff --git a/doc/radosgw/adminops.rst b/doc/radosgw/adminops.rst new file mode 100644 index 000000000..2dfd9f6d9 --- /dev/null +++ b/doc/radosgw/adminops.rst @@ -0,0 +1,1994 @@ +================== + Admin Operations +================== + +An admin API request will be done on a URI that starts with the configurable 'admin' +resource entry point. Authorization for the admin API duplicates the S3 authorization +mechanism. Some operations require that the user holds special administrative capabilities. +The response entity type (XML or JSON) may be specified as the 'format' option in the +request and defaults to JSON if not specified. + +Get Usage +========= + +Request bandwidth usage information. + +Note: this feature is disabled by default, can be enabled by setting ``rgw +enable usage log = true`` in the appropriate section of ceph.conf. For changes +in ceph.conf to take effect, radosgw process restart is needed. + +:caps: usage=read + +Syntax +~~~~~~ + +:: + + GET /{admin}/usage?format=json HTTP/1.1 + Host: {fqdn} + + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``uid`` + +:Description: The user for which the information is requested. If not specified will apply to all users. +:Type: String +:Example: ``foo_user`` +:Required: No + +``start`` + +:Description: Date and (optional) time that specifies the start time of the requested data. +:Type: String +:Example: ``2012-09-25 16:00:00`` +:Required: No + +``end`` + +:Description: Date and (optional) time that specifies the end time of the requested data (non-inclusive). +:Type: String +:Example: ``2012-09-25 16:00:00`` +:Required: No + + +``show-entries`` + +:Description: Specifies whether data entries should be returned. +:Type: Boolean +:Example: True [True] +:Required: No + + +``show-summary`` + +:Description: Specifies whether data summary should be returned. +:Type: Boolean +:Example: True [True] +:Required: No + + + +Response Entities +~~~~~~~~~~~~~~~~~ + +If successful, the response contains the requested information. + +``usage`` + +:Description: A container for the usage information. +:Type: Container + +``entries`` + +:Description: A container for the usage entries information. +:Type: Container + +``user`` + +:Description: A container for the user data information. +:Type: Container + +``owner`` + +:Description: The name of the user that owns the buckets. +:Type: String + +``bucket`` + +:Description: The bucket name. +:Type: String + +``time`` + +:Description: Time lower bound for which data is being specified (rounded to the beginning of the first relevant hour). +:Type: String + +``epoch`` + +:Description: The time specified in seconds since 1/1/1970. +:Type: String + +``categories`` + +:Description: A container for stats categories. +:Type: Container + +``entry`` + +:Description: A container for stats entry. +:Type: Container + +``category`` + +:Description: Name of request category for which the stats are provided. +:Type: String + +``bytes_sent`` + +:Description: Number of bytes sent by the RADOS Gateway. +:Type: Integer + +``bytes_received`` + +:Description: Number of bytes received by the RADOS Gateway. +:Type: Integer + +``ops`` + +:Description: Number of operations. +:Type: Integer + +``successful_ops`` + +:Description: Number of successful operations. +:Type: Integer + +``summary`` + +:Description: A container for stats summary. +:Type: Container + +``total`` + +:Description: A container for stats summary aggregated total. +:Type: Container + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +TBD. + +Trim Usage +========== + +Remove usage information. With no dates specified, removes all usage +information. + +Note: this feature is disabled by default, can be enabled by setting ``rgw +enable usage log = true`` in the appropriate section of ceph.conf. For changes +in ceph.conf to take effect, radosgw process restart is needed. + +:caps: usage=write + +Syntax +~~~~~~ + +:: + + DELETE /{admin}/usage?format=json HTTP/1.1 + Host: {fqdn} + + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``uid`` + +:Description: The user for which the information is requested. If not specified will apply to all users. +:Type: String +:Example: ``foo_user`` +:Required: No + +``start`` + +:Description: Date and (optional) time that specifies the start time of the requested data. +:Type: String +:Example: ``2012-09-25 16:00:00`` +:Required: No + +``end`` + +:Description: Date and (optional) time that specifies the end time of the requested data (none inclusive). +:Type: String +:Example: ``2012-09-25 16:00:00`` +:Required: No + + +``remove-all`` + +:Description: Required when uid is not specified, in order to acknowledge multi user data removal. +:Type: Boolean +:Example: True [False] +:Required: No + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +TBD. + +Get User Info +============= + +Get user information. + +:caps: users=read + + +Syntax +~~~~~~ + +:: + + GET /{admin}/user?format=json HTTP/1.1 + Host: {fqdn} + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``uid`` + +:Description: The user for which the information is requested. +:Type: String +:Example: ``foo_user`` +:Required: Yes + + +Response Entities +~~~~~~~~~~~~~~~~~ + +If successful, the response contains the user information. + +``user`` + +:Description: A container for the user data information. +:Type: Container + +``user_id`` + +:Description: The user id. +:Type: String +:Parent: ``user`` + +``display_name`` + +:Description: Display name for the user. +:Type: String +:Parent: ``user`` + +``suspended`` + +:Description: True if the user is suspended. +:Type: Boolean +:Parent: ``user`` + +``max_buckets`` + +:Description: The maximum number of buckets to be owned by the user. +:Type: Integer +:Parent: ``user`` + +``subusers`` + +:Description: Subusers associated with this user account. +:Type: Container +:Parent: ``user`` + +``keys`` + +:Description: S3 keys associated with this user account. +:Type: Container +:Parent: ``user`` + +``swift_keys`` + +:Description: Swift keys associated with this user account. +:Type: Container +:Parent: ``user`` + +``caps`` + +:Description: User capabilities. +:Type: Container +:Parent: ``user`` + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +None. + +Create User +=========== + +Create a new user. By default, a S3 key pair will be created automatically +and returned in the response. If only one of ``access-key`` or ``secret-key`` +is provided, the omitted key will be automatically generated. By default, a +generated key is added to the keyring without replacing an existing key pair. +If ``access-key`` is specified and refers to an existing key owned by the user +then it will be modified. + +.. versionadded:: Luminous + +A ``tenant`` may either be specified as a part of uid or as an additional +request param. + +:caps: users=write + +Syntax +~~~~~~ + +:: + + PUT /{admin}/user?format=json HTTP/1.1 + Host: {fqdn} + + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``uid`` + +:Description: The user ID to be created. +:Type: String +:Example: ``foo_user`` +:Required: Yes + +A tenant name may also specified as a part of ``uid``, by following the syntax +``tenant$user``, refer to :ref:`Multitenancy <rgw-multitenancy>` for more details. + +``display-name`` + +:Description: The display name of the user to be created. +:Type: String +:Example: ``foo user`` +:Required: Yes + + +``email`` + +:Description: The email address associated with the user. +:Type: String +:Example: ``foo@bar.com`` +:Required: No + +``key-type`` + +:Description: Key type to be generated, options are: swift, s3 (default). +:Type: String +:Example: ``s3`` [``s3``] +:Required: No + +``access-key`` + +:Description: Specify access key. +:Type: String +:Example: ``ABCD0EF12GHIJ2K34LMN`` +:Required: No + + +``secret-key`` + +:Description: Specify secret key. +:Type: String +:Example: ``0AbCDEFg1h2i34JklM5nop6QrSTUV+WxyzaBC7D8`` +:Required: No + +``user-caps`` + +:Description: User capabilities. +:Type: String +:Example: ``usage=read, write; users=read`` +:Required: No + +``generate-key`` + +:Description: Generate a new key pair and add to the existing keyring. +:Type: Boolean +:Example: True [True] +:Required: No + +``max-buckets`` + +:Description: Specify the maximum number of buckets the user can own. +:Type: Integer +:Example: 500 [1000] +:Required: No + +``suspended`` + +:Description: Specify whether the user should be suspended. +:Type: Boolean +:Example: False [False] +:Required: No + +.. versionadded:: Jewel + +``tenant`` + +:Description: the Tenant under which a user is a part of. +:Type: string +:Example: tenant1 +:Required: No + +Response Entities +~~~~~~~~~~~~~~~~~ + +If successful, the response contains the user information. + +``user`` + +:Description: A container for the user data information. +:Type: Container + +``tenant`` + +:Description: The tenant which user is a part of. +:Type: String +:Parent: ``user`` + +``user_id`` + +:Description: The user id. +:Type: String +:Parent: ``user`` + +``display_name`` + +:Description: Display name for the user. +:Type: String +:Parent: ``user`` + +``suspended`` + +:Description: True if the user is suspended. +:Type: Boolean +:Parent: ``user`` + +``max_buckets`` + +:Description: The maximum number of buckets to be owned by the user. +:Type: Integer +:Parent: ``user`` + +``subusers`` + +:Description: Subusers associated with this user account. +:Type: Container +:Parent: ``user`` + +``keys`` + +:Description: S3 keys associated with this user account. +:Type: Container +:Parent: ``user`` + +``swift_keys`` + +:Description: Swift keys associated with this user account. +:Type: Container +:Parent: ``user`` + +``caps`` + +:Description: User capabilities. +:Type: Container +:Parent: ``user`` + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +``UserExists`` + +:Description: Attempt to create existing user. +:Code: 409 Conflict + +``InvalidAccessKey`` + +:Description: Invalid access key specified. +:Code: 400 Bad Request + +``InvalidKeyType`` + +:Description: Invalid key type specified. +:Code: 400 Bad Request + +``InvalidSecretKey`` + +:Description: Invalid secret key specified. +:Code: 400 Bad Request + +``InvalidKeyType`` + +:Description: Invalid key type specified. +:Code: 400 Bad Request + +``KeyExists`` + +:Description: Provided access key exists and belongs to another user. +:Code: 409 Conflict + +``EmailExists`` + +:Description: Provided email address exists. +:Code: 409 Conflict + +``InvalidCapability`` + +:Description: Attempt to grant invalid admin capability. +:Code: 400 Bad Request + + +Modify User +=========== + +Modify a user. + +:caps: users=write + +Syntax +~~~~~~ + +:: + + POST /{admin}/user?format=json HTTP/1.1 + Host: {fqdn} + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``uid`` + +:Description: The user ID to be modified. +:Type: String +:Example: ``foo_user`` +:Required: Yes + +``display-name`` + +:Description: The display name of the user to be modified. +:Type: String +:Example: ``foo user`` +:Required: No + +``email`` + +:Description: The email address to be associated with the user. +:Type: String +:Example: ``foo@bar.com`` +:Required: No + +``generate-key`` + +:Description: Generate a new key pair and add to the existing keyring. +:Type: Boolean +:Example: True [False] +:Required: No + +``access-key`` + +:Description: Specify access key. +:Type: String +:Example: ``ABCD0EF12GHIJ2K34LMN`` +:Required: No + +``secret-key`` + +:Description: Specify secret key. +:Type: String +:Example: ``0AbCDEFg1h2i34JklM5nop6QrSTUV+WxyzaBC7D8`` +:Required: No + +``key-type`` + +:Description: Key type to be generated, options are: swift, s3 (default). +:Type: String +:Example: ``s3`` +:Required: No + +``user-caps`` + +:Description: User capabilities. +:Type: String +:Example: ``usage=read, write; users=read`` +:Required: No + +``max-buckets`` + +:Description: Specify the maximum number of buckets the user can own. +:Type: Integer +:Example: 500 [1000] +:Required: No + +``suspended`` + +:Description: Specify whether the user should be suspended. +:Type: Boolean +:Example: False [False] +:Required: No + +``op-mask`` + +:Description: The op-mask of the user to be modified. +:Type: String +:Example: ``read, write, delete, *`` +:Required: No + +Response Entities +~~~~~~~~~~~~~~~~~ + +If successful, the response contains the user information. + +``user`` + +:Description: A container for the user data information. +:Type: Container + +``user_id`` + +:Description: The user id. +:Type: String +:Parent: ``user`` + +``display_name`` + +:Description: Display name for the user. +:Type: String +:Parent: ``user`` + + +``suspended`` + +:Description: True if the user is suspended. +:Type: Boolean +:Parent: ``user`` + + +``max_buckets`` + +:Description: The maximum number of buckets to be owned by the user. +:Type: Integer +:Parent: ``user`` + + +``subusers`` + +:Description: Subusers associated with this user account. +:Type: Container +:Parent: ``user`` + + +``keys`` + +:Description: S3 keys associated with this user account. +:Type: Container +:Parent: ``user`` + + +``swift_keys`` + +:Description: Swift keys associated with this user account. +:Type: Container +:Parent: ``user`` + + +``caps`` + +:Description: User capabilities. +:Type: Container +:Parent: ``user`` + + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +``InvalidAccessKey`` + +:Description: Invalid access key specified. +:Code: 400 Bad Request + +``InvalidKeyType`` + +:Description: Invalid key type specified. +:Code: 400 Bad Request + +``InvalidSecretKey`` + +:Description: Invalid secret key specified. +:Code: 400 Bad Request + +``KeyExists`` + +:Description: Provided access key exists and belongs to another user. +:Code: 409 Conflict + +``EmailExists`` + +:Description: Provided email address exists. +:Code: 409 Conflict + +``InvalidCapability`` + +:Description: Attempt to grant invalid admin capability. +:Code: 400 Bad Request + +Remove User +=========== + +Remove an existing user. + +:caps: users=write + +Syntax +~~~~~~ + +:: + + DELETE /{admin}/user?format=json HTTP/1.1 + Host: {fqdn} + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``uid`` + +:Description: The user ID to be removed. +:Type: String +:Example: ``foo_user`` +:Required: Yes. + +``purge-data`` + +:Description: When specified the buckets and objects belonging + to the user will also be removed. +:Type: Boolean +:Example: True +:Required: No + +Response Entities +~~~~~~~~~~~~~~~~~ + +None + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +None. + +Create Subuser +============== + +Create a new subuser (primarily useful for clients using the Swift API). +Note that in general for a subuser to be useful, it must be granted +permissions by specifying ``access``. As with user creation if +``subuser`` is specified without ``secret``, then a secret key will +be automatically generated. + +:caps: users=write + +Syntax +~~~~~~ + +:: + + PUT /{admin}/user?subuser&format=json HTTP/1.1 + Host {fqdn} + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``uid`` + +:Description: The user ID under which a subuser is to be created. +:Type: String +:Example: ``foo_user`` +:Required: Yes + + +``subuser`` + +:Description: Specify the subuser ID to be created. +:Type: String +:Example: ``sub_foo`` +:Required: Yes + +``secret-key`` + +:Description: Specify secret key. +:Type: String +:Example: ``0AbCDEFg1h2i34JklM5nop6QrSTUV+WxyzaBC7D8`` +:Required: No + +``key-type`` + +:Description: Key type to be generated, options are: swift (default), s3. +:Type: String +:Example: ``swift`` [``swift``] +:Required: No + +``access`` + +:Description: Set access permissions for sub-user, should be one + of ``read, write, readwrite, full``. +:Type: String +:Example: ``read`` +:Required: No + +``generate-secret`` + +:Description: Generate the secret key. +:Type: Boolean +:Example: True [False] +:Required: No + +Response Entities +~~~~~~~~~~~~~~~~~ + +If successful, the response contains the subuser information. + + +``subusers`` + +:Description: Subusers associated with the user account. +:Type: Container + +``id`` + +:Description: Subuser id. +:Type: String +:Parent: ``subusers`` + +``permissions`` + +:Description: Subuser access to user account. +:Type: String +:Parent: ``subusers`` + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +``SubuserExists`` + +:Description: Specified subuser exists. +:Code: 409 Conflict + +``InvalidKeyType`` + +:Description: Invalid key type specified. +:Code: 400 Bad Request + +``InvalidSecretKey`` + +:Description: Invalid secret key specified. +:Code: 400 Bad Request + +``InvalidAccess`` + +:Description: Invalid subuser access specified. +:Code: 400 Bad Request + +Modify Subuser +============== + +Modify an existing subuser + +:caps: users=write + +Syntax +~~~~~~ + +:: + + POST /{admin}/user?subuser&format=json HTTP/1.1 + Host {fqdn} + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``uid`` + +:Description: The user ID under which the subuser is to be modified. +:Type: String +:Example: ``foo_user`` +:Required: Yes + +``subuser`` + +:Description: The subuser ID to be modified. +:Type: String +:Example: ``sub_foo`` +:Required: Yes + +``generate-secret`` + +:Description: Generate a new secret key for the subuser, + replacing the existing key. +:Type: Boolean +:Example: True [False] +:Required: No + +``secret`` + +:Description: Specify secret key. +:Type: String +:Example: ``0AbCDEFg1h2i34JklM5nop6QrSTUV+WxyzaBC7D8`` +:Required: No + +``key-type`` + +:Description: Key type to be generated, options are: swift (default), s3 . +:Type: String +:Example: ``swift`` [``swift``] +:Required: No + +``access`` + +:Description: Set access permissions for sub-user, should be one + of ``read, write, readwrite, full``. +:Type: String +:Example: ``read`` +:Required: No + + +Response Entities +~~~~~~~~~~~~~~~~~ + +If successful, the response contains the subuser information. + + +``subusers`` + +:Description: Subusers associated with the user account. +:Type: Container + +``id`` + +:Description: Subuser id. +:Type: String +:Parent: ``subusers`` + +``permissions`` + +:Description: Subuser access to user account. +:Type: String +:Parent: ``subusers`` + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +``InvalidKeyType`` + +:Description: Invalid key type specified. +:Code: 400 Bad Request + +``InvalidSecretKey`` + +:Description: Invalid secret key specified. +:Code: 400 Bad Request + +``InvalidAccess`` + +:Description: Invalid subuser access specified. +:Code: 400 Bad Request + +Remove Subuser +============== + +Remove an existing subuser + +:caps: users=write + +Syntax +~~~~~~ + +:: + + DELETE /{admin}/user?subuser&format=json HTTP/1.1 + Host {fqdn} + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``uid`` + +:Description: The user ID under which the subuser is to be removed. +:Type: String +:Example: ``foo_user`` +:Required: Yes + + +``subuser`` + +:Description: The subuser ID to be removed. +:Type: String +:Example: ``sub_foo`` +:Required: Yes + +``purge-keys`` + +:Description: Remove keys belonging to the subuser. +:Type: Boolean +:Example: True [True] +:Required: No + +Response Entities +~~~~~~~~~~~~~~~~~ + +None. + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ +None. + +Create Key +========== + +Create a new key. If a ``subuser`` is specified then by default created keys +will be swift type. If only one of ``access-key`` or ``secret-key`` is provided the +committed key will be automatically generated, that is if only ``secret-key`` is +specified then ``access-key`` will be automatically generated. By default, a +generated key is added to the keyring without replacing an existing key pair. +If ``access-key`` is specified and refers to an existing key owned by the user +then it will be modified. The response is a container listing all keys of the same +type as the key created. Note that when creating a swift key, specifying the option +``access-key`` will have no effect. Additionally, only one swift key may be held by +each user or subuser. + +:caps: users=write + + +Syntax +~~~~~~ + +:: + + PUT /{admin}/user?key&format=json HTTP/1.1 + Host {fqdn} + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``uid`` + +:Description: The user ID to receive the new key. +:Type: String +:Example: ``foo_user`` +:Required: Yes + +``subuser`` + +:Description: The subuser ID to receive the new key. +:Type: String +:Example: ``sub_foo`` +:Required: No + +``key-type`` + +:Description: Key type to be generated, options are: swift, s3 (default). +:Type: String +:Example: ``s3`` [``s3``] +:Required: No + +``access-key`` + +:Description: Specify the access key. +:Type: String +:Example: ``AB01C2D3EF45G6H7IJ8K`` +:Required: No + +``secret-key`` + +:Description: Specify the secret key. +:Type: String +:Example: ``0ab/CdeFGhij1klmnopqRSTUv1WxyZabcDEFgHij`` +:Required: No + +``generate-key`` + +:Description: Generate a new key pair and add to the existing keyring. +:Type: Boolean +:Example: True [``True``] +:Required: No + + +Response Entities +~~~~~~~~~~~~~~~~~ + +``keys`` + +:Description: Keys of type created associated with this user account. +:Type: Container + +``user`` + +:Description: The user account associated with the key. +:Type: String +:Parent: ``keys`` + +``access-key`` + +:Description: The access key. +:Type: String +:Parent: ``keys`` + +``secret-key`` + +:Description: The secret key +:Type: String +:Parent: ``keys`` + + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +``InvalidAccessKey`` + +:Description: Invalid access key specified. +:Code: 400 Bad Request + +``InvalidKeyType`` + +:Description: Invalid key type specified. +:Code: 400 Bad Request + +``InvalidSecretKey`` + +:Description: Invalid secret key specified. +:Code: 400 Bad Request + +``InvalidKeyType`` + +:Description: Invalid key type specified. +:Code: 400 Bad Request + +``KeyExists`` + +:Description: Provided access key exists and belongs to another user. +:Code: 409 Conflict + +Remove Key +========== + +Remove an existing key. + +:caps: users=write + +Syntax +~~~~~~ + +:: + + DELETE /{admin}/user?key&format=json HTTP/1.1 + Host {fqdn} + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``access-key`` + +:Description: The S3 access key belonging to the S3 key pair to remove. +:Type: String +:Example: ``AB01C2D3EF45G6H7IJ8K`` +:Required: Yes + +``uid`` + +:Description: The user to remove the key from. +:Type: String +:Example: ``foo_user`` +:Required: No + +``subuser`` + +:Description: The subuser to remove the key from. +:Type: String +:Example: ``sub_foo`` +:Required: No + +``key-type`` + +:Description: Key type to be removed, options are: swift, s3. + NOTE: Required to remove swift key. +:Type: String +:Example: ``swift`` +:Required: No + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +None. + +Response Entities +~~~~~~~~~~~~~~~~~ + +None. + +Get Bucket Info +=============== + +Get information about a subset of the existing buckets. If ``uid`` is specified +without ``bucket`` then all buckets belonging to the user will be returned. If +``bucket`` alone is specified, information for that particular bucket will be +retrieved. + +:caps: buckets=read + +Syntax +~~~~~~ + +:: + + GET /{admin}/bucket?format=json HTTP/1.1 + Host {fqdn} + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``bucket`` + +:Description: The bucket to return info on. +:Type: String +:Example: ``foo_bucket`` +:Required: No + +``uid`` + +:Description: The user to retrieve bucket information for. +:Type: String +:Example: ``foo_user`` +:Required: No + +``stats`` + +:Description: Return bucket statistics. +:Type: Boolean +:Example: True [False] +:Required: No + +Response Entities +~~~~~~~~~~~~~~~~~ + +If successful the request returns a buckets container containing +the desired bucket information. + +``stats`` + +:Description: Per bucket information. +:Type: Container + +``buckets`` + +:Description: Contains a list of one or more bucket containers. +:Type: Container + +``bucket`` + +:Description: Container for single bucket information. +:Type: Container +:Parent: ``buckets`` + +``name`` + +:Description: The name of the bucket. +:Type: String +:Parent: ``bucket`` + +``pool`` + +:Description: The pool the bucket is stored in. +:Type: String +:Parent: ``bucket`` + +``id`` + +:Description: The unique bucket id. +:Type: String +:Parent: ``bucket`` + +``marker`` + +:Description: Internal bucket tag. +:Type: String +:Parent: ``bucket`` + +``owner`` + +:Description: The user id of the bucket owner. +:Type: String +:Parent: ``bucket`` + +``usage`` + +:Description: Storage usage information. +:Type: Container +:Parent: ``bucket`` + +``index`` + +:Description: Status of bucket index. +:Type: String +:Parent: ``bucket`` + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +``IndexRepairFailed`` + +:Description: Bucket index repair failed. +:Code: 409 Conflict + +Check Bucket Index +================== + +Check the index of an existing bucket. NOTE: to check multipart object +accounting with ``check-objects``, ``fix`` must be set to True. + +:caps: buckets=write + +Syntax +~~~~~~ + +:: + + GET /{admin}/bucket?index&format=json HTTP/1.1 + Host {fqdn} + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``bucket`` + +:Description: The bucket to return info on. +:Type: String +:Example: ``foo_bucket`` +:Required: Yes + +``check-objects`` + +:Description: Check multipart object accounting. +:Type: Boolean +:Example: True [False] +:Required: No + +``fix`` + +:Description: Also fix the bucket index when checking. +:Type: Boolean +:Example: False [False] +:Required: No + +Response Entities +~~~~~~~~~~~~~~~~~ + +``index`` + +:Description: Status of bucket index. +:Type: String + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +``IndexRepairFailed`` + +:Description: Bucket index repair failed. +:Code: 409 Conflict + +Remove Bucket +============= + +Delete an existing bucket. + +:caps: buckets=write + +Syntax +~~~~~~ + +:: + + DELETE /{admin}/bucket?format=json HTTP/1.1 + Host {fqdn} + + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``bucket`` + +:Description: The bucket to remove. +:Type: String +:Example: ``foo_bucket`` +:Required: Yes + +``purge-objects`` + +:Description: Remove a buckets objects before deletion. +:Type: Boolean +:Example: True [False] +:Required: No + +Response Entities +~~~~~~~~~~~~~~~~~ + +None. + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +``BucketNotEmpty`` + +:Description: Attempted to delete non-empty bucket. +:Code: 409 Conflict + +``ObjectRemovalFailed`` + +:Description: Unable to remove objects. +:Code: 409 Conflict + +Unlink Bucket +============= + +Unlink a bucket from a specified user. Primarily useful for changing +bucket ownership. + +:caps: buckets=write + +Syntax +~~~~~~ + +:: + + POST /{admin}/bucket?format=json HTTP/1.1 + Host {fqdn} + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``bucket`` + +:Description: The bucket to unlink. +:Type: String +:Example: ``foo_bucket`` +:Required: Yes + +``uid`` + +:Description: The user ID to unlink the bucket from. +:Type: String +:Example: ``foo_user`` +:Required: Yes + +Response Entities +~~~~~~~~~~~~~~~~~ + +None. + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +``BucketUnlinkFailed`` + +:Description: Unable to unlink bucket from specified user. +:Code: 409 Conflict + +Link Bucket +=========== + +Link a bucket to a specified user, unlinking the bucket from +any previous user. + +:caps: buckets=write + +Syntax +~~~~~~ + +:: + + PUT /{admin}/bucket?format=json HTTP/1.1 + Host {fqdn} + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``bucket`` + +:Description: The bucket name to unlink. +:Type: String +:Example: ``foo_bucket`` +:Required: Yes + +``bucket-id`` + +:Description: The bucket id to unlink. +:Type: String +:Example: ``dev.6607669.420`` +:Required: No + +``uid`` + +:Description: The user ID to link the bucket to. +:Type: String +:Example: ``foo_user`` +:Required: Yes + +Response Entities +~~~~~~~~~~~~~~~~~ + +``bucket`` + +:Description: Container for single bucket information. +:Type: Container + +``name`` + +:Description: The name of the bucket. +:Type: String +:Parent: ``bucket`` + +``pool`` + +:Description: The pool the bucket is stored in. +:Type: String +:Parent: ``bucket`` + +``id`` + +:Description: The unique bucket id. +:Type: String +:Parent: ``bucket`` + +``marker`` + +:Description: Internal bucket tag. +:Type: String +:Parent: ``bucket`` + +``owner`` + +:Description: The user id of the bucket owner. +:Type: String +:Parent: ``bucket`` + +``usage`` + +:Description: Storage usage information. +:Type: Container +:Parent: ``bucket`` + +``index`` + +:Description: Status of bucket index. +:Type: String +:Parent: ``bucket`` + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +``BucketUnlinkFailed`` + +:Description: Unable to unlink bucket from specified user. +:Code: 409 Conflict + +``BucketLinkFailed`` + +:Description: Unable to link bucket to specified user. +:Code: 409 Conflict + +Remove Object +============= + +Remove an existing object. NOTE: Does not require owner to be non-suspended. + +:caps: buckets=write + +Syntax +~~~~~~ + +:: + + DELETE /{admin}/bucket?object&format=json HTTP/1.1 + Host {fqdn} + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``bucket`` + +:Description: The bucket containing the object to be removed. +:Type: String +:Example: ``foo_bucket`` +:Required: Yes + +``object`` + +:Description: The object to remove. +:Type: String +:Example: ``foo.txt`` +:Required: Yes + +Response Entities +~~~~~~~~~~~~~~~~~ + +None. + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +``NoSuchObject`` + +:Description: Specified object does not exist. +:Code: 404 Not Found + +``ObjectRemovalFailed`` + +:Description: Unable to remove objects. +:Code: 409 Conflict + + + +Get Bucket or Object Policy +=========================== + +Read the policy of an object or bucket. + +:caps: buckets=read + +Syntax +~~~~~~ + +:: + + GET /{admin}/bucket?policy&format=json HTTP/1.1 + Host {fqdn} + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``bucket`` + +:Description: The bucket to read the policy from. +:Type: String +:Example: ``foo_bucket`` +:Required: Yes + +``object`` + +:Description: The object to read the policy from. +:Type: String +:Example: ``foo.txt`` +:Required: No + +Response Entities +~~~~~~~~~~~~~~~~~ + +If successful, returns the object or bucket policy + +``policy`` + +:Description: Access control policy. +:Type: Container + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +``IncompleteBody`` + +:Description: Either bucket was not specified for a bucket policy request or bucket + and object were not specified for an object policy request. +:Code: 400 Bad Request + +Add A User Capability +===================== + +Add an administrative capability to a specified user. + +:caps: users=write + +Syntax +~~~~~~ + +:: + + PUT /{admin}/user?caps&format=json HTTP/1.1 + Host {fqdn} + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``uid`` + +:Description: The user ID to add an administrative capability to. +:Type: String +:Example: ``foo_user`` +:Required: Yes + +``user-caps`` + +:Description: The administrative capability to add to the user. +:Type: String +:Example: ``usage=read,write;user=write`` +:Required: Yes + +Response Entities +~~~~~~~~~~~~~~~~~ + +If successful, the response contains the user's capabilities. + +``user`` + +:Description: A container for the user data information. +:Type: Container +:Parent: ``user`` + +``user_id`` + +:Description: The user id. +:Type: String +:Parent: ``user`` + +``caps`` + +:Description: User capabilities. +:Type: Container +:Parent: ``user`` + + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +``InvalidCapability`` + +:Description: Attempt to grant invalid admin capability. +:Code: 400 Bad Request + +Example Request +~~~~~~~~~~~~~~~ + +:: + + PUT /{admin}/user?caps&user-caps=usage=read,write;user=write&format=json HTTP/1.1 + Host: {fqdn} + Content-Type: text/plain + Authorization: {your-authorization-token} + + + +Remove A User Capability +======================== + +Remove an administrative capability from a specified user. + +:caps: users=write + +Syntax +~~~~~~ + +:: + + DELETE /{admin}/user?caps&format=json HTTP/1.1 + Host {fqdn} + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``uid`` + +:Description: The user ID to remove an administrative capability from. +:Type: String +:Example: ``foo_user`` +:Required: Yes + +``user-caps`` + +:Description: The administrative capabilities to remove from the user. +:Type: String +:Example: ``usage=read, write`` +:Required: Yes + +Response Entities +~~~~~~~~~~~~~~~~~ + +If successful, the response contains the user's capabilities. + +``user`` + +:Description: A container for the user data information. +:Type: Container +:Parent: ``user`` + +``user_id`` + +:Description: The user id. +:Type: String +:Parent: ``user`` + +``caps`` + +:Description: User capabilities. +:Type: Container +:Parent: ``user`` + + +Special Error Responses +~~~~~~~~~~~~~~~~~~~~~~~ + +``InvalidCapability`` + +:Description: Attempt to remove an invalid admin capability. +:Code: 400 Bad Request + +``NoSuchCap`` + +:Description: User does not possess specified capability. +:Code: 404 Not Found + + +Quotas +====== + +The Admin Operations API enables you to set quotas on users and on bucket owned +by users. See `Quota Management`_ for additional details. Quotas include the +maximum number of objects in a bucket and the maximum storage size in megabytes. + +To view quotas, the user must have a ``users=read`` capability. To set, +modify or disable a quota, the user must have ``users=write`` capability. +See the `Admin Guide`_ for details. + +Valid parameters for quotas include: + +- **Bucket:** The ``bucket`` option allows you to specify a quota for + buckets owned by a user. + +- **Maximum Objects:** The ``max-objects`` setting allows you to specify + the maximum number of objects. A negative value disables this setting. + +- **Maximum Size:** The ``max-size`` option allows you to specify a quota + for the maximum number of bytes. The ``max-size-kb`` option allows you + to specify it in KiB. A negative value disables this setting. + +- **Quota Type:** The ``quota-type`` option sets the scope for the quota. + The options are ``bucket`` and ``user``. + +- **Enable/Disable Quota:** The ``enabled`` option specifies whether the + quota should be enabled. The value should be either 'True' or 'False'. + +Get User Quota +~~~~~~~~~~~~~~ + +To get a quota, the user must have ``users`` capability set with ``read`` +permission. :: + + GET /admin/user?quota&uid=<uid>"a-type=user + + +Set User Quota +~~~~~~~~~~~~~~ + +To set a quota, the user must have ``users`` capability set with ``write`` +permission. :: + + PUT /admin/user?quota&uid=<uid>"a-type=user + + +The content must include a JSON representation of the quota settings +as encoded in the corresponding read operation. + + +Get Bucket Quota +~~~~~~~~~~~~~~~~ + +To get a quota, the user must have ``users`` capability set with ``read`` +permission. :: + + GET /admin/user?quota&uid=<uid>"a-type=bucket + + +Set Bucket Quota +~~~~~~~~~~~~~~~~ + +To set a quota, the user must have ``users`` capability set with ``write`` +permission. :: + + PUT /admin/user?quota&uid=<uid>"a-type=bucket + +The content must include a JSON representation of the quota settings +as encoded in the corresponding read operation. + + +Set Quota for an Individual Bucket +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To set a quota, the user must have ``buckets`` capability set with ``write`` +permission. :: + + PUT /admin/bucket?quota&uid=<uid>&bucket=<bucket-name>"a + +The content must include a JSON representation of the quota settings +as mentioned in Set Bucket Quota section above. + + + +Standard Error Responses +======================== + +``AccessDenied`` + +:Description: Access denied. +:Code: 403 Forbidden + +``InternalError`` + +:Description: Internal server error. +:Code: 500 Internal Server Error + +``NoSuchUser`` + +:Description: User does not exist. +:Code: 404 Not Found + +``NoSuchBucket`` + +:Description: Bucket does not exist. +:Code: 404 Not Found + +``NoSuchKey`` + +:Description: No such access key. +:Code: 404 Not Found + + + + +Binding libraries +======================== + +``Golang`` + + - `IrekFasikhov/go-rgwadmin`_ + - `QuentinPerez/go-radosgw`_ + +``Java`` + + - `twonote/radosgw-admin4j`_ + +``Python`` + + - `UMIACS/rgwadmin`_ + - `valerytschopp/python-radosgw-admin`_ + + + +.. _Admin Guide: ../admin +.. _Quota Management: ../admin#quota-management +.. _IrekFasikhov/go-rgwadmin: https://github.com/IrekFasikhov/go-rgwadmin +.. _QuentinPerez/go-radosgw: https://github.com/QuentinPerez/go-radosgw +.. _twonote/radosgw-admin4j: https://github.com/twonote/radosgw-admin4j +.. _UMIACS/rgwadmin: https://github.com/UMIACS/rgwadmin +.. _valerytschopp/python-radosgw-admin: https://github.com/valerytschopp/python-radosgw-admin + diff --git a/doc/radosgw/api.rst b/doc/radosgw/api.rst new file mode 100644 index 000000000..c01a3e579 --- /dev/null +++ b/doc/radosgw/api.rst @@ -0,0 +1,14 @@ +=============== +librgw (Python) +=============== + +.. highlight:: python + +The `rgw` python module provides file-like access to rgw. + +API Reference +============= + +.. automodule:: rgw + :members: LibRGWFS, FileHandle + diff --git a/doc/radosgw/archive-sync-module.rst b/doc/radosgw/archive-sync-module.rst new file mode 100644 index 000000000..b121ee6b1 --- /dev/null +++ b/doc/radosgw/archive-sync-module.rst @@ -0,0 +1,44 @@ +=================== +Archive Sync Module +=================== + +.. versionadded:: Nautilus + +This sync module leverages the versioning feature of the S3 objects in RGW to +have an archive zone that captures the different versions of the S3 objects +as they occur over time in the other zones. + +An archive zone allows to have a history of versions of S3 objects that can +only be eliminated through the gateways associated with the archive zone. + +This functionality is useful to have a configuration where several +non-versioned zones replicate their data and metadata through their zone +gateways (mirror configuration) providing high availability to the end users, +while the archive zone captures all the data updates and metadata for +consolidate them as versions of S3 objects. + +Including an archive zone in a multizone configuration allows you to have the +flexibility of an S3 object history in one only zone while saving the space +that the replicas of the versioned S3 objects would consume in the rest of the +zones. + + + +Archive Sync Tier Type Configuration +------------------------------------ + +How to Configure +~~~~~~~~~~~~~~~~ + +See `Multisite Configuration`_ for how to multisite config instructions. The +archive sync module requires a creation of a new zone. The zone tier type needs +to be defined as ``archive``: + +:: + + # radosgw-admin zone create --rgw-zonegroup={zone-group-name} \ + --rgw-zone={zone-name} \ + --endpoints={http://fqdn}[,{http://fqdn}] + --tier-type=archive + +.. _Multisite Configuration: ../multisite diff --git a/doc/radosgw/barbican.rst b/doc/radosgw/barbican.rst new file mode 100644 index 000000000..a90d063fb --- /dev/null +++ b/doc/radosgw/barbican.rst @@ -0,0 +1,123 @@ +============================== +OpenStack Barbican Integration +============================== + +OpenStack `Barbican`_ can be used as a secure key management service for +`Server-Side Encryption`_. + +.. image:: ../images/rgw-encryption-barbican.png + +#. `Configure Keystone`_ +#. `Create a Keystone user`_ +#. `Configure the Ceph Object Gateway`_ +#. `Create a key in Barbican`_ + +Configure Keystone +================== + +Barbican depends on Keystone for authorization and access control of its keys. + +See `OpenStack Keystone Integration`_. + +Create a Keystone user +====================== + +Create a new user that will be used by the Ceph Object Gateway to retrieve +keys. + +For example:: + + user = rgwcrypt-user + pass = rgwcrypt-password + tenant = rgwcrypt + +See OpenStack documentation for `Manage projects, users, and roles`_. + +Create a key in Barbican +======================== + +See Barbican documentation for `How to Create a Secret`_. Requests to +Barbican must include a valid Keystone token in the ``X-Auth-Token`` header. + +.. note:: Server-side encryption keys must be 256-bit long and base64 encoded. + +Example request:: + + POST /v1/secrets HTTP/1.1 + Host: barbican.example.com:9311 + Accept: */* + Content-Type: application/json + X-Auth-Token: 7f7d588dd29b44df983bc961a6b73a10 + Content-Length: 299 + + { + "name": "my-key", + "expiration": "2016-12-28T19:14:44.180394", + "algorithm": "aes", + "bit_length": 256, + "mode": "cbc", + "payload": "6b+WOZ1T3cqZMxgThRcXAQBrS5mXKdDUphvpxptl9/4=", + "payload_content_type": "application/octet-stream", + "payload_content_encoding": "base64" + } + +Response:: + + {"secret_ref": "http://barbican.example.com:9311/v1/secrets/d1e7ef3b-f841-4b7c-90b2-b7d90ca2d723"} + +In the response, ``d1e7ef3b-f841-4b7c-90b2-b7d90ca2d723`` is the key id that +can be used in any `SSE-KMS`_ request. + +This newly created key is not accessible by user ``rgwcrypt-user``. This +privilege must be added with an ACL. See `How to Set/Replace ACL`_ for more +details. + +Example request (assuming that the Keystone id of ``rgwcrypt-user`` is +``906aa90bd8a946c89cdff80d0869460f``):: + + PUT /v1/secrets/d1e7ef3b-f841-4b7c-90b2-b7d90ca2d723/acl HTTP/1.1 + Host: barbican.example.com:9311 + Accept: */* + Content-Type: application/json + X-Auth-Token: 7f7d588dd29b44df983bc961a6b73a10 + Content-Length: 101 + + { + "read":{ + "users":[ "906aa90bd8a946c89cdff80d0869460f" ], + "project-access": true + } + } + +Response:: + + {"acl_ref": "http://barbican.example.com:9311/v1/secrets/d1e7ef3b-f841-4b7c-90b2-b7d90ca2d723/acl"} + +Configure the Ceph Object Gateway +================================= + +Edit the Ceph configuration file to enable Barbican as a KMS and add information +about the Barbican server and Keystone user:: + + rgw crypt s3 kms backend = barbican + rgw barbican url = http://barbican.example.com:9311 + rgw keystone barbican user = rgwcrypt-user + rgw keystone barbican password = rgwcrypt-password + +When using Keystone API version 2:: + + rgw keystone barbican tenant = rgwcrypt + +When using API version 3:: + + rgw keystone barbican project + rgw keystone barbican domain + + +.. _Barbican: https://wiki.openstack.org/wiki/Barbican +.. _Server-Side Encryption: ../encryption +.. _OpenStack Keystone Integration: ../keystone +.. _Manage projects, users, and roles: https://docs.openstack.org/admin-guide/cli-manage-projects-users-and-roles.html#create-a-user +.. _How to Create a Secret: https://developer.openstack.org/api-guide/key-manager/secrets.html#how-to-create-a-secret +.. _SSE-KMS: http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html +.. _How to Set/Replace ACL: https://developer.openstack.org/api-guide/key-manager/acls.html#how-to-set-replace-acl diff --git a/doc/radosgw/bucketpolicy.rst b/doc/radosgw/bucketpolicy.rst new file mode 100644 index 000000000..5d6dddc74 --- /dev/null +++ b/doc/radosgw/bucketpolicy.rst @@ -0,0 +1,216 @@ +=============== +Bucket Policies +=============== + +.. versionadded:: Luminous + +The Ceph Object Gateway supports a subset of the Amazon S3 policy +language applied to buckets. + + +Creation and Removal +==================== + +Bucket policies are managed through standard S3 operations rather than +radosgw-admin. + +For example, one may use s3cmd to set or delete a policy thus:: + + $ cat > examplepol + { + "Version": "2012-10-17", + "Statement": [{ + "Effect": "Allow", + "Principal": {"AWS": ["arn:aws:iam::usfolks:user/fred:subuser"]}, + "Action": "s3:PutObjectAcl", + "Resource": [ + "arn:aws:s3:::happybucket/*" + ] + }] + } + + $ s3cmd setpolicy examplepol s3://happybucket + $ s3cmd delpolicy s3://happybucket + + +Limitations +=========== + +Currently, we support only the following actions: + +- s3:AbortMultipartUpload +- s3:CreateBucket +- s3:DeleteBucketPolicy +- s3:DeleteBucket +- s3:DeleteBucketWebsite +- s3:DeleteObject +- s3:DeleteObjectVersion +- s3:DeleteReplicationConfiguration +- s3:GetAccelerateConfiguration +- s3:GetBucketAcl +- s3:GetBucketCORS +- s3:GetBucketLocation +- s3:GetBucketLogging +- s3:GetBucketNotification +- s3:GetBucketPolicy +- s3:GetBucketRequestPayment +- s3:GetBucketTagging +- s3:GetBucketVersioning +- s3:GetBucketWebsite +- s3:GetLifecycleConfiguration +- s3:GetObjectAcl +- s3:GetObject +- s3:GetObjectTorrent +- s3:GetObjectVersionAcl +- s3:GetObjectVersion +- s3:GetObjectVersionTorrent +- s3:GetReplicationConfiguration +- s3:IPAddress +- s3:NotIpAddress +- s3:ListAllMyBuckets +- s3:ListBucketMultipartUploads +- s3:ListBucket +- s3:ListBucketVersions +- s3:ListMultipartUploadParts +- s3:PutAccelerateConfiguration +- s3:PutBucketAcl +- s3:PutBucketCORS +- s3:PutBucketLogging +- s3:PutBucketNotification +- s3:PutBucketPolicy +- s3:PutBucketRequestPayment +- s3:PutBucketTagging +- s3:PutBucketVersioning +- s3:PutBucketWebsite +- s3:PutLifecycleConfiguration +- s3:PutObjectAcl +- s3:PutObject +- s3:PutObjectVersionAcl +- s3:PutReplicationConfiguration +- s3:RestoreObject + +We do not yet support setting policies on users, groups, or roles. + +We use the RGW ‘tenant’ identifier in place of the Amazon twelve-digit +account ID. In the future we may allow you to assign an account ID to +a tenant, but for now if you want to use policies between AWS S3 and +RGW S3 you will have to use the Amazon account ID as the tenant ID when +creating users. + +Under AWS, all tenants share a single namespace. RGW gives every +tenant its own namespace of buckets. There may be an option to enable +an AWS-like 'flat' bucket namespace in future versions. At present, to +access a bucket belonging to another tenant, address it as +"tenant:bucket" in the S3 request. + +In AWS, a bucket policy can grant access to another account, and that +account owner can then grant access to individual users with user +permissions. Since we do not yet support user, role, and group +permissions, account owners will currently need to grant access +directly to individual users, and granting an entire account access to +a bucket grants access to all users in that account. + +Bucket policies do not yet support string interpolation. + +For all requests, condition keys we support are: +- aws:CurrentTime +- aws:EpochTime +- aws:PrincipalType +- aws:Referer +- aws:SecureTransport +- aws:SourceIp +- aws:UserAgent +- aws:username + +We support certain s3 condition keys for bucket and object requests. + +.. versionadded:: Mimic + +Bucket Related Operations +~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++-----------------------+----------------------+----------------+ +| Permission | Condition Keys | Comments | ++-----------------------+----------------------+----------------+ +| | s3:x-amz-acl | | +| | s3:x-amz-grant-<perm>| | +|s3:createBucket | where perm is one of | | +| | read/write/read-acp | | +| | write-acp/ | | +| | full-control | | ++-----------------------+----------------------+----------------+ +| | s3:prefix | | +| +----------------------+----------------+ +| s3:ListBucket & | s3:delimiter | | +| +----------------------+----------------+ +| s3:ListBucketVersions | s3:max-keys | | ++-----------------------+----------------------+----------------+ +| s3:PutBucketAcl | s3:x-amz-acl | | +| | s3:x-amz-grant-<perm>| | ++-----------------------+----------------------+----------------+ + +.. _tag_policy: + +Object Related Operations +~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++-----------------------------+-----------------------------------------------+-------------------+ +|Permission |Condition Keys | Comments | +| | | | ++-----------------------------+-----------------------------------------------+-------------------+ +| |s3:x-amz-acl & s3:x-amz-grant-<perm> | | +| | | | +| +-----------------------------------------------+-------------------+ +| |s3:x-amz-copy-source | | +| | | | +| +-----------------------------------------------+-------------------+ +| |s3:x-amz-server-side-encryption | | +| | | | +| +-----------------------------------------------+-------------------+ +|s3:PutObject |s3:x-amz-server-side-encryption-aws-kms-key-id | | +| | | | +| +-----------------------------------------------+-------------------+ +| |s3:x-amz-metadata-directive |PUT & COPY to | +| | |overwrite/preserve | +| | |metadata in COPY | +| | |requests | +| +-----------------------------------------------+-------------------+ +| |s3:RequestObjectTag/<tag-key> | | +| | | | ++-----------------------------+-----------------------------------------------+-------------------+ +|s3:PutObjectAcl |s3:x-amz-acl & s3-amz-grant-<perm> | | +|s3:PutObjectVersionAcl | | | +| +-----------------------------------------------+-------------------+ +| |s3:ExistingObjectTag/<tag-key> | | +| | | | ++-----------------------------+-----------------------------------------------+-------------------+ +| |s3:RequestObjectTag/<tag-key> | | +|s3:PutObjectTagging & +-----------------------------------------------+-------------------+ +|s3:PutObjectVersionTagging |s3:ExistingObjectTag/<tag-key> | | +| | | | ++-----------------------------+-----------------------------------------------+-------------------+ +|s3:GetObject & |s3:ExistingObjectTag/<tag-key> | | +|s3:GetObjectVersion | | | ++-----------------------------+-----------------------------------------------+-------------------+ +|s3:GetObjectAcl & |s3:ExistingObjectTag/<tag-key> | | +|s3:GetObjectVersionAcl | | | ++-----------------------------+-----------------------------------------------+-------------------+ +|s3:GetObjectTagging & |s3:ExistingObjectTag/<tag-key> | | +|s3:GetObjectVersionTagging | | | ++-----------------------------+-----------------------------------------------+-------------------+ +|s3:DeleteObjectTagging & |s3:ExistingOBjectTag/<tag-key> | | +|s3:DeleteObjectVersionTagging| | | ++-----------------------------+-----------------------------------------------+-------------------+ + + +More may be supported soon as we integrate with the recently rewritten +Authentication/Authorization subsystem. + +Swift +===== + +There is no way to set bucket policies under Swift, but bucket +policies that have been set govern Swift as well as S3 operations. + +Swift credentials are matched against Principals specified in a policy +in a way specific to whatever backend is being used. diff --git a/doc/radosgw/cloud-sync-module.rst b/doc/radosgw/cloud-sync-module.rst new file mode 100644 index 000000000..bea7f441b --- /dev/null +++ b/doc/radosgw/cloud-sync-module.rst @@ -0,0 +1,244 @@ +========================= +Cloud Sync Module +========================= + +.. versionadded:: Mimic + +This module syncs zone data to a remote cloud service. The sync is unidirectional; data is not synced back from the +remote zone. The goal of this module is to enable syncing data to multiple cloud providers. The currently supported +cloud providers are those that are compatible with AWS (S3). + +User credentials for the remote cloud object store service need to be configured. Since many cloud services impose limits +on the number of buckets that each user can create, the mapping of source objects and buckets is configurable. +It is possible to configure different targets to different buckets and bucket prefixes. Note that source ACLs will not +be preserved. It is possible to map permissions of specific source users to specific destination users. + +Due to API limitations there is no way to preserve original object modification time and ETag. The cloud sync module +stores these as metadata attributes on the destination objects. + + + +Cloud Sync Tier Type Configuration +------------------------------------- + +Trivial Configuration: +~~~~~~~~~~~~~~~~~~~~~~ + +:: + + { + "connection": { + "access_key": <access>, + "secret": <secret>, + "endpoint": <endpoint>, + "host_style": <path | virtual>, + }, + "acls": [ { "type": <id | email | uri>, + "source_id": <source_id>, + "dest_id": <dest_id> } ... ], + "target_path": <target_path>, + } + + +Non Trivial Configuration: +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + { + "default": { + "connection": { + "access_key": <access>, + "secret": <secret>, + "endpoint": <endpoint>, + "host_style" <path | virtual>, + }, + "acls": [ + { + "type" : <id | email | uri>, # optional, default is id + "source_id": <id>, + "dest_id": <id> + } ... ] + "target_path": <path> # optional + }, + "connections": [ + { + "connection_id": <id>, + "access_key": <access>, + "secret": <secret>, + "endpoint": <endpoint>, + "host_style" <path | virtual>, # optional + } ... ], + "acl_profiles": [ + { + "acls_id": <id>, # acl mappings + "acls": [ { + "type": <id | email | uri>, + "source_id": <id>, + "dest_id": <id> + } ... ] + } + ], + "profiles": [ + { + "source_bucket": <source>, + "connection_id": <connection_id>, + "acls_id": <mappings_id>, + "target_path": <dest>, # optional + } ... ], + } + + +.. Note:: Trivial configuration can coincide with the non-trivial one. + + +* ``connection`` (container) + +Represents a connection to the remote cloud service. Contains ``conection_id`, ``access_key``, +``secret``, ``endpoint``, and ``host_style``. + +* ``access_key`` (string) + +The remote cloud access key that will be used for a specific connection. + +* ``secret`` (string) + +The secret key for the remote cloud service. + +* ``endpoint`` (string) + +URL of remote cloud service endpoint. + +* ``host_style`` (path | virtual) + +Type of host style to be used when accessing remote cloud endpoint (default: ``path``). + +* ``acls`` (array) + +Contains a list of ``acl_mappings``. + +* ``acl_mapping`` (container) + +Each ``acl_mapping`` structure contains ``type``, ``source_id``, and ``dest_id``. These +will define the ACL mutation that will be done on each object. An ACL mutation allows converting source +user id to a destination id. + +* ``type`` (id | email | uri) + +ACL type: ``id`` defines user id, ``email`` defines user by email, and ``uri`` defines user by ``uri`` (group). + +* ``source_id`` (string) + +ID of user in the source zone. + +* ``dest_id`` (string) + +ID of user in the destination. + +* ``target_path`` (string) + +A string that defines how the target path is created. The target path specifies a prefix to which +the source object name is appended. The target path configurable can include any of the following +variables: +- ``sid``: unique string that represents the sync instance ID +- ``zonegroup``: the zonegroup name +- ``zonegroup_id``: the zonegroup ID +- ``zone``: the zone name +- ``zone_id``: the zone id +- ``bucket``: source bucket name +- ``owner``: source bucket owner ID + +For example: ``target_path = rgwx-${zone}-${sid}/${owner}/${bucket}`` + + +* ``acl_profiles`` (array) + +An array of ``acl_profile``. + +* ``acl_profile`` (container) + +Each profile contains ``acls_id`` (string) that represents the profile, and ``acls`` array that +holds a list of ``acl_mappings``. + +* ``profiles`` (array) + +A list of profiles. Each profile contains the following: +- ``source_bucket``: either a bucket name, or a bucket prefix (if ends with ``*``) that defines the source bucket(s) for this profile +- ``target_path``: as defined above +- ``connection_id``: ID of the connection that will be used for this profile +- ``acls_id``: ID of ACLs profile that will be used for this profile + + +S3 Specific Configurables: +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Currently cloud sync will only work with backends that are compatible with AWS S3. There are +a few configurables that can be used to tweak its behavior when accessing these cloud services: + +:: + + { + "multipart_sync_threshold": {object_size}, + "multipart_min_part_size": {part_size} + } + + +* ``multipart_sync_threshold`` (integer) + +Objects this size or larger will be synced to the cloud using multipart upload. + +* ``multipart_min_part_size`` (integer) + +Minimum parts size to use when syncing objects using multipart upload. + + +How to Configure +~~~~~~~~~~~~~~~~ + +See :ref:`multisite` for how to multisite config instructions. The cloud sync module requires a creation of a new zone. The zone +tier type needs to be defined as ``cloud``: + +:: + + # radosgw-admin zone create --rgw-zonegroup={zone-group-name} \ + --rgw-zone={zone-name} \ + --endpoints={http://fqdn}[,{http://fqdn}] + --tier-type=cloud + + +The tier configuration can be then done using the following command + +:: + + # radosgw-admin zone modify --rgw-zonegroup={zone-group-name} \ + --rgw-zone={zone-name} \ + --tier-config={key}={val}[,{key}={val}] + +The ``key`` in the configuration specifies the config variable that needs to be updated, and +the ``val`` specifies its new value. Nested values can be accessed using period. For example: + +:: + + # radosgw-admin zone modify --rgw-zonegroup={zone-group-name} \ + --rgw-zone={zone-name} \ + --tier-config=connection.access_key={key},connection.secret={secret} + + +Configuration array entries can be accessed by specifying the specific entry to be referenced enclosed +in square brackets, and adding new array entry can be done by using `[]`. Index value of `-1` references +the last entry in the array. At the moment it is not possible to create a new entry and reference it +again at the same command. +For example, creating a new profile for buckets starting with {prefix}: + +:: + + # radosgw-admin zone modify --rgw-zonegroup={zone-group-name} \ + --rgw-zone={zone-name} \ + --tier-config=profiles[].source_bucket={prefix}'*' + + # radosgw-admin zone modify --rgw-zonegroup={zone-group-name} \ + --rgw-zone={zone-name} \ + --tier-config=profiles[-1].connection_id={conn_id},profiles[-1].acls_id={acls_id} + + +An entry can be removed by using ``--tier-config-rm={key}``. diff --git a/doc/radosgw/compression.rst b/doc/radosgw/compression.rst new file mode 100644 index 000000000..23655f1dc --- /dev/null +++ b/doc/radosgw/compression.rst @@ -0,0 +1,87 @@ +=========== +Compression +=========== + +.. versionadded:: Kraken + +The Ceph Object Gateway supports server-side compression of uploaded objects, +using any of Ceph's existing compression plugins. + + +Configuration +============= + +Compression can be enabled on a storage class in the Zone's placement target +by providing the ``--compression=<type>`` option to the command +``radosgw-admin zone placement modify``. + +The compression ``type`` refers to the name of the compression plugin to use +when writing new object data. Each compressed object remembers which plugin +was used, so changing this setting does not hinder the ability to decompress +existing objects, nor does it force existing objects to be recompressed. + +This compression setting applies to all new objects uploaded to buckets using +this placement target. Compression can be disabled by setting the ``type`` to +an empty string or ``none``. + +For example:: + + $ radosgw-admin zone placement modify \ + --rgw-zone default \ + --placement-id default-placement \ + --storage-class STANDARD \ + --compression zlib + { + ... + "placement_pools": [ + { + "key": "default-placement", + "val": { + "index_pool": "default.rgw.buckets.index", + "storage_classes": { + "STANDARD": { + "data_pool": "default.rgw.buckets.data", + "compression_type": "zlib" + } + }, + "data_extra_pool": "default.rgw.buckets.non-ec", + "index_type": 0, + } + } + ], + ... + } + +.. note:: A ``default`` zone is created for you if you have not done any + previous `Multisite Configuration`_. + + +Statistics +========== + +While all existing commands and APIs continue to report object and bucket +sizes based their uncompressed data, compression statistics for a given bucket +are included in its ``bucket stats``:: + + $ radosgw-admin bucket stats --bucket=<name> + { + ... + "usage": { + "rgw.main": { + "size": 1075028, + "size_actual": 1331200, + "size_utilized": 592035, + "size_kb": 1050, + "size_kb_actual": 1300, + "size_kb_utilized": 579, + "num_objects": 104 + } + }, + ... + } + +The ``size_utilized`` and ``size_kb_utilized`` fields represent the total +size of compressed data, in bytes and kilobytes respectively. + + +.. _`Multisite Configuration`: ../multisite diff --git a/doc/radosgw/config-ref.rst b/doc/radosgw/config-ref.rst new file mode 100644 index 000000000..1ed6085a6 --- /dev/null +++ b/doc/radosgw/config-ref.rst @@ -0,0 +1,1161 @@ +====================================== + Ceph Object Gateway Config Reference +====================================== + +The following settings may added to the Ceph configuration file (i.e., usually +``ceph.conf``) under the ``[client.radosgw.{instance-name}]`` section. The +settings may contain default values. If you do not specify each setting in the +Ceph configuration file, the default value will be set automatically. + +Configuration variables set under the ``[client.radosgw.{instance-name}]`` +section will not apply to rgw or radosgw-admin commands without an instance-name +specified in the command. Thus variables meant to be applied to all RGW +instances or all radosgw-admin options can be put into the ``[global]`` or the +``[client]`` section to avoid specifying ``instance-name``. + +``rgw_frontends`` + +:Description: Configures the HTTP frontend(s). The configuration for multiple + frontends can be provided in a comma-delimited list. Each frontend + configuration may include a list of options separated by spaces, + where each option is in the form "key=value" or "key". See + `HTTP Frontends`_ for more on supported options. + +:Type: String +:Default: ``beast port=7480`` + +``rgw_data`` + +:Description: Sets the location of the data files for Ceph RADOS Gateway. +:Type: String +:Default: ``/var/lib/ceph/radosgw/$cluster-$id`` + + +``rgw_enable_apis`` + +:Description: Enables the specified APIs. + + .. note:: Enabling the ``s3`` API is a requirement for + any ``radosgw`` instance that is meant to + participate in a `multi-site <../multisite>`_ + configuration. +:Type: String +:Default: ``s3, s3website, swift, swift_auth, admin, sts, iam, notifications`` All APIs. + + +``rgw_cache_enabled`` + +:Description: Whether the Ceph Object Gateway cache is enabled. +:Type: Boolean +:Default: ``true`` + + +``rgw_cache_lru_size`` + +:Description: The number of entries in the Ceph Object Gateway cache. +:Type: Integer +:Default: ``10000`` + + +``rgw_socket_path`` + +:Description: The socket path for the domain socket. ``FastCgiExternalServer`` + uses this socket. If you do not specify a socket path, Ceph + Object Gateway will not run as an external server. The path you + specify here must be the same as the path specified in the + ``rgw.conf`` file. + +:Type: String +:Default: N/A + +``rgw_fcgi_socket_backlog`` + +:Description: The socket backlog for fcgi. +:Type: Integer +:Default: ``1024`` + +``rgw_host`` + +:Description: The host for the Ceph Object Gateway instance. Can be an IP + address or a hostname. + +:Type: String +:Default: ``0.0.0.0`` + + +``rgw_port`` + +:Description: Port the instance listens for requests. If not specified, + Ceph Object Gateway runs external FastCGI. + +:Type: String +:Default: None + + +``rgw_dns_name`` + +:Description: The DNS name of the served domain. See also the ``hostnames`` setting within regions. +:Type: String +:Default: None + + +``rgw_script_uri`` + +:Description: The alternative value for the ``SCRIPT_URI`` if not set + in the request. + +:Type: String +:Default: None + + +``rgw_request_uri`` + +:Description: The alternative value for the ``REQUEST_URI`` if not set + in the request. + +:Type: String +:Default: None + + +``rgw_print_continue`` + +:Description: Enable ``100-continue`` if it is operational. +:Type: Boolean +:Default: ``true`` + + +``rgw_remote_addr_param`` + +:Description: The remote address parameter. For example, the HTTP field + containing the remote address, or the ``X-Forwarded-For`` + address if a reverse proxy is operational. + +:Type: String +:Default: ``REMOTE_ADDR`` + + +``rgw_op_thread_timeout`` + +:Description: The timeout in seconds for open threads. +:Type: Integer +:Default: 600 + + +``rgw_op_thread_suicide_timeout`` + +:Description: The time ``timeout`` in seconds before a Ceph Object Gateway + process dies. Disabled if set to ``0``. + +:Type: Integer +:Default: ``0`` + + +``rgw_thread_pool_size`` + +:Description: The size of the thread pool. +:Type: Integer +:Default: 100 threads. + + +``rgw_num_control_oids`` + +:Description: The number of notification objects used for cache synchronization + between different ``rgw`` instances. + +:Type: Integer +:Default: ``8`` + + +``rgw_init_timeout`` + +:Description: The number of seconds before Ceph Object Gateway gives up on + initialization. + +:Type: Integer +:Default: ``30`` + + +``rgw_mime_types_file`` + +:Description: The path and location of the MIME-types file. Used for Swift + auto-detection of object types. + +:Type: String +:Default: ``/etc/mime.types`` + + +``rgw_s3_success_create_obj_status`` + +:Description: The alternate success status response for ``create-obj``. +:Type: Integer +:Default: ``0`` + + +``rgw_resolve_cname`` + +:Description: Whether ``rgw`` should use DNS CNAME record of the request + hostname field (if hostname is not equal to ``rgw dns name``). + +:Type: Boolean +:Default: ``false`` + + +``rgw_obj_stripe_size`` + +:Description: The size of an object stripe for Ceph Object Gateway objects. + See `Architecture`_ for details on striping. + +:Type: Integer +:Default: ``4 << 20`` + + +``rgw_extended_http_attrs`` + +:Description: Add new set of attributes that could be set on an entity + (user, bucket or object). These extra attributes can be set + through HTTP header fields when putting the entity or modifying + it using POST method. If set, these attributes will return as + HTTP fields when doing GET/HEAD on the entity. + +:Type: String +:Default: None +:Example: "content_foo, content_bar, x-foo-bar" + + +``rgw_exit_timeout_secs`` + +:Description: Number of seconds to wait for a process before exiting + unconditionally. + +:Type: Integer +:Default: ``120`` + + +``rgw_get_obj_window_size`` + +:Description: The window size in bytes for a single object request. +:Type: Integer +:Default: ``16 << 20`` + + +``rgw_get_obj_max_req_size`` + +:Description: The maximum request size of a single get operation sent to the + Ceph Storage Cluster. + +:Type: Integer +:Default: ``4 << 20`` + + +``rgw_relaxed_s3_bucket_names`` + +:Description: Enables relaxed S3 bucket names rules for US region buckets. +:Type: Boolean +:Default: ``false`` + + +``rgw_list_buckets_max_chunk`` + +:Description: The maximum number of buckets to retrieve in a single operation + when listing user buckets. + +:Type: Integer +:Default: ``1000`` + + +``rgw_override_bucket_index_max_shards`` + +:Description: Represents the number of shards for the bucket index object, + a value of zero indicates there is no sharding. It is not + recommended to set a value too large (e.g. thousand) as it + increases the cost for bucket listing. + This variable should be set in the client or global sections + so that it is automatically applied to radosgw-admin commands. + +:Type: Integer +:Default: ``0`` + + +``rgw_curl_wait_timeout_ms`` + +:Description: The timeout in milliseconds for certain ``curl`` calls. +:Type: Integer +:Default: ``1000`` + + +``rgw_copy_obj_progress`` + +:Description: Enables output of object progress during long copy operations. +:Type: Boolean +:Default: ``true`` + + +``rgw_copy_obj_progress_every_bytes`` + +:Description: The minimum bytes between copy progress output. +:Type: Integer +:Default: ``1024 * 1024`` + + +``rgw_admin_entry`` + +:Description: The entry point for an admin request URL. +:Type: String +:Default: ``admin`` + + +``rgw_content_length_compat`` + +:Description: Enable compatibility handling of FCGI requests with both ``CONTENT_LENGTH`` and ``HTTP_CONTENT_LENGTH`` set. +:Type: Boolean +:Default: ``false`` + + +``rgw_bucket_quota_ttl`` + +:Description: The amount of time in seconds cached quota information is + trusted. After this timeout, the quota information will be + re-fetched from the cluster. +:Type: Integer +:Default: ``600`` + + +``rgw_user_quota_bucket_sync_interval`` + +:Description: The amount of time in seconds bucket quota information is + accumulated before syncing to the cluster. During this time, + other RGW instances will not see the changes in bucket quota + stats from operations on this instance. +:Type: Integer +:Default: ``180`` + + +``rgw_user_quota_sync_interval`` + +:Description: The amount of time in seconds user quota information is + accumulated before syncing to the cluster. During this time, + other RGW instances will not see the changes in user quota stats + from operations on this instance. +:Type: Integer +:Default: ``180`` + + +``rgw_bucket_default_quota_max_objects`` + +:Description: Default max number of objects per bucket. Set on new users, + if no other quota is specified. Has no effect on existing users. + This variable should be set in the client or global sections + so that it is automatically applied to radosgw-admin commands. +:Type: Integer +:Default: ``-1`` + + +``rgw_bucket_default_quota_max_size`` + +:Description: Default max capacity per bucket, in bytes. Set on new users, + if no other quota is specified. Has no effect on existing users. +:Type: Integer +:Default: ``-1`` + + +``rgw_user_default_quota_max_objects`` + +:Description: Default max number of objects for a user. This includes all + objects in all buckets owned by the user. Set on new users, + if no other quota is specified. Has no effect on existing users. +:Type: Integer +:Default: ``-1`` + + +``rgw_user_default_quota_max_size`` + +:Description: The value for user max size quota in bytes set on new users, + if no other quota is specified. Has no effect on existing users. +:Type: Integer +:Default: ``-1`` + + +``rgw_verify_ssl`` + +:Description: Verify SSL certificates while making requests. +:Type: Boolean +:Default: ``true`` + +Lifecycle Settings +================== + +Bucket Lifecycle configuration can be used to manage your objects so they are stored +effectively throughout their lifetime. In past releases Lifecycle processing was rate-limited +by single threaded processing. With the Nautilus release this has been addressed and the +Ceph Object Gateway now allows for parallel thread processing of bucket lifecycles across +additional Ceph Object Gateway instances and replaces the in-order +index shard enumeration with a random ordered sequence. + +There are two options in particular to look at when looking to increase the +aggressiveness of lifecycle processing: + +``rgw_lc_max_worker`` + +:Description: This option specifies the number of lifecycle worker threads + to run in parallel, thereby processing bucket and index + shards simultaneously. + +:Type: Integer +:Default: ``3`` + +``rgw_lc_max_wp_worker`` + +:Description: This option specifies the number of threads in each lifecycle + workers work pool. This option can help accelerate processing each bucket. + +These values can be tuned based upon your specific workload to further increase the +aggressiveness of lifecycle processing. For a workload with a larger number of buckets (thousands) +you would look at increasing the ``rgw_lc_max_worker`` value from the default value of 3 whereas for a +workload with a smaller number of buckets but higher number of objects (hundreds of thousands) +per bucket you would consider decreasing ``rgw_lc_max_wp_worker`` from the default value of 3. + +:NOTE: When looking to tune either of these specific values please validate the + current Cluster performance and Ceph Object Gateway utilization before increasing. + +Garbage Collection Settings +=========================== + +The Ceph Object Gateway allocates storage for new objects immediately. + +The Ceph Object Gateway purges the storage space used for deleted and overwritten +objects in the Ceph Storage cluster some time after the gateway deletes the +objects from the bucket index. The process of purging the deleted object data +from the Ceph Storage cluster is known as Garbage Collection or GC. + +To view the queue of objects awaiting garbage collection, execute the following:: + + $ radosgw-admin gc list + + Note: specify ``--include-all`` to list all entries, including unexpired + +Garbage collection is a background activity that may +execute continuously or during times of low loads, depending upon how the +administrator configures the Ceph Object Gateway. By default, the Ceph Object +Gateway conducts GC operations continuously. Since GC operations are a normal +part of Ceph Object Gateway operations, especially with object delete +operations, objects eligible for garbage collection exist most of the time. + +Some workloads may temporarily or permanently outpace the rate of garbage +collection activity. This is especially true of delete-heavy workloads, where +many objects get stored for a short period of time and then deleted. For these +types of workloads, administrators can increase the priority of garbage +collection operations relative to other operations with the following +configuration parameters. + + +``rgw_gc_max_objs`` + +:Description: The maximum number of objects that may be handled by + garbage collection in one garbage collection processing cycle. + Please do not change this value after the first deployment. + +:Type: Integer +:Default: ``32`` + + +``rgw_gc_obj_min_wait`` + +:Description: The minimum wait time before a deleted object may be removed + and handled by garbage collection processing. + +:Type: Integer +:Default: ``2 * 3600`` + + +``rgw_gc_processor_max_time`` + +:Description: The maximum time between the beginning of two consecutive garbage + collection processing cycles. + +:Type: Integer +:Default: ``3600`` + + +``rgw_gc_processor_period`` + +:Description: The cycle time for garbage collection processing. +:Type: Integer +:Default: ``3600`` + + +``rgw_gc_max_concurrent_io`` + +:Description: The maximum number of concurrent IO operations that the RGW garbage + collection thread will use when purging old data. +:Type: Integer +:Default: ``10`` + + +:Tuning Garbage Collection for Delete Heavy Workloads: + +As an initial step towards tuning Ceph Garbage Collection to be more aggressive the following options are suggested to be increased from their default configuration values: + +``rgw_gc_max_concurrent_io = 20`` +``rgw_gc_max_trim_chunk = 64`` + +:NOTE: Modifying these values requires a restart of the RGW service. + +Once these values have been increased from default please monitor for performance of the cluster during Garbage Collection to verify no adverse performance issues due to the increased values. + +Multisite Settings +================== + +.. versionadded:: Jewel + +You may include the following settings in your Ceph configuration +file under each ``[client.radosgw.{instance-name}]`` instance. + + +``rgw_zone`` + +:Description: The name of the zone for the gateway instance. If no zone is + set, a cluster-wide default can be configured with the command + ``radosgw-admin zone default``. +:Type: String +:Default: None + + +``rgw_zonegroup`` + +:Description: The name of the zonegroup for the gateway instance. If no + zonegroup is set, a cluster-wide default can be configured with + the command ``radosgw-admin zonegroup default``. +:Type: String +:Default: None + + +``rgw_realm`` + +:Description: The name of the realm for the gateway instance. If no realm is + set, a cluster-wide default can be configured with the command + ``radosgw-admin realm default``. +:Type: String +:Default: None + + +``rgw_run_sync_thread`` + +:Description: If there are other zones in the realm to sync from, spawn threads + to handle the sync of data and metadata. +:Type: Boolean +:Default: ``true`` + + +``rgw_data_log_window`` + +:Description: The data log entries window in seconds. +:Type: Integer +:Default: ``30`` + + +``rgw_data_log_changes_size`` + +:Description: The number of in-memory entries to hold for the data changes log. +:Type: Integer +:Default: ``1000`` + + +``rgw_data_log_obj_prefix`` + +:Description: The object name prefix for the data log. +:Type: String +:Default: ``data_log`` + + +``rgw_data_log_num_shards`` + +:Description: The number of shards (objects) on which to keep the + data changes log. + +:Type: Integer +:Default: ``128`` + + +``rgw_md_log_max_shards`` + +:Description: The maximum number of shards for the metadata log. +:Type: Integer +:Default: ``64`` + +.. important:: The values of ``rgw_data_log_num_shards`` and + ``rgw_md_log_max_shards`` should not be changed after sync has + started. + +S3 Settings +=========== + +``rgw_s3_auth_use_ldap`` + +:Description: Should S3 authentication use LDAP? +:Type: Boolean +:Default: ``false`` + + +Swift Settings +============== + +``rgw_enforce_swift_acls`` + +:Description: Enforces the Swift Access Control List (ACL) settings. +:Type: Boolean +:Default: ``true`` + + +``rgw_swift_token_expiration`` + +:Description: The time in seconds for expiring a Swift token. +:Type: Integer +:Default: ``24 * 3600`` + + +``rgw_swift_url`` + +:Description: The URL for the Ceph Object Gateway Swift API. +:Type: String +:Default: None + + +``rgw_swift_url_prefix`` + +:Description: The URL prefix for the Swift API, to distinguish it from + the S3 API endpoint. The default is ``swift``, which + makes the Swift API available at the URL + ``http://host:port/swift/v1`` (or + ``http://host:port/swift/v1/AUTH_%(tenant_id)s`` if + ``rgw swift account in url`` is enabled). + + For compatibility, setting this configuration variable + to the empty string causes the default ``swift`` to be + used; if you do want an empty prefix, set this option to + ``/``. + + .. warning:: If you set this option to ``/``, you must + disable the S3 API by modifying ``rgw + enable apis`` to exclude ``s3``. It is not + possible to operate radosgw with ``rgw + swift url prefix = /`` and simultaneously + support both the S3 and Swift APIs. If you + do need to support both APIs without + prefixes, deploy multiple radosgw instances + to listen on different hosts (or ports) + instead, enabling some for S3 and some for + Swift. +:Default: ``swift`` +:Example: "/swift-testing" + + +``rgw_swift_auth_url`` + +:Description: Default URL for verifying v1 auth tokens (if not using internal + Swift auth). + +:Type: String +:Default: None + + +``rgw_swift_auth_entry`` + +:Description: The entry point for a Swift auth URL. +:Type: String +:Default: ``auth`` + + +``rgw_swift_account_in_url`` + +:Description: Whether or not the Swift account name should be included + in the Swift API URL. + + If set to ``false`` (the default), then the Swift API + will listen on a URL formed like + ``http://host:port/<rgw_swift_url_prefix>/v1``, and the + account name (commonly a Keystone project UUID if + radosgw is configured with `Keystone integration + <../keystone>`_) will be inferred from request + headers. + + If set to ``true``, the Swift API URL will be + ``http://host:port/<rgw_swift_url_prefix>/v1/AUTH_<account_name>`` + (or + ``http://host:port/<rgw_swift_url_prefix>/v1/AUTH_<keystone_project_id>``) + instead, and the Keystone ``object-store`` endpoint must + accordingly be configured to include the + ``AUTH_%(tenant_id)s`` suffix. + + You **must** set this option to ``true`` (and update the + Keystone service catalog) if you want radosgw to support + publicly-readable containers and `temporary URLs + <../swift/tempurl>`_. +:Type: Boolean +:Default: ``false`` + + +``rgw_swift_versioning_enabled`` + +:Description: Enables the Object Versioning of OpenStack Object Storage API. + This allows clients to put the ``X-Versions-Location`` attribute + on containers that should be versioned. The attribute specifies + the name of container storing archived versions. It must be owned + by the same user that the versioned container due to access + control verification - ACLs are NOT taken into consideration. + Those containers cannot be versioned by the S3 object versioning + mechanism. + + A slightly different attribute, ``X-History-Location``, which is also understood by + `OpenStack Swift <https://docs.openstack.org/swift/latest/api/object_versioning.html>`_ + for handling ``DELETE`` operations, is currently not supported. +:Type: Boolean +:Default: ``false`` + + +``rgw_trust_forwarded_https`` + +:Description: When a proxy in front of radosgw is used for ssl termination, radosgw + does not know whether incoming http connections are secure. Enable + this option to trust the ``Forwarded`` and ``X-Forwarded-Proto`` headers + sent by the proxy when determining whether the connection is secure. + This is required for some features, such as server side encryption. + (Never enable this setting if you do not have a trusted proxy in front of + radosgw, or else malicious users will be able to set these headers in + any request.) +:Type: Boolean +:Default: ``false`` + + + +Logging Settings +================ + + +``rgw_log_nonexistent_bucket`` + +:Description: Enables Ceph Object Gateway to log a request for a non-existent + bucket. + +:Type: Boolean +:Default: ``false`` + + +``rgw_log_object_name`` + +:Description: The logging format for an object name. See ma npage + :manpage:`date` for details about format specifiers. + +:Type: Date +:Default: ``%Y-%m-%d-%H-%i-%n`` + + +``rgw_log_object_name_utc`` + +:Description: Whether a logged object name includes a UTC time. + If ``false``, it uses the local time. + +:Type: Boolean +:Default: ``false`` + + +``rgw_usage_max_shards`` + +:Description: The maximum number of shards for usage logging. +:Type: Integer +:Default: ``32`` + + +``rgw_usage_max_user_shards`` + +:Description: The maximum number of shards used for a single user's + usage logging. + +:Type: Integer +:Default: ``1`` + + +``rgw_enable_ops_log`` + +:Description: Enable logging for each successful Ceph Object Gateway operation. +:Type: Boolean +:Default: ``false`` + + +``rgw_enable_usage_log`` + +:Description: Enable the usage log. +:Type: Boolean +:Default: ``false`` + + +``rgw_ops_log_rados`` + +:Description: Whether the operations log should be written to the + Ceph Storage Cluster backend. + +:Type: Boolean +:Default: ``true`` + + +``rgw_ops_log_socket_path`` + +:Description: The Unix domain socket for writing operations logs. +:Type: String +:Default: None + + +``rgw_ops_log_file_path`` + +:Description: The file for writing operations logs. +:Type: String +:Default: None + + +``rgw_ops_log_data_backlog`` + +:Description: The maximum data backlog data size for operations logs written + to a Unix domain socket. + +:Type: Integer +:Default: ``5 << 20`` + + +``rgw_usage_log_flush_threshold`` + +:Description: The number of dirty merged entries in the usage log before + flushing synchronously. + +:Type: Integer +:Default: 1024 + + +``rgw_usage_log_tick_interval`` + +:Description: Flush pending usage log data every ``n`` seconds. +:Type: Integer +:Default: ``30`` + + +``rgw_log_http_headers`` + +:Description: Comma-delimited list of HTTP headers to include with ops + log entries. Header names are case insensitive, and use + the full header name with words separated by underscores. + +:Type: String +:Default: None +:Example: "http_x_forwarded_for, http_x_special_k" + + +``rgw_intent_log_object_name`` + +:Description: The logging format for the intent log object name. See the manpage + :manpage:`date` for details about format specifiers. + +:Type: Date +:Default: ``%Y-%m-%d-%i-%n`` + + +``rgw_intent_log_object_name_utc`` + +:Description: Whether the intent log object name uses UTC time. + If ``false``, it uses the local time. + +:Type: Boolean +:Default: ``false`` + + + +Keystone Settings +================= + + +``rgw_keystone_url`` + +:Description: The URL for the Keystone server. +:Type: String +:Default: None + + +``rgw_keystone_api_version`` + +:Description: The version (2 or 3) of OpenStack Identity API that should be + used for communication with the Keystone server. +:Type: Integer +:Default: ``2`` + + +``rgw_keystone_admin_domain`` + +:Description: The name of OpenStack domain with admin privilege when using + OpenStack Identity API v3. +:Type: String +:Default: None + + +``rgw_keystone_admin_project`` + +:Description: The name of OpenStack project with admin privilege when using + OpenStack Identity API v3. If left unspecified, value of + ``rgw keystone admin tenant`` will be used instead. +:Type: String +:Default: None + + +``rgw_keystone_admin_token`` + +:Description: The Keystone admin token (shared secret). In Ceph RGW + authentication with the admin token has priority over + authentication with the admin credentials + (``rgw_keystone_admin_user``, ``rgw_keystone_admin_password``, + ``rgw_keystone_admin_tenant``, ``rgw_keystone_admin_project``, + ``rgw_keystone_admin_domain``). The Keystone admin token + has been deprecated, but can be used to integrate with + older environments. It is preferred to instead configure + ``rgw_keystone_admin_token_path`` to avoid exposing the token. +:Type: String +:Default: None + +``rgw_keystone_admin_token_path`` + +:Description: Path to a file containing the Keystone admin token + (shared secret). In Ceph RadosGW authentication with + the admin token has priority over authentication with + the admin credentials + (``rgw_keystone_admin_user``, ``rgw_keystone_admin_password``, + ``rgw_keystone_admin_tenant``, ``rgw_keystone_admin_project``, + ``rgw_keystone_admin_domain``). + The Keystone admin token has been deprecated, but can be + used to integrate with older environments. +:Type: String +:Default: None + +``rgw_keystone_admin_tenant`` + +:Description: The name of OpenStack tenant with admin privilege (Service Tenant) when + using OpenStack Identity API v2 +:Type: String +:Default: None + + +``rgw_keystone_admin_user`` + +:Description: The name of OpenStack user with admin privilege for Keystone + authentication (Service User) when using OpenStack Identity API v2 +:Type: String +:Default: None + + +``rgw_keystone_admin_password`` + +:Description: The password for OpenStack admin user when using OpenStack + Identity API v2. It is preferred to instead configure + ``rgw_keystone_admin_password_path`` to avoid exposing the token. +:Type: String +:Default: None + +``rgw_keystone_admin_password_path`` + +:Description: Path to a file containing the password for OpenStack + admin user when using OpenStack Identity API v2. +:Type: String +:Default: None + + +``rgw_keystone_accepted_roles`` + +:Description: The roles required to serve requests. +:Type: String +:Default: ``Member, admin`` + + +``rgw_keystone_token_cache_size`` + +:Description: The maximum number of entries in each Keystone token cache. +:Type: Integer +:Default: ``10000`` + + +``rgw_keystone_revocation_interval`` + +:Description: The number of seconds between token revocation checks. +:Type: Integer +:Default: ``15 * 60`` + + +``rgw_keystone_verify_ssl`` + +:Description: Verify SSL certificates while making token requests to keystone. +:Type: Boolean +:Default: ``true`` + + +Server-side encryption Settings +=============================== + +``rgw_crypt_s3_kms_backend`` + +:Description: Where the SSE-KMS encryption keys are stored. Supported KMS + systems are OpenStack Barbican (``barbican``, the default) and + HashiCorp Vault (``vault``). +:Type: String +:Default: None + + +Barbican Settings +================= + +``rgw_barbican_url`` + +:Description: The URL for the Barbican server. +:Type: String +:Default: None + +``rgw_keystone_barbican_user`` + +:Description: The name of the OpenStack user with access to the `Barbican`_ + secrets used for `Encryption`_. +:Type: String +:Default: None + +``rgw_keystone_barbican_password`` + +:Description: The password associated with the `Barbican`_ user. +:Type: String +:Default: None + +``rgw_keystone_barbican_tenant`` + +:Description: The name of the OpenStack tenant associated with the `Barbican`_ + user when using OpenStack Identity API v2. +:Type: String +:Default: None + +``rgw_keystone_barbican_project`` + +:Description: The name of the OpenStack project associated with the `Barbican`_ + user when using OpenStack Identity API v3. +:Type: String +:Default: None + +``rgw_keystone_barbican_domain`` + +:Description: The name of the OpenStack domain associated with the `Barbican`_ + user when using OpenStack Identity API v3. +:Type: String +:Default: None + + +HashiCorp Vault Settings +======================== + +``rgw_crypt_vault_auth`` + +:Description: Type of authentication method to be used. The only method + currently supported is ``token``. +:Type: String +:Default: ``token`` + +``rgw_crypt_vault_token_file`` + +:Description: If authentication method is ``token``, provide a path to the token + file, which should be readable only by Rados Gateway. +:Type: String +:Default: None + +``rgw_crypt_vault_addr`` + +:Description: Vault server base address, e.g. ``http://vaultserver:8200``. +:Type: String +:Default: None + +``rgw_crypt_vault_prefix`` + +:Description: The Vault secret URL prefix, which can be used to restrict access + to a particular subset of the secret space, e.g. ``/v1/secret/data``. +:Type: String +:Default: None + +``rgw_crypt_vault_secret_engine`` + +:Description: Vault Secret Engine to be used to retrieve encryption keys: choose + between kv-v2, transit. +:Type: String +:Default: None + +``rgw_crypt_vault_namespace`` + +:Description: If set, Vault Namespace provides tenant isolation for teams and individuals + on the same Vault Enterprise instance, e.g. ``acme/tenant1`` +:Type: String +:Default: None + + +QoS settings +------------ + +.. versionadded:: Nautilus + +The ``civetweb`` frontend has a threading model that uses a thread per +connection and hence is automatically throttled by ``rgw_thread_pool_size`` +configurable when it comes to accepting connections. The newer ``beast`` frontend is +not restricted by the thread pool size when it comes to accepting new +connections, so a scheduler abstraction is introduced in the Nautilus release +to support future methods of scheduling requests. + +Currently the scheduler defaults to a throttler which throttles the active +connections to a configured limit. QoS based on mClock is currently in an +*experimental* phase and not recommended for production yet. Current +implementation of *dmclock_client* op queue divides RGW Ops on admin, auth +(swift auth, sts) metadata & data requests. + + +``rgw_max_concurrent_requests`` + +:Description: Maximum number of concurrent HTTP requests that the Beast front end + will process. Tuning this can help to limit memory usage under + heavy load. +:Type: Integer +:Default: 1024 + + +``rgw_scheduler_type`` + +:Description: The RGW scheduler to use. Valid values are ``throttler` and + ``dmclock``. Currently defaults to ``throttler`` which throttles Beast + frontend requests. ``dmclock` is *experimental* and requires the + ``dmclock`` to be included in the ``experimental_feature_enabled`` + configuration option. + + +The options below tune the experimental dmclock scheduler. For +additional reading on dmclock, see :ref:`dmclock-qos`. `op_class` for the flags below is +one of ``admin``, ``auth``, ``metadata``, or ``data``. + +``rgw_dmclock_<op_class>_res`` + +:Description: The mclock reservation for `op_class` requests +:Type: float +:Default: 100.0 + +``rgw_dmclock_<op_class>_wgt`` + +:Description: The mclock weight for `op_class` requests +:Type: float +:Default: 1.0 + +``rgw_dmclock_<op_class>_lim`` + +:Description: The mclock limit for `op_class` requests +:Type: float +:Default: 0.0 + + + +.. _Architecture: ../../architecture#data-striping +.. _Pool Configuration: ../../rados/configuration/pool-pg-config-ref/ +.. _Cluster Pools: ../../rados/operations/pools +.. _Rados cluster handles: ../../rados/api/librados-intro/#step-2-configuring-a-cluster-handle +.. _Barbican: ../barbican +.. _Encryption: ../encryption +.. _HTTP Frontends: ../frontends diff --git a/doc/radosgw/dynamicresharding.rst b/doc/radosgw/dynamicresharding.rst new file mode 100644 index 000000000..a7698e7bb --- /dev/null +++ b/doc/radosgw/dynamicresharding.rst @@ -0,0 +1,235 @@ +.. _rgw_dynamic_bucket_index_resharding: + +=================================== +RGW Dynamic Bucket Index Resharding +=================================== + +.. versionadded:: Luminous + +A large bucket index can lead to performance problems. In order +to address this problem we introduced bucket index sharding. +Until Luminous, changing the number of bucket shards (resharding) +needed to be done offline. Starting with Luminous we support +online bucket resharding. + +Each bucket index shard can handle its entries efficiently up until +reaching a certain threshold number of entries. If this threshold is +exceeded the system can suffer from performance issues. The dynamic +resharding feature detects this situation and automatically increases +the number of shards used by the bucket index, resulting in a +reduction of the number of entries in each bucket index shard. This +process is transparent to the user. + +By default dynamic bucket index resharding can only increase the +number of bucket index shards to 1999, although this upper-bound is a +configuration parameter (see Configuration below). When +possible, the process chooses a prime number of bucket index shards to +spread the number of bucket index entries across the bucket index +shards more evenly. + +The detection process runs in a background process that periodically +scans all the buckets. A bucket that requires resharding is added to +the resharding queue and will be scheduled to be resharded later. The +reshard thread runs in the background and execute the scheduled +resharding tasks, one at a time. + +Multisite +========= + +Dynamic resharding is not supported in a multisite environment. + + +Configuration +============= + +Enable/Disable dynamic bucket index resharding: + +- ``rgw_dynamic_resharding``: true/false, default: true + +Configuration options that control the resharding process: + +- ``rgw_max_objs_per_shard``: maximum number of objects per bucket index shard before resharding is triggered, default: 100000 objects + +- ``rgw_max_dynamic_shards``: maximum number of shards that dynamic bucket index resharding can increase to, default: 1999 + +- ``rgw_reshard_bucket_lock_duration``: duration, in seconds, of lock on bucket obj during resharding, default: 360 seconds (i.e., 6 minutes) + +- ``rgw_reshard_thread_interval``: maximum time, in seconds, between rounds of resharding queue processing, default: 600 seconds (i.e., 10 minutes) + +- ``rgw_reshard_num_logs``: number of shards for the resharding queue, default: 16 + +Admin commands +============== + +Add a bucket to the resharding queue +------------------------------------ + +:: + + # radosgw-admin reshard add --bucket <bucket_name> --num-shards <new number of shards> + +List resharding queue +--------------------- + +:: + + # radosgw-admin reshard list + +Process tasks on the resharding queue +------------------------------------- + +:: + + # radosgw-admin reshard process + +Bucket resharding status +------------------------ + +:: + + # radosgw-admin reshard status --bucket <bucket_name> + +The output is a json array of 3 objects (reshard_status, new_bucket_instance_id, num_shards) per shard. + +For example, the output at different Dynamic Resharding stages is shown below: + +``1. Before resharding occurred:`` +:: + + [ + { + "reshard_status": "not-resharding", + "new_bucket_instance_id": "", + "num_shards": -1 + } + ] + +``2. During resharding:`` +:: + + [ + { + "reshard_status": "in-progress", + "new_bucket_instance_id": "1179f470-2ebf-4630-8ec3-c9922da887fd.8652.1", + "num_shards": 2 + }, + { + "reshard_status": "in-progress", + "new_bucket_instance_id": "1179f470-2ebf-4630-8ec3-c9922da887fd.8652.1", + "num_shards": 2 + } + ] + +``3, After resharding completed:`` +:: + + [ + { + "reshard_status": "not-resharding", + "new_bucket_instance_id": "", + "num_shards": -1 + }, + { + "reshard_status": "not-resharding", + "new_bucket_instance_id": "", + "num_shards": -1 + } + ] + + +Cancel pending bucket resharding +-------------------------------- + +Note: Ongoing bucket resharding operations cannot be cancelled. :: + + # radosgw-admin reshard cancel --bucket <bucket_name> + +Manual immediate bucket resharding +---------------------------------- + +:: + + # radosgw-admin bucket reshard --bucket <bucket_name> --num-shards <new number of shards> + +When choosing a number of shards, the administrator should keep a +number of items in mind. Ideally the administrator is aiming for no +more than 100000 entries per shard, now and through some future point +in time. + +Additionally, bucket index shards that are prime numbers tend to work +better in evenly distributing bucket index entries across the +shards. For example, 7001 bucket index shards is better than 7000 +since the former is prime. A variety of web sites have lists of prime +numbers; search for "list of prime numbers" withy your favorite web +search engine to locate some web sites. + +Troubleshooting +=============== + +Clusters prior to Luminous 12.2.11 and Mimic 13.2.5 left behind stale bucket +instance entries, which were not automatically cleaned up. The issue also affected +LifeCycle policies, which were not applied to resharded buckets anymore. Both of +these issues can be worked around using a couple of radosgw-admin commands. + +Stale instance management +------------------------- + +List the stale instances in a cluster that are ready to be cleaned up. + +:: + + # radosgw-admin reshard stale-instances list + +Clean up the stale instances in a cluster. Note: cleanup of these +instances should only be done on a single site cluster. + +:: + + # radosgw-admin reshard stale-instances rm + + +Lifecycle fixes +--------------- + +For clusters that had resharded instances, it is highly likely that the old +lifecycle processes would have flagged and deleted lifecycle processing as the +bucket instance changed during a reshard. While this is fixed for newer clusters +(from Mimic 13.2.6 and Luminous 12.2.12), older buckets that had lifecycle policies and +that have undergone resharding will have to be manually fixed. + +The command to do so is: + +:: + + # radosgw-admin lc reshard fix --bucket {bucketname} + + +As a convenience wrapper, if the ``--bucket`` argument is dropped then this +command will try and fix lifecycle policies for all the buckets in the cluster. + +Object Expirer fixes +-------------------- + +Objects subject to Swift object expiration on older clusters may have +been dropped from the log pool and never deleted after the bucket was +resharded. This would happen if their expiration time was before the +cluster was upgraded, but if their expiration was after the upgrade +the objects would be correctly handled. To manage these expire-stale +objects, radosgw-admin provides two subcommands. + +Listing: + +:: + + # radosgw-admin objects expire-stale list --bucket {bucketname} + +Displays a list of object names and expiration times in JSON format. + +Deleting: + +:: + + # radosgw-admin objects expire-stale rm --bucket {bucketname} + + +Initiates deletion of such objects, displaying a list of object names, expiration times, and deletion status in JSON format. diff --git a/doc/radosgw/elastic-sync-module.rst b/doc/radosgw/elastic-sync-module.rst new file mode 100644 index 000000000..60c806e87 --- /dev/null +++ b/doc/radosgw/elastic-sync-module.rst @@ -0,0 +1,181 @@ +========================= +ElasticSearch Sync Module +========================= + +.. versionadded:: Kraken + +.. note:: + As of 31 May 2020, only Elasticsearch 6 and lower are supported. ElasticSearch 7 is not supported. + +This sync module writes the metadata from other zones to `ElasticSearch`_. As of +luminous this is a json of data fields we currently store in ElasticSearch. + +:: + + { + "_index" : "rgw-gold-ee5863d6", + "_type" : "object", + "_id" : "34137443-8592-48d9-8ca7-160255d52ade.34137.1:object1:null", + "_score" : 1.0, + "_source" : { + "bucket" : "testbucket123", + "name" : "object1", + "instance" : "null", + "versioned_epoch" : 0, + "owner" : { + "id" : "user1", + "display_name" : "user1" + }, + "permissions" : [ + "user1" + ], + "meta" : { + "size" : 712354, + "mtime" : "2017-05-04T12:54:16.462Z", + "etag" : "7ac66c0f148de9519b8bd264312c4d64" + } + } + } + + + +ElasticSearch tier type configurables +------------------------------------- + +* ``endpoint`` + +Specifies the Elasticsearch server endpoint to access + +* ``num_shards`` (integer) + +The number of shards that Elasticsearch will be configured with on +data sync initialization. Note that this cannot be changed after init. +Any change here requires rebuild of the Elasticsearch index and reinit +of the data sync process. + +* ``num_replicas`` (integer) + +The number of the replicas that Elasticsearch will be configured with +on data sync initialization. + +* ``explicit_custom_meta`` (true | false) + +Specifies whether all user custom metadata will be indexed, or whether +user will need to configure (at the bucket level) what custom +metadata entries should be indexed. This is false by default + +* ``index_buckets_list`` (comma separated list of strings) + +If empty, all buckets will be indexed. Otherwise, only buckets +specified here will be indexed. It is possible to provide bucket +prefixes (e.g., foo\*), or bucket suffixes (e.g., \*bar). + +* ``approved_owners_list`` (comma separated list of strings) + +If empty, buckets of all owners will be indexed (subject to other +restrictions), otherwise, only buckets owned by specified owners will +be indexed. Suffixes and prefixes can also be provided. + +* ``override_index_path`` (string) + +if not empty, this string will be used as the elasticsearch index +path. Otherwise the index path will be determined and generated on +sync initialization. + + +End user metadata queries +------------------------- + +.. versionadded:: Luminous + +Since the ElasticSearch cluster now stores object metadata, it is important that +the ElasticSearch endpoint is not exposed to the public and only accessible to +the cluster administrators. For exposing metadata queries to the end user itself +this poses a problem since we'd want the user to only query their metadata and +not of any other users, this would require the ElasticSearch cluster to +authenticate users in a way similar to RGW does which poses a problem. + +As of Luminous RGW in the metadata master zone can now service end user +requests. This allows for not exposing the elasticsearch endpoint in public and +also solves the authentication and authorization problem since RGW itself can +authenticate the end user requests. For this purpose RGW introduces a new query +in the bucket APIs that can service elasticsearch requests. All these requests +must be sent to the metadata master zone. + +Syntax +~~~~~~ + +Get an elasticsearch query +`````````````````````````` + +:: + + GET /{bucket}?query={query-expr} + +request params: + - max-keys: max number of entries to return + - marker: pagination marker + +``expression := [(]<arg> <op> <value> [)][<and|or> ...]`` + +op is one of the following: +<, <=, ==, >=, > + +For example :: + + GET /?query=name==foo + +Will return all the indexed keys that user has read permission to, and +are named 'foo'. + +The output will be a list of keys in XML that is similar to the S3 +list buckets response. + +Configure custom metadata fields +```````````````````````````````` + +Define which custom metadata entries should be indexed (under the +specified bucket), and what are the types of these keys. If explicit +custom metadata indexing is configured, this is needed so that rgw +will index the specified custom metadata values. Otherwise it is +needed in cases where the indexed metadata keys are of a type other +than string. + +:: + + POST /{bucket}?mdsearch + x-amz-meta-search: <key [; type]> [, ...] + +Multiple metadata fields must be comma separated, a type can be forced for a +field with a `;`. The currently allowed types are string(default), integer and +date + +eg. if you want to index a custom object metadata x-amz-meta-year as int, +x-amz-meta-date as type date and x-amz-meta-title as string, you'd do + +:: + + POST /mybooks?mdsearch + x-amz-meta-search: x-amz-meta-year;int, x-amz-meta-release-date;date, x-amz-meta-title;string + + +Delete custom metadata configuration +```````````````````````````````````` + +Delete custom metadata bucket configuration. + +:: + + DELETE /<bucket>?mdsearch + +Get custom metadata configuration +````````````````````````````````` + +Retrieve custom metadata bucket configuration. + +:: + + GET /<bucket>?mdsearch + + +.. _`Elasticsearch`: https://github.com/elastic/elasticsearch diff --git a/doc/radosgw/encryption.rst b/doc/radosgw/encryption.rst new file mode 100644 index 000000000..07fd60ac5 --- /dev/null +++ b/doc/radosgw/encryption.rst @@ -0,0 +1,68 @@ +========== +Encryption +========== + +.. versionadded:: Luminous + +The Ceph Object Gateway supports server-side encryption of uploaded objects, +with 3 options for the management of encryption keys. Server-side encryption +means that the data is sent over HTTP in its unencrypted form, and the Ceph +Object Gateway stores that data in the Ceph Storage Cluster in encrypted form. + +.. note:: Requests for server-side encryption must be sent over a secure HTTPS + connection to avoid sending secrets in plaintext. If a proxy is used + for SSL termination, ``rgw trust forwarded https`` must be enabled + before forwarded requests will be trusted as secure. + +.. note:: Server-side encryption keys must be 256-bit long and base64 encoded. + +Customer-Provided Keys +====================== + +In this mode, the client passes an encryption key along with each request to +read or write encrypted data. It is the client's responsibility to manage those +keys and remember which key was used to encrypt each object. + +This is implemented in S3 according to the `Amazon SSE-C`_ specification. + +As all key management is handled by the client, no special configuration is +needed to support this encryption mode. + +Key Management Service +====================== + +This mode allows keys to be stored in a secure key management service and +retrieved on demand by the Ceph Object Gateway to serve requests to encrypt +or decrypt data. + +This is implemented in S3 according to the `Amazon SSE-KMS`_ specification. + +In principle, any key management service could be used here. Currently +integration with `Barbican`_, `Vault`_, and `KMIP`_ are implemented. + +See `OpenStack Barbican Integration`_, `HashiCorp Vault Integration`_, +and `KMIP Integration`_. + +Automatic Encryption (for testing only) +======================================= + +A ``rgw crypt default encryption key`` can be set in ceph.conf to force the +encryption of all objects that do not otherwise specify an encryption mode. + +The configuration expects a base64-encoded 256 bit key. For example:: + + rgw crypt default encryption key = 4YSmvJtBv0aZ7geVgAsdpRnLBEwWSWlMIGnRS8a9TSA= + +.. important:: This mode is for diagnostic purposes only! The ceph configuration + file is not a secure method for storing encryption keys. Keys that are + accidentally exposed in this way should be considered compromised. + + +.. _Amazon SSE-C: https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys.html +.. _Amazon SSE-KMS: http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html +.. _Barbican: https://wiki.openstack.org/wiki/Barbican +.. _Vault: https://www.vaultproject.io/docs/ +.. _KMIP: http://www.oasis-open.org/committees/kmip/ +.. _OpenStack Barbican Integration: ../barbican +.. _HashiCorp Vault Integration: ../vault +.. _KMIP Integration: ../kmip diff --git a/doc/radosgw/frontends.rst b/doc/radosgw/frontends.rst new file mode 100644 index 000000000..274cdce87 --- /dev/null +++ b/doc/radosgw/frontends.rst @@ -0,0 +1,226 @@ +.. _rgw_frontends: + +============== +HTTP Frontends +============== + +.. contents:: + +The Ceph Object Gateway supports two embedded HTTP frontend libraries +that can be configured with ``rgw_frontends``. See `Config Reference`_ +for details about the syntax. + +Beast +===== + +.. versionadded:: Mimic + +The ``beast`` frontend uses the Boost.Beast library for HTTP parsing +and the Boost.Asio library for asynchronous network i/o. + +Options +------- + +``port`` and ``ssl_port`` + +:Description: Sets the ipv4 & ipv6 listening port number. Can be specified multiple + times as in ``port=80 port=8000``. +:Type: Integer +:Default: ``80`` + + +``endpoint`` and ``ssl_endpoint`` + +:Description: Sets the listening address in the form ``address[:port]``, where + the address is an IPv4 address string in dotted decimal form, or + an IPv6 address in hexadecimal notation surrounded by square + brackets. Specifying a IPv6 endpoint would listen to v6 only. The + optional port defaults to 80 for ``endpoint`` and 443 for + ``ssl_endpoint``. Can be specified multiple times as in + ``endpoint=[::1] endpoint=192.168.0.100:8000``. + +:Type: Integer +:Default: None + + +``ssl_certificate`` + +:Description: Path to the SSL certificate file used for SSL-enabled endpoints. + If path is prefixed with ``config://``, the certificate will be + pulled from the ceph monitor ``config-key`` database. + +:Type: String +:Default: None + + +``ssl_private_key`` + +:Description: Optional path to the private key file used for SSL-enabled + endpoints. If one is not given, the ``ssl_certificate`` file + is used as the private key. + If path is prefixed with ``config://``, the certificate will be + pulled from the ceph monitor ``config-key`` database. + +:Type: String +:Default: None + +``ssl_options`` + +:Description: Optional colon separated list of ssl context options: + + ``default_workarounds`` Implement various bug workarounds. + + ``no_compression`` Disable compression. + + ``no_sslv2`` Disable SSL v2. + + ``no_sslv3`` Disable SSL v3. + + ``no_tlsv1`` Disable TLS v1. + + ``no_tlsv1_1`` Disable TLS v1.1. + + ``no_tlsv1_2`` Disable TLS v1.2. + + ``single_dh_use`` Always create a new key when using tmp_dh parameters. + +:Type: String +:Default: ``no_sslv2:no_sslv3:no_tlsv1:no_tlsv1_1`` + +``ssl_ciphers`` + +:Description: Optional list of one or more cipher strings separated by colons. + The format of the string is described in openssl's ciphers(1) + manual. + +:Type: String +:Default: None + +``tcp_nodelay`` + +:Description: If set the socket option will disable Nagle's algorithm on + the connection which means that packets will be sent as soon + as possible instead of waiting for a full buffer or timeout to occur. + + ``1`` Disable Nagel's algorithm for all sockets. + + ``0`` Keep the default: Nagel's algorithm enabled. + +:Type: Integer (0 or 1) +:Default: 0 + +``max_connection_backlog`` + +:Description: Optional value to define the maximum size for the queue of + connections waiting to be accepted. If not configured, the value + from ``boost::asio::socket_base::max_connections`` will be used. + +:Type: Integer +:Default: None + +``request_timeout_ms`` + +:Description: The amount of time in milliseconds that Beast will wait + for more incoming data or outgoing data before giving up. + Setting this value to 0 will disable timeout. + +:Type: Integer +:Default: ``65000`` + + +Civetweb +======== + +.. versionadded:: Firefly +.. deprecated:: Pacific + +The ``civetweb`` frontend uses the Civetweb HTTP library, which is a +fork of Mongoose. + + +Options +------- + +``port`` + +:Description: Sets the listening port number. For SSL-enabled ports, add an + ``s`` suffix like ``443s``. To bind a specific IPv4 or IPv6 + address, use the form ``address:port``. Multiple endpoints + can either be separated by ``+`` as in ``127.0.0.1:8000+443s``, + or by providing multiple options as in ``port=8000 port=443s``. + +:Type: String +:Default: ``7480`` + + +``num_threads`` + +:Description: Sets the number of threads spawned by Civetweb to handle + incoming HTTP connections. This effectively limits the number + of concurrent connections that the frontend can service. + +:Type: Integer +:Default: ``rgw_thread_pool_size`` + + +``request_timeout_ms`` + +:Description: The amount of time in milliseconds that Civetweb will wait + for more incoming data before giving up. + +:Type: Integer +:Default: ``30000`` + + +``ssl_certificate`` + +:Description: Path to the SSL certificate file used for SSL-enabled ports. + +:Type: String +:Default: None + +``access_log_file`` + +:Description: Path to a file for access logs. Either full path, or relative + to the current working directory. If absent (default), then + accesses are not logged. + +:Type: String +:Default: ``EMPTY`` + + +``error_log_file`` + +:Description: Path to a file for error logs. Either full path, or relative + to the current working directory. If absent (default), then + errors are not logged. + +:Type: String +:Default: ``EMPTY`` + + +The following is an example of the ``/etc/ceph/ceph.conf`` file with some of these options set: :: + + [client.rgw.gateway-node1] + rgw_frontends = civetweb request_timeout_ms=30000 error_log_file=/var/log/radosgw/civetweb.error.log access_log_file=/var/log/radosgw/civetweb.access.log + +A complete list of supported options can be found in the `Civetweb User Manual`_. + + +Generic Options +=============== + +Some frontend options are generic and supported by all frontends: + +``prefix`` + +:Description: A prefix string that is inserted into the URI of all + requests. For example, a swift-only frontend could supply + a uri prefix of ``/swift``. + +:Type: String +:Default: None + + +.. _Civetweb User Manual: https://civetweb.github.io/civetweb/UserManual.html +.. _Config Reference: ../config-ref diff --git a/doc/radosgw/index.rst b/doc/radosgw/index.rst new file mode 100644 index 000000000..efdb12d28 --- /dev/null +++ b/doc/radosgw/index.rst @@ -0,0 +1,85 @@ +.. _object-gateway: + +===================== + Ceph Object Gateway +===================== + +:term:`Ceph Object Gateway` is an object storage interface built on top of +``librados``. It provides a RESTful gateway between applications and Ceph +Storage Clusters. :term:`Ceph Object Storage` supports two interfaces: + +#. **S3-compatible:** Provides object storage functionality with an interface + that is compatible with a large subset of the Amazon S3 RESTful API. + +#. **Swift-compatible:** Provides object storage functionality with an interface + that is compatible with a large subset of the OpenStack Swift API. + +Ceph Object Storage uses the Ceph Object Gateway daemon (``radosgw``), an HTTP +server designed for interacting with a Ceph Storage Cluster. The Ceph Object +Gateway provides interfaces that are compatible with both Amazon S3 and +OpenStack Swift, and it has its own user management. Ceph Object Gateway can +store data in the same Ceph Storage Cluster in which data from Ceph File System +clients and Ceph Block Device clients is stored. The S3 API and the Swift API +share a common namespace, which makes it possible to write data to a Ceph +Storage Cluster with one API and then retrieve that data with the other API. + +.. ditaa:: + + +------------------------+ +------------------------+ + | S3 compatible API | | Swift compatible API | + +------------------------+-+------------------------+ + | radosgw | + +---------------------------------------------------+ + | librados | + +------------------------+-+------------------------+ + | OSDs | | Monitors | + +------------------------+ +------------------------+ + +.. note:: Ceph Object Storage does **NOT** use the Ceph Metadata Server. + + +.. toctree:: + :maxdepth: 1 + + HTTP Frontends <frontends> + Pool Placement and Storage Classes <placement> + Multisite Configuration <multisite> + Multisite Sync Policy Configuration <multisite-sync-policy> + Configuring Pools <pools> + Config Reference <config-ref> + Admin Guide <admin> + S3 API <s3> + Data caching and CDN <rgw-cache.rst> + Swift API <swift> + Admin Ops API <adminops> + Python binding <api> + Export over NFS <nfs> + OpenStack Keystone Integration <keystone> + OpenStack Barbican Integration <barbican> + HashiCorp Vault Integration <vault> + KMIP Integration <kmip> + Open Policy Agent Integration <opa> + Multi-tenancy <multitenancy> + Compression <compression> + LDAP Authentication <ldap-auth> + Server-Side Encryption <encryption> + Bucket Policy <bucketpolicy> + Dynamic bucket index resharding <dynamicresharding> + Multi factor authentication <mfa> + Sync Modules <sync-modules> + Bucket Notifications <notifications> + Data Layout in RADOS <layout> + STS <STS> + STS Lite <STSLite> + Keycloak <keycloak> + Session Tags <session-tags> + Role <role> + Orphan List and Associated Tooling <orphans> + OpenID Connect Provider <oidc> + troubleshooting + Manpage radosgw <../../man/8/radosgw> + Manpage radosgw-admin <../../man/8/radosgw-admin> + QAT Acceleration for Encryption and Compression <qat-accel> + S3-select <s3select> + Lua Scripting <lua-scripting> + diff --git a/doc/radosgw/keycloak.rst b/doc/radosgw/keycloak.rst new file mode 100644 index 000000000..534c4733a --- /dev/null +++ b/doc/radosgw/keycloak.rst @@ -0,0 +1,127 @@ +================================= +Keycloak integration with RadosGW +================================= + +Keycloak can be setup as an OpenID Connect Identity Provider, which can be used by mobile/ web apps +to authenticate their users. The Web token returned as a result of authentication can be used by the +mobile/ web app to call AssumeRoleWithWebIdentity to get back a set of temporary S3 credentials, +which can be used by the app to make S3 calls. + +Setting up Keycloak +==================== + +Installing and bringing up Keycloak can be found here: https://www.keycloak.org/docs/latest/server_installation/. + +Configuring Keycloak to talk to RGW +=================================== + +The following configurables have to be added for RGW to talk to Keycloak:: + + [client.radosgw.gateway] + rgw sts key = {sts key for encrypting/ decrypting the session token} + rgw s3 auth use sts = true + +Example showing how to fetch a web token from Keycloak +====================================================== + +Several examples of apps authenticating with Keycloak are given here: https://github.com/keycloak/keycloak-quickstarts/blob/latest/docs/getting-started.md +Taking the example of app-profile-jee-jsp app given in the link above, its client id and client secret, can be used to fetch the +access token (web token) for an application using grant type 'client_credentials' as given below:: + + KC_REALM=demo + KC_CLIENT=<client id> + KC_CLIENT_SECRET=<client secret> + KC_SERVER=<host>:8080 + KC_CONTEXT=auth + + # Request Tokens for credentials + KC_RESPONSE=$( \ + curl -k -v -X POST \ + -H "Content-Type: application/x-www-form-urlencoded" \ + -d "scope=openid" \ + -d "grant_type=client_credentials" \ + -d "client_id=$KC_CLIENT" \ + -d "client_secret=$KC_CLIENT_SECRET" \ + "http://$KC_SERVER/$KC_CONTEXT/realms/$KC_REALM/protocol/openid-connect/token" \ + | jq . + ) + + KC_ACCESS_TOKEN=$(echo $KC_RESPONSE| jq -r .access_token) + +An access token can also be fetched for a particular user with grant type 'password', using client id, client secret, username and its password +as given below:: + + KC_REALM=demo + KC_USERNAME=<username> + KC_PASSWORD=<userpassword> + KC_CLIENT=<client id> + KC_CLIENT_SECRET=<client secret> + KC_SERVER=<host>:8080 + KC_CONTEXT=auth + + # Request Tokens for credentials + KC_RESPONSE=$( \ + curl -k -v -X POST \ + -H "Content-Type: application/x-www-form-urlencoded" \ + -d "scope=openid" \ + -d "grant_type=password" \ + -d "client_id=$KC_CLIENT" \ + -d "client_secret=$KC_CLIENT_SECRET" \ + -d "username=$KC_USERNAME" \ + -d "password=$KC_PASSWORD" \ + "http://$KC_SERVER/$KC_CONTEXT/realms/$KC_REALM/protocol/openid-connect/token" \ + | jq . + ) + + KC_ACCESS_TOKEN=$(echo $KC_RESPONSE| jq -r .access_token) + + +KC_ACCESS_TOKEN can be used to invoke AssumeRoleWithWebIdentity as given in +:doc:`STS`. + +Attaching tags to a user in Keycloak +==================================== + +We need to create a user in keycloak, and add tags to it as its attributes. + +Add a user as shown below: + +.. image:: ../images/keycloak-adduser.png + :align: center + +Add user details as shown below: + +.. image:: ../images/keycloak-userdetails.png + :align: center + +Add user credentials as shown below: + +.. image:: ../images/keycloak-usercredentials.png + :align: center + +Add tags to the 'attributes' tab of the user as shown below: + +.. image:: ../images/keycloak-usertags.png + :align: center + +Add a protocol mapper for the user attribute to a client as shown below: + +.. image:: ../images/keycloak-userclientmapper.png + :align: center + + +After following the steps shown above, the tag 'Department' will appear in the JWT (web token), under 'https://aws.amazon.com/tags' namespace. +The tags can be verified using token introspection of the JWT. The command to introspect a token using client id and client secret is shown below:: + + KC_REALM=demo + KC_CLIENT=<client id> + KC_CLIENT_SECRET=<client secret> + KC_SERVER=<host>:8080 + KC_CONTEXT=auth + + curl -k -v \ + -X POST \ + -u "$KC_CLIENT:$KC_CLIENT_SECRET" \ + -d "token=$KC_ACCESS_TOKEN" \ + "http://$KC_SERVER/$KC_CONTEXT/realms/$KC_REALM/protocol/openid-connect/token/introspect" \ + | jq . diff --git a/doc/radosgw/keystone.rst b/doc/radosgw/keystone.rst new file mode 100644 index 000000000..628810ad3 --- /dev/null +++ b/doc/radosgw/keystone.rst @@ -0,0 +1,160 @@ +===================================== + Integrating with OpenStack Keystone +===================================== + +It is possible to integrate the Ceph Object Gateway with Keystone, the OpenStack +identity service. This sets up the gateway to accept Keystone as the users +authority. A user that Keystone authorizes to access the gateway will also be +automatically created on the Ceph Object Gateway (if didn't exist beforehand). A +token that Keystone validates will be considered as valid by the gateway. + +The following configuration options are available for Keystone integration:: + + [client.radosgw.gateway] + rgw keystone api version = {keystone api version} + rgw keystone url = {keystone server url:keystone server admin port} + rgw keystone admin token = {keystone admin token} + rgw keystone admin token path = {path to keystone admin token} #preferred + rgw keystone accepted roles = {accepted user roles} + rgw keystone token cache size = {number of tokens to cache} + rgw keystone implicit tenants = {true for private tenant for each new user} + +It is also possible to configure a Keystone service tenant, user & password for +Keystone (for v2.0 version of the OpenStack Identity API), similar to the way +OpenStack services tend to be configured, this avoids the need for setting the +shared secret ``rgw keystone admin token`` in the configuration file, which is +recommended to be disabled in production environments. The service tenant +credentials should have admin privileges, for more details refer the `OpenStack +Keystone documentation`_, which explains the process in detail. The requisite +configuration options for are:: + + rgw keystone admin user = {keystone service tenant user name} + rgw keystone admin password = {keystone service tenant user password} + rgw keystone admin password = {keystone service tenant user password path} # preferred + rgw keystone admin tenant = {keystone service tenant name} + + +A Ceph Object Gateway user is mapped into a Keystone ``tenant``. A Keystone user +has different roles assigned to it on possibly more than a single tenant. When +the Ceph Object Gateway gets the ticket, it looks at the tenant, and the user +roles that are assigned to that ticket, and accepts/rejects the request +according to the ``rgw keystone accepted roles`` configurable. + +For a v3 version of the OpenStack Identity API you should replace +``rgw keystone admin tenant`` with:: + + rgw keystone admin domain = {keystone admin domain name} + rgw keystone admin project = {keystone admin project name} + +For compatibility with previous versions of ceph, it is also +possible to set ``rgw keystone implicit tenants`` to either +``s3`` or ``swift``. This has the effect of splitting +the identity space such that the indicated protocol will +only use implicit tenants, and the other protocol will +never use implicit tenants. Some older versions of ceph +only supported implicit tenants with swift. + +Ocata (and later) +----------------- + +Keystone itself needs to be configured to point to the Ceph Object Gateway as an +object-storage endpoint:: + + openstack service create --name=swift \ + --description="Swift Service" \ + object-store + +-------------+----------------------------------+ + | Field | Value | + +-------------+----------------------------------+ + | description | Swift Service | + | enabled | True | + | id | 37c4c0e79571404cb4644201a4a6e5ee | + | name | swift | + | type | object-store | + +-------------+----------------------------------+ + + openstack endpoint create --region RegionOne \ + --publicurl "http://radosgw.example.com:8080/swift/v1" \ + --adminurl "http://radosgw.example.com:8080/swift/v1" \ + --internalurl "http://radosgw.example.com:8080/swift/v1" \ + swift + +--------------+------------------------------------------+ + | Field | Value | + +--------------+------------------------------------------+ + | adminurl | http://radosgw.example.com:8080/swift/v1 | + | id | e4249d2b60e44743a67b5e5b38c18dd3 | + | internalurl | http://radosgw.example.com:8080/swift/v1 | + | publicurl | http://radosgw.example.com:8080/swift/v1 | + | region | RegionOne | + | service_id | 37c4c0e79571404cb4644201a4a6e5ee | + | service_name | swift | + | service_type | object-store | + +--------------+------------------------------------------+ + + $ openstack endpoint show object-store + +--------------+------------------------------------------+ + | Field | Value | + +--------------+------------------------------------------+ + | adminurl | http://radosgw.example.com:8080/swift/v1 | + | enabled | True | + | id | e4249d2b60e44743a67b5e5b38c18dd3 | + | internalurl | http://radosgw.example.com:8080/swift/v1 | + | publicurl | http://radosgw.example.com:8080/swift/v1 | + | region | RegionOne | + | service_id | 37c4c0e79571404cb4644201a4a6e5ee | + | service_name | swift | + | service_type | object-store | + +--------------+------------------------------------------+ + +.. note:: If your radosgw ``ceph.conf`` sets the configuration option + ``rgw swift account in url = true``, your ``object-store`` + endpoint URLs must be set to include the suffix + ``/v1/AUTH_%(tenant_id)s`` (instead of just ``/v1``). + +The Keystone URL is the Keystone admin RESTful API URL. The admin token is the +token that is configured internally in Keystone for admin requests. + +OpenStack Keystone may be terminated with a self signed ssl certificate, in +order for radosgw to interact with Keystone in such a case, you could either +install Keystone's ssl certificate in the node running radosgw. Alternatively +radosgw could be made to not verify the ssl certificate at all (similar to +OpenStack clients with a ``--insecure`` switch) by setting the value of the +configurable ``rgw keystone verify ssl`` to false. + + +.. _OpenStack Keystone documentation: http://docs.openstack.org/developer/keystone/configuringservices.html#setting-up-projects-users-and-roles + +Cross Project(Tenant) Access +---------------------------- + +In order to let a project (earlier called a 'tenant') access buckets belonging to a different project, the following config option needs to be enabled:: + + rgw swift account in url = true + +The Keystone object-store endpoint must accordingly be configured to include the AUTH_%(project_id)s suffix:: + + openstack endpoint create --region RegionOne \ + --publicurl "http://radosgw.example.com:8080/swift/v1/AUTH_$(project_id)s" \ + --adminurl "http://radosgw.example.com:8080/swift/v1/AUTH_$(project_id)s" \ + --internalurl "http://radosgw.example.com:8080/swift/v1/AUTH_$(project_id)s" \ + swift + +--------------+--------------------------------------------------------------+ + | Field | Value | + +--------------+--------------------------------------------------------------+ + | adminurl | http://radosgw.example.com:8080/swift/v1/AUTH_$(project_id)s | + | id | e4249d2b60e44743a67b5e5b38c18dd3 | + | internalurl | http://radosgw.example.com:8080/swift/v1/AUTH_$(project_id)s | + | publicurl | http://radosgw.example.com:8080/swift/v1/AUTH_$(project_id)s | + | region | RegionOne | + | service_id | 37c4c0e79571404cb4644201a4a6e5ee | + | service_name | swift | + | service_type | object-store | + +--------------+--------------------------------------------------------------+ + +Keystone integration with the S3 API +------------------------------------ + +It is possible to use Keystone for authentication even when using the +S3 API (with AWS-like access and secret keys), if the ``rgw s3 auth +use keystone`` option is set. For details, see +:doc:`s3/authentication`. diff --git a/doc/radosgw/kmip.rst b/doc/radosgw/kmip.rst new file mode 100644 index 000000000..988897121 --- /dev/null +++ b/doc/radosgw/kmip.rst @@ -0,0 +1,219 @@ +================ +KMIP Integration +================ + +`KMIP`_ can be used as a secure key management service for +`Server-Side Encryption`_ (SSE-KMS). + +.. ditaa:: + + +---------+ +---------+ +------+ +-------+ + | Client | | RadosGW | | KMIP | | OSD | + +---------+ +---------+ +------+ +-------+ + | create secret | | | + | key for key ID | | | + |-----------------+---------------->| | + | | | | + | upload object | | | + | with key ID | | | + |---------------->| request secret | | + | | key for key ID | | + | |---------------->| | + | |<----------------| | + | | return secret | | + | | key | | + | | | | + | | encrypt object | | + | | with secret key | | + | |--------------+ | | + | | | | | + | |<-------------+ | | + | | | | + | | store encrypted | | + | | object | | + | |------------------------------>| + +#. `Setting KMIP Access for Ceph`_ +#. `Creating Keys in KMIP`_ +#. `Configure the Ceph Object Gateway`_ +#. `Upload object`_ + +Before you can use KMIP with ceph, you will need to do three things. +You will need to associate ceph with client information in KMIP, +and configure ceph to use that client information. +You will also need to create 1 or more keys in KMIP. + +Setting KMIP Access for Ceph +============================ + +Setting up Ceph in KMIP is very dependent on the mechanism(s) supported +by your implementation of KMIP. Two implementations are described +here, + +1. `IBM Security Guardium Key Lifecycle Manager (SKLM)`__. This is a well + supported commercial product. + +__ SKLM_ + +2. PyKMIP_. This is a small python project, suitable for experimental + and testing use only. + +Using IBM SKLM +-------------- + +IBM SKLM__ supports client authentication using certificates. +Certificates may either be self-signed certificates created, +for instance, using openssl, or certificates may be created +using SKLM. Ceph should then be configured (see below) to +use KMIP and an attempt made to use it. This will fail, +but it will leave an "untrusted client device certificate" in SKLM. +This can be then upgraded to a registered client using the web +interface to complete the registration process. + +__ SKLM_ + +Find untrusted clients under ``Advanced Configuration``, +``Client Device Communication Certificates``. Select +``Modify SSL/KMIP Certificates for Clients``, then toggle the flag +``allow the server to trust this certificate and communicate...``. + +Using PyKMIP +------------ + +PyKMIP_ has no special registration process, it simply +trusts the certificate. However, the certificate has to +be issued by a certificate authority that is trusted by +pykmip. PyKMIP also prefers that the certificate contain +an extension for "extended key usage". However, that +can be defeated by specifying ``enable_tls_client_auth=False`` +in the server configuration. + +Creating Keys in KMIP +===================== + +Some KMIP implementations come with a web interface or other +administrative tools to create and manage keys. Refer to your +documentation on that if you wish to use it. The KMIP protocol can also +be used to create and manage keys. PyKMIP comes with a python client +library that can be used this way. + +In preparation to using the pykmip client, you'll need to have a valid +kmip client key & certificate, such as the one you created for ceph. + +Next, you'll then need to download and install it:: + + virtualenv $HOME/my-kmip-env + source $HOME/my-kmip-env/bin/activate + pip install pykmip + +Then you'll need to prepare a configuration file +for the client, something like this:: + + cat <<EOF >$HOME/my-kmip-configuration + [client] + host={hostname} + port=5696 + certfile={clientcert} + keyfile={clientkey} + ca_certs={clientca} + ssl_version=PROTOCOL_TLSv1_2 + EOF + +You will need to replace {hostname} with the name of your kmip host, +also replace {clientcert} {clientkey} and {clientca} with pathnames to +a suitable pem encoded certificate, such as the one you created for +ceph to use. + +Now, you can run this python script directly from +the shell:: + + python + from kmip.pie import client + from kmip import enums + import ssl + import os + import sys + import json + c = client.ProxyKmipClient(config_file=os.environ['HOME']+"/my-kmip-configuration") + + while True: + l=sys.stdin.readline() + keyname=l.strip() + if keyname == "": break + with c: + key_id = c.create( + enums.CryptographicAlgorithm.AES, + 256, + operation_policy_name='default', + name=keyname, + cryptographic_usage_mask=[ + enums.CryptographicUsageMask.ENCRYPT, + enums.CryptographicUsageMask.DECRYPT + ] + ) + c.activate(key_id) + attrs = c.get_attributes(uid=key_id) + r = {} + for a in attrs[1]: + r[str(a.attribute_name)] = str(a.attribute_value) + print (json.dumps(r)) + +If this is all entered at the shell prompt, python will +prompt with ">>>" then "..." until the script is read in, +after which it will read and process names with no prompt +until a blank line or end of file (^D) is given it, or +an error occurs. Of course you can turn this into a regular +python script if you prefer. + +Configure the Ceph Object Gateway +================================= + +Edit the Ceph configuration file to enable Vault as a KMS backend for +server-side encryption:: + + rgw crypt s3 kms backend = kmip + rgw crypt kmip ca path: /etc/ceph/kmiproot.crt + rgw crypt kmip client cert: /etc/ceph/kmip-client.crt + rgw crypt kmip client key: /etc/ceph/private/kmip-client.key + rgw crypt kmip kms key template: pykmip-$keyid + +You may need to change the paths above to match where +you actually want to store kmip certificate data. + +The kmip key template describes how ceph will modify +the name given to it before it looks it up +in kmip. The default is just "$keyid". +If you don't want ceph to see all your kmip +keys, you can use this to limit ceph to just the +designated subset of your kmip key namespace. + +Upload object +============= + +When uploading an object to the Gateway, provide the SSE key ID in the request. +As an example, for the kv engine, using the AWS command-line client:: + + aws --endpoint=http://radosgw:8000 s3 cp plaintext.txt \ + s3://mybucket/encrypted.txt --sse=aws:kms --sse-kms-key-id mybucketkey + +As an example, for the transit engine, using the AWS command-line client:: + + aws --endpoint=http://radosgw:8000 s3 cp plaintext.txt \ + s3://mybucket/encrypted.txt --sse=aws:kms --sse-kms-key-id mybucketkey + +The Object Gateway will fetch the key from Vault, encrypt the object and store +it in the bucket. Any request to download the object will make the Gateway +automatically retrieve the correspondent key from Vault and decrypt the object. + +Note that the secret will be fetched from kmip using a name constructed +from the key template, replacing ``$keyid`` with the key provided. + +With the ceph configuration given above, +radosgw would fetch the secret from:: + + pykmip-mybucketkey + +.. _Server-Side Encryption: ../encryption +.. _KMIP: http://www.oasis-open.org/committees/kmip/ +.. _SKLM: https://www.ibm.com/products/ibm-security-key-lifecycle-manager +.. _PyKMIP: https://pykmip.readthedocs.io/en/latest/ diff --git a/doc/radosgw/layout.rst b/doc/radosgw/layout.rst new file mode 100644 index 000000000..37c3a72cd --- /dev/null +++ b/doc/radosgw/layout.rst @@ -0,0 +1,207 @@ +=========================== + Rados Gateway Data Layout +=========================== + +Although the source code is the ultimate guide, this document helps +new developers to get up to speed with the implementation details. + +Introduction +------------ + +Swift offers something called a *container*, which we use interchangeably with +the term *bucket*, so we say that RGW's buckets implement Swift containers. + +This document does not consider how RGW operates on these structures, +e.g. the use of encode() and decode() methods for serialization and so on. + +Conceptual View +--------------- + +Although RADOS only knows about pools and objects with their xattrs and +omap[1], conceptually RGW organizes its data into three different kinds: +metadata, bucket index, and data. + +Metadata +^^^^^^^^ + +We have 3 'sections' of metadata: 'user', 'bucket', and 'bucket.instance'. +You can use the following commands to introspect metadata entries: :: + + $ radosgw-admin metadata list + $ radosgw-admin metadata list bucket + $ radosgw-admin metadata list bucket.instance + $ radosgw-admin metadata list user + + $ radosgw-admin metadata get bucket:<bucket> + $ radosgw-admin metadata get bucket.instance:<bucket>:<bucket_id> + $ radosgw-admin metadata get user:<user> # get or set + +Some variables have been used in above commands, they are: + +- user: Holds user information +- bucket: Holds a mapping between bucket name and bucket instance id +- bucket.instance: Holds bucket instance information[2] + +Every metadata entry is kept on a single RADOS object. See below for implementation details. + +Note that the metadata is not indexed. When listing a metadata section we do a +RADOS ``pgls`` operation on the containing pool. + +Bucket Index +^^^^^^^^^^^^ + +It's a different kind of metadata, and kept separately. The bucket index holds +a key-value map in RADOS objects. By default it is a single RADOS object per +bucket, but it is possible since Hammer to shard that map over multiple RADOS +objects. The map itself is kept in omap, associated with each RADOS object. +The key of each omap is the name of the objects, and the value holds some basic +metadata of that object -- metadata that shows up when listing the bucket. +Also, each omap holds a header, and we keep some bucket accounting metadata +in that header (number of objects, total size, etc.). + +Note that we also hold other information in the bucket index, and it's kept in +other key namespaces. We can hold the bucket index log there, and for versioned +objects there is more information that we keep on other keys. + +Data +^^^^ + +Objects data is kept in one or more RADOS objects for each rgw object. + +Object Lookup Path +------------------ + +When accessing objects, REST APIs come to RGW with three parameters: +account information (access key in S3 or account name in Swift), +bucket or container name, and object name (or key). At present, RGW only +uses account information to find out the user ID and for access control. +Only the bucket name and object key are used to address the object in a pool. + +The user ID in RGW is a string, typically the actual user name from the user +credentials and not a hashed or mapped identifier. + +When accessing a user's data, the user record is loaded from an object +"<user_id>" in pool "default.rgw.meta" with namespace "users.uid". + +Bucket names are represented in the pool "default.rgw.meta" with namespace +"root". Bucket record is +loaded in order to obtain so-called marker, which serves as a bucket ID. + +The object is located in pool "default.rgw.buckets.data". +Object name is "<marker>_<key>", +for example "default.7593.4_image.png", where the marker is "default.7593.4" +and the key is "image.png". Since these concatenated names are not parsed, +only passed down to RADOS, the choice of the separator is not important and +causes no ambiguity. For the same reason, slashes are permitted in object +names (keys). + +It is also possible to create multiple data pools and make it so that +different users buckets will be created in different RADOS pools by default, +thus providing the necessary scaling. The layout and naming of these pools +is controlled by a 'policy' setting.[3] + +An RGW object may consist of several RADOS objects, the first of which +is the head that contains the metadata, such as manifest, ACLs, content type, +ETag, and user-defined metadata. The metadata is stored in xattrs. +The head may also contain up to 512 kilobytes of object data, for efficiency +and atomicity. The manifest describes how each object is laid out in RADOS +objects. + +Bucket and Object Listing +------------------------- + +Buckets that belong to a given user are listed in an omap of an object named +"<user_id>.buckets" (for example, "foo.buckets") in pool "default.rgw.meta" +with namespace "users.uid". +These objects are accessed when listing buckets, when updating bucket +contents, and updating and retrieving bucket statistics (e.g. for quota). + +See the user-visible, encoded class 'cls_user_bucket_entry' and its +nested class 'cls_user_bucket' for the values of these omap entires. + +These listings are kept consistent with buckets in pool ".rgw". + +Objects that belong to a given bucket are listed in a bucket index, +as discussed in sub-section 'Bucket Index' above. The default naming +for index objects is ".dir.<marker>" in pool "default.rgw.buckets.index". + +Footnotes +--------- + +[1] Omap is a key-value store, associated with an object, in a way similar +to how Extended Attributes associate with a POSIX file. An object's omap +is not physically located in the object's storage, but its precise +implementation is invisible and immaterial to RADOS Gateway. +In Hammer, one LevelDB is used to store omap in each OSD. + +[2] Before the Dumpling release, the 'bucket.instance' metadata did not +exist and the 'bucket' metadata contained its information. It is possible +to encounter such buckets in old installations. + +[3] The pool names have been changed starting with the Infernalis release. +If you are looking at an older setup, some details may be different. In +particular there was a different pool for each of the namespaces that are +now being used inside the default.root.meta pool. + +Appendix: Compendium +-------------------- + +Known pools: + +.rgw.root + Unspecified region, zone, and global information records, one per object. + +<zone>.rgw.control + notify.<N> + +<zone>.rgw.meta + Multiple namespaces with different kinds of metadata: + + namespace: root + <bucket> + .bucket.meta.<bucket>:<marker> # see put_bucket_instance_info() + + The tenant is used to disambiguate buckets, but not bucket instances. + Example:: + + .bucket.meta.prodtx:test%25star:default.84099.6 + .bucket.meta.testcont:default.4126.1 + .bucket.meta.prodtx:testcont:default.84099.4 + prodtx/testcont + prodtx/test%25star + testcont + + namespace: users.uid + Contains _both_ per-user information (RGWUserInfo) in "<user>" objects + and per-user lists of buckets in omaps of "<user>.buckets" objects. + The "<user>" may contain the tenant if non-empty, for example:: + + prodtx$prodt + test2.buckets + prodtx$prodt.buckets + test2 + + namespace: users.email + Unimportant + + namespace: users.keys + 47UA98JSTJZ9YAN3OS3O + + This allows ``radosgw`` to look up users by their access keys during authentication. + + namespace: users.swift + test:tester + +<zone>.rgw.buckets.index + Objects are named ".dir.<marker>", each contains a bucket index. + If the index is sharded, each shard appends the shard index after + the marker. + +<zone>.rgw.buckets.data + default.7593.4__shadow_.488urDFerTYXavx4yAd-Op8mxehnvTI_1 + <marker>_<key> + +An example of a marker would be "default.16004.1" or "default.7593.4". +The current format is "<zone>.<instance_id>.<bucket_id>". But once +generated, a marker is not parsed again, so its format may change +freely in the future. diff --git a/doc/radosgw/ldap-auth.rst b/doc/radosgw/ldap-auth.rst new file mode 100644 index 000000000..486d0c623 --- /dev/null +++ b/doc/radosgw/ldap-auth.rst @@ -0,0 +1,167 @@ +=================== +LDAP Authentication +=================== + +.. versionadded:: Jewel + +You can delegate the Ceph Object Gateway authentication to an LDAP server. + +How it works +============ + +The Ceph Object Gateway extracts the users LDAP credentials from a token. A +search filter is constructed with the user name. The Ceph Object Gateway uses +the configured service account to search the directory for a matching entry. If +an entry is found, the Ceph Object Gateway attempts to bind to the found +distinguished name with the password from the token. If the credentials are +valid, the bind will succeed, and the Ceph Object Gateway will grant access and +radosgw-user will be created with the provided username. + +You can limit the allowed users by setting the base for the search to a +specific organizational unit or by specifying a custom search filter, for +example requiring specific group membership, custom object classes, or +attributes. + +The LDAP credentials must be available on the server to perform the LDAP +authentication. Make sure to set the ``rgw`` log level low enough to hide the +base-64-encoded credentials / access tokens. + +Requirements +============ + +- **LDAP or Active Directory:** A running LDAP instance accessible by the Ceph + Object Gateway +- **Service account:** LDAP credentials to be used by the Ceph Object Gateway + with search permissions +- **User account:** At least one user account in the LDAP directory +- **Do not overlap LDAP and local users:** You should not use the same user + names for local users and for users being authenticated by using LDAP. The + Ceph Object Gateway cannot distinguish them and it treats them as the same + user. + +Sanity checks +============= + +Use the ``ldapsearch`` utility to verify the service account or the LDAP connection: + +:: + + # ldapsearch -x -D "uid=ceph,ou=system,dc=example,dc=com" -W \ + -H ldaps://example.com -b "ou=users,dc=example,dc=com" 'uid=*' dn + +.. note:: Make sure to use the same LDAP parameters like in the Ceph configuration file to + eliminate possible problems. + +Configuring the Ceph Object Gateway to use LDAP authentication +============================================================== + +The following parameters in the Ceph configuration file are related to the LDAP +authentication: + +- ``rgw_s3_auth_use_ldap``: Set this to ``true`` to enable S3 authentication with LDAP +- ``rgw_ldap_uri``: Specifies the LDAP server to use. Make sure to use the + ``ldaps://<fqdn>:<port>`` parameter to not transmit clear text credentials + over the wire. +- ``rgw_ldap_binddn``: The Distinguished Name (DN) of the service account used + by the Ceph Object Gateway +- ``rgw_ldap_secret``: Path to file containing credentials for ``rgw_ldap_binddn`` +- ``rgw_ldap_searchdn``: Specifies the base in the directory information tree + for searching users. This might be your users organizational unit or some + more specific Organizational Unit (OU). +- ``rgw_ldap_dnattr``: The attribute being used in the constructed search + filter to match a username. Depending on your Directory Information Tree + (DIT) this would probably be ``uid`` or ``cn``. The generated filter string + will be, e.g., ``cn=some_username``. +- ``rgw_ldap_searchfilter``: If not specified, the Ceph Object Gateway + automatically constructs the search filter with the ``rgw_ldap_dnattr`` + setting. Use this parameter to narrow the list of allowed users in very + flexible ways. Consult the *Using a custom search filter to limit user access + section* for details + +Using a custom search filter to limit user access +================================================= + +There are two ways to use the ``rgw_search_filter`` parameter: + +Specifying a partial filter to further limit the constructed search filter +-------------------------------------------------------------------------- + +An example for a partial filter: + +:: + + "objectclass=inetorgperson" + +The Ceph Object Gateway will generate the search filter as usual with the +user name from the token and the value of ``rgw_ldap_dnattr``. The constructed +filter is then combined with the partial filter from the ``rgw_search_filter`` +attribute. Depending on the user name and the settings the final search filter +might become: + +:: + + "(&(uid=hari)(objectclass=inetorgperson))" + +So user ``hari`` will only be granted access if he is found in the LDAP +directory, has an object class of ``inetorgperson``, and did specify a valid +password. + +Specifying a complete filter +---------------------------- + +A complete filter must contain a ``@USERNAME@`` token which will be substituted +with the user name during the authentication attempt. The ``rgw_ldap_dnattr`` +parameter is not used anymore in this case. For example, to limit valid users +to a specific group, use the following filter: + +:: + + "(&(uid=@USERNAME@)(memberOf=cn=ceph-users,ou=groups,dc=mycompany,dc=com))" + +.. note:: Using the ``memberOf`` attribute in LDAP searches requires server side + support from you specific LDAP server implementation. + +Generating an access token for LDAP authentication +================================================== + +The ``radosgw-token`` utility generates the access token based on the LDAP +user name and password. It will output a base-64 encoded string which is the +access token. + +:: + + # export RGW_ACCESS_KEY_ID="<username>" + # export RGW_SECRET_ACCESS_KEY="<password>" + # radosgw-token --encode + +.. important:: The access token is a base-64 encoded JSON struct and contains + the LDAP credentials as a clear text. + +Alternatively, users can also generate the token manually by base-64-encoding +this JSON snippet, if they do not have the ``radosgw-token`` tool installed. + +:: + + { + "RGW_TOKEN": { + "version": 1, + "type": "ldap", + "id": "your_username", + "key": "your_clear_text_password_here" + } + } + +Using the access token +====================== + +Use your favorite S3 client and specify the token as the access key in your +client or environment variables. + +:: + + # export AWS_ACCESS_KEY_ID=<base64-encoded token generated by radosgw-token> + # export AWS_SECRET_ACCESS_KEY="" # define this with an empty string, otherwise tools might complain about missing env variables. + +.. important:: The access token is a base-64 encoded JSON struct and contains + the LDAP credentials as a clear text. DO NOT share it unless + you want to share your clear text password! diff --git a/doc/radosgw/lua-scripting.rst b/doc/radosgw/lua-scripting.rst new file mode 100644 index 000000000..0f385c46f --- /dev/null +++ b/doc/radosgw/lua-scripting.rst @@ -0,0 +1,393 @@ +============= +Lua Scripting +============= + +.. versionadded:: Pacific + +.. contents:: + +This feature allows users to upload Lua scripts to different context in the radosgw. The two supported context are "preRequest" that will execute a script before the +operation was taken, and "postRequest" that will execute after each operation is taken. Script may be uploaded to address requests for users of a specific tenant. +The script can access fields in the request and modify some fields. All Lua language features can be used in the script. + +By default, all lua standard libraries are available in the script, however, in order to allow for other lua modules to be used in the script, we support adding packages to an allowlist: + + - All packages in the allowlist are being re-installed using the luarocks package manager on radosgw restart. Therefore a restart is needed for adding or removing of packages to take effect + - To add a package that contains C source code that needs to be compiled, use the `--allow-compilation` flag. In this case a C compiler needs to be available on the host + - Lua packages are installed in, and used from, a directory local to the radosgw. Meaning that lua packages in the allowlist are separated from any lua packages available on the host. + By default, this directory would be `/tmp/luarocks/<entity name>`. Its prefix part (`/tmp/luarocks/`) could be set to a different location via the `rgw_luarocks_location` configuration parameter. + Note that this parameter should not be set to one of the default locations where luarocks install packages (e.g. `$HOME/.luarocks`, `/usr/lib64/lua`, `/usr/share/lua`) + + +.. toctree:: + :maxdepth: 1 + + +Script Management via CLI +------------------------- + +To upload a script: + +:: + + # radosgw-admin script put --infile={lua-file} --context={preRequest|postRequest} [--tenant={tenant-name}] + + +To print the content of the script to standard output: + +:: + + # radosgw-admin script get --context={preRequest|postRequest} [--tenant={tenant-name}] + + +To remove the script: + +:: + + # radosgw-admin script rm --context={preRequest|postRequest} [--tenant={tenant-name}] + + +Package Management via CLI +-------------------------- + +To add a package to the allowlist: + +:: + + # radosgw-admin script-package add --package={package name} [--allow-compilation] + + +To remove a package from the allowlist: + +:: + + # radosgw-admin script-package rm --package={package name} + + +To print the list of packages in the allowlist: + +:: + + # radosgw-admin script-package list + + +Context Free Functions +---------------------- +Debug Log +~~~~~~~~~ +The ``RGWDebugLog()`` function accepts a string and prints it to the debug log with priority 20. +Each log message is prefixed ``Lua INFO:``. This function has no return value. + +Request Fields +----------------- + +.. warning:: This feature is experimental. Fields may be removed or renamed in the future. + +.. note:: + + - Although Lua is a case-sensitive language, field names provided by the radosgw are case-insensitive. Function names remain case-sensitive. + - Fields marked "optional" can have a nil value. + - Fields marked as "iterable" can be used by the pairs() function and with the # length operator. + - All table fields can be used with the bracket operator ``[]``. + - ``time`` fields are strings with the following format: ``%Y-%m-%d %H:%M:%S``. + + ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| Field | Type | Description | Iterable | Writeable | Optional | ++====================================================+==========+==============================================================+==========+===========+==========+ +| ``Request.RGWOp`` | string | radosgw operation | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.DecodedURI`` | string | decoded URI | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.ContentLength`` | integer | size of the request | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.GenericAttributes`` | table | string to string generic attributes map | yes | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Response`` | table | response to the request | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Response.HTTPStatusCode`` | integer | HTTP status code | no | yes | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Response.HTTPStatus`` | string | HTTP status text | no | yes | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Response.RGWCode`` | integer | radosgw error code | no | yes | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Response.Message`` | string | response message | no | yes | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.SwiftAccountName`` | string | swift account name | no | no | yes | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket`` | table | info on the bucket | no | no | yes | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.Tenant`` | string | tenant of the bucket | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.Name`` | string | bucket name | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.Marker`` | string | bucket marker (initial id) | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.Id`` | string | bucket id | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.Count`` | integer | number of objects in the bucket | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.Size`` | integer | total size of objects in the bucket | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.ZoneGroupId`` | string | zone group of the bucket | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.CreationTime`` | time | creation time of the bucket | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.MTime`` | time | modification time of the bucket | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.Quota`` | table | bucket quota | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.Quota.MaxSize`` | integer | bucket quota max size | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.Quota.MaxObjects`` | integer | bucket quota max number of objects | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Reques.Bucket.Quota.Enabled`` | boolean | bucket quota is enabled | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.Quota.Rounded`` | boolean | bucket quota is rounded to 4K | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.PlacementRule`` | table | bucket placement rule | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.PlacementRule.Name`` | string | bucket placement rule name | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.PlacementRule.StorageClass`` | string | bucket placement rule storage class | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.User`` | table | bucket owner | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.User.Tenant`` | string | bucket owner tenant | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Bucket.User.Id`` | string | bucket owner id | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Object`` | table | info on the object | no | no | yes | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Object.Name`` | string | object name | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Object.Instance`` | string | object version | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Object.Id`` | string | object id | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Object.Size`` | integer | object size | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Object.MTime`` | time | object mtime | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.CopyFrom`` | table | information on copy operation | no | no | yes | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.CopyFrom.Tenant`` | string | tenant of the object copied from | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.CopyFrom.Bucket`` | string | bucket of the object copied from | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.CopyFrom.Object`` | table | object copied from. See: ``Request.Object`` | no | no | yes | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.ObjectOwner`` | table | object owner | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.ObjectOwner.DisplayName`` | string | object owner display name | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.ObjectOwner.User`` | table | object user. See: ``Request.Bucket.User`` | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.ZoneGroup.Name`` | string | name of zone group | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.ZoneGroup.Endpoint`` | string | endpoint of zone group | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.UserAcl`` | table | user ACL | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.UserAcl.Owner`` | table | user ACL owner. See: ``Request.ObjectOwner`` | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.UserAcl.Grants`` | table | user ACL map of string to grant | yes | no | no | +| | | note: grants without an Id are not presented when iterated | | | | +| | | and only one of them can be accessed via brackets | | | | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.UserAcl.Grants["<name>"]`` | table | user ACL grant | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.UserAcl.Grants["<name>"].Type`` | integer | user ACL grant type | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.UserAcl.Grants["<name>"].User`` | table | user ACL grant user | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.UserAcl.Grants["<name>"].User.Tenant`` | table | user ACL grant user tenant | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.UserAcl.Grants["<name>"].User.Id`` | table | user ACL grant user id | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.UserAcl.Grants["<name>"].GroupType`` | integer | user ACL grant group type | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.UserAcl.Grants["<name>"].Referer`` | string | user ACL grant referer | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.BucketAcl`` | table | bucket ACL. See: ``Request.UserAcl`` | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.ObjectAcl`` | table | object ACL. See: ``Request.UserAcl`` | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Environment`` | table | string to string environment map | yes | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Policy`` | table | policy | no | no | yes | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Policy.Text`` | string | policy text | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Policy.Id`` | string | policy Id | no | no | yes | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Policy.Statements`` | table | list of string statements | yes | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.UserPolicies`` | table | list of user policies | yes | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.UserPolicies[<index>]`` | table | user policy. See: ``Request.Policy`` | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.RGWId`` | string | radosgw host id: ``<host>-<zone>-<zonegroup>`` | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.HTTP`` | table | HTTP header | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.HTTP.Parameters`` | table | string to string parameter map | yes | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.HTTP.Resources`` | table | string to string resource map | yes | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.HTTP.Metadata`` | table | string to string metadata map | yes | yes | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.HTTP.Host`` | string | host name | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.HTTP.Method`` | string | HTTP method | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.HTTP.URI`` | string | URI | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.HTTP.QueryString`` | string | HTTP query string | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.HTTP.Domain`` | string | domain name | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Time`` | time | request time | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Dialect`` | string | "S3" or "Swift" | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Id`` | string | request Id | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.TransactionId`` | string | transaction Id | no | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ +| ``Request.Tags`` | table | object tags map | yes | no | no | ++----------------------------------------------------+----------+--------------------------------------------------------------+----------+-----------+----------+ + +Request Functions +-------------------- +Operations Log +~~~~~~~~~~~~~~ +The ``Request.Log()`` function prints the requests into the operations log. This function has no parameters. It returns 0 for success and an error code if it fails. + +Lua Code Samples +---------------- +- Print information on source and destination objects in case of copy: + +.. code-block:: lua + + function print_object(object) + RGWDebugLog(" Name: " .. object.Name) + RGWDebugLog(" Instance: " .. object.Instance) + RGWDebugLog(" Id: " .. object.Id) + RGWDebugLog(" Size: " .. object.Size) + RGWDebugLog(" MTime: " .. object.MTime) + end + + if Request.CopyFrom and Request.Object and Request.CopyFrom.Object then + RGWDebugLog("copy from object:") + print_object(Request.CopyFrom.Object) + RGWDebugLog("to object:") + print_object(Request.Object) + end + +- Print ACLs via a "generic function": + +.. code-block:: lua + + function print_owner(owner) + RGWDebugLog("Owner:") + RGWDebugLog(" Dispaly Name: " .. owner.DisplayName) + RGWDebugLog(" Id: " .. owner.User.Id) + RGWDebugLog(" Tenanet: " .. owner.User.Tenant) + end + + function print_acl(acl_type) + index = acl_type .. "ACL" + acl = Request[index] + if acl then + RGWDebugLog(acl_type .. "ACL Owner") + print_owner(acl.Owner) + RGWDebugLog(" there are " .. #acl.Grants .. " grant for owner") + for k,v in pairs(acl.Grants) do + RGWDebugLog(" Grant Key: " .. k) + RGWDebugLog(" Grant Type: " .. v.Type) + RGWDebugLog(" Grant Group Type: " .. v.GroupType) + RGWDebugLog(" Grant Referer: " .. v.Referer) + RGWDebugLog(" Grant User Tenant: " .. v.User.Tenant) + RGWDebugLog(" Grant User Id: " .. v.User.Id) + end + else + RGWDebugLog("no " .. acl_type .. " ACL in request: " .. Request.Id) + end + end + + print_acl("User") + print_acl("Bucket") + print_acl("Object") + +- Use of operations log only in case of errors: + +.. code-block:: lua + + if Request.Response.HTTPStatusCode ~= 200 then + RGWDebugLog("request is bad, use ops log") + rc = Request.Log() + RGWDebugLog("ops log return code: " .. rc) + end + +- Set values into the error message: + +.. code-block:: lua + + if Request.Response.HTTPStatusCode == 500 then + Request.Response.Message = "<Message> something bad happened :-( </Message>" + end + +- Add metadata to objects that was not originally sent by the client: + +In the `preRequest` context we should add: + +.. code-block:: lua + + if Request.RGWOp == 'put_obj' then + Request.HTTP.Metadata["x-amz-meta-mydata"] = "my value" + end + +In the `postRequest` context we look at the metadata: + +.. code-block:: lua + + RGWDebugLog("number of metadata entries is: " .. #Request.HTTP.Metadata) + for k, v in pairs(Request.HTTP.Metadata) do + RGWDebugLog("key=" .. k .. ", " .. "value=" .. v) + end + +- Use modules to create Unix socket based, JSON encoded, "access log": + +First we should add the following packages to the allowlist: + +:: + + # radosgw-admin script-package add --package=luajson + # radosgw-admin script-package add --package=luasocket --allow-compilation + + +Then, do a restart for the radosgw and upload the following script to the `postRequest` context: + +.. code-block:: lua + + if Request.RGWOp == "get_obj" then + local json = require("json") + local socket = require("socket") + local unix = require("socket.unix") + local s = assert(unix()) + E = {} + + msg = {bucket = (Request.Bucket or (Request.CopyFrom or E).Bucket).Name, + time = Request.Time, + operation = Request.RGWOp, + http_status = Request.Response.HTTPStatusCode, + error_code = Request.Response.HTTPStatus, + object_size = Request.Object.Size, + trans_id = Request.TransactionId} + + assert(s:connect("/tmp/socket")) + assert(s:send(json.encode(msg).."\n")) + assert(s:close()) + end + diff --git a/doc/radosgw/mfa.rst b/doc/radosgw/mfa.rst new file mode 100644 index 000000000..0cbead85f --- /dev/null +++ b/doc/radosgw/mfa.rst @@ -0,0 +1,102 @@ +.. _rgw_mfa: + +========================================== +RGW Support for Multifactor Authentication +========================================== + +.. versionadded:: Mimic + +The S3 multifactor authentication (MFA) feature allows +users to require the use of one-time password when removing +objects on certain buckets. The buckets need to be configured +with versioning and MFA enabled which can be done through +the S3 api. + +Time-based one time password tokens can be assigned to a user +through radosgw-admin. Each token has a secret seed, and a serial +id that is assigned to it. Tokens are added to the user, can +be listedm removed, and can also be re-synchronized. + +Multisite +========= + +While the MFA IDs are set on the user's metadata, the +actual MFA one time password configuration resides in the local zone's +osds. Therefore, in a multi-site environment it is advisable to use +different tokens for different zones. + + +Terminology +============= + +-``TOTP``: Time-based One Time Password + +-``token serial``: a string that represents the ID of a TOTP token + +-``token seed``: the secret seed that is used to calculate the TOTP + +-``totp seconds``: the time resolution that is being used for TOTP generation + +-``totp window``: the number of TOTP tokens that are checked before and after the current token when validating token + +-``totp pin``: the valid value of a TOTP token at a certain time + + +Admin commands +============== + +Create a new MFA TOTP token +------------------------------------ + +:: + + # radosgw-admin mfa create --uid=<user-id> \ + --totp-serial=<serial> \ + --totp-seed=<seed> \ + [ --totp-seed-type=<hex|base32> ] \ + [ --totp-seconds=<num-seconds> ] \ + [ --totp-window=<twindow> ] + +List MFA TOTP tokens +--------------------- + +:: + + # radosgw-admin mfa list --uid=<user-id> + + +Show MFA TOTP token +------------------------------------ + +:: + + # radosgw-admin mfa get --uid=<user-id> --totp-serial=<serial> + + +Delete MFA TOTP token +------------------------ + +:: + + # radosgw-admin mfa remove --uid=<user-id> --totp-serial=<serial> + + +Check MFA TOTP token +-------------------------------- + +Test a TOTP token pin, needed for validating that TOTP functions correctly. :: + + # radosgw-admin mfa check --uid=<user-id> --totp-serial=<serial> \ + --totp-pin=<pin> + + +Re-sync MFA TOTP token +-------------------------------- + +In order to re-sync the TOTP token (in case of time skew). This requires +feeding two consecutive pins: the previous pin, and the current pin. :: + + # radosgw-admin mfa resync --uid=<user-id> --totp-serial=<serial> \ + --totp-pin=<prev-pin> --totp=pin=<current-pin> + + diff --git a/doc/radosgw/multisite-sync-policy.rst b/doc/radosgw/multisite-sync-policy.rst new file mode 100644 index 000000000..56342473c --- /dev/null +++ b/doc/radosgw/multisite-sync-policy.rst @@ -0,0 +1,714 @@ +===================== +Multisite Sync Policy +===================== + +.. versionadded:: Octopus + +Multisite bucket-granularity sync policy provides fine grained control of data movement between buckets in different zones. It extends the zone sync mechanism. Previously buckets were being treated symmetrically, that is -- each (data) zone holds a mirror of that bucket that should be the same as all the other zones. Whereas leveraging the bucket-granularity sync policy is possible for buckets to diverge, and a bucket can pull data from other buckets (ones that don't share its name or its ID) in different zone. The sync process was assuming therefore that the bucket sync source and the bucket sync destination were always referring to the same bucket, now that is not the case anymore. + +The sync policy supersedes the old zonegroup coarse configuration (sync_from*). The sync policy can be configured at the zonegroup level (and if it is configured it replaces the old style config), but it can also be configured at the bucket level. + +In the sync policy multiple groups that can contain lists of data-flow configurations can be defined, as well as lists of pipe configurations. The data-flow defines the flow of data between the different zones. It can define symmetrical data flow, in which multiple zones sync data from each other, and it can define directional data flow, in which the data moves in one way from one zone to another. + +A pipe defines the actual buckets that can use these data flows, and the properties that are associated with it (for example: source object prefix). + +A sync policy group can be in 3 states: + ++----------------------------+----------------------------------------+ +| Value | Description | ++============================+========================================+ +| ``enabled`` | sync is allowed and enabled | ++----------------------------+----------------------------------------+ +| ``allowed`` | sync is allowed | ++----------------------------+----------------------------------------+ +| ``forbidden`` | sync (as defined by this group) is not | +| | allowed and can override other groups | ++----------------------------+----------------------------------------+ + +A policy can be defined at the bucket level. A bucket level sync policy inherits the data flow of the zonegroup policy, and can only define a subset of what the zonegroup allows. + +A wildcard zone, and a wildcard bucket parameter in the policy defines all relevant zones, or all relevant buckets. In the context of a bucket policy it means the current bucket instance. A disaster recovery configuration where entire zones are mirrored doesn't require configuring anything on the buckets. However, for a fine grained bucket sync it would be better to configure the pipes to be synced by allowing (status=allowed) them at the zonegroup level (e.g., using wildcards), but only enable the specific sync at the bucket level (status=enabled). If needed, the policy at the bucket level can limit the data movement to specific relevant zones. + +.. important:: Any changes to the zonegroup policy needs to be applied on the + zonegroup master zone, and require period update and commit. Changes + to the bucket policy needs to be applied on the zonegroup master + zone. The changes are dynamically handled by rgw. + + +S3 Replication API +~~~~~~~~~~~~~~~~~~ + +The S3 bucket replication api has also been implemented, and allows users to create replication rules between different buckets. Note though that while the AWS replication feature allows bucket replication within the same zone, rgw does not allow it at the moment. However, the rgw api also added a new 'Zone' array that allows users to select to what zones the specific bucket will be synced. + + +Sync Policy Control Reference +============================= + + +Get Sync Policy +~~~~~~~~~~~~~~~ + +To retrieve the current zonegroup sync policy, or a specific bucket policy: + +:: + + # radosgw-admin sync policy get [--bucket=<bucket>] + + +Create Sync Policy Group +~~~~~~~~~~~~~~~~~~~~~~~~ + +To create a sync policy group: + +:: + + # radosgw-admin sync group create [--bucket=<bucket>] \ + --group-id=<group-id> \ + --status=<enabled | allowed | forbidden> \ + + +Modify Sync Policy Group +~~~~~~~~~~~~~~~~~~~~~~~~ + +To modify a sync policy group: + +:: + + # radosgw-admin sync group modify [--bucket=<bucket>] \ + --group-id=<group-id> \ + --status=<enabled | allowed | forbidden> \ + + +Show Sync Policy Group +~~~~~~~~~~~~~~~~~~~~~~~~ + +To show a sync policy group: + +:: + + # radosgw-admin sync group get [--bucket=<bucket>] \ + --group-id=<group-id> + + +Remove Sync Policy Group +~~~~~~~~~~~~~~~~~~~~~~~~ + +To remove a sync policy group: + +:: + + # radosgw-admin sync group remove [--bucket=<bucket>] \ + --group-id=<group-id> + + + +Create Sync Flow +~~~~~~~~~~~~~~~~ + +- To create or update directional sync flow: + +:: + + # radosgw-admin sync group flow create [--bucket=<bucket>] \ + --group-id=<group-id> \ + --flow-id=<flow-id> \ + --flow-type=directional \ + --source-zone=<source_zone> \ + --dest-zone=<dest_zone> + + +- To create or update symmetrical sync flow: + +:: + + # radosgw-admin sync group flow create [--bucket=<bucket>] \ + --group-id=<group-id> \ + --flow-id=<flow-id> \ + --flow-type=symmetrical \ + --zones=<zones> + + +Where zones are a comma separated list of all the zones that need to add to the flow. + + +Remove Sync Flow Zones +~~~~~~~~~~~~~~~~~~~~~~ + +- To remove directional sync flow: + +:: + + # radosgw-admin sync group flow remove [--bucket=<bucket>] \ + --group-id=<group-id> \ + --flow-id=<flow-id> \ + --flow-type=directional \ + --source-zone=<source_zone> \ + --dest-zone=<dest_zone> + + +- To remove specific zones from symmetrical sync flow: + +:: + + # radosgw-admin sync group flow remove [--bucket=<bucket>] \ + --group-id=<group-id> \ + --flow-id=<flow-id> \ + --flow-type=symmetrical \ + --zones=<zones> + + +Where zones are a comma separated list of all zones to remove from the flow. + + +- To remove symmetrical sync flow: + +:: + + # radosgw-admin sync group flow remove [--bucket=<bucket>] \ + --group-id=<group-id> \ + --flow-id=<flow-id> \ + --flow-type=symmetrical + + +Create Sync Pipe +~~~~~~~~~~~~~~~~ + +To create sync group pipe, or update its parameters: + + +:: + + # radosgw-admin sync group pipe create [--bucket=<bucket>] \ + --group-id=<group-id> \ + --pipe-id=<pipe-id> \ + --source-zones=<source_zones> \ + [--source-bucket=<source_buckets>] \ + [--source-bucket-id=<source_bucket_id>] \ + --dest-zones=<dest_zones> \ + [--dest-bucket=<dest_buckets>] \ + [--dest-bucket-id=<dest_bucket_id>] \ + [--prefix=<source_prefix>] \ + [--prefix-rm] \ + [--tags-add=<tags>] \ + [--tags-rm=<tags>] \ + [--dest-owner=<owner>] \ + [--storage-class=<storage_class>] \ + [--mode=<system | user>] \ + [--uid=<user_id>] + + +Zones are either a list of zones, or '*' (wildcard). Wildcard zones mean any zone that matches the sync flow rules. +Buckets are either a bucket name, or '*' (wildcard). Wildcard bucket means the current bucket +Prefix can be defined to filter source objects. +Tags are passed by a comma separated list of 'key=value'. +Destination owner can be set to force a destination owner of the objects. If user mode is selected, only the destination bucket owner can be set. +Destinatino storage class can also be condfigured. +User id can be set for user mode, and will be the user under which the sync operation will be executed (for permissions validation). + + +Remove Sync Pipe +~~~~~~~~~~~~~~~~ + +To remove specific sync group pipe params, or the entire pipe: + + +:: + + # radosgw-admin sync group pipe remove [--bucket=<bucket>] \ + --group-id=<group-id> \ + --pipe-id=<pipe-id> \ + [--source-zones=<source_zones>] \ + [--source-bucket=<source_buckets>] \ + [--source-bucket-id=<source_bucket_id>] \ + [--dest-zones=<dest_zones>] \ + [--dest-bucket=<dest_buckets>] \ + [--dest-bucket-id=<dest_bucket_id>] + + +Sync Info +~~~~~~~~~ + +To get information about the expected sync sources and targets (as defined by the sync policy): + +:: + + # radosgw-admin sync info [--bucket=<bucket>] \ + [--effective-zone-name=<zone>] + + +Since a bucket can define a policy that defines data movement from it towards a different bucket at a different zone, when the policy is created we also generate a list of bucket dependencies that are used as hints when a sync of any particular bucket happens. The fact that a bucket references another bucket does not mean it actually syncs to/from it, as the data flow might not permit it. + + +Examples +======== + +The system in these examples includes 3 zones: ``us-east`` (the master zone), ``us-west``, ``us-west-2``. + +Example 1: Two Zones, Complete Mirror +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is similar to older (pre ``Octopus``) sync capabilities, but being done via the new sync policy engine. Note that changes to the zonegroup sync policy require a period update and commit. + + +:: + + [us-east] $ radosgw-admin sync group create --group-id=group1 --status=allowed + [us-east] $ radosgw-admin sync group flow create --group-id=group1 \ + --flow-id=flow-mirror --flow-type=symmetrical \ + --zones=us-east,us-west + [us-east] $ radosgw-admin sync group pipe create --group-id=group1 \ + --pipe-id=pipe1 --source-zones='*' \ + --source-bucket='*' --dest-zones='*' \ + --dest-bucket='*' + [us-east] $ radosgw-admin sync group modify --group-id=group1 --status=enabled + [us-east] $ radosgw-admin period update --commit + + $ radosgw-admin sync info --bucket=buck + { + "sources": [ + { + "id": "pipe1", + "source": { + "zone": "us-west", + "bucket": "buck:115b12b3-....4409.1" + }, + "dest": { + "zone": "us-east", + "bucket": "buck:115b12b3-....4409.1" + }, + "params": { + ... + } + } + ], + "dests": [ + { + "id": "pipe1", + "source": { + "zone": "us-east", + "bucket": "buck:115b12b3-....4409.1" + }, + "dest": { + "zone": "us-west", + "bucket": "buck:115b12b3-....4409.1" + }, + ... + } + ], + ... + } + } + + +Note that the "id" field in the output above reflects the pipe rule +that generated that entry, a single rule can generate multiple sync +entries as can be seen in the example. + +:: + + [us-west] $ radosgw-admin sync info --bucket=buck + { + "sources": [ + { + "id": "pipe1", + "source": { + "zone": "us-east", + "bucket": "buck:115b12b3-....4409.1" + }, + "dest": { + "zone": "us-west", + "bucket": "buck:115b12b3-....4409.1" + }, + ... + } + ], + "dests": [ + { + "id": "pipe1", + "source": { + "zone": "us-west", + "bucket": "buck:115b12b3-....4409.1" + }, + "dest": { + "zone": "us-east", + "bucket": "buck:115b12b3-....4409.1" + }, + ... + } + ], + ... + } + + + +Example 2: Directional, Entire Zone Backup +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Also similar to older sync capabilities. In here we add a third zone, ``us-west-2`` that will be a replica of ``us-west``, but data will not be replicated back from it. + +:: + + [us-east] $ radosgw-admin sync group flow create --group-id=group1 \ + --flow-id=us-west-backup --flow-type=directional \ + --source-zone=us-west --dest-zone=us-west-2 + [us-east] $ radosgw-admin period update --commit + + +Note that us-west has two dests: + +:: + + [us-west] $ radosgw-admin sync info --bucket=buck + { + "sources": [ + { + "id": "pipe1", + "source": { + "zone": "us-east", + "bucket": "buck:115b12b3-....4409.1" + }, + "dest": { + "zone": "us-west", + "bucket": "buck:115b12b3-....4409.1" + }, + ... + } + ], + "dests": [ + { + "id": "pipe1", + "source": { + "zone": "us-west", + "bucket": "buck:115b12b3-....4409.1" + }, + "dest": { + "zone": "us-east", + "bucket": "buck:115b12b3-....4409.1" + }, + ... + }, + { + "id": "pipe1", + "source": { + "zone": "us-west", + "bucket": "buck:115b12b3-....4409.1" + }, + "dest": { + "zone": "us-west-2", + "bucket": "buck:115b12b3-....4409.1" + }, + ... + } + ], + ... + } + + +Whereas us-west-2 has only source and no destinations: + +:: + + [us-west-2] $ radosgw-admin sync info --bucket=buck + { + "sources": [ + { + "id": "pipe1", + "source": { + "zone": "us-west", + "bucket": "buck:115b12b3-....4409.1" + }, + "dest": { + "zone": "us-west-2", + "bucket": "buck:115b12b3-....4409.1" + }, + ... + } + ], + "dests": [], + ... + } + + + +Example 3: Mirror a Specific Bucket +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Using the same group configuration, but this time switching it to ``allowed`` state, which means that sync is allowed but not enabled. + +:: + + [us-east] $ radosgw-admin sync group modify --group-id=group1 --status=allowed + [us-east] $ radosgw-admin period update --commit + + +And we will create a bucket level policy rule for existing bucket ``buck2``. Note that the bucket needs to exist before being able to set this policy, and admin commands that modify bucket policies need to run on the master zone, however, they do not require period update. There is no need to change the data flow, as it is inherited from the zonegroup policy. A bucket policy flow will only be a subset of the flow defined in the zonegroup policy. Same goes for pipes, although a bucket policy can enable pipes that are not enabled (albeit not forbidden) at the zonegroup policy. + +:: + + [us-east] $ radosgw-admin sync group create --bucket=buck2 \ + --group-id=buck2-default --status=enabled + + [us-east] $ radosgw-admin sync group pipe create --bucket=buck2 \ + --group-id=buck2-default --pipe-id=pipe1 \ + --source-zones='*' --dest-zones='*' + + + +Example 4: Limit Bucket Sync To Specific Zones +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This will only sync ``buck3`` to ``us-east`` (from any zone that flow allows to sync into ``us-east``). + +:: + + [us-east] $ radosgw-admin sync group create --bucket=buck3 \ + --group-id=buck3-default --status=enabled + + [us-east] $ radosgw-admin sync group pipe create --bucket=buck3 \ + --group-id=buck3-default --pipe-id=pipe1 \ + --source-zones='*' --dest-zones=us-east + + + +Example 5: Sync From a Different Bucket +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Note that bucket sync only works (currently) across zones and not within the same zone. + +Set ``buck4`` to pull data from ``buck5``: + +:: + + [us-east] $ radosgw-admin sync group create --bucket=buck4 ' + --group-id=buck4-default --status=enabled + + [us-east] $ radosgw-admin sync group pipe create --bucket=buck4 \ + --group-id=buck4-default --pipe-id=pipe1 \ + --source-zones='*' --source-bucket=buck5 \ + --dest-zones='*' + + +can also limit it to specific zones, for example the following will +only sync data originated in us-west: + +:: + + [us-east] $ radosgw-admin sync group pipe modify --bucket=buck4 \ + --group-id=buck4-default --pipe-id=pipe1 \ + --source-zones=us-west --source-bucket=buck5 \ + --dest-zones='*' + + +Checking the sync info for ``buck5`` on ``us-west`` is interesting: + +:: + + [us-west] $ radosgw-admin sync info --bucket=buck5 + { + "sources": [], + "dests": [], + "hints": { + "sources": [], + "dests": [ + "buck4:115b12b3-....14433.2" + ] + }, + "resolved-hints-1": { + "sources": [], + "dests": [ + { + "id": "pipe1", + "source": { + "zone": "us-west", + "bucket": "buck5" + }, + "dest": { + "zone": "us-east", + "bucket": "buck4:115b12b3-....14433.2" + }, + ... + }, + { + "id": "pipe1", + "source": { + "zone": "us-west", + "bucket": "buck5" + }, + "dest": { + "zone": "us-west-2", + "bucket": "buck4:115b12b3-....14433.2" + }, + ... + } + ] + }, + "resolved-hints": { + "sources": [], + "dests": [] + } + } + + +Note that there are resolved hints, which means that the bucket ``buck5`` found about ``buck4`` syncing from it indirectly, and not from its own policy (the policy for ``buck5`` itself is empty). + + +Example 6: Sync To Different Bucket +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The same mechanism can work for configuring data to be synced to (vs. synced from as in the previous example). Note that internally data is still pulled from the source at the destination zone: + +Set ``buck6`` to "push" data to ``buck5``: + +:: + + [us-east] $ radosgw-admin sync group create --bucket=buck6 \ + --group-id=buck6-default --status=enabled + + [us-east] $ radosgw-admin sync group pipe create --bucket=buck6 \ + --group-id=buck6-default --pipe-id=pipe1 \ + --source-zones='*' --source-bucket='*' \ + --dest-zones='*' --dest-bucket=buck5 + + +A wildcard bucket name means the current bucket in the context of bucket sync policy. + +Combined with the configuration in Example 5, we can now write data to ``buck6`` on ``us-east``, data will sync to ``buck5`` on ``us-west``, and from there it will be distributed to ``buck4`` on ``us-east``, and on ``us-west-2``. + +Example 7: Source Filters +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Sync from ``buck8`` to ``buck9``, but only objects that start with ``foo/``: + +:: + + [us-east] $ radosgw-admin sync group create --bucket=buck8 \ + --group-id=buck8-default --status=enabled + + [us-east] $ radosgw-admin sync group pipe create --bucket=buck8 \ + --group-id=buck8-default --pipe-id=pipe-prefix \ + --prefix=foo/ --source-zones='*' --dest-zones='*' \ + --dest-bucket=buck9 + + +Also sync from ``buck8`` to ``buck9`` any object that has the tags ``color=blue`` or ``color=red``: + +:: + + [us-east] $ radosgw-admin sync group pipe create --bucket=buck8 \ + --group-id=buck8-default --pipe-id=pipe-tags \ + --tags-add=color=blue,color=red --source-zones='*' \ + --dest-zones='*' --dest-bucket=buck9 + + +And we can check the expected sync in ``us-east`` (for example): + +:: + + [us-east] $ radosgw-admin sync info --bucket=buck8 + { + "sources": [], + "dests": [ + { + "id": "pipe-prefix", + "source": { + "zone": "us-east", + "bucket": "buck8:115b12b3-....14433.5" + }, + "dest": { + "zone": "us-west", + "bucket": "buck9" + }, + "params": { + "source": { + "filter": { + "prefix": "foo/", + "tags": [] + } + }, + ... + } + }, + { + "id": "pipe-tags", + "source": { + "zone": "us-east", + "bucket": "buck8:115b12b3-....14433.5" + }, + "dest": { + "zone": "us-west", + "bucket": "buck9" + }, + "params": { + "source": { + "filter": { + "tags": [ + { + "key": "color", + "value": "blue" + }, + { + "key": "color", + "value": "red" + } + ] + } + }, + ... + } + } + ], + ... + } + + +Note that there aren't any sources, only two different destinations (one for each configuration). When the sync process happens it will select the relevant rule for each object it syncs. + +Prefixes and tags can be combined, in which object will need to have both in order to be synced. The priority param can also be passed, and it can be used to determine when there are multiple different rules that are matched (and have the same source and destination), to determine which of the rules to be used. + + +Example 8: Destination Params: Storage Class +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Storage class of the destination objects can be configured: + +:: + + [us-east] $ radosgw-admin sync group create --bucket=buck10 \ + --group-id=buck10-default --status=enabled + + [us-east] $ radosgw-admin sync group pipe create --bucket=buck10 \ + --group-id=buck10-default \ + --pipe-id=pipe-storage-class \ + --source-zones='*' --dest-zones=us-west-2 \ + --storage-class=CHEAP_AND_SLOW + + +Example 9: Destination Params: Destination Owner Translation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Set the destination objects owner as the destination bucket owner. +This requires specifying the uid of the destination bucket: + +:: + + [us-east] $ radosgw-admin sync group create --bucket=buck11 \ + --group-id=buck11-default --status=enabled + + [us-east] $ radosgw-admin sync group pipe create --bucket=buck11 \ + --group-id=buck11-default --pipe-id=pipe-dest-owner \ + --source-zones='*' --dest-zones='*' \ + --dest-bucket=buck12 --dest-owner=joe + +Example 10: Destination Params: User Mode +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +User mode makes sure that the user has permissions to both read the objects, and write to the destination bucket. This requires that the uid of the user (which in its context the operation executes) is specified. + +:: + + [us-east] $ radosgw-admin sync group pipe modify --bucket=buck11 \ + --group-id=buck11-default --pipe-id=pipe-dest-owner \ + --mode=user --uid=jenny + + + diff --git a/doc/radosgw/multisite.rst b/doc/radosgw/multisite.rst new file mode 100644 index 000000000..a4a89d268 --- /dev/null +++ b/doc/radosgw/multisite.rst @@ -0,0 +1,1537 @@ +.. _multisite: + +========== +Multi-Site +========== + +.. versionadded:: Jewel + +A single zone configuration typically consists of one zone group containing one +zone and one or more `ceph-radosgw` instances where you may load-balance gateway +client requests between the instances. In a single zone configuration, typically +multiple gateway instances point to a single Ceph storage cluster. However, Kraken +supports several multi-site configuration options for the Ceph Object Gateway: + +- **Multi-zone:** A more advanced configuration consists of one zone group and + multiple zones, each zone with one or more `ceph-radosgw` instances. Each zone + is backed by its own Ceph Storage Cluster. Multiple zones in a zone group + provides disaster recovery for the zone group should one of the zones experience + a significant failure. In Kraken, each zone is active and may receive write + operations. In addition to disaster recovery, multiple active zones may also + serve as a foundation for content delivery networks. + +- **Multi-zone-group:** Formerly called 'regions', Ceph Object Gateway can also + support multiple zone groups, each zone group with one or more zones. Objects + stored to zones in one zone group within the same realm as another zone + group will share a global object namespace, ensuring unique object IDs across + zone groups and zones. + +- **Multiple Realms:** In Kraken, the Ceph Object Gateway supports the notion + of realms, which can be a single zone group or multiple zone groups and + a globally unique namespace for the realm. Multiple realms provide the ability + to support numerous configurations and namespaces. + +Replicating object data between zones within a zone group looks something +like this: + +.. image:: ../images/zone-sync2.png + :align: center + +For additional details on setting up a cluster, see `Ceph Object Gateway for +Production <https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/ceph_object_gateway_for_production/index/>`__. + +Functional Changes from Infernalis +================================== + +In Kraken, you can configure each Ceph Object Gateway to +work in an active-active zone configuration, allowing for writes to +non-master zones. + +The multi-site configuration is stored within a container called a +"realm." The realm stores zone groups, zones, and a time "period" with +multiple epochs for tracking changes to the configuration. In Kraken, +the ``ceph-radosgw`` daemons handle the synchronization, +eliminating the need for a separate synchronization agent. Additionally, +the new approach to synchronization allows the Ceph Object Gateway to +operate with an "active-active" configuration instead of +"active-passive". + +Requirements and Assumptions +============================ + +A multi-site configuration requires at least two Ceph storage clusters, +preferably given a distinct cluster name. At least two Ceph object +gateway instances, one for each Ceph storage cluster. + +This guide assumes at least two Ceph storage clusters are in geographically +separate locations; however, the configuration can work on the same +site. This guide also assumes two Ceph object gateway servers named +``rgw1`` and ``rgw2``. + +.. important:: Running a single Ceph storage cluster is NOT recommended unless you have + low latency WAN connections. + +A multi-site configuration requires a master zone group and a master +zone. Additionally, each zone group requires a master zone. Zone groups +may have one or more secondary or non-master zones. + +In this guide, the ``rgw1`` host will serve as the master zone of the +master zone group; and, the ``rgw2`` host will serve as the secondary zone +of the master zone group. + +See `Pools`_ for instructions on creating and tuning pools for Ceph +Object Storage. + +See `Sync Policy Config`_ for instructions on defining fine grained bucket sync +policy rules. + +.. _master-zone-label: + +Configuring a Master Zone +========================= + +All gateways in a multi-site configuration will retrieve their +configuration from a ``ceph-radosgw`` daemon on a host within the master +zone group and master zone. To configure your gateways in a multi-site +configuration, choose a ``ceph-radosgw`` instance to configure the +master zone group and master zone. + +Create a Realm +-------------- + +A realm contains the multi-site configuration of zone groups and zones +and also serves to enforce a globally unique namespace within the realm. + +Create a new realm for the multi-site configuration by opening a command +line interface on a host identified to serve in the master zone group +and zone. Then, execute the following: + +.. prompt:: bash # + + radosgw-admin realm create --rgw-realm={realm-name} [--default] + +For example: + +.. prompt:: bash # + + radosgw-admin realm create --rgw-realm=movies --default + +If the cluster will have a single realm, specify the ``--default`` flag. +If ``--default`` is specified, ``radosgw-admin`` will use this realm by +default. If ``--default`` is not specified, adding zone-groups and zones +requires specifying either the ``--rgw-realm`` flag or the +``--realm-id`` flag to identify the realm when adding zone groups and +zones. + +After creating the realm, ``radosgw-admin`` will echo back the realm +configuration. For example: + +:: + + { + "id": "0956b174-fe14-4f97-8b50-bb7ec5e1cf62", + "name": "movies", + "current_period": "1950b710-3e63-4c41-a19e-46a715000980", + "epoch": 1 + } + +.. note:: Ceph generates a unique ID for the realm, which allows the renaming + of a realm if the need arises. + +Create a Master Zone Group +-------------------------- + +A realm must have at least one zone group, which will serve as the +master zone group for the realm. + +Create a new master zone group for the multi-site configuration by +opening a command line interface on a host identified to serve in the +master zone group and zone. Then, execute the following: + +.. prompt:: bash # + + radosgw-admin zonegroup create --rgw-zonegroup={name} --endpoints={url} [--rgw-realm={realm-name}|--realm-id={realm-id}] --master --default + +For example: + +.. prompt:: bash # + + radosgw-admin zonegroup create --rgw-zonegroup=us --endpoints=http://rgw1:80 --rgw-realm=movies --master --default + +If the realm will only have a single zone group, specify the +``--default`` flag. If ``--default`` is specified, ``radosgw-admin`` +will use this zone group by default when adding new zones. If +``--default`` is not specified, adding zones will require either the +``--rgw-zonegroup`` flag or the ``--zonegroup-id`` flag to identify the +zone group when adding or modifying zones. + +After creating the master zone group, ``radosgw-admin`` will echo back +the zone group configuration. For example: + +:: + + { + "id": "f1a233f5-c354-4107-b36c-df66126475a6", + "name": "us", + "api_name": "us", + "is_master": "true", + "endpoints": [ + "http:\/\/rgw1:80" + ], + "hostnames": [], + "hostnames_s3webzone": [], + "master_zone": "", + "zones": [], + "placement_targets": [], + "default_placement": "", + "realm_id": "0956b174-fe14-4f97-8b50-bb7ec5e1cf62" + } + +Create a Master Zone +-------------------- + +.. important:: Zones must be created on a Ceph Object Gateway node that will be + within the zone. + +Create a new master zone for the multi-site configuration by opening a +command line interface on a host identified to serve in the master zone +group and zone. Then, execute the following: + +.. prompt:: bash # + + radosgw-admin zone create --rgw-zonegroup={zone-group-name} \ + --rgw-zone={zone-name} \ + --master --default \ + --endpoints={http://fqdn}[,{http://fqdn}] + +For example: + +.. prompt:: bash # + + radosgw-admin zone create --rgw-zonegroup=us --rgw-zone=us-east \ + --master --default \ + --endpoints={http://fqdn}[,{http://fqdn}] + + +.. note:: The ``--access-key`` and ``--secret`` aren’t specified. These + settings will be added to the zone once the user is created in the + next section. + +.. important:: The following steps assume a multi-site configuration using newly + installed systems that aren’t storing data yet. DO NOT DELETE the + ``default`` zone and its pools if you are already using it to store + data, or the data will be deleted and unrecoverable. + +Delete Default Zone Group and Zone +---------------------------------- + +Delete the ``default`` zone if it exists. Make sure to remove it from +the default zone group first. + +.. prompt:: bash # + + radosgw-admin zonegroup delete --rgw-zonegroup=default --rgw-zone=default + radosgw-admin period update --commit + radosgw-admin zone delete --rgw-zone=default + radosgw-admin period update --commit + radosgw-admin zonegroup delete --rgw-zonegroup=default + radosgw-admin period update --commit + +Finally, delete the ``default`` pools in your Ceph storage cluster if +they exist. + +.. important:: The following step assumes a multi-site configuration using newly + installed systems that aren’t currently storing data. DO NOT DELETE + the ``default`` zone group if you are already using it to store + data. + +.. prompt:: bash # + + ceph osd pool rm default.rgw.control default.rgw.control --yes-i-really-really-mean-it + ceph osd pool rm default.rgw.data.root default.rgw.data.root --yes-i-really-really-mean-it + ceph osd pool rm default.rgw.gc default.rgw.gc --yes-i-really-really-mean-it + ceph osd pool rm default.rgw.log default.rgw.log --yes-i-really-really-mean-it + ceph osd pool rm default.rgw.users.uid default.rgw.users.uid --yes-i-really-really-mean-it + +Create a System User +-------------------- + +The ``ceph-radosgw`` daemons must authenticate before pulling realm and +period information. In the master zone, create a system user to +facilitate authentication between daemons. + + +.. prompt:: bash # + + radosgw-admin user create --uid="{user-name}" --display-name="{Display Name}" --system + +For example: + +.. prompt:: bash # + + radosgw-admin user create --uid="synchronization-user" --display-name="Synchronization User" --system + +Make a note of the ``access_key`` and ``secret_key``, as the secondary +zones will require them to authenticate with the master zone. + +Finally, add the system user to the master zone. + +.. prompt:: bash # + + radosgw-admin zone modify --rgw-zone={zone-name} --access-key={access-key} --secret={secret} + radosgw-admin period update --commit + +Update the Period +----------------- + +After updating the master zone configuration, update the period. + +.. prompt:: bash # + + radosgw-admin period update --commit + +.. note:: Updating the period changes the epoch, and ensures that other zones + will receive the updated configuration. + +Update the Ceph Configuration File +---------------------------------- + +Update the Ceph configuration file on master zone hosts by adding the +``rgw_zone`` configuration option and the name of the master zone to the +instance entry. + +:: + + [client.rgw.{instance-name}] + ... + rgw_zone={zone-name} + +For example: + +:: + + [client.rgw.rgw1] + host = rgw1 + rgw frontends = "civetweb port=80" + rgw_zone=us-east + +Start the Gateway +----------------- + +On the object gateway host, start and enable the Ceph Object Gateway +service: + +.. prompt:: bash # + + systemctl start ceph-radosgw@rgw.`hostname -s` + systemctl enable ceph-radosgw@rgw.`hostname -s` + +.. _secondary-zone-label: + +Configure Secondary Zones +========================= + +Zones within a zone group replicate all data to ensure that each zone +has the same data. When creating the secondary zone, execute all of the +following operations on a host identified to serve the secondary zone. + +.. note:: To add a third zone, follow the same procedures as for adding the + secondary zone. Use different zone name. + +.. important:: You must execute metadata operations, such as user creation, on a + host within the master zone. The master zone and the secondary zone + can receive bucket operations, but the secondary zone redirects + bucket operations to the master zone. If the master zone is down, + bucket operations will fail. + +Pull the Realm +-------------- + +Using the URL path, access key and secret of the master zone in the +master zone group, pull the realm configuration to the host. To pull a +non-default realm, specify the realm using the ``--rgw-realm`` or +``--realm-id`` configuration options. + +.. prompt:: bash # + + radosgw-admin realm pull --url={url-to-master-zone-gateway} --access-key={access-key} --secret={secret} + +.. note:: Pulling the realm also retrieves the remote's current period + configuration, and makes it the current period on this host as well. + +If this realm is the default realm or the only realm, make the realm the +default realm. + +.. prompt:: bash # + + radosgw-admin realm default --rgw-realm={realm-name} + +Create a Secondary Zone +----------------------- + +.. important:: Zones must be created on a Ceph Object Gateway node that will be + within the zone. + +Create a secondary zone for the multi-site configuration by opening a +command line interface on a host identified to serve the secondary zone. +Specify the zone group ID, the new zone name and an endpoint for the +zone. **DO NOT** use the ``--master`` or ``--default`` flags. In Kraken, +all zones run in an active-active configuration by +default; that is, a gateway client may write data to any zone and the +zone will replicate the data to all other zones within the zone group. +If the secondary zone should not accept write operations, specify the +``--read-only`` flag to create an active-passive configuration between +the master zone and the secondary zone. Additionally, provide the +``access_key`` and ``secret_key`` of the generated system user stored in +the master zone of the master zone group. Execute the following: + +.. prompt:: bash # + + radosgw-admin zone create --rgw-zonegroup={zone-group-name} \ + --rgw-zone={zone-name} \ + --access-key={system-key} --secret={secret} \ + --endpoints=http://{fqdn}:80 \ + [--read-only] + +For example: + +.. prompt:: bash # + + radosgw-admin zone create --rgw-zonegroup=us --rgw-zone=us-west \ + --access-key={system-key} --secret={secret} \ + --endpoints=http://rgw2:80 + +.. important:: The following steps assume a multi-site configuration using newly + installed systems that aren’t storing data. **DO NOT DELETE** the + ``default`` zone and its pools if you are already using it to store + data, or the data will be lost and unrecoverable. + +Delete the default zone if needed. + +.. prompt:: bash # + + radosgw-admin zone delete --rgw-zone=default + +Finally, delete the default pools in your Ceph storage cluster if +needed. + +.. prompt:: bash # + + ceph osd pool rm default.rgw.control default.rgw.control --yes-i-really-really-mean-it + ceph osd pool rm default.rgw.data.root default.rgw.data.root --yes-i-really-really-mean-it + ceph osd pool rm default.rgw.gc default.rgw.gc --yes-i-really-really-mean-it + ceph osd pool rm default.rgw.log default.rgw.log --yes-i-really-really-mean-it + ceph osd pool rm default.rgw.users.uid default.rgw.users.uid --yes-i-really-really-mean-it + +Update the Ceph Configuration File +---------------------------------- + +Update the Ceph configuration file on the secondary zone hosts by adding +the ``rgw_zone`` configuration option and the name of the secondary zone +to the instance entry. + +:: + + [client.rgw.{instance-name}] + ... + rgw_zone={zone-name} + +For example: + +:: + + [client.rgw.rgw2] + host = rgw2 + rgw frontends = "civetweb port=80" + rgw_zone=us-west + +Update the Period +----------------- + +After updating the master zone configuration, update the period. + +.. prompt:: bash # + + radosgw-admin period update --commit + +.. note:: Updating the period changes the epoch, and ensures that other zones + will receive the updated configuration. + +Start the Gateway +----------------- + +On the object gateway host, start and enable the Ceph Object Gateway +service: + +.. prompt:: bash # + + systemctl start ceph-radosgw@rgw.`hostname -s` + systemctl enable ceph-radosgw@rgw.`hostname -s` + +Check Synchronization Status +---------------------------- + +Once the secondary zone is up and running, check the synchronization +status. Synchronization copies users and buckets created in the master +zone to the secondary zone. + +.. prompt:: bash # + + radosgw-admin sync status + +The output will provide the status of synchronization operations. For +example: + +:: + + realm f3239bc5-e1a8-4206-a81d-e1576480804d (earth) + zonegroup c50dbb7e-d9ce-47cc-a8bb-97d9b399d388 (us) + zone 4c453b70-4a16-4ce8-8185-1893b05d346e (us-west) + metadata sync syncing + full sync: 0/64 shards + metadata is caught up with master + incremental sync: 64/64 shards + data sync source: 1ee9da3e-114d-4ae3-a8a4-056e8a17f532 (us-east) + syncing + full sync: 0/128 shards + incremental sync: 128/128 shards + data is caught up with source + +.. note:: Secondary zones accept bucket operations; however, secondary zones + redirect bucket operations to the master zone and then synchronize + with the master zone to receive the result of the bucket operations. + If the master zone is down, bucket operations executed on the + secondary zone will fail, but object operations should succeed. + + +Maintenance +=========== + +Checking the Sync Status +------------------------ + +Information about the replication status of a zone can be queried with: + +.. prompt:: bash $ + + radosgw-admin sync status + +:: + + realm b3bc1c37-9c44-4b89-a03b-04c269bea5da (earth) + zonegroup f54f9b22-b4b6-4a0e-9211-fa6ac1693f49 (us) + zone adce11c9-b8ed-4a90-8bc5-3fc029ff0816 (us-2) + metadata sync syncing + full sync: 0/64 shards + incremental sync: 64/64 shards + metadata is behind on 1 shards + oldest incremental change not applied: 2017-03-22 10:20:00.0.881361s + data sync source: 341c2d81-4574-4d08-ab0f-5a2a7b168028 (us-1) + syncing + full sync: 0/128 shards + incremental sync: 128/128 shards + data is caught up with source + source: 3b5d1a3f-3f27-4e4a-8f34-6072d4bb1275 (us-3) + syncing + full sync: 0/128 shards + incremental sync: 128/128 shards + data is caught up with source + +Changing the Metadata Master Zone +--------------------------------- + +.. important:: Care must be taken when changing which zone is the metadata + master. If a zone has not finished syncing metadata from the current + master zone, it will be unable to serve any remaining entries when + promoted to master and those changes will be lost. For this reason, + waiting for a zone's ``radosgw-admin sync status`` to catch up on + metadata sync before promoting it to master is recommended. + +Similarly, if changes to metadata are being processed by the current master +zone while another zone is being promoted to master, those changes are +likely to be lost. To avoid this, shutting down any ``radosgw`` instances +on the previous master zone is recommended. After promoting another zone, +its new period can be fetched with ``radosgw-admin period pull`` and the +gateway(s) can be restarted. + +To promote a zone (for example, zone ``us-2`` in zonegroup ``us``) to metadata +master, run the following commands on that zone: + +.. prompt:: bash $ + + radosgw-admin zone modify --rgw-zone=us-2 --master + radosgw-admin zonegroup modify --rgw-zonegroup=us --master + radosgw-admin period update --commit + +This will generate a new period, and the radosgw instance(s) in zone ``us-2`` +will send this period to other zones. + +Failover and Disaster Recovery +============================== + +If the master zone should fail, failover to the secondary zone for +disaster recovery. + +1. Make the secondary zone the master and default zone. For example: + + .. prompt:: bash # + + radosgw-admin zone modify --rgw-zone={zone-name} --master --default + + By default, Ceph Object Gateway will run in an active-active + configuration. If the cluster was configured to run in an + active-passive configuration, the secondary zone is a read-only zone. + Remove the ``--read-only`` status to allow the zone to receive write + operations. For example: + + .. prompt:: bash # + + radosgw-admin zone modify --rgw-zone={zone-name} --master --default \ + --read-only=false + +2. Update the period to make the changes take effect. + + .. prompt:: bash # + + radosgw-admin period update --commit + +3. Finally, restart the Ceph Object Gateway. + + .. prompt:: bash # + + systemctl restart ceph-radosgw@rgw.`hostname -s` + +If the former master zone recovers, revert the operation. + +1. From the recovered zone, pull the latest realm configuration + from the current master zone: + + .. prompt:: bash # + + radosgw-admin realm pull --url={url-to-master-zone-gateway} \ + --access-key={access-key} --secret={secret} + +2. Make the recovered zone the master and default zone. + + .. prompt:: bash # + + radosgw-admin zone modify --rgw-zone={zone-name} --master --default + +3. Update the period to make the changes take effect. + + .. prompt:: bash # + + radosgw-admin period update --commit + +4. Then, restart the Ceph Object Gateway in the recovered zone. + + .. prompt:: bash # + + systemctl restart ceph-radosgw@rgw.`hostname -s` + +5. If the secondary zone needs to be a read-only configuration, update + the secondary zone. + + .. prompt:: bash # + + radosgw-admin zone modify --rgw-zone={zone-name} --read-only + +6. Update the period to make the changes take effect. + + .. prompt:: bash # + + radosgw-admin period update --commit + +7. Finally, restart the Ceph Object Gateway in the secondary zone. + + .. prompt:: bash # + + systemctl restart ceph-radosgw@rgw.`hostname -s` + +.. _rgw-multisite-migrate-from-single-site: + +Migrating a Single Site System to Multi-Site +============================================ + +To migrate from a single site system with a ``default`` zone group and +zone to a multi site system, use the following steps: + +1. Create a realm. Replace ``<name>`` with the realm name. + + .. prompt:: bash # + + radosgw-admin realm create --rgw-realm=<name> --default + +2. Rename the default zone and zonegroup. Replace ``<name>`` with the + zonegroup or zone name. + + .. prompt:: bash # + + radosgw-admin zonegroup rename --rgw-zonegroup default --zonegroup-new-name=<name> + radosgw-admin zone rename --rgw-zone default --zone-new-name us-east-1 --rgw-zonegroup=<name> + +3. Configure the master zonegroup. Replace ``<name>`` with the realm or + zonegroup name. Replace ``<fqdn>`` with the fully qualified domain + name(s) in the zonegroup. + + .. prompt:: bash # + + radosgw-admin zonegroup modify --rgw-realm=<name> --rgw-zonegroup=<name> --endpoints http://<fqdn>:80 --master --default + +4. Configure the master zone. Replace ``<name>`` with the realm, + zonegroup or zone name. Replace ``<fqdn>`` with the fully qualified + domain name(s) in the zonegroup. + + .. prompt:: bash # + + radosgw-admin zone modify --rgw-realm=<name> --rgw-zonegroup=<name> \ + --rgw-zone=<name> --endpoints http://<fqdn>:80 \ + --access-key=<access-key> --secret=<secret-key> \ + --master --default + +5. Create a system user. Replace ``<user-id>`` with the username. + Replace ``<display-name>`` with a display name. It may contain + spaces. + + .. prompt:: bash # + + radosgw-admin user create --uid=<user-id> \ + --display-name="<display-name>" \ + --access-key=<access-key> \ + --secret=<secret-key> --system + +6. Commit the updated configuration. + + .. prompt:: bash # + + radosgw-admin period update --commit + +7. Finally, restart the Ceph Object Gateway. + + .. prompt:: bash # + + systemctl restart ceph-radosgw@rgw.`hostname -s` + +After completing this procedure, proceed to `Configure a Secondary +Zone <#configure-secondary-zones>`__ to create a secondary zone +in the master zone group. + + +Multi-Site Configuration Reference +================================== + +The following sections provide additional details and command-line +usage for realms, periods, zone groups and zones. + +Realms +------ + +A realm represents a globally unique namespace consisting of one or more +zonegroups containing one or more zones, and zones containing buckets, +which in turn contain objects. A realm enables the Ceph Object Gateway +to support multiple namespaces and their configuration on the same +hardware. + +A realm contains the notion of periods. Each period represents the state +of the zone group and zone configuration in time. Each time you make a +change to a zonegroup or zone, update the period and commit it. + +By default, the Ceph Object Gateway does not create a realm +for backward compatibility with Infernalis and earlier releases. +However, as a best practice, we recommend creating realms for new +clusters. + +Create a Realm +~~~~~~~~~~~~~~ + +To create a realm, execute ``realm create`` and specify the realm name. +If the realm is the default, specify ``--default``. + +.. prompt:: bash # + + radosgw-admin realm create --rgw-realm={realm-name} [--default] + +For example: + +.. prompt:: bash # + + radosgw-admin realm create --rgw-realm=movies --default + +By specifying ``--default``, the realm will be called implicitly with +each ``radosgw-admin`` call unless ``--rgw-realm`` and the realm name +are explicitly provided. + +Make a Realm the Default +~~~~~~~~~~~~~~~~~~~~~~~~ + +One realm in the list of realms should be the default realm. There may +be only one default realm. If there is only one realm and it wasn’t +specified as the default realm when it was created, make it the default +realm. Alternatively, to change which realm is the default, execute: + +.. prompt:: bash # + + radosgw-admin realm default --rgw-realm=movies + +.. note:: When the realm is default, the command line assumes + ``--rgw-realm=<realm-name>`` as an argument. + +Delete a Realm +~~~~~~~~~~~~~~ + +To delete a realm, execute ``realm delete`` and specify the realm name. + +.. prompt:: bash # + + radosgw-admin realm rm --rgw-realm={realm-name} + +For example: + +.. prompt:: bash # + + radosgw-admin realm rm --rgw-realm=movies + +Get a Realm +~~~~~~~~~~~ + +To get a realm, execute ``realm get`` and specify the realm name. + +.. prompt:: bash # + + radosgw-admin realm get --rgw-realm=<name> + +For example: + +.. prompt:: bash # + + radosgw-admin realm get --rgw-realm=movies [> filename.json] + +The CLI will echo a JSON object with the realm properties. + +:: + + { + "id": "0a68d52e-a19c-4e8e-b012-a8f831cb3ebc", + "name": "movies", + "current_period": "b0c5bbef-4337-4edd-8184-5aeab2ec413b", + "epoch": 1 + } + +Use ``>`` and an output file name to output the JSON object to a file. + +Set a Realm +~~~~~~~~~~~ + +To set a realm, execute ``realm set``, specify the realm name, and +``--infile=`` with an input file name. + +.. prompt:: bash # + + radosgw-admin realm set --rgw-realm=<name> --infile=<infilename> + +For example: + +.. prompt:: bash # + + radosgw-admin realm set --rgw-realm=movies --infile=filename.json + +List Realms +~~~~~~~~~~~ + +To list realms, execute ``realm list``. + +.. prompt:: bash # + + radosgw-admin realm list + +List Realm Periods +~~~~~~~~~~~~~~~~~~ + +To list realm periods, execute ``realm list-periods``. + +.. prompt:: bash # + + radosgw-admin realm list-periods + +Pull a Realm +~~~~~~~~~~~~ + +To pull a realm from the node containing the master zone group and +master zone to a node containing a secondary zone group or zone, execute +``realm pull`` on the node that will receive the realm configuration. + +.. prompt:: bash # + + radosgw-admin realm pull --url={url-to-master-zone-gateway} --access-key={access-key} --secret={secret} + +Rename a Realm +~~~~~~~~~~~~~~ + +A realm is not part of the period. Consequently, renaming the realm is +only applied locally, and will not get pulled with ``realm pull``. When +renaming a realm with multiple zones, run the command on each zone. To +rename a realm, execute the following: + +.. prompt:: bash # + + radosgw-admin realm rename --rgw-realm=<current-name> --realm-new-name=<new-realm-name> + +.. note:: DO NOT use ``realm set`` to change the ``name`` parameter. That + changes the internal name only. Specifying ``--rgw-realm`` would + still use the old realm name. + +Zone Groups +----------- + +The Ceph Object Gateway supports multi-site deployments and a global +namespace by using the notion of zone groups. Formerly called a region +in Infernalis, a zone group defines the geographic location of one or more Ceph +Object Gateway instances within one or more zones. + +Configuring zone groups differs from typical configuration procedures, +because not all of the settings end up in a Ceph configuration file. You +can list zone groups, get a zone group configuration, and set a zone +group configuration. + +Create a Zone Group +~~~~~~~~~~~~~~~~~~~ + +Creating a zone group consists of specifying the zone group name. +Creating a zone assumes it will live in the default realm unless +``--rgw-realm=<realm-name>`` is specified. If the zonegroup is the +default zonegroup, specify the ``--default`` flag. If the zonegroup is +the master zonegroup, specify the ``--master`` flag. For example: + +.. prompt:: bash # + + radosgw-admin zonegroup create --rgw-zonegroup=<name> [--rgw-realm=<name>][--master] [--default] + + +.. note:: Use ``zonegroup modify --rgw-zonegroup=<zonegroup-name>`` to modify + an existing zone group’s settings. + +Make a Zone Group the Default +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +One zonegroup in the list of zonegroups should be the default zonegroup. +There may be only one default zonegroup. If there is only one zonegroup +and it wasn’t specified as the default zonegroup when it was created, +make it the default zonegroup. Alternatively, to change which zonegroup +is the default, execute: + +.. prompt:: bash # + + radosgw-admin zonegroup default --rgw-zonegroup=comedy + +.. note:: When the zonegroup is default, the command line assumes + ``--rgw-zonegroup=<zonegroup-name>`` as an argument. + +Then, update the period: + +.. prompt:: bash # + + radosgw-admin period update --commit + +Add a Zone to a Zone Group +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To add a zone to a zonegroup, execute the following: + +.. prompt:: bash # + + radosgw-admin zonegroup add --rgw-zonegroup=<name> --rgw-zone=<name> + +Then, update the period: + +.. prompt:: bash # + + radosgw-admin period update --commit + +Remove a Zone from a Zone Group +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To remove a zone from a zonegroup, execute the following: + +.. prompt:: bash # + + radosgw-admin zonegroup remove --rgw-zonegroup=<name> --rgw-zone=<name> + +Then, update the period: + +.. prompt:: bash # + + radosgw-admin period update --commit + +Rename a Zone Group +~~~~~~~~~~~~~~~~~~~ + +To rename a zonegroup, execute the following: + + +.. prompt:: bash # + + radosgw-admin zonegroup rename --rgw-zonegroup=<name> --zonegroup-new-name=<name> + +Then, update the period: + +.. prompt:: bash # + + radosgw-admin period update --commit + +Delete a Zone Group +~~~~~~~~~~~~~~~~~~~ + +To delete a zonegroup, execute the following: + +.. prompt:: bash # + + radosgw-admin zonegroup delete --rgw-zonegroup=<name> + +Then, update the period: + +.. prompt:: bash # + + radosgw-admin period update --commit + +List Zone Groups +~~~~~~~~~~~~~~~~ + +A Ceph cluster contains a list of zone groups. To list the zone groups, +execute: + +.. prompt:: bash # + + radosgw-admin zonegroup list + +The ``radosgw-admin`` returns a JSON formatted list of zone groups. + +:: + + { + "default_info": "90b28698-e7c3-462c-a42d-4aa780d24eda", + "zonegroups": [ + "us" + ] + } + +Get a Zone Group Map +~~~~~~~~~~~~~~~~~~~~ + +To list the details of each zone group, execute: + +.. prompt:: bash # + + radosgw-admin zonegroup-map get + +.. note:: If you receive a ``failed to read zonegroup map`` error, run + ``radosgw-admin zonegroup-map update`` as ``root`` first. + +Get a Zone Group +~~~~~~~~~~~~~~~~ + +To view the configuration of a zone group, execute: + +.. prompt:: bash # + + dosgw-admin zonegroup get [--rgw-zonegroup=<zonegroup>] + +The zone group configuration looks like this: + +:: + + { + "id": "90b28698-e7c3-462c-a42d-4aa780d24eda", + "name": "us", + "api_name": "us", + "is_master": "true", + "endpoints": [ + "http:\/\/rgw1:80" + ], + "hostnames": [], + "hostnames_s3website": [], + "master_zone": "9248cab2-afe7-43d8-a661-a40bf316665e", + "zones": [ + { + "id": "9248cab2-afe7-43d8-a661-a40bf316665e", + "name": "us-east", + "endpoints": [ + "http:\/\/rgw1" + ], + "log_meta": "true", + "log_data": "true", + "bucket_index_max_shards": 0, + "read_only": "false" + }, + { + "id": "d1024e59-7d28-49d1-8222-af101965a939", + "name": "us-west", + "endpoints": [ + "http:\/\/rgw2:80" + ], + "log_meta": "false", + "log_data": "true", + "bucket_index_max_shards": 0, + "read_only": "false" + } + ], + "placement_targets": [ + { + "name": "default-placement", + "tags": [] + } + ], + "default_placement": "default-placement", + "realm_id": "ae031368-8715-4e27-9a99-0c9468852cfe" + } + +Set a Zone Group +~~~~~~~~~~~~~~~~ + +Defining a zone group consists of creating a JSON object, specifying at +least the required settings: + +1. ``name``: The name of the zone group. Required. + +2. ``api_name``: The API name for the zone group. Optional. + +3. ``is_master``: Determines if the zone group is the master zone group. + Required. **note:** You can only have one master zone group. + +4. ``endpoints``: A list of all the endpoints in the zone group. For + example, you may use multiple domain names to refer to the same zone + group. Remember to escape the forward slashes (``\/``). You may also + specify a port (``fqdn:port``) for each endpoint. Optional. + +5. ``hostnames``: A list of all the hostnames in the zone group. For + example, you may use multiple domain names to refer to the same zone + group. Optional. The ``rgw dns name`` setting will automatically be + included in this list. You should restart the gateway daemon(s) after + changing this setting. + +6. ``master_zone``: The master zone for the zone group. Optional. Uses + the default zone if not specified. **note:** You can only have one + master zone per zone group. + +7. ``zones``: A list of all zones within the zone group. Each zone has a + name (required), a list of endpoints (optional), and whether or not + the gateway will log metadata and data operations (false by default). + +8. ``placement_targets``: A list of placement targets (optional). Each + placement target contains a name (required) for the placement target + and a list of tags (optional) so that only users with the tag can use + the placement target (i.e., the user’s ``placement_tags`` field in + the user info). + +9. ``default_placement``: The default placement target for the object + index and object data. Set to ``default-placement`` by default. You + may also set a per-user default placement in the user info for each + user. + +To set a zone group, create a JSON object consisting of the required +fields, save the object to a file (e.g., ``zonegroup.json``); then, +execute the following command: + +.. prompt:: bash # + + radosgw-admin zonegroup set --infile zonegroup.json + +Where ``zonegroup.json`` is the JSON file you created. + +.. important:: The ``default`` zone group ``is_master`` setting is ``true`` by + default. If you create a new zone group and want to make it the + master zone group, you must either set the ``default`` zone group + ``is_master`` setting to ``false``, or delete the ``default`` zone + group. + +Finally, update the period: + +.. prompt:: bash # + + radosgw-admin period update --commit + +Set a Zone Group Map +~~~~~~~~~~~~~~~~~~~~ + +Setting a zone group map consists of creating a JSON object consisting +of one or more zone groups, and setting the ``master_zonegroup`` for the +cluster. Each zone group in the zone group map consists of a key/value +pair, where the ``key`` setting is equivalent to the ``name`` setting +for an individual zone group configuration, and the ``val`` is a JSON +object consisting of an individual zone group configuration. + +You may only have one zone group with ``is_master`` equal to ``true``, +and it must be specified as the ``master_zonegroup`` at the end of the +zone group map. The following JSON object is an example of a default +zone group map. + +:: + + { + "zonegroups": [ + { + "key": "90b28698-e7c3-462c-a42d-4aa780d24eda", + "val": { + "id": "90b28698-e7c3-462c-a42d-4aa780d24eda", + "name": "us", + "api_name": "us", + "is_master": "true", + "endpoints": [ + "http:\/\/rgw1:80" + ], + "hostnames": [], + "hostnames_s3website": [], + "master_zone": "9248cab2-afe7-43d8-a661-a40bf316665e", + "zones": [ + { + "id": "9248cab2-afe7-43d8-a661-a40bf316665e", + "name": "us-east", + "endpoints": [ + "http:\/\/rgw1" + ], + "log_meta": "true", + "log_data": "true", + "bucket_index_max_shards": 0, + "read_only": "false" + }, + { + "id": "d1024e59-7d28-49d1-8222-af101965a939", + "name": "us-west", + "endpoints": [ + "http:\/\/rgw2:80" + ], + "log_meta": "false", + "log_data": "true", + "bucket_index_max_shards": 0, + "read_only": "false" + } + ], + "placement_targets": [ + { + "name": "default-placement", + "tags": [] + } + ], + "default_placement": "default-placement", + "realm_id": "ae031368-8715-4e27-9a99-0c9468852cfe" + } + } + ], + "master_zonegroup": "90b28698-e7c3-462c-a42d-4aa780d24eda", + "bucket_quota": { + "enabled": false, + "max_size_kb": -1, + "max_objects": -1 + }, + "user_quota": { + "enabled": false, + "max_size_kb": -1, + "max_objects": -1 + } + } + +To set a zone group map, execute the following: + +.. prompt:: bash # + + radosgw-admin zonegroup-map set --infile zonegroupmap.json + +Where ``zonegroupmap.json`` is the JSON file you created. Ensure that +you have zones created for the ones specified in the zone group map. +Finally, update the period. + +.. prompt:: bash # + + radosgw-admin period update --commit + +Zones +----- + +Ceph Object Gateway supports the notion of zones. A zone defines a +logical group consisting of one or more Ceph Object Gateway instances. + +Configuring zones differs from typical configuration procedures, because +not all of the settings end up in a Ceph configuration file. You can +list zones, get a zone configuration and set a zone configuration. + +Create a Zone +~~~~~~~~~~~~~ + +To create a zone, specify a zone name. If it is a master zone, specify +the ``--master`` option. Only one zone in a zone group may be a master +zone. To add the zone to a zonegroup, specify the ``--rgw-zonegroup`` +option with the zonegroup name. + +.. prompt:: bash # + + radosgw-admin zone create --rgw-zone=<name> \ + [--zonegroup=<zonegroup-name]\ + [--endpoints=<endpoint>[,<endpoint>] \ + [--master] [--default] \ + --access-key $SYSTEM_ACCESS_KEY --secret $SYSTEM_SECRET_KEY + +Then, update the period: + +.. prompt:: bash # + + radosgw-admin period update --commit + +Delete a Zone +~~~~~~~~~~~~~ + +To delete zone, first remove it from the zonegroup. + +.. prompt:: bash # + + radosgw-admin zonegroup remove --zonegroup=<name>\ + --zone=<name> + +Then, update the period: + +.. prompt:: bash # + + radosgw-admin period update --commit + +Next, delete the zone. Execute the following: + +.. prompt:: bash # + + radosgw-admin zone delete --rgw-zone<name> + +Finally, update the period: + +.. prompt:: bash # + + radosgw-admin period update --commit + +.. important:: Do not delete a zone without removing it from a zone group first. + Otherwise, updating the period will fail. + +If the pools for the deleted zone will not be used anywhere else, +consider deleting the pools. Replace ``<del-zone>`` in the example below +with the deleted zone’s name. + +.. important:: Only delete the pools with prepended zone names. Deleting the root + pool, such as, ``.rgw.root`` will remove all of the system’s + configuration. + +.. important:: Once the pools are deleted, all of the data within them are deleted + in an unrecoverable manner. Only delete the pools if the pool + contents are no longer needed. + +.. prompt:: bash # + + ceph osd pool rm <del-zone>.rgw.control <del-zone>.rgw.control --yes-i-really-really-mean-it + ceph osd pool rm <del-zone>.rgw.meta <del-zone>.rgw.meta --yes-i-really-really-mean-it + ceph osd pool rm <del-zone>.rgw.log <del-zone>.rgw.log --yes-i-really-really-mean-it + ceph osd pool rm <del-zone>.rgw.otp <del-zone>.rgw.otp --yes-i-really-really-mean-it + ceph osd pool rm <del-zone>.rgw.buckets.index <del-zone>.rgw.buckets.index --yes-i-really-really-mean-it + ceph osd pool rm <del-zone>.rgw.buckets.non-ec <del-zone>.rgw.buckets.non-ec --yes-i-really-really-mean-it + ceph osd pool rm <del-zone>.rgw.buckets.data <del-zone>.rgw.buckets.data --yes-i-really-really-mean-it + +Modify a Zone +~~~~~~~~~~~~~ + +To modify a zone, specify the zone name and the parameters you wish to +modify. + +.. prompt:: bash # + + radosgw-admin zone modify [options] + +Where ``[options]``: + +- ``--access-key=<key>`` +- ``--secret/--secret-key=<key>`` +- ``--master`` +- ``--default`` +- ``--endpoints=<list>`` + +Then, update the period: + +.. prompt:: bash # + + radosgw-admin period update --commit + +List Zones +~~~~~~~~~~ + +As ``root``, to list the zones in a cluster, execute: + +.. prompt:: bash # + + radosgw-admin zone list + +Get a Zone +~~~~~~~~~~ + +As ``root``, to get the configuration of a zone, execute: + +.. prompt:: bash # + + radosgw-admin zone get [--rgw-zone=<zone>] + +The ``default`` zone looks like this: + +:: + + { "domain_root": ".rgw", + "control_pool": ".rgw.control", + "gc_pool": ".rgw.gc", + "log_pool": ".log", + "intent_log_pool": ".intent-log", + "usage_log_pool": ".usage", + "user_keys_pool": ".users", + "user_email_pool": ".users.email", + "user_swift_pool": ".users.swift", + "user_uid_pool": ".users.uid", + "system_key": { "access_key": "", "secret_key": ""}, + "placement_pools": [ + { "key": "default-placement", + "val": { "index_pool": ".rgw.buckets.index", + "data_pool": ".rgw.buckets"} + } + ] + } + +Set a Zone +~~~~~~~~~~ + +Configuring a zone involves specifying a series of Ceph Object Gateway +pools. For consistency, we recommend using a pool prefix that is the +same as the zone name. See +`Pools <http://docs.ceph.com/en/latest/rados/operations/pools/#pools>`__ +for details of configuring pools. + +To set a zone, create a JSON object consisting of the pools, save the +object to a file (e.g., ``zone.json``); then, execute the following +command, replacing ``{zone-name}`` with the name of the zone: + +.. prompt:: bash # + + radosgw-admin zone set --rgw-zone={zone-name} --infile zone.json + +Where ``zone.json`` is the JSON file you created. + +Then, as ``root``, update the period: + +.. prompt:: bash # + + radosgw-admin period update --commit + +Rename a Zone +~~~~~~~~~~~~~ + +To rename a zone, specify the zone name and the new zone name. + +.. prompt:: bash # + + radosgw-admin zone rename --rgw-zone=<name> --zone-new-name=<name> + +Then, update the period: + +.. prompt:: bash # + + radosgw-admin period update --commit + +Zone Group and Zone Settings +---------------------------- + +When configuring a default zone group and zone, the pool name includes +the zone name. For example: + +- ``default.rgw.control`` + +To change the defaults, include the following settings in your Ceph +configuration file under each ``[client.radosgw.{instance-name}]`` +instance. + ++-------------------------------------+-----------------------------------+---------+-----------------------+ +| Name | Description | Type | Default | ++=====================================+===================================+=========+=======================+ +| ``rgw_zone`` | The name of the zone for the | String | None | +| | gateway instance. | | | ++-------------------------------------+-----------------------------------+---------+-----------------------+ +| ``rgw_zonegroup`` | The name of the zone group for | String | None | +| | the gateway instance. | | | ++-------------------------------------+-----------------------------------+---------+-----------------------+ +| ``rgw_zonegroup_root_pool`` | The root pool for the zone group. | String | ``.rgw.root`` | ++-------------------------------------+-----------------------------------+---------+-----------------------+ +| ``rgw_zone_root_pool`` | The root pool for the zone. | String | ``.rgw.root`` | ++-------------------------------------+-----------------------------------+---------+-----------------------+ +| ``rgw_default_zone_group_info_oid`` | The OID for storing the default | String | ``default.zonegroup`` | +| | zone group. We do not recommend | | | +| | changing this setting. | | | ++-------------------------------------+-----------------------------------+---------+-----------------------+ + + +Zone Features +============= + +Some multisite features require support from all zones before they can be enabled. Each zone lists its ``supported_features``, and each zonegroup lists its ``enabled_features``. Before a feature can be enabled in the zonegroup, it must be supported by all of its zones. + +On creation of new zones and zonegroups, all known features are supported/enabled. After upgrading an existing multisite configuration, however, new features must be enabled manually. + +Supported Features +------------------ + ++---------------------------+---------+ +| Feature | Release | ++===========================+=========+ +| :ref:`feature_resharding` | Quincy | ++---------------------------+---------+ + +.. _feature_resharding: + +resharding +~~~~~~~~~~ + +Allows buckets to be resharded in a multisite configuration without interrupting the replication of their objects. When ``rgw_dynamic_resharding`` is enabled, it runs on each zone independently, and zones may choose different shard counts for the same bucket. When buckets are resharded manually with ``radosgw-admin bucket reshard``, only that zone's bucket is modified. A zone feature should only be marked as supported after all of its radosgws and osds have upgraded. + + +Commands +----------------- + +Add support for a zone feature +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On the cluster that contains the given zone: + +.. prompt:: bash $ + + radosgw-admin zone modify --rgw-zone={zone-name} --enable-feature={feature-name} + radosgw-admin period update --commit + + +Remove support for a zone feature +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On the cluster that contains the given zone: + +.. prompt:: bash $ + + radosgw-admin zone modify --rgw-zone={zone-name} --disable-feature={feature-name} + radosgw-admin period update --commit + +Enable a zonegroup feature +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On any cluster in the realm: + +.. prompt:: bash $ + + radosgw-admin zonegroup modify --rgw-zonegroup={zonegroup-name} --enable-feature={feature-name} + radosgw-admin period update --commit + +Disable a zonegroup feature +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On any cluster in the realm: + +.. prompt:: bash $ + + radosgw-admin zonegroup modify --rgw-zonegroup={zonegroup-name} --disable-feature={feature-name} + radosgw-admin period update --commit + + +.. _`Pools`: ../pools +.. _`Sync Policy Config`: ../multisite-sync-policy diff --git a/doc/radosgw/multitenancy.rst b/doc/radosgw/multitenancy.rst new file mode 100644 index 000000000..0cca50d96 --- /dev/null +++ b/doc/radosgw/multitenancy.rst @@ -0,0 +1,169 @@ +.. _rgw-multitenancy: + +================= +RGW Multi-tenancy +================= + +.. versionadded:: Jewel + +The multi-tenancy feature allows to use buckets and users of the same +name simultaneously by segregating them under so-called ``tenants``. +This may be useful, for instance, to permit users of Swift API to +create buckets with easily conflicting names such as "test" or "trove". + +From the Jewel release onward, each user and bucket lies under a tenant. +For compatibility, a "legacy" tenant with an empty name is provided. +Whenever a bucket is referred without an explicit tenant, an implicit +tenant is used, taken from the user performing the operation. Since +the pre-existing users are under the legacy tenant, they continue +to create and access buckets as before. The layout of objects in RADOS +is extended in a compatible way, ensuring a smooth upgrade to Jewel. + +Administering Users With Explicit Tenants +========================================= + +Tenants as such do not have any operations on them. They appear and +disappear as needed, when users are administered. In order to create, +modify, and remove users with explicit tenants, either an additional +option --tenant is supplied, or a syntax "<tenant>$<user>" is used +in the parameters of the radosgw-admin command. + +Examples +-------- + +Create a user testx$tester to be accessed with S3:: + + # radosgw-admin --tenant testx --uid tester --display-name "Test User" --access_key TESTER --secret test123 user create + +Create a user testx$tester to be accessed with Swift:: + + # radosgw-admin --tenant testx --uid tester --display-name "Test User" --subuser tester:test --key-type swift --access full user create + # radosgw-admin --subuser 'testx$tester:test' --key-type swift --secret test123 + +.. note:: The subuser with explicit tenant has to be quoted in the shell. + + Tenant names may contain only alphanumeric characters and underscores. + +Accessing Buckets with Explicit Tenants +======================================= + +When a client application accesses buckets, it always operates with +credentials of a particular user. As mentioned above, every user belongs +to a tenant. Therefore, every operation has an implicit tenant in its +context, to be used if no tenant is specified explicitly. Thus a complete +compatibility is maintained with previous releases, as long as the +referred buckets and referring user belong to the same tenant. +In other words, anything unusual occurs when accessing another tenant's +buckets *only*. + +Extensions employed to specify an explicit tenant differ according +to the protocol and authentication system used. + +S3 +-- + +In case of S3, a colon character is used to separate tenant and bucket. +Thus a sample URL would be:: + + https://ep.host.dom/tenant:bucket + +Here's a simple Python sample: + +.. code-block:: python + :linenos: + + from boto.s3.connection import S3Connection, OrdinaryCallingFormat + c = S3Connection( + aws_access_key_id="TESTER", + aws_secret_access_key="test123", + host="ep.host.dom", + calling_format = OrdinaryCallingFormat()) + bucket = c.get_bucket("test5b:testbucket") + +Note that it's not possible to supply an explicit tenant using +a hostname. Hostnames cannot contain colons, or any other separators +that are not already valid in bucket names. Using a period creates an +ambiguous syntax. Therefore, the bucket-in-URL-path format has to be +used. + +Due to the fact that the native S3 API does not deal with +multi-tenancy and radosgw's implementation does, things get a bit +involved when dealing with signed URLs and public read ACLs. + +* A **signed URL** does contain the ``AWSAccessKeyId`` query + parameters, from which radosgw is able to discern the correct user + and tenant owning the bucket. In other words, an application + generating signed URLs should be able to take just the un-prefixed + bucket name, and produce a signed URL that itself contains the + bucket name without the tenant prefix. However, it is *possible* to + include the prefix if you so choose. + + Thus, accessing a signed URL of an object ``bar`` in a container + ``foo`` belonging to the tenant ``7188e165c0ae4424ac68ae2e89a05c50`` + would be possible either via + ``http://<host>:<port>/foo/bar?AWSAccessKeyId=b200fb6634c547199e436a0f93c0c46e&Expires=1542890806&Signature=eok6CYQC%2FDwmQQmqvY5jTg6ehXU%3D``, + or via + ``http://<host>:<port>/7188e165c0ae4424ac68ae2e89a05c50:foo/bar?AWSAccessKeyId=b200fb6634c547199e436a0f93c0c46e&Expires=1542890806&Signature=eok6CYQC%2FDwmQQmqvY5jTg6ehXU%3D``, + depending on whether or not the tenant prefix was passed in on + signature generation. + +* A bucket with a **public read ACL** is meant to be read by an HTTP + client *without* including any query parameters that would allow + radosgw to discern tenants. Thus, publicly readable objects must + always be accessed using the bucket name with the tenant prefix. + + Thus, if you set a public read ACL on an object ``bar`` in a + container ``foo`` belonging to the tenant + ``7188e165c0ae4424ac68ae2e89a05c50``, you would need to access that + object via the public URL + ``http://<host>:<port>/7188e165c0ae4424ac68ae2e89a05c50:foo/bar``. + +Swift with built-in authenticator +--------------------------------- + +TBD -- not in test_multen.py yet + +Swift with Keystone +------------------- + +In the default configuration, although native Swift has inherent +multi-tenancy, radosgw does not enable multi-tenancy for the Swift +API. This is to ensure that a setup with legacy buckets --- that is, +buckets that were created before radosgw supported multitenancy ---, +those buckets retain their dual-API capability to be queried and +modified using either S3 or Swift. + +If you want to enable multitenancy for Swift, particularly if your +users only ever authenticate against OpenStack Keystone, you should +enable Keystone-based multitenancy with the following ``ceph.conf`` +configuration option:: + + rgw keystone implicit tenants = true + +Once you enable this option, any newly connecting user (whether they +are using the Swift API, or Keystone-authenticated S3) will prompt +radosgw to create a user named ``<tenant_id>$<tenant_id``, where +``<tenant_id>`` is a Keystone tenant (project) UUID --- for example, +``7188e165c0ae4424ac68ae2e89a05c50$7188e165c0ae4424ac68ae2e89a05c50``. + +Whenever that user then creates an Swift container, radosgw internally +translates the given container name into +``<tenant_id>/<container_name>``, such as +``7188e165c0ae4424ac68ae2e89a05c50/foo``. This ensures that if there +are two or more different tenants all creating a container named +``foo``, radosgw is able to transparently discern them by their tenant +prefix. + +It is also possible to limit the effects of implicit tenants +to only apply to swift or s3, by setting ``rgw keystone implicit tenants`` +to either ``s3`` or ``swift``. This will likely primarily +be of use to users who had previously used implicit tenants +with older versions of ceph, where implicit tenants +only applied to the swift protocol. + +Notes and known issues +---------------------- + +Just to be clear, it is not possible to create buckets in other +tenants at present. The owner of newly created bucket is extracted +from authentication information. diff --git a/doc/radosgw/nfs.rst b/doc/radosgw/nfs.rst new file mode 100644 index 000000000..0a506599a --- /dev/null +++ b/doc/radosgw/nfs.rst @@ -0,0 +1,371 @@ +=== +NFS +=== + +.. versionadded:: Jewel + +.. note:: Only the NFSv4 protocol is supported when using a cephadm or rook based deployment. + +Ceph Object Gateway namespaces can be exported over the file-based +NFSv4 protocols, alongside traditional HTTP access +protocols (S3 and Swift). + +In particular, the Ceph Object Gateway can now be configured to +provide file-based access when embedded in the NFS-Ganesha NFS server. + +The simplest and preferred way of managing nfs-ganesha clusters and rgw exports +is using ``ceph nfs ...`` commands. See :doc:`/mgr/nfs` for more details. + +librgw +====== + +The librgw.so shared library (Unix) provides a loadable interface to +Ceph Object Gateway services, and instantiates a full Ceph Object Gateway +instance on initialization. + +In turn, librgw.so exports rgw_file, a stateful API for file-oriented +access to RGW buckets and objects. The API is general, but its design +is strongly influenced by the File System Abstraction Layer (FSAL) API +of NFS-Ganesha, for which it has been primarily designed. + +A set of Python bindings is also provided. + +Namespace Conventions +===================== + +The implementation conforms to Amazon Web Services (AWS) hierarchical +namespace conventions which map UNIX-style path names onto S3 buckets +and objects. + +The top level of the attached namespace consists of S3 buckets, +represented as NFS directories. Files and directories subordinate to +buckets are each represented as objects, following S3 prefix and +delimiter conventions, with '/' being the only supported path +delimiter [#]_. + +For example, if an NFS client has mounted an RGW namespace at "/nfs", +then a file "/nfs/mybucket/www/index.html" in the NFS namespace +corresponds to an RGW object "www/index.html" in a bucket/container +"mybucket." + +Although it is generally invisible to clients, the NFS namespace is +assembled through concatenation of the corresponding paths implied by +the objects in the namespace. Leaf objects, whether files or +directories, will always be materialized in an RGW object of the +corresponding key name, "<name>" if a file, "<name>/" if a directory. +Non-leaf directories (e.g., "www" above) might only be implied by +their appearance in the names of one or more leaf objects. Directories +created within NFS or directly operated on by an NFS client (e.g., via +an attribute-setting operation such as chown or chmod) always have a +leaf object representation used to store materialized attributes such +as Unix ownership and permissions. + +Supported Operations +==================== + +The RGW NFS interface supports most operations on files and +directories, with the following restrictions: + +- Links, including symlinks, are not supported. +- NFS ACLs are not supported. + + + Unix user and group ownership and permissions *are* supported. + +- Directories may not be moved/renamed. + + + Files may be moved between directories. + +- Only full, sequential *write* I/O is supported + + + i.e., write operations are constrained to be **uploads**. + + Many typical I/O operations such as editing files in place will necessarily fail as they perform non-sequential stores. + + Some file utilities *apparently* writing sequentially (e.g., some versions of GNU tar) may fail due to infrequent non-sequential stores. + + When mounting via NFS, sequential application I/O can generally be constrained to be written sequentially to the NFS server via a synchronous mount option (e.g. -osync in Linux). + + NFS clients which cannot mount synchronously (e.g., MS Windows) will not be able to upload files. + +Security +======== + +The RGW NFS interface provides a hybrid security model with the +following characteristics: + +- NFS protocol security is provided by the NFS-Ganesha server, as negotiated by the NFS server and clients + + + e.g., clients can by trusted (AUTH_SYS), or required to present Kerberos user credentials (RPCSEC_GSS) + + RPCSEC_GSS wire security can be integrity only (krb5i) or integrity and privacy (encryption, krb5p) + + various NFS-specific security and permission rules are available + + * e.g., root-squashing + +- a set of RGW/S3 security credentials (unknown to NFS) is associated with each RGW NFS mount (i.e., NFS-Ganesha EXPORT) + + + all RGW object operations performed via the NFS server will be performed by the RGW user associated with the credentials stored in the export being accessed (currently only RGW and RGW LDAP credentials are supported) + + * additional RGW authentication types such as Keystone are not currently supported + +Manually configuring an NFS-Ganesha Instance +============================================ + +Each NFS RGW instance is an NFS-Ganesha server instance *embeddding* +a full Ceph RGW instance. + +Therefore, the RGW NFS configuration includes Ceph and Ceph Object +Gateway-specific configuration in a local ceph.conf, as well as +NFS-Ganesha-specific configuration in the NFS-Ganesha config file, +ganesha.conf. + +ceph.conf +--------- + +Required ceph.conf configuration for RGW NFS includes: + +* valid [client.radosgw.{instance-name}] section +* valid values for minimal instance configuration, in particular, an installed and correct ``keyring`` + +Other config variables are optional, front-end-specific and front-end +selection variables (e.g., ``rgw data`` and ``rgw frontends``) are +optional and in some cases ignored. + +A small number of config variables (e.g., ``rgw_nfs_namespace_expire_secs``) +are unique to RGW NFS. + +ganesha.conf +------------ + +A strictly minimal ganesha.conf for use with RGW NFS includes one +EXPORT block with embedded FSAL block of type RGW:: + + EXPORT + { + Export_ID={numeric-id}; + Path = "/"; + Pseudo = "/"; + Access_Type = RW; + SecType = "sys"; + NFS_Protocols = 4; + Transport_Protocols = TCP; + + # optional, permit unsquashed access by client "root" user + #Squash = No_Root_Squash; + + FSAL { + Name = RGW; + User_Id = {s3-user-id}; + Access_Key_Id ="{s3-access-key}"; + Secret_Access_Key = "{s3-secret}"; + } + } + +``Export_ID`` must have an integer value, e.g., "77" + +``Path`` (for RGW) should be "/" + +``Pseudo`` defines an NFSv4 pseudo root name (NFSv4 only) + +``SecType = sys;`` allows clients to attach without Kerberos +authentication + +``Squash = No_Root_Squash;`` enables the client root user to override +permissions (Unix convention). When root-squashing is enabled, +operations attempted by the root user are performed as if by the local +"nobody" (and "nogroup") user on the NFS-Ganesha server + +The RGW FSAL additionally supports RGW-specific configuration +variables in the RGW config section:: + + RGW { + cluster = "{cluster name, default 'ceph'}"; + name = "client.rgw.{instance-name}"; + ceph_conf = "/opt/ceph-rgw/etc/ceph/ceph.conf"; + init_args = "-d --debug-rgw=16"; + } + +``cluster`` sets a Ceph cluster name (must match the cluster being exported) + +``name`` sets an RGW instance name (must match the cluster being exported) + +``ceph_conf`` gives a path to a non-default ceph.conf file to use + + +Other useful NFS-Ganesha configuration: +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Any EXPORT block which should support NFSv3 should include version 3 +in the NFS_Protocols setting. Additionally, NFSv3 is the last major +version to support the UDP transport. To enable UDP, include it in the +Transport_Protocols setting. For example:: + + EXPORT { + ... + NFS_Protocols = 3,4; + Transport_Protocols = UDP,TCP; + ... + } + +One important family of options pertains to interaction with the Linux +idmapping service, which is used to normalize user and group names +across systems. Details of idmapper integration are not provided here. + +With Linux NFS clients, NFS-Ganesha can be configured +to accept client-supplied numeric user and group identifiers with +NFSv4, which by default stringifies these--this may be useful in small +setups and for experimentation:: + + NFSV4 { + Allow_Numeric_Owners = true; + Only_Numeric_Owners = true; + } + +Troubleshooting +~~~~~~~~~~~~~~~ + +NFS-Ganesha configuration problems are usually debugged by running the +server with debugging options, controlled by the LOG config section. + +NFS-Ganesha log messages are grouped into various components, logging +can be enabled separately for each component. Valid values for +component logging include:: + + *FATAL* critical errors only + *WARN* unusual condition + *DEBUG* mildly verbose trace output + *FULL_DEBUG* verbose trace output + +Example:: + + LOG { + + Components { + MEMLEAKS = FATAL; + FSAL = FATAL; + NFSPROTO = FATAL; + NFS_V4 = FATAL; + EXPORT = FATAL; + FILEHANDLE = FATAL; + DISPATCH = FATAL; + CACHE_INODE = FATAL; + CACHE_INODE_LRU = FATAL; + HASHTABLE = FATAL; + HASHTABLE_CACHE = FATAL; + DUPREQ = FATAL; + INIT = DEBUG; + MAIN = DEBUG; + IDMAPPER = FATAL; + NFS_READDIR = FATAL; + NFS_V4_LOCK = FATAL; + CONFIG = FATAL; + CLIENTID = FATAL; + SESSIONS = FATAL; + PNFS = FATAL; + RW_LOCK = FATAL; + NLM = FATAL; + RPC = FATAL; + NFS_CB = FATAL; + THREAD = FATAL; + NFS_V4_ACL = FATAL; + STATE = FATAL; + FSAL_UP = FATAL; + DBUS = FATAL; + } + # optional: redirect log output + # Facility { + # name = FILE; + # destination = "/tmp/ganesha-rgw.log"; + # enable = active; + } + } + +Running Multiple NFS Gateways +============================= + +Each NFS-Ganesha instance acts as a full gateway endpoint, with the +limitation that currently an NFS-Ganesha instance cannot be configured +to export HTTP services. As with ordinary gateway instances, any +number of NFS-Ganesha instances can be started, exporting the same or +different resources from the cluster. This enables the clustering of +NFS-Ganesha instances. However, this does not imply high availability. + +When regular gateway instances and NFS-Ganesha instances overlap the +same data resources, they will be accessible from both the standard S3 +API and through the NFS-Ganesha instance as exported. You can +co-locate the NFS-Ganesha instance with a Ceph Object Gateway instance +on the same host. + +RGW vs RGW NFS +============== + +Exporting an NFS namespace and other RGW namespaces (e.g., S3 or Swift +via the Civetweb HTTP front-end) from the same program instance is +currently not supported. + +When adding objects and buckets outside of NFS, those objects will +appear in the NFS namespace in the time set by +``rgw_nfs_namespace_expire_secs``, which defaults to 300 seconds (5 minutes). +Override the default value for ``rgw_nfs_namespace_expire_secs`` in the +Ceph configuration file to change the refresh rate. + +If exporting Swift containers that do not conform to valid S3 bucket +naming requirements, set ``rgw_relaxed_s3_bucket_names`` to true in the +[client.radosgw] section of the Ceph configuration file. For example, +if a Swift container name contains underscores, it is not a valid S3 +bucket name and will be rejected unless ``rgw_relaxed_s3_bucket_names`` +is set to true. + +Configuring NFSv4 clients +========================= + +To access the namespace, mount the configured NFS-Ganesha export(s) +into desired locations in the local POSIX namespace. As noted, this +implementation has a few unique restrictions: + +- NFS 4.1 and higher protocol flavors are preferred + + + NFSv4 OPEN and CLOSE operations are used to track upload transactions + +- To upload data successfully, clients must preserve write ordering + + + on Linux and many Unix NFS clients, use the -osync mount option + +Conventions for mounting NFS resources are platform-specific. The +following conventions work on Linux and some Unix platforms: + +From the command line:: + + mount -t nfs -o nfsvers=4.1,noauto,soft,sync,proto=tcp <ganesha-host-name>:/ <mount-point> + +In /etc/fstab:: + +<ganesha-host-name>:/ <mount-point> nfs noauto,soft,nfsvers=4.1,sync,proto=tcp 0 0 + +Specify the NFS-Ganesha host name and the path to the mount point on +the client. + +Configuring NFSv3 Clients +========================= + +Linux clients can be configured to mount with NFSv3 by supplying +``nfsvers=3`` and ``noacl`` as mount options. To use UDP as the +transport, add ``proto=udp`` to the mount options. However, TCP is the +preferred transport:: + + <ganesha-host-name>:/ <mount-point> nfs noauto,noacl,soft,nfsvers=3,sync,proto=tcp 0 0 + +Configure the NFS Ganesha EXPORT block Protocols setting with version +3 and the Transports setting with UDP if the mount will use version 3 with UDP. + +NFSv3 Semantics +--------------- + +Since NFSv3 does not communicate client OPEN and CLOSE operations to +file servers, RGW NFS cannot use these operations to mark the +beginning and ending of file upload transactions. Instead, RGW NFS +starts a new upload when the first write is sent to a file at offset +0, and finalizes the upload when no new writes to the file have been +seen for a period of time, by default, 10 seconds. To change this +timeout, set an alternate value for ``rgw_nfs_write_completion_interval_s`` +in the RGW section(s) of the Ceph configuration file. + +References +========== + +.. [#] http://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html diff --git a/doc/radosgw/notifications.rst b/doc/radosgw/notifications.rst new file mode 100644 index 000000000..a294e6c15 --- /dev/null +++ b/doc/radosgw/notifications.rst @@ -0,0 +1,543 @@ +==================== +Bucket Notifications +==================== + +.. versionadded:: Nautilus + +.. contents:: + +Bucket notifications provide a mechanism for sending information out of radosgw +when certain events happen on the bucket. Notifications can be sent to HTTP +endpoints, AMQP0.9.1 endpoints, and Kafka endpoints. + +The `PubSub Module`_ (and *not* the bucket-notification mechanism) should be +used for events stored in Ceph. + +A user can create topics. A topic entity is defined by its name and is "per +tenant". A user can associate its topics (via notification configuration) only +with buckets it owns. + +A notification entity must be created in order to send event notifications for +a specific bucket. A notification entity can be created either for a subset +of event types or for all event types (which is the default). The +notification may also filter out events based on matches of the prefixes and +suffixes of (1) the keys, (2) the metadata attributes attached to the object, +or (3) the object tags. Regular-expression matching can also be used on these +to create filters. There can be multiple notifications for any specific topic, +and the same topic can used for multiple notifications. + +REST API has been defined so as to provide configuration and control interfaces +for the bucket notification mechanism. This API is similar to the one defined +as the S3-compatible API of the `PubSub Module`_. + +.. toctree:: + :maxdepth: 1 + + S3 Bucket Notification Compatibility <s3-notification-compatibility> + +.. note:: To enable bucket notifications API, the `rgw_enable_apis` configuration parameter should contain: "notifications". + +Notification Reliability +------------------------ + +Notifications can be sent synchronously or asynchronously. This section +describes the latency and reliability that you should expect for synchronous +and asynchronous notifications. + +Synchronous Notifications +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Notifications can be sent synchronously, as part of the operation that +triggered them. In this mode, the operation is acknowledged (acked) only after +the notification is sent to the topic's configured endpoint. This means that +the round trip time of the notification (the time it takes to send the +notification to the topic's endpoint plus the time it takes to receive the +acknowledgement) is added to the latency of the operation itself. + +.. note:: The original triggering operation is considered successful even if + the notification fails with an error, cannot be delivered, or times out. + +Asynchronous Notifications +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Notifications can be sent asynchronously. They are committed into persistent +storage and then asynchronously sent to the topic's configured endpoint. In +this case, the only latency added to the original operation is the latency +added when the notification is committed to persistent storage. + +.. note:: If the notification fails with an error, cannot be delivered, or + times out, it is retried until it is successfully acknowledged. + +.. tip:: To minimize the latency added by asynchronous notification, we + recommended placing the "log" pool on fast media. + + +Topic Management via CLI +------------------------ + +Configuration of all topics, associated with a tenant, could be fetched using the following command: + +.. prompt:: bash # + + radosgw-admin topic list [--tenant={tenant}] + + +Configuration of a specific topic could be fetched using: + +.. prompt:: bash # + + radosgw-admin topic get --topic={topic-name} [--tenant={tenant}] + + +And removed using: + +.. prompt:: bash # + + radosgw-admin topic rm --topic={topic-name} [--tenant={tenant}] + + +Notification Performance Stats +------------------------------ +The same counters are shared between the pubsub sync module and the bucket notification mechanism. + +- ``pubsub_event_triggered``: running counter of events with at least one topic associated with them +- ``pubsub_event_lost``: running counter of events that had topics associated with them but that were not pushed to any of the endpoints +- ``pubsub_push_ok``: running counter, for all notifications, of events successfully pushed to their endpoint +- ``pubsub_push_fail``: running counter, for all notifications, of events failed to be pushed to their endpoint +- ``pubsub_push_pending``: gauge value of events pushed to an endpoint but not acked or nacked yet + +.. note:: + + ``pubsub_event_triggered`` and ``pubsub_event_lost`` are incremented per event, while: + ``pubsub_push_ok``, ``pubsub_push_fail``, are incremented per push action on each notification + +Bucket Notification REST API +---------------------------- + +Topics +~~~~~~ + +.. note:: + + In all topic actions, the parameters are URL-encoded and sent in the + message body using this content type: + ``application/x-www-form-urlencoded``. + + +Create a Topic +`````````````` + +This creates a new topic. Provide the topic with push endpoint parameters, +which will be used later when a notification is created. A response is +generated. A successful response includes the the topic's `ARN +<https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html>`_ +(the "Amazon Resource Name", a unique identifier used to reference the topic). +To update a topic, use the same command that you used to create it (but when +updating, use the name of an existing topic and different endpoint values). + +.. tip:: Any notification already associated with the topic must be re-created + in order for the topic to update. + +.. note:: For rabbitmq, ``push-endpoint`` (with a hyphen in the middle) must be + changed to ``push_endpoint`` (with an underscore in the middle). + +:: + + POST + + Action=CreateTopic + &Name=<topic-name> + [&Attributes.entry.1.key=amqp-exchange&Attributes.entry.1.value=<exchange>] + [&Attributes.entry.2.key=amqp-ack-level&Attributes.entry.2.value=none|broker|routable] + [&Attributes.entry.3.key=verify-ssl&Attributes.entry.3.value=true|false] + [&Attributes.entry.4.key=kafka-ack-level&Attributes.entry.4.value=none|broker] + [&Attributes.entry.5.key=use-ssl&Attributes.entry.5.value=true|false] + [&Attributes.entry.6.key=ca-location&Attributes.entry.6.value=<file path>] + [&Attributes.entry.7.key=OpaqueData&Attributes.entry.7.value=<opaque data>] + [&Attributes.entry.8.key=push-endpoint&Attributes.entry.8.value=<endpoint>] + [&Attributes.entry.9.key=persistent&Attributes.entry.9.value=true|false] + +Request parameters: + +- push-endpoint: This is the URI of an endpoint to send push notifications to. +- OpaqueData: Opaque data is set in the topic configuration and added to all + notifications that are triggered by the topic. +- persistent: This indicates whether notifications to this endpoint are + persistent (=asynchronous) or not persistent. (This is "false" by default.) + +- HTTP endpoint + + - URI: ``http[s]://<fqdn>[:<port]`` + - port: This defaults to 80 for HTTP and 443 for HTTPS. + - verify-ssl: This indicates whether the server certificate is validated by + the client. (This is "true" by default.) + - cloudevents: This indicates whether the HTTP header should contain + attributes according to the `S3 CloudEvents Spec`_. (This is "false" by + default.) + +- AMQP0.9.1 endpoint + + - URI: ``amqp[s]://[<user>:<password>@]<fqdn>[:<port>][/<vhost>]`` + - user/password: This defaults to "guest/guest". + - user/password: This must be provided only over HTTPS. Topic creation + requests will otherwise be rejected. + - port: This defaults to 5672 for unencrypted connections and 5671 for + SSL-encrypted connections. + - vhost: This defaults to "/". + - verify-ssl: This indicates whether the server certificate is validated by + the client. (This is "true" by default.) + - If ``ca-location`` is provided and a secure connection is used, the + specified CA will be used to authenticate the broker. The default CA will + not be used. + - amqp-exchange: The exchanges must exist and must be able to route messages + based on topics. This parameter is mandatory. Different topics that point + to the same endpoint must use the same exchange. + - amqp-ack-level: No end2end acking is required. Messages may persist in the + broker before being delivered to their final destinations. Three ack methods + exist: + + - "none": The message is considered "delivered" if it is sent to the broker. + - "broker": The message is considered "delivered" if it is acked by the broker (default). + - "routable": The message is considered "delivered" if the broker can route to a consumer. + +.. tip:: The topic-name (see :ref:`radosgw-create-a-topic`) is used for the + AMQP topic ("routing key" for a topic exchange). + +- Kafka endpoint + + - URI: ``kafka://[<user>:<password>@]<fqdn>[:<port]`` + - ``use-ssl``: If this is set to "true", a secure connection is used to + connect to the broker. (This is "false" by default.) + - ``ca-location``: If this is provided and a secure connection is used, the + specified CA will be used insted of the default CA to authenticate the + broker. + - user/password: This must be provided only over HTTPS. Topic creation + requests will otherwise be rejected. + - user/password: This must be provided along with ``use-ssl``. Connections to + the broker will otherwise fail. + - port: This defaults to 9092. + - kafka-ack-level: No end2end acking is required. Messages may persist in the + broker before being delivered to their final destinations. Two ack methods + exist: + + - "none": Messages are considered "delivered" if sent to the broker. + - "broker": Messages are considered "delivered" if acked by the broker. (This + is the default.) + +.. note:: + + - The key-value pair of a specific parameter need not reside in the same + line as the parameter, and need not appear in any specific order, but it + must use the same index. + - Attribute indexing need not be sequential and need not start from any + specific value. + - `AWS Create Topic`_ provides a detailed explanation of the endpoint + attributes format. In our case, however, different keys and values are + used. + +The response has the following format: + +:: + + <CreateTopicResponse xmlns="https://sns.amazonaws.com/doc/2010-03-31/"> + <CreateTopicResult> + <TopicArn></TopicArn> + </CreateTopicResult> + <ResponseMetadata> + <RequestId></RequestId> + </ResponseMetadata> + </CreateTopicResponse> + +The topic `ARN +<https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html>`_ +in the response has the following format: + +:: + + arn:aws:sns:<zone-group>:<tenant>:<topic> + +Get Topic Attributes +```````````````````` + +This returns information about a specific topic. This includes push-endpoint +information, if provided. + +:: + + POST + + Action=GetTopicAttributes + &TopicArn=<topic-arn> + +The response has the following format: + +:: + + <GetTopicAttributesResponse> + <GetTopicAttributesRersult> + <Attributes> + <entry> + <key>User</key> + <value></value> + </entry> + <entry> + <key>Name</key> + <value></value> + </entry> + <entry> + <key>EndPoint</key> + <value></value> + </entry> + <entry> + <key>TopicArn</key> + <value></value> + </entry> + <entry> + <key>OpaqueData</key> + <value></value> + </entry> + </Attributes> + </GetTopicAttributesResult> + <ResponseMetadata> + <RequestId></RequestId> + </ResponseMetadata> + </GetTopicAttributesResponse> + +- User: the name of the user that created the topic. +- Name: the name of the topic. +- EndPoint: The JSON-formatted endpoint parameters, including: + - EndpointAddress: The push-endpoint URL. + - EndpointArgs: The push-endpoint args. + - EndpointTopic: The topic name to be sent to the endpoint (can be different + than the above topic name). + - HasStoredSecret: This is "true" if the endpoint URL contains user/password + information. In this case, the request must be made over HTTPS. The "topic + get" request will otherwise be rejected. + - Persistent: This is "true" if the topic is persistent. +- TopicArn: topic `ARN + <https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html>`_. +- OpaqueData: The opaque data set on the topic. + +Get Topic Information +````````````````````` + +This returns information about a specific topic. This includes push-endpoint +information, if provided. Note that this API is now deprecated in favor of the +AWS compliant `GetTopicAttributes` API. + +:: + + POST + + Action=GetTopic + &TopicArn=<topic-arn> + +The response has the following format: + +:: + + <GetTopicResponse> + <GetTopicRersult> + <Topic> + <User></User> + <Name></Name> + <EndPoint> + <EndpointAddress></EndpointAddress> + <EndpointArgs></EndpointArgs> + <EndpointTopic></EndpointTopic> + <HasStoredSecret></HasStoredSecret> + <Persistent></Persistent> + </EndPoint> + <TopicArn></TopicArn> + <OpaqueData></OpaqueData> + </Topic> + </GetTopicResult> + <ResponseMetadata> + <RequestId></RequestId> + </ResponseMetadata> + </GetTopicResponse> + +- User: The name of the user that created the topic. +- Name: The name of the topic. +- EndpointAddress: The push-endpoint URL. +- EndpointArgs: The push-endpoint args. +- EndpointTopic: The topic name to be sent to the endpoint (which can be + different than the above topic name). +- HasStoredSecret: This is "true" if the endpoint URL contains user/password + information. In this case, the request must be made over HTTPS. The "topic + get" request will otherwise be rejected. +- Persistent: "true" if topic is persistent. +- TopicArn: topic `ARN + <https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html>`_. +- OpaqueData: the opaque data set on the topic. + +Delete Topic +```````````` + +:: + + POST + + Action=DeleteTopic + &TopicArn=<topic-arn> + +This deletes the specified topic. + +.. note:: + + - Deleting an unknown notification (for example, double delete) is not + considered an error. + - Deleting a topic does not automatically delete all notifications associated + with it. + +The response has the following format: + +:: + + <DeleteTopicResponse xmlns="https://sns.amazonaws.com/doc/2010-03-31/"> + <ResponseMetadata> + <RequestId></RequestId> + </ResponseMetadata> + </DeleteTopicResponse> + +List Topics +``````````` + +List all topics associated with a tenant. + +:: + + POST + + Action=ListTopics + +The response has the following format: + +:: + + <ListTopicdResponse xmlns="https://sns.amazonaws.com/doc/2010-03-31/"> + <ListTopicsRersult> + <Topics> + <member> + <User></User> + <Name></Name> + <EndPoint> + <EndpointAddress></EndpointAddress> + <EndpointArgs></EndpointArgs> + <EndpointTopic></EndpointTopic> + </EndPoint> + <TopicArn></TopicArn> + <OpaqueData></OpaqueData> + </member> + </Topics> + </ListTopicsResult> + <ResponseMetadata> + <RequestId></RequestId> + </ResponseMetadata> + </ListTopicsResponse> + +- If the endpoint URL contains user/password information in any part of the + topic, the request must be made over HTTPS. The "topic list" request will + otherwise be rejected. + +Notifications +~~~~~~~~~~~~~ + +Detailed under: `Bucket Operations`_. + +.. note:: + + - "Abort Multipart Upload" request does not emit a notification + - Both "Initiate Multipart Upload" and "POST Object" requests will emit an ``s3:ObjectCreated:Post`` notification + +Events +~~~~~~ + +Events are in JSON format (regardless of the actual endpoint), and share the +same structure as S3-compatible events that are pushed or pulled using the +pubsub sync module. For example: + +:: + + {"Records":[ + { + "eventVersion":"2.1", + "eventSource":"ceph:s3", + "awsRegion":"us-east-1", + "eventTime":"2019-11-22T13:47:35.124724Z", + "eventName":"ObjectCreated:Put", + "userIdentity":{ + "principalId":"tester" + }, + "requestParameters":{ + "sourceIPAddress":"" + }, + "responseElements":{ + "x-amz-request-id":"503a4c37-85eb-47cd-8681-2817e80b4281.5330.903595", + "x-amz-id-2":"14d2-zone1-zonegroup1" + }, + "s3":{ + "s3SchemaVersion":"1.0", + "configurationId":"mynotif1", + "bucket":{ + "name":"mybucket1", + "ownerIdentity":{ + "principalId":"tester" + }, + "arn":"arn:aws:s3:us-east-1::mybucket1", + "id":"503a4c37-85eb-47cd-8681-2817e80b4281.5332.38" + }, + "object":{ + "key":"myimage1.jpg", + "size":"1024", + "eTag":"37b51d194a7513e45b56f6524f2d51f2", + "versionId":"", + "sequencer": "F7E6D75DC742D108", + "metadata":[], + "tags":[] + } + }, + "eventId":"", + "opaqueData":"me@example.com" + } + ]} + +- awsRegion: The zonegroup. +- eventTime: The timestamp, indicating when the event was triggered. +- eventName: For the list of supported events see: `S3 Notification + Compatibility`_. Note that eventName values do not start with the `s3:` + prefix. +- userIdentity.principalId: The user that triggered the change. +- requestParameters.sourceIPAddress: not supported +- responseElements.x-amz-request-id: The request ID of the original change. +- responseElements.x_amz_id_2: The RGW on which the change was made. +- s3.configurationId: The notification ID that created the event. +- s3.bucket.name: The name of the bucket. +- s3.bucket.ownerIdentity.principalId: The owner of the bucket. +- s3.bucket.arn: The ARN of the bucket. +- s3.bucket.id: The ID of the bucket. (This is an extension to the S3 + notification API.) +- s3.object.key: The object key. +- s3.object.size: The object size. +- s3.object.eTag: The object etag. +- s3.object.versionId: The object version, if the bucket is versioned. When a + copy is made, it includes the version of the target object. When a delete + marker is created, it includes the version of the delete marker. +- s3.object.sequencer: The monotonically-increasing identifier of the "change + per object" (hexadecimal format). +- s3.object.metadata: Any metadata set on the object that is sent as + ``x-amz-meta-`` (that is, any metadata set on the object that is sent as an + extension to the S3 notification API). +- s3.object.tags: Any tags set on the object. (This is an extension to the S3 + notification API.) +- s3.eventId: The unique ID of the event, which could be used for acking. (This + is an extension to the S3 notification API.) +- s3.opaqueData: This means that "opaque data" is set in the topic configuration + and is added to all notifications triggered by the topic. (This is an + extension to the S3 notification API.) + +.. _PubSub Module : ../pubsub-module +.. _S3 Notification Compatibility: ../s3-notification-compatibility +.. _AWS Create Topic: https://docs.aws.amazon.com/sns/latest/api/API_CreateTopic.html +.. _Bucket Operations: ../s3/bucketops +.. _S3 CloudEvents Spec: https://github.com/cloudevents/spec/blob/main/cloudevents/adapters/aws-s3.md diff --git a/doc/radosgw/oidc.rst b/doc/radosgw/oidc.rst new file mode 100644 index 000000000..46593f1d8 --- /dev/null +++ b/doc/radosgw/oidc.rst @@ -0,0 +1,97 @@ +=============================== + OpenID Connect Provider in RGW +=============================== + +An entity describing the OpenID Connect Provider needs to be created in RGW, in order to establish trust between the two. + +REST APIs for Manipulating an OpenID Connect Provider +===================================================== + +The following REST APIs can be used for creating and managing an OpenID Connect Provider entity in RGW. + +In order to invoke the REST admin APIs, a user with admin caps needs to be created. + +.. code-block:: javascript + + radosgw-admin --uid TESTER --display-name "TestUser" --access_key TESTER --secret test123 user create + radosgw-admin caps add --uid="TESTER" --caps="oidc-provider=*" + + +CreateOpenIDConnectProvider +--------------------------------- + +Create an OpenID Connect Provider entity in RGW + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``ClientIDList.member.N`` + +:Description: List of Client Ids that needs access to S3 resources. +:Type: Array of Strings + +``ThumbprintList.member.N`` + +:Description: List of OpenID Connect IDP's server certificates' thumbprints. A maximum of 5 thumbprints are allowed. +:Type: Array of Strings + +``Url`` + +:Description: URL of the IDP. +:Type: String + + +Example:: + POST "<hostname>?Action=Action=CreateOpenIDConnectProvider + &ThumbprintList.list.1=F7D7B3515DD0D319DD219A43A9EA727AD6065287 + &ClientIDList.list.1=app-profile-jsp + &Url=http://localhost:8080/auth/realms/quickstart + + +DeleteOpenIDConnectProvider +--------------------------- + +Deletes an OpenID Connect Provider entity in RGW + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``OpenIDConnectProviderArn`` + +:Description: ARN of the IDP which is returned by the Create API. +:Type: String + +Example:: + POST "<hostname>?Action=Action=DeleteOpenIDConnectProvider + &OpenIDConnectProviderArn=arn:aws:iam:::oidc-provider/localhost:8080/auth/realms/quickstart + + +GetOpenIDConnectProvider +--------------------------- + +Gets information about an IDP. + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``OpenIDConnectProviderArn`` + +:Description: ARN of the IDP which is returned by the Create API. +:Type: String + +Example:: + POST "<hostname>?Action=Action=GetOpenIDConnectProvider + &OpenIDConnectProviderArn=arn:aws:iam:::oidc-provider/localhost:8080/auth/realms/quickstart + +ListOpenIDConnectProviders +-------------------------- + +Lists information about all IDPs + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +None + +Example:: + POST "<hostname>?Action=Action=ListOpenIDConnectProviders diff --git a/doc/radosgw/opa.rst b/doc/radosgw/opa.rst new file mode 100644 index 000000000..f1b76b5ef --- /dev/null +++ b/doc/radosgw/opa.rst @@ -0,0 +1,72 @@ +============================== +Open Policy Agent Integration +============================== + +Open Policy Agent (OPA) is a lightweight general-purpose policy engine +that can be co-located with a service. OPA can be integrated as a +sidecar, host-level daemon, or library. + +Services can offload policy decisions to OPA by executing queries. Hence, +policy enforcement can be decoupled from policy decisions. + +Configure OPA +============= + +To configure OPA, load custom policies into OPA that control the resources users +are allowed to access. Relevant data or context can also be loaded into OPA to make decisions. + +Policies and data can be loaded into OPA in the following ways:: + * OPA's RESTful APIs + * OPA's *bundle* feature that downloads policies and data from remote HTTP servers + * Filesystem + +Configure the Ceph Object Gateway +================================= + +The following configuration options are available for OPA integration:: + + rgw use opa authz = {use opa server to authorize client requests} + rgw opa url = {opa server url:opa server port} + rgw opa token = {opa bearer token} + rgw opa verify ssl = {verify opa server ssl certificate} + +How does the RGW-OPA integration work +===================================== + +After a user is authenticated, OPA can be used to check if the user is authorized +to perform the given action on the resource. OPA responds with an allow or deny +decision which is sent back to the RGW which enforces the decision. + +Example request:: + + POST /v1/data/ceph/authz HTTP/1.1 + Host: opa.example.com:8181 + Content-Type: application/json + + { + "input": { + "method": "GET", + "subuser": "subuser", + "user_info": { + "user_id": "john", + "display_name": "John" + }, + "bucket_info": { + "bucket": { + "name": "Testbucket", + "bucket_id": "testbucket" + }, + "owner": "john" + } + } + } + +Response:: + + {"result": true} + +The above is a sample request sent to OPA which contains information about the +user, resource and the action to be performed on the resource. Based on the polices +and data loaded into OPA, it will verify whether the request should be allowed or denied. +In the sample request, RGW makes a POST request to the endpoint */v1/data/ceph/authz*, +where *ceph* is the package name and *authz* is the rule name. diff --git a/doc/radosgw/orphans.rst b/doc/radosgw/orphans.rst new file mode 100644 index 000000000..9a77d60de --- /dev/null +++ b/doc/radosgw/orphans.rst @@ -0,0 +1,115 @@ +================================== +Orphan List and Associated Tooling +================================== + +.. version added:: Luminous + +.. contents:: + +Orphans are RADOS objects that are left behind after their associated +RGW objects are removed. Normally these RADOS objects are removed +automatically, either immediately or through a process known as +"garbage collection". Over the history of RGW, however, there may have +been bugs that prevented these RADOS objects from being deleted, and +these RADOS objects may be consuming space on the Ceph cluster without +being of any use. From the perspective of RGW, we call such RADOS +objects "orphans". + +Orphans Find -- DEPRECATED +-------------------------- + +The `radosgw-admin` tool has/had three subcommands to help manage +orphans, however these subcommands are (or will soon be) +deprecated. These subcommands are: + +:: + # radosgw-admin orphans find ... + # radosgw-admin orphans finish ... + # radosgw-admin orphans list-jobs ... + +There are two key problems with these subcommands, however. First, +these subcommands have not been actively maintained and therefore have +not tracked RGW as it has evolved in terms of features and updates. As +a result the confidence that these subcommands can accurately identify +true orphans is presently low. + +Second, these subcommands store intermediate results on the cluster +itself. This can be problematic when cluster administrators are +confronting insufficient storage space and want to remove orphans as a +means of addressing the issue. The intermediate results could strain +the existing cluster storage capacity even further. + +For these reasons "orphans find" has been deprecated. + +Orphan List +----------- + +Because "orphans find" has been deprecated, RGW now includes an +additional tool -- 'rgw-orphan-list'. When run it will list the +available pools and prompt the user to enter the name of the data +pool. At that point the tool will, perhaps after an extended period of +time, produce a local file containing the RADOS objects from the +designated pool that appear to be orphans. The administrator is free +to examine this file and the decide on a course of action, perhaps +removing those RADOS objects from the designated pool. + +All intermediate results are stored on the local file system rather +than the Ceph cluster. So running the 'rgw-orphan-list' tool should +have no appreciable impact on the amount of cluster storage consumed. + +WARNING: Experimental Status +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The 'rgw-orphan-list' tool is new and therefore currently considered +experimental. The list of orphans produced should be "sanity checked" +before being used for a large delete operation. + +WARNING: Specifying a Data Pool +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a pool other than an RGW data pool is specified, the results of the +tool will be erroneous. All RADOS objects found on such a pool will +falsely be designated as orphans. + +WARNING: Unindexed Buckets +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +RGW allows for unindexed buckets, that is buckets that do not maintain +an index of their contents. This is not a typical configuration, but +it is supported. Because the 'rgw-orphan-list' tool uses the bucket +indices to determine what RADOS objects should exist, objects in the +unindexed buckets will falsely be listed as orphans. + + +RADOS List +---------- + +One of the sub-steps in computing a list of orphans is to map each RGW +object into its corresponding set of RADOS objects. This is done using +a subcommand of 'radosgw-admin'. + +:: + # radosgw-admin bucket radoslist [--bucket={bucket-name}] + +The subcommand will produce a list of RADOS objects that support all +of the RGW objects. If a bucket is specified then the subcommand will +only produce a list of RADOS objects that correspond back the RGW +objects in the specified bucket. + +Note: Shared Bucket Markers +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Some administrators will be aware of the coding schemes used to name +the RADOS objects that correspond to RGW objects, which include a +"marker" unique to a given bucket. + +RADOS objects that correspond with the contents of one RGW bucket, +however, may contain a marker that specifies a different bucket. This +behavior is a consequence of the "shallow copy" optimization used by +RGW. When larger objects are copied from bucket to bucket, only the +"head" objects are actually copied, and the tail objects are +shared. Those shared objects will contain the marker of the original +bucket. + +.. _Data Layout in RADOS : ../layout +.. _Pool Placement and Storage Classes : ../placement diff --git a/doc/radosgw/placement.rst b/doc/radosgw/placement.rst new file mode 100644 index 000000000..4931cc87d --- /dev/null +++ b/doc/radosgw/placement.rst @@ -0,0 +1,250 @@ +================================== +Pool Placement and Storage Classes +================================== + +.. contents:: + +Placement Targets +================= + +.. versionadded:: Jewel + +Placement targets control which `Pools`_ are associated with a particular +bucket. A bucket's placement target is selected on creation, and cannot be +modified. The ``radosgw-admin bucket stats`` command will display its +``placement_rule``. + +The zonegroup configuration contains a list of placement targets with an +initial target named ``default-placement``. The zone configuration then maps +each zonegroup placement target name onto its local storage. This zone +placement information includes the ``index_pool`` name for the bucket index, +the ``data_extra_pool`` name for metadata about incomplete multipart uploads, +and a ``data_pool`` name for each storage class. + +.. _storage_classes: + +Storage Classes +=============== + +.. versionadded:: Nautilus + +Storage classes are used to customize the placement of object data. S3 Bucket +Lifecycle rules can automate the transition of objects between storage classes. + +Storage classes are defined in terms of placement targets. Each zonegroup +placement target lists its available storage classes with an initial class +named ``STANDARD``. The zone configuration is responsible for providing a +``data_pool`` pool name for each of the zonegroup's storage classes. + +Zonegroup/Zone Configuration +============================ + +Placement configuration is performed with ``radosgw-admin`` commands on +the zonegroups and zones. + +The zonegroup placement configuration can be queried with: + +:: + + $ radosgw-admin zonegroup get + { + "id": "ab01123f-e0df-4f29-9d71-b44888d67cd5", + "name": "default", + "api_name": "default", + ... + "placement_targets": [ + { + "name": "default-placement", + "tags": [], + "storage_classes": [ + "STANDARD" + ] + } + ], + "default_placement": "default-placement", + ... + } + +The zone placement configuration can be queried with: + +:: + + $ radosgw-admin zone get + { + "id": "557cdcee-3aae-4e9e-85c7-2f86f5eddb1f", + "name": "default", + "domain_root": "default.rgw.meta:root", + ... + "placement_pools": [ + { + "key": "default-placement", + "val": { + "index_pool": "default.rgw.buckets.index", + "storage_classes": { + "STANDARD": { + "data_pool": "default.rgw.buckets.data" + } + }, + "data_extra_pool": "default.rgw.buckets.non-ec", + "index_type": 0 + } + } + ], + ... + } + +.. note:: If you have not done any previous `Multisite Configuration`_, + a ``default`` zone and zonegroup are created for you, and changes + to the zone/zonegroup will not take effect until the Ceph Object + Gateways are restarted. If you have created a realm for multisite, + the zone/zonegroup changes will take effect once the changes are + committed with ``radosgw-admin period update --commit``. + +Adding a Placement Target +------------------------- + +To create a new placement target named ``temporary``, start by adding it to +the zonegroup: + +:: + + $ radosgw-admin zonegroup placement add \ + --rgw-zonegroup default \ + --placement-id temporary + +Then provide the zone placement info for that target: + +:: + + $ radosgw-admin zone placement add \ + --rgw-zone default \ + --placement-id temporary \ + --data-pool default.rgw.temporary.data \ + --index-pool default.rgw.temporary.index \ + --data-extra-pool default.rgw.temporary.non-ec + +Adding a Storage Class +---------------------- + +To add a new storage class named ``GLACIER`` to the ``default-placement`` target, +start by adding it to the zonegroup: + +:: + + $ radosgw-admin zonegroup placement add \ + --rgw-zonegroup default \ + --placement-id default-placement \ + --storage-class GLACIER + +Then provide the zone placement info for that storage class: + +:: + + $ radosgw-admin zone placement add \ + --rgw-zone default \ + --placement-id default-placement \ + --storage-class GLACIER \ + --data-pool default.rgw.glacier.data \ + --compression lz4 + +Customizing Placement +===================== + +Default Placement +----------------- + +By default, new buckets will use the zonegroup's ``default_placement`` target. +This zonegroup setting can be changed with: + +:: + + $ radosgw-admin zonegroup placement default \ + --rgw-zonegroup default \ + --placement-id new-placement + +User Placement +-------------- + +A Ceph Object Gateway user can override the zonegroup's default placement +target by setting a non-empty ``default_placement`` field in the user info. +Similarly, the ``default_storage_class`` can override the ``STANDARD`` +storage class applied to objects by default. + +:: + + $ radosgw-admin user info --uid testid + { + ... + "default_placement": "", + "default_storage_class": "", + "placement_tags": [], + ... + } + +If a zonegroup's placement target contains any ``tags``, users will be unable +to create buckets with that placement target unless their user info contains +at least one matching tag in its ``placement_tags`` field. This can be useful +to restrict access to certain types of storage. + +The ``radosgw-admin`` command can modify these fields directly with: + +:: + + $ radosgw-admin user modify \ + --uid <user-id> \ + --placement-id <default-placement-id> \ + --storage-class <default-storage-class> \ + --tags <tag1,tag2> + +.. _s3_bucket_placement: + +S3 Bucket Placement +------------------- + +When creating a bucket with the S3 protocol, a placement target can be +provided as part of the LocationConstraint to override the default placement +targets from the user and zonegroup. + +Normally, the LocationConstraint must match the zonegroup's ``api_name``: + +:: + + <LocationConstraint>default</LocationConstraint> + +A custom placement target can be added to the ``api_name`` following a colon: + +:: + + <LocationConstraint>default:new-placement</LocationConstraint> + +Swift Bucket Placement +---------------------- + +When creating a bucket with the Swift protocol, a placement target can be +provided in the HTTP header ``X-Storage-Policy``: + +:: + + X-Storage-Policy: new-placement + +Using Storage Classes +===================== + +All placement targets have a ``STANDARD`` storage class which is applied to +new objects by default. The user can override this default with its +``default_storage_class``. + +To create an object in a non-default storage class, provide that storage class +name in an HTTP header with the request. The S3 protocol uses the +``X-Amz-Storage-Class`` header, while the Swift protocol uses the +``X-Object-Storage-Class`` header. + +When using AWS S3 SDKs such as python boto3, it is important that the non-default +storage class will be called as one on of the AWS S3 allowed storage classes, or else the SDK +will drop the request and raise an exception. + +S3 Object Lifecycle Management can then be used to move object data between +storage classes using ``Transition`` actions. + +.. _`Pools`: ../pools +.. _`Multisite Configuration`: ../multisite diff --git a/doc/radosgw/pools.rst b/doc/radosgw/pools.rst new file mode 100644 index 000000000..bb1246c1f --- /dev/null +++ b/doc/radosgw/pools.rst @@ -0,0 +1,57 @@ +===== +Pools +===== + +The Ceph Object Gateway uses several pools for its various storage needs, +which are listed in the Zone object (see ``radosgw-admin zone get``). A +single zone named ``default`` is created automatically with pool names +starting with ``default.rgw.``, but a `Multisite Configuration`_ will have +multiple zones. + +Tuning +====== + +When ``radosgw`` first tries to operate on a zone pool that does not +exist, it will create that pool with the default values from +``osd pool default pg num`` and ``osd pool default pgp num``. These defaults +are sufficient for some pools, but others (especially those listed in +``placement_pools`` for the bucket index and data) will require additional +tuning. We recommend using the `Ceph Placement Group’s per Pool +Calculator <https://old.ceph.com/pgcalc/>`__ to calculate a suitable number of +placement groups for these pools. See +`Pools <http://docs.ceph.com/en/latest/rados/operations/pools/#pools>`__ +for details on pool creation. + +.. _radosgw-pool-namespaces: + +Pool Namespaces +=============== + +.. versionadded:: Luminous + +Pool names particular to a zone follow the naming convention +``{zone-name}.pool-name``. For example, a zone named ``us-east`` will +have the following pools: + +- ``.rgw.root`` + +- ``us-east.rgw.control`` + +- ``us-east.rgw.meta`` + +- ``us-east.rgw.log`` + +- ``us-east.rgw.buckets.index`` + +- ``us-east.rgw.buckets.data`` + +The zone definitions list several more pools than that, but many of those +are consolidated through the use of rados namespaces. For example, all of +the following pool entries use namespaces of the ``us-east.rgw.meta`` pool:: + + "user_keys_pool": "us-east.rgw.meta:users.keys", + "user_email_pool": "us-east.rgw.meta:users.email", + "user_swift_pool": "us-east.rgw.meta:users.swift", + "user_uid_pool": "us-east.rgw.meta:users.uid", + +.. _`Multisite Configuration`: ../multisite diff --git a/doc/radosgw/pubsub-module.rst b/doc/radosgw/pubsub-module.rst new file mode 100644 index 000000000..8a45f4105 --- /dev/null +++ b/doc/radosgw/pubsub-module.rst @@ -0,0 +1,646 @@ +================== +PubSub Sync Module +================== + +.. versionadded:: Nautilus + +.. contents:: + +This sync module provides a publish and subscribe mechanism for the object store modification +events. Events are published into predefined topics. Topics can be subscribed to, and events +can be pulled from them. Events need to be acked. Also, events will expire and disappear +after a period of time. + +A push notification mechanism exists too, currently supporting HTTP, +AMQP0.9.1 and Kafka endpoints. In this case, the events are pushed to an endpoint on top of storing them in Ceph. If events should only be pushed to an endpoint +and do not need to be stored in Ceph, the `Bucket Notification`_ mechanism should be used instead of pubsub sync module. + +A user can create different topics. A topic entity is defined by its name and is per tenant. A +user can only associate its topics (via notification configuration) with buckets it owns. + +In order to publish events for specific bucket a notification entity needs to be created. A +notification can be created on a subset of event types, or for all event types (default). +There can be multiple notifications for any specific topic, and the same topic could be used for multiple notifications. + +A subscription to a topic can also be defined. There can be multiple subscriptions for any +specific topic. + +REST API has been defined to provide configuration and control interfaces for the pubsub +mechanisms. This API has two flavors, one is S3-compatible and one is not. The two flavors can be used +together, although it is recommended to use the S3-compatible one. +The S3-compatible API is similar to the one used in the bucket notification mechanism. + +Events are stored as RGW objects in a special bucket, under a special user (pubsub control user). Events cannot +be accessed directly, but need to be pulled and acked using the new REST API. + +.. toctree:: + :maxdepth: 1 + + S3 Bucket Notification Compatibility <s3-notification-compatibility> + +.. note:: To enable bucket notifications API, the `rgw_enable_apis` configuration parameter should contain: "notifications". + +PubSub Zone Configuration +------------------------- + +The pubsub sync module requires the creation of a new zone in a :ref:`multisite` environment... +First, a master zone must exist (see: :ref:`master-zone-label`), +then a secondary zone should be created (see :ref:`secondary-zone-label`). +In the creation of the secondary zone, its tier type must be set to ``pubsub``: + +:: + + # radosgw-admin zone create --rgw-zonegroup={zone-group-name} \ + --rgw-zone={zone-name} \ + --endpoints={http://fqdn}[,{http://fqdn}] \ + --sync-from-all=0 \ + --sync-from={master-zone-name} \ + --tier-type=pubsub + + +PubSub Zone Configuration Parameters +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + { + "tenant": <tenant>, # default: <empty> + "uid": <uid>, # default: "pubsub" + "data_bucket_prefix": <prefix> # default: "pubsub-" + "data_oid_prefix": <prefix> # + "events_retention_days": <days> # default: 7 + } + +* ``tenant`` (string) + +The tenant of the pubsub control user. + +* ``uid`` (string) + +The uid of the pubsub control user. + +* ``data_bucket_prefix`` (string) + +The prefix of the bucket name that will be created to store events for specific topic. + +* ``data_oid_prefix`` (string) + +The oid prefix for the stored events. + +* ``events_retention_days`` (integer) + +How many days to keep events that weren't acked. + +Configuring Parameters via CLI +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The tier configuration could be set using the following command: + +:: + + # radosgw-admin zone modify --rgw-zonegroup={zone-group-name} \ + --rgw-zone={zone-name} \ + --tier-config={key}={val}[,{key}={val}] + +Where the ``key`` in the configuration specifies the configuration variable that needs to be updated (from the list above), and +the ``val`` specifies its new value. For example, setting the pubsub control user ``uid`` to ``user_ps``: + +:: + + # radosgw-admin zone modify --rgw-zonegroup={zone-group-name} \ + --rgw-zone={zone-name} \ + --tier-config=uid=pubsub + +A configuration field can be removed by using ``--tier-config-rm={key}``. + + +Topic and Subscription Management via CLI +----------------------------------------- + +Configuration of all topics, associated with a tenant, could be fetched using the following command: + +:: + + # radosgw-admin topic list [--tenant={tenant}] + + +Configuration of a specific topic could be fetched using: + +:: + + # radosgw-admin topic get --topic={topic-name} [--tenant={tenant}] + + +And removed using: + +:: + + # radosgw-admin topic rm --topic={topic-name} [--tenant={tenant}] + + +Configuration of a subscription could be fetched using: + +:: + + # radosgw-admin subscription get --subscription={topic-name} [--tenant={tenant}] + +And removed using: + +:: + + # radosgw-admin subscription rm --subscription={topic-name} [--tenant={tenant}] + + +To fetch all of the events stored in a subcription, use: + +:: + + # radosgw-admin subscription pull --subscription={topic-name} [--marker={last-marker}] [--tenant={tenant}] + + +To ack (and remove) an event from a subscription, use: + +:: + + # radosgw-admin subscription ack --subscription={topic-name} --event-id={event-id} [--tenant={tenant}] + + +PubSub Performance Stats +------------------------- +Same counters are shared between the pubsub sync module and the notification mechanism. + +- ``pubsub_event_triggered``: running counter of events with at lease one topic associated with them +- ``pubsub_event_lost``: running counter of events that had topics and subscriptions associated with them but that were not stored or pushed to any of the subscriptions +- ``pubsub_store_ok``: running counter, for all subscriptions, of stored events +- ``pubsub_store_fail``: running counter, for all subscriptions, of events failed to be stored +- ``pubsub_push_ok``: running counter, for all subscriptions, of events successfully pushed to their endpoint +- ``pubsub_push_fail``: running counter, for all subscriptions, of events failed to be pushed to their endpoint +- ``pubsub_push_pending``: gauge value of events pushed to an endpoint but not acked or nacked yet + +.. note:: + + ``pubsub_event_triggered`` and ``pubsub_event_lost`` are incremented per event, while: + ``pubsub_store_ok``, ``pubsub_store_fail``, ``pubsub_push_ok``, ``pubsub_push_fail``, are incremented per store/push action on each subscriptions. + +PubSub REST API +--------------- + +.. tip:: PubSub REST calls, and only them, should be sent to an RGW which belong to a PubSub zone + +Topics +~~~~~~ + +.. _radosgw-create-a-topic: + +Create a Topic +`````````````` + +This will create a new topic. Topic creation is needed both for both flavors of the API. +Optionally the topic could be provided with push endpoint parameters that would be used later +when an S3-compatible notification is created. +Upon successful request, the response will include the topic ARN that could be later used to reference this topic in an S3-compatible notification request. +To update a topic, use the same command used for topic creation, with the topic name of an existing topic and different endpoint values. + +.. tip:: Any S3-compatible notification already associated with the topic needs to be re-created for the topic update to take effect + +:: + + PUT /topics/<topic-name>[?OpaqueData=<opaque data>][&push-endpoint=<endpoint>[&amqp-exchange=<exchange>][&amqp-ack-level=none|broker|routable][&verify-ssl=true|false][&kafka-ack-level=none|broker][&use-ssl=true|false][&ca-location=<file path>]] + +Request parameters: + +- push-endpoint: URI of an endpoint to send push notification to +- OpaqueData: opaque data is set in the topic configuration and added to all notifications triggered by the ropic + +The endpoint URI may include parameters depending with the type of endpoint: + +- HTTP endpoint + + - URI: ``http[s]://<fqdn>[:<port]`` + - port defaults to: 80/443 for HTTP/S accordingly + - verify-ssl: indicate whether the server certificate is validated by the client or not ("true" by default) + +- AMQP0.9.1 endpoint + + - URI: ``amqp[s]://[<user>:<password>@]<fqdn>[:<port>][/<vhost>]`` + - user/password defaults to: guest/guest + - user/password may only be provided over HTTPS. Topic creation request will be rejected if not + - port defaults to: 5672/5671 for unencrypted/SSL-encrypted connections + - vhost defaults to: "/" + - verify-ssl: indicate whether the server certificate is validated by the client or not ("true" by default) + - if ``ca-location`` is provided, and secure connection is used, the specified CA will be used, instead of the default one, to authenticate the broker + - amqp-exchange: the exchanges must exist and be able to route messages based on topics (mandatory parameter for AMQP0.9.1). Different topics pointing to the same endpoint must use the same exchange + - amqp-ack-level: no end2end acking is required, as messages may persist in the broker before delivered into their final destination. Three ack methods exist: + + - "none": message is considered "delivered" if sent to broker + - "broker": message is considered "delivered" if acked by broker (default) + - "routable": message is considered "delivered" if broker can route to a consumer + +.. tip:: The topic-name (see :ref:`radosgw-create-a-topic`) is used for the AMQP topic ("routing key" for a topic exchange) + +- Kafka endpoint + + - URI: ``kafka://[<user>:<password>@]<fqdn>[:<port]`` + - if ``use-ssl`` is set to "true", secure connection will be used for connecting with the broker ("false" by default) + - if ``ca-location`` is provided, and secure connection is used, the specified CA will be used, instead of the default one, to authenticate the broker + - user/password may only be provided over HTTPS. Topic creation request will be rejected if not + - user/password may only be provided together with ``use-ssl``, connection to the broker would fail if not + - port defaults to: 9092 + - kafka-ack-level: no end2end acking is required, as messages may persist in the broker before delivered into their final destination. Two ack methods exist: + + - "none": message is considered "delivered" if sent to broker + - "broker": message is considered "delivered" if acked by broker (default) + +The topic ARN in the response will have the following format: + +:: + + arn:aws:sns:<zone-group>:<tenant>:<topic> + +Get Topic Information +````````````````````` + +Returns information about specific topic. This includes subscriptions to that topic, and push-endpoint information, if provided. + +:: + + GET /topics/<topic-name> + +Response will have the following format (JSON): + +:: + + { + "topic":{ + "user":"", + "name":"", + "dest":{ + "bucket_name":"", + "oid_prefix":"", + "push_endpoint":"", + "push_endpoint_args":"", + "push_endpoint_topic":"", + "stored_secret":"", + "persistent":"" + }, + "arn":"" + "opaqueData":"" + }, + "subs":[] + } + +- topic.user: name of the user that created the topic +- name: name of the topic +- dest.bucket_name: not used +- dest.oid_prefix: not used +- dest.push_endpoint: in case of S3-compliant notifications, this value will be used as the push-endpoint URL +- if push-endpoint URL contain user/password information, request must be made over HTTPS. Topic get request will be rejected if not +- dest.push_endpoint_args: in case of S3-compliant notifications, this value will be used as the push-endpoint args +- dest.push_endpoint_topic: in case of S3-compliant notifications, this value will hold the topic name as sent to the endpoint (may be different than the internal topic name) +- topic.arn: topic ARN +- subs: list of subscriptions associated with this topic + +Delete Topic +```````````` + +:: + + DELETE /topics/<topic-name> + +Delete the specified topic. + +List Topics +``````````` + +List all topics associated with a tenant. + +:: + + GET /topics + +- if push-endpoint URL contain user/password information, in any of the topic, request must be made over HTTPS. Topic list request will be rejected if not + +S3-Compliant Notifications +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Detailed under: `Bucket Operations`_. + +.. note:: + + - Notification creation will also create a subscription for pushing/pulling events + - The generated subscription's name will have the same as the notification Id, and could be used later to fetch and ack events with the subscription API. + - Notification deletion will deletes all generated subscriptions + - In case that bucket deletion implicitly deletes the notification, + the associated subscription will not be deleted automatically (any events of the deleted bucket could still be access), + and will have to be deleted explicitly with the subscription deletion API + - Filtering based on metadata (which is an extension to S3) is not supported, and such rules will be ignored + - Filtering based on tags (which is an extension to S3) is not supported, and such rules will be ignored + + +Non S3-Compliant Notifications +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Create a Notification +````````````````````` + +This will create a publisher for a specific bucket into a topic. + +:: + + PUT /notifications/bucket/<bucket>?topic=<topic-name>[&events=<event>[,<event>]] + +Request parameters: + +- topic-name: name of topic +- event: event type (string), one of: ``OBJECT_CREATE``, ``OBJECT_DELETE``, ``DELETE_MARKER_CREATE`` + +Delete Notification Information +``````````````````````````````` + +Delete publisher from a specific bucket into a specific topic. + +:: + + DELETE /notifications/bucket/<bucket>?topic=<topic-name> + +Request parameters: + +- topic-name: name of topic + +.. note:: When the bucket is deleted, any notification defined on it is also deleted + +List Notifications +`````````````````` + +List all topics with associated events defined on a bucket. + +:: + + GET /notifications/bucket/<bucket> + +Response will have the following format (JSON): + +:: + + {"topics":[ + { + "topic":{ + "user":"", + "name":"", + "dest":{ + "bucket_name":"", + "oid_prefix":"", + "push_endpoint":"", + "push_endpoint_args":"", + "push_endpoint_topic":"" + } + "arn":"" + }, + "events":[] + } + ]} + +Subscriptions +~~~~~~~~~~~~~ + +Create a Subscription +````````````````````` + +Creates a new subscription. + +:: + + PUT /subscriptions/<sub-name>?topic=<topic-name>[?push-endpoint=<endpoint>[&amqp-exchange=<exchange>][&amqp-ack-level=none|broker|routable][&verify-ssl=true|false][&kafka-ack-level=none|broker][&ca-location=<file path>]] + +Request parameters: + +- topic-name: name of topic +- push-endpoint: URI of endpoint to send push notification to + +The endpoint URI may include parameters depending with the type of endpoint: + +- HTTP endpoint + + - URI: ``http[s]://<fqdn>[:<port]`` + - port defaults to: 80/443 for HTTP/S accordingly + - verify-ssl: indicate whether the server certificate is validated by the client or not ("true" by default) + +- AMQP0.9.1 endpoint + + - URI: ``amqp://[<user>:<password>@]<fqdn>[:<port>][/<vhost>]`` + - user/password defaults to : guest/guest + - port defaults to: 5672 + - vhost defaults to: "/" + - amqp-exchange: the exchanges must exist and be able to route messages based on topics (mandatory parameter for AMQP0.9.1) + - amqp-ack-level: no end2end acking is required, as messages may persist in the broker before delivered into their final destination. Three ack methods exist: + + - "none": message is considered "delivered" if sent to broker + - "broker": message is considered "delivered" if acked by broker (default) + - "routable": message is considered "delivered" if broker can route to a consumer + +- Kafka endpoint + + - URI: ``kafka://[<user>:<password>@]<fqdn>[:<port]`` + - if ``ca-location`` is provided, secure connection will be used for connection with the broker + - user/password may only be provided over HTTPS. Topic creation request will be rejected if not + - user/password may only be provided together with ``ca-location``. Topic creation request will be rejected if not + - port defaults to: 9092 + - kafka-ack-level: no end2end acking is required, as messages may persist in the broker before delivered into their final destination. Two ack methods exist: + + - "none": message is considered "delivered" if sent to broker + - "broker": message is considered "delivered" if acked by broker (default) + + +Get Subscription Information +```````````````````````````` + +Returns information about specific subscription. + +:: + + GET /subscriptions/<sub-name> + +Response will have the following format (JSON): + +:: + + { + "user":"", + "name":"", + "topic":"", + "dest":{ + "bucket_name":"", + "oid_prefix":"", + "push_endpoint":"", + "push_endpoint_args":"", + "push_endpoint_topic":"" + } + "s3_id":"" + } + +- user: name of the user that created the subscription +- name: name of the subscription +- topic: name of the topic the subscription is associated with +- dest.bucket_name: name of the bucket storing the events +- dest.oid_prefix: oid prefix for the events stored in the bucket +- dest.push_endpoint: in case of S3-compliant notifications, this value will be used as the push-endpoint URL +- if push-endpoint URL contain user/password information, request must be made over HTTPS. Topic get request will be rejected if not +- dest.push_endpoint_args: in case of S3-compliant notifications, this value will be used as the push-endpoint args +- dest.push_endpoint_topic: in case of S3-compliant notifications, this value will hold the topic name as sent to the endpoint (may be different than the internal topic name) +- s3_id: in case of S3-compliant notifications, this will hold the notification name that created the subscription + +Delete Subscription +``````````````````` + +Removes a subscription. + +:: + + DELETE /subscriptions/<sub-name> + +Events +~~~~~~ + +Pull Events +``````````` + +Pull events sent to a specific subscription. + +:: + + GET /subscriptions/<sub-name>?events[&max-entries=<max-entries>][&marker=<marker>] + +Request parameters: + +- marker: pagination marker for list of events, if not specified will start from the oldest +- max-entries: max number of events to return + +The response will hold information on the current marker and whether there are more events not fetched: + +:: + + {"next_marker":"","is_truncated":"",...} + + +The actual content of the response is depended with how the subscription was created. +In case that the subscription was created via an S3-compatible notification, +the events will have an S3-compatible record format (JSON): + +:: + + {"Records":[ + { + "eventVersion":"2.1" + "eventSource":"aws:s3", + "awsRegion":"", + "eventTime":"", + "eventName":"", + "userIdentity":{ + "principalId":"" + }, + "requestParameters":{ + "sourceIPAddress":"" + }, + "responseElements":{ + "x-amz-request-id":"", + "x-amz-id-2":"" + }, + "s3":{ + "s3SchemaVersion":"1.0", + "configurationId":"", + "bucket":{ + "name":"", + "ownerIdentity":{ + "principalId":"" + }, + "arn":"", + "id":"" + }, + "object":{ + "key":"", + "size":"0", + "eTag":"", + "versionId":"", + "sequencer":"", + "metadata":[], + "tags":[] + } + }, + "eventId":"", + "opaqueData":"", + } + ]} + +- awsRegion: zonegroup +- eventTime: timestamp indicating when the event was triggered +- eventName: either ``s3:ObjectCreated:``, or ``s3:ObjectRemoved:`` +- userIdentity: not supported +- requestParameters: not supported +- responseElements: not supported +- s3.configurationId: notification ID that created the subscription for the event +- s3.bucket.name: name of the bucket +- s3.bucket.ownerIdentity.principalId: owner of the bucket +- s3.bucket.arn: ARN of the bucket +- s3.bucket.id: Id of the bucket (an extension to the S3 notification API) +- s3.object.key: object key +- s3.object.size: not supported +- s3.object.eTag: object etag +- s3.object.version: object version in case of versioned bucket +- s3.object.sequencer: monotonically increasing identifier of the change per object (hexadecimal format) +- s3.object.metadata: not supported (an extension to the S3 notification API) +- s3.object.tags: not supported (an extension to the S3 notification API) +- s3.eventId: unique ID of the event, that could be used for acking (an extension to the S3 notification API) +- s3.opaqueData: opaque data is set in the topic configuration and added to all notifications triggered by the ropic (an extension to the S3 notification API) + +In case that the subscription was not created via a non S3-compatible notification, +the events will have the following event format (JSON): + +:: + + {"events":[ + { + "id":"", + "event":"", + "timestamp":"", + "info":{ + "attrs":{ + "mtime":"" + }, + "bucket":{ + "bucket_id":"", + "name":"", + "tenant":"" + }, + "key":{ + "instance":"", + "name":"" + } + } + } + ]} + +- id: unique ID of the event, that could be used for acking +- event: one of: ``OBJECT_CREATE``, ``OBJECT_DELETE``, ``DELETE_MARKER_CREATE`` +- timestamp: timestamp indicating when the event was sent +- info.attrs.mtime: timestamp indicating when the event was triggered +- info.bucket.bucket_id: id of the bucket +- info.bucket.name: name of the bucket +- info.bucket.tenant: tenant the bucket belongs to +- info.key.instance: object version in case of versioned bucket +- info.key.name: object key + +Ack Event +````````` + +Ack event so that it can be removed from the subscription history. + +:: + + POST /subscriptions/<sub-name>?ack&event-id=<event-id> + +Request parameters: + +- event-id: id of event to be acked + +.. _Bucket Notification : ../notifications +.. _Bucket Operations: ../s3/bucketops diff --git a/doc/radosgw/qat-accel.rst b/doc/radosgw/qat-accel.rst new file mode 100644 index 000000000..4e661c63d --- /dev/null +++ b/doc/radosgw/qat-accel.rst @@ -0,0 +1,94 @@ +=============================================== +QAT Acceleration for Encryption and Compression +=============================================== + +Intel QAT (QuickAssist Technology) can provide extended accelerated encryption +and compression services by offloading the actual encryption and compression +request(s) to the hardware QuickAssist accelerators, which are more efficient +in terms of cost and power than general purpose CPUs for those specific +compute-intensive workloads. + +See `QAT Support for Compression`_ and `QAT based Encryption for RGW`_. + + +QAT in the Software Stack +========================= + +Application developers can access QuickAssist features through the QAT API. +The QAT API is the top-level API for QuickAssist technology, and enables easy +interfacing between the customer application and the QuickAssist acceleration +driver. + +The QAT API accesses the QuickAssist driver, which in turn drives the +QuickAssist Accelerator hardware. The QuickAssist driver is responsible for +exposing the acceleration services to the application software. + +A user can write directly to the QAT API, or the use of QAT can be done via +frameworks that have been enabled by others including Intel (for example, zlib*, +OpenSSL* libcrypto*, and the Linux* Kernel Crypto Framework). + +QAT Environment Setup +===================== +1. QuickAssist Accelerator hardware is necessary to make use of accelerated + encryption and compression services. And QAT driver in kernel space have to + be loaded to drive the hardware. + +The driver package can be downloaded from `Intel Quickassist Technology`_. + +2. The implementation for QAT based encryption is directly base on QAT API which + is included the driver package. But QAT support for compression depends on + QATzip project, which is a user space library which builds on top of the QAT + API. Currently, QATzip speeds up gzip compression and decompression at the + time of writing. + +See `QATzip`_. + +Implementation +============== +1. QAT based Encryption for RGW + +`OpenSSL support for RGW encryption`_ has been merged into Ceph, and Intel also +provides one `QAT Engine`_ for OpenSSL. So, theoretically speaking, QAT based +encryption in Ceph can be directly supported through OpenSSl+QAT Engine. + +But the QAT Engine for OpenSSL currently supports chained operations only, and +so Ceph will not be able to utilize QAT hardware feature for crypto operations +based on OpenSSL crypto plugin. As a result, one QAT plugin based on native +QAT API is added into crypto framework. + +2. QAT Support for Compression + +As mentioned above, QAT support for compression is based on QATzip library in +user space, which is designed to take full advantage of the performance provided +by QuickAssist Technology. Unlike QAT based encryption, QAT based compression +is supported through a tool class for QAT acceleration rather than a compressor +plugin. The common tool class can transparently accelerate the existing compression +types, but only zlib compressor can be supported at the time of writing. So +user is allowed to use it to speed up zlib compressor as long as the QAT +hardware is available and QAT is capable to handle it. + +Configuration +============= +1. QAT based Encryption for RGW + +Edit the Ceph configuration file to make use of QAT based crypto plugin:: + + plugin crypto accelerator = crypto_qat + +2. QAT Support for Compression + +One CMake option have to be used to trigger QAT based compression:: + + -DWITH_QATZIP=ON + +Edit the Ceph configuration file to enable QAT support for compression:: + + qat compressor enabled=true + + +.. _QAT Support for Compression: https://github.com/ceph/ceph/pull/19714 +.. _QAT based Encryption for RGW: https://github.com/ceph/ceph/pull/19386 +.. _Intel Quickassist Technology: https://01.org/intel-quickassist-technology +.. _QATzip: https://github.com/intel/QATzip +.. _OpenSSL support for RGW encryption: https://github.com/ceph/ceph/pull/15168 +.. _QAT Engine: https://github.com/intel/QAT_Engine diff --git a/doc/radosgw/rgw-cache.rst b/doc/radosgw/rgw-cache.rst new file mode 100644 index 000000000..9b4c96f1e --- /dev/null +++ b/doc/radosgw/rgw-cache.rst @@ -0,0 +1,150 @@ +========================== +RGW Data caching and CDN +========================== + +.. versionadded:: Octopus + +.. contents:: + +This feature adds to RGW the ability to securely cache objects and offload the workload from the cluster, using Nginx. +After an object is accessed the first time it will be stored in the Nginx cache directory. +When data is already cached, it need not be fetched from RGW. A permission check will be made against RGW to ensure the requesting user has access. +This feature is based on some Nginx modules, ngx_http_auth_request_module, https://github.com/kaltura/nginx-aws-auth-module, Openresty for Lua capabilities. + +Currently, this feature will cache only AWSv4 requests (only s3 requests), caching-in the output of the 1st GET request +and caching-out on subsequent GET requests, passing thru transparently PUT,POST,HEAD,DELETE and COPY requests. + + +The feature introduces 2 new APIs: Auth and Cache. + +New APIs +------------------------- + +There are 2 new APIs for this feature: + +Auth API - The cache uses this to validate that a user can access the cached data + +Cache API - Adds the ability to override securely Range header, that way Nginx can use it is own smart cache on top of S3: +https://www.nginx.com/blog/smart-efficient-byte-range-caching-nginx/ +Using this API gives the ability to read ahead objects when clients asking a specific range from the object. +On subsequent accesses to the cached object, Nginx will satisfy requests for already-cached ranges from the cache. Uncached ranges will be read from RGW (and cached). + +Auth API +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This API Validates a specific authenticated access being made to the cache, using RGW's knowledge of the client credentials and stored access policy. +Returns success if the encapsulated request would be granted. + +Cache API +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This API is meant to allow changing signed Range headers using a privileged user, cache user. + +Creating cache user + +:: + +$ radosgw-admin user create --uid=<uid for cache user> --display-name="cache user" --caps="amz-cache=read" + +This user can send to the RGW the Cache API header ``X-Amz-Cache``, this header contains the headers from the original request(before changing the Range header). +It means that ``X-Amz-Cache`` built from several headers. +The headers that are building the ``X-Amz-Cache`` header are separated by char with ASCII code 177 and the header name and value are separated by char ASCII code 178. +The RGW will check that the cache user is an authorized user and if it is a cache user, +if yes it will use the ``X-Amz-Cache`` to revalidate that the user has permissions, using the headers from the X-Amz-Cache. +During this flow, the RGW will override the Range header. + + +Using Nginx with RGW +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Download the source of Openresty: + +:: + +$ wget https://openresty.org/download/openresty-1.15.8.3.tar.gz + +git clone the AWS auth Nginx module: + +:: + +$ git clone https://github.com/kaltura/nginx-aws-auth-module + +untar the openresty package: + +:: + +$ tar xvzf openresty-1.15.8.3.tar.gz +$ cd openresty-1.15.8.3 + +Compile openresty, Make sure that you have pcre lib and openssl lib: + +:: + +$ sudo yum install pcre-devel openssl-devel gcc curl zlib-devel nginx +$ ./configure --add-module=<the nginx-aws-auth-module dir> --with-http_auth_request_module --with-http_slice_module --conf-path=/etc/nginx/nginx.conf +$ gmake -j $(nproc) +$ sudo gmake install +$ sudo ln -sf /usr/local/openresty/bin/openresty /usr/bin/nginx + +Put in-place your Nginx configuration files and edit them according to your environment: + +All Nginx conf files are under: https://github.com/ceph/ceph/tree/master/examples/rgw-cache + +`nginx.conf` should go to `/etc/nginx/nginx.conf` + +`nginx-lua-file.lua` should go to `/etc/nginx/nginx-lua-file.lua` + +`nginx-default.conf` should go to `/etc/nginx/conf.d/nginx-default.conf` + +The parameters that are most likely to require adjustment according to the environment are located in the file `nginx-default.conf` + +Modify the example values of *proxy_cache_path* and *max_size* at: + +:: + + proxy_cache_path /data/cache levels=2:2:2 keys_zone=mycache:999m max_size=20G inactive=1d use_temp_path=off; + + +And modify the example *server* values to point to the RGWs URIs: + +:: + + server rgw1:8000 max_fails=2 fail_timeout=5s; + server rgw2:8000 max_fails=2 fail_timeout=5s; + server rgw3:8000 max_fails=2 fail_timeout=5s; + +| It is important to substitute the *access key* and *secret key* located in the `nginx.conf` with those belong to the user with the `amz-cache` caps +| for example, create the `cache` user as following: + +:: + + radosgw-admin user create --uid=cacheuser --display-name="cache user" --caps="amz-cache=read" --access-key <access> --secret <secret> + +It is possible to use Nginx slicing which is a better method for streaming purposes. + +For using slice you should use `nginx-slicing.conf` and not `nginx-default.conf` + +Further information about Nginx slicing: + +https://docs.nginx.com/nginx/admin-guide/content-cache/content-caching/#byte-range-caching + + +If you do not want to use the prefetch caching, It is possible to replace `nginx-default.conf` with `nginx-noprefetch.conf` +Using `noprefetch` means that if the client is sending range request of 0-4095 and then 0-4096 Nginx will cache those requests separately, So it will need to fetch those requests twice. + + +Run Nginx(openresty): + +:: + +$ sudo systemctl restart nginx + +Appendix +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +**A note about performance:** In certain instances like development environment, disabling the authentication by commenting the following line in `nginx-default.conf`: + +:: + + #auth_request /authentication; + +may (depending on the hardware) increases the performance significantly as it forgoes the auth API calls to radosgw. diff --git a/doc/radosgw/role.rst b/doc/radosgw/role.rst new file mode 100644 index 000000000..2e2697f57 --- /dev/null +++ b/doc/radosgw/role.rst @@ -0,0 +1,523 @@ +====== + Role +====== + +A role is similar to a user and has permission policies attached to it, that determine what a role can or can not do. A role can be assumed by any identity that needs it. If a user assumes a role, a set of dynamically created temporary credentials are returned to the user. A role can be used to delegate access to users, applications, services that do not have permissions to access some s3 resources. + +The following radosgw-admin commands can be used to create/ delete/ update a role and permissions asscociated with a role. + +Create a Role +------------- + +To create a role, execute the following:: + + radosgw-admin role create --role-name={role-name} [--path=="{path to the role}"] [--assume-role-policy-doc={trust-policy-document}] + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``role-name`` + +:Description: Name of the role. +:Type: String + +``path`` + +:Description: Path to the role. The default value is a slash(/). +:Type: String + +``assume-role-policy-doc`` + +:Description: The trust relationship policy document that grants an entity permission to assume the role. +:Type: String + +For example:: + + radosgw-admin role create --role-name=S3Access1 --path=/application_abc/component_xyz/ --assume-role-policy-doc=\{\"Version\":\"2012-10-17\",\"Statement\":\[\{\"Effect\":\"Allow\",\"Principal\":\{\"AWS\":\[\"arn:aws:iam:::user/TESTER\"\]\},\"Action\":\[\"sts:AssumeRole\"\]\}\]\} + +.. code-block:: javascript + + { + "id": "ca43045c-082c-491a-8af1-2eebca13deec", + "name": "S3Access1", + "path": "/application_abc/component_xyz/", + "arn": "arn:aws:iam:::role/application_abc/component_xyz/S3Access1", + "create_date": "2018-10-17T10:18:29.116Z", + "max_session_duration": 3600, + "assume_role_policy_document": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"AWS\":[\"arn:aws:iam:::user/TESTER\"]},\"Action\":[\"sts:AssumeRole\"]}]}" + } + + +Delete a Role +------------- + +To delete a role, execute the following:: + + radosgw-admin role rm --role-name={role-name} + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``role-name`` + +:Description: Name of the role. +:Type: String + +For example:: + + radosgw-admin role rm --role-name=S3Access1 + +Note: A role can be deleted only when it doesn't have any permission policy attached to it. + +Get a Role +---------- + +To get information about a role, execute the following:: + + radosgw-admin role get --role-name={role-name} + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``role-name`` + +:Description: Name of the role. +:Type: String + +For example:: + + radosgw-admin role get --role-name=S3Access1 + +.. code-block:: javascript + + { + "id": "ca43045c-082c-491a-8af1-2eebca13deec", + "name": "S3Access1", + "path": "/application_abc/component_xyz/", + "arn": "arn:aws:iam:::role/application_abc/component_xyz/S3Access1", + "create_date": "2018-10-17T10:18:29.116Z", + "max_session_duration": 3600, + "assume_role_policy_document": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"AWS\":[\"arn:aws:iam:::user/TESTER\"]},\"Action\":[\"sts:AssumeRole\"]}]}" + } + + +List Roles +---------- + +To list roles with a specified path prefix, execute the following:: + + radosgw-admin role list [--path-prefix ={path prefix}] + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``path-prefix`` + +:Description: Path prefix for filtering roles. If this is not specified, all roles are listed. +:Type: String + +For example:: + + radosgw-admin role list --path-prefix="/application" + +.. code-block:: javascript + + [ + { + "id": "3e1c0ff7-8f2b-456c-8fdf-20f428ba6a7f", + "name": "S3Access1", + "path": "/application_abc/component_xyz/", + "arn": "arn:aws:iam:::role/application_abc/component_xyz/S3Access1", + "create_date": "2018-10-17T10:32:01.881Z", + "max_session_duration": 3600, + "assume_role_policy_document": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"AWS\":[\"arn:aws:iam:::user/TESTER\"]},\"Action\":[\"sts:AssumeRole\"]}]}" + } + ] + + +Update Assume Role Policy Document of a role +-------------------------------------------- + +To modify a role's assume role policy document, execute the following:: + + radosgw-admin role modify --role-name={role-name} --assume-role-policy-doc={trust-policy-document} + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``role-name`` + +:Description: Name of the role. +:Type: String + +``assume-role-policy-doc`` + +:Description: The trust relationship policy document that grants an entity permission to assume the role. +:Type: String + +For example:: + + radosgw-admin role modify --role-name=S3Access1 --assume-role-policy-doc=\{\"Version\":\"2012-10-17\",\"Statement\":\[\{\"Effect\":\"Allow\",\"Principal\":\{\"AWS\":\[\"arn:aws:iam:::user/TESTER2\"\]\},\"Action\":\[\"sts:AssumeRole\"\]\}\]\} + +.. code-block:: javascript + + { + "id": "ca43045c-082c-491a-8af1-2eebca13deec", + "name": "S3Access1", + "path": "/application_abc/component_xyz/", + "arn": "arn:aws:iam:::role/application_abc/component_xyz/S3Access1", + "create_date": "2018-10-17T10:18:29.116Z", + "max_session_duration": 3600, + "assume_role_policy_document": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"AWS\":[\"arn:aws:iam:::user/TESTER2\"]},\"Action\":[\"sts:AssumeRole\"]}]}" + } + + +In the above example, we are modifying the Principal from TESTER to TESTER2 in its assume role policy document. + +Add/ Update a Policy attached to a Role +--------------------------------------- + +To add or update the inline policy attached to a role, execute the following:: + + radosgw-admin role policy put --role-name={role-name} --policy-name={policy-name} --policy-doc={permission-policy-doc} + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``role-name`` + +:Description: Name of the role. +:Type: String + +``policy-name`` + +:Description: Name of the policy. +:Type: String + +``policy-doc`` + +:Description: The Permission policy document. +:Type: String + +For example:: + + radosgw-admin role-policy put --role-name=S3Access1 --policy-name=Policy1 --policy-doc=\{\"Version\":\"2012-10-17\",\"Statement\":\[\{\"Effect\":\"Allow\",\"Action\":\[\"s3:*\"\],\"Resource\":\"arn:aws:s3:::example_bucket\"\}\]\} + +In the above example, we are attaching a policy 'Policy1' to role 'S3Access1', which allows all s3 actions on 'example_bucket'. + +List Permission Policy Names attached to a Role +----------------------------------------------- + +To list the names of permission policies attached to a role, execute the following:: + + radosgw-admin role policy get --role-name={role-name} + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``role-name`` + +:Description: Name of the role. +:Type: String + +For example:: + + radosgw-admin role-policy list --role-name=S3Access1 + +.. code-block:: javascript + + [ + "Policy1" + ] + + +Get Permission Policy attached to a Role +---------------------------------------- + +To get a specific permission policy attached to a role, execute the following:: + + radosgw-admin role policy get --role-name={role-name} --policy-name={policy-name} + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``role-name`` + +:Description: Name of the role. +:Type: String + +``policy-name`` + +:Description: Name of the policy. +:Type: String + +For example:: + + radosgw-admin role-policy get --role-name=S3Access1 --policy-name=Policy1 + +.. code-block:: javascript + + { + "Permission policy": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Action\":[\"s3:*\"],\"Resource\":\"arn:aws:s3:::example_bucket\"}]}" + } + + +Delete Policy attached to a Role +-------------------------------- + +To delete permission policy attached to a role, execute the following:: + + radosgw-admin role policy rm --role-name={role-name} --policy-name={policy-name} + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``role-name`` + +:Description: Name of the role. +:Type: String + +``policy-name`` + +:Description: Name of the policy. +:Type: String + +For example:: + + radosgw-admin role-policy get --role-name=S3Access1 --policy-name=Policy1 + + +REST APIs for Manipulating a Role +================================= + +In addition to the above radosgw-admin commands, the following REST APIs can be used for manipulating a role. For the request parameters and their explanations, refer to the sections above. + +In order to invoke the REST admin APIs, a user with admin caps needs to be created. + +.. code-block:: javascript + + radosgw-admin --uid TESTER --display-name "TestUser" --access_key TESTER --secret test123 user create + radosgw-admin caps add --uid="TESTER" --caps="roles=*" + + +Create a Role +------------- + +Example:: + POST "<hostname>?Action=CreateRole&RoleName=S3Access&Path=/application_abc/component_xyz/&AssumeRolePolicyDocument=\{\"Version\":\"2012-10-17\",\"Statement\":\[\{\"Effect\":\"Allow\",\"Principal\":\{\"AWS\":\[\"arn:aws:iam:::user/TESTER\"\]\},\"Action\":\[\"sts:AssumeRole\"\]\}\]\}" + +.. code-block:: XML + + <role> + <id>8f41f4e0-7094-4dc0-ac20-074a881ccbc5</id> + <name>S3Access</name> + <path>/application_abc/component_xyz/</path> + <arn>arn:aws:iam:::role/application_abc/component_xyz/S3Access</arn> + <create_date>2018-10-23T07:43:42.811Z</create_date> + <max_session_duration>3600</max_session_duration> + <assume_role_policy_document>{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"AWS":["arn:aws:iam:::user/TESTER"]},"Action":["sts:AssumeRole"]}]}</assume_role_policy_document> + </role> + + +Delete a Role +------------- + +Example:: + POST "<hostname>?Action=DeleteRole&RoleName=S3Access" + +Note: A role can be deleted only when it doesn't have any permission policy attached to it. + +Get a Role +---------- + +Example:: + POST "<hostname>?Action=GetRole&RoleName=S3Access" + +.. code-block:: XML + + <role> + <id>8f41f4e0-7094-4dc0-ac20-074a881ccbc5</id> + <name>S3Access</name> + <path>/application_abc/component_xyz/</path> + <arn>arn:aws:iam:::role/application_abc/component_xyz/S3Access</arn> + <create_date>2018-10-23T07:43:42.811Z</create_date> + <max_session_duration>3600</max_session_duration> + <assume_role_policy_document>{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"AWS":["arn:aws:iam:::user/TESTER"]},"Action":["sts:AssumeRole"]}]}</assume_role_policy_document> + </role> + + +List Roles +---------- + +Example:: + POST "<hostname>?Action=ListRoles&RoleName=S3Access&PathPrefix=/application" + +.. code-block:: XML + + <role> + <id>8f41f4e0-7094-4dc0-ac20-074a881ccbc5</id> + <name>S3Access</name> + <path>/application_abc/component_xyz/</path> + <arn>arn:aws:iam:::role/application_abc/component_xyz/S3Access</arn> + <create_date>2018-10-23T07:43:42.811Z</create_date> + <max_session_duration>3600</max_session_duration> + <assume_role_policy_document>{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"AWS":["arn:aws:iam:::user/TESTER"]},"Action":["sts:AssumeRole"]}]}</assume_role_policy_document> + </role> + + +Update Assume Role Policy Document +---------------------------------- + +Example:: + POST "<hostname>?Action=UpdateAssumeRolePolicy&RoleName=S3Access&PolicyDocument=\{\"Version\":\"2012-10-17\",\"Statement\":\[\{\"Effect\":\"Allow\",\"Principal\":\{\"AWS\":\[\"arn:aws:iam:::user/TESTER2\"\]\},\"Action\":\[\"sts:AssumeRole\"\]\}\]\}" + +Add/ Update a Policy attached to a Role +--------------------------------------- + +Example:: + POST "<hostname>?Action=PutRolePolicy&RoleName=S3Access&PolicyName=Policy1&PolicyDocument=\{\"Version\":\"2012-10-17\",\"Statement\":\[\{\"Effect\":\"Allow\",\"Action\":\[\"s3:CreateBucket\"\],\"Resource\":\"arn:aws:s3:::example_bucket\"\}\]\}" + +List Permission Policy Names attached to a Role +----------------------------------------------- + +Example:: + POST "<hostname>?Action=ListRolePolicies&RoleName=S3Access" + +.. code-block:: XML + + <PolicyNames> + <member>Policy1</member> + </PolicyNames> + + +Get Permission Policy attached to a Role +---------------------------------------- + +Example:: + POST "<hostname>?Action=GetRolePolicy&RoleName=S3Access&PolicyName=Policy1" + +.. code-block:: XML + + <GetRolePolicyResult> + <PolicyName>Policy1</PolicyName> + <RoleName>S3Access</RoleName> + <Permission_policy>{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Action":["s3:CreateBucket"],"Resource":"arn:aws:s3:::example_bucket"}]}</Permission_policy> + </GetRolePolicyResult> + + +Delete Policy attached to a Role +-------------------------------- + +Example:: + POST "<hostname>?Action=DeleteRolePolicy&RoleName=S3Access&PolicyName=Policy1" + +Tag a role +---------- +A role can have multivalued tags attached to it. These tags can be passed in as part of CreateRole REST API also. +AWS does not support multi-valued role tags. + +Example:: + POST "<hostname>?Action=TagRole&RoleName=S3Access&Tags.member.1.Key=Department&Tags.member.1.Value=Engineering" + +.. code-block:: XML + + <TagRoleResponse> + <ResponseMetadata> + <RequestId>tx000000000000000000004-00611f337e-1027-default</RequestId> + </ResponseMetadata> + </TagRoleResponse> + + +List role tags +-------------- +Lists the tags attached to a role. + +Example:: + POST "<hostname>?Action=ListRoleTags&RoleName=S3Access" + +.. code-block:: XML + + <ListRoleTagsResponse> + <ListRoleTagsResult> + <Tags> + <member> + <Key>Department</Key> + <Value>Engineering</Value> + </member> + </Tags> + </ListRoleTagsResult> + <ResponseMetadata> + <RequestId>tx000000000000000000005-00611f337e-1027-default</RequestId> + </ResponseMetadata> + </ListRoleTagsResponse> + +Delete role tags +---------------- +Delete a tag/ tags attached to a role. + +Example:: + POST "<hostname>?Action=UntagRoles&RoleName=S3Access&TagKeys.member.1=Department" + +.. code-block:: XML + + <UntagRoleResponse> + <ResponseMetadata> + <RequestId>tx000000000000000000007-00611f337e-1027-default</RequestId> + </ResponseMetadata> + </UntagRoleResponse> + + +Sample code for tagging, listing tags and untagging a role +---------------------------------------------------------- + +The following is sample code for adding tags to role, listing tags and untagging a role using boto3. + +.. code-block:: python + + import boto3 + + access_key = 'TESTER' + secret_key = 'test123' + + iam_client = boto3.client('iam', + aws_access_key_id=access_key, + aws_secret_access_key=secret_key, + endpoint_url='http://s3.us-east.localhost:8000', + region_name='' + ) + + policy_document = "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Federated\":[\"arn:aws:iam:::oidc-provider/localhost:8080/auth/realms/quickstart\"]},\"Action\":[\"sts:AssumeRoleWithWebIdentity\"],\"Condition\":{\"StringEquals\":{\"localhost:8080/auth/realms/quickstart:sub\":\"user1\"}}}]}" + + print ("\n Creating Role with tags\n") + tags_list = [ + {'Key':'Department','Value':'Engineering'} + ] + role_response = iam_client.create_role( + AssumeRolePolicyDocument=policy_document, + Path='/', + RoleName='S3Access', + Tags=tags_list, + ) + + print ("Adding tags to role\n") + response = iam_client.tag_role( + RoleName='S3Access', + Tags= [ + {'Key':'CostCenter','Value':'123456'} + ] + ) + print ("Listing role tags\n") + response = iam_client.list_role_tags( + RoleName='S3Access' + ) + print (response) + print ("Untagging role\n") + response = iam_client.untag_role( + RoleName='S3Access', + TagKeys=[ + 'Department', + ] + ) + + + diff --git a/doc/radosgw/s3-notification-compatibility.rst b/doc/radosgw/s3-notification-compatibility.rst new file mode 100644 index 000000000..09054bed3 --- /dev/null +++ b/doc/radosgw/s3-notification-compatibility.rst @@ -0,0 +1,139 @@ +===================================== +S3 Bucket Notifications Compatibility +===================================== + +Ceph's `Bucket Notifications`_ and `PubSub Module`_ APIs follow `AWS S3 Bucket Notifications API`_. However, some differences exist, as listed below. + + +.. note:: + + Compatibility is different depending on which of the above mechanism is used + +Supported Destination +--------------------- + +AWS supports: **SNS**, **SQS** and **Lambda** as possible destinations (AWS internal destinations). +Currently, we support: **HTTP/S**, **Kafka** and **AMQP**. And also support pulling and acking of events stored in Ceph (as an intrenal destination). + +We are using the **SNS** ARNs to represent the **HTTP/S**, **Kafka** and **AMQP** destinations. + +Notification Configuration XML +------------------------------ + +Following tags (and the tags inside them) are not supported: + ++-----------------------------------+----------------------------------------------+ +| Tag | Remaks | ++===================================+==============================================+ +| ``<QueueConfiguration>`` | not needed, we treat all destinations as SNS | ++-----------------------------------+----------------------------------------------+ +| ``<CloudFunctionConfiguration>`` | not needed, we treat all destinations as SNS | ++-----------------------------------+----------------------------------------------+ + +REST API Extension +------------------ + +Ceph's bucket notification API has the following extensions: + +- Deletion of a specific notification, or all notifications on a bucket, using the ``DELETE`` verb + + - In S3, all notifications are deleted when the bucket is deleted, or when an empty notification is set on the bucket + +- Getting the information on a specific notification (when more than one exists on a bucket) + + - In S3, it is only possible to fetch all notifications on a bucket + +- In addition to filtering based on prefix/suffix of object keys we support: + + - Filtering based on regular expression matching + + - Filtering based on metadata attributes attached to the object + + - Filtering based on object tags + +- Each one of the additional filters extends the S3 API and using it will require extension of the client SDK (unless you are using plain HTTP). + +- Filtering overlapping is allowed, so that same event could be sent as different notification + + +Unsupported Fields in the Event Record +-------------------------------------- + +The records sent for bucket notification follow format described in: `Event Message Structure`_. +However, the following fields may be sent empty, under the different deployment options (Notification/PubSub): + ++----------------------------------------+--------------+---------------+------------------------------------------------------------+ +| Field | Notification | PubSub | Description | ++========================================+==============+===============+============================================================+ +| ``userIdentity.principalId`` | Supported | Not Supported | The identity of the user that triggered the event | ++----------------------------------------+--------------+---------------+------------------------------------------------------------+ +| ``requestParameters.sourceIPAddress`` | Not Supported | The IP address of the client that triggered the event | ++----------------------------------------+--------------+---------------+------------------------------------------------------------+ +| ``requestParameters.x-amz-request-id`` | Supported | Not Supported | The request id that triggered the event | ++----------------------------------------+--------------+---------------+------------------------------------------------------------+ +| ``requestParameters.x-amz-id-2`` | Supported | Not Supported | The IP address of the RGW on which the event was triggered | ++----------------------------------------+--------------+---------------+------------------------------------------------------------+ +| ``s3.object.size`` | Supported | Not Supported | The size of the object | ++----------------------------------------+--------------+---------------+------------------------------------------------------------+ + +Event Types +----------- + ++----------------------------------------------+-----------------+-------------------------------------------+ +| Event | Notification | PubSub | ++==============================================+=================+===========================================+ +| ``s3:ObjectCreated:*`` | Supported | ++----------------------------------------------+-----------------+-------------------------------------------+ +| ``s3:ObjectCreated:Put`` | Supported | Supported at ``s3:ObjectCreated:*`` level | ++----------------------------------------------+-----------------+-------------------------------------------+ +| ``s3:ObjectCreated:Post`` | Supported | Not Supported | ++----------------------------------------------+-----------------+-------------------------------------------+ +| ``s3:ObjectCreated:Copy`` | Supported | Supported at ``s3:ObjectCreated:*`` level | ++----------------------------------------------+-----------------+-------------------------------------------+ +| ``s3:ObjectCreated:CompleteMultipartUpload`` | Supported | Supported at ``s3:ObjectCreated:*`` level | ++----------------------------------------------+-----------------+-------------------------------------------+ +| ``s3:ObjectRemoved:*`` | Supported | Supported only the specific events below | ++----------------------------------------------+-----------------+-------------------------------------------+ +| ``s3:ObjectRemoved:Delete`` | Supported | ++----------------------------------------------+-----------------+-------------------------------------------+ +| ``s3:ObjectRemoved:DeleteMarkerCreated`` | Supported | ++----------------------------------------------+-----------------+-------------------------------------------+ +| ``s3:ObjectRestore:Post`` | Not applicable to Ceph | ++----------------------------------------------+-----------------+-------------------------------------------+ +| ``s3:ObjectRestore:Complete`` | Not applicable to Ceph | ++----------------------------------------------+-----------------+-------------------------------------------+ +| ``s3:ReducedRedundancyLostObject`` | Not applicable to Ceph | ++----------------------------------------------+-----------------+-------------------------------------------+ + +.. note:: + + The ``s3:ObjectRemoved:DeleteMarkerCreated`` event presents information on the latest version of the object + +.. note:: + + In case of multipart upload, an ``ObjectCreated:CompleteMultipartUpload`` notification will be sent at the end of the process. + +Topic Configuration +------------------- +In the case of bucket notifications, the topics management API will be derived from `AWS Simple Notification Service API`_. +Note that most of the API is not applicable to Ceph, and only the following actions are implemented: + + - ``CreateTopic`` + - ``DeleteTopic`` + - ``ListTopics`` + +We also have the following extensions to topic configuration: + + - In ``GetTopic`` we allow fetching a specific topic, instead of all user topics + - In ``CreateTopic`` + + - we allow setting endpoint attributes + - we allow setting opaque data thta will be sent to the endpoint in the notification + + +.. _AWS Simple Notification Service API: https://docs.aws.amazon.com/sns/latest/api/API_Operations.html +.. _AWS S3 Bucket Notifications API: https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html +.. _Event Message Structure: https://docs.aws.amazon.com/AmazonS3/latest/dev/notification-content-structure.html +.. _`PubSub Module`: ../pubsub-module +.. _`Bucket Notifications`: ../notifications +.. _`boto3 SDK filter extensions`: https://github.com/ceph/ceph/tree/master/examples/boto3 diff --git a/doc/radosgw/s3.rst b/doc/radosgw/s3.rst new file mode 100644 index 000000000..d68863d6b --- /dev/null +++ b/doc/radosgw/s3.rst @@ -0,0 +1,102 @@ +============================ + Ceph Object Gateway S3 API +============================ + +Ceph supports a RESTful API that is compatible with the basic data access model of the `Amazon S3 API`_. + +API +--- + +.. toctree:: + :maxdepth: 1 + + Common <s3/commons> + Authentication <s3/authentication> + Service Ops <s3/serviceops> + Bucket Ops <s3/bucketops> + Object Ops <s3/objectops> + C++ <s3/cpp> + C# <s3/csharp> + Java <s3/java> + Perl <s3/perl> + PHP <s3/php> + Python <s3/python> + Ruby <s3/ruby> + + +Features Support +---------------- + +The following table describes the support status for current Amazon S3 functional features: + ++---------------------------------+-----------------+----------------------------------------+ +| Feature | Status | Remarks | ++=================================+=================+========================================+ +| **List Buckets** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Delete Bucket** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Create Bucket** | Supported | Different set of canned ACLs | ++---------------------------------+-----------------+----------------------------------------+ +| **Bucket Lifecycle** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Bucket Replication** | Partial | Only permitted across zones | ++---------------------------------+-----------------+----------------------------------------+ +| **Policy (Buckets, Objects)** | Supported | ACLs & bucket policies are supported | ++---------------------------------+-----------------+----------------------------------------+ +| **Bucket Website** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Bucket ACLs (Get, Put)** | Supported | Different set of canned ACLs | ++---------------------------------+-----------------+----------------------------------------+ +| **Bucket Location** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Bucket Notification** | Supported | See `S3 Notification Compatibility`_ | ++---------------------------------+-----------------+----------------------------------------+ +| **Bucket Object Versions** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Get Bucket Info (HEAD)** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Bucket Request Payment** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Put Object** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Delete Object** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Get Object** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Object ACLs (Get, Put)** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Get Object Info (HEAD)** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **POST Object** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Copy Object** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Multipart Uploads** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Object Tagging** | Supported | See :ref:`tag_policy` for Policy verbs | ++---------------------------------+-----------------+----------------------------------------+ +| **Bucket Tagging** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Storage Class** | Supported | See :ref:`storage_classes` | ++---------------------------------+-----------------+----------------------------------------+ + +Unsupported Header Fields +------------------------- + +The following common request header fields are not supported: + ++----------------------------+------------+ +| Name | Type | ++============================+============+ +| **Server** | Response | ++----------------------------+------------+ +| **x-amz-delete-marker** | Response | ++----------------------------+------------+ +| **x-amz-id-2** | Response | ++----------------------------+------------+ +| **x-amz-version-id** | Response | ++----------------------------+------------+ + +.. _Amazon S3 API: http://docs.aws.amazon.com/AmazonS3/latest/API/APIRest.html +.. _S3 Notification Compatibility: ../s3-notification-compatibility diff --git a/doc/radosgw/s3/authentication.rst b/doc/radosgw/s3/authentication.rst new file mode 100644 index 000000000..10143290d --- /dev/null +++ b/doc/radosgw/s3/authentication.rst @@ -0,0 +1,231 @@ +========================= + Authentication and ACLs +========================= + +Requests to the RADOS Gateway (RGW) can be either authenticated or +unauthenticated. RGW assumes unauthenticated requests are sent by an anonymous +user. RGW supports canned ACLs. + +Authentication +-------------- +Authenticating a request requires including an access key and a Hash-based +Message Authentication Code (HMAC) in the request before it is sent to the +RGW server. RGW uses an S3-compatible authentication approach. + +:: + + HTTP/1.1 + PUT /buckets/bucket/object.mpeg + Host: cname.domain.com + Date: Mon, 2 Jan 2012 00:01:01 +0000 + Content-Encoding: mpeg + Content-Length: 9999999 + + Authorization: AWS {access-key}:{hash-of-header-and-secret} + +In the foregoing example, replace ``{access-key}`` with the value for your access +key ID followed by a colon (``:``). Replace ``{hash-of-header-and-secret}`` with +a hash of the header string and the secret corresponding to the access key ID. + +To generate the hash of the header string and secret, you must: + +#. Get the value of the header string. +#. Normalize the request header string into canonical form. +#. Generate an HMAC using a SHA-1 hashing algorithm. + See `RFC 2104`_ and `HMAC`_ for details. +#. Encode the ``hmac`` result as base-64. + +To normalize the header into canonical form: + +#. Get all fields beginning with ``x-amz-``. +#. Ensure that the fields are all lowercase. +#. Sort the fields lexicographically. +#. Combine multiple instances of the same field name into a + single field and separate the field values with a comma. +#. Replace white space and line breaks in field values with a single space. +#. Remove white space before and after colons. +#. Append a new line after each field. +#. Merge the fields back into the header. + +Replace the ``{hash-of-header-and-secret}`` with the base-64 encoded HMAC string. + +Authentication against OpenStack Keystone +----------------------------------------- + +In a radosgw instance that is configured with authentication against +OpenStack Keystone, it is possible to use Keystone as an authoritative +source for S3 API authentication. To do so, you must set: + +* the ``rgw keystone`` configuration options explained in :doc:`../keystone`, +* ``rgw s3 auth use keystone = true``. + +In addition, a user wishing to use the S3 API must obtain an AWS-style +access key and secret key. They can do so with the ``openstack ec2 +credentials create`` command:: + + $ openstack --os-interface public ec2 credentials create + +------------+---------------------------------------------------------------------------------------------------------------------------------------------+ + | Field | Value | + +------------+---------------------------------------------------------------------------------------------------------------------------------------------+ + | access | c921676aaabbccdeadbeef7e8b0eeb2c | + | links | {u'self': u'https://auth.example.com:5000/v3/users/7ecbebaffeabbddeadbeefa23267ccbb24/credentials/OS-EC2/c921676aaabbccdeadbeef7e8b0eeb2c'} | + | project_id | 5ed51981aab4679851adeadbeef6ebf7 | + | secret | ******************************** | + | trust_id | None | + | user_id | 7ecbebaffeabbddeadbeefa23267cc24 | + +------------+---------------------------------------------------------------------------------------------------------------------------------------------+ + +The thus-generated access and secret key can then be used for S3 API +access to radosgw. + +.. note:: Consider that most production radosgw deployments + authenticating against OpenStack Keystone are also set up + for :doc:`../multitenancy`, for which special + considerations apply with respect to S3 signed URLs and + public read ACLs. + +Access Control Lists (ACLs) +--------------------------- + +RGW supports S3-compatible ACL functionality. An ACL is a list of access grants +that specify which operations a user can perform on a bucket or on an object. +Each grant has a different meaning when applied to a bucket versus applied to +an object: + ++------------------+--------------------------------------------------------+----------------------------------------------+ +| Permission | Bucket | Object | ++==================+========================================================+==============================================+ +| ``READ`` | Grantee can list the objects in the bucket. | Grantee can read the object. | ++------------------+--------------------------------------------------------+----------------------------------------------+ +| ``WRITE`` | Grantee can write or delete objects in the bucket. | N/A | ++------------------+--------------------------------------------------------+----------------------------------------------+ +| ``READ_ACP`` | Grantee can read bucket ACL. | Grantee can read the object ACL. | ++------------------+--------------------------------------------------------+----------------------------------------------+ +| ``WRITE_ACP`` | Grantee can write bucket ACL. | Grantee can write to the object ACL. | ++------------------+--------------------------------------------------------+----------------------------------------------+ +| ``FULL_CONTROL`` | Grantee has full permissions for object in the bucket. | Grantee can read or write to the object ACL. | ++------------------+--------------------------------------------------------+----------------------------------------------+ + +Internally, S3 operations are mapped to ACL permissions thus: + ++---------------------------------------+---------------+ +| Operation | Permission | ++=======================================+===============+ +| ``s3:GetObject`` | ``READ`` | ++---------------------------------------+---------------+ +| ``s3:GetObjectTorrent`` | ``READ`` | ++---------------------------------------+---------------+ +| ``s3:GetObjectVersion`` | ``READ`` | ++---------------------------------------+---------------+ +| ``s3:GetObjectVersionTorrent`` | ``READ`` | ++---------------------------------------+---------------+ +| ``s3:GetObjectTagging`` | ``READ`` | ++---------------------------------------+---------------+ +| ``s3:GetObjectVersionTagging`` | ``READ`` | ++---------------------------------------+---------------+ +| ``s3:ListAllMyBuckets`` | ``READ`` | ++---------------------------------------+---------------+ +| ``s3:ListBucket`` | ``READ`` | ++---------------------------------------+---------------+ +| ``s3:ListBucketMultipartUploads`` | ``READ`` | ++---------------------------------------+---------------+ +| ``s3:ListBucketVersions`` | ``READ`` | ++---------------------------------------+---------------+ +| ``s3:ListMultipartUploadParts`` | ``READ`` | ++---------------------------------------+---------------+ +| ``s3:AbortMultipartUpload`` | ``WRITE`` | ++---------------------------------------+---------------+ +| ``s3:CreateBucket`` | ``WRITE`` | ++---------------------------------------+---------------+ +| ``s3:DeleteBucket`` | ``WRITE`` | ++---------------------------------------+---------------+ +| ``s3:DeleteObject`` | ``WRITE`` | ++---------------------------------------+---------------+ +| ``s3:s3DeleteObjectVersion`` | ``WRITE`` | ++---------------------------------------+---------------+ +| ``s3:PutObject`` | ``WRITE`` | ++---------------------------------------+---------------+ +| ``s3:PutObjectTagging`` | ``WRITE`` | ++---------------------------------------+---------------+ +| ``s3:PutObjectVersionTagging`` | ``WRITE`` | ++---------------------------------------+---------------+ +| ``s3:DeleteObjectTagging`` | ``WRITE`` | ++---------------------------------------+---------------+ +| ``s3:DeleteObjectVersionTagging`` | ``WRITE`` | ++---------------------------------------+---------------+ +| ``s3:RestoreObject`` | ``WRITE`` | ++---------------------------------------+---------------+ +| ``s3:GetAccelerateConfiguration`` | ``READ_ACP`` | ++---------------------------------------+---------------+ +| ``s3:GetBucketAcl`` | ``READ_ACP`` | ++---------------------------------------+---------------+ +| ``s3:GetBucketCORS`` | ``READ_ACP`` | ++---------------------------------------+---------------+ +| ``s3:GetBucketLocation`` | ``READ_ACP`` | ++---------------------------------------+---------------+ +| ``s3:GetBucketLogging`` | ``READ_ACP`` | ++---------------------------------------+---------------+ +| ``s3:GetBucketNotification`` | ``READ_ACP`` | ++---------------------------------------+---------------+ +| ``s3:GetBucketPolicy`` | ``READ_ACP`` | ++---------------------------------------+---------------+ +| ``s3:GetBucketRequestPayment`` | ``READ_ACP`` | ++---------------------------------------+---------------+ +| ``s3:GetBucketTagging`` | ``READ_ACP`` | ++---------------------------------------+---------------+ +| ``s3:GetBucketVersioning`` | ``READ_ACP`` | ++---------------------------------------+---------------+ +| ``s3:GetBucketWebsite`` | ``READ_ACP`` | ++---------------------------------------+---------------+ +| ``s3:GetLifecycleConfiguration`` | ``READ_ACP`` | ++---------------------------------------+---------------+ +| ``s3:GetObjectAcl`` | ``READ_ACP`` | ++---------------------------------------+---------------+ +| ``s3:GetObjectVersionAcl`` | ``READ_ACP`` | ++---------------------------------------+---------------+ +| ``s3:GetReplicationConfiguration`` | ``READ_ACP`` | ++---------------------------------------+---------------+ +| ``s3:DeleteBucketPolicy`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:DeleteBucketWebsite`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:DeleteReplicationConfiguration`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:PutAccelerateConfiguration`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:PutBucketAcl`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:PutBucketCORS`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:PutBucketLogging`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:PutBucketNotification`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:PutBucketPolicy`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:PutBucketRequestPayment`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:PutBucketTagging`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:PutPutBucketVersioning`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:PutBucketWebsite`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:PutLifecycleConfiguration`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:PutObjectAcl`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:PutObjectVersionAcl`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ +| ``s3:PutReplicationConfiguration`` | ``WRITE_ACP`` | ++---------------------------------------+---------------+ + +Some mappings, (e.g. ``s3:CreateBucket`` to ``WRITE``) are not +applicable to S3 operation, but are required to allow Swift and S3 to +access the same resources when things like Swift user ACLs are in +play. This is one of the many reasons that you should use S3 bucket +policies rather than S3 ACLs when possible. + + +.. _RFC 2104: http://www.ietf.org/rfc/rfc2104.txt +.. _HMAC: https://en.wikipedia.org/wiki/HMAC diff --git a/doc/radosgw/s3/bucketops.rst b/doc/radosgw/s3/bucketops.rst new file mode 100644 index 000000000..378eb5f04 --- /dev/null +++ b/doc/radosgw/s3/bucketops.rst @@ -0,0 +1,706 @@ +=================== + Bucket Operations +=================== + +PUT Bucket +---------- +Creates a new bucket. To create a bucket, you must have a user ID and a valid AWS Access Key ID to authenticate requests. You may not +create buckets as an anonymous user. + +Constraints +~~~~~~~~~~~ +In general, bucket names should follow domain name constraints. + +- Bucket names must be unique. +- Bucket names cannot be formatted as IP address. +- Bucket names can be between 3 and 63 characters long. +- Bucket names must not contain uppercase characters or underscores. +- Bucket names must start with a lowercase letter or number. +- Bucket names must be a series of one or more labels. Adjacent labels are separated by a single period (.). Bucket names can contain lowercase letters, numbers, and hyphens. Each label must start and end with a lowercase letter or a number. + +.. note:: The above constraints are relaxed if the option 'rgw_relaxed_s3_bucket_names' is set to true except that the bucket names must still be unique, cannot be formatted as IP address and can contain letters, numbers, periods, dashes and underscores for up to 255 characters long. + +Syntax +~~~~~~ + +:: + + PUT /{bucket} HTTP/1.1 + Host: cname.domain.com + x-amz-acl: public-read-write + + Authorization: AWS {access-key}:{hash-of-header-and-secret} + +Parameters +~~~~~~~~~~ + + ++---------------+----------------------+-----------------------------------------------------------------------------+------------+ +| Name | Description | Valid Values | Required | ++===============+======================+=============================================================================+============+ +| ``x-amz-acl`` | Canned ACLs. | ``private``, ``public-read``, ``public-read-write``, ``authenticated-read`` | No | ++---------------+----------------------+-----------------------------------------------------------------------------+------------+ +| ``x-amz-bucket-object-lock-enabled`` | Enable object lock on bucket. | ``true``, ``false`` | No | ++--------------------------------------+-------------------------------+---------------------------------------------+------------+ + +Request Entities +~~~~~~~~~~~~~~~~ + ++-------------------------------+-----------+----------------------------------------------------------------+ +| Name | Type | Description | ++===============================+===========+================================================================+ +| ``CreateBucketConfiguration`` | Container | A container for the bucket configuration. | ++-------------------------------+-----------+----------------------------------------------------------------+ +| ``LocationConstraint`` | String | A zonegroup api name, with optional :ref:`s3_bucket_placement` | ++-------------------------------+-----------+----------------------------------------------------------------+ + + +HTTP Response +~~~~~~~~~~~~~ + +If the bucket name is unique, within constraints and unused, the operation will succeed. +If a bucket with the same name already exists and the user is the bucket owner, the operation will succeed. +If the bucket name is already in use, the operation will fail. + ++---------------+-----------------------+----------------------------------------------------------+ +| HTTP Status | Status Code | Description | ++===============+=======================+==========================================================+ +| ``409`` | BucketAlreadyExists | Bucket already exists under different user's ownership. | ++---------------+-----------------------+----------------------------------------------------------+ + +DELETE Bucket +------------- + +Deletes a bucket. You can reuse bucket names following a successful bucket removal. + +Syntax +~~~~~~ + +:: + + DELETE /{bucket} HTTP/1.1 + Host: cname.domain.com + + Authorization: AWS {access-key}:{hash-of-header-and-secret} + +HTTP Response +~~~~~~~~~~~~~ + ++---------------+---------------+------------------+ +| HTTP Status | Status Code | Description | ++===============+===============+==================+ +| ``204`` | No Content | Bucket removed. | ++---------------+---------------+------------------+ + +GET Bucket +---------- +Returns a list of bucket objects. + +Syntax +~~~~~~ + +:: + + GET /{bucket}?max-keys=25 HTTP/1.1 + Host: cname.domain.com + +Parameters +~~~~~~~~~~ + ++---------------------+-----------+-------------------------------------------------------------------------------------------------+ +| Name | Type | Description | ++=====================+===========+=================================================================================================+ +| ``prefix`` | String | Only returns objects that contain the specified prefix. | ++---------------------+-----------+-------------------------------------------------------------------------------------------------+ +| ``delimiter`` | String | The delimiter between the prefix and the rest of the object name. | ++---------------------+-----------+-------------------------------------------------------------------------------------------------+ +| ``marker`` | String | A beginning index for the list of objects returned. | ++---------------------+-----------+-------------------------------------------------------------------------------------------------+ +| ``max-keys`` | Integer | The maximum number of keys to return. Default is 1000. | ++---------------------+-----------+-------------------------------------------------------------------------------------------------+ +| ``allow-unordered`` | Boolean | Non-standard extension. Allows results to be returned unordered. Cannot be used with delimiter. | ++---------------------+-----------+-------------------------------------------------------------------------------------------------+ + +HTTP Response +~~~~~~~~~~~~~ + ++---------------+---------------+--------------------+ +| HTTP Status | Status Code | Description | ++===============+===============+====================+ +| ``200`` | OK | Buckets retrieved | ++---------------+---------------+--------------------+ + +Bucket Response Entities +~~~~~~~~~~~~~~~~~~~~~~~~ +``GET /{bucket}`` returns a container for buckets with the following fields. + ++------------------------+-----------+----------------------------------------------------------------------------------+ +| Name | Type | Description | ++========================+===========+==================================================================================+ +| ``ListBucketResult`` | Entity | The container for the list of objects. | ++------------------------+-----------+----------------------------------------------------------------------------------+ +| ``Name`` | String | The name of the bucket whose contents will be returned. | ++------------------------+-----------+----------------------------------------------------------------------------------+ +| ``Prefix`` | String | A prefix for the object keys. | ++------------------------+-----------+----------------------------------------------------------------------------------+ +| ``Marker`` | String | A beginning index for the list of objects returned. | ++------------------------+-----------+----------------------------------------------------------------------------------+ +| ``MaxKeys`` | Integer | The maximum number of keys returned. | ++------------------------+-----------+----------------------------------------------------------------------------------+ +| ``Delimiter`` | String | If set, objects with the same prefix will appear in the ``CommonPrefixes`` list. | ++------------------------+-----------+----------------------------------------------------------------------------------+ +| ``IsTruncated`` | Boolean | If ``true``, only a subset of the bucket's contents were returned. | ++------------------------+-----------+----------------------------------------------------------------------------------+ +| ``CommonPrefixes`` | Container | If multiple objects contain the same prefix, they will appear in this list. | ++------------------------+-----------+----------------------------------------------------------------------------------+ + +Object Response Entities +~~~~~~~~~~~~~~~~~~~~~~~~ +The ``ListBucketResult`` contains objects, where each object is within a ``Contents`` container. + ++------------------------+-----------+------------------------------------------+ +| Name | Type | Description | ++========================+===========+==========================================+ +| ``Contents`` | Object | A container for the object. | ++------------------------+-----------+------------------------------------------+ +| ``Key`` | String | The object's key. | ++------------------------+-----------+------------------------------------------+ +| ``LastModified`` | Date | The object's last-modified date/time. | ++------------------------+-----------+------------------------------------------+ +| ``ETag`` | String | An MD-5 hash of the object. (entity tag) | ++------------------------+-----------+------------------------------------------+ +| ``Size`` | Integer | The object's size. | ++------------------------+-----------+------------------------------------------+ +| ``StorageClass`` | String | Should always return ``STANDARD``. | ++------------------------+-----------+------------------------------------------+ +| ``Type`` | String | ``Appendable`` or ``Normal``. | ++------------------------+-----------+------------------------------------------+ + +Get Bucket Location +------------------- +Retrieves the bucket's region. The user needs to be the bucket owner +to call this. A bucket can be constrained to a region by providing +``LocationConstraint`` during a PUT request. + +Syntax +~~~~~~ +Add the ``location`` subresource to bucket resource as shown below + +:: + + GET /{bucket}?location HTTP/1.1 + Host: cname.domain.com + + Authorization: AWS {access-key}:{hash-of-header-and-secret} + +Response Entities +~~~~~~~~~~~~~~~~~~~~~~~~ + ++------------------------+-----------+------------------------------------------+ +| Name | Type | Description | ++========================+===========+==========================================+ +| ``LocationConstraint`` | String | The region where bucket resides, empty | +| | | string for default region | ++------------------------+-----------+------------------------------------------+ + + + +Get Bucket ACL +-------------- +Retrieves the bucket access control list. The user needs to be the bucket +owner or to have been granted ``READ_ACP`` permission on the bucket. + +Syntax +~~~~~~ +Add the ``acl`` subresource to the bucket request as shown below. + +:: + + GET /{bucket}?acl HTTP/1.1 + Host: cname.domain.com + + Authorization: AWS {access-key}:{hash-of-header-and-secret} + +Response Entities +~~~~~~~~~~~~~~~~~ + ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| Name | Type | Description | ++===========================+=============+==============================================================================================+ +| ``AccessControlPolicy`` | Container | A container for the response. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``AccessControlList`` | Container | A container for the ACL information. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Owner`` | Container | A container for the bucket owner's ``ID`` and ``DisplayName``. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``ID`` | String | The bucket owner's ID. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``DisplayName`` | String | The bucket owner's display name. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Grant`` | Container | A container for ``Grantee`` and ``Permission``. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Grantee`` | Container | A container for the ``DisplayName`` and ``ID`` of the user receiving a grant of permission. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Permission`` | String | The permission given to the ``Grantee`` bucket. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ + +PUT Bucket ACL +-------------- +Sets an access control to an existing bucket. The user needs to be the bucket +owner or to have been granted ``WRITE_ACP`` permission on the bucket. + +Syntax +~~~~~~ +Add the ``acl`` subresource to the bucket request as shown below. + +:: + + PUT /{bucket}?acl HTTP/1.1 + +Request Entities +~~~~~~~~~~~~~~~~ + ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| Name | Type | Description | ++===========================+=============+==============================================================================================+ +| ``AccessControlPolicy`` | Container | A container for the request. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``AccessControlList`` | Container | A container for the ACL information. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Owner`` | Container | A container for the bucket owner's ``ID`` and ``DisplayName``. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``ID`` | String | The bucket owner's ID. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``DisplayName`` | String | The bucket owner's display name. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Grant`` | Container | A container for ``Grantee`` and ``Permission``. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Grantee`` | Container | A container for the ``DisplayName`` and ``ID`` of the user receiving a grant of permission. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Permission`` | String | The permission given to the ``Grantee`` bucket. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ + +List Bucket Multipart Uploads +----------------------------- + +``GET /?uploads`` returns a list of the current in-progress multipart uploads--i.e., the application initiates a multipart upload, but +the service hasn't completed all the uploads yet. + +Syntax +~~~~~~ + +:: + + GET /{bucket}?uploads HTTP/1.1 + +Parameters +~~~~~~~~~~ + +You may specify parameters for ``GET /{bucket}?uploads``, but none of them are required. + ++------------------------+-----------+--------------------------------------------------------------------------------------+ +| Name | Type | Description | ++========================+===========+======================================================================================+ +| ``prefix`` | String | Returns in-progress uploads whose keys contains the specified prefix. | ++------------------------+-----------+--------------------------------------------------------------------------------------+ +| ``delimiter`` | String | The delimiter between the prefix and the rest of the object name. | ++------------------------+-----------+--------------------------------------------------------------------------------------+ +| ``key-marker`` | String | The beginning marker for the list of uploads. | ++------------------------+-----------+--------------------------------------------------------------------------------------+ +| ``max-keys`` | Integer | The maximum number of in-progress uploads. The default is 1000. | ++------------------------+-----------+--------------------------------------------------------------------------------------+ +| ``max-uploads`` | Integer | The maximum number of multipart uploads. The range from 1-1000. The default is 1000. | ++------------------------+-----------+--------------------------------------------------------------------------------------+ +| ``upload-id-marker`` | String | Ignored if ``key-marker`` is not specified. Specifies the ``ID`` of first | +| | | upload to list in lexicographical order at or following the ``ID``. | ++------------------------+-----------+--------------------------------------------------------------------------------------+ + + +Response Entities +~~~~~~~~~~~~~~~~~ + ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| Name | Type | Description | ++=========================================+=============+==========================================================================================================+ +| ``ListMultipartUploadsResult`` | Container | A container for the results. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``ListMultipartUploadsResult.Prefix`` | String | The prefix specified by the ``prefix`` request parameter (if any). | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``Bucket`` | String | The bucket that will receive the bucket contents. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``KeyMarker`` | String | The key marker specified by the ``key-marker`` request parameter (if any). | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``UploadIdMarker`` | String | The marker specified by the ``upload-id-marker`` request parameter (if any). | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``NextKeyMarker`` | String | The key marker to use in a subsequent request if ``IsTruncated`` is ``true``. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``NextUploadIdMarker`` | String | The upload ID marker to use in a subsequent request if ``IsTruncated`` is ``true``. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``MaxUploads`` | Integer | The max uploads specified by the ``max-uploads`` request parameter. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``Delimiter`` | String | If set, objects with the same prefix will appear in the ``CommonPrefixes`` list. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``IsTruncated`` | Boolean | If ``true``, only a subset of the bucket's upload contents were returned. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``Upload`` | Container | A container for ``Key``, ``UploadId``, ``InitiatorOwner``, ``StorageClass``, and ``Initiated`` elements. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``Key`` | String | The key of the object once the multipart upload is complete. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``UploadId`` | String | The ``ID`` that identifies the multipart upload. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``Initiator`` | Container | Contains the ``ID`` and ``DisplayName`` of the user who initiated the upload. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``DisplayName`` | String | The initiator's display name. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``ID`` | String | The initiator's ID. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``Owner`` | Container | A container for the ``ID`` and ``DisplayName`` of the user who owns the uploaded object. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``StorageClass`` | String | The method used to store the resulting object. ``STANDARD`` or ``REDUCED_REDUNDANCY`` | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``Initiated`` | Date | The date and time the user initiated the upload. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``CommonPrefixes`` | Container | If multiple objects contain the same prefix, they will appear in this list. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``CommonPrefixes.Prefix`` | String | The substring of the key after the prefix as defined by the ``prefix`` request parameter. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ + +ENABLE/SUSPEND BUCKET VERSIONING +-------------------------------- + +``PUT /?versioning`` This subresource set the versioning state of an existing bucket. To set the versioning state, you must be the bucket owner. + +You can set the versioning state with one of the following values: + +- Enabled : Enables versioning for the objects in the bucket, All objects added to the bucket receive a unique version ID. +- Suspended : Disables versioning for the objects in the bucket, All objects added to the bucket receive the version ID null. + +If the versioning state has never been set on a bucket, it has no versioning state; a GET versioning request does not return a versioning state value. + +Syntax +~~~~~~ + +:: + + PUT /{bucket}?versioning HTTP/1.1 + +REQUEST ENTITIES +~~~~~~~~~~~~~~~~ + ++-----------------------------+-----------+---------------------------------------------------------------------------+ +| Name | Type | Description | ++=============================+===========+===========================================================================+ +| ``VersioningConfiguration`` | Container | A container for the request. | ++-----------------------------+-----------+---------------------------------------------------------------------------+ +| ``Status`` | String | Sets the versioning state of the bucket. Valid Values: Suspended/Enabled | ++-----------------------------+-----------+---------------------------------------------------------------------------+ + +PUT BUCKET OBJECT LOCK +-------------------------------- + +Places an Object Lock configuration on the specified bucket. The rule specified in the Object Lock configuration will be +applied by default to every new object placed in the specified bucket. + +Syntax +~~~~~~ + +:: + + PUT /{bucket}?object-lock HTTP/1.1 + +Request Entities +~~~~~~~~~~~~~~~~ + ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ +| Name | Type | Description | Required | ++=============================+=============+========================================================================================+==========+ +| ``ObjectLockConfiguration`` | Container | A container for the request. | Yes | ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ +| ``ObjectLockEnabled`` | String | Indicates whether this bucket has an Object Lock configuration enabled. | Yes | ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ +| ``Rule`` | Container | The Object Lock rule in place for the specified bucket. | No | ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ +| ``DefaultRetention`` | Container | The default retention period applied to new objects placed in the specified bucket. | No | ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ +| ``Mode`` | String | The default Object Lock retention mode. Valid Values: GOVERNANCE/COMPLIANCE | Yes | ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ +| ``Days`` | Integer | The number of days specified for the default retention period. | No | ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ +| ``Years`` | Integer | The number of years specified for the default retention period. | No | ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ + +HTTP Response +~~~~~~~~~~~~~ + +If the bucket object lock is not enabled when creating the bucket, the operation will fail. + ++---------------+-----------------------+----------------------------------------------------------+ +| HTTP Status | Status Code | Description | ++===============+=======================+==========================================================+ +| ``400`` | MalformedXML | The XML is not well-formed | ++---------------+-----------------------+----------------------------------------------------------+ +| ``409`` | InvalidBucketState | The bucket object lock is not enabled | ++---------------+-----------------------+----------------------------------------------------------+ + +GET BUCKET OBJECT LOCK +-------------------------------- + +Gets the Object Lock configuration for a bucket. The rule specified in the Object Lock configuration will be applied by +default to every new object placed in the specified bucket. + +Syntax +~~~~~~ + +:: + + GET /{bucket}?object-lock HTTP/1.1 + + +Response Entities +~~~~~~~~~~~~~~~~~ + ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ +| Name | Type | Description | Required | ++=============================+=============+========================================================================================+==========+ +| ``ObjectLockConfiguration`` | Container | A container for the request. | Yes | ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ +| ``ObjectLockEnabled`` | String | Indicates whether this bucket has an Object Lock configuration enabled. | Yes | ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ +| ``Rule`` | Container | The Object Lock rule in place for the specified bucket. | No | ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ +| ``DefaultRetention`` | Container | The default retention period applied to new objects placed in the specified bucket. | No | ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ +| ``Mode`` | String | The default Object Lock retention mode. Valid Values: GOVERNANCE/COMPLIANCE | Yes | ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ +| ``Days`` | Integer | The number of days specified for the default retention period. | No | ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ +| ``Years`` | Integer | The number of years specified for the default retention period. | No | ++-----------------------------+-------------+----------------------------------------------------------------------------------------+----------+ + +Create Notification +------------------- + +Create a publisher for a specific bucket into a topic. + +Syntax +~~~~~~ + +:: + + PUT /<bucket name>?notification HTTP/1.1 + + +Request Entities +~~~~~~~~~~~~~~~~ + +Parameters are XML encoded in the body of the request, in the following format: + +:: + + <NotificationConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/"> + <TopicConfiguration> + <Id></Id> + <Topic></Topic> + <Event></Event> + <Filter> + <S3Key> + <FilterRule> + <Name></Name> + <Value></Value> + </FilterRule> + </S3Key> + <S3Metadata> + <FilterRule> + <Name></Name> + <Value></Value> + </FilterRule> + </S3Metadata> + <S3Tags> + <FilterRule> + <Name></Name> + <Value></Value> + </FilterRule> + </S3Tags> + </Filter> + </TopicConfiguration> + </NotificationConfiguration> + ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| Name | Type | Description | Required | ++===============================+===========+======================================================================================+==========+ +| ``NotificationConfiguration`` | Container | Holding list of ``TopicConfiguration`` entities | Yes | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``TopicConfiguration`` | Container | Holding ``Id``, ``Topic`` and list of ``Event`` entities | Yes | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``Id`` | String | Name of the notification | Yes | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``Topic`` | String | Topic ARN. Topic must be created beforehand | Yes | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``Event`` | String | List of supported events see: `S3 Notification Compatibility`_. Multiple ``Event`` | No | +| | | entities can be used. If omitted, all events are handled | | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``Filter`` | Container | Holding ``S3Key``, ``S3Metadata`` and ``S3Tags`` entities | No | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``S3Key`` | Container | Holding a list of ``FilterRule`` entities, for filtering based on object key. | No | +| | | At most, 3 entities may be in the list, with ``Name`` be ``prefix``, ``suffix`` or | | +| | | ``regex``. All filter rules in the list must match for the filter to match. | | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``S3Metadata`` | Container | Holding a list of ``FilterRule`` entities, for filtering based on object metadata. | No | +| | | All filter rules in the list must match the metadata defined on the object. However, | | +| | | the object still match if it has other metadata entries not listed in the filter. | | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``S3Tags`` | Container | Holding a list of ``FilterRule`` entities, for filtering based on object tags. | No | +| | | All filter rules in the list must match the tags defined on the object. However, | | +| | | the object still match it it has other tags not listed in the filter. | | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``S3Key.FilterRule`` | Container | Holding ``Name`` and ``Value`` entities. ``Name`` would be: ``prefix``, ``suffix`` | Yes | +| | | or ``regex``. The ``Value`` would hold the key prefix, key suffix or a regular | | +| | | expression for matching the key, accordingly. | | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``S3Metadata.FilterRule`` | Container | Holding ``Name`` and ``Value`` entities. ``Name`` would be the name of the metadata | Yes | +| | | attribute (e.g. ``x-amz-meta-xxx``). The ``Value`` would be the expected value for | | +| | | this attribute. | | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``S3Tags.FilterRule`` | Container | Holding ``Name`` and ``Value`` entities. ``Name`` would be the tag key, | Yes | +| | | and ``Value`` would be the tag value. | | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ + + +HTTP Response +~~~~~~~~~~~~~ + ++---------------+-----------------------+----------------------------------------------------------+ +| HTTP Status | Status Code | Description | ++===============+=======================+==========================================================+ +| ``400`` | MalformedXML | The XML is not well-formed | ++---------------+-----------------------+----------------------------------------------------------+ +| ``400`` | InvalidArgument | Missing Id; Missing/Invalid Topic ARN; Invalid Event | ++---------------+-----------------------+----------------------------------------------------------+ +| ``404`` | NoSuchBucket | The bucket does not exist | ++---------------+-----------------------+----------------------------------------------------------+ +| ``404`` | NoSuchKey | The topic does not exist | ++---------------+-----------------------+----------------------------------------------------------+ + + +Delete Notification +------------------- + +Delete a specific, or all, notifications from a bucket. + +.. note:: + + - Notification deletion is an extension to the S3 notification API + - When the bucket is deleted, any notification defined on it is also deleted + - Deleting an unknown notification (e.g. double delete) is not considered an error + +Syntax +~~~~~~ + +:: + + DELETE /bucket?notification[=<notification-id>] HTTP/1.1 + + +Parameters +~~~~~~~~~~ + ++------------------------+-----------+----------------------------------------------------------------------------------------+ +| Name | Type | Description | ++========================+===========+========================================================================================+ +| ``notification-id`` | String | Name of the notification. If not provided, all notifications on the bucket are deleted | ++------------------------+-----------+----------------------------------------------------------------------------------------+ + +HTTP Response +~~~~~~~~~~~~~ + ++---------------+-----------------------+----------------------------------------------------------+ +| HTTP Status | Status Code | Description | ++===============+=======================+==========================================================+ +| ``404`` | NoSuchBucket | The bucket does not exist | ++---------------+-----------------------+----------------------------------------------------------+ + +Get/List Notification +--------------------- + +Get a specific notification, or list all notifications configured on a bucket. + +Syntax +~~~~~~ + +:: + + GET /bucket?notification[=<notification-id>] HTTP/1.1 + + +Parameters +~~~~~~~~~~ + ++------------------------+-----------+----------------------------------------------------------------------------------------+ +| Name | Type | Description | ++========================+===========+========================================================================================+ +| ``notification-id`` | String | Name of the notification. If not provided, all notifications on the bucket are listed | ++------------------------+-----------+----------------------------------------------------------------------------------------+ + +Response Entities +~~~~~~~~~~~~~~~~~ + +Response is XML encoded in the body of the request, in the following format: + +:: + + <NotificationConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/"> + <TopicConfiguration> + <Id></Id> + <Topic></Topic> + <Event></Event> + <Filter> + <S3Key> + <FilterRule> + <Name></Name> + <Value></Value> + </FilterRule> + </S3Key> + <S3Metadata> + <FilterRule> + <Name></Name> + <Value></Value> + </FilterRule> + </S3Metadata> + <S3Tags> + <FilterRule> + <Name></Name> + <Value></Value> + </FilterRule> + </S3Tags> + </Filter> + </TopicConfiguration> + </NotificationConfiguration> + ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| Name | Type | Description | Required | ++===============================+===========+======================================================================================+==========+ +| ``NotificationConfiguration`` | Container | Holding list of ``TopicConfiguration`` entities | Yes | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``TopicConfiguration`` | Container | Holding ``Id``, ``Topic`` and list of ``Event`` entities | Yes | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``Id`` | String | Name of the notification | Yes | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``Topic`` | String | Topic ARN | Yes | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``Event`` | String | Handled event. Multiple ``Event`` entities may exist | Yes | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ +| ``Filter`` | Container | Holding the filters configured for this notification | No | ++-------------------------------+-----------+--------------------------------------------------------------------------------------+----------+ + +HTTP Response +~~~~~~~~~~~~~ + ++---------------+-----------------------+----------------------------------------------------------+ +| HTTP Status | Status Code | Description | ++===============+=======================+==========================================================+ +| ``404`` | NoSuchBucket | The bucket does not exist | ++---------------+-----------------------+----------------------------------------------------------+ +| ``404`` | NoSuchKey | The notification does not exist (if provided) | ++---------------+-----------------------+----------------------------------------------------------+ + +.. _S3 Notification Compatibility: ../../s3-notification-compatibility diff --git a/doc/radosgw/s3/commons.rst b/doc/radosgw/s3/commons.rst new file mode 100644 index 000000000..ca848bc27 --- /dev/null +++ b/doc/radosgw/s3/commons.rst @@ -0,0 +1,111 @@ +================= + Common Entities +================= + +.. toctree:: + :maxdepth: -1 + +Bucket and Host Name +-------------------- +There are two different modes of accessing the buckets. The first (preferred) method +identifies the bucket as the top-level directory in the URI. :: + + GET /mybucket HTTP/1.1 + Host: cname.domain.com + +The second method identifies the bucket via a virtual bucket host name. For example:: + + GET / HTTP/1.1 + Host: mybucket.cname.domain.com + +To configure virtual hosted buckets, you can either set ``rgw_dns_name = cname.domain.com`` in ceph.conf, or add ``cname.domain.com`` to the list of ``hostnames`` in your zonegroup configuration. See `Ceph Object Gateway - Multisite Configuration`_ for more on zonegroups. + +.. tip:: We prefer the first method, because the second method requires expensive domain certification and DNS wild cards. + +Common Request Headers +---------------------- + ++--------------------+------------------------------------------+ +| Request Header | Description | ++====================+==========================================+ +| ``CONTENT_LENGTH`` | Length of the request body. | ++--------------------+------------------------------------------+ +| ``DATE`` | Request time and date (in UTC). | ++--------------------+------------------------------------------+ +| ``HOST`` | The name of the host server. | ++--------------------+------------------------------------------+ +| ``AUTHORIZATION`` | Authorization token. | ++--------------------+------------------------------------------+ + +Common Response Status +---------------------- + ++---------------+-----------------------------------+ +| HTTP Status | Response Code | ++===============+===================================+ +| ``100`` | Continue | ++---------------+-----------------------------------+ +| ``200`` | Success | ++---------------+-----------------------------------+ +| ``201`` | Created | ++---------------+-----------------------------------+ +| ``202`` | Accepted | ++---------------+-----------------------------------+ +| ``204`` | NoContent | ++---------------+-----------------------------------+ +| ``206`` | Partial content | ++---------------+-----------------------------------+ +| ``304`` | NotModified | ++---------------+-----------------------------------+ +| ``400`` | InvalidArgument | ++---------------+-----------------------------------+ +| ``400`` | InvalidDigest | ++---------------+-----------------------------------+ +| ``400`` | BadDigest | ++---------------+-----------------------------------+ +| ``400`` | InvalidBucketName | ++---------------+-----------------------------------+ +| ``400`` | InvalidObjectName | ++---------------+-----------------------------------+ +| ``400`` | UnresolvableGrantByEmailAddress | ++---------------+-----------------------------------+ +| ``400`` | InvalidPart | ++---------------+-----------------------------------+ +| ``400`` | InvalidPartOrder | ++---------------+-----------------------------------+ +| ``400`` | RequestTimeout | ++---------------+-----------------------------------+ +| ``400`` | EntityTooLarge | ++---------------+-----------------------------------+ +| ``403`` | AccessDenied | ++---------------+-----------------------------------+ +| ``403`` | UserSuspended | ++---------------+-----------------------------------+ +| ``403`` | RequestTimeTooSkewed | ++---------------+-----------------------------------+ +| ``404`` | NoSuchKey | ++---------------+-----------------------------------+ +| ``404`` | NoSuchBucket | ++---------------+-----------------------------------+ +| ``404`` | NoSuchUpload | ++---------------+-----------------------------------+ +| ``405`` | MethodNotAllowed | ++---------------+-----------------------------------+ +| ``408`` | RequestTimeout | ++---------------+-----------------------------------+ +| ``409`` | BucketAlreadyExists | ++---------------+-----------------------------------+ +| ``409`` | BucketNotEmpty | ++---------------+-----------------------------------+ +| ``411`` | MissingContentLength | ++---------------+-----------------------------------+ +| ``412`` | PreconditionFailed | ++---------------+-----------------------------------+ +| ``416`` | InvalidRange | ++---------------+-----------------------------------+ +| ``422`` | UnprocessableEntity | ++---------------+-----------------------------------+ +| ``500`` | InternalError | ++---------------+-----------------------------------+ + +.. _`Ceph Object Gateway - Multisite Configuration`: ../../multisite diff --git a/doc/radosgw/s3/cpp.rst b/doc/radosgw/s3/cpp.rst new file mode 100644 index 000000000..089c9c53a --- /dev/null +++ b/doc/radosgw/s3/cpp.rst @@ -0,0 +1,337 @@ +.. _cpp: + +C++ S3 Examples +=============== + +Setup +----- + +The following contains includes and globals that will be used in later examples: + +.. code-block:: cpp + + #include "libs3.h" + #include <stdlib.h> + #include <iostream> + #include <fstream> + + const char access_key[] = "ACCESS_KEY"; + const char secret_key[] = "SECRET_KEY"; + const char host[] = "HOST"; + const char sample_bucket[] = "sample_bucket"; + const char sample_key[] = "hello.txt"; + const char sample_file[] = "resource/hello.txt"; + const char *security_token = NULL; + const char *auth_region = NULL; + + S3BucketContext bucketContext = + { + host, + sample_bucket, + S3ProtocolHTTP, + S3UriStylePath, + access_key, + secret_key, + security_token, + auth_region + }; + + S3Status responsePropertiesCallback( + const S3ResponseProperties *properties, + void *callbackData) + { + return S3StatusOK; + } + + static void responseCompleteCallback( + S3Status status, + const S3ErrorDetails *error, + void *callbackData) + { + return; + } + + S3ResponseHandler responseHandler = + { + &responsePropertiesCallback, + &responseCompleteCallback + }; + + +Creating (and Closing) a Connection +----------------------------------- + +This creates a connection so that you can interact with the server. + +.. code-block:: cpp + + S3_initialize("s3", S3_INIT_ALL, host); + // Do stuff... + S3_deinitialize(); + + +Listing Owned Buckets +--------------------- + +This gets a list of Buckets that you own. +This also prints out the bucket name, owner ID, and display name +for each bucket. + +.. code-block:: cpp + + static S3Status listServiceCallback( + const char *ownerId, + const char *ownerDisplayName, + const char *bucketName, + int64_t creationDate, void *callbackData) + { + bool *header_printed = (bool*) callbackData; + if (!*header_printed) { + *header_printed = true; + printf("%-22s", " Bucket"); + printf(" %-20s %-12s", " Owner ID", "Display Name"); + printf("\n"); + printf("----------------------"); + printf(" --------------------" " ------------"); + printf("\n"); + } + + printf("%-22s", bucketName); + printf(" %-20s %-12s", ownerId ? ownerId : "", ownerDisplayName ? ownerDisplayName : ""); + printf("\n"); + + return S3StatusOK; + } + + S3ListServiceHandler listServiceHandler = + { + responseHandler, + &listServiceCallback + }; + bool header_printed = false; + S3_list_service(S3ProtocolHTTP, access_key, secret_key, security_token, host, + auth_region, NULL, 0, &listServiceHandler, &header_printed); + + +Creating a Bucket +----------------- + +This creates a new bucket. + +.. code-block:: cpp + + S3_create_bucket(S3ProtocolHTTP, access_key, secret_key, NULL, host, sample_bucket, S3CannedAclPrivate, NULL, NULL, &responseHandler, NULL); + + +Listing a Bucket's Content +-------------------------- + +This gets a list of objects in the bucket. +This also prints out each object's name, the file size, and +last modified date. + +.. code-block:: cpp + + static S3Status listBucketCallback( + int isTruncated, + const char *nextMarker, + int contentsCount, + const S3ListBucketContent *contents, + int commonPrefixesCount, + const char **commonPrefixes, + void *callbackData) + { + printf("%-22s", " Object Name"); + printf(" %-5s %-20s", "Size", " Last Modified"); + printf("\n"); + printf("----------------------"); + printf(" -----" " --------------------"); + printf("\n"); + + for (int i = 0; i < contentsCount; i++) { + char timebuf[256]; + char sizebuf[16]; + const S3ListBucketContent *content = &(contents[i]); + time_t t = (time_t) content->lastModified; + + strftime(timebuf, sizeof(timebuf), "%Y-%m-%dT%H:%M:%SZ", gmtime(&t)); + sprintf(sizebuf, "%5llu", (unsigned long long) content->size); + printf("%-22s %s %s\n", content->key, sizebuf, timebuf); + } + + return S3StatusOK; + } + + S3ListBucketHandler listBucketHandler = + { + responseHandler, + &listBucketCallback + }; + S3_list_bucket(&bucketContext, NULL, NULL, NULL, 0, NULL, 0, &listBucketHandler, NULL); + +The output will look something like this:: + + myphoto1.jpg 251262 2011-08-08T21:35:48.000Z + myphoto2.jpg 262518 2011-08-08T21:38:01.000Z + + +Deleting a Bucket +----------------- + +.. note:: + + The Bucket must be empty! Otherwise it won't work! + +.. code-block:: cpp + + S3_delete_bucket(S3ProtocolHTTP, S3UriStylePath, access_key, secret_key, 0, host, sample_bucket, NULL, NULL, 0, &responseHandler, NULL); + + +Creating an Object (from a file) +-------------------------------- + +This creates a file ``hello.txt``. + +.. code-block:: cpp + + #include <sys/stat.h> + typedef struct put_object_callback_data + { + FILE *infile; + uint64_t contentLength; + } put_object_callback_data; + + + static int putObjectDataCallback(int bufferSize, char *buffer, void *callbackData) + { + put_object_callback_data *data = (put_object_callback_data *) callbackData; + + int ret = 0; + + if (data->contentLength) { + int toRead = ((data->contentLength > (unsigned) bufferSize) ? (unsigned) bufferSize : data->contentLength); + ret = fread(buffer, 1, toRead, data->infile); + } + data->contentLength -= ret; + return ret; + } + + put_object_callback_data data; + struct stat statbuf; + if (stat(sample_file, &statbuf) == -1) { + fprintf(stderr, "\nERROR: Failed to stat file %s: ", sample_file); + perror(0); + exit(-1); + } + + int contentLength = statbuf.st_size; + data.contentLength = contentLength; + + if (!(data.infile = fopen(sample_file, "r"))) { + fprintf(stderr, "\nERROR: Failed to open input file %s: ", sample_file); + perror(0); + exit(-1); + } + + S3PutObjectHandler putObjectHandler = + { + responseHandler, + &putObjectDataCallback + }; + + S3_put_object(&bucketContext, sample_key, contentLength, NULL, NULL, 0, &putObjectHandler, &data); + fclose(data.infile); + + +Download an Object (to a file) +------------------------------ + +This downloads a file and prints the contents. + +.. code-block:: cpp + + static S3Status getObjectDataCallback(int bufferSize, const char *buffer, void *callbackData) + { + FILE *outfile = (FILE *) callbackData; + size_t wrote = fwrite(buffer, 1, bufferSize, outfile); + return ((wrote < (size_t) bufferSize) ? S3StatusAbortedByCallback : S3StatusOK); + } + + S3GetObjectHandler getObjectHandler = + { + responseHandler, + &getObjectDataCallback + }; + FILE *outfile = stdout; + S3_get_object(&bucketContext, sample_key, NULL, 0, 0, NULL, 0, &getObjectHandler, outfile); + + +Delete an Object +---------------- + +This deletes an object. + +.. code-block:: cpp + + S3ResponseHandler deleteResponseHandler = + { + 0, + &responseCompleteCallback + }; + S3_delete_object(&bucketContext, sample_key, 0, 0, &deleteResponseHandler, 0); + + +Change an Object's ACL +---------------------- + +This changes an object's ACL to grant full control to another user. + + +.. code-block:: cpp + + #include <string.h> + char ownerId[] = "owner"; + char ownerDisplayName[] = "owner"; + char granteeId[] = "grantee"; + char granteeDisplayName[] = "grantee"; + + S3AclGrant grants[] = { + { + S3GranteeTypeCanonicalUser, + {{}}, + S3PermissionFullControl + }, + { + S3GranteeTypeCanonicalUser, + {{}}, + S3PermissionReadACP + }, + { + S3GranteeTypeAllUsers, + {{}}, + S3PermissionRead + } + }; + + strncpy(grants[0].grantee.canonicalUser.id, ownerId, S3_MAX_GRANTEE_USER_ID_SIZE); + strncpy(grants[0].grantee.canonicalUser.displayName, ownerDisplayName, S3_MAX_GRANTEE_DISPLAY_NAME_SIZE); + + strncpy(grants[1].grantee.canonicalUser.id, granteeId, S3_MAX_GRANTEE_USER_ID_SIZE); + strncpy(grants[1].grantee.canonicalUser.displayName, granteeDisplayName, S3_MAX_GRANTEE_DISPLAY_NAME_SIZE); + + S3_set_acl(&bucketContext, sample_key, ownerId, ownerDisplayName, 3, grants, 0, &responseHandler, 0); + + +Generate Object Download URL (signed) +------------------------------------- + +This generates a signed download URL that will be valid for 5 minutes. + +.. code-block:: cpp + + #include <time.h> + char buffer[S3_MAX_AUTHENTICATED_QUERY_STRING_SIZE]; + int64_t expires = time(NULL) + 60 * 5; // Current time + 5 minutes + + S3_generate_authenticated_query_string(buffer, &bucketContext, sample_key, expires, NULL, "GET"); + diff --git a/doc/radosgw/s3/csharp.rst b/doc/radosgw/s3/csharp.rst new file mode 100644 index 000000000..af1c6e4b5 --- /dev/null +++ b/doc/radosgw/s3/csharp.rst @@ -0,0 +1,199 @@ +.. _csharp: + +C# S3 Examples +============== + +Creating a Connection +--------------------- + +This creates a connection so that you can interact with the server. + +.. code-block:: csharp + + using System; + using Amazon; + using Amazon.S3; + using Amazon.S3.Model; + + string accessKey = "put your access key here!"; + string secretKey = "put your secret key here!"; + + AmazonS3Config config = new AmazonS3Config(); + config.ServiceURL = "objects.dreamhost.com"; + + AmazonS3Client s3Client = new AmazonS3Client( + accessKey, + secretKey, + config + ); + + +Listing Owned Buckets +--------------------- + +This gets a list of Buckets that you own. +This also prints out the bucket name and creation date of each bucket. + +.. code-block:: csharp + + ListBucketsResponse response = client.ListBuckets(); + foreach (S3Bucket b in response.Buckets) + { + Console.WriteLine("{0}\t{1}", b.BucketName, b.CreationDate); + } + +The output will look something like this:: + + mahbuckat1 2011-04-21T18:05:39.000Z + mahbuckat2 2011-04-21T18:05:48.000Z + mahbuckat3 2011-04-21T18:07:18.000Z + + +Creating a Bucket +----------------- +This creates a new bucket called ``my-new-bucket`` + +.. code-block:: csharp + + PutBucketRequest request = new PutBucketRequest(); + request.BucketName = "my-new-bucket"; + client.PutBucket(request); + +Listing a Bucket's Content +-------------------------- + +This gets a list of objects in the bucket. +This also prints out each object's name, the file size, and last +modified date. + +.. code-block:: csharp + + ListObjectsRequest request = new ListObjectsRequest(); + request.BucketName = "my-new-bucket"; + ListObjectsResponse response = client.ListObjects(request); + foreach (S3Object o in response.S3Objects) + { + Console.WriteLine("{0}\t{1}\t{2}", o.Key, o.Size, o.LastModified); + } + +The output will look something like this:: + + myphoto1.jpg 251262 2011-08-08T21:35:48.000Z + myphoto2.jpg 262518 2011-08-08T21:38:01.000Z + + +Deleting a Bucket +----------------- + +.. note:: + + The Bucket must be empty! Otherwise it won't work! + +.. code-block:: csharp + + DeleteBucketRequest request = new DeleteBucketRequest(); + request.BucketName = "my-new-bucket"; + client.DeleteBucket(request); + + +Forced Delete for Non-empty Buckets +----------------------------------- + +.. attention:: + + not available + + +Creating an Object +------------------ + +This creates a file ``hello.txt`` with the string ``"Hello World!"`` + +.. code-block:: csharp + + PutObjectRequest request = new PutObjectRequest(); + request.BucketName = "my-new-bucket"; + request.Key = "hello.txt"; + request.ContentType = "text/plain"; + request.ContentBody = "Hello World!"; + client.PutObject(request); + + +Change an Object's ACL +---------------------- + +This makes the object ``hello.txt`` to be publicly readable, and +``secret_plans.txt`` to be private. + +.. code-block:: csharp + + PutACLRequest request = new PutACLRequest(); + request.BucketName = "my-new-bucket"; + request.Key = "hello.txt"; + request.CannedACL = S3CannedACL.PublicRead; + client.PutACL(request); + + PutACLRequest request2 = new PutACLRequest(); + request2.BucketName = "my-new-bucket"; + request2.Key = "secret_plans.txt"; + request2.CannedACL = S3CannedACL.Private; + client.PutACL(request2); + + +Download an Object (to a file) +------------------------------ + +This downloads the object ``perl_poetry.pdf`` and saves it in +``C:\Users\larry\Documents`` + +.. code-block:: csharp + + GetObjectRequest request = new GetObjectRequest(); + request.BucketName = "my-new-bucket"; + request.Key = "perl_poetry.pdf"; + GetObjectResponse response = client.GetObject(request); + response.WriteResponseStreamToFile("C:\\Users\\larry\\Documents\\perl_poetry.pdf"); + + +Delete an Object +---------------- + +This deletes the object ``goodbye.txt`` + +.. code-block:: csharp + + DeleteObjectRequest request = new DeleteObjectRequest(); + request.BucketName = "my-new-bucket"; + request.Key = "goodbye.txt"; + client.DeleteObject(request); + + +Generate Object Download URLs (signed and unsigned) +--------------------------------------------------- + +This generates an unsigned download URL for ``hello.txt``. This works +because we made ``hello.txt`` public by setting the ACL above. +This then generates a signed download URL for ``secret_plans.txt`` that +will work for 1 hour. Signed download URLs will work for the time +period even if the object is private (when the time period is up, the +URL will stop working). + +.. note:: + + The C# S3 Library does not have a method for generating unsigned + URLs, so the following example only shows generating signed URLs. + +.. code-block:: csharp + + GetPreSignedUrlRequest request = new GetPreSignedUrlRequest(); + request.BucketName = "my-bucket-name"; + request.Key = "secret_plans.txt"; + request.Expires = DateTime.Now.AddHours(1); + request.Protocol = Protocol.HTTP; + string url = client.GetPreSignedURL(request); + Console.WriteLine(url); + +The output of this will look something like:: + + http://objects.dreamhost.com/my-bucket-name/secret_plans.txt?Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXX&Expires=1316027075&AWSAccessKeyId=XXXXXXXXXXXXXXXXXXX + diff --git a/doc/radosgw/s3/java.rst b/doc/radosgw/s3/java.rst new file mode 100644 index 000000000..057c09c2c --- /dev/null +++ b/doc/radosgw/s3/java.rst @@ -0,0 +1,212 @@ +.. _java: + +Java S3 Examples +================ + +Setup +----- + +The following examples may require some or all of the following java +classes to be imported: + +.. code-block:: java + + import java.io.ByteArrayInputStream; + import java.io.File; + import java.util.List; + import com.amazonaws.auth.AWSCredentials; + import com.amazonaws.auth.BasicAWSCredentials; + import com.amazonaws.util.StringUtils; + import com.amazonaws.services.s3.AmazonS3; + import com.amazonaws.services.s3.AmazonS3Client; + import com.amazonaws.services.s3.model.Bucket; + import com.amazonaws.services.s3.model.CannedAccessControlList; + import com.amazonaws.services.s3.model.GeneratePresignedUrlRequest; + import com.amazonaws.services.s3.model.GetObjectRequest; + import com.amazonaws.services.s3.model.ObjectListing; + import com.amazonaws.services.s3.model.ObjectMetadata; + import com.amazonaws.services.s3.model.S3ObjectSummary; + + +If you are just testing the Ceph Object Storage services, consider +using HTTP protocol instead of HTTPS protocol. + +First, import the ``ClientConfiguration`` and ``Protocol`` classes. + +.. code-block:: java + + import com.amazonaws.ClientConfiguration; + import com.amazonaws.Protocol; + + +Then, define the client configuration, and add the client configuration +as an argument for the S3 client. + +.. code-block:: java + + AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey); + + ClientConfiguration clientConfig = new ClientConfiguration(); + clientConfig.setProtocol(Protocol.HTTP); + + AmazonS3 conn = new AmazonS3Client(credentials, clientConfig); + conn.setEndpoint("endpoint.com"); + + +Creating a Connection +--------------------- + +This creates a connection so that you can interact with the server. + +.. code-block:: java + + String accessKey = "insert your access key here!"; + String secretKey = "insert your secret key here!"; + + AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey); + AmazonS3 conn = new AmazonS3Client(credentials); + conn.setEndpoint("objects.dreamhost.com"); + + +Listing Owned Buckets +--------------------- + +This gets a list of Buckets that you own. +This also prints out the bucket name and creation date of each bucket. + +.. code-block:: java + + List<Bucket> buckets = conn.listBuckets(); + for (Bucket bucket : buckets) { + System.out.println(bucket.getName() + "\t" + + StringUtils.fromDate(bucket.getCreationDate())); + } + +The output will look something like this:: + + mahbuckat1 2011-04-21T18:05:39.000Z + mahbuckat2 2011-04-21T18:05:48.000Z + mahbuckat3 2011-04-21T18:07:18.000Z + + +Creating a Bucket +----------------- + +This creates a new bucket called ``my-new-bucket`` + +.. code-block:: java + + Bucket bucket = conn.createBucket("my-new-bucket"); + + +Listing a Bucket's Content +-------------------------- +This gets a list of objects in the bucket. +This also prints out each object's name, the file size, and last +modified date. + +.. code-block:: java + + ObjectListing objects = conn.listObjects(bucket.getName()); + do { + for (S3ObjectSummary objectSummary : objects.getObjectSummaries()) { + System.out.println(objectSummary.getKey() + "\t" + + objectSummary.getSize() + "\t" + + StringUtils.fromDate(objectSummary.getLastModified())); + } + objects = conn.listNextBatchOfObjects(objects); + } while (objects.isTruncated()); + +The output will look something like this:: + + myphoto1.jpg 251262 2011-08-08T21:35:48.000Z + myphoto2.jpg 262518 2011-08-08T21:38:01.000Z + + +Deleting a Bucket +----------------- + +.. note:: + The Bucket must be empty! Otherwise it won't work! + +.. code-block:: java + + conn.deleteBucket(bucket.getName()); + + +Forced Delete for Non-empty Buckets +----------------------------------- +.. attention:: + not available + + +Creating an Object +------------------ + +This creates a file ``hello.txt`` with the string ``"Hello World!"`` + +.. code-block:: java + + ByteArrayInputStream input = new ByteArrayInputStream("Hello World!".getBytes()); + conn.putObject(bucket.getName(), "hello.txt", input, new ObjectMetadata()); + + +Change an Object's ACL +---------------------- + +This makes the object ``hello.txt`` to be publicly readable, and +``secret_plans.txt`` to be private. + +.. code-block:: java + + conn.setObjectAcl(bucket.getName(), "hello.txt", CannedAccessControlList.PublicRead); + conn.setObjectAcl(bucket.getName(), "secret_plans.txt", CannedAccessControlList.Private); + + +Download an Object (to a file) +------------------------------ + +This downloads the object ``perl_poetry.pdf`` and saves it in +``/home/larry/documents`` + +.. code-block:: java + + conn.getObject( + new GetObjectRequest(bucket.getName(), "perl_poetry.pdf"), + new File("/home/larry/documents/perl_poetry.pdf") + ); + + +Delete an Object +---------------- + +This deletes the object ``goodbye.txt`` + +.. code-block:: java + + conn.deleteObject(bucket.getName(), "goodbye.txt"); + + +Generate Object Download URLs (signed and unsigned) +--------------------------------------------------- + +This generates an unsigned download URL for ``hello.txt``. This works +because we made ``hello.txt`` public by setting the ACL above. +This then generates a signed download URL for ``secret_plans.txt`` that +will work for 1 hour. Signed download URLs will work for the time +period even if the object is private (when the time period is up, the +URL will stop working). + +.. note:: + The java library does not have a method for generating unsigned + URLs, so the example below just generates a signed URL. + +.. code-block:: java + + GeneratePresignedUrlRequest request = new GeneratePresignedUrlRequest(bucket.getName(), "secret_plans.txt"); + System.out.println(conn.generatePresignedUrl(request)); + +The output will look something like this:: + + https://my-bucket-name.objects.dreamhost.com/secret_plans.txt?Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXX&Expires=1316027075&AWSAccessKeyId=XXXXXXXXXXXXXXXXXXX + diff --git a/doc/radosgw/s3/objectops.rst b/doc/radosgw/s3/objectops.rst new file mode 100644 index 000000000..2ac52607f --- /dev/null +++ b/doc/radosgw/s3/objectops.rst @@ -0,0 +1,558 @@ +Object Operations +================= + +Put Object +---------- +Adds an object to a bucket. You must have write permissions on the bucket to perform this operation. + + +Syntax +~~~~~~ + +:: + + PUT /{bucket}/{object} HTTP/1.1 + +Request Headers +~~~~~~~~~~~~~~~ + ++----------------------+--------------------------------------------+-------------------------------------------------------------------------------+------------+ +| Name | Description | Valid Values | Required | ++======================+============================================+===============================================================================+============+ +| **content-md5** | A base64 encoded MD-5 hash of the message. | A string. No defaults or constraints. | No | ++----------------------+--------------------------------------------+-------------------------------------------------------------------------------+------------+ +| **content-type** | A standard MIME type. | Any MIME type. Default: ``binary/octet-stream`` | No | ++----------------------+--------------------------------------------+-------------------------------------------------------------------------------+------------+ +| **x-amz-meta-<...>** | User metadata. Stored with the object. | A string up to 8kb. No defaults. | No | ++----------------------+--------------------------------------------+-------------------------------------------------------------------------------+------------+ +| **x-amz-acl** | A canned ACL. | ``private``, ``public-read``, ``public-read-write``, ``authenticated-read`` | No | ++----------------------+--------------------------------------------+-------------------------------------------------------------------------------+------------+ + + +Copy Object +----------- +To copy an object, use ``PUT`` and specify a destination bucket and the object name. + +Syntax +~~~~~~ + +:: + + PUT /{dest-bucket}/{dest-object} HTTP/1.1 + x-amz-copy-source: {source-bucket}/{source-object} + +Request Headers +~~~~~~~~~~~~~~~ + ++--------------------------------------+-------------------------------------------------+------------------------+------------+ +| Name | Description | Valid Values | Required | ++======================================+=================================================+========================+============+ +| **x-amz-copy-source** | The source bucket name + object name. | {bucket}/{obj} | Yes | ++--------------------------------------+-------------------------------------------------+------------------------+------------+ +| **x-amz-acl** | A canned ACL. | ``private``, | No | +| | | ``public-read``, | | +| | | ``public-read-write``, | | +| | | ``authenticated-read`` | | ++--------------------------------------+-------------------------------------------------+------------------------+------------+ +| **x-amz-copy-if-modified-since** | Copies only if modified since the timestamp. | Timestamp | No | ++--------------------------------------+-------------------------------------------------+------------------------+------------+ +| **x-amz-copy-if-unmodified-since** | Copies only if unmodified since the timestamp. | Timestamp | No | ++--------------------------------------+-------------------------------------------------+------------------------+------------+ +| **x-amz-copy-if-match** | Copies only if object ETag matches ETag. | Entity Tag | No | ++--------------------------------------+-------------------------------------------------+------------------------+------------+ +| **x-amz-copy-if-none-match** | Copies only if object ETag doesn't match. | Entity Tag | No | ++--------------------------------------+-------------------------------------------------+------------------------+------------+ + +Response Entities +~~~~~~~~~~~~~~~~~ + ++------------------------+-------------+-----------------------------------------------+ +| Name | Type | Description | ++========================+=============+===============================================+ +| **CopyObjectResult** | Container | A container for the response elements. | ++------------------------+-------------+-----------------------------------------------+ +| **LastModified** | Date | The last modified date of the source object. | ++------------------------+-------------+-----------------------------------------------+ +| **Etag** | String | The ETag of the new object. | ++------------------------+-------------+-----------------------------------------------+ + +Remove Object +------------- + +Removes an object. Requires WRITE permission set on the containing bucket. + +Syntax +~~~~~~ + +:: + + DELETE /{bucket}/{object} HTTP/1.1 + + + +Get Object +---------- +Retrieves an object from a bucket within RADOS. + +Syntax +~~~~~~ + +:: + + GET /{bucket}/{object} HTTP/1.1 + +Request Headers +~~~~~~~~~~~~~~~ + ++---------------------------+------------------------------------------------+--------------------------------+------------+ +| Name | Description | Valid Values | Required | ++===========================+================================================+================================+============+ +| **range** | The range of the object to retrieve. | Range: bytes=beginbyte-endbyte | No | ++---------------------------+------------------------------------------------+--------------------------------+------------+ +| **if-modified-since** | Gets only if modified since the timestamp. | Timestamp | No | ++---------------------------+------------------------------------------------+--------------------------------+------------+ +| **if-unmodified-since** | Gets only if not modified since the timestamp. | Timestamp | No | ++---------------------------+------------------------------------------------+--------------------------------+------------+ +| **if-match** | Gets only if object ETag matches ETag. | Entity Tag | No | ++---------------------------+------------------------------------------------+--------------------------------+------------+ +| **if-none-match** | Gets only if object ETag matches ETag. | Entity Tag | No | ++---------------------------+------------------------------------------------+--------------------------------+------------+ + +Response Headers +~~~~~~~~~~~~~~~~ + ++-------------------+--------------------------------------------------------------------------------------------+ +| Name | Description | ++===================+============================================================================================+ +| **Content-Range** | Data range, will only be returned if the range header field was specified in the request | ++-------------------+--------------------------------------------------------------------------------------------+ + +Get Object Info +--------------- + +Returns information about object. This request will return the same +header information as with the Get Object request, but will include +the metadata only, not the object data payload. + +Syntax +~~~~~~ + +:: + + HEAD /{bucket}/{object} HTTP/1.1 + +Request Headers +~~~~~~~~~~~~~~~ + ++---------------------------+------------------------------------------------+--------------------------------+------------+ +| Name | Description | Valid Values | Required | ++===========================+================================================+================================+============+ +| **range** | The range of the object to retrieve. | Range: bytes=beginbyte-endbyte | No | ++---------------------------+------------------------------------------------+--------------------------------+------------+ +| **if-modified-since** | Gets only if modified since the timestamp. | Timestamp | No | ++---------------------------+------------------------------------------------+--------------------------------+------------+ +| **if-unmodified-since** | Gets only if not modified since the timestamp. | Timestamp | No | ++---------------------------+------------------------------------------------+--------------------------------+------------+ +| **if-match** | Gets only if object ETag matches ETag. | Entity Tag | No | ++---------------------------+------------------------------------------------+--------------------------------+------------+ +| **if-none-match** | Gets only if object ETag matches ETag. | Entity Tag | No | ++---------------------------+------------------------------------------------+--------------------------------+------------+ + +Get Object ACL +-------------- + +Syntax +~~~~~~ + +:: + + GET /{bucket}/{object}?acl HTTP/1.1 + +Response Entities +~~~~~~~~~~~~~~~~~ + ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| Name | Type | Description | ++===========================+=============+==============================================================================================+ +| ``AccessControlPolicy`` | Container | A container for the response. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``AccessControlList`` | Container | A container for the ACL information. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Owner`` | Container | A container for the object owner's ``ID`` and ``DisplayName``. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``ID`` | String | The object owner's ID. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``DisplayName`` | String | The object owner's display name. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Grant`` | Container | A container for ``Grantee`` and ``Permission``. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Grantee`` | Container | A container for the ``DisplayName`` and ``ID`` of the user receiving a grant of permission. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Permission`` | String | The permission given to the ``Grantee`` object. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ + + + +Set Object ACL +-------------- + +Syntax +~~~~~~ + +:: + + PUT /{bucket}/{object}?acl + +Request Entities +~~~~~~~~~~~~~~~~ + ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| Name | Type | Description | ++===========================+=============+==============================================================================================+ +| ``AccessControlPolicy`` | Container | A container for the response. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``AccessControlList`` | Container | A container for the ACL information. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Owner`` | Container | A container for the object owner's ``ID`` and ``DisplayName``. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``ID`` | String | The object owner's ID. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``DisplayName`` | String | The object owner's display name. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Grant`` | Container | A container for ``Grantee`` and ``Permission``. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Grantee`` | Container | A container for the ``DisplayName`` and ``ID`` of the user receiving a grant of permission. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ +| ``Permission`` | String | The permission given to the ``Grantee`` object. | ++---------------------------+-------------+----------------------------------------------------------------------------------------------+ + + + +Initiate Multi-part Upload +-------------------------- + +Initiate a multi-part upload process. + +Syntax +~~~~~~ + +:: + + POST /{bucket}/{object}?uploads + +Request Headers +~~~~~~~~~~~~~~~ + ++----------------------+--------------------------------------------+-------------------------------------------------------------------------------+------------+ +| Name | Description | Valid Values | Required | ++======================+============================================+===============================================================================+============+ +| **content-md5** | A base64 encoded MD-5 hash of the message. | A string. No defaults or constraints. | No | ++----------------------+--------------------------------------------+-------------------------------------------------------------------------------+------------+ +| **content-type** | A standard MIME type. | Any MIME type. Default: ``binary/octet-stream`` | No | ++----------------------+--------------------------------------------+-------------------------------------------------------------------------------+------------+ +| **x-amz-meta-<...>** | User metadata. Stored with the object. | A string up to 8kb. No defaults. | No | ++----------------------+--------------------------------------------+-------------------------------------------------------------------------------+------------+ +| **x-amz-acl** | A canned ACL. | ``private``, ``public-read``, ``public-read-write``, ``authenticated-read`` | No | ++----------------------+--------------------------------------------+-------------------------------------------------------------------------------+------------+ + + +Response Entities +~~~~~~~~~~~~~~~~~ + ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| Name | Type | Description | ++=========================================+=============+==========================================================================================================+ +| ``InitiatedMultipartUploadsResult`` | Container | A container for the results. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``Bucket`` | String | The bucket that will receive the object contents. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``Key`` | String | The key specified by the ``key`` request parameter (if any). | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``UploadId`` | String | The ID specified by the ``upload-id`` request parameter identifying the multipart upload (if any). | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ + + +Multipart Upload Part +--------------------- + +Syntax +~~~~~~ + +:: + + PUT /{bucket}/{object}?partNumber=&uploadId= HTTP/1.1 + +HTTP Response +~~~~~~~~~~~~~ + +The following HTTP response may be returned: + ++---------------+----------------+--------------------------------------------------------------------------+ +| HTTP Status | Status Code | Description | ++===============+================+==========================================================================+ +| **404** | NoSuchUpload | Specified upload-id does not match any initiated upload on this object | ++---------------+----------------+--------------------------------------------------------------------------+ + +List Multipart Upload Parts +--------------------------- + +Syntax +~~~~~~ + +:: + + GET /{bucket}/{object}?uploadId=123 HTTP/1.1 + +Response Entities +~~~~~~~~~~~~~~~~~ + ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| Name | Type | Description | ++=========================================+=============+==========================================================================================================+ +| ``ListPartsResult`` | Container | A container for the results. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``Bucket`` | String | The bucket that will receive the object contents. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``Key`` | String | The key specified by the ``key`` request parameter (if any). | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``UploadId`` | String | The ID specified by the ``upload-id`` request parameter identifying the multipart upload (if any). | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``Initiator`` | Container | Contains the ``ID`` and ``DisplayName`` of the user who initiated the upload. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``ID`` | String | The initiator's ID. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``DisplayName`` | String | The initiator's display name. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``Owner`` | Container | A container for the ``ID`` and ``DisplayName`` of the user who owns the uploaded object. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``StorageClass`` | String | The method used to store the resulting object. ``STANDARD`` or ``REDUCED_REDUNDANCY`` | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``PartNumberMarker`` | String | The part marker to use in a subsequent request if ``IsTruncated`` is ``true``. Precedes the list. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``NextPartNumberMarker`` | String | The next part marker to use in a subsequent request if ``IsTruncated`` is ``true``. The end of the list. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``MaxParts`` | Integer | The max parts allowed in the response as specified by the ``max-parts`` request parameter. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``IsTruncated`` | Boolean | If ``true``, only a subset of the object's upload contents were returned. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``Part`` | Container | A container for ``LastModified``, ``PartNumber``, ``ETag`` and ``Size`` elements. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``LastModified`` | Date | Date and time at which the part was uploaded. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``PartNumber`` | Integer | The identification number of the part. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``ETag`` | String | The part's entity tag. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ +| ``Size`` | Integer | The size of the uploaded part. | ++-----------------------------------------+-------------+----------------------------------------------------------------------------------------------------------+ + + + +Complete Multipart Upload +------------------------- +Assembles uploaded parts and creates a new object, thereby completing a multipart upload. + +Syntax +~~~~~~ + +:: + + POST /{bucket}/{object}?uploadId= HTTP/1.1 + +Request Entities +~~~~~~~~~~~~~~~~ + ++----------------------------------+-------------+-----------------------------------------------------+----------+ +| Name | Type | Description | Required | ++==================================+=============+=====================================================+==========+ +| ``CompleteMultipartUpload`` | Container | A container consisting of one or more parts. | Yes | ++----------------------------------+-------------+-----------------------------------------------------+----------+ +| ``Part`` | Container | A container for the ``PartNumber`` and ``ETag``. | Yes | ++----------------------------------+-------------+-----------------------------------------------------+----------+ +| ``PartNumber`` | Integer | The identifier of the part. | Yes | ++----------------------------------+-------------+-----------------------------------------------------+----------+ +| ``ETag`` | String | The part's entity tag. | Yes | ++----------------------------------+-------------+-----------------------------------------------------+----------+ + + +Response Entities +~~~~~~~~~~~~~~~~~ + ++-------------------------------------+-------------+-------------------------------------------------------+ +| Name | Type | Description | ++=====================================+=============+=======================================================+ +| **CompleteMultipartUploadResult** | Container | A container for the response. | ++-------------------------------------+-------------+-------------------------------------------------------+ +| **Location** | URI | The resource identifier (path) of the new object. | ++-------------------------------------+-------------+-------------------------------------------------------+ +| **Bucket** | String | The name of the bucket that contains the new object. | ++-------------------------------------+-------------+-------------------------------------------------------+ +| **Key** | String | The object's key. | ++-------------------------------------+-------------+-------------------------------------------------------+ +| **ETag** | String | The entity tag of the new object. | ++-------------------------------------+-------------+-------------------------------------------------------+ + +Abort Multipart Upload +---------------------- + +Syntax +~~~~~~ + +:: + + DELETE /{bucket}/{object}?uploadId= HTTP/1.1 + + + +Append Object +------------- +Append data to an object. You must have write permissions on the bucket to perform this operation. +It is used to upload files in appending mode. The type of the objects created by the Append Object +operation is Appendable Object, and the type of the objects uploaded with the Put Object operation is Normal Object. +**Append Object can't be used if bucket versioning is enabled or suspended.** +**Synced object will become normal in multisite, but you can still append to the original object.** +**Compression and encryption features are disabled for Appendable objects.** + +Syntax +~~~~~~ + +:: + + PUT /{bucket}/{object}?append&position= HTTP/1.1 + +Request Headers +~~~~~~~~~~~~~~~ + ++----------------------+--------------------------------------------+-------------------------------------------------------------------------------+------------+ +| Name | Description | Valid Values | Required | ++======================+============================================+===============================================================================+============+ +| **content-md5** | A base64 encoded MD-5 hash of the message. | A string. No defaults or constraints. | No | ++----------------------+--------------------------------------------+-------------------------------------------------------------------------------+------------+ +| **content-type** | A standard MIME type. | Any MIME type. Default: ``binary/octet-stream`` | No | ++----------------------+--------------------------------------------+-------------------------------------------------------------------------------+------------+ +| **x-amz-meta-<...>** | User metadata. Stored with the object. | A string up to 8kb. No defaults. | No | ++----------------------+--------------------------------------------+-------------------------------------------------------------------------------+------------+ +| **x-amz-acl** | A canned ACL. | ``private``, ``public-read``, ``public-read-write``, ``authenticated-read`` | No | ++----------------------+--------------------------------------------+-------------------------------------------------------------------------------+------------+ + +Response Headers +~~~~~~~~~~~~~~~~ + ++--------------------------------+------------------------------------------------------------------+ +| Name | Description | ++================================+==================================================================+ +| **x-rgw-next-append-position** | Next position to append object | ++--------------------------------+------------------------------------------------------------------+ + +HTTP Response +~~~~~~~~~~~~~ + +The following HTTP response may be returned: + ++---------------+----------------------------+---------------------------------------------------+ +| HTTP Status | Status Code | Description | ++===============+============================+===================================================+ +| **409** | PositionNotEqualToLength | Specified position does not match object length | ++---------------+----------------------------+---------------------------------------------------+ +| **409** | ObjectNotAppendable | Specified object can not be appended | ++---------------+----------------------------+---------------------------------------------------+ +| **409** | InvalidBucketstate | Bucket versioning is enabled or suspended | ++---------------+----------------------------+---------------------------------------------------+ + + +Put Object Retention +-------------------- +Places an Object Retention configuration on an object. + +Syntax +~~~~~~ + +:: + + PUT /{bucket}/{object}?retention&versionId= HTTP/1.1 + +Request Entities +~~~~~~~~~~~~~~~~ + ++---------------------+-------------+-------------------------------------------------------------------------------+------------+ +| Name | Type | Description | Required | ++=====================+=============+===============================================================================+============+ +| ``Retention`` | Container | A container for the request. | Yes | ++---------------------+-------------+-------------------------------------------------------------------------------+------------+ +| ``Mode`` | String | Retention mode for the specified object. Valid Values: GOVERNANCE/COMPLIANCE | Yes | ++---------------------+-------------+--------------------------------------------------------------------------------------------+ +| ``RetainUntilDate`` | Timestamp | Retention date. Format: 2020-01-05T00:00:00.000Z | Yes | ++---------------------+-------------+--------------------------------------------------------------------------------------------+ + + +Get Object Retention +-------------------- +Gets an Object Retention configuration on an object. + + +Syntax +~~~~~~ + +:: + + GET /{bucket}/{object}?retention&versionId= HTTP/1.1 + +Response Entities +~~~~~~~~~~~~~~~~~ + ++---------------------+-------------+-------------------------------------------------------------------------------+------------+ +| Name | Type | Description | Required | ++=====================+=============+===============================================================================+============+ +| ``Retention`` | Container | A container for the request. | Yes | ++---------------------+-------------+-------------------------------------------------------------------------------+------------+ +| ``Mode`` | String | Retention mode for the specified object. Valid Values: GOVERNANCE/COMPLIANCE | Yes | ++---------------------+-------------+--------------------------------------------------------------------------------------------+ +| ``RetainUntilDate`` | Timestamp | Retention date. Format: 2020-01-05T00:00:00.000Z | Yes | ++---------------------+-------------+--------------------------------------------------------------------------------------------+ + + +Put Object Legal Hold +--------------------- +Applies a Legal Hold configuration to the specified object. + +Syntax +~~~~~~ + +:: + + PUT /{bucket}/{object}?legal-hold&versionId= HTTP/1.1 + +Request Entities +~~~~~~~~~~~~~~~~ + ++----------------+-------------+----------------------------------------------------------------------------------------+------------+ +| Name | Type | Description | Required | ++================+=============+========================================================================================+============+ +| ``LegalHold`` | Container | A container for the request. | Yes | ++----------------+-------------+----------------------------------------------------------------------------------------+------------+ +| ``Status`` | String | Indicates whether the specified object has a Legal Hold in place. Valid Values: ON/OFF | Yes | ++----------------+-------------+----------------------------------------------------------------------------------------+------------+ + + +Get Object Legal Hold +--------------------- +Gets an object's current Legal Hold status. + +Syntax +~~~~~~ + +:: + + GET /{bucket}/{object}?legal-hold&versionId= HTTP/1.1 + +Response Entities +~~~~~~~~~~~~~~~~~ + ++----------------+-------------+----------------------------------------------------------------------------------------+------------+ +| Name | Type | Description | Required | ++================+=============+========================================================================================+============+ +| ``LegalHold`` | Container | A container for the request. | Yes | ++----------------+-------------+----------------------------------------------------------------------------------------+------------+ +| ``Status`` | String | Indicates whether the specified object has a Legal Hold in place. Valid Values: ON/OFF | Yes | ++----------------+-------------+----------------------------------------------------------------------------------------+------------+ + diff --git a/doc/radosgw/s3/perl.rst b/doc/radosgw/s3/perl.rst new file mode 100644 index 000000000..f12e5c698 --- /dev/null +++ b/doc/radosgw/s3/perl.rst @@ -0,0 +1,192 @@ +.. _perl: + +Perl S3 Examples +================ + +Creating a Connection +--------------------- + +This creates a connection so that you can interact with the server. + +.. code-block:: perl + + use Amazon::S3; + my $access_key = 'put your access key here!'; + my $secret_key = 'put your secret key here!'; + + my $conn = Amazon::S3->new({ + aws_access_key_id => $access_key, + aws_secret_access_key => $secret_key, + host => 'objects.dreamhost.com', + secure => 1, + retry => 1, + }); + + +Listing Owned Buckets +--------------------- + +This gets a list of `Amazon::S3::Bucket`_ objects that you own. +We'll also print out the bucket name and creation date of each bucket. + +.. code-block:: perl + + my @buckets = @{$conn->buckets->{buckets} || []}; + foreach my $bucket (@buckets) { + print $bucket->bucket . "\t" . $bucket->creation_date . "\n"; + } + +The output will look something like this:: + + mahbuckat1 2011-04-21T18:05:39.000Z + mahbuckat2 2011-04-21T18:05:48.000Z + mahbuckat3 2011-04-21T18:07:18.000Z + + +Creating a Bucket +----------------- + +This creates a new bucket called ``my-new-bucket`` + +.. code-block:: perl + + my $bucket = $conn->add_bucket({ bucket => 'my-new-bucket' }); + + +Listing a Bucket's Content +-------------------------- + +This gets a list of hashes with info about each object in the bucket. +We'll also print out each object's name, the file size, and last +modified date. + +.. code-block:: perl + + my @keys = @{$bucket->list_all->{keys} || []}; + foreach my $key (@keys) { + print "$key->{key}\t$key->{size}\t$key->{last_modified}\n"; + } + +The output will look something like this:: + + myphoto1.jpg 251262 2011-08-08T21:35:48.000Z + myphoto2.jpg 262518 2011-08-08T21:38:01.000Z + + +Deleting a Bucket +----------------- + +.. note:: + The Bucket must be empty! Otherwise it won't work! + +.. code-block:: perl + + $conn->delete_bucket($bucket); + + +Forced Delete for Non-empty Buckets +----------------------------------- + +.. attention:: + + not available in the `Amazon::S3`_ perl module + + +Creating an Object +------------------ + +This creates a file ``hello.txt`` with the string ``"Hello World!"`` + +.. code-block:: perl + + $bucket->add_key( + 'hello.txt', 'Hello World!', + { content_type => 'text/plain' }, + ); + + +Change an Object's ACL +---------------------- + +This makes the object ``hello.txt`` to be publicly readable and +``secret_plans.txt`` to be private. + +.. code-block:: perl + + $bucket->set_acl({ + key => 'hello.txt', + acl_short => 'public-read', + }); + $bucket->set_acl({ + key => 'secret_plans.txt', + acl_short => 'private', + }); + + +Download an Object (to a file) +------------------------------ + +This downloads the object ``perl_poetry.pdf`` and saves it in +``/home/larry/documents/`` + +.. code-block:: perl + + $bucket->get_key_filename('perl_poetry.pdf', undef, + '/home/larry/documents/perl_poetry.pdf'); + + +Delete an Object +---------------- + +This deletes the object ``goodbye.txt`` + +.. code-block:: perl + + $bucket->delete_key('goodbye.txt'); + +Generate Object Download URLs (signed and unsigned) +--------------------------------------------------- +This generates an unsigned download URL for ``hello.txt``. This works +because we made ``hello.txt`` public by setting the ACL above. +Then this generates a signed download URL for ``secret_plans.txt`` that +will work for 1 hour. Signed download URLs will work for the time +period even if the object is private (when the time period is up, the +URL will stop working). + +.. note:: + The `Amazon::S3`_ module does not have a way to generate download + URLs, so we are going to be using another module instead. Unfortunately, + most modules for generating these URLs assume that you are using Amazon, + so we have had to go with using a more obscure module, `Muck::FS::S3`_. This + should be the same as Amazon's sample S3 perl module, but this sample + module is not in CPAN. So, you can either use CPAN to install + `Muck::FS::S3`_, or install Amazon's sample S3 module manually. If you go + the manual route, you can remove ``Muck::FS::`` from the example below. + +.. code-block:: perl + + use Muck::FS::S3::QueryStringAuthGenerator; + my $generator = Muck::FS::S3::QueryStringAuthGenerator->new( + $access_key, + $secret_key, + 0, # 0 means use 'http'. set this to 1 for 'https' + 'objects.dreamhost.com', + ); + + my $hello_url = $generator->make_bare_url($bucket->bucket, 'hello.txt'); + print $hello_url . "\n"; + + $generator->expires_in(3600); # 1 hour = 3600 seconds + my $plans_url = $generator->get($bucket->bucket, 'secret_plans.txt'); + print $plans_url . "\n"; + +The output will look something like this:: + + http://objects.dreamhost.com:80/my-bucket-name/hello.txt + http://objects.dreamhost.com:80/my-bucket-name/secret_plans.txt?Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXX&Expires=1316027075&AWSAccessKeyId=XXXXXXXXXXXXXXXXXXX + + +.. _`Amazon::S3`: http://search.cpan.org/~tima/Amazon-S3-0.441/lib/Amazon/S3.pm +.. _`Amazon::S3::Bucket`: http://search.cpan.org/~tima/Amazon-S3-0.441/lib/Amazon/S3/Bucket.pm +.. _`Muck::FS::S3`: http://search.cpan.org/~mike/Muck-0.02/ + diff --git a/doc/radosgw/s3/php.rst b/doc/radosgw/s3/php.rst new file mode 100644 index 000000000..4878a3489 --- /dev/null +++ b/doc/radosgw/s3/php.rst @@ -0,0 +1,214 @@ +.. _php: + +PHP S3 Examples +=============== + +Installing AWS PHP SDK +---------------------- + +This installs AWS PHP SDK using composer (see here_ how to install composer). + +.. _here: https://getcomposer.org/download/ + +.. code-block:: bash + + $ composer install aws/aws-sdk-php + +Creating a Connection +--------------------- + +This creates a connection so that you can interact with the server. + +.. note:: + + The client initialization requires a region so we use ``''``. + +.. code-block:: php + + <?php + + use Aws\S3\S3Client; + + define('AWS_KEY', 'place access key here'); + define('AWS_SECRET_KEY', 'place secret key here'); + $ENDPOINT = 'http://objects.dreamhost.com'; + + // require the amazon sdk from your composer vendor dir + require __DIR__.'/vendor/autoload.php'; + + // Instantiate the S3 class and point it at the desired host + $client = new S3Client([ + 'region' => '', + 'version' => '2006-03-01', + 'endpoint' => $ENDPOINT, + 'credentials' => [ + 'key' => AWS_KEY, + 'secret' => AWS_SECRET_KEY + ], + // Set the S3 class to use objects.dreamhost.com/bucket + // instead of bucket.objects.dreamhost.com + 'use_path_style_endpoint' => true + ]); + +Listing Owned Buckets +--------------------- +This gets a ``AWS\Result`` instance that is more convenient to visit using array access way. +This also prints out the bucket name and creation date of each bucket. + +.. code-block:: php + + <?php + $listResponse = $client->listBuckets(); + $buckets = $listResponse['Buckets']; + foreach ($buckets as $bucket) { + echo $bucket['Name'] . "\t" . $bucket['CreationDate'] . "\n"; + } + +The output will look something like this:: + + mahbuckat1 2011-04-21T18:05:39.000Z + mahbuckat2 2011-04-21T18:05:48.000Z + mahbuckat3 2011-04-21T18:07:18.000Z + + +Creating a Bucket +----------------- + +This creates a new bucket called ``my-new-bucket`` and returns a +``AWS\Result`` object. + +.. code-block:: php + + <?php + $client->createBucket(['Bucket' => 'my-new-bucket']); + + +List a Bucket's Content +----------------------- + +This gets a ``AWS\Result`` instance that is more convenient to visit using array access way. +This then prints out each object's name, the file size, and last modified date. + +.. code-block:: php + + <?php + $objectsListResponse = $client->listObjects(['Bucket' => $bucketname]); + $objects = $objectsListResponse['Contents'] ?? []; + foreach ($objects as $object) { + echo $object['Key'] . "\t" . $object['Size'] . "\t" . $object['LastModified'] . "\n"; + } + +.. note:: + + If there are more than 1000 objects in this bucket, + you need to check $objectsListResponse['isTruncated'] + and run again with the name of the last key listed. + Keep doing this until isTruncated is not true. + +The output will look something like this if the bucket has some files:: + + myphoto1.jpg 251262 2011-08-08T21:35:48.000Z + myphoto2.jpg 262518 2011-08-08T21:38:01.000Z + + +Deleting a Bucket +----------------- + +This deletes the bucket called ``my-old-bucket`` and returns a +``AWS\Result`` object + +.. note:: + + The Bucket must be empty! Otherwise it won't work! + +.. code-block:: php + + <?php + $client->deleteBucket(['Bucket' => 'my-old-bucket']); + + +Creating an Object +------------------ + +This creates an object ``hello.txt`` with the string ``"Hello World!"`` + +.. code-block:: php + + <?php + $client->putObject([ + 'Bucket' => 'my-bucket-name', + 'Key' => 'hello.txt', + 'Body' => "Hello World!" + ]); + + +Change an Object's ACL +---------------------- + +This makes the object ``hello.txt`` to be publicly readable and +``secret_plans.txt`` to be private. + +.. code-block:: php + + <?php + $client->putObjectAcl([ + 'Bucket' => 'my-bucket-name', + 'Key' => 'hello.txt', + 'ACL' => 'public-read' + ]); + $client->putObjectAcl([ + 'Bucket' => 'my-bucket-name', + 'Key' => 'secret_plans.txt', + 'ACL' => 'private' + ]); + + +Delete an Object +---------------- + +This deletes the object ``goodbye.txt`` + +.. code-block:: php + + <?php + $client->deleteObject(['Bucket' => 'my-bucket-name', 'Key' => 'goodbye.txt']); + + +Download an Object (to a file) +------------------------------ + +This downloads the object ``poetry.pdf`` and saves it in +``/home/larry/documents/`` + +.. code-block:: php + + <?php + $object = $client->getObject(['Bucket' => 'my-bucket-name', 'Key' => 'poetry.pdf']); + file_put_contents('/home/larry/documents/poetry.pdf', $object['Body']->getContents()); + +Generate Object Download URLs (signed and unsigned) +--------------------------------------------------- + +This generates an unsigned download URL for ``hello.txt``. +This works because we made ``hello.txt`` public by setting +the ACL above. This then generates a signed download URL +for ``secret_plans.txt`` that will work for 1 hour. +Signed download URLs will work for the time period even +if the object is private (when the time period is up, +the URL will stop working). + +.. code-block:: php + + <?php + $hello_url = $client->getObjectUrl('my-bucket-name', 'hello.txt'); + echo $hello_url."\n"; + + $secret_plans_cmd = $client->getCommand('GetObject', ['Bucket' => 'my-bucket-name', 'Key' => 'secret_plans.txt']); + $request = $client->createPresignedRequest($secret_plans_cmd, '+1 hour'); + echo $request->getUri()."\n"; + +The output of this will look something like:: + + http://objects.dreamhost.com/my-bucket-name/hello.txt + http://objects.dreamhost.com/my-bucket-name/secret_plans.txt?X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=sandboxAccessKey%2F20190116%2F%2Fs3%2Faws4_request&X-Amz-Date=20190116T125520Z&X-Amz-SignedHeaders=host&X-Amz-Expires=3600&X-Amz-Signature=61921f07c73d7695e47a2192cf55ae030f34c44c512b2160bb5a936b2b48d923 + diff --git a/doc/radosgw/s3/python.rst b/doc/radosgw/s3/python.rst new file mode 100644 index 000000000..4b9faef11 --- /dev/null +++ b/doc/radosgw/s3/python.rst @@ -0,0 +1,186 @@ +.. _python: + +Python S3 Examples +================== + +Creating a Connection +--------------------- + +This creates a connection so that you can interact with the server. + +.. code-block:: python + + import boto + import boto.s3.connection + access_key = 'put your access key here!' + secret_key = 'put your secret key here!' + + conn = boto.connect_s3( + aws_access_key_id = access_key, + aws_secret_access_key = secret_key, + host = 'objects.dreamhost.com', + #is_secure=False, # uncomment if you are not using ssl + calling_format = boto.s3.connection.OrdinaryCallingFormat(), + ) + + +Listing Owned Buckets +--------------------- + +This gets a list of Buckets that you own. +This also prints out the bucket name and creation date of each bucket. + +.. code-block:: python + + for bucket in conn.get_all_buckets(): + print "{name}\t{created}".format( + name = bucket.name, + created = bucket.creation_date, + ) + +The output will look something like this:: + + mahbuckat1 2011-04-21T18:05:39.000Z + mahbuckat2 2011-04-21T18:05:48.000Z + mahbuckat3 2011-04-21T18:07:18.000Z + + +Creating a Bucket +----------------- + +This creates a new bucket called ``my-new-bucket`` + +.. code-block:: python + + bucket = conn.create_bucket('my-new-bucket') + + +Listing a Bucket's Content +-------------------------- + +This gets a list of objects in the bucket. +This also prints out each object's name, the file size, and last +modified date. + +.. code-block:: python + + for key in bucket.list(): + print "{name}\t{size}\t{modified}".format( + name = key.name, + size = key.size, + modified = key.last_modified, + ) + +The output will look something like this:: + + myphoto1.jpg 251262 2011-08-08T21:35:48.000Z + myphoto2.jpg 262518 2011-08-08T21:38:01.000Z + + +Deleting a Bucket +----------------- + +.. note:: + + The Bucket must be empty! Otherwise it won't work! + +.. code-block:: python + + conn.delete_bucket(bucket.name) + + +Forced Delete for Non-empty Buckets +----------------------------------- + +.. attention:: + + not available in python + + +Creating an Object +------------------ + +This creates a file ``hello.txt`` with the string ``"Hello World!"`` + +.. code-block:: python + + key = bucket.new_key('hello.txt') + key.set_contents_from_string('Hello World!') + + +Change an Object's ACL +---------------------- + +This makes the object ``hello.txt`` to be publicly readable, and +``secret_plans.txt`` to be private. + +.. code-block:: python + + hello_key = bucket.get_key('hello.txt') + hello_key.set_canned_acl('public-read') + plans_key = bucket.get_key('secret_plans.txt') + plans_key.set_canned_acl('private') + + +Download an Object (to a file) +------------------------------ + +This downloads the object ``perl_poetry.pdf`` and saves it in +``/home/larry/documents/`` + +.. code-block:: python + + key = bucket.get_key('perl_poetry.pdf') + key.get_contents_to_filename('/home/larry/documents/perl_poetry.pdf') + + +Delete an Object +---------------- + +This deletes the object ``goodbye.txt`` + +.. code-block:: python + + bucket.delete_key('goodbye.txt') + + +Generate Object Download URLs (signed and unsigned) +--------------------------------------------------- + +This generates an unsigned download URL for ``hello.txt``. This works +because we made ``hello.txt`` public by setting the ACL above. +This then generates a signed download URL for ``secret_plans.txt`` that +will work for 1 hour. Signed download URLs will work for the time +period even if the object is private (when the time period is up, the +URL will stop working). + +.. code-block:: python + + hello_key = bucket.get_key('hello.txt') + hello_url = hello_key.generate_url(0, query_auth=False, force_http=True) + print hello_url + + plans_key = bucket.get_key('secret_plans.txt') + plans_url = plans_key.generate_url(3600, query_auth=True, force_http=True) + print plans_url + +The output of this will look something like:: + + http://objects.dreamhost.com/my-bucket-name/hello.txt + http://objects.dreamhost.com/my-bucket-name/secret_plans.txt?Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXX&Expires=1316027075&AWSAccessKeyId=XXXXXXXXXXXXXXXXXXX + +Using S3 API Extensions +----------------------- + +To use the boto3 client to tests the RadosGW extensions to the S3 API, the `extensions file`_ should be placed under: ``~/.aws/models/s3/2006-03-01/`` directory. +For example, unordered list of objects could be fetched using: + +.. code-block:: python + + print conn.list_objects(Bucket='my-new-bucket', AllowUnordered=True) + + +Without the extensions file, in the above example, boto3 would complain that the ``AllowUnordered`` argument is invalid. + + +.. _extensions file: https://github.com/ceph/ceph/blob/master/examples/boto3/service-2.sdk-extras.json diff --git a/doc/radosgw/s3/ruby.rst b/doc/radosgw/s3/ruby.rst new file mode 100644 index 000000000..435b3c630 --- /dev/null +++ b/doc/radosgw/s3/ruby.rst @@ -0,0 +1,364 @@ +.. _ruby: + +Ruby `AWS::SDK`_ Examples (aws-sdk gem ~>2) +=========================================== + +Settings +--------------------- + +You can setup the connection on global way: + +.. code-block:: ruby + + Aws.config.update( + endpoint: 'https://objects.dreamhost.com.', + access_key_id: 'my-access-key', + secret_access_key: 'my-secret-key', + force_path_style: true, + region: 'us-east-1' + ) + + +and instantiate a client object: + +.. code-block:: ruby + + s3_client = Aws::S3::Client.new + +Listing Owned Buckets +--------------------- + +This gets a list of buckets that you own. +This also prints out the bucket name and creation date of each bucket. + +.. code-block:: ruby + + s3_client.list_buckets.buckets.each do |bucket| + puts "#{bucket.name}\t#{bucket.creation_date}" + end + +The output will look something like this:: + + mahbuckat1 2011-04-21T18:05:39.000Z + mahbuckat2 2011-04-21T18:05:48.000Z + mahbuckat3 2011-04-21T18:07:18.000Z + + +Creating a Bucket +----------------- + +This creates a new bucket called ``my-new-bucket`` + +.. code-block:: ruby + + s3_client.create_bucket(bucket: 'my-new-bucket') + +If you want a private bucket: + +`acl` option accepts: # private, public-read, public-read-write, authenticated-read + +.. code-block:: ruby + + s3_client.create_bucket(bucket: 'my-new-bucket', acl: 'private') + + +Listing a Bucket's Content +-------------------------- + +This gets a list of hashes with the contents of each object +This also prints out each object's name, the file size, and last +modified date. + +.. code-block:: ruby + + s3_client.get_objects(bucket: 'my-new-bucket').contents.each do |object| + puts "#{object.key}\t#{object.size}\t#{object.last-modified}" + end + +The output will look something like this if the bucket has some files:: + + myphoto1.jpg 251262 2011-08-08T21:35:48.000Z + myphoto2.jpg 262518 2011-08-08T21:38:01.000Z + + +Deleting a Bucket +----------------- +.. note:: + The Bucket must be empty! Otherwise it won't work! + +.. code-block:: ruby + + s3_client.delete_bucket(bucket: 'my-new-bucket') + + +Forced Delete for Non-empty Buckets +----------------------------------- +First, you need to clear the bucket: + +.. code-block:: ruby + + Aws::S3::Bucket.new('my-new-bucket', client: s3_client).clear! + +after, you can destroy the bucket + +.. code-block:: ruby + + s3_client.delete_bucket(bucket: 'my-new-bucket') + + +Creating an Object +------------------ + +This creates a file ``hello.txt`` with the string ``"Hello World!"`` + +.. code-block:: ruby + + s3_client.put_object( + key: 'hello.txt', + body: 'Hello World!', + bucket: 'my-new-bucket', + content_type: 'text/plain' + ) + + +Change an Object's ACL +---------------------- + +This makes the object ``hello.txt`` to be publicly readable, and ``secret_plans.txt`` +to be private. + +.. code-block:: ruby + + s3_client.put_object_acl(bucket: 'my-new-bucket', key: 'hello.txt', acl: 'public-read') + + s3_client.put_object_acl(bucket: 'my-new-bucket', key: 'private.txt', acl: 'private') + + +Download an Object (to a file) +------------------------------ + +This downloads the object ``poetry.pdf`` and saves it in +``/home/larry/documents/`` + +.. code-block:: ruby + + s3_client.get_object(bucket: 'my-new-bucket', key: 'poetry.pdf', response_target: '/home/larry/documents/poetry.pdf') + + +Delete an Object +---------------- + +This deletes the object ``goodbye.txt`` + +.. code-block:: ruby + + s3_client.delete_object(key: 'goodbye.txt', bucket: 'my-new-bucket') + + +Generate Object Download URLs (signed and unsigned) +--------------------------------------------------- + +This generates an unsigned download URL for ``hello.txt``. This works +because we made ``hello.txt`` public by setting the ACL above. +This then generates a signed download URL for ``secret_plans.txt`` that +will work for 1 hour. Signed download URLs will work for the time +period even if the object is private (when the time period is up, the +URL will stop working). + +.. code-block:: ruby + + puts Aws::S3::Object.new( + key: 'hello.txt', + bucket_name: 'my-new-bucket', + client: s3_client + ).public_url + + puts Aws::S3::Object.new( + key: 'secret_plans.txt', + bucket_name: 'hermes_ceph_gem', + client: s3_client + ).presigned_url(:get, expires_in: 60 * 60) + +The output of this will look something like:: + + http://objects.dreamhost.com/my-bucket-name/hello.txt + http://objects.dreamhost.com/my-bucket-name/secret_plans.txt?Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXX&Expires=1316027075&AWSAccessKeyId=XXXXXXXXXXXXXXXXXXX + +.. _`AWS::SDK`: http://docs.aws.amazon.com/sdkforruby/api/Aws/S3/Client.html + + + +Ruby `AWS::S3`_ Examples (aws-s3 gem) +===================================== + +Creating a Connection +--------------------- + +This creates a connection so that you can interact with the server. + +.. code-block:: ruby + + AWS::S3::Base.establish_connection!( + :server => 'objects.dreamhost.com', + :use_ssl => true, + :access_key_id => 'my-access-key', + :secret_access_key => 'my-secret-key' + ) + + +Listing Owned Buckets +--------------------- + +This gets a list of `AWS::S3::Bucket`_ objects that you own. +This also prints out the bucket name and creation date of each bucket. + +.. code-block:: ruby + + AWS::S3::Service.buckets.each do |bucket| + puts "#{bucket.name}\t#{bucket.creation_date}" + end + +The output will look something like this:: + + mahbuckat1 2011-04-21T18:05:39.000Z + mahbuckat2 2011-04-21T18:05:48.000Z + mahbuckat3 2011-04-21T18:07:18.000Z + + +Creating a Bucket +----------------- + +This creates a new bucket called ``my-new-bucket`` + +.. code-block:: ruby + + AWS::S3::Bucket.create('my-new-bucket') + + +Listing a Bucket's Content +-------------------------- + +This gets a list of hashes with the contents of each object +This also prints out each object's name, the file size, and last +modified date. + +.. code-block:: ruby + + new_bucket = AWS::S3::Bucket.find('my-new-bucket') + new_bucket.each do |object| + puts "#{object.key}\t#{object.about['content-length']}\t#{object.about['last-modified']}" + end + +The output will look something like this if the bucket has some files:: + + myphoto1.jpg 251262 2011-08-08T21:35:48.000Z + myphoto2.jpg 262518 2011-08-08T21:38:01.000Z + + +Deleting a Bucket +----------------- +.. note:: + The Bucket must be empty! Otherwise it won't work! + +.. code-block:: ruby + + AWS::S3::Bucket.delete('my-new-bucket') + + +Forced Delete for Non-empty Buckets +----------------------------------- + +.. code-block:: ruby + + AWS::S3::Bucket.delete('my-new-bucket', :force => true) + + +Creating an Object +------------------ + +This creates a file ``hello.txt`` with the string ``"Hello World!"`` + +.. code-block:: ruby + + AWS::S3::S3Object.store( + 'hello.txt', + 'Hello World!', + 'my-new-bucket', + :content_type => 'text/plain' + ) + + +Change an Object's ACL +---------------------- + +This makes the object ``hello.txt`` to be publicly readable, and ``secret_plans.txt`` +to be private. + +.. code-block:: ruby + + policy = AWS::S3::S3Object.acl('hello.txt', 'my-new-bucket') + policy.grants = [ AWS::S3::ACL::Grant.grant(:public_read) ] + AWS::S3::S3Object.acl('hello.txt', 'my-new-bucket', policy) + + policy = AWS::S3::S3Object.acl('secret_plans.txt', 'my-new-bucket') + policy.grants = [] + AWS::S3::S3Object.acl('secret_plans.txt', 'my-new-bucket', policy) + + +Download an Object (to a file) +------------------------------ + +This downloads the object ``poetry.pdf`` and saves it in +``/home/larry/documents/`` + +.. code-block:: ruby + + open('/home/larry/documents/poetry.pdf', 'w') do |file| + AWS::S3::S3Object.stream('poetry.pdf', 'my-new-bucket') do |chunk| + file.write(chunk) + end + end + + +Delete an Object +---------------- + +This deletes the object ``goodbye.txt`` + +.. code-block:: ruby + + AWS::S3::S3Object.delete('goodbye.txt', 'my-new-bucket') + + +Generate Object Download URLs (signed and unsigned) +--------------------------------------------------- + +This generates an unsigned download URL for ``hello.txt``. This works +because we made ``hello.txt`` public by setting the ACL above. +This then generates a signed download URL for ``secret_plans.txt`` that +will work for 1 hour. Signed download URLs will work for the time +period even if the object is private (when the time period is up, the +URL will stop working). + +.. code-block:: ruby + + puts AWS::S3::S3Object.url_for( + 'hello.txt', + 'my-new-bucket', + :authenticated => false + ) + + puts AWS::S3::S3Object.url_for( + 'secret_plans.txt', + 'my-new-bucket', + :expires_in => 60 * 60 + ) + +The output of this will look something like:: + + http://objects.dreamhost.com/my-bucket-name/hello.txt + http://objects.dreamhost.com/my-bucket-name/secret_plans.txt?Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXX&Expires=1316027075&AWSAccessKeyId=XXXXXXXXXXXXXXXXXXX + +.. _`AWS::S3`: http://amazon.rubyforge.org/ +.. _`AWS::S3::Bucket`: http://amazon.rubyforge.org/doc/ + diff --git a/doc/radosgw/s3/serviceops.rst b/doc/radosgw/s3/serviceops.rst new file mode 100644 index 000000000..54b6ca375 --- /dev/null +++ b/doc/radosgw/s3/serviceops.rst @@ -0,0 +1,69 @@ +Service Operations +================== + +List Buckets +------------ +``GET /`` returns a list of buckets created by the user making the request. ``GET /`` only +returns buckets created by an authenticated user. You cannot make an anonymous request. + +Syntax +~~~~~~ +:: + + GET / HTTP/1.1 + Host: cname.domain.com + + Authorization: AWS {access-key}:{hash-of-header-and-secret} + +Response Entities +~~~~~~~~~~~~~~~~~ + ++----------------------------+-------------+-----------------------------------------------------------------+ +| Name | Type | Description | ++============================+=============+=================================================================+ +| ``Buckets`` | Container | Container for list of buckets. | ++----------------------------+-------------+-----------------------------------------------------------------+ +| ``Bucket`` | Container | Container for bucket information. | ++----------------------------+-------------+-----------------------------------------------------------------+ +| ``Name`` | String | Bucket name. | ++----------------------------+-------------+-----------------------------------------------------------------+ +| ``CreationDate`` | Date | UTC time when the bucket was created. | ++----------------------------+-------------+-----------------------------------------------------------------+ +| ``ListAllMyBucketsResult`` | Container | A container for the result. | ++----------------------------+-------------+-----------------------------------------------------------------+ +| ``Owner`` | Container | A container for the bucket owner's ``ID`` and ``DisplayName``. | ++----------------------------+-------------+-----------------------------------------------------------------+ +| ``ID`` | String | The bucket owner's ID. | ++----------------------------+-------------+-----------------------------------------------------------------+ +| ``DisplayName`` | String | The bucket owner's display name. | ++----------------------------+-------------+-----------------------------------------------------------------+ + + +Get Usage Stats +--------------- + +Gets usage stats per user, similar to the admin command :ref:`rgw_user_usage_stats`. + +Syntax +~~~~~~ +:: + + GET /?usage HTTP/1.1 + Host: cname.domain.com + + Authorization: AWS {access-key}:{hash-of-header-and-secret} + +Response Entities +~~~~~~~~~~~~~~~~~ + ++----------------------------+-------------+-----------------------------------------------------------------+ +| Name | Type | Description | ++============================+=============+=================================================================+ +| ``Summary`` | Container | Summary of total stats by user. | ++----------------------------+-------------+-----------------------------------------------------------------+ +| ``TotalBytes`` | Integer | Bytes used by user | ++----------------------------+-------------+-----------------------------------------------------------------+ +| ``TotalBytesRounded`` | Integer | Bytes rounded to the nearest 4k boundary | ++----------------------------+-------------+-----------------------------------------------------------------+ +| ``TotalEntries`` | Integer | Total object entries | ++----------------------------+-------------+-----------------------------------------------------------------+ diff --git a/doc/radosgw/s3select.rst b/doc/radosgw/s3select.rst new file mode 100644 index 000000000..fc7ae339c --- /dev/null +++ b/doc/radosgw/s3select.rst @@ -0,0 +1,267 @@ +=============== + Ceph s3 select +=============== + +.. contents:: + +Overview +-------- + + | The purpose of the **s3 select** engine is to create an efficient pipe between user client and storage nodes (the engine should be close as possible to storage). + | It enables selection of a restricted subset of (structured) data stored in an S3 object using an SQL-like syntax. + | It also enables for higher level analytic-applications (such as SPARK-SQL) , using that feature to improve their latency and throughput. + + | For example, a s3-object of several GB (CSV file), a user needs to extract a single column which filtered by another column. + | As the following query: + | ``select customer-id from s3Object where age>30 and age<65;`` + + | Currently the whole s3-object must retrieve from OSD via RGW before filtering and extracting data. + | By "pushing down" the query into OSD , it's possible to save a lot of network and CPU(serialization / deserialization). + + | **The bigger the object, and the more accurate the query, the better the performance**. + +Basic workflow +-------------- + + | S3-select query is sent to RGW via `AWS-CLI <https://docs.aws.amazon.com/cli/latest/reference/s3api/select-object-content.html>`_ + + | It passes the authentication and permission process as an incoming message (POST). + | **RGWSelectObj_ObjStore_S3::send_response_data** is the “entry point”, it handles each fetched chunk according to input object-key. + | **send_response_data** is first handling the input query, it extracts the query and other CLI parameters. + + | Per each new fetched chunk (~4m), RGW executes s3-select query on it. + | The current implementation supports CSV objects and since chunks are randomly “cutting” the CSV rows in the middle, those broken-lines (first or last per chunk) are skipped while processing the query. + | Those “broken” lines are stored and later merged with the next broken-line (belong to the next chunk), and finally processed. + + | Per each processed chunk an output message is formatted according to `AWS specification <https://docs.aws.amazon.com/AmazonS3/latest/API/archive-RESTObjectSELECTContent.html#archive-RESTObjectSELECTContent-responses>`_ and sent back to the client. + | RGW supports the following response: ``{:event-type,records} {:content-type,application/octet-stream} {:message-type,event}``. + | For aggregation queries the last chunk should be identified as the end of input, following that the s3-select-engine initiates end-of-process and produces an aggregate result. + + +Basic functionalities +~~~~~~~~~~~~~~~~~~~~~ + + | **S3select** has a definite set of functionalities that should be implemented (if we wish to stay compliant with AWS), currently only a portion of it is implemented. + + | The implemented software architecture supports basic arithmetic expressions, logical and compare expressions, including nested function calls and casting operators, that alone enables the user reasonable flexibility. + | review the below s3-select-feature-table_. + + +Error Handling +~~~~~~~~~~~~~~ + + | Any error occurs while the input query processing, i.e. parsing phase or execution phase, is returned to client as response error message. + + | Fatal severity (attached to the exception) will end query execution immediately, other error severity are counted, upon reaching 100, it ends query execution with an error message. + + + + +.. _s3-select-feature-table: + +Features Support +---------------- + + | Currently only part of `AWS select command <https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html>`_ is implemented, table bellow describes what is currently supported. + | The following table describes the current implementation for s3-select functionalities: + ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| Feature | Detailed | Example | ++=================================+=================+=======================================================================+ +| Arithmetic operators | ^ * / + - ( ) | select (int(_1)+int(_2))*int(_9) from stdin; | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| | | select ((1+2)*3.14) ^ 2 from stdin; | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| Compare operators | > < >= <= == != | select _1,_2 from stdin where (int(1)+int(_3))>int(_5); | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| logical operator | AND OR | select count(*) from stdin where int(1)>123 and int(_5)<200; | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| casting operator | int(expression) | select int(_1),int( 1.2 + 3.4) from stdin; | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| |float(expression)| select float(1.2) from stdin; | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| | timestamp(...) | select timestamp("1999:10:10-12:23:44") from stdin; | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| Aggregation Function | sum | select sum(int(_1)) from stdin; | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| Aggregation Function | min | select min( int(_1) * int(_5) ) from stdin; | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| Aggregation Function | max | select max(float(_1)),min(int(_5)) from stdin; | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| Aggregation Function | count | select count(*) from stdin where (int(1)+int(_3))>int(_5); | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| Timestamp Functions | extract | select count(*) from stdin where | +| | | extract("year",timestamp(_2)) > 1950 | +| | | and extract("year",timestamp(_1)) < 1960; | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| Timestamp Functions | dateadd | select count(0) from stdin where | +| | | datediff("year",timestamp(_1),dateadd("day",366,timestamp(_1))) == 1; | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| Timestamp Functions | datediff | select count(0) from stdin where | +| | | datediff("month",timestamp(_1),timestamp(_2))) == 2; | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| Timestamp Functions | utcnow | select count(0) from stdin where | +| | | datediff("hours",utcnow(),dateadd("day",1,utcnow())) == 24 ; | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| String Functions | substr | select count(0) from stdin where | +| | | int(substr(_1,1,4))>1950 and int(substr(_1,1,4))<1960; | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| alias support | | select int(_1) as a1, int(_2) as a2 , (a1+a2) as a3 | +| | | from stdin where a3>100 and a3<300; | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ + +s3-select function interfaces +----------------------------- + +Timestamp functions +~~~~~~~~~~~~~~~~~~~ + | The `timestamp functionalities <https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-date.html>`_ is partially implemented. + | the casting operator( ``timestamp( string )`` ), converts string to timestamp basic type. + | Currently it can convert the following pattern ``yyyy:mm:dd hh:mi:dd`` + + | ``extract( date-part , timestamp)`` : function return integer according to date-part extract from input timestamp. + | supported date-part : year,month,week,day. + + | ``dateadd(date-part , integer,timestamp)`` : function return timestamp, a calculation results of input timestamp and date-part. + | supported data-part : year,month,day. + + | ``datediff(date-part,timestamp,timestamp)`` : function return an integer, a calculated result for difference between 2 timestamps according to date-part. + | supported date-part : year,month,day,hours. + + + | ``utcnow()`` : return timestamp of current time. + +Aggregation functions +~~~~~~~~~~~~~~~~~~~~~ + + | ``count()`` : return integer according to number of rows matching condition(if such exist). + + | ``sum(expression)`` : return a summary of expression per all rows matching condition(if such exist). + + | ``max(expression)`` : return the maximal result for all expressions matching condition(if such exist). + + | ``min(expression)`` : return the minimal result for all expressions matching condition(if such exist). + +String functions +~~~~~~~~~~~~~~~~ + + | ``substr(string,from,to)`` : return a string extract from input string according to from,to inputs. + + +Alias +~~~~~ + | **Alias** programming-construct is an essential part of s3-select language, it enables much better programming especially with objects containing many columns or in the case of complex queries. + + | Upon parsing the statement containing alias construct, it replaces alias with reference to correct projection column, on query execution time the reference is evaluated as any other expression. + + | There is a risk that self(or cyclic) reference may occur causing stack-overflow(endless-loop), for that concern upon evaluating an alias, it is validated for cyclic reference. + + | Alias also maintains result-cache, meaning upon using the same alias more than once, it’s not evaluating the same expression again(it will return the same result),instead it uses the result from cache. + + | Of Course, per each new row the cache is invalidated. + +Sending Query to RGW +-------------------- + + | Any http-client can send s3-select request to RGW, it must be compliant with `AWS Request syntax <https://docs.aws.amazon.com/AmazonS3/latest/API/API_SelectObjectContent.html#API_SelectObjectContent_RequestSyntax>`_. + + + + | Sending s3-select request to RGW using AWS cli, should follow `AWS command reference <https://docs.aws.amazon.com/cli/latest/reference/s3api/select-object-content.html>`_. + | bellow is an example for it. + +:: + + aws --endpoint-url http://localhost:8000 s3api select-object-content + --bucket {BUCKET-NAME} + --expression-type 'SQL' + --input-serialization + '{"CSV": {"FieldDelimiter": "," , "QuoteCharacter": "\"" , "RecordDelimiter" : "\n" , "QuoteEscapeCharacter" : "\\" , "FileHeaderInfo": "USE" }, "CompressionType": "NONE"}' + --output-serialization '{"CSV": {}}' + --key {OBJECT-NAME} + --expression "select count(0) from stdin where int(_1)<10;" output.csv + +Syntax +~~~~~~ + + | **Input serialization** (Implemented), it let the user define the CSV definitions; the default values are {\\n} for row-delimiter {,} for field delimiter, {"} for quote, {\\} for escape characters. + | it handle the **csv-header-info**, the first row in input object containing the schema. + | **Output serialization** is currently not implemented, the same for **compression-type**. + + | s3-select engine contain a CSV parser, which parse s3-objects as follows. + | - each row ends with row-delimiter. + | - field-separator separates between adjacent columns, successive field separator define NULL column. + | - quote-character overrides field separator, meaning , field separator become as any character between quotes. + | - escape character disables any special characters, except for row delimiter. + + | Below are examples for CSV parsing rules. + + +CSV parsing behavior +-------------------- + ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| Feature | Description | input ==> tokens | ++=================================+=================+=======================================================================+ +| NULL | successive | ,,1,,2, ==> {null}{null}{1}{null}{2}{null} | +| | field delimiter | | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| QUOTE | quote character | 11,22,"a,b,c,d",last ==> {11}{22}{"a,b,c,d"}{last} | +| | overrides | | +| | field delimiter | | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| Escape | escape char | 11,22,str=\\"abcd\\"\\,str2=\\"123\\",last | +| | overrides | ==> {11}{22}{str="abcd",str2="123"}{last} | +| | meta-character. | | +| | escape removed | | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| row delimiter | no close quote, | 11,22,a="str,44,55,66 | +| | row delimiter is| ==> {11}{22}{a="str,44,55,66} | +| | closing line | | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ +| csv header info | FileHeaderInfo | "**USE**" value means each token on first line is column-name, | +| | tag | "**IGNORE**" value means to skip the first line | ++---------------------------------+-----------------+-----------------------------------------------------------------------+ + + +BOTO3 +----- + + | using BOTO3 is "natural" and easy due to AWS-cli support. + +:: + + + def run_s3select(bucket,key,query,column_delim=",",row_delim="\n",quot_char='"',esc_char='\\',csv_header_info="NONE"): + s3 = boto3.client('s3', + endpoint_url=endpoint, + aws_access_key_id=access_key, + region_name=region_name, + aws_secret_access_key=secret_key) + + + + r = s3.select_object_content( + Bucket=bucket, + Key=key, + ExpressionType='SQL', + InputSerialization = {"CSV": {"RecordDelimiter" : row_delim, "FieldDelimiter" : column_delim,"QuoteEscapeCharacter": esc_char, "QuoteCharacter": quot_char, "FileHeaderInfo": csv_header_info}, "CompressionType": "NONE"}, + OutputSerialization = {"CSV": {}}, + Expression=query,) + + result = "" + for event in r['Payload']: + if 'Records' in event: + records = event['Records']['Payload'].decode('utf-8') + result += records + + return result + + + + + run_s3select( + "my_bucket", + "my_csv_object", + "select int(_1) as a1, int(_2) as a2 , (a1+a2) as a3 from stdin where a3>100 and a3<300;") + diff --git a/doc/radosgw/session-tags.rst b/doc/radosgw/session-tags.rst new file mode 100644 index 000000000..abfb89e87 --- /dev/null +++ b/doc/radosgw/session-tags.rst @@ -0,0 +1,424 @@ +======================================================= +Session tags for Attribute Based Access Control in STS +======================================================= + +Session tags are key-value pairs that can be passed while federating a user (currently it +is only supported as part of the web token passed to AssumeRoleWithWebIdentity). The session +tags are passed along as aws:PrincipalTag in the session credentials (temporary credentials) +that is returned back by STS. These Principal Tags consists of the session tags that come in +as part of the web token and the tags that are attached to the role being assumed. Please note +that the tags have to be always specified in the following namespace: https://aws.amazon.com/tags. + +An example of the session tags that are passed in by the IDP in the web token is as follows: + +.. code-block:: python + + { + "jti": "947960a3-7e91-4027-99f6-da719b0d4059", + "exp": 1627438044, + "nbf": 0, + "iat": 1627402044, + "iss": "http://localhost:8080/auth/realms/quickstart", + "aud": "app-profile-jsp", + "sub": "test", + "typ": "ID", + "azp": "app-profile-jsp", + "auth_time": 0, + "session_state": "3a46e3e7-d198-4a64-8b51-69682bcfc670", + "preferred_username": "test", + "email_verified": false, + "acr": "1", + "https://aws.amazon.com/tags": [ + { + "principal_tags": { + "Department": [ + "Engineering", + "Marketing" + ] + } + } + ], + "client_id": "app-profile-jsp", + "username": "test", + "active": true + } + +Steps to configure Keycloak to pass tags in the web token are described here:doc:`keycloak`. + +The trust policy must have 'sts:TagSession' permission if the web token passed in by the federated user contains session tags, otherwise +the AssumeRoleWithWebIdentity action will fail. An example of the trust policy with sts:TagSession is as follows: + +.. code-block:: python + + { + "Version":"2012-10-17", + "Statement":[ + { + "Effect":"Allow", + "Action":["sts:AssumeRoleWithWebIdentity","sts:TagSession"], + "Principal":{"Federated":["arn:aws:iam:::oidc-provider/localhost:8080/auth/realms/quickstart"]}, + "Condition":{"StringEquals":{"localhost:8080/auth/realms/quickstart:sub":"test"}} + }] + } + +Tag Keys +======== + +The following are the tag keys that can be used in the role's trust policy or the role's permission policy: + +1. aws:RequestTag: This key is used to compare the key-value pair passed in the request with the key-value pair +in the role's trust policy. In case of AssumeRoleWithWebIdentity, the session tags that are passed by the idp +in the web token can be used as aws:RequestTag in the role's trust policy based on which a federated user can be +allowed to assume a role. + +An example of a role trust policy that uses aws:RequestTag is as follows: + +.. code-block:: python + + { + "Version":"2012-10-17", + "Statement":[ + { + "Effect":"Allow", + "Action":["sts:AssumeRoleWithWebIdentity","sts:TagSession"], + "Principal":{"Federated":["arn:aws:iam:::oidc-provider/localhost:8080/auth/realms/quickstart"]}, + "Condition":{"StringEquals":{"aws:RequestTag/Department":"Engineering"}} + }] + } + +2. aws:PrincipalTag: This key is used to compare the key-value pair attached to the principal with the key-value pair +in the policy. In case of AssumeRoleWithWebIdentity, the session tags that are passed by the idp in the web token appear +as Principal tags in the temporary credentials once a user has been authenticated, and these tags can be used as +aws:PrincipalTag in the role's permission policy. + +An example of a role permission policy that uses aws:PrincipalTag is as follows: + +.. code-block:: python + + { + "Version":"2012-10-17", + "Statement":[ + { + "Effect":"Allow", + "Action":["s3:*"], + "Resource":["arn:aws:s3::t1tenant:my-test-bucket","arn:aws:s3::t1tenant:my-test-bucket/*],"+ + "Condition":{"StringEquals":{"aws:PrincipalTag/Department":"Engineering"}} + }] + } + +3. iam:ResourceTag: This key is used to compare the key-value pair attached to the resource with the key-value pair +in the policy. In case of AssumeRoleWithWebIdentity, tags attached to the role can be used to compare with that in +the trust policy to allow a user to assume a role. +RGW now supports REST APIs for tagging, listing tags and untagging actions on a role. More information related to +role tagging can be found here :doc:`role`. + +An example of a role's trust policy that uses aws:ResourceTag is as follows: + +.. code-block:: python + + { + "Version":"2012-10-17", + "Statement":[ + { + "Effect":"Allow", + "Action":["sts:AssumeRoleWithWebIdentity","sts:TagSession"], + "Principal":{"Federated":["arn:aws:iam:::oidc-provider/localhost:8080/auth/realms/quickstart"]}, + "Condition":{"StringEquals":{"iam:ResourceTag/Department":"Engineering"}} + }] + } + +For the above to work, you need to attach 'Department=Engineering' tag to the role. + +4. aws:TagKeys: This key is used to compare tags in the request with the tags in the policy. In case of +AssumeRoleWithWebIdentity this can be used to check the tag keys in a role's trust policy before a user +is allowed to assume a role. +This can also be used in the role's permission policy. + +An example of a role's trust policy that uses aws:TagKeys is as follows: + +.. code-block:: python + + { + "Version":"2012-10-17", + "Statement":[ + { + "Effect":"Allow", + "Action":["sts:AssumeRoleWithWebIdentity","sts:TagSession"], + "Principal":{"Federated":["arn:aws:iam:::oidc-provider/localhost:8080/auth/realms/quickstart"]}, + "Condition":{"ForAllValues:StringEquals":{"aws:TagKeys":["Marketing,Engineering"]}} + }] + } + +'ForAllValues:StringEquals' tests whether every tag key in the request is a subset of the tag keys in the policy. So the above +condition restricts the tag keys passed in the request. + +5. s3:ResourceTag: This key is used to compare tags present on the s3 resource (bucket or object) with the tags in +the role's permission policy. + +An example of a role's permission policy that uses s3:ResourceTag is as follows: + +.. code-block:: python + + { + "Version":"2012-10-17", + "Statement":[ + { + "Effect":"Allow", + "Action":["s3:PutBucketTagging"], + "Resource":["arn:aws:s3::t1tenant:my-test-bucket\","arn:aws:s3::t1tenant:my-test-bucket/*"] + }, + { + "Effect":"Allow", + "Action":["s3:*"], + "Resource":["*"], + "Condition":{"StringEquals":{"s3:ResourceTag/Department":\"Engineering"}} + } + } + +For the above to work, you need to attach 'Department=Engineering' tag to the bucket (and on the object too) on which you want this policy +to be applied. + +More examples of policies using tags +==================================== + +1. To assume a role by matching the tags in the incoming request with the tag attached to the role. +aws:RequestTag is the incoming tag in the JWT (access token) and iam:ResourceTag is the tag attached to the role being assumed: + +.. code-block:: python + + { + "Version":"2012-10-17", + "Statement":[ + { + "Effect":"Allow", + "Action":["sts:AssumeRoleWithWebIdentity","sts:TagSession"], + "Principal":{"Federated":["arn:aws:iam:::oidc-provider/localhost:8080/auth/realms/quickstart"]}, + "Condition":{"StringEquals":{"aws:RequestTag/Department":"${iam:ResourceTag/Department}"}} + }] + } + +2. To evaluate a role's permission policy by matching principal tags with s3 resource tags. +aws:PrincipalTag is the tag passed in along with the temporary credentials and s3:ResourceTag is the tag attached to +the s3 resource (object/ bucket): + +.. code-block:: python + + + { + "Version":"2012-10-17", + "Statement":[ + { + "Effect":"Allow", + "Action":["s3:PutBucketTagging"], + "Resource":["arn:aws:s3::t1tenant:my-test-bucket\","arn:aws:s3::t1tenant:my-test-bucket/*"] + }, + { + "Effect":"Allow", + "Action":["s3:*"], + "Resource":["*"], + "Condition":{"StringEquals":{"s3:ResourceTag/Department":"${aws:PrincipalTag/Department}"}} + } + } + +Properties of Session Tags +========================== + +1. Session Tags can be multi-valued. (Multi-valued session tags are not supported in AWS) +2. A maximum of 50 session tags are allowed to be passed in by the IDP. +3. The maximum size of a key allowed is 128 characters. +4. The maximum size of a value allowed is 256 characters. +5. The tag or the value can not start with "aws:". + +s3 Resource Tags +================ + +As stated above 's3:ResourceTag' key can be used for authorizing an s3 operation in RGW (this is not allowed in AWS). + +s3:ResourceTag is a key used to refer to tags that have been attached to an object or a bucket. Tags can be attached to an object or +a bucket using REST APIs available for the same. + +The following table shows which s3 resource tag type (bucket/object) are supported for authorizing a particular operation. + ++-----------------------------------+-------------------+ +| Operation | Tag type | ++===================================+===================+ +| **GetObject** | Object tags | +| **GetObjectTags** | | +| **DeleteObjectTags** | | +| **DeleteObject** | | +| **PutACLs** | | +| **InitMultipart** | | +| **AbortMultipart** | | +| **ListMultipart** | | +| **GetAttrs** | | +| **PutObjectRetention** | | +| **GetObjectRetention** | | +| **PutObjectLegalHold** | | +| **GetObjectLegalHold** | | ++-----------------------------------+-------------------+ +| **PutObjectTags** | Bucket tags | +| **GetBucketTags** | | +| **PutBucketTags** | | +| **DeleteBucketTags** | | +| **GetBucketReplication** | | +| **DeleteBucketReplication** | | +| **GetBucketVersioning** | | +| **SetBucketVersioning** | | +| **GetBucketWebsite** | | +| **SetBucketWebsite** | | +| **DeleteBucketWebsite** | | +| **StatBucket** | | +| **ListBucket** | | +| **GetBucketLogging** | | +| **GetBucketLocation** | | +| **DeleteBucket** | | +| **GetLC** | | +| **PutLC** | | +| **DeleteLC** | | +| **GetCORS** | | +| **PutCORS** | | +| **GetRequestPayment** | | +| **SetRequestPayment** | | +| **PutBucketPolicy** | | +| **GetBucketPolicy** | | +| **DeleteBucketPolicy** | | +| **PutBucketObjectLock** | | +| **GetBucketObjectLock** | | +| **GetBucketPolicyStatus** | | +| **PutBucketPublicAccessBlock** | | +| **GetBucketPublicAccessBlock** | | +| **DeleteBucketPublicAccessBlock** | | ++-----------------------------------+-------------------+ +| **GetACLs** | Bucket tags for | +| **PutACLs** | bucket ACLs | +| | Object tags for | +| | object ACLs | ++-----------------------------------+-------------------+ +| **PutObject** | Object tags of | +| **CopyObject** | source object | +| | Bucket tags of | +| | destination bucket| ++-----------------------------------+-------------------+ + + +Sample code demonstrating usage of session tags +=============================================== + +The following is a sample code for tagging a role, a bucket, an object in it and using tag keys in a role's +trust policy and its permission policy, assuming that a tag 'Department=Engineering' is passed in the +JWT (access token) by the IDP + +.. code-block:: python + + # -*- coding: utf-8 -*- + + import boto3 + import json + from nose.tools import eq_ as eq + + access_key = 'TESTER' + secret_key = 'test123' + endpoint = 'http://s3.us-east.localhost:8000' + + s3client = boto3.client('s3', + aws_access_key_id = access_key, + aws_secret_access_key = secret_key, + endpoint_url = endpoint, + region_name='',) + + s3res = boto3.resource('s3', + aws_access_key_id = access_key, + aws_secret_access_key = secret_key, + endpoint_url = endpoint, + region_name='',) + + iam_client = boto3.client('iam', + aws_access_key_id=access_key, + aws_secret_access_key=secret_key, + endpoint_url=endpoint, + region_name='' + ) + + bucket_name = 'test-bucket' + s3bucket = s3client.create_bucket(Bucket=bucket_name) + + bucket_tagging = s3res.BucketTagging(bucket_name) + Set_Tag = bucket_tagging.put(Tagging={'TagSet':[{'Key':'Department', 'Value': 'Engineering'}]}) + try: + response = iam_client.create_open_id_connect_provider( + Url='http://localhost:8080/auth/realms/quickstart', + ClientIDList=[ + 'app-profile-jsp', + 'app-jee-jsp' + ], + ThumbprintList=[ + 'F7D7B3515DD0D319DD219A43A9EA727AD6065287' + ] + ) + except ClientError as e: + print ("Provider already exists") + + policy_document = "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Federated\":[\"arn:aws:iam:::oidc-provider/localhost:8080/auth/realms/quickstart\"]},\"Action\":[\"sts:AssumeRoleWithWebIdentity\",\"sts:TagSession\"],\"Condition\":{\"StringEquals\":{\"aws:RequestTag/Department\":\"${iam:ResourceTag/Department}\"}}}]}" + role_response = "" + + print ("\n Getting Role \n") + + try: + role_response = iam_client.get_role( + RoleName='S3Access' + ) + print (role_response) + except ClientError as e: + if e.response['Code'] == 'NoSuchEntity': + print ("\n Creating Role \n") + tags_list = [ + {'Key':'Department','Value':'Engineering'}, + ] + role_response = iam_client.create_role( + AssumeRolePolicyDocument=policy_document, + Path='/', + RoleName='S3Access', + Tags=tags_list, + ) + print (role_response) + else: + print("Unexpected error: %s" % e) + + role_policy = "{\"Version\":\"2012-10-17\",\"Statement\":{\"Effect\":\"Allow\",\"Action\":\"s3:*\",\"Resource\":\"arn:aws:s3:::*\",\"Condition\":{\"StringEquals\":{\"s3:ResourceTag/Department\":[\"${aws:PrincipalTag/Department}\"]}}}}" + + response = iam_client.put_role_policy( + RoleName='S3Access', + PolicyName='Policy1', + PolicyDocument=role_policy + ) + + sts_client = boto3.client('sts', + aws_access_key_id='abc', + aws_secret_access_key='def', + endpoint_url = endpoint, + region_name = '', + ) + + + print ("\n Assuming Role with Web Identity\n") + response = sts_client.assume_role_with_web_identity( + RoleArn=role_response['Role']['Arn'], + RoleSessionName='Bob', + DurationSeconds=900, + WebIdentityToken='<web-token>') + + s3client2 = boto3.client('s3', + aws_access_key_id = response['Credentials']['AccessKeyId'], + aws_secret_access_key = response['Credentials']['SecretAccessKey'], + aws_session_token = response['Credentials']['SessionToken'], + endpoint_url='http://s3.us-east.localhost:8000', + region_name='',) + + bucket_body = 'this is a test file' + tags = 'Department=Engineering' + key = "test-1.txt" + s3_put_obj = s3client2.put_object(Body=bucket_body, Bucket=bucket_name, Key=key, Tagging=tags) + eq(s3_put_obj['ResponseMetadata']['HTTPStatusCode'],200) + + s3_get_obj = s3client2.get_object(Bucket=bucket_name, Key=key) + eq(s3_get_obj['ResponseMetadata']['HTTPStatusCode'],200)
\ No newline at end of file diff --git a/doc/radosgw/swift.rst b/doc/radosgw/swift.rst new file mode 100644 index 000000000..2cb2dde67 --- /dev/null +++ b/doc/radosgw/swift.rst @@ -0,0 +1,77 @@ +=============================== + Ceph Object Gateway Swift API +=============================== + +Ceph supports a RESTful API that is compatible with the basic data access model of the `Swift API`_. + +API +--- + +.. toctree:: + :maxdepth: 1 + + Authentication <swift/auth> + Service Ops <swift/serviceops> + Container Ops <swift/containerops> + Object Ops <swift/objectops> + Temp URL Ops <swift/tempurl> + Tutorial <swift/tutorial> + Java <swift/java> + Python <swift/python> + Ruby <swift/ruby> + + +Features Support +---------------- + +The following table describes the support status for current Swift functional features: + ++---------------------------------+-----------------+----------------------------------------+ +| Feature | Status | Remarks | ++=================================+=================+========================================+ +| **Authentication** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Get Account Metadata** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Swift ACLs** | Supported | Supports a subset of Swift ACLs | ++---------------------------------+-----------------+----------------------------------------+ +| **List Containers** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Delete Container** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Create Container** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Get Container Metadata** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Update Container Metadata** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Delete Container Metadata** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **List Objects** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Static Website** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Create Object** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Create Large Object** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Delete Object** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Get Object** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Copy Object** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Get Object Metadata** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Update Object Metadata** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Expiring Objects** | Supported | | ++---------------------------------+-----------------+----------------------------------------+ +| **Temporary URLs** | Partial Support | No support for container-level keys | ++---------------------------------+-----------------+----------------------------------------+ +| **Object Versioning** | Partial Support | No support for ``X-History-Location`` | ++---------------------------------+-----------------+----------------------------------------+ +| **CORS** | Not Supported | | ++---------------------------------+-----------------+----------------------------------------+ + +.. _Swift API: https://developer.openstack.org/api-ref/object-store/index.html diff --git a/doc/radosgw/swift/auth.rst b/doc/radosgw/swift/auth.rst new file mode 100644 index 000000000..12d6b23ff --- /dev/null +++ b/doc/radosgw/swift/auth.rst @@ -0,0 +1,82 @@ +================ + Authentication +================ + +Swift API requests that require authentication must contain an +``X-Storage-Token`` authentication token in the request header. +The token may be retrieved from RADOS Gateway, or from another authenticator. +To obtain a token from RADOS Gateway, you must create a user. For example:: + + sudo radosgw-admin user create --subuser="{username}:{subusername}" --uid="{username}" + --display-name="{Display Name}" --key-type=swift --secret="{password}" --access=full + +For details on RADOS Gateway administration, see `radosgw-admin`_. + +.. _radosgw-admin: ../../../man/8/radosgw-admin/ + +.. note:: + For those used to the Swift API this is implementing the Swift auth v1.0 API, as such + `{username}` above is generally equivalent to a Swift `account` and `{subusername}` + is a user under that account. + +Auth Get +-------- + +To authenticate a user, make a request containing an ``X-Auth-User`` and a +``X-Auth-Key`` in the header. + +Syntax +~~~~~~ + +:: + + GET /auth HTTP/1.1 + Host: swift.radosgwhost.com + X-Auth-User: johndoe + X-Auth-Key: R7UUOLFDI2ZI9PRCQ53K + +Request Headers +~~~~~~~~~~~~~~~ + +``X-Auth-User`` + +:Description: The key RADOS GW username to authenticate. +:Type: String +:Required: Yes + +``X-Auth-Key`` + +:Description: The key associated to a RADOS GW username. +:Type: String +:Required: Yes + + +Response Headers +~~~~~~~~~~~~~~~~ + +The response from the server should include an ``X-Auth-Token`` value. The +response may also contain a ``X-Storage-Url`` that provides the +``{api version}/{account}`` prefix that is specified in other requests +throughout the API documentation. + + +``X-Storage-Token`` + +:Description: The authorization token for the ``X-Auth-User`` specified in the request. +:Type: String + + +``X-Storage-Url`` + +:Description: The URL and ``{api version}/{account}`` path for the user. +:Type: String + +A typical response looks like this:: + + HTTP/1.1 204 No Content + Date: Mon, 16 Jul 2012 11:05:33 GMT + Server: swift + X-Storage-Url: https://swift.radosgwhost.com/v1/ACCT-12345 + X-Auth-Token: UOlCCC8TahFKlWuv9DB09TWHF0nDjpPElha0kAa + Content-Length: 0 + Content-Type: text/plain; charset=UTF-8 diff --git a/doc/radosgw/swift/containerops.rst b/doc/radosgw/swift/containerops.rst new file mode 100644 index 000000000..17ef4658b --- /dev/null +++ b/doc/radosgw/swift/containerops.rst @@ -0,0 +1,341 @@ +====================== + Container Operations +====================== + +A container is a mechanism for storing data objects. An account may +have many containers, but container names must be unique. This API enables a +client to create a container, set access controls and metadata, +retrieve a container's contents, and delete a container. Since this API +makes requests related to information in a particular user's account, all +requests in this API must be authenticated unless a container's access control +is deliberately made publicly accessible (i.e., allows anonymous requests). + +.. note:: The Amazon S3 API uses the term 'bucket' to describe a data container. + When you hear someone refer to a 'bucket' within the Swift API, the term + 'bucket' may be construed as the equivalent of the term 'container.' + +One facet of object storage is that it does not support hierarchical paths +or directories. Instead, it supports one level consisting of one or more +containers, where each container may have objects. The RADOS Gateway's +Swift-compatible API supports the notion of 'pseudo-hierarchical containers,' +which is a means of using object naming to emulate a container (or directory) +hierarchy without actually implementing one in the storage system. You may +name objects with pseudo-hierarchical names +(e.g., photos/buildings/empire-state.jpg), but container names cannot +contain a forward slash (``/``) character. + + +Create a Container +================== + +To create a new container, make a ``PUT`` request with the API version, account, +and the name of the new container. The container name must be unique, must not +contain a forward-slash (/) character, and should be less than 256 bytes. You +may include access control headers and metadata headers in the request. The +operation is idempotent; that is, if you make a request to create a container +that already exists, it will return with a HTTP 202 return code, but will not +create another container. + + +Syntax +~~~~~~ + +:: + + PUT /{api version}/{account}/{container} HTTP/1.1 + Host: {fqdn} + X-Auth-Token: {auth-token} + X-Container-Read: {comma-separated-uids} + X-Container-Write: {comma-separated-uids} + X-Container-Meta-{key}: {value} + + +Headers +~~~~~~~ + +``X-Container-Read`` + +:Description: The user IDs with read permissions for the container. +:Type: Comma-separated string values of user IDs. +:Required: No + +``X-Container-Write`` + +:Description: The user IDs with write permissions for the container. +:Type: Comma-separated string values of user IDs. +:Required: No + +``X-Container-Meta-{key}`` + +:Description: A user-defined meta data key that takes an arbitrary string value. +:Type: String +:Required: No + + +HTTP Response +~~~~~~~~~~~~~ + +If a container with the same name already exists, and the user is the +container owner then the operation will succeed. Otherwise the operation +will fail. + +``409`` + +:Description: The container already exists under a different user's ownership. +:Status Code: ``BucketAlreadyExists`` + + + + +List a Container's Objects +========================== + +To list the objects within a container, make a ``GET`` request with the with the +API version, account, and the name of the container. You can specify query +parameters to filter the full list, or leave out the parameters to return a list +of the first 10,000 object names stored in the container. + + +Syntax +~~~~~~ + +:: + + GET /{api version}/{container} HTTP/1.1 + Host: {fqdn} + X-Auth-Token: {auth-token} + + +Parameters +~~~~~~~~~~ + +``format`` + +:Description: Defines the format of the result. +:Type: String +:Valid Values: ``json`` | ``xml`` +:Required: No + +``prefix`` + +:Description: Limits the result set to objects beginning with the specified prefix. +:Type: String +:Required: No + +``marker`` + +:Description: Returns a list of results greater than the marker value. +:Type: String +:Required: No + +``limit`` + +:Description: Limits the number of results to the specified value. +:Type: Integer +:Valid Range: 0 - 10,000 +:Required: No + +``delimiter`` + +:Description: The delimiter between the prefix and the rest of the object name. +:Type: String +:Required: No + +``path`` + +:Description: The pseudo-hierarchical path of the objects. +:Type: String +:Required: No + +``allow_unordered`` + +:Description: Allows the results to be returned unordered to reduce computation overhead. Cannot be used with ``delimiter``. +:Type: Boolean +:Required: No +:Non-Standard Extension: Yes + + +Response Entities +~~~~~~~~~~~~~~~~~ + +``container`` + +:Description: The container. +:Type: Container + +``object`` + +:Description: An object within the container. +:Type: Container + +``name`` + +:Description: The name of an object within the container. +:Type: String + +``hash`` + +:Description: A hash code of the object's contents. +:Type: String + +``last_modified`` + +:Description: The last time the object's contents were modified. +:Type: Date + +``content_type`` + +:Description: The type of content within the object. +:Type: String + + + +Update a Container's ACLs +========================= + +When a user creates a container, the user has read and write access to the +container by default. To allow other users to read a container's contents or +write to a container, you must specifically enable the user. +You may also specify ``*`` in the ``X-Container-Read`` or ``X-Container-Write`` +settings, which effectively enables all users to either read from or write +to the container. Setting ``*`` makes the container public. That is it +enables anonymous users to either read from or write to the container. + +.. note:: If you are planning to expose public read ACL functionality + for the Swift API, it is strongly recommended to include the + Swift account name in the endpoint definition, so as to most + closely emulate the behavior of native OpenStack Swift. To + do so, set the ``ceph.conf`` configuration option ``rgw + swift account in url = true``, and update your Keystone + endpoint to the URL suffix ``/v1/AUTH_%(tenant_id)s`` + (instead of just ``/v1``). + + +Syntax +~~~~~~ + +:: + + POST /{api version}/{account}/{container} HTTP/1.1 + Host: {fqdn} + X-Auth-Token: {auth-token} + X-Container-Read: * + X-Container-Write: {uid1}, {uid2}, {uid3} + +Request Headers +~~~~~~~~~~~~~~~ + +``X-Container-Read`` + +:Description: The user IDs with read permissions for the container. +:Type: Comma-separated string values of user IDs. +:Required: No + +``X-Container-Write`` + +:Description: The user IDs with write permissions for the container. +:Type: Comma-separated string values of user IDs. +:Required: No + + +Add/Update Container Metadata +============================= + +To add metadata to a container, make a ``POST`` request with the API version, +account, and container name. You must have write permissions on the +container to add or update metadata. + +Syntax +~~~~~~ + +:: + + POST /{api version}/{account}/{container} HTTP/1.1 + Host: {fqdn} + X-Auth-Token: {auth-token} + X-Container-Meta-Color: red + X-Container-Meta-Taste: salty + +Request Headers +~~~~~~~~~~~~~~~ + +``X-Container-Meta-{key}`` + +:Description: A user-defined meta data key that takes an arbitrary string value. +:Type: String +:Required: No + + +Enable Object Versioning for a Container +======================================== + +To enable object versioning a container, make a ``POST`` request with +the API version, account, and container name. You must have write +permissions on the container to add or update metadata. + +.. note:: Object versioning support is not enabled in radosgw by + default; you must set ``rgw swift versioning enabled = + true`` in ``ceph.conf`` to enable this feature. + +Syntax +~~~~~~ + +:: + + POST /{api version}/{account}/{container} HTTP/1.1 + Host: {fqdn} + X-Auth-Token: {auth-token} + X-Versions-Location: {archive-container} + +Request Headers +~~~~~~~~~~~~~~~ + +``X-Versions-Location`` + +:Description: The name of a container (the "archive container") that + will be used to store versions of the objects in the + container that the ``POST`` request is made on (the + "current container"). The archive container need not + exist at the time it is being referenced, but once + ``X-Versions-Location`` is set on the current container, + and object versioning is thus enabled, the archive + container must exist before any further objects are + updated or deleted in the current container. + + .. note:: ``X-Versions-Location`` is the only + versioning-related header that radosgw + interprets. ``X-History-Location``, supported + by native OpenStack Swift, is currently not + supported by radosgw. +:Type: String +:Required: No (if this header is passed with an empty value, object + versioning on the current container is disabled, but the + archive container continues to exist.) + + +Delete a Container +================== + +To delete a container, make a ``DELETE`` request with the API version, account, +and the name of the container. The container must be empty. If you'd like to check +if the container is empty, execute a ``HEAD`` request against the container. Once +you have successfully removed the container, you will be able to reuse the container name. + +Syntax +~~~~~~ + +:: + + DELETE /{api version}/{account}/{container} HTTP/1.1 + Host: {fqdn} + X-Auth-Token: {auth-token} + + +HTTP Response +~~~~~~~~~~~~~ + +``204`` + +:Description: The container was removed. +:Status Code: ``NoContent`` + diff --git a/doc/radosgw/swift/java.rst b/doc/radosgw/swift/java.rst new file mode 100644 index 000000000..8977a3b16 --- /dev/null +++ b/doc/radosgw/swift/java.rst @@ -0,0 +1,175 @@ +.. _java_swift: + +===================== + Java Swift Examples +===================== + +Setup +===== + +The following examples may require some or all of the following Java +classes to be imported: + +.. code-block:: java + + import org.javaswift.joss.client.factory.AccountConfig; + import org.javaswift.joss.client.factory.AccountFactory; + import org.javaswift.joss.client.factory.AuthenticationMethod; + import org.javaswift.joss.model.Account; + import org.javaswift.joss.model.Container; + import org.javaswift.joss.model.StoredObject; + import java.io.File; + import java.io.IOException; + import java.util.*; + + +Create a Connection +=================== + +This creates a connection so that you can interact with the server: + +.. code-block:: java + + String username = "USERNAME"; + String password = "PASSWORD"; + String authUrl = "https://radosgw.endpoint/auth/1.0"; + + AccountConfig config = new AccountConfig(); + config.setUsername(username); + config.setPassword(password); + config.setAuthUrl(authUrl); + config.setAuthenticationMethod(AuthenticationMethod.BASIC); + Account account = new AccountFactory(config).createAccount(); + + +Create a Container +================== + +This creates a new container called ``my-new-container``: + +.. code-block:: java + + Container container = account.getContainer("my-new-container"); + container.create(); + + +Create an Object +================ + +This creates an object ``foo.txt`` from the file named ``foo.txt`` in +the container ``my-new-container``: + +.. code-block:: java + + Container container = account.getContainer("my-new-container"); + StoredObject object = container.getObject("foo.txt"); + object.uploadObject(new File("foo.txt")); + + +Add/Update Object Metadata +========================== + +This adds the metadata key-value pair ``key``:``value`` to the object named +``foo.txt`` in the container ``my-new-container``: + +.. code-block:: java + + Container container = account.getContainer("my-new-container"); + StoredObject object = container.getObject("foo.txt"); + Map<String, Object> metadata = new TreeMap<String, Object>(); + metadata.put("key", "value"); + object.setMetadata(metadata); + + +List Owned Containers +===================== + +This gets a list of Containers that you own. +This also prints out the container name. + +.. code-block:: java + + Collection<Container> containers = account.list(); + for (Container currentContainer : containers) { + System.out.println(currentContainer.getName()); + } + +The output will look something like this:: + + mahbuckat1 + mahbuckat2 + mahbuckat3 + + +List a Container's Content +========================== + +This gets a list of objects in the container ``my-new-container``; and, it also +prints out each object's name, the file size, and last modified date: + +.. code-block:: java + + Container container = account.getContainer("my-new-container"); + Collection<StoredObject> objects = container.list(); + for (StoredObject currentObject : objects) { + System.out.println(currentObject.getName()); + } + +The output will look something like this:: + + myphoto1.jpg + myphoto2.jpg + + +Retrieve an Object's Metadata +============================= + +This retrieves metadata and gets the MIME type for an object named ``foo.txt`` +in a container named ``my-new-container``: + +.. code-block:: java + + Container container = account.getContainer("my-new-container"); + StoredObject object = container.getObject("foo.txt"); + Map<String, Object> returnedMetadata = object.getMetadata(); + for (String name : returnedMetadata.keySet()) { + System.out.println("META / "+name+": "+returnedMetadata.get(name)); + } + + +Retrieve an Object +================== + +This downloads the object ``foo.txt`` in the container ``my-new-container`` +and saves it in ``./outfile.txt``: + +.. code-block:: java + + Container container = account.getContainer("my-new-container"); + StoredObject object = container.getObject("foo.txt"); + object.downloadObject(new File("outfile.txt")); + + +Delete an Object +================ + +This deletes the object ``goodbye.txt`` in the container "my-new-container": + +.. code-block:: java + + Container container = account.getContainer("my-new-container"); + StoredObject object = container.getObject("foo.txt"); + object.delete(); + + +Delete a Container +================== + +This deletes a container named "my-new-container": + +.. code-block:: java + + Container container = account.getContainer("my-new-container"); + container.delete(); + +.. note:: The container must be empty! Otherwise it won't work! diff --git a/doc/radosgw/swift/objectops.rst b/doc/radosgw/swift/objectops.rst new file mode 100644 index 000000000..fc8d21967 --- /dev/null +++ b/doc/radosgw/swift/objectops.rst @@ -0,0 +1,271 @@ +=================== + Object Operations +=================== + +An object is a container for storing data and metadata. A container may +have many objects, but the object names must be unique. This API enables a +client to create an object, set access controls and metadata, retrieve an +object's data and metadata, and delete an object. Since this API makes requests +related to information in a particular user's account, all requests in this API +must be authenticated unless the container or object's access control is +deliberately made publicly accessible (i.e., allows anonymous requests). + + +Create/Update an Object +======================= + +To create a new object, make a ``PUT`` request with the API version, account, +container name and the name of the new object. You must have write permission +on the container to create or update an object. The object name must be +unique within the container. The ``PUT`` request is not idempotent, so if you +do not use a unique name, the request will update the object. However, you may +use pseudo-hierarchical syntax in your object name to distinguish it from +another object of the same name if it is under a different pseudo-hierarchical +directory. You may include access control headers and metadata headers in the +request. + + +Syntax +~~~~~~ + +:: + + PUT /{api version}/{account}/{container}/{object} HTTP/1.1 + Host: {fqdn} + X-Auth-Token: {auth-token} + + +Request Headers +~~~~~~~~~~~~~~~ + +``ETag`` + +:Description: An MD5 hash of the object's contents. Recommended. +:Type: String +:Required: No + + +``Content-Type`` + +:Description: The type of content the object contains. +:Type: String +:Required: No + + +``Transfer-Encoding`` + +:Description: Indicates whether the object is part of a larger aggregate object. +:Type: String +:Valid Values: ``chunked`` +:Required: No + + +Copy an Object +============== + +Copying an object allows you to make a server-side copy of an object, so that +you don't have to download it and upload it under another container/name. +To copy the contents of one object to another object, you may make either a +``PUT`` request or a ``COPY`` request with the API version, account, and the +container name. For a ``PUT`` request, use the destination container and object +name in the request, and the source container and object in the request header. +For a ``Copy`` request, use the source container and object in the request, and +the destination container and object in the request header. You must have write +permission on the container to copy an object. The destination object name must be +unique within the container. The request is not idempotent, so if you do not use +a unique name, the request will update the destination object. However, you may +use pseudo-hierarchical syntax in your object name to distinguish the destination +object from the source object of the same name if it is under a different +pseudo-hierarchical directory. You may include access control headers and metadata +headers in the request. + +Syntax +~~~~~~ + +:: + + PUT /{api version}/{account}/{dest-container}/{dest-object} HTTP/1.1 + X-Copy-From: {source-container}/{source-object} + Host: {fqdn} + X-Auth-Token: {auth-token} + + +or alternatively: + +:: + + COPY /{api version}/{account}/{source-container}/{source-object} HTTP/1.1 + Destination: {dest-container}/{dest-object} + +Request Headers +~~~~~~~~~~~~~~~ + +``X-Copy-From`` + +:Description: Used with a ``PUT`` request to define the source container/object path. +:Type: String +:Required: Yes, if using ``PUT`` + + +``Destination`` + +:Description: Used with a ``COPY`` request to define the destination container/object path. +:Type: String +:Required: Yes, if using ``COPY`` + + +``If-Modified-Since`` + +:Description: Only copies if modified since the date/time of the source object's ``last_modified`` attribute. +:Type: Date +:Required: No + + +``If-Unmodified-Since`` + +:Description: Only copies if not modified since the date/time of the source object's ``last_modified`` attribute. +:Type: Date +:Required: No + +``Copy-If-Match`` + +:Description: Copies only if the ETag in the request matches the source object's ETag. +:Type: ETag. +:Required: No + + +``Copy-If-None-Match`` + +:Description: Copies only if the ETag in the request does not match the source object's ETag. +:Type: ETag. +:Required: No + + +Delete an Object +================ + +To delete an object, make a ``DELETE`` request with the API version, account, +container and object name. You must have write permissions on the container to delete +an object within it. Once you have successfully deleted the object, you will be able to +reuse the object name. + +Syntax +~~~~~~ + +:: + + DELETE /{api version}/{account}/{container}/{object} HTTP/1.1 + Host: {fqdn} + X-Auth-Token: {auth-token} + + +Get an Object +============= + +To retrieve an object, make a ``GET`` request with the API version, account, +container and object name. You must have read permissions on the container to +retrieve an object within it. + +Syntax +~~~~~~ + +:: + + GET /{api version}/{account}/{container}/{object} HTTP/1.1 + Host: {fqdn} + X-Auth-Token: {auth-token} + + + +Request Headers +~~~~~~~~~~~~~~~ + +``range`` + +:Description: To retrieve a subset of an object's contents, you may specify a byte range. +:Type: Date +:Required: No + + +``If-Modified-Since`` + +:Description: Only copies if modified since the date/time of the source object's ``last_modified`` attribute. +:Type: Date +:Required: No + + +``If-Unmodified-Since`` + +:Description: Only copies if not modified since the date/time of the source object's ``last_modified`` attribute. +:Type: Date +:Required: No + +``Copy-If-Match`` + +:Description: Copies only if the ETag in the request matches the source object's ETag. +:Type: ETag. +:Required: No + + +``Copy-If-None-Match`` + +:Description: Copies only if the ETag in the request does not match the source object's ETag. +:Type: ETag. +:Required: No + + + +Response Headers +~~~~~~~~~~~~~~~~ + +``Content-Range`` + +:Description: The range of the subset of object contents. Returned only if the range header field was specified in the request + + +Get Object Metadata +=================== + +To retrieve an object's metadata, make a ``HEAD`` request with the API version, +account, container and object name. You must have read permissions on the +container to retrieve metadata from an object within the container. This request +returns the same header information as the request for the object itself, but +it does not return the object's data. + +Syntax +~~~~~~ + +:: + + HEAD /{api version}/{account}/{container}/{object} HTTP/1.1 + Host: {fqdn} + X-Auth-Token: {auth-token} + + + +Add/Update Object Metadata +========================== + +To add metadata to an object, make a ``POST`` request with the API version, +account, container and object name. You must have write permissions on the +parent container to add or update metadata. + + +Syntax +~~~~~~ + +:: + + POST /{api version}/{account}/{container}/{object} HTTP/1.1 + Host: {fqdn} + X-Auth-Token: {auth-token} + +Request Headers +~~~~~~~~~~~~~~~ + +``X-Object-Meta-{key}`` + +:Description: A user-defined meta data key that takes an arbitrary string value. +:Type: String +:Required: No + diff --git a/doc/radosgw/swift/python.rst b/doc/radosgw/swift/python.rst new file mode 100644 index 000000000..28d92d7cc --- /dev/null +++ b/doc/radosgw/swift/python.rst @@ -0,0 +1,114 @@ +.. _python_swift: + +===================== +Python Swift Examples +===================== + +Create a Connection +=================== + +This creates a connection so that you can interact with the server: + +.. code-block:: python + + import swiftclient + user = 'account_name:username' + key = 'your_api_key' + + conn = swiftclient.Connection( + user=user, + key=key, + authurl='https://objects.dreamhost.com/auth', + ) + + +Create a Container +================== + +This creates a new container called ``my-new-container``: + +.. code-block:: python + + container_name = 'my-new-container' + conn.put_container(container_name) + + +Create an Object +================ + +This creates a file ``hello.txt`` from the file named ``my_hello.txt``: + +.. code-block:: python + + with open('hello.txt', 'r') as hello_file: + conn.put_object(container_name, 'hello.txt', + contents= hello_file.read(), + content_type='text/plain') + + +List Owned Containers +===================== + +This gets a list of containers that you own, and prints out the container name: + +.. code-block:: python + + for container in conn.get_account()[1]: + print container['name'] + +The output will look something like this:: + + mahbuckat1 + mahbuckat2 + mahbuckat3 + +List a Container's Content +========================== + +This gets a list of objects in the container, and prints out each +object's name, the file size, and last modified date: + +.. code-block:: python + + for data in conn.get_container(container_name)[1]: + print '{0}\t{1}\t{2}'.format(data['name'], data['bytes'], data['last_modified']) + +The output will look something like this:: + + myphoto1.jpg 251262 2011-08-08T21:35:48.000Z + myphoto2.jpg 262518 2011-08-08T21:38:01.000Z + + +Retrieve an Object +================== + +This downloads the object ``hello.txt`` and saves it in +``./my_hello.txt``: + +.. code-block:: python + + obj_tuple = conn.get_object(container_name, 'hello.txt') + with open('my_hello.txt', 'w') as my_hello: + my_hello.write(obj_tuple[1]) + + +Delete an Object +================ + +This deletes the object ``hello.txt``: + +.. code-block:: python + + conn.delete_object(container_name, 'hello.txt') + +Delete a Container +================== + +.. note:: + + The container must be empty! Otherwise the request won't work! + +.. code-block:: python + + conn.delete_container(container_name) + diff --git a/doc/radosgw/swift/ruby.rst b/doc/radosgw/swift/ruby.rst new file mode 100644 index 000000000..a20b66d88 --- /dev/null +++ b/doc/radosgw/swift/ruby.rst @@ -0,0 +1,119 @@ +.. _ruby_swift: + +===================== + Ruby Swift Examples +===================== + +Create a Connection +=================== + +This creates a connection so that you can interact with the server: + +.. code-block:: ruby + + require 'cloudfiles' + username = 'account_name:user_name' + api_key = 'your_secret_key' + + conn = CloudFiles::Connection.new( + :username => username, + :api_key => api_key, + :auth_url => 'http://objects.dreamhost.com/auth' + ) + + +Create a Container +================== + +This creates a new container called ``my-new-container`` + +.. code-block:: ruby + + container = conn.create_container('my-new-container') + + +Create an Object +================ + +This creates a file ``hello.txt`` from the file named ``my_hello.txt`` + +.. code-block:: ruby + + obj = container.create_object('hello.txt') + obj.load_from_filename('./my_hello.txt') + obj.content_type = 'text/plain' + + + +List Owned Containers +===================== + +This gets a list of Containers that you own, and also prints out +the container name: + +.. code-block:: ruby + + conn.containers.each do |container| + puts container + end + +The output will look something like this:: + + mahbuckat1 + mahbuckat2 + mahbuckat3 + + +List a Container's Contents +=========================== + +This gets a list of objects in the container, and prints out each +object's name, the file size, and last modified date: + +.. code-block:: ruby + + require 'date' # not necessary in the next version + + container.objects_detail.each do |name, data| + puts "#{name}\t#{data[:bytes]}\t#{data[:last_modified]}" + end + +The output will look something like this:: + + myphoto1.jpg 251262 2011-08-08T21:35:48.000Z + myphoto2.jpg 262518 2011-08-08T21:38:01.000Z + + + +Retrieve an Object +================== + +This downloads the object ``hello.txt`` and saves it in +``./my_hello.txt``: + +.. code-block:: ruby + + obj = container.object('hello.txt') + obj.save_to_filename('./my_hello.txt') + + +Delete an Object +================ + +This deletes the object ``goodbye.txt``: + +.. code-block:: ruby + + container.delete_object('goodbye.txt') + + +Delete a Container +================== + +.. note:: + + The container must be empty! Otherwise the request won't work! + +.. code-block:: ruby + + container.delete_container('my-new-container') diff --git a/doc/radosgw/swift/serviceops.rst b/doc/radosgw/swift/serviceops.rst new file mode 100644 index 000000000..a00f3d807 --- /dev/null +++ b/doc/radosgw/swift/serviceops.rst @@ -0,0 +1,76 @@ +==================== + Service Operations +==================== + +To retrieve data about our Swift-compatible service, you may execute ``GET`` +requests using the ``X-Storage-Url`` value retrieved during authentication. + +List Containers +=============== + +A ``GET`` request that specifies the API version and the account will return +a list of containers for a particular user account. Since the request returns +a particular user's containers, the request requires an authentication token. +The request cannot be made anonymously. + +Syntax +~~~~~~ + +:: + + GET /{api version}/{account} HTTP/1.1 + Host: {fqdn} + X-Auth-Token: {auth-token} + + + +Request Parameters +~~~~~~~~~~~~~~~~~~ + +``limit`` + +:Description: Limits the number of results to the specified value. +:Type: Integer +:Required: No + +``format`` + +:Description: Defines the format of the result. +:Type: String +:Valid Values: ``json`` | ``xml`` +:Required: No + + +``marker`` + +:Description: Returns a list of results greater than the marker value. +:Type: String +:Required: No + + + +Response Entities +~~~~~~~~~~~~~~~~~ + +The response contains a list of containers, or returns with an HTTP +204 response code + +``account`` + +:Description: A list for account information. +:Type: Container + +``container`` + +:Description: The list of containers. +:Type: Container + +``name`` + +:Description: The name of a container. +:Type: String + +``bytes`` + +:Description: The size of the container. +:Type: Integer
\ No newline at end of file diff --git a/doc/radosgw/swift/tempurl.rst b/doc/radosgw/swift/tempurl.rst new file mode 100644 index 000000000..79b392de6 --- /dev/null +++ b/doc/radosgw/swift/tempurl.rst @@ -0,0 +1,102 @@ +==================== + Temp URL Operations +==================== + +To allow temporary access (for eg for `GET` requests) to objects +without the need to share credentials, temp url functionality is +supported by swift endpoint of radosgw. For this functionality, +initially the value of `X-Account-Meta-Temp-URL-Key` and optionally +`X-Account-Meta-Temp-URL-Key-2` should be set. The Temp URL +functionality relies on a HMAC-SHA1 signature against these secret +keys. + +.. note:: If you are planning to expose Temp URL functionality for the + Swift API, it is strongly recommended to include the Swift + account name in the endpoint definition, so as to most + closely emulate the behavior of native OpenStack Swift. To + do so, set the ``ceph.conf`` configuration option ``rgw + swift account in url = true``, and update your Keystone + endpoint to the URL suffix ``/v1/AUTH_%(tenant_id)s`` + (instead of just ``/v1``). + + +POST Temp-URL Keys +================== + +A ``POST`` request to the Swift account with the required key will set +the secret temp URL key for the account, against which temporary URL +access can be provided to accounts. Up to two keys are supported, and +signatures are checked against both the keys, if present, so that keys +can be rotated without invalidating the temporary URLs. + +.. note:: Native OpenStack Swift also supports the option to set + temporary URL keys at the container level, issuing a + ``POST`` or ``PUT`` request against a container that sets + ``X-Container-Meta-Temp-URL-Key`` or + ``X-Container-Meta-Temp-URL-Key-2``. This functionality is + not supported in radosgw; temporary URL keys can only be set + and used at the account level. + +Syntax +~~~~~~ + +:: + + POST /{api version}/{account} HTTP/1.1 + Host: {fqdn} + X-Auth-Token: {auth-token} + +Request Headers +~~~~~~~~~~~~~~~ + +``X-Account-Meta-Temp-URL-Key`` + +:Description: A user-defined key that takes an arbitrary string value. +:Type: String +:Required: Yes + +``X-Account-Meta-Temp-URL-Key-2`` + +:Description: A user-defined key that takes an arbitrary string value. +:Type: String +:Required: No + + +GET Temp-URL Objects +==================== + +Temporary URL uses a cryptographic HMAC-SHA1 signature, which includes +the following elements: + +#. The value of the Request method, "GET" for instance +#. The expiry time, in format of seconds since the epoch, ie Unix time +#. The request path starting from "v1" onwards + +The above items are normalized with newlines appended between them, +and a HMAC is generated using the SHA-1 hashing algorithm against one +of the Temp URL Keys posted earlier. + +A sample python script to demonstrate the above is given below: + + +.. code-block:: python + + import hmac + from hashlib import sha1 + from time import time + + method = 'GET' + host = 'https://objectstore.example.com/swift' + duration_in_seconds = 300 # Duration for which the url is valid + expires = int(time() + duration_in_seconds) + path = '/v1/your-bucket/your-object' + key = 'secret' + hmac_body = '%s\n%s\n%s' % (method, expires, path) + sig = hmac.new(key, hmac_body, sha1).hexdigest() + rest_uri = "{host}{path}?temp_url_sig={sig}&temp_url_expires={expires}".format( + host=host, path=path, sig=sig, expires=expires) + print rest_uri + + # Example Output + # https://objectstore.example.com/swift/v1/your-bucket/your-object?temp_url_sig=ff4657876227fc6025f04fcf1e82818266d022c6&temp_url_expires=1423200992 + diff --git a/doc/radosgw/swift/tutorial.rst b/doc/radosgw/swift/tutorial.rst new file mode 100644 index 000000000..5d2889b19 --- /dev/null +++ b/doc/radosgw/swift/tutorial.rst @@ -0,0 +1,62 @@ +========== + Tutorial +========== + +The Swift-compatible API tutorials follow a simple container-based object +lifecycle. The first step requires you to setup a connection between your +client and the RADOS Gateway server. Then, you may follow a natural +container and object lifecycle, including adding and retrieving object +metadata. See example code for the following languages: + +- `Java`_ +- `Python`_ +- `Ruby`_ + + +.. ditaa:: + + +----------------------------+ +-----------------------------+ + | | | | + | Create a Connection |------->| Create a Container | + | | | | + +----------------------------+ +-----------------------------+ + | + +--------------------------------------+ + | + v + +----------------------------+ +-----------------------------+ + | | | | + | Create an Object |------->| Add/Update Object Metadata | + | | | | + +----------------------------+ +-----------------------------+ + | + +--------------------------------------+ + | + v + +----------------------------+ +-----------------------------+ + | | | | + | List Owned Containers |------->| List a Container's Contents | + | | | | + +----------------------------+ +-----------------------------+ + | + +--------------------------------------+ + | + v + +----------------------------+ +-----------------------------+ + | | | | + | Get an Object's Metadata |------->| Retrieve an Object | + | | | | + +----------------------------+ +-----------------------------+ + | + +--------------------------------------+ + | + v + +----------------------------+ +-----------------------------+ + | | | | + | Delete an Object |------->| Delete a Container | + | | | | + +----------------------------+ +-----------------------------+ + +.. _Java: ../java +.. _Python: ../python +.. _Ruby: ../ruby diff --git a/doc/radosgw/sync-modules.rst b/doc/radosgw/sync-modules.rst new file mode 100644 index 000000000..ca201e90a --- /dev/null +++ b/doc/radosgw/sync-modules.rst @@ -0,0 +1,99 @@ +============ +Sync Modules +============ + +.. versionadded:: Kraken + +The :ref:`multisite` functionality of RGW introduced in Jewel allowed the ability to +create multiple zones and mirror data and metadata between them. ``Sync Modules`` +are built atop of the multisite framework that allows for forwarding data and +metadata to a different external tier. A sync module allows for a set of actions +to be performed whenever a change in data occurs (metadata ops like bucket or +user creation etc. are also regarded as changes in data). As the rgw multisite +changes are eventually consistent at remote sites, changes are propagated +asynchronously. This would allow for unlocking use cases such as backing up the +object storage to an external cloud cluster or a custom backup solution using +tape drives, indexing metadata in ElasticSearch etc. + +A sync module configuration is local to a zone. The sync module determines +whether the zone exports data or can only consume data that was modified in +another zone. As of luminous the supported sync plugins are `elasticsearch`_, +``rgw``, which is the default sync plugin that synchronises data between the +zones and ``log`` which is a trivial sync plugin that logs the metadata +operation that happens in the remote zones. The following docs are written with +the example of a zone using `elasticsearch sync module`_, the process would be similar +for configuring any sync plugin + +.. toctree:: + :maxdepth: 1 + + ElasticSearch Sync Module <elastic-sync-module> + Cloud Sync Module <cloud-sync-module> + PubSub Module <pubsub-module> + Archive Sync Module <archive-sync-module> + +.. note ``rgw`` is the default sync plugin and there is no need to explicitly + configure this + +Requirements and Assumptions +---------------------------- + +Let us assume a simple multisite configuration as described in the :ref:`multisite` +docs, of 2 zones ``us-east`` and ``us-west``, let's add a third zone +``us-east-es`` which is a zone that only processes metadata from the other +sites. This zone can be in the same or a different ceph cluster as ``us-east``. +This zone would only consume metadata from other zones and RGWs in this zone +will not serve any end user requests directly. + + +Configuring Sync Modules +------------------------ + +Create the third zone similar to the :ref:`multisite` docs, for example + +:: + + # radosgw-admin zone create --rgw-zonegroup=us --rgw-zone=us-east-es \ + --access-key={system-key} --secret={secret} --endpoints=http://rgw-es:80 + + + +A sync module can be configured for this zone via the following + +:: + + # radosgw-admin zone modify --rgw-zone={zone-name} --tier-type={tier-type} --tier-config={set of key=value pairs} + + +For example in the ``elasticsearch`` sync module + +:: + + # radosgw-admin zone modify --rgw-zone={zone-name} --tier-type=elasticsearch \ + --tier-config=endpoint=http://localhost:9200,num_shards=10,num_replicas=1 + + +For the various supported tier-config options refer to the `elasticsearch sync module`_ docs + +Finally update the period + + +:: + + # radosgw-admin period update --commit + + +Now start the radosgw in the zone + +:: + + # systemctl start ceph-radosgw@rgw.`hostname -s` + # systemctl enable ceph-radosgw@rgw.`hostname -s` + + + +.. _`elasticsearch sync module`: ../elastic-sync-module +.. _`elasticsearch`: ../elastic-sync-module +.. _`cloud sync module`: ../cloud-sync-module +.. _`pubsub module`: ../pubsub-module +.. _`archive sync module`: ../archive-sync-module diff --git a/doc/radosgw/troubleshooting.rst b/doc/radosgw/troubleshooting.rst new file mode 100644 index 000000000..4a084e82a --- /dev/null +++ b/doc/radosgw/troubleshooting.rst @@ -0,0 +1,208 @@ +================= + Troubleshooting +================= + + +The Gateway Won't Start +======================= + +If you cannot start the gateway (i.e., there is no existing ``pid``), +check to see if there is an existing ``.asok`` file from another +user. If an ``.asok`` file from another user exists and there is no +running ``pid``, remove the ``.asok`` file and try to start the +process again. This may occur when you start the process as a ``root`` user and +the startup script is trying to start the process as a +``www-data`` or ``apache`` user and an existing ``.asok`` is +preventing the script from starting the daemon. + +The radosgw init script (/etc/init.d/radosgw) also has a verbose argument that +can provide some insight as to what could be the issue:: + + /etc/init.d/radosgw start -v + +or :: + + /etc/init.d radosgw start --verbose + +HTTP Request Errors +=================== + +Examining the access and error logs for the web server itself is +probably the first step in identifying what is going on. If there is +a 500 error, that usually indicates a problem communicating with the +``radosgw`` daemon. Ensure the daemon is running, its socket path is +configured, and that the web server is looking for it in the proper +location. + + +Crashed ``radosgw`` process +=========================== + +If the ``radosgw`` process dies, you will normally see a 500 error +from the web server (apache, nginx, etc.). In that situation, simply +restarting radosgw will restore service. + +To diagnose the cause of the crash, check the log in ``/var/log/ceph`` +and/or the core file (if one was generated). + + +Blocked ``radosgw`` Requests +============================ + +If some (or all) radosgw requests appear to be blocked, you can get +some insight into the internal state of the ``radosgw`` daemon via +its admin socket. By default, there will be a socket configured to +reside in ``/var/run/ceph``, and the daemon can be queried with:: + + ceph daemon /var/run/ceph/client.rgw help + + help list available commands + objecter_requests show in-progress osd requests + perfcounters_dump dump perfcounters value + perfcounters_schema dump perfcounters schema + version get protocol version + +Of particular interest:: + + ceph daemon /var/run/ceph/client.rgw objecter_requests + ... + +will dump information about current in-progress requests with the +RADOS cluster. This allows one to identify if any requests are blocked +by a non-responsive OSD. For example, one might see:: + + { "ops": [ + { "tid": 1858, + "pg": "2.d2041a48", + "osd": 1, + "last_sent": "2012-03-08 14:56:37.949872", + "attempts": 1, + "object_id": "fatty_25647_object1857", + "object_locator": "@2", + "snapid": "head", + "snap_context": "0=[]", + "mtime": "2012-03-08 14:56:37.949813", + "osd_ops": [ + "write 0~4096"]}, + { "tid": 1873, + "pg": "2.695e9f8e", + "osd": 1, + "last_sent": "2012-03-08 14:56:37.970615", + "attempts": 1, + "object_id": "fatty_25647_object1872", + "object_locator": "@2", + "snapid": "head", + "snap_context": "0=[]", + "mtime": "2012-03-08 14:56:37.970555", + "osd_ops": [ + "write 0~4096"]}], + "linger_ops": [], + "pool_ops": [], + "pool_stat_ops": [], + "statfs_ops": []} + +In this dump, two requests are in progress. The ``last_sent`` field is +the time the RADOS request was sent. If this is a while ago, it suggests +that the OSD is not responding. For example, for request 1858, you could +check the OSD status with:: + + ceph pg map 2.d2041a48 + + osdmap e9 pg 2.d2041a48 (2.0) -> up [1,0] acting [1,0] + +This tells us to look at ``osd.1``, the primary copy for this PG:: + + ceph daemon osd.1 ops + { "num_ops": 651, + "ops": [ + { "description": "osd_op(client.4124.0:1858 fatty_25647_object1857 [write 0~4096] 2.d2041a48)", + "received_at": "1331247573.344650", + "age": "25.606449", + "flag_point": "waiting for sub ops", + "client_info": { "client": "client.4124", + "tid": 1858}}, + ... + +The ``flag_point`` field indicates that the OSD is currently waiting +for replicas to respond, in this case ``osd.0``. + + +Java S3 API Troubleshooting +=========================== + + +Peer Not Authenticated +---------------------- + +You may receive an error that looks like this:: + + [java] INFO: Unable to execute HTTP request: peer not authenticated + +The Java SDK for S3 requires a valid certificate from a recognized certificate +authority, because it uses HTTPS by default. If you are just testing the Ceph +Object Storage services, you can resolve this problem in a few ways: + +#. Prepend the IP address or hostname with ``http://``. For example, change this:: + + conn.setEndpoint("myserver"); + + To:: + + conn.setEndpoint("http://myserver") + +#. After setting your credentials, add a client configuration and set the + protocol to ``Protocol.HTTP``. :: + + AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey); + + ClientConfiguration clientConfig = new ClientConfiguration(); + clientConfig.setProtocol(Protocol.HTTP); + + AmazonS3 conn = new AmazonS3Client(credentials, clientConfig); + + + +405 MethodNotAllowed +-------------------- + +If you receive an 405 error, check to see if you have the S3 subdomain set up correctly. +You will need to have a wild card setting in your DNS record for subdomain functionality +to work properly. + +Also, check to ensure that the default site is disabled. :: + + [java] Exception in thread "main" Status Code: 405, AWS Service: Amazon S3, AWS Request ID: null, AWS Error Code: MethodNotAllowed, AWS Error Message: null, S3 Extended Request ID: null + + + +Numerous objects in default.rgw.meta pool +========================================= + +Clusters created prior to *jewel* have a metadata archival feature enabled by default, using the ``default.rgw.meta`` pool. +This archive keeps all old versions of user and bucket metadata, resulting in large numbers of objects in the ``default.rgw.meta`` pool. + +Disabling the Metadata Heap +--------------------------- + +Users who want to disable this feature going forward should set the ``metadata_heap`` field to an empty string ``""``:: + + $ radosgw-admin zone get --rgw-zone=default > zone.json + [edit zone.json, setting "metadata_heap": ""] + $ radosgw-admin zone set --rgw-zone=default --infile=zone.json + $ radosgw-admin period update --commit + +This will stop new metadata from being written to the ``default.rgw.meta`` pool, but does not remove any existing objects or pool. + +Cleaning the Metadata Heap Pool +------------------------------- + +Clusters created prior to *jewel* normally use ``default.rgw.meta`` only for the metadata archival feature. + +However, from *luminous* onwards, radosgw uses :ref:`Pool Namespaces <radosgw-pool-namespaces>` within ``default.rgw.meta`` for an entirely different purpose, that is, to store ``user_keys`` and other critical metadata. + +Users should check zone configuration before proceeding any cleanup procedures:: + + $ radosgw-admin zone get --rgw-zone=default | grep default.rgw.meta + [should not match any strings] + +Having confirmed that the pool is not used for any purpose, users may safely delete all objects in the ``default.rgw.meta`` pool, or optionally, delete the entire pool itself. diff --git a/doc/radosgw/vault.rst b/doc/radosgw/vault.rst new file mode 100644 index 000000000..0f3cb8fd1 --- /dev/null +++ b/doc/radosgw/vault.rst @@ -0,0 +1,482 @@ +=========================== +HashiCorp Vault Integration +=========================== + +HashiCorp `Vault`_ can be used as a secure key management service for +`Server-Side Encryption`_ (SSE-KMS). + +.. ditaa:: + + +---------+ +---------+ +-------+ +-------+ + | Client | | RadosGW | | Vault | | OSD | + +---------+ +---------+ +-------+ +-------+ + | create secret | | | + | key for key ID | | | + |-----------------+---------------->| | + | | | | + | upload object | | | + | with key ID | | | + |---------------->| request secret | | + | | key for key ID | | + | |---------------->| | + | |<----------------| | + | | return secret | | + | | key | | + | | | | + | | encrypt object | | + | | with secret key | | + | |--------------+ | | + | | | | | + | |<-------------+ | | + | | | | + | | store encrypted | | + | | object | | + | |------------------------------>| + +#. `Vault secrets engines`_ +#. `Vault authentication`_ +#. `Vault namespaces`_ +#. `Create a key in Vault`_ +#. `Configure the Ceph Object Gateway`_ +#. `Upload object`_ + +Some examples below use the Vault command line utility to interact with +Vault. You may need to set the following environment variable with the correct +address of your Vault server to use this utility:: + + export VAULT_ADDR='https://vault-server-fqdn:8200' + +Vault secrets engines +===================== + +Vault provides several secrets engines, which can store, generate, and encrypt +data. Currently, the Object Gateway supports: + +- `KV secrets engine`_ version 2 +- `Transit engine`_ + +KV secrets engine +----------------- + +The KV secrets engine is used to store arbitrary key/value secrets in Vault. To +enable the KV engine version 2 in Vault, use the following command:: + + vault secrets enable -path secret kv-v2 + +The Object Gateway can be configured to use the KV engine version 2 with the +following setting:: + + rgw crypt vault secret engine = kv + +Transit secrets engine +---------------------- + +The transit engine handles cryptographic functions on data in-transit. To enable +it in Vault, use the following command:: + + vault secrets enable transit + +The Object Gateway can be configured to use the transit engine with the +following setting:: + + rgw crypt vault secret engine = transit + +Vault authentication +==================== + +Vault supports several authentication mechanisms. Currently, the Object +Gateway can be configured to authenticate to Vault using the +`Token authentication method`_ or a `Vault agent`_. + +Most tokens in Vault have limited lifetimes and powers. The only +sort of Vault token that does not have a lifetime are root tokens. +For all other tokens, it is necesary to periodically refresh them, +either by performing initial authentication, or by renewing the token. +Ceph does not have any logic to perform either operation. +The simplest best way to use Vault tokens with ceph is to +also run the Vault agent and have it refresh the token file. +When the Vault agent is used in this mode, file system permissions +can be used to restrict who has the use of tokens. + +Instead of having Vault agent refresh a token file, it can be told +to act as a proxy server. In this mode, Vault will add a token when +necessary and add it to requests passed to it before forwarding them on +to the real server. Vault agent will still handle token renewal just +as it would when storing a token in the filesystem. In this mode, it +is necessary to properly secure the network path rgw uses to reach the +Vault agent, such as having the Vault agent listen only to localhost. + +Token policies for the object gateway +------------------------------------- + +All Vault tokens have powers as specified by the polices attached +to that token. Multiple policies may be associated with one +token. You should only use the policy necessary for your +configuration. + +When using the kv secret engine with the object gateway:: + + vault policy write rgw-kv-policy -<<EOF + path "secret/data/*" { + capabilities = ["read"] + } + EOF + +When using the transit secret engine with the object gateway:: + + vault policy write rgw-transit-policy -<<EOF + path "transit/keys/*" { + capabilities = [ "create", "update" ] + denied_parameters = {"exportable" = [], "allow_plaintext_backup" = [] } + } + + path "transit/keys/*" { + capabilities = ["read", "delete"] + } + + path "transit/keys/" { + capabilities = ["list"] + } + + path "transit/keys/+/rotate" { + capabilities = [ "update" ] + } + + path "transit/*" { + capabilities = [ "update" ] + } + EOF + +If you had previously used an older version of ceph with the +transit secret engine, you might need the following policy:: + + vault policy write old-rgw-transit-policy -<<EOF + path "transit/export/encryption-key/*" { + capabilities = ["read"] + } + EOF + + +Token authentication +-------------------- + +.. note: Never use root tokens with ceph in production environments. + +The token authentication method expects a Vault token to be present in a +plaintext file. The Object Gateway can be configured to use token authentication +with the following settings:: + + rgw crypt vault auth = token + rgw crypt vault token file = /run/.rgw-vault-token + rgw crypt vault addr = https://vault-server-fqdn:8200 + +Adjust these settinsg to match your configuration. +For security reasons, the token file must be readable by the Object Gateway +only. + +You might set up vault agent as follows:: + + vault write auth/approle/role/rgw-ap \ + token_policies=rgw-transit-policy,default \ + token_max_ttl=60m + +Change the policy here to match your configuration. + +Get the role-id:: + + vault read auth/approle/role/rgw-ap/role-id -format=json | \ + jq -r .data.role_id + +Store the output in some file, such as /usr/local/etc/vault/.rgw-ap-role-id + +Get the secret-id:: + + vault read auth/approle/role/rgw-ap/role-id -format=json | \ + jq -r .data.role_id + +Store the output in some file, such as /usr/local/etc/vault/.rgw-ap-secret-id + +Create configuration for the Vault agent, such as:: + + pid_file = "/run/rgw-vault-agent-pid" + auto_auth { + method "AppRole" { + mount_path = "auth/approle" + config = { + role_id_file_path ="/usr/local/etc/vault/.rgw-ap-role-id" + secret_id_file_path ="/usr/local/etc/vault/.rgw-ap-secret-id" + remove_secret_id_file_after_reading ="false" + } + } + sink "file" { + config = { + path = "/run/.rgw-vault-token" + } + } + } + vault { + address = "https://vault-server-fqdn:8200" + } + +Then use systemctl or another method of your choice to run +a persistent daemon with the following arguments:: + + /usr/local/bin/vault agent -config=/usr/local/etc/vault/rgw-agent.hcl + +Once the vault agent is running, the token file should be populated +with a valid token. + +Vault agent +----------- + +The Vault agent is a client daemon that provides authentication to Vault and +manages token renewal and caching. It typically runs on the same host as the +Object Gateway. With a Vault agent, it is possible to use other Vault +authentication mechanism such as AppRole, AWS, Certs, JWT, and Azure. + +The Object Gateway can be configured to use a Vault agent with the following +settings:: + + rgw crypt vault auth = agent + rgw crypt vault addr = http://127.0.0.1:8100 + +You might set up vault agent as follows:: + + vault write auth/approle/role/rgw-ap \ + token_policies=rgw-transit-policy,default \ + token_max_ttl=60m + +Change the policy here to match your configuration. + +Get the role-id: + vault read auth/approle/role/rgw-ap/role-id -format=json | \ + jq -r .data.role_id + +Store the output in some file, such as /usr/local/etc/vault/.rgw-ap-role-id + +Get the secret-id: + vault read auth/approle/role/rgw-ap/role-id -format=json | \ + jq -r .data.role_id + +Store the output in some file, such as /usr/local/etc/vault/.rgw-ap-secret-id + +Create configuration for the Vault agent, such as:: + + pid_file = "/run/rgw-vault-agent-pid" + auto_auth { + method "AppRole" { + mount_path = "auth/approle" + config = { + role_id_file_path ="/usr/local/etc/vault/.rgw-ap-role-id" + secret_id_file_path ="/usr/local/etc/vault/.rgw-ap-secret-id" + remove_secret_id_file_after_reading ="false" + } + } + } + cache { + use_auto_auth_token = true + } + listener "tcp" { + address = "127.0.0.1:8100" + tls_disable = true + } + vault { + address = "https://vault-server-fqdn:8200" + } + +Then use systemctl or another method of your choice to run +a persistent daemon with the following arguments:: + + /usr/local/bin/vault agent -config=/usr/local/etc/vault/rgw-agent.hcl + +Once the vault agent is running, you should find it listening +to port 8100 on localhost, and you should be able to interact +with it using the vault command. + +Vault namespaces +================ + +In the Enterprise version, Vault supports the concept of `namespaces`_, which +allows centralized management for teams within an organization while ensuring +that those teams operate within isolated environments known as tenants. + +The Object Gateway can be configured to access Vault within a particular +namespace using the following configuration setting:: + + rgw crypt vault namespace = tenant1 + +Create a key in Vault +===================== + +.. note:: Keys for server-side encryption must be 256-bit long and base-64 + encoded. + +Using the KV engine +------------------- + +A key for server-side encryption can be created in the KV version 2 engine using +the command line utility, as in the following example:: + + vault kv put secret/myproject/mybucketkey key=$(openssl rand -base64 32) + +Sample output:: + + ====== Metadata ====== + Key Value + --- ----- + created_time 2019-08-29T17:01:09.095824999Z + deletion_time n/a + destroyed false + version 1 + +Note that in the KV secrets engine, secrets are stored as key-value pairs, and +the Gateway expects the key name to be ``key``, i.e. the secret must be in the +form ``key=<secret key>``. + +Using the Transit engine +------------------------ + +Keys created for use with the Transit engine should no longer be marked +exportable. They can be created with:: + + vault write -f transit/keys/mybucketkey + +The command above creates a keyring, which contains a key of type +``aes256-gcm96`` by default. To verify that the key was correctly created, use +the following command:: + + vault read transit/mybucketkey + +Sample output:: + + Key Value + --- ----- + derived false + exportable false + name mybucketkey + type aes256-gcm96 + +Configure the Ceph Object Gateway +================================= + +Edit the Ceph configuration file to enable Vault as a KMS backend for +server-side encryption:: + + rgw crypt s3 kms backend = vault + +Choose the Vault authentication method, e.g.:: + + rgw crypt vault auth = token + rgw crypt vault token file = /run/.rgw-vault-token + rgw crypt vault addr = https://vault-server-fqdn:8200 + +Or:: + + rgw crypt vault auth = agent + rgw crypt vault addr = http://localhost:8100 + +Choose the secrets engine:: + + rgw crypt vault secret engine = kv + +Or:: + + rgw crypt vault secret engine = transit + +Optionally, set the Vault namespace where encryption keys will be fetched from:: + + rgw crypt vault namespace = tenant1 + +Finally, the URLs where the Gateway will retrieve encryption keys from Vault can +be restricted by setting a path prefix. For instance, the Gateway can be +restricted to fetch KV keys as follows:: + + rgw crypt vault prefix = /v1/secret/data + +Or, when using the transit secret engine:: + + rgw crypt vault prefix = /v1/transit + +In the example above, the Gateway would only fetch transit encryption keys under +``https://vault-server:8200/v1/transit``. + +You can use custom ssl certs to authenticate with vault with help of +following options:: + + rgw crypt vault verify ssl = true + rgw crypt vault ssl cacert = /etc/ceph/vault.ca + rgw crypt vault ssl clientcert = /etc/ceph/vault.crt + rgw crypt vault ssl clientkey = /etc/ceph/vault.key + +where vault.ca is CA certificate and vault.key/vault.crt are private key and ssl +ceritificate generated for RGW to access the vault server. It highly recommended to +set this option true, setting false is very dangerous and need to avoid since this +runs in very secured enviroments. + +Transit engine compatibility support +------------------------------------ +The transit engine has compatibility support for previous +versions of ceph, which used the transit engine as a simple key store. + +There is a a "compat" option which can be given to the transit +engine to configure the compatibility support, + +To entirely disable backwards support, use:: + + rgw crypt vault secret engine = transit compat=0 + +This will be the default in future verisons. and is safe to use +for new installs using the current version. + +This is the normal default with the current version:: + + rgw crypt vault secret engine = transit compat=1 + +This enables the new engine for newly created objects, +but still allows the old engine to be used for old objects. +In order to access old and new objects, the vault token given +to ceph must have both the old and new transit policies. + +To force use of only the old engine, use:: + + rgw crypt vault secret engine = transit compat=2 + +This mode is automatically selected if the vault prefix +ends in export/encryption-key, which was the previously +documented setting. + +Upload object +============= + +When uploading an object to the Gateway, provide the SSE key ID in the request. +As an example, for the kv engine, using the AWS command-line client:: + + aws --endpoint=http://radosgw:8000 s3 cp plaintext.txt s3://mybucket/encrypted.txt --sse=aws:kms --sse-kms-key-id myproject/mybucketkey + +As an example, for the transit engine (new flavor), using the AWS command-line client:: + + aws --endpoint=http://radosgw:8000 s3 cp plaintext.txt s3://mybucket/encrypted.txt --sse=aws:kms --sse-kms-key-id mybucketkey + +The Object Gateway will fetch the key from Vault, encrypt the object and store +it in the bucket. Any request to download the object will make the Gateway +automatically retrieve the correspondent key from Vault and decrypt the object. + +Note that the secret will be fetched from Vault using a URL constructed by +concatenating the base address (``rgw crypt vault addr``), the (optional) +URL prefix (``rgw crypt vault prefix``), and finally the key ID. + +In the kv engine example above, the Gateway would fetch the secret from:: + + http://vaultserver:8200/v1/secret/data/myproject/mybucketkey + +In the transit engine example above, the Gateway would encrypt the secret using this key:: + + http://vaultserver:8200/v1/transit/mybucketkey + +.. _Server-Side Encryption: ../encryption +.. _Vault: https://www.vaultproject.io/docs/ +.. _Token authentication method: https://www.vaultproject.io/docs/auth/token.html +.. _Vault agent: https://www.vaultproject.io/docs/agent/index.html +.. _KV Secrets engine: https://www.vaultproject.io/docs/secrets/kv/ +.. _Transit engine: https://www.vaultproject.io/docs/secrets/transit +.. _namespaces: https://www.vaultproject.io/docs/enterprise/namespaces/index.html |