From e6918187568dbd01842d8d1d2c808ce16a894239 Mon Sep 17 00:00:00 2001
From: Daniel Baumann <daniel.baumann@progress-linux.org>
Date: Sun, 21 Apr 2024 13:54:28 +0200
Subject: Adding upstream version 18.2.2.

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
---
 doc/architecture.rst | 1688 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1688 insertions(+)
 create mode 100644 doc/architecture.rst

(limited to 'doc/architecture.rst')

diff --git a/doc/architecture.rst b/doc/architecture.rst
new file mode 100644
index 000000000..852225ce6
--- /dev/null
+++ b/doc/architecture.rst
@@ -0,0 +1,1688 @@
+==============
+ Architecture
+==============
+
+:term:`Ceph` uniquely delivers **object, block, and file storage** in one
+unified system. Ceph is highly reliable, easy to manage, and free. The power of
+Ceph can transform your company's IT infrastructure and your ability to manage
+vast amounts of data. Ceph delivers extraordinary scalability–thousands of
+clients accessing petabytes to exabytes of data. A :term:`Ceph Node` leverages
+commodity hardware and intelligent daemons, and a :term:`Ceph Storage Cluster`
+accommodates large numbers of nodes, which communicate with each other to
+replicate and redistribute data dynamically.
+
+.. image:: images/stack.png
+
+.. _arch-ceph-storage-cluster:
+
+The Ceph Storage Cluster
+========================
+
+Ceph provides an infinitely scalable :term:`Ceph Storage Cluster` based upon
+:abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, which you can read
+about in `RADOS - A Scalable, Reliable Storage Service for Petabyte-scale
+Storage Clusters`_.
+
+A Ceph Storage Cluster consists of multiple types of daemons:
+
+- :term:`Ceph Monitor`
+- :term:`Ceph OSD Daemon`
+- :term:`Ceph Manager`
+- :term:`Ceph Metadata Server`
+
+.. _arch_monitor:
+
+Ceph Monitors maintain the master copy of the cluster map, which they provide
+to Ceph clients. Provisioning multiple monitors within the Ceph cluster ensures
+availability in the event that one of the monitor daemons or its host fails.
+The Ceph monitor provides copies of the cluster map to storage cluster clients.
+
+A Ceph OSD Daemon checks its own state and the state of other OSDs and reports 
+back to monitors.
+
+A Ceph Manager serves as an endpoint for monitoring, orchestration, and plug-in
+modules.
+
+A Ceph Metadata Server (MDS) manages file metadata when CephFS is used to
+provide file services.
+
+Storage cluster clients and :term:`Ceph OSD Daemon`\s use the CRUSH algorithm
+to compute information about data location. This means that clients and OSDs
+are not bottlenecked by a central lookup table. Ceph's high-level features
+include a native interface to the Ceph Storage Cluster via ``librados``, and a
+number of service interfaces built on top of ``librados``.
+
+Storing Data
+------------
+
+The Ceph Storage Cluster receives data from :term:`Ceph Client`\s--whether it
+comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
+:term:`Ceph File System`, or a custom implementation that you create by using
+``librados``. The data received by the Ceph Storage Cluster is stored as RADOS
+objects. Each object is stored on an :term:`Object Storage Device` (this is
+also called an "OSD"). Ceph OSDs control read, write, and replication
+operations on storage drives. The default BlueStore back end stores objects 
+in a monolithic, database-like fashion.
+
+.. ditaa::
+
+           /------\       +-----+       +-----+
+           | obj  |------>| {d} |------>| {s} |
+           \------/       +-----+       +-----+
+   
+            Object         OSD          Drive
+
+Ceph OSD Daemons store data as objects in a flat namespace. This means that
+objects are not stored in a hierarchy of directories. An object has an
+identifier, binary data, and metadata consisting of name/value pairs.
+:term:`Ceph Client`\s determine the semantics of the object data. For example,
+CephFS uses metadata to store file attributes such as the file owner, the
+created date, and the last modified date.
+
+
+.. ditaa::
+
+           /------+------------------------------+----------------\
+           | ID   | Binary Data                  | Metadata       |
+           +------+------------------------------+----------------+
+           | 1234 | 0101010101010100110101010010 | name1 = value1 | 
+           |      | 0101100001010100110101010010 | name2 = value2 |
+           |      | 0101100001010100110101010010 | nameN = valueN |
+           \------+------------------------------+----------------/    
+
+.. note:: An object ID is unique across the entire cluster, not just the local
+   filesystem.
+
+
+.. index:: architecture; high availability, scalability
+
+.. _arch_scalability_and_high_availability:
+
+Scalability and High Availability
+---------------------------------
+
+In traditional architectures, clients talk to a centralized component. This
+centralized component might be a gateway, a broker, an API, or a facade. A
+centralized component of this kind acts as a single point of entry to a complex
+subsystem. Architectures that rely upon such a centralized component have a
+single point of failure and incur limits to performance and scalability. If
+the centralized component goes down, the whole system becomes unavailable.
+
+Ceph eliminates this centralized component. This enables clients to interact
+with Ceph OSDs directly. Ceph OSDs create object replicas on other Ceph Nodes
+to ensure data safety and high availability. Ceph also uses a cluster of
+monitors to ensure high availability. To eliminate centralization, Ceph uses an
+algorithm called :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)`.
+
+
+.. index:: CRUSH; architecture
+
+CRUSH Introduction
+~~~~~~~~~~~~~~~~~~
+
+Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
+Replication Under Scalable Hashing)` algorithm to compute information about
+object location instead of relying upon a central lookup table. CRUSH provides
+a better data management mechanism than do older approaches, and CRUSH enables
+massive scale by distributing the work to all the OSD daemons in the cluster
+and all the clients that communicate with them. CRUSH uses intelligent data
+replication to ensure resiliency, which is better suited to hyper-scale
+storage. The following sections provide additional details on how CRUSH works.
+For a detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable,
+Decentralized Placement of Replicated Data`_.
+
+.. index:: architecture; cluster map
+
+.. _architecture_cluster_map:
+
+Cluster Map
+~~~~~~~~~~~
+
+In order for a Ceph cluster to function properly, Ceph Clients and Ceph OSDs
+must have current information about the cluster's topology. Current information
+is stored in the "Cluster Map", which is in fact a collection of five maps. The
+five maps that constitute the cluster map are:
+
+#. **The Monitor Map:** Contains the cluster ``fsid``, the position, the name,
+   the address, and the TCP port of each monitor. The monitor map specifies the
+   current epoch, the time of the monitor map's creation, and the time of the
+   monitor map's last modification.  To view a monitor map, run ``ceph mon
+   dump``.   
+   
+#. **The OSD Map:** Contains the cluster ``fsid``, the time of the OSD map's
+   creation, the time of the OSD map's last modification, a list of pools, a
+   list of replica sizes, a list of PG numbers, and a list of OSDs and their
+   statuses (for example, ``up``, ``in``). To view an OSD map, run ``ceph
+   osd dump``. 
+   
+#. **The PG Map:** Contains the PG version, its time stamp, the last OSD map
+   epoch, the full ratios, and the details of each placement group. This
+   includes the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (for
+   example, ``active + clean``), and data usage statistics for each pool.
+
+#. **The CRUSH Map:** Contains a list of storage devices, the failure domain
+   hierarchy (for example, ``device``, ``host``, ``rack``, ``row``, ``room``),
+   and rules for traversing the hierarchy when storing data. To view a CRUSH
+   map, run ``ceph osd getcrushmap -o {filename}`` and then decompile it by
+   running ``crushtool -d {comp-crushmap-filename} -o
+   {decomp-crushmap-filename}``. Use a text editor or ``cat`` to view the
+   decompiled map.
+
+#. **The MDS Map:** Contains the current MDS map epoch, when the map was 
+   created, and the last time it changed. It also contains the pool for 
+   storing metadata, a list of metadata servers, and which metadata servers
+   are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``.
+
+Each map maintains a history of changes to its operating state. Ceph Monitors
+maintain a master copy of the cluster map. This master copy includes the
+cluster members, the state of the cluster, changes to the cluster, and
+information recording the overall health of the Ceph Storage Cluster.
+
+.. index:: high availability; monitor architecture
+
+High Availability Monitors
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A Ceph Client must contact a Ceph Monitor and obtain a current copy of the
+cluster map in order to read data from or to write data to the Ceph cluster.
+
+It is possible for a Ceph cluster to function properly with only a single
+monitor, but a Ceph cluster that has only a single monitor has a single point
+of failure: if the monitor goes down, Ceph clients will be unable to read data
+from or write data to the cluster.
+
+Ceph leverages a cluster of monitors in order to increase reliability and fault
+tolerance. When a cluster of monitors is used, however, one or more of the
+monitors in the cluster can fall behind due to latency or other faults. Ceph
+mitigates these negative effects by requiring multiple monitor instances to
+agree about the state of the cluster. To establish consensus among the monitors
+regarding the state of the cluster, Ceph uses the `Paxos`_ algorithm and a
+majority of monitors (for example, one in a cluster that contains only one
+monitor, two in a cluster that contains three monitors, three in a cluster that
+contains five monitors, four in a cluster that contains six monitors, and so
+on).
+
+See the `Monitor Config Reference`_ for more detail on configuring monitors.
+
+.. index:: architecture; high availability authentication
+
+.. _arch_high_availability_authentication:
+
+High Availability Authentication
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``cephx`` authentication system is used by Ceph to authenticate users and
+daemons and to protect against man-in-the-middle attacks. 
+
+.. note:: The ``cephx`` protocol does not address data encryption in transport 
+   (for example, SSL/TLS) or encryption at rest.
+
+``cephx`` uses shared secret keys for authentication. This means that both the
+client and the monitor cluster keep a copy of the client's secret key. 
+
+The ``cephx`` protocol makes it possible for each party to prove to the other
+that it has a copy of the key without revealing it. This provides mutual
+authentication and allows the cluster to confirm (1) that the user has the
+secret key and (2) that the user can be confident that the cluster has a copy
+of the secret key.
+
+As stated in :ref:`Scalability and High Availability
+<arch_scalability_and_high_availability>`, Ceph does not have any centralized
+interface between clients and the Ceph object store. By avoiding such a
+centralized interface, Ceph avoids the bottlenecks that attend such centralized
+interfaces. However, this means that clients must interact directly with OSDs.
+Direct interactions between Ceph clients and OSDs require authenticated
+connections. The ``cephx`` authentication system establishes and sustains these
+authenticated connections.
+
+The ``cephx`` protocol operates in a manner similar to `Kerberos`_. 
+
+A user invokes a Ceph client to contact a monitor. Unlike Kerberos, each
+monitor can authenticate users and distribute keys, which means that there is
+no single point of failure and no bottleneck when using ``cephx``. The monitor
+returns an authentication data structure that is similar to a Kerberos ticket.
+This authentication data structure contains a session key for use in obtaining
+Ceph services. The session key is itself encrypted with the user's permanent
+secret key, which means that only the user can request services from the Ceph
+Monitors. The client then uses the session key to request services from the
+monitors, and the monitors provide the client with a ticket that authenticates
+the client against the OSDs that actually handle data. Ceph Monitors and OSDs
+share a secret, which means that the clients can use the ticket provided by the
+monitors to authenticate against any OSD or metadata server in the cluster. 
+
+Like Kerberos tickets, ``cephx`` tickets expire. An attacker cannot use an
+expired ticket or session key that has been obtained surreptitiously. This form
+of authentication prevents attackers who have access to the communications
+medium from creating bogus messages under another user's identity and prevents
+attackers from altering another user's legitimate messages, as long as the
+user's secret key is not divulged before it expires.
+
+An administrator must set up users before using ``cephx``.  In the following
+diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from
+the command line to generate a username and secret key. Ceph's ``auth``
+subsystem generates the username and key, stores a copy on the monitor(s), and
+transmits the user's secret back to the ``client.admin`` user. This means that
+the client and the monitor share a secret key.
+
+.. note:: The ``client.admin`` user must provide the user ID and 
+   secret key to the user in a secure manner. 
+
+.. ditaa::
+
+           +---------+     +---------+
+           | Client  |     | Monitor |
+           +---------+     +---------+
+                |  request to   |
+                | create a user |
+                |-------------->|----------+ create user
+                |               |          | and                 
+                |<--------------|<---------+ store key
+                | transmit key  |
+                |               |
+
+Here is how a client authenticates with a monitor. The client passes the user
+name to the monitor. The monitor generates a session key that is encrypted with
+the secret key associated with the ``username``. The monitor transmits the
+encrypted ticket to the client. The client uses the shared secret key to
+decrypt the payload. The session key identifies the user, and this act of
+identification will last for the duration of the session.  The client requests
+a ticket for the user, and the ticket is signed with the session key. The
+monitor generates a ticket and uses the user's secret key to encrypt it. The
+encrypted ticket is transmitted to the client. The client decrypts the ticket
+and uses it to sign requests to OSDs and to metadata servers in the cluster.
+
+.. ditaa::
+
+           +---------+     +---------+
+           | Client  |     | Monitor |
+           +---------+     +---------+
+                |  authenticate |
+                |-------------->|----------+ generate and
+                |               |          | encrypt                
+                |<--------------|<---------+ session key
+                | transmit      |
+                | encrypted     |
+                | session key   |
+                |               |             
+                |-----+ decrypt |
+                |     | session | 
+                |<----+ key     |              
+                |               |
+                |  req. ticket  |
+                |-------------->|----------+ generate and
+                |               |          | encrypt                
+                |<--------------|<---------+ ticket
+                | recv. ticket  |
+                |               |             
+                |-----+ decrypt |
+                |     | ticket  | 
+                |<----+         |              
+
+
+The ``cephx`` protocol authenticates ongoing communications between the clients
+and Ceph daemons. After initial authentication, each message sent between a
+client and a daemon is signed using a ticket that can be verified by monitors,
+OSDs, and metadata daemons. This ticket is verified by using the secret shared
+between the client and the daemon.
+
+.. ditaa::
+
+           +---------+     +---------+     +-------+     +-------+
+           |  Client |     | Monitor |     |  MDS  |     |  OSD  |
+           +---------+     +---------+     +-------+     +-------+
+                |  request to   |              |             |
+                | create a user |              |             |               
+                |-------------->| mon and      |             |
+                |<--------------| client share |             |
+                |    receive    | a secret.    |             |
+                | shared secret |              |             |
+                |               |<------------>|             |
+                |               |<-------------+------------>|
+                |               | mon, mds,    |             |
+                | authenticate  | and osd      |             |  
+                |-------------->| share        |             |
+                |<--------------| a secret     |             |
+                |  session key  |              |             |
+                |               |              |             |
+                |  req. ticket  |              |             |
+                |-------------->|              |             |
+                |<--------------|              |             |
+                | recv. ticket  |              |             |
+                |               |              |             |
+                |   make request (CephFS only) |             |
+                |----------------------------->|             |
+                |<-----------------------------|             |
+                | receive response (CephFS only)             |
+                |                                            |
+                |                make request                |
+                |------------------------------------------->|  
+                |<-------------------------------------------|
+                               receive response
+
+This authentication protects only the connections between Ceph clients and Ceph
+daemons. The authentication is not extended beyond the Ceph client. If a user
+accesses the Ceph client from a remote host, cephx authentication will not be
+applied to the connection between the user's host and the client host.
+
+See `Cephx Config Guide`_ for more on configuration details. 
+
+See `User Management`_ for more on user management.
+
+See :ref:`A Detailed Description of the Cephx Authentication Protocol
+<cephx_2012_peter>` for more on the distinction between authorization and
+authentication and for a step-by-step explanation of the setup of ``cephx``
+tickets and session keys.
+
+.. index:: architecture; smart daemons and scalability
+
+Smart Daemons Enable Hyperscale
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+A feature of many storage clusters is a centralized interface that keeps track
+of the nodes that clients are permitted to access. Such centralized
+architectures provide services to clients by means of a double dispatch. At the
+petabyte-to-exabyte scale, such double dispatches are a significant
+bottleneck.
+
+Ceph obviates this bottleneck: Ceph's OSD Daemons AND Ceph clients are
+cluster-aware. Like Ceph clients, each Ceph OSD Daemon is aware of other Ceph
+OSD Daemons in the cluster. This enables Ceph OSD Daemons to interact directly
+with other Ceph OSD Daemons and to interact directly with Ceph Monitors.  Being
+cluster-aware makes it possible for Ceph clients to interact directly with Ceph
+OSD Daemons.
+
+Because Ceph clients, Ceph monitors, and Ceph OSD daemons interact with one
+another directly, Ceph OSD daemons can make use of the aggregate CPU and RAM
+resources of the nodes in the Ceph cluster. This means that a Ceph cluster can
+easily perform tasks that a cluster with a centralized interface would struggle
+to perform. The ability of Ceph nodes to make use of the computing power of
+the greater cluster provides several benefits:
+
+#. **OSDs Service Clients Directly:** Network devices can support only a
+   limited number of concurrent connections. Because Ceph clients contact
+   Ceph OSD daemons directly without first connecting to a central interface,
+   Ceph enjoys improved perfomance and increased system capacity relative to
+   storage redundancy strategies that include a central interface. Ceph clients
+   maintain sessions only when needed, and maintain those sessions with only
+   particular Ceph OSD daemons, not with a centralized interface.
+
+#. **OSD Membership and Status**: When Ceph OSD Daemons join a cluster, they
+   report their status. At the lowest level, the Ceph OSD Daemon status is
+   ``up`` or ``down``: this reflects whether the Ceph OSD daemon is running and
+   able to service Ceph Client requests. If a Ceph OSD Daemon is ``down`` and
+   ``in`` the Ceph Storage Cluster, this status may indicate the failure of the
+   Ceph OSD Daemon. If a Ceph OSD Daemon is not running because it has crashed,
+   the Ceph OSD Daemon cannot notify the Ceph Monitor that it is ``down``. The
+   OSDs periodically send messages to the Ceph Monitor (in releases prior to
+   Luminous, this was done by means of ``MPGStats``, and beginning with the
+   Luminous release, this has been done with ``MOSDBeacon``). If the Ceph
+   Monitors receive no such message after a configurable period of time,
+   then they mark the OSD ``down``. This mechanism is a failsafe, however.
+   Normally, Ceph OSD Daemons determine if a neighboring OSD is ``down`` and
+   report it to the Ceph Monitors. This contributes to making Ceph Monitors 
+   lightweight processes. See `Monitoring OSDs`_ and `Heartbeats`_ for
+   additional details.
+
+#. **Data Scrubbing:** To maintain data consistency, Ceph OSD Daemons scrub
+   RADOS objects. Ceph OSD Daemons compare the metadata of their own local
+   objects against the metadata of the replicas of those objects, which are
+   stored on other OSDs. Scrubbing occurs on a per-Placement-Group basis, finds
+   mismatches in object size and finds metadata mismatches, and is usually
+   performed daily. Ceph OSD Daemons perform deeper scrubbing by comparing the
+   data in objects, bit-for-bit, against their checksums. Deep scrubbing finds
+   bad sectors on drives that are not detectable with light scrubs. See `Data
+   Scrubbing`_ for details on configuring scrubbing.
+
+#. **Replication:** Data replication involves a collaboration between Ceph
+   Clients and Ceph OSD Daemons. Ceph OSD Daemons use the CRUSH algorithm to
+   determine the storage location of object replicas. Ceph clients use the
+   CRUSH algorithm to determine the storage location of an object, then the
+   object is mapped to a pool and to a placement group, and then the client
+   consults the CRUSH map to identify the placement group's primary OSD.
+
+   After identifying the target placement group, the client writes the object
+   to the identified placement group's primary OSD. The primary OSD then
+   consults its own copy of the CRUSH map to identify secondary and tertiary
+   OSDS, replicates the object to the placement groups in those secondary and
+   tertiary OSDs, confirms that the object was stored successfully in the
+   secondary and tertiary OSDs, and reports to the client that the object
+   was stored successfully.
+
+.. ditaa::
+
+             +----------+
+             |  Client  |
+             |          |
+             +----------+
+                 *  ^
+      Write (1)  |  |  Ack (6)
+                 |  |
+                 v  *
+            +-------------+
+            | Primary OSD |
+            |             |
+            +-------------+
+              *  ^   ^  *
+    Write (2) |  |   |  |  Write (3)
+       +------+  |   |  +------+
+       |  +------+   +------+  |
+       |  | Ack (4)  Ack (5)|  | 
+       v  *                 *  v
+ +---------------+   +---------------+
+ | Secondary OSD |   | Tertiary OSD  |
+ |               |   |               |
+ +---------------+   +---------------+
+
+By performing this act of data replication, Ceph OSD Daemons relieve Ceph
+clients of the burden of replicating data.
+
+Dynamic Cluster Management
+--------------------------
+
+In the `Scalability and High Availability`_ section, we explained how Ceph uses
+CRUSH, cluster topology, and intelligent daemons to scale and maintain high
+availability. Key to Ceph's design is the autonomous, self-healing, and
+intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
+enable modern cloud storage infrastructures to place data, rebalance the
+cluster, and adaptively place and balance data and recover from faults.
+
+.. index:: architecture; pools
+
+About Pools
+~~~~~~~~~~~
+
+The Ceph storage system supports the notion of 'Pools', which are logical
+partitions for storing objects.
+   
+Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write RADOS
+objects to pools. The way that Ceph places the data in the pools is determined
+by the pool's ``size`` or number of replicas, the CRUSH rule, and the number of
+placement groups in the pool.
+
+.. ditaa::
+
+            +--------+  Retrieves  +---------------+
+            | Client |------------>|  Cluster Map  |
+            +--------+             +---------------+
+                 |
+                 v      Writes
+              /-----\
+              | obj |
+              \-----/
+                 |      To
+                 v
+            +--------+           +---------------+
+            |  Pool  |---------->|  CRUSH Rule   |
+            +--------+  Selects  +---------------+
+                 
+
+Pools set at least the following parameters:
+
+- Ownership/Access to Objects
+- The Number of Placement Groups, and 
+- The CRUSH Rule to Use.
+
+See `Set Pool Values`_ for details.
+
+
+.. index: architecture; placement group mapping
+
+Mapping PGs to OSDs
+~~~~~~~~~~~~~~~~~~~
+
+Each pool has a number of placement groups (PGs) within it. CRUSH dynamically
+maps PGs to OSDs. When a Ceph Client stores objects, CRUSH maps each RADOS
+object to a PG. 
+
+This mapping of RADOS objects to PGs implements an abstraction and indirection
+layer between Ceph OSD Daemons and Ceph Clients. The Ceph Storage Cluster must
+be able to grow (or shrink) and redistribute data adaptively when the internal
+topology changes. 
+
+If the Ceph Client "knew" which Ceph OSD Daemons were storing which objects, a
+tight coupling would exist between the Ceph Client and the Ceph OSD Daemon.
+But Ceph avoids any such tight coupling. Instead, the CRUSH algorithm maps each
+RADOS object to a placement group and then maps each placement group to one or
+more Ceph OSD Daemons. This "layer of indirection" allows Ceph to rebalance
+dynamically when new Ceph OSD Daemons and their underlying OSD devices come
+online. The following diagram shows how the CRUSH algorithm maps objects to
+placement groups, and how it maps placement groups to OSDs.
+
+.. ditaa::
+
+           /-----\  /-----\  /-----\  /-----\  /-----\
+           | obj |  | obj |  | obj |  | obj |  | obj |
+           \-----/  \-----/  \-----/  \-----/  \-----/
+              |        |        |        |        |
+              +--------+--------+        +---+----+
+              |                              |
+              v                              v
+   +-----------------------+      +-----------------------+
+   |  Placement Group #1   |      |  Placement Group #2   |
+   |                       |      |                       |
+   +-----------------------+      +-----------------------+
+               |                              |
+               |      +-----------------------+---+
+        +------+------+-------------+             |
+        |             |             |             |
+        v             v             v             v
+   /----------\  /----------\  /----------\  /----------\ 
+   |          |  |          |  |          |  |          |
+   |  OSD #1  |  |  OSD #2  |  |  OSD #3  |  |  OSD #4  |
+   |          |  |          |  |          |  |          |
+   \----------/  \----------/  \----------/  \----------/  
+
+The client uses its copy of the cluster map and the CRUSH algorithm to compute
+precisely which OSD it will use when reading or writing a particular object.
+
+.. index:: architecture; calculating PG IDs
+
+Calculating PG IDs
+~~~~~~~~~~~~~~~~~~
+
+When a Ceph Client binds to a Ceph Monitor, it retrieves the latest version of
+the `Cluster Map`_. When a client has been equipped with a copy of the cluster
+map, it is aware of all the monitors, OSDs, and metadata servers in the
+cluster. **However, even equipped with a copy of the latest version of the
+cluster map, the client doesn't know anything about object locations.** 
+
+**Object locations must be computed.**
+
+The client requies only the object ID and the name of the pool in order to
+compute the object location.
+
+Ceph stores data in named pools (for example,  "liverpool"). When a client
+stores a named object (for example, "john", "paul", "george", or "ringo") it
+calculates a placement group by using the object name, a hash code, the number
+of PGs in the pool, and the pool name. Ceph clients use the following steps to
+compute PG IDs.
+
+#. The client inputs the pool name and the object ID. (for example: pool =
+   "liverpool" and object-id = "john")
+#. Ceph hashes the object ID.
+#. Ceph calculates the hash, modulo the number of PGs (for example: ``58``), to
+   get a PG ID.
+#. Ceph uses the pool name to retrieve the pool ID: (for example: "liverpool" =
+   ``4``)
+#. Ceph prepends the pool ID to the PG ID (for example: ``4.58``).
+
+It is much faster to compute object locations than to perform object location
+query over a chatty session. The :abbr:`CRUSH (Controlled Replication Under
+Scalable Hashing)` algorithm allows a client to compute where objects are
+expected to be stored, and enables the client to contact the primary OSD to
+store or retrieve the objects.
+
+.. index:: architecture; PG Peering
+
+Peering and Sets
+~~~~~~~~~~~~~~~~
+
+In previous sections, we noted that Ceph OSD Daemons check each other's
+heartbeats and report back to Ceph Monitors. Ceph OSD daemons also 'peer',
+which is the process of bringing all of the OSDs that store a Placement Group
+(PG) into agreement about the state of all of the RADOS objects (and their
+metadata) in that PG. Ceph OSD Daemons `Report Peering Failure`_ to the Ceph
+Monitors. Peering issues usually resolve themselves; however, if the problem
+persists, you may need to refer to the `Troubleshooting Peering Failure`_
+section.
+
+.. Note:: PGs that agree on the state of the cluster do not necessarily have
+   the current data yet. 
+
+The Ceph Storage Cluster was designed to store at least two copies of an object
+(that is, ``size = 2``), which is the minimum requirement for data safety. For
+high availability, a Ceph Storage Cluster should store more than two copies of
+an object (that is, ``size = 3`` and ``min size = 2``) so that it can continue
+to run in a ``degraded`` state while maintaining data safety.
+
+.. warning:: Although we say here that R2 (replication with two copies) is the
+   minimum requirement for data safety, R3 (replication with three copies) is
+   recommended. On a long enough timeline, data stored with an R2 strategy will
+   be lost.
+
+As explained in the diagram in `Smart Daemons Enable Hyperscale`_, we do not
+name the Ceph OSD Daemons specifically (for example, ``osd.0``, ``osd.1``,
+etc.), but rather refer to them as *Primary*, *Secondary*, and so forth. By
+convention, the *Primary* is the first OSD in the *Acting Set*, and is
+responsible for orchestrating the peering process for each placement group
+where it acts as the *Primary*. The *Primary* is the **ONLY** OSD in a given
+placement group that accepts client-initiated writes to objects.
+
+The set of OSDs that is responsible for a placement group is called the
+*Acting Set*. The term "*Acting Set*" can refer either to the Ceph OSD Daemons
+that are currently responsible for the placement group, or to the Ceph OSD
+Daemons that were responsible for a particular placement group as of some
+epoch.
+
+The Ceph OSD daemons that are part of an *Acting Set* might not always be
+``up``. When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*.
+The *Up Set* is an important distinction, because Ceph can remap PGs to other
+Ceph OSD Daemons when an OSD fails. 
+
+.. note:: Consider a hypothetical *Acting Set* for a PG that contains
+   ``osd.25``, ``osd.32`` and ``osd.61``. The first OSD (``osd.25``), is the
+   *Primary*. If that OSD fails, the Secondary (``osd.32``), becomes the
+   *Primary*, and ``osd.25`` is removed from the *Up Set*.
+
+.. index:: architecture; Rebalancing
+
+Rebalancing
+~~~~~~~~~~~
+
+When you add a Ceph OSD Daemon to a Ceph Storage Cluster, the cluster map gets
+updated with the new OSD. Referring back to `Calculating PG IDs`_, this changes
+the cluster map. Consequently, it changes object placement, because it changes
+an input for the calculations. The following diagram depicts the rebalancing
+process (albeit rather crudely, since it is substantially less impactful with
+large clusters) where some, but not all of the PGs migrate from existing OSDs
+(OSD 1, and OSD 2) to the new OSD (OSD 3). Even when rebalancing, CRUSH is
+stable. Many of the placement groups remain in their original configuration,
+and each OSD gets some added capacity, so there are no load spikes on the 
+new OSD after rebalancing is complete.
+
+
+.. ditaa::
+
+           +--------+     +--------+
+   Before  |  OSD 1 |     |  OSD 2 |
+           +--------+     +--------+
+           |  PG #1 |     | PG #6  |
+           |  PG #2 |     | PG #7  |
+           |  PG #3 |     | PG #8  |
+           |  PG #4 |     | PG #9  |
+           |  PG #5 |     | PG #10 |
+           +--------+     +--------+
+
+           +--------+     +--------+     +--------+
+    After  |  OSD 1 |     |  OSD 2 |     |  OSD 3 |
+           +--------+     +--------+     +--------+
+           |  PG #1 |     | PG #7  |     |  PG #3 |
+           |  PG #2 |     | PG #8  |     |  PG #6 |
+           |  PG #4 |     | PG #10 |     |  PG #9 |
+           |  PG #5 |     |        |     |        |
+           |        |     |        |     |        |
+           +--------+     +--------+     +--------+
+
+
+.. index:: architecture; Data Scrubbing
+
+Data Consistency
+~~~~~~~~~~~~~~~~
+
+As part of maintaining data consistency and cleanliness, Ceph OSDs also scrub
+objects within placement groups. That is, Ceph OSDs compare object metadata in
+one placement group with its replicas in placement groups stored in other
+OSDs. Scrubbing (usually performed daily) catches OSD bugs or filesystem
+errors, often as a result of hardware issues.  OSDs also perform deeper
+scrubbing by comparing data in objects bit-for-bit.  Deep scrubbing (by default
+performed weekly) finds bad blocks on a drive that weren't apparent in a light
+scrub.
+
+See `Data Scrubbing`_ for details on configuring scrubbing.
+
+
+
+
+
+.. index:: erasure coding
+
+Erasure Coding
+--------------
+
+An erasure coded pool stores each object as ``K+M`` chunks. It is divided into
+``K`` data chunks and ``M`` coding chunks. The pool is configured to have a size
+of ``K+M`` so that each chunk is stored in an OSD in the acting set. The rank of
+the chunk is stored as an attribute of the object.
+
+For instance an erasure coded pool can be created to use five OSDs (``K+M = 5``) and
+sustain the loss of two of them (``M = 2``).
+
+Reading and Writing Encoded Chunks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When the object **NYAN** containing ``ABCDEFGHI`` is written to the pool, the erasure
+encoding function splits the content into three data chunks simply by dividing
+the content in three: the first contains ``ABC``, the second ``DEF`` and the
+last ``GHI``. The content will be padded if the content length is not a multiple
+of ``K``. The function also creates two coding chunks: the fourth with ``YXY``
+and the fifth with ``QGC``. Each chunk is stored in an OSD in the acting set.
+The chunks are stored in objects that have the same name (**NYAN**) but reside
+on different OSDs. The order in which the chunks were created must be preserved
+and is stored as an attribute of the object (``shard_t``), in addition to its
+name. Chunk 1 contains ``ABC`` and is stored on **OSD5** while chunk 4 contains
+``YXY`` and is stored on **OSD3**.
+
+
+.. ditaa::
+
+                            +-------------------+
+                       name |       NYAN        |
+                            +-------------------+
+                    content |     ABCDEFGHI     |
+                            +--------+----------+
+                                     |
+                                     |
+                                     v
+                              +------+------+
+              +---------------+ encode(3,2) +-----------+
+              |               +--+--+---+---+           |
+              |                  |  |   |               |
+              |          +-------+  |   +-----+         |
+              |          |          |         |         |
+           +--v---+   +--v---+   +--v---+  +--v---+  +--v---+
+     name  | NYAN |   | NYAN |   | NYAN |  | NYAN |  | NYAN |
+           +------+   +------+   +------+  +------+  +------+
+    shard  |  1   |   |  2   |   |  3   |  |  4   |  |  5   |
+           +------+   +------+   +------+  +------+  +------+
+  content  | ABC  |   | DEF  |   | GHI  |  | YXY  |  | QGC  |
+           +--+---+   +--+---+   +--+---+  +--+---+  +--+---+
+              |          |          |         |         |
+              |          |          v         |         |
+              |          |       +--+---+     |         |
+              |          |       | OSD1 |     |         |
+              |          |       +------+     |         |
+              |          |                    |         |
+              |          |       +------+     |         |
+              |          +------>| OSD2 |     |         |
+              |                  +------+     |         |
+              |                               |         |
+              |                  +------+     |         |
+              |                  | OSD3 |<----+         |
+              |                  +------+               |
+              |                                         |
+              |                  +------+               |
+              |                  | OSD4 |<--------------+
+              |                  +------+
+              |
+              |                  +------+
+              +----------------->| OSD5 |
+                                 +------+
+
+
+When the object **NYAN** is read from the erasure coded pool, the decoding
+function reads three chunks: chunk 1 containing ``ABC``, chunk 3 containing
+``GHI`` and chunk 4 containing ``YXY``. Then, it rebuilds the original content
+of the object ``ABCDEFGHI``. The decoding function is informed that the chunks 2
+and 5 are missing (they are called 'erasures'). The chunk 5 could not be read
+because the **OSD4** is out. The decoding function can be called as soon as
+three chunks are read: **OSD2** was the slowest and its chunk was not taken into
+account.
+
+.. ditaa::
+
+	                         +-------------------+
+	                    name |       NYAN        |
+	                         +-------------------+
+	                 content |     ABCDEFGHI     |
+	                         +---------+---------+
+	                                   ^
+	                                   |
+	                                   |
+	                           +-------+-------+
+	                           |  decode(3,2)  |
+	            +------------->+  erasures 2,5 +<-+
+	            |              |               |  |
+	            |              +-------+-------+  |
+	            |                      ^          |
+	            |                      |          | 
+	            |                      |          |
+	         +--+---+   +------+   +---+--+   +---+--+
+	   name  | NYAN |   | NYAN |   | NYAN |   | NYAN |
+	         +------+   +------+   +------+   +------+
+	  shard  |  1   |   |  2   |   |  3   |   |  4   |
+	         +------+   +------+   +------+   +------+
+	content  | ABC  |   | DEF  |   | GHI  |   | YXY  |
+	         +--+---+   +--+---+   +--+---+   +--+---+
+	            ^          .          ^          ^
+	            |    TOO   .          |          |
+	            |    SLOW  .       +--+---+      |
+	            |          ^       | OSD1 |      |
+	            |          |       +------+      |
+	            |          |                     |
+	            |          |       +------+      |
+	            |          +-------| OSD2 |      |
+	            |                  +------+      |
+	            |                                |
+	            |                  +------+      |
+	            |                  | OSD3 |------+
+	            |                  +------+
+	            |
+	            |                  +------+
+	            |                  | OSD4 | OUT
+	            |                  +------+
+	            |
+	            |                  +------+
+	            +------------------| OSD5 |
+	                               +------+
+
+
+Interrupted Full Writes
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In an erasure coded pool, the primary OSD in the up set receives all write
+operations. It is responsible for encoding the payload into ``K+M`` chunks and
+sends them to the other OSDs. It is also responsible for maintaining an
+authoritative version of the placement group logs.
+
+In the following diagram, an erasure coded placement group has been created with
+``K = 2, M = 1`` and is supported by three OSDs, two for ``K`` and one for
+``M``. The acting set of the placement group is made of **OSD 1**, **OSD 2** and
+**OSD 3**. An object has been encoded and stored in the OSDs : the chunk
+``D1v1`` (i.e. Data chunk number 1, version 1) is on **OSD 1**, ``D2v1`` on
+**OSD 2** and ``C1v1`` (i.e. Coding chunk number 1, version 1) on **OSD 3**. The
+placement group logs on each OSD are identical (i.e. ``1,1`` for epoch 1,
+version 1).
+
+
+.. ditaa::
+
+     Primary OSD
+    
+   +-------------+
+   |    OSD 1    |             +-------------+
+   |         log |  Write Full |             |
+   |  +----+     |<------------+ Ceph Client |
+   |  |D1v1| 1,1 |      v1     |             |
+   |  +----+     |             +-------------+
+   +------+------+
+          |
+          |
+          |          +-------------+
+          |          |    OSD 2    |
+          |          |         log |
+          +--------->+  +----+     |
+          |          |  |D2v1| 1,1 |
+          |          |  +----+     |
+          |          +-------------+
+          |
+          |          +-------------+
+          |          |    OSD 3    |
+          |          |         log |
+          +--------->|  +----+     |
+                     |  |C1v1| 1,1 |
+                     |  +----+     |
+                     +-------------+
+
+**OSD 1** is the primary and receives a **WRITE FULL** from a client, which
+means the payload is to replace the object entirely instead of overwriting a
+portion of it. Version 2 (v2) of the object is created to override version 1
+(v1). **OSD 1** encodes the payload into three chunks: ``D1v2`` (i.e. Data
+chunk number 1 version 2) will be on **OSD 1**, ``D2v2`` on **OSD 2** and
+``C1v2`` (i.e. Coding chunk number 1 version 2) on **OSD 3**. Each chunk is sent
+to the target OSD, including the primary OSD which is responsible for storing
+chunks in addition to handling write operations and maintaining an authoritative
+version of the placement group logs. When an OSD receives the message
+instructing it to write the chunk, it also creates a new entry in the placement
+group logs to reflect the change. For instance, as soon as **OSD 3** stores
+``C1v2``, it adds the entry ``1,2`` ( i.e. epoch 1, version 2 ) to its logs.
+Because the OSDs work asynchronously, some chunks may still be in flight ( such
+as ``D2v2`` ) while others are acknowledged and persisted to storage drives
+(such as ``C1v1`` and ``D1v1``).
+
+.. ditaa::
+
+     Primary OSD
+    
+   +-------------+
+   |    OSD 1    |
+   |         log |
+   |  +----+     |             +-------------+
+   |  |D1v2| 1,2 |  Write Full |             |
+   |  +----+     +<------------+ Ceph Client |
+   |             |      v2     |             |
+   |  +----+     |             +-------------+
+   |  |D1v1| 1,1 |           
+   |  +----+     |           
+   +------+------+           
+          |                  
+          |                  
+          |           +------+------+
+          |           |    OSD 2    |
+          |  +------+ |         log |
+          +->| D2v2 | |  +----+     |
+          |  +------+ |  |D2v1| 1,1 |
+          |           |  +----+     |
+          |           +-------------+
+          |
+          |           +-------------+
+          |           |    OSD 3    |
+          |           |         log |
+          |           |  +----+     |
+          |           |  |C1v2| 1,2 |
+          +---------->+  +----+     |
+                      |             |
+                      |  +----+     |
+                      |  |C1v1| 1,1 |
+                      |  +----+     |
+                      +-------------+
+
+
+If all goes well, the chunks are acknowledged on each OSD in the acting set and
+the logs' ``last_complete`` pointer can move from ``1,1`` to ``1,2``.
+
+.. ditaa::
+
+     Primary OSD
+    
+   +-------------+
+   |    OSD 1    |
+   |         log |
+   |  +----+     |             +-------------+
+   |  |D1v2| 1,2 |  Write Full |             |
+   |  +----+     +<------------+ Ceph Client |
+   |             |      v2     |             |
+   |  +----+     |             +-------------+
+   |  |D1v1| 1,1 |           
+   |  +----+     |           
+   +------+------+           
+          |                  
+          |           +-------------+
+          |           |    OSD 2    |
+          |           |         log |
+          |           |  +----+     |
+          |           |  |D2v2| 1,2 |
+          +---------->+  +----+     |
+          |           |             |
+          |           |  +----+     |
+          |           |  |D2v1| 1,1 |
+          |           |  +----+     |
+          |           +-------------+
+          |                  
+          |           +-------------+
+          |           |    OSD 3    |
+          |           |         log |
+          |           |  +----+     |
+          |           |  |C1v2| 1,2 |
+          +---------->+  +----+     |
+                      |             |
+                      |  +----+     |
+                      |  |C1v1| 1,1 |
+                      |  +----+     |
+                      +-------------+
+
+
+Finally, the files used to store the chunks of the previous version of the
+object can be removed: ``D1v1`` on **OSD 1**, ``D2v1`` on **OSD 2** and ``C1v1``
+on **OSD 3**.
+
+.. ditaa::
+
+     Primary OSD
+    
+   +-------------+
+   |    OSD 1    |
+   |         log |
+   |  +----+     |
+   |  |D1v2| 1,2 |
+   |  +----+     |
+   +------+------+
+          |
+          |
+          |          +-------------+
+          |          |    OSD 2    |
+          |          |         log |
+          +--------->+  +----+     |
+          |          |  |D2v2| 1,2 |
+          |          |  +----+     |
+          |          +-------------+
+          |
+          |          +-------------+
+          |          |    OSD 3    |
+          |          |         log |
+          +--------->|  +----+     |
+                     |  |C1v2| 1,2 |
+                     |  +----+     |
+                     +-------------+
+
+
+But accidents happen. If **OSD 1** goes down while ``D2v2`` is still in flight,
+the object's version 2 is partially written: **OSD 3** has one chunk but that is
+not enough to recover. It lost two chunks: ``D1v2`` and ``D2v2`` and the
+erasure coding parameters ``K = 2``, ``M = 1`` require that at least two chunks are
+available to rebuild the third. **OSD 4** becomes the new primary and finds that
+the ``last_complete`` log entry (i.e., all objects before this entry were known
+to be available on all OSDs in the previous acting set ) is ``1,1`` and that
+will be the head of the new authoritative log.
+
+.. ditaa::
+
+   +-------------+
+   |    OSD 1    |
+   |   (down)    |
+   | c333        |
+   +------+------+
+          |                  
+          |           +-------------+
+          |           |    OSD 2    |
+          |           |         log |
+          |           |  +----+     |
+          +---------->+  |D2v1| 1,1 |
+          |           |  +----+     |
+          |           |             |
+          |           +-------------+
+          |                  
+          |           +-------------+
+          |           |    OSD 3    |
+          |           |         log |
+          |           |  +----+     |
+          |           |  |C1v2| 1,2 |
+          +---------->+  +----+     |
+                      |             |
+                      |  +----+     |
+                      |  |C1v1| 1,1 |
+                      |  +----+     |
+                      +-------------+
+     Primary OSD
+   +-------------+
+   |    OSD 4    |
+   |         log |
+   |             |
+   |         1,1 |
+   |             |
+   +------+------+
+          
+
+
+The log entry 1,2 found on **OSD 3** is divergent from the new authoritative log
+provided by **OSD 4**: it is discarded and the file containing the ``C1v2``
+chunk is removed. The ``D1v1`` chunk is rebuilt with the ``decode`` function of
+the erasure coding library during scrubbing and stored on the new primary 
+**OSD 4**.
+
+
+.. ditaa::
+
+     Primary OSD
+    
+   +-------------+
+   |    OSD 4    |
+   |         log |
+   |  +----+     |
+   |  |D1v1| 1,1 |
+   |  +----+     |
+   +------+------+
+          ^
+          |
+          |          +-------------+
+          |          |    OSD 2    |
+          |          |         log |
+          +----------+  +----+     |
+          |          |  |D2v1| 1,1 |
+          |          |  +----+     |
+          |          +-------------+
+          |
+          |          +-------------+
+          |          |    OSD 3    |
+          |          |         log |
+          +----------|  +----+     |
+                     |  |C1v1| 1,1 |
+                     |  +----+     |
+                     +-------------+
+
+   +-------------+
+   |    OSD 1    |
+   |   (down)    |
+   | c333        |
+   +-------------+
+
+See `Erasure Code Notes`_ for additional details.
+
+
+
+Cache Tiering
+-------------
+
+.. note:: Cache tiering is deprecated in Reef.
+
+A cache tier provides Ceph Clients with better I/O performance for a subset of
+the data stored in a backing storage tier. Cache tiering involves creating a
+pool of relatively fast/expensive storage devices (e.g., solid state drives)
+configured to act as a cache tier, and a backing pool of either erasure-coded
+or relatively slower/cheaper devices configured to act as an economical storage
+tier. The Ceph objecter handles where to place the objects and the tiering
+agent determines when to flush objects from the cache to the backing storage
+tier. So the cache tier and the backing storage tier are completely transparent 
+to Ceph clients.
+
+
+.. ditaa::
+
+           +-------------+
+           | Ceph Client |
+           +------+------+
+                  ^
+     Tiering is   |  
+    Transparent   |              Faster I/O
+        to Ceph   |           +---------------+
+     Client Ops   |           |               |   
+                  |    +----->+   Cache Tier  |
+                  |    |      |               |
+                  |    |      +-----+---+-----+
+                  |    |            |   ^ 
+                  v    v            |   |   Active Data in Cache Tier
+           +------+----+--+         |   |
+           |   Objecter   |         |   |
+           +-----------+--+         |   |
+                       ^            |   |   Inactive Data in Storage Tier
+                       |            v   |
+                       |      +-----+---+-----+
+                       |      |               |
+                       +----->|  Storage Tier |
+                              |               |
+                              +---------------+
+                                 Slower I/O
+
+See `Cache Tiering`_ for additional details.  Note that Cache Tiers can be
+tricky and their use is now discouraged.
+
+
+.. index:: Extensibility, Ceph Classes
+
+Extending Ceph
+--------------
+
+You can extend Ceph by creating shared object classes called 'Ceph Classes'.
+Ceph loads ``.so`` classes stored in the ``osd class dir`` directory dynamically
+(i.e., ``$libdir/rados-classes`` by default). When you implement a class, you
+can create new object methods that have the ability to call the native methods
+in the Ceph Object Store, or other class methods you incorporate via libraries
+or create yourself.
+
+On writes, Ceph Classes can call native or class methods, perform any series of
+operations on the inbound data and generate a resulting write transaction  that
+Ceph will apply atomically.
+
+On reads, Ceph Classes can call native or class methods, perform any series of
+operations on the outbound data and return the data to the client.
+
+.. topic:: Ceph Class Example
+
+   A Ceph class for a content management system that presents pictures of a
+   particular size and aspect ratio could take an inbound bitmap image, crop it
+   to a particular aspect ratio, resize it and embed an invisible copyright or 
+   watermark to help protect the intellectual property; then, save the 
+   resulting bitmap image to the object store.
+
+See ``src/objclass/objclass.h``, ``src/fooclass.cc`` and ``src/barclass`` for 
+exemplary implementations.
+
+
+Summary
+-------
+
+Ceph Storage Clusters are dynamic--like a living organism. Whereas, many storage
+appliances do not fully utilize the CPU and RAM of a typical commodity server,
+Ceph does. From heartbeats, to  peering, to rebalancing the cluster or
+recovering from faults,  Ceph offloads work from clients (and from a centralized
+gateway which doesn't exist in the Ceph architecture) and uses the computing
+power of the OSDs to perform the work. When referring to `Hardware
+Recommendations`_ and the `Network Config Reference`_,  be cognizant of the
+foregoing concepts to understand how Ceph utilizes computing resources.
+
+.. index:: Ceph Protocol, librados
+
+Ceph Protocol
+=============
+
+Ceph Clients use the native protocol for interacting with the Ceph Storage
+Cluster. Ceph packages this functionality into the ``librados`` library so that
+you can create your own custom Ceph Clients. The following diagram depicts the
+basic architecture.
+
+.. ditaa::
+
+            +---------------------------------+
+            |  Ceph Storage Cluster Protocol  |
+            |           (librados)            |
+            +---------------------------------+
+            +---------------+ +---------------+
+            |      OSDs     | |    Monitors   |
+            +---------------+ +---------------+
+
+
+Native Protocol and ``librados``
+--------------------------------
+
+Modern applications need a simple object storage interface with asynchronous
+communication capability. The Ceph Storage Cluster provides a simple object
+storage interface with asynchronous communication capability. The interface
+provides direct, parallel access to objects throughout the cluster.
+
+
+- Pool Operations
+- Snapshots and Copy-on-write Cloning
+- Read/Write Objects
+  - Create or Remove
+  - Entire Object or Byte Range
+  - Append or Truncate
+- Create/Set/Get/Remove XATTRs
+- Create/Set/Get/Remove Key/Value Pairs
+- Compound operations and dual-ack semantics
+- Object Classes
+
+
+.. index:: architecture; watch/notify
+
+Object Watch/Notify
+-------------------
+
+A client can register a persistent interest with an object and keep a session to
+the primary OSD open. The client can send a notification message and a payload to
+all watchers and receive notification when the watchers receive the
+notification. This enables a client to use any object as a
+synchronization/communication channel.
+
+
+.. ditaa::
+
+           +----------+     +----------+     +----------+     +---------------+
+           | Client 1 |     | Client 2 |     | Client 3 |     | OSD:Object ID |
+           +----------+     +----------+     +----------+     +---------------+
+                 |                |                |                  |
+                 |                |                |                  |
+                 |                |  Watch Object  |                  |               
+                 |--------------------------------------------------->|
+                 |                |                |                  |
+                 |<---------------------------------------------------|
+                 |                |   Ack/Commit   |                  |
+                 |                |                |                  |
+                 |                |  Watch Object  |                  |
+                 |                |---------------------------------->|
+                 |                |                |                  |
+                 |                |<----------------------------------|
+                 |                |   Ack/Commit   |                  |
+                 |                |                |   Watch Object   |
+                 |                |                |----------------->|
+                 |                |                |                  |
+                 |                |                |<-----------------|
+                 |                |                |    Ack/Commit    |
+                 |                |     Notify     |                  |               
+                 |--------------------------------------------------->|
+                 |                |                |                  |
+                 |<---------------------------------------------------|
+                 |                |     Notify     |                  |
+                 |                |                |                  |
+                 |                |<----------------------------------|
+                 |                |     Notify     |                  |
+                 |                |                |<-----------------|
+                 |                |                |      Notify      |
+                 |                |       Ack      |                  |               
+                 |----------------+---------------------------------->|
+                 |                |                |                  |
+                 |                |       Ack      |                  |
+                 |                +---------------------------------->|
+                 |                |                |                  |
+                 |                |                |        Ack       |
+                 |                |                |----------------->|
+                 |                |                |                  | 
+                 |<---------------+----------------+------------------|
+                 |                     Complete
+
+.. index:: architecture; Striping
+
+Data Striping
+-------------
+
+Storage devices have throughput limitations, which impact performance and
+scalability. So storage systems often support `striping`_--storing sequential
+pieces of information across multiple storage devices--to increase throughput
+and performance. The most common form of data striping comes from `RAID`_.
+The RAID type most similar to Ceph's striping is `RAID 0`_, or a 'striped
+volume'. Ceph's striping offers the throughput of RAID 0 striping, the
+reliability of n-way RAID mirroring and faster recovery.
+
+Ceph provides three types of clients: Ceph Block Device, Ceph File System, and
+Ceph Object Storage. A Ceph Client converts its data from the representation 
+format it provides to its users (a block device image, RESTful objects, CephFS
+filesystem directories) into objects for storage in the Ceph Storage Cluster. 
+
+.. tip:: The objects Ceph stores in the Ceph Storage Cluster are not striped. 
+   Ceph Object Storage, Ceph Block Device, and the Ceph File System stripe their 
+   data over multiple Ceph Storage Cluster objects. Ceph Clients that write 
+   directly to the Ceph Storage Cluster via ``librados`` must perform the
+   striping (and parallel I/O) for themselves to obtain these benefits.
+
+The simplest Ceph striping format involves a stripe count of 1 object. Ceph
+Clients write stripe units to a Ceph Storage Cluster object until the object is
+at its maximum capacity, and then create another object for additional stripes
+of data. The simplest form of striping may be sufficient for small block device
+images, S3 or Swift objects and CephFS files. However, this simple form doesn't
+take maximum advantage of Ceph's ability to distribute data across placement
+groups, and consequently doesn't improve performance very much. The following
+diagram depicts the simplest form of striping:
+
+.. ditaa::
+
+                        +---------------+
+                        |  Client Data  |
+                        |     Format    |
+                        | cCCC          |
+                        +---------------+
+                                |
+                       +--------+-------+
+                       |                |
+                       v                v
+                 /-----------\    /-----------\
+                 | Begin cCCC|    | Begin cCCC|
+                 | Object  0 |    | Object  1 |
+                 +-----------+    +-----------+
+                 |  stripe   |    |  stripe   |
+                 |  unit 1   |    |  unit 5   |
+                 +-----------+    +-----------+
+                 |  stripe   |    |  stripe   |
+                 |  unit 2   |    |  unit 6   |
+                 +-----------+    +-----------+
+                 |  stripe   |    |  stripe   |
+                 |  unit 3   |    |  unit 7   |
+                 +-----------+    +-----------+
+                 |  stripe   |    |  stripe   |
+                 |  unit 4   |    |  unit 8   |
+                 +-----------+    +-----------+
+                 | End cCCC  |    | End cCCC  |
+                 | Object 0  |    | Object 1  |
+                 \-----------/    \-----------/
+   
+
+If you anticipate large images sizes, large S3 or Swift objects (e.g., video),
+or large CephFS directories, you may see considerable read/write performance
+improvements by striping client data over multiple objects within an object set.
+Significant write performance occurs when the client writes the stripe units to
+their corresponding objects in parallel. Since objects get mapped to different
+placement groups and further mapped to different OSDs, each write occurs in
+parallel at the maximum write speed. A write to a single drive would be limited
+by the head movement (e.g. 6ms per seek) and bandwidth of that one device (e.g.
+100MB/s).  By spreading that write over multiple objects (which map to different
+placement groups and OSDs) Ceph can reduce the number of seeks per drive and
+combine the throughput of multiple drives to achieve much faster write (or read)
+speeds.
+
+.. note:: Striping is independent of object replicas. Since CRUSH
+   replicates objects across OSDs, stripes get replicated automatically.
+
+In the following diagram, client data gets striped across an object set
+(``object set 1`` in the following diagram) consisting of 4 objects, where the
+first stripe unit is ``stripe unit 0`` in ``object 0``, and the fourth stripe
+unit is ``stripe unit 3`` in ``object 3``. After writing the fourth stripe, the
+client determines if the object set is full. If the object set is not full, the
+client begins writing a stripe to the first object again (``object 0`` in the
+following diagram). If the object set is full, the client creates a new object
+set (``object set 2`` in the following diagram), and begins writing to the first
+stripe (``stripe unit 16``) in the first object in the new object set (``object
+4`` in the diagram below).
+
+.. ditaa::
+
+                          +---------------+
+                          |  Client Data  |
+                          |     Format    |
+                          | cCCC          |
+                          +---------------+
+                                  |
+       +-----------------+--------+--------+-----------------+
+       |                 |                 |                 |     +--\
+       v                 v                 v                 v        |
+ /-----------\     /-----------\     /-----------\     /-----------\  |   
+ | Begin cCCC|     | Begin cCCC|     | Begin cCCC|     | Begin cCCC|  |
+ | Object 0  |     | Object  1 |     | Object  2 |     | Object  3 |  |
+ +-----------+     +-----------+     +-----------+     +-----------+  |
+ |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
+ |  unit 0   |     |  unit 1   |     |  unit 2   |     |  unit 3   |  |
+ +-----------+     +-----------+     +-----------+     +-----------+  |
+ |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  +-\ 
+ |  unit 4   |     |  unit 5   |     |  unit 6   |     |  unit 7   |    | Object
+ +-----------+     +-----------+     +-----------+     +-----------+    +- Set 
+ |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |    |   1
+ |  unit 8   |     |  unit 9   |     |  unit 10  |     |  unit 11  |  +-/
+ +-----------+     +-----------+     +-----------+     +-----------+  |
+ |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
+ |  unit 12  |     |  unit 13  |     |  unit 14  |     |  unit 15  |  |
+ +-----------+     +-----------+     +-----------+     +-----------+  |
+ | End cCCC  |     | End cCCC  |     | End cCCC  |     | End cCCC  |  |
+ | Object 0  |     | Object 1  |     | Object 2  |     | Object 3  |  |  
+ \-----------/     \-----------/     \-----------/     \-----------/  |
+                                                                      |
+                                                                   +--/
+  
+                                                                   +--\
+                                                                      |
+ /-----------\     /-----------\     /-----------\     /-----------\  |   
+ | Begin cCCC|     | Begin cCCC|     | Begin cCCC|     | Begin cCCC|  |
+ | Object  4 |     | Object  5 |     | Object  6 |     | Object  7 |  |  
+ +-----------+     +-----------+     +-----------+     +-----------+  |
+ |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
+ |  unit 16  |     |  unit 17  |     |  unit 18  |     |  unit 19  |  |
+ +-----------+     +-----------+     +-----------+     +-----------+  |
+ |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  +-\ 
+ |  unit 20  |     |  unit 21  |     |  unit 22  |     |  unit 23  |    | Object
+ +-----------+     +-----------+     +-----------+     +-----------+    +- Set
+ |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |    |   2 
+ |  unit 24  |     |  unit 25  |     |  unit 26  |     |  unit 27  |  +-/
+ +-----------+     +-----------+     +-----------+     +-----------+  |
+ |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
+ |  unit 28  |     |  unit 29  |     |  unit 30  |     |  unit 31  |  |
+ +-----------+     +-----------+     +-----------+     +-----------+  |
+ | End cCCC  |     | End cCCC  |     | End cCCC  |     | End cCCC  |  |
+ | Object 4  |     | Object 5  |     | Object 6  |     | Object 7  |  |  
+ \-----------/     \-----------/     \-----------/     \-----------/  |
+                                                                      |
+                                                                   +--/
+
+Three important variables determine how Ceph stripes data: 
+
+- **Object Size:** Objects in the Ceph Storage Cluster have a maximum
+  configurable size (e.g., 2MB, 4MB, etc.). The object size should be large
+  enough to accommodate many stripe units, and should be a multiple of
+  the stripe unit.
+
+- **Stripe Width:** Stripes have a configurable unit size (e.g., 64kb).
+  The Ceph Client divides the data it will write to objects into equally 
+  sized stripe units, except for the last stripe unit. A stripe width, 
+  should be a fraction of the Object Size so that an object may contain 
+  many stripe units.
+
+- **Stripe Count:** The Ceph Client writes a sequence of stripe units
+  over a series of objects determined by the stripe count. The series 
+  of objects is called an object set. After the Ceph Client writes to 
+  the last object in the object set, it returns to the first object in
+  the object set.
+  
+.. important:: Test the performance of your striping configuration before
+   putting your cluster into production. You CANNOT change these striping
+   parameters after you stripe the data and write it to objects.
+
+Once the Ceph Client has striped data to stripe units and mapped the stripe
+units to objects, Ceph's CRUSH algorithm maps the objects to placement groups,
+and the placement groups to Ceph OSD Daemons before the objects are stored as 
+files on a storage drive.
+
+.. note:: Since a client writes to a single pool, all data striped into objects
+   get mapped to placement groups in the same pool. So they use the same CRUSH
+   map and the same access controls.
+
+
+.. index:: architecture; Ceph Clients
+
+Ceph Clients
+============
+
+Ceph Clients include a number of service interfaces. These include:
+
+- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
+  provides resizable, thin-provisioned block devices that can be snapshotted
+  and cloned. Ceph stripes a block device across the cluster for high
+  performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
+  that uses ``librbd`` directly--avoiding the kernel object overhead for
+  virtualized systems.
+
+- **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service 
+  provides RESTful APIs with interfaces that are compatible with Amazon S3
+  and OpenStack Swift. 
+  
+- **Filesystem**: The :term:`Ceph File System` (CephFS) service provides 
+  a POSIX compliant filesystem usable with ``mount`` or as 
+  a filesystem in user space (FUSE).
+
+Ceph can run additional instances of OSDs, MDSs, and monitors for scalability
+and high availability. The following diagram depicts the high-level
+architecture. 
+
+.. ditaa::
+
+            +--------------+  +----------------+  +-------------+
+            | Block Device |  | Object Storage |  |   CephFS    |
+            +--------------+  +----------------+  +-------------+            
+
+            +--------------+  +----------------+  +-------------+
+            |    librbd    |  |     librgw     |  |  libcephfs  |
+            +--------------+  +----------------+  +-------------+
+
+            +---------------------------------------------------+
+            |      Ceph Storage Cluster Protocol (librados)     |
+            +---------------------------------------------------+
+
+            +---------------+ +---------------+ +---------------+
+            |      OSDs     | |      MDSs     | |    Monitors   |
+            +---------------+ +---------------+ +---------------+
+
+
+.. index:: architecture; Ceph Object Storage
+
+Ceph Object Storage
+-------------------
+
+The Ceph Object Storage daemon, ``radosgw``, is a FastCGI service that provides
+a RESTful_ HTTP API to store objects and metadata. It layers on top of the Ceph
+Storage Cluster with its own data formats, and maintains its own user database,
+authentication, and access control. The RADOS Gateway uses a unified namespace,
+which means you can use either the OpenStack Swift-compatible API or the Amazon
+S3-compatible API. For example, you can write data using the S3-compatible API
+with one application and then read data using the Swift-compatible API with
+another application.
+
+.. topic:: S3/Swift Objects and Store Cluster Objects Compared
+
+   Ceph's Object Storage uses the term *object* to describe the data it stores.
+   S3 and Swift objects are not the same as the objects that Ceph writes to the 
+   Ceph Storage Cluster. Ceph Object Storage objects are mapped to Ceph Storage
+   Cluster objects. The S3 and Swift objects do not necessarily 
+   correspond in a 1:1 manner with an object stored in the storage cluster. It 
+   is possible for an S3 or Swift object to map to multiple Ceph objects.
+
+See `Ceph Object Storage`_ for details.
+
+
+.. index:: Ceph Block Device; block device; RBD; Rados Block Device
+
+Ceph Block Device
+-----------------
+
+A Ceph Block Device stripes a block device image over multiple objects in the
+Ceph Storage Cluster, where each object gets mapped to a placement group and
+distributed, and the placement groups are spread across separate ``ceph-osd``
+daemons throughout the cluster.
+
+.. important:: Striping allows RBD block devices to perform better than a single 
+   server could!
+
+Thin-provisioned snapshottable Ceph Block Devices are an attractive option for
+virtualization and cloud computing. In virtual machine scenarios, people
+typically deploy a Ceph Block Device with the ``rbd`` network storage driver in
+QEMU/KVM, where the host machine uses ``librbd`` to provide a block device
+service to the guest. Many cloud computing stacks use ``libvirt`` to integrate
+with hypervisors. You can use thin-provisioned Ceph Block Devices with QEMU and
+``libvirt`` to support OpenStack and CloudStack among other solutions.
+
+While we do not provide ``librbd`` support with other hypervisors at this time,
+you may also use Ceph Block Device kernel objects to provide a block device to a
+client. Other virtualization technologies such as Xen can access the Ceph Block
+Device kernel object(s). This is done with the  command-line tool ``rbd``.
+
+
+.. index:: CephFS; Ceph File System; libcephfs; MDS; metadata server; ceph-mds
+
+.. _arch-cephfs:
+
+Ceph File System
+----------------
+
+The Ceph File System (CephFS) provides a POSIX-compliant filesystem as a
+service that is layered on top of the object-based Ceph Storage Cluster.
+CephFS files get mapped to objects that Ceph stores in the Ceph Storage
+Cluster. Ceph Clients mount a CephFS filesystem as a kernel object or as
+a Filesystem in User Space (FUSE).
+
+.. ditaa::
+
+            +-----------------------+  +------------------------+
+            | CephFS Kernel Object  |  |      CephFS FUSE       |
+            +-----------------------+  +------------------------+            
+
+            +---------------------------------------------------+
+            |            CephFS Library (libcephfs)             |
+            +---------------------------------------------------+
+
+            +---------------------------------------------------+
+            |      Ceph Storage Cluster Protocol (librados)     |
+            +---------------------------------------------------+
+
+            +---------------+ +---------------+ +---------------+
+            |      OSDs     | |      MDSs     | |    Monitors   |
+            +---------------+ +---------------+ +---------------+
+
+
+The Ceph File System service includes the Ceph Metadata Server (MDS) deployed
+with the Ceph Storage cluster. The purpose of the MDS is to store all the
+filesystem metadata (directories, file ownership, access modes, etc) in
+high-availability Ceph Metadata Servers where the metadata resides in memory.
+The reason for the MDS (a daemon called ``ceph-mds``) is that simple filesystem
+operations like listing a directory or changing a directory (``ls``, ``cd``)
+would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from
+the data means that the Ceph File System can provide high performance services
+without taxing the Ceph Storage Cluster.
+
+CephFS separates the metadata from the data, storing the metadata in the MDS,
+and storing the file data in one or more objects in the Ceph Storage Cluster.
+The Ceph filesystem aims for POSIX compatibility. ``ceph-mds`` can run as a
+single process, or it can be distributed out to multiple physical machines,
+either for high availability or for scalability. 
+
+- **High Availability**: The extra ``ceph-mds`` instances can be `standby`, 
+  ready to take over the duties of any failed ``ceph-mds`` that was
+  `active`. This is easy because all the data, including the journal, is
+  stored on RADOS. The transition is triggered automatically by ``ceph-mon``.
+
+- **Scalability**: Multiple ``ceph-mds`` instances can be `active`, and they
+  will split the directory tree into subtrees (and shards of a single
+  busy directory), effectively balancing the load amongst all `active`
+  servers.
+
+Combinations of `standby` and `active` etc are possible, for example
+running 3 `active` ``ceph-mds`` instances for scaling, and one `standby`
+instance for high availability.
+
+
+
+.. _RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters: https://ceph.io/assets/pdfs/weil-rados-pdsw07.pdf
+.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)
+.. _Monitor Config Reference: ../rados/configuration/mon-config-ref
+.. _Monitoring OSDs and PGs: ../rados/operations/monitoring-osd-pg
+.. _Heartbeats: ../rados/configuration/mon-osd-interaction
+.. _Monitoring OSDs: ../rados/operations/monitoring-osd-pg/#monitoring-osds
+.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf
+.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
+.. _Report Peering Failure: ../rados/configuration/mon-osd-interaction#osds-report-peering-failure
+.. _Troubleshooting Peering Failure: ../rados/troubleshooting/troubleshooting-pg#placement-group-down-peering-failure
+.. _Ceph Authentication and Authorization: ../rados/operations/auth-intro/
+.. _Hardware Recommendations: ../start/hardware-recommendations
+.. _Network Config Reference: ../rados/configuration/network-config-ref
+.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
+.. _striping: https://en.wikipedia.org/wiki/Data_striping
+.. _RAID: https://en.wikipedia.org/wiki/RAID
+.. _RAID 0: https://en.wikipedia.org/wiki/RAID_0#RAID_0
+.. _Ceph Object Storage: ../radosgw/
+.. _RESTful: https://en.wikipedia.org/wiki/RESTful
+.. _Erasure Code Notes: https://github.com/ceph/ceph/blob/40059e12af88267d0da67d8fd8d9cd81244d8f93/doc/dev/osd_internals/erasure_coding/developer_notes.rst
+.. _Cache Tiering: ../rados/operations/cache-tiering
+.. _Set Pool Values: ../rados/operations/pools#set-pool-values
+.. _Kerberos: https://en.wikipedia.org/wiki/Kerberos_(protocol)
+.. _Cephx Config Guide: ../rados/configuration/auth-config-ref
+.. _User Management: ../rados/operations/user-management
-- 
cgit v1.2.3