1 files changed, 518 insertions, 0 deletions
diff --git a/doc/rados/operations/monitoring-osd-pg.rst b/doc/rados/operations/monitoring-osd-pg.rst
new file mode 100644
index 00000000..08b70dd4
--- /dev/null
+++ b/doc/rados/operations/monitoring-osd-pg.rst
@@ -0,0 +1,518 @@
+=========================
+ Monitoring OSDs and PGs
+=========================
+
+High availability and high reliability require a fault-tolerant approach to
+managing hardware and software issues. Ceph has no single point-of-failure, and
+can service requests for data in a "degraded" mode. Ceph's `data placement`_
+introduces a layer of indirection to ensure that data doesn't bind directly to
+particular OSD addresses. This means that tracking down system faults requires
+finding the `placement group`_ and the underlying OSDs at root of the problem.
+
+.. tip:: A fault in one part of the cluster may prevent you from accessing a 
+   particular object, but that doesn't mean that you cannot access other objects.
+   When you run into a fault, don't panic. Just follow the steps for monitoring
+   your OSDs and placement groups. Then, begin troubleshooting.
+
+Ceph is generally self-repairing. However, when problems persist, monitoring
+OSDs and placement groups will help you identify the problem.
+
+
+Monitoring OSDs
+===============
+
+An OSD's status is either in the cluster (``in``) or out of the cluster
+(``out``); and, it is either up and running (``up``), or it is down and not
+running (``down``). If an OSD is ``up``, it may be either ``in`` the cluster
+(you can read and write data) or it is ``out`` of the cluster.  If it was
+``in`` the cluster and recently moved ``out`` of the cluster, Ceph will migrate
+placement groups to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will
+not assign placement groups to the OSD. If an OSD is ``down``, it should also be
+``out``.
+
+.. note:: If an OSD is ``down`` and ``in``, there is a problem and the cluster 
+   will not be in a healthy state.
+
+.. ditaa::
+
+           +----------------+        +----------------+
+           |                |        |                |
+           |   OSD #n In    |        |   OSD #n Up    |
+           |                |        |                |
+           +----------------+        +----------------+
+                   ^                         ^
+                   |                         |
+                   |                         |
+                   v                         v
+           +----------------+        +----------------+
+           |                |        |                |
+           |   OSD #n Out   |        |   OSD #n Down  |
+           |                |        |                |
+           +----------------+        +----------------+
+
+If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``,
+you may notice that the cluster does not always echo back ``HEALTH OK``. Don't
+panic. With respect to OSDs, you should expect that the cluster will **NOT**
+echo   ``HEALTH OK`` in a few expected circumstances:
+
+#. You haven't started the cluster yet (it won't respond).
+#. You have just started or restarted the cluster and it's not ready yet,
+   because the placement groups are getting created and the OSDs are in
+   the process of peering.
+#. You just added or removed an OSD.
+#. You just have modified your cluster map.
+
+An important aspect of monitoring OSDs is to ensure that when the cluster
+is up and running that all OSDs that are ``in`` the cluster are ``up`` and
+running, too. To see if all OSDs are running, execute:: 
+
+	ceph osd stat
+
+The result should tell you the total number of OSDs (x),
+how many are ``up`` (y), how many are ``in`` (z) and the map epoch (eNNNN). ::
+
+	x osds: y up, z in; epoch: eNNNN
+
+If the number of OSDs that are ``in`` the cluster is more than the number of
+OSDs that are ``up``, execute the following command to identify the ``ceph-osd``
+daemons that are not running:: 
+
+	ceph osd tree
+
+:: 
+
+	#ID CLASS WEIGHT  TYPE NAME             STATUS REWEIGHT PRI-AFF
+	 -1       2.00000 pool openstack
+	 -3       2.00000 rack dell-2950-rack-A
+	 -2       2.00000 host dell-2950-A1
+	  0   ssd 1.00000      osd.0                up  1.00000 1.00000
+	  1   ssd 1.00000      osd.1              down  1.00000 1.00000
+
+.. tip:: The ability to search through a well-designed CRUSH hierarchy may help
+   you troubleshoot your cluster by identifying the physical locations faster.
+
+If an OSD is ``down``, start it:: 
+
+	sudo systemctl start ceph-osd@1
+
+See `OSD Not Running`_ for problems associated with OSDs that stopped, or won't
+restart.
+	
+
+PG Sets
+=======
+
+When CRUSH assigns placement groups to OSDs, it looks at the number of replicas
+for the pool and assigns the placement group to OSDs such that each replica of
+the placement group gets assigned to a different OSD. For example, if the pool
+requires three replicas of a placement group, CRUSH may assign them to
+``osd.1``, ``osd.2`` and ``osd.3`` respectively. CRUSH actually seeks a
+pseudo-random placement that will take into account failure domains you set in
+your `CRUSH map`_, so you will rarely see placement groups assigned to nearest
+neighbor OSDs in a large cluster. We refer to the set of OSDs that should
+contain the replicas of a particular placement group as the **Acting Set**. In
+some cases, an OSD in the Acting Set is ``down`` or otherwise not able to
+service requests for objects in the placement group. When these situations
+arise, don't panic. Common examples include:
+
+- You added or removed an OSD. Then, CRUSH reassigned the placement group to 
+  other OSDs--thereby changing the composition of the Acting Set and spawning
+  the migration of data with a "backfill" process.
+- An OSD was ``down``, was restarted, and is now ``recovering``.
+- An OSD in the Acting Set is ``down`` or unable to service requests, 
+  and another OSD has temporarily assumed its duties.
+
+Ceph processes a client request using the **Up Set**, which is the set of OSDs
+that will actually handle the requests. In most cases, the Up Set and the Acting
+Set are virtually identical. When they are not, it may indicate that Ceph is
+migrating data, an OSD is recovering, or that there is a problem (i.e., Ceph
+usually echoes a "HEALTH WARN" state with a "stuck stale" message in such
+scenarios).
+
+To retrieve a list of placement groups, execute:: 
+
+	ceph pg dump
+	
+To view which OSDs are within the Acting Set or the Up Set for a given placement
+group, execute:: 
+
+	ceph pg map {pg-num}
+
+The result should tell you the osdmap epoch (eNNN), the placement group number
+({pg-num}),  the OSDs in the Up Set (up[]), and the OSDs in the acting set
+(acting[]). ::
+
+	osdmap eNNN pg {raw-pg-num} ({pg-num}) -> up [0,1,2] acting [0,1,2]
+
+.. note:: If the Up Set and Acting Set do not match, this may be an indicator
+   that the cluster rebalancing itself or of a potential problem with 
+   the cluster.
+ 
+
+Peering
+=======
+
+Before you can write data to a placement group, it must be in an ``active``
+state, and it  **should** be in a ``clean`` state. For Ceph to determine the
+current state of a placement group, the primary OSD of the placement group
+(i.e., the first OSD in the acting set), peers with the secondary and tertiary
+OSDs to establish agreement on the current state of the placement group
+(assuming a pool with 3 replicas of the PG).
+
+
+.. ditaa::
+
+           +---------+     +---------+     +-------+
+           |  OSD 1  |     |  OSD 2  |     | OSD 3 |
+           +---------+     +---------+     +-------+
+                |               |              |
+                |  Request To   |              |
+                |     Peer      |              |             
+                |-------------->|              |
+                |<--------------|              |
+                |    Peering                   |
+                |                              |
+                |         Request To           |
+                |            Peer              | 
+                |----------------------------->|  
+                |<-----------------------------|
+                |          Peering             |
+
+The OSDs also report their status to the monitor. See `Configuring Monitor/OSD
+Interaction`_ for details. To troubleshoot peering issues, see `Peering
+Failure`_.
+
+
+Monitoring Placement Group States
+=================================
+
+If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``,
+you may notice that the cluster does not always echo back ``HEALTH OK``. After
+you check to see if the OSDs are running, you should also check placement group
+states. You should expect that the cluster will **NOT** echo ``HEALTH OK`` in a
+number of placement group peering-related circumstances:
+
+#. You have just created a pool and placement groups haven't peered yet.
+#. The placement groups are recovering.
+#. You have just added an OSD to or removed an OSD from the cluster.
+#. You have just modified your CRUSH map and your placement groups are migrating.
+#. There is inconsistent data in different replicas of a placement group.
+#. Ceph is scrubbing a placement group's replicas.
+#. Ceph doesn't have enough storage capacity to complete backfilling operations.
+
+If one of the foregoing circumstances causes Ceph to echo ``HEALTH WARN``, don't
+panic. In many cases, the cluster will recover on its own. In some cases, you
+may need to take action. An important aspect of monitoring placement groups is
+to ensure that when the cluster is up and running that all placement groups are
+``active``, and preferably in the ``clean`` state. To see the status of all
+placement groups, execute:: 
+
+	ceph pg stat
+
+The result should tell you the total number of placement groups (x), how many
+placement groups are in a particular state such as ``active+clean`` (y) and the
+amount of data stored (z). ::
+
+	x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail
+
+.. note:: It is common for Ceph to report multiple states for placement groups.
+
+In addition to the placement group states, Ceph will also echo back the amount of
+storage capacity used (aa), the amount of storage capacity remaining (bb), and the total
+storage capacity for the placement group. These numbers can be important in a
+few cases: 
+
+- You are reaching your ``near full ratio`` or ``full ratio``. 
+- Your data is not getting distributed across the cluster due to an 
+  error in your CRUSH configuration.
+
+
+.. topic:: Placement Group IDs
+
+   Placement group IDs consist of the pool number (not pool name) followed 
+   by a period (.) and the placement group ID--a hexadecimal number. You
+   can view pool numbers and their names from the output of ``ceph osd 
+   lspools``. For example, the first pool created corresponds to
+   pool number ``1``. A fully qualified placement group ID has the
+   following form::
+   
+   	{pool-num}.{pg-id}
+   
+   And it typically looks like this:: 
+   
+	1.1f
+   
+
+To retrieve a list of placement groups, execute the following:: 
+
+	ceph pg dump
+	
+You can also format the output in JSON format and save it to a file:: 
+
+	ceph pg dump -o {filename} --format=json
+
+To query a particular placement group, execute the following:: 
+
+	ceph pg {poolnum}.{pg-id} query
+	
+Ceph will output the query in JSON format.
+
+The following subsections describe the common pg states in detail.
+
+Creating
+--------
+
+When you create a pool, it will create the number of placement groups you
+specified.  Ceph will echo ``creating`` when it is creating one or more
+placement groups. Once they are created, the OSDs that are part of a placement
+group's Acting Set will peer. Once peering is complete, the placement group
+status should be ``active+clean``, which means a Ceph client can begin writing
+to the placement group.
+
+.. ditaa::
+
+       /-----------\       /-----------\       /-----------\
+       | Creating  |------>|  Peering  |------>|  Active   |
+       \-----------/       \-----------/       \-----------/
+
+Peering
+-------
+
+When Ceph is Peering a placement group, Ceph is bringing the OSDs that
+store the replicas of the placement group into **agreement about the state**
+of the objects and metadata in the placement group. When Ceph completes peering,
+this means that the OSDs that store the placement group agree about the current
+state of the placement group. However, completion of the peering process does
+**NOT** mean that each replica has the latest contents.
+
+.. topic:: Authoritative History
+
+   Ceph will **NOT** acknowledge a write operation to a client, until 
+   all OSDs of the acting set persist the write operation. This practice 
+   ensures that at least one member of the acting set will have a record 
+   of every acknowledged write operation since the last successful 
+   peering operation.
+   
+   With an accurate record of each acknowledged write operation, Ceph can 
+   construct and disseminate a new authoritative history of the placement 
+   group--a complete, and fully ordered set of operations that, if performed, 
+   would bring an OSD’s copy of a placement group up to date.
+
+
+Active
+------
+
+Once Ceph completes the peering process, a placement group may become
+``active``. The ``active`` state means that the data in the placement group is
+generally  available in the primary placement group and the replicas for read
+and write operations. 
+
+
+Clean 
+-----
+
+When a placement group is in the ``clean`` state, the primary OSD and the
+replica OSDs have successfully peered and there are no stray replicas for the
+placement group. Ceph replicated all objects in the placement group the correct 
+number of times.
+
+
+Degraded
+--------
+
+When a client writes an object to the primary OSD, the primary OSD is
+responsible for writing the replicas to the replica OSDs. After the primary OSD
+writes the object to storage, the placement group will remain in a ``degraded``
+state until the primary OSD has received an acknowledgement from the replica
+OSDs that Ceph created the replica objects successfully. 
+
+The reason a placement group can be ``active+degraded`` is that an OSD may be
+``active`` even though it doesn't hold all of the objects yet. If an OSD goes
+``down``, Ceph marks each placement group assigned to the OSD as ``degraded``.
+The OSDs must peer again when the OSD comes back online. However, a client can
+still write a new object to a ``degraded`` placement group if it is ``active``.
+
+If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the
+``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD
+to another OSD. The time between being marked ``down`` and being marked ``out``
+is controlled by ``mon osd down out interval``, which is set to ``600`` seconds
+by default.
+
+A placement group can also be ``degraded``, because Ceph cannot find one or more
+objects that Ceph thinks should be in the placement group. While you cannot
+read or write to unfound objects, you can still access all of the other objects
+in the ``degraded`` placement group.
+
+
+Recovering
+----------
+
+Ceph was designed for fault-tolerance at a scale where hardware and software
+problems are ongoing. When an OSD goes ``down``, its contents may fall behind
+the current state of other replicas in the placement groups. When the OSD is
+back ``up``, the contents of the placement groups must be updated to reflect the
+current state. During that time period, the OSD may reflect a ``recovering``
+state.
+
+Recovery is not always trivial, because a hardware failure might cause a
+cascading failure of multiple OSDs. For example, a network switch for a rack or
+cabinet may fail, which can cause the OSDs of a number of host machines to fall
+behind the current state  of the cluster. Each one of the OSDs must recover once
+the fault is resolved.
+
+Ceph provides a number of settings to balance the resource contention between
+new service requests and the need to recover data objects and restore the
+placement groups to the current state. The ``osd recovery delay start`` setting
+allows an OSD to restart, re-peer and even process some replay requests before
+starting the recovery process.  The ``osd
+recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail,
+restart and re-peer at staggered rates. The ``osd recovery max active`` setting
+limits the  number of recovery requests an OSD will entertain simultaneously to
+prevent the OSD from failing to serve . The ``osd recovery max chunk`` setting
+limits the size of the recovered data chunks to prevent network congestion.
+
+
+Back Filling
+------------
+
+When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs
+in the cluster to the newly added OSD. Forcing the new OSD to accept the
+reassigned placement groups immediately can put excessive load on the new OSD.
+Back filling the OSD with the placement groups allows this process to begin in
+the background.  Once backfilling is complete, the new OSD will begin serving
+requests when it is ready.
+
+During the backfill operations, you may see one of several states:
+``backfill_wait`` indicates that a backfill operation is pending, but is not
+underway yet; ``backfilling`` indicates that a backfill operation is underway;
+and, ``backfill_toofull`` indicates that a backfill operation was requested,
+but couldn't be completed due to insufficient storage capacity. When a 
+placement group cannot be backfilled, it may be considered ``incomplete``.
+
+The ``backfill_toofull`` state may be transient.  It is possible that as PGs
+are moved around, space may become available.  The ``backfill_toofull`` is
+similar to ``backfill_wait`` in that as soon as conditions change
+backfill can proceed.
+
+Ceph provides a number of settings to manage the load spike associated with
+reassigning placement groups to an OSD (especially a new OSD). By default,
+``osd_max_backfills`` sets the maximum number of concurrent backfills to and from
+an OSD to 1. The ``backfill full ratio`` enables an OSD to refuse a
+backfill request if the OSD is approaching its full ratio (90%, by default) and
+change with ``ceph osd set-backfillfull-ratio`` command.
+If an OSD refuses a backfill request, the ``osd backfill retry interval``
+enables an OSD to retry the request (after 30 seconds, by default). OSDs can
+also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan
+intervals (64 and 512, by default).
+
+
+Remapped
+--------
+
+When the Acting Set that services a placement group changes, the data migrates
+from the old acting set to the new acting set. It may take some time for a new
+primary OSD to service requests. So it may ask the old primary to continue to
+service requests until the placement group migration is complete. Once  data
+migration completes, the mapping uses the primary OSD of the new acting set.
+
+
+Stale
+-----
+
+While Ceph uses heartbeats to ensure that hosts and daemons are running, the
+``ceph-osd`` daemons may also get into a ``stuck`` state where they are not
+reporting statistics in a timely manner (e.g., a temporary network fault). By
+default, OSD daemons report their placement group, up through, boot and failure
+statistics every half second (i.e., ``0.5``),  which is more frequent than the
+heartbeat thresholds. If the **Primary OSD** of a placement group's acting set
+fails to report to the monitor or if other OSDs have reported the primary OSD
+``down``, the monitors will mark the placement group ``stale``.
+
+When you start your cluster, it is common to see the ``stale`` state until
+the peering process completes. After your cluster has been running for awhile, 
+seeing placement groups in the ``stale`` state indicates that the primary OSD
+for those placement groups is ``down`` or not reporting placement group statistics
+to the monitor.
+
+
+Identifying Troubled PGs
+========================
+
+As previously noted, a placement group is not necessarily problematic just 
+because its state is not ``active+clean``. Generally, Ceph's ability to self
+repair may not be working when placement groups get stuck. The stuck states
+include:
+
+- **Unclean**: Placement groups contain objects that are not replicated the 
+  desired number of times. They should be recovering.
+- **Inactive**: Placement groups cannot process reads or writes because they 
+  are waiting for an OSD with the most up-to-date data to come back ``up``.
+- **Stale**: Placement groups are in an unknown state, because the OSDs that 
+  host them have not reported to the monitor cluster in a while (configured 
+  by ``mon osd report timeout``).
+
+To identify stuck placement groups, execute the following:: 
+
+	ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]
+
+See `Placement Group Subsystem`_ for additional details. To troubleshoot
+stuck placement groups, see `Troubleshooting PG Errors`_.
+
+
+Finding an Object Location
+==========================
+
+To store object data in the Ceph Object Store, a Ceph client must: 
+
+#. Set an object name
+#. Specify a `pool`_
+
+The Ceph client retrieves the latest cluster map and the CRUSH algorithm
+calculates how to map the object to a `placement group`_, and then calculates
+how to assign the placement group to an OSD dynamically. To find the object
+location, all you need is the object name and the pool name. For example:: 
+
+	ceph osd map {poolname} {object-name} [namespace]
+
+.. topic:: Exercise: Locate an Object
+
+	As an exercise, lets create an object. Specify an object name, a path to a
+	test file containing some object data and a pool name using the 
+	``rados put`` command on the command line. For example::
+   
+		rados put {object-name} {file-path} --pool=data   	
+		rados put test-object-1 testfile.txt --pool=data
+   
+	To verify that the Ceph Object Store stored the object, execute the following::
+   
+		rados -p data ls
+   
+	Now, identify the object location::	
+
+		ceph osd map {pool-name} {object-name}
+		ceph osd map data test-object-1
+   
+	Ceph should output the object's location. For example:: 
+   
+		osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0)
+   
+	To remove the test object, simply delete it using the ``rados rm`` command.
+	For example:: 
+   
+		rados rm test-object-1 --pool=data
+   
+
+As the cluster evolves, the object location may change dynamically. One benefit
+of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform
+the migration manually. See the  `Architecture`_ section for details.
+
+.. _data placement: ../data-placement
+.. _pool: ../pools
+.. _placement group: ../placement-groups
+.. _Architecture: ../../../architecture
+.. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running
+.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors
+.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering
+.. _CRUSH map: ../crush-map
+.. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/
+.. _Placement Group Subsystem: ../control#placement-group-subsystem