summaryrefslogtreecommitdiffstats
path: root/doc/cephfs
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-27 18:24:20 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-27 18:24:20 +0000
commit483eb2f56657e8e7f419ab1a4fab8dce9ade8609 (patch)
treee5d88d25d870d5dedacb6bbdbe2a966086a0a5cf /doc/cephfs
parentInitial commit. (diff)
downloadceph-upstream.tar.xz
ceph-upstream.zip
Adding upstream version 14.2.21.upstream/14.2.21upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/cephfs')
-rw-r--r--doc/cephfs/.gitignore1
-rw-r--r--doc/cephfs/Makefile7
-rw-r--r--doc/cephfs/add-remove-mds.rst106
-rw-r--r--doc/cephfs/administration.rst295
-rw-r--r--doc/cephfs/app-best-practices.rst86
-rw-r--r--doc/cephfs/best-practices.rst88
-rw-r--r--doc/cephfs/cache-size-limits.rst15
-rw-r--r--doc/cephfs/capabilities.rst112
-rw-r--r--doc/cephfs/cephfs-journal-tool.rst238
-rw-r--r--doc/cephfs/cephfs-shell.rst348
-rw-r--r--doc/cephfs/client-auth.rst148
-rw-r--r--doc/cephfs/client-config-ref.rst220
-rw-r--r--doc/cephfs/createfs.rst92
-rw-r--r--doc/cephfs/dirfrags.rst95
-rw-r--r--doc/cephfs/disaster-recovery-experts.rst254
-rw-r--r--doc/cephfs/disaster-recovery.rst61
-rw-r--r--doc/cephfs/eviction.rst190
-rw-r--r--doc/cephfs/experimental-features.rst111
-rw-r--r--doc/cephfs/file-layouts.rst230
-rw-r--r--doc/cephfs/fs-volumes.rst369
-rw-r--r--doc/cephfs/fstab.rst47
-rw-r--r--doc/cephfs/full.rst60
-rw-r--r--doc/cephfs/fuse.rst52
-rw-r--r--doc/cephfs/hadoop.rst202
-rw-r--r--doc/cephfs/health-messages.rst131
-rw-r--r--doc/cephfs/index.rst133
-rw-r--r--doc/cephfs/journaler.rst41
-rw-r--r--doc/cephfs/kernel-features.rst40
-rw-r--r--doc/cephfs/kernel.rst41
-rw-r--r--doc/cephfs/lazyio.rst23
-rw-r--r--doc/cephfs/mantle.rst263
-rw-r--r--doc/cephfs/mds-config-ref.rst546
-rw-r--r--doc/cephfs/mds-state-diagram.dot71
-rw-r--r--doc/cephfs/mds-state-diagram.svg311
-rw-r--r--doc/cephfs/mds-states.rst227
-rw-r--r--doc/cephfs/multimds.rst137
-rw-r--r--doc/cephfs/nfs.rst81
-rw-r--r--doc/cephfs/posix.rst101
-rw-r--r--doc/cephfs/quota.rst76
-rw-r--r--doc/cephfs/scrub.rst136
-rw-r--r--doc/cephfs/standby.rst103
-rw-r--r--doc/cephfs/troubleshooting.rst160
-rw-r--r--doc/cephfs/upgrading.rst92
43 files changed, 6140 insertions, 0 deletions
diff --git a/doc/cephfs/.gitignore b/doc/cephfs/.gitignore
new file mode 100644
index 00000000..e8232139
--- /dev/null
+++ b/doc/cephfs/.gitignore
@@ -0,0 +1 @@
+mds-state-diagram.svg
diff --git a/doc/cephfs/Makefile b/doc/cephfs/Makefile
new file mode 100644
index 00000000..eee2fa57
--- /dev/null
+++ b/doc/cephfs/Makefile
@@ -0,0 +1,7 @@
+TARGETS=mds-state-diagram.svg
+
+%.svg: %.dot
+ dot -Tsvg -o $@ $^
+
+
+all: $(TARGETS)
diff --git a/doc/cephfs/add-remove-mds.rst b/doc/cephfs/add-remove-mds.rst
new file mode 100644
index 00000000..c695fbbb
--- /dev/null
+++ b/doc/cephfs/add-remove-mds.rst
@@ -0,0 +1,106 @@
+============================
+ Deploying Metadata Servers
+============================
+
+Each CephFS file system requires at least one MDS. The cluster operator will
+generally use their automated deployment tool to launch required MDS servers as
+needed. Rook and ansible (via the ceph-ansible playbooks) are recommended
+tools for doing this. For clarity, we also show the systemd commands here which
+may be run by the deployment technology if executed on bare-metal.
+
+See `MDS Config Reference`_ for details on configuring metadata servers.
+
+
+Provisioning Hardware for an MDS
+================================
+
+The present version of the MDS is single-threaded and CPU-bound for most
+activities, including responding to client requests. Even so, an MDS under the
+most aggressive client loads still uses about 2 to 3 CPU cores. This is due to
+the other miscellaneous upkeep threads working in tandem.
+
+Even so, it is recommended that an MDS server be well provisioned with an
+advanced CPU with sufficient cores. Development is on-going to make better use
+of available CPU cores in the MDS; it is expected in future versions of Ceph
+that the MDS server will improve performance by taking advantage of more cores.
+
+The other dimension to MDS performance is the available RAM for caching. The
+MDS necessarily manages a distributed and cooperative metadata cache among all
+clients and other active MDSs. Therefore it is essential to provide the MDS
+with sufficient RAM to enable faster metadata access and mutation.
+
+Generally, an MDS serving a large cluster of clients (1000 or more) will use at
+least 64GB of cache (see also :doc:`/cephfs/cache-size-limits`). An MDS with a larger
+cache is not well explored in the largest known community clusters; there may
+be diminishing returns where management of such a large cache negatively
+impacts performance in surprising ways. It would be best to do analysis with
+expected workloads to determine if provisioning more RAM is worthwhile.
+
+In a bare-metal cluster, the best practice is to over-provision hardware for
+the MDS server. Even if a single MDS daemon is unable to fully utilize the
+hardware, it may be desirable later on to start more active MDS daemons on the
+same node to fully utilize the available cores and memory. Additionally, it may
+become clear with workloads on the cluster that performance improves with
+multiple active MDS on the same node rather than over-provisioning a single
+MDS.
+
+Finally, be aware that CephFS is a highly-available file system by supporting
+standby MDS (see also :ref:`mds-standby`) for rapid failover. To get a real
+benefit from deploying standbys, it is usually necessary to distribute MDS
+daemons across at least two nodes in the cluster. Otherwise, a hardware failure
+on a single node may result in the file system becoming unavailable.
+
+Co-locating the MDS with other Ceph daemons (hyperconverged) is an effective
+and recommended way to accomplish this so long as all daemons are configured to
+use available hardware within certain limits. For the MDS, this generally
+means limiting its cache size.
+
+
+Adding an MDS
+=============
+
+#. Create an mds data point ``/var/lib/ceph/mds/ceph-${id}``. The daemon only uses this directory to store its keyring.
+
+#. Edit ``ceph.conf`` and add MDS section. ::
+
+ [mds.${id}]
+ host = {hostname}
+
+#. Create the authentication key, if you use CephX. ::
+
+ $ sudo ceph auth get-or-create mds.${id} mon 'profile mds' mgr 'profile mds' mds 'allow *' osd 'allow *' > /var/lib/ceph/mds/ceph-${id}/keyring
+
+#. Start the service. ::
+
+ $ sudo systemctl start mds.${id}
+
+#. The status of the cluster should show: ::
+
+ mds: ${id}:1 {0=${id}=up:active} 2 up:standby
+
+Removing an MDS
+===============
+
+If you have a metadata server in your cluster that you'd like to remove, you may use
+the following method.
+
+#. (Optionally:) Create a new replacement Metadata Server. If there are no
+ replacement MDS to take over once the MDS is removed, the file system will
+ become unavailable to clients. If that is not desirable, consider adding a
+ metadata server before tearing down the metadata server you would like to
+ take offline.
+
+#. Stop the MDS to be removed. ::
+
+ $ sudo systemctl stop mds.${id}
+
+ The MDS will automatically notify the Ceph monitors that it is going down.
+ This enables the monitors to perform instantaneous failover to an available
+ standby, if one exists. It is unnecessary to use administrative commands to
+ effect this failover, e.g. through the use of ``ceph mds fail mds.${id}``.
+
+#. Remove the ``/var/lib/ceph/mds/ceph-${id}`` directory on the MDS. ::
+
+ $ sudo rm -rf /var/lib/ceph/mds/ceph-${id}
+
+.. _MDS Config Reference: ../mds-config-ref
diff --git a/doc/cephfs/administration.rst b/doc/cephfs/administration.rst
new file mode 100644
index 00000000..8646606f
--- /dev/null
+++ b/doc/cephfs/administration.rst
@@ -0,0 +1,295 @@
+.. _cephfs-administration:
+
+CephFS Administrative commands
+==============================
+
+Filesystems
+-----------
+
+.. note:: The names of the file systems, metadata pools, and data pools can
+ only have characters in the set [a-zA-Z0-9\_-.].
+
+These commands operate on the CephFS filesystems in your Ceph cluster.
+Note that by default only one filesystem is permitted: to enable
+creation of multiple filesystems use ``ceph fs flag set enable_multiple true``.
+
+::
+
+ fs new <filesystem name> <metadata pool name> <data pool name>
+
+This command creates a new file system. The file system name and metadata pool
+name are self-explanatory. The specified data pool is the default data pool and
+cannot be changed once set. Each file system has its own set of MDS daemons
+assigned to ranks so ensure that you have sufficient standby daemons available
+to accommodate the new file system.
+
+::
+
+ fs ls
+
+List all file systems by name.
+
+::
+
+ fs dump [epoch]
+
+This dumps the FSMap at the given epoch (default: current) which includes all
+file system settings, MDS daemons and the ranks they hold, and the list of
+standby MDS daemons.
+
+
+::
+
+ fs rm <filesystem name> [--yes-i-really-mean-it]
+
+Destroy a CephFS file system. This wipes information about the state of the
+file system from the FSMap. The metadata pool and data pools are untouched and
+must be destroyed separately.
+
+::
+
+ fs get <filesystem name>
+
+Get information about the named file system, including settings and ranks. This
+is a subset of the same information from the ``fs dump`` command.
+
+::
+
+ fs set <filesystem name> <var> <val>
+
+Change a setting on a file system. These settings are specific to the named
+file system and do not affect other file systems.
+
+::
+
+ fs add_data_pool <filesystem name> <pool name/id>
+
+Add a data pool to the file system. This pool can be used for file layouts
+as an alternate location to store file data.
+
+::
+
+ fs rm_data_pool <filesystem name> <pool name/id>
+
+This command removes the specified pool from the list of data pools for the
+file system. If any files have layouts for the removed data pool, the file
+data will become unavailable. The default data pool (when creating the file
+system) cannot be removed.
+
+
+Settings
+--------
+
+::
+
+ fs set <fs name> max_file_size <size in bytes>
+
+CephFS has a configurable maximum file size, and it's 1TB by default.
+You may wish to set this limit higher if you expect to store large files
+in CephFS. It is a 64-bit field.
+
+Setting ``max_file_size`` to 0 does not disable the limit. It would
+simply limit clients to only creating empty files.
+
+
+Maximum file sizes and performance
+----------------------------------
+
+CephFS enforces the maximum file size limit at the point of appending to
+files or setting their size. It does not affect how anything is stored.
+
+When users create a file of an enormous size (without necessarily
+writing any data to it), some operations (such as deletes) cause the MDS
+to have to do a large number of operations to check if any of the RADOS
+objects within the range that could exist (according to the file size)
+really existed.
+
+The ``max_file_size`` setting prevents users from creating files that
+appear to be eg. exabytes in size, causing load on the MDS as it tries
+to enumerate the objects during operations like stats or deletes.
+
+
+Taking the cluster down
+-----------------------
+
+Taking a CephFS cluster down is done by setting the down flag:
+
+::
+
+ fs set <fs_name> down true
+
+To bring the cluster back online:
+
+::
+
+ fs set <fs_name> down false
+
+This will also restore the previous value of max_mds. MDS daemons are brought
+down in a way such that journals are flushed to the metadata pool and all
+client I/O is stopped.
+
+
+Taking the cluster down rapidly for deletion or disaster recovery
+-----------------------------------------------------------------
+
+To allow rapidly deleting a file system (for testing) or to quickly bring the
+file system and MDS daemons down, use the ``fs fail`` command:
+
+::
+
+ fs fail <fs_name>
+
+This command sets a file system flag to prevent standbys from
+activating on the file system (the ``joinable`` flag).
+
+This process can also be done manually by doing the following:
+
+::
+
+ fs set <fs_name> joinable false
+
+Then the operator can fail all of the ranks which causes the MDS daemons to
+respawn as standbys. The file system will be left in a degraded state.
+
+::
+
+ # For all ranks, 0-N:
+ mds fail <fs_name>:<n>
+
+Once all ranks are inactive, the file system may also be deleted or left in
+this state for other purposes (perhaps disaster recovery).
+
+To bring the cluster back up, simply set the joinable flag:
+
+::
+
+ fs set <fs_name> joinable true
+
+
+Daemons
+-------
+
+Most commands manipulating MDSs take a ``<role>`` argument which can take one
+of three forms:
+
+::
+
+ <fs_name>:<rank>
+ <fs_id>:<rank>
+ <rank>
+
+Commands to manipulate MDS daemons:
+
+::
+
+ mds fail <gid/name/role>
+
+Mark an MDS daemon as failed. This is equivalent to what the cluster
+would do if an MDS daemon had failed to send a message to the mon
+for ``mds_beacon_grace`` second. If the daemon was active and a suitable
+standby is available, using ``mds fail`` will force a failover to the standby.
+
+If the MDS daemon was in reality still running, then using ``mds fail``
+will cause the daemon to restart. If it was active and a standby was
+available, then the "failed" daemon will return as a standby.
+
+
+::
+
+ tell mds.<daemon name> command ...
+
+Send a command to the MDS daemon(s). Use ``mds.*`` to send a command to all
+daemons. Use ``ceph tell mds.* help`` to learn available commands.
+
+::
+
+ mds metadata <gid/name/role>
+
+Get metadata about the given MDS known to the Monitors.
+
+::
+
+ mds repaired <role>
+
+Mark the file system rank as repaired. Unlike the name suggests, this command
+does not change a MDS; it manipulates the file system rank which has been
+marked damaged.
+
+
+Minimum Client Version
+----------------------
+
+It is sometimes desirable to set the minimum version of Ceph that a client must be
+running to connect to a CephFS cluster. Older clients may sometimes still be
+running with bugs that can cause locking issues between clients (due to
+capability release). CephFS provides a mechanism to set the minimum
+client version:
+
+::
+
+ fs set <fs name> min_compat_client <release>
+
+For example, to only allow Nautilus clients, use:
+
+::
+
+ fs set cephfs min_compat_client nautilus
+
+Clients running an older version will be automatically evicted.
+
+
+Global settings
+---------------
+
+
+::
+
+ fs flag set <flag name> <flag val> [<confirmation string>]
+
+Sets a global CephFS flag (i.e. not specific to a particular file system).
+Currently, the only flag setting is 'enable_multiple' which allows having
+multiple CephFS file systems.
+
+Some flags require you to confirm your intentions with "--yes-i-really-mean-it"
+or a similar string they will prompt you with. Consider these actions carefully
+before proceeding; they are placed on especially dangerous activities.
+
+
+Advanced
+--------
+
+These commands are not required in normal operation, and exist
+for use in exceptional circumstances. Incorrect use of these
+commands may cause serious problems, such as an inaccessible
+filesystem.
+
+::
+
+ mds compat rm_compat
+
+Removes an compatibility feature flag.
+
+::
+
+ mds compat rm_incompat
+
+Removes an incompatibility feature flag.
+
+::
+
+ mds compat show
+
+Show MDS compatibility flags.
+
+::
+
+ mds rmfailed
+
+This removes a rank from the failed set.
+
+::
+
+ fs reset <filesystem name>
+
+This command resets the file system state to defaults, except for the name and
+pools. Non-zero ranks are saved in the stopped set.
diff --git a/doc/cephfs/app-best-practices.rst b/doc/cephfs/app-best-practices.rst
new file mode 100644
index 00000000..d916e184
--- /dev/null
+++ b/doc/cephfs/app-best-practices.rst
@@ -0,0 +1,86 @@
+
+Application best practices for distributed filesystems
+======================================================
+
+CephFS is POSIX compatible, and therefore should work with any existing
+applications that expect a POSIX filesystem. However, because it is a
+network filesystem (unlike e.g. XFS) and it is highly consistent (unlike
+e.g. NFS), there are some consequences that application authors may
+benefit from knowing about.
+
+The following sections describe some areas where distributed filesystems
+may have noticeably different performance behaviours compared with
+local filesystems.
+
+
+ls -l
+-----
+
+When you run "ls -l", the ``ls`` program
+is first doing a directory listing, and then calling ``stat`` on every
+file in the directory.
+
+This is usually far in excess of what an application really needs, and
+it can be slow for large directories. If you don't really need all
+this metadata for each file, then use a plain ``ls``.
+
+ls/stat on files being extended
+-------------------------------
+
+If another client is currently extending files in the listed directory,
+then an ``ls -l`` may take an exceptionally long time to complete, as
+the lister must wait for the writer to flush data in order to do a valid
+read of the every file's size. So unless you *really* need to know the
+exact size of every file in the directory, just don't do it!
+
+This would also apply to any application code that was directly
+issuing ``stat`` system calls on files being appended from
+another node.
+
+Very large directories
+----------------------
+
+Do you really need that 10,000,000 file directory? While directory
+fragmentation enables CephFS to handle it, it is always going to be
+less efficient than splitting your files into more modest-sized directories.
+
+Even standard userspace tools can become quite slow when operating on very
+large directories. For example, the default behaviour of ``ls``
+is to give an alphabetically ordered result, but ``readdir`` system
+calls do not give an ordered result (this is true in general, not just
+with CephFS). So when you ``ls`` on a million file directory, it is
+loading a list of a million names into memory, sorting the list, then writing
+it out to the display.
+
+Hard links
+----------
+
+Hard links have an intrinsic cost in terms of the internal housekeeping
+that a filesystem has to do to keep two references to the same data. In
+CephFS there is a particular performance cost, because with normal files
+the inode is embedded in the directory (i.e. there is no extra fetch of
+the inode after looking up the path).
+
+Working set size
+----------------
+
+The MDS acts as a cache for the metadata stored in RADOS. Metadata
+performance is very different for workloads whose metadata fits within
+that cache.
+
+If your workload has more files than fit in your cache (configured using
+``mds_cache_memory_limit`` or ``mds_cache_size`` settings), then
+make sure you test it appropriately: don't test your system with a small
+number of files and then expect equivalent performance when you move
+to a much larger number of files.
+
+Do you need a filesystem?
+-------------------------
+
+Remember that Ceph also includes an object storage interface. If your
+application needs to store huge flat collections of files where you just
+read and write whole files at once, then you might well be better off
+using the :ref:`Object Gateway <object-gateway>`
+
+
+
diff --git a/doc/cephfs/best-practices.rst b/doc/cephfs/best-practices.rst
new file mode 100644
index 00000000..06a14ec6
--- /dev/null
+++ b/doc/cephfs/best-practices.rst
@@ -0,0 +1,88 @@
+
+CephFS best practices
+=====================
+
+This guide provides recommendations for best results when deploying CephFS.
+
+For the actual configuration guide for CephFS, please see the instructions
+at :doc:`/cephfs/index`.
+
+Which Ceph version?
+-------------------
+
+Use at least the Jewel (v10.2.0) release of Ceph. This is the first
+release to include stable CephFS code and fsck/repair tools. Make sure
+you are using the latest point release to get bug fixes.
+
+Note that Ceph releases do not include a kernel, this is versioned
+and released separately. See below for guidance of choosing an
+appropriate kernel version if you are using the kernel client
+for CephFS.
+
+Most stable configuration
+-------------------------
+
+Some features in CephFS are still experimental. See
+:doc:`/cephfs/experimental-features` for guidance on these.
+
+For the best chance of a happy healthy filesystem, use a **single active MDS**
+and **do not use snapshots**. Both of these are the default.
+
+Note that creating multiple MDS daemons is fine, as these will simply be
+used as standbys. However, for best stability you should avoid
+adjusting ``max_mds`` upwards, as this would cause multiple MDS
+daemons to be active at once.
+
+Which client?
+-------------
+
+The FUSE client is the most accessible and the easiest to upgrade to the
+version of Ceph used by the storage cluster, while the kernel client will
+often give better performance.
+
+The clients do not always provide equivalent functionality, for example
+the fuse client supports client-enforced quotas while the kernel client
+does not.
+
+When encountering bugs or performance issues, it is often instructive to
+try using the other client, in order to find out whether the bug was
+client-specific or not (and then to let the developers know).
+
+Which kernel version?
+---------------------
+
+Because the kernel client is distributed as part of the linux kernel (not
+as part of packaged ceph releases),
+you will need to consider which kernel version to use on your client nodes.
+Older kernels are known to include buggy ceph clients, and may not support
+features that more recent Ceph clusters support.
+
+Remember that the "latest" kernel in a stable linux distribution is likely
+to be years behind the latest upstream linux kernel where Ceph development
+takes place (including bug fixes).
+
+As a rough guide, as of Ceph 10.x (Jewel), you should be using a least a
+4.x kernel. If you absolutely have to use an older kernel, you should use
+the fuse client instead of the kernel client.
+
+This advice does not apply if you are using a linux distribution that
+includes CephFS support, as in this case the distributor will be responsible
+for backporting fixes to their stable kernel: check with your vendor.
+
+Reporting issues
+----------------
+
+If you have identified a specific issue, please report it with as much
+information as possible. Especially important information:
+
+* Ceph versions installed on client and server
+* Whether you are using the kernel or fuse client
+* If you are using the kernel client, what kernel version?
+* How many clients are in play, doing what kind of workload?
+* If a system is 'stuck', is that affecting all clients or just one?
+* Any ceph health messages
+* Any backtraces in the ceph logs from crashes
+
+If you are satisfied that you have found a bug, please file it on
+`the tracker <http://tracker.ceph.com>`_. For more general queries please write
+to the `ceph-users mailing list <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com/>`_.
diff --git a/doc/cephfs/cache-size-limits.rst b/doc/cephfs/cache-size-limits.rst
new file mode 100644
index 00000000..4ea41443
--- /dev/null
+++ b/doc/cephfs/cache-size-limits.rst
@@ -0,0 +1,15 @@
+Understanding MDS Cache Size Limits
+===================================
+
+This section describes ways to limit MDS cache size.
+
+You can limit the size of the Metadata Server (MDS) cache by:
+
+* *A memory limit*: A new behavior introduced in the Luminous release. Use the `mds_cache_memory_limit` parameters. We recommend to use memory limits instead of inode count limits.
+* *Inode count*: Use the `mds_cache_size` parameter. By default, limiting the MDS cache by inode count is disabled.
+
+In addition, you can specify a cache reservation by using the `mds_cache_reservation` parameter for MDS operations. The cache reservation is limited as a percentage of the memory or inode limit and is set to 5% by default. The intent of this parameter is to have the MDS maintain an extra reserve of memory for its cache for new metadata operations to use. As a consequence, the MDS should in general operate below its memory limit because it will recall old state from clients in order to drop unused metadata in its cache.
+
+The `mds_cache_reservation` parameter replaces the `mds_health_cache_threshold` in all situations except when MDS nodes sends a health alert to the Monitors indicating the cache is too large. By default, `mds_health_cache_threshold` is 150% of the maximum cache size.
+
+Be aware that the cache limit is not a hard limit. Potential bugs in the CephFS client or MDS or misbehaving applications might cause the MDS to exceed its cache size. The `mds_health_cache_threshold` configures the cluster health warning message so that operators can investigate why the MDS cannot shrink its cache.
diff --git a/doc/cephfs/capabilities.rst b/doc/cephfs/capabilities.rst
new file mode 100644
index 00000000..335b053a
--- /dev/null
+++ b/doc/cephfs/capabilities.rst
@@ -0,0 +1,112 @@
+======================
+Capabilities in CephFS
+======================
+When a client wants to operate on an inode, it will query the MDS in various
+ways, which will then grant the client a set of **capabilities**. These
+grant the client permissions to operate on the inode in various ways. One
+of the major differences from other network filesystems (e.g NFS or SMB) is
+that the capabilities granted are quite granular, and it's possible that
+multiple clients can hold different capabilities on the same inodes.
+
+Types of Capabilities
+---------------------
+There are several "generic" capability bits. These denote what sort of ability
+the capability grants.
+
+::
+
+ /* generic cap bits */
+ #define CEPH_CAP_GSHARED 1 /* client can reads (s) */
+ #define CEPH_CAP_GEXCL 2 /* client can read and update (x) */
+ #define CEPH_CAP_GCACHE 4 /* (file) client can cache reads (c) */
+ #define CEPH_CAP_GRD 8 /* (file) client can read (r) */
+ #define CEPH_CAP_GWR 16 /* (file) client can write (w) */
+ #define CEPH_CAP_GBUFFER 32 /* (file) client can buffer writes (b) */
+ #define CEPH_CAP_GWREXTEND 64 /* (file) client can extend EOF (a) */
+ #define CEPH_CAP_GLAZYIO 128 /* (file) client can perform lazy io (l) */
+
+These are then shifted by a particular number of bits. These denote a part of
+the inode's data or metadata on which the capability is being granted:
+
+::
+
+ /* per-lock shift */
+ #define CEPH_CAP_SAUTH 2 /* A */
+ #define CEPH_CAP_SLINK 4 /* L */
+ #define CEPH_CAP_SXATTR 6 /* X */
+ #define CEPH_CAP_SFILE 8 /* F */
+
+Only certain generic cap types are ever granted for some of those "shifts",
+however. In particular, only the FILE shift ever has more than the first two
+bits.
+
+::
+
+ | AUTH | LINK | XATTR | FILE
+ 2 4 6 8
+
+From the above, we get a number of constants, that are generated by taking
+each bit value and shifting to the correct bit in the word:
+
+::
+
+ #define CEPH_CAP_AUTH_SHARED (CEPH_CAP_GSHARED << CEPH_CAP_SAUTH)
+
+These bits can then be or'ed together to make a bitmask denoting a set of
+capabilities.
+
+There is one exception:
+
+::
+
+ #define CEPH_CAP_PIN 1 /* no specific capabilities beyond the pin */
+
+The "pin" just pins the inode into memory, without granting any other caps.
+
+Graphically:
+
+::
+
+ +---+---+---+---+---+---+---+---+
+ | p | _ |As x |Ls x |Xs x |
+ +---+---+---+---+---+---+---+---+
+ |Fs x c r w b a l |
+ +---+---+---+---+---+---+---+---+
+
+The second bit is currently unused.
+
+Abilities granted by each cap
+-----------------------------
+While that is how capabilities are granted (and communicated), the important
+bit is what they actually allow the client to do:
+
+* PIN: this just pins the inode into memory. This is sufficient to allow the
+ client to get to the inode number, as well as other immutable things like
+ major or minor numbers in a device inode, or symlink contents.
+
+* AUTH: this grants the ability to get to the authentication-related metadata.
+ In particular, the owner, group and mode. Note that doing a full permission
+ check may require getting at ACLs as well, which are stored in xattrs.
+
+* LINK: the link count of the inode
+
+* XATTR: ability to access or manipulate xattrs. Note that since ACLs are
+ stored in xattrs, it's also sometimes necessary to access them when checking
+ permissions.
+
+* FILE: this is the big one. These allow the client to access and manipulate
+ file data. It also covers certain metadata relating to file data -- the
+ size, mtime, atime and ctime, in particular.
+
+Shorthand
+---------
+Note that the client logging can also present a compact representation of the
+capabilities. For example:
+
+::
+
+ pAsLsXsFs
+
+The 'p' represents the pin. Each capital letter corresponds to the shift
+values, and the lowercase letters after each shift are for the actual
+capabilities granted in each shift.
diff --git a/doc/cephfs/cephfs-journal-tool.rst b/doc/cephfs/cephfs-journal-tool.rst
new file mode 100644
index 00000000..bd70e817
--- /dev/null
+++ b/doc/cephfs/cephfs-journal-tool.rst
@@ -0,0 +1,238 @@
+
+cephfs-journal-tool
+===================
+
+Purpose
+-------
+
+If a CephFS journal has become damaged, expert intervention may be required
+to restore the filesystem to a working state.
+
+The ``cephfs-journal-tool`` utility provides functionality to aid experts in
+examining, modifying, and extracting data from journals.
+
+.. warning::
+
+ This tool is **dangerous** because it directly modifies internal
+ data structures of the filesystem. Make backups, be careful, and
+ seek expert advice. If you are unsure, do not run this tool.
+
+Syntax
+------
+
+::
+
+ cephfs-journal-tool journal <inspect|import|export|reset>
+ cephfs-journal-tool header <get|set>
+ cephfs-journal-tool event <get|splice|apply> [filter] <list|json|summary|binary>
+
+
+The tool operates in three modes: ``journal``, ``header`` and ``event``,
+meaning the whole journal, the header, and the events within the journal
+respectively.
+
+Journal mode
+------------
+
+This should be your starting point to assess the state of a journal.
+
+* ``inspect`` reports on the health of the journal. This will identify any
+ missing objects or corruption in the stored journal. Note that this does
+ not identify inconsistencies in the events themselves, just that events are
+ present and can be decoded.
+
+* ``import`` and ``export`` read and write binary dumps of the journal
+ in a sparse file format. Pass the filename as the last argument. The
+ export operation may not work reliably for journals which are damaged (missing
+ objects).
+
+* ``reset`` truncates a journal, discarding any information within it.
+
+
+Example: journal inspect
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ # cephfs-journal-tool journal inspect
+ Overall journal integrity: DAMAGED
+ Objects missing:
+ 0x1
+ Corrupt regions:
+ 0x400000-ffffffffffffffff
+
+Example: Journal import/export
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ # cephfs-journal-tool journal export myjournal.bin
+ journal is 4194304~80643
+ read 80643 bytes at offset 4194304
+ wrote 80643 bytes at offset 4194304 to myjournal.bin
+ NOTE: this is a _sparse_ file; you can
+ $ tar cSzf myjournal.bin.tgz myjournal.bin
+ to efficiently compress it while preserving sparseness.
+
+ # cephfs-journal-tool journal import myjournal.bin
+ undump myjournal.bin
+ start 4194304 len 80643
+ writing header 200.00000000
+ writing 4194304~80643
+ done.
+
+.. note::
+
+ It is wise to use the ``journal export <backup file>`` command to make a journal backup
+ before any further manipulation.
+
+Header mode
+-----------
+
+* ``get`` outputs the current content of the journal header
+
+* ``set`` modifies an attribute of the header. Allowed attributes are
+ ``trimmed_pos``, ``expire_pos`` and ``write_pos``.
+
+Example: header get/set
+~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ # cephfs-journal-tool header get
+ { "magic": "ceph fs volume v011",
+ "write_pos": 4274947,
+ "expire_pos": 4194304,
+ "trimmed_pos": 4194303,
+ "layout": { "stripe_unit": 4194304,
+ "stripe_count": 4194304,
+ "object_size": 4194304,
+ "cas_hash": 4194304,
+ "object_stripe_unit": 4194304,
+ "pg_pool": 4194304}}
+
+ # cephfs-journal-tool header set trimmed_pos 4194303
+ Updating trimmed_pos 0x400000 -> 0x3fffff
+ Successfully updated header.
+
+
+Event mode
+----------
+
+Event mode allows detailed examination and manipulation of the contents of the journal. Event
+mode can operate on all events in the journal, or filters may be applied.
+
+The arguments following ``cephfs-journal-tool event`` consist of an action, optional filter
+parameters, and an output mode:
+
+::
+
+ cephfs-journal-tool event <action> [filter] <output>
+
+Actions:
+
+* ``get`` read the events from the log
+* ``splice`` erase events or regions in the journal
+* ``apply`` extract filesystem metadata from events and attempt to apply it to the metadata store.
+
+Filtering:
+
+* ``--range <int begin>..[int end]`` only include events within the range begin (inclusive) to end (exclusive)
+* ``--path <path substring>`` only include events referring to metadata containing the specified string
+* ``--inode <int>`` only include events referring to metadata containing the specified inode
+* ``--type <type string>`` only include events of this type
+* ``--frag <ino>[.frag id]`` only include events referring to this directory fragment
+* ``--dname <string>`` only include events referring to this named dentry within a directory
+ fragment (may only be used in conjunction with ``--frag``
+* ``--client <int>`` only include events from this client session ID
+
+Filters may be combined on an AND basis (i.e. only the intersection of events from each filter).
+
+Output modes:
+
+* ``binary``: write each event as a binary file, within a folder whose name is controlled by ``--path``
+* ``json``: write all events to a single file, as a JSON serialized list of objects
+* ``summary``: write a human readable summary of the events read to standard out
+* ``list``: write a human readable terse listing of the type of each event, and
+ which file paths the event affects.
+
+
+Example: event mode
+~~~~~~~~~~~~~~~~~~~
+
+::
+
+ # cephfs-journal-tool event get json --path output.json
+ Wrote output to JSON file 'output.json'
+
+ # cephfs-journal-tool event get summary
+ Events by type:
+ NOOP: 2
+ OPEN: 2
+ SESSION: 2
+ SUBTREEMAP: 1
+ UPDATE: 43
+
+ # cephfs-journal-tool event get list
+ 0x400000 SUBTREEMAP: ()
+ 0x400308 SESSION: ()
+ 0x4003de UPDATE: (setattr)
+ /
+ 0x40068b UPDATE: (mkdir)
+ diralpha
+ 0x400d1b UPDATE: (mkdir)
+ diralpha/filealpha1
+ 0x401666 UPDATE: (unlink_local)
+ stray0/10000000001
+ diralpha/filealpha1
+ 0x40228d UPDATE: (unlink_local)
+ diralpha
+ stray0/10000000000
+ 0x402bf9 UPDATE: (scatter_writebehind)
+ stray0
+ 0x403150 UPDATE: (mkdir)
+ dirbravo
+ 0x4037e0 UPDATE: (openc)
+ dirbravo/.filebravo1.swp
+ 0x404032 UPDATE: (openc)
+ dirbravo/.filebravo1.swpx
+
+ # cephfs-journal-tool event get --path filebravo1 list
+ 0x40785a UPDATE: (openc)
+ dirbravo/filebravo1
+ 0x4103ee UPDATE: (cap update)
+ dirbravo/filebravo1
+
+ # cephfs-journal-tool event splice --range 0x40f754..0x410bf1 summary
+ Events by type:
+ OPEN: 1
+ UPDATE: 2
+
+ # cephfs-journal-tool event apply --range 0x410bf1.. summary
+ Events by type:
+ NOOP: 1
+ SESSION: 1
+ UPDATE: 9
+
+ # cephfs-journal-tool event get --inode=1099511627776 list
+ 0x40068b UPDATE: (mkdir)
+ diralpha
+ 0x400d1b UPDATE: (mkdir)
+ diralpha/filealpha1
+ 0x401666 UPDATE: (unlink_local)
+ stray0/10000000001
+ diralpha/filealpha1
+ 0x40228d UPDATE: (unlink_local)
+ diralpha
+ stray0/10000000000
+
+ # cephfs-journal-tool event get --frag=1099511627776 --dname=filealpha1 list
+ 0x400d1b UPDATE: (mkdir)
+ diralpha/filealpha1
+ 0x401666 UPDATE: (unlink_local)
+ stray0/10000000001
+ diralpha/filealpha1
+
+ # cephfs-journal-tool event get binary --path bin_events
+ Wrote output to binary files in directory 'bin_events'
+
diff --git a/doc/cephfs/cephfs-shell.rst b/doc/cephfs/cephfs-shell.rst
new file mode 100644
index 00000000..dfe8e232
--- /dev/null
+++ b/doc/cephfs/cephfs-shell.rst
@@ -0,0 +1,348 @@
+
+=============
+CephFS Shell
+=============
+
+The File System (FS) shell includes various shell-like commands that directly interact with the :term:`Ceph Filesystem`.
+
+Usage :
+
+ cephfs-shell [-options] -- [command, command,...]
+
+Options :
+ -c, --config FILE Set Configuration file.
+ -b, --batch FILE Process a batch file.
+ -t, --test FILE Test against transcript(s) in FILE
+
+Commands
+========
+
+mkdir
+-----
+
+Create the directory(ies), if they do not already exist.
+
+Usage :
+
+ mkdir [-option] <directory>...
+
+* directory - name of the directory to be created.
+
+Options :
+ -m MODE Sets the access mode for the new directory.
+ -p, --parent Create parent directories as necessary. When this option is specified, no error is reported if a directory already exists.
+
+put
+---
+
+Copy a file/directory to Ceph Filesystem from Local Filesystem.
+
+Usage :
+
+ put [options] <source_path> [target_path]
+
+* source_path - local file/directory path to be copied to cephfs.
+ * if `.` copies all the file/directories in the local working directory.
+ * if `-` Reads the input from stdin.
+
+* target_path - remote directory path where the files/directories are to be copied to.
+ * if `.` files/directories are copied to the remote working directory.
+
+Options :
+ -f, --force Overwrites the destination if it already exists.
+
+
+get
+---
+
+Copy a file from Ceph Filesystem to Local Filesystem.
+
+Usage :
+
+ get [options] <source_path> [target_path]
+
+* source_path - remote file/directory path which is to be copied to local filesystem.
+ * if `.` copies all the file/directories in the remote working directory.
+
+* target_path - local directory path where the files/directories are to be copied to.
+ * if `.` files/directories are copied to the local working directory.
+ * if `-` Writes output to stdout.
+
+Options:
+ -f, --force Overwrites the destination if it already exists.
+
+ls
+--
+
+List all the files and directories in the current working directory.
+
+Usage :
+
+ ls [option] [directory]...
+
+* directory - name of directory whose files/directories are to be listed.
+ * By default current working directory's files/directories are listed.
+
+Options:
+ -l, --long list with long format - show permissions
+ -r, --reverse reverse sort
+ -H human readable
+ -a, -all ignore entries starting with .
+ -S Sort by file_size
+
+
+cat
+---
+
+Concatenate files and print on the standard output
+
+Usage :
+
+ cat <file>....
+
+* file - name of the file
+
+cd
+--
+
+Change current working directory.
+
+Usage :
+
+ cd [directory]
+
+* directory - path/directory name. If no directory is mentioned it is changed to the root directory.
+ * If '.' moves to the parent directory of the current directory.
+
+cwd
+---
+
+Get current working directory.
+
+Usage :
+
+ cwd
+
+
+quit/Ctrl + D
+-------------
+
+Close the shell.
+
+chmod
+-----
+
+Change the permissions of file/directory.
+
+Usage :
+
+ chmod <mode> <file/directory>
+
+mv
+--
+
+Moves files/Directory from source to destination.
+
+Usage :
+
+ mv <source_path> <destination_path>
+
+rmdir
+-----
+
+Delete a directory(ies).
+
+Usage :
+
+ rmdir <directory_name>.....
+
+rm
+--
+
+Remove a file(es).
+
+Usage :
+
+ rm <file_name/pattern>...
+
+
+write
+-----
+
+Create and Write a file.
+
+Usage :
+
+ write <file_name>
+ <Enter Data>
+ Ctrl+D Exit.
+
+lls
+---
+
+Lists all files and directories in the specified directory.Current local directory files and directories are listed if no path is mentioned
+
+Usage:
+
+ lls <path>.....
+
+lcd
+---
+
+Moves into the given local directory.
+
+Usage :
+
+ lcd <path>
+
+lpwd
+----
+
+Prints the absolute path of the current local directory.
+
+Usage :
+
+ lpwd
+
+
+umask
+-----
+
+Set and get the file mode creation mask
+
+Usage :
+
+ umask [mode]
+
+alias
+-----
+
+Define or display aliases
+
+Usage:
+
+ alias [name] | [<name> <value>]
+
+* name - name of the alias being looked up, added, or replaced
+* value - what the alias will be resolved to (if adding or replacing) this can contain spaces and does not need to be quoted
+
+pyscript
+--------
+
+Runs a python script file inside the console
+
+Usage:
+
+ pyscript <script_path> [script_arguments]
+
+* Console commands can be executed inside this script with cmd ("your command")
+ However, you cannot run nested "py" or "pyscript" commands from within this script
+ Paths or arguments that contain spaces must be enclosed in quotes
+
+py
+--
+
+Invoke python command, shell, or script
+
+Usage :
+
+ py <command>: Executes a Python command.
+ py: Enters interactive Python mode.
+
+shortcuts
+---------
+
+Lists shortcuts (aliases) available
+
+history
+-------
+
+View, run, edit, and save previously entered commands.
+
+Usage :
+
+ history [-h] [-r | -e | -s | -o FILE | -t TRANSCRIPT] [arg]
+
+Options:
+ -h show this help message and exit
+ -r run selected history items
+ -e edit and then run selected history items
+ -s script format; no separation lines
+ -o FILE output commands to a script file
+ -t TRANSCRIPT output commands and results to a transcript file
+
+unalias
+-------
+
+Unsets aliases
+
+Usage :
+
+ unalias [-a] name [name ...]
+
+* name - name of the alias being unset
+
+Options:
+ -a remove all alias definitions
+
+set
+---
+
+Sets a settable parameter or shows current settings of parameters.
+
+Usage :
+
+ set [-h] [-a] [-l] [settable [settable ...]]
+
+* Call without arguments for a list of settable parameters with their values.
+
+ Options :
+ -h show this help message and exit
+ -a display read-only settings as well
+ -l describe function of parameter
+
+edit
+----
+
+Edit a file in a text editor.
+
+Usage:
+
+ edit [file_path]
+
+* file_path - path to a file to open in editor
+
+load
+----
+
+Runs commands in script file that is encoded as either ASCII or UTF-8 text.
+
+Usage:
+
+ load <file_path>
+
+* file_path - a file path pointing to a script
+
+* Script should contain one command per line, just like command would betyped in console.
+
+shell
+-----
+
+Execute a command as if at the OS prompt.
+
+Usage:
+
+ shell <command> [arguments]
+
+locate
+------
+
+Find an item in Filesystem
+
+Usage:
+ locate [options] <name>
+
+Options :
+ -c Count number of items found
+ -i Ignore case
+
diff --git a/doc/cephfs/client-auth.rst b/doc/cephfs/client-auth.rst
new file mode 100644
index 00000000..12876194
--- /dev/null
+++ b/doc/cephfs/client-auth.rst
@@ -0,0 +1,148 @@
+================================
+CephFS Client Capabilities
+================================
+
+Use Ceph authentication capabilities to restrict your filesystem clients
+to the lowest possible level of authority needed.
+
+.. note::
+
+ Path restriction and layout modification restriction are new features
+ in the Jewel release of Ceph.
+
+Path restriction
+================
+
+By default, clients are not restricted in what paths they are allowed to mount.
+Further, when clients mount a subdirectory, e.g., /home/user, the MDS does not
+by default verify that subsequent operations
+are ‘locked’ within that directory.
+
+To restrict clients to only mount and work within a certain directory, use
+path-based MDS authentication capabilities.
+
+Syntax
+------
+
+To grant rw access to the specified directory only, we mention the specified
+directory while creating key for a client using the following syntax. ::
+
+ ceph fs authorize *filesystem_name* client.*client_name* /*specified_directory* rw
+
+For example, to restrict client ``foo`` to writing only in the ``bar`` directory of filesystem ``cephfs_a``, use ::
+
+ ceph fs authorize cephfs_a client.foo / r /bar rw
+
+ results in:
+
+ client.foo
+ key: *key*
+ caps: [mds] allow r, allow rw path=/bar
+ caps: [mon] allow r
+ caps: [osd] allow rw tag cephfs data=cephfs_a
+
+To completely restrict the client to the ``bar`` directory, omit the
+root directory ::
+
+ ceph fs authorize cephfs_a client.foo /bar rw
+
+Note that if a client's read access is restricted to a path, they will only
+be able to mount the filesystem when specifying a readable path in the
+mount command (see below).
+
+Supplying ``all`` or ``*`` as the filesystem name will grant access to every
+file system. Note that it is usually necessary to quote ``*`` to protect it from
+the shell.
+
+See `User Management - Add a User to a Keyring`_. for additional details on user management
+
+To restrict a client to the specified sub-directory only, we mention the specified
+directory while mounting using the following syntax. ::
+
+ ./ceph-fuse -n client.*client_name* *mount_path* -r *directory_to_be_mounted*
+
+for example, to restrict client ``foo`` to ``mnt/bar`` directory, we will use. ::
+
+ ./ceph-fuse -n client.foo mnt -r /bar
+
+Free space reporting
+--------------------
+
+By default, when a client is mounting a sub-directory, the used space (``df``)
+will be calculated from the quota on that sub-directory, rather than reporting
+the overall amount of space used on the cluster.
+
+If you would like the client to report the overall usage of the filesystem,
+and not just the quota usage on the sub-directory mounted, then set the
+following config option on the client:
+
+::
+
+ client quota df = false
+
+If quotas are not enabled, or no quota is set on the sub-directory mounted,
+then the overall usage of the filesystem will be reported irrespective of
+the value of this setting.
+
+Layout and Quota restriction (the 'p' flag)
+===========================================
+
+To set layouts or quotas, clients require the 'p' flag in addition to 'rw'.
+This restricts all the attributes that are set by special extended attributes
+with a "ceph." prefix, as well as restricting other means of setting
+these fields (such as openc operations with layouts).
+
+For example, in the following snippet client.0 can modify layouts and quotas
+on the filesystem cephfs_a, but client.1 cannot.
+
+::
+
+ client.0
+ key: AQAz7EVWygILFRAAdIcuJ12opU/JKyfFmxhuaw==
+ caps: [mds] allow rwp
+ caps: [mon] allow r
+ caps: [osd] allow rw tag cephfs data=cephfs_a
+
+ client.1
+ key: AQAz7EVWygILFRAAdIcuJ12opU/JKyfFmxhuaw==
+ caps: [mds] allow rw
+ caps: [mon] allow r
+ caps: [osd] allow rw tag cephfs data=cephfs_a
+
+
+Snapshot restriction (the 's' flag)
+===========================================
+
+To create or delete snapshots, clients require the 's' flag in addition to 'rw'.
+Note that when capability string also contains the 'p' flag, the 's' flag must
+appear after it (all flags except 'rw' must be specified in alphabetical order).
+
+For example, in the following snippet client.0 can create or delete snapshots
+in the ``bar`` directory of filesystem ``cephfs_a``.
+
+::
+
+ client.0
+ key: AQAz7EVWygILFRAAdIcuJ12opU/JKyfFmxhuaw==
+ caps: [mds] allow rw, allow rws path=/bar
+ caps: [mon] allow r
+ caps: [osd] allow rw tag cephfs data=cephfs_a
+
+
+.. _User Management - Add a User to a Keyring: ../../rados/operations/user-management/#add-a-user-to-a-keyring
+
+Network restriction
+===================
+
+::
+
+ client.foo
+ key: *key*
+ caps: [mds] allow r network 10.0.0.0/8, allow rw path=/bar network 10.0.0.0/8
+ caps: [mon] allow r network 10.0.0.0/8
+ caps: [osd] allow rw tag cephfs data=cephfs_a network 10.0.0.0/8
+
+The optional ``{network/prefix}`` is a standard network name and
+prefix length in CIDR notation (e.g., ``10.3.0.0/16``). If present,
+the use of this capability is restricted to clients connecting from
+this network.
diff --git a/doc/cephfs/client-config-ref.rst b/doc/cephfs/client-config-ref.rst
new file mode 100644
index 00000000..93175644
--- /dev/null
+++ b/doc/cephfs/client-config-ref.rst
@@ -0,0 +1,220 @@
+========================
+ Client Config Reference
+========================
+
+``client acl type``
+
+:Description: Set the ACL type. Currently, only possible value is ``"posix_acl"`` to enable POSIX ACL, or an empty string. This option only takes effect when the ``fuse_default_permissions`` is set to ``false``.
+
+:Type: String
+:Default: ``""`` (no ACL enforcement)
+
+``client cache mid``
+
+:Description: Set client cache midpoint. The midpoint splits the least recently used lists into a hot and warm list.
+:Type: Float
+:Default: ``0.75``
+
+``client cache size``
+
+:Description: Set the number of inodes that the client keeps in the metadata cache.
+:Type: Integer
+:Default: ``16384``
+
+``client caps release delay``
+
+:Description: Set the delay between capability releases in seconds. The delay sets how many seconds a client waits to release capabilities that it no longer needs in case the capabilities are needed for another user space operation.
+:Type: Integer
+:Default: ``5`` (seconds)
+
+``client debug force sync read``
+
+:Description: If set to ``true``, clients read data directly from OSDs instead of using a local page cache.
+:Type: Boolean
+:Default: ``false``
+
+``client dirsize rbytes``
+
+:Description: If set to ``true``, use the recursive size of a directory (that is, total of all descendants).
+:Type: Boolean
+:Default: ``true``
+
+``client max inline size``
+
+:Description: Set the maximum size of inlined data stored in a file inode rather than in a separate data object in RADOS. This setting only applies if the ``inline_data`` flag is set on the MDS map.
+:Type: Integer
+:Default: ``4096``
+
+``client metadata``
+
+:Description: Comma-delimited strings for client metadata sent to each MDS, in addition to the automatically generated version, host name, and other metadata.
+:Type: String
+:Default: ``""`` (no additional metadata)
+
+``client mount gid``
+
+:Description: Set the group ID of CephFS mount.
+:Type: Integer
+:Default: ``-1``
+
+``client mount timeout``
+
+:Description: Set the timeout for CephFS mount in seconds.
+:Type: Float
+:Default: ``300.0``
+
+``client mount uid``
+
+:Description: Set the user ID of CephFS mount.
+:Type: Integer
+:Default: ``-1``
+
+``client mountpoint``
+
+:Description: Directory to mount on the CephFS file system. An alternative to the ``-r`` option of the ``ceph-fuse`` command.
+:Type: String
+:Default: ``"/"``
+
+``client oc``
+
+:Description: Enable object caching.
+:Type: Boolean
+:Default: ``true``
+
+``client oc max dirty``
+
+:Description: Set the maximum number of dirty bytes in the object cache.
+:Type: Integer
+:Default: ``104857600`` (100MB)
+
+``client oc max dirty age``
+
+:Description: Set the maximum age in seconds of dirty data in the object cache before writeback.
+:Type: Float
+:Default: ``5.0`` (seconds)
+
+``client oc max objects``
+
+:Description: Set the maximum number of objects in the object cache.
+:Type: Integer
+:Default: ``1000``
+
+``client oc size``
+
+:Description: Set how many bytes of data will the client cache.
+:Type: Integer
+:Default: ``209715200`` (200 MB)
+
+``client oc target dirty``
+
+:Description: Set the target size of dirty data. We recommend to keep this number low.
+:Type: Integer
+:Default: ``8388608`` (8MB)
+
+``client permissions``
+
+:Description: Check client permissions on all I/O operations.
+:Type: Boolean
+:Default: ``true``
+
+``client quota``
+
+:Description: Enable client quota checking if set to ``true``.
+:Type: Boolean
+:Default: ``true``
+
+``client quota df``
+
+:Description: Report root directory quota for the ``statfs`` operation.
+:Type: Boolean
+:Default: ``true``
+
+``client readahead max bytes``
+
+:Description: Set the maximum number of bytes that the client reads ahead for future read operations. Overridden by the ``client_readahead_max_periods`` setting.
+:Type: Integer
+:Default: ``0`` (unlimited)
+
+``client readahead max periods``
+
+:Description: Set the number of file layout periods (object size * number of stripes) that the client reads ahead. Overrides the ``client_readahead_max_bytes`` setting.
+:Type: Integer
+:Default: ``4``
+
+``client readahead min``
+
+:Description: Set the minimum number bytes that the client reads ahead.
+:Type: Integer
+:Default: ``131072`` (128KB)
+
+``client reconnect stale``
+
+:Description: Automatically reconnect stale session.
+:Type: Boolean
+:Default: ``false``
+
+``client snapdir``
+
+:Description: Set the snapshot directory name.
+:Type: String
+:Default: ``".snap"``
+
+``client tick interval``
+
+:Description: Set the interval in seconds between capability renewal and other upkeep.
+:Type: Float
+:Default: ``1.0`` (seconds)
+
+``client use random mds``
+
+:Description: Choose random MDS for each request.
+:Type: Boolean
+:Default: ``false``
+
+``fuse default permissions``
+
+:Description: When set to ``false``, ``ceph-fuse`` utility checks does its own permissions checking, instead of relying on the permissions enforcement in FUSE. Set to ``false`` together with the ``client acl type=posix_acl`` option to enable POSIX ACL.
+:Type: Boolean
+:Default: ``true``
+
+``fuse max write``
+
+:Description: Set the maximum number of bytes in a single write operation. Because the FUSE default is 128kbytes, SO fuse_max_write default set to 0(The default does not take effect)
+:Type: Integer
+:Default: ``0``
+
+Developer Options
+#################
+
+.. important:: These options are internal. They are listed here only to complete the list of options.
+
+``client debug getattr caps``
+
+:Description: Check if the reply from the MDS contains required capabilities.
+:Type: Boolean
+:Default: ``false``
+
+``client debug inject tick delay``
+
+:Description: Add artificial delay between client ticks.
+:Type: Integer
+:Default: ``0``
+
+``client inject fixed oldest tid``
+
+:Description:
+:Type: Boolean
+:Default: ``false``
+
+``client inject release failure``
+
+:Description:
+:Type: Boolean
+:Default: ``false``
+
+``client trace``
+
+:Description: The path to the trace file for all file operations. The output is designed to be used by the Ceph `synthetic client <../../man/8/ceph-syn>`_.
+:Type: String
+:Default: ``""`` (disabled)
+
diff --git a/doc/cephfs/createfs.rst b/doc/cephfs/createfs.rst
new file mode 100644
index 00000000..cad3aa1b
--- /dev/null
+++ b/doc/cephfs/createfs.rst
@@ -0,0 +1,92 @@
+========================
+Create a Ceph filesystem
+========================
+
+Creating pools
+==============
+
+A Ceph filesystem requires at least two RADOS pools, one for data and one for metadata.
+When configuring these pools, you might consider:
+
+- Using a higher replication level for the metadata pool, as any data loss in
+ this pool can render the whole filesystem inaccessible.
+- Using lower-latency storage such as SSDs for the metadata pool, as this will
+ directly affect the observed latency of filesystem operations on clients.
+- The data pool used to create the file system is the "default" data pool and
+ the location for storing all inode backtrace information, used for hard link
+ management and disaster recovery. For this reason, all inodes created in
+ CephFS have at least one object in the default data pool. If erasure-coded
+ pools are planned for the file system, it is usually better to use a
+ replicated pool for the default data pool to improve small-object write and
+ read performance for updating backtraces. Separately, another erasure-coded
+ data pool can be added (see also :ref:`ecpool`) that can be used on an entire
+ hierarchy of directories and files (see also :ref:`file-layouts`).
+
+Refer to :doc:`/rados/operations/pools` to learn more about managing pools. For
+example, to create two pools with default settings for use with a filesystem, you
+might run the following commands:
+
+.. code:: bash
+
+ $ ceph osd pool create cephfs_data <pg_num>
+ $ ceph osd pool create cephfs_metadata <pg_num>
+
+Generally, the metadata pool will have at most a few gigabytes of data. For
+this reason, a smaller PG count is usually recommended. 64 or 128 is commonly
+used in practice for large clusters.
+
+.. note:: The names of the file systems, metadata pools, and data pools can
+ only have characters in the set [a-zA-Z0-9\_-.].
+
+Creating a filesystem
+=====================
+
+Once the pools are created, you may enable the filesystem using the ``fs new`` command:
+
+.. code:: bash
+
+ $ ceph fs new <fs_name> <metadata> <data>
+
+For example:
+
+.. code:: bash
+
+ $ ceph fs new cephfs cephfs_metadata cephfs_data
+ $ ceph fs ls
+ name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
+
+Once a filesystem has been created, your MDS(s) will be able to enter
+an *active* state. For example, in a single MDS system:
+
+.. code:: bash
+
+ $ ceph mds stat
+ cephfs-1/1/1 up {0=a=up:active}
+
+Once the filesystem is created and the MDS is active, you are ready to mount
+the filesystem. If you have created more than one filesystem, you will
+choose which to use when mounting.
+
+ - `Mount CephFS`_
+ - `Mount CephFS as FUSE`_
+
+.. _Mount CephFS: ../../cephfs/kernel
+.. _Mount CephFS as FUSE: ../../cephfs/fuse
+
+If you have created more than one filesystem, and a client does not
+specify a filesystem when mounting, you can control which filesystem
+they will see by using the `ceph fs set-default` command.
+
+Using Erasure Coded pools with CephFS
+=====================================
+
+You may use Erasure Coded pools as CephFS data pools as long as they have overwrites enabled, which is done as follows:
+
+.. code:: bash
+
+ ceph osd pool set my_ec_pool allow_ec_overwrites true
+
+Note that EC overwrites are only supported when using OSDS with the BlueStore backend.
+
+You may not use Erasure Coded pools as CephFS metadata pools, because CephFS metadata is stored using RADOS *OMAP* data structures, which EC pools cannot store.
+
diff --git a/doc/cephfs/dirfrags.rst b/doc/cephfs/dirfrags.rst
new file mode 100644
index 00000000..7f421307
--- /dev/null
+++ b/doc/cephfs/dirfrags.rst
@@ -0,0 +1,95 @@
+
+===================================
+Configuring Directory fragmentation
+===================================
+
+In CephFS, directories are *fragmented* when they become very large
+or very busy. This splits up the metadata so that it can be shared
+between multiple MDS daemons, and between multiple objects in the
+metadata pool.
+
+In normal operation, directory fragmentation is invisible to
+users and administrators, and all the configuration settings mentioned
+here should be left at their default values.
+
+While directory fragmentation enables CephFS to handle very large
+numbers of entries in a single directory, application programmers should
+remain conservative about creating very large directories, as they still
+have a resource cost in situations such as a CephFS client listing
+the directory, where all the fragments must be loaded at once.
+
+All directories are initially created as a single fragment. This fragment
+may be *split* to divide up the directory into more fragments, and these
+fragments may be *merged* to reduce the number of fragments in the directory.
+
+Splitting and merging
+=====================
+
+When an MDS identifies a directory fragment to be split, it does not
+do the split immediately. Because splitting interrupts metadata IO,
+a short delay is used to allow short bursts of client IO to complete
+before the split begins. This delay is configured with
+``mds_bal_fragment_interval``, which defaults to 5 seconds.
+
+When the split is done, the directory fragment is broken up into
+a power of two number of new fragments. The number of new
+fragments is given by two to the power ``mds_bal_split_bits``, i.e.
+if ``mds_bal_split_bits`` is 2, then four new fragments will be
+created. The default setting is 3, i.e. splits create 8 new fragments.
+
+The criteria for initiating a split or a merge are described in the
+following sections.
+
+Size thresholds
+===============
+
+A directory fragment is eligible for splitting when its size exceeds
+``mds_bal_split_size`` (default 10000). Ordinarily this split is
+delayed by ``mds_bal_fragment_interval``, but if the fragment size
+exceeds a factor of ``mds_bal_fragment_fast_factor`` the split size,
+the split will happen immediately (holding up any client metadata
+IO on the directory).
+
+``mds_bal_fragment_size_max`` is the hard limit on the size of
+directory fragments. If it is reached, clients will receive
+ENOSPC errors if they try to create files in the fragment. On
+a properly configured system, this limit should never be reached on
+ordinary directories, as they will have split long before. By default,
+this is set to 10 times the split size, giving a dirfrag size limit of
+100000. Increasing this limit may lead to oversized directory fragment
+objects in the metadata pool, which the OSDs may not be able to handle.
+
+A directory fragment is eligible for merging when its size is less
+than ``mds_bal_merge_size``. There is no merge equivalent of the
+"fast splitting" explained above: fast splitting exists to avoid
+creating oversized directory fragments, there is no equivalent issue
+to avoid when merging. The default merge size is 50.
+
+Activity thresholds
+===================
+
+In addition to splitting fragments based
+on their size, the MDS may split directory fragments if their
+activity exceeds a threshold.
+
+The MDS maintains separate time-decaying load counters for read and write
+operations on directory fragments. The decaying load counters have an
+exponential decay based on the ``mds_decay_halflife`` setting.
+
+On writes, the write counter is
+incremented, and compared with ``mds_bal_split_wr``, triggering a
+split if the threshold is exceeded. Write operations include metadata IO
+such as renames, unlinks and creations.
+
+The ``mds_bal_split_rd`` threshold is applied based on the read operation
+load counter, which tracks readdir operations.
+
+By the default, the read threshold is 25000 and the write threshold is
+10000, i.e. 2.5x as many reads as writes would be required to trigger
+a split.
+
+After fragments are split due to the activity thresholds, they are only
+merged based on the size threshold (``mds_bal_merge_size``), so
+a spike in activity may cause a directory to stay fragmented
+forever unless some entries are unlinked.
+
diff --git a/doc/cephfs/disaster-recovery-experts.rst b/doc/cephfs/disaster-recovery-experts.rst
new file mode 100644
index 00000000..75c03f03
--- /dev/null
+++ b/doc/cephfs/disaster-recovery-experts.rst
@@ -0,0 +1,254 @@
+
+.. _disaster-recovery-experts:
+
+Advanced: Metadata repair tools
+===============================
+
+.. warning::
+
+ If you do not have expert knowledge of CephFS internals, you will
+ need to seek assistance before using any of these tools.
+
+ The tools mentioned here can easily cause damage as well as fixing it.
+
+ It is essential to understand exactly what has gone wrong with your
+ filesystem before attempting to repair it.
+
+ If you do not have access to professional support for your cluster,
+ consult the ceph-users mailing list or the #ceph IRC channel.
+
+
+Journal export
+--------------
+
+Before attempting dangerous operations, make a copy of the journal like so:
+
+::
+
+ cephfs-journal-tool journal export backup.bin
+
+Note that this command may not always work if the journal is badly corrupted,
+in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).
+
+
+Dentry recovery from journal
+----------------------------
+
+If a journal is damaged or for any reason an MDS is incapable of replaying it,
+attempt to recover what file metadata we can like so:
+
+::
+
+ cephfs-journal-tool event recover_dentries summary
+
+This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.
+
+This command will write any inodes/dentries recoverable from the journal
+into the backing store, if these inodes/dentries are higher-versioned
+than the previous contents of the backing store. If any regions of the journal
+are missing/damaged, they will be skipped.
+
+Note that in addition to writing out dentries and inodes, this command will update
+the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
+are now in use. In simple cases, this will result in an entirely valid backing
+store state.
+
+.. warning::
+
+ The resulting state of the backing store is not guaranteed to be self-consistent,
+ and an online MDS scrub will be required afterwards. The journal contents
+ will not be modified by this command, you should truncate the journal
+ separately after recovering what you can.
+
+Journal truncation
+------------------
+
+If the journal is corrupt or MDSs cannot replay it for any reason, you can
+truncate it like so:
+
+::
+
+ cephfs-journal-tool journal reset
+
+.. warning::
+
+ Resetting the journal *will* lose metadata unless you have extracted
+ it by other means such as ``recover_dentries``. It is likely to leave
+ some orphaned objects in the data pool. It may result in re-allocation
+ of already-written inodes, such that permissions rules could be violated.
+
+MDS table wipes
+---------------
+
+After the journal has been reset, it may no longer be consistent with respect
+to the contents of the MDS tables (InoTable, SessionMap, SnapServer).
+
+To reset the SessionMap (erase all sessions), use:
+
+::
+
+ cephfs-table-tool all reset session
+
+This command acts on the tables of all 'in' MDS ranks. Replace 'all' with an MDS
+rank to operate on that rank only.
+
+The session table is the table most likely to need resetting, but if you know you
+also need to reset the other tables then replace 'session' with 'snap' or 'inode'.
+
+MDS map reset
+-------------
+
+Once the in-RADOS state of the filesystem (i.e. contents of the metadata pool)
+is somewhat recovered, it may be necessary to update the MDS map to reflect
+the contents of the metadata pool. Use the following command to reset the MDS
+map to a single MDS:
+
+::
+
+ ceph fs reset <fs name> --yes-i-really-mean-it
+
+Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored:
+as a result it is possible for this to result in data loss.
+
+One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'. The
+key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such
+that it would overwrite any existing root inode on disk and orphan any existing files. In
+contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS
+daemon to claim the rank will go ahead and use the existing in-RADOS metadata.
+
+Recovery from missing metadata objects
+--------------------------------------
+
+Depending on what objects are missing or corrupt, you may need to
+run various commands to regenerate default versions of the
+objects.
+
+::
+
+ # Session table
+ cephfs-table-tool 0 reset session
+ # SnapServer
+ cephfs-table-tool 0 reset snap
+ # InoTable
+ cephfs-table-tool 0 reset inode
+ # Journal
+ cephfs-journal-tool --rank=0 journal reset
+ # Root inodes ("/" and MDS directory)
+ cephfs-data-scan init
+
+Finally, you can regenerate metadata objects for missing files
+and directories based on the contents of a data pool. This is
+a three-phase process. First, scanning *all* objects to calculate
+size and mtime metadata for inodes. Second, scanning the first
+object from every file to collect this metadata and inject it into
+the metadata pool. Third, checking inode linkages and fixing found
+errors.
+
+::
+
+ cephfs-data-scan scan_extents <data pool>
+ cephfs-data-scan scan_inodes <data pool>
+ cephfs-data-scan scan_links
+
+'scan_extents' and 'scan_inodes' commands may take a *very long* time
+if there are many files or very large files in the data pool.
+
+To accelerate the process, run multiple instances of the tool.
+
+Decide on a number of workers, and pass each worker a number within
+the range 0-(worker_m - 1).
+
+The example below shows how to run 4 workers simultaneously:
+
+::
+
+ # Worker 0
+ cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
+ # Worker 1
+ cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
+ # Worker 2
+ cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
+ # Worker 3
+ cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>
+
+ # Worker 0
+ cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
+ # Worker 1
+ cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
+ # Worker 2
+ cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
+ # Worker 3
+ cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>
+
+It is **important** to ensure that all workers have completed the
+scan_extents phase before any workers enter the scan_inodes phase.
+
+After completing the metadata recovery, you may want to run cleanup
+operation to delete ancillary data geneated during recovery.
+
+::
+
+ cephfs-data-scan cleanup <data pool>
+
+
+
+Using an alternate metadata pool for recovery
+---------------------------------------------
+
+.. warning::
+
+ There has not been extensive testing of this procedure. It should be
+ undertaken with great care.
+
+If an existing filesystem is damaged and inoperative, it is possible to create
+a fresh metadata pool and attempt to reconstruct the filesystem metadata
+into this new pool, leaving the old metadata in place. This could be used to
+make a safer attempt at recovery since the existing metadata pool would not be
+overwritten.
+
+.. caution::
+
+ During this process, multiple metadata pools will contain data referring to
+ the same data pool. Extreme caution must be exercised to avoid changing the
+ data pool contents while this is the case. Once recovery is complete, the
+ damaged metadata pool should be deleted.
+
+To begin this process, first create the fresh metadata pool and initialize
+it with empty file system data structures:
+
+::
+
+ ceph fs flag set enable_multiple true --yes-i-really-mean-it
+ ceph osd pool create recovery <pg-num> replicated <crush-rule-name>
+ ceph fs new recovery-fs recovery <data pool> --allow-dangerous-metadata-overlay
+ cephfs-data-scan init --force-init --filesystem recovery-fs --alternate-pool recovery
+ ceph fs reset recovery-fs --yes-i-really-mean-it
+ cephfs-table-tool recovery-fs:all reset session
+ cephfs-table-tool recovery-fs:all reset snap
+ cephfs-table-tool recovery-fs:all reset inode
+
+Next, run the recovery toolset using the --alternate-pool argument to output
+results to the alternate pool:
+
+::
+
+ cephfs-data-scan scan_extents --alternate-pool recovery --filesystem <original filesystem name> <original data pool name>
+ cephfs-data-scan scan_inodes --alternate-pool recovery --filesystem <original filesystem name> --force-corrupt --force-init <original data pool name>
+ cephfs-data-scan scan_links --filesystem recovery-fs
+
+If the damaged filesystem contains dirty journal data, it may be recovered next
+with:
+
+::
+
+ cephfs-journal-tool --rank=<original filesystem name>:0 event recover_dentries list --alternate-pool recovery
+ cephfs-journal-tool --rank recovery-fs:0 journal reset --force
+
+After recovery, some recovered directories will have incorrect statistics.
+Ensure the parameters mds_verify_scatter and mds_debug_scatterstat are set
+to false (the default) to prevent the MDS from checking the statistics, then
+run a forward scrub to repair them. Ensure you have an MDS running and issue:
+
+::
+
+ ceph tell mds.a scrub start / recursive repair
diff --git a/doc/cephfs/disaster-recovery.rst b/doc/cephfs/disaster-recovery.rst
new file mode 100644
index 00000000..71344e90
--- /dev/null
+++ b/doc/cephfs/disaster-recovery.rst
@@ -0,0 +1,61 @@
+.. _cephfs-disaster-recovery:
+
+Disaster recovery
+=================
+
+Metadata damage and repair
+--------------------------
+
+If a filesystem has inconsistent or missing metadata, it is considered
+*damaged*. You may find out about damage from a health message, or in some
+unfortunate cases from an assertion in a running MDS daemon.
+
+Metadata damage can result either from data loss in the underlying RADOS
+layer (e.g. multiple disk failures that lose all copies of a PG), or from
+software bugs.
+
+CephFS includes some tools that may be able to recover a damaged filesystem,
+but to use them safely requires a solid understanding of CephFS internals.
+The documentation for these potentially dangerous operations is on a
+separate page: :ref:`disaster-recovery-experts`.
+
+Data pool damage (files affected by lost data PGs)
+--------------------------------------------------
+
+If a PG is lost in a *data* pool, then the filesystem will continue
+to operate normally, but some parts of some files will simply
+be missing (reads will return zeros).
+
+Losing a data PG may affect many files. Files are split into many objects,
+so identifying which files are affected by loss of particular PGs requires
+a full scan over all object IDs that may exist within the size of a file.
+This type of scan may be useful for identifying which files require
+restoring from a backup.
+
+.. danger::
+
+ This command does not repair any metadata, so when restoring files in
+ this case you must *remove* the damaged file, and replace it in order
+ to have a fresh inode. Do not overwrite damaged files in place.
+
+If you know that objects have been lost from PGs, use the ``pg_files``
+subcommand to scan for files that may have been damaged as a result:
+
+::
+
+ cephfs-data-scan pg_files <path> <pg id> [<pg id>...]
+
+For example, if you have lost data from PGs 1.4 and 4.5, and you would like
+to know which files under /home/bob might have been damaged:
+
+::
+
+ cephfs-data-scan pg_files /home/bob 1.4 4.5
+
+The output will be a list of paths to potentially damaged files, one
+per line.
+
+Note that this command acts as a normal CephFS client to find all the
+files in the filesystem and read their layouts, so the MDS must be
+up and running.
+
diff --git a/doc/cephfs/eviction.rst b/doc/cephfs/eviction.rst
new file mode 100644
index 00000000..c0d54f41
--- /dev/null
+++ b/doc/cephfs/eviction.rst
@@ -0,0 +1,190 @@
+
+===============================
+Ceph filesystem client eviction
+===============================
+
+When a filesystem client is unresponsive or otherwise misbehaving, it
+may be necessary to forcibly terminate its access to the filesystem. This
+process is called *eviction*.
+
+Evicting a CephFS client prevents it from communicating further with MDS
+daemons and OSD daemons. If a client was doing buffered IO to the filesystem,
+any un-flushed data will be lost.
+
+Clients may either be evicted automatically (if they fail to communicate
+promptly with the MDS), or manually (by the system administrator).
+
+The client eviction process applies to clients of all kinds, this includes
+FUSE mounts, kernel mounts, nfs-ganesha gateways, and any process using
+libcephfs.
+
+Automatic client eviction
+=========================
+
+There are three situations in which a client may be evicted automatically.
+
+#. On an active MDS daemon, if a client has not communicated with the MDS for over
+ ``session_autoclose`` (a file system variable) seconds (300 seconds by
+ default), then it will be evicted automatically.
+
+#. On an active MDS daemon, if a client has not responded to cap revoke messages
+ for over ``mds_cap_revoke_eviction_timeout`` (configuration option) seconds.
+ This is disabled by default.
+
+#. During MDS startup (including on failover), the MDS passes through a
+ state called ``reconnect``. During this state, it waits for all the
+ clients to connect to the new MDS daemon. If any clients fail to do
+ so within the time window (``mds_reconnect_timeout``, 45 seconds by default)
+ then they will be evicted.
+
+A warning message is sent to the cluster log if either of these situations
+arises.
+
+Manual client eviction
+======================
+
+Sometimes, the administrator may want to evict a client manually. This
+could happen if a client has died and the administrator does not
+want to wait for its session to time out, or it could happen if
+a client is misbehaving and the administrator does not have access to
+the client node to unmount it.
+
+It is useful to inspect the list of clients first:
+
+::
+
+ ceph tell mds.0 client ls
+
+ [
+ {
+ "id": 4305,
+ "num_leases": 0,
+ "num_caps": 3,
+ "state": "open",
+ "replay_requests": 0,
+ "completed_requests": 0,
+ "reconnecting": false,
+ "inst": "client.4305 172.21.9.34:0/422650892",
+ "client_metadata": {
+ "ceph_sha1": "ae81e49d369875ac8b569ff3e3c456a31b8f3af5",
+ "ceph_version": "ceph version 12.0.0-1934-gae81e49 (ae81e49d369875ac8b569ff3e3c456a31b8f3af5)",
+ "entity_id": "0",
+ "hostname": "senta04",
+ "mount_point": "/tmp/tmpcMpF1b/mnt.0",
+ "pid": "29377",
+ "root": "/"
+ }
+ }
+ ]
+
+
+
+Once you have identified the client you want to evict, you can
+do that using its unique ID, or various other attributes to identify it:
+
+::
+
+ # These all work
+ ceph tell mds.0 client evict id=4305
+ ceph tell mds.0 client evict client_metadata.=4305
+
+
+Advanced: Un-blacklisting a client
+==================================
+
+Ordinarily, a blacklisted client may not reconnect to the servers: it
+must be unmounted and then mounted anew.
+
+However, in some situations it may be useful to permit a client that
+was evicted to attempt to reconnect.
+
+Because CephFS uses the RADOS OSD blacklist to control client eviction,
+CephFS clients can be permitted to reconnect by removing them from
+the blacklist:
+
+::
+
+ $ ceph osd blacklist ls
+ listed 1 entries
+ 127.0.0.1:0/3710147553 2018-03-19 11:32:24.716146
+ $ ceph osd blacklist rm 127.0.0.1:0/3710147553
+ un-blacklisting 127.0.0.1:0/3710147553
+
+
+Doing this may put data integrity at risk if other clients have accessed
+files that the blacklisted client was doing buffered IO to. It is also not
+guaranteed to result in a fully functional client -- the best way to get
+a fully healthy client back after an eviction is to unmount the client
+and do a fresh mount.
+
+If you are trying to reconnect clients in this way, you may also
+find it useful to set ``client_reconnect_stale`` to true in the
+FUSE client, to prompt the client to try to reconnect.
+
+Advanced: Configuring blacklisting
+==================================
+
+If you are experiencing frequent client evictions, due to slow
+client hosts or an unreliable network, and you cannot fix the underlying
+issue, then you may want to ask the MDS to be less strict.
+
+It is possible to respond to slow clients by simply dropping their
+MDS sessions, but permit them to re-open sessions and permit them
+to continue talking to OSDs. To enable this mode, set
+``mds_session_blacklist_on_timeout`` to false on your MDS nodes.
+
+For the equivalent behaviour on manual evictions, set
+``mds_session_blacklist_on_evict`` to false.
+
+Note that if blacklisting is disabled, then evicting a client will
+only have an effect on the MDS you send the command to. On a system
+with multiple active MDS daemons, you would need to send an
+eviction command to each active daemon. When blacklisting is enabled
+(the default), sending an eviction command to just a single
+MDS is sufficient, because the blacklist propagates it to the others.
+
+.. _background_blacklisting_and_osd_epoch_barrier:
+
+Background: Blacklisting and OSD epoch barrier
+==============================================
+
+After a client is blacklisted, it is necessary to make sure that
+other clients and MDS daemons have the latest OSDMap (including
+the blacklist entry) before they try to access any data objects
+that the blacklisted client might have been accessing.
+
+This is ensured using an internal "osdmap epoch barrier" mechanism.
+
+The purpose of the barrier is to ensure that when we hand out any
+capabilities which might allow touching the same RADOS objects, the
+clients we hand out the capabilities to must have a sufficiently recent
+OSD map to not race with cancelled operations (from ENOSPC) or
+blacklisted clients (from evictions).
+
+More specifically, the cases where an epoch barrier is set are:
+
+ * Client eviction (where the client is blacklisted and other clients
+ must wait for a post-blacklist epoch to touch the same objects).
+ * OSD map full flag handling in the client (where the client may
+ cancel some OSD ops from a pre-full epoch, so other clients must
+ wait until the full epoch or later before touching the same objects).
+ * MDS startup, because we don't persist the barrier epoch, so must
+ assume that latest OSD map is always required after a restart.
+
+Note that this is a global value for simplicity. We could maintain this on
+a per-inode basis. But we don't, because:
+
+ * It would be more complicated.
+ * It would use an extra 4 bytes of memory for every inode.
+ * It would not be much more efficient as, almost always, everyone has
+ the latest OSD map. And, in most cases everyone will breeze through this
+ barrier rather than waiting.
+ * This barrier is done in very rare cases, so any benefit from per-inode
+ granularity would only very rarely be seen.
+
+The epoch barrier is transmitted along with all capability messages, and
+instructs the receiver of the message to avoid sending any more RADOS
+operations to OSDs until it has seen this OSD epoch. This mainly applies
+to clients (doing their data writes directly to files), but also applies
+to the MDS because things like file size probing and file deletion are
+done directly from the MDS.
diff --git a/doc/cephfs/experimental-features.rst b/doc/cephfs/experimental-features.rst
new file mode 100644
index 00000000..3344a34d
--- /dev/null
+++ b/doc/cephfs/experimental-features.rst
@@ -0,0 +1,111 @@
+
+Experimental Features
+=====================
+
+CephFS includes a number of experimental features which are not fully stabilized
+or qualified for users to turn on in real deployments. We generally do our best
+to clearly demarcate these and fence them off so they cannot be used by mistake.
+
+Some of these features are closer to being done than others, though. We describe
+each of them with an approximation of how risky they are and briefly describe
+what is required to enable them. Note that doing so will *irrevocably* flag maps
+in the monitor as having once enabled this flag to improve debugging and
+support processes.
+
+Inline data
+-----------
+By default, all CephFS file data is stored in RADOS objects. The inline data
+feature enables small files (generally <2KB) to be stored in the inode
+and served out of the MDS. This may improve small-file performance but increases
+load on the MDS. It is not sufficiently tested to support at this time, although
+failures within it are unlikely to make non-inlined data inaccessible
+
+Inline data has always been off by default and requires setting
+the ``inline_data`` flag.
+
+Mantle: Programmable Metadata Load Balancer
+-------------------------------------------
+
+Mantle is a programmable metadata balancer built into the MDS. The idea is to
+protect the mechanisms for balancing load (migration, replication,
+fragmentation) but stub out the balancing policies using Lua. For details, see
+:doc:`/cephfs/mantle`.
+
+Snapshots
+---------
+Like multiple active MDSes, CephFS is designed from the ground up to support
+snapshotting of arbitrary directories. There are no known bugs at the time of
+writing, but there is insufficient testing to provide stability guarantees and
+every expansion of testing has generally revealed new issues. If you do enable
+snapshots and experience failure, manual intervention will be needed.
+
+Snapshots are known not to work properly with multiple filesystems (below) in
+some cases. Specifically, if you share a pool for multiple FSes and delete
+a snapshot in one FS, expect to lose snapshotted file data in any other FS using
+snapshots. See the :doc:`/dev/cephfs-snapshots` page for more information.
+
+For somewhat obscure implementation reasons, the kernel client only supports up
+to 400 snapshots (http://tracker.ceph.com/issues/21420).
+
+Snapshotting was blocked off with the ``allow_new_snaps`` flag prior to Mimic.
+
+Multiple filesystems within a Ceph cluster
+------------------------------------------
+Code was merged prior to the Jewel release which enables administrators
+to create multiple independent CephFS filesystems within a single Ceph cluster.
+These independent filesystems have their own set of active MDSes, cluster maps,
+and data. But the feature required extensive changes to data structures which
+are not yet fully qualified, and has security implications which are not all
+apparent nor resolved.
+
+There are no known bugs, but any failures which do result from having multiple
+active filesystems in your cluster will require manual intervention and, so far,
+will not have been experienced by anybody else -- knowledgeable help will be
+extremely limited. You also probably do not have the security or isolation
+guarantees you want or think you have upon doing so.
+
+Note that snapshots and multiple filesystems are *not* tested in combination
+and may not work together; see above.
+
+Multiple filesystems were available starting in the Jewel release candidates
+but must be turned on via the ``enable_multiple`` flag until declared stable.
+
+LazyIO
+------
+LazyIO relaxes POSIX semantics. Buffered reads/writes are allowed even when a
+file is opened by multiple applications on multiple clients. Applications are
+responsible for managing cache coherency themselves.
+
+Previously experimental features
+================================
+
+Directory Fragmentation
+-----------------------
+
+Directory fragmentation was considered experimental prior to the *Luminous*
+(12.2.x). It is now enabled by default on new filesystems. To enable directory
+fragmentation on filesystems created with older versions of Ceph, set
+the ``allow_dirfrags`` flag on the filesystem:
+
+::
+
+ ceph fs set <filesystem name> allow_dirfrags 1
+
+Multiple active metadata servers
+--------------------------------
+
+Prior to the *Luminous* (12.2.x) release, running multiple active metadata
+servers within a single filesystem was considered experimental. Creating
+multiple active metadata servers is now permitted by default on new
+filesystems.
+
+Filesystems created with older versions of Ceph still require explicitly
+enabling multiple active metadata servers as follows:
+
+::
+
+ ceph fs set <filesystem name> allow_multimds 1
+
+Note that the default size of the active mds cluster (``max_mds``) is
+still set to 1 initially.
+
diff --git a/doc/cephfs/file-layouts.rst b/doc/cephfs/file-layouts.rst
new file mode 100644
index 00000000..6ff834b0
--- /dev/null
+++ b/doc/cephfs/file-layouts.rst
@@ -0,0 +1,230 @@
+.. _file-layouts:
+
+File layouts
+============
+
+The layout of a file controls how its contents are mapped to Ceph RADOS objects. You can
+read and write a file's layout using *virtual extended attributes* or xattrs.
+
+The name of the layout xattrs depends on whether a file is a regular file or a directory. Regular
+files' layout xattrs are called ``ceph.file.layout``, whereas directories' layout xattrs are called
+``ceph.dir.layout``. Where subsequent examples refer to ``ceph.file.layout``, substitute ``dir`` as appropriate
+when dealing with directories.
+
+.. tip::
+
+ Your linux distribution may not ship with commands for manipulating xattrs by default,
+ the required package is usually called ``attr``.
+
+Layout fields
+-------------
+
+pool
+ String, giving ID or name. String can only have characters in the set [a-zA-Z0-9\_-.]. Which RADOS pool a file's data objects will be stored in.
+
+pool_namespace
+ String with only characters in the set [a-zA-Z0-9\_-.]. Within the data pool, which RADOS namespace the objects will
+ be written to. Empty by default (i.e. default namespace).
+
+stripe_unit
+ Integer in bytes. The size (in bytes) of a block of data used in the RAID 0 distribution of a file. All stripe units for a file have equal size. The last stripe unit is typically incomplete–i.e. it represents the data at the end of the file as well as unused “space” beyond it up to the end of the fixed stripe unit size.
+
+stripe_count
+ Integer. The number of consecutive stripe units that constitute a RAID 0 “stripe” of file data.
+
+object_size
+ Integer in bytes. File data is chunked into RADOS objects of this size.
+
+.. tip::
+
+ RADOS enforces a configurable limit on object sizes: if you increase CephFS
+ object sizes beyond that limit then writes may not succeed. The OSD
+ setting is ``osd_max_object_size``, which is 128MB by default.
+ Very large RADOS objects may prevent smooth operation of the cluster,
+ so increasing the object size limit past the default is not recommended.
+
+Reading layouts with ``getfattr``
+---------------------------------
+
+Read the layout information as a single string:
+
+.. code-block:: bash
+
+ $ touch file
+ $ getfattr -n ceph.file.layout file
+ # file: file
+ ceph.file.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs_data"
+
+Read individual layout fields:
+
+.. code-block:: bash
+
+ $ getfattr -n ceph.file.layout.pool file
+ # file: file
+ ceph.file.layout.pool="cephfs_data"
+ $ getfattr -n ceph.file.layout.stripe_unit file
+ # file: file
+ ceph.file.layout.stripe_unit="4194304"
+ $ getfattr -n ceph.file.layout.stripe_count file
+ # file: file
+ ceph.file.layout.stripe_count="1"
+ $ getfattr -n ceph.file.layout.object_size file
+ # file: file
+ ceph.file.layout.object_size="4194304"
+
+.. note::
+
+ When reading layouts, the pool will usually be indicated by name. However, in
+ rare cases when pools have only just been created, the ID may be output instead.
+
+Directories do not have an explicit layout until it is customized. Attempts to read
+the layout will fail if it has never been modified: this indicates that layout of the
+next ancestor directory with an explicit layout will be used.
+
+.. code-block:: bash
+
+ $ mkdir dir
+ $ getfattr -n ceph.dir.layout dir
+ dir: ceph.dir.layout: No such attribute
+ $ setfattr -n ceph.dir.layout.stripe_count -v 2 dir
+ $ getfattr -n ceph.dir.layout dir
+ # file: dir
+ ceph.dir.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 pool=cephfs_data"
+
+
+Writing layouts with ``setfattr``
+---------------------------------
+
+Layout fields are modified using ``setfattr``:
+
+.. code-block:: bash
+
+ $ ceph osd lspools
+ 0 rbd
+ 1 cephfs_data
+ 2 cephfs_metadata
+
+ $ setfattr -n ceph.file.layout.stripe_unit -v 1048576 file2
+ $ setfattr -n ceph.file.layout.stripe_count -v 8 file2
+ $ setfattr -n ceph.file.layout.object_size -v 10485760 file2
+ $ setfattr -n ceph.file.layout.pool -v 1 file2 # Setting pool by ID
+ $ setfattr -n ceph.file.layout.pool -v cephfs_data file2 # Setting pool by name
+
+.. note::
+
+ When the layout fields of a file are modified using ``setfattr``, this file must be empty, otherwise an error will occur.
+
+.. code-block:: bash
+
+ # touch an empty file
+ $ touch file1
+ # modify layout field successfully
+ $ setfattr -n ceph.file.layout.stripe_count -v 3 file1
+
+ # write something to file1
+ $ echo "hello world" > file1
+ $ setfattr -n ceph.file.layout.stripe_count -v 4 file1
+ setfattr: file1: Directory not empty
+
+Clearing layouts
+----------------
+
+If you wish to remove an explicit layout from a directory, to revert to
+inheriting the layout of its ancestor, you can do so:
+
+.. code-block:: bash
+
+ setfattr -x ceph.dir.layout mydir
+
+Similarly, if you have set the ``pool_namespace`` attribute and wish
+to modify the layout to use the default namespace instead:
+
+.. code-block:: bash
+
+ # Create a dir and set a namespace on it
+ mkdir mydir
+ setfattr -n ceph.dir.layout.pool_namespace -v foons mydir
+ getfattr -n ceph.dir.layout mydir
+ ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs_data_a pool_namespace=foons"
+
+ # Clear the namespace from the directory's layout
+ setfattr -x ceph.dir.layout.pool_namespace mydir
+ getfattr -n ceph.dir.layout mydir
+ ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs_data_a"
+
+
+Inheritance of layouts
+----------------------
+
+Files inherit the layout of their parent directory at creation time. However, subsequent
+changes to the parent directory's layout do not affect children.
+
+.. code-block:: bash
+
+ $ getfattr -n ceph.dir.layout dir
+ # file: dir
+ ceph.dir.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 pool=cephfs_data"
+
+ # Demonstrate file1 inheriting its parent's layout
+ $ touch dir/file1
+ $ getfattr -n ceph.file.layout dir/file1
+ # file: dir/file1
+ ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 pool=cephfs_data"
+
+ # Now update the layout of the directory before creating a second file
+ $ setfattr -n ceph.dir.layout.stripe_count -v 4 dir
+ $ touch dir/file2
+
+ # Demonstrate that file1's layout is unchanged
+ $ getfattr -n ceph.file.layout dir/file1
+ # file: dir/file1
+ ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 pool=cephfs_data"
+
+ # ...while file2 has the parent directory's new layout
+ $ getfattr -n ceph.file.layout dir/file2
+ # file: dir/file2
+ ceph.file.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 pool=cephfs_data"
+
+
+Files created as descendents of the directory also inherit the layout, if the intermediate
+directories do not have layouts set:
+
+.. code-block:: bash
+
+ $ getfattr -n ceph.dir.layout dir
+ # file: dir
+ ceph.dir.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 pool=cephfs_data"
+ $ mkdir dir/childdir
+ $ getfattr -n ceph.dir.layout dir/childdir
+ dir/childdir: ceph.dir.layout: No such attribute
+ $ touch dir/childdir/grandchild
+ $ getfattr -n ceph.file.layout dir/childdir/grandchild
+ # file: dir/childdir/grandchild
+ ceph.file.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 pool=cephfs_data"
+
+
+Adding a data pool to the MDS
+-----------------------------
+
+Before you can use a pool with CephFS you have to add it to the Metadata Servers.
+
+.. code-block:: bash
+
+ $ ceph fs add_data_pool cephfs cephfs_data_ssd
+ $ ceph fs ls # Pool should now show up
+ .... data pools: [cephfs_data cephfs_data_ssd ]
+
+Make sure that your cephx keys allows the client to access this new pool.
+
+You can then update the layout on a directory in CephFS to use the pool you added:
+
+.. code-block:: bash
+
+ $ mkdir /mnt/cephfs/myssddir
+ $ setfattr -n ceph.dir.layout.pool -v cephfs_data_ssd /mnt/cephfs/myssddir
+
+All new files created within that directory will now inherit its layout and place their data in your newly added pool.
+
+You may notice that object counts in your primary data pool (the one passed to ``fs new``) continue to increase, even if files are being created in the pool you added. This is normal: the file data is stored in the pool specified by the layout, but a small amount of metadata is kept in the primary data pool for all files.
+
+
diff --git a/doc/cephfs/fs-volumes.rst b/doc/cephfs/fs-volumes.rst
new file mode 100644
index 00000000..01510d02
--- /dev/null
+++ b/doc/cephfs/fs-volumes.rst
@@ -0,0 +1,369 @@
+.. _fs-volumes-and-subvolumes:
+
+FS volumes and subvolumes
+=========================
+
+A single source of truth for CephFS exports is implemented in the volumes
+module of the :term:`Ceph Manager` daemon (ceph-mgr). The OpenStack shared
+file system service (manila_), Ceph Containter Storage Interface (CSI_),
+storage administrators among others can use the common CLI provided by the
+ceph-mgr volumes module to manage the CephFS exports.
+
+The ceph-mgr volumes module implements the following file system export
+abstactions:
+
+* FS volumes, an abstraction for CephFS file systems
+
+* FS subvolumes, an abstraction for independent CephFS directory trees
+
+* FS subvolume groups, an abstraction for a directory level higher than FS
+ subvolumes to effect policies (e.g., :doc:`/cephfs/file-layouts`) across a
+ set of subvolumes
+
+Some possible use-cases for the export abstractions:
+
+* FS subvolumes used as manila shares or CSI volumes
+
+* FS subvolume groups used as manila share groups
+
+Requirements
+------------
+
+* Nautilus (14.2.x) or a later version of Ceph
+
+* Cephx client user (see :doc:`/rados/operations/user-management`) with
+ the following minimum capabilities::
+
+ mon 'allow r'
+ mgr 'allow rw'
+
+
+FS Volumes
+----------
+
+Create a volume using::
+
+ $ ceph fs volume create <vol_name>
+
+This creates a CephFS file sytem and its data and metadata pools. It also tries
+to create MDSes for the filesytem using the enabled ceph-mgr orchestrator
+module (see :doc:`/mgr/orchestrator_cli`) , e.g., rook.
+
+Remove a volume using::
+
+ $ ceph fs volume rm <vol_name> [--yes-i-really-mean-it]
+
+This removes a file system and its data and metadata pools. It also tries to
+remove MDSes using the enabled ceph-mgr orchestrator module.
+
+List volumes using::
+
+ $ ceph fs volume ls
+
+FS Subvolume groups
+-------------------
+
+Create a subvolume group using::
+
+ $ ceph fs subvolumegroup create <vol_name> <group_name> [--pool_layout <data_pool_name>] [--uid <uid>] [--gid <gid>] [--mode <octal_mode>]
+
+The command succeeds even if the subvolume group already exists.
+
+When creating a subvolume group you can specify its data pool layout (see
+:doc:`/cephfs/file-layouts`), uid, gid, and file mode in octal numerals. By default, the
+subvolume group is created with an octal file mode '755', uid '0', gid '0' and data pool
+layout of its parent directory.
+
+
+Remove a subvolume group using::
+
+ $ ceph fs subvolumegroup rm <vol_name> <group_name> [--force]
+
+The removal of a subvolume group fails if it is not empty or non-existent.
+'--force' flag allows the non-existent subvolume group remove command to succeed.
+
+
+Fetch the absolute path of a subvolume group using::
+
+ $ ceph fs subvolumegroup getpath <vol_name> <group_name>
+
+List subvolume groups using::
+
+ $ ceph fs subvolumegroup ls <vol_name>
+
+.. note:: Subvolume group snapshot feature is no longer supported in nautilus CephFS (existing group
+ snapshots can still be listed and deleted)
+
+Remove a snapshot of a subvolume group using::
+
+ $ ceph fs subvolumegroup snapshot rm <vol_name> <group_name> <snap_name> [--force]
+
+Using the '--force' flag allows the command to succeed that would otherwise
+fail if the snapshot did not exist.
+
+List snapshots of a subvolume group using::
+
+ $ ceph fs subvolumegroup snapshot ls <vol_name> <group_name>
+
+
+FS Subvolumes
+-------------
+
+Create a subvolume using::
+
+ $ ceph fs subvolume create <vol_name> <subvol_name> [--size <size_in_bytes>] [--group_name <subvol_group_name>] [--pool_layout <data_pool_name>] [--uid <uid>] [--gid <gid>] [--mode <octal_mode>] [--namespace-isolated]
+
+
+The command succeeds even if the subvolume already exists.
+
+When creating a subvolume you can specify its subvolume group, data pool layout,
+uid, gid, file mode in octal numerals, and size in bytes. The size of the subvolume is
+specified by setting a quota on it (see :doc:`/cephfs/quota`). The subvolume can be
+created in a separate RADOS namespace by specifying --namespace-isolated option. By
+default a subvolume is created within the default subvolume group, and with an octal file
+mode '755', uid of its subvolume group, gid of its subvolume group, data pool layout of
+its parent directory and no size limit.
+
+Remove a subvolume using::
+
+ $ ceph fs subvolume rm <vol_name> <subvol_name> [--group_name <subvol_group_name>] [--force] [--retain-snapshots]
+
+
+The command removes the subvolume and its contents. It does this in two steps.
+First, it moves the subvolume to a trash folder, and then asynchronously purges
+its contents.
+
+The removal of a subvolume fails if it has snapshots, or is non-existent.
+'--force' flag allows the non-existent subvolume remove command to succeed.
+
+A subvolume can be removed retaining existing snapshots of the subvolume using the
+'--retain-snapshots' option. If snapshots are retained, the subvolume is considered
+empty for all operations not involving the retained snapshots.
+
+.. note:: Snapshot retained subvolumes can be recreated using 'ceph fs subvolume create'
+
+.. note:: Retained snapshots can be used as a clone source to recreate the subvolume, or clone to a newer subvolume.
+
+Resize a subvolume using::
+
+ $ ceph fs subvolume resize <vol_name> <subvol_name> <new_size> [--group_name <subvol_group_name>] [--no_shrink]
+
+The command resizes the subvolume quota using the size specified by 'new_size'.
+'--no_shrink' flag prevents the subvolume to shrink below the current used size of the subvolume.
+
+The subvolume can be resized to an infinite size by passing 'inf' or 'infinite' as the new_size.
+
+Authorize cephx auth IDs, the read/read-write access to fs subvolumes::
+
+ $ ceph fs subvolume authorize <vol_name> <sub_name> <auth_id> [--group_name=<group_name>] [--access_level=<access_level>]
+
+The 'access_level' takes 'r' or 'rw' as value.
+
+Deauthorize cephx auth IDs, the read/read-write access to fs subvolumes::
+
+ $ ceph fs subvolume deauthorize <vol_name> <sub_name> <auth_id> [--group_name=<group_name>]
+
+List cephx auth IDs authorized to access fs subvolume::
+
+ $ ceph fs subvolume authorized_list <vol_name> <sub_name> [--group_name=<group_name>]
+
+Evict fs clients based on auth ID and subvolume mounted::
+
+ $ ceph fs subvolume evict <vol_name> <sub_name> <auth_id> [--group_name=<group_name>]
+
+Fetch the absolute path of a subvolume using::
+
+ $ ceph fs subvolume getpath <vol_name> <subvol_name> [--group_name <subvol_group_name>]
+
+Fetch the metadata of a subvolume using::
+
+ $ ceph fs subvolume info <vol_name> <subvol_name> [--group_name <subvol_group_name>]
+
+The output format is json and contains fields as follows.
+
+* atime: access time of subvolume path in the format "YYYY-MM-DD HH:MM:SS"
+* mtime: modification time of subvolume path in the format "YYYY-MM-DD HH:MM:SS"
+* ctime: change time of subvolume path in the format "YYYY-MM-DD HH:MM:SS"
+* uid: uid of subvolume path
+* gid: gid of subvolume path
+* mode: mode of subvolume path
+* mon_addrs: list of monitor addresses
+* bytes_pcent: quota used in percentage if quota is set, else displays "undefined"
+* bytes_quota: quota size in bytes if quota is set, else displays "infinite"
+* bytes_used: current used size of the subvolume in bytes
+* created_at: time of creation of subvolume in the format "YYYY-MM-DD HH:MM:SS"
+* data_pool: data pool the subvolume belongs to
+* path: absolute path of a subvolume
+* type: subvolume type indicating whether it's clone or subvolume
+* pool_namespace: RADOS namespace of the subvolume
+* features: features supported by the subvolume
+* state: current state of the subvolume
+
+If a subvolume has been removed retaining its snapshots, the output only contains fields as follows.
+
+* type: subvolume type indicating whether it's clone or subvolume
+* features: features supported by the subvolume
+* state: current state of the subvolume
+
+The subvolume "features" are based on the internal version of the subvolume and is a list containing
+a subset of the following features,
+
+* "snapshot-clone": supports cloning using a subvolumes snapshot as the source
+* "snapshot-autoprotect": supports automatically protecting snapshots, that are active clone sources, from deletion
+* "snapshot-retention": supports removing subvolume contents, retaining any existing snapshots
+
+The subvolume "state" is based on the current state of the subvolume and contains one of the following values.
+
+* "complete": subvolume is ready for all operations
+* "snapshot-retained": subvolume is removed but its snapshots are retained
+
+List subvolumes using::
+
+ $ ceph fs subvolume ls <vol_name> [--group_name <subvol_group_name>]
+
+.. note:: subvolumes that are removed but have snapshots retained, are also listed.
+
+Create a snapshot of a subvolume using::
+
+ $ ceph fs subvolume snapshot create <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
+
+
+Remove a snapshot of a subvolume using::
+
+ $ ceph fs subvolume snapshot rm <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>] [--force]
+
+Using the '--force' flag allows the command to succeed that would otherwise
+fail if the snapshot did not exist.
+
+.. note:: if the last snapshot within a snapshot retained subvolume is removed, the subvolume is also removed
+
+List snapshots of a subvolume using::
+
+ $ ceph fs subvolume snapshot ls <vol_name> <subvol_name> [--group_name <subvol_group_name>]
+
+Fetch the metadata of a snapshot using::
+
+ $ ceph fs subvolume snapshot info <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
+
+The output format is json and contains fields as follows.
+
+* created_at: time of creation of snapshot in the format "YYYY-MM-DD HH:MM:SS:ffffff"
+* data_pool: data pool the snapshot belongs to
+* has_pending_clones: "yes" if snapshot clone is in progress otherwise "no"
+* size: snapshot size in bytes
+
+Cloning Snapshots
+-----------------
+
+Subvolumes can be created by cloning subvolume snapshots. Cloning is an asynchronous operation involving copying
+data from a snapshot to a subvolume. Due to this bulk copy nature, cloning is currently inefficient for very huge
+data sets.
+
+.. note:: Removing a snapshot (source subvolume) would fail if there are pending or in progress clone operations.
+
+Protecting snapshots prior to cloning was a pre-requisite in the Nautilus release, and the commands to protect/unprotect
+snapshots were introduced for this purpose. This pre-requisite, and hence the commands to protect/unprotect, is being
+deprecated in mainline CephFS, and may be removed from a future release.
+
+The commands being deprecated are::
+
+ $ ceph fs subvolume snapshot protect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
+ $ ceph fs subvolume snapshot unprotect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
+
+.. note:: Using the above commands would not result in an error, but they serve no useful function.
+
+.. note:: Use subvolume info command to fetch subvolume metadata regarding supported "features" to help decide if protect/unprotect of snapshots is required, based on the "snapshot-autoprotect" feature availability.
+
+To initiate a clone operation use::
+
+ $ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name>
+
+If a snapshot (source subvolume) is a part of non-default group, the group name needs to be specified as per::
+
+ $ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --group_name <subvol_group_name>
+
+Cloned subvolumes can be a part of a different group than the source snapshot (by default, cloned subvolumes are created in default group). To clone to a particular group use::
+
+ $ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --target_group_name <subvol_group_name>
+
+Similar to specifying a pool layout when creating a subvolume, pool layout can be specified when creating a cloned subvolume. To create a cloned subvolume with a specific pool layout use::
+
+ $ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --pool_layout <pool_layout>
+
+Configure maximum number of concurrent clones. The default is set to 4::
+
+ $ ceph config set mgr mgr/volumes/max_concurrent_clones <value>
+
+To check the status of a clone operation use::
+
+ $ ceph fs clone status <vol_name> <clone_name> [--group_name <group_name>]
+
+A clone can be in one of the following states:
+
+#. `pending` : Clone operation has not started
+#. `in-progress` : Clone operation is in progress
+#. `complete` : Clone operation has successfully finished
+#. `failed` : Clone operation has failed
+
+Sample output from an `in-progress` clone operation::
+
+ $ ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1
+ $ ceph fs clone status cephfs clone1
+ {
+ "status": {
+ "state": "in-progress",
+ "source": {
+ "volume": "cephfs",
+ "subvolume": "subvol1",
+ "snapshot": "snap1"
+ }
+ }
+ }
+
+(NOTE: since `subvol1` is in default group, `source` section in `clone status` does not include group name)
+
+.. note:: Cloned subvolumes are accessible only after the clone operation has successfully completed.
+
+For a successful clone operation, `clone status` would look like so::
+
+ $ ceph fs clone status cephfs clone1
+ {
+ "status": {
+ "state": "complete"
+ }
+ }
+
+or `failed` state when clone is unsuccessful.
+
+On failure of a clone operation, the partial clone needs to be deleted and the clone operation needs to be retriggered.
+To delete a partial clone use::
+
+ $ ceph fs subvolume rm <vol_name> <clone_name> [--group_name <group_name>] --force
+
+.. note:: Cloning only synchronizes directories, regular files and symbolic links. Also, inode timestamps (access and
+ modification times) are synchronized upto seconds granularity.
+
+An `in-progress` or a `pending` clone operation can be canceled. To cancel a clone operation use the `clone cancel` command::
+
+ $ ceph fs clone cancel <vol_name> <clone_name> [--group_name <group_name>]
+
+On successful cancelation, the cloned subvolume is moved to `canceled` state::
+
+ $ ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1
+ $ ceph fs clone cancel cephfs clone1
+ $ ceph fs clone status cephfs clone1
+ {
+ "status": {
+ "state": "canceled",
+ "source": {
+ "volume": "cephfs",
+ "subvolume": "subvol1",
+ "snapshot": "snap1"
+ }
+ }
+ }
+
+.. note:: The canceled cloned can be deleted by using --force option in `fs subvolume rm` command.
+
+.. _manila: https://github.com/openstack/manila
+.. _CSI: https://github.com/ceph/ceph-csi
diff --git a/doc/cephfs/fstab.rst b/doc/cephfs/fstab.rst
new file mode 100644
index 00000000..344c80d9
--- /dev/null
+++ b/doc/cephfs/fstab.rst
@@ -0,0 +1,47 @@
+========================================
+ Mount CephFS in your File Systems Table
+========================================
+
+If you mount CephFS in your file systems table, the Ceph file system will mount
+automatically on startup.
+
+Kernel Driver
+=============
+
+To mount CephFS in your file systems table as a kernel driver, add the
+following to ``/etc/fstab``::
+
+ {ipaddress}:{port}:/ {mount}/{mountpoint} {filesystem-name} [name=username,secret=secretkey|secretfile=/path/to/secretfile],[{mount.options}]
+
+For example::
+
+ 10.10.10.10:6789:/ /mnt/ceph ceph name=admin,noatime,_netdev 0 2
+
+The default for the ``name=`` parameter is ``guest``. If the ``secret`` or
+``secretfile`` options are not specified then the mount helper will attempt to
+find a secret for the given ``name`` in one of the configured keyrings.
+
+See `User Management`_ for details.
+
+
+FUSE
+====
+
+To mount CephFS in your file systems table as a filesystem in user space, add the
+following to ``/etc/fstab``::
+
+ #DEVICE PATH TYPE OPTIONS
+ none /mnt/ceph fuse.ceph ceph.id={user-ID}[,ceph.conf={path/to/conf.conf}],_netdev,defaults 0 0
+
+For example::
+
+ none /mnt/ceph fuse.ceph ceph.id=myuser,_netdev,defaults 0 0
+ none /mnt/ceph fuse.ceph ceph.id=myuser,ceph.conf=/etc/ceph/foo.conf,_netdev,defaults 0 0
+
+Ensure you use the ID (e.g., ``admin``, not ``client.admin``). You can pass any valid
+``ceph-fuse`` option to the command line this way.
+
+See `User Management`_ for details.
+
+
+.. _User Management: ../../rados/operations/user-management/
diff --git a/doc/cephfs/full.rst b/doc/cephfs/full.rst
new file mode 100644
index 00000000..cc9eb596
--- /dev/null
+++ b/doc/cephfs/full.rst
@@ -0,0 +1,60 @@
+
+Handling a full Ceph filesystem
+===============================
+
+When a RADOS cluster reaches its ``mon_osd_full_ratio`` (default
+95%) capacity, it is marked with the OSD full flag. This flag causes
+most normal RADOS clients to pause all operations until it is resolved
+(for example by adding more capacity to the cluster).
+
+The filesystem has some special handling of the full flag, explained below.
+
+Hammer and later
+----------------
+
+Since the hammer release, a full filesystem will lead to ENOSPC
+results from:
+
+ * Data writes on the client
+ * Metadata operations other than deletes and truncates
+
+Because the full condition may not be encountered until
+data is flushed to disk (sometime after a ``write`` call has already
+returned 0), the ENOSPC error may not be seen until the application
+calls ``fsync`` or ``fclose`` (or equivalent) on the file handle.
+
+Calling ``fsync`` is guaranteed to reliably indicate whether the data
+made it to disk, and will return an error if it doesn't. ``fclose`` will
+only return an error if buffered data happened to be flushed since
+the last write -- a successful ``fclose`` does not guarantee that the
+data made it to disk, and in a full-space situation, buffered data
+may be discarded after an ``fclose`` if no space is available to persist it.
+
+.. warning::
+ If an application appears to be misbehaving on a full filesystem,
+ check that it is performing ``fsync()`` calls as necessary to ensure
+ data is on disk before proceeding.
+
+Data writes may be cancelled by the client if they are in flight at the
+time the OSD full flag is sent. Clients update the ``osd_epoch_barrier``
+when releasing capabilities on files affected by cancelled operations, in
+order to ensure that these cancelled operations do not interfere with
+subsequent access to the data objects by the MDS or other clients. For
+more on the epoch barrier mechanism, see :ref:`background_blacklisting_and_osd_epoch_barrier`.
+
+Legacy (pre-hammer) behavior
+----------------------------
+
+In versions of Ceph earlier than hammer, the MDS would ignore
+the full status of the RADOS cluster, and any data writes from
+clients would stall until the cluster ceased to be full.
+
+There are two dangerous conditions to watch for with this behaviour:
+
+* If a client had pending writes to a file, then it was not possible
+ for the client to release the file to the MDS for deletion: this could
+ lead to difficulty clearing space on a full filesystem
+* If clients continued to create a large number of empty files, the
+ resulting metadata writes from the MDS could lead to total exhaustion
+ of space on the OSDs such that no further deletions could be performed.
+
diff --git a/doc/cephfs/fuse.rst b/doc/cephfs/fuse.rst
new file mode 100644
index 00000000..25125370
--- /dev/null
+++ b/doc/cephfs/fuse.rst
@@ -0,0 +1,52 @@
+=======================
+Mount CephFS using FUSE
+=======================
+
+Before mounting a Ceph File System in User Space (FUSE), ensure that the client
+host has a copy of the Ceph configuration file and a keyring with CAPS for the
+Ceph metadata server.
+
+#. From your client host, copy the Ceph configuration file from the monitor host
+ to the ``/etc/ceph`` directory. ::
+
+ sudo mkdir -p /etc/ceph
+ sudo scp {user}@{server-machine}:/etc/ceph/ceph.conf /etc/ceph/ceph.conf
+
+#. From your client host, copy the Ceph keyring from the monitor host to
+ to the ``/etc/ceph`` directory. ::
+
+ sudo scp {user}@{server-machine}:/etc/ceph/ceph.keyring /etc/ceph/ceph.keyring
+
+#. Ensure that the Ceph configuration file and the keyring have appropriate
+ permissions set on your client machine (e.g., ``chmod 644``).
+
+For additional details on ``cephx`` configuration, see
+`CEPHX Config Reference`_.
+
+To mount the Ceph file system as a FUSE, you may use the ``ceph-fuse`` command.
+For example::
+
+ sudo mkdir /home/username/cephfs
+ sudo ceph-fuse -m 192.168.0.1:6789 /home/username/cephfs
+
+If you have more than one filesystem, specify which one to mount using
+the ``--client_mds_namespace`` command line argument, or add a
+``client_mds_namespace`` setting to your ``ceph.conf``.
+
+See `ceph-fuse`_ for additional details.
+
+To automate mounting ceph-fuse, you may add an entry to the system fstab_.
+Additionally, ``ceph-fuse@.service`` and ``ceph-fuse.target`` systemd units are
+available. As usual, these unit files declare the default dependencies and
+recommended execution context for ``ceph-fuse``. An example ceph-fuse mount on
+``/mnt`` would be::
+
+ sudo systemctl start ceph-fuse@/mnt.service
+
+A persistent mount point can be setup via::
+
+ sudo systemctl enable ceph-fuse@/mnt.service
+
+.. _ceph-fuse: ../../man/8/ceph-fuse/
+.. _fstab: ../fstab/#fuse
+.. _CEPHX Config Reference: ../../rados/configuration/auth-config-ref
diff --git a/doc/cephfs/hadoop.rst b/doc/cephfs/hadoop.rst
new file mode 100644
index 00000000..566c6030
--- /dev/null
+++ b/doc/cephfs/hadoop.rst
@@ -0,0 +1,202 @@
+========================
+Using Hadoop with CephFS
+========================
+
+The Ceph file system can be used as a drop-in replacement for the Hadoop File
+System (HDFS). This page describes the installation and configuration process
+of using Ceph with Hadoop.
+
+Dependencies
+============
+
+* CephFS Java Interface
+* Hadoop CephFS Plugin
+
+.. important:: Currently requires Hadoop 1.1.X stable series
+
+Installation
+============
+
+There are three requirements for using CephFS with Hadoop. First, a running
+Ceph installation is required. The details of setting up a Ceph cluster and
+the file system are beyond the scope of this document. Please refer to the
+Ceph documentation for installing Ceph.
+
+The remaining two requirements are a Hadoop installation, and the Ceph file
+system Java packages, including the Java CephFS Hadoop plugin. The high-level
+steps are two add the dependencies to the Hadoop installation ``CLASSPATH``,
+and configure Hadoop to use the Ceph file system.
+
+CephFS Java Packages
+--------------------
+
+* CephFS Hadoop plugin (`hadoop-cephfs.jar <https://download.ceph.com/tarballs/hadoop-cephfs.jar>`_)
+
+Adding these dependencies to a Hadoop installation will depend on your
+particular deployment. In general the dependencies must be present on each
+node in the system that will be part of the Hadoop cluster, and must be in the
+``CLASSPATH`` searched for by Hadoop. Typically approaches are to place the
+additional ``jar`` files into the ``hadoop/lib`` directory, or to edit the
+``HADOOP_CLASSPATH`` variable in ``hadoop-env.sh``.
+
+The native Ceph file system client must be installed on each participating
+node in the Hadoop cluster.
+
+Hadoop Configuration
+====================
+
+This section describes the Hadoop configuration options used to control Ceph.
+These options are intended to be set in the Hadoop configuration file
+`conf/core-site.xml`.
+
++---------------------+--------------------------+----------------------------+
+|Property |Value |Notes |
+| | | |
++=====================+==========================+============================+
+|fs.default.name |Ceph URI |ceph://[monaddr:port]/ |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.conf.file |Local path to ceph.conf |/etc/ceph/ceph.conf |
+| | | |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.conf.options |Comma separated list of |opt1=val1,opt2=val2 |
+| |Ceph configuration | |
+| |key/value pairs | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.root.dir |Mount root directory |Default value: / |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.mon.address |Monitor address |host:port |
+| | | |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.auth.id |Ceph user id |Example: admin |
+| | | |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.auth.keyfile |Ceph key file | |
+| | | |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.auth.keyring |Ceph keyring file | |
+| | | |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.object.size |Default file object size |Default value (64MB): |
+| |in bytes |67108864 |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.data.pools |List of Ceph data pools |Default value: default Ceph |
+| |for storing file. |pool. |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.localize.reads |Allow reading from file |Default value: true |
+| |replica objects | |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+
+Support For Per-file Custom Replication
+---------------------------------------
+
+The Hadoop file system interface allows users to specify a custom replication
+factor (e.g. 3 copies of each block) when creating a file. However, object
+replication factors in the Ceph file system are controlled on a per-pool
+basis, and by default a Ceph file system will contain only a single
+pre-configured pool. Thus, in order to support per-file replication with
+Hadoop over Ceph, additional storage pools with non-default replications
+factors must be created, and Hadoop must be configured to choose from these
+additional pools.
+
+Additional data pools can be specified using the ``ceph.data.pools``
+configuration option. The value of the option is a comma separated list of
+pool names. The default Ceph pool will be used automatically if this
+configuration option is omitted or the value is empty. For example, the
+following configuration setting will consider the pools ``pool1``, ``pool2``, and
+``pool5`` when selecting a target pool to store a file. ::
+
+ <property>
+ <name>ceph.data.pools</name>
+ <value>pool1,pool2,pool5</value>
+ </property>
+
+Hadoop will not create pools automatically. In order to create a new pool with
+a specific replication factor use the ``ceph osd pool create`` command, and then
+set the ``size`` property on the pool using the ``ceph osd pool set`` command. For
+more information on creating and configuring pools see the `RADOS Pool
+documentation`_.
+
+.. _RADOS Pool documentation: ../../rados/operations/pools
+
+Once a pool has been created and configured the metadata service must be told
+that the new pool may be used to store file data. A pool is be made available
+for storing file system data using the ``ceph fs add_data_pool`` command.
+
+First, create the pool. In this example we create the ``hadoop1`` pool with
+replication factor 1. ::
+
+ ceph osd pool create hadoop1 100
+ ceph osd pool set hadoop1 size 1
+
+Next, determine the pool id. This can be done by examining the output of the
+``ceph osd dump`` command. For example, we can look for the newly created
+``hadoop1`` pool. ::
+
+ ceph osd dump | grep hadoop1
+
+The output should resemble::
+
+ pool 3 'hadoop1' rep size 1 min_size 1 crush_rule 0...
+
+where ``3`` is the pool id. Next we will use the pool id reference to register
+the pool as a data pool for storing file system data. ::
+
+ ceph fs add_data_pool cephfs 3
+
+The final step is to configure Hadoop to consider this data pool when
+selecting the target pool for new files. ::
+
+ <property>
+ <name>ceph.data.pools</name>
+ <value>hadoop1</value>
+ </property>
+
+Pool Selection Rules
+~~~~~~~~~~~~~~~~~~~~
+
+The following rules describe how Hadoop chooses a pool given a desired
+replication factor and the set of pools specified using the
+``ceph.data.pools`` configuration option.
+
+1. When no custom pools are specified the default Ceph data pool is used.
+2. A custom pool with the same replication factor as the default Ceph data
+ pool will override the default.
+3. A pool with a replication factor that matches the desired replication will
+ be chosen if it exists.
+4. Otherwise, a pool with at least the desired replication factor will be
+ chosen, or the maximum possible.
+
+Debugging Pool Selection
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Hadoop will produce log file entry when it cannot determine the replication
+factor of a pool (e.g. it is not configured as a data pool). The log message
+will appear as follows::
+
+ Error looking up replication of pool: <pool name>
+
+Hadoop will also produce a log entry when it wasn't able to select an exact
+match for replication. This log entry will appear as follows::
+
+ selectDataPool path=<path> pool:repl=<name>:<value> wanted=<value>
diff --git a/doc/cephfs/health-messages.rst b/doc/cephfs/health-messages.rst
new file mode 100644
index 00000000..b096e124
--- /dev/null
+++ b/doc/cephfs/health-messages.rst
@@ -0,0 +1,131 @@
+
+.. _cephfs-health-messages:
+
+======================
+CephFS health messages
+======================
+
+Cluster health checks
+=====================
+
+The Ceph monitor daemons will generate health messages in response
+to certain states of the filesystem map structure (and the enclosed MDS maps).
+
+Message: mds rank(s) *ranks* have failed
+Description: One or more MDS ranks are not currently assigned to
+an MDS daemon; the cluster will not recover until a suitable replacement
+daemon starts.
+
+Message: mds rank(s) *ranks* are damaged
+Description: One or more MDS ranks has encountered severe damage to
+its stored metadata, and cannot start again until it is repaired.
+
+Message: mds cluster is degraded
+Description: One or more MDS ranks are not currently up and running, clients
+may pause metadata IO until this situation is resolved. This includes
+ranks being failed or damaged, and additionally includes ranks
+which are running on an MDS but have not yet made it to the *active*
+state (e.g. ranks currently in *replay* state).
+
+Message: mds *names* are laggy
+Description: The named MDS daemons have failed to send beacon messages
+to the monitor for at least ``mds_beacon_grace`` (default 15s), while
+they are supposed to send beacon messages every ``mds_beacon_interval``
+(default 4s). The daemons may have crashed. The Ceph monitor will
+automatically replace laggy daemons with standbys if any are available.
+
+Message: insufficient standby daemons available
+Description: One or more file systems are configured to have a certain number
+of standby daemons available (including daemons in standby-replay) but the
+cluster does not have enough standby daemons. The standby daemons not in replay
+count towards any file system (i.e. they may overlap). This warning can
+configured by setting ``ceph fs set <fs> standby_count_wanted <count>``. Use
+zero for ``count`` to disable.
+
+
+Daemon-reported health checks
+=============================
+
+MDS daemons can identify a variety of unwanted conditions, and
+indicate these to the operator in the output of ``ceph status``.
+This conditions have human readable messages, and additionally
+a unique code starting MDS_HEALTH which appears in JSON output.
+
+Message: "Behind on trimming..."
+Code: MDS_HEALTH_TRIM
+Description: CephFS maintains a metadata journal that is divided into
+*log segments*. The length of journal (in number of segments) is controlled
+by the setting ``mds_log_max_segments``, and when the number of segments
+exceeds that setting the MDS starts writing back metadata so that it
+can remove (trim) the oldest segments. If this writeback is happening
+too slowly, or a software bug is preventing trimming, then this health
+message may appear. The threshold for this message to appear is controlled by
+the config option ``mds_log_warn_factor``, the default is 2.0.
+
+Message: "Client *name* failing to respond to capability release"
+Code: MDS_HEALTH_CLIENT_LATE_RELEASE, MDS_HEALTH_CLIENT_LATE_RELEASE_MANY
+Description: CephFS clients are issued *capabilities* by the MDS, which
+are like locks. Sometimes, for example when another client needs access,
+the MDS will request clients release their capabilities. If the client
+is unresponsive or buggy, it might fail to do so promptly or fail to do
+so at all. This message appears if a client has taken longer than
+``session_timeout`` (default 60s) to comply.
+
+Message: "Client *name* failing to respond to cache pressure"
+Code: MDS_HEALTH_CLIENT_RECALL, MDS_HEALTH_CLIENT_RECALL_MANY
+Description: Clients maintain a metadata cache. Items (such as inodes) in the
+client cache are also pinned in the MDS cache, so when the MDS needs to shrink
+its cache (to stay within ``mds_cache_size`` or ``mds_cache_memory_limit``), it
+sends messages to clients to shrink their caches too. If the client is
+unresponsive or buggy, this can prevent the MDS from properly staying within
+its cache limits and it may eventually run out of memory and crash. This
+message appears if a client has failed to release more than
+``mds_recall_warning_threshold`` capabilities (decaying with a half-life of
+``mds_recall_max_decay_rate``) within the last
+``mds_recall_warning_decay_rate`` second.
+
+Message: "Client *name* failing to advance its oldest client/flush tid"
+Code: MDS_HEALTH_CLIENT_OLDEST_TID, MDS_HEALTH_CLIENT_OLDEST_TID_MANY
+Description: The CephFS client-MDS protocol uses a field called the
+*oldest tid* to inform the MDS of which client requests are fully
+complete and may therefore be forgotten about by the MDS. If a buggy
+client is failing to advance this field, then the MDS may be prevented
+from properly cleaning up resources used by client requests. This message
+appears if a client appears to have more than ``max_completed_requests``
+(default 100000) requests that are complete on the MDS side but haven't
+yet been accounted for in the client's *oldest tid* value.
+
+Message: "Metadata damage detected"
+Code: MDS_HEALTH_DAMAGE,
+Description: Corrupt or missing metadata was encountered when reading
+from the metadata pool. This message indicates that the damage was
+sufficiently isolated for the MDS to continue operating, although
+client accesses to the damaged subtree will return IO errors. Use
+the ``damage ls`` admin socket command to get more detail on the damage.
+This message appears as soon as any damage is encountered.
+
+Message: "MDS in read-only mode"
+Code: MDS_HEALTH_READ_ONLY,
+Description: The MDS has gone into readonly mode and will return EROFS
+error codes to client operations that attempt to modify any metadata. The
+MDS will go into readonly mode if it encounters a write error while
+writing to the metadata pool, or if forced to by an administrator using
+the *force_readonly* admin socket command.
+
+Message: *N* slow requests are blocked"
+Code: MDS_HEALTH_SLOW_REQUEST,
+Description: One or more client requests have not been completed promptly,
+indicating that the MDS is either running very slowly, or that the RADOS
+cluster is not acknowledging journal writes promptly, or that there is a bug.
+Use the ``ops`` admin socket command to list outstanding metadata operations.
+This message appears if any client requests have taken longer than
+``mds_op_complaint_time`` (default 30s).
+
+Message: "Too many inodes in cache"
+Code: MDS_HEALTH_CACHE_OVERSIZED
+Description: The MDS is not succeeding in trimming its cache to comply with the
+limit set by the administrator. If the MDS cache becomes too large, the daemon
+may exhaust available memory and crash. By default, this message appears if
+the actual cache size (in inodes or memory) is at least 50% greater than
+``mds_cache_size`` (default 100000) or ``mds_cache_memory_limit`` (default
+1GB). Modify ``mds_health_cache_threshold`` to set the warning ratio.
diff --git a/doc/cephfs/index.rst b/doc/cephfs/index.rst
new file mode 100644
index 00000000..d9494a2b
--- /dev/null
+++ b/doc/cephfs/index.rst
@@ -0,0 +1,133 @@
+.. _ceph-filesystem:
+
+=================
+ Ceph Filesystem
+=================
+
+The Ceph Filesystem (CephFS) is a POSIX-compliant filesystem that uses
+a Ceph Storage Cluster to store its data. The Ceph filesystem uses the same Ceph
+Storage Cluster system as Ceph Block Devices, Ceph Object Storage with its S3
+and Swift APIs, or native bindings (librados).
+
+.. note:: If you are evaluating CephFS for the first time, please review
+ the best practices for deployment: :doc:`/cephfs/best-practices`
+
+.. ditaa::
+ +-----------------------+ +------------------------+
+ | | | CephFS FUSE |
+ | | +------------------------+
+ | |
+ | | +------------------------+
+ | CephFS Kernel Object | | CephFS Library |
+ | | +------------------------+
+ | |
+ | | +------------------------+
+ | | | librados |
+ +-----------------------+ +------------------------+
+
+ +---------------+ +---------------+ +---------------+
+ | OSDs | | MDSs | | Monitors |
+ +---------------+ +---------------+ +---------------+
+
+
+Using CephFS
+============
+
+Using the Ceph Filesystem requires at least one :term:`Ceph Metadata Server` in
+your Ceph Storage Cluster.
+
+
+
+.. raw:: html
+
+ <style type="text/css">div.body h3{margin:5px 0px 0px 0px;}</style>
+ <table cellpadding="10"><colgroup><col width="33%"><col width="33%"><col width="33%"></colgroup><tbody valign="top"><tr><td><h3>Step 1: Metadata Server</h3>
+
+To run the Ceph Filesystem, you must have a running Ceph Storage Cluster with at
+least one :term:`Ceph Metadata Server` running.
+
+
+.. toctree::
+ :maxdepth: 1
+
+ Provision/Add/Remove MDS(s) <add-remove-mds>
+ MDS failover and standby configuration <standby>
+ MDS Configuration Settings <mds-config-ref>
+ Client Configuration Settings <client-config-ref>
+ Journaler Configuration <journaler>
+ Manpage ceph-mds <../../man/8/ceph-mds>
+
+.. raw:: html
+
+ </td><td><h3>Step 2: Mount CephFS</h3>
+
+Once you have a healthy Ceph Storage Cluster with at least
+one Ceph Metadata Server, you may create and mount your Ceph Filesystem.
+Ensure that your client has network connectivity and the proper
+authentication keyring.
+
+.. toctree::
+ :maxdepth: 1
+
+ Create a CephFS file system <createfs>
+ Mount CephFS <kernel>
+ Mount CephFS as FUSE <fuse>
+ Mount CephFS in fstab <fstab>
+ Use the CephFS Shell <cephfs-shell>
+ Supported Features of Kernel Driver <kernel-features>
+ Manpage ceph-fuse <../../man/8/ceph-fuse>
+ Manpage mount.ceph <../../man/8/mount.ceph>
+ Manpage mount.fuse.ceph <../../man/8/mount.fuse.ceph>
+
+
+.. raw:: html
+
+ </td><td><h3>Additional Details</h3>
+
+.. toctree::
+ :maxdepth: 1
+
+ Deployment best practices <best-practices>
+ MDS States <mds-states>
+ Administrative commands <administration>
+ Understanding MDS Cache Size Limits <cache-size-limits>
+ POSIX compatibility <posix>
+ Experimental Features <experimental-features>
+ CephFS Quotas <quota>
+ Using Ceph with Hadoop <hadoop>
+ cephfs-journal-tool <cephfs-journal-tool>
+ File layouts <file-layouts>
+ Client eviction <eviction>
+ Handling full filesystems <full>
+ Health messages <health-messages>
+ Troubleshooting <troubleshooting>
+ Disaster recovery <disaster-recovery>
+ Client authentication <client-auth>
+ Upgrading old filesystems <upgrading>
+ Configuring directory fragmentation <dirfrags>
+ Configuring multiple active MDS daemons <multimds>
+ Export over NFS <nfs>
+ Application best practices <app-best-practices>
+ Scrub <scrub>
+ LazyIO <lazyio>
+ FS volume and subvolumes <fs-volumes>
+
+.. toctree::
+ :hidden:
+
+ Advanced: Metadata repair <disaster-recovery-experts>
+
+.. raw:: html
+
+ </td></tr></tbody></table>
+
+For developers
+==============
+
+.. toctree::
+ :maxdepth: 1
+
+ Client's Capabilities <capabilities>
+ libcephfs <../../api/libcephfs-java/>
+ Mantle <mantle>
+
diff --git a/doc/cephfs/journaler.rst b/doc/cephfs/journaler.rst
new file mode 100644
index 00000000..2121532f
--- /dev/null
+++ b/doc/cephfs/journaler.rst
@@ -0,0 +1,41 @@
+===========
+ Journaler
+===========
+
+``journaler write head interval``
+
+:Description: How frequently to update the journal head object
+:Type: Integer
+:Required: No
+:Default: ``15``
+
+
+``journaler prefetch periods``
+
+:Description: How many stripe periods to read-ahead on journal replay
+:Type: Integer
+:Required: No
+:Default: ``10``
+
+
+``journal prezero periods``
+
+:Description: How many stripe periods to zero ahead of write position
+:Type: Integer
+:Required: No
+:Default: ``10``
+
+``journaler batch interval``
+
+:Description: Maximum additional latency in seconds we incur artificially.
+:Type: Double
+:Required: No
+:Default: ``.001``
+
+
+``journaler batch max``
+
+:Description: Maximum bytes we will delay flushing.
+:Type: 64-bit Unsigned Integer
+:Required: No
+:Default: ``0``
diff --git a/doc/cephfs/kernel-features.rst b/doc/cephfs/kernel-features.rst
new file mode 100644
index 00000000..edd27bcd
--- /dev/null
+++ b/doc/cephfs/kernel-features.rst
@@ -0,0 +1,40 @@
+
+Supported Features of Kernel Driver
+========================================
+
+Inline data
+-----------
+Inline data was introduced by the Firefly release. Linux kernel clients >= 3.19
+can read inline data, can convert existing inline data to RADOS objects when
+file data is modified. At present, Linux kernel clients do not store file data
+as inline data.
+
+See `Experimental Features`_ for more information.
+
+Quotas
+------
+Quota was first introduced by the hammer release. Quota disk format got renewed
+by the Mimic release. Linux kernel clients >= 4.17 can support the new format
+quota. At present, no Linux kernel client support the old format quota.
+
+See `Quotas`_ for more information.
+
+Multiple filesystems within a Ceph cluster
+------------------------------------------
+The feature was introduced by the Jewel release. Linux kernel clients >= 4.7
+can support it.
+
+See `Experimental Features`_ for more information.
+
+Multiple active metadata servers
+--------------------------------
+The feature has been supported since the Luminous release. It is recommended to
+use Linux kernel clients >= 4.14 when there are multiple active MDS.
+
+Snapshots
+---------
+The feature has been supported since the Mimic release. It is recommended to
+use Linux kernel clients >= 4.17 if snapshot is used.
+
+.. _Experimental Features: ../experimental-features
+.. _Quotas: ../quota
diff --git a/doc/cephfs/kernel.rst b/doc/cephfs/kernel.rst
new file mode 100644
index 00000000..89f481f9
--- /dev/null
+++ b/doc/cephfs/kernel.rst
@@ -0,0 +1,41 @@
+====================================
+ Mount CephFS with the Kernel Driver
+====================================
+
+To mount the Ceph file system you may use the ``mount`` command if you know the
+monitor host IP address(es), or use the ``mount.ceph`` utility to resolve the
+monitor host name(s) into IP address(es) for you. For example::
+
+ sudo mkdir /mnt/mycephfs
+ sudo mount -t ceph 192.168.0.1:6789:/ /mnt/mycephfs
+
+To mount the Ceph file system with ``cephx`` authentication enabled, the kernel
+must authenticate with the cluster. The default ``name=`` option is ``guest``.
+The mount.ceph helper will automatically attempt to find a secret key in the
+keyring.
+
+The secret can also be specified manually with the ``secret=`` option. ::
+
+ sudo mount -t ceph 192.168.0.1:6789:/ /mnt/mycephfs -o name=admin,secret=AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==
+
+The foregoing usage leaves the secret in the Bash history. A more secure
+approach reads the secret from a file. For example::
+
+ sudo mount -t ceph 192.168.0.1:6789:/ /mnt/mycephfs -o name=admin,secretfile=/etc/ceph/admin.secret
+
+See `User Management`_ for details on cephx.
+
+If you have more than one file system, specify which one to mount using
+the ``mds_namespace`` option, e.g. ``-o mds_namespace=myfs``.
+
+To unmount the Ceph file system, you may use the ``umount`` command. For example::
+
+ sudo umount /mnt/mycephfs
+
+.. tip:: Ensure that you are not within the file system directories before
+ executing this command.
+
+See `mount.ceph`_ for details.
+
+.. _mount.ceph: ../../man/8/mount.ceph/
+.. _User Management: ../../rados/operations/user-management/
diff --git a/doc/cephfs/lazyio.rst b/doc/cephfs/lazyio.rst
new file mode 100644
index 00000000..6737932a
--- /dev/null
+++ b/doc/cephfs/lazyio.rst
@@ -0,0 +1,23 @@
+======
+LazyIO
+======
+
+LazyIO relaxes POSIX semantics. Buffered reads/writes are allowed even when a
+file is opened by multiple applications on multiple clients. Applications are
+responsible for managing cache coherency themselves.
+
+Libcephfs supports LazyIO since nautilus release.
+
+Enable LazyIO
+=============
+
+LazyIO can be enabled by following ways.
+
+- ``client_force_lazyio`` option enables LAZY_IO globally for libcephfs and
+ ceph-fuse mount.
+
+- ``ceph_lazyio(...)`` and ``ceph_ll_lazyio(...)`` enable LAZY_IO for file handle
+ in libcephfs.
+
+- ``ioctl(fd, CEPH_IOC_LAZYIO, 1UL)`` enables LAZY_IO for file handle in
+ ceph-fuse mount.
diff --git a/doc/cephfs/mantle.rst b/doc/cephfs/mantle.rst
new file mode 100644
index 00000000..6d3d40d6
--- /dev/null
+++ b/doc/cephfs/mantle.rst
@@ -0,0 +1,263 @@
+Mantle
+======
+
+.. warning::
+
+ Mantle is for research and development of metadata balancer algorithms,
+ not for use on production CephFS clusters.
+
+Multiple, active MDSs can migrate directories to balance metadata load. The
+policies for when, where, and how much to migrate are hard-coded into the
+metadata balancing module. Mantle is a programmable metadata balancer built
+into the MDS. The idea is to protect the mechanisms for balancing load
+(migration, replication, fragmentation) but stub out the balancing policies
+using Lua. Mantle is based on [1] but the current implementation does *NOT*
+have the following features from that paper:
+
+1. Balancing API: in the paper, the user fills in when, where, how much, and
+ load calculation policies; currently, Mantle only requires that Lua policies
+ return a table of target loads (e.g., how much load to send to each MDS)
+2. "How much" hook: in the paper, there was a hook that let the user control
+ the fragment selector policy; currently, Mantle does not have this hook
+3. Instantaneous CPU utilization as a metric
+
+[1] Supercomputing '15 Paper:
+http://sc15.supercomputing.org/schedule/event_detail-evid=pap168.html
+
+Quickstart with vstart
+----------------------
+
+.. warning::
+
+ Developing balancers with vstart is difficult because running all daemons
+ and clients on one node can overload the system. Let it run for a while, even
+ though you will likely see a bunch of lost heartbeat and laggy MDS warnings.
+ Most of the time this guide will work but sometimes all MDSs lock up and you
+ cannot actually see them spill. It is much better to run this on a cluster.
+
+As a prerequisite, we assume you have installed `mdtest
+<https://sourceforge.net/projects/mdtest/>`_ or pulled the `Docker image
+<https://hub.docker.com/r/michaelsevilla/mdtest/>`_. We use mdtest because we
+need to generate enough load to get over the MIN_OFFLOAD threshold that is
+arbitrarily set in the balancer. For example, this does not create enough
+metadata load:
+
+::
+
+ while true; do
+ touch "/cephfs/blah-`date`"
+ done
+
+
+Mantle with `vstart.sh`
+~~~~~~~~~~~~~~~~~~~~~~~
+
+1. Start Ceph and tune the logging so we can see migrations happen:
+
+::
+
+ cd build
+ ../src/vstart.sh -n -l
+ for i in a b c; do
+ bin/ceph --admin-daemon out/mds.$i.asok config set debug_ms 0
+ bin/ceph --admin-daemon out/mds.$i.asok config set debug_mds 2
+ bin/ceph --admin-daemon out/mds.$i.asok config set mds_beacon_grace 1500
+ done
+
+
+2. Put the balancer into RADOS:
+
+::
+
+ bin/rados put --pool=cephfs_metadata_a greedyspill.lua ../src/mds/balancers/greedyspill.lua
+
+
+3. Activate Mantle:
+
+::
+
+ bin/ceph fs set cephfs max_mds 5
+ bin/ceph fs set cephfs_a balancer greedyspill.lua
+
+
+4. Mount CephFS in another window:
+
+::
+
+ bin/ceph-fuse /cephfs -o allow_other &
+ tail -f out/mds.a.log
+
+
+ Note that if you look at the last MDS (which could be a, b, or c -- it's
+ random), you will see an an attempt to index a nil value. This is because the
+ last MDS tries to check the load of its neighbor, which does not exist.
+
+5. Run a simple benchmark. In our case, we use the Docker mdtest image to
+ create load:
+
+::
+
+ for i in 0 1 2; do
+ docker run -d \
+ --name=client$i \
+ -v /cephfs:/cephfs \
+ michaelsevilla/mdtest \
+ -F -C -n 100000 -d "/cephfs/client-test$i"
+ done
+
+
+6. When you are done, you can kill all the clients with:
+
+::
+
+ for i in 0 1 2 3; do docker rm -f client$i; done
+
+
+Output
+~~~~~~
+
+Looking at the log for the first MDS (could be a, b, or c), we see that
+everyone has no load:
+
+::
+
+ 2016-08-21 06:44:01.763930 7fd03aaf7700 0 lua.balancer MDS0: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=1.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0
+ 2016-08-21 06:44:01.763966 7fd03aaf7700 0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0
+ 2016-08-21 06:44:01.763982 7fd03aaf7700 0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0
+ 2016-08-21 06:44:01.764010 7fd03aaf7700 2 lua.balancer when: not migrating! my_load=0.0 hisload=0.0
+ 2016-08-21 06:44:01.764033 7fd03aaf7700 2 mds.0.bal mantle decided that new targets={}
+
+
+After the jobs starts, MDS0 gets about 1953 units of load. The greedy spill
+balancer dictates that half the load goes to your neighbor MDS, so we see that
+Mantle tries to send 1953 load units to MDS1.
+
+::
+
+ 2016-08-21 06:45:21.869994 7fd03aaf7700 0 lua.balancer MDS0: < auth.meta_load=5834.188908912 all.meta_load=1953.3492228857 req_rate=12591.0 queue_len=1075.0 cpu_load_avg=3.05 > load=1953.3492228857
+ 2016-08-21 06:45:21.870017 7fd03aaf7700 0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=3.05 > load=0.0
+ 2016-08-21 06:45:21.870027 7fd03aaf7700 0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=3.05 > load=0.0
+ 2016-08-21 06:45:21.870034 7fd03aaf7700 2 lua.balancer when: migrating! my_load=1953.3492228857 hisload=0.0
+ 2016-08-21 06:45:21.870050 7fd03aaf7700 2 mds.0.bal mantle decided that new targets={0=0,1=976.675,2=0}
+ 2016-08-21 06:45:21.870094 7fd03aaf7700 0 mds.0.bal - exporting [0,0.52287 1.04574] 1030.88 to mds.1 [dir 100000006ab /client-test2/ [2,head] auth pv=33 v=32 cv=32/0 ap=2+3+4 state=1610612802|complete f(v0 m2016-08-21 06:44:20.366935 1=0+1) n(v2 rc2016-08-21 06:44:30.946816 3790=3788+2) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x55d2762fd690]
+ 2016-08-21 06:45:21.870151 7fd03aaf7700 0 mds.0.migrator nicely exporting to mds.1 [dir 100000006ab /client-test2/ [2,head] auth pv=33 v=32 cv=32/0 ap=2+3+4 state=1610612802|complete f(v0 m2016-08-21 06:44:20.366935 1=0+1) n(v2 rc2016-08-21 06:44:30.946816 3790=3788+2) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x55d2762fd690]
+
+
+Eventually load moves around:
+
+::
+
+ 2016-08-21 06:47:10.210253 7fd03aaf7700 0 lua.balancer MDS0: < auth.meta_load=415.77414300449 all.meta_load=415.79000078186 req_rate=82813.0 queue_len=0.0 cpu_load_avg=11.97 > load=415.79000078186
+ 2016-08-21 06:47:10.210277 7fd03aaf7700 0 lua.balancer MDS1: < auth.meta_load=228.72023977691 all.meta_load=186.5606496623 req_rate=28580.0 queue_len=0.0 cpu_load_avg=11.97 > load=186.5606496623
+ 2016-08-21 06:47:10.210290 7fd03aaf7700 0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=1.0 queue_len=0.0 cpu_load_avg=11.97 > load=0.0
+ 2016-08-21 06:47:10.210298 7fd03aaf7700 2 lua.balancer when: not migrating! my_load=415.79000078186 hisload=186.5606496623
+ 2016-08-21 06:47:10.210311 7fd03aaf7700 2 mds.0.bal mantle decided that new targets={}
+
+
+Implementation Details
+----------------------
+
+Most of the implementation is in MDBalancer. Metrics are passed to the balancer
+policies via the Lua stack and a list of loads is returned back to MDBalancer.
+It sits alongside the current balancer implementation and it's enabled with a
+Ceph CLI command ("ceph fs set cephfs balancer mybalancer.lua"). If the Lua policy
+fails (for whatever reason), we fall back to the original metadata load
+balancer. The balancer is stored in the RADOS metadata pool and a string in the
+MDSMap tells the MDSs which balancer to use.
+
+Exposing Metrics to Lua
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Metrics are exposed directly to the Lua code as global variables instead of
+using a well-defined function signature. There is a global "mds" table, where
+each index is an MDS number (e.g., 0) and each value is a dictionary of metrics
+and values. The Lua code can grab metrics using something like this:
+
+::
+
+ mds[0]["queue_len"]
+
+
+This is in contrast to cls-lua in the OSDs, which has well-defined arguments
+(e.g., input/output bufferlists). Exposing the metrics directly makes it easier
+to add new metrics without having to change the API on the Lua side; we want
+the API to grow and shrink as we explore which metrics matter. The downside of
+this approach is that the person programming Lua balancer policies has to look
+at the Ceph source code to see which metrics are exposed. We figure that the
+Mantle developer will be in touch with MDS internals anyways.
+
+The metrics exposed to the Lua policy are the same ones that are already stored
+in mds_load_t: auth.meta_load(), all.meta_load(), req_rate, queue_length,
+cpu_load_avg.
+
+Compile/Execute the Balancer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Here we use `lua_pcall` instead of `lua_call` because we want to handle errors
+in the MDBalancer. We do not want the error propagating up the call chain. The
+cls_lua class wants to handle the error itself because it must fail gracefully.
+For Mantle, we don't care if a Lua error crashes our balancer -- in that case,
+we will fall back to the original balancer.
+
+The performance improvement of using `lua_call` over `lua_pcall` would not be
+leveraged here because the balancer is invoked every 10 seconds by default.
+
+Returning Policy Decision to C++
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We force the Lua policy engine to return a table of values, corresponding to
+the amount of load to send to each MDS. These loads are inserted directly into
+the MDBalancer "my_targets" vector. We do not allow the MDS to return a table
+of MDSs and metrics because we want the decision to be completely made on the
+Lua side.
+
+Iterating through tables returned by Lua is done through the stack. In Lua
+jargon: a dummy value is pushed onto the stack and the next iterator replaces
+the top of the stack with a (k, v) pair. After reading each value, pop that
+value but keep the key for the next call to `lua_next`.
+
+Reading from RADOS
+~~~~~~~~~~~~~~~~~~
+
+All MDSs will read balancing code from RADOS when the balancer version changes
+in the MDS Map. The balancer pulls the Lua code from RADOS synchronously. We do
+this with a timeout: if the asynchronous read does not come back within half
+the balancing tick interval the operation is cancelled and a Connection Timeout
+error is returned. By default, the balancing tick interval is 10 seconds, so
+Mantle will use a 5 second second timeout. This design allows Mantle to
+immediately return an error if anything RADOS-related goes wrong.
+
+We use this implementation because we do not want to do a blocking OSD read
+from inside the global MDS lock. Doing so would bring down the MDS cluster if
+any of the OSDs are not responsive -- this is tested in the ceph-qa-suite by
+setting all OSDs to down/out and making sure the MDS cluster stays active.
+
+One approach would be to asynchronously fire the read when handling the MDS Map
+and fill in the Lua code in the background. We cannot do this because the MDS
+does not support daemon-local fallbacks and the balancer assumes that all MDSs
+come to the same decision at the same time (e.g., importers, exporters, etc.).
+
+Debugging
+~~~~~~~~~
+
+Logging in a Lua policy will appear in the MDS log. The syntax is the same as
+the cls logging interface:
+
+::
+
+ BAL_LOG(0, "this is a log message")
+
+
+It is implemented by passing a function that wraps the `dout` logging framework
+(`dout_wrapper`) to Lua with the `lua_register()` primitive. The Lua code is
+actually calling the `dout` function in C++.
+
+Warning and Info messages are centralized using the clog/Beacon. Successful
+messages are only sent on version changes by the first MDS to avoid spamming
+the `ceph -w` utility. These messages are used for the integration tests.
+
+Testing
+~~~~~~~
+
+Testing is done with the ceph-qa-suite (tasks.cephfs.test_mantle). We do not
+test invalid balancer logging and loading the actual Lua VM.
diff --git a/doc/cephfs/mds-config-ref.rst b/doc/cephfs/mds-config-ref.rst
new file mode 100644
index 00000000..b91a4424
--- /dev/null
+++ b/doc/cephfs/mds-config-ref.rst
@@ -0,0 +1,546 @@
+======================
+ MDS Config Reference
+======================
+
+``mds cache memory limit``
+
+:Description: The memory limit the MDS should enforce for its cache.
+ Administrators should use this instead of ``mds cache size``.
+:Type: 64-bit Integer Unsigned
+:Default: ``1073741824``
+
+``mds cache reservation``
+
+:Description: The cache reservation (memory or inodes) for the MDS cache to maintain.
+ Once the MDS begins dipping into its reservation, it will recall
+ client state until its cache size shrinks to restore the
+ reservation.
+:Type: Float
+:Default: ``0.05``
+
+``mds cache size``
+
+:Description: The number of inodes to cache. A value of 0 indicates an
+ unlimited number. It is recommended to use
+ ``mds_cache_memory_limit`` to limit the amount of memory the MDS
+ cache uses.
+:Type: 32-bit Integer
+:Default: ``0``
+
+``mds cache mid``
+
+:Description: The insertion point for new items in the cache LRU
+ (from the top).
+
+:Type: Float
+:Default: ``0.7``
+
+
+``mds dir commit ratio``
+
+:Description: The fraction of directory that is dirty before Ceph commits using
+ a full update (instead of partial update).
+
+:Type: Float
+:Default: ``0.5``
+
+
+``mds dir max commit size``
+
+:Description: The maximum size of a directory update before Ceph breaks it into
+ smaller transactions) (MB).
+
+:Type: 32-bit Integer
+:Default: ``90``
+
+
+``mds decay halflife``
+
+:Description: The half-life of MDS cache temperature.
+:Type: Float
+:Default: ``5``
+
+``mds beacon interval``
+
+:Description: The frequency (in seconds) of beacon messages sent
+ to the monitor.
+
+:Type: Float
+:Default: ``4``
+
+
+``mds beacon grace``
+
+:Description: The interval without beacons before Ceph declares an MDS laggy
+ (and possibly replace it).
+
+:Type: Float
+:Default: ``15``
+
+
+``mds blacklist interval``
+
+:Description: The blacklist duration for failed MDSs in the OSD map. Note,
+ this controls how long failed MDS daemons will stay in the
+ OSDMap blacklist. It has no effect on how long something is
+ blacklisted when the administrator blacklists it manually. For
+ example, ``ceph osd blacklist add`` will still use the default
+ blacklist time.
+:Type: Float
+:Default: ``24.0*60.0``
+
+
+``mds reconnect timeout``
+
+:Description: The interval (in seconds) to wait for clients to reconnect
+ during MDS restart.
+
+:Type: Float
+:Default: ``45``
+
+
+``mds tick interval``
+
+:Description: How frequently the MDS performs internal periodic tasks.
+:Type: Float
+:Default: ``5``
+
+
+``mds dirstat min interval``
+
+:Description: The minimum interval (in seconds) to try to avoid propagating
+ recursive stats up the tree.
+
+:Type: Float
+:Default: ``1``
+
+``mds scatter nudge interval``
+
+:Description: How quickly dirstat changes propagate up.
+:Type: Float
+:Default: ``5``
+
+
+``mds client prealloc inos``
+
+:Description: The number of inode numbers to preallocate per client session.
+:Type: 32-bit Integer
+:Default: ``1000``
+
+
+``mds early reply``
+
+:Description: Determines whether the MDS should allow clients to see request
+ results before they commit to the journal.
+
+:Type: Boolean
+:Default: ``true``
+
+
+``mds default dir hash``
+
+:Description: The function to use for hashing files across directory fragments.
+:Type: 32-bit Integer
+:Default: ``2`` (i.e., rjenkins)
+
+
+``mds log skip corrupt events``
+
+:Description: Determines whether the MDS should try to skip corrupt journal
+ events during journal replay.
+
+:Type: Boolean
+:Default: ``false``
+
+
+``mds log max events``
+
+:Description: The maximum events in the journal before we initiate trimming.
+ Set to ``-1`` to disable limits.
+
+:Type: 32-bit Integer
+:Default: ``-1``
+
+
+``mds log max segments``
+
+:Description: The maximum number of segments (objects) in the journal before
+ we initiate trimming. Set to ``-1`` to disable limits.
+
+:Type: 32-bit Integer
+:Default: ``128``
+
+
+``mds bal sample interval``
+
+:Description: Determines how frequently to sample directory temperature
+ (for fragmentation decisions).
+
+:Type: Float
+:Default: ``3``
+
+
+``mds bal replicate threshold``
+
+:Description: The maximum temperature before Ceph attempts to replicate
+ metadata to other nodes.
+
+:Type: Float
+:Default: ``8000``
+
+
+``mds bal unreplicate threshold``
+
+:Description: The minimum temperature before Ceph stops replicating
+ metadata to other nodes.
+
+:Type: Float
+:Default: ``0``
+
+
+``mds bal split size``
+
+:Description: The maximum directory size before the MDS will split a directory
+ fragment into smaller bits.
+
+:Type: 32-bit Integer
+:Default: ``10000``
+
+
+``mds bal split rd``
+
+:Description: The maximum directory read temperature before Ceph splits
+ a directory fragment.
+
+:Type: Float
+:Default: ``25000``
+
+
+``mds bal split wr``
+
+:Description: The maximum directory write temperature before Ceph splits
+ a directory fragment.
+
+:Type: Float
+:Default: ``10000``
+
+
+``mds bal split bits``
+
+:Description: The number of bits by which to split a directory fragment.
+:Type: 32-bit Integer
+:Default: ``3``
+
+
+``mds bal merge size``
+
+:Description: The minimum directory size before Ceph tries to merge
+ adjacent directory fragments.
+
+:Type: 32-bit Integer
+:Default: ``50``
+
+
+``mds bal interval``
+
+:Description: The frequency (in seconds) of workload exchanges between MDSs.
+:Type: 32-bit Integer
+:Default: ``10``
+
+
+``mds bal fragment interval``
+
+:Description: The delay (in seconds) between a fragment being eligible for split
+ or merge and executing the fragmentation change.
+:Type: 32-bit Integer
+:Default: ``5``
+
+
+``mds bal fragment fast factor``
+
+:Description: The ratio by which frags may exceed the split size before
+ a split is executed immediately (skipping the fragment interval)
+:Type: Float
+:Default: ``1.5``
+
+``mds bal fragment size max``
+
+:Description: The maximum size of a fragment before any new entries
+ are rejected with ENOSPC.
+:Type: 32-bit Integer
+:Default: ``100000``
+
+``mds bal idle threshold``
+
+:Description: The minimum temperature before Ceph migrates a subtree
+ back to its parent.
+
+:Type: Float
+:Default: ``0``
+
+
+``mds bal max``
+
+:Description: The number of iterations to run balancer before Ceph stops.
+ (used for testing purposes only)
+
+:Type: 32-bit Integer
+:Default: ``-1``
+
+
+``mds bal max until``
+
+:Description: The number of seconds to run balancer before Ceph stops.
+ (used for testing purposes only)
+
+:Type: 32-bit Integer
+:Default: ``-1``
+
+
+``mds bal mode``
+
+:Description: The method for calculating MDS load.
+
+ - ``0`` = Hybrid.
+ - ``1`` = Request rate and latency.
+ - ``2`` = CPU load.
+
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds bal min rebalance``
+
+:Description: The minimum subtree temperature before Ceph migrates.
+:Type: Float
+:Default: ``0.1``
+
+
+``mds bal min start``
+
+:Description: The minimum subtree temperature before Ceph searches a subtree.
+:Type: Float
+:Default: ``0.2``
+
+
+``mds bal need min``
+
+:Description: The minimum fraction of target subtree size to accept.
+:Type: Float
+:Default: ``0.8``
+
+
+``mds bal need max``
+
+:Description: The maximum fraction of target subtree size to accept.
+:Type: Float
+:Default: ``1.2``
+
+
+``mds bal midchunk``
+
+:Description: Ceph will migrate any subtree that is larger than this fraction
+ of the target subtree size.
+
+:Type: Float
+:Default: ``0.3``
+
+
+``mds bal minchunk``
+
+:Description: Ceph will ignore any subtree that is smaller than this fraction
+ of the target subtree size.
+
+:Type: Float
+:Default: ``0.001``
+
+
+``mds bal target removal min``
+
+:Description: The minimum number of balancer iterations before Ceph removes
+ an old MDS target from the MDS map.
+
+:Type: 32-bit Integer
+:Default: ``5``
+
+
+``mds bal target removal max``
+
+:Description: The maximum number of balancer iteration before Ceph removes
+ an old MDS target from the MDS map.
+
+:Type: 32-bit Integer
+:Default: ``10``
+
+
+``mds replay interval``
+
+:Description: The journal poll interval when in standby-replay mode.
+ ("hot standby")
+
+:Type: Float
+:Default: ``1``
+
+
+``mds shutdown check``
+
+:Description: The interval for polling the cache during MDS shutdown.
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds thrash exports``
+
+:Description: Ceph will randomly export subtrees between nodes (testing only).
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds thrash fragments``
+
+:Description: Ceph will randomly fragment or merge directories.
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds dump cache on map``
+
+:Description: Ceph will dump the MDS cache contents to a file on each MDSMap.
+:Type: Boolean
+:Default: ``false``
+
+
+``mds dump cache after rejoin``
+
+:Description: Ceph will dump MDS cache contents to a file after
+ rejoining the cache (during recovery).
+
+:Type: Boolean
+:Default: ``false``
+
+
+``mds verify scatter``
+
+:Description: Ceph will assert that various scatter/gather invariants
+ are ``true`` (developers only).
+
+:Type: Boolean
+:Default: ``false``
+
+
+``mds debug scatterstat``
+
+:Description: Ceph will assert that various recursive stat invariants
+ are ``true`` (for developers only).
+
+:Type: Boolean
+:Default: ``false``
+
+
+``mds debug frag``
+
+:Description: Ceph will verify directory fragmentation invariants
+ when convenient (developers only).
+
+:Type: Boolean
+:Default: ``false``
+
+
+``mds debug auth pins``
+
+:Description: The debug auth pin invariants (for developers only).
+:Type: Boolean
+:Default: ``false``
+
+
+``mds debug subtrees``
+
+:Description: The debug subtree invariants (for developers only).
+:Type: Boolean
+:Default: ``false``
+
+
+``mds kill mdstable at``
+
+:Description: Ceph will inject MDS failure in MDSTable code
+ (for developers only).
+
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds kill export at``
+
+:Description: Ceph will inject MDS failure in the subtree export code
+ (for developers only).
+
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds kill import at``
+
+:Description: Ceph will inject MDS failure in the subtree import code
+ (for developers only).
+
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds kill link at``
+
+:Description: Ceph will inject MDS failure in hard link code
+ (for developers only).
+
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds kill rename at``
+
+:Description: Ceph will inject MDS failure in the rename code
+ (for developers only).
+
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds wipe sessions``
+
+:Description: Ceph will delete all client sessions on startup
+ (for testing only).
+
+:Type: Boolean
+:Default: ``false``
+
+
+``mds wipe ino prealloc``
+
+:Description: Ceph will delete ino preallocation metadata on startup
+ (for testing only).
+
+:Type: Boolean
+:Default: ``false``
+
+
+``mds skip ino``
+
+:Description: The number of inode numbers to skip on startup
+ (for testing only).
+
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds min caps per client``
+
+:Description: Set the minimum number of capabilities a client may hold.
+:Type: Integer
+:Default: ``100``
+
+
+``mds max ratio caps per client``
+
+:Description: Set the maximum ratio of current caps that may be recalled during MDS cache pressure.
+:Type: Float
+:Default: ``0.8``
diff --git a/doc/cephfs/mds-state-diagram.dot b/doc/cephfs/mds-state-diagram.dot
new file mode 100644
index 00000000..dee82506
--- /dev/null
+++ b/doc/cephfs/mds-state-diagram.dot
@@ -0,0 +1,71 @@
+digraph {
+
+node [shape=circle,style=unfilled,fixedsize=true,width=2.0]
+
+node [color=blue,peripheries=1];
+N0 [label="up:boot"]
+
+node [color=orange,peripheries=2];
+N1 [label="up:creating"]
+N0 -> N1 [color=orange,penwidth=2.0];
+N2 [label="up:starting"]
+N0 -> N2 [color=orange,penwidth=2.0];
+N3 [label="up:replay"]
+N0 -> N3 [color=orange,penwidth=2.0];
+N4 [label="up:resolve"]
+N3 -> N4 [color=orange,penwidth=2.0];
+N5 [label="up:reconnect"]
+N3 -> N5 [color=orange,penwidth=2.0];
+N4 -> N5 [color=orange,penwidth=2.0];
+N6 [label="up:rejoin"]
+N5 -> N6 [color=orange,penwidth=2.0];
+N7 [label="up:clientreplay"]
+N6 -> N7 [color=orange,penwidth=2.0];
+
+node [color=green,peripheries=2];
+S0 [label="up:active"]
+N7 -> S0 [color=green,penwidth=2.0];
+N1 -> S0 [color=green,penwidth=2.0];
+N2 -> S0 [color=green,penwidth=2.0];
+N6 -> S0 [color=green,penwidth=2.0];
+node [color=green,peripheries=1];
+S1 [label="up:standby"]
+N0 -> S1 [color=green,penwidth=2.0];
+S2 [label="up:standby_replay"]
+N0 -> S2 [color=green,penwidth=2.0];
+
+// going down but still accessible by clients
+node [color=purple,peripheries=2];
+S3 [label="up:stopping"]
+S0 -> S3 [color=purple,penwidth=2.0];
+
+// terminal (but "in")
+node [shape=polygon,sides=6,color=red,peripheries=2];
+D0 [label="down:failed"]
+N2 -> D0 [color=red,penwidth=2.0];
+N3 -> D0 [color=red,penwidth=2.0];
+N4 -> D0 [color=red,penwidth=2.0];
+N5 -> D0 [color=red,penwidth=2.0];
+N6 -> D0 [color=red,penwidth=2.0];
+N7 -> D0 [color=red,penwidth=2.0];
+S0 -> D0 [color=red,penwidth=2.0];
+S3 -> D0 [color=red,penwidth=2.0];
+D0 -> N3 [color=red,penwidth=2.0];
+
+// terminal (but not "in")
+node [shape=polygon,sides=6,color=black,peripheries=1];
+D1 [label="down:damaged"]
+N3 -> D1 [color=black,penwidth=2.0];
+N4 -> D1 [color=black,penwidth=2.0];
+N5 -> D1 [color=black,penwidth=2.0];
+N6 -> D1 [color=black,penwidth=2.0];
+N7 -> D1 [color=black,penwidth=2.0];
+S0 -> D1 [color=black,penwidth=2.0];
+S3 -> D1 [color=black,penwidth=2.0];
+D1 -> D0 [color=red,penwidth=2.0]
+
+node [shape=polygon,sides=6,color=purple,peripheries=1];
+D3 [label="down:stopped"]
+S3 -> D3 [color=purple,penwidth=2.0];
+
+}
diff --git a/doc/cephfs/mds-state-diagram.svg b/doc/cephfs/mds-state-diagram.svg
new file mode 100644
index 00000000..6c3127a3
--- /dev/null
+++ b/doc/cephfs/mds-state-diagram.svg
@@ -0,0 +1,311 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
+ "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<!-- Generated by graphviz version 2.40.1 (20161225.0304)
+ -->
+<!-- Title: %3 Pages: 1 -->
+<svg width="783pt" height="1808pt"
+ viewBox="0.00 0.00 783.00 1808.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
+<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 1804)">
+<title>%3</title>
+<polygon fill="#ffffff" stroke="transparent" points="-4,4 -4,-1804 779,-1804 779,4 -4,4"/>
+<!-- N0 -->
+<g id="node1" class="node">
+<title>N0</title>
+<ellipse fill="none" stroke="#0000ff" cx="375" cy="-1728" rx="72" ry="72"/>
+<text text-anchor="middle" x="375" y="-1724.3" font-family="Times,serif" font-size="14.00" fill="#000000">up:boot</text>
+</g>
+<!-- N1 -->
+<g id="node2" class="node">
+<title>N1</title>
+<ellipse fill="none" stroke="#ffa500" cx="375" cy="-1544" rx="72" ry="72"/>
+<ellipse fill="none" stroke="#ffa500" cx="375" cy="-1544" rx="76" ry="76"/>
+<text text-anchor="middle" x="375" y="-1540.3" font-family="Times,serif" font-size="14.00" fill="#000000">up:creating</text>
+</g>
+<!-- N0&#45;&gt;N1 -->
+<g id="edge1" class="edge">
+<title>N0&#45;&gt;N1</title>
+<path fill="none" stroke="#ffa500" stroke-width="2" d="M375,-1655.8064C375,-1647.5034 375,-1638.9744 375,-1630.5077"/>
+<polygon fill="#ffa500" stroke="#ffa500" stroke-width="2" points="378.5001,-1630.2303 375,-1620.2304 371.5001,-1630.2304 378.5001,-1630.2303"/>
+</g>
+<!-- N2 -->
+<g id="node3" class="node">
+<title>N2</title>
+<ellipse fill="none" stroke="#ffa500" cx="205" cy="-1544" rx="72" ry="72"/>
+<ellipse fill="none" stroke="#ffa500" cx="205" cy="-1544" rx="76" ry="76"/>
+<text text-anchor="middle" x="205" y="-1540.3" font-family="Times,serif" font-size="14.00" fill="#000000">up:starting</text>
+</g>
+<!-- N0&#45;&gt;N2 -->
+<g id="edge2" class="edge">
+<title>N0&#45;&gt;N2</title>
+<path fill="none" stroke="#ffa500" stroke-width="2" d="M325.829,-1674.7796C306.3584,-1653.7056 283.8178,-1629.3086 263.5164,-1607.3354"/>
+<polygon fill="#ffa500" stroke="#ffa500" stroke-width="2" points="266.082,-1604.9547 256.7251,-1599.9848 260.9405,-1609.705 266.082,-1604.9547"/>
+</g>
+<!-- N3 -->
+<g id="node4" class="node">
+<title>N3</title>
+<ellipse fill="none" stroke="#ffa500" cx="98" cy="-900" rx="72" ry="72"/>
+<ellipse fill="none" stroke="#ffa500" cx="98" cy="-900" rx="76" ry="76"/>
+<text text-anchor="middle" x="98" y="-896.3" font-family="Times,serif" font-size="14.00" fill="#000000">up:replay</text>
+</g>
+<!-- N0&#45;&gt;N3 -->
+<g id="edge3" class="edge">
+<title>N0&#45;&gt;N3</title>
+<path fill="none" stroke="#ffa500" stroke-width="2" d="M303.6813,-1718.116C244.3606,-1705.8591 162.9365,-1678.8371 120,-1620 50.549,-1524.8294 96,-1473.8172 96,-1356 96,-1356 96,-1356 96,-1168 96,-1107.3668 96.5182,-1039.0195 97.0271,-986.5224"/>
+<polygon fill="#ffa500" stroke="#ffa500" stroke-width="2" points="100.5296,-986.2755 97.1286,-976.2414 93.53,-986.2063 100.5296,-986.2755"/>
+</g>
+<!-- S1 -->
+<g id="node10" class="node">
+<title>S1</title>
+<ellipse fill="none" stroke="#00ff00" cx="541" cy="-1544" rx="72" ry="72"/>
+<text text-anchor="middle" x="541" y="-1540.3" font-family="Times,serif" font-size="14.00" fill="#000000">up:standby</text>
+</g>
+<!-- N0&#45;&gt;S1 -->
+<g id="edge13" class="edge">
+<title>N0&#45;&gt;S1</title>
+<path fill="none" stroke="#00ff00" stroke-width="2" d="M423.4603,-1674.285C443.0634,-1652.5563 465.795,-1627.3597 486.014,-1604.9483"/>
+<polygon fill="#00ff00" stroke="#00ff00" stroke-width="2" points="488.6666,-1607.2332 492.7664,-1597.4637 483.4691,-1602.5441 488.6666,-1607.2332"/>
+</g>
+<!-- S2 -->
+<g id="node11" class="node">
+<title>S2</title>
+<ellipse fill="none" stroke="#00ff00" cx="703" cy="-1544" rx="72" ry="72"/>
+<text text-anchor="middle" x="703" y="-1540.3" font-family="Times,serif" font-size="14.00" fill="#000000">up:standby_replay</text>
+</g>
+<!-- N0&#45;&gt;S2 -->
+<g id="edge14" class="edge">
+<title>N0&#45;&gt;S2</title>
+<path fill="none" stroke="#00ff00" stroke-width="2" d="M443.4232,-1705.2759C494.8612,-1686.5285 565.831,-1657.0318 622,-1620 629.8031,-1614.8555 637.5551,-1608.9346 644.9954,-1602.7159"/>
+<polygon fill="#00ff00" stroke="#00ff00" stroke-width="2" points="647.6445,-1605.0534 652.9345,-1595.8736 643.0746,-1599.7509 647.6445,-1605.0534"/>
+</g>
+<!-- S0 -->
+<g id="node9" class="node">
+<title>S0</title>
+<ellipse fill="none" stroke="#00ff00" cx="375" cy="-1356" rx="72" ry="72"/>
+<ellipse fill="none" stroke="#00ff00" cx="375" cy="-1356" rx="76" ry="76"/>
+<text text-anchor="middle" x="375" y="-1352.3" font-family="Times,serif" font-size="14.00" fill="#000000">up:active</text>
+</g>
+<!-- N1&#45;&gt;S0 -->
+<g id="edge10" class="edge">
+<title>N1&#45;&gt;S0</title>
+<path fill="none" stroke="#00ff00" stroke-width="2" d="M375,-1467.8042C375,-1459.4826 375,-1450.9678 375,-1442.5337"/>
+<polygon fill="#00ff00" stroke="#00ff00" stroke-width="2" points="378.5001,-1442.3042 375,-1432.3043 371.5001,-1442.3043 378.5001,-1442.3042"/>
+</g>
+<!-- N2&#45;&gt;S0 -->
+<g id="edge11" class="edge">
+<title>N2&#45;&gt;S0</title>
+<path fill="none" stroke="#00ff00" stroke-width="2" d="M256.0056,-1487.5938C275.1652,-1466.4055 297.0838,-1442.1662 316.8451,-1420.3125"/>
+<polygon fill="#00ff00" stroke="#00ff00" stroke-width="2" points="319.6258,-1422.4558 323.7368,-1412.691 314.4337,-1417.7608 319.6258,-1422.4558"/>
+</g>
+<!-- D0 -->
+<g id="node13" class="node">
+<title>D0</title>
+<polygon fill="none" stroke="#ff0000" points="276.9505,-1034 240.9752,-1052 169.0248,-1052 133.0495,-1034 169.0248,-1016 240.9752,-1016 276.9505,-1034"/>
+<polygon fill="none" stroke="#ff0000" points="285.8886,-1034 241.9189,-1056 168.0811,-1056 124.1114,-1034 168.0811,-1012 241.9189,-1012 285.8886,-1034"/>
+<text text-anchor="middle" x="205" y="-1030.3" font-family="Times,serif" font-size="14.00" fill="#000000">down:failed</text>
+</g>
+<!-- N2&#45;&gt;D0 -->
+<g id="edge16" class="edge">
+<title>N2&#45;&gt;D0</title>
+<path fill="none" stroke="#ff0000" stroke-width="2" d="M205,-1467.9234C205,-1354.6806 205,-1146.5379 205,-1066.5209"/>
+<polygon fill="#ff0000" stroke="#ff0000" stroke-width="2" points="208.5001,-1066.1037 205,-1056.1037 201.5001,-1066.1037 208.5001,-1066.1037"/>
+</g>
+<!-- N4 -->
+<g id="node5" class="node">
+<title>N4</title>
+<ellipse fill="none" stroke="#ffa500" cx="142" cy="-712" rx="72" ry="72"/>
+<ellipse fill="none" stroke="#ffa500" cx="142" cy="-712" rx="76" ry="76"/>
+<text text-anchor="middle" x="142" y="-708.3" font-family="Times,serif" font-size="14.00" fill="#000000">up:resolve</text>
+</g>
+<!-- N3&#45;&gt;N4 -->
+<g id="edge4" class="edge">
+<title>N3&#45;&gt;N4</title>
+<path fill="none" stroke="#ffa500" stroke-width="2" d="M115.3268,-825.9673C117.6174,-816.1801 119.9789,-806.0901 122.3056,-796.1487"/>
+<polygon fill="#ffa500" stroke="#ffa500" stroke-width="2" points="125.7458,-796.8081 124.6168,-786.2736 118.93,-795.2128 125.7458,-796.8081"/>
+</g>
+<!-- N5 -->
+<g id="node6" class="node">
+<title>N5</title>
+<ellipse fill="none" stroke="#ffa500" cx="180" cy="-524" rx="72" ry="72"/>
+<ellipse fill="none" stroke="#ffa500" cx="180" cy="-524" rx="76" ry="76"/>
+<text text-anchor="middle" x="180" y="-520.3" font-family="Times,serif" font-size="14.00" fill="#000000">up:reconnect</text>
+</g>
+<!-- N3&#45;&gt;N5 -->
+<g id="edge5" class="edge">
+<title>N3&#45;&gt;N5</title>
+<path fill="none" stroke="#ffa500" stroke-width="2" d="M66.8389,-830.5319C46.9319,-775.6268 29.5354,-698.153 57,-636 68.0235,-611.0535 87.6518,-589.2711 108.0365,-571.7132"/>
+<polygon fill="#ffa500" stroke="#ffa500" stroke-width="2" points="110.423,-574.2805 115.8697,-565.193 105.9448,-568.9004 110.423,-574.2805"/>
+</g>
+<!-- N3&#45;&gt;D0 -->
+<g id="edge17" class="edge">
+<title>N3&#45;&gt;D0</title>
+<path fill="none" stroke="#ff0000" stroke-width="2" d="M149.6142,-956.282C162.8016,-972.5015 176.1362,-989.5814 186.4752,-1003.5856"/>
+<polygon fill="#ff0000" stroke="#ff0000" stroke-width="2" points="183.7873,-1005.8407 192.4974,-1011.8726 189.45,-1001.7255 183.7873,-1005.8407"/>
+</g>
+<!-- D1 -->
+<g id="node14" class="node">
+<title>D1</title>
+<polygon fill="none" stroke="#000000" points="326,-18 290,-36 218,-36 182,-18 218,0 290,0 326,-18"/>
+<text text-anchor="middle" x="254" y="-14.3" font-family="Times,serif" font-size="14.00" fill="#000000">down:damaged</text>
+</g>
+<!-- N3&#45;&gt;D1 -->
+<g id="edge25" class="edge">
+<title>N3&#45;&gt;D1</title>
+<path fill="none" stroke="#000000" stroke-width="2" d="M58.9011,-834.7037C51.302,-819.7688 44.1186,-803.6682 39,-788 2.1684,-675.257 0,-642.6067 0,-524 0,-524 0,-524 0,-336 0,-208.8889 18.5148,-160.2479 110,-72 131.8215,-50.9507 162.6171,-37.9503 190.132,-29.9999"/>
+<polygon fill="#000000" stroke="#000000" stroke-width="2" points="191.3646,-33.292 200.0979,-27.2937 189.5302,-26.5367 191.3646,-33.292"/>
+</g>
+<!-- N4&#45;&gt;N5 -->
+<g id="edge6" class="edge">
+<title>N4&#45;&gt;N5</title>
+<path fill="none" stroke="#ffa500" stroke-width="2" d="M157.0732,-637.4274C158.9802,-627.9927 160.9424,-618.2849 162.8783,-608.7071"/>
+<polygon fill="#ffa500" stroke="#ffa500" stroke-width="2" points="166.3322,-609.2852 164.8829,-598.79 159.471,-607.8983 166.3322,-609.2852"/>
+</g>
+<!-- N4&#45;&gt;D0 -->
+<g id="edge18" class="edge">
+<title>N4&#45;&gt;D0</title>
+<path fill="none" stroke="#ff0000" stroke-width="2" d="M170.8211,-782.5993C175.5042,-796.1615 179.8456,-810.3838 183,-824 197.3463,-885.9266 202.3514,-960.3122 204.0876,-1001.8649"/>
+<polygon fill="#ff0000" stroke="#ff0000" stroke-width="2" points="200.5932,-1002.0802 204.4707,-1011.94 207.5881,-1001.8142 200.5932,-1002.0802"/>
+</g>
+<!-- N4&#45;&gt;D1 -->
+<g id="edge26" class="edge">
+<title>N4&#45;&gt;D1</title>
+<path fill="none" stroke="#000000" stroke-width="2" d="M108.3518,-643.7605C102.8165,-629.615 97.934,-614.574 95,-600 81.6674,-533.7732 87.5907,-515.148 95,-448 113.78,-277.8024 84.6917,-214.919 179,-72 186.8932,-60.0384 198.4086,-49.85 210.0577,-41.6536"/>
+<polygon fill="#000000" stroke="#000000" stroke-width="2" points="212.1179,-44.487 218.5139,-36.0405 208.2466,-38.6549 212.1179,-44.487"/>
+</g>
+<!-- N6 -->
+<g id="node7" class="node">
+<title>N6</title>
+<ellipse fill="none" stroke="#ffa500" cx="334" cy="-336" rx="72" ry="72"/>
+<ellipse fill="none" stroke="#ffa500" cx="334" cy="-336" rx="76" ry="76"/>
+<text text-anchor="middle" x="334" y="-332.3" font-family="Times,serif" font-size="14.00" fill="#000000">up:rejoin</text>
+</g>
+<!-- N5&#45;&gt;N6 -->
+<g id="edge7" class="edge">
+<title>N5&#45;&gt;N6</title>
+<path fill="none" stroke="#ffa500" stroke-width="2" d="M228.304,-465.0314C244.4236,-445.353 262.5011,-423.2844 279.0854,-403.0386"/>
+<polygon fill="#ffa500" stroke="#ffa500" stroke-width="2" points="281.9735,-405.0361 285.6028,-395.0823 276.5583,-400.6003 281.9735,-405.0361"/>
+</g>
+<!-- N5&#45;&gt;D0 -->
+<g id="edge19" class="edge">
+<title>N5&#45;&gt;D0</title>
+<path fill="none" stroke="#ff0000" stroke-width="2" d="M213.6482,-592.2395C219.1835,-606.385 224.066,-621.426 227,-636 253.9006,-769.6228 226.3843,-933.2287 212.2624,-1001.5818"/>
+<polygon fill="#ff0000" stroke="#ff0000" stroke-width="2" points="208.7642,-1001.2075 210.1213,-1011.715 215.613,-1002.6546 208.7642,-1001.2075"/>
+</g>
+<!-- N5&#45;&gt;D1 -->
+<g id="edge27" class="edge">
+<title>N5&#45;&gt;D1</title>
+<path fill="none" stroke="#000000" stroke-width="2" d="M174.0783,-447.8601C169.6026,-355.8973 170.5717,-197.8787 216,-72 219.5236,-62.2364 225.3085,-52.624 231.3143,-44.2773"/>
+<polygon fill="#000000" stroke="#000000" stroke-width="2" points="234.2827,-46.1569 237.5673,-36.0841 228.7181,-41.91 234.2827,-46.1569"/>
+</g>
+<!-- N7 -->
+<g id="node8" class="node">
+<title>N7</title>
+<ellipse fill="none" stroke="#ffa500" cx="401" cy="-148" rx="72" ry="72"/>
+<ellipse fill="none" stroke="#ffa500" cx="401" cy="-148" rx="76" ry="76"/>
+<text text-anchor="middle" x="401" y="-144.3" font-family="Times,serif" font-size="14.00" fill="#000000">up:clientreplay</text>
+</g>
+<!-- N6&#45;&gt;N7 -->
+<g id="edge8" class="edge">
+<title>N6&#45;&gt;N7</title>
+<path fill="none" stroke="#ffa500" stroke-width="2" d="M359.521,-264.389C363.5946,-252.9584 367.8354,-241.0588 371.9829,-229.4212"/>
+<polygon fill="#ffa500" stroke="#ffa500" stroke-width="2" points="375.3789,-230.3179 375.4391,-219.7232 368.7851,-227.9679 375.3789,-230.3179"/>
+</g>
+<!-- N6&#45;&gt;S0 -->
+<g id="edge12" class="edge">
+<title>N6&#45;&gt;S0</title>
+<path fill="none" stroke="#00ff00" stroke-width="2" d="M374.6915,-400.4499C416.6522,-473.2042 476,-596.5177 476,-712 476,-1034 476,-1034 476,-1034 476,-1127.7146 490.1099,-1156.3293 457,-1244 450.8791,-1260.2075 441.7293,-1276.1446 431.7345,-1290.685"/>
+<polygon fill="#00ff00" stroke="#00ff00" stroke-width="2" points="428.7105,-1288.8981 425.7686,-1299.0763 434.4156,-1292.9542 428.7105,-1288.8981"/>
+</g>
+<!-- N6&#45;&gt;D0 -->
+<g id="edge20" class="edge">
+<title>N6&#45;&gt;D0</title>
+<path fill="none" stroke="#ff0000" stroke-width="2" d="M350.4852,-410.3817C374.2134,-536.5952 404.3269,-796.4598 298,-976 289.2477,-990.7789 275.1051,-1002.3467 260.4616,-1011.1304"/>
+<polygon fill="#ff0000" stroke="#ff0000" stroke-width="2" points="258.3409,-1008.3063 251.323,-1016.2436 261.7588,-1014.4152 258.3409,-1008.3063"/>
+</g>
+<!-- N6&#45;&gt;D1 -->
+<g id="edge28" class="edge">
+<title>N6&#45;&gt;D1</title>
+<path fill="none" stroke="#000000" stroke-width="2" d="M315.4453,-262.2451C298.1087,-193.3322 273.1414,-94.0872 261.0685,-46.0972"/>
+<polygon fill="#000000" stroke="#000000" stroke-width="2" points="264.3934,-44.9675 258.5594,-36.1235 257.6049,-46.6753 264.3934,-44.9675"/>
+</g>
+<!-- N7&#45;&gt;S0 -->
+<g id="edge9" class="edge">
+<title>N7&#45;&gt;S0</title>
+<path fill="none" stroke="#00ff00" stroke-width="2" d="M443.6651,-211.1162C488.3175,-283.4252 552,-407.0513 552,-524 552,-1034 552,-1034 552,-1034 552,-1132.382 532.7639,-1159.7264 482,-1244 470.3882,-1263.2768 454.9343,-1281.9188 439.391,-1298.2812"/>
+<polygon fill="#00ff00" stroke="#00ff00" stroke-width="2" points="436.7092,-1296.0189 432.2566,-1305.6327 441.7326,-1300.8939 436.7092,-1296.0189"/>
+</g>
+<!-- N7&#45;&gt;D0 -->
+<g id="edge21" class="edge">
+<title>N7&#45;&gt;D0</title>
+<path fill="none" stroke="#ff0000" stroke-width="2" d="M414.0091,-223.1818C415.8671,-235.4615 417.6144,-248.0669 419,-260 432.5684,-376.8517 438,-406.3632 438,-524 438,-712 438,-712 438,-712 438,-837.0027 427.1077,-885.3846 341,-976 322.1123,-995.8765 295.3079,-1009.2638 270.482,-1018.1167"/>
+<polygon fill="#ff0000" stroke="#ff0000" stroke-width="2" points="269.0809,-1014.8955 260.7233,-1021.4071 271.3175,-1021.5286 269.0809,-1014.8955"/>
+</g>
+<!-- N7&#45;&gt;D1 -->
+<g id="edge29" class="edge">
+<title>N7&#45;&gt;D1</title>
+<path fill="none" stroke="#000000" stroke-width="2" d="M343.745,-97.3664C322.6326,-78.6955 299.6377,-58.3599 282.21,-42.9476"/>
+<polygon fill="#000000" stroke="#000000" stroke-width="2" points="284.4307,-40.2392 274.6212,-36.2364 279.7935,-45.4829 284.4307,-40.2392"/>
+</g>
+<!-- S3 -->
+<g id="node12" class="node">
+<title>S3</title>
+<ellipse fill="none" stroke="#a020f0" cx="372" cy="-1168" rx="72" ry="72"/>
+<ellipse fill="none" stroke="#a020f0" cx="372" cy="-1168" rx="76" ry="76"/>
+<text text-anchor="middle" x="372" y="-1164.3" font-family="Times,serif" font-size="14.00" fill="#000000">up:stopping</text>
+</g>
+<!-- S0&#45;&gt;S3 -->
+<g id="edge15" class="edge">
+<title>S0&#45;&gt;S3</title>
+<path fill="none" stroke="#a020f0" stroke-width="2" d="M373.7841,-1279.8042C373.6487,-1271.318 373.5101,-1262.6309 373.3729,-1254.0333"/>
+<polygon fill="#a020f0" stroke="#a020f0" stroke-width="2" points="376.872,-1253.9418 373.2127,-1243.9989 369.8728,-1254.0536 376.872,-1253.9418"/>
+</g>
+<!-- S0&#45;&gt;D0 -->
+<g id="edge22" class="edge">
+<title>S0&#45;&gt;D0</title>
+<path fill="none" stroke="#ff0000" stroke-width="2" d="M324.7658,-1298.7259C311.4118,-1281.8155 297.7347,-1262.7868 287,-1244 253.0065,-1184.5078 227.1586,-1108.1765 214.2126,-1065.8585"/>
+<polygon fill="#ff0000" stroke="#ff0000" stroke-width="2" points="217.5124,-1064.678 211.2761,-1056.113 210.8101,-1066.6976 217.5124,-1064.678"/>
+</g>
+<!-- S0&#45;&gt;D1 -->
+<g id="edge30" class="edge">
+<title>S0&#45;&gt;D1</title>
+<path fill="none" stroke="#000000" stroke-width="2" d="M442.7372,-1320.9248C473.4011,-1301.9277 507.694,-1275.8005 530,-1244 585.7414,-1164.5323 590,-1131.0681 590,-1034 590,-1034 590,-1034 590,-336 590,-214.8369 609.9264,-155.3634 522,-72 494.3419,-45.7772 395.9102,-31.2035 326.3899,-23.9832"/>
+<polygon fill="#000000" stroke="#000000" stroke-width="2" points="326.6553,-20.4923 316.3544,-22.9708 325.9526,-27.4569 326.6553,-20.4923"/>
+</g>
+<!-- S3&#45;&gt;D0 -->
+<g id="edge23" class="edge">
+<title>S3&#45;&gt;D0</title>
+<path fill="none" stroke="#ff0000" stroke-width="2" d="M312.6518,-1120.3793C288.6501,-1101.1204 261.7353,-1079.5241 240.8219,-1062.7433"/>
+<polygon fill="#ff0000" stroke="#ff0000" stroke-width="2" points="242.8216,-1059.8605 232.8316,-1056.3319 238.4408,-1065.3202 242.8216,-1059.8605"/>
+</g>
+<!-- S3&#45;&gt;D1 -->
+<g id="edge31" class="edge">
+<title>S3&#45;&gt;D1</title>
+<path fill="none" stroke="#000000" stroke-width="2" d="M422.8642,-1111.3791C463.6096,-1060.0328 514,-980.2995 514,-900 514,-900 514,-900 514,-336 514,-218.0086 564.146,-160.4035 486,-72 464.8015,-48.019 384.4146,-33.2954 324.1685,-25.4019"/>
+<polygon fill="#000000" stroke="#000000" stroke-width="2" points="324.2918,-21.8895 313.9302,-24.1001 323.4088,-28.8336 324.2918,-21.8895"/>
+</g>
+<!-- D3 -->
+<g id="node15" class="node">
+<title>D3</title>
+<polygon fill="none" stroke="#a020f0" points="448,-1034 412,-1052 340,-1052 304,-1034 340,-1016 412,-1016 448,-1034"/>
+<text text-anchor="middle" x="376" y="-1030.3" font-family="Times,serif" font-size="14.00" fill="#000000">down:stopped</text>
+</g>
+<!-- S3&#45;&gt;D3 -->
+<g id="edge33" class="edge">
+<title>S3&#45;&gt;D3</title>
+<path fill="none" stroke="#a020f0" stroke-width="2" d="M374.2688,-1091.995C374.5847,-1081.4121 374.8915,-1071.1346 375.1569,-1062.2444"/>
+<polygon fill="#a020f0" stroke="#a020f0" stroke-width="2" points="378.6578,-1062.2611 375.4579,-1052.1611 371.661,-1062.0522 378.6578,-1062.2611"/>
+</g>
+<!-- D0&#45;&gt;N3 -->
+<g id="edge24" class="edge">
+<title>D0&#45;&gt;N3</title>
+<path fill="none" stroke="#ff0000" stroke-width="2" d="M182.1648,-1011.8726C171.9859,-1000.3518 159.6454,-985.5549 147.5698,-970.521"/>
+<polygon fill="#ff0000" stroke="#ff0000" stroke-width="2" points="150.2651,-968.2872 141.2921,-962.6537 144.7935,-972.6532 150.2651,-968.2872"/>
+</g>
+<!-- D1&#45;&gt;D0 -->
+<g id="edge32" class="edge">
+<title>D1&#45;&gt;D0</title>
+<path fill="none" stroke="#ff0000" stroke-width="2" d="M253.4952,-36.2127C252.406,-76.5308 249.8638,-176.379 249,-260 248.3022,-327.552 234.976,-345.9161 249,-412 252.6347,-429.1277 261.3653,-430.8723 265,-448 279.024,-514.0839 267.0633,-532.476 265,-600 259.8865,-767.3454 293.786,-816.7869 242,-976 238.9016,-985.5258 233.8623,-995.035 228.4863,-1003.5163"/>
+<polygon fill="#ff0000" stroke="#ff0000" stroke-width="2" points="225.4987,-1001.687 222.8293,-1011.94 231.31,-1005.5896 225.4987,-1001.687"/>
+</g>
+</g>
+</svg>
diff --git a/doc/cephfs/mds-states.rst b/doc/cephfs/mds-states.rst
new file mode 100644
index 00000000..ecd5686c
--- /dev/null
+++ b/doc/cephfs/mds-states.rst
@@ -0,0 +1,227 @@
+
+MDS States
+==========
+
+
+The Metadata Server (MDS) goes through several states during normal operation
+in CephFS. For example, some states indicate that the MDS is recovering from a
+failover by a previous instance of the MDS. Here we'll document all of these
+states and include a state diagram to visualize the transitions.
+
+State Descriptions
+------------------
+
+Common states
+~~~~~~~~~~~~~~
+
+
+::
+
+ up:active
+
+This is the normal operating state of the MDS. It indicates that the MDS
+and its rank in the file system is available.
+
+
+::
+
+ up:standby
+
+The MDS is available to takeover for a failed rank (see also :ref:`mds-standby`).
+The monitor will automatically assign an MDS in this state to a failed rank
+once available.
+
+
+::
+
+ up:standby_replay
+
+The MDS is following the journal of another ``up:active`` MDS. Should the
+active MDS fail, having a standby MDS in replay mode is desirable as the MDS is
+replaying the live journal and will more quickly takeover. A downside to having
+standby replay MDSs is that they are not available to takeover for any other
+MDS that fails, only the MDS they follow.
+
+
+Less common or transitory states
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+
+::
+
+ up:boot
+
+This state is broadcast to the Ceph monitors during startup. This state is
+never visible as the Monitor immediately assign the MDS to an available rank or
+commands the MDS to operate as a standby. The state is documented here for
+completeness.
+
+
+::
+
+ up:creating
+
+The MDS is creating a new rank (perhaps rank 0) by constructing some per-rank
+metadata (like the journal) and entering the MDS cluster.
+
+
+::
+
+ up:starting
+
+The MDS is restarting a stopped rank. It opens associated per-rank metadata
+and enters the MDS cluster.
+
+
+::
+
+ up:stopping
+
+When a rank is stopped, the monitors command an active MDS to enter the
+``up:stopping`` state. In this state, the MDS accepts no new client
+connections, migrates all subtrees to other ranks in the file system, flush its
+metadata journal, and, if the last rank (0), evict all clients and shutdown
+(see also :ref:`cephfs-administration`).
+
+
+::
+
+ up:replay
+
+The MDS taking over a failed rank. This state represents that the MDS is
+recovering its journal and other metadata.
+
+
+::
+
+ up:resolve
+
+The MDS enters this state from ``up:replay`` if the Ceph file system has
+multiple ranks (including this one), i.e. it's not a single active MDS cluster.
+The MDS is resolving any uncommitted inter-MDS operations. All ranks in the
+file system must be in this state or later for progress to be made, i.e. no
+rank can be failed/damaged or ``up:replay``.
+
+
+::
+
+ up:reconnect
+
+An MDS enters this state from ``up:replay`` or ``up:resolve``. This state is to
+solicit reconnections from clients. Any client which had a session with this
+rank must reconnect during this time, configurable via
+``mds_reconnect_timeout``.
+
+
+::
+
+ up:rejoin
+
+The MDS enters this state from ``up:reconnect``. In this state, the MDS is
+rejoining the MDS cluster cache. In particular, all inter-MDS locks on metadata
+are reestablished.
+
+If there are no known client requests to be replayed, the MDS directly becomes
+``up:active`` from this state.
+
+
+::
+
+ up:clientreplay
+
+The MDS may enter this state from ``up:rejoin``. The MDS is replaying any
+client requests which were replied to but not yet durable (not journaled).
+Clients resend these requests during ``up:reconnect`` and the requests are
+replayed once again. The MDS enters ``up:active`` after completing replay.
+
+
+Failed states
+~~~~~~~~~~~~~
+
+::
+
+ down:failed
+
+No MDS actually holds this state. Instead, it is applied to the rank in the file system. For example:
+
+::
+
+ $ ceph fs dump
+ ...
+ max_mds 1
+ in 0
+ up {}
+ failed 0
+ ...
+
+Rank 0 is part of the failed set.
+
+
+::
+
+ down:damaged
+
+No MDS actually holds this state. Instead, it is applied to the rank in the file system. For example:
+
+::
+
+ $ ceph fs dump
+ ...
+ max_mds 1
+ in 0
+ up {}
+ failed
+ damaged 0
+ ...
+
+Rank 0 has become damaged (see also :ref:`cephfs-disaster-recovery`) and placed in
+the ``damaged`` set. An MDS which was running as rank 0 found metadata damage
+that could not be automatically recovered. Operator intervention is required.
+
+
+::
+
+ down:stopped
+
+No MDS actually holds this state. Instead, it is applied to the rank in the file system. For example:
+
+::
+
+ $ ceph fs dump
+ ...
+ max_mds 1
+ in 0
+ up {}
+ failed
+ damaged
+ stopped 1
+ ...
+
+The rank has been stopped by reducing ``max_mds`` (see also :ref:`cephfs-multimds`).
+
+State Diagram
+-------------
+
+This state diagram shows the possible state transitions for the MDS/rank. The legend is as follows:
+
+Color
+~~~~~
+
+- Green: MDS is active.
+- Orange: MDS is in transient state trying to become active.
+- Red: MDS is indicating a state that causes the rank to be marked failed.
+- Purple: MDS and rank is stopping.
+- Red: MDS is indicating a state that causes the rank to be marked damaged.
+
+Shape
+~~~~~
+
+- Circle: an MDS holds this state.
+- Hexagon: no MDS holds this state (it is applied to the rank).
+
+Lines
+~~~~~
+
+- A double-lined shape indicates the rank is "in".
+
+.. image:: mds-state-diagram.svg
diff --git a/doc/cephfs/multimds.rst b/doc/cephfs/multimds.rst
new file mode 100644
index 00000000..8ed5bd07
--- /dev/null
+++ b/doc/cephfs/multimds.rst
@@ -0,0 +1,137 @@
+.. _cephfs-multimds:
+
+Configuring multiple active MDS daemons
+---------------------------------------
+
+*Also known as: multi-mds, active-active MDS*
+
+Each CephFS filesystem is configured for a single active MDS daemon
+by default. To scale metadata performance for large scale systems, you
+may enable multiple active MDS daemons, which will share the metadata
+workload with one another.
+
+When should I use multiple active MDS daemons?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You should configure multiple active MDS daemons when your metadata performance
+is bottlenecked on the single MDS that runs by default.
+
+Adding more daemons may not increase performance on all workloads. Typically,
+a single application running on a single client will not benefit from an
+increased number of MDS daemons unless the application is doing a lot of
+metadata operations in parallel.
+
+Workloads that typically benefit from a larger number of active MDS daemons
+are those with many clients, perhaps working on many separate directories.
+
+
+Increasing the MDS active cluster size
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Each CephFS filesystem has a *max_mds* setting, which controls how many ranks
+will be created. The actual number of ranks in the filesystem will only be
+increased if a spare daemon is available to take on the new rank. For example,
+if there is only one MDS daemon running, and max_mds is set to two, no second
+rank will be created. (Note that such a configuration is not Highly Available
+(HA) because no standby is available to take over for a failed rank. The
+cluster will complain via health warnings when configured this way.)
+
+Set ``max_mds`` to the desired number of ranks. In the following examples
+the "fsmap" line of "ceph status" is shown to illustrate the expected
+result of commands.
+
+::
+
+ # fsmap e5: 1/1/1 up {0=a=up:active}, 2 up:standby
+
+ ceph fs set <fs_name> max_mds 2
+
+ # fsmap e8: 2/2/2 up {0=a=up:active,1=c=up:creating}, 1 up:standby
+ # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
+
+The newly created rank (1) will pass through the 'creating' state
+and then enter this 'active state'.
+
+Standby daemons
+~~~~~~~~~~~~~~~
+
+Even with multiple active MDS daemons, a highly available system **still
+requires standby daemons** to take over if any of the servers running
+an active daemon fail.
+
+Consequently, the practical maximum of ``max_mds`` for highly available systems
+is at most one less than the total number of MDS servers in your system.
+
+To remain available in the event of multiple server failures, increase the
+number of standby daemons in the system to match the number of server failures
+you wish to withstand.
+
+Decreasing the number of ranks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Reducing the number of ranks is as simple as reducing ``max_mds``:
+
+::
+
+ # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
+ ceph fs set <fs_name> max_mds 1
+ # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
+ # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
+ ...
+ # fsmap e10: 1/1/1 up {0=a=up:active}, 2 up:standby
+
+The cluster will automatically stop extra ranks incrementally until ``max_mds``
+is reached.
+
+See :doc:`/cephfs/administration` for more details which forms ``<role>`` can
+take.
+
+Note: stopped ranks will first enter the stopping state for a period of
+time while it hands off its share of the metadata to the remaining active
+daemons. This phase can take from seconds to minutes. If the MDS appears to
+be stuck in the stopping state then that should be investigated as a possible
+bug.
+
+If an MDS daemon crashes or is killed while in the ``up:stopping`` state, a
+standby will take over and the cluster monitors will against try to stop
+the daemon.
+
+When a daemon finishes stopping, it will respawn itself and go back to being a
+standby.
+
+
+Manually pinning directory trees to a particular rank
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In multiple active metadata server configurations, a balancer runs which works
+to spread metadata load evenly across the cluster. This usually works well
+enough for most users but sometimes it is desirable to override the dynamic
+balancer with explicit mappings of metadata to particular ranks. This can allow
+the administrator or users to evenly spread application load or limit impact of
+users' metadata requests on the entire cluster.
+
+The mechanism provided for this purpose is called an ``export pin``, an
+extended attribute of directories. The name of this extended attribute is
+``ceph.dir.pin``. Users can set this attribute using standard commands:
+
+::
+
+ setfattr -n ceph.dir.pin -v 2 path/to/dir
+
+The value of the extended attribute is the rank to assign the directory subtree
+to. A default value of ``-1`` indicates the directory is not pinned.
+
+A directory's export pin is inherited from its closest parent with a set export
+pin. In this way, setting the export pin on a directory affects all of its
+children. However, the parents pin can be overridden by setting the child
+directory's export pin. For example:
+
+::
+
+ mkdir -p a/b
+ # "a" and "a/b" both start without an export pin set
+ setfattr -n ceph.dir.pin -v 1 a/
+ # a and b are now pinned to rank 1
+ setfattr -n ceph.dir.pin -v 0 a/b
+ # a/b is now pinned to rank 0 and a/ and the rest of its children are still pinned to rank 1
+
diff --git a/doc/cephfs/nfs.rst b/doc/cephfs/nfs.rst
new file mode 100644
index 00000000..3485d33d
--- /dev/null
+++ b/doc/cephfs/nfs.rst
@@ -0,0 +1,81 @@
+===
+NFS
+===
+
+CephFS namespaces can be exported over NFS protocol using the
+`NFS-Ganesha NFS server <https://github.com/nfs-ganesha/nfs-ganesha/wiki>`_.
+
+Requirements
+============
+
+- Ceph filesystem (preferably latest stable luminous or higher versions)
+- In the NFS server host machine, 'libcephfs2' (preferably latest stable
+ luminous or higher), 'nfs-ganesha' and 'nfs-ganesha-ceph' packages (latest
+ ganesha v2.5 stable or higher versions)
+- NFS-Ganesha server host connected to the Ceph public network
+
+Configuring NFS-Ganesha to export CephFS
+========================================
+
+NFS-Ganesha provides a File System Abstraction Layer (FSAL) to plug in different
+storage backends. `FSAL_CEPH <https://github.com/nfs-ganesha/nfs-ganesha/tree/next/src/FSAL/FSAL_CEPH>`_
+is the plugin FSAL for CephFS. For each NFS-Ganesha export, FSAL_CEPH uses a
+libcephfs client, user-space CephFS client, to mount the CephFS path that
+NFS-Ganesha exports.
+
+Setting up NFS-Ganesha with CephFS, involves setting up NFS-Ganesha's
+configuration file, and also setting up a Ceph configuration file and cephx
+access credentials for the Ceph clients created by NFS-Ganesha to access
+CephFS.
+
+NFS-Ganesha configuration
+-------------------------
+
+A sample ganesha.conf configured with FSAL_CEPH can be found here,
+`<https://github.com/nfs-ganesha/nfs-ganesha/blob/next/src/config_samples/ceph.conf>`_.
+It is suitable for a standalone NFS-Ganesha server, or an active/passive
+configuration of NFS-Ganesha servers managed by some sort of clustering
+software (e.g., Pacemaker). Important details about the options are
+added as comments in the sample conf. There are options to do the following:
+
+- minimize Ganesha caching wherever possible since the libcephfs clients
+ (of FSAL_CEPH) also cache aggressively
+
+- read from Ganesha config files stored in RADOS objects
+
+- store client recovery data in RADOS OMAP key-value interface
+
+- mandate NFSv4.1+ access
+
+- enable read delegations (need at least v13.0.1 'libcephfs2' package
+ and v2.6.0 stable 'nfs-ganesha' and 'nfs-ganesha-ceph' packages)
+
+Configuration for libcephfs clients
+-----------------------------------
+
+Required ceph.conf for libcephfs clients includes:
+
+* a [client] section with ``mon_host`` option set to let the clients connect
+ to the Ceph cluster's monitors, e.g., ::
+
+ [client]
+ mon host = 192.168.1.7:6789, 192.168.1.8:6789, 192.168.1.9:6789
+
+Mount using NFSv4 clients
+=========================
+
+It is preferred to mount the NFS-Ganesha exports using NFSv4.1+ protocols
+to get the benefit of sessions.
+
+Conventions for mounting NFS resources are platform-specific. The
+following conventions work on Linux and some Unix platforms:
+
+From the command line::
+
+ mount -t nfs -o nfsvers=4.1,proto=tcp <ganesha-host-name>:<ganesha-pseudo-path> <mount-point>
+
+Current limitations
+===================
+
+- Per running ganesha daemon, FSAL_CEPH can only export one Ceph filesystem
+ although multiple directories in a Ceph filesystem may be exported.
diff --git a/doc/cephfs/posix.rst b/doc/cephfs/posix.rst
new file mode 100644
index 00000000..34c2b44a
--- /dev/null
+++ b/doc/cephfs/posix.rst
@@ -0,0 +1,101 @@
+========================
+ Differences from POSIX
+========================
+
+CephFS aims to adhere to POSIX semantics wherever possible. For
+example, in contrast to many other common network file systems like
+NFS, CephFS maintains strong cache coherency across clients. The goal
+is for processes communicating via the file system to behave the same
+when they are on different hosts as when they are on the same host.
+
+However, there are a few places where CephFS diverges from strict
+POSIX semantics for various reasons:
+
+- If a client is writing to a file and fails, its writes are not
+ necessarily atomic. That is, the client may call write(2) on a file
+ opened with O_SYNC with an 8 MB buffer and then crash and the write
+ may be only partially applied. (Almost all file systems, even local
+ file systems, have this behavior.)
+- In shared simultaneous writer situations, a write that crosses
+ object boundaries is not necessarily atomic. This means that you
+ could have writer A write "aa|aa" and writer B write "bb|bb"
+ simultaneously (where | is the object boundary), and end up with
+ "aa|bb" rather than the proper "aa|aa" or "bb|bb".
+- Sparse files propagate incorrectly to the stat(2) st_blocks field.
+ Because CephFS does not explicitly track which parts of a file are
+ allocated/written, the st_blocks field is always populated by the
+ file size divided by the block size. This will cause tools like
+ du(1) to overestimate consumed space. (The recursive size field,
+ maintained by CephFS, also includes file "holes" in its count.)
+- When a file is mapped into memory via mmap(2) on multiple hosts,
+ writes are not coherently propagated to other clients' caches. That
+ is, if a page is cached on host A, and then updated on host B, host
+ A's page is not coherently invalidated. (Shared writable mmap
+ appears to be quite rare--we have yet to here any complaints about this
+ behavior, and implementing cache coherency properly is complex.)
+- CephFS clients present a hidden ``.snap`` directory that is used to
+ access, create, delete, and rename snapshots. Although the virtual
+ directory is excluded from readdir(2), any process that tries to
+ create a file or directory with the same name will get an error
+ code. The name of this hidden directory can be changed at mount
+ time with ``-o snapdirname=.somethingelse`` (Linux) or the config
+ option ``client_snapdir`` (libcephfs, ceph-fuse).
+
+Perspective
+-----------
+
+People talk a lot about "POSIX compliance," but in reality most file
+system implementations do not strictly adhere to the spec, including
+local Linux file systems like ext4 and XFS. For example, for
+performance reasons, the atomicity requirements for reads are relaxed:
+processing reading from a file that is also being written may see torn
+results.
+
+Similarly, NFS has extremely weak consistency semantics when multiple
+clients are interacting with the same files or directories, opting
+instead for "close-to-open". In the world of network attached
+storage, where most environments use NFS, whether or not the server's
+file system is "fully POSIX" may not be relevant, and whether client
+applications notice depends on whether data is being shared between
+clients or not. NFS may also "tear" the results of concurrent writers
+as client data may not even be flushed to the server until the file is
+closed (and more generally writes will be significantly more
+time-shifted than CephFS, leading to less predictable results).
+
+However, all of there are very close to POSIX, and most of the time
+applications don't notice too much. Many other storage systems (e.g.,
+HDFS) claim to be "POSIX-like" but diverge significantly from the
+standard by dropping support for things like in-place file
+modifications, truncate, or directory renames.
+
+
+Bottom line
+-----------
+
+CephFS relaxes more than local Linux kernel file systems (e.g., writes
+spanning object boundaries may be torn). It relaxes strictly less
+than NFS when it comes to multiclient consistency, and generally less
+than NFS when it comes to write atomicity.
+
+In other words, when it comes to POSIX, ::
+
+ HDFS < NFS < CephFS < {XFS, ext4}
+
+
+fsync() and error reporting
+---------------------------
+
+POSIX is somewhat vague about the state of an inode after fsync reports
+an error. In general, CephFS uses the standard error-reporting
+mechanisms in the client's kernel, and therefore follows the same
+conventions as other filesystems.
+
+In modern Linux kernels (v4.17 or later), writeback errors are reported
+once to every file description that is open at the time of the error. In
+addition, unreported errors that occured before the file description was
+opened will also be returned on fsync.
+
+See `PostgreSQL's summary of fsync() error reporting across operating systems
+<https://wiki.postgresql.org/wiki/Fsync_Errors>`_ and `Matthew Wilcox's
+presentation on Linux IO error handling
+<https://www.youtube.com/watch?v=74c19hwY2oE>`_ for more information.
diff --git a/doc/cephfs/quota.rst b/doc/cephfs/quota.rst
new file mode 100644
index 00000000..951982d1
--- /dev/null
+++ b/doc/cephfs/quota.rst
@@ -0,0 +1,76 @@
+Quotas
+======
+
+CephFS allows quotas to be set on any directory in the system. The
+quota can restrict the number of *bytes* or the number of *files*
+stored beneath that point in the directory hierarchy.
+
+Limitations
+-----------
+
+#. *Quotas are cooperative and non-adversarial.* CephFS quotas rely on
+ the cooperation of the client who is mounting the file system to
+ stop writers when a limit is reached. A modified or adversarial
+ client cannot be prevented from writing as much data as it needs.
+ Quotas should not be relied on to prevent filling the system in
+ environments where the clients are fully untrusted.
+
+#. *Quotas are imprecise.* Processes that are writing to the file
+ system will be stopped a short time after the quota limit is
+ reached. They will inevitably be allowed to write some amount of
+ data over the configured limit. How far over the quota they are
+ able to go depends primarily on the amount of time, not the amount
+ of data. Generally speaking writers will be stopped within 10s of
+ seconds of crossing the configured limit.
+
+#. *Quotas are implemented in the kernel client 4.17 and higher.*
+ Quotas are supported by the userspace client (libcephfs, ceph-fuse).
+ Linux kernel clients >= 4.17 support CephFS quotas but only on
+ mimic+ clusters. Kernel clients (even recent versions) will fail
+ to handle quotas on older clusters, even if they may be able to set
+ the quotas extended attributes.
+
+#. *Quotas must be configured carefully when used with path-based
+ mount restrictions.* The client needs to have access to the
+ directory inode on which quotas are configured in order to enforce
+ them. If the client has restricted access to a specific path
+ (e.g., ``/home/user``) based on the MDS capability, and a quota is
+ configured on an ancestor directory they do not have access to
+ (e.g., ``/home``), the client will not enforce it. When using
+ path-based access restrictions be sure to configure the quota on
+ the directory the client is restricted too (e.g., ``/home/user``)
+ or something nested beneath it.
+
+#. *Snapshot file data which has since been deleted or changed does not count
+ towards the quota.* See also: http://tracker.ceph.com/issues/24284
+
+Configuration
+-------------
+
+Like most other things in CephFS, quotas are configured using virtual
+extended attributes:
+
+ * ``ceph.quota.max_files`` -- file limit
+ * ``ceph.quota.max_bytes`` -- byte limit
+
+If the attributes appear on a directory inode that means a quota is
+configured there. If they are not present then no quota is set on
+that directory (although one may still be configured on a parent directory).
+
+To set a quota::
+
+ setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir # 100 MB
+ setfattr -n ceph.quota.max_files -v 10000 /some/dir # 10,000 files
+
+To view quota settings::
+
+ getfattr -n ceph.quota.max_bytes /some/dir
+ getfattr -n ceph.quota.max_files /some/dir
+
+Note that if the value of the extended attribute is ``0`` that means
+the quota is not set.
+
+To remove a quota::
+
+ setfattr -n ceph.quota.max_bytes -v 0 /some/dir
+ setfattr -n ceph.quota.max_files -v 0 /some/dir
diff --git a/doc/cephfs/scrub.rst b/doc/cephfs/scrub.rst
new file mode 100644
index 00000000..7a28311b
--- /dev/null
+++ b/doc/cephfs/scrub.rst
@@ -0,0 +1,136 @@
+.. _mds-scrub:
+
+=====================
+Ceph Filesystem Scrub
+=====================
+
+CephFS provides the cluster admin (operator) to check consistency of a filesystem
+via a set of scrub commands. Scrub can be classified into two parts:
+
+#. Forward Scrub: In which the scrub operation starts at the root of the filesystem
+ (or a sub directory) and looks at everything that can be touched in the hierarchy
+ to ensure consistency.
+
+#. Backward Scrub: In which the scrub operation looks at every RADOS object in the
+ filesystem pools and maps it back to the filesystem hierarchy.
+
+This document details commands to initiate and control forward scrub (referred as
+scrub thereafter).
+
+Initiate Filesystem Scrub
+=========================
+
+To start a scrub operation for a directory tree use the following command
+
+::
+
+ ceph tell mds.a scrub start / recursive
+ {
+ "return_code": 0,
+ "scrub_tag": "6f0d204c-6cfd-4300-9e02-73f382fd23c1",
+ "mode": "asynchronous"
+ }
+
+Recursive scrub is asynchronous (as hinted by `mode` in the output above). Scrub tag is
+a random string that can used to monitor the progress of the scrub operation (explained
+further in this document).
+
+Custom tag can also be specified when initiating the scrub operation. Custom tags get
+persisted in the metadata object for every inode in the filesystem tree that is being
+scrubbed.
+
+::
+
+ ceph tell mds.a scrub start /a/b/c recursive tag0
+ {
+ "return_code": 0,
+ "scrub_tag": "tag0",
+ "mode": "asynchronous"
+ }
+
+
+Monitor (ongoing) Filesystem Scrubs
+===================================
+
+Status of ongoing scrubs can be monitored using in `scrub status` command. This commands
+lists out ongoing scrubs (identified by the tag) along with the path and options used to
+initiate the scrub.
+
+::
+
+ ceph tell mds.a scrub status
+ {
+ "status": "scrub active (85 inodes in the stack)",
+ "scrubs": {
+ "6f0d204c-6cfd-4300-9e02-73f382fd23c1": {
+ "path": "/",
+ "options": "recursive"
+ }
+ }
+ }
+
+`status` shows the number of inodes that are scheduled to be scrubbed at any point in time,
+hence, can change on subsequent `scrub status` invocations. Also, a high level summary of
+scrub operation (which includes the operation state and paths on which scrub is triggered)
+gets displayed in `ceph status`.
+
+::
+
+ ceph status
+ [...]
+
+ task status:
+ scrub status:
+ mds.0: active [paths:/]
+
+ [...]
+
+Control (ongoing) Filesystem Scrubs
+===================================
+
+- Pause: Pausing ongoing scrub operations results in no new or pending inodes being
+ scrubbed after in-flight RADOS ops (for the inodes that are currently being scrubbed)
+ finish.
+
+::
+
+ ceph tell mds.a scrub pause
+ {
+ "return_code": 0
+ }
+
+`scrub status` after pausing reflects the paused state. At this point, initiating new scrub
+operations (via `scrub start`) would just queue the inode for scrub.
+
+::
+
+ ceph tell mds.a scrub status
+ {
+ "status": "PAUSED (66 inodes in the stack)",
+ "scrubs": {
+ "6f0d204c-6cfd-4300-9e02-73f382fd23c1": {
+ "path": "/",
+ "options": "recursive"
+ }
+ }
+ }
+
+- Resume: Resuming kick starts a paused scrub operation.
+
+::
+
+ ceph tell mds.a. scrub resume
+ {
+ "return_code": 0
+ }
+
+- Abort: Aborting ongoing scrub operations removes pending inodes from the scrub
+ queue (thereby aborting the scrub) after in-flight RADOS ops (for the inodes that
+ are currently being scrubbed) finish.
+
+::
+
+ ceph tell mds.a. scrub abort
+ {
+ "return_code": 0
+ }
diff --git a/doc/cephfs/standby.rst b/doc/cephfs/standby.rst
new file mode 100644
index 00000000..8983415d
--- /dev/null
+++ b/doc/cephfs/standby.rst
@@ -0,0 +1,103 @@
+.. _mds-standby:
+
+Terminology
+-----------
+
+A Ceph cluster may have zero or more CephFS *filesystems*. CephFS
+filesystems have a human readable name (set in ``fs new``)
+and an integer ID. The ID is called the filesystem cluster ID,
+or *FSCID*.
+
+Each CephFS filesystem has a number of *ranks*, one by default,
+which start at zero. A rank may be thought of as a metadata shard.
+Controlling the number of ranks in a filesystem is described
+in :doc:`/cephfs/multimds`
+
+Each CephFS ceph-mds process (a *daemon*) initially starts up
+without a rank. It may be assigned one by the monitor cluster.
+A daemon may only hold one rank at a time. Daemons only give up
+a rank when the ceph-mds process stops.
+
+If a rank is not associated with a daemon, the rank is
+considered *failed*. Once a rank is assigned to a daemon,
+the rank is considered *up*.
+
+A daemon has a *name* that is set statically by the administrator
+when the daemon is first configured. Typical configurations
+use the hostname where the daemon runs as the daemon name.
+
+Each time a daemon starts up, it is also assigned a *GID*, which
+is unique to this particular process lifetime of the daemon. The
+GID is an integer.
+
+Referring to MDS daemons
+------------------------
+
+Most of the administrative commands that refer to an MDS daemon
+accept a flexible argument format that may contain a rank, a GID
+or a name.
+
+Where a rank is used, this may optionally be qualified with
+a leading filesystem name or ID. If a daemon is a standby (i.e.
+it is not currently assigned a rank), then it may only be
+referred to by GID or name.
+
+For example, if we had an MDS daemon which was called 'myhost',
+had GID 5446, and was assigned rank 0 in the filesystem 'myfs'
+which had FSCID 3, then any of the following would be suitable
+forms of the 'fail' command:
+
+::
+
+ ceph mds fail 5446 # GID
+ ceph mds fail myhost # Daemon name
+ ceph mds fail 0 # Unqualified rank
+ ceph mds fail 3:0 # FSCID and rank
+ ceph mds fail myfs:0 # Filesystem name and rank
+
+Managing failover
+-----------------
+
+If an MDS daemon stops communicating with the monitor, the monitor will wait
+``mds_beacon_grace`` seconds (default 15 seconds) before marking the daemon as
+*laggy*. If a standby is available, the monitor will immediately replace the
+laggy daemon.
+
+Each file system may specify a number of standby daemons to be considered
+healthy. This number includes daemons in standby-replay waiting for a rank to
+fail (remember that a standby-replay daemon will not be assigned to take over a
+failure for another rank or a failure in a another CephFS file system). The
+pool of standby daemons not in replay count towards any file system count.
+Each file system may set the number of standby daemons wanted using:
+
+::
+
+ ceph fs set <fs name> standby_count_wanted <count>
+
+Setting ``count`` to 0 will disable the health check.
+
+
+.. _mds-standby-replay:
+
+Configuring standby-replay
+--------------------------
+
+Each CephFS file system may be configured to add standby-replay daemons. These
+standby daemons follow the active MDS's metadata journal to reduce failover
+time in the event the active MDS becomes unavailable. Each active MDS may have
+only one standby-replay daemon following it.
+
+Configuring standby-replay on a file system is done using:
+
+::
+
+ ceph fs set <fs name> allow_standby_replay <bool>
+
+Once set, the monitors will assign available standby daemons to follow the
+active MDSs in that file system.
+
+Once an MDS has entered the standby-replay state, it will only be used as a
+standby for the rank that it is following. If another rank fails, this
+standby-replay daemon will not be used as a replacement, even if no other
+standbys are available. For this reason, it is advised that if standby-replay
+is used then every active MDS should have a standby-replay daemon.
diff --git a/doc/cephfs/troubleshooting.rst b/doc/cephfs/troubleshooting.rst
new file mode 100644
index 00000000..d13914a1
--- /dev/null
+++ b/doc/cephfs/troubleshooting.rst
@@ -0,0 +1,160 @@
+=================
+ Troubleshooting
+=================
+
+Slow/stuck operations
+=====================
+
+If you are experiencing apparent hung operations, the first task is to identify
+where the problem is occurring: in the client, the MDS, or the network connecting
+them. Start by looking to see if either side has stuck operations
+(:ref:`slow_requests`, below), and narrow it down from there.
+
+RADOS Health
+============
+
+If part of the CephFS metadata or data pools is unavailable and CephFS is not
+responding, it is probably because RADOS itself is unhealthy. Resolve those
+problems first (:doc:`../../rados/troubleshooting/index`).
+
+The MDS
+=======
+
+If an operation is hung inside the MDS, it will eventually show up in ``ceph health``,
+identifying "slow requests are blocked". It may also identify clients as
+"failing to respond" or misbehaving in other ways. If the MDS identifies
+specific clients as misbehaving, you should investigate why they are doing so.
+Generally it will be the result of
+1) overloading the system (if you have extra RAM, increase the
+"mds cache size" config from its default 100000; having a larger active file set
+than your MDS cache is the #1 cause of this!)
+2) running an older (misbehaving) client, or
+3) underlying RADOS issues.
+
+Otherwise, you have probably discovered a new bug and should report it to
+the developers!
+
+.. _slow_requests:
+
+Slow requests (MDS)
+-------------------
+You can list current operations via the admin socket by running::
+
+ ceph daemon mds.<name> dump_ops_in_flight
+
+from the MDS host. Identify the stuck commands and examine why they are stuck.
+Usually the last "event" will have been an attempt to gather locks, or sending
+the operation off to the MDS log. If it is waiting on the OSDs, fix them. If
+operations are stuck on a specific inode, you probably have a client holding
+caps which prevent others from using it, either because the client is trying
+to flush out dirty data or because you have encountered a bug in CephFS'
+distributed file lock code (the file "capabilities" ["caps"] system).
+
+If it's a result of a bug in the capabilities code, restarting the MDS
+is likely to resolve the problem.
+
+If there are no slow requests reported on the MDS, and it is not reporting
+that clients are misbehaving, either the client has a problem or its
+requests are not reaching the MDS.
+
+ceph-fuse debugging
+===================
+
+ceph-fuse also supports dump_ops_in_flight. See if it has any and where they are
+stuck.
+
+Debug output
+------------
+
+To get more debugging information from ceph-fuse, try running in the foreground
+with logging to the console (``-d``) and enabling client debug
+(``--debug-client=20``), enabling prints for each message sent
+(``--debug-ms=1``).
+
+If you suspect a potential monitor issue, enable monitor debugging as well
+(``--debug-monc=20``).
+
+
+Kernel mount debugging
+======================
+
+Slow requests
+-------------
+
+Unfortunately the kernel client does not support the admin socket, but it has
+similar (if limited) interfaces if your kernel has debugfs enabled. There
+will be a folder in ``sys/kernel/debug/ceph/``, and that folder (whose name will
+look something like ``28f7427e-5558-4ffd-ae1a-51ec3042759a.client25386880``)
+will contain a variety of files that output interesting output when you ``cat``
+them. These files are described below; the most interesting when debugging
+slow requests are probably the ``mdsc`` and ``osdc`` files.
+
+* bdi: BDI info about the Ceph system (blocks dirtied, written, etc)
+* caps: counts of file "caps" structures in-memory and used
+* client_options: dumps the options provided to the CephFS mount
+* dentry_lru: Dumps the CephFS dentries currently in-memory
+* mdsc: Dumps current requests to the MDS
+* mdsmap: Dumps the current MDSMap epoch and MDSes
+* mds_sessions: Dumps the current sessions to MDSes
+* monc: Dumps the current maps from the monitor, and any "subscriptions" held
+* monmap: Dumps the current monitor map epoch and monitors
+* osdc: Dumps the current ops in-flight to OSDs (ie, file data IO)
+* osdmap: Dumps the current OSDMap epoch, pools, and OSDs
+
+If there are no stuck requests but you have file IO which is not progressing,
+you might have a...
+
+Disconnected+Remounted FS
+=========================
+Because CephFS has a "consistent cache", if your network connection is
+disrupted for a long enough time, the client will be forcibly
+disconnected from the system. At this point, the kernel client is in
+a bind: it cannot safely write back dirty data, and many applications
+do not handle IO errors correctly on close().
+At the moment, the kernel client will remount the FS, but outstanding filesystem
+IO may or may not be satisfied. In these cases, you may need to reboot your
+client system.
+
+You can identify you are in this situation if dmesg/kern.log report something like::
+
+ Jul 20 08:14:38 teuthology kernel: [3677601.123718] ceph: mds0 closed our session
+ Jul 20 08:14:38 teuthology kernel: [3677601.128019] ceph: mds0 reconnect start
+ Jul 20 08:14:39 teuthology kernel: [3677602.093378] ceph: mds0 reconnect denied
+ Jul 20 08:14:39 teuthology kernel: [3677602.098525] ceph: dropping dirty+flushing Fw state for ffff8802dc150518 1099935956631
+ Jul 20 08:14:39 teuthology kernel: [3677602.107145] ceph: dropping dirty+flushing Fw state for ffff8801008e8518 1099935946707
+ Jul 20 08:14:39 teuthology kernel: [3677602.196747] libceph: mds0 172.21.5.114:6812 socket closed (con state OPEN)
+ Jul 20 08:14:40 teuthology kernel: [3677603.126214] libceph: mds0 172.21.5.114:6812 connection reset
+ Jul 20 08:14:40 teuthology kernel: [3677603.132176] libceph: reset on mds0
+
+This is an area of ongoing work to improve the behavior. Kernels will soon
+be reliably issuing error codes to in-progress IO, although your application(s)
+may not deal with them well. In the longer-term, we hope to allow reconnect
+and reclaim of data in cases where it won't violate POSIX semantics (generally,
+data which hasn't been accessed or modified by other clients).
+
+Mounting
+========
+
+Mount 5 Error
+-------------
+
+A mount 5 error typically occurs if a MDS server is laggy or if it crashed.
+Ensure at least one MDS is up and running, and the cluster is ``active +
+healthy``.
+
+Mount 12 Error
+--------------
+
+A mount 12 error with ``cannot allocate memory`` usually occurs if you have a
+version mismatch between the :term:`Ceph Client` version and the :term:`Ceph
+Storage Cluster` version. Check the versions using::
+
+ ceph -v
+
+If the Ceph Client is behind the Ceph cluster, try to upgrade it::
+
+ sudo apt-get update && sudo apt-get install ceph-common
+
+You may need to uninstall, autoclean and autoremove ``ceph-common``
+and then reinstall it so that you have the latest version.
+
diff --git a/doc/cephfs/upgrading.rst b/doc/cephfs/upgrading.rst
new file mode 100644
index 00000000..0a62fdb3
--- /dev/null
+++ b/doc/cephfs/upgrading.rst
@@ -0,0 +1,92 @@
+Upgrading the MDS Cluster
+=========================
+
+Currently the MDS cluster does not have built-in versioning or file system
+flags to support seamless upgrades of the MDSs without potentially causing
+assertions or other faults due to incompatible messages or other functional
+differences. For this reason, it's necessary during any cluster upgrade to
+reduce the number of active MDS for a file system to one first so that two
+active MDS do not communicate with different versions. Further, it's also
+necessary to take standbys offline as any new CompatSet flags will propagate
+via the MDSMap to all MDS and cause older MDS to suicide.
+
+The proper sequence for upgrading the MDS cluster is:
+
+1. Reduce the number of ranks to 1:
+
+::
+
+ ceph fs set <fs_name> max_mds 1
+
+2. Wait for cluster to stop non-zero ranks where only rank 0 is active and the rest are standbys.
+
+::
+
+ ceph status # wait for MDS to finish stopping
+
+3. Take all standbys offline, e.g. using systemctl:
+
+::
+
+ systemctl stop ceph-mds.target
+
+4. Confirm only one MDS is online and is rank 0 for your FS:
+
+::
+
+ ceph status
+
+5. Upgrade the single active MDS, e.g. using systemctl:
+
+::
+
+ # use package manager to update cluster
+ systemctl restart ceph-mds.target
+
+6. Upgrade/start the standby daemons.
+
+::
+
+ # use package manager to update cluster
+ systemctl restart ceph-mds.target
+
+7. Restore the previous max_mds for your cluster:
+
+::
+
+ ceph fs set <fs_name> max_mds <old_max_mds>
+
+
+Upgrading pre-Firefly filesystems past Jewel
+============================================
+
+.. tip::
+
+ This advice only applies to users with filesystems
+ created using versions of Ceph older than *Firefly* (0.80).
+ Users creating new filesystems may disregard this advice.
+
+Pre-firefly versions of Ceph used a now-deprecated format
+for storing CephFS directory objects, called TMAPs. Support
+for reading these in RADOS will be removed after the Jewel
+release of Ceph, so for upgrading CephFS users it is important
+to ensure that any old directory objects have been converted.
+
+After installing Jewel on all your MDS and OSD servers, and restarting
+the services, run the following command:
+
+::
+
+ cephfs-data-scan tmap_upgrade <metadata pool name>
+
+This only needs to be run once, and it is not necessary to
+stop any other services while it runs. The command may take some
+time to execute, as it iterates overall objects in your metadata
+pool. It is safe to continue using your filesystem as normal while
+it executes. If the command aborts for any reason, it is safe
+to simply run it again.
+
+If you are upgrading a pre-Firefly CephFS filesystem to a newer Ceph version
+than Jewel, you must first upgrade to Jewel and run the ``tmap_upgrade``
+command before completing your upgrade to the latest version.
+