summaryrefslogtreecommitdiffstats
path: root/doc/cephfs/mdcache.rst
diff options
context:
space:
mode:
Diffstat (limited to 'doc/cephfs/mdcache.rst')
-rw-r--r--doc/cephfs/mdcache.rst77
1 files changed, 77 insertions, 0 deletions
diff --git a/doc/cephfs/mdcache.rst b/doc/cephfs/mdcache.rst
new file mode 100644
index 000000000..f2e20238c
--- /dev/null
+++ b/doc/cephfs/mdcache.rst
@@ -0,0 +1,77 @@
+=================================
+CephFS Distributed Metadata Cache
+=================================
+While the data for inodes in a Ceph file system is stored in RADOS and
+accessed by the clients directly, inode metadata and directory
+information is managed by the Ceph metadata server (MDS). The MDS's
+act as mediator for all metadata related activity, storing the resulting
+information in a separate RADOS pool from the file data.
+
+CephFS clients can request that the MDS fetch or change inode metadata
+on its behalf, but an MDS can also grant the client **capabilities**
+(aka **caps**) for each inode (see :doc:`/cephfs/capabilities`).
+
+A capability grants the client the ability to cache and possibly
+manipulate some portion of the data or metadata associated with the
+inode. When another client needs access to the same information, the MDS
+will revoke the capability and the client will eventually return it,
+along with an updated version of the inode's metadata (in the event that
+it made changes to it while it held the capability).
+
+Clients can request capabilities and will generally get them, but when
+there is competing access or memory pressure on the MDS, they may be
+**revoked**. When a capability is revoked, the client is responsible for
+returning it as soon as it is able. Clients that fail to do so in a
+timely fashion may end up **blocklisted** and unable to communicate with
+the cluster.
+
+Since the cache is distributed, the MDS must take great care to ensure
+that no client holds capabilities that may conflict with other clients'
+capabilities, or operations that it does itself. This allows cephfs
+clients to rely on much greater cache coherence than a filesystem like
+NFS, where the client may cache data and metadata beyond the point where
+it has changed on the server.
+
+Client Metadata Requests
+------------------------
+When a client needs to query/change inode metadata or perform an
+operation on a directory, it has two options. It can make a request to
+the MDS directly, or serve the information out of its cache. With
+CephFS, the latter is only possible if the client has the necessary
+caps.
+
+Clients can send simple requests to the MDS to query or request changes
+to certain metadata. The replies to these requests may also grant the
+client a certain set of caps for the inode, allowing it to perform
+subsequent requests without consulting the MDS.
+
+Clients can also request caps directly from the MDS, which is necessary
+in order to read or write file data.
+
+Distributed Locks in an MDS Cluster
+-----------------------------------
+When an MDS wants to read or change information about an inode, it must
+gather the appropriate locks for it. The MDS cluster may have a series
+of different types of locks on the given inode and each MDS may have
+disjoint sets of locks.
+
+If there are outstanding caps that would conflict with these locks, then
+they must be revoked before the lock can be acquired. Once the competing
+caps are returned to the MDS, then it can get the locks and do the
+operation.
+
+On a filesystem served by multiple MDS', the metadata cache is also
+distributed among the MDS' in the cluster. For every inode, at any given
+time, only one MDS in the cluster is considered **authoritative**. Any
+requests to change that inode must be done by the authoritative MDS,
+though non-authoritative MDS can forward requests to the authoritative
+one.
+
+Non-auth MDS' can also obtain read locks that prevent the auth MDS from
+changing the data until the lock is dropped, so that they can serve
+inode info to the clients.
+
+The auth MDS for an inode can change over time as well. The MDS' will
+actively balance responsibility for the inode cache amongst
+themselves, but this can be overridden by **pinning** certain subtrees
+to a single MDS.