summaryrefslogtreecommitdiffstats
path: root/doc/cephfs/eviction.rst
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
commite6918187568dbd01842d8d1d2c808ce16a894239 (patch)
tree64f88b554b444a49f656b6c656111a145cbbaa28 /doc/cephfs/eviction.rst
parentInitial commit. (diff)
downloadceph-b26c4052f3542036551aa9dec9caa4226e456195.tar.xz
ceph-b26c4052f3542036551aa9dec9caa4226e456195.zip
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
-rw-r--r--doc/cephfs/eviction.rst190
1 files changed, 190 insertions, 0 deletions
diff --git a/doc/cephfs/eviction.rst b/doc/cephfs/eviction.rst
new file mode 100644
index 000000000..eb6f70a8e
--- /dev/null
+++ b/doc/cephfs/eviction.rst
@@ -0,0 +1,190 @@
+
+================================
+Ceph file system client eviction
+================================
+
+When a file system client is unresponsive or otherwise misbehaving, it
+may be necessary to forcibly terminate its access to the file system. This
+process is called *eviction*.
+
+Evicting a CephFS client prevents it from communicating further with MDS
+daemons and OSD daemons. If a client was doing buffered IO to the file system,
+any un-flushed data will be lost.
+
+Clients may either be evicted automatically (if they fail to communicate
+promptly with the MDS), or manually (by the system administrator).
+
+The client eviction process applies to clients of all kinds, this includes
+FUSE mounts, kernel mounts, nfs-ganesha gateways, and any process using
+libcephfs.
+
+Automatic client eviction
+=========================
+
+There are three situations in which a client may be evicted automatically.
+
+#. On an active MDS daemon, if a client has not communicated with the MDS for over
+ ``session_autoclose`` (a file system variable) seconds (300 seconds by
+ default), then it will be evicted automatically.
+
+#. On an active MDS daemon, if a client has not responded to cap revoke messages
+ for over ``mds_cap_revoke_eviction_timeout`` (configuration option) seconds.
+ This is disabled by default.
+
+#. During MDS startup (including on failover), the MDS passes through a
+ state called ``reconnect``. During this state, it waits for all the
+ clients to connect to the new MDS daemon. If any clients fail to do
+ so within the time window (``mds_reconnect_timeout``, 45 seconds by default)
+ then they will be evicted.
+
+A warning message is sent to the cluster log if either of these situations
+arises.
+
+Manual client eviction
+======================
+
+Sometimes, the administrator may want to evict a client manually. This
+could happen if a client has died and the administrator does not
+want to wait for its session to time out, or it could happen if
+a client is misbehaving and the administrator does not have access to
+the client node to unmount it.
+
+It is useful to inspect the list of clients first:
+
+::
+
+ ceph tell mds.0 client ls
+
+ [
+ {
+ "id": 4305,
+ "num_leases": 0,
+ "num_caps": 3,
+ "state": "open",
+ "replay_requests": 0,
+ "completed_requests": 0,
+ "reconnecting": false,
+ "inst": "client.4305 172.21.9.34:0/422650892",
+ "client_metadata": {
+ "ceph_sha1": "ae81e49d369875ac8b569ff3e3c456a31b8f3af5",
+ "ceph_version": "ceph version 12.0.0-1934-gae81e49 (ae81e49d369875ac8b569ff3e3c456a31b8f3af5)",
+ "entity_id": "0",
+ "hostname": "senta04",
+ "mount_point": "/tmp/tmpcMpF1b/mnt.0",
+ "pid": "29377",
+ "root": "/"
+ }
+ }
+ ]
+
+
+
+Once you have identified the client you want to evict, you can
+do that using its unique ID, or various other attributes to identify it:
+
+::
+
+ # These all work
+ ceph tell mds.0 client evict id=4305
+ ceph tell mds.0 client evict client_metadata.=4305
+
+
+Advanced: Un-blocklisting a client
+==================================
+
+Ordinarily, a blocklisted client may not reconnect to the servers: it
+must be unmounted and then mounted anew.
+
+However, in some situations it may be useful to permit a client that
+was evicted to attempt to reconnect.
+
+Because CephFS uses the RADOS OSD blocklist to control client eviction,
+CephFS clients can be permitted to reconnect by removing them from
+the blocklist:
+
+::
+
+ $ ceph osd blocklist ls
+ listed 1 entries
+ 127.0.0.1:0/3710147553 2018-03-19 11:32:24.716146
+ $ ceph osd blocklist rm 127.0.0.1:0/3710147553
+ un-blocklisting 127.0.0.1:0/3710147553
+
+
+Doing this may put data integrity at risk if other clients have accessed
+files that the blocklisted client was doing buffered IO to. It is also not
+guaranteed to result in a fully functional client -- the best way to get
+a fully healthy client back after an eviction is to unmount the client
+and do a fresh mount.
+
+If you are trying to reconnect clients in this way, you may also
+find it useful to set ``client_reconnect_stale`` to true in the
+FUSE client, to prompt the client to try to reconnect.
+
+Advanced: Configuring blocklisting
+==================================
+
+If you are experiencing frequent client evictions, due to slow
+client hosts or an unreliable network, and you cannot fix the underlying
+issue, then you may want to ask the MDS to be less strict.
+
+It is possible to respond to slow clients by simply dropping their
+MDS sessions, but permit them to re-open sessions and permit them
+to continue talking to OSDs. To enable this mode, set
+``mds_session_blocklist_on_timeout`` to false on your MDS nodes.
+
+For the equivalent behaviour on manual evictions, set
+``mds_session_blocklist_on_evict`` to false.
+
+Note that if blocklisting is disabled, then evicting a client will
+only have an effect on the MDS you send the command to. On a system
+with multiple active MDS daemons, you would need to send an
+eviction command to each active daemon. When blocklisting is enabled
+(the default), sending an eviction command to just a single
+MDS is sufficient, because the blocklist propagates it to the others.
+
+.. _background_blocklisting_and_osd_epoch_barrier:
+
+Background: Blocklisting and OSD epoch barrier
+==============================================
+
+After a client is blocklisted, it is necessary to make sure that
+other clients and MDS daemons have the latest OSDMap (including
+the blocklist entry) before they try to access any data objects
+that the blocklisted client might have been accessing.
+
+This is ensured using an internal "osdmap epoch barrier" mechanism.
+
+The purpose of the barrier is to ensure that when we hand out any
+capabilities which might allow touching the same RADOS objects, the
+clients we hand out the capabilities to must have a sufficiently recent
+OSD map to not race with cancelled operations (from ENOSPC) or
+blocklisted clients (from evictions).
+
+More specifically, the cases where an epoch barrier is set are:
+
+ * Client eviction (where the client is blocklisted and other clients
+ must wait for a post-blocklist epoch to touch the same objects).
+ * OSD map full flag handling in the client (where the client may
+ cancel some OSD ops from a pre-full epoch, so other clients must
+ wait until the full epoch or later before touching the same objects).
+ * MDS startup, because we don't persist the barrier epoch, so must
+ assume that latest OSD map is always required after a restart.
+
+Note that this is a global value for simplicity. We could maintain this on
+a per-inode basis. But we don't, because:
+
+ * It would be more complicated.
+ * It would use an extra 4 bytes of memory for every inode.
+ * It would not be much more efficient as, almost always, everyone has
+ the latest OSD map. And, in most cases everyone will breeze through this
+ barrier rather than waiting.
+ * This barrier is done in very rare cases, so any benefit from per-inode
+ granularity would only very rarely be seen.
+
+The epoch barrier is transmitted along with all capability messages, and
+instructs the receiver of the message to avoid sending any more RADOS
+operations to OSDs until it has seen this OSD epoch. This mainly applies
+to clients (doing their data writes directly to files), but also applies
+to the MDS because things like file size probing and file deletion are
+done directly from the MDS.