diff options
Diffstat (limited to 'doc/cephfs/cache-configuration.rst')
-rw-r--r-- | doc/cephfs/cache-configuration.rst | 227 |
1 files changed, 227 insertions, 0 deletions
diff --git a/doc/cephfs/cache-configuration.rst b/doc/cephfs/cache-configuration.rst new file mode 100644 index 000000000..eabc61cc1 --- /dev/null +++ b/doc/cephfs/cache-configuration.rst @@ -0,0 +1,227 @@ +======================= +MDS Cache Configuration +======================= + +The Metadata Server coordinates a distributed cache among all MDS and CephFS +clients. The cache serves to improve metadata access latency and allow clients +to safely (coherently) mutate metadata state (e.g. via `chmod`). The MDS issues +**capabilities** and **directory entry leases** to indicate what state clients +may cache and what manipulations clients may perform (e.g. writing to a file). + +The MDS and clients both try to enforce a cache size. The mechanism for +specifying the MDS cache size is described below. Note that the MDS cache size +is not a hard limit. The MDS always allows clients to lookup new metadata +which is loaded into the cache. This is an essential policy as it avoids +deadlock in client requests (some requests may rely on held capabilities before +capabilities are released). + +When the MDS cache is too large, the MDS will **recall** client state so cache +items become unpinned and eligible to be dropped. The MDS can only drop cache +state when no clients refer to the metadata to be dropped. Also described below +is how to configure the MDS recall settings for your workload's needs. This is +necessary if the internal throttles on the MDS recall can not keep up with the +client workload. + + +MDS Cache Size +-------------- + +You can limit the size of the Metadata Server (MDS) cache by a byte count. This +is done through the `mds_cache_memory_limit` configuration. For example:: + + ceph config set mds mds_cache_memory_limit 8GB + +In addition, you can specify a cache reservation by using the +`mds_cache_reservation` parameter for MDS operations. The cache reservation is +limited as a percentage of the memory and is set to 5% by default. The intent +of this parameter is to have the MDS maintain an extra reserve of memory for +its cache for new metadata operations to use. As a consequence, the MDS should +in general operate below its memory limit because it will recall old state from +clients in order to drop unused metadata in its cache. + +If the MDS cannot keep its cache under the target size, the MDS will send a +health alert to the Monitors indicating the cache is too large. This is +controlled by the `mds_health_cache_threshold` configuration which is by +default 150% of the maximum cache size. + +Because the cache limit is not a hard limit, potential bugs in the CephFS +client, MDS, or misbehaving applications might cause the MDS to exceed its +cache size. The health warnings are intended to help the operator detect this +situation and make necessary adjustments or investigate buggy clients. + +MDS Cache Trimming +------------------ + +There are two configurations for throttling the rate of cache trimming in the MDS: + +:: + + mds_cache_trim_threshold (default 64k) + + +and + +:: + + mds_cache_trim_decay_rate (default 1) + + +The intent of the throttle is to prevent the MDS from spending too much time +trimming its cache. This may limit its ability to handle client requests or +perform other upkeep. + +The trim configurations control an internal **decay counter**. Anytime metadata +is trimmed from the cache, the counter is incremented. The threshold sets the +maximum size of the counter while the decay rate indicates the exponential half +life for the counter. If the MDS is continually removing items from its cache, +it will reach a steady state of ``-ln(0.5)/rate*threshold`` items removed per +second. + +.. note:: Increasing the value of the confguration setting + ``mds_cache_trim_decay_rate`` leads to the MDS spending less time + trimming the cache. To increase the cache trimming rate, set a lower + value. + +The defaults are conservative and may need to be changed for production MDS with +large cache sizes. + + +MDS Recall +---------- + +MDS limits its recall of client state (capabilities/leases) to prevent creating +too much work for itself handling release messages from clients. This is controlled +via the following configurations: + + +The maximum number of capabilities to recall from a single client in a given recall +event:: + + mds_recall_max_caps (default: 5000) + +The threshold and decay rate for the decay counter on a session:: + + mds_recall_max_decay_threshold (default: 16k) + +and:: + + mds_recall_max_decay_rate (default: 2.5 seconds) + +The session decay counter controls the rate of recall for an individual +session. The behavior of the counter works the same as for cache trimming +above. Each capability that is recalled increments the counter. + +There is also a global decay counter that throttles for all session recall:: + + mds_recall_global_max_decay_threshold (default: 64k) + +its decay rate is the same as ``mds_recall_max_decay_rate``. Any recalled +capability for any session also increments this counter. + +If clients are slow to release state, the warning "failing to respond to cache +pressure" or ``MDS_HEALTH_CLIENT_RECALL`` will be reported. Each session's rate +of release is monitored by another decay counter configured by:: + + mds_recall_warning_threshold (default: 32k) + +and:: + + mds_recall_warning_decay_rate (default: 60.0 seconds) + +Each time a capability is released, the counter is incremented. If clients do +not release capabilities quickly enough and there is cache pressure, the +counter will indicate if the client is slow to release state. + +Some workloads and client behaviors may require faster recall of client state +to keep up with capability acquisition. It is recommended to increase the above +counters as needed to resolve any slow recall warnings in the cluster health +state. + + +MDS Cap Acquisition Throttle +---------------------------- + +A trivial "find" command on a large directory hierarchy will cause the client +to receive caps significantly faster than it will release. The MDS will try +to have the client reduce its caps below the ``mds_max_caps_per_client`` limit +but the recall throttles prevent it from catching up to the pace of acquisition. +So the readdir is throttled to control cap acquisition via the following +configurations: + + +The threshold and decay rate for the readdir cap acquisition decay counter:: + + mds_session_cap_acquisition_throttle (default: 500K) + +and:: + + mds_session_cap_acquisition_decay_rate (default: 10 seconds) + +The cap acquisition decay counter controls the rate of cap acquisition via +readdir. The behavior of the decay counter is the same as for cache trimming or +caps recall. Each readdir call increments the counter by the number of files in +the result. + +The ratio of ``mds_max_maps_per_client`` that client must exceed before readdir +maybe throttled by cap acquisition throttle:: + + mds_session_max_caps_throttle_ratio (default: 1.1) + +The timeout in seconds after which a client request is retried due to cap +acquisition throttling:: + + mds_cap_acquisition_throttle_retry_request_timeout (default: 0.5 seconds) + +If the number of caps acquired by the client per session is greater than the +``mds_session_max_caps_throttle_ratio`` and cap acquisition decay counter is +greater than ``mds_session_cap_acquisition_throttle``, the readdir is throttled. +The readdir request is retried after ``mds_cap_acquisition_throttle_retry_request_timeout`` +seconds. + + +Session Liveness +---------------- + +The MDS also keeps track of whether sessions are quiescent. If a client session +is not utilizing its capabilities or is otherwise quiet, the MDS will begin +recalling state from the session even if it's not under cache pressure. This +helps the MDS avoid future work when the cluster workload is hot and cache +pressure is forcing the MDS to recall state. The expectation is that a client +not utilizing its capabilities is unlikely to use those capabilities anytime +in the near future. + +Determining whether a given session is quiescent is controlled by the following +configuration variables:: + + mds_session_cache_liveness_magnitude (default: 10) + +and:: + + mds_session_cache_liveness_decay_rate (default: 5min) + +The configuration ``mds_session_cache_liveness_decay_rate`` indicates the +half-life for the decay counter tracking the use of capabilities by the client. +Each time a client manipulates or acquires a capability, the MDS will increment +the counter. This is a rough but effective way to monitor the utilization of the +client cache. + +The ``mds_session_cache_liveness_magnitude`` is a base-2 magnitude difference +of the liveness decay counter and the number of capabilities outstanding for +the session. So if the client has ``1*2^20`` (1M) capabilities outstanding and +only uses **less** than ``1*2^(20-mds_session_cache_liveness_magnitude)`` (1K +using defaults), the MDS will consider the client to be quiescent and begin +recall. + + +Capability Limit +---------------- + +The MDS also tries to prevent a single client from acquiring too many +capabilities. This helps prevent recovery from taking a long time in some +situations. It is not generally necessary for a client to have such a large +cache. The limit is configured via:: + + mds_max_caps_per_client (default: 1M) + +It is not recommended to set this value above 5M but it may be helpful with +some workloads. |