diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
commit | e6918187568dbd01842d8d1d2c808ce16a894239 (patch) | |
tree | 64f88b554b444a49f656b6c656111a145cbbaa28 /doc/cephfs/dynamic-metadata-management.rst | |
parent | Initial commit. (diff) | |
download | ceph-upstream/18.2.2.tar.xz ceph-upstream/18.2.2.zip |
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/cephfs/dynamic-metadata-management.rst')
-rw-r--r-- | doc/cephfs/dynamic-metadata-management.rst | 90 |
1 files changed, 90 insertions, 0 deletions
diff --git a/doc/cephfs/dynamic-metadata-management.rst b/doc/cephfs/dynamic-metadata-management.rst new file mode 100644 index 000000000..6e7ada9fc --- /dev/null +++ b/doc/cephfs/dynamic-metadata-management.rst @@ -0,0 +1,90 @@ +================================== +CephFS Dynamic Metadata Management +================================== +Metadata operations usually take up more than 50 percent of all +file system operations. Also the metadata scales in a more complex +fashion when compared to scaling storage (which in turn scales I/O +throughput linearly). This is due to the hierarchical and +interdependent nature of the file system metadata. So in CephFS, +the metadata workload is decoupled from data workload so as to +avoid placing unnecessary strain on the RADOS cluster. The metadata +is hence handled by a cluster of Metadata Servers (MDSs). +CephFS distributes metadata across MDSs via `Dynamic Subtree Partitioning <https://ceph.com/assets/pdfs/weil-mds-sc04.pdf>`__. + +Dynamic Subtree Partitioning +---------------------------- +In traditional subtree partitioning, subtrees of the file system +hierarchy are assigned to individual MDSs. This metadata distribution +strategy provides good hierarchical locality, linear growth of +cache and horizontal scaling across MDSs and a fairly good distribution +of metadata across MDSs. + +.. image:: subtree-partitioning.svg + +The problem with traditional subtree partitioning is that the workload +growth by depth (across a single MDS) leads to a hotspot of activity. +This results in lack of vertical scaling and wastage of non-busy resources/MDSs. + +This led to the adoption of a more dynamic way of handling +metadata: Dynamic Subtree Partitioning, where load intensive portions +of the directory hierarchy from busy MDSs are migrated to non busy MDSs. + +This strategy ensures that activity hotspots are relieved as they +appear and so leads to vertical scaling of the metadata workload in +addition to horizontal scaling. + +Export Process During Subtree Migration +--------------------------------------- + +Once the exporter verifies that the subtree is permissible to be exported +(Non degraded cluster, non-frozen subtree root), the subtree root +directory is temporarily auth pinned, the subtree freeze is initiated, +and the exporter is committed to the subtree migration, barring an +intervening failure of the importer or itself. + +The MExportDiscover message is exchanged to ensure that the inode for the +base directory being exported is open on the destination node. It is +auth pinned by the importer to prevent it from being trimmed. This occurs +before the exporter completes the freeze of the subtree to ensure that +the importer is able to replicate the necessary metadata. When the +exporter receives the MDiscoverAck, it allows the freeze to proceed by +removing its temporary auth pin. + +A warning stage occurs only if the base subtree directory is open by +nodes other than the importer and exporter. If it is not, then this +implies that no metadata within or nested beneath the subtree is +replicated by any node other than the importer and exporter. If it is, +then an MExportWarning message informs any bystanders that the +authority for the region is temporarily ambiguous, and lists both the +exporter and importer as authoritative MDS nodes. In particular, +bystanders who are trimming items from their cache must send +MCacheExpire messages to both the old and new authorities. This is +necessary to ensure that the surviving authority reliably receives all +expirations even if the importer or exporter fails. While the subtree +is frozen (on both the importer and exporter), expirations will not be +immediately processed; instead, they will be queued until the region +is unfrozen and it can be determined that the node is or is not +authoritative. + +The exporter then packages an MExport message containing all metadata +of the subtree and flags the objects as non-authoritative. The MExport message sends +the actual subtree metadata to the importer. Upon receipt, the +importer inserts the data into its cache, marks all objects as +authoritative, and logs a copy of all metadata in an EImportStart +journal message. Once that has safely flushed, it replies with an +MExportAck. The exporter can now log an EExport journal entry, which +ultimately specifies that the export was a success. In the presence +of failures, it is the existence of the EExport entry only that +disambiguates authority during recovery. + +Once logged, the exporter will send an MExportNotify to any +bystanders, informing them that the authority is no longer ambiguous +and cache expirations should be sent only to the new authority (the +importer). Once these are acknowledged back to the exporter, +implicitly flushing the bystander to exporter message streams of any +stray expiration notices, the exporter unfreezes the subtree, cleans +up its migration-related state, and sends a final MExportFinish to the +importer. Upon receipt, the importer logs an EImportFinish(true) +(noting locally that the export was indeed a success), unfreezes its +subtree, processes any queued cache expirations, and cleans up its +state. |