From 19fcec84d8d7d21e796c7624e521b60d28ee21ed Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 7 Apr 2024 20:45:59 +0200 Subject: Adding upstream version 16.2.11+ds. Signed-off-by: Daniel Baumann --- doc/dev/mds_internals/data-structures.rst | 44 ++++++++++++++++++ doc/dev/mds_internals/exports.rst | 76 +++++++++++++++++++++++++++++++ doc/dev/mds_internals/index.rst | 10 ++++ 3 files changed, 130 insertions(+) create mode 100644 doc/dev/mds_internals/data-structures.rst create mode 100644 doc/dev/mds_internals/exports.rst create mode 100644 doc/dev/mds_internals/index.rst (limited to 'doc/dev/mds_internals') diff --git a/doc/dev/mds_internals/data-structures.rst b/doc/dev/mds_internals/data-structures.rst new file mode 100644 index 000000000..c77175a16 --- /dev/null +++ b/doc/dev/mds_internals/data-structures.rst @@ -0,0 +1,44 @@ +MDS internal data structures +============================== + +*CInode* + CInode contains the metadata of a file, there is one CInode for each file. + The CInode stores information like who owns the file, how big the file is. + +*CDentry* + CDentry is the glue that holds inodes and files together by relating inode to + file/directory names. A CDentry links to at most one CInode (it may not link + to any CInode). A CInode may be linked by multiple CDentries. + +*CDir* + CDir only exists for directory inode, it's used to link CDentries under the + directory. A CInode can have multiple CDir when the directory is fragmented. + +These data structures are linked together as:: + + CInode + CDir + | \ + | \ + | \ + CDentry CDentry + CInode CInode + CDir CDir + | | \ + | | \ + | | \ + CDentry CDentry CDentry + CInode CInode CInode + +As this doc is being written, size of CInode is about 1400 bytes, size of CDentry +is about 400 bytes, size of CDir is about 700 bytes. These data structures are +quite large. Please be careful if you want to add new fields to them. + +*OpenFileTable* + Open file table tracks open files and their ancestor directories. Recovering + MDS can easily get open files' paths, significantly reducing the time of + loading inodes for open files. Each entry in the table corresponds to an inode, + it records linkage information (parent inode and dentry name) of the inode. MDS + can constructs the inode's path by recursively lookup parent inode's linkage. + Open file table is stored in omap of RADOS objects, table entries correspond to + KV pairs in omap. diff --git a/doc/dev/mds_internals/exports.rst b/doc/dev/mds_internals/exports.rst new file mode 100644 index 000000000..c5b0e3915 --- /dev/null +++ b/doc/dev/mds_internals/exports.rst @@ -0,0 +1,76 @@ + +=============== +Subtree exports +=============== + +Normal Migration +---------------- + +The exporter begins by doing some checks in export_dir() to verify +that it is permissible to export the subtree at this time. In +particular, the cluster must not be degraded, the subtree root may not +be freezing or frozen (\ie already exporting, or nested beneath +something that is exporting), and the path must be pinned (\ie not +conflicted with a rename). If these conditions are met, the subtree +freeze is initiated, and the exporter is committed to the subtree +migration, barring an intervening failure of the importer or itself. + +The MExportDirDiscover serves simply to ensure that the base directory +being exported is open on the destination node. It is pinned by the +importer to prevent it from being trimmed. This occurs before the +exporter completes the freeze of the subtree to ensure that the +importer is able to replicate the necessary metadata. When the +exporter receives the MExportDirDiscoverAck, it allows the freeze to proceed. + +The MExportDirPrep message then follows to populate a spanning tree that +includes all dirs, inodes, and dentries necessary to reach any nested +exports within the exported region. This replicates metadata as well, +but it is pushed out by the exporter, avoiding deadlock with the +regular discover and replication process. The importer is responsible +for opening the bounding directories from any third parties before +acknowledging. This ensures that the importer has correct dir_auth +information about where authority is delegated for all points nested +within the subtree being migrated. While processing the MExportDirPrep, +the importer freezes the entire subtree region to prevent any new +replication or cache expiration. + +The warning stage occurs only if the base subtree directory is open by +nodes other than the importer and exporter. If so, then a +MExportDirNotify message informs any bystanders that the authority for +the region is temporarily ambiguous. In particular, bystanders who +are trimming items from their cache must send MCacheExpire messages to +both the old and new authorities. This is necessary to ensure that +the surviving authority reliably receives all expirations even if the +importer or exporter fails. While the subtree is frozen (on both the +importer and exporter), expirations will not be immediately processed; +instead, they will be queued until the region is unfrozen and it can +be determined that the node is or is not authoritative for the region. + +The MExportDir message sends the actual subtree metadata to the importer. +Upon receipt, the importer inserts the data into its cache, logs a +copy in the EImportStart, and replies with an MExportDirAck. The exporter +can now log an EExport, which ultimately specifies that +the export was a success. In the presence of failures, it is the +existence of the EExport that disambiguates authority during recovery. + +Once logged, the exporter will send an MExportDirNotify to any +bystanders, informing them that the authority is no longer ambiguous +and cache expirations should be sent only to the new authority (the +importer). Once these are acknowledged, implicitly flushing the +bystander to exporter message streams of any stray expiration notices, +the exporter unfreezes the subtree, cleans up its state, and sends a +final MExportDirFinish to the importer. Upon receipt, the importer logs +an EImportFinish(true), unfreezes its subtree, and cleans up its +state. + + +PARTIAL FAILURE RECOVERY + + + +RECOVERY FROM JOURNAL + + + + + diff --git a/doc/dev/mds_internals/index.rst b/doc/dev/mds_internals/index.rst new file mode 100644 index 000000000..c8c82ad10 --- /dev/null +++ b/doc/dev/mds_internals/index.rst @@ -0,0 +1,10 @@ +============================== +MDS developer documentation +============================== + +.. rubric:: Contents + +.. toctree:: + :glob: + + * -- cgit v1.2.3