From 483eb2f56657e8e7f419ab1a4fab8dce9ade8609 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sat, 27 Apr 2024 20:24:20 +0200 Subject: Adding upstream version 14.2.21. Signed-off-by: Daniel Baumann --- doc/dev/cephfs-snapshots.rst | 133 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 133 insertions(+) create mode 100644 doc/dev/cephfs-snapshots.rst (limited to 'doc/dev/cephfs-snapshots.rst') diff --git a/doc/dev/cephfs-snapshots.rst b/doc/dev/cephfs-snapshots.rst new file mode 100644 index 00000000..509b5ff3 --- /dev/null +++ b/doc/dev/cephfs-snapshots.rst @@ -0,0 +1,133 @@ +CephFS Snapshots +================ + +CephFS supports snapshots, generally created by invoking mkdir within the +``.snap`` directory. Note this is a hidden, special directory, not visible +during a directory listing. + +Overview +----------- + +Generally, snapshots do what they sound like: they create an immutable view +of the filesystem at the point in time they're taken. There are some headline +features that make CephFS snapshots different from what you might expect: + +* Arbitrary subtrees. Snapshots are created within any directory you choose, + and cover all data in the filesystem under that directory. +* Asynchronous. If you create a snapshot, buffered data is flushed out lazily, + including from other clients. As a result, "creating" the snapshot is + very fast. + +Important Data Structures +------------------------- +* SnapRealm: A `SnapRealm` is created whenever you create a snapshot at a new + point in the hierarchy (or, when a snapshotted inode is move outside of its + parent snapshot). SnapRealms contain an `sr_t srnode`, and `inodes_with_caps` + that are part of the snapshot. Clients also have a SnapRealm concept that + maintains less data but is used to associate a `SnapContext` with each open + file for writing. +* sr_t: An `sr_t` is the on-disk snapshot metadata. It is part of the containing + directory and contains sequence counters, timestamps, the list of associated + snapshot IDs, and `past_parent_snaps`. +* SnapServer: SnapServer manages snapshot ID allocation, snapshot deletion and + tracks list of effective snapshots in the filesystem. A filesystem only has + one instance of snapserver. +* SnapClient: SnapClient is used to communicate with snapserver, each MDS rank + has its own snapclient instance. SnapClient also caches effective snapshots + locally. + +Creating a snapshot +------------------- +CephFS snapshot feature is enabled by default on new filesystem. To enable it +on existing filesystems, use command below. + +.. code:: + + $ ceph fs set allow_new_snaps true + +When snapshots are enabled, all directories in CephFS will have a special +``.snap`` directory. (You may configure a different name with the ``client +snapdir`` setting if you wish.) + +To create a CephFS snapshot, create a subdirectory under +``.snap`` with a name of your choice. For example, to create a snapshot on +directory "/1/2/3/", invoke ``mkdir /1/2/3/.snap/my-snapshot-name`` . + +This is transmitted to the MDS Server as a +CEPH_MDS_OP_MKSNAP-tagged `MClientRequest`, and initially handled in +Server::handle_client_mksnap(). It allocates a `snapid` from the `SnapServer`, +projects a new inode with the new SnapRealm, and commits it to the MDLog as +usual. When committed, it invokes +`MDCache::do_realm_invalidate_and_update_notify()`, which notifies all clients +with caps on files under "/1/2/3/", about the new SnapRealm. When clients get +the notifications, they update client-side SnapRealm hierarchy, link files +under "/1/2/3/" to the new SnapRealm and generate a `SnapContext` for the +new SnapRealm. + +Note that this *is not* a synchronous part of the snapshot creation! + +Updating a snapshot +------------------- +If you delete a snapshot, a similar process is followed. If you remove an inode +out of its parent SnapRealm, the rename code creates a new SnapRealm for the +renamed inode (if SnapRealm does not already exist), saves IDs of snapshots that +are effective on the original parent SnapRealm into `past_parent_snaps` of the +new SnapRealm, then follows a process similar to creating snapshot. + +Generating a SnapContext +------------------------ +A RADOS `SnapContext` consists of a snapshot sequence ID (`snapid`) and all +the snapshot IDs that an object is already part of. To generate that list, we +combine `snapids` associated with the SnapRealm and all valid `snapids` in +`past_parent_snaps`. Stale `snapids` are filtered out by SnapClient's cached +effective snapshots. + +Storing snapshot data +--------------------- +File data is stored in RADOS "self-managed" snapshots. Clients are careful to +use the correct `SnapContext` when writing file data to the OSDs. + +Storing snapshot metadata +------------------------- +Snapshotted dentries (and their inodes) are stored in-line as part of the +directory they were in at the time of the snapshot. *All dentries* include a +`first` and `last` snapid for which they are valid. (Non-snapshotted dentries +will have their `last` set to CEPH_NOSNAP). + +Snapshot writeback +------------------ +There is a great deal of code to handle writeback efficiently. When a Client +receives an `MClientSnap` message, it updates the local `SnapRealm` +representation and its links to specific `Inodes`, and generates a `CapSnap` +for the `Inode`. The `CapSnap` is flushed out as part of capability writeback, +and if there is dirty data the `CapSnap` is used to block fresh data writes +until the snapshot is completely flushed to the OSDs. + +In the MDS, we generate snapshot-representing dentries as part of the regular +process for flushing them. Dentries with outstanding `CapSnap` data is kept +pinned and in the journal. + +Deleting snapshots +------------------ +Snapshots are deleted by invoking "rmdir" on the ".snap" directory they are +rooted in. (Attempts to delete a directory which roots snapshots *will fail*; +you must delete the snapshots first.) Once deleted, they are entered into the +`OSDMap` list of deleted snapshots and the file data is removed by the OSDs. +Metadata is cleaned up as the directory objects are read in and written back +out again. + +Hard links +---------- +Inode with multiple hard links is moved to a dummy global SnapRealm. The +dummy SnapRealm covers all snapshots in the filesystem. The inode's data +will be preserved for any new snapshot. These preserved data will cover +snapshots on any linkage of the inode. + +Multi-FS +--------- +Snapshots and multiple filesystems don't interact well. Specifically, each +MDS cluster allocates `snapids` independently; if you have multiple filesystems +sharing a single pool (via namespaces), their snapshots *will* collide and +deleting one will result in missing file data for others. (This may even be +invisible, not throwing errors to the user.) If each FS gets its own +pool things probably work, but this isn't tested and may not be true. -- cgit v1.2.3