1 files changed, 133 insertions, 0 deletions
diff --git a/doc/dev/cephfs-snapshots.rst b/doc/dev/cephfs-snapshots.rst
new file mode 100644
index 00000000..509b5ff3
--- /dev/null
+++ b/doc/dev/cephfs-snapshots.rst
@@ -0,0 +1,133 @@
+CephFS Snapshots
+================
+
+CephFS supports snapshots, generally created by invoking mkdir within the
+``.snap`` directory. Note this is a hidden, special directory, not visible
+during a directory listing.
+
+Overview
+-----------
+
+Generally, snapshots do what they sound like: they create an immutable view
+of the filesystem at the point in time they're taken. There are some headline
+features that make CephFS snapshots different from what you might expect:
+
+* Arbitrary subtrees. Snapshots are created within any directory you choose,
+  and cover all data in the filesystem under that directory.
+* Asynchronous. If you create a snapshot, buffered data is flushed out lazily,
+  including from other clients. As a result, "creating" the snapshot is
+  very fast.
+
+Important Data Structures
+-------------------------
+* SnapRealm: A `SnapRealm` is created whenever you create a snapshot at a new
+  point in the hierarchy (or, when a snapshotted inode is move outside of its
+  parent snapshot). SnapRealms contain an `sr_t srnode`, and `inodes_with_caps`
+  that are part of the snapshot. Clients also have a SnapRealm concept that
+  maintains less data but is used to associate a `SnapContext` with each open
+  file for writing.
+* sr_t: An `sr_t` is the on-disk snapshot metadata. It is part of the containing
+  directory and contains sequence counters, timestamps, the list of associated
+  snapshot IDs, and `past_parent_snaps`.
+* SnapServer: SnapServer manages snapshot ID allocation, snapshot deletion and
+  tracks list of effective snapshots in the filesystem. A filesystem only has
+  one instance of snapserver.
+* SnapClient: SnapClient is used to communicate with snapserver, each MDS rank
+  has its own snapclient instance. SnapClient also caches effective snapshots
+  locally.
+
+Creating a snapshot
+-------------------
+CephFS snapshot feature is enabled by default on new filesystem. To enable it
+on existing filesystems, use command below.
+
+.. code::
+
+       $ ceph fs set <fs_name> allow_new_snaps true
+
+When snapshots are enabled, all directories in CephFS will have a special
+``.snap`` directory. (You may configure a different name with the ``client
+snapdir`` setting if you wish.)
+
+To create a CephFS snapshot, create a subdirectory under
+``.snap`` with a name of your choice. For example, to create a snapshot on
+directory "/1/2/3/", invoke ``mkdir /1/2/3/.snap/my-snapshot-name`` .
+
+This is transmitted to the MDS Server as a
+CEPH_MDS_OP_MKSNAP-tagged `MClientRequest`, and initially handled in
+Server::handle_client_mksnap(). It allocates a `snapid` from the `SnapServer`,
+projects a new inode with the new SnapRealm, and commits it to the MDLog as
+usual. When committed, it invokes
+`MDCache::do_realm_invalidate_and_update_notify()`, which notifies all clients
+with caps on files under "/1/2/3/", about the new SnapRealm. When clients get
+the notifications, they update client-side SnapRealm hierarchy, link files
+under "/1/2/3/" to the new SnapRealm and generate a `SnapContext` for the
+new SnapRealm.
+
+Note that this *is not* a synchronous part of the snapshot creation!
+
+Updating a snapshot
+-------------------
+If you delete a snapshot, a similar process is followed. If you remove an inode
+out of its parent SnapRealm, the rename code creates a new SnapRealm for the
+renamed inode (if SnapRealm does not already exist), saves IDs of snapshots that
+are effective on the original parent SnapRealm into `past_parent_snaps` of the
+new SnapRealm, then follows a process similar to creating snapshot.
+
+Generating a SnapContext
+------------------------
+A RADOS `SnapContext` consists of a snapshot sequence ID (`snapid`) and all
+the snapshot IDs that an object is already part of. To generate that list, we
+combine `snapids` associated with the SnapRealm and all valid `snapids` in
+`past_parent_snaps`. Stale `snapids` are filtered out by SnapClient's cached
+effective snapshots.
+
+Storing snapshot data
+---------------------
+File data is stored in RADOS "self-managed" snapshots. Clients are careful to
+use the correct `SnapContext` when writing file data to the OSDs.
+
+Storing snapshot metadata
+-------------------------
+Snapshotted dentries (and their inodes) are stored in-line as part of the
+directory they were in at the time of the snapshot. *All dentries* include a
+`first` and `last` snapid for which they are valid. (Non-snapshotted dentries
+will have their `last` set to CEPH_NOSNAP).
+
+Snapshot writeback
+------------------
+There is a great deal of code to handle writeback efficiently. When a Client
+receives an `MClientSnap` message, it updates the local `SnapRealm`
+representation and its links to specific `Inodes`, and generates a `CapSnap`
+for the `Inode`. The `CapSnap` is flushed out as part of capability writeback,
+and if there is dirty data the `CapSnap` is used to block fresh data writes
+until the snapshot is completely flushed to the OSDs.
+
+In the MDS, we generate snapshot-representing dentries as part of the regular
+process for flushing them. Dentries with outstanding `CapSnap` data is kept
+pinned and in the journal.
+
+Deleting snapshots
+------------------
+Snapshots are deleted by invoking "rmdir" on the ".snap" directory they are
+rooted in. (Attempts to delete a directory which roots snapshots *will fail*;
+you must delete the snapshots first.) Once deleted, they are entered into the
+`OSDMap` list of deleted snapshots and the file data is removed by the OSDs.
+Metadata is cleaned up as the directory objects are read in and written back
+out again.
+
+Hard links
+----------
+Inode with multiple hard links is moved to a dummy global SnapRealm. The
+dummy SnapRealm covers all snapshots in the filesystem. The inode's data
+will be preserved for any new snapshot. These preserved data will cover
+snapshots on any linkage of the inode.
+
+Multi-FS
+---------
+Snapshots and multiple filesystems don't interact well. Specifically, each
+MDS cluster allocates `snapids` independently; if you have multiple filesystems
+sharing a single pool (via namespaces), their snapshots *will* collide and
+deleting one will result in missing file data for others. (This may even be
+invisible, not throwing errors to the user.) If each FS gets its own
+pool things probably work, but this isn't tested and may not be true.