1 files changed, 281 insertions, 0 deletions
diff --git a/doc/dev/rbd-layering.rst b/doc/dev/rbd-layering.rst
new file mode 100644
index 000000000..e6e224ce4
--- /dev/null
+++ b/doc/dev/rbd-layering.rst
@@ -0,0 +1,281 @@
+============
+RBD Layering
+============
+
+RBD layering refers to the creation of copy-on-write clones of block
+devices. This allows for fast image creation, for example to clone a
+golden master image of a virtual machine into a new instance. To
+simplify the semantics, you can only create a clone of a snapshot -
+snapshots are always read-only, so the rest of the image is
+unaffected, and there's no possibility of writing to them
+accidentally.
+
+From a user's perspective, a clone is just like any other rbd image.
+You can take snapshots of them, read/write them, resize them, etc.
+There are no restrictions on clones from a user's viewpoint.
+
+Note: the terms `child` and `parent` below mean an rbd image created
+by cloning, and the rbd image snapshot a child was cloned from.
+
+Command line interface
+----------------------
+
+Before cloning a snapshot, you must mark it as protected, to prevent
+it from being deleted while child images refer to it:
+::
+
+    $ rbd snap protect pool/image@snap
+
+Then you can perform the clone:
+::
+
+    $ rbd clone [--parent] pool/parent@snap [--image] pool2/child1
+
+You can create a clone with different object sizes from the parent:
+::
+
+    $ rbd clone --order 25 pool/parent@snap pool2/child2
+
+To delete the parent, you must first mark it unprotected, which checks
+that there are no children left:
+::
+
+    $ rbd snap unprotect pool/image@snap
+    Cannot unprotect: Still in use by pool2/image2
+    $ rbd children pool/image@snap
+    pool2/child1
+    pool2/child2
+    $ rbd flatten pool2/child1
+    $ rbd rm pool2/child2
+    $ rbd snap rm pool/image@snap
+    Cannot remove a protected snapshot: pool/image@snap
+    $ rbd snap unprotect pool/image@snap
+
+Then the snapshot can be deleted like normal:
+::
+
+    $ rbd snap rm pool/image@snap
+
+Implementation
+--------------
+
+Data Flow
+^^^^^^^^^
+
+In the initial implementation, called 'trivial layering', there will
+be no tracking of which objects exist in a clone. A read that hits a
+non-existent object will attempt to read from the parent snapshot, and
+this will continue recursively until an object exists or an image with
+no parent is found. This is done through the normal read path from
+the parent, so differing object sizes between parents and children
+do not matter.
+
+Before a write to an object is performed, the object is checked for
+existence. If it doesn't exist, a copy-up operation is performed,
+which means reading the relevant range of data from the parent
+snapshot and writing it (plus the original write) to the child
+image. To prevent races with multiple writes trying to copy-up the
+same object, this copy-up operation will include an atomic create. If
+the atomic create fails, the original write is done instead. This
+copy-up operation is implemented as a class method so that extra
+metadata can be stored by it in the future. In trivial layering, the
+copy-up operation copies the entire range needed to the child object
+(that is, the full size of the child object). A future optimization
+could make this copy-up more fine-grained.
+
+Another future optimization could be storing a bitmap of which objects
+actually exist in a child. This would obviate the check for existence
+before each write, and let reads go directly to the parent if needed.
+
+These optimizations are discussed in:
+
+http://marc.info/?l=ceph-devel&m=129867273303846
+
+Parent/Child relationships
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Children store a reference to their parent in their header, as a tuple
+of (pool id, image id, snapshot id). This is enough information to
+open the parent and read from it.
+
+In addition to knowing which parent a given image has, we want to be
+able to tell if a protected snapshot still has children. This is
+accomplished with a new per-pool object, `rbd_children`, which maps
+(parent pool id, parent image id, parent snapshot id) to a list of
+child image ids. This is stored in the same pool as the child image
+because the client creating a clone already has read/write access to
+everything in this pool, but may not have write access to the parent's
+pool. This lets a client with read-only access to one pool clone a
+snapshot from that pool into a pool they have full access to. It
+increases the cost of unprotecting an image, since this needs to check
+for children in every pool, but this is a rare operation. It would
+likely only be done before removing old images, which is already much
+more expensive because it involves deleting every data object in the
+image.
+
+Protection
+^^^^^^^^^^
+
+Internally, protection_state is a field in the header object that
+can be in three states. "protected", "unprotected", and
+"unprotecting". The first two are set as the result of "rbd
+protect/unprotect". The "unprotecting" state is set while the "rbd
+unprotect" command checks for any child images. Only snapshots in the
+"protected" state may be cloned, so the "unprotected" state prevents
+a race like:
+
+1. A: walk through all pools, look for clones, find none
+2. B: create a clone
+3. A: unprotect parent
+4. A: rbd snap rm pool/parent@snap
+
+Resizing
+^^^^^^^^
+
+Resizing an rbd image is like truncating a sparse file. New space is
+treated as zeroes, and shrinking an rbd image deletes the contents
+beyond the old bounds. This means that if you have a 10G image full of
+data, and you resize it down to 5G and then up to 10G again, the last
+5G is treated as zeroes (and any objects that held that data were
+removed when the image was shrunk).
+
+Layering complicates this because the absence of an object no longer
+implies it should be treated as zeroes - if the object is part of a
+clone, it may mean that some data needs to be read from the parent.
+
+To preserve the resizing behavior for clones, we need to keep track of
+which objects could be stored in the parent. We can track this as the
+amount of overlap the child has with the parent, since resizing only
+changes the end of an image. When a child is created, its overlap
+is the size of the parent snapshot. On each subsequent resize, the
+overlap is `min(overlap, new_size)`. That is, shrinking the image
+may shrinks the overlap, but increasing the image's size does not
+change the overlap.
+
+Objects that do not exist past the overlap are treated as zeroes.
+Objects that do not exist before that point fall back to reading
+from the parent.
+
+Since this overlap changes over time, we store it as part of the
+metadata for a snapshot as well.
+
+Renaming
+^^^^^^^^
+
+Currently the rbd header object (that stores all the metadata about an
+image) is named after the name of the image. This makes renaming
+disrupt clients who have the image open (such as children reading from
+a parent). To avoid this, we can name the header object by the
+id of the image, which does not change. That is, the name of the
+header object could be `rbd_header.$id`, where $id is a unique id for
+the image in the pool.
+
+When a client opens an image, all it knows is the name. There is
+already a per-pool `rbd_directory` object that maps image names to
+ids, but if we relied on it to get the id, we could not open any
+images in that pool if that single object was unavailable. To avoid
+this dependency, we can store the id of an image in an object called
+`rbd_id.$image_name`, where $image_name is the name of the image. The
+per-pool `rbd_directory` object is still useful for listing all images
+in a pool, however.
+
+Header changes
+--------------
+
+The header needs a few new fields:
+
+* int64_t parent_pool_id
+* string parent_image_id
+* uint64_t parent_snap_id
+* uint64_t overlap (how much of the image may be referring to the parent)
+
+These are stored in a "parent" key, which is only present if the image
+has a parent.
+
+cls_rbd
+^^^^^^^
+
+Some new methods are needed:
+::
+
+    /***************** methods on the rbd header *********************/
+    /**
+     * Sets the parent and overlap keys.
+     * Fails if any of these keys exist, since the image already
+     * had a parent.
+     */
+    set_parent(uint64_t pool_id, string image_id, uint64_t snap_id)
+
+    /**
+     * returns the parent pool id, image id, snap id, and overlap, or -ENOENT
+     * if parent_pool_id does not exist or is -1
+     */
+    get_parent(uint64_t snapid)
+
+    /**
+     * Removes the parent key
+     */
+    remove_parent() // after all parent data is copied to the child
+
+    /*************** methods on the rbd_children object *****************/
+
+    add_child(uint64_t parent_pool_id, string parent_image_id,
+              uint64_t parent_snap_id, string image_id);
+    remove_child(uint64_t parent_pool_id, string parent_image_id,
+                 uint64_t parent_snap_id, string image_id);
+    /**
+     * List ids of a given parent
+     */
+    get_children(uint64_t parent_pool_id, string parent_image_id,
+                 uint64_t parent_snap_id, uint64_t max_return,
+                 string start);
+    /**
+     * list parent
+     */
+    get_parents(uint64_t max_return, uint64_t start_pool_id,
+                string start_image_id, string start_snap_id);
+
+
+    /************ methods on the rbd_id.$image_name object **************/
+
+    set_id(string id)
+    get_id()
+
+    /************** methods on the rbd_directory object *****************/
+
+    dir_get_id(string name);
+    dir_get_name(string id);
+    dir_list(string start_after, uint64_t max_return);
+    dir_add_image(string name, string id);
+    dir_remove_image(string name, string id);
+    dir_rename_image(string src, string dest, string id);
+
+Two existing methods will change if the image supports
+layering:
+::
+
+    snapshot_add - stores current overlap and has_parent with
+                   other snapshot metadata (images that don't have
+                   layering enabled aren't affected)
+
+    set_size     - will adjust the parent overlap down as needed.
+
+librbd
+^^^^^^
+
+Opening a child image opens its parent (and this will continue
+recursively as needed). This means that an ImageCtx will contain a
+pointer to the parent image context. Differing object sizes won't
+matter, since reading from the parent will go through the parent
+image context.
+
+Discard will need to change for layered images so that it only
+truncates objects, and does not remove them. If we removed objects, we
+could not tell if we needed to read them from the parent.
+
+A new clone method will be added, which takes the same arguments as
+create except size (size of the parent image is used).
+
+Instead of expanding the rbd_info struct, we will break the metadata
+retrieval into several API calls.  Right now, the only users of
+rbd_stat() other than 'rbd info' only use it to retrieve image size.