1 files changed, 162 insertions, 0 deletions
diff --git a/doc/dev/seastore.rst b/doc/dev/seastore.rst
new file mode 100644
index 00000000..ae2b014a
--- /dev/null
+++ b/doc/dev/seastore.rst
@@ -0,0 +1,162 @@
+==========
+ SeaStore
+==========
+
+This is a rough design doc for a new ObjectStore implementation design
+to facilitate higher performance on solid state devices.
+
+Name
+====
+
+SeaStore maximizes the opportunity for confusion (seastar? seashore?)
+and associated fun.  Alternative suggestions welcome.
+
+
+Goals
+=====
+
+* Target NVMe devices.  Not primarily concerned with pmem or HDD.
+* make use of SPDK for user-space driven IO
+* Use Seastar futures programming model to facilitate run-to-completion and a sharded memory/processing model
+* Allow zero- (or minimal) data copying on read and write paths when combined with a seastar-based messenger using DPDK
+
+
+Motivation and background
+=========================
+
+All flash devices are internally structured in terms of segments that
+can be written efficiently but must be erased in their entirety.  The
+NVMe device generally has limited knowledge about what data in a
+segment is still "live" (hasn't been logically discarded), making the
+inevitable garbage collection within the device inefficient.  We can
+design an on-disk layout that is friendly to GC at lower layers and
+drive garbage collection at higher layers.
+
+In principle a fine-grained discard could communicate our intent to
+the device, but in practice discard is poorly implemented in the
+device and intervening software layers.
+
+  
+Basics
+======
+
+The basic idea is that all data will be stream out sequentially to
+large segments on the device.  In the SSD hardware, segments are
+likely to be on the order of 100's of MB to tens of GB.
+
+SeaStore's logical segments would ideally be perfectly aligned with
+the hardware segments.  In practice, it may be challenging to
+determine geometry and to sufficiently hint to the device that LBAs
+being written should be aligned to the underlying hardware.  In the
+worst case, we can structure our logical segments to correspond to
+e.g. 5x the physical segment size so that we have about ~20% of our
+data misaligned.
+
+When we reach some utilization threshold, we mix cleaning work in with
+the ongoing write workload in order to evacuate live data from
+previously written segments.  Once they are completely free we can
+discard the entire segment so that it can be erased and reclaimed by
+the device.
+
+The key is to mix a small bit of cleaning work with every write
+transaction to avoid spikes and variance in write latency.
+
+
+  
+Data layout basics
+==================
+
+One or more cores/shards will be reading and writing to the device at
+once.  Each shard will have its own independent data it is operating
+on and stream to its own open segments.  Devices that support streams
+can be hinted accordingly so that data from different shards is not
+mixed on the underlying media.
+
+Global state
+------------
+
+There will be a simple global table of segments and their usage/empty
+status.  Each shard will occasionally claim new empty segments for
+writing as needed, or return cleaned segments to the global free list.
+
+At a high level, all metadata will be structured as a b-tree.  The
+root for the metadata btree will also be stored centrally (along with
+the segment allocation table).
+
+This is hand-wavey, but it is probably sufficient to update the root
+pointer for the btree either as each segment is sealed or as a new
+segment is opened.
+
+
+Writing segments
+----------------
+
+Each segment will be written sequentially as a sequence of
+transactions.  Each transaction will be on-disk expression of an
+ObjectStore::Transaction.  It will consist of
+
+* new data blocks
+* some metadata describing changes to b-tree metadata blocks.  This
+  will be written compact as a delta: which keys are removed and which
+  keys/values are inserted into the b-tree block.
+
+As each b-tree block is modified, we update the block in memory and
+put it on a 'dirty' list.  However, again, only the (compact) delta is journaled
+to the segment.
+
+As we approach the end of the segment, the goal is to undirty all of
+our dirty blocks in memory.  Based on the number of dirty blocks and
+the remaining space, we include a proportional number of dirty blocks
+in each transaction write so that we undirty some of the b-tree
+blocks.  Eventually, the last transaction written to the segment will
+include all of the remaining dirty b-tree blocks.
+
+Segment inventory
+-----------------
+
+At the end of each segment, an inventory will be written that includes
+any metadata needed to test whether blocks in the segment are still
+live.  For data blocks, that means an object id (e.g., ino number) and
+offset to test whether the block is still reference.  For metadata
+blocks, it would be at least one metadata key that lands in any b-tree
+block that is modified (via a delta) in the segment--enough for us to
+do a forward lookup in the b-tree to check whether the b-tree block is
+still referenced.  Once this is written, the segment is sealed and read-only.
+
+Crash recovery
+--------------
+
+On any crash, we simply "replay" the currently open segment in memory.
+For any b-tree delta encountered, we load the original block, modify
+in memory, and mark it dirty.  Once we continue writing, the normal "write
+dirty blocks as we near the end of the segment" behavior will pick up where
+we left off.
+
+
+
+ObjectStore considerations
+==========================
+
+Splits, merges, and sharding
+----------------------------
+
+One of the current ObjectStore requirements is to be able to split a
+collection (PG) in O(1) time.  Starting in mimic, we also need to be
+able to merge two collections into one (i.e., exactly the reverse of a
+split).
+
+However, the PGs that we split into would hash to different shards of
+the OSD in the current sharding scheme.  One can imagine replacing
+that sharding scheme with a temporary mapping directing the smaller
+child PG to the right shard since we generally then migrate that PG to
+another OSD anyway, but this wouldn't help us in the merge case where
+the constituent pieces may start out on different shards and
+ultimately need to be handled in the same collection (and be operated
+on via single transactions).
+
+This suggests that we likely need a way for data written via one shard
+to "switch ownership" and later be read and managed by a different
+shard.
+
+
+