summaryrefslogtreecommitdiffstats
path: root/src/spdk/doc/compression.md
diff options
context:
space:
mode:
Diffstat (limited to 'src/spdk/doc/compression.md')
-rw-r--r--src/spdk/doc/compression.md286
1 files changed, 286 insertions, 0 deletions
diff --git a/src/spdk/doc/compression.md b/src/spdk/doc/compression.md
new file mode 100644
index 000000000..32c076300
--- /dev/null
+++ b/src/spdk/doc/compression.md
@@ -0,0 +1,286 @@
+# SPDK "Reduce" Block Compression Algorithm {#reduce}
+
+## Overview
+
+The SPDK "reduce" block compression scheme is based on using SSDs for storing compressed blocks of
+storage and persistent memory for metadata. This metadata includes mappings of logical blocks
+requested by a user to the compressed blocks on SSD. The scheme described in this document
+is generic and not tied to any specific block device framework such as the SPDK block device (bdev)
+framework. This algorithm will be implemented in a library called "libreduce". Higher-level
+software modules can built on top of this library to create and present block devices in a
+specific block device framework. For SPDK, a bdev_reduce module will serve as a wrapper around
+the libreduce library, to present the compressed block devices as an SPDK bdev.
+
+This scheme only describes how compressed blocks are stored on an SSD and the metadata for tracking
+those compressed blocks. It relies on the higher-software module to perform the compression
+algorithm itself. For SPDK, the bdev_reduce module will utilize the DPDK compressdev framework
+to perform compression and decompression on behalf of the libreduce library.
+
+(Note that in some cases, blocks of storage may not be compressible, or cannot be compressed enough
+to realize savings from the compression. In these cases, the data may be stored uncompressed on
+disk. The phrase "compressed blocks of storage" includes these uncompressed blocks.)
+
+A compressed block device is a logical entity built on top of a similarly-sized backing storage
+device. The backing storage device must be thin-provisioned to realize any savings from
+compression for reasons described later in this document. This algorithm has no direct knowledge
+of the implementation of the backing storage device, except that it will always use the
+lowest-numbered blocks available on the backing storage device. This will ensure that when this
+algorithm is used on a thin-provisioned backing storage device, blocks will not be allocated until
+they are actually needed.
+
+The backing storage device must be sized for the worst case scenario, where no data can be
+compressed. In this case, the size of the backing storage device would be the same as the
+compressed block device. Since this algorithm ensures atomicity by never overwriting data
+in place, some additional backing storage is required to temporarily store data for writes in
+progress before the associated metadata is updated.
+
+Storage from the backing storage device will be allocated, read, and written to in 4KB units for
+best NVMe performance. These 4KB units are called "backing IO units". They are indexed from 0 to N-1
+with the indices called "backing IO unit indices". At start, the full set of indices represent the
+"free backing IO unit list".
+
+A compressed block device compresses and decompresses data in units of chunks, where a chunk is a
+multiple of at least two 4KB backing IO units. The number of backing IO units per chunk determines
+the chunk size and is specified when the compressed block device is created. A chunk
+consumes a number of 4KB backing IO units between 1 and the number of 4KB units in the chunk. For
+example, a 16KB chunk consumes 1, 2, 3 or 4 backing IO units. The number of backing IO units depends on how
+much the chunk was able to be compressed. The blocks on disk associated with a chunk are stored in a
+"chunk map" in persistent memory. Each chunk map consists of N 64-bit values, where N is the maximum
+number of backing IO units in the chunk. Each 64-bit value corresponds to a backing IO unit index. A
+special value (for example, 2^64-1) is used for backing IO units not needed due to compression. The
+number of chunk maps allocated is equal to the size of the compressed block device divided by its chunk
+size, plus some number of extra chunk maps. These extra chunk maps are used to ensure atomicity on
+writes and will be explained later in this document. At start, all of the chunk maps represent the
+"free chunk map list".
+
+Finally, the logical view of the compressed block device is represented by the "logical map". The
+logical map is a mapping of chunk offsets into the compressed block device to the corresponding
+chunk map. Each entry in the logical map is a 64-bit value, denoting the associated chunk map.
+A special value (UINT64_MAX) is used if there is no associated chunk map. The mapping is
+determined by dividing the byte offset by the chunk size to get an index, which is used as an
+array index into the array of chunk map entries. At start, all entries in the logical map have no
+associated chunk map. Note that while access to the backing storage device is in 4KB units, the
+logical view may allow 4KB or 512B unit access and should perform similarly.
+
+## Example
+
+To illustrate this algorithm, we will use a real example at a very small scale.
+
+The size of the compressed block device is 64KB, with a chunk size of 16KB. This will
+realize the following:
+
+* "Backing storage" will consist of an 80KB thin-provisioned logical volume. This
+ corresponds to the 64KB size of the compressed block device, plus an extra 16KB to handle
+ additional write operations under a worst-case compression scenario.
+* "Free backing IO unit list" will consist of indices 0 through 19 (inclusive). These represent
+ the 20 4KB IO units in the backing storage.
+* A "chunk map" will be 32 bytes in size. This corresponds to 4 backing IO units per chunk
+ (16KB / 4KB), and 8B (64b) per backing IO unit index.
+* 5 chunk maps will be allocated in 160B of persistent memory. This corresponds to 4 chunk maps
+ for the 4 chunks in the compressed block device (64KB / 16KB), plus an extra chunk map for use
+ when overwriting an existing chunk.
+* "Free chunk map list" will consist of indices 0 through 4 (inclusive). These represent the
+ 5 allocated chunk maps.
+* The "logical map" will be allocated in 32B of persistent memory. This corresponds to
+ 4 entries for the 4 chunks in the compressed block device and 8B (64b) per entry.
+
+In these examples, the value "X" will represent the special value (2^64-1) described above.
+
+### Initial Creation
+
+```
+ +--------------------+
+ Backing Device | |
+ +--------------------+
+
+ Free Backing IO Unit List 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
+
+ +------------+------------+------------+------------+------------+
+ Chunk Maps | | | | | |
+ +------------+------------+------------+------------+------------+
+
+ Free Chunk Map List 0, 1, 2, 3, 4
+
+ +---+---+---+---+
+ Logical Map | X | X | X | X |
+ +---+---+---+---+
+```
+
+### Write 16KB at Offset 32KB
+
+* Find the corresponding index into the logical map. Offset 32KB divided by the chunk size
+ (16KB) is 2.
+* Entry 2 in the logical map is "X". This means no part of this 16KB has been written to yet.
+* Allocate a 16KB buffer in memory
+* Compress the incoming 16KB of data into this allocated buffer
+* Assume this data compresses to 6KB. This requires 2 4KB backing IO units.
+* Allocate 2 blocks (0 and 1) from the free backing IO unit list. Always use the lowest numbered
+ entries in the free backing IO unit list - this ensures that unnecessary backing storage
+ is not allocated in the thin-provisioned logical volume holding the backing storage.
+* Write the 6KB of data to backing IO units 0 and 1.
+* Allocate a chunk map (0) from the free chunk map list.
+* Write (0, 1, X, X) to the chunk map. This represents that only 2 backing IO units were used to
+ store the 16KB of data.
+* Write the chunk map index to entry 2 in the logical map.
+
+```
+ +--------------------+
+ Backing Device |01 |
+ +--------------------+
+
+ Free Backing IO Unit List 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
+
+ +------------+------------+------------+------------+------------+
+ Chunk Maps | 0 1 X X | | | | |
+ +------------+------------+------------+------------+------------+
+
+ Free Chunk Map List 1, 2, 3, 4
+
+ +---+---+---+---+
+ Logical Map | X | X | 0 | X |
+ +---+---+---+---+
+```
+
+### Write 4KB at Offset 8KB
+
+* Find the corresponding index into the logical map. Offset 8KB divided by the chunk size is 0.
+* Entry 0 in the logical map is "X". This means no part of this 16KB has been written to yet.
+* The write is not for the entire 16KB chunk, so we must allocate a 16KB chunk-sized buffer for
+ source data.
+* Copy the incoming 4KB data to offset 8KB of this 16KB buffer. Zero the rest of the 16KB buffer.
+* Allocate a 16KB destination buffer.
+* Compress the 16KB source data buffer into the 16KB destination buffer
+* Assume this data compresses to 3KB. This requires 1 4KB backing IO unit.
+* Allocate 1 block (2) from the free backing IO unit list.
+* Write the 3KB of data to block 2.
+* Allocate a chunk map (1) from the free chunk map list.
+* Write (2, X, X, X) to the chunk map.
+* Write the chunk map index to entry 0 in the logical map.
+
+```
+ +--------------------+
+ Backing Device |012 |
+ +--------------------+
+
+ Free Backing IO Unit List 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
+
+ +------------+------------+------------+------------+------------+
+ Chunk Maps | 0 1 X X | 2 X X X | | | |
+ +------------+------------+------------+------------+------------+
+
+ Free Chunk Map List 2, 3, 4
+
+ +---+---+---+---+
+ Logical Map | 1 | X | 0 | X |
+ +---+---+---+---+
+```
+
+### Read 16KB at Offset 16KB
+
+* Offset 16KB maps to index 1 in the logical map.
+* Entry 1 in the logical map is "X". This means no part of this 16KB has been written to yet.
+* Since no data has been written to this chunk, return all 0's to satisfy the read I/O.
+
+### Write 4KB at Offset 4KB
+
+* Offset 4KB maps to index 0 in the logical map.
+* Entry 0 in the logical map is "1". Since we are not overwriting the entire chunk, we must
+ do a read-modify-write.
+* Chunk map 1 only specifies one backing IO unit (2). Allocate a 16KB buffer and read block
+ 2 into it. This will be called the compressed data buffer. Note that 16KB is allocated
+ instead of 4KB so that we can reuse this buffer to hold the compressed data that will
+ be written later back to disk.
+* Allocate a 16KB buffer for the uncompressed data for this chunk. Decompress the data from
+ the compressed data buffer into this buffer.
+* Copy the incoming 4KB of data to offset 4KB of the uncompressed data buffer.
+* Compress the 16KB uncompressed data buffer into the compressed data buffer.
+* Assume this data compresses to 5KB. This requires 2 4KB backing IO units.
+* Allocate blocks 3 and 4 from the free backing IO unit list.
+* Write the 5KB of data to blocks 3 and 4.
+* Allocate chunk map 2 from the free chunk map list.
+* Write (3, 4, X, X) to chunk map 2. Note that at this point, the chunk map is not referenced
+ by the logical map. If there was a power fail at this point, the previous data for this chunk
+ would still be fully valid.
+* Write chunk map 2 to entry 0 in the logical map.
+* Free chunk map 1 back to the free chunk map list.
+* Free backing IO unit 2 back to the free backing IO unit list.
+
+```
+ +--------------------+
+ Backing Device |01 34 |
+ +--------------------+
+
+ Free Backing IO Unit List 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
+
+ +------------+------------+------------+------------+------------+
+ Chunk Maps | 0 1 X X | | 3 4 X X | | |
+ +------------+------------+------------+------------+------------+
+
+ Free Chunk Map List 1, 3, 4
+
+ +---+---+---+---+
+ Logical Map | 2 | X | 0 | X |
+ +---+---+---+---+
+```
+
+### Operations that span across multiple chunks
+
+Operations that span a chunk boundary are logically split into multiple operations, each of
+which is associated with a single chunk.
+
+Example: 20KB write at offset 4KB
+
+In this case, the write operation is split into a 12KB write at offset 4KB (affecting only
+chunk 0 in the logical map) and a 8KB write at offset 16KB (affecting only chunk 1 in the
+logical map). Each write is processed independently using the algorithm described above.
+Completion of the 20KB write does not occur until both operations have completed.
+
+### Unmap Operations
+
+Unmap operations on an entire chunk are achieved by removing the chunk map entry (if any) from
+the logical map. The chunk map is returned to the free chunk map list, and any backing IO units
+associated with the chunk map are returned to the free backing IO unit list.
+
+Unmap operations that affect only part of a chunk can be treated as writing zeroes to that
+region of the chunk. If the entire chunk is unmapped via several operations, it can be
+detected via the uncompressed data equaling all zeroes. When this occurs, the chunk map entry
+may be removed from the logical map.
+
+After an entire chunk has been unmapped, subsequent reads to the chunk will return all zeroes.
+This is similar to the "Read 16KB at offset 16KB" example above.
+
+### Write Zeroes Operations
+
+Write zeroes operations are handled similarly to unmap operations. If a write zeroes
+operation covers an entire chunk, we can remove the chunk's entry in the logical map
+completely. Then subsequent reads to that chunk will return all zeroes.
+
+### Restart
+
+An application using libreduce will periodically exit and need to be restarted. When the
+application restarts, it will reload compressed volumes so they can be used again from the
+same state as when the application exited.
+
+When the compressed volume is reloaded, the free chunk map list and free backing IO unit list
+are reconstructed by walking the logical map. The logical map will only point to valid
+chunk maps, and the valid chunk maps will only point to valid backing IO units. Any chunk maps
+and backing IO units not referenced go into their respective free lists.
+
+This ensures that if a system crashes in the middle of a write operation - i.e. during or
+after a chunk map is updated, but before it is written to the logical map - that everything
+related to that in-progress write will be ignored after the compressed volume is restarted.
+
+### Overlapping operations on same chunk
+
+Implementations must take care to handle overlapping operations on the same chunk. For example,
+operation 1 writes some data to chunk A, and while this is in progress, operation 2 also writes
+some data to chunk A. In this case, operation 2 should not start until operation 1 has
+completed. Further optimizations are outside the scope of this document.
+
+### Thin provisioned backing storage
+
+Backing storage must be thin provisioned to realize any savings from compression. This algorithm
+will always use (and reuse) backing IO units available closest to offset 0 on the backing device.
+This ensures that even though backing storage device may have been sized similarly to the size of
+the compressed volume, storage for the backing storage device will not actually be allocated
+until the backing IO units are actually needed.