diff options
Diffstat (limited to 'src/spdk/doc/compression.md')
-rw-r--r-- | src/spdk/doc/compression.md | 286 |
1 files changed, 286 insertions, 0 deletions
diff --git a/src/spdk/doc/compression.md b/src/spdk/doc/compression.md new file mode 100644 index 000000000..32c076300 --- /dev/null +++ b/src/spdk/doc/compression.md @@ -0,0 +1,286 @@ +# SPDK "Reduce" Block Compression Algorithm {#reduce} + +## Overview + +The SPDK "reduce" block compression scheme is based on using SSDs for storing compressed blocks of +storage and persistent memory for metadata. This metadata includes mappings of logical blocks +requested by a user to the compressed blocks on SSD. The scheme described in this document +is generic and not tied to any specific block device framework such as the SPDK block device (bdev) +framework. This algorithm will be implemented in a library called "libreduce". Higher-level +software modules can built on top of this library to create and present block devices in a +specific block device framework. For SPDK, a bdev_reduce module will serve as a wrapper around +the libreduce library, to present the compressed block devices as an SPDK bdev. + +This scheme only describes how compressed blocks are stored on an SSD and the metadata for tracking +those compressed blocks. It relies on the higher-software module to perform the compression +algorithm itself. For SPDK, the bdev_reduce module will utilize the DPDK compressdev framework +to perform compression and decompression on behalf of the libreduce library. + +(Note that in some cases, blocks of storage may not be compressible, or cannot be compressed enough +to realize savings from the compression. In these cases, the data may be stored uncompressed on +disk. The phrase "compressed blocks of storage" includes these uncompressed blocks.) + +A compressed block device is a logical entity built on top of a similarly-sized backing storage +device. The backing storage device must be thin-provisioned to realize any savings from +compression for reasons described later in this document. This algorithm has no direct knowledge +of the implementation of the backing storage device, except that it will always use the +lowest-numbered blocks available on the backing storage device. This will ensure that when this +algorithm is used on a thin-provisioned backing storage device, blocks will not be allocated until +they are actually needed. + +The backing storage device must be sized for the worst case scenario, where no data can be +compressed. In this case, the size of the backing storage device would be the same as the +compressed block device. Since this algorithm ensures atomicity by never overwriting data +in place, some additional backing storage is required to temporarily store data for writes in +progress before the associated metadata is updated. + +Storage from the backing storage device will be allocated, read, and written to in 4KB units for +best NVMe performance. These 4KB units are called "backing IO units". They are indexed from 0 to N-1 +with the indices called "backing IO unit indices". At start, the full set of indices represent the +"free backing IO unit list". + +A compressed block device compresses and decompresses data in units of chunks, where a chunk is a +multiple of at least two 4KB backing IO units. The number of backing IO units per chunk determines +the chunk size and is specified when the compressed block device is created. A chunk +consumes a number of 4KB backing IO units between 1 and the number of 4KB units in the chunk. For +example, a 16KB chunk consumes 1, 2, 3 or 4 backing IO units. The number of backing IO units depends on how +much the chunk was able to be compressed. The blocks on disk associated with a chunk are stored in a +"chunk map" in persistent memory. Each chunk map consists of N 64-bit values, where N is the maximum +number of backing IO units in the chunk. Each 64-bit value corresponds to a backing IO unit index. A +special value (for example, 2^64-1) is used for backing IO units not needed due to compression. The +number of chunk maps allocated is equal to the size of the compressed block device divided by its chunk +size, plus some number of extra chunk maps. These extra chunk maps are used to ensure atomicity on +writes and will be explained later in this document. At start, all of the chunk maps represent the +"free chunk map list". + +Finally, the logical view of the compressed block device is represented by the "logical map". The +logical map is a mapping of chunk offsets into the compressed block device to the corresponding +chunk map. Each entry in the logical map is a 64-bit value, denoting the associated chunk map. +A special value (UINT64_MAX) is used if there is no associated chunk map. The mapping is +determined by dividing the byte offset by the chunk size to get an index, which is used as an +array index into the array of chunk map entries. At start, all entries in the logical map have no +associated chunk map. Note that while access to the backing storage device is in 4KB units, the +logical view may allow 4KB or 512B unit access and should perform similarly. + +## Example + +To illustrate this algorithm, we will use a real example at a very small scale. + +The size of the compressed block device is 64KB, with a chunk size of 16KB. This will +realize the following: + +* "Backing storage" will consist of an 80KB thin-provisioned logical volume. This + corresponds to the 64KB size of the compressed block device, plus an extra 16KB to handle + additional write operations under a worst-case compression scenario. +* "Free backing IO unit list" will consist of indices 0 through 19 (inclusive). These represent + the 20 4KB IO units in the backing storage. +* A "chunk map" will be 32 bytes in size. This corresponds to 4 backing IO units per chunk + (16KB / 4KB), and 8B (64b) per backing IO unit index. +* 5 chunk maps will be allocated in 160B of persistent memory. This corresponds to 4 chunk maps + for the 4 chunks in the compressed block device (64KB / 16KB), plus an extra chunk map for use + when overwriting an existing chunk. +* "Free chunk map list" will consist of indices 0 through 4 (inclusive). These represent the + 5 allocated chunk maps. +* The "logical map" will be allocated in 32B of persistent memory. This corresponds to + 4 entries for the 4 chunks in the compressed block device and 8B (64b) per entry. + +In these examples, the value "X" will represent the special value (2^64-1) described above. + +### Initial Creation + +``` + +--------------------+ + Backing Device | | + +--------------------+ + + Free Backing IO Unit List 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 + + +------------+------------+------------+------------+------------+ + Chunk Maps | | | | | | + +------------+------------+------------+------------+------------+ + + Free Chunk Map List 0, 1, 2, 3, 4 + + +---+---+---+---+ + Logical Map | X | X | X | X | + +---+---+---+---+ +``` + +### Write 16KB at Offset 32KB + +* Find the corresponding index into the logical map. Offset 32KB divided by the chunk size + (16KB) is 2. +* Entry 2 in the logical map is "X". This means no part of this 16KB has been written to yet. +* Allocate a 16KB buffer in memory +* Compress the incoming 16KB of data into this allocated buffer +* Assume this data compresses to 6KB. This requires 2 4KB backing IO units. +* Allocate 2 blocks (0 and 1) from the free backing IO unit list. Always use the lowest numbered + entries in the free backing IO unit list - this ensures that unnecessary backing storage + is not allocated in the thin-provisioned logical volume holding the backing storage. +* Write the 6KB of data to backing IO units 0 and 1. +* Allocate a chunk map (0) from the free chunk map list. +* Write (0, 1, X, X) to the chunk map. This represents that only 2 backing IO units were used to + store the 16KB of data. +* Write the chunk map index to entry 2 in the logical map. + +``` + +--------------------+ + Backing Device |01 | + +--------------------+ + + Free Backing IO Unit List 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 + + +------------+------------+------------+------------+------------+ + Chunk Maps | 0 1 X X | | | | | + +------------+------------+------------+------------+------------+ + + Free Chunk Map List 1, 2, 3, 4 + + +---+---+---+---+ + Logical Map | X | X | 0 | X | + +---+---+---+---+ +``` + +### Write 4KB at Offset 8KB + +* Find the corresponding index into the logical map. Offset 8KB divided by the chunk size is 0. +* Entry 0 in the logical map is "X". This means no part of this 16KB has been written to yet. +* The write is not for the entire 16KB chunk, so we must allocate a 16KB chunk-sized buffer for + source data. +* Copy the incoming 4KB data to offset 8KB of this 16KB buffer. Zero the rest of the 16KB buffer. +* Allocate a 16KB destination buffer. +* Compress the 16KB source data buffer into the 16KB destination buffer +* Assume this data compresses to 3KB. This requires 1 4KB backing IO unit. +* Allocate 1 block (2) from the free backing IO unit list. +* Write the 3KB of data to block 2. +* Allocate a chunk map (1) from the free chunk map list. +* Write (2, X, X, X) to the chunk map. +* Write the chunk map index to entry 0 in the logical map. + +``` + +--------------------+ + Backing Device |012 | + +--------------------+ + + Free Backing IO Unit List 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 + + +------------+------------+------------+------------+------------+ + Chunk Maps | 0 1 X X | 2 X X X | | | | + +------------+------------+------------+------------+------------+ + + Free Chunk Map List 2, 3, 4 + + +---+---+---+---+ + Logical Map | 1 | X | 0 | X | + +---+---+---+---+ +``` + +### Read 16KB at Offset 16KB + +* Offset 16KB maps to index 1 in the logical map. +* Entry 1 in the logical map is "X". This means no part of this 16KB has been written to yet. +* Since no data has been written to this chunk, return all 0's to satisfy the read I/O. + +### Write 4KB at Offset 4KB + +* Offset 4KB maps to index 0 in the logical map. +* Entry 0 in the logical map is "1". Since we are not overwriting the entire chunk, we must + do a read-modify-write. +* Chunk map 1 only specifies one backing IO unit (2). Allocate a 16KB buffer and read block + 2 into it. This will be called the compressed data buffer. Note that 16KB is allocated + instead of 4KB so that we can reuse this buffer to hold the compressed data that will + be written later back to disk. +* Allocate a 16KB buffer for the uncompressed data for this chunk. Decompress the data from + the compressed data buffer into this buffer. +* Copy the incoming 4KB of data to offset 4KB of the uncompressed data buffer. +* Compress the 16KB uncompressed data buffer into the compressed data buffer. +* Assume this data compresses to 5KB. This requires 2 4KB backing IO units. +* Allocate blocks 3 and 4 from the free backing IO unit list. +* Write the 5KB of data to blocks 3 and 4. +* Allocate chunk map 2 from the free chunk map list. +* Write (3, 4, X, X) to chunk map 2. Note that at this point, the chunk map is not referenced + by the logical map. If there was a power fail at this point, the previous data for this chunk + would still be fully valid. +* Write chunk map 2 to entry 0 in the logical map. +* Free chunk map 1 back to the free chunk map list. +* Free backing IO unit 2 back to the free backing IO unit list. + +``` + +--------------------+ + Backing Device |01 34 | + +--------------------+ + + Free Backing IO Unit List 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 + + +------------+------------+------------+------------+------------+ + Chunk Maps | 0 1 X X | | 3 4 X X | | | + +------------+------------+------------+------------+------------+ + + Free Chunk Map List 1, 3, 4 + + +---+---+---+---+ + Logical Map | 2 | X | 0 | X | + +---+---+---+---+ +``` + +### Operations that span across multiple chunks + +Operations that span a chunk boundary are logically split into multiple operations, each of +which is associated with a single chunk. + +Example: 20KB write at offset 4KB + +In this case, the write operation is split into a 12KB write at offset 4KB (affecting only +chunk 0 in the logical map) and a 8KB write at offset 16KB (affecting only chunk 1 in the +logical map). Each write is processed independently using the algorithm described above. +Completion of the 20KB write does not occur until both operations have completed. + +### Unmap Operations + +Unmap operations on an entire chunk are achieved by removing the chunk map entry (if any) from +the logical map. The chunk map is returned to the free chunk map list, and any backing IO units +associated with the chunk map are returned to the free backing IO unit list. + +Unmap operations that affect only part of a chunk can be treated as writing zeroes to that +region of the chunk. If the entire chunk is unmapped via several operations, it can be +detected via the uncompressed data equaling all zeroes. When this occurs, the chunk map entry +may be removed from the logical map. + +After an entire chunk has been unmapped, subsequent reads to the chunk will return all zeroes. +This is similar to the "Read 16KB at offset 16KB" example above. + +### Write Zeroes Operations + +Write zeroes operations are handled similarly to unmap operations. If a write zeroes +operation covers an entire chunk, we can remove the chunk's entry in the logical map +completely. Then subsequent reads to that chunk will return all zeroes. + +### Restart + +An application using libreduce will periodically exit and need to be restarted. When the +application restarts, it will reload compressed volumes so they can be used again from the +same state as when the application exited. + +When the compressed volume is reloaded, the free chunk map list and free backing IO unit list +are reconstructed by walking the logical map. The logical map will only point to valid +chunk maps, and the valid chunk maps will only point to valid backing IO units. Any chunk maps +and backing IO units not referenced go into their respective free lists. + +This ensures that if a system crashes in the middle of a write operation - i.e. during or +after a chunk map is updated, but before it is written to the logical map - that everything +related to that in-progress write will be ignored after the compressed volume is restarted. + +### Overlapping operations on same chunk + +Implementations must take care to handle overlapping operations on the same chunk. For example, +operation 1 writes some data to chunk A, and while this is in progress, operation 2 also writes +some data to chunk A. In this case, operation 2 should not start until operation 1 has +completed. Further optimizations are outside the scope of this document. + +### Thin provisioned backing storage + +Backing storage must be thin provisioned to realize any savings from compression. This algorithm +will always use (and reuse) backing IO units available closest to offset 0 on the backing device. +This ensures that even though backing storage device may have been sized similarly to the size of +the compressed volume, storage for the backing storage device will not actually be allocated +until the backing IO units are actually needed. |