diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 18:24:20 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 18:24:20 +0000 |
commit | 483eb2f56657e8e7f419ab1a4fab8dce9ade8609 (patch) | |
tree | e5d88d25d870d5dedacb6bbdbe2a966086a0a5cf /src/spdk/doc/ssd_internals.md | |
parent | Initial commit. (diff) | |
download | ceph-483eb2f56657e8e7f419ab1a4fab8dce9ade8609.tar.xz ceph-483eb2f56657e8e7f419ab1a4fab8dce9ade8609.zip |
Adding upstream version 14.2.21.upstream/14.2.21upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/spdk/doc/ssd_internals.md')
-rw-r--r-- | src/spdk/doc/ssd_internals.md | 96 |
1 files changed, 96 insertions, 0 deletions
diff --git a/src/spdk/doc/ssd_internals.md b/src/spdk/doc/ssd_internals.md new file mode 100644 index 00000000..532290fc --- /dev/null +++ b/src/spdk/doc/ssd_internals.md @@ -0,0 +1,96 @@ +# SSD Internals {#ssd_internals} + +Solid State Devices (SSD) are complex devices and their performance depends on +how they're used. The following description is intended to help software +developers understand what is occurring inside the SSD, so that they can come +up with better software designs. It should not be thought of as a strictly +accurate guide to how SSD hardware really works. + + As of this writing, SSDs are generally implemented on top of + [NAND Flash](https://en.wikipedia.org/wiki/Flash_memory) memory. At a + very high level, this media has a few important properties: + +* The media is grouped onto chips called NAND dies and each die can + operate in parallel. +* Flipping a bit is a highly asymmetric process. Flipping it one way is + easy, but flipping it back is quite hard. + +NAND Flash media is grouped into large units often referred to as **erase +blocks**. The size of an erase block is highly implementation specific, but +can be thought of as somewhere between 1MiB and 8MiB. For each erase block, +each bit may be written to (i.e. have its bit flipped from 0 to 1) with +bit-granularity once. In order to write to the erase block a second time, the +entire block must be erased (i.e. all bits in the block are flipped back to +0). This is the asymmetry part from above. Erasing a block causes a measurable +amount of wear and each block may only be erased a limited number of times. + +SSDs expose an interface to the host system that makes it appear as if the +drive is composed of a set of fixed size **logical blocks** which are usually +512B or 4KiB in size. These blocks are entirely logical constructs of the +device firmware and they do not statically map to a location on the backing +media. Instead, upon each write to a logical block, a new location on the NAND +Flash is selected and written and the mapping of the logical block to its +physical location is updated. The algorithm for choosing this location is a +key part of overall SSD performance and is often called the **flash +translation layer** or FTL. This algorithm must correctly distribute the +blocks to account for wear (called **wear-leveling**) and spread them across +NAND dies to improve total available performance. The simplest model is to +group all of the physical media on each die together using an algorithm +similar to RAID and then write to that set sequentially. Real SSDs are far +more complicated, but this is an excellent simple model for software +developers - imagine they are simply logging to a RAID volume and updating an +in-memory hash-table. + +One consequence of the flash translation layer is that logical blocks do not +necessarily correspond to physical locations on the NAND at all times. In +fact, there is a command that clears the translation for a block. In NVMe, +this command is called deallocate, in SCSI it is called unmap, and in SATA it +is called trim. When a user attempts to read a block that doesn't have a +mapping to a physical location, drives will do one of two things: + +1. Immediately complete the read request successfully, without performing any + data transfer. This is acceptable because the data the drive would return + is no more valid than the data already in the user's data buffer. +2. Return all 0's as the data. + +Choice #1 is much more common and performing reads to a fully deallocated +device will often show performance far beyond what the drive claims to be +capable of precisely because it is not actually transferring any data. Write +to all blocks prior to reading them when benchmarking! + +As SSDs are written to, the internal log will eventually consume all of the +available erase blocks. In order to continue writing, the SSD must free some +of them. This process is often called **garbage collection**. All SSDs reserve +some number of erase blocks so that they can guarantee there are free erase +blocks available for garbage collection. Garbage collection generally proceeds +by: + +1. Selecting a target erase block (a good mental model is that it picks the least recently used erase block) +2. Walking through each entry in the erase block and determining if it is still a valid logical block. +3. Moving valid logical blocks by reading them and writing them to a different erase block (i.e. the current head of the log) +4. Erasing the entire erase block and marking it available for use. + +Garbage collection is clearly far more efficient when step #3 can be skipped +because the erase block is already empty. There are two ways to make it much +more likely that step #3 can be skipped. The first is that SSDs reserve +additional erase blocks beyond their reported capacity (called +**over-provisioning**), so that statistically its much more likely that an +erase block will not contain valid data. The second is software can write to +the blocks on the device in sequential order in a circular pattern, throwing +away old data when it is no longer needed. In this case, the software +guarantees that the least recently used erase blocks will not contain any +valid data that must be moved. + +The amount of over-provisioning a device has can dramatically impact the +performance on random read and write workloads, if the workload is filling up +the entire device. However, the same effect can typically be obtained by +simply reserving a given amount of space on the device in software. This +understanding is critical to producing consistent benchmarks. In particular, +if background garbage collection cannot keep up and the drive must switch to +on-demand garbage collection, the latency of writes will increase +dramatically. Therefore the internal state of the device must be forced into +some known state prior to running benchmarks for consistency. This is usually +accomplished by writing to the device sequentially two times, from start to +finish. For a highly detailed description of exactly how to force an SSD into +a known state for benchmarking see this +[SNIA Article](http://www.snia.org/sites/default/files/SSS_PTS_Enterprise_v1.1.pdf). |