From e6918187568dbd01842d8d1d2c808ce16a894239 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 21 Apr 2024 13:54:28 +0200 Subject: Adding upstream version 18.2.2. Signed-off-by: Daniel Baumann --- src/spdk/doc/nvme_spec.md | 123 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 123 insertions(+) create mode 100644 src/spdk/doc/nvme_spec.md (limited to 'src/spdk/doc/nvme_spec.md') diff --git a/src/spdk/doc/nvme_spec.md b/src/spdk/doc/nvme_spec.md new file mode 100644 index 000000000..123e631e4 --- /dev/null +++ b/src/spdk/doc/nvme_spec.md @@ -0,0 +1,123 @@ +# Submitting I/O to an NVMe Device {#nvme_spec} + +## The NVMe Specification + +The NVMe specification describes a hardware interface for interacting with +storage devices. The specification includes network transport definitions for +remote storage as well as a hardware register layout for local PCIe devices. +What follows here is an overview of how an I/O is submitted to a local PCIe +device through SPDK. + +NVMe devices allow host software (in our case, the SPDK NVMe driver) to allocate +queue pairs in host memory. The term "host" is used a lot, so to clarify that's +the system that the NVMe SSD is plugged into. A queue pair consists of two +queues - a submission queue and a completion queue. These queues are more +accurately described as circular rings of fixed size entries. The submission +queue is an array of 64 byte command structures, plus 2 integers (head and tail +indices). The completion queue is similarly an array of 16 byte completion +structures, plus 2 integers (head and tail indices). There are also two 32-bit +registers involved that are called doorbells. + +An I/O is submitted to an NVMe device by constructing a 64 byte command, placing +it into the submission queue at the current location of the submission queue +head index, and then writing the new index of the submission queue head to the +submission queue head doorbell register. It's actually valid to copy a whole set +of commands into open slots in the ring and then write the doorbell just one +time to submit the whole batch. + +There is a very detailed description of the command submission and completion +process in the NVMe specification, which is conveniently available from the main +page over at [NVM Express](https://nvmexpress.org). + +Most importantly, the command itself describes the operation and also, if +necessary, a location in host memory containing a descriptor for host memory +associated with the command. This host memory is the data to be written on a +write command, or the location to place the data on a read command. Data is +transferred to or from this location using a DMA engine on the NVMe device. + +The completion queue works similarly, but the device is instead the one writing +entries into the ring. Each entry contains a "phase" bit that toggles between 0 +and 1 on each loop through the entire ring. When a queue pair is set up to +generate interrupts, the interrupt contains the index of the completion queue +head. However, SPDK doesn't enable interrupts and instead polls on the phase +bit to detect completions. Interrupts are very heavy operations, so polling this +phase bit is often far more efficient. + +## The SPDK NVMe Driver I/O Path + +Now that we know how the ring structures work, let's cover how the SPDK NVMe +driver uses them. The user is going to construct a queue pair at some early time +in the life cycle of the program, so that's not part of the "hot" path. Then, +they'll call functions like spdk_nvme_ns_cmd_read() to perform an I/O operation. +The user supplies a data buffer, the target LBA, and the length, as well as +other information like which NVMe namespace the command is targeted at and which +NVMe queue pair to use. Finally, the user provides a callback function and +context pointer that will be called when a completion for the resulting command +is discovered during a later call to spdk_nvme_qpair_process_completions(). + +The first stage in the driver is allocating a request object to track the operation. The +operations are asynchronous, so it can't simply track the state of the request +on the call stack. Allocating a new request object on the heap would be far too +slow, so SPDK keeps a pre-allocated set of request objects inside of the NVMe +queue pair object - `struct spdk_nvme_qpair`. The number of requests allocated to +the queue pair is larger than the actual queue depth of the NVMe submission +queue because SPDK supports a couple of key convenience features. The first is +software queueing - SPDK will allow the user to submit more requests than the +hardware queue can actually hold and SPDK will automatically queue in software. +The second is splitting. SPDK will split a request for many reasons, some of +which are outlined next. The number of request objects is configurable at queue +pair creation time and if not specified, SPDK will pick a sensible number based +on the hardware queue depth. + +The second stage is building the 64 byte NVMe command itself. The command is +built into memory embedded into the request object - not directly into an NVMe +submission queue slot. Once the command has been constructed, SPDK attempts to +obtain an open slot in the NVMe submission queue. For each element in the +submission queue an object called a tracker is allocated. The trackers are +allocated in an array, so they can be quickly looked up by an index. The tracker +itself contains a pointer to the request currently occupying that slot. When a +particular tracker is obtained, the command's CID value is updated with the +index of the tracker. The NVMe specification provides that CID value in the +completion, so the request can be recovered by looking up the tracker via the +CID value and then following the pointer. + +Once a tracker (slot) is obtained, the data buffer associated with it is +processed to build a PRP list. That's essentially an NVMe scatter gather list, +although it is a bit more restricted. The user provides SPDK with the virtual +address of the buffer, so SPDK has to go do a page table look up to find the +physical address (pa) or I/O virtual addresses (iova) backing that virtual +memory. A virtually contiguous memory region may not be physically contiguous, +so this may result in a PRP list with multiple elements. Sometimes this may +result in a set of physical addresses that can't actually be expressed as a +single PRP list, so SPDK will automatically split the user operation into two +separate requests transparently. For more information on how memory is managed, +see @ref memory. + +The reason the PRP list is not built until a tracker is obtained is because the +PRP list description must be allocated in DMA-able memory and can be quite +large. Since SPDK typically allocates a large number of requests, we didn't want +to allocate enough space to pre-build the worst case scenario PRP list, +especially given that the common case does not require a separate PRP list at +all. + +Each NVMe command has two PRP list elements embedded into it, so a separate PRP +list isn't required if the request is 4KiB (or if it is 8KiB and aligned +perfectly). Profiling shows that this section of the code is not a major +contributor to the overall CPU use. + +With a tracker filled out, SPDK copies the 64 byte command into the actual NVMe +submission queue slot and then rings the submission queue tail doorbell to tell +the device to go process it. SPDK then returns back to the user, without waiting +for a completion. + +The user can periodically call `spdk_nvme_qpair_process_completions()` to tell +SPDK to examine the completion queue. Specifically, it reads the phase bit of +the next expected completion slot and when it flips, looks at the CID value to +find the tracker, which points at the request object. The request object +contains a function pointer that the user provided initially, which is then +called to complete the command. + +The `spdk_nvme_qpair_process_completions()` function will keep advancing to the +next completion slot until it runs out of completions, at which point it will +write the completion queue head doorbell to let the device know that it can use +the completion queue slots for new completions and return. -- cgit v1.2.3