diff options
Diffstat (limited to '')
-rw-r--r-- | src/spdk/doc/nvmf_tgt_pg.md | 204 |
1 files changed, 204 insertions, 0 deletions
diff --git a/src/spdk/doc/nvmf_tgt_pg.md b/src/spdk/doc/nvmf_tgt_pg.md new file mode 100644 index 00000000..fe1ca422 --- /dev/null +++ b/src/spdk/doc/nvmf_tgt_pg.md @@ -0,0 +1,204 @@ +# NVMe over Fabrics Target Programming Guide {#nvmf_tgt_pg} + +## Target Audience + +This programming guide is intended for developers authoring applications that +use the SPDK NVMe-oF target library (`lib/nvmf`). It is intended to provide +background context, architectural insight, and design recommendations. This +guide will not cover how to use the SPDK NVMe-oF target application. For a +guide on how to use the existing application as-is, see @ref nvmf. + +## Introduction + +The SPDK NVMe-oF target library is located in `lib/nvmf`. The library +implements all logic required to create an NVMe-oF target application. It is +used in the implementation of the example NVMe-oF target application in +`app/nvmf_tgt`, but is intended to be consumed independently. + +This guide is written assuming that the reader is familiar with both NVMe and +NVMe over Fabrics. The best way to become familiar with those is to read their +[specifications](http://nvmexpress.org/resources/specifications/). + +## Primitives + +The library exposes a number of primitives - basic objects that the user +creates and interacts with. They are: + +`struct spdk_nvmf_tgt`: An NVMe-oF target. This concept, surprisingly, does +not appear in the NVMe-oF specification. SPDK defines this to mean the +collection of subsystems with the associated namespaces, plus the set of +transports and their associated network connections. This will be referred to +throughout this guide as a **target**. + +`struct spdk_nvmf_subsystem`: An NVMe-oF subsystem, as defined by the NVMe-oF +specification. Subsystems contain namespaces and controllers and perform +access control. This will be referred to throughout this guide as a +**subsystem**. + +`struct spdk_nvmf_ns`: An NVMe-oF namespace, as defined by the NVMe-oF +specification. Namespaces are **bdevs**. See @ref bdev for an explanation of +the SPDK bdev layer. This will be referred to throughout this guide as a +**namespace**. + +`struct spdk_nvmf_qpair`: An NVMe-oF queue pair, as defined by the NVMe-oF +specification. These map 1:1 to network connections. This will be referred to +throughout this guide as a **qpair**. + +`struct spdk_nvmf_transport`: An abstraction for a network fabric, as defined +by the NVMe-oF specification. The specification is designed to allow for many +different network fabrics, so the code mirrors that and implements a plugin +system. Currently, only the RDMA transport is available. This will be referred +to throughout this guide as a **transport**. + +`struct spdk_nvmf_poll_group`: An abstraction for a collection of network +connections that can be polled as a unit. This is an SPDK-defined concept that +does not appear in the NVMe-oF specification. Often, network transports have +facilities to check for incoming data on groups of connections more +efficiently than checking each one individually (e.g. epoll), so poll groups +provide a generic abstraction for that. This will be referred to throughout +this guide as a **poll group**. + +`struct spdk_nvmf_listener`: A network address at which the target will accept +new connections. + +`struct spdk_nvmf_host`: An NVMe-oF NQN representing a host (initiator) +system. This is used for access control. + +## The Basics + +A user of the NVMe-oF target library begins by creating a target using +spdk_nvmf_tgt_create(), setting up a set of addresses on which to accept +connections by calling spdk_nvmf_tgt_listen(), then creating a subsystem +using spdk_nvmf_subsystem_create(). + +Subsystems begin in an inactive state and must be activated by calling +spdk_nvmf_subsystem_start(). Subsystems may be modified at run time, but only +when in the paused or inactive state. A running subsystem may be paused by +calling spdk_nvmf_subsystem_pause() and resumed by calling +spdk_nvmf_subsystem_resume(). + +Namespaces may be added to the subsystem by calling +spdk_nvmf_subsystem_add_ns() when the subsystem is inactive or paused. +Namespaces are bdevs. See @ref bdev for more information about the SPDK bdev +layer. A bdev may be obtained by calling spdk_bdev_get_by_name(). + +Once a subsystem exists and the target is listening on an address, new +connections may be accepted by polling spdk_nvmf_tgt_accept(). + +All I/O to a subsystem is driven by a poll group, which polls for incoming +network I/O. Poll groups may be created by calling +spdk_nvmf_poll_group_create(). They automatically request to begin polling +upon creation on the thread from which they were created. Most importantly, *a +poll group may only be accessed from the thread on which it was created.* + +When spdk_nvmf_tgt_accept() detects a new connection, it will construct a new +struct spdk_nvmf_qpair object and call the user provided `new_qpair_fn` +callback for each new qpair. In response to this callback, the user must +assign the qpair to a poll group by calling spdk_nvmf_poll_group_add(). +Remember, a poll group may only be accessed from the thread on which it was created, +so making a call to spdk_nvmf_poll_group_add() may require passing a message +to the appropriate thread. + +## Access Control + +Access control is performed at the subsystem level by adding allowed listen +addresses and hosts to a subsystem (see spdk_nvmf_subsystem_add_listener() and +spdk_nvmf_subsystem_add_host()). By default, a subsystem will not accept +connections from any host or over any established listen address. Listeners +and hosts may only be added to inactive or paused subsystems. + +## Discovery Subsystems + +A discovery subsystem, as defined by the NVMe-oF specification, is +automatically created for each NVMe-oF target constructed. Connections to the +discovery subsystem are handled in the same way as any other subsystem - new +qpairs are created in response to spdk_nvmf_tgt_accept() and they must be +assigned to a poll group. + +## Transports + +The NVMe-oF specification defines multiple network transports (the "Fabrics" +in NVMe over Fabrics) and has an extensible system for adding new fabrics +in the future. The SPDK NVMe-oF target library implements a plugin system for +network transports to mirror the specification. The API a new transport must +implement is located in lib/nvmf/transport.h. As of this writing, only an RDMA +transport has been implemented. + +The SPDK NVMe-oF target is designed to be able to process I/O from multiple +fabrics simultaneously. + +## Choosing a Threading Model + +The SPDK NVMe-oF target library does not strictly dictate threading model, but +poll groups do all of their polling and I/O processing on the thread they are +created on. Given that, it almost always makes sense to create one poll group +per thread used in the application. New qpairs created in response to +spdk_nvmf_tgt_accept() can be handed out round-robin to the poll groups. This +is how the SPDK NVMe-oF target application currently functions. + +More advanced algorithms for distributing qpairs to poll groups are possible. +For instance, a NUMA-aware algorithm would be an improvement over basic +round-robin, where NUMA-aware means assigning qpairs to poll groups running on +CPU cores that are on the same NUMA node as the network adapter and storage +device. Load-aware algorithms also may have benefits. + +## Scaling Across CPU Cores + +Incoming I/O requests are picked up by the poll group polling their assigned +qpair. For regular NVMe commands such as READ and WRITE, the I/O request is +processed on the initial thread from start to the point where it is submitted +to the backing storage device, without interruption. Completions are +discovered by polling the backing storage device and also processed to +completion on the polling thread. **Regular NVMe commands (READ, WRITE, etc.) +do not require any cross-thread coordination, and therefore take no locks.** + +NVMe ADMIN commands, which are used for managing the NVMe device itself, may +modify global state in the subsystem. For instance, an NVMe ADMIN command may +perform namespace management, such as shrinking a namespace. For these +commands, the subsystem will temporarily enter a paused state by sending a +message to each thread in the system. All new incoming I/O on any thread +targeting the subsystem will be queued during this time. Once the subsystem is +fully paused, the state change will occur, and messages will be sent to each +thread to release queued I/O and resume. Management commands are rare, so this +style of coordination is preferable to forcing all commands to take locks in +the I/O path. + +## Zero Copy Support + +For the RDMA transport, data is transferred from the RDMA NIC to host memory +and then host memory to the SSD (or vice versa), without any intermediate +copies. Data is never moved from one location in host memory to another. Other +transports in the future may require data copies. + +## RDMA + +The SPDK NVMe-oF RDMA transport is implemented on top of the libibverbs and +rdmacm libraries, which are packaged and available on most Linux +distributions. It does not use a user-space RDMA driver stack through DPDK. + +In order to scale to large numbers of connections, the SPDK NVMe-oF RDMA +transport allocates a single RDMA completion queue per poll group. All new +qpairs assigned to the poll group are given their own RDMA send and receive +queues, but share this common completion queue. This allows the poll group to +poll a single queue for incoming messages instead of iterating through each +one. + +Each RDMA request is handled by a state machine that walks the request through +a number of states. This keeps the code organized and makes all of the corner +cases much more obvious. + +RDMA SEND, READ, and WRITE operations are ordered with respect to one another, +but RDMA RECVs are not necessarily ordered with SEND acknowledgements. For +instance, it is possible to detect an incoming RDMA RECV message containing a +new NVMe-oF capsule prior to detecting the acknowledgement of a previous SEND +containing an NVMe completion. This is problematic at full queue depth because +there may not yet be a free request structure. To handle this, the RDMA +request structure is broken into two parts - an rdma_recv and an rdma_request. +New RDMA RECVs will always grab a free rdma_recv, but may need to wait in a +queue for a SEND acknowledgement before they can acquire a full rdma_request +object. + +Further, RDMA NICs expose different queue depths for READ/WRITE operations +than they do for SEND/RECV operations. The RDMA transport reports available +queue depth based on SEND/RECV operation limits and will queue in software as +necessary to accommodate (usually lower) limits on READ/WRITE operations. |