1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
|
# NVMe over Fabrics Target Programming Guide {#nvmf_tgt_pg}
## Target Audience
This programming guide is intended for developers authoring applications that
use the SPDK NVMe-oF target library (`lib/nvmf`). It is intended to provide
background context, architectural insight, and design recommendations. This
guide will not cover how to use the SPDK NVMe-oF target application. For a
guide on how to use the existing application as-is, see @ref nvmf.
## Introduction
The SPDK NVMe-oF target library is located in `lib/nvmf`. The library
implements all logic required to create an NVMe-oF target application. It is
used in the implementation of the example NVMe-oF target application in
`app/nvmf_tgt`, but is intended to be consumed independently.
This guide is written assuming that the reader is familiar with both NVMe and
NVMe over Fabrics. The best way to become familiar with those is to read their
[specifications](http://nvmexpress.org/resources/specifications/).
## Primitives
The library exposes a number of primitives - basic objects that the user
creates and interacts with. They are:
`struct spdk_nvmf_tgt`: An NVMe-oF target. This concept, surprisingly, does
not appear in the NVMe-oF specification. SPDK defines this to mean the
collection of subsystems with the associated namespaces, plus the set of
transports and their associated network connections. This will be referred to
throughout this guide as a **target**.
`struct spdk_nvmf_subsystem`: An NVMe-oF subsystem, as defined by the NVMe-oF
specification. Subsystems contain namespaces and controllers and perform
access control. This will be referred to throughout this guide as a
**subsystem**.
`struct spdk_nvmf_ns`: An NVMe-oF namespace, as defined by the NVMe-oF
specification. Namespaces are **bdevs**. See @ref bdev for an explanation of
the SPDK bdev layer. This will be referred to throughout this guide as a
**namespace**.
`struct spdk_nvmf_qpair`: An NVMe-oF queue pair, as defined by the NVMe-oF
specification. These map 1:1 to network connections. This will be referred to
throughout this guide as a **qpair**.
`struct spdk_nvmf_transport`: An abstraction for a network fabric, as defined
by the NVMe-oF specification. The specification is designed to allow for many
different network fabrics, so the code mirrors that and implements a plugin
system. Currently, only the RDMA transport is available. This will be referred
to throughout this guide as a **transport**.
`struct spdk_nvmf_poll_group`: An abstraction for a collection of network
connections that can be polled as a unit. This is an SPDK-defined concept that
does not appear in the NVMe-oF specification. Often, network transports have
facilities to check for incoming data on groups of connections more
efficiently than checking each one individually (e.g. epoll), so poll groups
provide a generic abstraction for that. This will be referred to throughout
this guide as a **poll group**.
`struct spdk_nvmf_listener`: A network address at which the target will accept
new connections.
`struct spdk_nvmf_host`: An NVMe-oF NQN representing a host (initiator)
system. This is used for access control.
## The Basics
A user of the NVMe-oF target library begins by creating a target using
spdk_nvmf_tgt_create(), setting up a set of addresses on which to accept
connections by calling spdk_nvmf_tgt_listen(), then creating a subsystem
using spdk_nvmf_subsystem_create().
Subsystems begin in an inactive state and must be activated by calling
spdk_nvmf_subsystem_start(). Subsystems may be modified at run time, but only
when in the paused or inactive state. A running subsystem may be paused by
calling spdk_nvmf_subsystem_pause() and resumed by calling
spdk_nvmf_subsystem_resume().
Namespaces may be added to the subsystem by calling
spdk_nvmf_subsystem_add_ns() when the subsystem is inactive or paused.
Namespaces are bdevs. See @ref bdev for more information about the SPDK bdev
layer. A bdev may be obtained by calling spdk_bdev_get_by_name().
Once a subsystem exists and the target is listening on an address, new
connections may be accepted by polling spdk_nvmf_tgt_accept().
All I/O to a subsystem is driven by a poll group, which polls for incoming
network I/O. Poll groups may be created by calling
spdk_nvmf_poll_group_create(). They automatically request to begin polling
upon creation on the thread from which they were created. Most importantly, *a
poll group may only be accessed from the thread on which it was created.*
When spdk_nvmf_tgt_accept() detects a new connection, it will construct a new
struct spdk_nvmf_qpair object and call the user provided `new_qpair_fn`
callback for each new qpair. In response to this callback, the user must
assign the qpair to a poll group by calling spdk_nvmf_poll_group_add().
Remember, a poll group may only be accessed from the thread on which it was created,
so making a call to spdk_nvmf_poll_group_add() may require passing a message
to the appropriate thread.
## Access Control
Access control is performed at the subsystem level by adding allowed listen
addresses and hosts to a subsystem (see spdk_nvmf_subsystem_add_listener() and
spdk_nvmf_subsystem_add_host()). By default, a subsystem will not accept
connections from any host or over any established listen address. Listeners
and hosts may only be added to inactive or paused subsystems.
## Discovery Subsystems
A discovery subsystem, as defined by the NVMe-oF specification, is
automatically created for each NVMe-oF target constructed. Connections to the
discovery subsystem are handled in the same way as any other subsystem - new
qpairs are created in response to spdk_nvmf_tgt_accept() and they must be
assigned to a poll group.
## Transports
The NVMe-oF specification defines multiple network transports (the "Fabrics"
in NVMe over Fabrics) and has an extensible system for adding new fabrics
in the future. The SPDK NVMe-oF target library implements a plugin system for
network transports to mirror the specification. The API a new transport must
implement is located in lib/nvmf/transport.h. As of this writing, only an RDMA
transport has been implemented.
The SPDK NVMe-oF target is designed to be able to process I/O from multiple
fabrics simultaneously.
## Choosing a Threading Model
The SPDK NVMe-oF target library does not strictly dictate threading model, but
poll groups do all of their polling and I/O processing on the thread they are
created on. Given that, it almost always makes sense to create one poll group
per thread used in the application. New qpairs created in response to
spdk_nvmf_tgt_accept() can be handed out round-robin to the poll groups. This
is how the SPDK NVMe-oF target application currently functions.
More advanced algorithms for distributing qpairs to poll groups are possible.
For instance, a NUMA-aware algorithm would be an improvement over basic
round-robin, where NUMA-aware means assigning qpairs to poll groups running on
CPU cores that are on the same NUMA node as the network adapter and storage
device. Load-aware algorithms also may have benefits.
## Scaling Across CPU Cores
Incoming I/O requests are picked up by the poll group polling their assigned
qpair. For regular NVMe commands such as READ and WRITE, the I/O request is
processed on the initial thread from start to the point where it is submitted
to the backing storage device, without interruption. Completions are
discovered by polling the backing storage device and also processed to
completion on the polling thread. **Regular NVMe commands (READ, WRITE, etc.)
do not require any cross-thread coordination, and therefore take no locks.**
NVMe ADMIN commands, which are used for managing the NVMe device itself, may
modify global state in the subsystem. For instance, an NVMe ADMIN command may
perform namespace management, such as shrinking a namespace. For these
commands, the subsystem will temporarily enter a paused state by sending a
message to each thread in the system. All new incoming I/O on any thread
targeting the subsystem will be queued during this time. Once the subsystem is
fully paused, the state change will occur, and messages will be sent to each
thread to release queued I/O and resume. Management commands are rare, so this
style of coordination is preferable to forcing all commands to take locks in
the I/O path.
## Zero Copy Support
For the RDMA transport, data is transferred from the RDMA NIC to host memory
and then host memory to the SSD (or vice versa), without any intermediate
copies. Data is never moved from one location in host memory to another. Other
transports in the future may require data copies.
## RDMA
The SPDK NVMe-oF RDMA transport is implemented on top of the libibverbs and
rdmacm libraries, which are packaged and available on most Linux
distributions. It does not use a user-space RDMA driver stack through DPDK.
In order to scale to large numbers of connections, the SPDK NVMe-oF RDMA
transport allocates a single RDMA completion queue per poll group. All new
qpairs assigned to the poll group are given their own RDMA send and receive
queues, but share this common completion queue. This allows the poll group to
poll a single queue for incoming messages instead of iterating through each
one.
Each RDMA request is handled by a state machine that walks the request through
a number of states. This keeps the code organized and makes all of the corner
cases much more obvious.
RDMA SEND, READ, and WRITE operations are ordered with respect to one another,
but RDMA RECVs are not necessarily ordered with SEND acknowledgements. For
instance, it is possible to detect an incoming RDMA RECV message containing a
new NVMe-oF capsule prior to detecting the acknowledgement of a previous SEND
containing an NVMe completion. This is problematic at full queue depth because
there may not yet be a free request structure. To handle this, the RDMA
request structure is broken into two parts - an rdma_recv and an rdma_request.
New RDMA RECVs will always grab a free rdma_recv, but may need to wait in a
queue for a SEND acknowledgement before they can acquire a full rdma_request
object.
Further, RDMA NICs expose different queue depths for READ/WRITE operations
than they do for SEND/RECV operations. The RDMA transport reports available
queue depth based on SEND/RECV operation limits and will queue in software as
necessary to accommodate (usually lower) limits on READ/WRITE operations.
|