1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
|
# Notes to Linux distributors
This document contains information about the packaging of nvme-stas.
## Compile-time dependencies
nvme-stas is a Python 3 project and does not require compile-time libraries per se. However, we use the meson build system for installation and testing.
| Library / Program | Purpose | Mandatory / Optional |
| ----------------- | ------------------------------------------------- | -------------------- |
| meson | Project configuration, installation, and testing. | Mandatory |
## Unit tests dependencies
nvme-stas provides static code analysis (pylint, pyflakes), which can be run with "`meson test`".
| Library / Program | Purpose | Mandatory / Optional |
| ----------------- | ------------------------------------------------------------ | -------------------- |
| pylint | Static code analysis | Optional |
| python3-pyflakes | Static code analysis | Optional |
| python3-pyfakefs | Static code analysis | Optional |
| vermin | Check that code meets minimum Python version requirement (3.6) | Optional |
## Run-time dependencies
Python 3.6 is the minimum version required to run nvme-stas. nvme-stas is built on top of libnvme, which is used to interact with the kernel's NVMe driver (i.e. `drivers/nvme/host/`). To support all the features of nvme-stas, several changes to the Linux kernel are required. nvme-stas can also operate with older kernels, but with limited functionality. Kernel 5.18 provides all the features needed by nvme-stas. nvme-stas can also work with older kernels that include back-ported changes to the NVMe driver.
The next table shows different features that were added to the NVMe driver and in which version of the Linux kernel they were added (the list of git patches can be found in addendum). Note that the ability to query the NVMe driver to determine what options it supports was added in 5.17. This is needed if nvme-stas is to make the right decision on whether a feature is supported. Otherwise, nvme-stas can only rely on the kernel version to decide what is supported. This can greatly limit the features supported on back-ported kernels.
| Feature | Introduced in kernel version |
| ------------------------------------------------------------ | ---------------------------- |
| **`host-iface` option** - Ability to force TCP connections over a specific interface. Needed for zeroconf provisioning. | 5.14 |
| **TP8013 Support** - Discovery Controller (DC) Unique NQN. Allow the creation of connections to DC with a NQN other than the default `nqn.2014-08.org.nvmexpress.discovery` | 5.16 |
| **Query supported options** - Allow user-space applications to query which options the NVMe driver supports | 5.17 |
| **TP8010 Support** - Ability for a Host to register with a Discovery Controller. This version of the kernel introduces a new event to indicate to user-space apps (e.g. nvme-stas) when a connection to a DC is restored. This is used to trigger a re-registration of the host. This kernel also exposes the DC Type (dctype) attribute through the sysfs, which is needed to determine whether registration is supported. | 5.18 |
| - Print actual source IP address (`src_addr`) through sysfs "address" attr. This is needed to verify that TCP connections were made on the right interface.<br />- Consider also `host_iface` when checking IP options.<br />- Send a rediscover uevent when a persistent discovery controller reconnects. | 6.1 |
nvme-stas also depends on the following run-time libraries and modules. Note that versions listed are the versions that were tested with at the time the code was developed.
| Package / Module | Min version | stafd | stacd | How to determine the currently installed version |
| ---------------------------------------------------------- | ----------- | ------------- | ------------- | ------------------------------------------------------------ |
| python3 | 3.6 | **Mandatory** | **Mandatory** | `python3 --version`<br />`nvme-stas` requires Python 3.6 as a minimum. |
| python3-dasbus | 1.6 | **Mandatory** | **Mandatory** | pip list \| grep dasbus |
| python3-pyudev | 0.22.0 | **Mandatory** | **Mandatory** | `python3 -c 'import pyudev; print(f"{pyudev.__version__}")'` |
| python3-systemd | 240 | **Mandatory** | **Mandatory** | `systemd --version` |
| python3-gi (Debian) OR python3-gobject (Fedora) | 3.36.0 | **Mandatory** | **Mandatory** | `python3 -c 'import gi; print(f"{gi.__version__}")'` |
| nvme-tcp (kernel module) | 5.18 * | **Mandatory** | **Mandatory** | N/A |
| dbus-daemon | 1.12.2 | **Mandatory** | **Mandatory** | `dbus-daemon --version` |
| avahi-daemon | 0.7 | **Mandatory** | Not required | `avahi-daemon --version` |
| python3-libnvme | 1.3 | **Mandatory** | **Mandatory** | `python3 -c 'import libnvme; print(f"{libnvme.__version__}")'` |
| importlib.resources.files() OR importlib_resources.files() | *** | Optional | Optional | `importlib.resources.files()` was introduced in Python 3.9 and backported to earlier versions as `importlib_resources.files()`. If neither modules can be found, `nvme-stas` will default to using the less efficient `pkg_resources.resource_string()` instead. When `nvme-stas` is no longer required to support Python 3.6 and is allowed a minimum of 3.9 or later, only `importlib.resources.files()` will be required. |
* Kernel 5.18 provides full functionality. nvme-stas can work with older kernels, but with limited functionality, unless the kernels contain back-ported features (see Addendum for the list of kernel patches that could be back-ported to an older kernel).
## Things to do post installation
### D-Bus configuration
We install D-Bus configuration files under `/usr/share/dbus-1/system.d`. One needs to run **`systemctl reload dbus-broker.service`** (Fedora) OR **`systemctl reload dbus.service`** (SuSE, Debian) for the new configuration to take effect.
### Configuration shared with `libnvme` and `nvme-cli`
`stafd` and `stacd` use the `libnvme` library to interact with the Linux kernel. And `libnvme` as well as `nvme-cli` rely on two configuration files, `/etc/nvme/hostnqn` and `/etc/nvme/hostid`, to retrieve the Host NQN and ID respectively. These files should be created post installation with the help of the `stadadm` utility. Here's an example for Debian-based systems:
```
if [ "$1" = "configure" ]; then
if [ ! -d "/etc/nvme" ]
mkdir /etc/nvme
fi
if [ ! -s "/etc/nvme/hostnqn" ]; then
stasadm hostnqn -f /etc/nvme/hostnqn
fi
if [ ! -s "/etc/nvme/hostid" ]; then
stasadm hostid -f /etc/nvme/hostid
fi
fi
```
The utility program `stasadm` gets installed with `nvme-stas`. `stasadm` also manages the creation (and updating) of `/etc/stas/sys.conf`, the `nvme-stas` system configuration file.
### Configuration specific to nvme-stas
The [README](./README.md) file defines the following three configuration files:
- `/etc/stas/sys.conf`
- `/etc/stas/stafd.conf`
- `/etc/stas/stacd.conf`
Care should be taken during upgrades to preserve customer configuration and not simply overwrite it. The process to migrate the configuration data and the list of parameters to migrate is still to be defined.
### Enabling and starting the daemons
Lastly, the two daemons, `stafd` and `stacd`, should be enabled (e.g. `systemctl enable stafd.service stacd.service`) and started (e.g. `systemctl start stafd.service stacd.service`).
# Compatibility between nvme-stas and nvme-cli
Udev rules are installed along with `nvme-cli` (e.g. `/usr/lib/udev/rules.d/70-nvmf-autoconnect.rules`). These udev rules allow `nvme-cli` to perform tasks similar to those performed by `nvme-stas`. However, the udev rules in `nvme-cli` version 2.1.2 and prior drop the `host-iface` parameter when making TCP connections to I/O controllers. `nvme-stas`, on the other hand, always makes sure that TCP connections to I/O controllers are made over the right interface using the `host-iface` parameter.
We essentially have a race condition because `nvme-stas` and `nvme-cli` react to the same kernel events. Both try to perform the same task in parallel, which is to connect to I/O controllers. Because `nvme-stas` is written in Python and the udevd daemon (i.e. the process running the udev rules) in C, `nvme-stas` usually loses the race and TCP connections are made by the udev rules without specifying the `host-iface`.
To remedy to this problem, `nvme-stas` disables `nvme-cli` udev rules and assumes the tasks performed by the udev rules. This way, only one process will take action on kernel events eliminating any race conditions. This also ensure that the right `host-iface` is used when making TCP connections.
# Addendum
## Kernel patches for nvme-stas 1.x
Here's the list of kernel patches (added in kernels 5.14 to 5.18) that will enable all features of nvme-stas.
```
commit e3448b134426741902b6e2c07cbaf5f66cfd2ebc
Author: Martin Belanger <martin.belanger@dell.com>
Date: Tue Feb 8 14:18:02 2022 -0500
nvme: Expose cntrltype and dctype through sysfs
TP8010 introduces the Discovery Controller Type attribute (dctype).
The dctype is returned in the response to the Identify command. This
patch exposes the dctype through the sysfs. Since the dctype depends on
the Controller Type (cntrltype), another attribute of the Identify
response, the patch also exposes the cntrltype as well. The dctype will
only be displayed for discovery controllers.
A note about the naming of this attribute:
Although TP8010 calls this attribute the Discovery Controller Type,
note that the dctype is now part of the response to the Identify
command for all controller types. I/O, Discovery, and Admin controllers
all share the same Identify response PDU structure. Non-discovery
controllers as well as pre-TP8010 discovery controllers will continue
to set this field to 0 (which has always been the default for reserved
bytes). Per TP8010, the value 0 now means "Discovery controller type is
not reported" instead of "Reserved". One could argue that this
definition is correct even for non-discovery controllers, and by
extension, exposing it in the sysfs for non-discovery controllers is
appropriate.
Signed-off-by: Martin Belanger <martin.belanger@dell.com>
commit 68c483a105ce7107f1cf8e1ed6c2c2abb5baa551
Author: Martin Belanger <martin.belanger@dell.com>
Date: Thu Feb 3 16:04:29 2022 -0500
nvme: send uevent on connection up
When connectivity with a controller is lost, the driver will keep
trying to reconnect once every 10 sec. When connection is restored,
user-space apps need to be informed so that they can take proper
action. For example, TP8010 introduces the DIM PDU, which is used to
register with a discovery controller (DC). The DIM PDU is sent from
user-space. The DIM PDU must be sent every time a connection is
established with a DC. Therefore, the kernel must tell user-space apps
when connection is restored so that registration can happen.
The uevent sent is a "change" uevent with environmental data
set to: "NVME_EVENT=connected".
Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
commit f18ee3d988157ebcadc9b7e5fd34811938f50223
Author: Hannes Reinecke <hare@suse.de>
Date: Tue Dec 7 14:55:49 2021 +0100
nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics
Currently applications have a hard time figuring out which
nvme-over-fabrics arguments are supported for any given kernel;
the ioctl will return an error code on failure, and the application
has to guess whether this was due to an invalid argument or due
to a connection or controller error.
With this patch applications can read a list of supported
arguments by simply reading from /dev/nvme-fabrics, allowing
them to validate the connection string.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
commit e5ea42faa773c6a6bb5d9e9f5c2cc808940b5a55
Author: Hannes Reinecke <hare@suse.de>
Date: Wed Sep 22 08:35:25 2021 +0200
nvme: display correct subsystem NQN
With discovery controllers supporting unique subsystem NQNs the
actual subsystem NQN might be different from that one passed in
via the connect args. So add a helper to display the resulting
subsystem NQN.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
commit 20e8b689c9088027b7495ffd6f80812c11ecc872
Author: Hannes Reinecke <hare@suse.de>
Date: Wed Sep 22 08:35:24 2021 +0200
nvme: Add connect option 'discovery'
Add a connect option 'discovery' to specify that the connection
should be made to a discovery controller, not a normal I/O controller.
With discovery controllers supporting unique subsystem NQNs we
cannot easily distinguish by the subsystem NQN if this should be
a discovery connection, but we need this information to blank out
options not supported by discovery controllers.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
commit 954ae16681f6bdf684f016ca626329302a38e177
Author: Hannes Reinecke <hare@suse.de>
Date: Wed Sep 22 08:35:23 2021 +0200
nvme: expose subsystem type in sysfs attribute 'subsystype'
With unique discovery controller NQNs we cannot distinguish the
subsystem type by the NQN alone, but need to check the subsystem
type, too.
So expose the subsystem type in a new sysfs attribute 'subsystype'.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
commit 3ede8f72a9a2825efca23a3552e80a1202ea88fd
Author: Martin Belanger <martin.belanger@dell.com>
Date: Thu May 20 15:09:34 2021 -0400
nvme-tcp: allow selecting the network interface for connections
In our application, we need a way to force TCP connections to go out a
specific IP interface instead of letting Linux select the interface
based on the routing tables.
Add the 'host-iface' option to allow specifying the interface to use.
When the option host-iface is specified, the driver uses the specified
interface to set the option SO_BINDTODEVICE on the TCP socket before
connecting.
This new option is needed in addtion to the existing host-traddr for
the following reasons:
Specifying an IP interface by its associated IP address is less
intuitive than specifying the actual interface name and, in some cases,
simply doesn't work. That's because the association between interfaces
and IP addresses is not predictable. IP addresses can be changed or can
change by themselves over time (e.g. DHCP). Interface names are
predictable [1] and will persist over time. Consider the following
configuration.
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state ...
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 100.0.0.100/24 scope global lo
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
link/ether 08:00:27:21:65:ec brd ff:ff:ff:ff:ff:ff
inet 100.0.0.100/24 scope global enp0s3
valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
inet 100.0.0.100/24 scope global enp0s8
valid_lft forever preferred_lft forever
The above is a VM that I configured with the same IP address
(100.0.0.100) on all interfaces. Doing a reverse lookup to identify the
unique interface associated with 100.0.0.100 does not work here. And
this is why the option host_iface is required. I understand that the
above config does not represent a standard host system, but I'm using
this to prove a point: "We can never know how users will configure
their systems". By te way, The above configuration is perfectly fine
by Linux.
The current TCP implementation for host_traddr performs a
bind()-before-connect(). This is a common construct to set the source
IP address on a TCP socket before connecting. This has no effect on how
Linux selects the interface for the connection. That's because Linux
uses the Weak End System model as described in RFC1122 [2]. On the other
hand, setting the Source IP Address has benefits and should be supported
by linux-nvme. In fact, setting the Source IP Address is a mandatory
FedGov requirement (e.g. connection to a RADIUS/TACACS+ server).
Consider the following configuration.
$ ip addr list dev enp0s8
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
inet 192.168.56.101/24 brd 192.168.56.255 scope global enp0s8
valid_lft 426sec preferred_lft 426sec
inet 192.168.56.102/24 scope global secondary enp0s8
valid_lft forever preferred_lft forever
inet 192.168.56.103/24 scope global secondary enp0s8
valid_lft forever preferred_lft forever
inet 192.168.56.104/24 scope global secondary enp0s8
valid_lft forever preferred_lft forever
Here we can see that several addresses are associated with interface
enp0s8. By default, Linux always selects the default IP address,
192.168.56.101, as the source address when connecting over interface
enp0s8. Some users, however, want the ability to specify a different
source address (e.g., 192.168.56.102, 192.168.56.103, ...). The option
host_traddr can be used as-is to perform this function.
In conclusion, I believe that we need 2 options for TCP connections.
One that can be used to specify an interface (host-iface). And one that
can be used to set the source address (host-traddr). Users should be
allowed to use one or the other, or both, or none. Of course, the
documentation for host_traddr will need some clarification. It should
state that when used for TCP connection, this option only sets the
source address. And the documentation for host_iface should say that
this option is only available for TCP connections.
References:
[1] https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/
[2] https://tools.ietf.org/html/rfc1122
Tested both IPv4 and IPv6 connections.
Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
```
## Kernel patches for nvme-stas 2.x
These patches are not essential for nvme-stas 2.x, but they allow nvme-stas to operate better.
```
commit f46ef9e87c9e8941b7acee45611c7c6a322592bb
Author: Sagi Grimberg <sagi@grimberg.me>
Date: Thu Sep 22 11:15:37 2022 +0300
nvme: send a rediscover uevent when a persistent discovery controller reconnects
When a discovery controller is disconnected, no AENs will arrive to
notify the host about discovery log change events.
In order to solve this, send a uevent notification when a
persistent discovery controller reconnects. We add a new ctrl
flag NVME_CTRL_STARTED_ONCE that will be set on the first
start, and consecutive calls will find it set, and send the
event to userspace if the controller is a discovery controller.
Upon the event reception, userspace will re-read the discovery
log page and will act upon changes as it sees fit.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
commit 02c57a82c0081141abc19150beab48ef47f97f18 (tag: nvme-6.1-2022-09-20)
Author: Martin Belanger <martin.belanger@dell.com>
Date: Wed Sep 7 08:27:37 2022 -0400
nvme-tcp: print actual source IP address through sysfs "address" attr
TCP transport relies on the routing table to determine which source
address and interface to use when making a connection. Currently, there
is no way to tell from userspace where a connection was made. This
patch exposes the actual source address using a new field named
"src_addr=" in the "address" attribute.
This is needed to diagnose and identify connectivity issues. With the
source address we can infer the interface associated with each
connection.
This was tested with nvme-cli 2.0 to verify it does not have any
adverse effect. The new "src_addr=" field will simply be displayed in
the output of the "list-subsys" or "list -v" commands as shown here.
$ nvme list-subsys
nvme-subsys0 - NQN=nqn.2014-08.org.nvmexpress.discovery
\
+- nvme0 tcp traddr=192.168.56.1,trsvcid=8009,src_addr=192.168.56.101 live
Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
commit 4cde03d82e2d0056d20fd5af6a264c7f5e6a3e76
Author: Daniel Wagner <dwagner@suse.de>
Date: Fri Jul 29 16:26:30 2022 +0200
nvme: consider also host_iface when checking ip options
It's perfectly fine to use the same traddr and trsvcid more than once
as long we use different host interface. This is used in setups where
the host has more than one interface but the target exposes only one
traddr/trsvcid combination.
Use the same acceptance rules for host_iface as we have for
host_traddr.
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
```
|