diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-07 18:49:45 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-07 18:49:45 +0000 |
commit | 2c3c1048746a4622d8c89a29670120dc8fab93c4 (patch) | |
tree | 848558de17fb3008cdf4d861b01ac7781903ce39 /Documentation/driver-api/nvdimm | |
parent | Initial commit. (diff) | |
download | linux-2c3c1048746a4622d8c89a29670120dc8fab93c4.tar.xz linux-2c3c1048746a4622d8c89a29670120dc8fab93c4.zip |
Adding upstream version 6.1.76.upstream/6.1.76upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
-rw-r--r-- | Documentation/driver-api/nvdimm/btt.rst | 285 | ||||
-rw-r--r-- | Documentation/driver-api/nvdimm/firmware-activate.rst | 86 | ||||
-rw-r--r-- | Documentation/driver-api/nvdimm/index.rst | 13 | ||||
-rw-r--r-- | Documentation/driver-api/nvdimm/nvdimm.rst | 657 | ||||
-rw-r--r-- | Documentation/driver-api/nvdimm/security.rst | 143 |
5 files changed, 1184 insertions, 0 deletions
diff --git a/Documentation/driver-api/nvdimm/btt.rst b/Documentation/driver-api/nvdimm/btt.rst new file mode 100644 index 000000000..107395c04 --- /dev/null +++ b/Documentation/driver-api/nvdimm/btt.rst @@ -0,0 +1,285 @@ +============================= +BTT - Block Translation Table +============================= + + +1. Introduction +=============== + +Persistent memory based storage is able to perform IO at byte (or more +accurately, cache line) granularity. However, we often want to expose such +storage as traditional block devices. The block drivers for persistent memory +will do exactly this. However, they do not provide any atomicity guarantees. +Traditional SSDs typically provide protection against torn sectors in hardware, +using stored energy in capacitors to complete in-flight block writes, or perhaps +in firmware. We don't have this luxury with persistent memory - if a write is in +progress, and we experience a power failure, the block will contain a mix of old +and new data. Applications may not be prepared to handle such a scenario. + +The Block Translation Table (BTT) provides atomic sector update semantics for +persistent memory devices, so that applications that rely on sector writes not +being torn can continue to do so. The BTT manifests itself as a stacked block +device, and reserves a portion of the underlying storage for its metadata. At +the heart of it, is an indirection table that re-maps all the blocks on the +volume. It can be thought of as an extremely simple file system that only +provides atomic sector updates. + + +2. Static Layout +================ + +The underlying storage on which a BTT can be laid out is not limited in any way. +The BTT, however, splits the available space into chunks of up to 512 GiB, +called "Arenas". + +Each arena follows the same layout for its metadata, and all references in an +arena are internal to it (with the exception of one field that points to the +next arena). The following depicts the "On-disk" metadata layout:: + + + Backing Store +-------> Arena + +---------------+ | +------------------+ + | | | | Arena info block | + | Arena 0 +---+ | 4K | + | 512G | +------------------+ + | | | | + +---------------+ | | + | | | | + | Arena 1 | | Data Blocks | + | 512G | | | + | | | | + +---------------+ | | + | . | | | + | . | | | + | . | | | + | | | | + | | | | + +---------------+ +------------------+ + | | + | BTT Map | + | | + | | + +------------------+ + | | + | BTT Flog | + | | + +------------------+ + | Info block copy | + | 4K | + +------------------+ + + +3. Theory of Operation +====================== + + +a. The BTT Map +-------------- + +The map is a simple lookup/indirection table that maps an LBA to an internal +block. Each map entry is 32 bits. The two most significant bits are special +flags, and the remaining form the internal block number. + +======== ============================================================= +Bit Description +======== ============================================================= +31 - 30 Error and Zero flags - Used in the following way:: + + == == ==================================================== + 31 30 Description + == == ==================================================== + 0 0 Initial state. Reads return zeroes; Premap = Postmap + 0 1 Zero state: Reads return zeroes + 1 0 Error state: Reads fail; Writes clear 'E' bit + 1 1 Normal Block – has valid postmap + == == ==================================================== + +29 - 0 Mappings to internal 'postmap' blocks +======== ============================================================= + + +Some of the terminology that will be subsequently used: + +============ ================================================================ +External LBA LBA as made visible to upper layers. +ABA Arena Block Address - Block offset/number within an arena +Premap ABA The block offset into an arena, which was decided upon by range + checking the External LBA +Postmap ABA The block number in the "Data Blocks" area obtained after + indirection from the map +nfree The number of free blocks that are maintained at any given time. + This is the number of concurrent writes that can happen to the + arena. +============ ================================================================ + + +For example, after adding a BTT, we surface a disk of 1024G. We get a read for +the external LBA at 768G. This falls into the second arena, and of the 512G +worth of blocks that this arena contributes, this block is at 256G. Thus, the +premap ABA is 256G. We now refer to the map, and find out the mapping for block +'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64. + + +b. The BTT Flog +--------------- + +The BTT provides sector atomicity by making every write an "allocating write", +i.e. Every write goes to a "free" block. A running list of free blocks is +maintained in the form of the BTT flog. 'Flog' is a combination of the words +"free list" and "log". The flog contains 'nfree' entries, and an entry contains: + +======== ===================================================================== +lba The premap ABA that is being written to +old_map The old postmap ABA - after 'this' write completes, this will be a + free block. +new_map The new postmap ABA. The map will up updated to reflect this + lba->postmap_aba mapping, but we log it here in case we have to + recover. +seq Sequence number to mark which of the 2 sections of this flog entry is + valid/newest. It cycles between 01->10->11->01 (binary) under normal + operation, with 00 indicating an uninitialized state. +lba' alternate lba entry +old_map' alternate old postmap entry +new_map' alternate new postmap entry +seq' alternate sequence number. +======== ===================================================================== + +Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also +padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are +done such that for any entry being written, it: +a. overwrites the 'old' section in the entry based on sequence numbers +b. writes the 'new' section such that the sequence number is written last. + + +c. The concept of lanes +----------------------- + +While 'nfree' describes the number of concurrent IOs an arena can process +concurrently, 'nlanes' is the number of IOs the BTT device as a whole can +process:: + + nlanes = min(nfree, num_cpus) + +A lane number is obtained at the start of any IO, and is used for indexing into +all the on-disk and in-memory data structures for the duration of the IO. If +there are more CPUs than the max number of available lanes, than lanes are +protected by spinlocks. + + +d. In-memory data structure: Read Tracking Table (RTT) +------------------------------------------------------ + +Consider a case where we have two threads, one doing reads and the other, +writes. We can hit a condition where the writer thread grabs a free block to do +a new IO, but the (slow) reader thread is still reading from it. In other words, +the reader consulted a map entry, and started reading the corresponding block. A +writer started writing to the same external LBA, and finished the write updating +the map for that external LBA to point to its new postmap ABA. At this point the +internal, postmap block that the reader is (still) reading has been inserted +into the list of free blocks. If another write comes in for the same LBA, it can +grab this free block, and start writing to it, causing the reader to read +incorrect data. To prevent this, we introduce the RTT. + +The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts +into rtt[lane_number], the postmap ABA it is reading, and clears it after the +read is complete. Every writer thread, after grabbing a free block, checks the +RTT for its presence. If the postmap free block is in the RTT, it waits till the +reader clears the RTT entry, and only then starts writing to it. + + +e. In-memory data structure: map locks +-------------------------------------- + +Consider a case where two writer threads are writing to the same LBA. There can +be a race in the following sequence of steps:: + + free[lane] = map[premap_aba] + map[premap_aba] = postmap_aba + +Both threads can update their respective free[lane] with the same old, freed +postmap_aba. This has made the layout inconsistent by losing a free entry, and +at the same time, duplicating another free entry for two lanes. + +To solve this, we could have a single map lock (per arena) that has to be taken +before performing the above sequence, but we feel that could be too contentious. +Instead we use an array of (nfree) map_locks that is indexed by +(premap_aba modulo nfree). + + +f. Reconstruction from the Flog +------------------------------- + +On startup, we analyze the BTT flog to create our list of free blocks. We walk +through all the entries, and for each lane, of the set of two possible +'sections', we always look at the most recent one only (based on the sequence +number). The reconstruction rules/steps are simple: + +- Read map[log_entry.lba]. +- If log_entry.new matches the map entry, then log_entry.old is free. +- If log_entry.new does not match the map entry, then log_entry.new is free. + (This case can only be caused by power-fails/unsafe shutdowns) + + +g. Summarizing - Read and Write flows +------------------------------------- + +Read: + +1. Convert external LBA to arena number + pre-map ABA +2. Get a lane (and take lane_lock) +3. Read map to get the entry for this pre-map ABA +4. Enter post-map ABA into RTT[lane] +5. If TRIM flag set in map, return zeroes, and end IO (go to step 8) +6. If ERROR flag set in map, end IO with EIO (go to step 8) +7. Read data from this block +8. Remove post-map ABA entry from RTT[lane] +9. Release lane (and lane_lock) + +Write: + +1. Convert external LBA to Arena number + pre-map ABA +2. Get a lane (and take lane_lock) +3. Use lane to index into in-memory free list and obtain a new block, next flog + index, next sequence number +4. Scan the RTT to check if free block is present, and spin/wait if it is. +5. Write data to this free block +6. Read map to get the existing post-map ABA entry for this pre-map ABA +7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num] +8. Write new post-map ABA into map. +9. Write old post-map entry into the free list +10. Calculate next sequence number and write into the free list entry +11. Release lane (and lane_lock) + + +4. Error Handling +================= + +An arena would be in an error state if any of the metadata is corrupted +irrecoverably, either due to a bug or a media error. The following conditions +indicate an error: + +- Info block checksum does not match (and recovering from the copy also fails) +- All internal available blocks are not uniquely and entirely addressed by the + sum of mapped blocks and free blocks (from the BTT flog). +- Rebuilding free list from the flog reveals missing/duplicate/impossible + entries +- A map entry is out of bounds + +If any of these error conditions are encountered, the arena is put into a read +only state using a flag in the info block. + + +5. Usage +======== + +The BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem +(pmem, or blk mode). The easiest way to set up such a namespace is using the +'ndctl' utility [1]: + +For example, the ndctl command line to setup a btt with a 4k sector size is:: + + ndctl create-namespace -f -e namespace0.0 -m sector -l 4k + +See ndctl create-namespace --help for more options. + +[1]: https://github.com/pmem/ndctl diff --git a/Documentation/driver-api/nvdimm/firmware-activate.rst b/Documentation/driver-api/nvdimm/firmware-activate.rst new file mode 100644 index 000000000..7ee7decbb --- /dev/null +++ b/Documentation/driver-api/nvdimm/firmware-activate.rst @@ -0,0 +1,86 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +NVDIMM Runtime Firmware Activation +================================== + +Some persistent memory devices run a firmware locally on the device / +"DIMM" to perform tasks like media management, capacity provisioning, +and health monitoring. The process of updating that firmware typically +involves a reboot because it has implications for in-flight memory +transactions. However, reboots are disruptive and at least the Intel +persistent memory platform implementation, described by the Intel ACPI +DSM specification [1], has added support for activating firmware at +runtime. + +A native sysfs interface is implemented in libnvdimm to allow platform +to advertise and control their local runtime firmware activation +capability. + +The libnvdimm bus object, ndbusX, implements an ndbusX/firmware/activate +attribute that shows the state of the firmware activation as one of 'idle', +'armed', 'overflow', and 'busy'. + +- idle: + No devices are set / armed to activate firmware + +- armed: + At least one device is armed + +- busy: + In the busy state armed devices are in the process of transitioning + back to idle and completing an activation cycle. + +- overflow: + If the platform has a concept of incremental work needed to perform + the activation it could be the case that too many DIMMs are armed for + activation. In that scenario the potential for firmware activation to + timeout is indicated by the 'overflow' state. + +The 'ndbusX/firmware/activate' property can be written with a value of +either 'live', or 'quiesce'. A value of 'quiesce' triggers the kernel to +run firmware activation from within the equivalent of the hibernation +'freeze' state where drivers and applications are notified to stop their +modifications of system memory. A value of 'live' attempts +firmware activation without this hibernation cycle. The +'ndbusX/firmware/activate' property will be elided completely if no +firmware activation capability is detected. + +Another property 'ndbusX/firmware/capability' indicates a value of +'live' or 'quiesce', where 'live' indicates that the firmware +does not require or inflict any quiesce period on the system to update +firmware. A capability value of 'quiesce' indicates that firmware does +expect and injects a quiet period for the memory controller, but 'live' +may still be written to 'ndbusX/firmware/activate' as an override to +assume the risk of racing firmware update with in-flight device and +application activity. The 'ndbusX/firmware/capability' property will be +elided completely if no firmware activation capability is detected. + +The libnvdimm memory-device / DIMM object, nmemX, implements +'nmemX/firmware/activate' and 'nmemX/firmware/result' attributes to +communicate the per-device firmware activation state. Similar to the +'ndbusX/firmware/activate' attribute, the 'nmemX/firmware/activate' +attribute indicates 'idle', 'armed', or 'busy'. The state transitions +from 'armed' to 'idle' when the system is prepared to activate firmware, +firmware staged + state set to armed, and 'ndbusX/firmware/activate' is +triggered. After that activation event the nmemX/firmware/result +attribute reflects the state of the last activation as one of: + +- none: + No runtime activation triggered since the last time the device was reset + +- success: + The last runtime activation completed successfully. + +- fail: + The last runtime activation failed for device-specific reasons. + +- not_staged: + The last runtime activation failed due to a sequencing error of the + firmware image not being staged. + +- need_reset: + Runtime firmware activation failed, but the firmware can still be + activated via the legacy method of power-cycling the system. + +[1]: https://docs.pmem.io/persistent-memory/ diff --git a/Documentation/driver-api/nvdimm/index.rst b/Documentation/driver-api/nvdimm/index.rst new file mode 100644 index 000000000..5863bd04f --- /dev/null +++ b/Documentation/driver-api/nvdimm/index.rst @@ -0,0 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +Non-Volatile Memory Device (NVDIMM) +=================================== + +.. toctree:: + :maxdepth: 1 + + nvdimm + btt + security + firmware-activate diff --git a/Documentation/driver-api/nvdimm/nvdimm.rst b/Documentation/driver-api/nvdimm/nvdimm.rst new file mode 100644 index 000000000..be8587a55 --- /dev/null +++ b/Documentation/driver-api/nvdimm/nvdimm.rst @@ -0,0 +1,657 @@ +=============================== +LIBNVDIMM: Non-Volatile Devices +=============================== + +libnvdimm - kernel / libndctl - userspace helper library + +nvdimm@lists.linux.dev + +Version 13 + +.. contents: + + Glossary + Overview + Supporting Documents + Git Trees + LIBNVDIMM PMEM + PMEM-REGIONs, Atomic Sectors, and DAX + Example NVDIMM Platform + LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API + LIBNDCTL: Context + libndctl: instantiate a new library context example + LIBNVDIMM/LIBNDCTL: Bus + libnvdimm: control class device in /sys/class + libnvdimm: bus + libndctl: bus enumeration example + LIBNVDIMM/LIBNDCTL: DIMM (NMEM) + libnvdimm: DIMM (NMEM) + libndctl: DIMM enumeration example + LIBNVDIMM/LIBNDCTL: Region + libnvdimm: region + libndctl: region enumeration example + Why Not Encode the Region Type into the Region Name? + How Do I Determine the Major Type of a Region? + LIBNVDIMM/LIBNDCTL: Namespace + libnvdimm: namespace + libndctl: namespace enumeration example + libndctl: namespace creation example + Why the Term "namespace"? + LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" + libnvdimm: btt layout + libndctl: btt creation example + Summary LIBNDCTL Diagram + + +Glossary +======== + +PMEM: + A system-physical-address range where writes are persistent. A + block device composed of PMEM is capable of DAX. A PMEM address range + may span an interleave of several DIMMs. + +DPA: + DIMM Physical Address, is a DIMM-relative offset. With one DIMM in + the system there would be a 1:1 system-physical-address:DPA association. + Once more DIMMs are added a memory controller interleave must be + decoded to determine the DPA associated with a given + system-physical-address. + +DAX: + File system extensions to bypass the page cache and block layer to + mmap persistent memory, from a PMEM block device, directly into a + process address space. + +DSM: + Device Specific Method: ACPI method to control specific + device - in this case the firmware. + +DCR: + NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. + It defines a vendor-id, device-id, and interface format for a given DIMM. + +BTT: + Block Translation Table: Persistent memory is byte addressable. + Existing software may have an expectation that the power-fail-atomicity + of writes is at least one sector, 512 bytes. The BTT is an indirection + table with atomic update semantics to front a PMEM block device + driver and present arbitrary atomic sector sizes. + +LABEL: + Metadata stored on a DIMM device that partitions and identifies + (persistently names) capacity allocated to different PMEM namespaces. It + also indicates whether an address abstraction like a BTT is applied to + the namepsace. Note that traditional partition tables, GPT/MBR, are + layered on top of a PMEM namespace, or an address abstraction like BTT + if present, but partition support is deprecated going forward. + + +Overview +======== + +The LIBNVDIMM subsystem provides support for PMEM described by platform +firmware or a device driver. On ACPI based systems the platform firmware +conveys persistent memory resource via the ACPI NFIT "NVDIMM Firmware +Interface Table" in ACPI 6. While the LIBNVDIMM subsystem implementation +is generic and supports pre-NFIT platforms, it was guided by the +superset of capabilities need to support this ACPI 6 definition for +NVDIMM resources. The original implementation supported the +block-window-aperture capability described in the NFIT, but that support +has since been abandoned and never shipped in a product. + +Supporting Documents +-------------------- + +ACPI 6: + https://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf +NVDIMM Namespace: + https://pmem.io/documents/NVDIMM_Namespace_Spec.pdf +DSM Interface Example: + https://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf +Driver Writer's Guide: + https://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf + +Git Trees +--------- + +LIBNVDIMM: + https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git +LIBNDCTL: + https://github.com/pmem/ndctl.git + + +LIBNVDIMM PMEM +============== + +Prior to the arrival of the NFIT, non-volatile memory was described to a +system in various ad-hoc ways. Usually only the bare minimum was +provided, namely, a single system-physical-address range where writes +are expected to be durable after a system power loss. Now, the NFIT +specification standardizes not only the description of PMEM, but also +platform message-passing entry points for control and configuration. + +PMEM (nd_pmem.ko): Drives a system-physical-address range. This range is +contiguous in system memory and may be interleaved (hardware memory controller +striped) across multiple DIMMs. When interleaved the platform may optionally +provide details of which DIMMs are participating in the interleave. + +It is worth noting that when the labeling capability is detected (a EFI +namespace label index block is found), then no block device is created +by default as userspace needs to do at least one allocation of DPA to +the PMEM range. In contrast ND_NAMESPACE_IO ranges, once registered, +can be immediately attached to nd_pmem. This latter mode is called +label-less or "legacy". + +PMEM-REGIONs, Atomic Sectors, and DAX +------------------------------------- + +For the cases where an application or filesystem still needs atomic sector +update guarantees it can register a BTT on a PMEM device or partition. See +LIBNVDIMM/NDCTL: Block Translation Table "btt" + + +Example NVDIMM Platform +======================= + +For the remainder of this document the following diagram will be +referenced for any example sysfs layouts:: + + + (a) (b) DIMM + +-------------------+--------+--------+--------+ + +------+ | pm0.0 | free | pm1.0 | free | 0 + | imc0 +--+- - - region0- - - +--------+ +--------+ + +--+---+ | pm0.0 | free | pm1.0 | free | 1 + | +-------------------+--------v v--------+ + +--+---+ | | + | cpu0 | region1 + +--+---+ | | + | +----------------------------^ ^--------+ + +--+---+ | free | pm1.0 | free | 2 + | imc1 +--+----------------------------| +--------+ + +------+ | free | pm1.0 | free | 3 + +----------------------------+--------+--------+ + +In this platform we have four DIMMs and two memory controllers in one +socket. Each PMEM interleave set is identified by a region device with +a dynamically assigned id. + + 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A + single PMEM namespace is created in the REGION0-SPA-range that spans most + of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that + interleaved system-physical-address range is left free for + another PMEM namespace to be defined. + + 2. In the last portion of DIMM0 and DIMM1 we have an interleaved + system-physical-address range, REGION1, that spans those two DIMMs as + well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace + named "pm1.0". + + This bus is provided by the kernel under the device + /sys/devices/platform/nfit_test.0 when the nfit_test.ko module from + tools/testing/nvdimm is loaded. This module is a unit test for + LIBNVDIMM and the acpi_nfit.ko driver. + + +LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API +======================================================== + +What follows is a description of the LIBNVDIMM sysfs layout and a +corresponding object hierarchy diagram as viewed through the LIBNDCTL +API. The example sysfs paths and diagrams are relative to the Example +NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit +test. + +LIBNDCTL: Context +----------------- + +Every API call in the LIBNDCTL library requires a context that holds the +logging parameters and other library instance state. The library is +based on the libabc template: + + https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git + +LIBNDCTL: instantiate a new library context example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:: + + struct ndctl_ctx *ctx; + + if (ndctl_new(&ctx) == 0) + return ctx; + else + return NULL; + +LIBNVDIMM/LIBNDCTL: Bus +----------------------- + +A bus has a 1:1 relationship with an NFIT. The current expectation for +ACPI based systems is that there is only ever one platform-global NFIT. +That said, it is trivial to register multiple NFITs, the specification +does not preclude it. The infrastructure supports multiple busses and +we use this capability to test multiple NFIT configurations in the unit +test. + +LIBNVDIMM: control class device in /sys/class +--------------------------------------------- + +This character device accepts DSM messages to be passed to DIMM +identified by its NFIT handle:: + + /sys/class/nd/ndctl0 + |-- dev + |-- device -> ../../../ndbus0 + |-- subsystem -> ../../../../../../../class/nd + + + +LIBNVDIMM: bus +-------------- + +:: + + struct nvdimm_bus *nvdimm_bus_register(struct device *parent, + struct nvdimm_bus_descriptor *nfit_desc); + +:: + + /sys/devices/platform/nfit_test.0/ndbus0 + |-- commands + |-- nd + |-- nfit + |-- nmem0 + |-- nmem1 + |-- nmem2 + |-- nmem3 + |-- power + |-- provider + |-- region0 + |-- region1 + |-- region2 + |-- region3 + |-- region4 + |-- region5 + |-- uevent + `-- wait_probe + +LIBNDCTL: bus enumeration example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Find the bus handle that describes the bus from Example NVDIMM Platform:: + + static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx, + const char *provider) + { + struct ndctl_bus *bus; + + ndctl_bus_foreach(ctx, bus) + if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0) + return bus; + + return NULL; + } + + bus = get_bus_by_provider(ctx, "nfit_test.0"); + + +LIBNVDIMM/LIBNDCTL: DIMM (NMEM) +------------------------------- + +The DIMM device provides a character device for sending commands to +hardware, and it is a container for LABELs. If the DIMM is defined by +NFIT then an optional 'nfit' attribute sub-directory is available to add +NFIT-specifics. + +Note that the kernel device name for "DIMMs" is "nmemX". The NFIT +describes these devices via "Memory Device to System Physical Address +Range Mapping Structure", and there is no requirement that they actually +be physical DIMMs, so we use a more generic name. + +LIBNVDIMM: DIMM (NMEM) +^^^^^^^^^^^^^^^^^^^^^^ + +:: + + struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data, + const struct attribute_group **groups, unsigned long flags, + unsigned long *dsm_mask); + +:: + + /sys/devices/platform/nfit_test.0/ndbus0 + |-- nmem0 + | |-- available_slots + | |-- commands + | |-- dev + | |-- devtype + | |-- driver -> ../../../../../bus/nd/drivers/nvdimm + | |-- modalias + | |-- nfit + | | |-- device + | | |-- format + | | |-- handle + | | |-- phys_id + | | |-- rev_id + | | |-- serial + | | `-- vendor + | |-- state + | |-- subsystem -> ../../../../../bus/nd + | `-- uevent + |-- nmem1 + [..] + + +LIBNDCTL: DIMM enumeration example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Note, in this example we are assuming NFIT-defined DIMMs which are +identified by an "nfit_handle" a 32-bit value where: + + - Bit 3:0 DIMM number within the memory channel + - Bit 7:4 memory channel number + - Bit 11:8 memory controller ID + - Bit 15:12 socket ID (within scope of a Node controller if node + controller is present) + - Bit 27:16 Node Controller ID + - Bit 31:28 Reserved + +:: + + static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus, + unsigned int handle) + { + struct ndctl_dimm *dimm; + + ndctl_dimm_foreach(bus, dimm) + if (ndctl_dimm_get_handle(dimm) == handle) + return dimm; + + return NULL; + } + + #define DIMM_HANDLE(n, s, i, c, d) \ + (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \ + | ((c & 0xf) << 4) | (d & 0xf)) + + dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0)); + +LIBNVDIMM/LIBNDCTL: Region +-------------------------- + +A generic REGION device is registered for each PMEM interleave-set / +range. Per the example there are 2 PMEM regions on the "nfit_test.0" +bus. The primary role of regions are to be a container of "mappings". A +mapping is a tuple of <DIMM, DPA-start-offset, length>. + +LIBNVDIMM provides a built-in driver for REGION devices. This driver +is responsible for all parsing LABELs, if present, and then emitting NAMESPACE +devices for the nd_pmem driver to consume. + +In addition to the generic attributes of "mapping"s, "interleave_ways" +and "size" the REGION device also exports some convenience attributes. +"nstype" indicates the integer type of namespace-device this region +emits, "devtype" duplicates the DEVTYPE variable stored by udev at the +'add' event, "modalias" duplicates the MODALIAS variable stored by udev +at the 'add' event, and finally, the optional "spa_index" is provided in +the case where the region is defined by a SPA. + +LIBNVDIMM: region:: + + struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, + struct nd_region_desc *ndr_desc); + +:: + + /sys/devices/platform/nfit_test.0/ndbus0 + |-- region0 + | |-- available_size + | |-- btt0 + | |-- btt_seed + | |-- devtype + | |-- driver -> ../../../../../bus/nd/drivers/nd_region + | |-- init_namespaces + | |-- mapping0 + | |-- mapping1 + | |-- mappings + | |-- modalias + | |-- namespace0.0 + | |-- namespace_seed + | |-- numa_node + | |-- nfit + | | `-- spa_index + | |-- nstype + | |-- set_cookie + | |-- size + | |-- subsystem -> ../../../../../bus/nd + | `-- uevent + |-- region1 + [..] + +LIBNDCTL: region enumeration example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Sample region retrieval routines based on NFIT-unique data like +"spa_index" (interleave set id). + +:: + + static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus, + unsigned int spa_index) + { + struct ndctl_region *region; + + ndctl_region_foreach(bus, region) { + if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM) + continue; + if (ndctl_region_get_spa_index(region) == spa_index) + return region; + } + return NULL; + } + + +LIBNVDIMM/LIBNDCTL: Namespace +----------------------------- + +A REGION, after resolving DPA aliasing and LABEL specified boundaries, surfaces +one or more "namespace" devices. The arrival of a "namespace" device currently +triggers the nd_pmem driver to load and register a disk/block device. + +LIBNVDIMM: namespace +^^^^^^^^^^^^^^^^^^^^ + +Here is a sample layout from the 2 major types of NAMESPACE where namespace0.0 +represents DIMM-info-backed PMEM (note that it has a 'uuid' attribute), and +namespace1.0 represents an anonymous PMEM namespace (note that has no 'uuid' +attribute due to not support a LABEL) + +:: + + /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0 + |-- alt_name + |-- devtype + |-- dpa_extents + |-- force_raw + |-- modalias + |-- numa_node + |-- resource + |-- size + |-- subsystem -> ../../../../../../bus/nd + |-- type + |-- uevent + `-- uuid + /sys/devices/platform/nfit_test.1/ndbus1/region1/namespace1.0 + |-- block + | `-- pmem0 + |-- devtype + |-- driver -> ../../../../../../bus/nd/drivers/pmem + |-- force_raw + |-- modalias + |-- numa_node + |-- resource + |-- size + |-- subsystem -> ../../../../../../bus/nd + |-- type + `-- uevent + +LIBNDCTL: namespace enumeration example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Namespaces are indexed relative to their parent region, example below. +These indexes are mostly static from boot to boot, but subsystem makes +no guarantees in this regard. For a static namespace identifier use its +'uuid' attribute. + +:: + + static struct ndctl_namespace + *get_namespace_by_id(struct ndctl_region *region, unsigned int id) + { + struct ndctl_namespace *ndns; + + ndctl_namespace_foreach(region, ndns) + if (ndctl_namespace_get_id(ndns) == id) + return ndns; + + return NULL; + } + +LIBNDCTL: namespace creation example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Idle namespaces are automatically created by the kernel if a given +region has enough available capacity to create a new namespace. +Namespace instantiation involves finding an idle namespace and +configuring it. For the most part the setting of namespace attributes +can occur in any order, the only constraint is that 'uuid' must be set +before 'size'. This enables the kernel to track DPA allocations +internally with a static identifier:: + + static int configure_namespace(struct ndctl_region *region, + struct ndctl_namespace *ndns, + struct namespace_parameters *parameters) + { + char devname[50]; + + snprintf(devname, sizeof(devname), "namespace%d.%d", + ndctl_region_get_id(region), paramaters->id); + + ndctl_namespace_set_alt_name(ndns, devname); + /* 'uuid' must be set prior to setting size! */ + ndctl_namespace_set_uuid(ndns, paramaters->uuid); + ndctl_namespace_set_size(ndns, paramaters->size); + /* unlike pmem namespaces, blk namespaces have a sector size */ + if (parameters->lbasize) + ndctl_namespace_set_sector_size(ndns, parameters->lbasize); + ndctl_namespace_enable(ndns); + } + + +Why the Term "namespace"? +^^^^^^^^^^^^^^^^^^^^^^^^^ + + 1. Why not "volume" for instance? "volume" ran the risk of confusing + ND (libnvdimm subsystem) to a volume manager like device-mapper. + + 2. The term originated to describe the sub-devices that can be created + within a NVME controller (see the nvme specification: + https://www.nvmexpress.org/specifications/), and NFIT namespaces are + meant to parallel the capabilities and configurability of + NVME-namespaces. + + +LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" +------------------------------------------------- + +A BTT (design document: https://pmem.io/2014/09/23/btt.html) is a +personality driver for a namespace that fronts entire namespace as an +'address abstraction'. + +LIBNVDIMM: btt layout +^^^^^^^^^^^^^^^^^^^^^ + +Every region will start out with at least one BTT device which is the +seed device. To activate it set the "namespace", "uuid", and +"sector_size" attributes and then bind the device to the nd_pmem or +nd_blk driver depending on the region type:: + + /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/ + |-- namespace + |-- delete + |-- devtype + |-- modalias + |-- numa_node + |-- sector_size + |-- subsystem -> ../../../../../bus/nd + |-- uevent + `-- uuid + +LIBNDCTL: btt creation example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Similar to namespaces an idle BTT device is automatically created per +region. Each time this "seed" btt device is configured and enabled a new +seed is created. Creating a BTT configuration involves two steps of +finding and idle BTT and assigning it to consume a namespace. + +:: + + static struct ndctl_btt *get_idle_btt(struct ndctl_region *region) + { + struct ndctl_btt *btt; + + ndctl_btt_foreach(region, btt) + if (!ndctl_btt_is_enabled(btt) + && !ndctl_btt_is_configured(btt)) + return btt; + + return NULL; + } + + static int configure_btt(struct ndctl_region *region, + struct btt_parameters *parameters) + { + btt = get_idle_btt(region); + + ndctl_btt_set_uuid(btt, parameters->uuid); + ndctl_btt_set_sector_size(btt, parameters->sector_size); + ndctl_btt_set_namespace(btt, parameters->ndns); + /* turn off raw mode device */ + ndctl_namespace_disable(parameters->ndns); + /* turn on btt access */ + ndctl_btt_enable(btt); + } + +Once instantiated a new inactive btt seed device will appear underneath +the region. + +Once a "namespace" is removed from a BTT that instance of the BTT device +will be deleted or otherwise reset to default values. This deletion is +only at the device model level. In order to destroy a BTT the "info +block" needs to be destroyed. Note, that to destroy a BTT the media +needs to be written in raw mode. By default, the kernel will autodetect +the presence of a BTT and disable raw mode. This autodetect behavior +can be suppressed by enabling raw mode for the namespace via the +ndctl_namespace_set_raw_mode() API. + + +Summary LIBNDCTL Diagram +------------------------ + +For the given example above, here is the view of the objects as seen by the +LIBNDCTL API:: + + +---+ + |CTX| + +-+-+ + | + +-------+ | + | DIMM0 <-+ | +---------+ +--------------+ +---------------+ + +-------+ | | +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | + | DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+ + +-------+ +-+BUS0+-| +---------+ +--------------+ +----------------------+ + | DIMM2 <-+ +----+ +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | BTT1 | + +-------+ | | +---------+ +--------------+ +---------------+------+ + | DIMM3 <-+ + +-------+ diff --git a/Documentation/driver-api/nvdimm/security.rst b/Documentation/driver-api/nvdimm/security.rst new file mode 100644 index 000000000..7aab71524 --- /dev/null +++ b/Documentation/driver-api/nvdimm/security.rst @@ -0,0 +1,143 @@ +=============== +NVDIMM Security +=============== + +1. Introduction +--------------- + +With the introduction of Intel Device Specific Methods (DSM) v1.8 +specification [1], security DSMs are introduced. The spec added the following +security DSMs: "get security state", "set passphrase", "disable passphrase", +"unlock unit", "freeze lock", "secure erase", and "overwrite". A security_ops +data structure has been added to struct dimm in order to support the security +operations and generic APIs are exposed to allow vendor neutral operations. + +2. Sysfs Interface +------------------ +The "security" sysfs attribute is provided in the nvdimm sysfs directory. For +example: +/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security + +The "show" attribute of that attribute will display the security state for +that DIMM. The following states are available: disabled, unlocked, locked, +frozen, and overwrite. If security is not supported, the sysfs attribute +will not be visible. + +The "store" attribute takes several commands when it is being written to +in order to support some of the security functionalities: +update <old_keyid> <new_keyid> - enable or update passphrase. +disable <keyid> - disable enabled security and remove key. +freeze - freeze changing of security states. +erase <keyid> - delete existing user encryption key. +overwrite <keyid> - wipe the entire nvdimm. +master_update <keyid> <new_keyid> - enable or update master passphrase. +master_erase <keyid> - delete existing user encryption key. + +3. Key Management +----------------- + +The key is associated to the payload by the DIMM id. For example: +# cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/nfit/id +8089-a2-1740-00000133 +The DIMM id would be provided along with the key payload (passphrase) to +the kernel. + +The security keys are managed on the basis of a single key per DIMM. The +key "passphrase" is expected to be 32bytes long. This is similar to the ATA +security specification [2]. A key is initially acquired via the request_key() +kernel API call during nvdimm unlock. It is up to the user to make sure that +all the keys are in the kernel user keyring for unlock. + +A nvdimm encrypted-key of format enc32 has the description format of: +nvdimm:<bus-provider-specific-unique-id> + +See file ``Documentation/security/keys/trusted-encrypted.rst`` for creating +encrypted-keys of enc32 format. TPM usage with a master trusted key is +preferred for sealing the encrypted-keys. + +4. Unlocking +------------ +When the DIMMs are being enumerated by the kernel, the kernel will attempt to +retrieve the key from the kernel user keyring. This is the only time +a locked DIMM can be unlocked. Once unlocked, the DIMM will remain unlocked +until reboot. Typically an entity (i.e. shell script) will inject all the +relevant encrypted-keys into the kernel user keyring during the initramfs phase. +This provides the unlock function access to all the related keys that contain +the passphrase for the respective nvdimms. It is also recommended that the +keys are injected before libnvdimm is loaded by modprobe. + +5. Update +--------- +When doing an update, it is expected that the existing key is removed from +the kernel user keyring and reinjected as different (old) key. It's irrelevant +what the key description is for the old key since we are only interested in the +keyid when doing the update operation. It is also expected that the new key +is injected with the description format described from earlier in this +document. The update command written to the sysfs attribute will be with +the format: +update <old keyid> <new keyid> + +If there is no old keyid due to a security enabling, then a 0 should be +passed in. + +6. Freeze +--------- +The freeze operation does not require any keys. The security config can be +frozen by a user with root privelege. + +7. Disable +---------- +The security disable command format is: +disable <keyid> + +An key with the current passphrase payload that is tied to the nvdimm should be +in the kernel user keyring. + +8. Secure Erase +--------------- +The command format for doing a secure erase is: +erase <keyid> + +An key with the current passphrase payload that is tied to the nvdimm should be +in the kernel user keyring. + +9. Overwrite +------------ +The command format for doing an overwrite is: +overwrite <keyid> + +Overwrite can be done without a key if security is not enabled. A key serial +of 0 can be passed in to indicate no key. + +The sysfs attribute "security" can be polled to wait on overwrite completion. +Overwrite can last tens of minutes or more depending on nvdimm size. + +An encrypted-key with the current user passphrase that is tied to the nvdimm +should be injected and its keyid should be passed in via sysfs. + +10. Master Update +----------------- +The command format for doing a master update is: +update <old keyid> <new keyid> + +The operating mechanism for master update is identical to update except the +master passphrase key is passed to the kernel. The master passphrase key +is just another encrypted-key. + +This command is only available when security is disabled. + +11. Master Erase +---------------- +The command format for doing a master erase is: +master_erase <current keyid> + +This command has the same operating mechanism as erase except the master +passphrase key is passed to the kernel. The master passphrase key is just +another encrypted-key. + +This command is only available when the master security is enabled, indicated +by the extended security status. + +[1]: https://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf + +[2]: http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf |