diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 18:24:20 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 18:24:20 +0000 |
commit | 483eb2f56657e8e7f419ab1a4fab8dce9ade8609 (patch) | |
tree | e5d88d25d870d5dedacb6bbdbe2a966086a0a5cf /src/spdk/doc/memory.md | |
parent | Initial commit. (diff) | |
download | ceph-upstream.tar.xz ceph-upstream.zip |
Adding upstream version 14.2.21.upstream/14.2.21upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/spdk/doc/memory.md')
-rw-r--r-- | src/spdk/doc/memory.md | 118 |
1 files changed, 118 insertions, 0 deletions
diff --git a/src/spdk/doc/memory.md b/src/spdk/doc/memory.md new file mode 100644 index 00000000..abe81ffa --- /dev/null +++ b/src/spdk/doc/memory.md @@ -0,0 +1,118 @@ +# Memory Management for User Space Drivers {#memory} + +The following is an attempt to explain why all data buffers passed to SPDK must +be allocated using spdk_dma_malloc() or its siblings, and why SPDK relies on +DPDK's proven base functionality to implement memory management. + +Computing platforms generally carve physical memory up into 4KiB segments +called pages. They number the pages from 0 to N starting from the beginning of +addressable memory. Operating systems then overlay 4KiB virtual memory pages on +top of these physical pages using arbitrarily complex mappings. See +[Virtual Memory](https://en.wikipedia.org/wiki/Virtual_memory) for an overview. + +Physical memory is attached on channels, where each memory channel provides +some fixed amount of bandwidth. To optimize total memory bandwidth, the +physical addressing is often set up to automatically interleave between +channels. For instance, page 0 may be located on channel 0, page 1 on channel +1, page 2 on channel 2, etc. This is so that writing to memory sequentially +automatically utilizes all available channels. In practice, interleaving is +done at a much more granular level than a full page. + +Modern computing platforms support hardware acceleration for virtual to +physical translation inside of their Memory Management Unit (MMU). The MMU +often supports multiple different page sizes. On recent x86_64 systems, 4KiB, +2MiB, and 1GiB pages are supported. Typically, operating systems use 4KiB pages +by default. + +NVMe devices transfer data to and from system memory using Direct Memory Access +(DMA). Specifically, they send messages across the PCI bus requesting data +transfers. In the absence of an IOMMU, these messages contain *physical* memory +addresses. These data transfers happen without involving the CPU, and the MMU +is responsible for making access to memory coherent. + +NVMe devices also may place additional requirements on the physical layout of +memory for these transfers. The NVMe 1.0 specification requires all physical +memory to be describable by what is called a *PRP list*. To be described by a +PRP list, memory must have the following properties: + +* The memory is broken into physical 4KiB pages, which we'll call device pages. +* The first device page can be a partial page starting at any 4-byte aligned + address. It may extend up to the end of the current physical page, but not + beyond. +* If there is more than one device page, the first device page must end on a + physical 4KiB page boundary. +* The last device page begins on a physical 4KiB page boundary, but is not + required to end on a physical 4KiB page boundary. + +The specification allows for device pages to be other sizes than 4KiB, but all +known devices as of this writing use 4KiB. + +The NVMe 1.1 specification added support for fully flexible scatter gather lists, +but the feature is optional and most devices available today do not support it. + +User space drivers run in the context of a regular process and so have access +to virtual memory. In order to correctly program the device with physical +addresses, some method for address translation must be implemented. + +The simplest way to do this on Linux is to inspect `/proc/self/pagemap` from +within a process. This file contains the virtual address to physical address +mappings. As of Linux 4.0, accessing these mappings requires root privileges. +However, operating systems make absolutely no guarantee that the mapping of +virtual to physical pages is static. The operating system has no visibility +into whether a PCI device is directly transferring data to a set of physical +addresses, so great care must be taken to coordinate DMA requests with page +movement. When an operating system flags a page such that the virtual to +physical address mapping cannot be modified, this is called **pinning** the +page. + +There are several reasons why the virtual to physical mappings may change, too. +By far the most common reason is due to page swapping to disk. However, the +operating system also moves pages during a process called compaction, which +collapses identical virtual pages onto the same physical page to save memory. +Some operating systems are also capable of doing transparent memory +compression. It is also increasingly possible to hot-add additional memory, +which may trigger a physical address rebalance to optimize interleaving. + +POSIX provides the `mlock` call that forces a virtual page of memory to always +be backed by a physical page. In effect, this is disabling swapping. This does +*not* guarantee, however, that the virtual to physical address mapping is +static. The `mlock` call should not be confused with a **pin** call, and it +turns out that POSIX does not define an API for pinning memory. Therefore, the +mechanism to allocate pinned memory is operating system specific. + +SPDK relies on DPDK to allocate pinned memory. On Linux, DPDK does this by +allocating `hugepages` (by default, 2MiB). The Linux kernel treats hugepages +differently than regular 4KiB pages. Specifically, the operating system will +never change their physical location. This is not by intent, and so things +could change in future versions, but it is true today and has been for a number +of years (see the later section on the IOMMU for a future-proof solution). DPDK +goes through great pains to allocate hugepages such that it can string together +the longest runs of physical pages possible, such that it can accommodate +physically contiguous allocations larger than a single page. + +With this explanation, hopefully it is now clear why all data buffers passed to +SPDK must be allocated using spdk_dma_malloc() or its siblings. The buffers +must be allocated specifically so that they are pinned and so that physical +addresses are known. + +# IOMMU Support + +Many platforms contain an extra piece of hardware called an I/O Memory +Management Unit (IOMMU). An IOMMU is much like a regular MMU, except it +provides virtualized address spaces to peripheral devices (i.e. on the PCI +bus). The MMU knows about virtual to physical mappings per process on the +system, so the IOMMU associates a particular device with one of these mappings +and then allows the user to assign arbitrary *bus addresses* to virtual +addresses in their process. All DMA operations between the PCI device and +system memory are then translated through the IOMMU by converting the bus +address to a virtual address and then the virtual address to the physical +address. This allows the operating system to freely modify the virtual to +physical address mapping without breaking ongoing DMA operations. Linux +provides a device driver, `vfio-pci`, that allows a user to configure the IOMMU +with their current process. + +This is a future-proof, hardware-accelerated solution for performing DMA +operations into and out of a user space process and forms the long-term +foundation for SPDK and DPDK's memory management strategy. We highly recommend +that applications are deployed using vfio and the IOMMU enabled, which is fully +supported today. |