diff options
Diffstat (limited to 'Documentation/powerpc/pci_iov_resource_on_powernv.rst')
-rw-r--r-- | Documentation/powerpc/pci_iov_resource_on_powernv.rst | 312 |
1 files changed, 0 insertions, 312 deletions
diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.rst b/Documentation/powerpc/pci_iov_resource_on_powernv.rst deleted file mode 100644 index f5a5793e16..0000000000 --- a/Documentation/powerpc/pci_iov_resource_on_powernv.rst +++ /dev/null @@ -1,312 +0,0 @@ -=================================================== -PCI Express I/O Virtualization Resource on Powerenv -=================================================== - -Wei Yang <weiyang@linux.vnet.ibm.com> - -Benjamin Herrenschmidt <benh@au1.ibm.com> - -Bjorn Helgaas <bhelgaas@google.com> - -26 Aug 2014 - -This document describes the requirement from hardware for PCI MMIO resource -sizing and assignment on PowerKVM and how generic PCI code handles this -requirement. The first two sections describe the concepts of Partitionable -Endpoints and the implementation on P8 (IODA2). The next two sections talks -about considerations on enabling SRIOV on IODA2. - -1. Introduction to Partitionable Endpoints -========================================== - -A Partitionable Endpoint (PE) is a way to group the various resources -associated with a device or a set of devices to provide isolation between -partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism -to freeze a device that is causing errors in order to limit the possibility -of propagation of bad data. - -There is thus, in HW, a table of PE states that contains a pair of "frozen" -state bits (one for MMIO and one for DMA, they get set together but can be -cleared independently) for each PE. - -When a PE is frozen, all stores in any direction are dropped and all loads -return all 1's value. MSIs are also blocked. There's a bit more state that -captures things like the details of the error that caused the freeze etc., but -that's not critical. - -The interesting part is how the various PCIe transactions (MMIO, DMA, ...) -are matched to their corresponding PEs. - -The following section provides a rough description of what we have on P8 -(IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB -is a completely separate HW entity that replicates the entire logic, so has -its own set of PEs, etc. - -2. Implementation of Partitionable Endpoints on P8 (IODA2) -========================================================== - -P8 supports up to 256 Partitionable Endpoints per PHB. - - * Inbound - - For DMA, MSIs and inbound PCIe error messages, we have a table (in - memory but accessed in HW by the chip) that provides a direct - correspondence between a PCIe RID (bus/dev/fn) with a PE number. - We call this the RTT. - - - For DMA we then provide an entire address space for each PE that can - contain two "windows", depending on the value of PCI address bit 59. - Each window can be configured to be remapped via a "TCE table" (IOMMU - translation table), which has various configurable characteristics - not described here. - - - For MSIs, we have two windows in the address space (one at the top of - the 32-bit space and one much higher) which, via a combination of the - address and MSI value, will result in one of the 2048 interrupts per - bridge being triggered. There's a PE# in the interrupt controller - descriptor table as well which is compared with the PE# obtained from - the RTT to "authorize" the device to emit that specific interrupt. - - - Error messages just use the RTT. - - * Outbound. That's where the tricky part is. - - Like other PCI host bridges, the Power8 IODA2 PHB supports "windows" - from the CPU address space to the PCI address space. There is one M32 - window and sixteen M64 windows. They have different characteristics. - First what they have in common: they forward a configurable portion of - the CPU address space to the PCIe bus and must be naturally aligned - power of two in size. The rest is different: - - - The M32 window: - - * Is limited to 4GB in size. - - * Drops the top bits of the address (above the size) and replaces - them with a configurable value. This is typically used to generate - 32-bit PCIe accesses. We configure that window at boot from FW and - don't touch it from Linux; it's usually set to forward a 2GB - portion of address space from the CPU to PCIe - 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually - reserved for MSIs but this is not a problem at this point; we just - need to ensure Linux doesn't assign anything there, the M32 logic - ignores that however and will forward in that space if we try). - - * It is divided into 256 segments of equal size. A table in the chip - maps each segment to a PE#. That allows portions of the MMIO space - to be assigned to PEs on a segment granularity. For a 2GB window, - the segment granularity is 2GB/256 = 8MB. - - Now, this is the "main" window we use in Linux today (excluding - SR-IOV). We basically use the trick of forcing the bridge MMIO windows - onto a segment alignment/granularity so that the space behind a bridge - can be assigned to a PE. - - Ideally we would like to be able to have individual functions in PEs - but that would mean using a completely different address allocation - scheme where individual function BARs can be "grouped" to fit in one or - more segments. - - - The M64 windows: - - * Must be at least 256MB in size. - - * Do not translate addresses (the address on PCIe is the same as the - address on the PowerBus). There is a way to also set the top 14 - bits which are not conveyed by PowerBus but we don't use this. - - * Can be configured to be segmented. When not segmented, we can - specify the PE# for the entire window. When segmented, a window - has 256 segments; however, there is no table for mapping a segment - to a PE#. The segment number *is* the PE#. - - * Support overlaps. If an address is covered by multiple windows, - there's a defined ordering for which window applies. - - We have code (fairly new compared to the M32 stuff) that exploits that - for large BARs in 64-bit space: - - We configure an M64 window to cover the entire region of address space - that has been assigned by FW for the PHB (about 64GB, ignore the space - for the M32, it comes out of a different "reserve"). We configure it - as segmented. - - Then we do the same thing as with M32, using the bridge alignment - trick, to match to those giant segments. - - Since we cannot remap, we have two additional constraints: - - - We do the PE# allocation *after* the 64-bit space has been assigned - because the addresses we use directly determine the PE#. We then - update the M32 PE# for the devices that use both 32-bit and 64-bit - spaces or assign the remaining PE# to 32-bit only devices. - - - We cannot "group" segments in HW, so if a device ends up using more - than one segment, we end up with more than one PE#. There is a HW - mechanism to make the freeze state cascade to "companion" PEs but - that only works for PCIe error messages (typically used so that if - you freeze a switch, it freezes all its children). So we do it in - SW. We lose a bit of effectiveness of EEH in that case, but that's - the best we found. So when any of the PEs freezes, we freeze the - other ones for that "domain". We thus introduce the concept of - "master PE" which is the one used for DMA, MSIs, etc., and "secondary - PEs" that are used for the remaining M64 segments. - - We would like to investigate using additional M64 windows in "single - PE" mode to overlay over specific BARs to work around some of that, for - example for devices with very large BARs, e.g., GPUs. It would make - sense, but we haven't done it yet. - -3. Considerations for SR-IOV on PowerKVM -======================================== - - * SR-IOV Background - - The PCIe SR-IOV feature allows a single Physical Function (PF) to - support several Virtual Functions (VFs). Registers in the PF's SR-IOV - Capability control the number of VFs and whether they are enabled. - - When VFs are enabled, they appear in Configuration Space like normal - PCI devices, but the BARs in VF config space headers are unusual. For - a non-VF device, software uses BARs in the config space header to - discover the BAR sizes and assign addresses for them. For VF devices, - software uses VF BAR registers in the *PF* SR-IOV Capability to - discover sizes and assign addresses. The BARs in the VF's config space - header are read-only zeros. - - When a VF BAR in the PF SR-IOV Capability is programmed, it sets the - base address for all the corresponding VF(n) BARs. For example, if the - PF SR-IOV Capability is programmed to enable eight VFs, and it has a - 1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region. - This region is divided into eight contiguous 1MB regions, each of which - is a BAR0 for one of the VFs. Note that even though the VF BAR - describes an 8MB region, the alignment requirement is for a single VF, - i.e., 1MB in this example. - - There are several strategies for isolating VFs in PEs: - - - M32 window: There's one M32 window, and it is split into 256 - equally-sized segments. The finest granularity possible is a 256MB - window with 1MB segments. VF BARs that are 1MB or larger could be - mapped to separate PEs in this window. Each segment can be - individually mapped to a PE via the lookup table, so this is quite - flexible, but it works best when all the VF BARs are the same size. If - they are different sizes, the entire window has to be small enough that - the segment size matches the smallest VF BAR, which means larger VF - BARs span several segments. - - - Non-segmented M64 window: A non-segmented M64 window is mapped entirely - to a single PE, so it could only isolate one VF. - - - Single segmented M64 windows: A segmented M64 window could be used just - like the M32 window, but the segments can't be individually mapped to - PEs (the segment number is the PE#), so there isn't as much - flexibility. A VF with multiple BARs would have to be in a "domain" of - multiple PEs, which is not as well isolated as a single PE. - - - Multiple segmented M64 windows: As usual, each window is split into 256 - equally-sized segments, and the segment number is the PE#. But if we - use several M64 windows, they can be set to different base addresses - and different segment sizes. If we have VFs that each have a 1MB BAR - and a 32MB BAR, we could use one M64 window to assign 1MB segments and - another M64 window to assign 32MB segments. - - Finally, the plan to use M64 windows for SR-IOV, which will be described - more in the next two sections. For a given VF BAR, we need to - effectively reserve the entire 256 segments (256 * VF BAR size) and - position the VF BAR to start at the beginning of a free range of - segments/PEs inside that M64 window. - - The goal is of course to be able to give a separate PE for each VF. - - The IODA2 platform has 16 M64 windows, which are used to map MMIO - range to PE#. Each M64 window defines one MMIO range and this range is - divided into 256 segments, with each segment corresponding to one PE. - - We decide to leverage this M64 window to map VFs to individual PEs, since - SR-IOV VF BARs are all the same size. - - But doing so introduces another problem: total_VFs is usually smaller - than the number of M64 window segments, so if we map one VF BAR directly - to one M64 window, some part of the M64 window will map to another - device's MMIO range. - - IODA supports 256 PEs, so segmented windows contain 256 segments, so if - total_VFs is less than 256, we have the situation in Figure 1.0, where - segments [total_VFs, 255] of the M64 window may map to some MMIO range on - other devices:: - - 0 1 total_VFs - 1 - +------+------+- -+------+------+ - | | | ... | | | - +------+------+- -+------+------+ - - VF(n) BAR space - - 0 1 total_VFs - 1 255 - +------+------+- -+------+------+- -+------+------+ - | | | ... | | | ... | | | - +------+------+- -+------+------+- -+------+------+ - - M64 window - - Figure 1.0 Direct map VF(n) BAR space - - Our current solution is to allocate 256 segments even if the VF(n) BAR - space doesn't need that much, as shown in Figure 1.1:: - - 0 1 total_VFs - 1 255 - +------+------+- -+------+------+- -+------+------+ - | | | ... | | | ... | | | - +------+------+- -+------+------+- -+------+------+ - - VF(n) BAR space + extra - - 0 1 total_VFs - 1 255 - +------+------+- -+------+------+- -+------+------+ - | | | ... | | | ... | | | - +------+------+- -+------+------+- -+------+------+ - - M64 window - - Figure 1.1 Map VF(n) BAR space + extra - - Allocating the extra space ensures that the entire M64 window will be - assigned to this one SR-IOV device and none of the space will be - available for other devices. Note that this only expands the space - reserved in software; there are still only total_VFs VFs, and they only - respond to segments [0, total_VFs - 1]. There's nothing in hardware that - responds to segments [total_VFs, 255]. - -4. Implications for the Generic PCI Code -======================================== - -The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be -aligned to the size of an individual VF BAR. - -In IODA2, the MMIO address determines the PE#. If the address is in an M32 -window, we can set the PE# by updating the table that translates segments -to PE#s. Similarly, if the address is in an unsegmented M64 window, we can -set the PE# for the window. But if it's in a segmented M64 window, the -segment number is the PE#. - -Therefore, the only way to control the PE# for a VF is to change the base -of the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact -amount of space required for the VF(n) BAR space, the VF BAR value is fixed -and cannot be changed. - -On the other hand, if the PCI core allocates additional space, the VF BAR -value can be changed as long as the entire VF(n) BAR space remains inside -the space allocated by the core. - -Ideally the segment size will be the same as an individual VF BAR size. -Then each VF will be in its own PE. The VF BARs (and therefore the PE#s) -are contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we -allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0. - -If the segment size is smaller than the VF BAR size, it will take several -segments to cover a VF BAR, and a VF will be in several PEs. This is -possible, but the isolation isn't as good, and it reduces the number of PE# -choices because instead of consuming only numVFs segments, the VF(n) BAR -space will consume (numVFs * n) segments. That means there aren't as many -available segments for adjusting base of the VF(n) BAR space. |