diff options
Diffstat (limited to 'src/cmd/compile/abi-internal.md')
-rw-r--r-- | src/cmd/compile/abi-internal.md | 923 |
1 files changed, 923 insertions, 0 deletions
diff --git a/src/cmd/compile/abi-internal.md b/src/cmd/compile/abi-internal.md new file mode 100644 index 0000000..14464ed --- /dev/null +++ b/src/cmd/compile/abi-internal.md @@ -0,0 +1,923 @@ +# Go internal ABI specification + +Self-link: [go.dev/s/regabi](https://go.dev/s/regabi) + +This document describes Go’s internal application binary interface +(ABI), known as ABIInternal. +Go's ABI defines the layout of data in memory and the conventions for +calling between Go functions. +This ABI is *unstable* and will change between Go versions. +If you’re writing assembly code, please instead refer to Go’s +[assembly documentation](/doc/asm.html), which describes Go’s stable +ABI, known as ABI0. + +All functions defined in Go source follow ABIInternal. +However, ABIInternal and ABI0 functions are able to call each other +through transparent *ABI wrappers*, described in the [internal calling +convention proposal](https://golang.org/design/27539-internal-abi). + +Go uses a common ABI design across all architectures. +We first describe the common ABI, and then cover per-architecture +specifics. + +*Rationale*: For the reasoning behind using a common ABI across +architectures instead of the platform ABI, see the [register-based Go +calling convention proposal](https://golang.org/design/40724-register-calling). + +## Memory layout + +Go's built-in types have the following sizes and alignments. +Many, though not all, of these sizes are guaranteed by the [language +specification](/doc/go_spec.html#Size_and_alignment_guarantees). +Those that aren't guaranteed may change in future versions of Go (for +example, we've considered changing the alignment of int64 on 32-bit). + +| Type | 64-bit | | 32-bit | | +|-----------------------------|--------|-------|--------|-------| +| | Size | Align | Size | Align | +| bool, uint8, int8 | 1 | 1 | 1 | 1 | +| uint16, int16 | 2 | 2 | 2 | 2 | +| uint32, int32 | 4 | 4 | 4 | 4 | +| uint64, int64 | 8 | 8 | 8 | 4 | +| int, uint | 8 | 8 | 4 | 4 | +| float32 | 4 | 4 | 4 | 4 | +| float64 | 8 | 8 | 8 | 4 | +| complex64 | 8 | 4 | 8 | 4 | +| complex128 | 16 | 8 | 16 | 4 | +| uintptr, *T, unsafe.Pointer | 8 | 8 | 4 | 4 | + +The types `byte` and `rune` are aliases for `uint8` and `int32`, +respectively, and hence have the same size and alignment as these +types. + +The layout of `map`, `chan`, and `func` types is equivalent to *T. + +To describe the layout of the remaining composite types, we first +define the layout of a *sequence* S of N fields with types +t<sub>1</sub>, t<sub>2</sub>, ..., t<sub>N</sub>. +We define the byte offset at which each field begins relative to a +base address of 0, as well as the size and alignment of the sequence +as follows: + +``` +offset(S, i) = 0 if i = 1 + = align(offset(S, i-1) + sizeof(t_(i-1)), alignof(t_i)) +alignof(S) = 1 if N = 0 + = max(alignof(t_i) | 1 <= i <= N) +sizeof(S) = 0 if N = 0 + = align(offset(S, N) + sizeof(t_N), alignof(S)) +``` + +Where sizeof(T) and alignof(T) are the size and alignment of type T, +respectively, and align(x, y) rounds x up to a multiple of y. + +The `interface{}` type is a sequence of 1. a pointer to the runtime type +description for the interface's dynamic type and 2. an `unsafe.Pointer` +data field. +Any other interface type (besides the empty interface) is a sequence +of 1. a pointer to the runtime "itab" that gives the method pointers and +the type of the data field and 2. an `unsafe.Pointer` data field. +An interface can be "direct" or "indirect" depending on the dynamic +type: a direct interface stores the value directly in the data field, +and an indirect interface stores a pointer to the value in the data +field. +An interface can only be direct if the value consists of a single +pointer word. + +An array type `[N]T` is a sequence of N fields of type T. + +The slice type `[]T` is a sequence of a `*[cap]T` pointer to the slice +backing store, an `int` giving the `len` of the slice, and an `int` +giving the `cap` of the slice. + +The `string` type is a sequence of a `*[len]byte` pointer to the +string backing store, and an `int` giving the `len` of the string. + +A struct type `struct { f1 t1; ...; fM tM }` is laid out as the +sequence t1, ..., tM, tP, where tP is either: + +- Type `byte` if sizeof(tM) = 0 and any of sizeof(t*i*) ≠ 0. +- Empty (size 0 and align 1) otherwise. + +The padding byte prevents creating a past-the-end pointer by taking +the address of the final, empty fN field. + +Note that user-written assembly code should generally not depend on Go +type layout and should instead use the constants defined in +[`go_asm.h`](/doc/asm.html#data-offsets). + +## Function call argument and result passing + +Function calls pass arguments and results using a combination of the +stack and machine registers. +Each argument or result is passed either entirely in registers or +entirely on the stack. +Because access to registers is generally faster than access to the +stack, arguments and results are preferentially passed in registers. +However, any argument or result that contains a non-trivial array or +does not fit entirely in the remaining available registers is passed +on the stack. + +Each architecture defines a sequence of integer registers and a +sequence of floating-point registers. +At a high level, arguments and results are recursively broken down +into values of base types and these base values are assigned to +registers from these sequences. + +Arguments and results can share the same registers, but do not share +the same stack space. +Beyond the arguments and results passed on the stack, the caller also +reserves spill space on the stack for all register-based arguments +(but does not populate this space). + +The receiver, arguments, and results of function or method F are +assigned to registers or the stack using the following algorithm: + +1. Let NI and NFP be the length of integer and floating-point register + sequences defined by the architecture. + Let I and FP be 0; these are the indexes of the next integer and + floating-pointer register. + Let S, the type sequence defining the stack frame, be empty. +1. If F is a method, assign F’s receiver. +1. For each argument A of F, assign A. +1. Add a pointer-alignment field to S. This has size 0 and the same + alignment as `uintptr`. +1. Reset I and FP to 0. +1. For each result R of F, assign R. +1. Add a pointer-alignment field to S. +1. For each register-assigned receiver and argument of F, let T be its + type and add T to the stack sequence S. + This is the argument's (or receiver's) spill space and will be + uninitialized at the call. +1. Add a pointer-alignment field to S. + +Assigning a receiver, argument, or result V of underlying type T works +as follows: + +1. Remember I and FP. +1. If T has zero size, add T to the stack sequence S and return. +1. Try to register-assign V. +1. If step 3 failed, reset I and FP to the values from step 1, add T + to the stack sequence S, and assign V to this field in S. + +Register-assignment of a value V of underlying type T works as follows: + +1. If T is a boolean or integral type that fits in an integer + register, assign V to register I and increment I. +1. If T is an integral type that fits in two integer registers, assign + the least significant and most significant halves of V to registers + I and I+1, respectively, and increment I by 2 +1. If T is a floating-point type and can be represented without loss + of precision in a floating-point register, assign V to register FP + and increment FP. +1. If T is a complex type, recursively register-assign its real and + imaginary parts. +1. If T is a pointer type, map type, channel type, or function type, + assign V to register I and increment I. +1. If T is a string type, interface type, or slice type, recursively + register-assign V’s components (2 for strings and interfaces, 3 for + slices). +1. If T is a struct type, recursively register-assign each field of V. +1. If T is an array type of length 0, do nothing. +1. If T is an array type of length 1, recursively register-assign its + one element. +1. If T is an array type of length > 1, fail. +1. If I > NI or FP > NFP, fail. +1. If any recursive assignment above fails, fail. + +The above algorithm produces an assignment of each receiver, argument, +and result to registers or to a field in the stack sequence. +The final stack sequence looks like: stack-assigned receiver, +stack-assigned arguments, pointer-alignment, stack-assigned results, +pointer-alignment, spill space for each register-assigned argument, +pointer-alignment. +The following diagram shows what this stack frame looks like on the +stack, using the typical convention where address 0 is at the bottom: + + +------------------------------+ + | . . . | + | 2nd reg argument spill space | + | 1st reg argument spill space | + | <pointer-sized alignment> | + | . . . | + | 2nd stack-assigned result | + | 1st stack-assigned result | + | <pointer-sized alignment> | + | . . . | + | 2nd stack-assigned argument | + | 1st stack-assigned argument | + | stack-assigned receiver | + +------------------------------+ ↓ lower addresses + +To perform a call, the caller reserves space starting at the lowest +address in its stack frame for the call stack frame, stores arguments +in the registers and argument stack fields determined by the above +algorithm, and performs the call. +At the time of a call, spill space, result stack fields, and result +registers are left uninitialized. +Upon return, the callee must have stored results to all result +registers and result stack fields determined by the above algorithm. + +There are no callee-save registers, so a call may overwrite any +register that doesn’t have a fixed meaning, including argument +registers. + +### Example + +Consider the function `func f(a1 uint8, a2 [2]uintptr, a3 uint8) (r1 +struct { x uintptr; y [2]uintptr }, r2 string)` on a 64-bit +architecture with hypothetical integer registers R0–R9. + +On entry, `a1` is assigned to `R0`, `a3` is assigned to `R1` and the +stack frame is laid out in the following sequence: + + a2 [2]uintptr + r1.x uintptr + r1.y [2]uintptr + a1Spill uint8 + a3Spill uint8 + _ [6]uint8 // alignment padding + +In the stack frame, only the `a2` field is initialized on entry; the +rest of the frame is left uninitialized. + +On exit, `r2.base` is assigned to `R0`, `r2.len` is assigned to `R1`, +and `r1.x` and `r1.y` are initialized in the stack frame. + +There are several things to note in this example. +First, `a2` and `r1` are stack-assigned because they contain arrays. +The other arguments and results are register-assigned. +Result `r2` is decomposed into its components, which are individually +register-assigned. +On the stack, the stack-assigned arguments appear at lower addresses +than the stack-assigned results, which appear at lower addresses than +the argument spill area. +Only arguments, not results, are assigned a spill area on the stack. + +### Rationale + +Each base value is assigned to its own register to optimize +construction and access. +An alternative would be to pack multiple sub-word values into +registers, or to simply map an argument's in-memory layout to +registers (this is common in C ABIs), but this typically adds cost to +pack and unpack these values. +Modern architectures have more than enough registers to pass all +arguments and results this way for nearly all functions (see the +appendix), so there’s little downside to spreading base values across +registers. + +Arguments that can’t be fully assigned to registers are passed +entirely on the stack in case the callee takes the address of that +argument. +If an argument could be split across the stack and registers and the +callee took its address, it would need to be reconstructed in memory, +a process that would be proportional to the size of the argument. + +Non-trivial arrays are always passed on the stack because indexing +into an array typically requires a computed offset, which generally +isn’t possible with registers. +Arrays in general are rare in function signatures (only 0.7% of +functions in the Go 1.15 standard library and 0.2% in kubelet). +We considered allowing array fields to be passed on the stack while +the rest of an argument’s fields are passed in registers, but this +creates the same problems as other large structs if the callee takes +the address of an argument, and would benefit <0.1% of functions in +kubelet (and even these very little). + +We make exceptions for 0 and 1-element arrays because these don’t +require computed offsets, and 1-element arrays are already decomposed +in the compiler’s SSA representation. + +The ABI assignment algorithm above is equivalent to Go’s stack-based +ABI0 calling convention if there are zero architecture registers. +This is intended to ease the transition to the register-based internal +ABI and make it easy for the compiler to generate either calling +convention. +An architecture may still define register meanings that aren’t +compatible with ABI0, but these differences should be easy to account +for in the compiler. + +The assignment algorithm assigns zero-sized values to the stack +(assignment step 2) in order to support ABI0-equivalence. +While these values take no space themselves, they do result in +alignment padding on the stack in ABI0. +Without this step, the internal ABI would register-assign zero-sized +values even on architectures that provide no argument registers +because they don't consume any registers, and hence not add alignment +padding to the stack. + +The algorithm reserves spill space for arguments in the caller’s frame +so that the compiler can generate a stack growth path that spills into +this reserved space. +If the callee has to grow the stack, it may not be able to reserve +enough additional stack space in its own frame to spill these, which +is why it’s important that the caller do so. +These slots also act as the home location if these arguments need to +be spilled for any other reason, which simplifies traceback printing. + +There are several options for how to lay out the argument spill space. +We chose to lay out each argument according to its type's usual memory +layout but to separate the spill space from the regular argument +space. +Using the usual memory layout simplifies the compiler because it +already understands this layout. +Also, if a function takes the address of a register-assigned argument, +the compiler must spill that argument to memory in its usual memory +layout and it's more convenient to use the argument spill space for +this purpose. + +Alternatively, the spill space could be structured around argument +registers. +In this approach, the stack growth spill path would spill each +argument register to a register-sized stack word. +However, if the function takes the address of a register-assigned +argument, the compiler would have to reconstruct it in memory layout +elsewhere on the stack. + +The spill space could also be interleaved with the stack-assigned +arguments so the arguments appear in order whether they are register- +or stack-assigned. +This would be close to ABI0, except that register-assigned arguments +would be uninitialized on the stack and there's no need to reserve +stack space for register-assigned results. +We expect separating the spill space to perform better because of +memory locality. +Separating the space is also potentially simpler for `reflect` calls +because this allows `reflect` to summarize the spill space as a single +number. +Finally, the long-term intent is to remove reserved spill slots +entirely – allowing most functions to be called without any stack +setup and easing the introduction of callee-save registers – and +separating the spill space makes that transition easier. + +## Closures + +A func value (e.g., `var x func()`) is a pointer to a closure object. +A closure object begins with a pointer-sized program counter +representing the entry point of the function, followed by zero or more +bytes containing the closed-over environment. + +Closure calls follow the same conventions as static function and +method calls, with one addition. Each architecture specifies a +*closure context pointer* register and calls to closures store the +address of the closure object in the closure context pointer register +prior to the call. + +## Software floating-point mode + +In "softfloat" mode, the ABI simply treats the hardware as having zero +floating-point registers. +As a result, any arguments containing floating-point values will be +passed on the stack. + +*Rationale*: Softfloat mode is about compatibility over performance +and is not commonly used. +Hence, we keep the ABI as simple as possible in this case, rather than +adding additional rules for passing floating-point values in integer +registers. + +## Architecture specifics + +This section describes per-architecture register mappings, as well as +other per-architecture special cases. + +### amd64 architecture + +The amd64 architecture uses the following sequence of 9 registers for +integer arguments and results: + + RAX, RBX, RCX, RDI, RSI, R8, R9, R10, R11 + +It uses X0 – X14 for floating-point arguments and results. + +*Rationale*: These sequences are chosen from the available registers +to be relatively easy to remember. + +Registers R12 and R13 are permanent scratch registers. +R15 is a scratch register except in dynamically linked binaries. + +*Rationale*: Some operations such as stack growth and reflection calls +need dedicated scratch registers in order to manipulate call frames +without corrupting arguments or results. + +Special-purpose registers are as follows: + +| Register | Call meaning | Return meaning | Body meaning | +| --- | --- | --- | --- | +| RSP | Stack pointer | Same | Same | +| RBP | Frame pointer | Same | Same | +| RDX | Closure context pointer | Scratch | Scratch | +| R12 | Scratch | Scratch | Scratch | +| R13 | Scratch | Scratch | Scratch | +| R14 | Current goroutine | Same | Same | +| R15 | GOT reference temporary if dynlink | Same | Same | +| X15 | Zero value (*) | Same | Scratch | + +(*) Except on Plan 9, where X15 is a scratch register because SSE +registers cannot be used in note handlers (so the compiler avoids +using them except when absolutely necessary). + +*Rationale*: These register meanings are compatible with Go’s +stack-based calling convention except for R14 and X15, which will have +to be restored on transitions from ABI0 code to ABIInternal code. +In ABI0, these are undefined, so transitions from ABIInternal to ABI0 +can ignore these registers. + +*Rationale*: For the current goroutine pointer, we chose a register +that requires an additional REX byte. +While this adds one byte to every function prologue, it is hardly ever +accessed outside the function prologue and we expect making more +single-byte registers available to be a net win. + +*Rationale*: We could allow R14 (the current goroutine pointer) to be +a scratch register in function bodies because it can always be +restored from TLS on amd64. +However, we designate it as a fixed register for simplicity and for +consistency with other architectures that may not have a copy of the +current goroutine pointer in TLS. + +*Rationale*: We designate X15 as a fixed zero register because +functions often have to bulk zero their stack frames, and this is more +efficient with a designated zero register. + +*Implementation note*: Registers with fixed meaning at calls but not +in function bodies must be initialized by "injected" calls such as +signal-based panics. + +#### Stack layout + +The stack pointer, RSP, grows down and is always aligned to 8 bytes. + +The amd64 architecture does not use a link register. + +A function's stack frame is laid out as follows: + + +------------------------------+ + | return PC | + | RBP on entry | + | ... locals ... | + | ... outgoing arguments ... | + +------------------------------+ ↓ lower addresses + +The "return PC" is pushed as part of the standard amd64 `CALL` +operation. +On entry, a function subtracts from RSP to open its stack frame and +saves the value of RBP directly below the return PC. +A leaf function that does not require any stack space may omit the +saved RBP. + +The Go ABI's use of RBP as a frame pointer register is compatible with +amd64 platform conventions so that Go can inter-operate with platform +debuggers and profilers. + +#### Flags + +The direction flag (D) is always cleared (set to the “forward” +direction) at a call. +The arithmetic status flags are treated like scratch registers and not +preserved across calls. +All other bits in RFLAGS are system flags. + +At function calls and returns, the CPU is in x87 mode (not MMX +technology mode). + +*Rationale*: Go on amd64 does not use either the x87 registers or MMX +registers. Hence, we follow the SysV platform conventions in order to +simplify transitions to and from the C ABI. + +At calls, the MXCSR control bits are always set as follows: + +| Flag | Bit | Value | Meaning | +| --- | --- | --- | --- | +| FZ | 15 | 0 | Do not flush to zero | +| RC | 14/13 | 0 (RN) | Round to nearest | +| PM | 12 | 1 | Precision masked | +| UM | 11 | 1 | Underflow masked | +| OM | 10 | 1 | Overflow masked | +| ZM | 9 | 1 | Divide-by-zero masked | +| DM | 8 | 1 | Denormal operations masked | +| IM | 7 | 1 | Invalid operations masked | +| DAZ | 6 | 0 | Do not zero de-normals | + +The MXCSR status bits are callee-save. + +*Rationale*: Having a fixed MXCSR control configuration allows Go +functions to use SSE operations without modifying or saving the MXCSR. +Functions are allowed to modify it between calls (as long as they +restore it), but as of this writing Go code never does. +The above fixed configuration matches the process initialization +control bits specified by the ELF AMD64 ABI. + +The x87 floating-point control word is not used by Go on amd64. + +### arm64 architecture + +The arm64 architecture uses R0 – R15 for integer arguments and results. + +It uses F0 – F15 for floating-point arguments and results. + +*Rationale*: 16 integer registers and 16 floating-point registers are +more than enough for passing arguments and results for practically all +functions (see Appendix). While there are more registers available, +using more registers provides little benefit. Additionally, it will add +overhead on code paths where the number of arguments are not statically +known (e.g. reflect call), and will consume more stack space when there +is only limited stack space available to fit in the nosplit limit. + +Registers R16 and R17 are permanent scratch registers. They are also +used as scratch registers by the linker (Go linker and external +linker) in trampolines. + +Register R18 is reserved and never used. It is reserved for the OS +on some platforms (e.g. macOS). + +Registers R19 – R25 are permanent scratch registers. In addition, +R27 is a permanent scratch register used by the assembler when +expanding instructions. + +Floating-point registers F16 – F31 are also permanent scratch +registers. + +Special-purpose registers are as follows: + +| Register | Call meaning | Return meaning | Body meaning | +| --- | --- | --- | --- | +| RSP | Stack pointer | Same | Same | +| R30 | Link register | Same | Scratch (non-leaf functions) | +| R29 | Frame pointer | Same | Same | +| R28 | Current goroutine | Same | Same | +| R27 | Scratch | Scratch | Scratch | +| R26 | Closure context pointer | Scratch | Scratch | +| R18 | Reserved (not used) | Same | Same | +| ZR | Zero value | Same | Same | + +*Rationale*: These register meanings are compatible with Go’s +stack-based calling convention. + +*Rationale*: The link register, R30, holds the function return +address at the function entry. For functions that have frames +(including most non-leaf functions), R30 is saved to stack in the +function prologue and restored in the epilogue. Within the function +body, R30 can be used as a scratch register. + +*Implementation note*: Registers with fixed meaning at calls but not +in function bodies must be initialized by "injected" calls such as +signal-based panics. + +#### Stack layout + +The stack pointer, RSP, grows down and is always aligned to 16 bytes. + +*Rationale*: The arm64 architecture requires the stack pointer to be +16-byte aligned. + +A function's stack frame, after the frame is created, is laid out as +follows: + + +------------------------------+ + | ... locals ... | + | ... outgoing arguments ... | + | return PC | ← RSP points to + | frame pointer on entry | + +------------------------------+ ↓ lower addresses + +The "return PC" is loaded to the link register, R30, as part of the +arm64 `CALL` operation. + +On entry, a function subtracts from RSP to open its stack frame, and +saves the values of R30 and R29 at the bottom of the frame. +Specifically, R30 is saved at 0(RSP) and R29 is saved at -8(RSP), +after RSP is updated. + +A leaf function that does not require any stack space may omit the +saved R30 and R29. + +The Go ABI's use of R29 as a frame pointer register is compatible with +arm64 architecture requirement so that Go can inter-operate with platform +debuggers and profilers. + +This stack layout is used by both register-based (ABIInternal) and +stack-based (ABI0) calling conventions. + +#### Flags + +The arithmetic status flags (NZCV) are treated like scratch registers +and not preserved across calls. +All other bits in PSTATE are system flags and are not modified by Go. + +The floating-point status register (FPSR) is treated like scratch +registers and not preserved across calls. + +At calls, the floating-point control register (FPCR) bits are always +set as follows: + +| Flag | Bit | Value | Meaning | +| --- | --- | --- | --- | +| DN | 25 | 0 | Propagate NaN operands | +| FZ | 24 | 0 | Do not flush to zero | +| RC | 23/22 | 0 (RN) | Round to nearest, choose even if tied | +| IDE | 15 | 0 | Denormal operations trap disabled | +| IXE | 12 | 0 | Inexact trap disabled | +| UFE | 11 | 0 | Underflow trap disabled | +| OFE | 10 | 0 | Overflow trap disabled | +| DZE | 9 | 0 | Divide-by-zero trap disabled | +| IOE | 8 | 0 | Invalid operations trap disabled | +| NEP | 2 | 0 | Scalar operations do not affect higher elements in vector registers | +| AH | 1 | 0 | No alternate handling of de-normal inputs | +| FIZ | 0 | 0 | Do not zero de-normals | + +*Rationale*: Having a fixed FPCR control configuration allows Go +functions to use floating-point and vector (SIMD) operations without +modifying or saving the FPCR. +Functions are allowed to modify it between calls (as long as they +restore it), but as of this writing Go code never does. + +### ppc64 architecture + +The ppc64 architecture uses R3 – R10 and R14 – R17 for integer arguments +and results. + +It uses F1 – F12 for floating-point arguments and results. + +Register R31 is a permanent scratch register in Go. + +Special-purpose registers used within Go generated code and Go +assembly code are as follows: + +| Register | Call meaning | Return meaning | Body meaning | +| --- | --- | --- | --- | +| R0 | Zero value | Same | Same | +| R1 | Stack pointer | Same | Same | +| R2 | TOC register | Same | Same | +| R11 | Closure context pointer | Scratch | Scratch | +| R12 | Function address on indirect calls | Scratch | Scratch | +| R13 | TLS pointer | Same | Same | +| R20,R21 | Scratch | Scratch | Used by duffcopy, duffzero | +| R30 | Current goroutine | Same | Same | +| R31 | Scratch | Scratch | Scratch | +| LR | Link register | Link register | Scratch | +*Rationale*: These register meanings are compatible with Go’s +stack-based calling convention. + +The link register, LR, holds the function return +address at the function entry and is set to the correct return +address before exiting the function. It is also used +in some cases as the function address when doing an indirect call. + +The register R2 contains the address of the TOC (table of contents) which +contains data or code addresses used when generating position independent +code. Non-Go code generated when using cgo contains TOC-relative addresses +which depend on R2 holding a valid TOC. Go code compiled with -shared or +-dynlink initializes and maintains R2 and uses it in some cases for +function calls; Go code compiled without these options does not modify R2. + +When making a function call R12 contains the function address for use by the +code to generate R2 at the beginning of the function. R12 can be used for +other purposes within the body of the function, such as trampoline generation. + +R20 and R21 are used in duffcopy and duffzero which could be generated +before arguments are saved so should not be used for register arguments. + +The Count register CTR can be used as the call target for some branch instructions. +It holds the return address when preemption has occurred. + +On PPC64 when a float32 is loaded it becomes a float64 in the register, which is +different from other platforms and that needs to be recognized by the internal +implementation of reflection so that float32 arguments are passed correctly. + +Registers R18 - R29 and F13 - F31 are considered scratch registers. + +#### Stack layout + +The stack pointer, R1, grows down and is aligned to 8 bytes in Go, but changed +to 16 bytes when calling cgo. + +A function's stack frame, after the frame is created, is laid out as +follows: + + +------------------------------+ + | ... locals ... | + | ... outgoing arguments ... | + | 24 TOC register R2 save | When compiled with -shared/-dynlink + | 16 Unused in Go | Not used in Go + | 8 CR save | nonvolatile CR fields + | 0 return PC | ← R1 points to + +------------------------------+ ↓ lower addresses + +The "return PC" is loaded to the link register, LR, as part of the +ppc64 `BL` operations. + +On entry to a non-leaf function, the stack frame size is subtracted from R1 to +create its stack frame, and saves the value of LR at the bottom of the frame. + +A leaf function that does not require any stack space does not modify R1 and +does not save LR. + +*NOTE*: We might need to save the frame pointer on the stack as +in the PPC64 ELF v2 ABI so Go can inter-operate with platform debuggers +and profilers. + +This stack layout is used by both register-based (ABIInternal) and +stack-based (ABI0) calling conventions. + +#### Flags + +The condition register consists of 8 condition code register fields +CR0-CR7. Go generated code only sets and uses CR0, commonly set by +compare functions and use to determine the target of a conditional +branch. The generated code does not set or use CR1-CR7. + +The floating point status and control register (FPSCR) is initialized +to 0 by the kernel at startup of the Go program and not changed by +the Go generated code. + +### riscv64 architecture + +The riscv64 architecture uses X10 – X17, X8, X9, X18 – X23 for integer arguments +and results. + +It uses F10 – F17, F8, F9, F18 – F23 for floating-point arguments and results. + +Special-purpose registers used within Go generated code and Go +assembly code are as follows: + +| Register | Call meaning | Return meaning | Body meaning | +| --- | --- | --- | --- | +| X0 | Zero value | Same | Same | +| X1 | Link register | Link register | Scratch | +| X2 | Stack pointer | Same | Same | +| X3 | Global pointer | Same | Used by dynamic linker | +| X4 | TLS (thread pointer) | TLS | Scratch | +| X24,X25 | Scratch | Scratch | Used by duffcopy, duffzero | +| X26 | Closure context pointer | Scratch | Scratch | +| X27 | Current goroutine | Same | Same | +| X31 | Scratch | Scratch | Scratch | + +*Rationale*: These register meanings are compatible with Go’s +stack-based calling convention. Context register X20 will change to X26, +duffcopy, duffzero register will change to X24, X25 before this register ABI been adopted. +X10 – X17, X8, X9, X18 – X23, is the same order as A0 – A7, S0 – S7 in platform ABI. +F10 – F17, F8, F9, F18 – F23, is the same order as FA0 – FA7, FS0 – FS7 in platform ABI. +X8 – X23, F8 – F15 are used for compressed instruction (RVC) which will benefit code size in the future. + +#### Stack layout + +The stack pointer, X2, grows down and is aligned to 8 bytes. + +A function's stack frame, after the frame is created, is laid out as +follows: + + +------------------------------+ + | ... locals ... | + | ... outgoing arguments ... | + | return PC | ← X2 points to + +------------------------------+ ↓ lower addresses + +The "return PC" is loaded to the link register, X1, as part of the +riscv64 `CALL` operation. + +#### Flags + +The riscv64 has Zicsr extension for control and status register (CSR) and +treated as scratch register. +All bits in CSR are system flags and are not modified by Go. + +## Future directions + +### Spill path improvements + +The ABI currently reserves spill space for argument registers so the +compiler can statically generate an argument spill path before calling +into `runtime.morestack` to grow the stack. +This ensures there will be sufficient spill space even when the stack +is nearly exhausted and keeps stack growth and stack scanning +essentially unchanged from ABI0. + +However, this wastes stack space (the median wastage is 16 bytes per +call), resulting in larger stacks and increased cache footprint. +A better approach would be to reserve stack space only when spilling. +One way to ensure enough space is available to spill would be for +every function to ensure there is enough space for the function's own +frame *as well as* the spill space of all functions it calls. +For most functions, this would change the threshold for the prologue +stack growth check. +For `nosplit` functions, this would change the threshold used in the +linker's static stack size check. + +Allocating spill space in the callee rather than the caller may also +allow for faster reflection calls in the common case where a function +takes only register arguments, since it would allow reflection to make +these calls directly without allocating any frame. + +The statically-generated spill path also increases code size. +It is possible to instead have a generic spill path in the runtime, as +part of `morestack`. +However, this complicates reserving the spill space, since spilling +all possible register arguments would, in most cases, take +significantly more space than spilling only those used by a particular +function. +Some options are to spill to a temporary space and copy back only the +registers used by the function, or to grow the stack if necessary +before spilling to it (using a temporary space if necessary), or to +use a heap-allocated space if insufficient stack space is available. +These options all add enough complexity that we will have to make this +decision based on the actual code size growth caused by the static +spill paths. + +### Clobber sets + +As defined, the ABI does not use callee-save registers. +This significantly simplifies the garbage collector and the compiler's +register allocator, but at some performance cost. +A potentially better balance for Go code would be to use *clobber +sets*: for each function, the compiler records the set of registers it +clobbers (including those clobbered by functions it calls) and any +register not clobbered by function F can remain live across calls to +F. + +This is generally a good fit for Go because Go's package DAG allows +function metadata like the clobber set to flow up the call graph, even +across package boundaries. +Clobber sets would require relatively little change to the garbage +collector, unlike general callee-save registers. +One disadvantage of clobber sets over callee-save registers is that +they don't help with indirect function calls or interface method +calls, since static information isn't available in these cases. + +### Large aggregates + +Go encourages passing composite values by value, and this simplifies +reasoning about mutation and races. +However, this comes at a performance cost for large composite values. +It may be possible to instead transparently pass large composite +values by reference and delay copying until it is actually necessary. + +## Appendix: Register usage analysis + +In order to understand the impacts of the above design on register +usage, we +[analyzed](https://github.com/aclements/go-misc/tree/master/abi) the +impact of the above ABI on a large code base: cmd/kubelet from +[Kubernetes](https://github.com/kubernetes/kubernetes) at tag v1.18.8. + +The following table shows the impact of different numbers of available +integer and floating-point registers on argument assignment: + +``` +| | | | stack args | spills | stack total | +| ints | floats | % fit | p50 | p95 | p99 | p50 | p95 | p99 | p50 | p95 | p99 | +| 0 | 0 | 6.3% | 32 | 152 | 256 | 0 | 0 | 0 | 32 | 152 | 256 | +| 0 | 8 | 6.4% | 32 | 152 | 256 | 0 | 0 | 0 | 32 | 152 | 256 | +| 1 | 8 | 21.3% | 24 | 144 | 248 | 8 | 8 | 8 | 32 | 152 | 256 | +| 2 | 8 | 38.9% | 16 | 128 | 224 | 8 | 16 | 16 | 24 | 136 | 240 | +| 3 | 8 | 57.0% | 0 | 120 | 224 | 16 | 24 | 24 | 24 | 136 | 240 | +| 4 | 8 | 73.0% | 0 | 120 | 216 | 16 | 32 | 32 | 24 | 136 | 232 | +| 5 | 8 | 83.3% | 0 | 112 | 216 | 16 | 40 | 40 | 24 | 136 | 232 | +| 6 | 8 | 87.5% | 0 | 112 | 208 | 16 | 48 | 48 | 24 | 136 | 232 | +| 7 | 8 | 89.8% | 0 | 112 | 208 | 16 | 48 | 56 | 24 | 136 | 232 | +| 8 | 8 | 91.3% | 0 | 112 | 200 | 16 | 56 | 64 | 24 | 136 | 232 | +| 9 | 8 | 92.1% | 0 | 112 | 192 | 16 | 56 | 72 | 24 | 136 | 232 | +| 10 | 8 | 92.6% | 0 | 104 | 192 | 16 | 56 | 72 | 24 | 136 | 232 | +| 11 | 8 | 93.1% | 0 | 104 | 184 | 16 | 56 | 80 | 24 | 128 | 232 | +| 12 | 8 | 93.4% | 0 | 104 | 176 | 16 | 56 | 88 | 24 | 128 | 232 | +| 13 | 8 | 94.0% | 0 | 88 | 176 | 16 | 56 | 96 | 24 | 128 | 232 | +| 14 | 8 | 94.4% | 0 | 80 | 152 | 16 | 64 | 104 | 24 | 128 | 232 | +| 15 | 8 | 94.6% | 0 | 80 | 152 | 16 | 64 | 112 | 24 | 128 | 232 | +| 16 | 8 | 94.9% | 0 | 16 | 152 | 16 | 64 | 112 | 24 | 128 | 232 | +| ∞ | 8 | 99.8% | 0 | 0 | 0 | 24 | 112 | 216 | 24 | 120 | 216 | +``` + +The first two columns show the number of available integer and +floating-point registers. +The first row shows the results for 0 integer and 0 floating-point +registers, which is equivalent to ABI0. +We found that any reasonable number of floating-point registers has +the same effect, so we fixed it at 8 for all other rows. + +The “% fit” column gives the fraction of functions where all arguments +and results are register-assigned and no arguments are passed on the +stack. +The three “stack args” columns give the median, 95th and 99th +percentile number of bytes of stack arguments. +The “spills” columns likewise summarize the number of bytes in +on-stack spill space. +And “stack total” summarizes the sum of stack arguments and on-stack +spill slots. +Note that these are three different distributions; for example, +there’s no single function that takes 0 stack argument bytes, 16 spill +bytes, and 24 total stack bytes. + +From this, we can see that the fraction of functions that fit entirely +in registers grows very slowly once it reaches about 90%, though +curiously there is a small minority of functions that could benefit +from a huge number of registers. +Making 9 integer registers available on amd64 puts it in this realm. +We also see that the stack space required for most functions is fairly +small. +While the increasing space required for spills largely balances out +the decreasing space required for stack arguments as the number of +available registers increases, there is a general reduction in the +total stack space required with more available registers. +This does, however, suggest that eliminating spill slots in the future +would noticeably reduce stack requirements. |