diff options
Diffstat (limited to '')
-rw-r--r-- | Documentation/ia64/fsys.txt | 286 |
1 files changed, 286 insertions, 0 deletions
diff --git a/Documentation/ia64/fsys.txt b/Documentation/ia64/fsys.txt new file mode 100644 index 000000000..59dd689d9 --- /dev/null +++ b/Documentation/ia64/fsys.txt @@ -0,0 +1,286 @@ +-*-Mode: outline-*- + + Light-weight System Calls for IA-64 + ----------------------------------- + + Started: 13-Jan-2003 + Last update: 27-Sep-2003 + + David Mosberger-Tang + <davidm@hpl.hp.com> + +Using the "epc" instruction effectively introduces a new mode of +execution to the ia64 linux kernel. We call this mode the +"fsys-mode". To recap, the normal states of execution are: + + - kernel mode: + Both the register stack and the memory stack have been + switched over to kernel memory. The user-level state is saved + in a pt-regs structure at the top of the kernel memory stack. + + - user mode: + Both the register stack and the kernel stack are in + user memory. The user-level state is contained in the + CPU registers. + + - bank 0 interruption-handling mode: + This is the non-interruptible state which all + interruption-handlers start execution in. The user-level + state remains in the CPU registers and some kernel state may + be stored in bank 0 of registers r16-r31. + +In contrast, fsys-mode has the following special properties: + + - execution is at privilege level 0 (most-privileged) + + - CPU registers may contain a mixture of user-level and kernel-level + state (it is the responsibility of the kernel to ensure that no + security-sensitive kernel-level state is leaked back to + user-level) + + - execution is interruptible and preemptible (an fsys-mode handler + can disable interrupts and avoid all other interruption-sources + to avoid preemption) + + - neither the memory-stack nor the register-stack can be trusted while + in fsys-mode (they point to the user-level stacks, which may + be invalid, or completely bogus addresses) + +In summary, fsys-mode is much more similar to running in user-mode +than it is to running in kernel-mode. Of course, given that the +privilege level is at level 0, this means that fsys-mode requires some +care (see below). + + +* How to tell fsys-mode + +Linux operates in fsys-mode when (a) the privilege level is 0 (most +privileged) and (b) the stacks have NOT been switched to kernel memory +yet. For convenience, the header file <asm-ia64/ptrace.h> provides +three macros: + + user_mode(regs) + user_stack(task,regs) + fsys_mode(task,regs) + +The "regs" argument is a pointer to a pt_regs structure. The "task" +argument is a pointer to the task structure to which the "regs" +pointer belongs to. user_mode() returns TRUE if the CPU state pointed +to by "regs" was executing in user mode (privilege level 3). +user_stack() returns TRUE if the state pointed to by "regs" was +executing on the user-level stack(s). Finally, fsys_mode() returns +TRUE if the CPU state pointed to by "regs" was executing in fsys-mode. +The fsys_mode() macro is equivalent to the expression: + + !user_mode(regs) && user_stack(task,regs) + +* How to write an fsyscall handler + +The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers +(fsyscall_table). This table contains one entry for each system call. +By default, a system call is handled by fsys_fallback_syscall(). This +routine takes care of entering (full) kernel mode and calling the +normal Linux system call handler. For performance-critical system +calls, it is possible to write a hand-tuned fsyscall_handler. For +example, fsys.S contains fsys_getpid(), which is a hand-tuned version +of the getpid() system call. + +The entry and exit-state of an fsyscall handler is as follows: + +** Machine state on entry to fsyscall handler: + + - r10 = 0 + - r11 = saved ar.pfs (a user-level value) + - r15 = system call number + - r16 = "current" task pointer (in normal kernel-mode, this is in r13) + - r32-r39 = system call arguments + - b6 = return address (a user-level value) + - ar.pfs = previous frame-state (a user-level value) + - PSR.be = cleared to zero (i.e., little-endian byte order is in effect) + - all other registers may contain values passed in from user-mode + +** Required machine state on exit to fsyscall handler: + + - r11 = saved ar.pfs (as passed into the fsyscall handler) + - r15 = system call number (as passed into the fsyscall handler) + - r32-r39 = system call arguments (as passed into the fsyscall handler) + - b6 = return address (as passed into the fsyscall handler) + - ar.pfs = previous frame-state (as passed into the fsyscall handler) + +Fsyscall handlers can execute with very little overhead, but with that +speed comes a set of restrictions: + + o Fsyscall-handlers MUST check for any pending work in the flags + member of the thread-info structure and if any of the + TIF_ALLWORK_MASK flags are set, the handler needs to fall back on + doing a full system call (by calling fsys_fallback_syscall). + + o Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11, + r15, b6, and ar.pfs) because they will be needed in case of a + system call restart. Of course, all "preserved" registers also + must be preserved, in accordance to the normal calling conventions. + + o Fsyscall-handlers MUST check argument registers for containing a + NaT value before using them in any way that could trigger a + NaT-consumption fault. If a system call argument is found to + contain a NaT value, an fsyscall-handler may return immediately + with r8=EINVAL, r10=-1. + + o Fsyscall-handlers MUST NOT use the "alloc" instruction or perform + any other operation that would trigger mandatory RSE + (register-stack engine) traffic. + + o Fsyscall-handlers MUST NOT write to any stacked registers because + it is not safe to assume that user-level called a handler with the + proper number of arguments. + + o Fsyscall-handlers need to be careful when accessing per-CPU variables: + unless proper safe-guards are taken (e.g., interruptions are avoided), + execution may be pre-empted and resumed on another CPU at any given + time. + + o Fsyscall-handlers must be careful not to leak sensitive kernel' + information back to user-level. In particular, before returning to + user-level, care needs to be taken to clear any scratch registers + that could contain sensitive information (note that the current + task pointer is not considered sensitive: it's already exposed + through ar.k6). + + o Fsyscall-handlers MUST NOT access user-memory without first + validating access-permission (this can be done typically via + probe.r.fault and/or probe.w.fault) and without guarding against + memory access exceptions (this can be done with the EX() macros + defined by asmmacro.h). + +The above restrictions may seem draconian, but remember that it's +possible to trade off some of the restrictions by paying a slightly +higher overhead. For example, if an fsyscall-handler could benefit +from the shadow register bank, it could temporarily disable PSR.i and +PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as +needed. In other words, following the above rules yields extremely +fast system call execution (while fully preserving system call +semantics), but there is also a lot of flexibility in handling more +complicated cases. + +* Signal handling + +The delivery of (asynchronous) signals must be delayed until fsys-mode +is exited. This is accomplished with the help of the lower-privilege +transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user() +checks whether the interrupted task was in fsys-mode and, if so, sets +PSR.lp and returns immediately. When fsys-mode is exited via the +"br.ret" instruction that lowers the privilege level, a trap will +occur. The trap handler clears PSR.lp again and returns immediately. +The kernel exit path then checks for and delivers any pending signals. + +* PSR Handling + +The "epc" instruction doesn't change the contents of PSR at all. This +is in contrast to a regular interruption, which clears almost all +bits. Because of that, some care needs to be taken to ensure things +work as expected. The following discussion describes how each PSR bit +is handled. + +PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used + to ensure the CPU is in little-endian mode before the first + load/store instruction is executed. PSR.be is normally NOT + restored upon return from an fsys-mode handler. In other + words, user-level code must not rely on PSR.be being preserved + across a system call. +PSR.up Unchanged. +PSR.ac Unchanged. +PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers! +PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers! +PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed. +PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed. +PSR.pk Unchanged. +PSR.dt Unchanged. +PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers! +PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers! +PSR.sp Unchanged. +PSR.pp Unchanged. +PSR.di Unchanged. +PSR.si Unchanged. +PSR.db Unchanged. The kernel prevents user-level from setting a hardware + breakpoint that triggers at any privilege level other than 3 (user-mode). +PSR.lp Unchanged. +PSR.tb Lazy redirect. If a taken-branch trap occurs while in + fsys-mode, the trap-handler modifies the saved machine state + such that execution resumes in the gate page at + syscall_via_break(), with privilege level 3. Note: the + taken branch would occur on the branch invoking the + fsyscall-handler, at which point, by definition, a syscall + restart is still safe. If the system call number is invalid, + the fsys-mode handler will return directly to user-level. This + return will trigger a taken-branch trap, but since the trap is + taken _after_ restoring the privilege level, the CPU has already + left fsys-mode, so no special treatment is needed. +PSR.rt Unchanged. +PSR.cpl Cleared to 0. +PSR.is Unchanged (guaranteed to be 0 on entry to the gate page). +PSR.mc Unchanged. +PSR.it Unchanged (guaranteed to be 1). +PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit. +PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit. +PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit. +PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to + be taken. The trap handler then modifies the saved machine + state such that execution resumes in the gate page at + syscall_via_break(), with privilege level 3. +PSR.ri Unchanged. +PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode + handler performed a speculative load that gets NaTted. If so, this + would be the normal & expected behavior, so no special treatment is + needed. +PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed. + Doing so requires clearing PSR.i and PSR.ic as well. +PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit. + +* Using fast system calls + +To use fast system calls, userspace applications need simply call +__kernel_syscall_via_epc(). For example + +-- example fgettimeofday() call -- +-- fgettimeofday.S -- + +#include <asm/asmmacro.h> + +GLOBAL_ENTRY(fgettimeofday) +.prologue +.save ar.pfs, r11 +mov r11 = ar.pfs +.body + +mov r2 = 0xa000000000020660;; // gate address + // found by inspection of System.map for the + // __kernel_syscall_via_epc() function. See + // below for how to do this for real. + +mov b7 = r2 +mov r15 = 1087 // gettimeofday syscall +;; +br.call.sptk.many b6 = b7 +;; + +.restore sp + +mov ar.pfs = r11 +br.ret.sptk.many rp;; // return to caller +END(fgettimeofday) + +-- end fgettimeofday.S -- + +In reality, getting the gate address is accomplished by two extra +values passed via the ELF auxiliary vector (include/asm-ia64/elf.h) + + o AT_SYSINFO : is the address of __kernel_syscall_via_epc() + o AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO + +The ELF DSO is a pre-linked library that is mapped in by the kernel at +the gate page. It is a proper ELF shared object so, with a dynamic +loader that recognises the library, you should be able to make calls to +the exported functions within it as with any other shared library. +AT_SYSINFO points into the kernel DSO at the +__kernel_syscall_via_epc() function for historical reasons (it was +used before the kernel DSO) and as a convenience. |