diff options
Diffstat (limited to 'man/man7/vdso.7')
-rw-r--r-- | man/man7/vdso.7 | 612 |
1 files changed, 612 insertions, 0 deletions
diff --git a/man/man7/vdso.7 b/man/man7/vdso.7 new file mode 100644 index 0000000..2a9ae35 --- /dev/null +++ b/man/man7/vdso.7 @@ -0,0 +1,612 @@ +'\" t +.\" Written by Mike Frysinger <vapier@gentoo.org> +.\" +.\" %%%LICENSE_START(PUBLIC_DOMAIN) +.\" This page is in the public domain. +.\" %%%LICENSE_END +.\" +.\" Useful background: +.\" http://articles.manugarg.com/systemcallinlinux2_6.html +.\" https://lwn.net/Articles/446528/ +.\" http://www.linuxjournal.com/content/creating-vdso-colonels-other-chicken +.\" http://www.trilithium.com/johan/2005/08/linux-gate/ +.\" +.TH vDSO 7 2024-05-02 "Linux man-pages (unreleased)" +.SH NAME +vdso \- overview of the virtual ELF dynamic shared object +.SH SYNOPSIS +.nf +.B #include <sys/auxv.h> +.P +.B void *vdso = (uintptr_t) getauxval(AT_SYSINFO_EHDR); +.fi +.SH DESCRIPTION +The "vDSO" (virtual dynamic shared object) is a small shared library that +the kernel automatically maps into the +address space of all user-space applications. +Applications usually do not need to concern themselves with these details +as the vDSO is most commonly called by the C library. +This way you can code in the normal way using standard functions +and the C library will take care +of using any functionality that is available via the vDSO. +.P +Why does the vDSO exist at all? +There are some system calls the kernel provides that +user-space code ends up using frequently, +to the point that such calls can dominate overall performance. +This is due both to the frequency of the call as well as the +context-switch overhead that results +from exiting user space and entering the kernel. +.P +The rest of this documentation is geared toward the curious and/or +C library writers rather than general developers. +If you're trying to call the vDSO in your own application rather than using +the C library, you're most likely doing it wrong. +.SS Example background +Making system calls can be slow. +In x86 32-bit systems, you can trigger a software interrupt +.RI ( "int $0x80" ) +to tell the kernel you wish to make a system call. +However, this instruction is expensive: it goes through +the full interrupt-handling paths +in the processor's microcode as well as in the kernel. +Newer processors have faster (but backward incompatible) instructions to +initiate system calls. +Rather than require the C library to figure out if this functionality is +available at run time, +the C library can use functions provided by the kernel in +the vDSO. +.P +Note that the terminology can be confusing. +On x86 systems, the vDSO function +used to determine the preferred method of making a system call is +named "__kernel_vsyscall", but on x86-64, +the term "vsyscall" also refers to an obsolete way to ask the kernel +what time it is or what CPU the caller is on. +.P +One frequently used system call is +.BR gettimeofday (2). +This system call is called both directly by user-space applications +as well as indirectly by +the C library. +Think timestamps or timing loops or polling\[em]all of these +frequently need to know what time it is right now. +This information is also not secret\[em]any application in any +privilege mode (root or any unprivileged user) will get the same answer. +Thus the kernel arranges for the information required to answer +this question to be placed in memory the process can access. +Now a call to +.BR gettimeofday (2) +changes from a system call to a normal function +call and a few memory accesses. +.SS Finding the vDSO +The base address of the vDSO (if one exists) is passed by the kernel to +each program in the initial auxiliary vector (see +.BR getauxval (3)), +via the +.B AT_SYSINFO_EHDR +tag. +.P +You must not assume the vDSO is mapped at any particular location in the +user's memory map. +The base address will usually be randomized at run time every time a new +process image is created (at +.BR execve (2) +time). +This is done for security reasons, +to prevent "return-to-libc" attacks. +.P +For some architectures, there is also an +.B AT_SYSINFO +tag. +This is used only for locating the vsyscall entry point and is frequently +omitted or set to 0 (meaning it's not available). +This tag is a throwback to the initial vDSO work (see +.I History +below) and its use should be avoided. +.SS File format +Since the vDSO is a fully formed ELF image, you can do symbol lookups on it. +This allows new symbols to be added with newer kernel releases, +and allows the C library to detect available functionality at +run time when running under different kernel versions. +Oftentimes the C library will do detection with the first call and then +cache the result for subsequent calls. +.P +All symbols are also versioned (using the GNU version format). +This allows the kernel to update the function signature without breaking +backward compatibility. +This means changing the arguments that the function accepts as well as the +return value. +Thus, when looking up a symbol in the vDSO, +you must always include the version +to match the ABI you expect. +.P +Typically the vDSO follows the naming convention of prefixing +all symbols with "__vdso_" or "__kernel_" +so as to distinguish them from other standard symbols. +For example, the "gettimeofday" function is named "__vdso_gettimeofday". +.P +You use the standard C calling conventions when calling +any of these functions. +No need to worry about weird register or stack behavior. +.SH NOTES +.SS Source +When you compile the kernel, +it will automatically compile and link the vDSO code for you. +You will frequently find it under the architecture-specific directory: +.P +.in +4n +.EX +find arch/$ARCH/ \-name \[aq]*vdso*.so*\[aq] \-o \-name \[aq]*gate*.so*\[aq] +.EE +.in +.\" +.SS vDSO names +The name of the vDSO varies across architectures. +It will often show up in things like glibc's +.BR ldd (1) +output. +The exact name should not matter to any code, so do not hardcode it. +.if t \{\ +.ft CW +\} +.TS +l l. +user ABI vDSO name +_ +aarch64 linux\-vdso.so.1 +arm linux\-vdso.so.1 +ia64 linux\-gate.so.1 +mips linux\-vdso.so.1 +ppc/32 linux\-vdso32.so.1 +ppc/64 linux\-vdso64.so.1 +riscv linux\-vdso.so.1 +s390 linux\-vdso32.so.1 +s390x linux\-vdso64.so.1 +sh linux\-gate.so.1 +i386 linux\-gate.so.1 +x86-64 linux\-vdso.so.1 +x86/x32 linux\-vdso.so.1 +.TE +.if t \{\ +.in +.ft P +\} +.SS strace(1), seccomp(2), and the vDSO +When tracing system calls with +.BR strace (1), +symbols (system calls) that are exported by the vDSO will +.I not +appear in the trace output. +Those system calls will likewise not be visible to +.BR seccomp (2) +filters. +.SH ARCHITECTURE-SPECIFIC NOTES +The subsections below provide architecture-specific notes +on the vDSO. +.P +Note that the vDSO that is used is based on the ABI of your user-space code +and not the ABI of the kernel. +Thus, for example, +when you run an i386 32-bit ELF binary, +you'll get the same vDSO regardless of whether you run it under +an i386 32-bit kernel or under an x86-64 64-bit kernel. +Therefore, the name of the user-space ABI should be used to determine +which of the sections below is relevant. +.SS ARM functions +.\" See linux/arch/arm/vdso/vdso.lds.S +.\" Commit: 8512287a8165592466cb9cb347ba94892e9c56a5 +The table below lists the symbols exported by the vDSO. +.if t \{\ +.ft CW +\} +.TS +l l. +symbol version +_ +__vdso_gettimeofday LINUX_2.6 (exported since Linux 4.1) +__vdso_clock_gettime LINUX_2.6 (exported since Linux 4.1) +.TE +.if t \{\ +.in +.ft P +\} +.P +.\" See linux/arch/arm/kernel/entry-armv.S +.\" See linux/Documentation/arm/kernel_user_helpers.rst +Additionally, the ARM port has a code page full of utility functions. +Since it's just a raw page of code, there is no ELF information for doing +symbol lookups or versioning. +It does provide support for different versions though. +.P +For information on this code page, +it's best to refer to the kernel documentation +as it's extremely detailed and covers everything you need to know: +.IR Documentation/arm/kernel_user_helpers.rst . +.SS aarch64 functions +.\" See linux/arch/arm64/kernel/vdso/vdso.lds.S +The table below lists the symbols exported by the vDSO. +.if t \{\ +.ft CW +\} +.TS +l l. +symbol version +_ +__kernel_rt_sigreturn LINUX_2.6.39 +__kernel_gettimeofday LINUX_2.6.39 +__kernel_clock_gettime LINUX_2.6.39 +__kernel_clock_getres LINUX_2.6.39 +.TE +.if t \{\ +.in +.ft P +\} +.SS bfin (Blackfin) functions (port removed in Linux 4.17) +.\" See linux/arch/blackfin/kernel/fixed_code.S +.\" See http://docs.blackfin.uclinux.org/doku.php?id=linux-kernel:fixed-code +As this CPU lacks a memory management unit (MMU), +it doesn't set up a vDSO in the normal sense. +Instead, it maps at boot time a few raw functions into +a fixed location in memory. +User-space applications then call directly into that region. +There is no provision for backward compatibility +beyond sniffing raw opcodes, +but as this is an embedded CPU, it can get away with things\[em]some of the +object formats it runs aren't even ELF based (they're bFLT/FLAT). +.P +For information on this code page, +it's best to refer to the public documentation: +.br +http://docs.blackfin.uclinux.org/doku.php?id=linux\-kernel:fixed\-code +.SS mips functions +.\" See linux/arch/mips/vdso/vdso.ld.S +The table below lists the symbols exported by the vDSO. +.if t \{\ +.ft CW +\} +.TS +l l. +symbol version +_ +__kernel_gettimeofday LINUX_2.6 (exported since Linux 4.4) +__kernel_clock_gettime LINUX_2.6 (exported since Linux 4.4) +.TE +.if t \{\ +.in +.ft P +\} +.SS ia64 (Itanium) functions +.\" See linux/arch/ia64/kernel/gate.lds.S +.\" Also linux/arch/ia64/kernel/fsys.S and linux/Documentation/ia64/fsys.rst +The table below lists the symbols exported by the vDSO. +.if t \{\ +.ft CW +\} +.TS +l l. +symbol version +_ +__kernel_sigtramp LINUX_2.5 +__kernel_syscall_via_break LINUX_2.5 +__kernel_syscall_via_epc LINUX_2.5 +.TE +.if t \{\ +.in +.ft P +\} +.P +The Itanium port is somewhat tricky. +In addition to the vDSO above, it also has "light-weight system calls" +(also known as "fast syscalls" or "fsys"). +You can invoke these via the +.I __kernel_syscall_via_epc +vDSO helper. +The system calls listed here have the same semantics as if you called them +directly via +.BR syscall (2), +so refer to the relevant +documentation for each. +The table below lists the functions available via this mechanism. +.if t \{\ +.ft CW +\} +.TS +l. +function +_ +clock_gettime +getcpu +getpid +getppid +gettimeofday +set_tid_address +.TE +.if t \{\ +.in +.ft P +\} +.SS parisc (hppa) functions +.\" See linux/arch/parisc/kernel/syscall.S +.\" See linux/Documentation/parisc/registers.rst +The parisc port has a code page with utility functions +called a gateway page. +Rather than use the normal ELF auxiliary vector approach, +it passes the address of +the page to the process via the SR2 register. +The permissions on the page are such that merely executing those addresses +automatically executes with kernel privileges and not in user space. +This is done to match the way HP-UX works. +.P +Since it's just a raw page of code, there is no ELF information for doing +symbol lookups or versioning. +Simply call into the appropriate offset via the branch instruction, +for example: +.P +.in +4n +.EX +ble <offset>(%sr2, %r0) +.EE +.in +.if t \{\ +.ft CW +\} +.TS +l l. +offset function +_ +00b0 lws_entry (CAS operations) +00e0 set_thread_pointer (used by glibc) +0100 linux_gateway_entry (syscall) +.TE +.if t \{\ +.in +.ft P +\} +.SS ppc/32 functions +.\" See linux/arch/powerpc/kernel/vdso32/vdso32.lds.S +The table below lists the symbols exported by the vDSO. +The functions marked with a +.I * +are available only when the kernel is +a PowerPC64 (64-bit) kernel. +.if t \{\ +.ft CW +\} +.TS +l l. +symbol version +_ +__kernel_clock_getres LINUX_2.6.15 +__kernel_clock_gettime LINUX_2.6.15 +__kernel_clock_gettime64 LINUX_5.11 +__kernel_datapage_offset LINUX_2.6.15 +__kernel_get_syscall_map LINUX_2.6.15 +__kernel_get_tbfreq LINUX_2.6.15 +__kernel_getcpu \fI*\fR LINUX_2.6.15 +__kernel_gettimeofday LINUX_2.6.15 +__kernel_sigtramp_rt32 LINUX_2.6.15 +__kernel_sigtramp32 LINUX_2.6.15 +__kernel_sync_dicache LINUX_2.6.15 +__kernel_sync_dicache_p5 LINUX_2.6.15 +.TE +.if t \{\ +.in +.ft P +\} +.P +Before Linux 5.6, +.\" commit 654abc69ef2e69712e6d4e8a6cb9292b97a4aa39 +the +.B CLOCK_REALTIME_COARSE +and +.B CLOCK_MONOTONIC_COARSE +clocks are +.I not +supported by the +.I __kernel_clock_getres +and +.I __kernel_clock_gettime +interfaces; +the kernel falls back to the real system call. +.SS ppc/64 functions +.\" See linux/arch/powerpc/kernel/vdso64/vdso64.lds.S +The table below lists the symbols exported by the vDSO. +.if t \{\ +.ft CW +\} +.TS +l l. +symbol version +_ +__kernel_clock_getres LINUX_2.6.15 +__kernel_clock_gettime LINUX_2.6.15 +__kernel_datapage_offset LINUX_2.6.15 +__kernel_get_syscall_map LINUX_2.6.15 +__kernel_get_tbfreq LINUX_2.6.15 +__kernel_getcpu LINUX_2.6.15 +__kernel_gettimeofday LINUX_2.6.15 +__kernel_sigtramp_rt64 LINUX_2.6.15 +__kernel_sync_dicache LINUX_2.6.15 +__kernel_sync_dicache_p5 LINUX_2.6.15 +.TE +.if t \{\ +.in +.ft P +\} +.P +Before Linux 4.16, +.\" commit 5c929885f1bb4b77f85b1769c49405a0e0f154a1 +the +.B CLOCK_REALTIME_COARSE +and +.B CLOCK_MONOTONIC_COARSE +clocks are +.I not +supported by the +.I __kernel_clock_getres +and +.I __kernel_clock_gettime +interfaces; +the kernel falls back to the real system call. +.SS riscv functions +.\" See linux/arch/riscv/kernel/vdso/vdso.lds.S +The table below lists the symbols exported by the vDSO. +.if t \{\ +.ft CW +\} +.TS +l l. +symbol version +_ +__vdso_rt_sigreturn LINUX_4.15 +__vdso_gettimeofday LINUX_4.15 +__vdso_clock_gettime LINUX_4.15 +__vdso_clock_getres LINUX_4.15 +__vdso_getcpu LINUX_4.15 +__vdso_flush_icache LINUX_4.15 +.TE +.if t \{\ +.in +.ft P +\} +.SS s390 functions +.\" See linux/arch/s390/kernel/vdso32/vdso32.lds.S +The table below lists the symbols exported by the vDSO. +.if t \{\ +.ft CW +\} +.TS +l l. +symbol version +_ +__kernel_clock_getres LINUX_2.6.29 +__kernel_clock_gettime LINUX_2.6.29 +__kernel_gettimeofday LINUX_2.6.29 +.TE +.if t \{\ +.in +.ft P +\} +.SS s390x functions +.\" See linux/arch/s390/kernel/vdso64/vdso64.lds.S +The table below lists the symbols exported by the vDSO. +.if t \{\ +.ft CW +\} +.TS +l l. +symbol version +_ +__kernel_clock_getres LINUX_2.6.29 +__kernel_clock_gettime LINUX_2.6.29 +__kernel_gettimeofday LINUX_2.6.29 +.TE +.if t \{\ +.in +.ft P +\} +.SS sh (SuperH) functions +.\" See linux/arch/sh/kernel/vsyscall/vsyscall.lds.S +The table below lists the symbols exported by the vDSO. +.if t \{\ +.ft CW +\} +.TS +l l. +symbol version +_ +__kernel_rt_sigreturn LINUX_2.6 +__kernel_sigreturn LINUX_2.6 +__kernel_vsyscall LINUX_2.6 +.TE +.if t \{\ +.in +.ft P +\} +.SS i386 functions +.\" See linux/arch/x86/vdso/vdso32/vdso32.lds.S +The table below lists the symbols exported by the vDSO. +.if t \{\ +.ft CW +\} +.TS +l l. +symbol version +_ +__kernel_sigreturn LINUX_2.5 +__kernel_rt_sigreturn LINUX_2.5 +__kernel_vsyscall LINUX_2.5 +.\" Added in 7a59ed415f5b57469e22e41fc4188d5399e0b194 and updated +.\" in 37c975545ec63320789962bf307f000f08fabd48. +__vdso_clock_gettime LINUX_2.6 (exported since Linux 3.15) +__vdso_gettimeofday LINUX_2.6 (exported since Linux 3.15) +__vdso_time LINUX_2.6 (exported since Linux 3.15) +.TE +.if t \{\ +.in +.ft P +\} +.SS x86-64 functions +.\" See linux/arch/x86/vdso/vdso.lds.S +The table below lists the symbols exported by the vDSO. +All of these symbols are also available without the "__vdso_" prefix, but +you should ignore those and stick to the names below. +.if t \{\ +.ft CW +\} +.TS +l l. +symbol version +_ +__vdso_clock_gettime LINUX_2.6 +__vdso_getcpu LINUX_2.6 +__vdso_gettimeofday LINUX_2.6 +__vdso_time LINUX_2.6 +.TE +.if t \{\ +.in +.ft P +\} +.SS x86/x32 functions +.\" See linux/arch/x86/vdso/vdso32.lds.S +The table below lists the symbols exported by the vDSO. +.if t \{\ +.ft CW +\} +.TS +l l. +symbol version +_ +__vdso_clock_gettime LINUX_2.6 +__vdso_getcpu LINUX_2.6 +__vdso_gettimeofday LINUX_2.6 +__vdso_time LINUX_2.6 +.TE +.if t \{\ +.in +.ft P +\} +.SS History +The vDSO was originally just a single function\[em]the vsyscall. +In older kernels, you might see that name +in a process's memory map rather than "vdso". +Over time, people realized that this mechanism +was a great way to pass more functionality +to user space, so it was reconceived as a vDSO in the current format. +.SH SEE ALSO +.BR syscalls (2), +.BR getauxval (3), +.BR proc (5) +.P +The documents, examples, and source code in the Linux source code tree: +.P +.in +4n +.EX +Documentation/ABI/stable/vdso +Documentation/ia64/fsys.rst +Documentation/vDSO/* (includes examples of using the vDSO) +.P +find arch/ \-iname \[aq]*vdso*\[aq] \-o \-iname \[aq]*gate*\[aq] +.EE +.in |