From 399644e47874bff147afb19c89228901ac39340e Mon Sep 17 00:00:00 2001
From: Daniel Baumann <daniel.baumann@progress-linux.org>
Date: Mon, 15 Apr 2024 21:40:15 +0200
Subject: Adding upstream version 6.05.01.

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
---
 man2/open.2 | 1934 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1934 insertions(+)
 create mode 100644 man2/open.2

(limited to 'man2/open.2')

diff --git a/man2/open.2 b/man2/open.2
new file mode 100644
index 0000000..52286f6
--- /dev/null
+++ b/man2/open.2
@@ -0,0 +1,1934 @@
+.\" This manpage is Copyright (C) 1992 Drew Eckhardt;
+.\" and Copyright (C) 1993 Michael Haardt, Ian Jackson.
+.\" and Copyright (C) 2008 Greg Banks
+.\" and Copyright (C) 2006, 2008, 2013, 2014 Michael Kerrisk <mtk.manpages@gmail.com>
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.\" Modified 1993-07-21 by Rik Faith <faith@cs.unc.edu>
+.\" Modified 1994-08-21 by Michael Haardt
+.\" Modified 1996-04-13 by Andries Brouwer <aeb@cwi.nl>
+.\" Modified 1996-05-13 by Thomas Koenig
+.\" Modified 1996-12-20 by Michael Haardt
+.\" Modified 1999-02-19 by Andries Brouwer <aeb@cwi.nl>
+.\" Modified 1998-11-28 by Joseph S. Myers <jsm28@hermes.cam.ac.uk>
+.\" Modified 1999-06-03 by Michael Haardt
+.\" Modified 2002-05-07 by Michael Kerrisk <mtk.manpages@gmail.com>
+.\" Modified 2004-06-23 by Michael Kerrisk <mtk.manpages@gmail.com>
+.\" 2004-12-08, mtk, reordered flags list alphabetically
+.\" 2004-12-08, Martin Pool <mbp@sourcefrog.net> (& mtk), added O_NOATIME
+.\" 2007-09-18, mtk, Added description of O_CLOEXEC + other minor edits
+.\" 2008-01-03, mtk, with input from Trond Myklebust
+.\"     <trond.myklebust@fys.uio.no> and Timo Sirainen <tss@iki.fi>
+.\"     Rewrite description of O_EXCL.
+.\" 2008-01-11, Greg Banks <gnb@melbourne.sgi.com>: add more detail
+.\"     on O_DIRECT.
+.\" 2008-02-26, Michael Haardt: Reorganized text for O_CREAT and mode
+.\"
+.\" FIXME . Apr 08: The next POSIX revision has O_EXEC, O_SEARCH, and
+.\" O_TTYINIT.  Eventually these may need to be documented.  --mtk
+.\"
+.TH open 2 2023-05-20 "Linux man-pages 6.05.01"
+.SH NAME
+open, openat, creat \- open and possibly create a file
+.SH LIBRARY
+Standard C library
+.RI ( libc ", " \-lc )
+.SH SYNOPSIS
+.nf
+.B #include <fcntl.h>
+.PP
+.BI "int open(const char *" pathname ", int " flags ", ..."
+.BI "           \fR/*\fP mode_t " mode " \fR*/\fP );"
+.PP
+.BI "int creat(const char *" pathname ", mode_t " mode );
+.PP
+.BI "int openat(int " dirfd ", const char *" pathname ", int " flags ", ..."
+.BI "           \fR/*\fP mode_t " mode " \fR*/\fP );"
+.PP
+/* Documented separately, in \fBopenat2\fP(2): */
+.BI "int openat2(int " dirfd ", const char *" pathname ,
+.BI "           const struct open_how *" how ", size_t " size );
+.fi
+.PP
+.RS -4
+Feature Test Macro Requirements for glibc (see
+.BR feature_test_macros (7)):
+.RE
+.PP
+.BR openat ():
+.nf
+    Since glibc 2.10:
+        _POSIX_C_SOURCE >= 200809L
+    Before glibc 2.10:
+        _ATFILE_SOURCE
+.fi
+.SH DESCRIPTION
+The
+.BR open ()
+system call opens the file specified by
+.IR pathname .
+If the specified file does not exist,
+it may optionally (if
+.B O_CREAT
+is specified in
+.IR flags )
+be created by
+.BR open ().
+.PP
+The return value of
+.BR open ()
+is a file descriptor, a small, nonnegative integer that is an index
+to an entry in the process's table of open file descriptors.
+The file descriptor is used
+in subsequent system calls
+.RB ( read "(2), " write "(2), " lseek "(2), " fcntl (2),
+etc.) to refer to the open file.
+The file descriptor returned by a successful call will be
+the lowest-numbered file descriptor not currently open for the process.
+.PP
+By default, the new file descriptor is set to remain open across an
+.BR execve (2)
+(i.e., the
+.B FD_CLOEXEC
+file descriptor flag described in
+.BR fcntl (2)
+is initially disabled); the
+.B O_CLOEXEC
+flag, described below, can be used to change this default.
+The file offset is set to the beginning of the file (see
+.BR lseek (2)).
+.PP
+A call to
+.BR open ()
+creates a new
+.IR "open file description" ,
+an entry in the system-wide table of open files.
+The open file description records the file offset and the file status flags
+(see below).
+A file descriptor is a reference to an open file description;
+this reference is unaffected if
+.I pathname
+is subsequently removed or modified to refer to a different file.
+For further details on open file descriptions, see NOTES.
+.PP
+The argument
+.I flags
+must include one of the following
+.IR "access modes" :
+.BR O_RDONLY ", " O_WRONLY ", or " O_RDWR .
+These request opening the file read-only, write-only, or read/write,
+respectively.
+.PP
+In addition, zero or more file creation flags and file status flags
+can be
+bitwise ORed
+in
+.IR flags .
+The
+.I file creation flags
+are
+.BR O_CLOEXEC ,
+.BR O_CREAT ,
+.BR O_DIRECTORY ,
+.BR O_EXCL ,
+.BR O_NOCTTY ,
+.BR O_NOFOLLOW ,
+.BR O_TMPFILE ,
+and
+.BR O_TRUNC .
+The
+.I file status flags
+are all of the remaining flags listed below.
+.\" SUSv4 divides the flags into:
+.\" * Access mode
+.\" * File creation
+.\" * File status
+.\" * Other (O_CLOEXEC, O_DIRECTORY, O_NOFOLLOW)
+.\" though it's not clear what the difference between "other" and
+.\" "File creation" flags is.  I raised an Aardvark to see if this
+.\" can be clarified in SUSv4; 10 Oct 2008.
+.\" http://thread.gmane.org/gmane.comp.standards.posix.austin.general/64/focus=67
+.\" TC1 (balloted in 2013), resolved this, so that those three constants
+.\" are also categorized" as file status flags.
+.\"
+The distinction between these two groups of flags is that
+the file creation flags affect the semantics of the open operation itself,
+while the file status flags affect the semantics of subsequent I/O operations.
+The file status flags can be retrieved and (in some cases)
+modified; see
+.BR fcntl (2)
+for details.
+.PP
+The full list of file creation flags and file status flags is as follows:
+.TP
+.B O_APPEND
+The file is opened in append mode.
+Before each
+.BR write (2),
+the file offset is positioned at the end of the file,
+as if with
+.BR lseek (2).
+The modification of the file offset and the write operation
+are performed as a single atomic step.
+.IP
+.B O_APPEND
+may lead to corrupted files on NFS filesystems if more than one process
+appends data to a file at once.
+.\" For more background, see
+.\" http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=453946
+.\" http://nfs.sourceforge.net/
+This is because NFS does not support
+appending to a file, so the client kernel has to simulate it, which
+can't be done without a race condition.
+.TP
+.B O_ASYNC
+Enable signal-driven I/O:
+generate a signal
+.RB ( SIGIO
+by default, but this can be changed via
+.BR fcntl (2))
+when input or output becomes possible on this file descriptor.
+This feature is available only for terminals, pseudoterminals,
+sockets, and (since Linux 2.6) pipes and FIFOs.
+See
+.BR fcntl (2)
+for further details.
+See also BUGS, below.
+.TP
+.BR O_CLOEXEC " (since Linux 2.6.23)"
+.\" NOTE! several other man pages refer to this text
+Enable the close-on-exec flag for the new file descriptor.
+.\" FIXME . for later review when Issue 8 is one day released...
+.\" POSIX proposes to fix many APIs that provide hidden FDs
+.\" http://austingroupbugs.net/tag_view_page.php?tag_id=8
+.\" http://austingroupbugs.net/view.php?id=368
+Specifying this flag permits a program to avoid additional
+.BR fcntl (2)
+.B F_SETFD
+operations to set the
+.B FD_CLOEXEC
+flag.
+.IP
+Note that the use of this flag is essential in some multithreaded programs,
+because using a separate
+.BR fcntl (2)
+.B F_SETFD
+operation to set the
+.B FD_CLOEXEC
+flag does not suffice to avoid race conditions
+where one thread opens a file descriptor and
+attempts to set its close-on-exec flag using
+.BR fcntl (2)
+at the same time as another thread does a
+.BR fork (2)
+plus
+.BR execve (2).
+Depending on the order of execution,
+the race may lead to the file descriptor returned by
+.BR open ()
+being unintentionally leaked to the program executed by the child process
+created by
+.BR fork (2).
+(This kind of race is in principle possible for any system call
+that creates a file descriptor whose close-on-exec flag should be set,
+and various other Linux system calls provide an equivalent of the
+.B O_CLOEXEC
+flag to deal with this problem.)
+.\" This flag fixes only one form of the race condition;
+.\" The race can also occur with, for example, file descriptors
+.\" returned by accept(), pipe(), etc.
+.TP
+.B O_CREAT
+If
+.I pathname
+does not exist, create it as a regular file.
+.IP
+The owner (user ID) of the new file is set to the effective user ID
+of the process.
+.IP
+The group ownership (group ID) of the new file is set either to
+the effective group ID of the process (System V semantics)
+or to the group ID of the parent directory (BSD semantics).
+On Linux, the behavior depends on whether the
+set-group-ID mode bit is set on the parent directory:
+if that bit is set, then BSD semantics apply;
+otherwise, System V semantics apply.
+For some filesystems, the behavior also depends on the
+.I bsdgroups
+and
+.I sysvgroups
+mount options described in
+.BR mount (8).
+.\" As at Linux 2.6.25, bsdgroups is supported by ext2, ext3, ext4, and
+.\" XFS (since Linux 2.6.14).
+.IP
+The
+.I mode
+argument specifies the file mode bits to be applied when a new file is created.
+If neither
+.B O_CREAT
+nor
+.B O_TMPFILE
+is specified in
+.IR flags ,
+then
+.I mode
+is ignored (and can thus be specified as 0, or simply omitted).
+The
+.I mode
+argument
+.B must
+be supplied if
+.B O_CREAT
+or
+.B O_TMPFILE
+is specified in
+.IR flags ;
+if it is not supplied,
+some arbitrary bytes from the stack will be applied as the file mode.
+.IP
+The effective mode is modified by the process's
+.I umask
+in the usual way: in the absence of a default ACL, the mode of the
+created file is
+.IR "(mode\ &\ \[ti]umask)" .
+.IP
+Note that
+.I mode
+applies only to future accesses of the
+newly created file; the
+.BR open ()
+call that creates a read-only file may well return a read/write
+file descriptor.
+.IP
+The following symbolic constants are provided for
+.IR mode :
+.RS
+.TP 9
+.B S_IRWXU
+00700 user (file owner) has read, write, and execute permission
+.TP
+.B S_IRUSR
+00400 user has read permission
+.TP
+.B S_IWUSR
+00200 user has write permission
+.TP
+.B S_IXUSR
+00100 user has execute permission
+.TP
+.B S_IRWXG
+00070 group has read, write, and execute permission
+.TP
+.B S_IRGRP
+00040 group has read permission
+.TP
+.B S_IWGRP
+00020 group has write permission
+.TP
+.B S_IXGRP
+00010 group has execute permission
+.TP
+.B S_IRWXO
+00007 others have read, write, and execute permission
+.TP
+.B S_IROTH
+00004 others have read permission
+.TP
+.B S_IWOTH
+00002 others have write permission
+.TP
+.B S_IXOTH
+00001 others have execute permission
+.RE
+.IP
+According to POSIX, the effect when other bits are set in
+.I mode
+is unspecified.
+On Linux, the following bits are also honored in
+.IR mode :
+.RS
+.TP 9
+.B S_ISUID
+0004000 set-user-ID bit
+.TP
+.B S_ISGID
+0002000 set-group-ID bit (see
+.BR inode (7)).
+.TP
+.B S_ISVTX
+0001000 sticky bit (see
+.BR inode (7)).
+.RE
+.TP
+.BR O_DIRECT " (since Linux 2.4.10)"
+Try to minimize cache effects of the I/O to and from this file.
+In general this will degrade performance, but it is useful in
+special situations, such as when applications do their own caching.
+File I/O is done directly to/from user-space buffers.
+The
+.B O_DIRECT
+flag on its own makes an effort to transfer data synchronously,
+but does not give the guarantees of the
+.B O_SYNC
+flag that data and necessary metadata are transferred.
+To guarantee synchronous I/O,
+.B O_SYNC
+must be used in addition to
+.BR O_DIRECT .
+See NOTES below for further discussion.
+.IP
+A semantically similar (but deprecated) interface for block devices
+is described in
+.BR raw (8).
+.TP
+.B O_DIRECTORY
+If \fIpathname\fP is not a directory, cause the open to fail.
+.\" But see the following and its replies:
+.\" http://marc.theaimsgroup.com/?t=112748702800001&r=1&w=2
+.\" [PATCH] open: O_DIRECTORY and O_CREAT together should fail
+.\" O_DIRECTORY | O_CREAT causes O_DIRECTORY to be ignored.
+This flag was added in Linux 2.1.126, to
+avoid denial-of-service problems if
+.BR opendir (3)
+is called on a
+FIFO or tape device.
+.TP
+.B O_DSYNC
+Write operations on the file will complete according to the requirements of
+synchronized I/O
+.I data
+integrity completion.
+.IP
+By the time
+.BR write (2)
+(and similar)
+return, the output data
+has been transferred to the underlying hardware,
+along with any file metadata that would be required to retrieve that data
+(i.e., as though each
+.BR write (2)
+was followed by a call to
+.BR fdatasync (2)).
+.IR "See NOTES below" .
+.TP
+.B O_EXCL
+Ensure that this call creates the file:
+if this flag is specified in conjunction with
+.BR O_CREAT ,
+and
+.I pathname
+already exists, then
+.BR open ()
+fails with the error
+.BR EEXIST .
+.IP
+When these two flags are specified, symbolic links are not followed:
+.\" POSIX.1-2001 explicitly requires this behavior.
+if
+.I pathname
+is a symbolic link, then
+.BR open ()
+fails regardless of where the symbolic link points.
+.IP
+In general, the behavior of
+.B O_EXCL
+is undefined if it is used without
+.BR O_CREAT .
+There is one exception: on Linux 2.6 and later,
+.B O_EXCL
+can be used without
+.B O_CREAT
+if
+.I pathname
+refers to a block device.
+If the block device is in use by the system (e.g., mounted),
+.BR open ()
+fails with the error
+.BR EBUSY .
+.IP
+On NFS,
+.B O_EXCL
+is supported only when using NFSv3 or later on kernel 2.6 or later.
+In NFS environments where
+.B O_EXCL
+support is not provided, programs that rely on it
+for performing locking tasks will contain a race condition.
+Portable programs that want to perform atomic file locking using a lockfile,
+and need to avoid reliance on NFS support for
+.BR O_EXCL ,
+can create a unique file on
+the same filesystem (e.g., incorporating hostname and PID), and use
+.BR link (2)
+to make a link to the lockfile.
+If
+.BR link (2)
+returns 0, the lock is successful.
+Otherwise, use
+.BR stat (2)
+on the unique file to check if its link count has increased to 2,
+in which case the lock is also successful.
+.TP
+.B O_LARGEFILE
+(LFS)
+Allow files whose sizes cannot be represented in an
+.I off_t
+(but can be represented in an
+.IR off64_t )
+to be opened.
+The
+.B _LARGEFILE64_SOURCE
+macro must be defined
+(before including
+.I any
+header files)
+in order to obtain this definition.
+Setting the
+.B _FILE_OFFSET_BITS
+feature test macro to 64 (rather than using
+.BR O_LARGEFILE )
+is the preferred
+method of accessing large files on 32-bit systems (see
+.BR feature_test_macros (7)).
+.TP
+.BR O_NOATIME " (since Linux 2.6.8)"
+Do not update the file last access time
+.RI ( st_atime
+in the inode)
+when the file is
+.BR read (2).
+.IP
+This flag can be employed only if one of the following conditions is true:
+.RS
+.IP \[bu] 3
+The effective UID of the process
+.\" Strictly speaking: the filesystem UID
+matches the owner UID of the file.
+.IP \[bu]
+The calling process has the
+.B CAP_FOWNER
+capability in its user namespace and
+the owner UID of the file has a mapping in the namespace.
+.RE
+.IP
+This flag is intended for use by indexing or backup programs,
+where its use can significantly reduce the amount of disk activity.
+This flag may not be effective on all filesystems.
+One example is NFS, where the server maintains the access time.
+.\" The O_NOATIME flag also affects the treatment of st_atime
+.\" by mmap() and readdir(2), MTK, Dec 04.
+.TP
+.B O_NOCTTY
+If
+.I pathname
+refers to a terminal device\[em]see
+.BR tty (4)\[em]it
+will not become the process's controlling terminal even if the
+process does not have one.
+.TP
+.B O_NOFOLLOW
+If the trailing component (i.e., basename) of
+.I pathname
+is a symbolic link, then the open fails, with the error
+.BR ELOOP .
+Symbolic links in earlier components of the pathname will still be
+followed.
+(Note that the
+.B ELOOP
+error that can occur in this case is indistinguishable from the case where
+an open fails because there are too many symbolic links found
+while resolving components in the prefix part of the pathname.)
+.IP
+This flag is a FreeBSD extension, which was added in Linux 2.1.126,
+and has subsequently been standardized in POSIX.1-2008.
+.IP
+See also
+.B O_PATH
+below.
+.\" The headers from glibc 2.0.100 and later include a
+.\" definition of this flag; \fIkernels before Linux 2.1.126 will ignore it if
+.\" used\fP.
+.TP
+.BR O_NONBLOCK " or " O_NDELAY
+When possible, the file is opened in nonblocking mode.
+Neither the
+.BR open ()
+nor any subsequent I/O operations on the file descriptor which is
+returned will cause the calling process to wait.
+.IP
+Note that the setting of this flag has no effect on the operation of
+.BR poll (2),
+.BR select (2),
+.BR epoll (7),
+and similar,
+since those interfaces merely inform the caller about whether
+a file descriptor is "ready",
+meaning that an I/O operation performed on
+the file descriptor with the
+.B O_NONBLOCK
+flag
+.I clear
+would not block.
+.IP
+Note that this flag has no effect for regular files and block devices;
+that is, I/O operations will (briefly) block when device activity
+is required, regardless of whether
+.B O_NONBLOCK
+is set.
+Since
+.B O_NONBLOCK
+semantics might eventually be implemented,
+applications should not depend upon blocking behavior
+when specifying this flag for regular files and block devices.
+.IP
+For the handling of FIFOs (named pipes), see also
+.BR fifo (7).
+For a discussion of the effect of
+.B O_NONBLOCK
+in conjunction with mandatory file locks and with file leases, see
+.BR fcntl (2).
+.TP
+.BR O_PATH " (since Linux 2.6.39)"
+.\" commit 1abf0c718f15a56a0a435588d1b104c7a37dc9bd
+.\" commit 326be7b484843988afe57566b627fb7a70beac56
+.\" commit 65cfc6722361570bfe255698d9cd4dccaf47570d
+.\"
+.\" http://thread.gmane.org/gmane.linux.man/2790/focus=3496
+.\"	Subject: Re: [PATCH] open(2): document O_PATH
+.\"	Newsgroups: gmane.linux.man, gmane.linux.kernel
+.\"
+Obtain a file descriptor that can be used for two purposes:
+to indicate a location in the filesystem tree and
+to perform operations that act purely at the file descriptor level.
+The file itself is not opened, and other file operations (e.g.,
+.BR read (2),
+.BR write (2),
+.BR fchmod (2),
+.BR fchown (2),
+.BR fgetxattr (2),
+.BR ioctl (2),
+.BR mmap (2))
+fail with the error
+.BR EBADF .
+.IP
+The following operations
+.I can
+be performed on the resulting file descriptor:
+.RS
+.IP \[bu] 3
+.BR close (2).
+.IP \[bu]
+.BR fchdir (2),
+if the file descriptor refers to a directory
+(since Linux 3.5).
+.\" commit 332a2e1244bd08b9e3ecd378028513396a004a24
+.IP \[bu]
+.BR fstat (2)
+(since Linux 3.6).
+.IP \[bu]
+.\" fstat(): commit 55815f70147dcfa3ead5738fd56d3574e2e3c1c2
+.BR fstatfs (2)
+(since Linux 3.12).
+.\" fstatfs(): commit 9d05746e7b16d8565dddbe3200faa1e669d23bbf
+.IP \[bu]
+Duplicating the file descriptor
+.RB ( dup (2),
+.BR fcntl (2)
+.BR F_DUPFD ,
+etc.).
+.IP \[bu]
+Getting and setting file descriptor flags
+.RB ( fcntl (2)
+.B F_GETFD
+and
+.BR F_SETFD ).
+.IP \[bu]
+Retrieving open file status flags using the
+.BR fcntl (2)
+.B F_GETFL
+operation: the returned flags will include the bit
+.BR O_PATH .
+.IP \[bu]
+Passing the file descriptor as the
+.I dirfd
+argument of
+.BR openat ()
+and the other "*at()" system calls.
+This includes
+.BR linkat (2)
+with
+.B AT_EMPTY_PATH
+(or via procfs using
+.BR AT_SYMLINK_FOLLOW )
+even if the file is not a directory.
+.IP \[bu]
+Passing the file descriptor to another process via a UNIX domain socket
+(see
+.B SCM_RIGHTS
+in
+.BR unix (7)).
+.RE
+.IP
+When
+.B O_PATH
+is specified in
+.IR flags ,
+flag bits other than
+.BR O_CLOEXEC ,
+.BR O_DIRECTORY ,
+and
+.B O_NOFOLLOW
+are ignored.
+.IP
+Opening a file or directory with the
+.B O_PATH
+flag requires no permissions on the object itself
+(but does require execute permission on the directories in the path prefix).
+Depending on the subsequent operation,
+a check for suitable file permissions may be performed (e.g.,
+.BR fchdir (2)
+requires execute permission on the directory referred to
+by its file descriptor argument).
+By contrast,
+obtaining a reference to a filesystem object by opening it with the
+.B O_RDONLY
+flag requires that the caller have read permission on the object,
+even when the subsequent operation (e.g.,
+.BR fchdir (2),
+.BR fstat (2))
+does not require read permission on the object.
+.IP
+If
+.I pathname
+is a symbolic link and the
+.B O_NOFOLLOW
+flag is also specified,
+then the call returns a file descriptor referring to the symbolic link.
+This file descriptor can be used as the
+.I dirfd
+argument in calls to
+.BR fchownat (2),
+.BR fstatat (2),
+.BR linkat (2),
+and
+.BR readlinkat (2)
+with an empty pathname to have the calls operate on the symbolic link.
+.IP
+If
+.I pathname
+refers to an automount point that has not yet been triggered, so no
+other filesystem is mounted on it, then the call returns a file
+descriptor referring to the automount directory without triggering a mount.
+.BR fstatfs (2)
+can then be used to determine if it is, in fact, an untriggered
+automount point
+.RB ( ".f_type == AUTOFS_SUPER_MAGIC" ).
+.IP
+One use of
+.B O_PATH
+for regular files is to provide the equivalent of POSIX.1's
+.B O_EXEC
+functionality.
+This permits us to open a file for which we have execute
+permission but not read permission, and then execute that file,
+with steps something like the following:
+.IP
+.in +4n
+.EX
+char buf[PATH_MAX];
+fd = open("some_prog", O_PATH);
+snprintf(buf, PATH_MAX, "/proc/self/fd/%d", fd);
+execl(buf, "some_prog", (char *) NULL);
+.EE
+.in
+.IP
+An
+.B O_PATH
+file descriptor can also be passed as the argument of
+.BR fexecve (3).
+.TP
+.B O_SYNC
+Write operations on the file will complete according to the requirements of
+synchronized I/O
+.I file
+integrity completion
+(by contrast with the
+synchronized I/O
+.I data
+integrity completion
+provided by
+.BR O_DSYNC .)
+.IP
+By the time
+.BR write (2)
+(or similar)
+returns, the output data and associated file metadata
+have been transferred to the underlying hardware
+(i.e., as though each
+.BR write (2)
+was followed by a call to
+.BR fsync (2)).
+.IR "See NOTES below" .
+.TP
+.BR O_TMPFILE " (since Linux 3.11)"
+.\" commit 60545d0d4610b02e55f65d141c95b18ccf855b6e
+.\" commit f4e0c30c191f87851c4a53454abb55ee276f4a7e
+.\" commit bb458c644a59dbba3a1fe59b27106c5e68e1c4bd
+Create an unnamed temporary regular file.
+The
+.I pathname
+argument specifies a directory;
+an unnamed inode will be created in that directory's filesystem.
+Anything written to the resulting file will be lost when
+the last file descriptor is closed, unless the file is given a name.
+.IP
+.B O_TMPFILE
+must be specified with one of
+.B O_RDWR
+or
+.B O_WRONLY
+and, optionally,
+.BR O_EXCL .
+If
+.B O_EXCL
+is not specified, then
+.BR linkat (2)
+can be used to link the temporary file into the filesystem, making it
+permanent, using code like the following:
+.IP
+.in +4n
+.EX
+char path[PATH_MAX];
+fd = open("/path/to/dir", O_TMPFILE | O_RDWR,
+                        S_IRUSR | S_IWUSR);
+\&
+/* File I/O on \[aq]fd\[aq]... */
+\&
+linkat(fd, "", AT_FDCWD, "/path/for/file", AT_EMPTY_PATH);
+\&
+/* If the caller doesn\[aq]t have the CAP_DAC_READ_SEARCH
+   capability (needed to use AT_EMPTY_PATH with linkat(2)),
+   and there is a proc(5) filesystem mounted, then the
+   linkat(2) call above can be replaced with:
+\&
+snprintf(path, PATH_MAX,  "/proc/self/fd/%d", fd);
+linkat(AT_FDCWD, path, AT_FDCWD, "/path/for/file",
+                        AT_SYMLINK_FOLLOW);
+*/
+.EE
+.in
+.IP
+In this case,
+the
+.BR open ()
+.I mode
+argument determines the file permission mode, as with
+.BR O_CREAT .
+.IP
+Specifying
+.B O_EXCL
+in conjunction with
+.B O_TMPFILE
+prevents a temporary file from being linked into the filesystem
+in the above manner.
+(Note that the meaning of
+.B O_EXCL
+in this case is different from the meaning of
+.B O_EXCL
+otherwise.)
+.IP
+There are two main use cases for
+.\" Inspired by http://lwn.net/Articles/559147/
+.BR O_TMPFILE :
+.RS
+.IP \[bu] 3
+Improved
+.BR tmpfile (3)
+functionality: race-free creation of temporary files that
+(1) are automatically deleted when closed;
+(2) can never be reached via any pathname;
+(3) are not subject to symlink attacks; and
+(4) do not require the caller to devise unique names.
+.IP \[bu]
+Creating a file that is initially invisible, which is then populated
+with data and adjusted to have appropriate filesystem attributes
+.RB ( fchown (2),
+.BR fchmod (2),
+.BR fsetxattr (2),
+etc.)
+before being atomically linked into the filesystem
+in a fully formed state (using
+.BR linkat (2)
+as described above).
+.RE
+.IP
+.B O_TMPFILE
+requires support by the underlying filesystem;
+only a subset of Linux filesystems provide that support.
+In the initial implementation, support was provided in
+the ext2, ext3, ext4, UDF, Minix, and tmpfs filesystems.
+.\" To check for support, grep for "tmpfile" in kernel sources
+Support for other filesystems has subsequently been added as follows:
+XFS (Linux 3.15);
+.\" commit 99b6436bc29e4f10e4388c27a3e4810191cc4788
+.\" commit ab29743117f9f4c22ac44c13c1647fb24fb2bafe
+Btrfs (Linux 3.16);
+.\" commit ef3b9af50bfa6a1f02cd7b3f5124b712b1ba3e3c
+F2FS (Linux 3.16);
+.\" commit 50732df02eefb39ab414ef655979c2c9b64ad21c
+and ubifs (Linux 4.9)
+.TP
+.B O_TRUNC
+If the file already exists and is a regular file and the access mode allows
+writing (i.e., is
+.B O_RDWR
+or
+.BR O_WRONLY )
+it will be truncated to length 0.
+If the file is a FIFO or terminal device file, the
+.B O_TRUNC
+flag is ignored.
+Otherwise, the effect of
+.B O_TRUNC
+is unspecified.
+.SS creat()
+A call to
+.BR creat ()
+is equivalent to calling
+.BR open ()
+with
+.I flags
+equal to
+.BR O_CREAT|O_WRONLY|O_TRUNC .
+.SS openat()
+The
+.BR openat ()
+system call operates in exactly the same way as
+.BR open (),
+except for the differences described here.
+.PP
+The
+.I dirfd
+argument is used in conjunction with the
+.I pathname
+argument as follows:
+.IP \[bu] 3
+If the pathname given in
+.I pathname
+is absolute, then
+.I dirfd
+is ignored.
+.IP \[bu]
+If the pathname given in
+.I pathname
+is relative and
+.I dirfd
+is the special value
+.BR AT_FDCWD ,
+then
+.I pathname
+is interpreted relative to the current working
+directory of the calling process (like
+.BR open ()).
+.IP \[bu]
+If the pathname given in
+.I pathname
+is relative, then it is interpreted relative to the directory
+referred to by the file descriptor
+.I dirfd
+(rather than relative to the current working directory of
+the calling process, as is done by
+.BR open ()
+for a relative pathname).
+In this case,
+.I dirfd
+must be a directory that was opened for reading
+.RB ( O_RDONLY )
+or using the
+.B O_PATH
+flag.
+.PP
+If the pathname given in
+.I pathname
+is relative, and
+.I dirfd
+is not a valid file descriptor, an error
+.RB ( EBADF )
+results.
+(Specifying an invalid file descriptor number in
+.I dirfd
+can be used as a means to ensure that
+.I pathname
+is absolute.)
+.\"
+.SS openat2(2)
+The
+.BR openat2 (2)
+system call is an extension of
+.BR openat (),
+and provides a superset of the features of
+.BR openat ().
+It is documented separately, in
+.BR openat2 (2).
+.SH RETURN VALUE
+On success,
+.BR open (),
+.BR openat (),
+and
+.BR creat ()
+return the new file descriptor (a nonnegative integer).
+On error, \-1 is returned and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.BR open (),
+.BR openat (),
+and
+.BR creat ()
+can fail with the following errors:
+.TP
+.B EACCES
+The requested access to the file is not allowed, or search permission
+is denied for one of the directories in the path prefix of
+.IR pathname ,
+or the file did not exist yet and write access to the parent directory
+is not allowed.
+(See also
+.BR path_resolution (7).)
+.TP
+.B EACCES
+.\" commit 30aba6656f61ed44cba445a3c0d38b296fa9e8f5
+Where
+.B O_CREAT
+is specified, the
+.I protected_fifos
+or
+.I protected_regular
+sysctl is enabled, the file already exists and is a FIFO or regular file, the
+owner of the file is neither the current user nor the owner of the
+containing directory, and the containing directory is both world- or
+group-writable and sticky.
+For details, see the descriptions of
+.I /proc/sys/fs/protected_fifos
+and
+.I /proc/sys/fs/protected_regular
+in
+.BR proc (5).
+.TP
+.B EBADF
+.RB ( openat ())
+.I pathname
+is relative but
+.I dirfd
+is neither
+.B AT_FDCWD
+nor a valid file descriptor.
+.TP
+.B EBUSY
+.B O_EXCL
+was specified in
+.I flags
+and
+.I pathname
+refers to a block device that is in use by the system (e.g., it is mounted).
+.TP
+.B EDQUOT
+Where
+.B O_CREAT
+is specified, the file does not exist, and the user's quota of disk
+blocks or inodes on the filesystem has been exhausted.
+.TP
+.B EEXIST
+.I pathname
+already exists and
+.BR O_CREAT " and " O_EXCL
+were used.
+.TP
+.B EFAULT
+.I pathname
+points outside your accessible address space.
+.TP
+.B EFBIG
+See
+.BR EOVERFLOW .
+.TP
+.B EINTR
+While blocked waiting to complete an open of a slow device
+(e.g., a FIFO; see
+.BR fifo (7)),
+the call was interrupted by a signal handler; see
+.BR signal (7).
+.TP
+.B EINVAL
+The filesystem does not support the
+.B O_DIRECT
+flag.
+See
+.B NOTES
+for more information.
+.TP
+.B EINVAL
+Invalid value in
+.\" In particular, __O_TMPFILE instead of O_TMPFILE
+.IR flags .
+.TP
+.B EINVAL
+.B O_TMPFILE
+was specified in
+.IR flags ,
+but neither
+.B O_WRONLY
+nor
+.B O_RDWR
+was specified.
+.TP
+.B EINVAL
+.B O_CREAT
+was specified in
+.I flags
+and the final component ("basename") of the new file's
+.I pathname
+is invalid
+(e.g., it contains characters not permitted by the underlying filesystem).
+.TP
+.B EINVAL
+The final component ("basename") of
+.I pathname
+is invalid
+(e.g., it contains characters not permitted by the underlying filesystem).
+.TP
+.B EISDIR
+.I pathname
+refers to a directory and the access requested involved writing
+(that is,
+.B O_WRONLY
+or
+.B O_RDWR
+is set).
+.TP
+.B EISDIR
+.I pathname
+refers to an existing directory,
+.B O_TMPFILE
+and one of
+.B O_WRONLY
+or
+.B O_RDWR
+were specified in
+.IR flags ,
+but this kernel version does not provide the
+.B O_TMPFILE
+functionality.
+.TP
+.B ELOOP
+Too many symbolic links were encountered in resolving
+.IR pathname .
+.TP
+.B ELOOP
+.I pathname
+was a symbolic link, and
+.I flags
+specified
+.B O_NOFOLLOW
+but not
+.BR O_PATH .
+.TP
+.B EMFILE
+The per-process limit on the number of open file descriptors has been reached
+(see the description of
+.B RLIMIT_NOFILE
+in
+.BR getrlimit (2)).
+.TP
+.B ENAMETOOLONG
+.I pathname
+was too long.
+.TP
+.B ENFILE
+The system-wide limit on the total number of open files has been reached.
+.TP
+.B ENODEV
+.I pathname
+refers to a device special file and no corresponding device exists.
+(This is a Linux kernel bug; in this situation
+.B ENXIO
+must be returned.)
+.TP
+.B ENOENT
+.B O_CREAT
+is not set and the named file does not exist.
+.TP
+.B ENOENT
+A directory component in
+.I pathname
+does not exist or is a dangling symbolic link.
+.TP
+.B ENOENT
+.I pathname
+refers to a nonexistent directory,
+.B O_TMPFILE
+and one of
+.B O_WRONLY
+or
+.B O_RDWR
+were specified in
+.IR flags ,
+but this kernel version does not provide the
+.B O_TMPFILE
+functionality.
+.TP
+.B ENOMEM
+The named file is a FIFO,
+but memory for the FIFO buffer can't be allocated because
+the per-user hard limit on memory allocation for pipes has been reached
+and the caller is not privileged; see
+.BR pipe (7).
+.TP
+.B ENOMEM
+Insufficient kernel memory was available.
+.TP
+.B ENOSPC
+.I pathname
+was to be created but the device containing
+.I pathname
+has no room for the new file.
+.TP
+.B ENOTDIR
+A component used as a directory in
+.I pathname
+is not, in fact, a directory, or \fBO_DIRECTORY\fP was specified and
+.I pathname
+was not a directory.
+.TP
+.B ENOTDIR
+.RB ( openat ())
+.I pathname
+is a relative pathname and
+.I dirfd
+is a file descriptor referring to a file other than a directory.
+.TP
+.B ENXIO
+.BR O_NONBLOCK " | " O_WRONLY
+is set, the named file is a FIFO, and
+no process has the FIFO open for reading.
+.TP
+.B ENXIO
+The file is a device special file and no corresponding device exists.
+.TP
+.B ENXIO
+The file is a UNIX domain socket.
+.TP
+.B EOPNOTSUPP
+The filesystem containing
+.I pathname
+does not support
+.BR O_TMPFILE .
+.TP
+.B EOVERFLOW
+.I pathname
+refers to a regular file that is too large to be opened.
+The usual scenario here is that an application compiled
+on a 32-bit platform without
+.I \-D_FILE_OFFSET_BITS=64
+tried to open a file whose size exceeds
+.I (1<<31)\-1
+bytes;
+see also
+.B O_LARGEFILE
+above.
+This is the error specified by POSIX.1;
+before Linux 2.6.24, Linux gave the error
+.B EFBIG
+for this case.
+.\" See http://bugzilla.kernel.org/show_bug.cgi?id=7253
+.\" "Open of a large file on 32-bit fails with EFBIG, should be EOVERFLOW"
+.\" Reported 2006-10-03
+.TP
+.B EPERM
+The
+.B O_NOATIME
+flag was specified, but the effective user ID of the caller
+.\" Strictly speaking, it's the filesystem UID... (MTK)
+did not match the owner of the file and the caller was not privileged.
+.TP
+.B EPERM
+The operation was prevented by a file seal; see
+.BR fcntl (2).
+.TP
+.B EROFS
+.I pathname
+refers to a file on a read-only filesystem and write access was
+requested.
+.TP
+.B ETXTBSY
+.I pathname
+refers to an executable image which is currently being executed and
+write access was requested.
+.TP
+.B ETXTBSY
+.I pathname
+refers to a file that is currently in use as a swap file, and the
+.B O_TRUNC
+flag was specified.
+.TP
+.B ETXTBSY
+.I pathname
+refers to a file that is currently being read by the kernel (e.g., for
+module/firmware loading), and write access was requested.
+.TP
+.B EWOULDBLOCK
+The
+.B O_NONBLOCK
+flag was specified, and an incompatible lease was held on the file
+(see
+.BR fcntl (2)).
+.SH VERSIONS
+The (undefined) effect of
+.B O_RDONLY | O_TRUNC
+varies among implementations.
+On many systems the file is actually truncated.
+.\" Linux 2.0, 2.5: truncate
+.\" Solaris 5.7, 5.8: truncate
+.\" Irix 6.5: truncate
+.\" Tru64 5.1B: truncate
+.\" HP-UX 11.22: truncate
+.\" FreeBSD 4.7: truncate
+.SS Synchronized I/O
+The POSIX.1-2008 "synchronized I/O" option
+specifies different variants of synchronized I/O,
+and specifies the
+.BR open ()
+flags
+.BR O_SYNC ,
+.BR O_DSYNC ,
+and
+.B O_RSYNC
+for controlling the behavior.
+Regardless of whether an implementation supports this option,
+it must at least support the use of
+.B O_SYNC
+for regular files.
+.PP
+Linux implements
+.B O_SYNC
+and
+.BR O_DSYNC ,
+but not
+.BR O_RSYNC .
+Somewhat incorrectly, glibc defines
+.B O_RSYNC
+to have the same value as
+.BR O_SYNC .
+.RB ( O_RSYNC
+is defined in the Linux header file
+.I <asm/fcntl.h>
+on HP PA-RISC, but it is not used.)
+.PP
+.B O_SYNC
+provides synchronized I/O
+.I file
+integrity completion,
+meaning write operations will flush data and all associated metadata
+to the underlying hardware.
+.B O_DSYNC
+provides synchronized I/O
+.I data
+integrity completion,
+meaning write operations will flush data
+to the underlying hardware,
+but will only flush metadata updates that are required
+to allow a subsequent read operation to complete successfully.
+Data integrity completion can reduce the number of disk operations
+that are required for applications that don't need the guarantees
+of file integrity completion.
+.PP
+To understand the difference between the two types of completion,
+consider two pieces of file metadata:
+the file last modification timestamp
+.RI ( st_mtime )
+and the file length.
+All write operations will update the last file modification timestamp,
+but only writes that add data to the end of the
+file will change the file length.
+The last modification timestamp is not needed to ensure that
+a read completes successfully, but the file length is.
+Thus,
+.B O_DSYNC
+would only guarantee to flush updates to the file length metadata
+(whereas
+.B O_SYNC
+would also always flush the last modification timestamp metadata).
+.PP
+Before Linux 2.6.33, Linux implemented only the
+.B O_SYNC
+flag for
+.BR open ().
+However, when that flag was specified,
+most filesystems actually provided the equivalent of synchronized I/O
+.I data
+integrity completion (i.e.,
+.B O_SYNC
+was actually implemented as the equivalent of
+.BR O_DSYNC ).
+.PP
+Since Linux 2.6.33, proper
+.B O_SYNC
+support is provided.
+However, to ensure backward binary compatibility,
+.B O_DSYNC
+was defined with the same value as the historical
+.BR O_SYNC ,
+and
+.B O_SYNC
+was defined as a new (two-bit) flag value that includes the
+.B O_DSYNC
+flag value.
+This ensures that applications compiled against
+new headers get at least
+.B O_DSYNC
+semantics before Linux 2.6.33.
+.\"
+.SS C library/kernel differences
+Since glibc 2.26,
+the glibc wrapper function for
+.BR open ()
+employs the
+.BR openat ()
+system call, rather than the kernel's
+.BR open ()
+system call.
+For certain architectures, this is also true before glibc 2.26.
+.\"
+.SH STANDARDS
+.TP
+.BR open ()
+.TQ
+.BR creat ()
+.TQ
+.BR openat ()
+POSIX.1-2008.
+.PP
+.BR openat2 (2)
+Linux.
+.PP
+The
+.BR O_DIRECT ,
+.BR O_NOATIME ,
+.BR O_PATH ,
+and
+.B O_TMPFILE
+flags are Linux-specific.
+One must define
+.B _GNU_SOURCE
+to obtain their definitions.
+.PP
+The
+.BR O_CLOEXEC ,
+.BR O_DIRECTORY ,
+and
+.B O_NOFOLLOW
+flags are not specified in POSIX.1-2001,
+but are specified in POSIX.1-2008.
+Since glibc 2.12, one can obtain their definitions by defining either
+.B _POSIX_C_SOURCE
+with a value greater than or equal to 200809L or
+.B _XOPEN_SOURCE
+with a value greater than or equal to 700.
+In glibc 2.11 and earlier, one obtains the definitions by defining
+.BR _GNU_SOURCE .
+.SH HISTORY
+.TP
+.BR open ()
+.TQ
+.BR creat ()
+SVr4, 4.3BSD, POSIX.1-2001.
+.TP
+.BR openat ()
+POSIX.1-2008.
+Linux 2.6.16,
+glibc 2.4.
+.SH NOTES
+Under Linux, the
+.B O_NONBLOCK
+flag is sometimes used in cases where one wants to open
+but does not necessarily have the intention to read or write.
+For example,
+this may be used to open a device in order to get a file descriptor
+for use with
+.BR ioctl (2).
+.PP
+Note that
+.BR open ()
+can open device special files, but
+.BR creat ()
+cannot create them; use
+.BR mknod (2)
+instead.
+.PP
+If the file is newly created, its
+.IR st_atime ,
+.IR st_ctime ,
+.I st_mtime
+fields
+(respectively, time of last access, time of last status change, and
+time of last modification; see
+.BR stat (2))
+are set
+to the current time, and so are the
+.I st_ctime
+and
+.I st_mtime
+fields of the
+parent directory.
+Otherwise, if the file is modified because of the
+.B O_TRUNC
+flag, its
+.I st_ctime
+and
+.I st_mtime
+fields are set to the current time.
+.PP
+The files in the
+.IR /proc/ pid /fd
+directory show the open file descriptors of the process with the PID
+.IR pid .
+The files in the
+.IR /proc/ pid /fdinfo
+directory show even more information about these file descriptors.
+See
+.BR proc (5)
+for further details of both of these directories.
+.PP
+The Linux header file
+.B <asm/fcntl.h>
+doesn't define
+.BR O_ASYNC ;
+the (BSD-derived)
+.B FASYNC
+synonym is defined instead.
+.\"
+.\"
+.SS Open file descriptions
+The term open file description is the one used by POSIX to refer to the
+entries in the system-wide table of open files.
+In other contexts, this object is
+variously also called an "open file object",
+a "file handle", an "open file table entry",
+or\[em]in kernel-developer parlance\[em]a
+.IR "struct file" .
+.PP
+When a file descriptor is duplicated (using
+.BR dup (2)
+or similar),
+the duplicate refers to the same open file description
+as the original file descriptor,
+and the two file descriptors consequently share
+the file offset and file status flags.
+Such sharing can also occur between processes:
+a child process created via
+.BR fork (2)
+inherits duplicates of its parent's file descriptors,
+and those duplicates refer to the same open file descriptions.
+.PP
+Each
+.BR open ()
+of a file creates a new open file description;
+thus, there may be multiple open file descriptions
+corresponding to a file inode.
+.PP
+On Linux, one can use the
+.BR kcmp (2)
+.B KCMP_FILE
+operation to test whether two file descriptors
+(in the same process or in two different processes)
+refer to the same open file description.
+.\"
+.SS NFS
+There are many infelicities in the protocol underlying NFS, affecting
+amongst others
+.BR O_SYNC " and " O_NDELAY .
+.PP
+On NFS filesystems with UID mapping enabled,
+.BR open ()
+may
+return a file descriptor but, for example,
+.BR read (2)
+requests are denied
+with
+.BR EACCES .
+This is because the client performs
+.BR open ()
+by checking the
+permissions, but UID mapping is performed by the server upon
+read and write requests.
+.\"
+.\"
+.SS FIFOs
+Opening the read or write end of a FIFO blocks until the other
+end is also opened (by another process or thread).
+See
+.BR fifo (7)
+for further details.
+.\"
+.\"
+.SS File access mode
+Unlike the other values that can be specified in
+.IR flags ,
+the
+.I "access mode"
+values
+.BR O_RDONLY ", " O_WRONLY ", and " O_RDWR
+do not specify individual bits.
+Rather, they define the low order two bits of
+.IR flags ,
+and are defined respectively as 0, 1, and 2.
+In other words, the combination
+.B "O_RDONLY | O_WRONLY"
+is a logical error, and certainly does not have the same meaning as
+.BR O_RDWR .
+.PP
+Linux reserves the special, nonstandard access mode 3 (binary 11) in
+.I flags
+to mean:
+check for read and write permission on the file and return a file descriptor
+that can't be used for reading or writing.
+This nonstandard access mode is used by some Linux drivers to return a
+file descriptor that is to be used only for device-specific
+.BR ioctl (2)
+operations.
+.\" See for example util-linux's disk-utils/setfdprm.c
+.\" For some background on access mode 3, see
+.\" http://thread.gmane.org/gmane.linux.kernel/653123
+.\" "[RFC] correct flags to f_mode conversion in __dentry_open"
+.\" LKML, 12 Mar 2008
+.\"
+.\"
+.SS Rationale for openat() and other "directory file descriptor" APIs
+.BR openat ()
+and the other system calls and library functions that take
+a directory file descriptor argument
+(i.e.,
+.BR execveat (2),
+.BR faccessat (2),
+.BR fanotify_mark (2),
+.BR fchmodat (2),
+.BR fchownat (2),
+.BR fspick (2),
+.BR fstatat (2),
+.BR futimesat (2),
+.BR linkat (2),
+.BR mkdirat (2),
+.BR mknodat (2),
+.BR mount_setattr (2),
+.BR move_mount (2),
+.BR name_to_handle_at (2),
+.BR open_tree (2),
+.BR openat2 (2),
+.BR readlinkat (2),
+.BR renameat (2),
+.BR renameat2 (2),
+.BR statx (2),
+.BR symlinkat (2),
+.BR unlinkat (2),
+.BR utimensat (2),
+.BR mkfifoat (3),
+and
+.BR scandirat (3))
+address two problems with the older interfaces that preceded them.
+Here, the explanation is in terms of the
+.BR openat ()
+call, but the rationale is analogous for the other interfaces.
+.PP
+First,
+.BR openat ()
+allows an application to avoid race conditions that could
+occur when using
+.BR open ()
+to open files in directories other than the current working directory.
+These race conditions result from the fact that some component
+of the directory prefix given to
+.BR open ()
+could be changed in parallel with the call to
+.BR open ().
+Suppose, for example, that we wish to create the file
+.I dir1/dir2/xxx.dep
+if the file
+.I dir1/dir2/xxx
+exists.
+The problem is that between the existence check and the file-creation step,
+.I dir1
+or
+.I dir2
+(which might be symbolic links)
+could be modified to point to a different location.
+Such races can be avoided by
+opening a file descriptor for the target directory,
+and then specifying that file descriptor as the
+.I dirfd
+argument of (say)
+.BR fstatat (2)
+and
+.BR openat ().
+The use of the
+.I dirfd
+file descriptor also has other benefits:
+.IP \[bu] 3
+the file descriptor is a stable reference to the directory,
+even if the directory is renamed; and
+.IP \[bu]
+the open file descriptor prevents the underlying filesystem from
+being dismounted,
+just as when a process has a current working directory on a filesystem.
+.PP
+Second,
+.BR openat ()
+allows the implementation of a per-thread "current working
+directory", via file descriptor(s) maintained by the application.
+(This functionality can also be obtained by tricks based
+on the use of
+.IR /proc/self/fd/ dirfd,
+but less efficiently.)
+.PP
+The
+.I dirfd
+argument for these APIs can be obtained by using
+.BR open ()
+or
+.BR openat ()
+to open a directory (with either the
+.B O_RDONLY
+or the
+.B O_PATH
+flag).
+Alternatively, such a file descriptor can be obtained by applying
+.BR dirfd (3)
+to a directory stream created using
+.BR opendir (3).
+.PP
+When these APIs are given a
+.I dirfd
+argument of
+.B AT_FDCWD
+or the specified pathname is absolute,
+then they handle their pathname argument in the same way as
+the corresponding conventional APIs.
+However, in this case, several of the APIs have a
+.I flags
+argument that provides access to functionality that is not available with
+the corresponding conventional APIs.
+.\"
+.\"
+.SS O_DIRECT
+The
+.B O_DIRECT
+flag may impose alignment restrictions on the length and address
+of user-space buffers and the file offset of I/Os.
+In Linux alignment
+restrictions vary by filesystem and kernel version and might be
+absent entirely.
+The handling of misaligned
+.B O_DIRECT
+I/Os also varies;
+they can either fail with
+.B EINVAL
+or fall back to buffered I/O.
+.PP
+Since Linux 6.1,
+.B O_DIRECT
+support and alignment restrictions for a file can be queried using
+.BR statx (2),
+using the
+.B STATX_DIOALIGN
+flag.
+Support for
+.B STATX_DIOALIGN
+varies by filesystem;
+see
+.BR statx (2).
+.PP
+Some filesystems provide their own interfaces for querying
+.B O_DIRECT
+alignment restrictions,
+for example the
+.B XFS_IOC_DIOINFO
+operation in
+.BR xfsctl (3).
+.B STATX_DIOALIGN
+should be used instead when it is available.
+.PP
+If none of the above is available,
+then direct I/O support and alignment restrictions
+can only be assumed from known characteristics of the filesystem,
+the individual file,
+the underlying storage device(s),
+and the kernel version.
+In Linux 2.4,
+most filesystems based on block devices require that
+the file offset and the length and memory address of all I/O segments
+be multiples of the filesystem block size
+(typically 4096 bytes).
+In Linux 2.6.0,
+this was relaxed to the logical block size of the block device
+(typically 512 bytes).
+A block device's logical block size can be determined using the
+.BR ioctl (2)
+.B BLKSSZGET
+operation or from the shell using the command:
+.PP
+.in +4n
+.EX
+blockdev \-\-getss
+.EE
+.in
+.PP
+.B O_DIRECT
+I/Os should never be run concurrently with the
+.BR fork (2)
+system call,
+if the memory buffer is a private mapping
+(i.e., any mapping created with the
+.BR mmap (2)
+.B MAP_PRIVATE
+flag;
+this includes memory allocated on the heap and statically allocated buffers).
+Any such I/Os, whether submitted via an asynchronous I/O interface or from
+another thread in the process,
+should be completed before
+.BR fork (2)
+is called.
+Failure to do so can result in data corruption and undefined behavior in
+parent and child processes.
+This restriction does not apply when the memory buffer for the
+.B O_DIRECT
+I/Os was created using
+.BR shmat (2)
+or
+.BR mmap (2)
+with the
+.B MAP_SHARED
+flag.
+Nor does this restriction apply when the memory buffer has been advised as
+.B MADV_DONTFORK
+with
+.BR madvise (2),
+ensuring that it will not be available
+to the child after
+.BR fork (2).
+.PP
+The
+.B O_DIRECT
+flag was introduced in SGI IRIX, where it has alignment
+restrictions similar to those of Linux 2.4.
+IRIX has also a
+.BR fcntl (2)
+call to query appropriate alignments, and sizes.
+FreeBSD 4.x introduced
+a flag of the same name, but without alignment restrictions.
+.PP
+.B O_DIRECT
+support was added in Linux 2.4.10.
+Older Linux kernels simply ignore this flag.
+Some filesystems may not implement the flag, in which case
+.BR open ()
+fails with the error
+.B EINVAL
+if it is used.
+.PP
+Applications should avoid mixing
+.B O_DIRECT
+and normal I/O to the same file,
+and especially to overlapping byte regions in the same file.
+Even when the filesystem correctly handles the coherency issues in
+this situation, overall I/O throughput is likely to be slower than
+using either mode alone.
+Likewise, applications should avoid mixing
+.BR mmap (2)
+of files with direct I/O to the same files.
+.PP
+The behavior of
+.B O_DIRECT
+with NFS will differ from local filesystems.
+Older kernels, or
+kernels configured in certain ways, may not support this combination.
+The NFS protocol does not support passing the flag to the server, so
+.B O_DIRECT
+I/O will bypass the page cache only on the client; the server may
+still cache the I/O.
+The client asks the server to make the I/O
+synchronous to preserve the synchronous semantics of
+.BR O_DIRECT .
+Some servers will perform poorly under these circumstances, especially
+if the I/O size is small.
+Some servers may also be configured to
+lie to clients about the I/O having reached stable storage; this
+will avoid the performance penalty at some risk to data integrity
+in the event of server power failure.
+The Linux NFS client places no alignment restrictions on
+.B O_DIRECT
+I/O.
+.PP
+In summary,
+.B O_DIRECT
+is a potentially powerful tool that should be used with caution.
+It is recommended that applications treat use of
+.B O_DIRECT
+as a performance option which is disabled by default.
+.SH BUGS
+Currently, it is not possible to enable signal-driven
+I/O by specifying
+.B O_ASYNC
+when calling
+.BR open ();
+use
+.BR fcntl (2)
+to enable this flag.
+.\" FIXME . Check bugzilla report on open(O_ASYNC)
+.\" See http://bugzilla.kernel.org/show_bug.cgi?id=5993
+.PP
+One must check for two different error codes,
+.B EISDIR
+and
+.BR ENOENT ,
+when trying to determine whether the kernel supports
+.B O_TMPFILE
+functionality.
+.PP
+When both
+.B O_CREAT
+and
+.B O_DIRECTORY
+are specified in
+.I flags
+and the file specified by
+.I pathname
+does not exist,
+.BR open ()
+will create a regular file (i.e.,
+.B O_DIRECTORY
+is ignored).
+.SH SEE ALSO
+.BR chmod (2),
+.BR chown (2),
+.BR close (2),
+.BR dup (2),
+.BR fcntl (2),
+.BR link (2),
+.BR lseek (2),
+.BR mknod (2),
+.BR mmap (2),
+.BR mount (2),
+.BR open_by_handle_at (2),
+.BR openat2 (2),
+.BR read (2),
+.BR socket (2),
+.BR stat (2),
+.BR umask (2),
+.BR unlink (2),
+.BR write (2),
+.BR fopen (3),
+.BR acl (5),
+.BR fifo (7),
+.BR inode (7),
+.BR path_resolution (7),
+.BR symlink (7)
-- 
cgit v1.2.3