Adding upstream version 6.1.76.upstream/6.1.76 upstream

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-07 18:49:45 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-07 18:49:45 +0000
commit: 2c3c1048746a4622d8c89a29670120dc8fab93c4 (patch)
tree: 848558de17fb3008cdf4d861b01ac7781903ce39 /Documentation/filesystems/nfs
parent: Initial commit. (diff)
download: linux-2c3c1048746a4622d8c89a29670120dc8fab93c4.tar.xz
linux-2c3c1048746a4622d8c89a29670120dc8fab93c4.zip
9 files changed, 1331 insertions, 0 deletions
diff --git a/Documentation/filesystems/nfs/client-identifier.rst b/Documentation/filesystems/nfs/client-identifier.rst
new file mode 100644
index 000000000..5147e1581
--- /dev/null
+++ b/Documentation/filesystems/nfs/client-identifier.rst
@@ -0,0 +1,216 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+NFSv4 client identifier
+=======================
+
+This document explains how the NFSv4 protocol identifies client
+instances in order to maintain file open and lock state during
+system restarts. A special identifier and principal are maintained
+on each client. These can be set by administrators, scripts
+provided by site administrators, or tools provided by Linux
+distributors.
+
+There are risks if a client's NFSv4 identifier and its principal
+are not chosen carefully.
+
+
+Introduction
+------------
+
+The NFSv4 protocol uses "lease-based file locking". Leases help
+NFSv4 servers provide file lock guarantees and manage their
+resources.
+
+Simply put, an NFSv4 server creates a lease for each NFSv4 client.
+The server collects each client's file open and lock state under
+the lease for that client.
+
+The client is responsible for periodically renewing its leases.
+While a lease remains valid, the server holding that lease
+guarantees the file locks the client has created remain in place.
+
+If a client stops renewing its lease (for example, if it crashes),
+the NFSv4 protocol allows the server to remove the client's open
+and lock state after a certain period of time. When a client
+restarts, it indicates to servers that open and lock state
+associated with its previous leases is no longer valid and can be
+destroyed immediately.
+
+In addition, each NFSv4 server manages a persistent list of client
+leases. When the server restarts and clients attempt to recover
+their state, the server uses this list to distinguish amongst
+clients that held state before the server restarted and clients
+sending fresh OPEN and LOCK requests. This enables file locks to
+persist safely across server restarts.
+
+NFSv4 client identifiers
+------------------------
+
+Each NFSv4 client presents an identifier to NFSv4 servers so that
+they can associate the client with its lease. Each client's
+identifier consists of two elements:
+
+  - co_ownerid: An arbitrary but fixed string.
+
+  - boot verifier: A 64-bit incarnation verifier that enables a
+    server to distinguish successive boot epochs of the same client.
+
+The NFSv4.0 specification refers to these two items as an
+"nfs_client_id4". The NFSv4.1 specification refers to these two
+items as a "client_owner4".
+
+NFSv4 servers tie this identifier to the principal and security
+flavor that the client used when presenting it. Servers use this
+principal to authorize subsequent lease modification operations
+sent by the client. Effectively this principal is a third element of
+the identifier.
+
+As part of the identity presented to servers, a good
+"co_ownerid" string has several important properties:
+
+  - The "co_ownerid" string identifies the client during reboot
+    recovery, therefore the string is persistent across client
+    reboots.
+  - The "co_ownerid" string helps servers distinguish the client
+    from others, therefore the string is globally unique. Note
+    that there is no central authority that assigns "co_ownerid"
+    strings.
+  - Because it often appears on the network in the clear, the
+    "co_ownerid" string does not reveal private information about
+    the client itself.
+  - The content of the "co_ownerid" string is set and unchanging
+    before the client attempts NFSv4 mounts after a restart.
+  - The NFSv4 protocol places a 1024-byte limit on the size of the
+    "co_ownerid" string.
+
+Protecting NFSv4 lease state
+----------------------------
+
+NFSv4 servers utilize the "client_owner4" as described above to
+assign a unique lease to each client. Under this scheme, there are
+circumstances where clients can interfere with each other. This is
+referred to as "lease stealing".
+
+If distinct clients present the same "co_ownerid" string and use
+the same principal (for example, AUTH_SYS and UID 0), a server is
+unable to tell that the clients are not the same. Each distinct
+client presents a different boot verifier, so it appears to the
+server as if there is one client that is rebooting frequently.
+Neither client can maintain open or lock state in this scenario.
+
+If distinct clients present the same "co_ownerid" string and use
+distinct principals, the server is likely to allow the first client
+to operate normally but reject subsequent clients with the same
+"co_ownerid" string.
+
+If a client's "co_ownerid" string or principal are not stable,
+state recovery after a server or client reboot is not guaranteed.
+If a client unexpectedly restarts but presents a different
+"co_ownerid" string or principal to the server, the server orphans
+the client's previous open and lock state. This blocks access to
+locked files until the server removes the orphaned state.
+
+If the server restarts and a client presents a changed "co_ownerid"
+string or principal to the server, the server will not allow the
+client to reclaim its open and lock state, and may give those locks
+to other clients in the meantime. This is referred to as "lock
+stealing".
+
+Lease stealing and lock stealing increase the potential for denial
+of service and in rare cases even data corruption.
+
+Selecting an appropriate client identifier
+------------------------------------------
+
+By default, the Linux NFSv4 client implementation constructs its
+"co_ownerid" string starting with the words "Linux NFS" followed by
+the client's UTS node name (the same node name, incidentally, that
+is used as the "machine name" in an AUTH_SYS credential). In small
+deployments, this construction is usually adequate. Often, however,
+the node name by itself is not adequately unique, and can change
+unexpectedly. Problematic situations include:
+
+  - NFS-root (diskless) clients, where the local DCHP server (or
+    equivalent) does not provide a unique host name.
+
+  - "Containers" within a single Linux host.  If each container has
+    a separate network namespace, but does not use the UTS namespace
+    to provide a unique host name, then there can be multiple NFS
+    client instances with the same host name.
+
+  - Clients across multiple administrative domains that access a
+    common NFS server. If hostnames are not assigned centrally
+    then uniqueness cannot be guaranteed unless a domain name is
+    included in the hostname.
+
+Linux provides two mechanisms to add uniqueness to its "co_ownerid"
+string:
+
+    nfs.nfs4_unique_id
+      This module parameter can set an arbitrary uniquifier string
+      via the kernel command line, or when the "nfs" module is
+      loaded.
+
+    /sys/fs/nfs/client/net/identifier
+      This virtual file, available since Linux 5.3, is local to the
+      network namespace in which it is accessed and so can provide
+      distinction between network namespaces (containers) when the
+      hostname remains uniform.
+
+Note that this file is empty on name-space creation. If the
+container system has access to some sort of per-container identity
+then that uniquifier can be used. For example, a uniquifier might
+be formed at boot using the container's internal identifier:
+
+    sha256sum /etc/machine-id | awk '{print $1}' \\
+        > /sys/fs/nfs/client/net/identifier
+
+Security considerations
+-----------------------
+
+The use of cryptographic security for lease management operations
+is strongly encouraged.
+
+If NFS with Kerberos is not configured, a Linux NFSv4 client uses
+AUTH_SYS and UID 0 as the principal part of its client identity.
+This configuration is not only insecure, it increases the risk of
+lease and lock stealing. However, it might be the only choice for
+client configurations that have no local persistent storage.
+"co_ownerid" string uniqueness and persistence is critical in this
+case.
+
+When a Kerberos keytab is present on a Linux NFS client, the client
+attempts to use one of the principals in that keytab when
+identifying itself to servers. The "sec=" mount option does not
+control this behavior. Alternately, a single-user client with a
+Kerberos principal can use that principal in place of the client's
+host principal.
+
+Using Kerberos for this purpose enables the client and server to
+use the same lease for operations covered by all "sec=" settings.
+Additionally, the Linux NFS client uses the RPCSEC_GSS security
+flavor with Kerberos and the integrity QOS to prevent in-transit
+modification of lease modification requests.
+
+Additional notes
+----------------
+The Linux NFSv4 client establishes a single lease on each NFSv4
+server it accesses. NFSv4 mounts from a Linux NFSv4 client of a
+particular server then share that lease.
+
+Once a client establishes open and lock state, the NFSv4 protocol
+enables lease state to transition to other servers, following data
+that has been migrated. This hides data migration completely from
+running applications. The Linux NFSv4 client facilitates state
+migration by presenting the same "client_owner4" to all servers it
+encounters.
+
+========
+See Also
+========
+
+  - nfs(5)
+  - kerberos(7)
+  - RFC 7530 for the NFSv4.0 specification
+  - RFC 8881 for the NFSv4.1 specification.
diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
new file mode 100644
index 000000000..0e98edd35
--- /dev/null
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -0,0 +1,217 @@
+:orphan:
+
+Making Filesystems Exportable
+=============================
+
+Overview
+--------
+
+All filesystem operations require a dentry (or two) as a starting
+point.  Local applications have a reference-counted hold on suitable
+dentries via open file descriptors or cwd/root.  However remote
+applications that access a filesystem via a remote filesystem protocol
+such as NFS may not be able to hold such a reference, and so need a
+different way to refer to a particular dentry.  As the alternative
+form of reference needs to be stable across renames, truncates, and
+server-reboot (among other things, though these tend to be the most
+problematic), there is no simple answer like 'filename'.
+
+The mechanism discussed here allows each filesystem implementation to
+specify how to generate an opaque (outside of the filesystem) byte
+string for any dentry, and how to find an appropriate dentry for any
+given opaque byte string.
+This byte string will be called a "filehandle fragment" as it
+corresponds to part of an NFS filehandle.
+
+A filesystem which supports the mapping between filehandle fragments
+and dentries will be termed "exportable".
+
+
+
+Dcache Issues
+-------------
+
+The dcache normally contains a proper prefix of any given filesystem
+tree.  This means that if any filesystem object is in the dcache, then
+all of the ancestors of that filesystem object are also in the dcache.
+As normal access is by filename this prefix is created naturally and
+maintained easily (by each object maintaining a reference count on
+its parent).
+
+However when objects are included into the dcache by interpreting a
+filehandle fragment, there is no automatic creation of a path prefix
+for the object.  This leads to two related but distinct features of
+the dcache that are not needed for normal filesystem access.
+
+1. The dcache must sometimes contain objects that are not part of the
+   proper prefix. i.e that are not connected to the root.
+2. The dcache must be prepared for a newly found (via ->lookup) directory
+   to already have a (non-connected) dentry, and must be able to move
+   that dentry into place (based on the parent and name in the
+   ->lookup).   This is particularly needed for directories as
+   it is a dcache invariant that directories only have one dentry.
+
+To implement these features, the dcache has:
+
+a. A dentry flag DCACHE_DISCONNECTED which is set on
+   any dentry that might not be part of the proper prefix.
+   This is set when anonymous dentries are created, and cleared when a
+   dentry is noticed to be a child of a dentry which is in the proper
+   prefix.  If the refcount on a dentry with this flag set
+   becomes zero, the dentry is immediately discarded, rather than being
+   kept in the dcache.  If a dentry that is not already in the dcache
+   is repeatedly accessed by filehandle (as NFSD might do), an new dentry
+   will be a allocated for each access, and discarded at the end of
+   the access.
+
+   Note that such a dentry can acquire children, name, ancestors, etc.
+   without losing DCACHE_DISCONNECTED - that flag is only cleared when
+   subtree is successfully reconnected to root.  Until then dentries
+   in such subtree are retained only as long as there are references;
+   refcount reaching zero means immediate eviction, same as for unhashed
+   dentries.  That guarantees that we won't need to hunt them down upon
+   umount.
+
+b. A primitive for creation of secondary roots - d_obtain_root(inode).
+   Those do _not_ bear DCACHE_DISCONNECTED.  They are placed on the
+   per-superblock list (->s_roots), so they can be located at umount
+   time for eviction purposes.
+
+c. Helper routines to allocate anonymous dentries, and to help attach
+   loose directory dentries at lookup time. They are:
+
+    d_obtain_alias(inode) will return a dentry for the given inode.
+      If the inode already has a dentry, one of those is returned.
+
+      If it doesn't, a new anonymous (IS_ROOT and
+      DCACHE_DISCONNECTED) dentry is allocated and attached.
+
+      In the case of a directory, care is taken that only one dentry
+      can ever be attached.
+
+    d_splice_alias(inode, dentry) will introduce a new dentry into the tree;
+      either the passed-in dentry or a preexisting alias for the given inode
+      (such as an anonymous one created by d_obtain_alias), if appropriate.
+      It returns NULL when the passed-in dentry is used, following the calling
+      convention of ->lookup.
+
+Filesystem Issues
+-----------------
+
+For a filesystem to be exportable it must:
+
+   1. provide the filehandle fragment routines described below.
+   2. make sure that d_splice_alias is used rather than d_add
+      when ->lookup finds an inode for a given parent and name.
+
+      If inode is NULL, d_splice_alias(inode, dentry) is equivalent to::
+
+		d_add(dentry, inode), NULL
+
+      Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)
+
+      Typically the ->lookup routine will simply end with a::
+
+		return d_splice_alias(inode, dentry);
+	}
+
+
+
+A file system implementation declares that instances of the filesystem
+are exportable by setting the s_export_op field in the struct
+super_block.  This field must point to a "struct export_operations"
+struct which has the following members:
+
+ encode_fh  (optional)
+    Takes a dentry and creates a filehandle fragment which can later be used
+    to find or create a dentry for the same object.  The default
+    implementation creates a filehandle fragment that encodes a 32bit inode
+    and generation number for the inode encoded, and if necessary the
+    same information for the parent.
+
+  fh_to_dentry (mandatory)
+    Given a filehandle fragment, this should find the implied object and
+    create a dentry for it (possibly with d_obtain_alias).
+
+  fh_to_parent (optional but strongly recommended)
+    Given a filehandle fragment, this should find the parent of the
+    implied object and create a dentry for it (possibly with
+    d_obtain_alias).  May fail if the filehandle fragment is too small.
+
+  get_parent (optional but strongly recommended)
+    When given a dentry for a directory, this should return  a dentry for
+    the parent.  Quite possibly the parent dentry will have been allocated
+    by d_alloc_anon.  The default get_parent function just returns an error
+    so any filehandle lookup that requires finding a parent will fail.
+    ->lookup("..") is *not* used as a default as it can leave ".." entries
+    in the dcache which are too messy to work with.
+
+  get_name (optional)
+    When given a parent dentry and a child dentry, this should find a name
+    in the directory identified by the parent dentry, which leads to the
+    object identified by the child dentry.  If no get_name function is
+    supplied, a default implementation is provided which uses vfs_readdir
+    to find potential names, and matches inode numbers to find the correct
+    match.
+
+  flags
+    Some filesystems may need to be handled differently than others. The
+    export_operations struct also includes a flags field that allows the
+    filesystem to communicate such information to nfsd. See the Export
+    Operations Flags section below for more explanation.
+
+A filehandle fragment consists of an array of 1 or more 4byte words,
+together with a one byte "type".
+The decode_fh routine should not depend on the stated size that is
+passed to it.  This size may be larger than the original filehandle
+generated by encode_fh, in which case it will have been padded with
+nuls.  Rather, the encode_fh routine should choose a "type" which
+indicates the decode_fh how much of the filehandle is valid, and how
+it should be interpreted.
+
+Export Operations Flags
+-----------------------
+In addition to the operation vector pointers, struct export_operations also
+contains a "flags" field that allows the filesystem to communicate to nfsd
+that it may want to do things differently when dealing with it. The
+following flags are defined:
+
+  EXPORT_OP_NOWCC - disable NFSv3 WCC attributes on this filesystem
+    RFC 1813 recommends that servers always send weak cache consistency
+    (WCC) data to the client after each operation. The server should
+    atomically collect attributes about the inode, do an operation on it,
+    and then collect the attributes afterward. This allows the client to
+    skip issuing GETATTRs in some situations but means that the server
+    is calling vfs_getattr for almost all RPCs. On some filesystems
+    (particularly those that are clustered or networked) this is expensive
+    and atomicity is difficult to guarantee. This flag indicates to nfsd
+    that it should skip providing WCC attributes to the client in NFSv3
+    replies when doing operations on this filesystem. Consider enabling
+    this on filesystems that have an expensive ->getattr inode operation,
+    or when atomicity between pre and post operation attribute collection
+    is impossible to guarantee.
+
+  EXPORT_OP_NOSUBTREECHK - disallow subtree checking on this fs
+    Many NFS operations deal with filehandles, which the server must then
+    vet to ensure that they live inside of an exported tree. When the
+    export consists of an entire filesystem, this is trivial. nfsd can just
+    ensure that the filehandle live on the filesystem. When only part of a
+    filesystem is exported however, then nfsd must walk the ancestors of the
+    inode to ensure that it's within an exported subtree. This is an
+    expensive operation and not all filesystems can support it properly.
+    This flag exempts the filesystem from subtree checking and causes
+    exportfs to get back an error if it tries to enable subtree checking
+    on it.
+
+  EXPORT_OP_CLOSE_BEFORE_UNLINK - always close cached files before unlinking
+    On some exportable filesystems (such as NFS) unlinking a file that
+    is still open can cause a fair bit of extra work. For instance,
+    the NFS client will do a "sillyrename" to ensure that the file
+    sticks around while it's still open. When reexporting, that open
+    file is held by nfsd so we usually end up doing a sillyrename, and
+    then immediately deleting the sillyrenamed file just afterward when
+    the link count actually goes to zero. Sometimes this delete can race
+    with other operations (for instance an rmdir of the parent directory).
+    This flag causes nfsd to close any open files for this inode _before_
+    calling into the vfs to do an unlink or a rename that would replace
+    an existing file.
diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst
new file mode 100644
index 000000000..8536134f3
--- /dev/null
+++ b/Documentation/filesystems/nfs/index.rst
@@ -0,0 +1,16 @@
+===============================
+NFS
+===============================
+
+
+.. toctree::
+   :maxdepth: 1
+
+   client-identifier
+   exporting
+   pnfs
+   rpc-cache
+   rpc-server-gss
+   nfs41-server
+   knfsd-stats
+   reexport
diff --git a/Documentation/filesystems/nfs/knfsd-stats.rst b/Documentation/filesystems/nfs/knfsd-stats.rst
new file mode 100644
index 000000000..80bcf1355
--- /dev/null
+++ b/Documentation/filesystems/nfs/knfsd-stats.rst
@@ -0,0 +1,122 @@
+============================
+Kernel NFS Server Statistics
+============================
+
+:Authors: Greg Banks <gnb@sgi.com> - 26 Mar 2009
+
+This document describes the format and semantics of the statistics
+which the kernel NFS server makes available to userspace.  These
+statistics are available in several text form pseudo files, each of
+which is described separately below.
+
+In most cases you don't need to know these formats, as the nfsstat(8)
+program from the nfs-utils distribution provides a helpful command-line
+interface for extracting and printing them.
+
+All the files described here are formatted as a sequence of text lines,
+separated by newline '\n' characters.  Lines beginning with a hash
+'#' character are comments intended for humans and should be ignored
+by parsing routines.  All other lines contain a sequence of fields
+separated by whitespace.
+
+/proc/fs/nfsd/pool_stats
+========================
+
+This file is available in kernels from 2.6.30 onwards, if the
+/proc/fs/nfsd filesystem is mounted (it almost always should be).
+
+The first line is a comment which describes the fields present in
+all the other lines.  The other lines present the following data as
+a sequence of unsigned decimal numeric fields.  One line is shown
+for each NFS thread pool.
+
+All counters are 64 bits wide and wrap naturally.  There is no way
+to zero these counters, instead applications should do their own
+rate conversion.
+
+pool
+	The id number of the NFS thread pool to which this line applies.
+	This number does not change.
+
+	Thread pool ids are a contiguous set of small integers starting
+	at zero.  The maximum value depends on the thread pool mode, but
+	currently cannot be larger than the number of CPUs in the system.
+	Note that in the default case there will be a single thread pool
+	which contains all the nfsd threads and all the CPUs in the system,
+	and thus this file will have a single line with a pool id of "0".
+
+packets-arrived
+	Counts how many NFS packets have arrived.  More precisely, this
+	is the number of times that the network stack has notified the
+	sunrpc server layer that new data may be available on a transport
+	(e.g. an NFS or UDP socket or an NFS/RDMA endpoint).
+
+	Depending on the NFS workload patterns and various network stack
+	effects (such as Large Receive Offload) which can combine packets
+	on the wire, this may be either more or less than the number
+	of NFS calls received (which statistic is available elsewhere).
+	However this is a more accurate and less workload-dependent measure
+	of how much CPU load is being placed on the sunrpc server layer
+	due to NFS network traffic.
+
+sockets-enqueued
+	Counts how many times an NFS transport is enqueued to wait for
+	an nfsd thread to service it, i.e. no nfsd thread was considered
+	available.
+
+	The circumstance this statistic tracks indicates that there was NFS
+	network-facing work to be done but it couldn't be done immediately,
+	thus introducing a small delay in servicing NFS calls.  The ideal
+	rate of change for this counter is zero; significantly non-zero
+	values may indicate a performance limitation.
+
+	This can happen because there are too few nfsd threads in the thread
+	pool for the NFS workload (the workload is thread-limited), in which
+	case configuring more nfsd threads will probably improve the
+	performance of the NFS workload.
+
+threads-woken
+	Counts how many times an idle nfsd thread is woken to try to
+	receive some data from an NFS transport.
+
+	This statistic tracks the circumstance where incoming
+	network-facing NFS work is being handled quickly, which is a good
+	thing.  The ideal rate of change for this counter will be close
+	to but less than the rate of change of the packets-arrived counter.
+
+threads-timedout
+	Counts how many times an nfsd thread triggered an idle timeout,
+	i.e. was not woken to handle any incoming network packets for
+	some time.
+
+	This statistic counts a circumstance where there are more nfsd
+	threads configured than can be used by the NFS workload.  This is
+	a clue that the number of nfsd threads can be reduced without
+	affecting performance.  Unfortunately, it's only a clue and not
+	a strong indication, for a couple of reasons:
+
+	 - Currently the rate at which the counter is incremented is quite
+	   slow; the idle timeout is 60 minutes.  Unless the NFS workload
+	   remains constant for hours at a time, this counter is unlikely
+	   to be providing information that is still useful.
+
+	 - It is usually a wise policy to provide some slack,
+	   i.e. configure a few more nfsds than are currently needed,
+	   to allow for future spikes in load.
+
+
+Note that incoming packets on NFS transports will be dealt with in
+one of three ways.  An nfsd thread can be woken (threads-woken counts
+this case), or the transport can be enqueued for later attention
+(sockets-enqueued counts this case), or the packet can be temporarily
+deferred because the transport is currently being used by an nfsd
+thread.  This last case is not very interesting and is not explicitly
+counted, but can be inferred from the other counters thus::
+
+	packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken )
+
+
+More
+====
+
+Descriptions of the other statistics file should go here.
diff --git a/Documentation/filesystems/nfs/nfs41-server.rst b/Documentation/filesystems/nfs/nfs41-server.rst
new file mode 100644
index 000000000..16b5f02f8
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfs41-server.rst
@@ -0,0 +1,256 @@
+=============================
+NFSv4.1 Server Implementation
+=============================
+
+Server support for minorversion 1 can be controlled using the
+/proc/fs/nfsd/versions control file.  The string output returned
+by reading this file will contain either "+4.1" or "-4.1"
+correspondingly.
+
+Currently, server support for minorversion 1 is enabled by default.
+It can be disabled at run time by writing the string "-4.1" to
+the /proc/fs/nfsd/versions control file.  Note that to write this
+control file, the nfsd service must be taken down.  You can use rpc.nfsd
+for this; see rpc.nfsd(8).
+
+(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and
+"-4", respectively.  Therefore, code meant to work on both new and old
+kernels must turn 4.1 on or off *before* turning support for version 4
+on or off; rpc.nfsd does this correctly.)
+
+The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based
+on RFC 5661.
+
+From the many new features in NFSv4.1 the current implementation
+focuses on the mandatory-to-implement NFSv4.1 Sessions, providing
+"exactly once" semantics and better control and throttling of the
+resources allocated for each client.
+
+The table below, taken from the NFSv4.1 document, lists
+the operations that are mandatory to implement (REQ), optional
+(OPT), and NFSv4.0 operations that are required not to implement (MNI)
+in minor version 1.  The first column indicates the operations that
+are not supported yet by the linux server implementation.
+
+The OPTIONAL features identified and their abbreviations are as follows:
+
+- **pNFS**	Parallel NFS
+- **FDELG**	File Delegations
+- **DDELG**	Directory Delegations
+
+The following abbreviations indicate the linux server implementation status.
+
+- **I**	Implemented NFSv4.1 operations.
+- **NS**	Not Supported.
+- **NS\***	Unimplemented optional feature.
+
+Operations
+==========
+
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| Implementation status | Operation            | REQ,REC, OPT or NMI | Feature (REQ, REC or OPT) | Definition     |
++=======================+======================+=====================+===========================+================+
+|                       | ACCESS               | REQ                 |                           | Section 18.1   |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I                     | BACKCHANNEL_CTL      | REQ                 |                           | Section 18.33  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I                     | BIND_CONN_TO_SESSION | REQ                 |                           | Section 18.34  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | CLOSE                | REQ                 |                           | Section 18.2   |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | COMMIT               | REQ                 |                           | Section 18.3   |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | CREATE               | REQ                 |                           | Section 18.4   |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I                     | CREATE_SESSION       | REQ                 |                           | Section 18.36  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS*                   | DELEGPURGE           | OPT                 | FDELG (REQ)               | Section 18.5   |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | DELEGRETURN          | OPT                 | FDELG,                    | Section 18.6   |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       |                      |                     | DDELG, pNFS               |                |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       |                      |                     | (REQ)                     |                |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I                     | DESTROY_CLIENTID     | REQ                 |                           | Section 18.50  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I                     | DESTROY_SESSION      | REQ                 |                           | Section 18.37  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I                     | EXCHANGE_ID          | REQ                 |                           | Section 18.35  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I                     | FREE_STATEID         | REQ                 |                           | Section 18.38  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | GETATTR              | REQ                 |                           | Section 18.7   |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I                     | GETDEVICEINFO        | OPT                 | pNFS (REQ)                | Section 18.40  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS*                   | GETDEVICELIST        | OPT                 | pNFS (OPT)                | Section 18.41  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | GETFH                | REQ                 |                           | Section 18.8   |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS*                   | GET_DIR_DELEGATION   | OPT                 | DDELG (REQ)               | Section 18.39  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I                     | LAYOUTCOMMIT         | OPT                 | pNFS (REQ)                | Section 18.42  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I                     | LAYOUTGET            | OPT                 | pNFS (REQ)                | Section 18.43  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I                     | LAYOUTRETURN         | OPT                 | pNFS (REQ)                | Section 18.44  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | LINK                 | OPT                 |                           | Section 18.9   |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | LOCK                 | REQ                 |                           | Section 18.10  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | LOCKT                | REQ                 |                           | Section 18.11  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | LOCKU                | REQ                 |                           | Section 18.12  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | LOOKUP               | REQ                 |                           | Section 18.13  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | LOOKUPP              | REQ                 |                           | Section 18.14  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | NVERIFY              | REQ                 |                           | Section 18.15  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | OPEN                 | REQ                 |                           | Section 18.16  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS*                   | OPENATTR             | OPT                 |                           | Section 18.17  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | OPEN_CONFIRM         | MNI                 |                           | N/A            |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | OPEN_DOWNGRADE       | REQ                 |                           | Section 18.18  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | PUTFH                | REQ                 |                           | Section 18.19  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | PUTPUBFH             | REQ                 |                           | Section 18.20  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | PUTROOTFH            | REQ                 |                           | Section 18.21  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | READ                 | REQ                 |                           | Section 18.22  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | READDIR              | REQ                 |                           | Section 18.23  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | READLINK             | OPT                 |                           | Section 18.24  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | RECLAIM_COMPLETE     | REQ                 |                           | Section 18.51  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | RELEASE_LOCKOWNER    | MNI                 |                           | N/A            |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | REMOVE               | REQ                 |                           | Section 18.25  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | RENAME               | REQ                 |                           | Section 18.26  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | RENEW                | MNI                 |                           | N/A            |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | RESTOREFH            | REQ                 |                           | Section 18.27  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | SAVEFH               | REQ                 |                           | Section 18.28  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | SECINFO              | REQ                 |                           | Section 18.29  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I                     | SECINFO_NO_NAME      | REC                 | pNFS files                | Section 18.45, |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       |                      |                     | layout (REQ)              | Section 13.12  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I                     | SEQUENCE             | REQ                 |                           | Section 18.46  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | SETATTR              | REQ                 |                           | Section 18.30  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | SETCLIENTID          | MNI                 |                           | N/A            |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | SETCLIENTID_CONFIRM  | MNI                 |                           | N/A            |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS                    | SET_SSV              | REQ                 |                           | Section 18.47  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I                     | TEST_STATEID         | REQ                 |                           | Section 18.48  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | VERIFY               | REQ                 |                           | Section 18.31  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS*                   | WANT_DELEGATION      | OPT                 | FDELG (OPT)               | Section 18.49  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+|                       | WRITE                | REQ                 |                           | Section 18.32  |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+
+
+Callback Operations
+===================
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| Implementation status | Operation               | REQ,REC, OPT or NMI | Feature (REQ, REC or OPT) | Definition    |
++=======================+=========================+=====================+===========================+===============+
+|                       | CB_GETATTR              | OPT                 | FDELG (REQ)               | Section 20.1  |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| I                     | CB_LAYOUTRECALL         | OPT                 | pNFS (REQ)                | Section 20.3  |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS*                   | CB_NOTIFY               | OPT                 | DDELG (REQ)               | Section 20.4  |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS*                   | CB_NOTIFY_DEVICEID      | OPT                 | pNFS (OPT)                | Section 20.12 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS*                   | CB_NOTIFY_LOCK          | OPT                 |                           | Section 20.11 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS*                   | CB_PUSH_DELEG           | OPT                 | FDELG (OPT)               | Section 20.5  |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+|                       | CB_RECALL               | OPT                 | FDELG,                    | Section 20.2  |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+|                       |                         |                     | DDELG, pNFS               |               |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+|                       |                         |                     | (REQ)                     |               |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS*                   | CB_RECALL_ANY           | OPT                 | FDELG,                    | Section 20.6  |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+|                       |                         |                     | DDELG, pNFS               |               |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+|                       |                         |                     | (REQ)                     |               |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS                    | CB_RECALL_SLOT          | REQ                 |                           | Section 20.8  |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS*                   | CB_RECALLABLE_OBJ_AVAIL | OPT                 | DDELG, pNFS               | Section 20.7  |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+|                       |                         |                     | (REQ)                     |               |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| I                     | CB_SEQUENCE             | OPT                 | FDELG,                    | Section 20.9  |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+|                       |                         |                     | DDELG, pNFS               |               |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+|                       |                         |                     | (REQ)                     |               |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS*                   | CB_WANTS_CANCELLED      | OPT                 | FDELG,                    | Section 20.10 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+|                       |                         |                     | DDELG, pNFS               |               |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+|                       |                         |                     | (REQ)                     |               |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+
+
+Implementation notes:
+=====================
+
+SSV:
+  The spec claims this is mandatory, but we don't actually know of any
+  implementations, so we're ignoring it for now.  The server returns
+  NFS4ERR_ENCR_ALG_UNSUPP on EXCHANGE_ID, which should be future-proof.
+
+GSS on the backchannel:
+  Again, theoretically required but not widely implemented (in
+  particular, the current Linux client doesn't request it).  We return
+  NFS4ERR_ENCR_ALG_UNSUPP on CREATE_SESSION.
+
+DELEGPURGE:
+  mandatory only for servers that support CLAIM_DELEGATE_PREV and/or
+  CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that
+  persist across client reboots).  Thus we need not implement this for
+  now.
+
+EXCHANGE_ID:
+  implementation ids are ignored
+
+CREATE_SESSION:
+  backchannel attributes are ignored
+
+SEQUENCE:
+  no support for dynamic slot table renegotiation (optional)
+
+Nonstandard compound limitations:
+  No support for a sessions fore channel RPC compound that requires both a
+  ca_maxrequestsize request and a ca_maxresponsesize reply, so we may
+  fail to live up to the promise we made in CREATE_SESSION fore channel
+  negotiation.
+
+See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues.
diff --git a/Documentation/filesystems/nfs/pnfs.rst b/Documentation/filesystems/nfs/pnfs.rst
new file mode 100644
index 000000000..7c470ecdc
--- /dev/null
+++ b/Documentation/filesystems/nfs/pnfs.rst
@@ -0,0 +1,78 @@
+==========================
+Reference counting in pnfs
+==========================
+
+The are several inter-related caches.  We have layouts which can
+reference multiple devices, each of which can reference multiple data servers.
+Each data server can be referenced by multiple devices.  Each device
+can be referenced by multiple layouts. To keep all of this straight,
+we need to reference count.
+
+
+struct pnfs_layout_hdr
+======================
+
+The on-the-wire command LAYOUTGET corresponds to struct
+pnfs_layout_segment, usually referred to by the variable name lseg.
+Each nfs_inode may hold a pointer to a cache of these layout
+segments in nfsi->layout, of type struct pnfs_layout_hdr.
+
+We reference the header for the inode pointing to it, across each
+outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN,
+LAYOUTCOMMIT), and for each lseg held within.
+
+Each header is also (when non-empty) put on a list associated with
+struct nfs_client (cl_layouts).  Being put on this list does not bump
+the reference count, as the layout is kept around by the lseg that
+keeps it in the list.
+
+deviceid_cache
+==============
+
+lsegs reference device ids, which are resolved per nfs_client and
+layout driver type.  The device ids are held in a RCU cache (struct
+nfs4_deviceid_cache).  The cache itself is referenced across each
+mount.  The entries (struct nfs4_deviceid) themselves are held across
+the lifetime of each lseg referencing them.
+
+RCU is used because the deviceid is basically a write once, read many
+data structure.  The hlist size of 32 buckets needs better
+justification, but seems reasonable given that we can have multiple
+deviceid's per filesystem, and multiple filesystems per nfs_client.
+
+The hash code is copied from the nfsd code base.  A discussion of
+hashing and variations of this algorithm can be found `here.
+<http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809>`_
+
+data server cache
+=================
+
+file driver devices refer to data servers, which are kept in a module
+level cache.  Its reference is held over the lifetime of the deviceid
+pointing to it.
+
+lseg
+====
+
+lseg maintains an extra reference corresponding to the NFS_LSEG_VALID
+bit which holds it in the pnfs_layout_hdr's list.  When the final lseg
+is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED
+bit is set, preventing any new lsegs from being added.
+
+layout drivers
+==============
+
+PNFS utilizes what is called layout drivers. The STD defines 4 basic
+layout types: "files", "objects", "blocks", and "flexfiles". For each
+of these types there is a layout-driver with a common function-vectors
+table which are called by the nfs-client pnfs-core to implement the
+different layout types.
+
+Files-layout-driver code is in: fs/nfs/filelayout/.. directory
+Blocks-layout-driver code is in: fs/nfs/blocklayout/.. directory
+Flexfiles-layout-driver code is in: fs/nfs/flexfilelayout/.. directory
+
+blocks-layout setup
+===================
+
+TODO: Document the setup needs of the blocks layout driver
diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst
new file mode 100644
index 000000000..ff9ae4a46
--- /dev/null
+++ b/Documentation/filesystems/nfs/reexport.rst
@@ -0,0 +1,113 @@
+Reexporting NFS filesystems
+===========================
+
+Overview
+--------
+
+It is possible to reexport an NFS filesystem over NFS.  However, this
+feature comes with a number of limitations.  Before trying it, we
+recommend some careful research to determine whether it will work for
+your purposes.
+
+A discussion of current known limitations follows.
+
+"fsid=" required, crossmnt broken
+---------------------------------
+
+We require the "fsid=" export option on any reexport of an NFS
+filesystem.  You can use "uuidgen -r" to generate a unique argument.
+
+The "crossmnt" export does not propagate "fsid=", so it will not allow
+traversing into further nfs filesystems; if you wish to export nfs
+filesystems mounted under the exported filesystem, you'll need to export
+them explicitly, assigning each its own unique "fsid= option.
+
+Reboot recovery
+---------------
+
+The NFS protocol's normal reboot recovery mechanisms don't work for the
+case when the reexport server reboots.  Clients will lose any locks
+they held before the reboot, and further IO will result in errors.
+Closing and reopening files should clear the errors.
+
+Filehandle limits
+-----------------
+
+If the original server uses an X byte filehandle for a given object, the
+reexport server's filehandle for the reexported object will be X+22
+bytes, rounded up to the nearest multiple of four bytes.
+
+The result must fit into the RFC-mandated filehandle size limits:
+
++-------+-----------+
+| NFSv2 |  32 bytes |
++-------+-----------+
+| NFSv3 |  64 bytes |
++-------+-----------+
+| NFSv4 | 128 bytes |
++-------+-----------+
+
+So, for example, you will only be able to reexport a filesystem over
+NFSv2 if the original server gives you filehandles that fit in 10
+bytes--which is unlikely.
+
+In general there's no way to know the maximum filehandle size given out
+by an NFS server without asking the server vendor.
+
+But the following table gives a few examples.  The first column is the
+typical length of the filehandle from a Linux server exporting the given
+filesystem, the second is the length after that nfs export is reexported
+by another Linux host:
+
++--------+-------------------+----------------+
+|        | filehandle length | after reexport |
++========+===================+================+
+| ext4:  | 28 bytes          | 52 bytes       |
++--------+-------------------+----------------+
+| xfs:   | 32 bytes          | 56 bytes       |
++--------+-------------------+----------------+
+| btrfs: | 40 bytes          | 64 bytes       |
++--------+-------------------+----------------+
+
+All will therefore fit in an NFSv3 or NFSv4 filehandle after reexport,
+but none are reexportable over NFSv2.
+
+Linux server filehandles are a bit more complicated than this, though;
+for example:
+
+        - The (non-default) "subtreecheck" export option generally
+          requires another 4 to 8 bytes in the filehandle.
+        - If you export a subdirectory of a filesystem (instead of
+          exporting the filesystem root), that also usually adds 4 to 8
+          bytes.
+        - If you export over NFSv2, knfsd usually uses a shorter
+          filesystem identifier that saves 8 bytes.
+        - The root directory of an export uses a filehandle that is
+          shorter.
+
+As you can see, the 128-byte NFSv4 filehandle is large enough that
+you're unlikely to have trouble using NFSv4 to reexport any filesystem
+exported from a Linux server.  In general, if the original server is
+something that also supports NFSv3, you're *probably* OK.  Re-exporting
+over NFSv3 may be dicier, and reexporting over NFSv2 will probably
+never work.
+
+For more details of Linux filehandle structure, the best reference is
+the source code and comments; see in particular:
+
+        - include/linux/exportfs.h:enum fid_type
+        - include/uapi/linux/nfsd/nfsfh.h:struct nfs_fhbase_new
+        - fs/nfsd/nfsfh.c:set_version_and_fsid_type
+        - fs/nfs/export.c:nfs_encode_fh
+
+Open DENY bits ignored
+----------------------
+
+NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which
+allow you, for example, to open a file in a mode which forbids other
+read opens or write opens. The Linux client doesn't use them, and the
+server's support has always been incomplete: they are enforced only
+against other NFS users, not against processes accessing the exported
+filesystem locally. A reexport server will also not pass them along to
+the original server, so they will not be enforced between clients of
+different reexport servers.
diff --git a/Documentation/filesystems/nfs/rpc-cache.rst b/Documentation/filesystems/nfs/rpc-cache.rst
new file mode 100644
index 000000000..bb164eea9
--- /dev/null
+++ b/Documentation/filesystems/nfs/rpc-cache.rst
@@ -0,0 +1,220 @@
+=========
+RPC Cache
+=========
+
+This document gives a brief introduction to the caching
+mechanisms in the sunrpc layer that is used, in particular,
+for NFS authentication.
+
+Caches
+======
+
+The caching replaces the old exports table and allows for
+a wide variety of values to be caches.
+
+There are a number of caches that are similar in structure though
+quite possibly very different in content and use.  There is a corpus
+of common code for managing these caches.
+
+Examples of caches that are likely to be needed are:
+
+  - mapping from IP address to client name
+  - mapping from client name and filesystem to export options
+  - mapping from UID to list of GIDs, to work around NFS's limitation
+    of 16 gids.
+  - mappings between local UID/GID and remote UID/GID for sites that
+    do not have uniform uid assignment
+  - mapping from network identify to public key for crypto authentication.
+
+The common code handles such things as:
+
+   - general cache lookup with correct locking
+   - supporting 'NEGATIVE' as well as positive entries
+   - allowing an EXPIRED time on cache items, and removing
+     items after they expire, and are no longer in-use.
+   - making requests to user-space to fill in cache entries
+   - allowing user-space to directly set entries in the cache
+   - delaying RPC requests that depend on as-yet incomplete
+     cache entries, and replaying those requests when the cache entry
+     is complete.
+   - clean out old entries as they expire.
+
+Creating a Cache
+----------------
+
+-  A cache needs a datum to store.  This is in the form of a
+   structure definition that must contain a struct cache_head
+   as an element, usually the first.
+   It will also contain a key and some content.
+   Each cache element is reference counted and contains
+   expiry and update times for use in cache management.
+-  A cache needs a "cache_detail" structure that
+   describes the cache.  This stores the hash table, some
+   parameters for cache management, and some operations detailing how
+   to work with particular cache items.
+
+   The operations are:
+
+    struct cache_head \*alloc(void)
+      This simply allocates appropriate memory and returns
+      a pointer to the cache_detail embedded within the
+      structure
+
+    void cache_put(struct kref \*)
+      This is called when the last reference to an item is
+      dropped.  The pointer passed is to the 'ref' field
+      in the cache_head.  cache_put should release any
+      references create by 'cache_init' and, if CACHE_VALID
+      is set, any references created by cache_update.
+      It should then release the memory allocated by
+      'alloc'.
+
+    int match(struct cache_head \*orig, struct cache_head \*new)
+      test if the keys in the two structures match.  Return
+      1 if they do, 0 if they don't.
+
+    void init(struct cache_head \*orig, struct cache_head \*new)
+      Set the 'key' fields in 'new' from 'orig'.  This may
+      include taking references to shared objects.
+
+    void update(struct cache_head \*orig, struct cache_head \*new)
+      Set the 'content' fileds in 'new' from 'orig'.
+
+    int cache_show(struct seq_file \*m, struct cache_detail \*cd, struct cache_head \*h)
+      Optional.  Used to provide a /proc file that lists the
+      contents of a cache.  This should show one item,
+      usually on just one line.
+
+    int cache_request(struct cache_detail \*cd, struct cache_head \*h, char \*\*bpp, int \*blen)
+      Format a request to be send to user-space for an item
+      to be instantiated.  \*bpp is a buffer of size \*blen.
+      bpp should be moved forward over the encoded message,
+      and  \*blen should be reduced to show how much free
+      space remains.  Return 0 on success or <0 if not
+      enough room or other problem.
+
+    int cache_parse(struct cache_detail \*cd, char \*buf, int len)
+      A message from user space has arrived to fill out a
+      cache entry.  It is in 'buf' of length 'len'.
+      cache_parse should parse this, find the item in the
+      cache with sunrpc_cache_lookup_rcu, and update the item
+      with sunrpc_cache_update.
+
+
+-  A cache needs to be registered using cache_register().  This
+   includes it on a list of caches that will be regularly
+   cleaned to discard old data.
+
+Using a cache
+-------------
+
+To find a value in a cache, call sunrpc_cache_lookup_rcu passing a pointer
+to the cache_head in a sample item with the 'key' fields filled in.
+This will be passed to ->match to identify the target entry.  If no
+entry is found, a new entry will be create, added to the cache, and
+marked as not containing valid data.
+
+The item returned is typically passed to cache_check which will check
+if the data is valid, and may initiate an up-call to get fresh data.
+cache_check will return -ENOENT in the entry is negative or if an up
+call is needed but not possible, -EAGAIN if an upcall is pending,
+or 0 if the data is valid;
+
+cache_check can be passed a "struct cache_req\*".  This structure is
+typically embedded in the actual request and can be used to create a
+deferred copy of the request (struct cache_deferred_req).  This is
+done when the found cache item is not uptodate, but the is reason to
+believe that userspace might provide information soon.  When the cache
+item does become valid, the deferred copy of the request will be
+revisited (->revisit).  It is expected that this method will
+reschedule the request for processing.
+
+The value returned by sunrpc_cache_lookup_rcu can also be passed to
+sunrpc_cache_update to set the content for the item.  A second item is
+passed which should hold the content.  If the item found by _lookup
+has valid data, then it is discarded and a new item is created.  This
+saves any user of an item from worrying about content changing while
+it is being inspected.  If the item found by _lookup does not contain
+valid data, then the content is copied across and CACHE_VALID is set.
+
+Populating a cache
+------------------
+
+Each cache has a name, and when the cache is registered, a directory
+with that name is created in /proc/net/rpc
+
+This directory contains a file called 'channel' which is a channel
+for communicating between kernel and user for populating the cache.
+This directory may later contain other files of interacting
+with the cache.
+
+The 'channel' works a bit like a datagram socket. Each 'write' is
+passed as a whole to the cache for parsing and interpretation.
+Each cache can treat the write requests differently, but it is
+expected that a message written will contain:
+
+  - a key
+  - an expiry time
+  - a content.
+
+with the intention that an item in the cache with the give key
+should be create or updated to have the given content, and the
+expiry time should be set on that item.
+
+Reading from a channel is a bit more interesting.  When a cache
+lookup fails, or when it succeeds but finds an entry that may soon
+expire, a request is lodged for that cache item to be updated by
+user-space.  These requests appear in the channel file.
+
+Successive reads will return successive requests.
+If there are no more requests to return, read will return EOF, but a
+select or poll for read will block waiting for another request to be
+added.
+
+Thus a user-space helper is likely to::
+
+  open the channel.
+    select for readable
+    read a request
+    write a response
+  loop.
+
+If it dies and needs to be restarted, any requests that have not been
+answered will still appear in the file and will be read by the new
+instance of the helper.
+
+Each cache should define a "cache_parse" method which takes a message
+written from user-space and processes it.  It should return an error
+(which propagates back to the write syscall) or 0.
+
+Each cache should also define a "cache_request" method which
+takes a cache item and encodes a request into the buffer
+provided.
+
+.. note::
+  If a cache has no active readers on the channel, and has had not
+  active readers for more than 60 seconds, further requests will not be
+  added to the channel but instead all lookups that do not find a valid
+  entry will fail.  This is partly for backward compatibility: The
+  previous nfs exports table was deemed to be authoritative and a
+  failed lookup meant a definite 'no'.
+
+request/response format
+-----------------------
+
+While each cache is free to use its own format for requests
+and responses over channel, the following is recommended as
+appropriate and support routines are available to help:
+Each request or response record should be printable ASCII
+with precisely one newline character which should be at the end.
+Fields within the record should be separated by spaces, normally one.
+If spaces, newlines, or nul characters are needed in a field they
+much be quoted.  two mechanisms are available:
+
+-  If a field begins '\x' then it must contain an even number of
+   hex digits, and pairs of these digits provide the bytes in the
+   field.
+-  otherwise a \ in the field must be followed by 3 octal digits
+   which give the code for a byte.  Other characters are treated
+   as them selves.  At the very least, space, newline, nul, and
+   '\' must be quoted in this way.
diff --git a/Documentation/filesystems/nfs/rpc-server-gss.rst b/Documentation/filesystems/nfs/rpc-server-gss.rst
new file mode 100644
index 000000000..ccaea9e7c
--- /dev/null
+++ b/Documentation/filesystems/nfs/rpc-server-gss.rst
@@ -0,0 +1,93 @@
+=========================================
+rpcsec_gss support for kernel RPC servers
+=========================================
+
+This document gives references to the standards and protocols used to
+implement RPCGSS authentication in kernel RPC servers such as the NFS
+server and the NFS client's NFSv4.0 callback server.  (But note that
+NFSv4.1 and higher don't require the client to act as a server for the
+purposes of authentication.)
+
+RPCGSS is specified in a few IETF documents:
+
+ - RFC2203 v1: https://tools.ietf.org/rfc/rfc2203.txt
+ - RFC5403 v2: https://tools.ietf.org/rfc/rfc5403.txt
+
+There is a third version that we don't currently implement:
+
+ - RFC7861 v3: https://tools.ietf.org/rfc/rfc7861.txt
+
+Background
+==========
+
+The RPCGSS Authentication method describes a way to perform GSSAPI
+Authentication for NFS.  Although GSSAPI is itself completely mechanism
+agnostic, in many cases only the KRB5 mechanism is supported by NFS
+implementations.
+
+The Linux kernel, at the moment, supports only the KRB5 mechanism, and
+depends on GSSAPI extensions that are KRB5 specific.
+
+GSSAPI is a complex library, and implementing it completely in kernel is
+unwarranted. However GSSAPI operations are fundementally separable in 2
+parts:
+
+- initial context establishment
+- integrity/privacy protection (signing and encrypting of individual
+  packets)
+
+The former is more complex and policy-independent, but less
+performance-sensitive.  The latter is simpler and needs to be very fast.
+
+Therefore, we perform per-packet integrity and privacy protection in the
+kernel, but leave the initial context establishment to userspace.  We
+need upcalls to request userspace to perform context establishment.
+
+NFS Server Legacy Upcall Mechanism
+==================================
+
+The classic upcall mechanism uses a custom text based upcall mechanism
+to talk to a custom daemon called rpc.svcgssd that is provide by the
+nfs-utils package.
+
+This upcall mechanism has 2 limitations:
+
+A) It can handle tokens that are no bigger than 2KiB
+
+In some Kerberos deployment GSSAPI tokens can be quite big, up and
+beyond 64KiB in size due to various authorization extensions attacked to
+the Kerberos tickets, that needs to be sent through the GSS layer in
+order to perform context establishment.
+
+B) It does not properly handle creds where the user is member of more
+than a few thousand groups (the current hard limit in the kernel is 65K
+groups) due to limitation on the size of the buffer that can be send
+back to the kernel (4KiB).
+
+NFS Server New RPC Upcall Mechanism
+===================================
+
+The newer upcall mechanism uses RPC over a unix socket to a daemon
+called gss-proxy, implemented by a userspace program called Gssproxy.
+
+The gss_proxy RPC protocol is currently documented `here
+<https://fedorahosted.org/gss-proxy/wiki/ProtocolDocumentation>`_.
+
+This upcall mechanism uses the kernel rpc client and connects to the gssproxy
+userspace program over a regular unix socket. The gssproxy protocol does not
+suffer from the size limitations of the legacy protocol.
+
+Negotiating Upcall Mechanisms
+=============================
+
+To provide backward compatibility, the kernel defaults to using the
+legacy mechanism.  To switch to the new mechanism, gss-proxy must bind
+to /var/run/gssproxy.sock and then write "1" to
+/proc/net/rpc/use-gss-proxy.  If gss-proxy dies, it must repeat both
+steps.
+
+Once the upcall mechanism is chosen, it cannot be changed.  To prevent
+locking into the legacy mechanisms, the above steps must be performed
+before starting nfsd.  Whoever starts nfsd can guarantee this by reading
+from /proc/net/rpc/use-gss-proxy and checking that it contains a
+"1"--the read will block until gss-proxy has done its write to the file.
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-07 18:49:45 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-07 18:49:45 +0000
commit	2c3c1048746a4622d8c89a29670120dc8fab93c4 (patch)
tree	848558de17fb3008cdf4d861b01ac7781903ce39 /Documentation/filesystems/nfs
parent	Initial commit. (diff)
download	linux-2c3c1048746a4622d8c89a29670120dc8fab93c4.tar.xz linux-2c3c1048746a4622d8c89a29670120dc8fab93c4.zip