summaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems/nfs
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems/nfs')
-rw-r--r--Documentation/filesystems/nfs/exporting.rst165
-rw-r--r--Documentation/filesystems/nfs/index.rst13
-rw-r--r--Documentation/filesystems/nfs/knfsd-stats.rst122
-rw-r--r--Documentation/filesystems/nfs/nfs41-server.rst256
-rw-r--r--Documentation/filesystems/nfs/pnfs.rst78
-rw-r--r--Documentation/filesystems/nfs/rpc-cache.rst220
-rw-r--r--Documentation/filesystems/nfs/rpc-server-gss.rst93
7 files changed, 947 insertions, 0 deletions
diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
new file mode 100644
index 000000000..33d588a01
--- /dev/null
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -0,0 +1,165 @@
+:orphan:
+
+Making Filesystems Exportable
+=============================
+
+Overview
+--------
+
+All filesystem operations require a dentry (or two) as a starting
+point. Local applications have a reference-counted hold on suitable
+dentries via open file descriptors or cwd/root. However remote
+applications that access a filesystem via a remote filesystem protocol
+such as NFS may not be able to hold such a reference, and so need a
+different way to refer to a particular dentry. As the alternative
+form of reference needs to be stable across renames, truncates, and
+server-reboot (among other things, though these tend to be the most
+problematic), there is no simple answer like 'filename'.
+
+The mechanism discussed here allows each filesystem implementation to
+specify how to generate an opaque (outside of the filesystem) byte
+string for any dentry, and how to find an appropriate dentry for any
+given opaque byte string.
+This byte string will be called a "filehandle fragment" as it
+corresponds to part of an NFS filehandle.
+
+A filesystem which supports the mapping between filehandle fragments
+and dentries will be termed "exportable".
+
+
+
+Dcache Issues
+-------------
+
+The dcache normally contains a proper prefix of any given filesystem
+tree. This means that if any filesystem object is in the dcache, then
+all of the ancestors of that filesystem object are also in the dcache.
+As normal access is by filename this prefix is created naturally and
+maintained easily (by each object maintaining a reference count on
+its parent).
+
+However when objects are included into the dcache by interpreting a
+filehandle fragment, there is no automatic creation of a path prefix
+for the object. This leads to two related but distinct features of
+the dcache that are not needed for normal filesystem access.
+
+1. The dcache must sometimes contain objects that are not part of the
+ proper prefix. i.e that are not connected to the root.
+2. The dcache must be prepared for a newly found (via ->lookup) directory
+ to already have a (non-connected) dentry, and must be able to move
+ that dentry into place (based on the parent and name in the
+ ->lookup). This is particularly needed for directories as
+ it is a dcache invariant that directories only have one dentry.
+
+To implement these features, the dcache has:
+
+a. A dentry flag DCACHE_DISCONNECTED which is set on
+ any dentry that might not be part of the proper prefix.
+ This is set when anonymous dentries are created, and cleared when a
+ dentry is noticed to be a child of a dentry which is in the proper
+ prefix. If the refcount on a dentry with this flag set
+ becomes zero, the dentry is immediately discarded, rather than being
+ kept in the dcache. If a dentry that is not already in the dcache
+ is repeatedly accessed by filehandle (as NFSD might do), an new dentry
+ will be a allocated for each access, and discarded at the end of
+ the access.
+
+ Note that such a dentry can acquire children, name, ancestors, etc.
+ without losing DCACHE_DISCONNECTED - that flag is only cleared when
+ subtree is successfully reconnected to root. Until then dentries
+ in such subtree are retained only as long as there are references;
+ refcount reaching zero means immediate eviction, same as for unhashed
+ dentries. That guarantees that we won't need to hunt them down upon
+ umount.
+
+b. A primitive for creation of secondary roots - d_obtain_root(inode).
+ Those do _not_ bear DCACHE_DISCONNECTED. They are placed on the
+ per-superblock list (->s_roots), so they can be located at umount
+ time for eviction purposes.
+
+c. Helper routines to allocate anonymous dentries, and to help attach
+ loose directory dentries at lookup time. They are:
+
+ d_obtain_alias(inode) will return a dentry for the given inode.
+ If the inode already has a dentry, one of those is returned.
+
+ If it doesn't, a new anonymous (IS_ROOT and
+ DCACHE_DISCONNECTED) dentry is allocated and attached.
+
+ In the case of a directory, care is taken that only one dentry
+ can ever be attached.
+
+ d_splice_alias(inode, dentry) will introduce a new dentry into the tree;
+ either the passed-in dentry or a preexisting alias for the given inode
+ (such as an anonymous one created by d_obtain_alias), if appropriate.
+ It returns NULL when the passed-in dentry is used, following the calling
+ convention of ->lookup.
+
+Filesystem Issues
+-----------------
+
+For a filesystem to be exportable it must:
+
+ 1. provide the filehandle fragment routines described below.
+ 2. make sure that d_splice_alias is used rather than d_add
+ when ->lookup finds an inode for a given parent and name.
+
+ If inode is NULL, d_splice_alias(inode, dentry) is equivalent to::
+
+ d_add(dentry, inode), NULL
+
+ Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)
+
+ Typically the ->lookup routine will simply end with a::
+
+ return d_splice_alias(inode, dentry);
+ }
+
+
+
+A file system implementation declares that instances of the filesystem
+are exportable by setting the s_export_op field in the struct
+super_block. This field must point to a "struct export_operations"
+struct which has the following members:
+
+ encode_fh (optional)
+ Takes a dentry and creates a filehandle fragment which can later be used
+ to find or create a dentry for the same object. The default
+ implementation creates a filehandle fragment that encodes a 32bit inode
+ and generation number for the inode encoded, and if necessary the
+ same information for the parent.
+
+ fh_to_dentry (mandatory)
+ Given a filehandle fragment, this should find the implied object and
+ create a dentry for it (possibly with d_obtain_alias).
+
+ fh_to_parent (optional but strongly recommended)
+ Given a filehandle fragment, this should find the parent of the
+ implied object and create a dentry for it (possibly with
+ d_obtain_alias). May fail if the filehandle fragment is too small.
+
+ get_parent (optional but strongly recommended)
+ When given a dentry for a directory, this should return a dentry for
+ the parent. Quite possibly the parent dentry will have been allocated
+ by d_alloc_anon. The default get_parent function just returns an error
+ so any filehandle lookup that requires finding a parent will fail.
+ ->lookup("..") is *not* used as a default as it can leave ".." entries
+ in the dcache which are too messy to work with.
+
+ get_name (optional)
+ When given a parent dentry and a child dentry, this should find a name
+ in the directory identified by the parent dentry, which leads to the
+ object identified by the child dentry. If no get_name function is
+ supplied, a default implementation is provided which uses vfs_readdir
+ to find potential names, and matches inode numbers to find the correct
+ match.
+
+
+A filehandle fragment consists of an array of 1 or more 4byte words,
+together with a one byte "type".
+The decode_fh routine should not depend on the stated size that is
+passed to it. This size may be larger than the original filehandle
+generated by encode_fh, in which case it will have been padded with
+nuls. Rather, the encode_fh routine should choose a "type" which
+indicates the decode_fh how much of the filehandle is valid, and how
+it should be interpreted.
diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst
new file mode 100644
index 000000000..65805624e
--- /dev/null
+++ b/Documentation/filesystems/nfs/index.rst
@@ -0,0 +1,13 @@
+===============================
+NFS
+===============================
+
+
+.. toctree::
+ :maxdepth: 1
+
+ pnfs
+ rpc-cache
+ rpc-server-gss
+ nfs41-server
+ knfsd-stats
diff --git a/Documentation/filesystems/nfs/knfsd-stats.rst b/Documentation/filesystems/nfs/knfsd-stats.rst
new file mode 100644
index 000000000..80bcf1355
--- /dev/null
+++ b/Documentation/filesystems/nfs/knfsd-stats.rst
@@ -0,0 +1,122 @@
+============================
+Kernel NFS Server Statistics
+============================
+
+:Authors: Greg Banks <gnb@sgi.com> - 26 Mar 2009
+
+This document describes the format and semantics of the statistics
+which the kernel NFS server makes available to userspace. These
+statistics are available in several text form pseudo files, each of
+which is described separately below.
+
+In most cases you don't need to know these formats, as the nfsstat(8)
+program from the nfs-utils distribution provides a helpful command-line
+interface for extracting and printing them.
+
+All the files described here are formatted as a sequence of text lines,
+separated by newline '\n' characters. Lines beginning with a hash
+'#' character are comments intended for humans and should be ignored
+by parsing routines. All other lines contain a sequence of fields
+separated by whitespace.
+
+/proc/fs/nfsd/pool_stats
+========================
+
+This file is available in kernels from 2.6.30 onwards, if the
+/proc/fs/nfsd filesystem is mounted (it almost always should be).
+
+The first line is a comment which describes the fields present in
+all the other lines. The other lines present the following data as
+a sequence of unsigned decimal numeric fields. One line is shown
+for each NFS thread pool.
+
+All counters are 64 bits wide and wrap naturally. There is no way
+to zero these counters, instead applications should do their own
+rate conversion.
+
+pool
+ The id number of the NFS thread pool to which this line applies.
+ This number does not change.
+
+ Thread pool ids are a contiguous set of small integers starting
+ at zero. The maximum value depends on the thread pool mode, but
+ currently cannot be larger than the number of CPUs in the system.
+ Note that in the default case there will be a single thread pool
+ which contains all the nfsd threads and all the CPUs in the system,
+ and thus this file will have a single line with a pool id of "0".
+
+packets-arrived
+ Counts how many NFS packets have arrived. More precisely, this
+ is the number of times that the network stack has notified the
+ sunrpc server layer that new data may be available on a transport
+ (e.g. an NFS or UDP socket or an NFS/RDMA endpoint).
+
+ Depending on the NFS workload patterns and various network stack
+ effects (such as Large Receive Offload) which can combine packets
+ on the wire, this may be either more or less than the number
+ of NFS calls received (which statistic is available elsewhere).
+ However this is a more accurate and less workload-dependent measure
+ of how much CPU load is being placed on the sunrpc server layer
+ due to NFS network traffic.
+
+sockets-enqueued
+ Counts how many times an NFS transport is enqueued to wait for
+ an nfsd thread to service it, i.e. no nfsd thread was considered
+ available.
+
+ The circumstance this statistic tracks indicates that there was NFS
+ network-facing work to be done but it couldn't be done immediately,
+ thus introducing a small delay in servicing NFS calls. The ideal
+ rate of change for this counter is zero; significantly non-zero
+ values may indicate a performance limitation.
+
+ This can happen because there are too few nfsd threads in the thread
+ pool for the NFS workload (the workload is thread-limited), in which
+ case configuring more nfsd threads will probably improve the
+ performance of the NFS workload.
+
+threads-woken
+ Counts how many times an idle nfsd thread is woken to try to
+ receive some data from an NFS transport.
+
+ This statistic tracks the circumstance where incoming
+ network-facing NFS work is being handled quickly, which is a good
+ thing. The ideal rate of change for this counter will be close
+ to but less than the rate of change of the packets-arrived counter.
+
+threads-timedout
+ Counts how many times an nfsd thread triggered an idle timeout,
+ i.e. was not woken to handle any incoming network packets for
+ some time.
+
+ This statistic counts a circumstance where there are more nfsd
+ threads configured than can be used by the NFS workload. This is
+ a clue that the number of nfsd threads can be reduced without
+ affecting performance. Unfortunately, it's only a clue and not
+ a strong indication, for a couple of reasons:
+
+ - Currently the rate at which the counter is incremented is quite
+ slow; the idle timeout is 60 minutes. Unless the NFS workload
+ remains constant for hours at a time, this counter is unlikely
+ to be providing information that is still useful.
+
+ - It is usually a wise policy to provide some slack,
+ i.e. configure a few more nfsds than are currently needed,
+ to allow for future spikes in load.
+
+
+Note that incoming packets on NFS transports will be dealt with in
+one of three ways. An nfsd thread can be woken (threads-woken counts
+this case), or the transport can be enqueued for later attention
+(sockets-enqueued counts this case), or the packet can be temporarily
+deferred because the transport is currently being used by an nfsd
+thread. This last case is not very interesting and is not explicitly
+counted, but can be inferred from the other counters thus::
+
+ packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken )
+
+
+More
+====
+
+Descriptions of the other statistics file should go here.
diff --git a/Documentation/filesystems/nfs/nfs41-server.rst b/Documentation/filesystems/nfs/nfs41-server.rst
new file mode 100644
index 000000000..16b5f02f8
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfs41-server.rst
@@ -0,0 +1,256 @@
+=============================
+NFSv4.1 Server Implementation
+=============================
+
+Server support for minorversion 1 can be controlled using the
+/proc/fs/nfsd/versions control file. The string output returned
+by reading this file will contain either "+4.1" or "-4.1"
+correspondingly.
+
+Currently, server support for minorversion 1 is enabled by default.
+It can be disabled at run time by writing the string "-4.1" to
+the /proc/fs/nfsd/versions control file. Note that to write this
+control file, the nfsd service must be taken down. You can use rpc.nfsd
+for this; see rpc.nfsd(8).
+
+(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and
+"-4", respectively. Therefore, code meant to work on both new and old
+kernels must turn 4.1 on or off *before* turning support for version 4
+on or off; rpc.nfsd does this correctly.)
+
+The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based
+on RFC 5661.
+
+From the many new features in NFSv4.1 the current implementation
+focuses on the mandatory-to-implement NFSv4.1 Sessions, providing
+"exactly once" semantics and better control and throttling of the
+resources allocated for each client.
+
+The table below, taken from the NFSv4.1 document, lists
+the operations that are mandatory to implement (REQ), optional
+(OPT), and NFSv4.0 operations that are required not to implement (MNI)
+in minor version 1. The first column indicates the operations that
+are not supported yet by the linux server implementation.
+
+The OPTIONAL features identified and their abbreviations are as follows:
+
+- **pNFS** Parallel NFS
+- **FDELG** File Delegations
+- **DDELG** Directory Delegations
+
+The following abbreviations indicate the linux server implementation status.
+
+- **I** Implemented NFSv4.1 operations.
+- **NS** Not Supported.
+- **NS\*** Unimplemented optional feature.
+
+Operations
+==========
+
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| Implementation status | Operation | REQ,REC, OPT or NMI | Feature (REQ, REC or OPT) | Definition |
++=======================+======================+=====================+===========================+================+
+| | ACCESS | REQ | | Section 18.1 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | BACKCHANNEL_CTL | REQ | | Section 18.33 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | CLOSE | REQ | | Section 18.2 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | COMMIT | REQ | | Section 18.3 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | CREATE | REQ | | Section 18.4 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | CREATE_SESSION | REQ | | Section 18.36 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS* | DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | DELEGRETURN | OPT | FDELG, | Section 18.6 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | | | DDELG, pNFS | |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | | | (REQ) | |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | DESTROY_CLIENTID | REQ | | Section 18.50 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | DESTROY_SESSION | REQ | | Section 18.37 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | EXCHANGE_ID | REQ | | Section 18.35 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | FREE_STATEID | REQ | | Section 18.38 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | GETATTR | REQ | | Section 18.7 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS* | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | GETFH | REQ | | Section 18.8 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS* | GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | LINK | OPT | | Section 18.9 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | LOCK | REQ | | Section 18.10 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | LOCKT | REQ | | Section 18.11 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | LOCKU | REQ | | Section 18.12 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | LOOKUP | REQ | | Section 18.13 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | LOOKUPP | REQ | | Section 18.14 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | NVERIFY | REQ | | Section 18.15 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | OPEN | REQ | | Section 18.16 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS* | OPENATTR | OPT | | Section 18.17 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | OPEN_CONFIRM | MNI | | N/A |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | OPEN_DOWNGRADE | REQ | | Section 18.18 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | PUTFH | REQ | | Section 18.19 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | PUTPUBFH | REQ | | Section 18.20 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | PUTROOTFH | REQ | | Section 18.21 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | READ | REQ | | Section 18.22 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | READDIR | REQ | | Section 18.23 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | READLINK | OPT | | Section 18.24 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | RECLAIM_COMPLETE | REQ | | Section 18.51 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | RELEASE_LOCKOWNER | MNI | | N/A |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | REMOVE | REQ | | Section 18.25 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | RENAME | REQ | | Section 18.26 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | RENEW | MNI | | N/A |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | RESTOREFH | REQ | | Section 18.27 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | SAVEFH | REQ | | Section 18.28 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | SECINFO | REQ | | Section 18.29 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | | | layout (REQ) | Section 13.12 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | SEQUENCE | REQ | | Section 18.46 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | SETATTR | REQ | | Section 18.30 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | SETCLIENTID | MNI | | N/A |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | SETCLIENTID_CONFIRM | MNI | | N/A |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS | SET_SSV | REQ | | Section 18.47 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | TEST_STATEID | REQ | | Section 18.48 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | VERIFY | REQ | | Section 18.31 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS* | WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | WRITE | REQ | | Section 18.32 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+
+
+Callback Operations
+===================
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| Implementation status | Operation | REQ,REC, OPT or NMI | Feature (REQ, REC or OPT) | Definition |
++=======================+=========================+=====================+===========================+===============+
+| | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| I | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS* | CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS* | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS* | CB_NOTIFY_LOCK | OPT | | Section 20.11 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS* | CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | CB_RECALL | OPT | FDELG, | Section 20.2 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | DDELG, pNFS | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | (REQ) | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS* | CB_RECALL_ANY | OPT | FDELG, | Section 20.6 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | DDELG, pNFS | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | (REQ) | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS | CB_RECALL_SLOT | REQ | | Section 20.8 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS* | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | (REQ) | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | DDELG, pNFS | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | (REQ) | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS* | CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | DDELG, pNFS | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | (REQ) | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+
+
+Implementation notes:
+=====================
+
+SSV:
+ The spec claims this is mandatory, but we don't actually know of any
+ implementations, so we're ignoring it for now. The server returns
+ NFS4ERR_ENCR_ALG_UNSUPP on EXCHANGE_ID, which should be future-proof.
+
+GSS on the backchannel:
+ Again, theoretically required but not widely implemented (in
+ particular, the current Linux client doesn't request it). We return
+ NFS4ERR_ENCR_ALG_UNSUPP on CREATE_SESSION.
+
+DELEGPURGE:
+ mandatory only for servers that support CLAIM_DELEGATE_PREV and/or
+ CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that
+ persist across client reboots). Thus we need not implement this for
+ now.
+
+EXCHANGE_ID:
+ implementation ids are ignored
+
+CREATE_SESSION:
+ backchannel attributes are ignored
+
+SEQUENCE:
+ no support for dynamic slot table renegotiation (optional)
+
+Nonstandard compound limitations:
+ No support for a sessions fore channel RPC compound that requires both a
+ ca_maxrequestsize request and a ca_maxresponsesize reply, so we may
+ fail to live up to the promise we made in CREATE_SESSION fore channel
+ negotiation.
+
+See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues.
diff --git a/Documentation/filesystems/nfs/pnfs.rst b/Documentation/filesystems/nfs/pnfs.rst
new file mode 100644
index 000000000..7c470ecdc
--- /dev/null
+++ b/Documentation/filesystems/nfs/pnfs.rst
@@ -0,0 +1,78 @@
+==========================
+Reference counting in pnfs
+==========================
+
+The are several inter-related caches. We have layouts which can
+reference multiple devices, each of which can reference multiple data servers.
+Each data server can be referenced by multiple devices. Each device
+can be referenced by multiple layouts. To keep all of this straight,
+we need to reference count.
+
+
+struct pnfs_layout_hdr
+======================
+
+The on-the-wire command LAYOUTGET corresponds to struct
+pnfs_layout_segment, usually referred to by the variable name lseg.
+Each nfs_inode may hold a pointer to a cache of these layout
+segments in nfsi->layout, of type struct pnfs_layout_hdr.
+
+We reference the header for the inode pointing to it, across each
+outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN,
+LAYOUTCOMMIT), and for each lseg held within.
+
+Each header is also (when non-empty) put on a list associated with
+struct nfs_client (cl_layouts). Being put on this list does not bump
+the reference count, as the layout is kept around by the lseg that
+keeps it in the list.
+
+deviceid_cache
+==============
+
+lsegs reference device ids, which are resolved per nfs_client and
+layout driver type. The device ids are held in a RCU cache (struct
+nfs4_deviceid_cache). The cache itself is referenced across each
+mount. The entries (struct nfs4_deviceid) themselves are held across
+the lifetime of each lseg referencing them.
+
+RCU is used because the deviceid is basically a write once, read many
+data structure. The hlist size of 32 buckets needs better
+justification, but seems reasonable given that we can have multiple
+deviceid's per filesystem, and multiple filesystems per nfs_client.
+
+The hash code is copied from the nfsd code base. A discussion of
+hashing and variations of this algorithm can be found `here.
+<http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809>`_
+
+data server cache
+=================
+
+file driver devices refer to data servers, which are kept in a module
+level cache. Its reference is held over the lifetime of the deviceid
+pointing to it.
+
+lseg
+====
+
+lseg maintains an extra reference corresponding to the NFS_LSEG_VALID
+bit which holds it in the pnfs_layout_hdr's list. When the final lseg
+is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED
+bit is set, preventing any new lsegs from being added.
+
+layout drivers
+==============
+
+PNFS utilizes what is called layout drivers. The STD defines 4 basic
+layout types: "files", "objects", "blocks", and "flexfiles". For each
+of these types there is a layout-driver with a common function-vectors
+table which are called by the nfs-client pnfs-core to implement the
+different layout types.
+
+Files-layout-driver code is in: fs/nfs/filelayout/.. directory
+Blocks-layout-driver code is in: fs/nfs/blocklayout/.. directory
+Flexfiles-layout-driver code is in: fs/nfs/flexfilelayout/.. directory
+
+blocks-layout setup
+===================
+
+TODO: Document the setup needs of the blocks layout driver
diff --git a/Documentation/filesystems/nfs/rpc-cache.rst b/Documentation/filesystems/nfs/rpc-cache.rst
new file mode 100644
index 000000000..bb164eea9
--- /dev/null
+++ b/Documentation/filesystems/nfs/rpc-cache.rst
@@ -0,0 +1,220 @@
+=========
+RPC Cache
+=========
+
+This document gives a brief introduction to the caching
+mechanisms in the sunrpc layer that is used, in particular,
+for NFS authentication.
+
+Caches
+======
+
+The caching replaces the old exports table and allows for
+a wide variety of values to be caches.
+
+There are a number of caches that are similar in structure though
+quite possibly very different in content and use. There is a corpus
+of common code for managing these caches.
+
+Examples of caches that are likely to be needed are:
+
+ - mapping from IP address to client name
+ - mapping from client name and filesystem to export options
+ - mapping from UID to list of GIDs, to work around NFS's limitation
+ of 16 gids.
+ - mappings between local UID/GID and remote UID/GID for sites that
+ do not have uniform uid assignment
+ - mapping from network identify to public key for crypto authentication.
+
+The common code handles such things as:
+
+ - general cache lookup with correct locking
+ - supporting 'NEGATIVE' as well as positive entries
+ - allowing an EXPIRED time on cache items, and removing
+ items after they expire, and are no longer in-use.
+ - making requests to user-space to fill in cache entries
+ - allowing user-space to directly set entries in the cache
+ - delaying RPC requests that depend on as-yet incomplete
+ cache entries, and replaying those requests when the cache entry
+ is complete.
+ - clean out old entries as they expire.
+
+Creating a Cache
+----------------
+
+- A cache needs a datum to store. This is in the form of a
+ structure definition that must contain a struct cache_head
+ as an element, usually the first.
+ It will also contain a key and some content.
+ Each cache element is reference counted and contains
+ expiry and update times for use in cache management.
+- A cache needs a "cache_detail" structure that
+ describes the cache. This stores the hash table, some
+ parameters for cache management, and some operations detailing how
+ to work with particular cache items.
+
+ The operations are:
+
+ struct cache_head \*alloc(void)
+ This simply allocates appropriate memory and returns
+ a pointer to the cache_detail embedded within the
+ structure
+
+ void cache_put(struct kref \*)
+ This is called when the last reference to an item is
+ dropped. The pointer passed is to the 'ref' field
+ in the cache_head. cache_put should release any
+ references create by 'cache_init' and, if CACHE_VALID
+ is set, any references created by cache_update.
+ It should then release the memory allocated by
+ 'alloc'.
+
+ int match(struct cache_head \*orig, struct cache_head \*new)
+ test if the keys in the two structures match. Return
+ 1 if they do, 0 if they don't.
+
+ void init(struct cache_head \*orig, struct cache_head \*new)
+ Set the 'key' fields in 'new' from 'orig'. This may
+ include taking references to shared objects.
+
+ void update(struct cache_head \*orig, struct cache_head \*new)
+ Set the 'content' fileds in 'new' from 'orig'.
+
+ int cache_show(struct seq_file \*m, struct cache_detail \*cd, struct cache_head \*h)
+ Optional. Used to provide a /proc file that lists the
+ contents of a cache. This should show one item,
+ usually on just one line.
+
+ int cache_request(struct cache_detail \*cd, struct cache_head \*h, char \*\*bpp, int \*blen)
+ Format a request to be send to user-space for an item
+ to be instantiated. \*bpp is a buffer of size \*blen.
+ bpp should be moved forward over the encoded message,
+ and \*blen should be reduced to show how much free
+ space remains. Return 0 on success or <0 if not
+ enough room or other problem.
+
+ int cache_parse(struct cache_detail \*cd, char \*buf, int len)
+ A message from user space has arrived to fill out a
+ cache entry. It is in 'buf' of length 'len'.
+ cache_parse should parse this, find the item in the
+ cache with sunrpc_cache_lookup_rcu, and update the item
+ with sunrpc_cache_update.
+
+
+- A cache needs to be registered using cache_register(). This
+ includes it on a list of caches that will be regularly
+ cleaned to discard old data.
+
+Using a cache
+-------------
+
+To find a value in a cache, call sunrpc_cache_lookup_rcu passing a pointer
+to the cache_head in a sample item with the 'key' fields filled in.
+This will be passed to ->match to identify the target entry. If no
+entry is found, a new entry will be create, added to the cache, and
+marked as not containing valid data.
+
+The item returned is typically passed to cache_check which will check
+if the data is valid, and may initiate an up-call to get fresh data.
+cache_check will return -ENOENT in the entry is negative or if an up
+call is needed but not possible, -EAGAIN if an upcall is pending,
+or 0 if the data is valid;
+
+cache_check can be passed a "struct cache_req\*". This structure is
+typically embedded in the actual request and can be used to create a
+deferred copy of the request (struct cache_deferred_req). This is
+done when the found cache item is not uptodate, but the is reason to
+believe that userspace might provide information soon. When the cache
+item does become valid, the deferred copy of the request will be
+revisited (->revisit). It is expected that this method will
+reschedule the request for processing.
+
+The value returned by sunrpc_cache_lookup_rcu can also be passed to
+sunrpc_cache_update to set the content for the item. A second item is
+passed which should hold the content. If the item found by _lookup
+has valid data, then it is discarded and a new item is created. This
+saves any user of an item from worrying about content changing while
+it is being inspected. If the item found by _lookup does not contain
+valid data, then the content is copied across and CACHE_VALID is set.
+
+Populating a cache
+------------------
+
+Each cache has a name, and when the cache is registered, a directory
+with that name is created in /proc/net/rpc
+
+This directory contains a file called 'channel' which is a channel
+for communicating between kernel and user for populating the cache.
+This directory may later contain other files of interacting
+with the cache.
+
+The 'channel' works a bit like a datagram socket. Each 'write' is
+passed as a whole to the cache for parsing and interpretation.
+Each cache can treat the write requests differently, but it is
+expected that a message written will contain:
+
+ - a key
+ - an expiry time
+ - a content.
+
+with the intention that an item in the cache with the give key
+should be create or updated to have the given content, and the
+expiry time should be set on that item.
+
+Reading from a channel is a bit more interesting. When a cache
+lookup fails, or when it succeeds but finds an entry that may soon
+expire, a request is lodged for that cache item to be updated by
+user-space. These requests appear in the channel file.
+
+Successive reads will return successive requests.
+If there are no more requests to return, read will return EOF, but a
+select or poll for read will block waiting for another request to be
+added.
+
+Thus a user-space helper is likely to::
+
+ open the channel.
+ select for readable
+ read a request
+ write a response
+ loop.
+
+If it dies and needs to be restarted, any requests that have not been
+answered will still appear in the file and will be read by the new
+instance of the helper.
+
+Each cache should define a "cache_parse" method which takes a message
+written from user-space and processes it. It should return an error
+(which propagates back to the write syscall) or 0.
+
+Each cache should also define a "cache_request" method which
+takes a cache item and encodes a request into the buffer
+provided.
+
+.. note::
+ If a cache has no active readers on the channel, and has had not
+ active readers for more than 60 seconds, further requests will not be
+ added to the channel but instead all lookups that do not find a valid
+ entry will fail. This is partly for backward compatibility: The
+ previous nfs exports table was deemed to be authoritative and a
+ failed lookup meant a definite 'no'.
+
+request/response format
+-----------------------
+
+While each cache is free to use its own format for requests
+and responses over channel, the following is recommended as
+appropriate and support routines are available to help:
+Each request or response record should be printable ASCII
+with precisely one newline character which should be at the end.
+Fields within the record should be separated by spaces, normally one.
+If spaces, newlines, or nul characters are needed in a field they
+much be quoted. two mechanisms are available:
+
+- If a field begins '\x' then it must contain an even number of
+ hex digits, and pairs of these digits provide the bytes in the
+ field.
+- otherwise a \ in the field must be followed by 3 octal digits
+ which give the code for a byte. Other characters are treated
+ as them selves. At the very least, space, newline, nul, and
+ '\' must be quoted in this way.
diff --git a/Documentation/filesystems/nfs/rpc-server-gss.rst b/Documentation/filesystems/nfs/rpc-server-gss.rst
new file mode 100644
index 000000000..ccaea9e7c
--- /dev/null
+++ b/Documentation/filesystems/nfs/rpc-server-gss.rst
@@ -0,0 +1,93 @@
+=========================================
+rpcsec_gss support for kernel RPC servers
+=========================================
+
+This document gives references to the standards and protocols used to
+implement RPCGSS authentication in kernel RPC servers such as the NFS
+server and the NFS client's NFSv4.0 callback server. (But note that
+NFSv4.1 and higher don't require the client to act as a server for the
+purposes of authentication.)
+
+RPCGSS is specified in a few IETF documents:
+
+ - RFC2203 v1: https://tools.ietf.org/rfc/rfc2203.txt
+ - RFC5403 v2: https://tools.ietf.org/rfc/rfc5403.txt
+
+There is a third version that we don't currently implement:
+
+ - RFC7861 v3: https://tools.ietf.org/rfc/rfc7861.txt
+
+Background
+==========
+
+The RPCGSS Authentication method describes a way to perform GSSAPI
+Authentication for NFS. Although GSSAPI is itself completely mechanism
+agnostic, in many cases only the KRB5 mechanism is supported by NFS
+implementations.
+
+The Linux kernel, at the moment, supports only the KRB5 mechanism, and
+depends on GSSAPI extensions that are KRB5 specific.
+
+GSSAPI is a complex library, and implementing it completely in kernel is
+unwarranted. However GSSAPI operations are fundementally separable in 2
+parts:
+
+- initial context establishment
+- integrity/privacy protection (signing and encrypting of individual
+ packets)
+
+The former is more complex and policy-independent, but less
+performance-sensitive. The latter is simpler and needs to be very fast.
+
+Therefore, we perform per-packet integrity and privacy protection in the
+kernel, but leave the initial context establishment to userspace. We
+need upcalls to request userspace to perform context establishment.
+
+NFS Server Legacy Upcall Mechanism
+==================================
+
+The classic upcall mechanism uses a custom text based upcall mechanism
+to talk to a custom daemon called rpc.svcgssd that is provide by the
+nfs-utils package.
+
+This upcall mechanism has 2 limitations:
+
+A) It can handle tokens that are no bigger than 2KiB
+
+In some Kerberos deployment GSSAPI tokens can be quite big, up and
+beyond 64KiB in size due to various authorization extensions attacked to
+the Kerberos tickets, that needs to be sent through the GSS layer in
+order to perform context establishment.
+
+B) It does not properly handle creds where the user is member of more
+than a few thousand groups (the current hard limit in the kernel is 65K
+groups) due to limitation on the size of the buffer that can be send
+back to the kernel (4KiB).
+
+NFS Server New RPC Upcall Mechanism
+===================================
+
+The newer upcall mechanism uses RPC over a unix socket to a daemon
+called gss-proxy, implemented by a userspace program called Gssproxy.
+
+The gss_proxy RPC protocol is currently documented `here
+<https://fedorahosted.org/gss-proxy/wiki/ProtocolDocumentation>`_.
+
+This upcall mechanism uses the kernel rpc client and connects to the gssproxy
+userspace program over a regular unix socket. The gssproxy protocol does not
+suffer from the size limitations of the legacy protocol.
+
+Negotiating Upcall Mechanisms
+=============================
+
+To provide backward compatibility, the kernel defaults to using the
+legacy mechanism. To switch to the new mechanism, gss-proxy must bind
+to /var/run/gssproxy.sock and then write "1" to
+/proc/net/rpc/use-gss-proxy. If gss-proxy dies, it must repeat both
+steps.
+
+Once the upcall mechanism is chosen, it cannot be changed. To prevent
+locking into the legacy mechanisms, the above steps must be performed
+before starting nfsd. Whoever starts nfsd can guarantee this by reading
+from /proc/net/rpc/use-gss-proxy and checking that it contains a
+"1"--the read will block until gss-proxy has done its write to the file.