From 2c3c1048746a4622d8c89a29670120dc8fab93c4 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 7 Apr 2024 20:49:45 +0200 Subject: Adding upstream version 6.1.76. Signed-off-by: Daniel Baumann --- Documentation/admin-guide/sysctl/abi.rst | 34 + Documentation/admin-guide/sysctl/fs.rst | 384 +++++++ Documentation/admin-guide/sysctl/index.rst | 98 ++ Documentation/admin-guide/sysctl/kernel.rst | 1596 +++++++++++++++++++++++++++ Documentation/admin-guide/sysctl/net.rst | 499 +++++++++ Documentation/admin-guide/sysctl/sunrpc.rst | 25 + Documentation/admin-guide/sysctl/user.rst | 84 ++ Documentation/admin-guide/sysctl/vm.rst | 1046 ++++++++++++++++++ 8 files changed, 3766 insertions(+) create mode 100644 Documentation/admin-guide/sysctl/abi.rst create mode 100644 Documentation/admin-guide/sysctl/fs.rst create mode 100644 Documentation/admin-guide/sysctl/index.rst create mode 100644 Documentation/admin-guide/sysctl/kernel.rst create mode 100644 Documentation/admin-guide/sysctl/net.rst create mode 100644 Documentation/admin-guide/sysctl/sunrpc.rst create mode 100644 Documentation/admin-guide/sysctl/user.rst create mode 100644 Documentation/admin-guide/sysctl/vm.rst (limited to 'Documentation/admin-guide/sysctl') diff --git a/Documentation/admin-guide/sysctl/abi.rst b/Documentation/admin-guide/sysctl/abi.rst new file mode 100644 index 000000000..4e6db0a2a --- /dev/null +++ b/Documentation/admin-guide/sysctl/abi.rst @@ -0,0 +1,34 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +================================ +Documentation for /proc/sys/abi/ +================================ + +.. See scripts/check-sysctl-docs to keep this up to date: +.. scripts/check-sysctl-docs -vtable="abi" \ +.. Documentation/admin-guide/sysctl/abi.rst \ +.. $(git grep -l register_sysctl_) + +Copyright (c) 2020, Stephen Kitt + +For general info, see Documentation/admin-guide/sysctl/index.rst. + +------------------------------------------------------------------------------ + +The files in ``/proc/sys/abi`` can be used to see and modify +ABI-related settings. + +Currently, these files might (depending on your configuration) +show up in ``/proc/sys/kernel``: + +.. contents:: :local: + +vsyscall32 (x86) +================ + +Determines whether the kernels maps a vDSO page into 32-bit processes; +can be set to 1 to enable, or 0 to disable. Defaults to enabled if +``CONFIG_COMPAT_VDSO`` is set, disabled otherwise. + +This controls the same setting as the ``vdso32`` kernel boot +parameter. diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst new file mode 100644 index 000000000..2a501c9dd --- /dev/null +++ b/Documentation/admin-guide/sysctl/fs.rst @@ -0,0 +1,384 @@ +=============================== +Documentation for /proc/sys/fs/ +=============================== + +kernel version 2.2.10 + +Copyright (c) 1998, 1999, Rik van Riel + +Copyright (c) 2009, Shen Feng + +For general info and legal blurb, please look in intro.rst. + +------------------------------------------------------------------------------ + +This file contains documentation for the sysctl files in +/proc/sys/fs/ and is valid for Linux kernel version 2.2. + +The files in this directory can be used to tune and monitor +miscellaneous and general things in the operation of the Linux +kernel. Since some of the files _can_ be used to screw up your +system, it is advisable to read both documentation and source +before actually making adjustments. + +1. /proc/sys/fs +=============== + +Currently, these files are in /proc/sys/fs: + +- aio-max-nr +- aio-nr +- dentry-state +- dquot-max +- dquot-nr +- file-max +- file-nr +- inode-max +- inode-nr +- inode-state +- nr_open +- overflowuid +- overflowgid +- pipe-user-pages-hard +- pipe-user-pages-soft +- protected_fifos +- protected_hardlinks +- protected_regular +- protected_symlinks +- suid_dumpable +- super-max +- super-nr + + +aio-nr & aio-max-nr +------------------- + +aio-nr is the running total of the number of events specified on the +io_setup system call for all currently active aio contexts. If aio-nr +reaches aio-max-nr then io_setup will fail with EAGAIN. Note that +raising aio-max-nr does not result in the pre-allocation or re-sizing +of any kernel data structures. + + +dentry-state +------------ + +From linux/include/linux/dcache.h:: + + struct dentry_stat_t dentry_stat { + int nr_dentry; + int nr_unused; + int age_limit; /* age in seconds */ + int want_pages; /* pages requested by system */ + int nr_negative; /* # of unused negative dentries */ + int dummy; /* Reserved for future use */ + }; + +Dentries are dynamically allocated and deallocated. + +nr_dentry shows the total number of dentries allocated (active ++ unused). nr_unused shows the number of dentries that are not +actively used, but are saved in the LRU list for future reuse. + +Age_limit is the age in seconds after which dcache entries +can be reclaimed when memory is short and want_pages is +nonzero when shrink_dcache_pages() has been called and the +dcache isn't pruned yet. + +nr_negative shows the number of unused dentries that are also +negative dentries which do not map to any files. Instead, +they help speeding up rejection of non-existing files provided +by the users. + + +dquot-max & dquot-nr +-------------------- + +The file dquot-max shows the maximum number of cached disk +quota entries. + +The file dquot-nr shows the number of allocated disk quota +entries and the number of free disk quota entries. + +If the number of free cached disk quotas is very low and +you have some awesome number of simultaneous system users, +you might want to raise the limit. + + +file-max & file-nr +------------------ + +The value in file-max denotes the maximum number of file- +handles that the Linux kernel will allocate. When you get lots +of error messages about running out of file handles, you might +want to increase this limit. + +Historically,the kernel was able to allocate file handles +dynamically, but not to free them again. The three values in +file-nr denote the number of allocated file handles, the number +of allocated but unused file handles, and the maximum number of +file handles. Linux 2.6 always reports 0 as the number of free +file handles -- this is not an error, it just means that the +number of allocated file handles exactly matches the number of +used file handles. + +Attempts to allocate more file descriptors than file-max are +reported with printk, look for "VFS: file-max limit +reached". + + +nr_open +------- + +This denotes the maximum number of file-handles a process can +allocate. Default value is 1024*1024 (1048576) which should be +enough for most machines. Actual limit depends on RLIMIT_NOFILE +resource limit. + + +inode-max, inode-nr & inode-state +--------------------------------- + +As with file handles, the kernel allocates the inode structures +dynamically, but can't free them yet. + +The value in inode-max denotes the maximum number of inode +handlers. This value should be 3-4 times larger than the value +in file-max, since stdin, stdout and network sockets also +need an inode struct to handle them. When you regularly run +out of inodes, you need to increase this value. + +The file inode-nr contains the first two items from +inode-state, so we'll skip to that file... + +Inode-state contains three actual numbers and four dummies. +The actual numbers are, in order of appearance, nr_inodes, +nr_free_inodes and preshrink. + +Nr_inodes stands for the number of inodes the system has +allocated, this can be slightly more than inode-max because +Linux allocates them one pageful at a time. + +Nr_free_inodes represents the number of free inodes (?) and +preshrink is nonzero when the nr_inodes > inode-max and the +system needs to prune the inode list instead of allocating +more. + + +overflowgid & overflowuid +------------------------- + +Some filesystems only support 16-bit UIDs and GIDs, although in Linux +UIDs and GIDs are 32 bits. When one of these filesystems is mounted +with writes enabled, any UID or GID that would exceed 65535 is translated +to a fixed value before being written to disk. + +These sysctls allow you to change the value of the fixed UID and GID. +The default is 65534. + + +pipe-user-pages-hard +-------------------- + +Maximum total number of pages a non-privileged user may allocate for pipes. +Once this limit is reached, no new pipes may be allocated until usage goes +below the limit again. When set to 0, no limit is applied, which is the default +setting. + + +pipe-user-pages-soft +-------------------- + +Maximum total number of pages a non-privileged user may allocate for pipes +before the pipe size gets limited to a single page. Once this limit is reached, +new pipes will be limited to a single page in size for this user in order to +limit total memory usage, and trying to increase them using fcntl() will be +denied until usage goes below the limit again. The default value allows to +allocate up to 1024 pipes at their default size. When set to 0, no limit is +applied. + + +protected_fifos +--------------- + +The intent of this protection is to avoid unintentional writes to +an attacker-controlled FIFO, where a program expected to create a regular +file. + +When set to "0", writing to FIFOs is unrestricted. + +When set to "1" don't allow O_CREAT open on FIFOs that we don't own +in world writable sticky directories, unless they are owned by the +owner of the directory. + +When set to "2" it also applies to group writable sticky directories. + +This protection is based on the restrictions in Openwall. + + +protected_hardlinks +-------------------- + +A long-standing class of security issues is the hardlink-based +time-of-check-time-of-use race, most commonly seen in world-writable +directories like /tmp. The common method of exploitation of this flaw +is to cross privilege boundaries when following a given hardlink (i.e. a +root process follows a hardlink created by another user). Additionally, +on systems without separated partitions, this stops unauthorized users +from "pinning" vulnerable setuid/setgid files against being upgraded by +the administrator, or linking to special files. + +When set to "0", hardlink creation behavior is unrestricted. + +When set to "1" hardlinks cannot be created by users if they do not +already own the source file, or do not have read/write access to it. + +This protection is based on the restrictions in Openwall and grsecurity. + + +protected_regular +----------------- + +This protection is similar to protected_fifos, but it +avoids writes to an attacker-controlled regular file, where a program +expected to create one. + +When set to "0", writing to regular files is unrestricted. + +When set to "1" don't allow O_CREAT open on regular files that we +don't own in world writable sticky directories, unless they are +owned by the owner of the directory. + +When set to "2" it also applies to group writable sticky directories. + + +protected_symlinks +------------------ + +A long-standing class of security issues is the symlink-based +time-of-check-time-of-use race, most commonly seen in world-writable +directories like /tmp. The common method of exploitation of this flaw +is to cross privilege boundaries when following a given symlink (i.e. a +root process follows a symlink belonging to another user). For a likely +incomplete list of hundreds of examples across the years, please see: +https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp + +When set to "0", symlink following behavior is unrestricted. + +When set to "1" symlinks are permitted to be followed only when outside +a sticky world-writable directory, or when the uid of the symlink and +follower match, or when the directory owner matches the symlink's owner. + +This protection is based on the restrictions in Openwall and grsecurity. + + +suid_dumpable: +-------------- + +This value can be used to query and set the core dump mode for setuid +or otherwise protected/tainted binaries. The modes are + += ========== =============================================================== +0 (default) traditional behaviour. Any process which has changed + privilege levels or is execute only will not be dumped. +1 (debug) all processes dump core when possible. The core dump is + owned by the current user and no security is applied. This is + intended for system debugging situations only. + Ptrace is unchecked. + This is insecure as it allows regular users to examine the + memory contents of privileged processes. +2 (suidsafe) any binary which normally would not be dumped is dumped + anyway, but only if the "core_pattern" kernel sysctl is set to + either a pipe handler or a fully qualified path. (For more + details on this limitation, see CVE-2006-2451.) This mode is + appropriate when administrators are attempting to debug + problems in a normal environment, and either have a core dump + pipe handler that knows to treat privileged core dumps with + care, or specific directory defined for catching core dumps. + If a core dump happens without a pipe handler or fully + qualified path, a message will be emitted to syslog warning + about the lack of a correct setting. += ========== =============================================================== + + +super-max & super-nr +-------------------- + +These numbers control the maximum number of superblocks, and +thus the maximum number of mounted filesystems the kernel +can have. You only need to increase super-max if you need to +mount more filesystems than the current value in super-max +allows you to. + + +aio-nr & aio-max-nr +------------------- + +aio-nr shows the current system-wide number of asynchronous io +requests. aio-max-nr allows you to change the maximum value +aio-nr can grow to. + + +mount-max +--------- + +This denotes the maximum number of mounts that may exist +in a mount namespace. + + + +2. /proc/sys/fs/binfmt_misc +=========================== + +Documentation for the files in /proc/sys/fs/binfmt_misc is +in Documentation/admin-guide/binfmt-misc.rst. + + +3. /proc/sys/fs/mqueue - POSIX message queues filesystem +======================================================== + + +The "mqueue" filesystem provides the necessary kernel features to enable the +creation of a user space library that implements the POSIX message queues +API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System +Interfaces specification.) + +The "mqueue" filesystem contains values for determining/setting the amount of +resources used by the file system. + +/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the +maximum number of message queues allowed on the system. + +/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the +maximum number of messages in a queue value. In fact it is the limiting value +for another (user) limit which is set in mq_open invocation. This attribute of +a queue must be less or equal then msg_max. + +/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the +maximum message size value (it is every message queue's attribute set during +its creation). + +/proc/sys/fs/mqueue/msg_default is a read/write file for setting/getting the +default number of messages in a queue value if attr parameter of mq_open(2) is +NULL. If it exceed msg_max, the default value is initialized msg_max. + +/proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting +the default message size value if attr parameter of mq_open(2) is NULL. If it +exceed msgsize_max, the default value is initialized msgsize_max. + +4. /proc/sys/fs/epoll - Configuration options for the epoll interface +===================================================================== + +This directory contains configuration options for the epoll(7) interface. + +max_user_watches +---------------- + +Every epoll file descriptor can store a number of files to be monitored +for event readiness. Each one of these monitored files constitutes a "watch". +This configuration option sets the maximum number of "watches" that are +allowed for each user. +Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes +on a 64bit one. +The current default value for max_user_watches is the 1/25 (4%) of the +available low memory, divided for the "watch" cost in bytes. diff --git a/Documentation/admin-guide/sysctl/index.rst b/Documentation/admin-guide/sysctl/index.rst new file mode 100644 index 000000000..03346f98c --- /dev/null +++ b/Documentation/admin-guide/sysctl/index.rst @@ -0,0 +1,98 @@ +=========================== +Documentation for /proc/sys +=========================== + +Copyright (c) 1998, 1999, Rik van Riel + +------------------------------------------------------------------------------ + +'Why', I hear you ask, 'would anyone even _want_ documentation +for them sysctl files? If anybody really needs it, it's all in +the source...' + +Well, this documentation is written because some people either +don't know they need to tweak something, or because they don't +have the time or knowledge to read the source code. + +Furthermore, the programmers who built sysctl have built it to +be actually used, not just for the fun of programming it :-) + +------------------------------------------------------------------------------ + +Legal blurb: + +As usual, there are two main things to consider: + +1. you get what you pay for +2. it's free + +The consequences are that I won't guarantee the correctness of +this document, and if you come to me complaining about how you +screwed up your system because of wrong documentation, I won't +feel sorry for you. I might even laugh at you... + +But of course, if you _do_ manage to screw up your system using +only the sysctl options used in this file, I'd like to hear of +it. Not only to have a great laugh, but also to make sure that +you're the last RTFMing person to screw up. + +In short, e-mail your suggestions, corrections and / or horror +stories to: + +Rik van Riel. + +-------------------------------------------------------------- + +Introduction +============ + +Sysctl is a means of configuring certain aspects of the kernel +at run-time, and the /proc/sys/ directory is there so that you +don't even need special tools to do it! +In fact, there are only four things needed to use these config +facilities: + +- a running Linux system +- root access +- common sense (this is especially hard to come by these days) +- knowledge of what all those values mean + +As a quick 'ls /proc/sys' will show, the directory consists of +several (arch-dependent?) subdirs. Each subdir is mainly about +one part of the kernel, so you can do configuration on a piece +by piece basis, or just some 'thematic frobbing'. + +This documentation is about: + +=============== =============================================================== +abi/ execution domains & personalities +debug/ +dev/ device specific information (eg dev/cdrom/info) +fs/ specific filesystems + filehandle, inode, dentry and quota tuning + binfmt_misc +kernel/ global kernel info / tuning + miscellaneous stuff +net/ networking stuff, for documentation look in: + +proc/ +sunrpc/ SUN Remote Procedure Call (NFS) +vm/ memory management tuning + buffer and cache management +user/ Per user per user namespace limits +=============== =============================================================== + +These are the subdirs I have on my system. There might be more +or other subdirs in another setup. If you see another dir, I'd +really like to hear about it :-) + +.. toctree:: + :maxdepth: 1 + + abi + fs + kernel + net + sunrpc + user + vm diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst new file mode 100644 index 000000000..b3588fff1 --- /dev/null +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -0,0 +1,1596 @@ +=================================== +Documentation for /proc/sys/kernel/ +=================================== + +.. See scripts/check-sysctl-docs to keep this up to date + + +Copyright (c) 1998, 1999, Rik van Riel + +Copyright (c) 2009, Shen Feng + +For general info and legal blurb, please look in +Documentation/admin-guide/sysctl/index.rst. + +------------------------------------------------------------------------------ + +This file contains documentation for the sysctl files in +``/proc/sys/kernel/``. + +The files in this directory can be used to tune and monitor +miscellaneous and general things in the operation of the Linux +kernel. Since some of the files *can* be used to screw up your +system, it is advisable to read both documentation and source +before actually making adjustments. + +Currently, these files might (depending on your configuration) +show up in ``/proc/sys/kernel``: + +.. contents:: :local: + + +acct +==== + +:: + + highwater lowwater frequency + +If BSD-style process accounting is enabled these values control +its behaviour. If free space on filesystem where the log lives +goes below ``lowwater``\ % accounting suspends. If free space gets +above ``highwater``\ % accounting resumes. ``frequency`` determines +how often do we check the amount of free space (value is in +seconds). Default: + +:: + + 4 2 30 + +That is, suspend accounting if free space drops below 2%; resume it +if it increases to at least 4%; consider information about amount of +free space valid for 30 seconds. + + +acpi_video_flags +================ + +See Documentation/power/video.rst. This allows the video resume mode to be set, +in a similar fashion to the ``acpi_sleep`` kernel parameter, by +combining the following values: + += ======= +1 s3_bios +2 s3_mode +4 s3_beep += ======= + +arch +==== + +The machine hardware name, the same output as ``uname -m`` +(e.g. ``x86_64`` or ``aarch64``). + +auto_msgmni +=========== + +This variable has no effect and may be removed in future kernel +releases. Reading it always returns 0. +Up to Linux 3.17, it enabled/disabled automatic recomputing of +`msgmni`_ +upon memory add/remove or upon IPC namespace creation/removal. +Echoing "1" into this file enabled msgmni automatic recomputing. +Echoing "0" turned it off. The default value was 1. + + +bootloader_type (x86 only) +========================== + +This gives the bootloader type number as indicated by the bootloader, +shifted left by 4, and OR'd with the low four bits of the bootloader +version. The reason for this encoding is that this used to match the +``type_of_loader`` field in the kernel header; the encoding is kept for +backwards compatibility. That is, if the full bootloader type number +is 0x15 and the full version number is 0x234, this file will contain +the value 340 = 0x154. + +See the ``type_of_loader`` and ``ext_loader_type`` fields in +Documentation/x86/boot.rst for additional information. + + +bootloader_version (x86 only) +============================= + +The complete bootloader version number. In the example above, this +file will contain the value 564 = 0x234. + +See the ``type_of_loader`` and ``ext_loader_ver`` fields in +Documentation/x86/boot.rst for additional information. + + +bpf_stats_enabled +================= + +Controls whether the kernel should collect statistics on BPF programs +(total time spent running, number of times run...). Enabling +statistics causes a slight reduction in performance on each program +run. The statistics can be seen using ``bpftool``. + += =================================== +0 Don't collect statistics (default). +1 Collect statistics. += =================================== + + +cad_pid +======= + +This is the pid which will be signalled on reboot (notably, by +Ctrl-Alt-Delete). Writing a value to this file which doesn't +correspond to a running process will result in ``-ESRCH``. + +See also `ctrl-alt-del`_. + + +cap_last_cap +============ + +Highest valid capability of the running kernel. Exports +``CAP_LAST_CAP`` from the kernel. + + +core_pattern +============ + +``core_pattern`` is used to specify a core dumpfile pattern name. + +* max length 127 characters; default value is "core" +* ``core_pattern`` is used as a pattern template for the output + filename; certain string patterns (beginning with '%') are + substituted with their actual values. +* backward compatibility with ``core_uses_pid``: + + If ``core_pattern`` does not include "%p" (default does not) + and ``core_uses_pid`` is set, then .PID will be appended to + the filename. + +* corename format specifiers + + ======== ========================================== + % '%' is dropped + %% output one '%' + %p pid + %P global pid (init PID namespace) + %i tid + %I global tid (init PID namespace) + %u uid (in initial user namespace) + %g gid (in initial user namespace) + %d dump mode, matches ``PR_SET_DUMPABLE`` and + ``/proc/sys/fs/suid_dumpable`` + %s signal number + %t UNIX time of dump + %h hostname + %e executable filename (may be shortened, could be changed by prctl etc) + %f executable filename + %E executable path + %c maximum size of core file by resource limit RLIMIT_CORE + % both are dropped + ======== ========================================== + +* If the first character of the pattern is a '|', the kernel will treat + the rest of the pattern as a command to run. The core dump will be + written to the standard input of that program instead of to a file. + + +core_pipe_limit +=============== + +This sysctl is only applicable when `core_pattern`_ is configured to +pipe core files to a user space helper (when the first character of +``core_pattern`` is a '|', see above). +When collecting cores via a pipe to an application, it is occasionally +useful for the collecting application to gather data about the +crashing process from its ``/proc/pid`` directory. +In order to do this safely, the kernel must wait for the collecting +process to exit, so as not to remove the crashing processes proc files +prematurely. +This in turn creates the possibility that a misbehaving userspace +collecting process can block the reaping of a crashed process simply +by never exiting. +This sysctl defends against that. +It defines how many concurrent crashing processes may be piped to user +space applications in parallel. +If this value is exceeded, then those crashing processes above that +value are noted via the kernel log and their cores are skipped. +0 is a special value, indicating that unlimited processes may be +captured in parallel, but that no waiting will take place (i.e. the +collecting process is not guaranteed access to ``/proc//``). +This value defaults to 0. + + +core_uses_pid +============= + +The default coredump filename is "core". By setting +``core_uses_pid`` to 1, the coredump filename becomes core.PID. +If `core_pattern`_ does not include "%p" (default does not) +and ``core_uses_pid`` is set, then .PID will be appended to +the filename. + + +ctrl-alt-del +============ + +When the value in this file is 0, ctrl-alt-del is trapped and +sent to the ``init(1)`` program to handle a graceful restart. +When, however, the value is > 0, Linux's reaction to a Vulcan +Nerve Pinch (tm) will be an immediate reboot, without even +syncing its dirty buffers. + +Note: + when a program (like dosemu) has the keyboard in 'raw' + mode, the ctrl-alt-del is intercepted by the program before it + ever reaches the kernel tty layer, and it's up to the program + to decide what to do with it. + + +dmesg_restrict +============== + +This toggle indicates whether unprivileged users are prevented +from using ``dmesg(8)`` to view messages from the kernel's log +buffer. +When ``dmesg_restrict`` is set to 0 there are no restrictions. +When ``dmesg_restrict`` is set to 1, users must have +``CAP_SYSLOG`` to use ``dmesg(8)``. + +The kernel config option ``CONFIG_SECURITY_DMESG_RESTRICT`` sets the +default value of ``dmesg_restrict``. + + +domainname & hostname +===================== + +These files can be used to set the NIS/YP domainname and the +hostname of your box in exactly the same way as the commands +domainname and hostname, i.e.:: + + # echo "darkstar" > /proc/sys/kernel/hostname + # echo "mydomain" > /proc/sys/kernel/domainname + +has the same effect as:: + + # hostname "darkstar" + # domainname "mydomain" + +Note, however, that the classic darkstar.frop.org has the +hostname "darkstar" and DNS (Internet Domain Name Server) +domainname "frop.org", not to be confused with the NIS (Network +Information Service) or YP (Yellow Pages) domainname. These two +domain names are in general different. For a detailed discussion +see the ``hostname(1)`` man page. + + +firmware_config +=============== + +See Documentation/driver-api/firmware/fallback-mechanisms.rst. + +The entries in this directory allow the firmware loader helper +fallback to be controlled: + +* ``force_sysfs_fallback``, when set to 1, forces the use of the + fallback; +* ``ignore_sysfs_fallback``, when set to 1, ignores any fallback. + + +ftrace_dump_on_oops +=================== + +Determines whether ``ftrace_dump()`` should be called on an oops (or +kernel panic). This will output the contents of the ftrace buffers to +the console. This is very useful for capturing traces that lead to +crashes and outputting them to a serial console. + += =================================================== +0 Disabled (default). +1 Dump buffers of all CPUs. +2 Dump the buffer of the CPU that triggered the oops. += =================================================== + + +ftrace_enabled, stack_tracer_enabled +==================================== + +See Documentation/trace/ftrace.rst. + + +hardlockup_all_cpu_backtrace +============================ + +This value controls the hard lockup detector behavior when a hard +lockup condition is detected as to whether or not to gather further +debug information. If enabled, arch-specific all-CPU stack dumping +will be initiated. + += ============================================ +0 Do nothing. This is the default behavior. +1 On detection capture more debug information. += ============================================ + + +hardlockup_panic +================ + +This parameter can be used to control whether the kernel panics +when a hard lockup is detected. + += =========================== +0 Don't panic on hard lockup. +1 Panic on hard lockup. += =========================== + +See Documentation/admin-guide/lockup-watchdogs.rst for more information. +This can also be set using the nmi_watchdog kernel parameter. + + +hotplug +======= + +Path for the hotplug policy agent. +Default value is ``CONFIG_UEVENT_HELPER_PATH``, which in turn defaults +to the empty string. + +This file only exists when ``CONFIG_UEVENT_HELPER`` is enabled. Most +modern systems rely exclusively on the netlink-based uevent source and +don't need this. + + +hung_task_all_cpu_backtrace +=========================== + +If this option is set, the kernel will send an NMI to all CPUs to dump +their backtraces when a hung task is detected. This file shows up if +CONFIG_DETECT_HUNG_TASK and CONFIG_SMP are enabled. + +0: Won't show all CPUs backtraces when a hung task is detected. +This is the default behavior. + +1: Will non-maskably interrupt all CPUs and dump their backtraces when +a hung task is detected. + + +hung_task_panic +=============== + +Controls the kernel's behavior when a hung task is detected. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. + += ================================================= +0 Continue operation. This is the default behavior. +1 Panic immediately. += ================================================= + + +hung_task_check_count +===================== + +The upper bound on the number of tasks that are checked. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. + + +hung_task_timeout_secs +====================== + +When a task in D state did not get scheduled +for more than this value report a warning. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. + +0 means infinite timeout, no checking is done. + +Possible values to set are in range {0:``LONG_MAX``/``HZ``}. + + +hung_task_check_interval_secs +============================= + +Hung task check interval. If hung task checking is enabled +(see `hung_task_timeout_secs`_), the check is done every +``hung_task_check_interval_secs`` seconds. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. + +0 (default) means use ``hung_task_timeout_secs`` as checking +interval. + +Possible values to set are in range {0:``LONG_MAX``/``HZ``}. + + +hung_task_warnings +================== + +The maximum number of warnings to report. During a check interval +if a hung task is detected, this value is decreased by 1. +When this value reaches 0, no more warnings will be reported. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. + +-1: report an infinite number of warnings. + + +hyperv_record_panic_msg +======================= + +Controls whether the panic kmsg data should be reported to Hyper-V. + += ========================================================= +0 Do not report panic kmsg data. +1 Report the panic kmsg data. This is the default behavior. += ========================================================= + + +ignore-unaligned-usertrap +========================= + +On architectures where unaligned accesses cause traps, and where this +feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN``; +currently, ``arc`` and ``ia64``), controls whether all unaligned traps +are logged. + += ============================================================= +0 Log all unaligned accesses. +1 Only warn the first time a process traps. This is the default + setting. += ============================================================= + +See also `unaligned-trap`_ and `unaligned-dump-stack`_. On ``ia64``, +this allows system administrators to override the +``IA64_THREAD_UAC_NOPRINT`` ``prctl`` and avoid logs being flooded. + + +kexec_load_disabled +=================== + +A toggle indicating if the ``kexec_load`` syscall has been disabled. +This value defaults to 0 (false: ``kexec_load`` enabled), but can be +set to 1 (true: ``kexec_load`` disabled). +Once true, kexec can no longer be used, and the toggle cannot be set +back to false. +This allows a kexec image to be loaded before disabling the syscall, +allowing a system to set up (and later use) an image without it being +altered. +Generally used together with the `modules_disabled`_ sysctl. + + +kptr_restrict +============= + +This toggle indicates whether restrictions are placed on +exposing kernel addresses via ``/proc`` and other interfaces. + +When ``kptr_restrict`` is set to 0 (the default) the address is hashed +before printing. +(This is the equivalent to %p.) + +When ``kptr_restrict`` is set to 1, kernel pointers printed using the +%pK format specifier will be replaced with 0s unless the user has +``CAP_SYSLOG`` and effective user and group ids are equal to the real +ids. +This is because %pK checks are done at read() time rather than open() +time, so if permissions are elevated between the open() and the read() +(e.g via a setuid binary) then %pK will not leak kernel pointers to +unprivileged users. +Note, this is a temporary solution only. +The correct long-term solution is to do the permission checks at +open() time. +Consider removing world read permissions from files that use %pK, and +using `dmesg_restrict`_ to protect against uses of %pK in ``dmesg(8)`` +if leaking kernel pointer values to unprivileged users is a concern. + +When ``kptr_restrict`` is set to 2, kernel pointers printed using +%pK will be replaced with 0s regardless of privileges. + + +modprobe +======== + +The full path to the usermode helper for autoloading kernel modules, +by default ``CONFIG_MODPROBE_PATH``, which in turn defaults to +"/sbin/modprobe". This binary is executed when the kernel requests a +module. For example, if userspace passes an unknown filesystem type +to mount(), then the kernel will automatically request the +corresponding filesystem module by executing this usermode helper. +This usermode helper should insert the needed module into the kernel. + +This sysctl only affects module autoloading. It has no effect on the +ability to explicitly insert modules. + +This sysctl can be used to debug module loading requests:: + + echo '#! /bin/sh' > /tmp/modprobe + echo 'echo "$@" >> /tmp/modprobe.log' >> /tmp/modprobe + echo 'exec /sbin/modprobe "$@"' >> /tmp/modprobe + chmod a+x /tmp/modprobe + echo /tmp/modprobe > /proc/sys/kernel/modprobe + +Alternatively, if this sysctl is set to the empty string, then module +autoloading is completely disabled. The kernel will not try to +execute a usermode helper at all, nor will it call the +kernel_module_request LSM hook. + +If CONFIG_STATIC_USERMODEHELPER=y is set in the kernel configuration, +then the configured static usermode helper overrides this sysctl, +except that the empty string is still accepted to completely disable +module autoloading as described above. + +modules_disabled +================ + +A toggle value indicating if modules are allowed to be loaded +in an otherwise modular kernel. This toggle defaults to off +(0), but can be set true (1). Once true, modules can be +neither loaded nor unloaded, and the toggle cannot be set back +to false. Generally used with the `kexec_load_disabled`_ toggle. + + +.. _msgmni: + +msgmax, msgmnb, and msgmni +========================== + +``msgmax`` is the maximum size of an IPC message, in bytes. 8192 by +default (``MSGMAX``). + +``msgmnb`` is the maximum size of an IPC queue, in bytes. 16384 by +default (``MSGMNB``). + +``msgmni`` is the maximum number of IPC queues. 32000 by default +(``MSGMNI``). + + +msg_next_id, sem_next_id, and shm_next_id (System V IPC) +======================================================== + +These three toggles allows to specify desired id for next allocated IPC +object: message, semaphore or shared memory respectively. + +By default they are equal to -1, which means generic allocation logic. +Possible values to set are in range {0:``INT_MAX``}. + +Notes: + 1) kernel doesn't guarantee, that new object will have desired id. So, + it's up to userspace, how to handle an object with "wrong" id. + 2) Toggle with non-default value will be set back to -1 by kernel after + successful IPC object allocation. If an IPC object allocation syscall + fails, it is undefined if the value remains unmodified or is reset to -1. + + +ngroups_max +=========== + +Maximum number of supplementary groups, _i.e._ the maximum size which +``setgroups`` will accept. Exports ``NGROUPS_MAX`` from the kernel. + + + +nmi_watchdog +============ + +This parameter can be used to control the NMI watchdog +(i.e. the hard lockup detector) on x86 systems. + += ================================= +0 Disable the hard lockup detector. +1 Enable the hard lockup detector. += ================================= + +The hard lockup detector monitors each CPU for its ability to respond to +timer interrupts. The mechanism utilizes CPU performance counter registers +that are programmed to generate Non-Maskable Interrupts (NMIs) periodically +while a CPU is busy. Hence, the alternative name 'NMI watchdog'. + +The NMI watchdog is disabled by default if the kernel is running as a guest +in a KVM virtual machine. This default can be overridden by adding:: + + nmi_watchdog=1 + +to the guest kernel command line (see +Documentation/admin-guide/kernel-parameters.rst). + + +nmi_wd_lpm_factor (PPC only) +============================ + +Factor to apply to the NMI watchdog timeout (only when ``nmi_watchdog`` is +set to 1). This factor represents the percentage added to +``watchdog_thresh`` when calculating the NMI watchdog timeout during an +LPM. The soft lockup timeout is not impacted. + +A value of 0 means no change. The default value is 200 meaning the NMI +watchdog is set to 30s (based on ``watchdog_thresh`` equal to 10). + + +numa_balancing +============== + +Enables/disables and configures automatic page fault based NUMA memory +balancing. Memory is moved automatically to nodes that access it often. +The value to set can be the result of ORing the following: + += ================================= +0 NUMA_BALANCING_DISABLED +1 NUMA_BALANCING_NORMAL +2 NUMA_BALANCING_MEMORY_TIERING += ================================= + +Or NUMA_BALANCING_NORMAL to optimize page placement among different +NUMA nodes to reduce remote accessing. On NUMA machines, there is a +performance penalty if remote memory is accessed by a CPU. When this +feature is enabled the kernel samples what task thread is accessing +memory by periodically unmapping pages and later trapping a page +fault. At the time of the page fault, it is determined if the data +being accessed should be migrated to a local memory node. + +The unmapping of pages and trapping faults incur additional overhead that +ideally is offset by improved memory locality but there is no universal +guarantee. If the target workload is already bound to NUMA nodes then this +feature should be disabled. + +Or NUMA_BALANCING_MEMORY_TIERING to optimize page placement among +different types of memory (represented as different NUMA nodes) to +place the hot pages in the fast memory. This is implemented based on +unmapping and page fault too. + +numa_balancing_promote_rate_limit_MBps +====================================== + +Too high promotion/demotion throughput between different memory types +may hurt application latency. This can be used to rate limit the +promotion throughput. The per-node max promotion throughput in MB/s +will be limited to be no more than the set value. + +A rule of thumb is to set this to less than 1/10 of the PMEM node +write bandwidth. + +oops_all_cpu_backtrace +====================== + +If this option is set, the kernel will send an NMI to all CPUs to dump +their backtraces when an oops event occurs. It should be used as a last +resort in case a panic cannot be triggered (to protect VMs running, for +example) or kdump can't be collected. This file shows up if CONFIG_SMP +is enabled. + +0: Won't show all CPUs backtraces when an oops is detected. +This is the default behavior. + +1: Will non-maskably interrupt all CPUs and dump their backtraces when +an oops event is detected. + + +oops_limit +========== + +Number of kernel oopses after which the kernel should panic when +``panic_on_oops`` is not set. Setting this to 0 disables checking +the count. Setting this to 1 has the same effect as setting +``panic_on_oops=1``. The default value is 10000. + + +osrelease, ostype & version +=========================== + +:: + + # cat osrelease + 2.1.88 + # cat ostype + Linux + # cat version + #5 Wed Feb 25 21:49:24 MET 1998 + +The files ``osrelease`` and ``ostype`` should be clear enough. +``version`` +needs a little more clarification however. The '#5' means that +this is the fifth kernel built from this source base and the +date behind it indicates the time the kernel was built. +The only way to tune these values is to rebuild the kernel :-) + + +overflowgid & overflowuid +========================= + +if your architecture did not always support 32-bit UIDs (i.e. arm, +i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to +applications that use the old 16-bit UID/GID system calls, if the +actual UID or GID would exceed 65535. + +These sysctls allow you to change the value of the fixed UID and GID. +The default is 65534. + + +panic +===== + +The value in this file determines the behaviour of the kernel on a +panic: + +* if zero, the kernel will loop forever; +* if negative, the kernel will reboot immediately; +* if positive, the kernel will reboot after the corresponding number + of seconds. + +When you use the software watchdog, the recommended setting is 60. + + +panic_on_io_nmi +=============== + +Controls the kernel's behavior when a CPU receives an NMI caused by +an IO error. + += ================================================================== +0 Try to continue operation (default). +1 Panic immediately. The IO error triggered an NMI. This indicates a + serious system condition which could result in IO data corruption. + Rather than continuing, panicking might be a better choice. Some + servers issue this sort of NMI when the dump button is pushed, + and you can use this option to take a crash dump. += ================================================================== + + +panic_on_oops +============= + +Controls the kernel's behaviour when an oops or BUG is encountered. + += =================================================================== +0 Try to continue operation. +1 Panic immediately. If the `panic` sysctl is also non-zero then the + machine will be rebooted. += =================================================================== + + +panic_on_stackoverflow +====================== + +Controls the kernel's behavior when detecting the overflows of +kernel, IRQ and exception stacks except a user stack. +This file shows up if ``CONFIG_DEBUG_STACKOVERFLOW`` is enabled. + += ========================== +0 Try to continue operation. +1 Panic immediately. += ========================== + + +panic_on_unrecovered_nmi +======================== + +The default Linux behaviour on an NMI of either memory or unknown is +to continue operation. For many environments such as scientific +computing it is preferable that the box is taken out and the error +dealt with than an uncorrected parity/ECC error get propagated. + +A small number of systems do generate NMIs for bizarre random reasons +such as power management so the default is off. That sysctl works like +the existing panic controls already in that directory. + + +panic_on_warn +============= + +Calls panic() in the WARN() path when set to 1. This is useful to avoid +a kernel rebuild when attempting to kdump at the location of a WARN(). + += ================================================ +0 Only WARN(), default behaviour. +1 Call panic() after printing out WARN() location. += ================================================ + + +panic_print +=========== + +Bitmask for printing system info when panic happens. User can chose +combination of the following bits: + +===== ============================================ +bit 0 print all tasks info +bit 1 print system memory info +bit 2 print timer info +bit 3 print locks info if ``CONFIG_LOCKDEP`` is on +bit 4 print ftrace buffer +bit 5 print all printk messages in buffer +bit 6 print all CPUs backtrace (if available in the arch) +===== ============================================ + +So for example to print tasks and memory info on panic, user can:: + + echo 3 > /proc/sys/kernel/panic_print + + +panic_on_rcu_stall +================== + +When set to 1, calls panic() after RCU stall detection messages. This +is useful to define the root cause of RCU stalls using a vmcore. + += ============================================================ +0 Do not panic() when RCU stall takes place, default behavior. +1 panic() after printing RCU stall messages. += ============================================================ + +max_rcu_stall_to_panic +====================== + +When ``panic_on_rcu_stall`` is set to 1, this value determines the +number of times that RCU can stall before panic() is called. + +When ``panic_on_rcu_stall`` is set to 0, this value is has no effect. + +perf_cpu_time_max_percent +========================= + +Hints to the kernel how much CPU time it should be allowed to +use to handle perf sampling events. If the perf subsystem +is informed that its samples are exceeding this limit, it +will drop its sampling frequency to attempt to reduce its CPU +usage. + +Some perf sampling happens in NMIs. If these samples +unexpectedly take too long to execute, the NMIs can become +stacked up next to each other so much that nothing else is +allowed to execute. + +===== ======================================================== +0 Disable the mechanism. Do not monitor or correct perf's + sampling rate no matter how CPU time it takes. + +1-100 Attempt to throttle perf's sample rate to this + percentage of CPU. Note: the kernel calculates an + "expected" length of each sample event. 100 here means + 100% of that expected length. Even if this is set to + 100, you may still see sample throttling if this + length is exceeded. Set to 0 if you truly do not care + how much CPU is consumed. +===== ======================================================== + + +perf_event_paranoid +=================== + +Controls use of the performance events system by unprivileged +users (without CAP_PERFMON). The default value is 2. + +For backward compatibility reasons access to system performance +monitoring and observability remains open for CAP_SYS_ADMIN +privileged processes but CAP_SYS_ADMIN usage for secure system +performance monitoring and observability operations is discouraged +with respect to CAP_PERFMON use cases. + +=== ================================================================== + -1 Allow use of (almost) all events by all users. + + Ignore mlock limit after perf_event_mlock_kb without + ``CAP_IPC_LOCK``. + +>=0 Disallow ftrace function tracepoint by users without + ``CAP_PERFMON``. + + Disallow raw tracepoint access by users without ``CAP_PERFMON``. + +>=1 Disallow CPU event access by users without ``CAP_PERFMON``. + +>=2 Disallow kernel profiling by users without ``CAP_PERFMON``. +=== ================================================================== + + +perf_event_max_stack +==================== + +Controls maximum number of stack frames to copy for (``attr.sample_type & +PERF_SAMPLE_CALLCHAIN``) configured events, for instance, when using +'``perf record -g``' or '``perf trace --call-graph fp``'. + +This can only be done when no events are in use that have callchains +enabled, otherwise writing to this file will return ``-EBUSY``. + +The default value is 127. + + +perf_event_mlock_kb +=================== + +Control size of per-cpu ring buffer not counted against mlock limit. + +The default value is 512 + 1 page + + +perf_event_max_contexts_per_stack +================================= + +Controls maximum number of stack frame context entries for +(``attr.sample_type & PERF_SAMPLE_CALLCHAIN``) configured events, for +instance, when using '``perf record -g``' or '``perf trace --call-graph fp``'. + +This can only be done when no events are in use that have callchains +enabled, otherwise writing to this file will return ``-EBUSY``. + +The default value is 8. + + +perf_user_access (arm64 only) +================================= + +Controls user space access for reading perf event counters. When set to 1, +user space can read performance monitor counter registers directly. + +The default value is 0 (access disabled). + +See Documentation/arm64/perf.rst for more information. + + +pid_max +======= + +PID allocation wrap value. When the kernel's next PID value +reaches this value, it wraps back to a minimum PID value. +PIDs of value ``pid_max`` or larger are not allocated. + + +ns_last_pid +=========== + +The last pid allocated in the current (the one task using this sysctl +lives in) pid namespace. When selecting a pid for a next task on fork +kernel tries to allocate a number starting from this one. + + +powersave-nap (PPC only) +======================== + +If set, Linux-PPC will use the 'nap' mode of powersaving, +otherwise the 'doze' mode will be used. + + +============================================================== + +printk +====== + +The four values in printk denote: ``console_loglevel``, +``default_message_loglevel``, ``minimum_console_loglevel`` and +``default_console_loglevel`` respectively. + +These values influence printk() behavior when printing or +logging error messages. See '``man 2 syslog``' for more info on +the different loglevels. + +======================== ===================================== +console_loglevel messages with a higher priority than + this will be printed to the console +default_message_loglevel messages without an explicit priority + will be printed with this priority +minimum_console_loglevel minimum (highest) value to which + console_loglevel can be set +default_console_loglevel default value for console_loglevel +======================== ===================================== + + +printk_delay +============ + +Delay each printk message in ``printk_delay`` milliseconds + +Value from 0 - 10000 is allowed. + + +printk_ratelimit +================ + +Some warning messages are rate limited. ``printk_ratelimit`` specifies +the minimum length of time between these messages (in seconds). +The default value is 5 seconds. + +A value of 0 will disable rate limiting. + + +printk_ratelimit_burst +====================== + +While long term we enforce one message per `printk_ratelimit`_ +seconds, we do allow a burst of messages to pass through. +``printk_ratelimit_burst`` specifies the number of messages we can +send before ratelimiting kicks in. + +The default value is 10 messages. + + +printk_devkmsg +============== + +Control the logging to ``/dev/kmsg`` from userspace: + +========= ============================================= +ratelimit default, ratelimited +on unlimited logging to /dev/kmsg from userspace +off logging to /dev/kmsg disabled +========= ============================================= + +The kernel command line parameter ``printk.devkmsg=`` overrides this and is +a one-time setting until next reboot: once set, it cannot be changed by +this sysctl interface anymore. + +============================================================== + + +pty +=== + +See Documentation/filesystems/devpts.rst. + + +random +====== + +This is a directory, with the following entries: + +* ``boot_id``: a UUID generated the first time this is retrieved, and + unvarying after that; + +* ``uuid``: a UUID generated every time this is retrieved (this can + thus be used to generate UUIDs at will); + +* ``entropy_avail``: the pool's entropy count, in bits; + +* ``poolsize``: the entropy pool size, in bits; + +* ``urandom_min_reseed_secs``: obsolete (used to determine the minimum + number of seconds between urandom pool reseeding). This file is + writable for compatibility purposes, but writing to it has no effect + on any RNG behavior; + +* ``write_wakeup_threshold``: when the entropy count drops below this + (as a number of bits), processes waiting to write to ``/dev/random`` + are woken up. This file is writable for compatibility purposes, but + writing to it has no effect on any RNG behavior. + + +randomize_va_space +================== + +This option can be used to select the type of process address +space randomization that is used in the system, for architectures +that support this feature. + +== =========================================================================== +0 Turn the process address space randomization off. This is the + default for architectures that do not support this feature anyways, + and kernels that are booted with the "norandmaps" parameter. + +1 Make the addresses of mmap base, stack and VDSO page randomized. + This, among other things, implies that shared libraries will be + loaded to random addresses. Also for PIE-linked binaries, the + location of code start is randomized. This is the default if the + ``CONFIG_COMPAT_BRK`` option is enabled. + +2 Additionally enable heap randomization. This is the default if + ``CONFIG_COMPAT_BRK`` is disabled. + + There are a few legacy applications out there (such as some ancient + versions of libc.so.5 from 1996) that assume that brk area starts + just after the end of the code+bss. These applications break when + start of the brk area is randomized. There are however no known + non-legacy applications that would be broken this way, so for most + systems it is safe to choose full randomization. + + Systems with ancient and/or broken binaries should be configured + with ``CONFIG_COMPAT_BRK`` enabled, which excludes the heap from process + address space randomization. +== =========================================================================== + + +real-root-dev +============= + +See Documentation/admin-guide/initrd.rst. + + +reboot-cmd (SPARC only) +======================= + +??? This seems to be a way to give an argument to the Sparc +ROM/Flash boot loader. Maybe to tell it what to do after +rebooting. ??? + + +sched_energy_aware +================== + +Enables/disables Energy Aware Scheduling (EAS). EAS starts +automatically on platforms where it can run (that is, +platforms with asymmetric CPU topologies and having an Energy +Model available). If your platform happens to meet the +requirements for EAS but you do not want to use it, change +this value to 0. + +task_delayacct +=============== + +Enables/disables task delay accounting (see +Documentation/accounting/delay-accounting.rst. Enabling this feature incurs +a small amount of overhead in the scheduler but is useful for debugging +and performance tuning. It is required by some tools such as iotop. + +sched_schedstats +================ + +Enables/disables scheduler statistics. Enabling this feature +incurs a small amount of overhead in the scheduler but is +useful for debugging and performance tuning. + +sched_util_clamp_min +==================== + +Max allowed *minimum* utilization. + +Default value is 1024, which is the maximum possible value. + +It means that any requested uclamp.min value cannot be greater than +sched_util_clamp_min, i.e., it is restricted to the range +[0:sched_util_clamp_min]. + +sched_util_clamp_max +==================== + +Max allowed *maximum* utilization. + +Default value is 1024, which is the maximum possible value. + +It means that any requested uclamp.max value cannot be greater than +sched_util_clamp_max, i.e., it is restricted to the range +[0:sched_util_clamp_max]. + +sched_util_clamp_min_rt_default +=============================== + +By default Linux is tuned for performance. Which means that RT tasks always run +at the highest frequency and most capable (highest capacity) CPU (in +heterogeneous systems). + +Uclamp achieves this by setting the requested uclamp.min of all RT tasks to +1024 by default, which effectively boosts the tasks to run at the highest +frequency and biases them to run on the biggest CPU. + +This knob allows admins to change the default behavior when uclamp is being +used. In battery powered devices particularly, running at the maximum +capacity and frequency will increase energy consumption and shorten the battery +life. + +This knob is only effective for RT tasks which the user hasn't modified their +requested uclamp.min value via sched_setattr() syscall. + +This knob will not escape the range constraint imposed by sched_util_clamp_min +defined above. + +For example if + + sched_util_clamp_min_rt_default = 800 + sched_util_clamp_min = 600 + +Then the boost will be clamped to 600 because 800 is outside of the permissible +range of [0:600]. This could happen for instance if a powersave mode will +restrict all boosts temporarily by modifying sched_util_clamp_min. As soon as +this restriction is lifted, the requested sched_util_clamp_min_rt_default +will take effect. + +seccomp +======= + +See Documentation/userspace-api/seccomp_filter.rst. + + +sg-big-buff +=========== + +This file shows the size of the generic SCSI (sg) buffer. +You can't tune it just yet, but you could change it on +compile time by editing ``include/scsi/sg.h`` and changing +the value of ``SG_BIG_BUFF``. + +There shouldn't be any reason to change this value. If +you can come up with one, you probably know what you +are doing anyway :) + + +shmall +====== + +This parameter sets the total amount of shared memory pages that +can be used system wide. Hence, ``shmall`` should always be at least +``ceil(shmmax/PAGE_SIZE)``. + +If you are not sure what the default ``PAGE_SIZE`` is on your Linux +system, you can run the following command:: + + # getconf PAGE_SIZE + + +shmmax +====== + +This value can be used to query and set the run time limit +on the maximum shared memory segment size that can be created. +Shared memory segments up to 1Gb are now supported in the +kernel. This value defaults to ``SHMMAX``. + + +shmmni +====== + +This value determines the maximum number of shared memory segments. +4096 by default (``SHMMNI``). + + +shm_rmid_forced +=============== + +Linux lets you set resource limits, including how much memory one +process can consume, via ``setrlimit(2)``. Unfortunately, shared memory +segments are allowed to exist without association with any process, and +thus might not be counted against any resource limits. If enabled, +shared memory segments are automatically destroyed when their attach +count becomes zero after a detach or a process termination. It will +also destroy segments that were created, but never attached to, on exit +from the process. The only use left for ``IPC_RMID`` is to immediately +destroy an unattached segment. Of course, this breaks the way things are +defined, so some applications might stop working. Note that this +feature will do you no good unless you also configure your resource +limits (in particular, ``RLIMIT_AS`` and ``RLIMIT_NPROC``). Most systems don't +need this. + +Note that if you change this from 0 to 1, already created segments +without users and with a dead originative process will be destroyed. + + +sysctl_writes_strict +==================== + +Control how file position affects the behavior of updating sysctl values +via the ``/proc/sys`` interface: + + == ====================================================================== + -1 Legacy per-write sysctl value handling, with no printk warnings. + Each write syscall must fully contain the sysctl value to be + written, and multiple writes on the same sysctl file descriptor + will rewrite the sysctl value, regardless of file position. + 0 Same behavior as above, but warn about processes that perform writes + to a sysctl file descriptor when the file position is not 0. + 1 (default) Respect file position when writing sysctl strings. Multiple + writes will append to the sysctl value buffer. Anything past the max + length of the sysctl value buffer will be ignored. Writes to numeric + sysctl entries must always be at file position 0 and the value must + be fully contained in the buffer sent in the write syscall. + == ====================================================================== + + +softlockup_all_cpu_backtrace +============================ + +This value controls the soft lockup detector thread's behavior +when a soft lockup condition is detected as to whether or not +to gather further debug information. If enabled, each cpu will +be issued an NMI and instructed to capture stack trace. + +This feature is only applicable for architectures which support +NMI. + += ============================================ +0 Do nothing. This is the default behavior. +1 On detection capture more debug information. += ============================================ + + +softlockup_panic +================= + +This parameter can be used to control whether the kernel panics +when a soft lockup is detected. + += ============================================ +0 Don't panic on soft lockup. +1 Panic on soft lockup. += ============================================ + +This can also be set using the softlockup_panic kernel parameter. + + +soft_watchdog +============= + +This parameter can be used to control the soft lockup detector. + += ================================= +0 Disable the soft lockup detector. +1 Enable the soft lockup detector. += ================================= + +The soft lockup detector monitors CPUs for threads that are hogging the CPUs +without rescheduling voluntarily, and thus prevent the 'migration/N' threads +from running, causing the watchdog work fail to execute. The mechanism depends +on the CPUs ability to respond to timer interrupts which are needed for the +watchdog work to be queued by the watchdog timer function, otherwise the NMI +watchdog — if enabled — can detect a hard lockup condition. + + +split_lock_mitigate (x86 only) +============================== + +On x86, each "split lock" imposes a system-wide performance penalty. On larger +systems, large numbers of split locks from unprivileged users can result in +denials of service to well-behaved and potentially more important users. + +The kernel mitigates these bad users by detecting split locks and imposing +penalties: forcing them to wait and only allowing one core to execute split +locks at a time. + +These mitigations can make those bad applications unbearably slow. Setting +split_lock_mitigate=0 may restore some application performance, but will also +increase system exposure to denial of service attacks from split lock users. + += =================================================================== +0 Disable the mitigation mode - just warns the split lock on kernel log + and exposes the system to denials of service from the split lockers. +1 Enable the mitigation mode (this is the default) - penalizes the split + lockers with intentional performance degradation. += =================================================================== + + +stack_erasing +============= + +This parameter can be used to control kernel stack erasing at the end +of syscalls for kernels built with ``CONFIG_GCC_PLUGIN_STACKLEAK``. + +That erasing reduces the information which kernel stack leak bugs +can reveal and blocks some uninitialized stack variable attacks. +The tradeoff is the performance impact: on a single CPU system kernel +compilation sees a 1% slowdown, other systems and workloads may vary. + += ==================================================================== +0 Kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. +1 Kernel stack erasing is enabled (default), it is performed before + returning to the userspace at the end of syscalls. += ==================================================================== + + +stop-a (SPARC only) +=================== + +Controls Stop-A: + += ==================================== +0 Stop-A has no effect. +1 Stop-A breaks to the PROM (default). += ==================================== + +Stop-A is always enabled on a panic, so that the user can return to +the boot PROM. + + +sysrq +===== + +See Documentation/admin-guide/sysrq.rst. + + +tainted +======= + +Non-zero if the kernel has been tainted. Numeric values, which can be +ORed together. The letters are seen in "Tainted" line of Oops reports. + +====== ===== ============================================================== + 1 `(P)` proprietary module was loaded + 2 `(F)` module was force loaded + 4 `(S)` kernel running on an out of specification system + 8 `(R)` module was force unloaded + 16 `(M)` processor reported a Machine Check Exception (MCE) + 32 `(B)` bad page referenced or some unexpected page flags + 64 `(U)` taint requested by userspace application + 128 `(D)` kernel died recently, i.e. there was an OOPS or BUG + 256 `(A)` an ACPI table was overridden by user + 512 `(W)` kernel issued warning + 1024 `(C)` staging driver was loaded + 2048 `(I)` workaround for bug in platform firmware applied + 4096 `(O)` externally-built ("out-of-tree") module was loaded + 8192 `(E)` unsigned module was loaded + 16384 `(L)` soft lockup occurred + 32768 `(K)` kernel has been live patched + 65536 `(X)` Auxiliary taint, defined and used by for distros +131072 `(T)` The kernel was built with the struct randomization plugin +====== ===== ============================================================== + +See Documentation/admin-guide/tainted-kernels.rst for more information. + +Note: + writes to this sysctl interface will fail with ``EINVAL`` if the kernel is + booted with the command line option ``panic_on_taint=,nousertaint`` + and any of the ORed together values being written to ``tainted`` match with + the bitmask declared on panic_on_taint. + See Documentation/admin-guide/kernel-parameters.rst for more details on + that particular kernel command line option and its optional + ``nousertaint`` switch. + +threads-max +=========== + +This value controls the maximum number of threads that can be created +using ``fork()``. + +During initialization the kernel sets this value such that even if the +maximum number of threads is created, the thread structures occupy only +a part (1/8th) of the available RAM pages. + +The minimum value that can be written to ``threads-max`` is 1. + +The maximum value that can be written to ``threads-max`` is given by the +constant ``FUTEX_TID_MASK`` (0x3fffffff). + +If a value outside of this range is written to ``threads-max`` an +``EINVAL`` error occurs. + + +traceoff_on_warning +=================== + +When set, disables tracing (see Documentation/trace/ftrace.rst) when a +``WARN()`` is hit. + + +tracepoint_printk +================= + +When tracepoints are sent to printk() (enabled by the ``tp_printk`` +boot parameter), this entry provides runtime control:: + + echo 0 > /proc/sys/kernel/tracepoint_printk + +will stop tracepoints from being sent to printk(), and:: + + echo 1 > /proc/sys/kernel/tracepoint_printk + +will send them to printk() again. + +This only works if the kernel was booted with ``tp_printk`` enabled. + +See Documentation/admin-guide/kernel-parameters.rst and +Documentation/trace/boottime-trace.rst. + + +.. _unaligned-dump-stack: + +unaligned-dump-stack (ia64) +=========================== + +When logging unaligned accesses, controls whether the stack is +dumped. + += =================================================== +0 Do not dump the stack. This is the default setting. +1 Dump the stack. += =================================================== + +See also `ignore-unaligned-usertrap`_. + + +unaligned-trap +============== + +On architectures where unaligned accesses cause traps, and where this +feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW``; currently, +``arc`` and ``parisc``), controls whether unaligned traps are caught +and emulated (instead of failing). + += ======================================================== +0 Do not emulate unaligned accesses. +1 Emulate unaligned accesses. This is the default setting. += ======================================================== + +See also `ignore-unaligned-usertrap`_. + + +unknown_nmi_panic +================= + +The value in this file affects behavior of handling NMI. When the +value is non-zero, unknown NMI is trapped and then panic occurs. At +that time, kernel debugging information is displayed on console. + +NMI switch that most IA32 servers have fires unknown NMI up, for +example. If a system hangs up, try pressing the NMI switch. + + +unprivileged_bpf_disabled +========================= + +Writing 1 to this entry will disable unprivileged calls to ``bpf()``; +once disabled, calling ``bpf()`` without ``CAP_SYS_ADMIN`` or ``CAP_BPF`` +will return ``-EPERM``. Once set to 1, this can't be cleared from the +running kernel anymore. + +Writing 2 to this entry will also disable unprivileged calls to ``bpf()``, +however, an admin can still change this setting later on, if needed, by +writing 0 or 1 to this entry. + +If ``BPF_UNPRIV_DEFAULT_OFF`` is enabled in the kernel config, then this +entry will default to 2 instead of 0. + += ============================================================= +0 Unprivileged calls to ``bpf()`` are enabled +1 Unprivileged calls to ``bpf()`` are disabled without recovery +2 Unprivileged calls to ``bpf()`` are disabled += ============================================================= + + +warn_limit +========== + +Number of kernel warnings after which the kernel should panic when +``panic_on_warn`` is not set. Setting this to 0 disables checking +the warning count. Setting this to 1 has the same effect as setting +``panic_on_warn=1``. The default value is 0. + + +watchdog +======== + +This parameter can be used to disable or enable the soft lockup detector +*and* the NMI watchdog (i.e. the hard lockup detector) at the same time. + += ============================== +0 Disable both lockup detectors. +1 Enable both lockup detectors. += ============================== + +The soft lockup detector and the NMI watchdog can also be disabled or +enabled individually, using the ``soft_watchdog`` and ``nmi_watchdog`` +parameters. +If the ``watchdog`` parameter is read, for example by executing:: + + cat /proc/sys/kernel/watchdog + +the output of this command (0 or 1) shows the logical OR of +``soft_watchdog`` and ``nmi_watchdog``. + + +watchdog_cpumask +================ + +This value can be used to control on which cpus the watchdog may run. +The default cpumask is all possible cores, but if ``NO_HZ_FULL`` is +enabled in the kernel config, and cores are specified with the +``nohz_full=`` boot argument, those cores are excluded by default. +Offline cores can be included in this mask, and if the core is later +brought online, the watchdog will be started based on the mask value. + +Typically this value would only be touched in the ``nohz_full`` case +to re-enable cores that by default were not running the watchdog, +if a kernel lockup was suspected on those cores. + +The argument value is the standard cpulist format for cpumasks, +so for example to enable the watchdog on cores 0, 2, 3, and 4 you +might say:: + + echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask + + +watchdog_thresh +=============== + +This value can be used to control the frequency of hrtimer and NMI +events and the soft and hard lockup thresholds. The default threshold +is 10 seconds. + +The softlockup threshold is (``2 * watchdog_thresh``). Setting this +tunable to zero will disable lockup detection altogether. diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst new file mode 100644 index 000000000..6394f5dc2 --- /dev/null +++ b/Documentation/admin-guide/sysctl/net.rst @@ -0,0 +1,499 @@ +================================ +Documentation for /proc/sys/net/ +================================ + +Copyright + +Copyright (c) 1999 + + - Terrehon Bowden + - Bodo Bauer + +Copyright (c) 2000 + + - Jorge Nerin + +Copyright (c) 2009 + + - Shen Feng + +For general info and legal blurb, please look in index.rst. + +------------------------------------------------------------------------------ + +This file contains the documentation for the sysctl files in +/proc/sys/net + +The interface to the networking parts of the kernel is located in +/proc/sys/net. The following table shows all possible subdirectories. You may +see only some of them, depending on your kernel's configuration. + + +Table : Subdirectories in /proc/sys/net + + ========= =================== = ========== =================== + Directory Content Directory Content + ========= =================== = ========== =================== + 802 E802 protocol mptcp Multipath TCP + appletalk Appletalk protocol netfilter Network Filter + ax25 AX25 netrom NET/ROM + bridge Bridging rose X.25 PLP layer + core General parameter tipc TIPC + ethernet Ethernet protocol unix Unix domain sockets + ipv4 IP version 4 x25 X.25 protocol + ipv6 IP version 6 + ========= =================== = ========== =================== + +1. /proc/sys/net/core - Network core options +============================================ + +bpf_jit_enable +-------------- + +This enables the BPF Just in Time (JIT) compiler. BPF is a flexible +and efficient infrastructure allowing to execute bytecode at various +hook points. It is used in a number of Linux kernel subsystems such +as networking (e.g. XDP, tc), tracing (e.g. kprobes, uprobes, tracepoints) +and security (e.g. seccomp). LLVM has a BPF back end that can compile +restricted C into a sequence of BPF instructions. After program load +through bpf(2) and passing a verifier in the kernel, a JIT will then +translate these BPF proglets into native CPU instructions. There are +two flavors of JITs, the newer eBPF JIT currently supported on: + + - x86_64 + - x86_32 + - arm64 + - arm32 + - ppc64 + - ppc32 + - sparc64 + - mips64 + - s390x + - riscv64 + - riscv32 + +And the older cBPF JIT supported on the following archs: + + - mips + - sparc + +eBPF JITs are a superset of cBPF JITs, meaning the kernel will +migrate cBPF instructions into eBPF instructions and then JIT +compile them transparently. Older cBPF JITs can only translate +tcpdump filters, seccomp rules, etc, but not mentioned eBPF +programs loaded through bpf(2). + +Values: + + - 0 - disable the JIT (default value) + - 1 - enable the JIT + - 2 - enable the JIT and ask the compiler to emit traces on kernel log. + +bpf_jit_harden +-------------- + +This enables hardening for the BPF JIT compiler. Supported are eBPF +JIT backends. Enabling hardening trades off performance, but can +mitigate JIT spraying. + +Values: + + - 0 - disable JIT hardening (default value) + - 1 - enable JIT hardening for unprivileged users only + - 2 - enable JIT hardening for all users + +where "privileged user" in this context means a process having +CAP_BPF or CAP_SYS_ADMIN in the root user name space. + +bpf_jit_kallsyms +---------------- + +When BPF JIT compiler is enabled, then compiled images are unknown +addresses to the kernel, meaning they neither show up in traces nor +in /proc/kallsyms. This enables export of these addresses, which can +be used for debugging/tracing. If bpf_jit_harden is enabled, this +feature is disabled. + +Values : + + - 0 - disable JIT kallsyms export (default value) + - 1 - enable JIT kallsyms export for privileged users only + +bpf_jit_limit +------------- + +This enforces a global limit for memory allocations to the BPF JIT +compiler in order to reject unprivileged JIT requests once it has +been surpassed. bpf_jit_limit contains the value of the global limit +in bytes. + +dev_weight +---------- + +The maximum number of packets that kernel can handle on a NAPI interrupt, +it's a Per-CPU variable. For drivers that support LRO or GRO_HW, a hardware +aggregated packet is counted as one packet in this context. + +Default: 64 + +dev_weight_rx_bias +------------------ + +RPS (e.g. RFS, aRFS) processing is competing with the registered NAPI poll function +of the driver for the per softirq cycle netdev_budget. This parameter influences +the proportion of the configured netdev_budget that is spent on RPS based packet +processing during RX softirq cycles. It is further meant for making current +dev_weight adaptable for asymmetric CPU needs on RX/TX side of the network stack. +(see dev_weight_tx_bias) It is effective on a per CPU basis. Determination is based +on dev_weight and is calculated multiplicative (dev_weight * dev_weight_rx_bias). + +Default: 1 + +dev_weight_tx_bias +------------------ + +Scales the maximum number of packets that can be processed during a TX softirq cycle. +Effective on a per CPU basis. Allows scaling of current dev_weight for asymmetric +net stack processing needs. Be careful to avoid making TX softirq processing a CPU hog. + +Calculation is based on dev_weight (dev_weight * dev_weight_tx_bias). + +Default: 1 + +default_qdisc +------------- + +The default queuing discipline to use for network devices. This allows +overriding the default of pfifo_fast with an alternative. Since the default +queuing discipline is created without additional parameters so is best suited +to queuing disciplines that work well without configuration like stochastic +fair queue (sfq), CoDel (codel) or fair queue CoDel (fq_codel). Don't use +queuing disciplines like Hierarchical Token Bucket or Deficit Round Robin +which require setting up classes and bandwidths. Note that physical multiqueue +interfaces still use mq as root qdisc, which in turn uses this default for its +leaves. Virtual devices (like e.g. lo or veth) ignore this setting and instead +default to noqueue. + +Default: pfifo_fast + +busy_read +--------- + +Low latency busy poll timeout for socket reads. (needs CONFIG_NET_RX_BUSY_POLL) +Approximate time in us to busy loop waiting for packets on the device queue. +This sets the default value of the SO_BUSY_POLL socket option. +Can be set or overridden per socket by setting socket option SO_BUSY_POLL, +which is the preferred method of enabling. If you need to enable the feature +globally via sysctl, a value of 50 is recommended. + +Will increase power usage. + +Default: 0 (off) + +busy_poll +---------------- +Low latency busy poll timeout for poll and select. (needs CONFIG_NET_RX_BUSY_POLL) +Approximate time in us to busy loop waiting for events. +Recommended value depends on the number of sockets you poll on. +For several sockets 50, for several hundreds 100. +For more than that you probably want to use epoll. +Note that only sockets with SO_BUSY_POLL set will be busy polled, +so you want to either selectively set SO_BUSY_POLL on those sockets or set +sysctl.net.busy_read globally. + +Will increase power usage. + +Default: 0 (off) + +rmem_default +------------ + +The default setting of the socket receive buffer in bytes. + +rmem_max +-------- + +The maximum receive socket buffer size in bytes. + +tstamp_allow_data +----------------- +Allow processes to receive tx timestamps looped together with the original +packet contents. If disabled, transmit timestamp requests from unprivileged +processes are dropped unless socket option SOF_TIMESTAMPING_OPT_TSONLY is set. + +Default: 1 (on) + + +wmem_default +------------ + +The default setting (in bytes) of the socket send buffer. + +wmem_max +-------- + +The maximum send socket buffer size in bytes. + +message_burst and message_cost +------------------------------ + +These parameters are used to limit the warning messages written to the kernel +log from the networking code. They enforce a rate limit to make a +denial-of-service attack impossible. A higher message_cost factor, results in +fewer messages that will be written. Message_burst controls when messages will +be dropped. The default settings limit warning messages to one every five +seconds. + +warnings +-------- + +This sysctl is now unused. + +This was used to control console messages from the networking stack that +occur because of problems on the network like duplicate address or bad +checksums. + +These messages are now emitted at KERN_DEBUG and can generally be enabled +and controlled by the dynamic_debug facility. + +netdev_budget +------------- + +Maximum number of packets taken from all interfaces in one polling cycle (NAPI +poll). In one polling cycle interfaces which are registered to polling are +probed in a round-robin manner. Also, a polling cycle may not exceed +netdev_budget_usecs microseconds, even if netdev_budget has not been +exhausted. + +netdev_budget_usecs +--------------------- + +Maximum number of microseconds in one NAPI polling cycle. Polling +will exit when either netdev_budget_usecs have elapsed during the +poll cycle or the number of packets processed reaches netdev_budget. + +netdev_max_backlog +------------------ + +Maximum number of packets, queued on the INPUT side, when the interface +receives packets faster than kernel can process them. + +netdev_rss_key +-------------- + +RSS (Receive Side Scaling) enabled drivers use a 40 bytes host key that is +randomly generated. +Some user space might need to gather its content even if drivers do not +provide ethtool -x support yet. + +:: + + myhost:~# cat /proc/sys/net/core/netdev_rss_key + 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (52 bytes total) + +File contains nul bytes if no driver ever called netdev_rss_key_fill() function. + +Note: + /proc/sys/net/core/netdev_rss_key contains 52 bytes of key, + but most drivers only use 40 bytes of it. + +:: + + myhost:~# ethtool -x eth0 + RX flow hash indirection table for eth0 with 8 RX ring(s): + 0: 0 1 2 3 4 5 6 7 + RSS hash key: + 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8:43:e3:c9:0c:fd:17:55:c2:3a:4d:69:ed:f1:42:89 + +netdev_tstamp_prequeue +---------------------- + +If set to 0, RX packet timestamps can be sampled after RPS processing, when +the target CPU processes packets. It might give some delay on timestamps, but +permit to distribute the load on several cpus. + +If set to 1 (default), timestamps are sampled as soon as possible, before +queueing. + +netdev_unregister_timeout_secs +------------------------------ + +Unregister network device timeout in seconds. +This option controls the timeout (in seconds) used to issue a warning while +waiting for a network device refcount to drop to 0 during device +unregistration. A lower value may be useful during bisection to detect +a leaked reference faster. A larger value may be useful to prevent false +warnings on slow/loaded systems. +Default value is 10, minimum 1, maximum 3600. + +skb_defer_max +------------- + +Max size (in skbs) of the per-cpu list of skbs being freed +by the cpu which allocated them. Used by TCP stack so far. + +Default: 64 + +optmem_max +---------- + +Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence +of struct cmsghdr structures with appended data. + +fb_tunnels_only_for_init_net +---------------------------- + +Controls if fallback tunnels (like tunl0, gre0, gretap0, erspan0, +sit0, ip6tnl0, ip6gre0) are automatically created. There are 3 possibilities +(a) value = 0; respective fallback tunnels are created when module is +loaded in every net namespaces (backward compatible behavior). +(b) value = 1; [kcmd value: initns] respective fallback tunnels are +created only in init net namespace and every other net namespace will +not have them. +(c) value = 2; [kcmd value: none] fallback tunnels are not created +when a module is loaded in any of the net namespace. Setting value to +"2" is pointless after boot if these modules are built-in, so there is +a kernel command-line option that can change this default. Please refer to +Documentation/admin-guide/kernel-parameters.txt for additional details. + +Not creating fallback tunnels gives control to userspace to create +whatever is needed only and avoid creating devices which are redundant. + +Default : 0 (for compatibility reasons) + +devconf_inherit_init_net +------------------------ + +Controls if a new network namespace should inherit all current +settings under /proc/sys/net/{ipv4,ipv6}/conf/{all,default}/. By +default, we keep the current behavior: for IPv4 we inherit all current +settings from init_net and for IPv6 we reset all settings to default. + +If set to 1, both IPv4 and IPv6 settings are forced to inherit from +current ones in init_net. If set to 2, both IPv4 and IPv6 settings are +forced to reset to their default values. If set to 3, both IPv4 and IPv6 +settings are forced to inherit from current ones in the netns where this +new netns has been created. + +Default : 0 (for compatibility reasons) + +txrehash +-------- + +Controls default hash rethink behaviour on listening socket when SO_TXREHASH +option is set to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt). + +If set to 1 (default), hash rethink is performed on listening socket. +If set to 0, hash rethink is not performed. + +gro_normal_batch +---------------- + +Maximum number of the segments to batch up on output of GRO. When a packet +exits GRO, either as a coalesced superframe or as an original packet which +GRO has decided not to coalesce, it is placed on a per-NAPI list. This +list is then passed to the stack when the number of segments reaches the +gro_normal_batch limit. + +high_order_alloc_disable +------------------------ + +By default the allocator for page frags tries to use high order pages (order-3 +on x86). While the default behavior gives good results in most cases, some users +might have hit a contention in page allocations/freeing. This was especially +true on older kernels (< 5.14) when high-order pages were not stored on per-cpu +lists. This allows to opt-in for order-0 allocation instead but is now mostly of +historical importance. + +Default: 0 + +2. /proc/sys/net/unix - Parameters for Unix domain sockets +---------------------------------------------------------- + +There is only one file in this directory. +unix_dgram_qlen limits the max number of datagrams queued in Unix domain +socket's buffer. It will not take effect unless PF_UNIX flag is specified. + + +3. /proc/sys/net/ipv4 - IPV4 settings +------------------------------------- +Please see: Documentation/networking/ip-sysctl.rst and +Documentation/admin-guide/sysctl/net.rst for descriptions of these entries. + + +4. Appletalk +------------ + +The /proc/sys/net/appletalk directory holds the Appletalk configuration data +when Appletalk is loaded. The configurable parameters are: + +aarp-expiry-time +---------------- + +The amount of time we keep an ARP entry before expiring it. Used to age out +old hosts. + +aarp-resolve-time +----------------- + +The amount of time we will spend trying to resolve an Appletalk address. + +aarp-retransmit-limit +--------------------- + +The number of times we will retransmit a query before giving up. + +aarp-tick-time +-------------- + +Controls the rate at which expires are checked. + +The directory /proc/net/appletalk holds the list of active Appletalk sockets +on a machine. + +The fields indicate the DDP type, the local address (in network:node format) +the remote address, the size of the transmit pending queue, the size of the +received queue (bytes waiting for applications to read) the state and the uid +owning the socket. + +/proc/net/atalk_iface lists all the interfaces configured for appletalk.It +shows the name of the interface, its Appletalk address, the network range on +that address (or network number for phase 1 networks), and the status of the +interface. + +/proc/net/atalk_route lists each known network route. It lists the target +(network) that the route leads to, the router (may be directly connected), the +route flags, and the device the route is using. + +5. TIPC +------- + +tipc_rmem +--------- + +The TIPC protocol now has a tunable for the receive memory, similar to the +tcp_rmem - i.e. a vector of 3 INTEGERs: (min, default, max) + +:: + + # cat /proc/sys/net/tipc/tipc_rmem + 4252725 34021800 68043600 + # + +The max value is set to CONN_OVERLOAD_LIMIT, and the default and min values +are scaled (shifted) versions of that same value. Note that the min value +is not at this point in time used in any meaningful way, but the triplet is +preserved in order to be consistent with things like tcp_rmem. + +named_timeout +------------- + +TIPC name table updates are distributed asynchronously in a cluster, without +any form of transaction handling. This means that different race scenarios are +possible. One such is that a name withdrawal sent out by one node and received +by another node may arrive after a second, overlapping name publication already +has been accepted from a third node, although the conflicting updates +originally may have been issued in the correct sequential order. +If named_timeout is nonzero, failed topology updates will be placed on a defer +queue until another event arrives that clears the error, or until the timeout +expires. Value is in milliseconds. diff --git a/Documentation/admin-guide/sysctl/sunrpc.rst b/Documentation/admin-guide/sysctl/sunrpc.rst new file mode 100644 index 000000000..09780a682 --- /dev/null +++ b/Documentation/admin-guide/sysctl/sunrpc.rst @@ -0,0 +1,25 @@ +=================================== +Documentation for /proc/sys/sunrpc/ +=================================== + +kernel version 2.2.10 + +Copyright (c) 1998, 1999, Rik van Riel + +For general info and legal blurb, please look in index.rst. + +------------------------------------------------------------------------------ + +This file contains the documentation for the sysctl files in +/proc/sys/sunrpc and is valid for Linux kernel version 2.2. + +The files in this directory can be used to (re)set the debug +flags of the SUN Remote Procedure Call (RPC) subsystem in +the Linux kernel. This stuff is used for NFS, KNFSD and +maybe a few other things as well. + +The files in there are used to control the debugging flags: +rpc_debug, nfs_debug, nfsd_debug and nlm_debug. + +These flags are for kernel hackers only. You should read the +source code in net/sunrpc/ for more information. diff --git a/Documentation/admin-guide/sysctl/user.rst b/Documentation/admin-guide/sysctl/user.rst new file mode 100644 index 000000000..c45824589 --- /dev/null +++ b/Documentation/admin-guide/sysctl/user.rst @@ -0,0 +1,84 @@ +================================= +Documentation for /proc/sys/user/ +================================= + +kernel version 4.9.0 + +Copyright (c) 2016 Eric Biederman + +------------------------------------------------------------------------------ + +This file contains the documentation for the sysctl files in +/proc/sys/user. + +The files in this directory can be used to override the default +limits on the number of namespaces and other objects that have +per user per user namespace limits. + +The primary purpose of these limits is to stop programs that +malfunction and attempt to create a ridiculous number of objects, +before the malfunction becomes a system wide problem. It is the +intention that the defaults of these limits are set high enough that +no program in normal operation should run into these limits. + +The creation of per user per user namespace objects are charged to +the user in the user namespace who created the object and +verified to be below the per user limit in that user namespace. + +The creation of objects is also charged to all of the users +who created user namespaces the creation of the object happens +in (user namespaces can be nested) and verified to be below the per user +limits in the user namespaces of those users. + +This recursive counting of created objects ensures that creating a +user namespace does not allow a user to escape their current limits. + +Currently, these files are in /proc/sys/user: + +max_cgroup_namespaces +===================== + + The maximum number of cgroup namespaces that any user in the current + user namespace may create. + +max_ipc_namespaces +================== + + The maximum number of ipc namespaces that any user in the current + user namespace may create. + +max_mnt_namespaces +================== + + The maximum number of mount namespaces that any user in the current + user namespace may create. + +max_net_namespaces +================== + + The maximum number of network namespaces that any user in the + current user namespace may create. + +max_pid_namespaces +================== + + The maximum number of pid namespaces that any user in the current + user namespace may create. + +max_time_namespaces +=================== + + The maximum number of time namespaces that any user in the current + user namespace may create. + +max_user_namespaces +=================== + + The maximum number of user namespaces that any user in the current + user namespace may create. + +max_uts_namespaces +================== + + The maximum number of user namespaces that any user in the current + user namespace may create. diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst new file mode 100644 index 000000000..988f6a4c8 --- /dev/null +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -0,0 +1,1046 @@ +=============================== +Documentation for /proc/sys/vm/ +=============================== + +kernel version 2.6.29 + +Copyright (c) 1998, 1999, Rik van Riel + +Copyright (c) 2008 Peter W. Morreale + +For general info and legal blurb, please look in index.rst. + +------------------------------------------------------------------------------ + +This file contains the documentation for the sysctl files in +/proc/sys/vm and is valid for Linux kernel version 2.6.29. + +The files in this directory can be used to tune the operation +of the virtual memory (VM) subsystem of the Linux kernel and +the writeout of dirty data to disk. + +Default values and initialization routines for most of these +files can be found in mm/swap.c. + +Currently, these files are in /proc/sys/vm: + +- admin_reserve_kbytes +- compact_memory +- compaction_proactiveness +- compact_unevictable_allowed +- dirty_background_bytes +- dirty_background_ratio +- dirty_bytes +- dirty_expire_centisecs +- dirty_ratio +- dirtytime_expire_seconds +- dirty_writeback_centisecs +- drop_caches +- extfrag_threshold +- highmem_is_dirtyable +- hugetlb_shm_group +- laptop_mode +- legacy_va_layout +- lowmem_reserve_ratio +- max_map_count +- memory_failure_early_kill +- memory_failure_recovery +- min_free_kbytes +- min_slab_ratio +- min_unmapped_ratio +- mmap_min_addr +- mmap_rnd_bits +- mmap_rnd_compat_bits +- nr_hugepages +- nr_hugepages_mempolicy +- nr_overcommit_hugepages +- nr_trim_pages (only if CONFIG_MMU=n) +- numa_zonelist_order +- oom_dump_tasks +- oom_kill_allocating_task +- overcommit_kbytes +- overcommit_memory +- overcommit_ratio +- page-cluster +- page_lock_unfairness +- panic_on_oom +- percpu_pagelist_high_fraction +- stat_interval +- stat_refresh +- numa_stat +- swappiness +- unprivileged_userfaultfd +- user_reserve_kbytes +- vfs_cache_pressure +- watermark_boost_factor +- watermark_scale_factor +- zone_reclaim_mode + + +admin_reserve_kbytes +==================== + +The amount of free memory in the system that should be reserved for users +with the capability cap_sys_admin. + +admin_reserve_kbytes defaults to min(3% of free pages, 8MB) + +That should provide enough for the admin to log in and kill a process, +if necessary, under the default overcommit 'guess' mode. + +Systems running under overcommit 'never' should increase this to account +for the full Virtual Memory Size of programs used to recover. Otherwise, +root may not be able to log in to recover the system. + +How do you calculate a minimum useful reserve? + +sshd or login + bash (or some other shell) + top (or ps, kill, etc.) + +For overcommit 'guess', we can sum resident set sizes (RSS). +On x86_64 this is about 8MB. + +For overcommit 'never', we can take the max of their virtual sizes (VSZ) +and add the sum of their RSS. +On x86_64 this is about 128MB. + +Changing this takes effect whenever an application requests memory. + + +compact_memory +============== + +Available only when CONFIG_COMPACTION is set. When 1 is written to the file, +all zones are compacted such that free memory is available in contiguous +blocks where possible. This can be important for example in the allocation of +huge pages although processes will also directly compact memory as required. + +compaction_proactiveness +======================== + +This tunable takes a value in the range [0, 100] with a default value of +20. This tunable determines how aggressively compaction is done in the +background. Write of a non zero value to this tunable will immediately +trigger the proactive compaction. Setting it to 0 disables proactive compaction. + +Note that compaction has a non-trivial system-wide impact as pages +belonging to different processes are moved around, which could also lead +to latency spikes in unsuspecting applications. The kernel employs +various heuristics to avoid wasting CPU cycles if it detects that +proactive compaction is not being effective. + +Be careful when setting it to extreme values like 100, as that may +cause excessive background compaction activity. + +compact_unevictable_allowed +=========================== + +Available only when CONFIG_COMPACTION is set. When set to 1, compaction is +allowed to examine the unevictable lru (mlocked pages) for pages to compact. +This should be used on systems where stalls for minor page faults are an +acceptable trade for large contiguous free memory. Set to 0 to prevent +compaction from moving pages that are unevictable. Default value is 1. +On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due +to compaction, which would block the task from becoming active until the fault +is resolved. + + +dirty_background_bytes +====================== + +Contains the amount of dirty memory at which the background kernel +flusher threads will start writeback. + +Note: + dirty_background_bytes is the counterpart of dirty_background_ratio. Only + one of them may be specified at a time. When one sysctl is written it is + immediately taken into account to evaluate the dirty memory limits and the + other appears as 0 when read. + + +dirty_background_ratio +====================== + +Contains, as a percentage of total available memory that contains free pages +and reclaimable pages, the number of pages at which the background kernel +flusher threads will start writing out dirty data. + +The total available memory is not equal to total system memory. + + +dirty_bytes +=========== + +Contains the amount of dirty memory at which a process generating disk writes +will itself start writeback. + +Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be +specified at a time. When one sysctl is written it is immediately taken into +account to evaluate the dirty memory limits and the other appears as 0 when +read. + +Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any +value lower than this limit will be ignored and the old configuration will be +retained. + + +dirty_expire_centisecs +====================== + +This tunable is used to define when dirty data is old enough to be eligible +for writeout by the kernel flusher threads. It is expressed in 100'ths +of a second. Data which has been dirty in-memory for longer than this +interval will be written out next time a flusher thread wakes up. + + +dirty_ratio +=========== + +Contains, as a percentage of total available memory that contains free pages +and reclaimable pages, the number of pages at which a process which is +generating disk writes will itself start writing out dirty data. + +The total available memory is not equal to total system memory. + + +dirtytime_expire_seconds +======================== + +When a lazytime inode is constantly having its pages dirtied, the inode with +an updated timestamp will never get chance to be written out. And, if the +only thing that has happened on the file system is a dirtytime inode caused +by an atime update, a worker will be scheduled to make sure that inode +eventually gets pushed out to disk. This tunable is used to define when dirty +inode is old enough to be eligible for writeback by the kernel flusher threads. +And, it is also used as the interval to wakeup dirtytime_writeback thread. + + +dirty_writeback_centisecs +========================= + +The kernel flusher threads will periodically wake up and write `old` data +out to disk. This tunable expresses the interval between those wakeups, in +100'ths of a second. + +Setting this to zero disables periodic writeback altogether. + + +drop_caches +=========== + +Writing to this will cause the kernel to drop clean caches, as well as +reclaimable slab objects like dentries and inodes. Once dropped, their +memory becomes free. + +To free pagecache:: + + echo 1 > /proc/sys/vm/drop_caches + +To free reclaimable slab objects (includes dentries and inodes):: + + echo 2 > /proc/sys/vm/drop_caches + +To free slab objects and pagecache:: + + echo 3 > /proc/sys/vm/drop_caches + +This is a non-destructive operation and will not free any dirty objects. +To increase the number of objects freed by this operation, the user may run +`sync` prior to writing to /proc/sys/vm/drop_caches. This will minimize the +number of dirty objects on the system and create more candidates to be +dropped. + +This file is not a means to control the growth of the various kernel caches +(inodes, dentries, pagecache, etc...) These objects are automatically +reclaimed by the kernel when memory is needed elsewhere on the system. + +Use of this file can cause performance problems. Since it discards cached +objects, it may cost a significant amount of I/O and CPU to recreate the +dropped objects, especially if they were under heavy use. Because of this, +use outside of a testing or debugging environment is not recommended. + +You may see informational messages in your kernel log when this file is +used:: + + cat (1234): drop_caches: 3 + +These are informational only. They do not mean that anything is wrong +with your system. To disable them, echo 4 (bit 2) into drop_caches. + + +extfrag_threshold +================= + +This parameter affects whether the kernel will compact memory or direct +reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in +debugfs shows what the fragmentation index for each order is in each zone in +the system. Values tending towards 0 imply allocations would fail due to lack +of memory, values towards 1000 imply failures are due to fragmentation and -1 +implies that the allocation will succeed as long as watermarks are met. + +The kernel will not compact memory in a zone if the +fragmentation index is <= extfrag_threshold. The default value is 500. + + +highmem_is_dirtyable +==================== + +Available only for systems with CONFIG_HIGHMEM enabled (32b systems). + +This parameter controls whether the high memory is considered for dirty +writers throttling. This is not the case by default which means that +only the amount of memory directly visible/usable by the kernel can +be dirtied. As a result, on systems with a large amount of memory and +lowmem basically depleted writers might be throttled too early and +streaming writes can get very slow. + +Changing the value to non zero would allow more memory to be dirtied +and thus allow writers to write more data which can be flushed to the +storage more effectively. Note this also comes with a risk of pre-mature +OOM killer because some writers (e.g. direct block device writes) can +only use the low memory and they can fill it up with dirty data without +any throttling. + + +hugetlb_shm_group +================= + +hugetlb_shm_group contains group id that is allowed to create SysV +shared memory segment using hugetlb page. + + +laptop_mode +=========== + +laptop_mode is a knob that controls "laptop mode". All the things that are +controlled by this knob are discussed in Documentation/admin-guide/laptops/laptop-mode.rst. + + +legacy_va_layout +================ + +If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel +will use the legacy (2.4) layout for all processes. + + +lowmem_reserve_ratio +==================== + +For some specialised workloads on highmem machines it is dangerous for +the kernel to allow process memory to be allocated from the "lowmem" +zone. This is because that memory could then be pinned via the mlock() +system call, or by unavailability of swapspace. + +And on large highmem machines this lack of reclaimable lowmem memory +can be fatal. + +So the Linux page allocator has a mechanism which prevents allocations +which *could* use highmem from using too much lowmem. This means that +a certain amount of lowmem is defended from the possibility of being +captured into pinned user memory. + +(The same argument applies to the old 16 megabyte ISA DMA region. This +mechanism will also defend that region from allocations which could use +highmem or lowmem). + +The `lowmem_reserve_ratio` tunable determines how aggressive the kernel is +in defending these lower zones. + +If you have a machine which uses highmem or ISA DMA and your +applications are using mlock(), or if you are running with no swap then +you probably should change the lowmem_reserve_ratio setting. + +The lowmem_reserve_ratio is an array. You can see them by reading this file:: + + % cat /proc/sys/vm/lowmem_reserve_ratio + 256 256 32 + +But, these values are not used directly. The kernel calculates # of protection +pages for each zones from them. These are shown as array of protection pages +in /proc/zoneinfo like followings. (This is an example of x86-64 box). +Each zone has an array of protection pages like this:: + + Node 0, zone DMA + pages free 1355 + min 3 + low 3 + high 4 + : + : + numa_other 0 + protection: (0, 2004, 2004, 2004) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + pagesets + cpu: 0 pcp: 0 + : + +These protections are added to score to judge whether this zone should be used +for page allocation or should be reclaimed. + +In this example, if normal pages (index=2) are required to this DMA zone and +watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should +not be used because pages_free(1355) is smaller than watermark + protection[2] +(4 + 2004 = 2008). If this protection value is 0, this zone would be used for +normal page requirement. If requirement is DMA zone(index=0), protection[0] +(=0) is used. + +zone[i]'s protection[j] is calculated by following expression:: + + (i < j): + zone[i]->protection[j] + = (total sums of managed_pages from zone[i+1] to zone[j] on the node) + / lowmem_reserve_ratio[i]; + (i = j): + (should not be protected. = 0; + (i > j): + (not necessary, but looks 0) + +The default values of lowmem_reserve_ratio[i] are + + === ==================================== + 256 (if zone[i] means DMA or DMA32 zone) + 32 (others) + === ==================================== + +As above expression, they are reciprocal number of ratio. +256 means 1/256. # of protection pages becomes about "0.39%" of total managed +pages of higher zones on the node. + +If you would like to protect more pages, smaller values are effective. +The minimum value is 1 (1/1 -> 100%). The value less than 1 completely +disables protection of the pages. + + +max_map_count: +============== + +This file contains the maximum number of memory map areas a process +may have. Memory map areas are used as a side-effect of calling +malloc, directly by mmap, mprotect, and madvise, and also when loading +shared libraries. + +While most applications need less than a thousand maps, certain +programs, particularly malloc debuggers, may consume lots of them, +e.g., up to one or two maps per allocation. + +The default value is 65530. + + +memory_failure_early_kill: +========================== + +Control how to kill processes when uncorrected memory error (typically +a 2bit error in a memory module) is detected in the background by hardware +that cannot be handled by the kernel. In some cases (like the page +still having a valid copy on disk) the kernel will handle the failure +transparently without affecting any applications. But if there is +no other uptodate copy of the data it will kill to prevent any data +corruptions from propagating. + +1: Kill all processes that have the corrupted and not reloadable page mapped +as soon as the corruption is detected. Note this is not supported +for a few types of pages, like kernel internally allocated data or +the swap cache, but works for the majority of user pages. + +0: Only unmap the corrupted page from all processes and only kill a process +who tries to access it. + +The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can +handle this if they want to. + +This is only active on architectures/platforms with advanced machine +check handling and depends on the hardware capabilities. + +Applications can override this setting individually with the PR_MCE_KILL prctl + + +memory_failure_recovery +======================= + +Enable memory failure recovery (when supported by the platform) + +1: Attempt recovery. + +0: Always panic on a memory failure. + + +min_free_kbytes +=============== + +This is used to force the Linux VM to keep a minimum number +of kilobytes free. The VM uses this number to compute a +watermark[WMARK_MIN] value for each lowmem zone in the system. +Each lowmem zone gets a number of reserved free pages based +proportionally on its size. + +Some minimal amount of memory is needed to satisfy PF_MEMALLOC +allocations; if you set this to lower than 1024KB, your system will +become subtly broken, and prone to deadlock under high loads. + +Setting this too high will OOM your machine instantly. + + +min_slab_ratio +============== + +This is available only on NUMA kernels. + +A percentage of the total pages in each zone. On Zone reclaim +(fallback from the local zone occurs) slabs will be reclaimed if more +than this percentage of pages in a zone are reclaimable slab pages. +This insures that the slab growth stays under control even in NUMA +systems that rarely perform global reclaim. + +The default is 5 percent. + +Note that slab reclaim is triggered in a per zone / node fashion. +The process of reclaiming slab memory is currently not node specific +and may not be fast. + + +min_unmapped_ratio +================== + +This is available only on NUMA kernels. + +This is a percentage of the total pages in each zone. Zone reclaim will +only occur if more than this percentage of pages are in a state that +zone_reclaim_mode allows to be reclaimed. + +If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared +against all file-backed unmapped pages including swapcache pages and tmpfs +files. Otherwise, only unmapped pages backed by normal files but not tmpfs +files and similar are considered. + +The default is 1 percent. + + +mmap_min_addr +============= + +This file indicates the amount of address space which a user process will +be restricted from mmapping. Since kernel null dereference bugs could +accidentally operate based on the information in the first couple of pages +of memory userspace processes should not be allowed to write to them. By +default this value is set to 0 and no protections will be enforced by the +security module. Setting this value to something like 64k will allow the +vast majority of applications to work correctly and provide defense in depth +against future potential kernel bugs. + + +mmap_rnd_bits +============= + +This value can be used to select the number of bits to use to +determine the random offset to the base address of vma regions +resulting from mmap allocations on architectures which support +tuning address space randomization. This value will be bounded +by the architecture's minimum and maximum supported values. + +This value can be changed after boot using the +/proc/sys/vm/mmap_rnd_bits tunable + + +mmap_rnd_compat_bits +==================== + +This value can be used to select the number of bits to use to +determine the random offset to the base address of vma regions +resulting from mmap allocations for applications run in +compatibility mode on architectures which support tuning address +space randomization. This value will be bounded by the +architecture's minimum and maximum supported values. + +This value can be changed after boot using the +/proc/sys/vm/mmap_rnd_compat_bits tunable + + +nr_hugepages +============ + +Change the minimum size of the hugepage pool. + +See Documentation/admin-guide/mm/hugetlbpage.rst + + +hugetlb_optimize_vmemmap +======================== + +This knob is not available when the size of 'struct page' (a structure defined +in include/linux/mm_types.h) is not power of two (an unusual system config could +result in this). + +Enable (set to 1) or disable (set to 0) HugeTLB Vmemmap Optimization (HVO). + +Once enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from +buddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages +per 1GB HugeTLB page), whereas already allocated HugeTLB pages will not be +optimized. When those optimized HugeTLB pages are freed from the HugeTLB pool +to the buddy allocator, the vmemmap pages representing that range needs to be +remapped again and the vmemmap pages discarded earlier need to be rellocated +again. If your use case is that HugeTLB pages are allocated 'on the fly' (e.g. +never explicitly allocating HugeTLB pages with 'nr_hugepages' but only set +'nr_overcommit_hugepages', those overcommitted HugeTLB pages are allocated 'on +the fly') instead of being pulled from the HugeTLB pool, you should weigh the +benefits of memory savings against the more overhead (~2x slower than before) +of allocation or freeing HugeTLB pages between the HugeTLB pool and the buddy +allocator. Another behavior to note is that if the system is under heavy memory +pressure, it could prevent the user from freeing HugeTLB pages from the HugeTLB +pool to the buddy allocator since the allocation of vmemmap pages could be +failed, you have to retry later if your system encounter this situation. + +Once disabled, the vmemmap pages of subsequent allocation of HugeTLB pages from +buddy allocator will not be optimized meaning the extra overhead at allocation +time from buddy allocator disappears, whereas already optimized HugeTLB pages +will not be affected. If you want to make sure there are no optimized HugeTLB +pages, you can set "nr_hugepages" to 0 first and then disable this. Note that +writing 0 to nr_hugepages will make any "in use" HugeTLB pages become surplus +pages. So, those surplus pages are still optimized until they are no longer +in use. You would need to wait for those surplus pages to be released before +there are no optimized pages in the system. + + +nr_hugepages_mempolicy +====================== + +Change the size of the hugepage pool at run-time on a specific +set of NUMA nodes. + +See Documentation/admin-guide/mm/hugetlbpage.rst + + +nr_overcommit_hugepages +======================= + +Change the maximum size of the hugepage pool. The maximum is +nr_hugepages + nr_overcommit_hugepages. + +See Documentation/admin-guide/mm/hugetlbpage.rst + + +nr_trim_pages +============= + +This is available only on NOMMU kernels. + +This value adjusts the excess page trimming behaviour of power-of-2 aligned +NOMMU mmap allocations. + +A value of 0 disables trimming of allocations entirely, while a value of 1 +trims excess pages aggressively. Any value >= 1 acts as the watermark where +trimming of allocations is initiated. + +The default value is 1. + +See Documentation/admin-guide/mm/nommu-mmap.rst for more information. + + +numa_zonelist_order +=================== + +This sysctl is only for NUMA and it is deprecated. Anything but +Node order will fail! + +'where the memory is allocated from' is controlled by zonelists. + +(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. +you may be able to read ZONE_DMA as ZONE_DMA32...) + +In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following. +ZONE_NORMAL -> ZONE_DMA +This means that a memory allocation request for GFP_KERNEL will +get memory from ZONE_DMA only when ZONE_NORMAL is not available. + +In NUMA case, you can think of following 2 types of order. +Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL:: + + (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL + (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. + +Type(A) offers the best locality for processes on Node(0), but ZONE_DMA +will be used before ZONE_NORMAL exhaustion. This increases possibility of +out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small. + +Type(B) cannot offer the best locality but is more robust against OOM of +the DMA zone. + +Type(A) is called as "Node" order. Type (B) is "Zone" order. + +"Node order" orders the zonelists by node, then by zone within each node. +Specify "[Nn]ode" for node order + +"Zone Order" orders the zonelists by zone type, then by node within each +zone. Specify "[Zz]one" for zone order. + +Specify "[Dd]efault" to request automatic configuration. + +On 32-bit, the Normal zone needs to be preserved for allocations accessible +by the kernel, so "zone" order will be selected. + +On 64-bit, devices that require DMA32/DMA are relatively rare, so "node" +order will be selected. + +Default order is recommended unless this is causing problems for your +system/application. + + +oom_dump_tasks +============== + +Enables a system-wide task dump (excluding kernel threads) to be produced +when the kernel performs an OOM-killing and includes such information as +pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj +score, and name. This is helpful to determine why the OOM killer was +invoked, to identify the rogue task that caused it, and to determine why +the OOM killer chose the task it did to kill. + +If this is set to zero, this information is suppressed. On very +large systems with thousands of tasks it may not be feasible to dump +the memory state information for each one. Such systems should not +be forced to incur a performance penalty in OOM conditions when the +information may not be desired. + +If this is set to non-zero, this information is shown whenever the +OOM killer actually kills a memory-hogging task. + +The default value is 1 (enabled). + + +oom_kill_allocating_task +======================== + +This enables or disables killing the OOM-triggering task in +out-of-memory situations. + +If this is set to zero, the OOM killer will scan through the entire +tasklist and select a task based on heuristics to kill. This normally +selects a rogue memory-hogging task that frees up a large amount of +memory when killed. + +If this is set to non-zero, the OOM killer simply kills the task that +triggered the out-of-memory condition. This avoids the expensive +tasklist scan. + +If panic_on_oom is selected, it takes precedence over whatever value +is used in oom_kill_allocating_task. + +The default value is 0. + + +overcommit_kbytes +================= + +When overcommit_memory is set to 2, the committed address space is not +permitted to exceed swap plus this amount of physical RAM. See below. + +Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one +of them may be specified at a time. Setting one disables the other (which +then appears as 0 when read). + + +overcommit_memory +================= + +This value contains a flag that enables memory overcommitment. + +When this flag is 0, the kernel attempts to estimate the amount +of free memory left when userspace requests more memory. + +When this flag is 1, the kernel pretends there is always enough +memory until it actually runs out. + +When this flag is 2, the kernel uses a "never overcommit" +policy that attempts to prevent any overcommit of memory. +Note that user_reserve_kbytes affects this policy. + +This feature can be very useful because there are a lot of +programs that malloc() huge amounts of memory "just-in-case" +and don't use much of it. + +The default value is 0. + +See Documentation/mm/overcommit-accounting.rst and +mm/util.c::__vm_enough_memory() for more information. + + +overcommit_ratio +================ + +When overcommit_memory is set to 2, the committed address +space is not permitted to exceed swap plus this percentage +of physical RAM. See above. + + +page-cluster +============ + +page-cluster controls the number of pages up to which consecutive pages +are read in from swap in a single attempt. This is the swap counterpart +to page cache readahead. +The mentioned consecutivity is not in terms of virtual/physical addresses, +but consecutive on swap space - that means they were swapped out together. + +It is a logarithmic value - setting it to zero means "1 page", setting +it to 1 means "2 pages", setting it to 2 means "4 pages", etc. +Zero disables swap readahead completely. + +The default value is three (eight pages at a time). There may be some +small benefits in tuning this to a different value if your workload is +swap-intensive. + +Lower values mean lower latencies for initial faults, but at the same time +extra faults and I/O delays for following faults if they would have been part of +that consecutive pages readahead would have brought in. + + +page_lock_unfairness +==================== + +This value determines the number of times that the page lock can be +stolen from under a waiter. After the lock is stolen the number of times +specified in this file (default is 5), the "fair lock handoff" semantics +will apply, and the waiter will only be awakened if the lock can be taken. + +panic_on_oom +============ + +This enables or disables panic on out-of-memory feature. + +If this is set to 0, the kernel will kill some rogue process, +called oom_killer. Usually, oom_killer can kill rogue processes and +system will survive. + +If this is set to 1, the kernel panics when out-of-memory happens. +However, if a process limits using nodes by mempolicy/cpusets, +and those nodes become memory exhaustion status, one process +may be killed by oom-killer. No panic occurs in this case. +Because other nodes' memory may be free. This means system total status +may be not fatal yet. + +If this is set to 2, the kernel panics compulsorily even on the +above-mentioned. Even oom happens under memory cgroup, the whole +system panics. + +The default value is 0. + +1 and 2 are for failover of clustering. Please select either +according to your policy of failover. + +panic_on_oom=2+kdump gives you very strong tool to investigate +why oom happens. You can get snapshot. + + +percpu_pagelist_high_fraction +============================= + +This is the fraction of pages in each zone that are can be stored to +per-cpu page lists. It is an upper boundary that is divided depending +on the number of online CPUs. The min value for this is 8 which means +that we do not allow more than 1/8th of pages in each zone to be stored +on per-cpu page lists. This entry only changes the value of hot per-cpu +page lists. A user can specify a number like 100 to allocate 1/100th of +each zone between per-cpu lists. + +The batch value of each per-cpu page list remains the same regardless of +the value of the high fraction so allocation latencies are unaffected. + +The initial value is zero. Kernel uses this value to set the high pcp->high +mark based on the low watermark for the zone and the number of local +online CPUs. If the user writes '0' to this sysctl, it will revert to +this default behavior. + + +stat_interval +============= + +The time interval between which vm statistics are updated. The default +is 1 second. + + +stat_refresh +============ + +Any read or write (by root only) flushes all the per-cpu vm statistics +into their global totals, for more accurate reports when testing +e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo + +As a side-effect, it also checks for negative totals (elsewhere reported +as 0) and "fails" with EINVAL if any are found, with a warning in dmesg. +(At time of writing, a few stats are known sometimes to be found negative, +with no ill effects: errors and warnings on these stats are suppressed.) + + +numa_stat +========= + +This interface allows runtime configuration of numa statistics. + +When page allocation performance becomes a bottleneck and you can tolerate +some possible tool breakage and decreased numa counter precision, you can +do:: + + echo 0 > /proc/sys/vm/numa_stat + +When page allocation performance is not a bottleneck and you want all +tooling to work, you can do:: + + echo 1 > /proc/sys/vm/numa_stat + + +swappiness +========== + +This control is used to define the rough relative IO cost of swapping +and filesystem paging, as a value between 0 and 200. At 100, the VM +assumes equal IO cost and will thus apply memory pressure to the page +cache and swap-backed pages equally; lower values signify more +expensive swap IO, higher values indicates cheaper. + +Keep in mind that filesystem IO patterns under memory pressure tend to +be more efficient than swap's random IO. An optimal value will require +experimentation and will also be workload-dependent. + +The default value is 60. + +For in-memory swap, like zram or zswap, as well as hybrid setups that +have swap on faster devices than the filesystem, values beyond 100 can +be considered. For example, if the random IO against the swap device +is on average 2x faster than IO from the filesystem, swappiness should +be 133 (x + 2x = 200, 2x = 133.33). + +At 0, the kernel will not initiate swap until the amount of free and +file-backed pages is less than the high watermark in a zone. + + +unprivileged_userfaultfd +======================== + +This flag controls the mode in which unprivileged users can use the +userfaultfd system calls. Set this to 0 to restrict unprivileged users +to handle page faults in user mode only. In this case, users without +SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd to +succeed. Prohibiting use of userfaultfd for handling faults from kernel +mode may make certain vulnerabilities more difficult to exploit. + +Set this to 1 to allow unprivileged users to use the userfaultfd system +calls without any restrictions. + +The default value is 0. + +Another way to control permissions for userfaultfd is to use +/dev/userfaultfd instead of userfaultfd(2). See +Documentation/admin-guide/mm/userfaultfd.rst. + +user_reserve_kbytes +=================== + +When overcommit_memory is set to 2, "never overcommit" mode, reserve +min(3% of current process size, user_reserve_kbytes) of free memory. +This is intended to prevent a user from starting a single memory hogging +process, such that they cannot recover (kill the hog). + +user_reserve_kbytes defaults to min(3% of the current process size, 128MB). + +If this is reduced to zero, then the user will be allowed to allocate +all free memory with a single process, minus admin_reserve_kbytes. +Any subsequent attempts to execute a command will result in +"fork: Cannot allocate memory". + +Changing this takes effect whenever an application requests memory. + + +vfs_cache_pressure +================== + +This percentage value controls the tendency of the kernel to reclaim +the memory which is used for caching of directory and inode objects. + +At the default value of vfs_cache_pressure=100 the kernel will attempt to +reclaim dentries and inodes at a "fair" rate with respect to pagecache and +swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer +to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will +never reclaim dentries and inodes due to memory pressure and this can easily +lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 +causes the kernel to prefer to reclaim dentries and inodes. + +Increasing vfs_cache_pressure significantly beyond 100 may have negative +performance impact. Reclaim code needs to take various locks to find freeable +directory and inode objects. With vfs_cache_pressure=1000, it will look for +ten times more freeable objects than there are. + + +watermark_boost_factor +====================== + +This factor controls the level of reclaim when memory is being fragmented. +It defines the percentage of the high watermark of a zone that will be +reclaimed if pages of different mobility are being mixed within pageblocks. +The intent is that compaction has less work to do in the future and to +increase the success rate of future high-order allocations such as SLUB +allocations, THP and hugetlbfs pages. + +To make it sensible with respect to the watermark_scale_factor +parameter, the unit is in fractions of 10,000. The default value of +15,000 means that up to 150% of the high watermark will be reclaimed in the +event of a pageblock being mixed due to fragmentation. The level of reclaim +is determined by the number of fragmentation events that occurred in the +recent past. If this value is smaller than a pageblock then a pageblocks +worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor +of 0 will disable the feature. + + +watermark_scale_factor +====================== + +This factor controls the aggressiveness of kswapd. It defines the +amount of memory left in a node/system before kswapd is woken up and +how much memory needs to be free before kswapd goes back to sleep. + +The unit is in fractions of 10,000. The default value of 10 means the +distances between watermarks are 0.1% of the available memory in the +node/system. The maximum value is 3000, or 30% of memory. + +A high rate of threads entering direct reclaim (allocstall) or kswapd +going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate +that the number of free pages kswapd maintains for latency reasons is +too small for the allocation bursts occurring in the system. This knob +can then be used to tune kswapd aggressiveness accordingly. + + +zone_reclaim_mode +================= + +Zone_reclaim_mode allows someone to set more or less aggressive approaches to +reclaim memory when a zone runs out of memory. If it is set to zero then no +zone reclaim occurs. Allocations will be satisfied from other zones / nodes +in the system. + +This is value OR'ed together of + += =================================== +1 Zone reclaim on +2 Zone reclaim writes dirty pages out +4 Zone reclaim swaps pages += =================================== + +zone_reclaim_mode is disabled by default. For file servers or workloads +that benefit from having their data cached, zone_reclaim_mode should be +left disabled as the caching effect is likely to be more important than +data locality. + +Consider enabling one or more zone_reclaim mode bits if it's known that the +workload is partitioned such that each partition fits within a NUMA node +and that accessing remote memory would cause a measurable performance +reduction. The page allocator will take additional actions before +allocating off node pages. + +Allowing zone reclaim to write out pages stops processes that are +writing large amounts of data from dirtying pages on other nodes. Zone +reclaim will write out dirty pages if a zone fills up and so effectively +throttle the process. This may decrease the performance of a single process +since it cannot use all of system memory to buffer the outgoing writes +anymore but it preserve the memory on other nodes so that the performance +of other processes running on other nodes will not be affected. + +Allowing regular swap effectively restricts allocations to the local +node unless explicitly overridden by memory policies or cpuset +configurations. -- cgit v1.2.3