From 5d1646d90e1f2cceb9f0828f4b28318cd0ec7744 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sat, 27 Apr 2024 12:05:51 +0200 Subject: Adding upstream version 5.10.209. Signed-off-by: Daniel Baumann --- Documentation/accounting/cgroupstats.rst | 31 ++++ Documentation/accounting/delay-accounting.rst | 126 ++++++++++++++++ Documentation/accounting/index.rst | 14 ++ Documentation/accounting/psi.rst | 185 ++++++++++++++++++++++++ Documentation/accounting/taskstats-struct.rst | 199 ++++++++++++++++++++++++++ Documentation/accounting/taskstats.rst | 180 +++++++++++++++++++++++ 6 files changed, 735 insertions(+) create mode 100644 Documentation/accounting/cgroupstats.rst create mode 100644 Documentation/accounting/delay-accounting.rst create mode 100644 Documentation/accounting/index.rst create mode 100644 Documentation/accounting/psi.rst create mode 100644 Documentation/accounting/taskstats-struct.rst create mode 100644 Documentation/accounting/taskstats.rst (limited to 'Documentation/accounting') diff --git a/Documentation/accounting/cgroupstats.rst b/Documentation/accounting/cgroupstats.rst new file mode 100644 index 000000000..b9afc48f4 --- /dev/null +++ b/Documentation/accounting/cgroupstats.rst @@ -0,0 +1,31 @@ +================== +Control Groupstats +================== + +Control Groupstats is inspired by the discussion at +http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as +suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. + +Per cgroup statistics infrastructure re-uses code from the taskstats +interface. A new set of cgroup operations are registered with commands +and attributes specific to cgroups. It should be very easy to +extend per cgroup statistics, by adding members to the cgroupstats +structure. + +The current model for cgroupstats is a pull, a push model (to post +statistics on interesting events), should be very easy to add. Currently +user space requests for statistics by passing the cgroup path. +Statistics about the state of all the tasks in the cgroup is returned to +user space. + +NOTE: We currently rely on delay accounting for extracting information +about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this +information will not be available. + +To extract cgroup statistics a utility very similar to getdelays.c +has been developed, the sample output of the utility is shown below:: + + ~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup/a" + sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0 + ~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup" + sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2 diff --git a/Documentation/accounting/delay-accounting.rst b/Documentation/accounting/delay-accounting.rst new file mode 100644 index 000000000..7cc7f5852 --- /dev/null +++ b/Documentation/accounting/delay-accounting.rst @@ -0,0 +1,126 @@ +================ +Delay accounting +================ + +Tasks encounter delays in execution when they wait +for some kernel resource to become available e.g. a +runnable task may wait for a free CPU to run on. + +The per-task delay accounting functionality measures +the delays experienced by a task while + +a) waiting for a CPU (while being runnable) +b) completion of synchronous block I/O initiated by the task +c) swapping in pages +d) memory reclaim + +and makes these statistics available to userspace through +the taskstats interface. + +Such delays provide feedback for setting a task's cpu priority, +io priority and rss limit values appropriately. Long delays for +important tasks could be a trigger for raising its corresponding priority. + +The functionality, through its use of the taskstats interface, also provides +delay statistics aggregated for all tasks (or threads) belonging to a +thread group (corresponding to a traditional Unix process). This is a commonly +needed aggregation that is more efficiently done by the kernel. + +Userspace utilities, particularly resource management applications, can also +aggregate delay statistics into arbitrary groups. To enable this, delay +statistics of a task are available both during its lifetime as well as on its +exit, ensuring continuous and complete monitoring can be done. + + +Interface +--------- + +Delay accounting uses the taskstats interface which is described +in detail in a separate document in this directory. Taskstats returns a +generic data structure to userspace corresponding to per-pid and per-tgid +statistics. The delay accounting functionality populates specific fields of +this structure. See + + include/linux/taskstats.h + +for a description of the fields pertaining to delay accounting. +It will generally be in the form of counters returning the cumulative +delay seen for cpu, sync block I/O, swapin, memory reclaim etc. + +Taking the difference of two successive readings of a given +counter (say cpu_delay_total) for a task will give the delay +experienced by the task waiting for the corresponding resource +in that interval. + +When a task exits, records containing the per-task statistics +are sent to userspace without requiring a command. If it is the last exiting +task of a thread group, the per-tgid statistics are also sent. More details +are given in the taskstats interface description. + +The getdelays.c userspace utility in tools/accounting directory allows simple +commands to be run and the corresponding delay statistics to be displayed. It +also serves as an example of using the taskstats interface. + +Usage +----- + +Compile the kernel with:: + + CONFIG_TASK_DELAY_ACCT=y + CONFIG_TASKSTATS=y + +Delay accounting is enabled by default at boot up. +To disable, add:: + + nodelayacct + +to the kernel boot options. The rest of the instructions +below assume this has not been done. + +After the system has booted up, use a utility +similar to getdelays.c to access the delays +seen by a given task or a task group (tgid). +The utility also allows a given command to be +executed and the corresponding delays to be +seen. + +General format of the getdelays command:: + + getdelays [-t tgid] [-p pid] [-c cmd...] + + +Get delays, since system boot, for pid 10:: + + # ./getdelays -p 10 + (output similar to next case) + +Get sum of delays, since system boot, for all pids with tgid 5:: + + # ./getdelays -t 5 + + + CPU count real total virtual total delay total + 7876 92005750 100000000 24001500 + IO count delay total + 0 0 + SWAP count delay total + 0 0 + RECLAIM count delay total + 0 0 + +Get delays seen in executing a given simple command:: + + # ./getdelays -c ls / + + bin data1 data3 data5 dev home media opt root srv sys usr + boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var + + + CPU count real total virtual total delay total + 6 4000250 4000000 0 + IO count delay total + 0 0 + SWAP count delay total + 0 0 + RECLAIM count delay total + 0 0 diff --git a/Documentation/accounting/index.rst b/Documentation/accounting/index.rst new file mode 100644 index 000000000..9369d8bf3 --- /dev/null +++ b/Documentation/accounting/index.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========== +Accounting +========== + +.. toctree:: + :maxdepth: 1 + + cgroupstats + delay-accounting + psi + taskstats + taskstats-struct diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst new file mode 100644 index 000000000..860fe651d --- /dev/null +++ b/Documentation/accounting/psi.rst @@ -0,0 +1,185 @@ +.. _psi: + +================================ +PSI - Pressure Stall Information +================================ + +:Date: April, 2018 +:Author: Johannes Weiner + +When CPU, memory or IO devices are contended, workloads experience +latency spikes, throughput losses, and run the risk of OOM kills. + +Without an accurate measure of such contention, users are forced to +either play it safe and under-utilize their hardware resources, or +roll the dice and frequently suffer the disruptions resulting from +excessive overcommit. + +The psi feature identifies and quantifies the disruptions caused by +such resource crunches and the time impact it has on complex workloads +or even entire systems. + +Having an accurate measure of productivity losses caused by resource +scarcity aids users in sizing workloads to hardware--or provisioning +hardware according to workload demand. + +As psi aggregates this information in realtime, systems can be managed +dynamically using techniques such as load shedding, migrating jobs to +other systems or data centers, or strategically pausing or killing low +priority or restartable batch jobs. + +This allows maximizing hardware utilization without sacrificing +workload health or risking major disruptions such as OOM kills. + +Pressure interface +================== + +Pressure information for each resource is exported through the +respective file in /proc/pressure/ -- cpu, memory, and io. + +The format for CPU is as such:: + + some avg10=0.00 avg60=0.00 avg300=0.00 total=0 + +and for memory and IO:: + + some avg10=0.00 avg60=0.00 avg300=0.00 total=0 + full avg10=0.00 avg60=0.00 avg300=0.00 total=0 + +The "some" line indicates the share of time in which at least some +tasks are stalled on a given resource. + +The "full" line indicates the share of time in which all non-idle +tasks are stalled on a given resource simultaneously. In this state +actual CPU cycles are going to waste, and a workload that spends +extended time in this state is considered to be thrashing. This has +severe impact on performance, and it's useful to distinguish this +situation from a state where some tasks are stalled but the CPU is +still doing productive work. As such, time spent in this subset of the +stall state is tracked separately and exported in the "full" averages. + +The ratios (in %) are tracked as recent trends over ten, sixty, and +three hundred second windows, which gives insight into short term events +as well as medium and long term trends. The total absolute stall time +(in us) is tracked and exported as well, to allow detection of latency +spikes which wouldn't necessarily make a dent in the time averages, +or to average trends over custom time frames. + +Monitoring for pressure thresholds +================================== + +Users can register triggers and use poll() to be woken up when resource +pressure exceeds certain thresholds. + +A trigger describes the maximum cumulative stall time over a specific +time window, e.g. 100ms of total stall time within any 500ms window to +generate a wakeup event. + +To register a trigger user has to open psi interface file under +/proc/pressure/ representing the resource to be monitored and write the +desired threshold and time window. The open file descriptor should be +used to wait for trigger events using select(), poll() or epoll(). +The following format is used:: + +