1
0
Fork 0

Adding upstream version 6.12.33.

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
This commit is contained in:
Daniel Baumann 2025-06-22 12:14:28 +02:00
parent 89eabb05c2
commit 79d69e5050
Signed by: daniel.baumann
GPG key ID: BCC918A2ABD66424
86698 changed files with 39662057 additions and 0 deletions

View file

@ -0,0 +1,804 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
===============================================
``amd-pstate`` CPU Performance Scaling Driver
===============================================
:Copyright: |copy| 2021 Advanced Micro Devices, Inc.
:Author: Huang Rui <ray.huang@amd.com>
Introduction
===================
``amd-pstate`` is the AMD CPU performance scaling driver that introduces a
new CPU frequency control mechanism on modern AMD APU and CPU series in
Linux kernel. The new mechanism is based on Collaborative Processor
Performance Control (CPPC) which provides finer grain frequency management
than legacy ACPI hardware P-States. Current AMD CPU/APU platforms are using
the ACPI P-states driver to manage CPU frequency and clocks with switching
only in 3 P-states. CPPC replaces the ACPI P-states controls and allows a
flexible, low-latency interface for the Linux kernel to directly
communicate the performance hints to hardware.
``amd-pstate`` leverages the Linux kernel governors such as ``schedutil``,
``ondemand``, etc. to manage the performance hints which are provided by
CPPC hardware functionality that internally follows the hardware
specification (for details refer to AMD64 Architecture Programmer's Manual
Volume 2: System Programming [1]_). Currently, ``amd-pstate`` supports basic
frequency control function according to kernel governors on some of the
Zen2 and Zen3 processors, and we will implement more AMD specific functions
in future after we verify them on the hardware and SBIOS.
AMD CPPC Overview
=======================
Collaborative Processor Performance Control (CPPC) interface enumerates a
continuous, abstract, and unit-less performance value in a scale that is
not tied to a specific performance state / frequency. This is an ACPI
standard [2]_ which software can specify application performance goals and
hints as a relative target to the infrastructure limits. AMD processors
provide the low latency register model (MSR) instead of an AML code
interpreter for performance adjustments. ``amd-pstate`` will initialize a
``struct cpufreq_driver`` instance, ``amd_pstate_driver``, with the callbacks
to manage each performance update behavior. ::
Highest Perf ------>+-----------------------+ +-----------------------+
| | | |
| | | |
| | Max Perf ---->| |
| | | |
| | | |
Nominal Perf ------>+-----------------------+ +-----------------------+
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | Desired Perf ---->| |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
Lowest non- | | | |
linear perf ------>+-----------------------+ +-----------------------+
| | | |
| | Lowest perf ---->| |
| | | |
Lowest perf ------>+-----------------------+ +-----------------------+
| | | |
| | | |
| | | |
0 ------>+-----------------------+ +-----------------------+
AMD P-States Performance Scale
.. _perf_cap:
AMD CPPC Performance Capability
--------------------------------
Highest Performance (RO)
.........................
This is the absolute maximum performance an individual processor may reach,
assuming ideal conditions. This performance level may not be sustainable
for long durations and may only be achievable if other platform components
are in a specific state; for example, it may require other processors to be in
an idle state. This would be equivalent to the highest frequencies
supported by the processor.
Nominal (Guaranteed) Performance (RO)
......................................
This is the maximum sustained performance level of the processor, assuming
ideal operating conditions. In the absence of an external constraint (power,
thermal, etc.), this is the performance level the processor is expected to
be able to maintain continuously. All cores/processors are expected to be
able to sustain their nominal performance state simultaneously.
Lowest non-linear Performance (RO)
...................................
This is the lowest performance level at which nonlinear power savings are
achieved, for example, due to the combined effects of voltage and frequency
scaling. Above this threshold, lower performance levels should be generally
more energy efficient than higher performance levels. This register
effectively conveys the most efficient performance level to ``amd-pstate``.
Lowest Performance (RO)
........................
This is the absolute lowest performance level of the processor. Selecting a
performance level lower than the lowest nonlinear performance level may
cause an efficiency penalty but should reduce the instantaneous power
consumption of the processor.
AMD CPPC Performance Control
------------------------------
``amd-pstate`` passes performance goals through these registers. The
register drives the behavior of the desired performance target.
Minimum requested performance (RW)
...................................
``amd-pstate`` specifies the minimum allowed performance level.
Maximum requested performance (RW)
...................................
``amd-pstate`` specifies a limit the maximum performance that is expected
to be supplied by the hardware.
Desired performance target (RW)
...................................
``amd-pstate`` specifies a desired target in the CPPC performance scale as
a relative number. This can be expressed as percentage of nominal
performance (infrastructure max). Below the nominal sustained performance
level, desired performance expresses the average performance level of the
processor subject to hardware. Above the nominal performance level,
the processor must provide at least nominal performance requested and go higher
if current operating conditions allow.
Energy Performance Preference (EPP) (RW)
.........................................
This attribute provides a hint to the hardware if software wants to bias
toward performance (0x0) or energy efficiency (0xff).
Key Governors Support
=======================
``amd-pstate`` can be used with all the (generic) scaling governors listed
by the ``scaling_available_governors`` policy attribute in ``sysfs``. Then,
it is responsible for the configuration of policy objects corresponding to
CPUs and provides the ``CPUFreq`` core (and the scaling governors attached
to the policy objects) with accurate information on the maximum and minimum
operating frequencies supported by the hardware. Users can check the
``scaling_cur_freq`` information comes from the ``CPUFreq`` core.
``amd-pstate`` mainly supports ``schedutil`` and ``ondemand`` for dynamic
frequency control. It is to fine tune the processor configuration on
``amd-pstate`` to the ``schedutil`` with CPU CFS scheduler. ``amd-pstate``
registers the adjust_perf callback to implement performance update behavior
similar to CPPC. It is initialized by ``sugov_start`` and then populates the
CPU's update_util_data pointer to assign ``sugov_update_single_perf`` as the
utilization update callback function in the CPU scheduler. The CPU scheduler
will call ``cpufreq_update_util`` and assigns the target performance according
to the ``struct sugov_cpu`` that the utilization update belongs to.
Then, ``amd-pstate`` updates the desired performance according to the CPU
scheduler assigned.
.. _processor_support:
Processor Support
=======================
The ``amd-pstate`` initialization will fail if the ``_CPC`` entry in the ACPI
SBIOS does not exist in the detected processor. It uses ``acpi_cpc_valid``
to check the existence of ``_CPC``. All Zen based processors support the legacy
ACPI hardware P-States function, so when ``amd-pstate`` fails initialization,
the kernel will fall back to initialize the ``acpi-cpufreq`` driver.
There are two types of hardware implementations for ``amd-pstate``: one is
`Full MSR Support <perf_cap_>`_ and another is `Shared Memory Support
<perf_cap_>`_. It can use the :c:macro:`X86_FEATURE_CPPC` feature flag to
indicate the different types. (For details, refer to the Processor Programming
Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors [3]_.)
``amd-pstate`` is to register different ``static_call`` instances for different
hardware implementations.
Currently, some of the Zen2 and Zen3 processors support ``amd-pstate``. In the
future, it will be supported on more and more AMD processors.
Full MSR Support
-----------------
Some new Zen3 processors such as Cezanne provide the MSR registers directly
while the :c:macro:`X86_FEATURE_CPPC` CPU feature flag is set.
``amd-pstate`` can handle the MSR register to implement the fast switch
function in ``CPUFreq`` that can reduce the latency of frequency control in
interrupt context. The functions with a ``pstate_xxx`` prefix represent the
operations on MSR registers.
Shared Memory Support
----------------------
If the :c:macro:`X86_FEATURE_CPPC` CPU feature flag is not set, the
processor supports the shared memory solution. In this case, ``amd-pstate``
uses the ``cppc_acpi`` helper methods to implement the callback functions
that are defined on ``static_call``. The functions with the ``cppc_xxx`` prefix
represent the operations of ACPI CPPC helpers for the shared memory solution.
AMD P-States and ACPI hardware P-States always can be supported in one
processor. But AMD P-States has the higher priority and if it is enabled
with :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond
to the request from AMD P-States.
User Space Interface in ``sysfs`` - Per-policy control
======================================================
``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to
control its functionality at the system level. They are located in the
``/sys/devices/system/cpu/cpufreq/policyX/`` directory and affect all CPUs. ::
root@hr-test1:/home/ray# ls /sys/devices/system/cpu/cpufreq/policy0/*amd*
/sys/devices/system/cpu/cpufreq/policy0/amd_pstate_highest_perf
/sys/devices/system/cpu/cpufreq/policy0/amd_pstate_lowest_nonlinear_freq
/sys/devices/system/cpu/cpufreq/policy0/amd_pstate_max_freq
``amd_pstate_highest_perf / amd_pstate_max_freq``
Maximum CPPC performance and CPU frequency that the driver is allowed to
set, in percent of the maximum supported CPPC performance level (the highest
performance supported in `AMD CPPC Performance Capability <perf_cap_>`_).
In some ASICs, the highest CPPC performance is not the one in the ``_CPC``
table, so we need to expose it to sysfs. If boost is not active, but
still supported, this maximum frequency will be larger than the one in
``cpuinfo``. On systems that support preferred core, the driver will have
different values for some cores than others and this will reflect the values
advertised by the platform at bootup.
This attribute is read-only.
``amd_pstate_lowest_nonlinear_freq``
The lowest non-linear CPPC CPU frequency that the driver is allowed to set,
in percent of the maximum supported CPPC performance level. (Please see the
lowest non-linear performance in `AMD CPPC Performance Capability
<perf_cap_>`_.)
This attribute is read-only.
``amd_pstate_hw_prefcore``
Whether the platform supports the preferred core feature and it has been
enabled. This attribute is read-only.
``amd_pstate_prefcore_ranking``
The performance ranking of the core. This number doesn't have any unit, but
larger numbers are preferred at the time of reading. This can change at
runtime based on platform conditions. This attribute is read-only.
``energy_performance_available_preferences``
A list of all the supported EPP preferences that could be used for
``energy_performance_preference`` on this system.
These profiles represent different hints that are provided
to the low-level firmware about the user's desired energy vs efficiency
tradeoff. ``default`` represents the epp value is set by platform
firmware. This attribute is read-only.
``energy_performance_preference``
The current energy performance preference can be read from this attribute.
and user can change current preference according to energy or performance needs
Please get all support profiles list from
``energy_performance_available_preferences`` attribute, all the profiles are
integer values defined between 0 to 255 when EPP feature is enabled by platform
firmware, if EPP feature is disabled, driver will ignore the written value
This attribute is read-write.
``boost``
The `boost` sysfs attribute provides control over the CPU core
performance boost, allowing users to manage the maximum frequency limitation
of the CPU. This attribute can be used to enable or disable the boost feature
on individual CPUs.
When the boost feature is enabled, the CPU can dynamically increase its frequency
beyond the base frequency, providing enhanced performance for demanding workloads.
On the other hand, disabling the boost feature restricts the CPU to operate at the
base frequency, which may be desirable in certain scenarios to prioritize power
efficiency or manage temperature.
To manipulate the `boost` attribute, users can write a value of `0` to disable the
boost or `1` to enable it, for the respective CPU using the sysfs path
`/sys/devices/system/cpu/cpuX/cpufreq/boost`, where `X` represents the CPU number.
Other performance and frequency values can be read back from
``/sys/devices/system/cpu/cpuX/acpi_cppc/``, see :ref:`cppc_sysfs`.
``amd-pstate`` vs ``acpi-cpufreq``
======================================
On the majority of AMD platforms supported by ``acpi-cpufreq``, the ACPI tables
provided by the platform firmware are used for CPU performance scaling, but
only provide 3 P-states on AMD processors.
However, on modern AMD APU and CPU series, hardware provides the Collaborative
Processor Performance Control according to the ACPI protocol and customizes this
for AMD platforms. That is, fine-grained and continuous frequency ranges
instead of the legacy hardware P-states. ``amd-pstate`` is the kernel
module which supports the new AMD P-States mechanism on most of the future AMD
platforms. The AMD P-States mechanism is the more performance and energy
efficiency frequency management method on AMD processors.
``amd-pstate`` Driver Operation Modes
======================================
``amd_pstate`` CPPC has 3 operation modes: autonomous (active) mode,
non-autonomous (passive) mode and guided autonomous (guided) mode.
Active/passive/guided mode can be chosen by different kernel parameters.
- In autonomous mode, platform ignores the desired performance level request
and takes into account only the values set to the minimum, maximum and energy
performance preference registers.
- In non-autonomous mode, platform gets desired performance level
from OS directly through Desired Performance Register.
- In guided-autonomous mode, platform sets operating performance level
autonomously according to the current workload and within the limits set by
OS through min and max performance registers.
Active Mode
------------
``amd_pstate=active``
This is the low-level firmware control mode which is implemented by ``amd_pstate_epp``
driver with ``amd_pstate=active`` passed to the kernel in the command line.
In this mode, ``amd_pstate_epp`` driver provides a hint to the hardware if software
wants to bias toward performance (0x0) or energy efficiency (0xff) to the CPPC firmware.
then CPPC power algorithm will calculate the runtime workload and adjust the realtime
cores frequency according to the power supply and thermal, core voltage and some other
hardware conditions.
Passive Mode
------------
``amd_pstate=passive``
It will be enabled if the ``amd_pstate=passive`` is passed to the kernel in the command line.
In this mode, ``amd_pstate`` driver software specifies a desired QoS target in the CPPC
performance scale as a relative number. This can be expressed as percentage of nominal
performance (infrastructure max). Below the nominal sustained performance level,
desired performance expresses the average performance level of the processor subject
to the Performance Reduction Tolerance register. Above the nominal performance level,
processor must provide at least nominal performance requested and go higher if current
operating conditions allow.
Guided Mode
-----------
``amd_pstate=guided``
If ``amd_pstate=guided`` is passed to kernel command line option then this mode
is activated. In this mode, driver requests minimum and maximum performance
level and the platform autonomously selects a performance level in this range
and appropriate to the current workload.
``amd-pstate`` Preferred Core
=================================
The core frequency is subjected to the process variation in semiconductors.
Not all cores are able to reach the maximum frequency respecting the
infrastructure limits. Consequently, AMD has redefined the concept of
maximum frequency of a part. This means that a fraction of cores can reach
maximum frequency. To find the best process scheduling policy for a given
scenario, OS needs to know the core ordering informed by the platform through
highest performance capability register of the CPPC interface.
``amd-pstate`` preferred core enables the scheduler to prefer scheduling on
cores that can achieve a higher frequency with lower voltage. The preferred
core rankings can dynamically change based on the workload, platform conditions,
thermals and ageing.
The priority metric will be initialized by the ``amd-pstate`` driver. The ``amd-pstate``
driver will also determine whether or not ``amd-pstate`` preferred core is
supported by the platform.
``amd-pstate`` driver will provide an initial core ordering when the system boots.
The platform uses the CPPC interfaces to communicate the core ranking to the
operating system and scheduler to make sure that OS is choosing the cores
with highest performance firstly for scheduling the process. When ``amd-pstate``
driver receives a message with the highest performance change, it will
update the core ranking and set the cpu's priority.
``amd-pstate`` Preferred Core Switch
=====================================
Kernel Parameters
-----------------
``amd-pstate`` peferred core`` has two states: enable and disable.
Enable/disable states can be chosen by different kernel parameters.
Default enable ``amd-pstate`` preferred core.
``amd_prefcore=disable``
For systems that support ``amd-pstate`` preferred core, the core rankings will
always be advertised by the platform. But OS can choose to ignore that via the
kernel parameter ``amd_prefcore=disable``.
User Space Interface in ``sysfs`` - General
===========================================
Global Attributes
-----------------
``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to
control its functionality at the system level. They are located in the
``/sys/devices/system/cpu/amd_pstate/`` directory and affect all CPUs.
``status``
Operation mode of the driver: "active", "passive", "guided" or "disable".
"active"
The driver is functional and in the ``active mode``
"passive"
The driver is functional and in the ``passive mode``
"guided"
The driver is functional and in the ``guided mode``
"disable"
The driver is unregistered and not functional now.
This attribute can be written to in order to change the driver's
operation mode or to unregister it. The string written to it must be
one of the possible values of it and, if successful, writing one of
these values to the sysfs file will cause the driver to switch over
to the operation mode represented by that string - or to be
unregistered in the "disable" case.
``prefcore``
Preferred core state of the driver: "enabled" or "disabled".
"enabled"
Enable the ``amd-pstate`` preferred core.
"disabled"
Disable the ``amd-pstate`` preferred core
This attribute is read-only to check the state of preferred core set
by the kernel parameter.
``cpupower`` tool support for ``amd-pstate``
===============================================
``amd-pstate`` is supported by the ``cpupower`` tool, which can be used to dump
frequency information. Development is in progress to support more and more
operations for the new ``amd-pstate`` module with this tool. ::
root@hr-test1:/home/ray# cpupower frequency-info
analyzing CPU 0:
driver: amd-pstate
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: 131 us
hardware limits: 400 MHz - 4.68 GHz
available cpufreq governors: ondemand conservative powersave userspace performance schedutil
current policy: frequency should be within 400 MHz and 4.68 GHz.
The governor "schedutil" may decide which speed to use
within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 4.02 GHz (asserted by call to kernel)
boost state support:
Supported: yes
Active: yes
AMD PSTATE Highest Performance: 166. Maximum Frequency: 4.68 GHz.
AMD PSTATE Nominal Performance: 117. Nominal Frequency: 3.30 GHz.
AMD PSTATE Lowest Non-linear Performance: 39. Lowest Non-linear Frequency: 1.10 GHz.
AMD PSTATE Lowest Performance: 15. Lowest Frequency: 400 MHz.
Diagnostics and Tuning
=======================
Trace Events
--------------
There are two static trace events that can be used for ``amd-pstate``
diagnostics. One of them is the ``cpu_frequency`` trace event generally used
by ``CPUFreq``, and the other one is the ``amd_pstate_perf`` trace event
specific to ``amd-pstate``. The following sequence of shell commands can
be used to enable them and see their output (if the kernel is
configured to support event tracing). ::
root@hr-test1:/home/ray# cd /sys/kernel/tracing/
root@hr-test1:/sys/kernel/tracing# echo 1 > events/amd_cpu/enable
root@hr-test1:/sys/kernel/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 47827/42233061 #P:2
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [015] dN... 4995.979886: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=15 changed=false fast_switch=true
<idle>-0 [007] d.h.. 4995.979893: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=7 changed=false fast_switch=true
cat-2161 [000] d.... 4995.980841: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=0 changed=false fast_switch=true
sshd-2125 [004] d.s.. 4995.980968: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=4 changed=false fast_switch=true
<idle>-0 [007] d.s.. 4995.980968: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=7 changed=false fast_switch=true
<idle>-0 [003] d.s.. 4995.980971: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=3 changed=false fast_switch=true
<idle>-0 [011] d.s.. 4995.980996: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=11 changed=false fast_switch=true
The ``cpu_frequency`` trace event will be triggered either by the ``schedutil`` scaling
governor (for the policies it is attached to), or by the ``CPUFreq`` core (for the
policies with other scaling governors).
Tracer Tool
-------------
``amd_pstate_tracer.py`` can record and parse ``amd-pstate`` trace log, then
generate performance plots. This utility can be used to debug and tune the
performance of ``amd-pstate`` driver. The tracer tool needs to import intel
pstate tracer.
Tracer tool located in ``linux/tools/power/x86/amd_pstate_tracer``. It can be
used in two ways. If trace file is available, then directly parse the file
with command ::
./amd_pstate_trace.py [-c cpus] -t <trace_file> -n <test_name>
Or generate trace file with root privilege, then parse and plot with command ::
sudo ./amd_pstate_trace.py [-c cpus] -n <test_name> -i <interval> [-m kbytes]
The test result can be found in ``results/test_name``. Following is the example
about part of the output. ::
common_cpu common_secs common_usecs min_perf des_perf max_perf freq mperf apef tsc load duration_ms sample_num elapsed_time common_comm
CPU_005 712 116384 39 49 166 0.7565 9645075 2214891 38431470 25.1 11.646 469 2.496 kworker/5:0-40
CPU_006 712 116408 39 49 166 0.6769 8950227 1839034 37192089 24.06 11.272 470 2.496 kworker/6:0-1264
Unit Tests for amd-pstate
-------------------------
``amd-pstate-ut`` is a test module for testing the ``amd-pstate`` driver.
* It can help all users to verify their processor support (SBIOS/Firmware or Hardware).
* Kernel can have a basic function test to avoid the kernel regression during the update.
* We can introduce more functional or performance tests to align the result together, it will benefit power and performance scale optimization.
1. Test case descriptions
1). Basic tests
Test prerequisite and basic functions for the ``amd-pstate`` driver.
+---------+--------------------------------+------------------------------------------------------------------------------------+
| Index | Functions | Description |
+=========+================================+====================================================================================+
| 1 | amd_pstate_ut_acpi_cpc_valid || Check whether the _CPC object is present in SBIOS. |
| | || |
| | || The detail refer to `Processor Support <processor_support_>`_. |
+---------+--------------------------------+------------------------------------------------------------------------------------+
| 2 | amd_pstate_ut_check_enabled || Check whether AMD P-State is enabled. |
| | || |
| | || AMD P-States and ACPI hardware P-States always can be supported in one processor. |
| | | But AMD P-States has the higher priority and if it is enabled with |
| | | :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond to the |
| | | request from AMD P-States. |
+---------+--------------------------------+------------------------------------------------------------------------------------+
| 3 | amd_pstate_ut_check_perf || Check if the each performance values are reasonable. |
| | || highest_perf >= nominal_perf > lowest_nonlinear_perf > lowest_perf > 0. |
+---------+--------------------------------+------------------------------------------------------------------------------------+
| 4 | amd_pstate_ut_check_freq || Check if the each frequency values and max freq when set support boost mode |
| | | are reasonable. |
| | || max_freq >= nominal_freq > lowest_nonlinear_freq > min_freq > 0 |
| | || If boost is not active but supported, this maximum frequency will be larger than |
| | | the one in ``cpuinfo``. |
+---------+--------------------------------+------------------------------------------------------------------------------------+
2). Tbench test
Test and monitor the cpu changes when running tbench benchmark under the specified governor.
These changes include desire performance, frequency, load, performance, energy etc.
The specified governor is ondemand or schedutil.
Tbench can also be tested on the ``acpi-cpufreq`` kernel driver for comparison.
3). Gitsource test
Test and monitor the cpu changes when running gitsource benchmark under the specified governor.
These changes include desire performance, frequency, load, time, energy etc.
The specified governor is ondemand or schedutil.
Gitsource can also be tested on the ``acpi-cpufreq`` kernel driver for comparison.
#. How to execute the tests
We use test module in the kselftest frameworks to implement it.
We create ``amd-pstate-ut`` module and tie it into kselftest.(for
details refer to Linux Kernel Selftests [4]_).
1). Build
+ open the :c:macro:`CONFIG_X86_AMD_PSTATE` configuration option.
+ set the :c:macro:`CONFIG_X86_AMD_PSTATE_UT` configuration option to M.
+ make project
+ make selftest ::
$ cd linux
$ make -C tools/testing/selftests
+ make perf ::
$ cd tools/perf/
$ make
2). Installation & Steps ::
$ make -C tools/testing/selftests install INSTALL_PATH=~/kselftest
$ cp tools/perf/perf /usr/bin/perf
$ sudo ./kselftest/run_kselftest.sh -c amd-pstate
3). Specified test case ::
$ cd ~/kselftest/amd-pstate
$ sudo ./run.sh -t basic
$ sudo ./run.sh -t tbench
$ sudo ./run.sh -t tbench -m acpi-cpufreq
$ sudo ./run.sh -t gitsource
$ sudo ./run.sh -t gitsource -m acpi-cpufreq
$ ./run.sh --help
./run.sh: illegal option -- -
Usage: ./run.sh [OPTION...]
[-h <help>]
[-o <output-file-for-dump>]
[-c <all: All testing,
basic: Basic testing,
tbench: Tbench testing,
gitsource: Gitsource testing.>]
[-t <tbench time limit>]
[-p <tbench process number>]
[-l <loop times for tbench>]
[-i <amd tracer interval>]
[-m <comparative test: acpi-cpufreq>]
4). Results
+ basic
When you finish test, you will get the following log info ::
$ dmesg | grep "amd_pstate_ut" | tee log.txt
[12977.570663] amd_pstate_ut: 1 amd_pstate_ut_acpi_cpc_valid success!
[12977.570673] amd_pstate_ut: 2 amd_pstate_ut_check_enabled success!
[12977.571207] amd_pstate_ut: 3 amd_pstate_ut_check_perf success!
[12977.571212] amd_pstate_ut: 4 amd_pstate_ut_check_freq success!
+ tbench
When you finish test, you will get selftest.tbench.csv and png images.
The selftest.tbench.csv file contains the raw data and the drop of the comparative test.
The png images shows the performance, energy and performan per watt of each test.
Open selftest.tbench.csv :
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ Governor | Round | Des-perf | Freq | Load | Performance | Energy | Performance Per Watt |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ Unit | | | GHz | | MB/s | J | MB/J |
+=================================================+==============+==========+=========+==========+=============+=========+======================+
+ amd-pstate-ondemand | 1 | | | | 2504.05 | 1563.67 | 158.5378 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand | 2 | | | | 2243.64 | 1430.32 | 155.2941 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand | 3 | | | | 2183.88 | 1401.32 | 154.2860 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand | Average | | | | 2310.52 | 1465.1 | 156.1268 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | 1 | 165.329 | 1.62257 | 99.798 | 2136.54 | 1395.26 | 151.5971 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | 2 | 166 | 1.49761 | 99.9993 | 2100.56 | 1380.5 | 150.6377 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | 3 | 166 | 1.47806 | 99.9993 | 2084.12 | 1375.76 | 149.9737 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | Average | 165.776 | 1.53275 | 99.9322 | 2107.07 | 1383.84 | 150.7399 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | 1 | | | | 2529.9 | 1564.4 | 160.0997 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | 2 | | | | 2249.76 | 1432.97 | 155.4297 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | 3 | | | | 2181.46 | 1406.88 | 153.5060 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | Average | | | | 2320.37 | 1468.08 | 156.4741 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | 1 | | | | 2137.64 | 1385.24 | 152.7723 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | 2 | | | | 2107.05 | 1372.23 | 152.0138 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | 3 | | | | 2085.86 | 1365.35 | 151.2433 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | Average | | | | 2110.18 | 1374.27 | 152.0136 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand VS acpi-cpufreq-schedutil | Comprison(%) | | | | -9.0584 | -6.3899 | -2.8506 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand VS amd-pstate-schedutil | Comprison(%) | | | | 8.8053 | -5.5463 | -3.4503 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand VS amd-pstate-ondemand | Comprison(%) | | | | -0.4245 | -0.2029 | -0.2219 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil VS amd-pstate-schedutil | Comprison(%) | | | | -0.1473 | 0.6963 | -0.8378 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ gitsource
When you finish test, you will get selftest.gitsource.csv and png images.
The selftest.gitsource.csv file contains the raw data and the drop of the comparative test.
The png images shows the performance, energy and performan per watt of each test.
Open selftest.gitsource.csv :
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ Governor | Round | Des-perf | Freq | Load | Time | Energy | Performance Per Watt |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ Unit | | | GHz | | s | J | 1/J |
+=================================================+==============+==========+==========+==========+=============+=========+======================+
+ amd-pstate-ondemand | 1 | 50.119 | 2.10509 | 23.3076 | 475.69 | 865.78 | 0.001155027 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand | 2 | 94.8006 | 1.98771 | 56.6533 | 467.1 | 839.67 | 0.001190944 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand | 3 | 76.6091 | 2.53251 | 43.7791 | 467.69 | 855.85 | 0.001168429 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand | Average | 73.8429 | 2.20844 | 41.2467 | 470.16 | 853.767 | 0.001171279 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | 1 | 165.919 | 1.62319 | 98.3868 | 464.17 | 866.8 | 0.001153668 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | 2 | 165.97 | 1.31309 | 99.5712 | 480.15 | 880.4 | 0.001135847 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | 3 | 165.973 | 1.28448 | 99.9252 | 481.79 | 867.02 | 0.001153375 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | Average | 165.954 | 1.40692 | 99.2944 | 475.37 | 871.407 | 0.001147569 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | 1 | | | | 2379.62 | 742.96 | 0.001345967 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | 2 | | | | 441.74 | 817.49 | 0.001223256 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | 3 | | | | 455.48 | 820.01 | 0.001219497 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | Average | | | | 425.613 | 793.487 | 0.001260260 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | 1 | | | | 459.69 | 838.54 | 0.001192548 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | 2 | | | | 466.55 | 830.89 | 0.001203528 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | 3 | | | | 470.38 | 837.32 | 0.001194286 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | Average | | | | 465.54 | 835.583 | 0.001196769 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand VS acpi-cpufreq-schedutil | Comprison(%) | | | | 9.3810 | 5.3051 | -5.0379 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand VS amd-pstate-schedutil | Comprison(%) | 124.7392 | -36.2934 | 140.7329 | 1.1081 | 2.0661 | -2.0242 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand VS amd-pstate-ondemand | Comprison(%) | | | | 10.4665 | 7.5968 | -7.0605 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil VS amd-pstate-schedutil | Comprison(%) | | | | 2.1115 | 4.2873 | -4.1110 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
Reference
===========
.. [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming,
https://www.amd.com/system/files/TechDocs/24593.pdf
.. [2] Advanced Configuration and Power Interface Specification,
https://uefi.org/sites/default/files/resources/ACPI_Spec_6_4_Jan22.pdf
.. [3] Processor Programming Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors
https://www.amd.com/system/files/TechDocs/56569-A1-PUB.zip
.. [4] Linux Kernel Selftests,
https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html

View file

@ -0,0 +1,712 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>`
=======================
CPU Performance Scaling
=======================
:Copyright: |copy| 2017 Intel Corporation
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Concept of CPU Performance Scaling
======================================
The majority of modern processors are capable of operating in a number of
different clock frequency and voltage configurations, often referred to as
Operating Performance Points or P-states (in ACPI terminology). As a rule,
the higher the clock frequency and the higher the voltage, the more instructions
can be retired by the CPU over a unit of time, but also the higher the clock
frequency and the higher the voltage, the more energy is consumed over a unit of
time (or the more power is drawn) by the CPU in the given P-state. Therefore
there is a natural tradeoff between the CPU capacity (the number of instructions
that can be executed over a unit of time) and the power drawn by the CPU.
In some situations it is desirable or even necessary to run the program as fast
as possible and then there is no reason to use any P-states different from the
highest one (i.e. the highest-performance frequency/voltage configuration
available). In some other cases, however, it may not be necessary to execute
instructions so quickly and maintaining the highest available CPU capacity for a
relatively long time without utilizing it entirely may be regarded as wasteful.
It also may not be physically possible to maintain maximum CPU capacity for too
long for thermal or power supply capacity reasons or similar. To cover those
cases, there are hardware interfaces allowing CPUs to be switched between
different frequency/voltage configurations or (in the ACPI terminology) to be
put into different P-states.
Typically, they are used along with algorithms to estimate the required CPU
capacity, so as to decide which P-states to put the CPUs into. Of course, since
the utilization of the system generally changes over time, that has to be done
repeatedly on a regular basis. The activity by which this happens is referred
to as CPU performance scaling or CPU frequency scaling (because it involves
adjusting the CPU clock frequency).
CPU Performance Scaling in Linux
================================
The Linux kernel supports CPU performance scaling by means of the ``CPUFreq``
(CPU Frequency scaling) subsystem that consists of three layers of code: the
core, scaling governors and scaling drivers.
The ``CPUFreq`` core provides the common code infrastructure and user space
interfaces for all platforms that support CPU performance scaling. It defines
the basic framework in which the other components operate.
Scaling governors implement algorithms to estimate the required CPU capacity.
As a rule, each governor implements one, possibly parametrized, scaling
algorithm.
Scaling drivers talk to the hardware. They provide scaling governors with
information on the available P-states (or P-state ranges in some cases) and
access platform-specific hardware interfaces to change CPU P-states as requested
by scaling governors.
In principle, all available scaling governors can be used with every scaling
driver. That design is based on the observation that the information used by
performance scaling algorithms for P-state selection can be represented in a
platform-independent form in the majority of cases, so it should be possible
to use the same performance scaling algorithm implemented in exactly the same
way regardless of which scaling driver is used. Consequently, the same set of
scaling governors should be suitable for every supported platform.
However, that observation may not hold for performance scaling algorithms
based on information provided by the hardware itself, for example through
feedback registers, as that information is typically specific to the hardware
interface it comes from and may not be easily represented in an abstract,
platform-independent way. For this reason, ``CPUFreq`` allows scaling drivers
to bypass the governor layer and implement their own performance scaling
algorithms. That is done by the |intel_pstate| scaling driver.
``CPUFreq`` Policy Objects
==========================
In some cases the hardware interface for P-state control is shared by multiple
CPUs. That is, for example, the same register (or set of registers) is used to
control the P-state of multiple CPUs at the same time and writing to it affects
all of those CPUs simultaneously.
Sets of CPUs sharing hardware P-state control interfaces are represented by
``CPUFreq`` as struct cpufreq_policy objects. For consistency,
struct cpufreq_policy is also used when there is only one CPU in the given
set.
The ``CPUFreq`` core maintains a pointer to a struct cpufreq_policy object for
every CPU in the system, including CPUs that are currently offline. If multiple
CPUs share the same hardware P-state control interface, all of the pointers
corresponding to them point to the same struct cpufreq_policy object.
``CPUFreq`` uses struct cpufreq_policy as its basic data type and the design
of its user space interface is based on the policy concept.
CPU Initialization
==================
First of all, a scaling driver has to be registered for ``CPUFreq`` to work.
It is only possible to register one scaling driver at a time, so the scaling
driver is expected to be able to handle all CPUs in the system.
The scaling driver may be registered before or after CPU registration. If
CPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to
take a note of all of the already registered CPUs during the registration of the
scaling driver. In turn, if any CPUs are registered after the registration of
the scaling driver, the ``CPUFreq`` core will be invoked to take note of them
at their registration time.
In any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it
has not seen so far as soon as it is ready to handle that CPU. [Note that the
logical CPU may be a physical single-core processor, or a single core in a
multicore processor, or a hardware thread in a physical processor or processor
core. In what follows "CPU" always means "logical CPU" unless explicitly stated
otherwise and the word "processor" is used to refer to the physical part
possibly including multiple logical CPUs.]
Once invoked, the ``CPUFreq`` core checks if the policy pointer is already set
for the given CPU and if so, it skips the policy object creation. Otherwise,
a new policy object is created and initialized, which involves the creation of
a new policy directory in ``sysfs``, and the policy pointer corresponding to
the given CPU is set to the new policy object's address in memory.
Next, the scaling driver's ``->init()`` callback is invoked with the policy
pointer of the new CPU passed to it as the argument. That callback is expected
to initialize the performance scaling hardware interface for the given CPU (or,
more precisely, for the set of CPUs sharing the hardware interface it belongs
to, represented by its policy object) and, if the policy object it has been
called for is new, to set parameters of the policy, like the minimum and maximum
frequencies supported by the hardware, the table of available frequencies (if
the set of supported P-states is not a continuous range), and the mask of CPUs
that belong to the same policy (including both online and offline CPUs). That
mask is then used by the core to populate the policy pointers for all of the
CPUs in it.
The next major initialization step for a new policy object is to attach a
scaling governor to it (to begin with, that is the default scaling governor
determined by the kernel command line or configuration, but it may be changed
later via ``sysfs``). First, a pointer to the new policy object is passed to
the governor's ``->init()`` callback which is expected to initialize all of the
data structures necessary to handle the given policy and, possibly, to add
a governor ``sysfs`` interface to it. Next, the governor is started by
invoking its ``->start()`` callback.
That callback is expected to register per-CPU utilization update callbacks for
all of the online CPUs belonging to the given policy with the CPU scheduler.
The utilization update callbacks will be invoked by the CPU scheduler on
important events, like task enqueue and dequeue, on every iteration of the
scheduler tick or generally whenever the CPU utilization may change (from the
scheduler's perspective). They are expected to carry out computations needed
to determine the P-state to use for the given policy going forward and to
invoke the scaling driver to make changes to the hardware in accordance with
the P-state selection. The scaling driver may be invoked directly from
scheduler context or asynchronously, via a kernel thread or workqueue, depending
on the configuration and capabilities of the scaling driver and the governor.
Similar steps are taken for policy objects that are not new, but were "inactive"
previously, meaning that all of the CPUs belonging to them were offline. The
only practical difference in that case is that the ``CPUFreq`` core will attempt
to use the scaling governor previously used with the policy that became
"inactive" (and is re-initialized now) instead of the default governor.
In turn, if a previously offline CPU is being brought back online, but some
other CPUs sharing the policy object with it are online already, there is no
need to re-initialize the policy object at all. In that case, it only is
necessary to restart the scaling governor so that it can take the new online CPU
into account. That is achieved by invoking the governor's ``->stop`` and
``->start()`` callbacks, in this order, for the entire policy.
As mentioned before, the |intel_pstate| scaling driver bypasses the scaling
governor layer of ``CPUFreq`` and provides its own P-state selection algorithms.
Consequently, if |intel_pstate| is used, scaling governors are not attached to
new policy objects. Instead, the driver's ``->setpolicy()`` callback is invoked
to register per-CPU utilization update callbacks for each policy. These
callbacks are invoked by the CPU scheduler in the same way as for scaling
governors, but in the |intel_pstate| case they both determine the P-state to
use and change the hardware configuration accordingly in one go from scheduler
context.
The policy objects created during CPU initialization and other data structures
associated with them are torn down when the scaling driver is unregistered
(which happens when the kernel module containing it is unloaded, for example) or
when the last CPU belonging to the given policy in unregistered.
Policy Interface in ``sysfs``
=============================
During the initialization of the kernel, the ``CPUFreq`` core creates a
``sysfs`` directory (kobject) called ``cpufreq`` under
:file:`/sys/devices/system/cpu/`.
That directory contains a ``policyX`` subdirectory (where ``X`` represents an
integer number) for every policy object maintained by the ``CPUFreq`` core.
Each ``policyX`` directory is pointed to by ``cpufreq`` symbolic links
under :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer
that may be different from the one represented by ``X``) for all of the CPUs
associated with (or belonging to) the given policy. The ``policyX`` directories
in :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific
attributes (files) to control ``CPUFreq`` behavior for the corresponding policy
objects (that is, for all of the CPUs associated with them).
Some of those attributes are generic. They are created by the ``CPUFreq`` core
and their behavior generally does not depend on what scaling driver is in use
and what scaling governor is attached to the given policy. Some scaling drivers
also add driver-specific attributes to the policy directories in ``sysfs`` to
control policy-specific aspects of driver behavior.
The generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/`
are the following:
``affected_cpus``
List of online CPUs belonging to this policy (i.e. sharing the hardware
performance scaling interface represented by the ``policyX`` policy
object).
``bios_limit``
If the platform firmware (BIOS) tells the OS to apply an upper limit to
CPU frequencies, that limit will be reported through this attribute (if
present).
The existence of the limit may be a result of some (often unintentional)
BIOS settings, restrictions coming from a service processor or another
BIOS/HW-based mechanisms.
This does not cover ACPI thermal limitations which can be discovered
through a generic thermal driver.
This attribute is not present if the scaling driver in use does not
support it.
``cpuinfo_cur_freq``
Current frequency of the CPUs belonging to this policy as obtained from
the hardware (in KHz).
This is expected to be the frequency the hardware actually runs at.
If that frequency cannot be determined, this attribute should not
be present.
``cpuinfo_max_freq``
Maximum possible operating frequency the CPUs belonging to this policy
can run at (in kHz).
``cpuinfo_min_freq``
Minimum possible operating frequency the CPUs belonging to this policy
can run at (in kHz).
``cpuinfo_transition_latency``
The time it takes to switch the CPUs belonging to this policy from one
P-state to another, in nanoseconds.
If unknown or if known to be so high that the scaling driver does not
work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`)
will be returned by reads from this attribute.
``related_cpus``
List of all (online and offline) CPUs belonging to this policy.
``scaling_available_frequencies``
List of available frequencies of the CPUs belonging to this policy
(in kHz).
``scaling_available_governors``
List of ``CPUFreq`` scaling governors present in the kernel that can
be attached to this policy or (if the |intel_pstate| scaling driver is
in use) list of scaling algorithms provided by the driver that can be
applied to this policy.
[Note that some governors are modular and it may be necessary to load a
kernel module for the governor held by it to become available and be
listed by this attribute.]
``scaling_cur_freq``
Current frequency of all of the CPUs belonging to this policy (in kHz).
In the majority of cases, this is the frequency of the last P-state
requested by the scaling driver from the hardware using the scaling
interface provided by it, which may or may not reflect the frequency
the CPU is actually running at (due to hardware design and other
limitations).
Some architectures (e.g. ``x86``) may attempt to provide information
more precisely reflecting the current CPU frequency through this
attribute, but that still may not be the exact current CPU frequency as
seen by the hardware at the moment.
``scaling_driver``
The scaling driver currently in use.
``scaling_governor``
The scaling governor currently attached to this policy or (if the
|intel_pstate| scaling driver is in use) the scaling algorithm
provided by the driver that is currently applied to this policy.
This attribute is read-write and writing to it will cause a new scaling
governor to be attached to this policy or a new scaling algorithm
provided by the scaling driver to be applied to it (in the
|intel_pstate| case), as indicated by the string written to this
attribute (which must be one of the names listed by the
``scaling_available_governors`` attribute described above).
``scaling_max_freq``
Maximum frequency the CPUs belonging to this policy are allowed to be
running at (in kHz).
This attribute is read-write and writing a string representing an
integer to it will cause a new limit to be set (it must not be lower
than the value of the ``scaling_min_freq`` attribute).
``scaling_min_freq``
Minimum frequency the CPUs belonging to this policy are allowed to be
running at (in kHz).
This attribute is read-write and writing a string representing a
non-negative integer to it will cause a new limit to be set (it must not
be higher than the value of the ``scaling_max_freq`` attribute).
``scaling_setspeed``
This attribute is functional only if the `userspace`_ scaling governor
is attached to the given policy.
It returns the last frequency requested by the governor (in kHz) or can
be written to in order to set a new frequency for the policy.
Generic Scaling Governors
=========================
``CPUFreq`` provides generic scaling governors that can be used with all
scaling drivers. As stated before, each of them implements a single, possibly
parametrized, performance scaling algorithm.
Scaling governors are attached to policy objects and different policy objects
can be handled by different scaling governors at the same time (although that
may lead to suboptimal results in some cases).
The scaling governor for a given policy object can be changed at any time with
the help of the ``scaling_governor`` policy attribute in ``sysfs``.
Some governors expose ``sysfs`` attributes to control or fine-tune the scaling
algorithms implemented by them. Those attributes, referred to as governor
tunables, can be either global (system-wide) or per-policy, depending on the
scaling driver in use. If the driver requires governor tunables to be
per-policy, they are located in a subdirectory of each policy directory.
Otherwise, they are located in a subdirectory under
:file:`/sys/devices/system/cpu/cpufreq/`. In either case the name of the
subdirectory containing the governor tunables is the name of the governor
providing them.
``performance``
---------------
When attached to a policy object, this governor causes the highest frequency,
within the ``scaling_max_freq`` policy limit, to be requested for that policy.
The request is made once at that time the governor for the policy is set to
``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
policy limits change after that.
``powersave``
-------------
When attached to a policy object, this governor causes the lowest frequency,
within the ``scaling_min_freq`` policy limit, to be requested for that policy.
The request is made once at that time the governor for the policy is set to
``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
policy limits change after that.
``userspace``
-------------
This governor does not do anything by itself. Instead, it allows user space
to set the CPU frequency for the policy it is attached to by writing to the
``scaling_setspeed`` attribute of that policy.
``schedutil``
-------------
This governor uses CPU utilization data available from the CPU scheduler. It
generally is regarded as a part of the CPU scheduler, so it can access the
scheduler's internal data structures directly.
It runs entirely in scheduler context, although in some cases it may need to
invoke the scaling driver asynchronously when it decides that the CPU frequency
should be changed for a given policy (that depends on whether or not the driver
is capable of changing the CPU frequency from scheduler context).
The actions of this governor for a particular CPU depend on the scheduling class
invoking its utilization update callback for that CPU. If it is invoked by the
RT or deadline scheduling classes, the governor will increase the frequency to
the allowed maximum (that is, the ``scaling_max_freq`` policy limit). In turn,
if it is invoked by the CFS scheduling class, the governor will use the
Per-Entity Load Tracking (PELT) metric for the root control group of the
given CPU as the CPU utilization estimate (see the *Per-entity load tracking*
LWN.net article [1]_ for a description of the PELT mechanism). Then, the new
CPU frequency to apply is computed in accordance with the formula
f = 1.25 * ``f_0`` * ``util`` / ``max``
where ``util`` is the PELT number, ``max`` is the theoretical maximum of
``util``, and ``f_0`` is either the maximum possible CPU frequency for the given
policy (if the PELT number is frequency-invariant), or the current CPU frequency
(otherwise).
This governor also employs a mechanism allowing it to temporarily bump up the
CPU frequency for tasks that have been waiting on I/O most recently, called
"IO-wait boosting". That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag
is passed by the scheduler to the governor callback which causes the frequency
to go up to the allowed maximum immediately and then draw back to the value
returned by the above formula over time.
This governor exposes only one tunable:
``rate_limit_us``
Minimum time (in microseconds) that has to pass between two consecutive
runs of governor computations (default: 1.5 times the scaling driver's
transition latency or the maximum 2ms).
The purpose of this tunable is to reduce the scheduler context overhead
of the governor which might be excessive without it.
This governor generally is regarded as a replacement for the older `ondemand`_
and `conservative`_ governors (described below), as it is simpler and more
tightly integrated with the CPU scheduler, its overhead in terms of CPU context
switches and similar is less significant, and it uses the scheduler's own CPU
utilization metric, so in principle its decisions should not contradict the
decisions made by the other parts of the scheduler.
``ondemand``
------------
This governor uses CPU load as a CPU frequency selection metric.
In order to estimate the current CPU load, it measures the time elapsed between
consecutive invocations of its worker routine and computes the fraction of that
time in which the given CPU was not idle. The ratio of the non-idle (active)
time to the total CPU time is taken as an estimate of the load.
If this governor is attached to a policy shared by multiple CPUs, the load is
estimated for all of them and the greatest result is taken as the load estimate
for the entire policy.
The worker routine of this governor has to run in process context, so it is
invoked asynchronously (via a workqueue) and CPU P-states are updated from
there if necessary. As a result, the scheduler context overhead from this
governor is minimum, but it causes additional CPU context switches to happen
relatively often and the CPU P-state updates triggered by it can be relatively
irregular. Also, it affects its own CPU load metric by running code that
reduces the CPU idle time (even though the CPU idle time is only reduced very
slightly by it).
It generally selects CPU frequencies proportional to the estimated load, so that
the value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of
1 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute
corresponds to the load of 0, unless when the load exceeds a (configurable)
speedup threshold, in which case it will go straight for the highest frequency
it is allowed to use (the ``scaling_max_freq`` policy limit).
This governor exposes the following tunables:
``sampling_rate``
This is how often the governor's worker routine should run, in
microseconds.
Typically, it is set to values of the order of 2000 (2 ms). Its
default value is to add a 50% breathing room
to ``cpuinfo_transition_latency`` on each policy this governor is
attached to. The minimum is typically the length of two scheduler
ticks.
If this tunable is per-policy, the following shell command sets the time
represented by it to be 1.5 times as high as the transition latency
(the default)::
# echo `$(($(cat cpuinfo_transition_latency) * 3 / 2)) > ondemand/sampling_rate
``up_threshold``
If the estimated CPU load is above this value (in percent), the governor
will set the frequency to the maximum value allowed for the policy.
Otherwise, the selected frequency will be proportional to the estimated
CPU load.
``ignore_nice_load``
If set to 1 (default 0), it will cause the CPU load estimation code to
treat the CPU time spent on executing tasks with "nice" levels greater
than 0 as CPU idle time.
This may be useful if there are tasks in the system that should not be
taken into account when deciding what frequency to run the CPUs at.
Then, to make that happen it is sufficient to increase the "nice" level
of those tasks above 0 and set this attribute to 1.
``sampling_down_factor``
Temporary multiplier, between 1 (default) and 100 inclusive, to apply to
the ``sampling_rate`` value if the CPU load goes above ``up_threshold``.
This causes the next execution of the governor's worker routine (after
setting the frequency to the allowed maximum) to be delayed, so the
frequency stays at the maximum level for a longer time.
Frequency fluctuations in some bursty workloads may be avoided this way
at the cost of additional energy spent on maintaining the maximum CPU
capacity.
``powersave_bias``
Reduction factor to apply to the original frequency target of the
governor (including the maximum value used when the ``up_threshold``
value is exceeded by the estimated CPU load) or sensitivity threshold
for the AMD frequency sensitivity powersave bias driver
(:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000
inclusive.
If the AMD frequency sensitivity powersave bias driver is not loaded,
the effective frequency to apply is given by
f * (1 - ``powersave_bias`` / 1000)
where f is the governor's original frequency target. The default value
of this attribute is 0 in that case.
If the AMD frequency sensitivity powersave bias driver is loaded, the
value of this attribute is 400 by default and it is used in a different
way.
On Family 16h (and later) AMD processors there is a mechanism to get a
measured workload sensitivity, between 0 and 100% inclusive, from the
hardware. That value can be used to estimate how the performance of the
workload running on a CPU will change in response to frequency changes.
The performance of a workload with the sensitivity of 0 (memory-bound or
IO-bound) is not expected to increase at all as a result of increasing
the CPU frequency, whereas workloads with the sensitivity of 100%
(CPU-bound) are expected to perform much better if the CPU frequency is
increased.
If the workload sensitivity is less than the threshold represented by
the ``powersave_bias`` value, the sensitivity powersave bias driver
will cause the governor to select a frequency lower than its original
target, so as to avoid over-provisioning workloads that will not benefit
from running at higher CPU frequencies.
``conservative``
----------------
This governor uses CPU load as a CPU frequency selection metric.
It estimates the CPU load in the same way as the `ondemand`_ governor described
above, but the CPU frequency selection algorithm implemented by it is different.
Namely, it avoids changing the frequency significantly over short time intervals
which may not be suitable for systems with limited power supply capacity (e.g.
battery-powered). To achieve that, it changes the frequency in relatively
small steps, one step at a time, up or down - depending on whether or not a
(configurable) threshold has been exceeded by the estimated CPU load.
This governor exposes the following tunables:
``freq_step``
Frequency step in percent of the maximum frequency the governor is
allowed to set (the ``scaling_max_freq`` policy limit), between 0 and
100 (5 by default).
This is how much the frequency is allowed to change in one go. Setting
it to 0 will cause the default frequency step (5 percent) to be used
and setting it to 100 effectively causes the governor to periodically
switch the frequency between the ``scaling_min_freq`` and
``scaling_max_freq`` policy limits.
``down_threshold``
Threshold value (in percent, 20 by default) used to determine the
frequency change direction.
If the estimated CPU load is greater than this value, the frequency will
go up (by ``freq_step``). If the load is less than this value (and the
``sampling_down_factor`` mechanism is not in effect), the frequency will
go down. Otherwise, the frequency will not be changed.
``sampling_down_factor``
Frequency decrease deferral factor, between 1 (default) and 10
inclusive.
It effectively causes the frequency to go down ``sampling_down_factor``
times slower than it ramps up.
Frequency Boost Support
=======================
Background
----------
Some processors support a mechanism to raise the operating frequency of some
cores in a multicore package temporarily (and above the sustainable frequency
threshold for the whole package) under certain conditions, for example if the
whole chip is not fully utilized and below its intended thermal or power budget.
Different names are used by different vendors to refer to this functionality.
For Intel processors it is referred to as "Turbo Boost", AMD calls it
"Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on.
As a rule, it also is implemented differently by different vendors. The simple
term "frequency boost" is used here for brevity to refer to all of those
implementations.
The frequency boost mechanism may be either hardware-based or software-based.
If it is hardware-based (e.g. on x86), the decision to trigger the boosting is
made by the hardware (although in general it requires the hardware to be put
into a special state in which it can control the CPU frequency within certain
limits). If it is software-based (e.g. on ARM), the scaling driver decides
whether or not to trigger boosting and when to do that.
The ``boost`` File in ``sysfs``
-------------------------------
This file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls
the "boost" setting for the whole system. It is not present if the underlying
scaling driver does not support the frequency boost mechanism (or supports it,
but provides a driver-specific interface for controlling it, like
|intel_pstate|).
If the value in this file is 1, the frequency boost mechanism is enabled. This
means that either the hardware can be put into states in which it is able to
trigger boosting (in the hardware-based case), or the software is allowed to
trigger boosting (in the software-based case). It does not mean that boosting
is actually in use at the moment on any CPUs in the system. It only means a
permission to use the frequency boost mechanism (which still may never be used
for other reasons).
If the value in this file is 0, the frequency boost mechanism is disabled and
cannot be used at all.
The only values that can be written to this file are 0 and 1.
Rationale for Boost Control Knob
--------------------------------
The frequency boost mechanism is generally intended to help to achieve optimum
CPU performance on time scales below software resolution (e.g. below the
scheduler tick interval) and it is demonstrably suitable for many workloads, but
it may lead to problems in certain situations.
For this reason, many systems make it possible to disable the frequency boost
mechanism in the platform firmware (BIOS) setup, but that requires the system to
be restarted for the setting to be adjusted as desired, which may not be
practical at least in some cases. For example:
1. Boosting means overclocking the processor, although under controlled
conditions. Generally, the processor's energy consumption increases
as a result of increasing its frequency and voltage, even temporarily.
That may not be desirable on systems that switch to power sources of
limited capacity, such as batteries, so the ability to disable the boost
mechanism while the system is running may help there (but that depends on
the workload too).
2. In some situations deterministic behavior is more important than
performance or energy consumption (or both) and the ability to disable
boosting while the system is running may be useful then.
3. To examine the impact of the frequency boost mechanism itself, it is useful
to be able to run tests with and without boosting, preferably without
restarting the system in the meantime.
4. Reproducible results are important when running benchmarks. Since
the boosting functionality depends on the load of the whole package,
single-thread performance may vary because of it which may lead to
unreproducible results sometimes. That can be avoided by disabling the
frequency boost mechanism before running benchmarks sensitive to that
issue.
Legacy AMD ``cpb`` Knob
-----------------------
The AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to
the global ``boost`` one. It is used for disabling/enabling the "Core
Performance Boost" feature of some AMD processors.
If present, that knob is located in every ``CPUFreq`` policy directory in
``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called
``cpb``, which indicates a more fine grained control interface. The actual
implementation, however, works on the system-wide basis and setting that knob
for one policy causes the same value of it to be set for all of the other
policies at the same time.
That knob is still supported on AMD processors that support its underlying
hardware feature, but it may be configured out of the kernel (via the
:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global
``boost`` knob is present regardless. Thus it is always possible use the
``boost`` knob instead of the ``cpb`` one which is highly recommended, as that
is more consistent with what all of the other systems do (and the ``cpb`` knob
may not be supported any more in the future).
The ``cpb`` knob is never present for any processors without the underlying
hardware feature (e.g. all Intel ones), even if the
:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set.
References
==========
.. [1] Jonathan Corbet, *Per-entity load tracking*,
https://lwn.net/Articles/531853/

View file

@ -0,0 +1,274 @@
.. SPDX-License-Identifier: GPL-2.0
=======================================================
Legacy Documentation of CPU Performance Scaling Drivers
=======================================================
Included below are historic documents describing assorted
:doc:`CPU performance scaling <cpufreq>` drivers. They are reproduced verbatim,
with the original white space formatting and indentation preserved, except for
the added leading space character in every line of text.
AMD PowerNow! Drivers
=====================
::
PowerNow! and Cool'n'Quiet are AMD names for frequency
management capabilities in AMD processors. As the hardware
implementation changes in new generations of the processors,
there is a different cpu-freq driver for each generation.
Note that the driver's will not load on the "wrong" hardware,
so it is safe to try each driver in turn when in doubt as to
which is the correct driver.
Note that the functionality to change frequency (and voltage)
is not available in all processors. The drivers will refuse
to load on processors without this capability. The capability
is detected with the cpuid instruction.
The drivers use BIOS supplied tables to obtain frequency and
voltage information appropriate for a particular platform.
Frequency transitions will be unavailable if the BIOS does
not supply these tables.
6th Generation: powernow-k6
7th Generation: powernow-k7: Athlon, Duron, Geode.
8th Generation: powernow-k8: Athlon, Athlon 64, Opteron, Sempron.
Documentation on this functionality in 8th generation processors
is available in the "BIOS and Kernel Developer's Guide", publication
26094, in chapter 9, available for download from www.amd.com.
BIOS supplied data, for powernow-k7 and for powernow-k8, may be
from either the PSB table or from ACPI objects. The ACPI support
is only available if the kernel config sets CONFIG_ACPI_PROCESSOR.
The powernow-k8 driver will attempt to use ACPI if so configured,
and fall back to PST if that fails.
The powernow-k7 driver will try to use the PSB support first, and
fall back to ACPI if the PSB support fails. A module parameter,
acpi_force, is provided to force ACPI support to be used instead
of PSB support.
``cpufreq-nforce2``
===================
::
The cpufreq-nforce2 driver changes the FSB on nVidia nForce2 platforms.
This works better than on other platforms, because the FSB of the CPU
can be controlled independently from the PCI/AGP clock.
The module has two options:
fid: multiplier * 10 (for example 8.5 = 85)
min_fsb: minimum FSB
If not set, fid is calculated from the current CPU speed and the FSB.
min_fsb defaults to FSB at boot time - 50 MHz.
IMPORTANT: The available range is limited downwards!
Also the minimum available FSB can differ, for systems
booting with 200 MHz, 150 should always work.
``pcc-cpufreq``
===============
::
/*
* pcc-cpufreq.txt - PCC interface documentation
*
* Copyright (C) 2009 Red Hat, Matthew Garrett <mjg@redhat.com>
* Copyright (C) 2009 Hewlett-Packard Development Company, L.P.
* Nagananda Chumbalkar <nagananda.chumbalkar@hp.com>
*/
Processor Clocking Control Driver
---------------------------------
Contents:
---------
1. Introduction
1.1 PCC interface
1.1.1 Get Average Frequency
1.1.2 Set Desired Frequency
1.2 Platforms affected
2. Driver and /sys details
2.1 scaling_available_frequencies
2.2 cpuinfo_transition_latency
2.3 cpuinfo_cur_freq
2.4 related_cpus
3. Caveats
1. Introduction:
----------------
Processor Clocking Control (PCC) is an interface between the platform
firmware and OSPM. It is a mechanism for coordinating processor
performance (ie: frequency) between the platform firmware and the OS.
The PCC driver (pcc-cpufreq) allows OSPM to take advantage of the PCC
interface.
OS utilizes the PCC interface to inform platform firmware what frequency the
OS wants for a logical processor. The platform firmware attempts to achieve
the requested frequency. If the request for the target frequency could not be
satisfied by platform firmware, then it usually means that power budget
conditions are in place, and "power capping" is taking place.
1.1 PCC interface:
------------------
The complete PCC specification is available here:
https://acpica.org/sites/acpica/files/Processor-Clocking-Control-v1p0.pdf
PCC relies on a shared memory region that provides a channel for communication
between the OS and platform firmware. PCC also implements a "doorbell" that
is used by the OS to inform the platform firmware that a command has been
sent.
The ACPI PCCH() method is used to discover the location of the PCC shared
memory region. The shared memory region header contains the "command" and
"status" interface. PCCH() also contains details on how to access the platform
doorbell.
The following commands are supported by the PCC interface:
* Get Average Frequency
* Set Desired Frequency
The ACPI PCCP() method is implemented for each logical processor and is
used to discover the offsets for the input and output buffers in the shared
memory region.
When PCC mode is enabled, the platform will not expose processor performance
or throttle states (_PSS, _TSS and related ACPI objects) to OSPM. Therefore,
the native P-state driver (such as acpi-cpufreq for Intel, powernow-k8 for
AMD) will not load.
However, OSPM remains in control of policy. The governor (eg: "ondemand")
computes the required performance for each processor based on server workload.
The PCC driver fills in the command interface, and the input buffer and
communicates the request to the platform firmware. The platform firmware is
responsible for delivering the requested performance.
Each PCC command is "global" in scope and can affect all the logical CPUs in
the system. Therefore, PCC is capable of performing "group" updates. With PCC
the OS is capable of getting/setting the frequency of all the logical CPUs in
the system with a single call to the BIOS.
1.1.1 Get Average Frequency:
----------------------------
This command is used by the OSPM to query the running frequency of the
processor since the last time this command was completed. The output buffer
indicates the average unhalted frequency of the logical processor expressed as
a percentage of the nominal (ie: maximum) CPU frequency. The output buffer
also signifies if the CPU frequency is limited by a power budget condition.
1.1.2 Set Desired Frequency:
----------------------------
This command is used by the OSPM to communicate to the platform firmware the
desired frequency for a logical processor. The output buffer is currently
ignored by OSPM. The next invocation of "Get Average Frequency" will inform
OSPM if the desired frequency was achieved or not.
1.2 Platforms affected:
-----------------------
The PCC driver will load on any system where the platform firmware:
* supports the PCC interface, and the associated PCCH() and PCCP() methods
* assumes responsibility for managing the hardware clocking controls in order
to deliver the requested processor performance
Currently, certain HP ProLiant platforms implement the PCC interface. On those
platforms PCC is the "default" choice.
However, it is possible to disable this interface via a BIOS setting. In
such an instance, as is also the case on platforms where the PCC interface
is not implemented, the PCC driver will fail to load silently.
2. Driver and /sys details:
---------------------------
When the driver loads, it merely prints the lowest and the highest CPU
frequencies supported by the platform firmware.
The PCC driver loads with a message such as:
pcc-cpufreq: (v1.00.00) driver loaded with frequency limits: 1600 MHz, 2933
MHz
This means that the OPSM can request the CPU to run at any frequency in
between the limits (1600 MHz, and 2933 MHz) specified in the message.
Internally, there is no need for the driver to convert the "target" frequency
to a corresponding P-state.
The VERSION number for the driver will be of the format v.xy.ab.
eg: 1.00.02
----- --
| |
| -- this will increase with bug fixes/enhancements to the driver
|-- this is the version of the PCC specification the driver adheres to
The following is a brief discussion on some of the fields exported via the
/sys filesystem and how their values are affected by the PCC driver:
2.1 scaling_available_frequencies:
----------------------------------
scaling_available_frequencies is not created in /sys. No intermediate
frequencies need to be listed because the BIOS will try to achieve any
frequency, within limits, requested by the governor. A frequency does not have
to be strictly associated with a P-state.
2.2 cpuinfo_transition_latency:
-------------------------------
The cpuinfo_transition_latency field is 0. The PCC specification does
not include a field to expose this value currently.
2.3 cpuinfo_cur_freq:
---------------------
A) Often cpuinfo_cur_freq will show a value different than what is declared
in the scaling_available_frequencies or scaling_cur_freq, or scaling_max_freq.
This is due to "turbo boost" available on recent Intel processors. If certain
conditions are met the BIOS can achieve a slightly higher speed than requested
by OSPM. An example:
scaling_cur_freq : 2933000
cpuinfo_cur_freq : 3196000
B) There is a round-off error associated with the cpuinfo_cur_freq value.
Since the driver obtains the current frequency as a "percentage" (%) of the
nominal frequency from the BIOS, sometimes, the values displayed by
scaling_cur_freq and cpuinfo_cur_freq may not match. An example:
scaling_cur_freq : 1600000
cpuinfo_cur_freq : 1583000
In this example, the nominal frequency is 2933 MHz. The driver obtains the
current frequency, cpuinfo_cur_freq, as 54% of the nominal frequency:
54% of 2933 MHz = 1583 MHz
Nominal frequency is the maximum frequency of the processor, and it usually
corresponds to the frequency of the P0 P-state.
2.4 related_cpus:
-----------------
The related_cpus field is identical to affected_cpus.
affected_cpus : 4
related_cpus : 4
Currently, the PCC driver does not evaluate _PSD. The platforms that support
PCC do not implement SW_ALL. So OSPM doesn't need to perform any coordination
to ensure that the same frequency is requested of all dependent CPUs.
3. Caveats:
-----------
The "cpufreq_stats" module in its present form cannot be loaded and
expected to work with the PCC driver. Since the "cpufreq_stats" module
provides information wrt each P-state, it is not applicable to the PCC driver.

View file

@ -0,0 +1,662 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
.. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state <cpuidle_state>`
.. |cpufreq| replace:: :doc:`CPU Performance Scaling <cpufreq>`
========================
CPU Idle Time Management
========================
:Copyright: |copy| 2018 Intel Corporation
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Concepts
========
Modern processors are generally able to enter states in which the execution of
a program is suspended and instructions belonging to it are not fetched from
memory or executed. Those states are the *idle* states of the processor.
Since part of the processor hardware is not used in idle states, entering them
generally allows power drawn by the processor to be reduced and, in consequence,
it is an opportunity to save energy.
CPU idle time management is an energy-efficiency feature concerned about using
the idle states of processors for this purpose.
Logical CPUs
------------
CPU idle time management operates on CPUs as seen by the *CPU scheduler* (that
is the part of the kernel responsible for the distribution of computational
work in the system). In its view, CPUs are *logical* units. That is, they need
not be separate physical entities and may just be interfaces appearing to
software as individual single-core processors. In other words, a CPU is an
entity which appears to be fetching instructions that belong to one sequence
(program) from memory and executing them, but it need not work this way
physically. Generally, three different cases can be consider here.
First, if the whole processor can only follow one sequence of instructions (one
program) at a time, it is a CPU. In that case, if the hardware is asked to
enter an idle state, that applies to the processor as a whole.
Second, if the processor is multi-core, each core in it is able to follow at
least one program at a time. The cores need not be entirely independent of each
other (for example, they may share caches), but still most of the time they
work physically in parallel with each other, so if each of them executes only
one program, those programs run mostly independently of each other at the same
time. The entire cores are CPUs in that case and if the hardware is asked to
enter an idle state, that applies to the core that asked for it in the first
place, but it also may apply to a larger unit (say a "package" or a "cluster")
that the core belongs to (in fact, it may apply to an entire hierarchy of larger
units containing the core). Namely, if all of the cores in the larger unit
except for one have been put into idle states at the "core level" and the
remaining core asks the processor to enter an idle state, that may trigger it
to put the whole larger unit into an idle state which also will affect the
other cores in that unit.
Finally, each core in a multi-core processor may be able to follow more than one
program in the same time frame (that is, each core may be able to fetch
instructions from multiple locations in memory and execute them in the same time
frame, but not necessarily entirely in parallel with each other). In that case
the cores present themselves to software as "bundles" each consisting of
multiple individual single-core "processors", referred to as *hardware threads*
(or hyper-threads specifically on Intel hardware), that each can follow one
sequence of instructions. Then, the hardware threads are CPUs from the CPU idle
time management perspective and if the processor is asked to enter an idle state
by one of them, the hardware thread (or CPU) that asked for it is stopped, but
nothing more happens, unless all of the other hardware threads within the same
core also have asked the processor to enter an idle state. In that situation,
the core may be put into an idle state individually or a larger unit containing
it may be put into an idle state as a whole (if the other cores within the
larger unit are in idle states already).
Idle CPUs
---------
Logical CPUs, simply referred to as "CPUs" in what follows, are regarded as
*idle* by the Linux kernel when there are no tasks to run on them except for the
special "idle" task.
Tasks are the CPU scheduler's representation of work. Each task consists of a
sequence of instructions to execute, or code, data to be manipulated while
running that code, and some context information that needs to be loaded into the
processor every time the task's code is run by a CPU. The CPU scheduler
distributes work by assigning tasks to run to the CPUs present in the system.
Tasks can be in various states. In particular, they are *runnable* if there are
no specific conditions preventing their code from being run by a CPU as long as
there is a CPU available for that (for example, they are not waiting for any
events to occur or similar). When a task becomes runnable, the CPU scheduler
assigns it to one of the available CPUs to run and if there are no more runnable
tasks assigned to it, the CPU will load the given task's context and run its
code (from the instruction following the last one executed so far, possibly by
another CPU). [If there are multiple runnable tasks assigned to one CPU
simultaneously, they will be subject to prioritization and time sharing in order
to allow them to make some progress over time.]
The special "idle" task becomes runnable if there are no other runnable tasks
assigned to the given CPU and the CPU is then regarded as idle. In other words,
in Linux idle CPUs run the code of the "idle" task called *the idle loop*. That
code may cause the processor to be put into one of its idle states, if they are
supported, in order to save energy, but if the processor does not support any
idle states, or there is not enough time to spend in an idle state before the
next wakeup event, or there are strict latency constraints preventing any of the
available idle states from being used, the CPU will simply execute more or less
useless instructions in a loop until it is assigned a new task to run.
.. _idle-loop:
The Idle Loop
=============
The idle loop code takes two major steps in every iteration of it. First, it
calls into a code module referred to as the *governor* that belongs to the CPU
idle time management subsystem called ``CPUIdle`` to select an idle state for
the CPU to ask the hardware to enter. Second, it invokes another code module
from the ``CPUIdle`` subsystem, called the *driver*, to actually ask the
processor hardware to enter the idle state selected by the governor.
The role of the governor is to find an idle state most suitable for the
conditions at hand. For this purpose, idle states that the hardware can be
asked to enter by logical CPUs are represented in an abstract way independent of
the platform or the processor architecture and organized in a one-dimensional
(linear) array. That array has to be prepared and supplied by the ``CPUIdle``
driver matching the platform the kernel is running on at the initialization
time. This allows ``CPUIdle`` governors to be independent of the underlying
hardware and to work with any platforms that the Linux kernel can run on.
Each idle state present in that array is characterized by two parameters to be
taken into account by the governor, the *target residency* and the (worst-case)
*exit latency*. The target residency is the minimum time the hardware must
spend in the given state, including the time needed to enter it (which may be
substantial), in order to save more energy than it would save by entering one of
the shallower idle states instead. [The "depth" of an idle state roughly
corresponds to the power drawn by the processor in that state.] The exit
latency, in turn, is the maximum time it will take a CPU asking the processor
hardware to enter an idle state to start executing the first instruction after a
wakeup from that state. Note that in general the exit latency also must cover
the time needed to enter the given state in case the wakeup occurs when the
hardware is entering it and it must be entered completely to be exited in an
ordered manner.
There are two types of information that can influence the governor's decisions.
First of all, the governor knows the time until the closest timer event. That
time is known exactly, because the kernel programs timers and it knows exactly
when they will trigger, and it is the maximum time the hardware that the given
CPU depends on can spend in an idle state, including the time necessary to enter
and exit it. However, the CPU may be woken up by a non-timer event at any time
(in particular, before the closest timer triggers) and it generally is not known
when that may happen. The governor can only see how much time the CPU actually
was idle after it has been woken up (that time will be referred to as the *idle
duration* from now on) and it can use that information somehow along with the
time until the closest timer to estimate the idle duration in future. How the
governor uses that information depends on what algorithm is implemented by it
and that is the primary reason for having more than one governor in the
``CPUIdle`` subsystem.
There are four ``CPUIdle`` governors available, ``menu``, `TEO <teo-gov_>`_,
``ladder`` and ``haltpoll``. Which of them is used by default depends on the
configuration of the kernel and in particular on whether or not the scheduler
tick can be `stopped by the idle loop <idle-cpus-and-tick_>`_. Available
governors can be read from the :file:`available_governors`, and the governor
can be changed at runtime. The name of the ``CPUIdle`` governor currently
used by the kernel can be read from the :file:`current_governor_ro` or
:file:`current_governor` file under :file:`/sys/devices/system/cpu/cpuidle/`
in ``sysfs``.
Which ``CPUIdle`` driver is used, on the other hand, usually depends on the
platform the kernel is running on, but there are platforms with more than one
matching driver. For example, there are two drivers that can work with the
majority of Intel platforms, ``intel_idle`` and ``acpi_idle``, one with
hardcoded idle states information and the other able to read that information
from the system's ACPI tables, respectively. Still, even in those cases, the
driver chosen at the system initialization time cannot be replaced later, so the
decision on which one of them to use has to be made early (on Intel platforms
the ``acpi_idle`` driver will be used if ``intel_idle`` is disabled for some
reason or if it does not recognize the processor). The name of the ``CPUIdle``
driver currently used by the kernel can be read from the :file:`current_driver`
file under :file:`/sys/devices/system/cpu/cpuidle/` in ``sysfs``.
.. _idle-cpus-and-tick:
Idle CPUs and The Scheduler Tick
================================
The scheduler tick is a timer that triggers periodically in order to implement
the time sharing strategy of the CPU scheduler. Of course, if there are
multiple runnable tasks assigned to one CPU at the same time, the only way to
allow them to make reasonable progress in a given time frame is to make them
share the available CPU time. Namely, in rough approximation, each task is
given a slice of the CPU time to run its code, subject to the scheduling class,
prioritization and so on and when that time slice is used up, the CPU should be
switched over to running (the code of) another task. The currently running task
may not want to give the CPU away voluntarily, however, and the scheduler tick
is there to make the switch happen regardless. That is not the only role of the
tick, but it is the primary reason for using it.
The scheduler tick is problematic from the CPU idle time management perspective,
because it triggers periodically and relatively often (depending on the kernel
configuration, the length of the tick period is between 1 ms and 10 ms).
Thus, if the tick is allowed to trigger on idle CPUs, it will not make sense
for them to ask the hardware to enter idle states with target residencies above
the tick period length. Moreover, in that case the idle duration of any CPU
will never exceed the tick period length and the energy used for entering and
exiting idle states due to the tick wakeups on idle CPUs will be wasted.
Fortunately, it is not really necessary to allow the tick to trigger on idle
CPUs, because (by definition) they have no tasks to run except for the special
"idle" one. In other words, from the CPU scheduler perspective, the only user
of the CPU time on them is the idle loop. Since the time of an idle CPU need
not be shared between multiple runnable tasks, the primary reason for using the
tick goes away if the given CPU is idle. Consequently, it is possible to stop
the scheduler tick entirely on idle CPUs in principle, even though that may not
always be worth the effort.
Whether or not it makes sense to stop the scheduler tick in the idle loop
depends on what is expected by the governor. First, if there is another
(non-tick) timer due to trigger within the tick range, stopping the tick clearly
would be a waste of time, even though the timer hardware may not need to be
reprogrammed in that case. Second, if the governor is expecting a non-timer
wakeup within the tick range, stopping the tick is not necessary and it may even
be harmful. Namely, in that case the governor will select an idle state with
the target residency within the time until the expected wakeup, so that state is
going to be relatively shallow. The governor really cannot select a deep idle
state then, as that would contradict its own expectation of a wakeup in short
order. Now, if the wakeup really occurs shortly, stopping the tick would be a
waste of time and in this case the timer hardware would need to be reprogrammed,
which is expensive. On the other hand, if the tick is stopped and the wakeup
does not occur any time soon, the hardware may spend indefinite amount of time
in the shallow idle state selected by the governor, which will be a waste of
energy. Hence, if the governor is expecting a wakeup of any kind within the
tick range, it is better to allow the tick trigger. Otherwise, however, the
governor will select a relatively deep idle state, so the tick should be stopped
so that it does not wake up the CPU too early.
In any case, the governor knows what it is expecting and the decision on whether
or not to stop the scheduler tick belongs to it. Still, if the tick has been
stopped already (in one of the previous iterations of the loop), it is better
to leave it as is and the governor needs to take that into account.
The kernel can be configured to disable stopping the scheduler tick in the idle
loop altogether. That can be done through the build-time configuration of it
(by unsetting the ``CONFIG_NO_HZ_IDLE`` configuration option) or by passing
``nohz=off`` to it in the command line. In both cases, as the stopping of the
scheduler tick is disabled, the governor's decisions regarding it are simply
ignored by the idle loop code and the tick is never stopped.
The systems that run kernels configured to allow the scheduler tick to be
stopped on idle CPUs are referred to as *tickless* systems and they are
generally regarded as more energy-efficient than the systems running kernels in
which the tick cannot be stopped. If the given system is tickless, it will use
the ``menu`` governor by default and if it is not tickless, the default
``CPUIdle`` governor on it will be ``ladder``.
.. _menu-gov:
The ``menu`` Governor
=====================
The ``menu`` governor is the default ``CPUIdle`` governor for tickless systems.
It is quite complex, but the basic principle of its design is straightforward.
Namely, when invoked to select an idle state for a CPU (i.e. an idle state that
the CPU will ask the processor hardware to enter), it attempts to predict the
idle duration and uses the predicted value for idle state selection.
It first obtains the time until the closest timer event with the assumption
that the scheduler tick will be stopped. That time, referred to as the *sleep
length* in what follows, is the upper bound on the time before the next CPU
wakeup. It is used to determine the sleep length range, which in turn is needed
to get the sleep length correction factor.
The ``menu`` governor maintains two arrays of sleep length correction factors.
One of them is used when tasks previously running on the given CPU are waiting
for some I/O operations to complete and the other one is used when that is not
the case. Each array contains several correction factor values that correspond
to different sleep length ranges organized so that each range represented in the
array is approximately 10 times wider than the previous one.
The correction factor for the given sleep length range (determined before
selecting the idle state for the CPU) is updated after the CPU has been woken
up and the closer the sleep length is to the observed idle duration, the closer
to 1 the correction factor becomes (it must fall between 0 and 1 inclusive).
The sleep length is multiplied by the correction factor for the range that it
falls into to obtain the first approximation of the predicted idle duration.
Next, the governor uses a simple pattern recognition algorithm to refine its
idle duration prediction. Namely, it saves the last 8 observed idle duration
values and, when predicting the idle duration next time, it computes the average
and variance of them. If the variance is small (smaller than 400 square
milliseconds) or it is small relative to the average (the average is greater
that 6 times the standard deviation), the average is regarded as the "typical
interval" value. Otherwise, the longest of the saved observed idle duration
values is discarded and the computation is repeated for the remaining ones.
Again, if the variance of them is small (in the above sense), the average is
taken as the "typical interval" value and so on, until either the "typical
interval" is determined or too many data points are disregarded, in which case
the "typical interval" is assumed to equal "infinity" (the maximum unsigned
integer value). The "typical interval" computed this way is compared with the
sleep length multiplied by the correction factor and the minimum of the two is
taken as the predicted idle duration.
Then, the governor computes an extra latency limit to help "interactive"
workloads. It uses the observation that if the exit latency of the selected
idle state is comparable with the predicted idle duration, the total time spent
in that state probably will be very short and the amount of energy to save by
entering it will be relatively small, so likely it is better to avoid the
overhead related to entering that state and exiting it. Thus selecting a
shallower state is likely to be a better option then. The first approximation
of the extra latency limit is the predicted idle duration itself which
additionally is divided by a value depending on the number of tasks that
previously ran on the given CPU and now they are waiting for I/O operations to
complete. The result of that division is compared with the latency limit coming
from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_,
framework and the minimum of the two is taken as the limit for the idle states'
exit latency.
Now, the governor is ready to walk the list of idle states and choose one of
them. For this purpose, it compares the target residency of each state with
the predicted idle duration and the exit latency of it with the computed latency
limit. It selects the state with the target residency closest to the predicted
idle duration, but still below it, and exit latency that does not exceed the
limit.
In the final step the governor may still need to refine the idle state selection
if it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That
happens if the idle duration predicted by it is less than the tick period and
the tick has not been stopped already (in a previous iteration of the idle
loop). Then, the sleep length used in the previous computations may not reflect
the real time until the closest timer event and if it really is greater than
that time, the governor may need to select a shallower state with a suitable
target residency.
.. _teo-gov:
The Timer Events Oriented (TEO) Governor
========================================
The timer events oriented (TEO) governor is an alternative ``CPUIdle`` governor
for tickless systems. It follows the same basic strategy as the ``menu`` `one
<menu-gov_>`_: it always tries to find the deepest idle state suitable for the
given conditions. However, it applies a different approach to that problem.
.. kernel-doc:: drivers/cpuidle/governors/teo.c
:doc: teo-description
.. _idle-states-representation:
Representation of Idle States
=============================
For the CPU idle time management purposes all of the physical idle states
supported by the processor have to be represented as a one-dimensional array of
|struct cpuidle_state| objects each allowing an individual (logical) CPU to ask
the processor hardware to enter an idle state of certain properties. If there
is a hierarchy of units in the processor, one |struct cpuidle_state| object can
cover a combination of idle states supported by the units at different levels of
the hierarchy. In that case, the `target residency and exit latency parameters
of it <idle-loop_>`_, must reflect the properties of the idle state at the
deepest level (i.e. the idle state of the unit containing all of the other
units).
For example, take a processor with two cores in a larger unit referred to as
a "module" and suppose that asking the hardware to enter a specific idle state
(say "X") at the "core" level by one core will trigger the module to try to
enter a specific idle state of its own (say "MX") if the other core is in idle
state "X" already. In other words, asking for idle state "X" at the "core"
level gives the hardware a license to go as deep as to idle state "MX" at the
"module" level, but there is no guarantee that this is going to happen (the core
asking for idle state "X" may just end up in that state by itself instead).
Then, the target residency of the |struct cpuidle_state| object representing
idle state "X" must reflect the minimum time to spend in idle state "MX" of
the module (including the time needed to enter it), because that is the minimum
time the CPU needs to be idle to save any energy in case the hardware enters
that state. Analogously, the exit latency parameter of that object must cover
the exit time of idle state "MX" of the module (and usually its entry time too),
because that is the maximum delay between a wakeup signal and the time the CPU
will start to execute the first new instruction (assuming that both cores in the
module will always be ready to execute instructions as soon as the module
becomes operational as a whole).
There are processors without direct coordination between different levels of the
hierarchy of units inside them, however. In those cases asking for an idle
state at the "core" level does not automatically affect the "module" level, for
example, in any way and the ``CPUIdle`` driver is responsible for the entire
handling of the hierarchy. Then, the definition of the idle state objects is
entirely up to the driver, but still the physical properties of the idle state
that the processor hardware finally goes into must always follow the parameters
used by the governor for idle state selection (for instance, the actual exit
latency of that idle state must not exceed the exit latency parameter of the
idle state object selected by the governor).
In addition to the target residency and exit latency idle state parameters
discussed above, the objects representing idle states each contain a few other
parameters describing the idle state and a pointer to the function to run in
order to ask the hardware to enter that state. Also, for each
|struct cpuidle_state| object, there is a corresponding
:c:type:`struct cpuidle_state_usage <cpuidle_state_usage>` one containing usage
statistics of the given idle state. That information is exposed by the kernel
via ``sysfs``.
For each CPU in the system, there is a :file:`/sys/devices/system/cpu/cpu<N>/cpuidle/`
directory in ``sysfs``, where the number ``<N>`` is assigned to the given
CPU at the initialization time. That directory contains a set of subdirectories
called :file:`state0`, :file:`state1` and so on, up to the number of idle state
objects defined for the given CPU minus one. Each of these directories
corresponds to one idle state object and the larger the number in its name, the
deeper the (effective) idle state represented by it. Each of them contains
a number of files (attributes) representing the properties of the idle state
object corresponding to it, as follows:
``above``
Total number of times this idle state had been asked for, but the
observed idle duration was certainly too short to match its target
residency.
``below``
Total number of times this idle state had been asked for, but certainly
a deeper idle state would have been a better match for the observed idle
duration.
``desc``
Description of the idle state.
``disable``
Whether or not this idle state is disabled.
``default_status``
The default status of this state, "enabled" or "disabled".
``latency``
Exit latency of the idle state in microseconds.
``name``
Name of the idle state.
``power``
Power drawn by hardware in this idle state in milliwatts (if specified,
0 otherwise).
``residency``
Target residency of the idle state in microseconds.
``time``
Total time spent in this idle state by the given CPU (as measured by the
kernel) in microseconds.
``usage``
Total number of times the hardware has been asked by the given CPU to
enter this idle state.
``rejected``
Total number of times a request to enter this idle state on the given
CPU was rejected.
The :file:`desc` and :file:`name` files both contain strings. The difference
between them is that the name is expected to be more concise, while the
description may be longer and it may contain white space or special characters.
The other files listed above contain integer numbers.
The :file:`disable` attribute is the only writeable one. If it contains 1, the
given idle state is disabled for this particular CPU, which means that the
governor will never select it for this particular CPU and the ``CPUIdle``
driver will never ask the hardware to enter it for that CPU as a result.
However, disabling an idle state for one CPU does not prevent it from being
asked for by the other CPUs, so it must be disabled for all of them in order to
never be asked for by any of them. [Note that, due to the way the ``ladder``
governor is implemented, disabling an idle state prevents that governor from
selecting any idle states deeper than the disabled one too.]
If the :file:`disable` attribute contains 0, the given idle state is enabled for
this particular CPU, but it still may be disabled for some or all of the other
CPUs in the system at the same time. Writing 1 to it causes the idle state to
be disabled for this particular CPU and writing 0 to it allows the governor to
take it into consideration for the given CPU and the driver to ask for it,
unless that state was disabled globally in the driver (in which case it cannot
be used at all).
The :file:`power` attribute is not defined very well, especially for idle state
objects representing combinations of idle states at different levels of the
hierarchy of units in the processor, and it generally is hard to obtain idle
state power numbers for complex hardware, so :file:`power` often contains 0 (not
available) and if it contains a nonzero number, that number may not be very
accurate and it should not be relied on for anything meaningful.
The number in the :file:`time` file generally may be greater than the total time
really spent by the given CPU in the given idle state, because it is measured by
the kernel and it may not cover the cases in which the hardware refused to enter
this idle state and entered a shallower one instead of it (or even it did not
enter any idle state at all). The kernel can only measure the time span between
asking the hardware to enter an idle state and the subsequent wakeup of the CPU
and it cannot say what really happened in the meantime at the hardware level.
Moreover, if the idle state object in question represents a combination of idle
states at different levels of the hierarchy of units in the processor,
the kernel can never say how deep the hardware went down the hierarchy in any
particular case. For these reasons, the only reliable way to find out how
much time has been spent by the hardware in different idle states supported by
it is to use idle state residency counters in the hardware, if available.
Generally, an interrupt received when trying to enter an idle state causes the
idle state entry request to be rejected, in which case the ``CPUIdle`` driver
may return an error code to indicate that this was the case. The :file:`usage`
and :file:`rejected` files report the number of times the given idle state
was entered successfully or rejected, respectively.
.. _cpu-pm-qos:
Power Management Quality of Service for CPUs
============================================
The power management quality of service (PM QoS) framework in the Linux kernel
allows kernel code and user space processes to set constraints on various
energy-efficiency features of the kernel to prevent performance from dropping
below a required level.
CPU idle time management can be affected by PM QoS in two ways, through the
global CPU latency limit and through the resume latency constraints for
individual CPUs. Kernel code (e.g. device drivers) can set both of them with
the help of special internal interfaces provided by the PM QoS framework. User
space can modify the former by opening the :file:`cpu_dma_latency` special
device file under :file:`/dev/` and writing a binary value (interpreted as a
signed 32-bit integer) to it. In turn, the resume latency constraint for a CPU
can be modified from user space by writing a string (representing a signed
32-bit integer) to the :file:`power/pm_qos_resume_latency_us` file under
:file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs``, where the CPU number
``<N>`` is allocated at the system initialization time. Negative values
will be rejected in both cases and, also in both cases, the written integer
number will be interpreted as a requested PM QoS constraint in microseconds.
The requested value is not automatically applied as a new constraint, however,
as it may be less restrictive (greater in this particular case) than another
constraint previously requested by someone else. For this reason, the PM QoS
framework maintains a list of requests that have been made so far for the
global CPU latency limit and for each individual CPU, aggregates them and
applies the effective (minimum in this particular case) value as the new
constraint.
In fact, opening the :file:`cpu_dma_latency` special device file causes a new
PM QoS request to be created and added to a global priority list of CPU latency
limit requests and the file descriptor coming from the "open" operation
represents that request. If that file descriptor is then used for writing, the
number written to it will be associated with the PM QoS request represented by
it as a new requested limit value. Next, the priority list mechanism will be
used to determine the new effective value of the entire list of requests and
that effective value will be set as a new CPU latency limit. Thus requesting a
new limit value will only change the real limit if the effective "list" value is
affected by it, which is the case if it is the minimum of the requested values
in the list.
The process holding a file descriptor obtained by opening the
:file:`cpu_dma_latency` special device file controls the PM QoS request
associated with that file descriptor, but it controls this particular PM QoS
request only.
Closing the :file:`cpu_dma_latency` special device file or, more precisely, the
file descriptor obtained while opening it, causes the PM QoS request associated
with that file descriptor to be removed from the global priority list of CPU
latency limit requests and destroyed. If that happens, the priority list
mechanism will be used again, to determine the new effective value for the whole
list and that value will become the new limit.
In turn, for each CPU there is one resume latency PM QoS request associated with
the :file:`power/pm_qos_resume_latency_us` file under
:file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs`` and writing to it causes
this single PM QoS request to be updated regardless of which user space
process does that. In other words, this PM QoS request is shared by the entire
user space, so access to the file associated with it needs to be arbitrated
to avoid confusion. [Arguably, the only legitimate use of this mechanism in
practice is to pin a process to the CPU in question and let it use the
``sysfs`` interface to control the resume latency constraint for it.] It is
still only a request, however. It is an entry in a priority list used to
determine the effective value to be set as the resume latency constraint for the
CPU in question every time the list of requests is updated this way or another
(there may be other requests coming from kernel code in that list).
CPU idle time governors are expected to regard the minimum of the global
(effective) CPU latency limit and the effective resume latency constraint for
the given CPU as the upper limit for the exit latency of the idle states that
they are allowed to select for that CPU. They should never select any idle
states with exit latency beyond that limit.
Idle States Control Via Kernel Command Line
===========================================
In addition to the ``sysfs`` interface allowing individual idle states to be
`disabled for individual CPUs <idle-states-representation_>`_, there are kernel
command line parameters affecting CPU idle time management.
The ``cpuidle.off=1`` kernel command line option can be used to disable the
CPU idle time management entirely. It does not prevent the idle loop from
running on idle CPUs, but it prevents the CPU idle time governors and drivers
from being invoked. If it is added to the kernel command line, the idle loop
will ask the hardware to enter idle states on idle CPUs via the CPU architecture
support code that is expected to provide a default mechanism for this purpose.
That default mechanism usually is the least common denominator for all of the
processors implementing the architecture (i.e. CPU instruction set) in question,
however, so it is rather crude and not very energy-efficient. For this reason,
it is not recommended for production use.
The ``cpuidle.governor=`` kernel command line switch allows the ``CPUIdle``
governor to use to be specified. It has to be appended with a string matching
the name of an available governor (e.g. ``cpuidle.governor=menu``) and that
governor will be used instead of the default one. It is possible to force
the ``menu`` governor to be used on the systems that use the ``ladder`` governor
by default this way, for example.
The other kernel command line parameters controlling CPU idle time management
described below are only relevant for the *x86* architecture and references
to ``intel_idle`` affect Intel processors only.
The *x86* architecture support code recognizes three kernel command line
options related to CPU idle time management: ``idle=poll``, ``idle=halt``,
and ``idle=nomwait``. The first two of them disable the ``acpi_idle`` and
``intel_idle`` drivers altogether, which effectively causes the entire
``CPUIdle`` subsystem to be disabled and makes the idle loop invoke the
architecture support code to deal with idle CPUs. How it does that depends on
which of the two parameters is added to the kernel command line. In the
``idle=halt`` case, the architecture support code will use the ``HLT``
instruction of the CPUs (which, as a rule, suspends the execution of the program
and causes the hardware to attempt to enter the shallowest available idle state)
for this purpose, and if ``idle=poll`` is used, idle CPUs will execute a
more or less "lightweight" sequence of instructions in a tight loop. [Note
that using ``idle=poll`` is somewhat drastic in many cases, as preventing idle
CPUs from saving almost any energy at all may not be the only effect of it.
For example, on Intel hardware it effectively prevents CPUs from using
P-states (see |cpufreq|) that require any number of CPUs in a package to be
idle, so it very well may hurt single-thread computations performance as well as
energy-efficiency. Thus using it for performance reasons may not be a good idea
at all.]
The ``idle=nomwait`` option prevents the use of ``MWAIT`` instruction of
the CPU to enter idle states. When this option is used, the ``acpi_idle``
driver will use the ``HLT`` instruction instead of ``MWAIT``. On systems
running Intel processors, this option disables the ``intel_idle`` driver
and forces the use of the ``acpi_idle`` driver instead. Note that in either
case, ``acpi_idle`` driver will function only if all the information needed
by it is in the system's ACPI tables.
In addition to the architecture-level kernel command line options affecting CPU
idle time management, there are parameters affecting individual ``CPUIdle``
drivers that can be passed to them via the kernel command line. Specifically,
the ``intel_idle.max_cstate=<n>`` and ``processor.max_cstate=<n>`` parameters,
where ``<n>`` is an idle state index also used in the name of the given
state's directory in ``sysfs`` (see
`Representation of Idle States <idle-states-representation_>`_), causes the
``intel_idle`` and ``acpi_idle`` drivers, respectively, to discard all of the
idle states deeper than idle state ``<n>``. In that case, they will never ask
for any of those idle states or expose them to the governor. [The behavior of
the two drivers is different for ``<n>`` equal to ``0``. Adding
``intel_idle.max_cstate=0`` to the kernel command line disables the
``intel_idle`` driver and allows ``acpi_idle`` to be used, whereas
``processor.max_cstate=0`` is equivalent to ``processor.max_cstate=1``.
Also, the ``acpi_idle`` driver is part of the ``processor`` kernel module that
can be loaded separately and ``max_cstate=<n>`` can be passed to it as a module
parameter when it is loaded.]

View file

@ -0,0 +1,12 @@
.. SPDX-License-Identifier: GPL-2.0
================
Power Management
================
.. toctree::
:maxdepth: 2
strategies
system-wide
working-state

View file

@ -0,0 +1,939 @@
.. SPDX-License-Identifier: GPL-2.0
============================================================
Intel(R) Speed Select Technology User Guide
============================================================
The Intel(R) Speed Select Technology (Intel(R) SST) provides a powerful new
collection of features that give more granular control over CPU performance.
With Intel(R) SST, one server can be configured for power and performance for a
variety of diverse workload requirements.
Refer to the links below for an overview of the technology:
- https://www.intel.com/content/www/us/en/architecture-and-technology/speed-select-technology-article.html
- https://builders.intel.com/docs/networkbuilders/intel-speed-select-technology-base-frequency-enhancing-performance.pdf
These capabilities are further enhanced in some of the newer generations of
server platforms where these features can be enumerated and controlled
dynamically without pre-configuring via BIOS setup options. This dynamic
configuration is done via mailbox commands to the hardware. One way to enumerate
and configure these features is by using the Intel Speed Select utility.
This document explains how to use the Intel Speed Select tool to enumerate and
control Intel(R) SST features. This document gives example commands and explains
how these commands change the power and performance profile of the system under
test. Using this tool as an example, customers can replicate the messaging
implemented in the tool in their production software.
intel-speed-select configuration tool
======================================
Most Linux distribution packages may include the "intel-speed-select" tool. If not,
it can be built by downloading the Linux kernel tree from kernel.org. Once
downloaded, the tool can be built without building the full kernel.
From the kernel tree, run the following commands::
# cd tools/power/x86/intel-speed-select/
# make
# make install
Getting Help
------------
To get help with the tool, execute the command below::
# intel-speed-select --help
The top-level help describes arguments and features. Notice that there is a
multi-level help structure in the tool. For example, to get help for the feature "perf-profile"::
# intel-speed-select perf-profile --help
To get help on a command, another level of help is provided. For example for the command info "info"::
# intel-speed-select perf-profile info --help
Summary of platform capability
------------------------------
To check the current platform and driver capabilities, execute::
#intel-speed-select --info
For example on a test system::
# intel-speed-select --info
Intel(R) Speed Select Technology
Executing on CPU model: X
Platform: API version : 1
Platform: Driver version : 1
Platform: mbox supported : 1
Platform: mmio supported : 1
Intel(R) SST-PP (feature perf-profile) is supported
TDP level change control is unlocked, max level: 4
Intel(R) SST-TF (feature turbo-freq) is supported
Intel(R) SST-BF (feature base-freq) is not supported
Intel(R) SST-CP (feature core-power) is supported
Intel(R) Speed Select Technology - Performance Profile (Intel(R) SST-PP)
------------------------------------------------------------------------
This feature allows configuration of a server dynamically based on workload
performance requirements. This helps users during deployment as they do not have
to choose a specific server configuration statically. This Intel(R) Speed Select
Technology - Performance Profile (Intel(R) SST-PP) feature introduces a mechanism
that allows multiple optimized performance profiles per system. Each profile
defines a set of CPUs that need to be online and rest offline to sustain a
guaranteed base frequency. Once the user issues a command to use a specific
performance profile and meet CPU online/offline requirement, the user can expect
a change in the base frequency dynamically. This feature is called
"perf-profile" when using the Intel Speed Select tool.
Number or performance levels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There can be multiple performance profiles on a system. To get the number of
profiles, execute the command below::
# intel-speed-select perf-profile get-config-levels
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-0
get-config-levels:4
package-1
die-0
cpu-14
get-config-levels:4
On this system under test, there are 4 performance profiles in addition to the
base performance profile (which is performance level 0).
Lock/Unlock status
~~~~~~~~~~~~~~~~~~
Even if there are multiple performance profiles, it is possible that they
are locked. If they are locked, users cannot issue a command to change the
performance state. It is possible that there is a BIOS setup to unlock or check
with your system vendor.
To check if the system is locked, execute the following command::
# intel-speed-select perf-profile get-lock-status
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-0
get-lock-status:0
package-1
die-0
cpu-14
get-lock-status:0
In this case, lock status is 0, which means that the system is unlocked.
Properties of a performance level
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To get properties of a specific performance level (For example for the level 0, below), execute the command below::
# intel-speed-select perf-profile info -l 0
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-0
perf-profile-level-0
cpu-count:28
enable-cpu-mask:000003ff,f0003fff
enable-cpu-list:0,1,2,3,4,5,6,7,8,9,10,11,12,13,28,29,30,31,32,33,34,35,36,37,38,39,40,41
thermal-design-power-ratio:26
base-frequency(MHz):2600
speed-select-turbo-freq:disabled
speed-select-base-freq:disabled
...
...
Here -l option is used to specify a performance level.
If the option -l is omitted, then this command will print information about all
the performance levels. The above command is printing properties of the
performance level 0.
For this performance profile, the list of CPUs displayed by the
"enable-cpu-mask/enable-cpu-list" at the max can be "online." When that
condition is met, then base frequency of 2600 MHz can be maintained. To
understand more, execute "intel-speed-select perf-profile info" for performance
level 4::
# intel-speed-select perf-profile info -l 4
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-0
perf-profile-level-4
cpu-count:28
enable-cpu-mask:000000fa,f0000faf
enable-cpu-list:0,1,2,3,5,7,8,9,10,11,28,29,30,31,33,35,36,37,38,39
thermal-design-power-ratio:28
base-frequency(MHz):2800
speed-select-turbo-freq:disabled
speed-select-base-freq:unsupported
...
...
There are fewer CPUs in the "enable-cpu-mask/enable-cpu-list". Consequently, if
the user only keeps these CPUs online and the rest "offline," then the base
frequency is increased to 2.8 GHz compared to 2.6 GHz at performance level 0.
Get current performance level
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To get the current performance level, execute::
# intel-speed-select perf-profile get-config-current-level
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-0
get-config-current_level:0
First verify that the base_frequency displayed by the cpufreq sysfs is correct::
# cat /sys/devices/system/cpu/cpu0/cpufreq/base_frequency
2600000
This matches the base-frequency (MHz) field value displayed from the
"perf-profile info" command for performance level 0(cpufreq frequency is in
KHz).
To check if the average frequency is equal to the base frequency for a 100% busy
workload, disable turbo::
# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
Then runs a busy workload on all CPUs, for example::
#stress -c 64
To verify the base frequency, run turbostat::
#turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1
Package Core CPU Bzy_MHz
- - 2600
0 0 0 2600
0 1 1 2600
0 2 2 2600
0 3 3 2600
0 4 4 2600
. . . .
Changing performance level
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To the change the performance level to 4, execute::
# intel-speed-select -d perf-profile set-config-level -l 4 -o
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-0
perf-profile
set_tdp_level:success
In the command above, "-o" is optional. If it is specified, then it will also
offline CPUs which are not present in the enable_cpu_mask for this performance
level.
Now if the base_frequency is checked::
#cat /sys/devices/system/cpu/cpu0/cpufreq/base_frequency
2800000
Which shows that the base frequency now increased from 2600 MHz at performance
level 0 to 2800 MHz at performance level 4. As a result, any workload, which can
use fewer CPUs, can see a boost of 200 MHz compared to performance level 0.
Changing performance level via BMC Interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It is possible to change SST-PP level using out of band (OOB) agent (Via some
remote management console, through BMC "Baseboard Management Controller"
interface). This mode is supported from the Sapphire Rapids processor
generation. The kernel and tool change to support this mode is added to Linux
kernel version 5.18. To enable this feature, kernel config
"CONFIG_INTEL_HFI_THERMAL" is required. The minimum version of the tool
is "v1.12" to support this feature, which is part of Linux kernel version 5.18.
To support such configuration, this tool can be used as a daemon. Add
a command line option --oob::
# intel-speed-select --oob
Intel(R) Speed Select Technology
Executing on CPU model:143[0x8f]
OOB mode is enabled and will run as daemon
In this mode the tool will online/offline CPUs based on the new performance
level.
Check presence of other Intel(R) SST features
---------------------------------------------
Each of the performance profiles also specifies weather there is support of
other two Intel(R) SST features (Intel(R) Speed Select Technology - Base Frequency
(Intel(R) SST-BF) and Intel(R) Speed Select Technology - Turbo Frequency (Intel
SST-TF)).
For example, from the output of "perf-profile info" above, for level 0 and level
4:
For level 0::
speed-select-turbo-freq:disabled
speed-select-base-freq:disabled
For level 4::
speed-select-turbo-freq:disabled
speed-select-base-freq:unsupported
Given these results, the "speed-select-base-freq" (Intel(R) SST-BF) in level 4
changed from "disabled" to "unsupported" compared to performance level 0.
This means that at performance level 4, the "speed-select-base-freq" feature is
not supported. However, at performance level 0, this feature is "supported", but
currently "disabled", meaning the user has not activated this feature. Whereas
"speed-select-turbo-freq" (Intel(R) SST-TF) is supported at both performance
levels, but currently not activated by the user.
The Intel(R) SST-BF and the Intel(R) SST-TF features are built on a foundation
technology called Intel(R) Speed Select Technology - Core Power (Intel(R) SST-CP).
The platform firmware enables this feature when Intel(R) SST-BF or Intel(R) SST-TF
is supported on a platform.
Intel(R) Speed Select Technology Core Power (Intel(R) SST-CP)
---------------------------------------------------------------
Intel(R) Speed Select Technology Core Power (Intel(R) SST-CP) is an interface that
allows users to define per core priority. This defines a mechanism to distribute
power among cores when there is a power constrained scenario. This defines a
class of service (CLOS) configuration.
The user can configure up to 4 class of service configurations. Each CLOS group
configuration allows definitions of parameters, which affects how the frequency
can be limited and power is distributed. Each CPU core can be tied to a class of
service and hence an associated priority. The granularity is at core level not
at per CPU level.
Enable CLOS based prioritization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To use CLOS based prioritization feature, firmware must be informed to enable
and use a priority type. There is a default per platform priority type, which
can be changed with optional command line parameter.
To enable and check the options, execute::
# intel-speed-select core-power enable --help
Intel(R) Speed Select Technology
Executing on CPU model: X
Enable core-power for a package/die
Clos Enable: Specify priority type with [--priority|-p]
0: Proportional, 1: Ordered
There are two types of priority types:
- Ordered
Priority for ordered throttling is defined based on the index of the assigned
CLOS group. Where CLOS0 gets highest priority (throttled last).
Priority order is:
CLOS0 > CLOS1 > CLOS2 > CLOS3.
- Proportional
When proportional priority is used, there is an additional parameter called
frequency_weight, which can be specified per CLOS group. The goal of
proportional priority is to provide each core with the requested min., then
distribute all remaining (excess/deficit) budgets in proportion to a defined
weight. This proportional priority can be configured using "core-power config"
command.
To enable with the platform default priority type, execute::
# intel-speed-select core-power enable
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-0
core-power
enable:success
package-1
die-0
cpu-6
core-power
enable:success
The scope of this enable is per package or die scoped when a package contains
multiple dies. To check if CLOS is enabled and get priority type, "core-power
info" command can be used. For example to check the status of core-power feature
on CPU 0, execute::
# intel-speed-select -c 0 core-power info
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-0
core-power
support-status:supported
enable-status:enabled
clos-enable-status:enabled
priority-type:proportional
package-1
die-0
cpu-24
core-power
support-status:supported
enable-status:enabled
clos-enable-status:enabled
priority-type:proportional
Configuring CLOS groups
~~~~~~~~~~~~~~~~~~~~~~~
Each CLOS group has its own attributes including min, max, freq_weight and
desired. These parameters can be configured with "core-power config" command.
Defaults will be used if user skips setting a parameter except clos id, which is
mandatory. To check core-power config options, execute::
# intel-speed-select core-power config --help
Intel(R) Speed Select Technology
Executing on CPU model: X
Set core-power configuration for one of the four clos ids
Specify targeted clos id with [--clos|-c]
Specify clos Proportional Priority [--weight|-w]
Specify clos min in MHz with [--min|-n]
Specify clos max in MHz with [--max|-m]
For example::
# intel-speed-select core-power config -c 0
Intel(R) Speed Select Technology
Executing on CPU model: X
clos epp is not specified, default: 0
clos frequency weight is not specified, default: 0
clos min is not specified, default: 0 MHz
clos max is not specified, default: 25500 MHz
clos desired is not specified, default: 0
package-0
die-0
cpu-0
core-power
config:success
package-1
die-0
cpu-6
core-power
config:success
The user has the option to change defaults. For example, the user can change the
"min" and set the base frequency to always get guaranteed base frequency.
Get the current CLOS configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To check the current configuration, "core-power get-config" can be used. For
example, to get the configuration of CLOS 0::
# intel-speed-select core-power get-config -c 0
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-0
core-power
clos:0
epp:0
clos-proportional-priority:0
clos-min:0 MHz
clos-max:Max Turbo frequency
clos-desired:0 MHz
package-1
die-0
cpu-24
core-power
clos:0
epp:0
clos-proportional-priority:0
clos-min:0 MHz
clos-max:Max Turbo frequency
clos-desired:0 MHz
Associating a CPU with a CLOS group
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To associate a CPU to a CLOS group "core-power assoc" command can be used::
# intel-speed-select core-power assoc --help
Intel(R) Speed Select Technology
Executing on CPU model: X
Associate a clos id to a CPU
Specify targeted clos id with [--clos|-c]
For example to associate CPU 10 to CLOS group 3, execute::
# intel-speed-select -c 10 core-power assoc -c 3
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-10
core-power
assoc:success
Once a CPU is associated, its sibling CPUs are also associated to a CLOS group.
Once associated, avoid changing Linux "cpufreq" subsystem scaling frequency
limits.
To check the existing association for a CPU, "core-power get-assoc" command can
be used. For example, to get association of CPU 10, execute::
# intel-speed-select -c 10 core-power get-assoc
Intel(R) Speed Select Technology
Executing on CPU model: X
package-1
die-0
cpu-10
get-assoc
clos:3
This shows that CPU 10 is part of a CLOS group 3.
Disable CLOS based prioritization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To disable, execute::
# intel-speed-select core-power disable
Some features like Intel(R) SST-TF can only be enabled when CLOS based prioritization
is enabled. For this reason, disabling while Intel(R) SST-TF is enabled can cause
Intel(R) SST-TF to fail. This will cause the "disable" command to display an error
if Intel(R) SST-TF is already enabled. In turn, to disable, the Intel(R) SST-TF
feature must be disabled first.
Intel(R) Speed Select Technology - Base Frequency (Intel(R) SST-BF)
-------------------------------------------------------------------
The Intel(R) Speed Select Technology - Base Frequency (Intel(R) SST-BF) feature lets
the user control base frequency. If some critical workload threads demand
constant high guaranteed performance, then this feature can be used to execute
the thread at higher base frequency on specific sets of CPUs (high priority
CPUs) at the cost of lower base frequency (low priority CPUs) on other CPUs.
This feature does not require offline of the low priority CPUs.
The support of Intel(R) SST-BF depends on the Intel(R) Speed Select Technology -
Performance Profile (Intel(R) SST-PP) performance level configuration. It is
possible that only certain performance levels support Intel(R) SST-BF. It is also
possible that only base performance level (level = 0) has support of Intel
SST-BF. Consequently, first select the desired performance level to enable this
feature.
In the system under test here, Intel(R) SST-BF is supported at the base
performance level 0, but currently disabled. For example for the level 0::
# intel-speed-select -c 0 perf-profile info -l 0
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-0
perf-profile-level-0
...
speed-select-base-freq:disabled
...
Before enabling Intel(R) SST-BF and measuring its impact on a workload
performance, execute some workload and measure performance and get a baseline
performance to compare against.
Here the user wants more guaranteed performance. For this reason, it is likely
that turbo is disabled. To disable turbo, execute::
#echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
Based on the output of the "intel-speed-select perf-profile info -l 0" base
frequency of guaranteed frequency 2600 MHz.
Measure baseline performance for comparison
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To compare, pick a multi-threaded workload where each thread can be scheduled on
separate CPUs. "Hackbench pipe" test is a good example on how to improve
performance using Intel(R) SST-BF.
Below, the workload is measuring average scheduler wakeup latency, so a lower
number means better performance::
# taskset -c 3,4 perf bench -r 100 sched pipe
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes
Total time: 6.102 [sec]
6.102445 usecs/op
163868 ops/sec
While running the above test, if we take turbostat output, it will show us that
2 of the CPUs are busy and reaching max. frequency (which would be the base
frequency as the turbo is disabled). The turbostat output::
#turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1
Package Core CPU Bzy_MHz
0 0 0 1000
0 1 1 1005
0 2 2 1000
0 3 3 2600
0 4 4 2600
0 5 5 1000
0 6 6 1000
0 7 7 1005
0 8 8 1005
0 9 9 1000
0 10 10 1000
0 11 11 995
0 12 12 1000
0 13 13 1000
From the above turbostat output, both CPU 3 and 4 are very busy and reaching
full guaranteed frequency of 2600 MHz.
Intel(R) SST-BF Capabilities
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To get capabilities of Intel(R) SST-BF for the current performance level 0,
execute::
# intel-speed-select base-freq info -l 0
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-0
speed-select-base-freq
high-priority-base-frequency(MHz):3000
high-priority-cpu-mask:00000216,00002160
high-priority-cpu-list:5,6,8,13,33,34,36,41
low-priority-base-frequency(MHz):2400
tjunction-temperature(C):125
thermal-design-power(W):205
The above capabilities show that there are some CPUs on this system that can
offer base frequency of 3000 MHz compared to the standard base frequency at this
performance levels. Nevertheless, these CPUs are fixed, and they are presented
via high-priority-cpu-list/high-priority-cpu-mask. But if this Intel(R) SST-BF
feature is selected, the low priorities CPUs (which are not in
high-priority-cpu-list) can only offer up to 2400 MHz. As a result, if this
clipping of low priority CPUs is acceptable, then the user can enable Intel
SST-BF feature particularly for the above "sched pipe" workload since only two
CPUs are used, they can be scheduled on high priority CPUs and can get boost of
400 MHz.
Enable Intel(R) SST-BF
~~~~~~~~~~~~~~~~~~~~~~
To enable Intel(R) SST-BF feature, execute::
# intel-speed-select base-freq enable -a
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-0
base-freq
enable:success
package-1
die-0
cpu-14
base-freq
enable:success
In this case, -a option is optional. This not only enables Intel(R) SST-BF, but it
also adjusts the priority of cores using Intel(R) Speed Select Technology Core
Power (Intel(R) SST-CP) features. This option sets the minimum performance of each
Intel(R) Speed Select Technology - Performance Profile (Intel(R) SST-PP) class to
maximum performance so that the hardware will give maximum performance possible
for each CPU.
If -a option is not used, then the following steps are required before enabling
Intel(R) SST-BF:
- Discover Intel(R) SST-BF and note low and high priority base frequency
- Note the high priority CPU list
- Enable CLOS using core-power feature set
- Configure CLOS parameters. Use CLOS.min to set to minimum performance
- Subscribe desired CPUs to CLOS groups
With this configuration, if the same workload is executed by pinning the
workload to high priority CPUs (CPU 5 and 6 in this case)::
#taskset -c 5,6 perf bench -r 100 sched pipe
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes
Total time: 5.627 [sec]
5.627922 usecs/op
177685 ops/sec
This way, by enabling Intel(R) SST-BF, the performance of this benchmark is
improved (latency reduced) by 7.79%. From the turbostat output, it can be
observed that the high priority CPUs reached 3000 MHz compared to 2600 MHz.
The turbostat output::
#turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1
Package Core CPU Bzy_MHz
0 0 0 2151
0 1 1 2166
0 2 2 2175
0 3 3 2175
0 4 4 2175
0 5 5 3000
0 6 6 3000
0 7 7 2180
0 8 8 2662
0 9 9 2176
0 10 10 2175
0 11 11 2176
0 12 12 2176
0 13 13 2661
Disable Intel(R) SST-BF
~~~~~~~~~~~~~~~~~~~~~~~
To disable the Intel(R) SST-BF feature, execute::
# intel-speed-select base-freq disable -a
Intel(R) Speed Select Technology - Turbo Frequency (Intel(R) SST-TF)
--------------------------------------------------------------------
This feature enables the ability to set different "All core turbo ratio limits"
to cores based on the priority. By using this feature, some cores can be
configured to get higher turbo frequency by designating them as high priority at
the cost of lower or no turbo frequency on the low priority cores.
For this reason, this feature is only useful when system is busy utilizing all
CPUs, but the user wants some configurable option to get high performance on
some CPUs.
The support of Intel(R) Speed Select Technology - Turbo Frequency (Intel(R) SST-TF)
depends on the Intel(R) Speed Select Technology - Performance Profile (Intel
SST-PP) performance level configuration. It is possible that only a certain
performance level supports Intel(R) SST-TF. It is also possible that only the base
performance level (level = 0) has the support of Intel(R) SST-TF. Hence, first
select the desired performance level to enable this feature.
In the system under test here, Intel(R) SST-TF is supported at the base
performance level 0, but currently disabled::
# intel-speed-select -c 0 perf-profile info -l 0
Intel(R) Speed Select Technology
package-0
die-0
cpu-0
perf-profile-level-0
...
...
speed-select-turbo-freq:disabled
...
...
To check if performance can be improved using Intel(R) SST-TF feature, get the turbo
frequency properties with Intel(R) SST-TF enabled and compare to the base turbo
capability of this system.
Get Base turbo capability
~~~~~~~~~~~~~~~~~~~~~~~~~
To get the base turbo capability of performance level 0, execute::
# intel-speed-select perf-profile info -l 0
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-0
perf-profile-level-0
...
...
turbo-ratio-limits-sse
bucket-0
core-count:2
max-turbo-frequency(MHz):3200
bucket-1
core-count:4
max-turbo-frequency(MHz):3100
bucket-2
core-count:6
max-turbo-frequency(MHz):3100
bucket-3
core-count:8
max-turbo-frequency(MHz):3100
bucket-4
core-count:10
max-turbo-frequency(MHz):3100
bucket-5
core-count:12
max-turbo-frequency(MHz):3100
bucket-6
core-count:14
max-turbo-frequency(MHz):3100
bucket-7
core-count:16
max-turbo-frequency(MHz):3100
Based on the data above, when all the CPUS are busy, the max. frequency of 3100
MHz can be achieved. If there is some busy workload on cpu 0 - 11 (e.g. stress)
and on CPU 12 and 13, execute "hackbench pipe" workload::
# taskset -c 12,13 perf bench -r 100 sched pipe
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes
Total time: 5.705 [sec]
5.705488 usecs/op
175269 ops/sec
The turbostat output::
#turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1
Package Core CPU Bzy_MHz
0 0 0 3000
0 1 1 3000
0 2 2 3000
0 3 3 3000
0 4 4 3000
0 5 5 3100
0 6 6 3100
0 7 7 3000
0 8 8 3100
0 9 9 3000
0 10 10 3000
0 11 11 3000
0 12 12 3100
0 13 13 3100
Based on turbostat output, the performance is limited by frequency cap of 3100
MHz. To check if the hackbench performance can be improved for CPU 12 and CPU
13, first check the capability of the Intel(R) SST-TF feature for this performance
level.
Get Intel(R) SST-TF Capability
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To get the capability, the "turbo-freq info" command can be used::
# intel-speed-select turbo-freq info -l 0
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-0
speed-select-turbo-freq
bucket-0
high-priority-cores-count:2
high-priority-max-frequency(MHz):3200
high-priority-max-avx2-frequency(MHz):3200
high-priority-max-avx512-frequency(MHz):3100
bucket-1
high-priority-cores-count:4
high-priority-max-frequency(MHz):3100
high-priority-max-avx2-frequency(MHz):3000
high-priority-max-avx512-frequency(MHz):2900
bucket-2
high-priority-cores-count:6
high-priority-max-frequency(MHz):3100
high-priority-max-avx2-frequency(MHz):3000
high-priority-max-avx512-frequency(MHz):2900
speed-select-turbo-freq-clip-frequencies
low-priority-max-frequency(MHz):2600
low-priority-max-avx2-frequency(MHz):2400
low-priority-max-avx512-frequency(MHz):2100
Based on the output above, there is an Intel(R) SST-TF bucket for which there are
two high priority cores. If only two high priority cores are set, then max.
turbo frequency on those cores can be increased to 3200 MHz. This is 100 MHz
more than the base turbo capability for all cores.
In turn, for the hackbench workload, two CPUs can be set as high priority and
rest as low priority. One side effect is that once enabled, the low priority
cores will be clipped to a lower frequency of 2600 MHz.
Enable Intel(R) SST-TF
~~~~~~~~~~~~~~~~~~~~~~
To enable Intel(R) SST-TF, execute::
# intel-speed-select -c 12,13 turbo-freq enable -a
Intel(R) Speed Select Technology
Executing on CPU model: X
package-0
die-0
cpu-12
turbo-freq
enable:success
package-0
die-0
cpu-13
turbo-freq
enable:success
package--1
die-0
cpu-63
turbo-freq --auto
enable:success
In this case, the option "-a" is optional. If set, it enables Intel(R) SST-TF
feature and also sets the CPUs to high and low priority using Intel Speed
Select Technology Core Power (Intel(R) SST-CP) features. The CPU numbers passed
with "-c" arguments are marked as high priority, including its siblings.
If -a option is not used, then the following steps are required before enabling
Intel(R) SST-TF:
- Discover Intel(R) SST-TF and note buckets of high priority cores and maximum frequency
- Enable CLOS using core-power feature set - Configure CLOS parameters
- Subscribe desired CPUs to CLOS groups making sure that high priority cores are set to the maximum frequency
If the same hackbench workload is executed, schedule hackbench threads on high
priority CPUs::
#taskset -c 12,13 perf bench -r 100 sched pipe
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes
Total time: 5.510 [sec]
5.510165 usecs/op
180826 ops/sec
This improved performance by around 3.3% improvement on a busy system. Here the
turbostat output will show that the CPU 12 and CPU 13 are getting 100 MHz boost.
The turbostat output::
#turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1
Package Core CPU Bzy_MHz
...
0 12 12 3200
0 13 13 3200

View file

@ -0,0 +1,41 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
======================================
Intel Performance and Energy Bias Hint
======================================
:Copyright: |copy| 2019 Intel Corporation
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
.. kernel-doc:: arch/x86/kernel/cpu/intel_epb.c
:doc: overview
Intel Performance and Energy Bias Attribute in ``sysfs``
========================================================
The Intel Performance and Energy Bias Hint (EPB) value for a given (logical) CPU
can be checked or updated through a ``sysfs`` attribute (file) under
:file:`/sys/devices/system/cpu/cpu<N>/power/`, where the CPU number ``<N>``
is allocated at the system initialization time:
``energy_perf_bias``
Shows the current EPB value for the CPU in a sliding scale 0 - 15, where
a value of 0 corresponds to a hint preference for highest performance
and a value of 15 corresponds to the maximum energy savings.
In order to update the EPB value for the CPU, this attribute can be
written to, either with a number in the 0 - 15 sliding scale above, or
with one of the strings: "performance", "balance-performance", "normal",
"balance-power", "power" that represent values reflected by their
meaning.
This attribute is present for all online CPUs supporting the EPB
feature.
Note that while the EPB interface to the processor is defined at the logical CPU
level, the physical register backing it may be shared by multiple CPUs (for
example, SMT siblings or cores in one package). For this reason, updating the
EPB value for one CPU may cause the EPB values for other CPUs to change.

View file

@ -0,0 +1,287 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
==============================================
``intel_idle`` CPU Idle Time Management Driver
==============================================
:Copyright: |copy| 2020 Intel Corporation
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
General Information
===================
``intel_idle`` is a part of the
:doc:`CPU idle time management subsystem <cpuidle>` in the Linux kernel
(``CPUIdle``). It is the default CPU idle time management driver for the
Nehalem and later generations of Intel processors, but the level of support for
a particular processor model in it depends on whether or not it recognizes that
processor model and may also depend on information coming from the platform
firmware. [To understand ``intel_idle`` it is necessary to know how ``CPUIdle``
works in general, so this is the time to get familiar with
Documentation/admin-guide/pm/cpuidle.rst if you have not done that yet.]
``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the
logical CPU executing it is idle and so it may be possible to put some of the
processor's functional blocks into low-power states. That instruction takes two
arguments (passed in the ``EAX`` and ``ECX`` registers of the target CPU), the
first of which, referred to as a *hint*, can be used by the processor to
determine what can be done (for details refer to Intel Software Developers
Manual [1]_). Accordingly, ``intel_idle`` refuses to work with processors in
which the support for the ``MWAIT`` instruction has been disabled (for example,
via the platform firmware configuration menu) or which do not support that
instruction at all.
``intel_idle`` is not modular, so it cannot be unloaded, which means that the
only way to pass early-configuration-time parameters to it is via the kernel
command line.
.. _intel-idle-enumeration-of-states:
Enumeration of Idle States
==========================
Each ``MWAIT`` hint value is interpreted by the processor as a license to
reconfigure itself in a certain way in order to save energy. The processor
configurations (with reduced power draw) resulting from that are referred to
as C-states (in the ACPI terminology) or idle states. The list of meaningful
``MWAIT`` hint values and idle states (i.e. low-power configurations of the
processor) corresponding to them depends on the processor model and it may also
depend on the configuration of the platform.
In order to create a list of available idle states required by the ``CPUIdle``
subsystem (see :ref:`idle-states-representation` in
Documentation/admin-guide/pm/cpuidle.rst),
``intel_idle`` can use two sources of information: static tables of idle states
for different processor models included in the driver itself and the ACPI tables
of the system. The former are always used if the processor model at hand is
recognized by ``intel_idle`` and the latter are used if that is required for
the given processor model (which is the case for all server processor models
recognized by ``intel_idle``) or if the processor model is not recognized.
[There is a module parameter that can be used to make the driver use the ACPI
tables with any processor model recognized by it; see
`below <intel-idle-parameters_>`_.]
If the ACPI tables are going to be used for building the list of available idle
states, ``intel_idle`` first looks for a ``_CST`` object under one of the ACPI
objects corresponding to the CPUs in the system (refer to the ACPI specification
[2]_ for the description of ``_CST`` and its output package). Because the
``CPUIdle`` subsystem expects that the list of idle states supplied by the
driver will be suitable for all of the CPUs handled by it and ``intel_idle`` is
registered as the ``CPUIdle`` driver for all of the CPUs in the system, the
driver looks for the first ``_CST`` object returning at least one valid idle
state description and such that all of the idle states included in its return
package are of the FFH (Functional Fixed Hardware) type, which means that the
``MWAIT`` instruction is expected to be used to tell the processor that it can
enter one of them. The return package of that ``_CST`` is then assumed to be
applicable to all of the other CPUs in the system and the idle state
descriptions extracted from it are stored in a preliminary list of idle states
coming from the ACPI tables. [This step is skipped if ``intel_idle`` is
configured to ignore the ACPI tables; see `below <intel-idle-parameters_>`_.]
Next, the first (index 0) entry in the list of available idle states is
initialized to represent a "polling idle state" (a pseudo-idle state in which
the target CPU continuously fetches and executes instructions), and the
subsequent (real) idle state entries are populated as follows.
If the processor model at hand is recognized by ``intel_idle``, there is a
(static) table of idle state descriptions for it in the driver. In that case,
the "internal" table is the primary source of information on idle states and the
information from it is copied to the final list of available idle states. If
using the ACPI tables for the enumeration of idle states is not required
(depending on the processor model), all of the listed idle state are enabled by
default (so all of them will be taken into consideration by ``CPUIdle``
governors during CPU idle state selection). Otherwise, some of the listed idle
states may not be enabled by default if there are no matching entries in the
preliminary list of idle states coming from the ACPI tables. In that case user
space still can enable them later (on a per-CPU basis) with the help of
the ``disable`` idle state attribute in ``sysfs`` (see
:ref:`idle-states-representation` in
Documentation/admin-guide/pm/cpuidle.rst). This basically means that
the idle states "known" to the driver may not be enabled by default if they have
not been exposed by the platform firmware (through the ACPI tables).
If the given processor model is not recognized by ``intel_idle``, but it
supports ``MWAIT``, the preliminary list of idle states coming from the ACPI
tables is used for building the final list that will be supplied to the
``CPUIdle`` core during driver registration. For each idle state in that list,
the description, ``MWAIT`` hint and exit latency are copied to the corresponding
entry in the final list of idle states. The name of the idle state represented
by it (to be returned by the ``name`` idle state attribute in ``sysfs``) is
"CX_ACPI", where X is the index of that idle state in the final list (note that
the minimum value of X is 1, because 0 is reserved for the "polling" state), and
its target residency is based on the exit latency value. Specifically, for
C1-type idle states the exit latency value is also used as the target residency
(for compatibility with the majority of the "internal" tables of idle states for
various processor models recognized by ``intel_idle``) and for the other idle
state types (C2 and C3) the target residency value is 3 times the exit latency
(again, that is because it reflects the target residency to exit latency ratio
in the majority of cases for the processor models recognized by ``intel_idle``).
All of the idle states in the final list are enabled by default in this case.
.. _intel-idle-initialization:
Initialization
==============
The initialization of ``intel_idle`` starts with checking if the kernel command
line options forbid the use of the ``MWAIT`` instruction. If that is the case,
an error code is returned right away.
The next step is to check whether or not the processor model is known to the
driver, which determines the idle states enumeration method (see
`above <intel-idle-enumeration-of-states_>`_), and whether or not the processor
supports ``MWAIT`` (the initialization fails if that is not the case). Then,
the ``MWAIT`` support in the processor is enumerated through ``CPUID`` and the
driver initialization fails if the level of support is not as expected (for
example, if the total number of ``MWAIT`` substates returned is 0).
Next, if the driver is not configured to ignore the ACPI tables (see
`below <intel-idle-parameters_>`_), the idle states information provided by the
platform firmware is extracted from them.
Then, ``CPUIdle`` device objects are allocated for all CPUs and the list of
available idle states is created as explained
`above <intel-idle-enumeration-of-states_>`_.
Finally, ``intel_idle`` is registered with the help of cpuidle_register_driver()
as the ``CPUIdle`` driver for all CPUs in the system and a CPU online callback
for configuring individual CPUs is registered via cpuhp_setup_state(), which
(among other things) causes the callback routine to be invoked for all of the
CPUs present in the system at that time (each CPU executes its own instance of
the callback routine). That routine registers a ``CPUIdle`` device for the CPU
running it (which enables the ``CPUIdle`` subsystem to operate that CPU) and
optionally performs some CPU-specific initialization actions that may be
required for the given processor model.
.. _intel-idle-parameters:
Kernel Command Line Options and Module Parameters
=================================================
The *x86* architecture support code recognizes three kernel command line
options related to CPU idle time management: ``idle=poll``, ``idle=halt``,
and ``idle=nomwait``. If any of them is present in the kernel command line, the
``MWAIT`` instruction is not allowed to be used, so the initialization of
``intel_idle`` will fail.
Apart from that there are five module parameters recognized by ``intel_idle``
itself that can be set via the kernel command line (they cannot be updated via
sysfs, so that is the only way to change their values).
The ``max_cstate`` parameter value is the maximum idle state index in the list
of idle states supplied to the ``CPUIdle`` core during the registration of the
driver. It is also the maximum number of regular (non-polling) idle states that
can be used by ``intel_idle``, so the enumeration of idle states is terminated
after finding that number of usable idle states (the other idle states that
potentially might have been used if ``max_cstate`` had been greater are not
taken into consideration at all). Setting ``max_cstate`` can prevent
``intel_idle`` from exposing idle states that are regarded as "too deep" for
some reason to the ``CPUIdle`` core, but it does so by making them effectively
invisible until the system is shut down and started again which may not always
be desirable. In practice, it is only really necessary to do that if the idle
states in question cannot be enabled during system startup, because in the
working state of the system the CPU power management quality of service (PM
QoS) feature can be used to prevent ``CPUIdle`` from touching those idle states
even if they have been enumerated (see :ref:`cpu-pm-qos` in
Documentation/admin-guide/pm/cpuidle.rst).
Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail.
The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle``
if the kernel has been configured with ACPI support) can be set to make the
driver ignore the system's ACPI tables entirely or use them for all of the
recognized processor models, respectively (they both are unset by default and
``use_acpi`` has no effect if ``no_acpi`` is set).
The value of the ``states_off`` module parameter (0 by default) represents a
list of idle states to be disabled by default in the form of a bitmask.
Namely, the positions of the bits that are set in the ``states_off`` value are
the indices of idle states to be disabled by default (as reflected by the names
of the corresponding idle state directories in ``sysfs``, :file:`state0`,
:file:`state1` ... :file:`state<i>` ..., where ``<i>`` is the index of the given
idle state; see :ref:`idle-states-representation` in
Documentation/admin-guide/pm/cpuidle.rst).
For example, if ``states_off`` is equal to 3, the driver will disable idle
states 0 and 1 by default, and if it is equal to 8, idle state 3 will be
disabled by default and so on (bit positions beyond the maximum idle state index
are ignored).
The idle states disabled this way can be enabled (on a per-CPU basis) from user
space via ``sysfs``.
The ``ibrs_off`` module parameter is a boolean flag (defaults to
false). If set, it is used to control if IBRS (Indirect Branch Restricted
Speculation) should be turned off when the CPU enters an idle state.
This flag does not affect CPUs that use Enhanced IBRS which can remain
on with little performance impact.
For some CPUs, IBRS will be selected as mitigation for Spectre v2 and Retbleed
security vulnerabilities by default. Leaving the IBRS mode on while idling may
have a performance impact on its sibling CPU. The IBRS mode will be turned off
by default when the CPU enters into a deep idle state, but not in some
shallower ones. Setting the ``ibrs_off`` module parameter will force the IBRS
mode to off when the CPU is in any one of the available idle states. This may
help performance of a sibling CPU at the expense of a slightly higher wakeup
latency for the idle CPU.
.. _intel-idle-core-and-package-idle-states:
Core and Package Levels of Idle States
======================================
Typically, in a processor supporting the ``MWAIT`` instruction there are (at
least) two levels of idle states (or C-states). One level, referred to as
"core C-states", covers individual cores in the processor, whereas the other
level, referred to as "package C-states", covers the entire processor package
and it may also involve other components of the system (GPUs, memory
controllers, I/O hubs etc.).
Some of the ``MWAIT`` hint values allow the processor to use core C-states only
(most importantly, that is the case for the ``MWAIT`` hint value corresponding
to the ``C1`` idle state), but the majority of them give it a license to put
the target core (i.e. the core containing the logical CPU executing ``MWAIT``
with the given hint value) into a specific core C-state and then (if possible)
to enter a specific package C-state at the deeper level. For example, the
``MWAIT`` hint value representing the ``C3`` idle state allows the processor to
put the target core into the low-power state referred to as "core ``C3``" (or
``CC3``), which happens if all of the logical CPUs (SMT siblings) in that core
have executed ``MWAIT`` with the ``C3`` hint value (or with a hint value
representing a deeper idle state), and in addition to that (in the majority of
cases) it gives the processor a license to put the entire package (possibly
including some non-CPU components such as a GPU or a memory controller) into the
low-power state referred to as "package ``C3``" (or ``PC3``), which happens if
all of the cores have gone into the ``CC3`` state and (possibly) some additional
conditions are satisfied (for instance, if the GPU is covered by ``PC3``, it may
be required to be in a certain GPU-specific low-power state for ``PC3`` to be
reachable).
As a rule, there is no simple way to make the processor use core C-states only
if the conditions for entering the corresponding package C-states are met, so
the logical CPU executing ``MWAIT`` with a hint value that is not core-level
only (like for ``C1``) must always assume that this may cause the processor to
enter a package C-state. [That is why the exit latency and target residency
values corresponding to the majority of ``MWAIT`` hint values in the "internal"
tables of idle states in ``intel_idle`` reflect the properties of package
C-states.] If using package C-states is not desirable at all, either
:ref:`PM QoS <cpu-pm-qos>` or the ``max_cstate`` module parameter of
``intel_idle`` described `above <intel-idle-parameters_>`_ must be used to
restrict the range of permissible idle states to the ones with core-level only
``MWAIT`` hint values (like ``C1``).
References
==========
.. [1] *Intel® 64 and IA-32 Architectures Software Developers Manual Volume 2B*,
https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2b-manual.html
.. [2] *Advanced Configuration and Power Interface (ACPI) Specification*,
https://uefi.org/specifications

View file

@ -0,0 +1,770 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
===============================================
``intel_pstate`` CPU Performance Scaling Driver
===============================================
:Copyright: |copy| 2017 Intel Corporation
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
General Information
===================
``intel_pstate`` is a part of the
:doc:`CPU performance scaling subsystem <cpufreq>` in the Linux kernel
(``CPUFreq``). It is a scaling driver for the Sandy Bridge and later
generations of Intel processors. Note, however, that some of those processors
may not be supported. [To understand ``intel_pstate`` it is necessary to know
how ``CPUFreq`` works in general, so this is the time to read
Documentation/admin-guide/pm/cpufreq.rst if you have not done that yet.]
For the processors supported by ``intel_pstate``, the P-state concept is broader
than just an operating frequency or an operating performance point (see the
LinuxCon Europe 2015 presentation by Kristen Accardi [1]_ for more
information about that). For this reason, the representation of P-states used
by ``intel_pstate`` internally follows the hardware specification (for details
refer to Intel Software Developers Manual [2]_). However, the ``CPUFreq`` core
uses frequencies for identifying operating performance points of CPUs and
frequencies are involved in the user space interface exposed by it, so
``intel_pstate`` maps its internal representation of P-states to frequencies too
(fortunately, that mapping is unambiguous). At the same time, it would not be
practical for ``intel_pstate`` to supply the ``CPUFreq`` core with a table of
available frequencies due to the possible size of it, so the driver does not do
that. Some functionality of the core is limited by that.
Since the hardware P-state selection interface used by ``intel_pstate`` is
available at the logical CPU level, the driver always works with individual
CPUs. Consequently, if ``intel_pstate`` is in use, every ``CPUFreq`` policy
object corresponds to one logical CPU and ``CPUFreq`` policies are effectively
equivalent to CPUs. In particular, this means that they become "inactive" every
time the corresponding CPU is taken offline and need to be re-initialized when
it goes back online.
``intel_pstate`` is not modular, so it cannot be unloaded, which means that the
only way to pass early-configuration-time parameters to it is via the kernel
command line. However, its configuration can be adjusted via ``sysfs`` to a
great extent. In some configurations it even is possible to unregister it via
``sysfs`` which allows another ``CPUFreq`` scaling driver to be loaded and
registered (see `below <status_attr_>`_).
Operation Modes
===============
``intel_pstate`` can operate in two different modes, active or passive. In the
active mode, it uses its own internal performance scaling governor algorithm or
allows the hardware to do performance scaling by itself, while in the passive
mode it responds to requests made by a generic ``CPUFreq`` governor implementing
a certain performance scaling algorithm. Which of them will be in effect
depends on what kernel command line options are used and on the capabilities of
the processor.
Active Mode
-----------
This is the default operation mode of ``intel_pstate`` for processors with
hardware-managed P-states (HWP) support. If it works in this mode, the
``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq`` policies
contains the string "intel_pstate".
In this mode the driver bypasses the scaling governors layer of ``CPUFreq`` and
provides its own scaling algorithms for P-state selection. Those algorithms
can be applied to ``CPUFreq`` policies in the same way as generic scaling
governors (that is, through the ``scaling_governor`` policy attribute in
``sysfs``). [Note that different P-state selection algorithms may be chosen for
different policies, but that is not recommended.]
They are not generic scaling governors, but their names are the same as the
names of some of those governors. Moreover, confusingly enough, they generally
do not work in the same way as the generic governors they share the names with.
For example, the ``powersave`` P-state selection algorithm provided by
``intel_pstate`` is not a counterpart of the generic ``powersave`` governor
(roughly, it corresponds to the ``schedutil`` and ``ondemand`` governors).
There are two P-state selection algorithms provided by ``intel_pstate`` in the
active mode: ``powersave`` and ``performance``. The way they both operate
depends on whether or not the hardware-managed P-states (HWP) feature has been
enabled in the processor and possibly on the processor model.
Which of the P-state selection algorithms is used by default depends on the
:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option.
Namely, if that option is set, the ``performance`` algorithm will be used by
default, and the other one will be used by default if it is not set.
Active Mode With HWP
~~~~~~~~~~~~~~~~~~~~
If the processor supports the HWP feature, it will be enabled during the
processor initialization and cannot be disabled after that. It is possible
to avoid enabling it by passing the ``intel_pstate=no_hwp`` argument to the
kernel in the command line.
If the HWP feature has been enabled, ``intel_pstate`` relies on the processor to
select P-states by itself, but still it can give hints to the processor's
internal P-state selection logic. What those hints are depends on which P-state
selection algorithm has been applied to the given policy (or to the CPU it
corresponds to).
Even though the P-state selection is carried out by the processor automatically,
``intel_pstate`` registers utilization update callbacks with the CPU scheduler
in this mode. However, they are not used for running a P-state selection
algorithm, but for periodic updates of the current CPU frequency information to
be made available from the ``scaling_cur_freq`` policy attribute in ``sysfs``.
HWP + ``performance``
.....................
In this configuration ``intel_pstate`` will write 0 to the processor's
Energy-Performance Preference (EPP) knob (if supported) or its
Energy-Performance Bias (EPB) knob (otherwise), which means that the processor's
internal P-state selection logic is expected to focus entirely on performance.
This will override the EPP/EPB setting coming from the ``sysfs`` interface
(see `Energy vs Performance Hints`_ below). Moreover, any attempts to change
the EPP/EPB to a value different from 0 ("performance") via ``sysfs`` in this
configuration will be rejected.
Also, in this configuration the range of P-states available to the processor's
internal P-state selection logic is always restricted to the upper boundary
(that is, the maximum P-state that the driver is allowed to use).
HWP + ``powersave``
...................
In this configuration ``intel_pstate`` will set the processor's
Energy-Performance Preference (EPP) knob (if supported) or its
Energy-Performance Bias (EPB) knob (otherwise) to whatever value it was
previously set to via ``sysfs`` (or whatever default value it was
set to by the platform firmware). This usually causes the processor's
internal P-state selection logic to be less performance-focused.
Active Mode Without HWP
~~~~~~~~~~~~~~~~~~~~~~~
This operation mode is optional for processors that do not support the HWP
feature or when the ``intel_pstate=no_hwp`` argument is passed to the kernel in
the command line. The active mode is used in those cases if the
``intel_pstate=active`` argument is passed to the kernel in the command line.
In this mode ``intel_pstate`` may refuse to work with processors that are not
recognized by it. [Note that ``intel_pstate`` will never refuse to work with
any processor with the HWP feature enabled.]
In this mode ``intel_pstate`` registers utilization update callbacks with the
CPU scheduler in order to run a P-state selection algorithm, either
``powersave`` or ``performance``, depending on the ``scaling_governor`` policy
setting in ``sysfs``. The current CPU frequency information to be made
available from the ``scaling_cur_freq`` policy attribute in ``sysfs`` is
periodically updated by those utilization update callbacks too.
``performance``
...............
Without HWP, this P-state selection algorithm is always the same regardless of
the processor model and platform configuration.
It selects the maximum P-state it is allowed to use, subject to limits set via
``sysfs``, every time the driver configuration for the given CPU is updated
(e.g. via ``sysfs``).
This is the default P-state selection algorithm if the
:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option
is set.
``powersave``
.............
Without HWP, this P-state selection algorithm is similar to the algorithm
implemented by the generic ``schedutil`` scaling governor except that the
utilization metric used by it is based on numbers coming from feedback
registers of the CPU. It generally selects P-states proportional to the
current CPU utilization.
This algorithm is run by the driver's utilization update callback for the
given CPU when it is invoked by the CPU scheduler, but not more often than
every 10 ms. Like in the ``performance`` case, the hardware configuration
is not touched if the new P-state turns out to be the same as the current
one.
This is the default P-state selection algorithm if the
:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option
is not set.
Passive Mode
------------
This is the default operation mode of ``intel_pstate`` for processors without
hardware-managed P-states (HWP) support. It is always used if the
``intel_pstate=passive`` argument is passed to the kernel in the command line
regardless of whether or not the given processor supports HWP. [Note that the
``intel_pstate=no_hwp`` setting causes the driver to start in the passive mode
if it is not combined with ``intel_pstate=active``.] Like in the active mode
without HWP support, in this mode ``intel_pstate`` may refuse to work with
processors that are not recognized by it if HWP is prevented from being enabled
through the kernel command line.
If the driver works in this mode, the ``scaling_driver`` policy attribute in
``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
Then, the driver behaves like a regular ``CPUFreq`` scaling driver. That is,
it is invoked by generic scaling governors when necessary to talk to the
hardware in order to change the P-state of a CPU (in particular, the
``schedutil`` governor can invoke it directly from scheduler context).
While in this mode, ``intel_pstate`` can be used with all of the (generic)
scaling governors listed by the ``scaling_available_governors`` policy attribute
in ``sysfs`` (and the P-state selection algorithms described above are not
used). Then, it is responsible for the configuration of policy objects
corresponding to CPUs and provides the ``CPUFreq`` core (and the scaling
governors attached to the policy objects) with accurate information on the
maximum and minimum operating frequencies supported by the hardware (including
the so-called "turbo" frequency ranges). In other words, in the passive mode
the entire range of available P-states is exposed by ``intel_pstate`` to the
``CPUFreq`` core. However, in this mode the driver does not register
utilization update callbacks with the CPU scheduler and the ``scaling_cur_freq``
information comes from the ``CPUFreq`` core (and is the last frequency selected
by the current scaling governor for the given policy).
.. _turbo:
Turbo P-states Support
======================
In the majority of cases, the entire range of P-states available to
``intel_pstate`` can be divided into two sub-ranges that correspond to
different types of processor behavior, above and below a boundary that
will be referred to as the "turbo threshold" in what follows.
The P-states above the turbo threshold are referred to as "turbo P-states" and
the whole sub-range of P-states they belong to is referred to as the "turbo
range". These names are related to the Turbo Boost technology allowing a
multicore processor to opportunistically increase the P-state of one or more
cores if there is enough power to do that and if that is not going to cause the
thermal envelope of the processor package to be exceeded.
Specifically, if software sets the P-state of a CPU core within the turbo range
(that is, above the turbo threshold), the processor is permitted to take over
performance scaling control for that core and put it into turbo P-states of its
choice going forward. However, that permission is interpreted differently by
different processor generations. Namely, the Sandy Bridge generation of
processors will never use any P-states above the last one set by software for
the given core, even if it is within the turbo range, whereas all of the later
processor generations will take it as a license to use any P-states from the
turbo range, even above the one set by software. In other words, on those
processors setting any P-state from the turbo range will enable the processor
to put the given core into all turbo P-states up to and including the maximum
supported one as it sees fit.
One important property of turbo P-states is that they are not sustainable. More
precisely, there is no guarantee that any CPUs will be able to stay in any of
those states indefinitely, because the power distribution within the processor
package may change over time or the thermal envelope it was designed for might
be exceeded if a turbo P-state was used for too long.
In turn, the P-states below the turbo threshold generally are sustainable. In
fact, if one of them is set by software, the processor is not expected to change
it to a lower one unless in a thermal stress or a power limit violation
situation (a higher P-state may still be used if it is set for another CPU in
the same package at the same time, for example).
Some processors allow multiple cores to be in turbo P-states at the same time,
but the maximum P-state that can be set for them generally depends on the number
of cores running concurrently. The maximum turbo P-state that can be set for 3
cores at the same time usually is lower than the analogous maximum P-state for
2 cores, which in turn usually is lower than the maximum turbo P-state that can
be set for 1 core. The one-core maximum turbo P-state is thus the maximum
supported one overall.
The maximum supported turbo P-state, the turbo threshold (the maximum supported
non-turbo P-state) and the minimum supported P-state are specific to the
processor model and can be determined by reading the processor's model-specific
registers (MSRs). Moreover, some processors support the Configurable TDP
(Thermal Design Power) feature and, when that feature is enabled, the turbo
threshold effectively becomes a configurable value that can be set by the
platform firmware.
Unlike ``_PSS`` objects in the ACPI tables, ``intel_pstate`` always exposes
the entire range of available P-states, including the whole turbo range, to the
``CPUFreq`` core and (in the passive mode) to generic scaling governors. This
generally causes turbo P-states to be set more often when ``intel_pstate`` is
used relative to ACPI-based CPU performance scaling (see `below <acpi-cpufreq_>`_
for more information).
Moreover, since ``intel_pstate`` always knows what the real turbo threshold is
(even if the Configurable TDP feature is enabled in the processor), its
``no_turbo`` attribute in ``sysfs`` (described `below <no_turbo_attr_>`_) should
work as expected in all cases (that is, if set to disable turbo P-states, it
always should prevent ``intel_pstate`` from using them).
Processor Support
=================
To handle a given processor ``intel_pstate`` requires a number of different
pieces of information on it to be known, including:
* The minimum supported P-state.
* The maximum supported `non-turbo P-state <turbo_>`_.
* Whether or not turbo P-states are supported at all.
* The maximum supported `one-core turbo P-state <turbo_>`_ (if turbo P-states
are supported).
* The scaling formula to translate the driver's internal representation
of P-states into frequencies and the other way around.
Generally, ways to obtain that information are specific to the processor model
or family. Although it often is possible to obtain all of it from the processor
itself (using model-specific registers), there are cases in which hardware
manuals need to be consulted to get to it too.
For this reason, there is a list of supported processors in ``intel_pstate`` and
the driver initialization will fail if the detected processor is not in that
list, unless it supports the HWP feature. [The interface to obtain all of the
information listed above is the same for all of the processors supporting the
HWP feature, which is why ``intel_pstate`` works with all of them.]
User Space Interface in ``sysfs``
=================================
Global Attributes
-----------------
``intel_pstate`` exposes several global attributes (files) in ``sysfs`` to
control its functionality at the system level. They are located in the
``/sys/devices/system/cpu/intel_pstate/`` directory and affect all CPUs.
Some of them are not present if the ``intel_pstate=per_cpu_perf_limits``
argument is passed to the kernel in the command line.
``max_perf_pct``
Maximum P-state the driver is allowed to set in percent of the
maximum supported performance level (the highest supported `turbo
P-state <turbo_>`_).
This attribute will not be exposed if the
``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel
command line.
``min_perf_pct``
Minimum P-state the driver is allowed to set in percent of the
maximum supported performance level (the highest supported `turbo
P-state <turbo_>`_).
This attribute will not be exposed if the
``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel
command line.
``num_pstates``
Number of P-states supported by the processor (between 0 and 255
inclusive) including both turbo and non-turbo P-states (see
`Turbo P-states Support`_).
This attribute is present only if the value exposed by it is the same
for all of the CPUs in the system.
The value of this attribute is not affected by the ``no_turbo``
setting described `below <no_turbo_attr_>`_.
This attribute is read-only.
``turbo_pct``
Ratio of the `turbo range <turbo_>`_ size to the size of the entire
range of supported P-states, in percent.
This attribute is present only if the value exposed by it is the same
for all of the CPUs in the system.
This attribute is read-only.
.. _no_turbo_attr:
``no_turbo``
If set (equal to 1), the driver is not allowed to set any turbo P-states
(see `Turbo P-states Support`_). If unset (equal to 0, which is the
default), turbo P-states can be set by the driver.
[Note that ``intel_pstate`` does not support the general ``boost``
attribute (supported by some other scaling drivers) which is replaced
by this one.]
This attribute does not affect the maximum supported frequency value
supplied to the ``CPUFreq`` core and exposed via the policy interface,
but it affects the maximum possible value of per-policy P-state limits
(see `Interpretation of Policy Attributes`_ below for details).
``hwp_dynamic_boost``
This attribute is only present if ``intel_pstate`` works in the
`active mode with the HWP feature enabled <Active Mode With HWP_>`_ in
the processor. If set (equal to 1), it causes the minimum P-state limit
to be increased dynamically for a short time whenever a task previously
waiting on I/O is selected to run on a given logical CPU (the purpose
of this mechanism is to improve performance).
This setting has no effect on logical CPUs whose minimum P-state limit
is directly set to the highest non-turbo P-state or above it.
.. _status_attr:
``status``
Operation mode of the driver: "active", "passive" or "off".
"active"
The driver is functional and in the `active mode
<Active Mode_>`_.
"passive"
The driver is functional and in the `passive mode
<Passive Mode_>`_.
"off"
The driver is not functional (it is not registered as a scaling
driver with the ``CPUFreq`` core).
This attribute can be written to in order to change the driver's
operation mode or to unregister it. The string written to it must be
one of the possible values of it and, if successful, the write will
cause the driver to switch over to the operation mode represented by
that string - or to be unregistered in the "off" case. [Actually,
switching over from the active mode to the passive mode or the other
way around causes the driver to be unregistered and registered again
with a different set of callbacks, so all of its settings (the global
as well as the per-policy ones) are then reset to their default
values, possibly depending on the target operation mode.]
``energy_efficiency``
This attribute is only present on platforms with CPUs matching the Kaby
Lake or Coffee Lake desktop CPU model. By default, energy-efficiency
optimizations are disabled on these CPU models if HWP is enabled.
Enabling energy-efficiency optimizations may limit maximum operating
frequency with or without the HWP feature. With HWP enabled, the
optimizations are done only in the turbo frequency range. Without it,
they are done in the entire available frequency range. Setting this
attribute to "1" enables the energy-efficiency optimizations and setting
to "0" disables them.
Interpretation of Policy Attributes
-----------------------------------
The interpretation of some ``CPUFreq`` policy attributes described in
Documentation/admin-guide/pm/cpufreq.rst is special with ``intel_pstate``
as the current scaling driver and it generally depends on the driver's
`operation mode <Operation Modes_>`_.
First of all, the values of the ``cpuinfo_max_freq``, ``cpuinfo_min_freq`` and
``scaling_cur_freq`` attributes are produced by applying a processor-specific
multiplier to the internal P-state representation used by ``intel_pstate``.
Also, the values of the ``scaling_max_freq`` and ``scaling_min_freq``
attributes are capped by the frequency corresponding to the maximum P-state that
the driver is allowed to set.
If the ``no_turbo`` `global attribute <no_turbo_attr_>`_ is set, the driver is
not allowed to use turbo P-states, so the maximum value of ``scaling_max_freq``
and ``scaling_min_freq`` is limited to the maximum non-turbo P-state frequency.
Accordingly, setting ``no_turbo`` causes ``scaling_max_freq`` and
``scaling_min_freq`` to go down to that value if they were above it before.
However, the old values of ``scaling_max_freq`` and ``scaling_min_freq`` will be
restored after unsetting ``no_turbo``, unless these attributes have been written
to after ``no_turbo`` was set.
If ``no_turbo`` is not set, the maximum possible value of ``scaling_max_freq``
and ``scaling_min_freq`` corresponds to the maximum supported turbo P-state,
which also is the value of ``cpuinfo_max_freq`` in either case.
Next, the following policy attributes have special meaning if
``intel_pstate`` works in the `active mode <Active Mode_>`_:
``scaling_available_governors``
List of P-state selection algorithms provided by ``intel_pstate``.
``scaling_governor``
P-state selection algorithm provided by ``intel_pstate`` currently in
use with the given policy.
``scaling_cur_freq``
Frequency of the average P-state of the CPU represented by the given
policy for the time interval between the last two invocations of the
driver's utilization update callback by the CPU scheduler for that CPU.
One more policy attribute is present if the HWP feature is enabled in the
processor:
``base_frequency``
Shows the base frequency of the CPU. Any frequency above this will be
in the turbo frequency range.
The meaning of these attributes in the `passive mode <Passive Mode_>`_ is the
same as for other scaling drivers.
Additionally, the value of the ``scaling_driver`` attribute for ``intel_pstate``
depends on the operation mode of the driver. Namely, it is either
"intel_pstate" (in the `active mode <Active Mode_>`_) or "intel_cpufreq" (in the
`passive mode <Passive Mode_>`_).
Coordination of P-State Limits
------------------------------
``intel_pstate`` allows P-state limits to be set in two ways: with the help of
the ``max_perf_pct`` and ``min_perf_pct`` `global attributes
<Global Attributes_>`_ or via the ``scaling_max_freq`` and ``scaling_min_freq``
``CPUFreq`` policy attributes. The coordination between those limits is based
on the following rules, regardless of the current operation mode of the driver:
1. All CPUs are affected by the global limits (that is, none of them can be
requested to run faster than the global maximum and none of them can be
requested to run slower than the global minimum).
2. Each individual CPU is affected by its own per-policy limits (that is, it
cannot be requested to run faster than its own per-policy maximum and it
cannot be requested to run slower than its own per-policy minimum). The
effective performance depends on whether the platform supports per core
P-states, hyper-threading is enabled and on current performance requests
from other CPUs. When platform doesn't support per core P-states, the
effective performance can be more than the policy limits set on a CPU, if
other CPUs are requesting higher performance at that moment. Even with per
core P-states support, when hyper-threading is enabled, if the sibling CPU
is requesting higher performance, the other siblings will get higher
performance than their policy limits.
3. The global and per-policy limits can be set independently.
In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the
resulting effective values are written into hardware registers whenever the
limits change in order to request its internal P-state selection logic to always
set P-states within these limits. Otherwise, the limits are taken into account
by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
every time before setting a new P-state for a CPU.
Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
is passed to the kernel, ``max_perf_pct`` and ``min_perf_pct`` are not exposed
at all and the only way to set the limits is by using the policy attributes.
Energy vs Performance Hints
---------------------------
If the hardware-managed P-states (HWP) is enabled in the processor, additional
attributes, intended to allow user space to help ``intel_pstate`` to adjust the
processor's internal P-state selection logic by focusing it on performance or on
energy-efficiency, or somewhere between the two extremes, are present in every
``CPUFreq`` policy directory in ``sysfs``. They are :
``energy_performance_preference``
Current value of the energy vs performance hint for the given policy
(or the CPU represented by it).
The hint can be changed by writing to this attribute.
``energy_performance_available_preferences``
List of strings that can be written to the
``energy_performance_preference`` attribute.
They represent different energy vs performance hints and should be
self-explanatory, except that ``default`` represents whatever hint
value was set by the platform firmware.
Strings written to the ``energy_performance_preference`` attribute are
internally translated to integer values written to the processor's
Energy-Performance Preference (EPP) knob (if supported) or its
Energy-Performance Bias (EPB) knob. It is also possible to write a positive
integer value between 0 to 255, if the EPP feature is present. If the EPP
feature is not present, writing integer value to this attribute is not
supported. In this case, user can use the
"/sys/devices/system/cpu/cpu*/power/energy_perf_bias" interface.
[Note that tasks may by migrated from one CPU to another by the scheduler's
load-balancing algorithm and if different energy vs performance hints are
set for those CPUs, that may lead to undesirable outcomes. To avoid such
issues it is better to set the same energy vs performance hint for all CPUs
or to pin every task potentially sensitive to them to a specific CPU.]
.. _acpi-cpufreq:
``intel_pstate`` vs ``acpi-cpufreq``
====================================
On the majority of systems supported by ``intel_pstate``, the ACPI tables
provided by the platform firmware contain ``_PSS`` objects returning information
that can be used for CPU performance scaling (refer to the ACPI specification
[3]_ for details on the ``_PSS`` objects and the format of the information
returned by them).
The information returned by the ACPI ``_PSS`` objects is used by the
``acpi-cpufreq`` scaling driver. On systems supported by ``intel_pstate``
the ``acpi-cpufreq`` driver uses the same hardware CPU performance scaling
interface, but the set of P-states it can use is limited by the ``_PSS``
output.
On those systems each ``_PSS`` object returns a list of P-states supported by
the corresponding CPU which basically is a subset of the P-states range that can
be used by ``intel_pstate`` on the same system, with one exception: the whole
`turbo range <turbo_>`_ is represented by one item in it (the topmost one). By
convention, the frequency returned by ``_PSS`` for that item is greater by 1 MHz
than the frequency of the highest non-turbo P-state listed by it, but the
corresponding P-state representation (following the hardware specification)
returned for it matches the maximum supported turbo P-state (or is the
special value 255 meaning essentially "go as high as you can get").
The list of P-states returned by ``_PSS`` is reflected by the table of
available frequencies supplied by ``acpi-cpufreq`` to the ``CPUFreq`` core and
scaling governors and the minimum and maximum supported frequencies reported by
it come from that list as well. In particular, given the special representation
of the turbo range described above, this means that the maximum supported
frequency reported by ``acpi-cpufreq`` is higher by 1 MHz than the frequency
of the highest supported non-turbo P-state listed by ``_PSS`` which, of course,
affects decisions made by the scaling governors, except for ``powersave`` and
``performance``.
For example, if a given governor attempts to select a frequency proportional to
estimated CPU load and maps the load of 100% to the maximum supported frequency
(possibly multiplied by a constant), then it will tend to choose P-states below
the turbo threshold if ``acpi-cpufreq`` is used as the scaling driver, because
in that case the turbo range corresponds to a small fraction of the frequency
band it can use (1 MHz vs 1 GHz or more). In consequence, it will only go to
the turbo range for the highest loads and the other loads above 50% that might
benefit from running at turbo frequencies will be given non-turbo P-states
instead.
One more issue related to that may appear on systems supporting the
`Configurable TDP feature <turbo_>`_ allowing the platform firmware to set the
turbo threshold. Namely, if that is not coordinated with the lists of P-states
returned by ``_PSS`` properly, there may be more than one item corresponding to
a turbo P-state in those lists and there may be a problem with avoiding the
turbo range (if desirable or necessary). Usually, to avoid using turbo
P-states overall, ``acpi-cpufreq`` simply avoids using the topmost state listed
by ``_PSS``, but that is not sufficient when there are other turbo P-states in
the list returned by it.
Apart from the above, ``acpi-cpufreq`` works like ``intel_pstate`` in the
`passive mode <Passive Mode_>`_, except that the number of P-states it can set
is limited to the ones listed by the ACPI ``_PSS`` objects.
Kernel Command Line Options for ``intel_pstate``
================================================
Several kernel command line options can be used to pass early-configuration-time
parameters to ``intel_pstate`` in order to enforce specific behavior of it. All
of them have to be prepended with the ``intel_pstate=`` prefix.
``disable``
Do not register ``intel_pstate`` as the scaling driver even if the
processor is supported by it.
``active``
Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start
with.
``passive``
Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
start with.
``force``
Register ``intel_pstate`` as the scaling driver instead of
``acpi-cpufreq`` even if the latter is preferred on the given system.
This may prevent some platform features (such as thermal controls and
power capping) that rely on the availability of ACPI P-states
information from functioning as expected, so it should be used with
caution.
This option does not work with processors that are not supported by
``intel_pstate`` and on platforms where the ``pcc-cpufreq`` scaling
driver is used instead of ``acpi-cpufreq``.
``no_hwp``
Do not enable the hardware-managed P-states (HWP) feature even if it is
supported by the processor.
``hwp_only``
Register ``intel_pstate`` as the scaling driver only if the
hardware-managed P-states (HWP) feature is supported by the processor.
``support_acpi_ppc``
Take ACPI ``_PPC`` performance limits into account.
If the preferred power management profile in the FADT (Fixed ACPI
Description Table) is set to "Enterprise Server" or "Performance
Server", the ACPI ``_PPC`` limits are taken into account by default
and this option has no effect.
``per_cpu_perf_limits``
Use per-logical-CPU P-State limits (see `Coordination of P-state
Limits`_ for details).
Diagnostics and Tuning
======================
Trace Events
------------
There are two static trace events that can be used for ``intel_pstate``
diagnostics. One of them is the ``cpu_frequency`` trace event generally used
by ``CPUFreq``, and the other one is the ``pstate_sample`` trace event specific
to ``intel_pstate``. Both of them are triggered by ``intel_pstate`` only if
it works in the `active mode <Active Mode_>`_.
The following sequence of shell commands can be used to enable them and see
their output (if the kernel is generally configured to support event tracing)::
# cd /sys/kernel/tracing/
# echo 1 > events/power/pstate_sample/enable
# echo 1 > events/power/cpu_frequency/enable
# cat trace
gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107 scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 freq=2474476
cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2
If ``intel_pstate`` works in the `passive mode <Passive Mode_>`_, the
``cpu_frequency`` trace event will be triggered either by the ``schedutil``
scaling governor (for the policies it is attached to), or by the ``CPUFreq``
core (for the policies with other scaling governors).
``ftrace``
----------
The ``ftrace`` interface can be used for low-level diagnostics of
``intel_pstate``. For example, to check how often the function to set a
P-state is called, the ``ftrace`` filter can be set to
:c:func:`intel_pstate_set_pstate`::
# cd /sys/kernel/tracing/
# cat available_filter_functions | grep -i pstate
intel_pstate_set_pstate
intel_pstate_cpu_init
...
# echo intel_pstate_set_pstate > set_ftrace_filter
# echo function > current_tracer
# cat trace | head -15
# tracer: function
#
# entries-in-buffer/entries-written: 80/80 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func
gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func
gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func
<idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func
References
==========
.. [1] Kristen Accardi, *Balancing Power and Performance in the Linux Kernel*,
https://events.static.linuxfound.org/sites/events/files/slides/LinuxConEurope_2015.pdf
.. [2] *Intel® 64 and IA-32 Architectures Software Developers Manual Volume 3: System Programming Guide*,
https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
.. [3] *Advanced Configuration and Power Interface Specification*,
https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf

View file

@ -0,0 +1,174 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
==============================
Intel Uncore Frequency Scaling
==============================
:Copyright: |copy| 2022-2023 Intel Corporation
:Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Introduction
------------
The uncore can consume significant amount of power in Intel's Xeon servers based
on the workload characteristics. To optimize the total power and improve overall
performance, SoCs have internal algorithms for scaling uncore frequency. These
algorithms monitor workload usage of uncore and set a desirable frequency.
It is possible that users have different expectations of uncore performance and
want to have control over it. The objective is similar to allowing users to set
the scaling min/max frequencies via cpufreq sysfs to improve CPU performance.
Users may have some latency sensitive workloads where they do not want any
change to uncore frequency. Also, users may have workloads which require
different core and uncore performance at distinct phases and they may want to
use both cpufreq and the uncore scaling interface to distribute power and
improve overall performance.
Sysfs Interface
---------------
To control uncore frequency, a sysfs interface is provided in the directory:
`/sys/devices/system/cpu/intel_uncore_frequency/`.
There is one directory for each package and die combination as the scope of
uncore scaling control is per die in multiple die/package SoCs or per
package for single die per package SoCs. The name represents the
scope of control. For example: 'package_00_die_00' is for package id 0 and
die 0.
Each package_*_die_* contains the following attributes:
``initial_max_freq_khz``
Out of reset, this attribute represent the maximum possible frequency.
This is a read-only attribute. If users adjust max_freq_khz,
they can always go back to maximum using the value from this attribute.
``initial_min_freq_khz``
Out of reset, this attribute represent the minimum possible frequency.
This is a read-only attribute. If users adjust min_freq_khz,
they can always go back to minimum using the value from this attribute.
``max_freq_khz``
This attribute is used to set the maximum uncore frequency.
``min_freq_khz``
This attribute is used to set the minimum uncore frequency.
``current_freq_khz``
This attribute is used to get the current uncore frequency.
SoCs with TPMI (Topology Aware Register and PM Capsule Interface)
-----------------------------------------------------------------
An SoC can contain multiple power domains with individual or collection
of mesh partitions. This partition is called fabric cluster.
Certain type of meshes will need to run at the same frequency, they will
be placed in the same fabric cluster. Benefit of fabric cluster is that it
offers a scalable mechanism to deal with partitioned fabrics in a SoC.
The current sysfs interface supports controls at package and die level.
This interface is not enough to support more granular control at
fabric cluster level.
SoCs with the support of TPMI (Topology Aware Register and PM Capsule
Interface), can have multiple power domains. Each power domain can
contain one or more fabric clusters.
To represent controls at fabric cluster level in addition to the
controls at package and die level (like systems without TPMI
support), sysfs is enhanced. This granular interface is presented in the
sysfs with directories names prefixed with "uncore". For example:
uncore00, uncore01 etc.
The scope of control is specified by attributes "package_id", "domain_id"
and "fabric_cluster_id" in the directory.
Attributes in each directory:
``domain_id``
This attribute is used to get the power domain id of this instance.
``fabric_cluster_id``
This attribute is used to get the fabric cluster id of this instance.
``package_id``
This attribute is used to get the package id of this instance.
The other attributes are same as presented at package_*_die_* level.
In most of current use cases, the "max_freq_khz" and "min_freq_khz"
is updated at "package_*_die_*" level. This model will be still supported
with the following approach:
When user uses controls at "package_*_die_*" level, then every fabric
cluster is affected in that package and die. For example: user changes
"max_freq_khz" in the package_00_die_00, then "max_freq_khz" for uncore*
directory with the same package id will be updated. In this case user can
still update "max_freq_khz" at each uncore* level, which is more restrictive.
Similarly, user can update "min_freq_khz" at "package_*_die_*" level
to apply at each uncore* level.
Support for "current_freq_khz" is available only at each fabric cluster
level (i.e., in uncore* directory).
Efficiency vs. Latency Tradeoff
-------------------------------
The Efficiency Latency Control (ELC) feature improves performance
per watt. With this feature hardware power management algorithms
optimize trade-off between latency and power consumption. For some
latency sensitive workloads further tuning can be done by SW to
get desired performance.
The hardware monitors the average CPU utilization across all cores
in a power domain at regular intervals and decides an uncore frequency.
While this may result in the best performance per watt, workload may be
expecting higher performance at the expense of power. Consider an
application that intermittently wakes up to perform memory reads on an
otherwise idle system. In such cases, if hardware lowers uncore
frequency, then there may be delay in ramp up of frequency to meet
target performance.
The ELC control defines some parameters which can be changed from SW.
If the average CPU utilization is below a user-defined threshold
(elc_low_threshold_percent attribute below), the user-defined uncore
floor frequency will be used (elc_floor_freq_khz attribute below)
instead of hardware calculated minimum.
Similarly in high load scenario where the CPU utilization goes above
the high threshold value (elc_high_threshold_percent attribute below)
instead of jumping to maximum uncore frequency, frequency is increased
in 100MHz steps. This avoids consuming unnecessarily high power
immediately with CPU utilization spikes.
Attributes for efficiency latency control:
``elc_floor_freq_khz``
This attribute is used to get/set the efficiency latency floor frequency.
If this variable is lower than the 'min_freq_khz', it is ignored by
the firmware.
``elc_low_threshold_percent``
This attribute is used to get/set the efficiency latency control low
threshold. This attribute is in percentages of CPU utilization.
``elc_high_threshold_percent``
This attribute is used to get/set the efficiency latency control high
threshold. This attribute is in percentages of CPU utilization.
``elc_high_threshold_enable``
This attribute is used to enable/disable the efficiency latency control
high threshold. Write '1' to enable, '0' to disable.
Example system configuration below, which does following:
* when CPU utilization is less than 10%: sets uncore frequency to 800MHz
* when CPU utilization is higher than 95%: increases uncore frequency in
100MHz steps, until power limit is reached
elc_floor_freq_khz:800000
elc_high_threshold_percent:95
elc_high_threshold_enable:1
elc_low_threshold_percent:10

View file

@ -0,0 +1,291 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
===================
System Sleep States
===================
:Copyright: |copy| 2017 Intel Corporation
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Sleep states are global low-power states of the entire system in which user
space code cannot be executed and the overall system activity is significantly
reduced.
Sleep States That Can Be Supported
==================================
Depending on its configuration and the capabilities of the platform it runs on,
the Linux kernel can support up to four system sleep states, including
hibernation and up to three variants of system suspend. The sleep states that
can be supported by the kernel are listed below.
.. _s2idle:
Suspend-to-Idle
---------------
This is a generic, pure software, light-weight variant of system suspend (also
referred to as S2I or S2Idle). It allows more energy to be saved relative to
runtime idle by freezing user space, suspending the timekeeping and putting all
I/O devices into low-power states (possibly lower-power than available in the
working state), such that the processors can spend time in their deepest idle
states while the system is suspended.
The system is woken up from this state by in-band interrupts, so theoretically
any devices that can cause interrupts to be generated in the working state can
also be set up as wakeup devices for S2Idle.
This state can be used on platforms without support for :ref:`standby <standby>`
or :ref:`suspend-to-RAM <s2ram>`, or it can be used in addition to any of the
deeper system suspend variants to provide reduced resume latency. It is always
supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration option is set.
.. _standby:
Standby
-------
This state, if supported, offers moderate, but real, energy savings, while
providing a relatively straightforward transition back to the working state. No
operating state is lost (the system core logic retains power), so the system can
go back to where it left off easily enough.
In addition to freezing user space, suspending the timekeeping and putting all
I/O devices into low-power states, which is done for :ref:`suspend-to-idle
<s2idle>` too, nonboot CPUs are taken offline and all low-level system functions
are suspended during transitions into this state. For this reason, it should
allow more energy to be saved relative to :ref:`suspend-to-idle <s2idle>`, but
the resume latency will generally be greater than for that state.
The set of devices that can wake up the system from this state usually is
reduced relative to :ref:`suspend-to-idle <s2idle>` and it may be necessary to
rely on the platform for setting up the wakeup functionality as appropriate.
This state is supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration
option is set and the support for it is registered by the platform with the
core system suspend subsystem. On ACPI-based systems this state is mapped to
the S1 system state defined by ACPI.
.. _s2ram:
Suspend-to-RAM
--------------
This state (also referred to as STR or S2RAM), if supported, offers significant
energy savings as everything in the system is put into a low-power state, except
for memory, which should be placed into the self-refresh mode to retain its
contents. All of the steps carried out when entering :ref:`standby <standby>`
are also carried out during transitions to S2RAM. Additional operations may
take place depending on the platform capabilities. In particular, on ACPI-based
systems the kernel passes control to the platform firmware (BIOS) as the last
step during S2RAM transitions and that usually results in powering down some
more low-level components that are not directly controlled by the kernel.
The state of devices and CPUs is saved and held in memory. All devices are
suspended and put into low-power states. In many cases, all peripheral buses
lose power when entering S2RAM, so devices must be able to handle the transition
back to the "on" state.
On ACPI-based systems S2RAM requires some minimal boot-strapping code in the
platform firmware to resume the system from it. This may be the case on other
platforms too.
The set of devices that can wake up the system from S2RAM usually is reduced
relative to :ref:`suspend-to-idle <s2idle>` and :ref:`standby <standby>` and it
may be necessary to rely on the platform for setting up the wakeup functionality
as appropriate.
S2RAM is supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration option
is set and the support for it is registered by the platform with the core system
suspend subsystem. On ACPI-based systems it is mapped to the S3 system state
defined by ACPI.
.. _hibernation:
Hibernation
-----------
This state (also referred to as Suspend-to-Disk or STD) offers the greatest
energy savings and can be used even in the absence of low-level platform support
for system suspend. However, it requires some low-level code for resuming the
system to be present for the underlying CPU architecture.
Hibernation is significantly different from any of the system suspend variants.
It takes three system state changes to put it into hibernation and two system
state changes to resume it.
First, when hibernation is triggered, the kernel stops all system activity and
creates a snapshot image of memory to be written into persistent storage. Next,
the system goes into a state in which the snapshot image can be saved, the image
is written out and finally the system goes into the target low-power state in
which power is cut from almost all of its hardware components, including memory,
except for a limited set of wakeup devices.
Once the snapshot image has been written out, the system may either enter a
special low-power state (like ACPI S4), or it may simply power down itself.
Powering down means minimum power draw and it allows this mechanism to work on
any system. However, entering a special low-power state may allow additional
means of system wakeup to be used (e.g. pressing a key on the keyboard or
opening a laptop lid).
After wakeup, control goes to the platform firmware that runs a boot loader
which boots a fresh instance of the kernel (control may also go directly to
the boot loader, depending on the system configuration, but anyway it causes
a fresh instance of the kernel to be booted). That new instance of the kernel
(referred to as the ``restore kernel``) looks for a hibernation image in
persistent storage and if one is found, it is loaded into memory. Next, all
activity in the system is stopped and the restore kernel overwrites itself with
the image contents and jumps into a special trampoline area in the original
kernel stored in the image (referred to as the ``image kernel``), which is where
the special architecture-specific low-level code is needed. Finally, the
image kernel restores the system to the pre-hibernation state and allows user
space to run again.
Hibernation is supported if the :c:macro:`CONFIG_HIBERNATION` kernel
configuration option is set. However, this option can only be set if support
for the given CPU architecture includes the low-level code for system resume.
Basic ``sysfs`` Interfaces for System Suspend and Hibernation
=============================================================
The power management subsystem provides userspace with a unified ``sysfs``
interface for system sleep regardless of the underlying system architecture or
platform. That interface is located in the :file:`/sys/power/` directory
(assuming that ``sysfs`` is mounted at :file:`/sys`) and it consists of the
following attributes (files):
``state``
This file contains a list of strings representing sleep states supported
by the kernel. Writing one of these strings into it causes the kernel
to start a transition of the system into the sleep state represented by
that string.
In particular, the "disk", "freeze" and "standby" strings represent the
:ref:`hibernation <hibernation>`, :ref:`suspend-to-idle <s2idle>` and
:ref:`standby <standby>` sleep states, respectively. The "mem" string
is interpreted in accordance with the contents of the ``mem_sleep`` file
described below.
If the kernel does not support any system sleep states, this file is
not present.
``mem_sleep``
This file contains a list of strings representing supported system
suspend variants and allows user space to select the variant to be
associated with the "mem" string in the ``state`` file described above.
The strings that may be present in this file are "s2idle", "shallow"
and "deep". The "s2idle" string always represents :ref:`suspend-to-idle
<s2idle>` and, by convention, "shallow" and "deep" represent
:ref:`standby <standby>` and :ref:`suspend-to-RAM <s2ram>`,
respectively.
Writing one of the listed strings into this file causes the system
suspend variant represented by it to be associated with the "mem" string
in the ``state`` file. The string representing the suspend variant
currently associated with the "mem" string in the ``state`` file is
shown in square brackets.
If the kernel does not support system suspend, this file is not present.
``disk``
This file controls the operating mode of hibernation (Suspend-to-Disk).
Specifically, it tells the kernel what to do after creating a
hibernation image.
Reading from it returns a list of supported options encoded as:
``platform``
Put the system into a special low-power state (e.g. ACPI S4) to
make additional wakeup options available and possibly allow the
platform firmware to take a simplified initialization path after
wakeup.
It is only available if the platform provides a special
mechanism to put the system to sleep after creating a
hibernation image (platforms with ACPI do that as a rule, for
example).
``shutdown``
Power off the system.
``reboot``
Reboot the system (useful for diagnostics mostly).
``suspend``
Hybrid system suspend. Put the system into the suspend sleep
state selected through the ``mem_sleep`` file described above.
If the system is successfully woken up from that state, discard
the hibernation image and continue. Otherwise, use the image
to restore the previous state of the system.
It is available if system suspend is supported.
``test_resume``
Diagnostic operation. Load the image as though the system had
just woken up from hibernation and the currently running kernel
instance was a restore kernel and follow up with full system
resume.
Writing one of the strings listed above into this file causes the option
represented by it to be selected.
The currently selected option is shown in square brackets, which means
that the operation represented by it will be carried out after creating
and saving the image when hibernation is triggered by writing ``disk``
to :file:`/sys/power/state`.
If the kernel does not support hibernation, this file is not present.
``image_size``
This file controls the size of hibernation images.
It can be written a string representing a non-negative integer that will
be used as a best-effort upper limit of the image size, in bytes. The
hibernation core will do its best to ensure that the image size will not
exceed that number, but if that turns out to be impossible to achieve, a
hibernation image will still be created and its size will be as small as
possible. In particular, writing '0' to this file causes the size of
hibernation images to be minimum.
Reading from it returns the current image size limit, which is set to
around 2/5 of the available RAM size by default.
``pm_trace``
This file controls the "PM trace" mechanism saving the last suspend
or resume event point in the RTC memory across reboots. It helps to
debug hard lockups or reboots due to device driver failures that occur
during system suspend or resume (which is more common) more effectively.
If it contains "1", the fingerprint of each suspend/resume event point
in turn will be stored in the RTC memory (overwriting the actual RTC
information), so it will survive a system crash if one occurs right
after storing it and it can be used later to identify the driver that
caused the crash to happen.
It contains "0" by default, which may be changed to "1" by writing a
string representing a nonzero integer into it.
According to the above, there are two ways to make the system go into the
:ref:`suspend-to-idle <s2idle>` state. The first one is to write "freeze"
directly to :file:`/sys/power/state`. The second one is to write "s2idle" to
:file:`/sys/power/mem_sleep` and then to write "mem" to
:file:`/sys/power/state`. Likewise, there are two ways to make the system go
into the :ref:`standby <standby>` state (the strings to write to the control
files in that case are "standby" or "shallow" and "mem", respectively) if that
state is supported by the platform. However, there is only one way to make the
system go into the :ref:`suspend-to-RAM <s2ram>` state (write "deep" into
:file:`/sys/power/mem_sleep` and "mem" into :file:`/sys/power/state`).
The default suspend variant (ie. the one to be used without writing anything
into :file:`/sys/power/mem_sleep`) is either "deep" (on the majority of systems
supporting :ref:`suspend-to-RAM <s2ram>`) or "s2idle", but it can be overridden
by the value of the ``mem_sleep_default`` parameter in the kernel command line.
On some systems with ACPI, depending on the information in the ACPI tables, the
default may be "s2idle" even if :ref:`suspend-to-RAM <s2ram>` is supported in
principle.

View file

@ -0,0 +1,56 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
===========================
Power Management Strategies
===========================
:Copyright: |copy| 2017 Intel Corporation
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Linux kernel supports two major high-level power management strategies.
One of them is based on using global low-power states of the whole system in
which user space code cannot be executed and the overall system activity is
significantly reduced, referred to as :doc:`sleep states <sleep-states>`. The
kernel puts the system into one of these states when requested by user space
and the system stays in it until a special signal is received from one of
designated devices, triggering a transition to the ``working state`` in which
user space code can run. Because sleep states are global and the whole system
is affected by the state changes, this strategy is referred to as the
:doc:`system-wide power management <system-wide>`.
The other strategy, referred to as the :doc:`working-state power management
<working-state>`, is based on adjusting the power states of individual hardware
components of the system, as needed, in the working state. In consequence, if
this strategy is in use, the working state of the system usually does not
correspond to any particular physical configuration of it, but can be treated as
a metastate covering a range of different power states of the system in which
the individual components of it can be either ``active`` (in use) or
``inactive`` (idle). If they are active, they have to be in power states
allowing them to process data and to be accessed by software. In turn, if they
are inactive, ideally, they should be in low-power states in which they may not
be accessible.
If all of the system components are active, the system as a whole is regarded as
"runtime active" and that situation typically corresponds to the maximum power
draw (or maximum energy usage) of it. If all of them are inactive, the system
as a whole is regarded as "runtime idle" which may be very close to a sleep
state from the physical system configuration and power draw perspective, but
then it takes much less time and effort to start executing user space code than
for the same system in a sleep state. However, transitions from sleep states
back to the working state can only be started by a limited set of devices, so
typically the system can spend much more time in a sleep state than it can be
runtime idle in one go. For this reason, systems usually use less energy in
sleep states than when they are runtime idle most of the time.
Moreover, the two power management strategies address different usage scenarios.
Namely, if the user indicates that the system will not be in use going forward,
for example by closing its lid (if the system is a laptop), it probably should
go into a sleep state at that point. On the other hand, if the user simply goes
away from the laptop keyboard, it probably should stay in the working state and
use the working-state power management in case it becomes idle, because the user
may come back to it at any time and then may want the system to be immediately
accessible.

View file

@ -0,0 +1,270 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
=========================
System Suspend Code Flows
=========================
:Copyright: |copy| 2020 Intel Corporation
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
At least one global system-wide transition needs to be carried out for the
system to get from the working state into one of the supported
:doc:`sleep states <sleep-states>`. Hibernation requires more than one
transition to occur for this purpose, but the other sleep states, commonly
referred to as *system-wide suspend* (or simply *system suspend*) states, need
only one.
For those sleep states, the transition from the working state of the system into
the target sleep state is referred to as *system suspend* too (in the majority
of cases, whether this means a transition or a sleep state of the system should
be clear from the context) and the transition back from the sleep state into the
working state is referred to as *system resume*.
The kernel code flows associated with the suspend and resume transitions for
different sleep states of the system are quite similar, but there are some
significant differences between the :ref:`suspend-to-idle <s2idle>` code flows
and the code flows related to the :ref:`suspend-to-RAM <s2ram>` and
:ref:`standby <standby>` sleep states.
The :ref:`suspend-to-RAM <s2ram>` and :ref:`standby <standby>` sleep states
cannot be implemented without platform support and the difference between them
boils down to the platform-specific actions carried out by the suspend and
resume hooks that need to be provided by the platform driver to make them
available. Apart from that, the suspend and resume code flows for these sleep
states are mostly identical, so they both together will be referred to as
*platform-dependent suspend* states in what follows.
.. _s2idle_suspend:
Suspend-to-idle Suspend Code Flow
=================================
The following steps are taken in order to transition the system from the working
state to the :ref:`suspend-to-idle <s2idle>` sleep state:
1. Invoking system-wide suspend notifiers.
Kernel subsystems can register callbacks to be invoked when the suspend
transition is about to occur and when the resume transition has finished.
That allows them to prepare for the change of the system state and to clean
up after getting back to the working state.
2. Freezing tasks.
Tasks are frozen primarily in order to avoid unchecked hardware accesses
from user space through MMIO regions or I/O registers exposed directly to
it and to prevent user space from entering the kernel while the next step
of the transition is in progress (which might have been problematic for
various reasons).
All user space tasks are intercepted as though they were sent a signal and
put into uninterruptible sleep until the end of the subsequent system resume
transition.
The kernel threads that choose to be frozen during system suspend for
specific reasons are frozen subsequently, but they are not intercepted.
Instead, they are expected to periodically check whether or not they need
to be frozen and to put themselves into uninterruptible sleep if so. [Note,
however, that kernel threads can use locking and other concurrency controls
available in kernel space to synchronize themselves with system suspend and
resume, which can be much more precise than the freezing, so the latter is
not a recommended option for kernel threads.]
3. Suspending devices and reconfiguring IRQs.
Devices are suspended in four phases called *prepare*, *suspend*,
*late suspend* and *noirq suspend* (see :ref:`driverapi_pm_devices` for more
information on what exactly happens in each phase).
Every device is visited in each phase, but typically it is not physically
accessed in more than two of them.
The runtime PM API is disabled for every device during the *late* suspend
phase and high-level ("action") interrupt handlers are prevented from being
invoked before the *noirq* suspend phase.
Interrupts are still handled after that, but they are only acknowledged to
interrupt controllers without performing any device-specific actions that
would be triggered in the working state of the system (those actions are
deferred till the subsequent system resume transition as described
`below <s2idle_resume_>`_).
IRQs associated with system wakeup devices are "armed" so that the resume
transition of the system is started when one of them signals an event.
4. Freezing the scheduler tick and suspending timekeeping.
When all devices have been suspended, CPUs enter the idle loop and are put
into the deepest available idle state. While doing that, each of them
"freezes" its own scheduler tick so that the timer events associated with
the tick do not occur until the CPU is woken up by another interrupt source.
The last CPU to enter the idle state also stops the timekeeping which
(among other things) prevents high resolution timers from triggering going
forward until the first CPU that is woken up restarts the timekeeping.
That allows the CPUs to stay in the deep idle state relatively long in one
go.
From this point on, the CPUs can only be woken up by non-timer hardware
interrupts. If that happens, they go back to the idle state unless the
interrupt that woke up one of them comes from an IRQ that has been armed for
system wakeup, in which case the system resume transition is started.
.. _s2idle_resume:
Suspend-to-idle Resume Code Flow
================================
The following steps are taken in order to transition the system from the
:ref:`suspend-to-idle <s2idle>` sleep state into the working state:
1. Resuming timekeeping and unfreezing the scheduler tick.
When one of the CPUs is woken up (by a non-timer hardware interrupt), it
leaves the idle state entered in the last step of the preceding suspend
transition, restarts the timekeeping (unless it has been restarted already
by another CPU that woke up earlier) and the scheduler tick on that CPU is
unfrozen.
If the interrupt that has woken up the CPU was armed for system wakeup,
the system resume transition begins.
2. Resuming devices and restoring the working-state configuration of IRQs.
Devices are resumed in four phases called *noirq resume*, *early resume*,
*resume* and *complete* (see :ref:`driverapi_pm_devices` for more
information on what exactly happens in each phase).
Every device is visited in each phase, but typically it is not physically
accessed in more than two of them.
The working-state configuration of IRQs is restored after the *noirq* resume
phase and the runtime PM API is re-enabled for every device whose driver
supports it during the *early* resume phase.
3. Thawing tasks.
Tasks frozen in step 2 of the preceding `suspend <s2idle_suspend_>`_
transition are "thawed", which means that they are woken up from the
uninterruptible sleep that they went into at that time and user space tasks
are allowed to exit the kernel.
4. Invoking system-wide resume notifiers.
This is analogous to step 1 of the `suspend <s2idle_suspend_>`_ transition
and the same set of callbacks is invoked at this point, but a different
"notification type" parameter value is passed to them.
Platform-dependent Suspend Code Flow
====================================
The following steps are taken in order to transition the system from the working
state to platform-dependent suspend state:
1. Invoking system-wide suspend notifiers.
This step is the same as step 1 of the suspend-to-idle suspend transition
described `above <s2idle_suspend_>`_.
2. Freezing tasks.
This step is the same as step 2 of the suspend-to-idle suspend transition
described `above <s2idle_suspend_>`_.
3. Suspending devices and reconfiguring IRQs.
This step is analogous to step 3 of the suspend-to-idle suspend transition
described `above <s2idle_suspend_>`_, but the arming of IRQs for system
wakeup generally does not have any effect on the platform.
There are platforms that can go into a very deep low-power state internally
when all CPUs in them are in sufficiently deep idle states and all I/O
devices have been put into low-power states. On those platforms,
suspend-to-idle can reduce system power very effectively.
On the other platforms, however, low-level components (like interrupt
controllers) need to be turned off in a platform-specific way (implemented
in the hooks provided by the platform driver) to achieve comparable power
reduction.
That usually prevents in-band hardware interrupts from waking up the system,
which must be done in a special platform-dependent way. Then, the
configuration of system wakeup sources usually starts when system wakeup
devices are suspended and is finalized by the platform suspend hooks later
on.
4. Disabling non-boot CPUs.
On some platforms the suspend hooks mentioned above must run in a one-CPU
configuration of the system (in particular, the hardware cannot be accessed
by any code running in parallel with the platform suspend hooks that may,
and often do, trap into the platform firmware in order to finalize the
suspend transition).
For this reason, the CPU offline/online (CPU hotplug) framework is used
to take all of the CPUs in the system, except for one (the boot CPU),
offline (typically, the CPUs that have been taken offline go into deep idle
states).
This means that all tasks are migrated away from those CPUs and all IRQs are
rerouted to the only CPU that remains online.
5. Suspending core system components.
This prepares the core system components for (possibly) losing power going
forward and suspends the timekeeping.
6. Platform-specific power removal.
This is expected to remove power from all of the system components except
for the memory controller and RAM (in order to preserve the contents of the
latter) and some devices designated for system wakeup.
In many cases control is passed to the platform firmware which is expected
to finalize the suspend transition as needed.
Platform-dependent Resume Code Flow
===================================
The following steps are taken in order to transition the system from a
platform-dependent suspend state into the working state:
1. Platform-specific system wakeup.
The platform is woken up by a signal from one of the designated system
wakeup devices (which need not be an in-band hardware interrupt) and
control is passed back to the kernel (the working configuration of the
platform may need to be restored by the platform firmware before the
kernel gets control again).
2. Resuming core system components.
The suspend-time configuration of the core system components is restored and
the timekeeping is resumed.
3. Re-enabling non-boot CPUs.
The CPUs disabled in step 4 of the preceding suspend transition are taken
back online and their suspend-time configuration is restored.
4. Resuming devices and restoring the working-state configuration of IRQs.
This step is the same as step 2 of the suspend-to-idle suspend transition
described `above <s2idle_resume_>`_.
5. Thawing tasks.
This step is the same as step 3 of the suspend-to-idle suspend transition
described `above <s2idle_resume_>`_.
6. Invoking system-wide resume notifiers.
This step is the same as step 4 of the suspend-to-idle suspend transition
described `above <s2idle_resume_>`_.

View file

@ -0,0 +1,11 @@
.. SPDX-License-Identifier: GPL-2.0
============================
System-Wide Power Management
============================
.. toctree::
:maxdepth: 2
sleep-states
suspend-flows

View file

@ -0,0 +1,18 @@
.. SPDX-License-Identifier: GPL-2.0
==============================
Working-State Power Management
==============================
.. toctree::
:maxdepth: 2
cpuidle
intel_idle
cpufreq
intel_pstate
amd-pstate
cpufreq_drivers
intel_epb
intel-speed-select
intel_uncore_frequency_scaling