Adding upstream version 6.12.33.
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
This commit is contained in:
parent
89eabb05c2
commit
79d69e5050
86698 changed files with 39662057 additions and 0 deletions
804
Documentation/admin-guide/pm/amd-pstate.rst
Normal file
804
Documentation/admin-guide/pm/amd-pstate.rst
Normal file
|
@ -0,0 +1,804 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
===============================================
|
||||
``amd-pstate`` CPU Performance Scaling Driver
|
||||
===============================================
|
||||
|
||||
:Copyright: |copy| 2021 Advanced Micro Devices, Inc.
|
||||
|
||||
:Author: Huang Rui <ray.huang@amd.com>
|
||||
|
||||
|
||||
Introduction
|
||||
===================
|
||||
|
||||
``amd-pstate`` is the AMD CPU performance scaling driver that introduces a
|
||||
new CPU frequency control mechanism on modern AMD APU and CPU series in
|
||||
Linux kernel. The new mechanism is based on Collaborative Processor
|
||||
Performance Control (CPPC) which provides finer grain frequency management
|
||||
than legacy ACPI hardware P-States. Current AMD CPU/APU platforms are using
|
||||
the ACPI P-states driver to manage CPU frequency and clocks with switching
|
||||
only in 3 P-states. CPPC replaces the ACPI P-states controls and allows a
|
||||
flexible, low-latency interface for the Linux kernel to directly
|
||||
communicate the performance hints to hardware.
|
||||
|
||||
``amd-pstate`` leverages the Linux kernel governors such as ``schedutil``,
|
||||
``ondemand``, etc. to manage the performance hints which are provided by
|
||||
CPPC hardware functionality that internally follows the hardware
|
||||
specification (for details refer to AMD64 Architecture Programmer's Manual
|
||||
Volume 2: System Programming [1]_). Currently, ``amd-pstate`` supports basic
|
||||
frequency control function according to kernel governors on some of the
|
||||
Zen2 and Zen3 processors, and we will implement more AMD specific functions
|
||||
in future after we verify them on the hardware and SBIOS.
|
||||
|
||||
|
||||
AMD CPPC Overview
|
||||
=======================
|
||||
|
||||
Collaborative Processor Performance Control (CPPC) interface enumerates a
|
||||
continuous, abstract, and unit-less performance value in a scale that is
|
||||
not tied to a specific performance state / frequency. This is an ACPI
|
||||
standard [2]_ which software can specify application performance goals and
|
||||
hints as a relative target to the infrastructure limits. AMD processors
|
||||
provide the low latency register model (MSR) instead of an AML code
|
||||
interpreter for performance adjustments. ``amd-pstate`` will initialize a
|
||||
``struct cpufreq_driver`` instance, ``amd_pstate_driver``, with the callbacks
|
||||
to manage each performance update behavior. ::
|
||||
|
||||
Highest Perf ------>+-----------------------+ +-----------------------+
|
||||
| | | |
|
||||
| | | |
|
||||
| | Max Perf ---->| |
|
||||
| | | |
|
||||
| | | |
|
||||
Nominal Perf ------>+-----------------------+ +-----------------------+
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
| | Desired Perf ---->| |
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
Lowest non- | | | |
|
||||
linear perf ------>+-----------------------+ +-----------------------+
|
||||
| | | |
|
||||
| | Lowest perf ---->| |
|
||||
| | | |
|
||||
Lowest perf ------>+-----------------------+ +-----------------------+
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
0 ------>+-----------------------+ +-----------------------+
|
||||
|
||||
AMD P-States Performance Scale
|
||||
|
||||
|
||||
.. _perf_cap:
|
||||
|
||||
AMD CPPC Performance Capability
|
||||
--------------------------------
|
||||
|
||||
Highest Performance (RO)
|
||||
.........................
|
||||
|
||||
This is the absolute maximum performance an individual processor may reach,
|
||||
assuming ideal conditions. This performance level may not be sustainable
|
||||
for long durations and may only be achievable if other platform components
|
||||
are in a specific state; for example, it may require other processors to be in
|
||||
an idle state. This would be equivalent to the highest frequencies
|
||||
supported by the processor.
|
||||
|
||||
Nominal (Guaranteed) Performance (RO)
|
||||
......................................
|
||||
|
||||
This is the maximum sustained performance level of the processor, assuming
|
||||
ideal operating conditions. In the absence of an external constraint (power,
|
||||
thermal, etc.), this is the performance level the processor is expected to
|
||||
be able to maintain continuously. All cores/processors are expected to be
|
||||
able to sustain their nominal performance state simultaneously.
|
||||
|
||||
Lowest non-linear Performance (RO)
|
||||
...................................
|
||||
|
||||
This is the lowest performance level at which nonlinear power savings are
|
||||
achieved, for example, due to the combined effects of voltage and frequency
|
||||
scaling. Above this threshold, lower performance levels should be generally
|
||||
more energy efficient than higher performance levels. This register
|
||||
effectively conveys the most efficient performance level to ``amd-pstate``.
|
||||
|
||||
Lowest Performance (RO)
|
||||
........................
|
||||
|
||||
This is the absolute lowest performance level of the processor. Selecting a
|
||||
performance level lower than the lowest nonlinear performance level may
|
||||
cause an efficiency penalty but should reduce the instantaneous power
|
||||
consumption of the processor.
|
||||
|
||||
AMD CPPC Performance Control
|
||||
------------------------------
|
||||
|
||||
``amd-pstate`` passes performance goals through these registers. The
|
||||
register drives the behavior of the desired performance target.
|
||||
|
||||
Minimum requested performance (RW)
|
||||
...................................
|
||||
|
||||
``amd-pstate`` specifies the minimum allowed performance level.
|
||||
|
||||
Maximum requested performance (RW)
|
||||
...................................
|
||||
|
||||
``amd-pstate`` specifies a limit the maximum performance that is expected
|
||||
to be supplied by the hardware.
|
||||
|
||||
Desired performance target (RW)
|
||||
...................................
|
||||
|
||||
``amd-pstate`` specifies a desired target in the CPPC performance scale as
|
||||
a relative number. This can be expressed as percentage of nominal
|
||||
performance (infrastructure max). Below the nominal sustained performance
|
||||
level, desired performance expresses the average performance level of the
|
||||
processor subject to hardware. Above the nominal performance level,
|
||||
the processor must provide at least nominal performance requested and go higher
|
||||
if current operating conditions allow.
|
||||
|
||||
Energy Performance Preference (EPP) (RW)
|
||||
.........................................
|
||||
|
||||
This attribute provides a hint to the hardware if software wants to bias
|
||||
toward performance (0x0) or energy efficiency (0xff).
|
||||
|
||||
|
||||
Key Governors Support
|
||||
=======================
|
||||
|
||||
``amd-pstate`` can be used with all the (generic) scaling governors listed
|
||||
by the ``scaling_available_governors`` policy attribute in ``sysfs``. Then,
|
||||
it is responsible for the configuration of policy objects corresponding to
|
||||
CPUs and provides the ``CPUFreq`` core (and the scaling governors attached
|
||||
to the policy objects) with accurate information on the maximum and minimum
|
||||
operating frequencies supported by the hardware. Users can check the
|
||||
``scaling_cur_freq`` information comes from the ``CPUFreq`` core.
|
||||
|
||||
``amd-pstate`` mainly supports ``schedutil`` and ``ondemand`` for dynamic
|
||||
frequency control. It is to fine tune the processor configuration on
|
||||
``amd-pstate`` to the ``schedutil`` with CPU CFS scheduler. ``amd-pstate``
|
||||
registers the adjust_perf callback to implement performance update behavior
|
||||
similar to CPPC. It is initialized by ``sugov_start`` and then populates the
|
||||
CPU's update_util_data pointer to assign ``sugov_update_single_perf`` as the
|
||||
utilization update callback function in the CPU scheduler. The CPU scheduler
|
||||
will call ``cpufreq_update_util`` and assigns the target performance according
|
||||
to the ``struct sugov_cpu`` that the utilization update belongs to.
|
||||
Then, ``amd-pstate`` updates the desired performance according to the CPU
|
||||
scheduler assigned.
|
||||
|
||||
.. _processor_support:
|
||||
|
||||
Processor Support
|
||||
=======================
|
||||
|
||||
The ``amd-pstate`` initialization will fail if the ``_CPC`` entry in the ACPI
|
||||
SBIOS does not exist in the detected processor. It uses ``acpi_cpc_valid``
|
||||
to check the existence of ``_CPC``. All Zen based processors support the legacy
|
||||
ACPI hardware P-States function, so when ``amd-pstate`` fails initialization,
|
||||
the kernel will fall back to initialize the ``acpi-cpufreq`` driver.
|
||||
|
||||
There are two types of hardware implementations for ``amd-pstate``: one is
|
||||
`Full MSR Support <perf_cap_>`_ and another is `Shared Memory Support
|
||||
<perf_cap_>`_. It can use the :c:macro:`X86_FEATURE_CPPC` feature flag to
|
||||
indicate the different types. (For details, refer to the Processor Programming
|
||||
Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors [3]_.)
|
||||
``amd-pstate`` is to register different ``static_call`` instances for different
|
||||
hardware implementations.
|
||||
|
||||
Currently, some of the Zen2 and Zen3 processors support ``amd-pstate``. In the
|
||||
future, it will be supported on more and more AMD processors.
|
||||
|
||||
Full MSR Support
|
||||
-----------------
|
||||
|
||||
Some new Zen3 processors such as Cezanne provide the MSR registers directly
|
||||
while the :c:macro:`X86_FEATURE_CPPC` CPU feature flag is set.
|
||||
``amd-pstate`` can handle the MSR register to implement the fast switch
|
||||
function in ``CPUFreq`` that can reduce the latency of frequency control in
|
||||
interrupt context. The functions with a ``pstate_xxx`` prefix represent the
|
||||
operations on MSR registers.
|
||||
|
||||
Shared Memory Support
|
||||
----------------------
|
||||
|
||||
If the :c:macro:`X86_FEATURE_CPPC` CPU feature flag is not set, the
|
||||
processor supports the shared memory solution. In this case, ``amd-pstate``
|
||||
uses the ``cppc_acpi`` helper methods to implement the callback functions
|
||||
that are defined on ``static_call``. The functions with the ``cppc_xxx`` prefix
|
||||
represent the operations of ACPI CPPC helpers for the shared memory solution.
|
||||
|
||||
|
||||
AMD P-States and ACPI hardware P-States always can be supported in one
|
||||
processor. But AMD P-States has the higher priority and if it is enabled
|
||||
with :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond
|
||||
to the request from AMD P-States.
|
||||
|
||||
|
||||
User Space Interface in ``sysfs`` - Per-policy control
|
||||
======================================================
|
||||
|
||||
``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to
|
||||
control its functionality at the system level. They are located in the
|
||||
``/sys/devices/system/cpu/cpufreq/policyX/`` directory and affect all CPUs. ::
|
||||
|
||||
root@hr-test1:/home/ray# ls /sys/devices/system/cpu/cpufreq/policy0/*amd*
|
||||
/sys/devices/system/cpu/cpufreq/policy0/amd_pstate_highest_perf
|
||||
/sys/devices/system/cpu/cpufreq/policy0/amd_pstate_lowest_nonlinear_freq
|
||||
/sys/devices/system/cpu/cpufreq/policy0/amd_pstate_max_freq
|
||||
|
||||
|
||||
``amd_pstate_highest_perf / amd_pstate_max_freq``
|
||||
|
||||
Maximum CPPC performance and CPU frequency that the driver is allowed to
|
||||
set, in percent of the maximum supported CPPC performance level (the highest
|
||||
performance supported in `AMD CPPC Performance Capability <perf_cap_>`_).
|
||||
In some ASICs, the highest CPPC performance is not the one in the ``_CPC``
|
||||
table, so we need to expose it to sysfs. If boost is not active, but
|
||||
still supported, this maximum frequency will be larger than the one in
|
||||
``cpuinfo``. On systems that support preferred core, the driver will have
|
||||
different values for some cores than others and this will reflect the values
|
||||
advertised by the platform at bootup.
|
||||
This attribute is read-only.
|
||||
|
||||
``amd_pstate_lowest_nonlinear_freq``
|
||||
|
||||
The lowest non-linear CPPC CPU frequency that the driver is allowed to set,
|
||||
in percent of the maximum supported CPPC performance level. (Please see the
|
||||
lowest non-linear performance in `AMD CPPC Performance Capability
|
||||
<perf_cap_>`_.)
|
||||
This attribute is read-only.
|
||||
|
||||
``amd_pstate_hw_prefcore``
|
||||
|
||||
Whether the platform supports the preferred core feature and it has been
|
||||
enabled. This attribute is read-only.
|
||||
|
||||
``amd_pstate_prefcore_ranking``
|
||||
|
||||
The performance ranking of the core. This number doesn't have any unit, but
|
||||
larger numbers are preferred at the time of reading. This can change at
|
||||
runtime based on platform conditions. This attribute is read-only.
|
||||
|
||||
``energy_performance_available_preferences``
|
||||
|
||||
A list of all the supported EPP preferences that could be used for
|
||||
``energy_performance_preference`` on this system.
|
||||
These profiles represent different hints that are provided
|
||||
to the low-level firmware about the user's desired energy vs efficiency
|
||||
tradeoff. ``default`` represents the epp value is set by platform
|
||||
firmware. This attribute is read-only.
|
||||
|
||||
``energy_performance_preference``
|
||||
|
||||
The current energy performance preference can be read from this attribute.
|
||||
and user can change current preference according to energy or performance needs
|
||||
Please get all support profiles list from
|
||||
``energy_performance_available_preferences`` attribute, all the profiles are
|
||||
integer values defined between 0 to 255 when EPP feature is enabled by platform
|
||||
firmware, if EPP feature is disabled, driver will ignore the written value
|
||||
This attribute is read-write.
|
||||
|
||||
``boost``
|
||||
The `boost` sysfs attribute provides control over the CPU core
|
||||
performance boost, allowing users to manage the maximum frequency limitation
|
||||
of the CPU. This attribute can be used to enable or disable the boost feature
|
||||
on individual CPUs.
|
||||
|
||||
When the boost feature is enabled, the CPU can dynamically increase its frequency
|
||||
beyond the base frequency, providing enhanced performance for demanding workloads.
|
||||
On the other hand, disabling the boost feature restricts the CPU to operate at the
|
||||
base frequency, which may be desirable in certain scenarios to prioritize power
|
||||
efficiency or manage temperature.
|
||||
|
||||
To manipulate the `boost` attribute, users can write a value of `0` to disable the
|
||||
boost or `1` to enable it, for the respective CPU using the sysfs path
|
||||
`/sys/devices/system/cpu/cpuX/cpufreq/boost`, where `X` represents the CPU number.
|
||||
|
||||
Other performance and frequency values can be read back from
|
||||
``/sys/devices/system/cpu/cpuX/acpi_cppc/``, see :ref:`cppc_sysfs`.
|
||||
|
||||
|
||||
``amd-pstate`` vs ``acpi-cpufreq``
|
||||
======================================
|
||||
|
||||
On the majority of AMD platforms supported by ``acpi-cpufreq``, the ACPI tables
|
||||
provided by the platform firmware are used for CPU performance scaling, but
|
||||
only provide 3 P-states on AMD processors.
|
||||
However, on modern AMD APU and CPU series, hardware provides the Collaborative
|
||||
Processor Performance Control according to the ACPI protocol and customizes this
|
||||
for AMD platforms. That is, fine-grained and continuous frequency ranges
|
||||
instead of the legacy hardware P-states. ``amd-pstate`` is the kernel
|
||||
module which supports the new AMD P-States mechanism on most of the future AMD
|
||||
platforms. The AMD P-States mechanism is the more performance and energy
|
||||
efficiency frequency management method on AMD processors.
|
||||
|
||||
|
||||
``amd-pstate`` Driver Operation Modes
|
||||
======================================
|
||||
|
||||
``amd_pstate`` CPPC has 3 operation modes: autonomous (active) mode,
|
||||
non-autonomous (passive) mode and guided autonomous (guided) mode.
|
||||
Active/passive/guided mode can be chosen by different kernel parameters.
|
||||
|
||||
- In autonomous mode, platform ignores the desired performance level request
|
||||
and takes into account only the values set to the minimum, maximum and energy
|
||||
performance preference registers.
|
||||
- In non-autonomous mode, platform gets desired performance level
|
||||
from OS directly through Desired Performance Register.
|
||||
- In guided-autonomous mode, platform sets operating performance level
|
||||
autonomously according to the current workload and within the limits set by
|
||||
OS through min and max performance registers.
|
||||
|
||||
Active Mode
|
||||
------------
|
||||
|
||||
``amd_pstate=active``
|
||||
|
||||
This is the low-level firmware control mode which is implemented by ``amd_pstate_epp``
|
||||
driver with ``amd_pstate=active`` passed to the kernel in the command line.
|
||||
In this mode, ``amd_pstate_epp`` driver provides a hint to the hardware if software
|
||||
wants to bias toward performance (0x0) or energy efficiency (0xff) to the CPPC firmware.
|
||||
then CPPC power algorithm will calculate the runtime workload and adjust the realtime
|
||||
cores frequency according to the power supply and thermal, core voltage and some other
|
||||
hardware conditions.
|
||||
|
||||
Passive Mode
|
||||
------------
|
||||
|
||||
``amd_pstate=passive``
|
||||
|
||||
It will be enabled if the ``amd_pstate=passive`` is passed to the kernel in the command line.
|
||||
In this mode, ``amd_pstate`` driver software specifies a desired QoS target in the CPPC
|
||||
performance scale as a relative number. This can be expressed as percentage of nominal
|
||||
performance (infrastructure max). Below the nominal sustained performance level,
|
||||
desired performance expresses the average performance level of the processor subject
|
||||
to the Performance Reduction Tolerance register. Above the nominal performance level,
|
||||
processor must provide at least nominal performance requested and go higher if current
|
||||
operating conditions allow.
|
||||
|
||||
Guided Mode
|
||||
-----------
|
||||
|
||||
``amd_pstate=guided``
|
||||
|
||||
If ``amd_pstate=guided`` is passed to kernel command line option then this mode
|
||||
is activated. In this mode, driver requests minimum and maximum performance
|
||||
level and the platform autonomously selects a performance level in this range
|
||||
and appropriate to the current workload.
|
||||
|
||||
``amd-pstate`` Preferred Core
|
||||
=================================
|
||||
|
||||
The core frequency is subjected to the process variation in semiconductors.
|
||||
Not all cores are able to reach the maximum frequency respecting the
|
||||
infrastructure limits. Consequently, AMD has redefined the concept of
|
||||
maximum frequency of a part. This means that a fraction of cores can reach
|
||||
maximum frequency. To find the best process scheduling policy for a given
|
||||
scenario, OS needs to know the core ordering informed by the platform through
|
||||
highest performance capability register of the CPPC interface.
|
||||
|
||||
``amd-pstate`` preferred core enables the scheduler to prefer scheduling on
|
||||
cores that can achieve a higher frequency with lower voltage. The preferred
|
||||
core rankings can dynamically change based on the workload, platform conditions,
|
||||
thermals and ageing.
|
||||
|
||||
The priority metric will be initialized by the ``amd-pstate`` driver. The ``amd-pstate``
|
||||
driver will also determine whether or not ``amd-pstate`` preferred core is
|
||||
supported by the platform.
|
||||
|
||||
``amd-pstate`` driver will provide an initial core ordering when the system boots.
|
||||
The platform uses the CPPC interfaces to communicate the core ranking to the
|
||||
operating system and scheduler to make sure that OS is choosing the cores
|
||||
with highest performance firstly for scheduling the process. When ``amd-pstate``
|
||||
driver receives a message with the highest performance change, it will
|
||||
update the core ranking and set the cpu's priority.
|
||||
|
||||
``amd-pstate`` Preferred Core Switch
|
||||
=====================================
|
||||
Kernel Parameters
|
||||
-----------------
|
||||
|
||||
``amd-pstate`` peferred core`` has two states: enable and disable.
|
||||
Enable/disable states can be chosen by different kernel parameters.
|
||||
Default enable ``amd-pstate`` preferred core.
|
||||
|
||||
``amd_prefcore=disable``
|
||||
|
||||
For systems that support ``amd-pstate`` preferred core, the core rankings will
|
||||
always be advertised by the platform. But OS can choose to ignore that via the
|
||||
kernel parameter ``amd_prefcore=disable``.
|
||||
|
||||
User Space Interface in ``sysfs`` - General
|
||||
===========================================
|
||||
|
||||
Global Attributes
|
||||
-----------------
|
||||
|
||||
``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to
|
||||
control its functionality at the system level. They are located in the
|
||||
``/sys/devices/system/cpu/amd_pstate/`` directory and affect all CPUs.
|
||||
|
||||
``status``
|
||||
Operation mode of the driver: "active", "passive", "guided" or "disable".
|
||||
|
||||
"active"
|
||||
The driver is functional and in the ``active mode``
|
||||
|
||||
"passive"
|
||||
The driver is functional and in the ``passive mode``
|
||||
|
||||
"guided"
|
||||
The driver is functional and in the ``guided mode``
|
||||
|
||||
"disable"
|
||||
The driver is unregistered and not functional now.
|
||||
|
||||
This attribute can be written to in order to change the driver's
|
||||
operation mode or to unregister it. The string written to it must be
|
||||
one of the possible values of it and, if successful, writing one of
|
||||
these values to the sysfs file will cause the driver to switch over
|
||||
to the operation mode represented by that string - or to be
|
||||
unregistered in the "disable" case.
|
||||
|
||||
``prefcore``
|
||||
Preferred core state of the driver: "enabled" or "disabled".
|
||||
|
||||
"enabled"
|
||||
Enable the ``amd-pstate`` preferred core.
|
||||
|
||||
"disabled"
|
||||
Disable the ``amd-pstate`` preferred core
|
||||
|
||||
|
||||
This attribute is read-only to check the state of preferred core set
|
||||
by the kernel parameter.
|
||||
|
||||
``cpupower`` tool support for ``amd-pstate``
|
||||
===============================================
|
||||
|
||||
``amd-pstate`` is supported by the ``cpupower`` tool, which can be used to dump
|
||||
frequency information. Development is in progress to support more and more
|
||||
operations for the new ``amd-pstate`` module with this tool. ::
|
||||
|
||||
root@hr-test1:/home/ray# cpupower frequency-info
|
||||
analyzing CPU 0:
|
||||
driver: amd-pstate
|
||||
CPUs which run at the same hardware frequency: 0
|
||||
CPUs which need to have their frequency coordinated by software: 0
|
||||
maximum transition latency: 131 us
|
||||
hardware limits: 400 MHz - 4.68 GHz
|
||||
available cpufreq governors: ondemand conservative powersave userspace performance schedutil
|
||||
current policy: frequency should be within 400 MHz and 4.68 GHz.
|
||||
The governor "schedutil" may decide which speed to use
|
||||
within this range.
|
||||
current CPU frequency: Unable to call hardware
|
||||
current CPU frequency: 4.02 GHz (asserted by call to kernel)
|
||||
boost state support:
|
||||
Supported: yes
|
||||
Active: yes
|
||||
AMD PSTATE Highest Performance: 166. Maximum Frequency: 4.68 GHz.
|
||||
AMD PSTATE Nominal Performance: 117. Nominal Frequency: 3.30 GHz.
|
||||
AMD PSTATE Lowest Non-linear Performance: 39. Lowest Non-linear Frequency: 1.10 GHz.
|
||||
AMD PSTATE Lowest Performance: 15. Lowest Frequency: 400 MHz.
|
||||
|
||||
|
||||
Diagnostics and Tuning
|
||||
=======================
|
||||
|
||||
Trace Events
|
||||
--------------
|
||||
|
||||
There are two static trace events that can be used for ``amd-pstate``
|
||||
diagnostics. One of them is the ``cpu_frequency`` trace event generally used
|
||||
by ``CPUFreq``, and the other one is the ``amd_pstate_perf`` trace event
|
||||
specific to ``amd-pstate``. The following sequence of shell commands can
|
||||
be used to enable them and see their output (if the kernel is
|
||||
configured to support event tracing). ::
|
||||
|
||||
root@hr-test1:/home/ray# cd /sys/kernel/tracing/
|
||||
root@hr-test1:/sys/kernel/tracing# echo 1 > events/amd_cpu/enable
|
||||
root@hr-test1:/sys/kernel/tracing# cat trace
|
||||
# tracer: nop
|
||||
#
|
||||
# entries-in-buffer/entries-written: 47827/42233061 #P:2
|
||||
#
|
||||
# _-----=> irqs-off
|
||||
# / _----=> need-resched
|
||||
# | / _---=> hardirq/softirq
|
||||
# || / _--=> preempt-depth
|
||||
# ||| / delay
|
||||
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
|
||||
# | | | |||| | |
|
||||
<idle>-0 [015] dN... 4995.979886: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=15 changed=false fast_switch=true
|
||||
<idle>-0 [007] d.h.. 4995.979893: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=7 changed=false fast_switch=true
|
||||
cat-2161 [000] d.... 4995.980841: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=0 changed=false fast_switch=true
|
||||
sshd-2125 [004] d.s.. 4995.980968: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=4 changed=false fast_switch=true
|
||||
<idle>-0 [007] d.s.. 4995.980968: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=7 changed=false fast_switch=true
|
||||
<idle>-0 [003] d.s.. 4995.980971: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=3 changed=false fast_switch=true
|
||||
<idle>-0 [011] d.s.. 4995.980996: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=11 changed=false fast_switch=true
|
||||
|
||||
The ``cpu_frequency`` trace event will be triggered either by the ``schedutil`` scaling
|
||||
governor (for the policies it is attached to), or by the ``CPUFreq`` core (for the
|
||||
policies with other scaling governors).
|
||||
|
||||
|
||||
Tracer Tool
|
||||
-------------
|
||||
|
||||
``amd_pstate_tracer.py`` can record and parse ``amd-pstate`` trace log, then
|
||||
generate performance plots. This utility can be used to debug and tune the
|
||||
performance of ``amd-pstate`` driver. The tracer tool needs to import intel
|
||||
pstate tracer.
|
||||
|
||||
Tracer tool located in ``linux/tools/power/x86/amd_pstate_tracer``. It can be
|
||||
used in two ways. If trace file is available, then directly parse the file
|
||||
with command ::
|
||||
|
||||
./amd_pstate_trace.py [-c cpus] -t <trace_file> -n <test_name>
|
||||
|
||||
Or generate trace file with root privilege, then parse and plot with command ::
|
||||
|
||||
sudo ./amd_pstate_trace.py [-c cpus] -n <test_name> -i <interval> [-m kbytes]
|
||||
|
||||
The test result can be found in ``results/test_name``. Following is the example
|
||||
about part of the output. ::
|
||||
|
||||
common_cpu common_secs common_usecs min_perf des_perf max_perf freq mperf apef tsc load duration_ms sample_num elapsed_time common_comm
|
||||
CPU_005 712 116384 39 49 166 0.7565 9645075 2214891 38431470 25.1 11.646 469 2.496 kworker/5:0-40
|
||||
CPU_006 712 116408 39 49 166 0.6769 8950227 1839034 37192089 24.06 11.272 470 2.496 kworker/6:0-1264
|
||||
|
||||
Unit Tests for amd-pstate
|
||||
-------------------------
|
||||
|
||||
``amd-pstate-ut`` is a test module for testing the ``amd-pstate`` driver.
|
||||
|
||||
* It can help all users to verify their processor support (SBIOS/Firmware or Hardware).
|
||||
|
||||
* Kernel can have a basic function test to avoid the kernel regression during the update.
|
||||
|
||||
* We can introduce more functional or performance tests to align the result together, it will benefit power and performance scale optimization.
|
||||
|
||||
1. Test case descriptions
|
||||
|
||||
1). Basic tests
|
||||
|
||||
Test prerequisite and basic functions for the ``amd-pstate`` driver.
|
||||
|
||||
+---------+--------------------------------+------------------------------------------------------------------------------------+
|
||||
| Index | Functions | Description |
|
||||
+=========+================================+====================================================================================+
|
||||
| 1 | amd_pstate_ut_acpi_cpc_valid || Check whether the _CPC object is present in SBIOS. |
|
||||
| | || |
|
||||
| | || The detail refer to `Processor Support <processor_support_>`_. |
|
||||
+---------+--------------------------------+------------------------------------------------------------------------------------+
|
||||
| 2 | amd_pstate_ut_check_enabled || Check whether AMD P-State is enabled. |
|
||||
| | || |
|
||||
| | || AMD P-States and ACPI hardware P-States always can be supported in one processor. |
|
||||
| | | But AMD P-States has the higher priority and if it is enabled with |
|
||||
| | | :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond to the |
|
||||
| | | request from AMD P-States. |
|
||||
+---------+--------------------------------+------------------------------------------------------------------------------------+
|
||||
| 3 | amd_pstate_ut_check_perf || Check if the each performance values are reasonable. |
|
||||
| | || highest_perf >= nominal_perf > lowest_nonlinear_perf > lowest_perf > 0. |
|
||||
+---------+--------------------------------+------------------------------------------------------------------------------------+
|
||||
| 4 | amd_pstate_ut_check_freq || Check if the each frequency values and max freq when set support boost mode |
|
||||
| | | are reasonable. |
|
||||
| | || max_freq >= nominal_freq > lowest_nonlinear_freq > min_freq > 0 |
|
||||
| | || If boost is not active but supported, this maximum frequency will be larger than |
|
||||
| | | the one in ``cpuinfo``. |
|
||||
+---------+--------------------------------+------------------------------------------------------------------------------------+
|
||||
|
||||
2). Tbench test
|
||||
|
||||
Test and monitor the cpu changes when running tbench benchmark under the specified governor.
|
||||
These changes include desire performance, frequency, load, performance, energy etc.
|
||||
The specified governor is ondemand or schedutil.
|
||||
Tbench can also be tested on the ``acpi-cpufreq`` kernel driver for comparison.
|
||||
|
||||
3). Gitsource test
|
||||
|
||||
Test and monitor the cpu changes when running gitsource benchmark under the specified governor.
|
||||
These changes include desire performance, frequency, load, time, energy etc.
|
||||
The specified governor is ondemand or schedutil.
|
||||
Gitsource can also be tested on the ``acpi-cpufreq`` kernel driver for comparison.
|
||||
|
||||
#. How to execute the tests
|
||||
|
||||
We use test module in the kselftest frameworks to implement it.
|
||||
We create ``amd-pstate-ut`` module and tie it into kselftest.(for
|
||||
details refer to Linux Kernel Selftests [4]_).
|
||||
|
||||
1). Build
|
||||
|
||||
+ open the :c:macro:`CONFIG_X86_AMD_PSTATE` configuration option.
|
||||
+ set the :c:macro:`CONFIG_X86_AMD_PSTATE_UT` configuration option to M.
|
||||
+ make project
|
||||
+ make selftest ::
|
||||
|
||||
$ cd linux
|
||||
$ make -C tools/testing/selftests
|
||||
|
||||
+ make perf ::
|
||||
|
||||
$ cd tools/perf/
|
||||
$ make
|
||||
|
||||
|
||||
2). Installation & Steps ::
|
||||
|
||||
$ make -C tools/testing/selftests install INSTALL_PATH=~/kselftest
|
||||
$ cp tools/perf/perf /usr/bin/perf
|
||||
$ sudo ./kselftest/run_kselftest.sh -c amd-pstate
|
||||
|
||||
3). Specified test case ::
|
||||
|
||||
$ cd ~/kselftest/amd-pstate
|
||||
$ sudo ./run.sh -t basic
|
||||
$ sudo ./run.sh -t tbench
|
||||
$ sudo ./run.sh -t tbench -m acpi-cpufreq
|
||||
$ sudo ./run.sh -t gitsource
|
||||
$ sudo ./run.sh -t gitsource -m acpi-cpufreq
|
||||
$ ./run.sh --help
|
||||
./run.sh: illegal option -- -
|
||||
Usage: ./run.sh [OPTION...]
|
||||
[-h <help>]
|
||||
[-o <output-file-for-dump>]
|
||||
[-c <all: All testing,
|
||||
basic: Basic testing,
|
||||
tbench: Tbench testing,
|
||||
gitsource: Gitsource testing.>]
|
||||
[-t <tbench time limit>]
|
||||
[-p <tbench process number>]
|
||||
[-l <loop times for tbench>]
|
||||
[-i <amd tracer interval>]
|
||||
[-m <comparative test: acpi-cpufreq>]
|
||||
|
||||
|
||||
4). Results
|
||||
|
||||
+ basic
|
||||
|
||||
When you finish test, you will get the following log info ::
|
||||
|
||||
$ dmesg | grep "amd_pstate_ut" | tee log.txt
|
||||
[12977.570663] amd_pstate_ut: 1 amd_pstate_ut_acpi_cpc_valid success!
|
||||
[12977.570673] amd_pstate_ut: 2 amd_pstate_ut_check_enabled success!
|
||||
[12977.571207] amd_pstate_ut: 3 amd_pstate_ut_check_perf success!
|
||||
[12977.571212] amd_pstate_ut: 4 amd_pstate_ut_check_freq success!
|
||||
|
||||
+ tbench
|
||||
|
||||
When you finish test, you will get selftest.tbench.csv and png images.
|
||||
The selftest.tbench.csv file contains the raw data and the drop of the comparative test.
|
||||
The png images shows the performance, energy and performan per watt of each test.
|
||||
Open selftest.tbench.csv :
|
||||
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ Governor | Round | Des-perf | Freq | Load | Performance | Energy | Performance Per Watt |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ Unit | | | GHz | | MB/s | J | MB/J |
|
||||
+=================================================+==============+==========+=========+==========+=============+=========+======================+
|
||||
+ amd-pstate-ondemand | 1 | | | | 2504.05 | 1563.67 | 158.5378 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-ondemand | 2 | | | | 2243.64 | 1430.32 | 155.2941 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-ondemand | 3 | | | | 2183.88 | 1401.32 | 154.2860 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-ondemand | Average | | | | 2310.52 | 1465.1 | 156.1268 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-schedutil | 1 | 165.329 | 1.62257 | 99.798 | 2136.54 | 1395.26 | 151.5971 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-schedutil | 2 | 166 | 1.49761 | 99.9993 | 2100.56 | 1380.5 | 150.6377 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-schedutil | 3 | 166 | 1.47806 | 99.9993 | 2084.12 | 1375.76 | 149.9737 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-schedutil | Average | 165.776 | 1.53275 | 99.9322 | 2107.07 | 1383.84 | 150.7399 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-ondemand | 1 | | | | 2529.9 | 1564.4 | 160.0997 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-ondemand | 2 | | | | 2249.76 | 1432.97 | 155.4297 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-ondemand | 3 | | | | 2181.46 | 1406.88 | 153.5060 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-ondemand | Average | | | | 2320.37 | 1468.08 | 156.4741 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-schedutil | 1 | | | | 2137.64 | 1385.24 | 152.7723 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-schedutil | 2 | | | | 2107.05 | 1372.23 | 152.0138 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-schedutil | 3 | | | | 2085.86 | 1365.35 | 151.2433 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-schedutil | Average | | | | 2110.18 | 1374.27 | 152.0136 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-ondemand VS acpi-cpufreq-schedutil | Comprison(%) | | | | -9.0584 | -6.3899 | -2.8506 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-ondemand VS amd-pstate-schedutil | Comprison(%) | | | | 8.8053 | -5.5463 | -3.4503 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-ondemand VS amd-pstate-ondemand | Comprison(%) | | | | -0.4245 | -0.2029 | -0.2219 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-schedutil VS amd-pstate-schedutil | Comprison(%) | | | | -0.1473 | 0.6963 | -0.8378 |
|
||||
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
|
||||
|
||||
+ gitsource
|
||||
|
||||
When you finish test, you will get selftest.gitsource.csv and png images.
|
||||
The selftest.gitsource.csv file contains the raw data and the drop of the comparative test.
|
||||
The png images shows the performance, energy and performan per watt of each test.
|
||||
Open selftest.gitsource.csv :
|
||||
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ Governor | Round | Des-perf | Freq | Load | Time | Energy | Performance Per Watt |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ Unit | | | GHz | | s | J | 1/J |
|
||||
+=================================================+==============+==========+==========+==========+=============+=========+======================+
|
||||
+ amd-pstate-ondemand | 1 | 50.119 | 2.10509 | 23.3076 | 475.69 | 865.78 | 0.001155027 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-ondemand | 2 | 94.8006 | 1.98771 | 56.6533 | 467.1 | 839.67 | 0.001190944 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-ondemand | 3 | 76.6091 | 2.53251 | 43.7791 | 467.69 | 855.85 | 0.001168429 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-ondemand | Average | 73.8429 | 2.20844 | 41.2467 | 470.16 | 853.767 | 0.001171279 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-schedutil | 1 | 165.919 | 1.62319 | 98.3868 | 464.17 | 866.8 | 0.001153668 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-schedutil | 2 | 165.97 | 1.31309 | 99.5712 | 480.15 | 880.4 | 0.001135847 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-schedutil | 3 | 165.973 | 1.28448 | 99.9252 | 481.79 | 867.02 | 0.001153375 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-schedutil | Average | 165.954 | 1.40692 | 99.2944 | 475.37 | 871.407 | 0.001147569 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-ondemand | 1 | | | | 2379.62 | 742.96 | 0.001345967 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-ondemand | 2 | | | | 441.74 | 817.49 | 0.001223256 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-ondemand | 3 | | | | 455.48 | 820.01 | 0.001219497 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-ondemand | Average | | | | 425.613 | 793.487 | 0.001260260 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-schedutil | 1 | | | | 459.69 | 838.54 | 0.001192548 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-schedutil | 2 | | | | 466.55 | 830.89 | 0.001203528 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-schedutil | 3 | | | | 470.38 | 837.32 | 0.001194286 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-schedutil | Average | | | | 465.54 | 835.583 | 0.001196769 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-ondemand VS acpi-cpufreq-schedutil | Comprison(%) | | | | 9.3810 | 5.3051 | -5.0379 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ amd-pstate-ondemand VS amd-pstate-schedutil | Comprison(%) | 124.7392 | -36.2934 | 140.7329 | 1.1081 | 2.0661 | -2.0242 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-ondemand VS amd-pstate-ondemand | Comprison(%) | | | | 10.4665 | 7.5968 | -7.0605 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
+ acpi-cpufreq-schedutil VS amd-pstate-schedutil | Comprison(%) | | | | 2.1115 | 4.2873 | -4.1110 |
|
||||
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
|
||||
|
||||
Reference
|
||||
===========
|
||||
|
||||
.. [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming,
|
||||
https://www.amd.com/system/files/TechDocs/24593.pdf
|
||||
|
||||
.. [2] Advanced Configuration and Power Interface Specification,
|
||||
https://uefi.org/sites/default/files/resources/ACPI_Spec_6_4_Jan22.pdf
|
||||
|
||||
.. [3] Processor Programming Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors
|
||||
https://www.amd.com/system/files/TechDocs/56569-A1-PUB.zip
|
||||
|
||||
.. [4] Linux Kernel Selftests,
|
||||
https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html
|
712
Documentation/admin-guide/pm/cpufreq.rst
Normal file
712
Documentation/admin-guide/pm/cpufreq.rst
Normal file
|
@ -0,0 +1,712 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>`
|
||||
|
||||
=======================
|
||||
CPU Performance Scaling
|
||||
=======================
|
||||
|
||||
:Copyright: |copy| 2017 Intel Corporation
|
||||
|
||||
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
||||
|
||||
|
||||
The Concept of CPU Performance Scaling
|
||||
======================================
|
||||
|
||||
The majority of modern processors are capable of operating in a number of
|
||||
different clock frequency and voltage configurations, often referred to as
|
||||
Operating Performance Points or P-states (in ACPI terminology). As a rule,
|
||||
the higher the clock frequency and the higher the voltage, the more instructions
|
||||
can be retired by the CPU over a unit of time, but also the higher the clock
|
||||
frequency and the higher the voltage, the more energy is consumed over a unit of
|
||||
time (or the more power is drawn) by the CPU in the given P-state. Therefore
|
||||
there is a natural tradeoff between the CPU capacity (the number of instructions
|
||||
that can be executed over a unit of time) and the power drawn by the CPU.
|
||||
|
||||
In some situations it is desirable or even necessary to run the program as fast
|
||||
as possible and then there is no reason to use any P-states different from the
|
||||
highest one (i.e. the highest-performance frequency/voltage configuration
|
||||
available). In some other cases, however, it may not be necessary to execute
|
||||
instructions so quickly and maintaining the highest available CPU capacity for a
|
||||
relatively long time without utilizing it entirely may be regarded as wasteful.
|
||||
It also may not be physically possible to maintain maximum CPU capacity for too
|
||||
long for thermal or power supply capacity reasons or similar. To cover those
|
||||
cases, there are hardware interfaces allowing CPUs to be switched between
|
||||
different frequency/voltage configurations or (in the ACPI terminology) to be
|
||||
put into different P-states.
|
||||
|
||||
Typically, they are used along with algorithms to estimate the required CPU
|
||||
capacity, so as to decide which P-states to put the CPUs into. Of course, since
|
||||
the utilization of the system generally changes over time, that has to be done
|
||||
repeatedly on a regular basis. The activity by which this happens is referred
|
||||
to as CPU performance scaling or CPU frequency scaling (because it involves
|
||||
adjusting the CPU clock frequency).
|
||||
|
||||
|
||||
CPU Performance Scaling in Linux
|
||||
================================
|
||||
|
||||
The Linux kernel supports CPU performance scaling by means of the ``CPUFreq``
|
||||
(CPU Frequency scaling) subsystem that consists of three layers of code: the
|
||||
core, scaling governors and scaling drivers.
|
||||
|
||||
The ``CPUFreq`` core provides the common code infrastructure and user space
|
||||
interfaces for all platforms that support CPU performance scaling. It defines
|
||||
the basic framework in which the other components operate.
|
||||
|
||||
Scaling governors implement algorithms to estimate the required CPU capacity.
|
||||
As a rule, each governor implements one, possibly parametrized, scaling
|
||||
algorithm.
|
||||
|
||||
Scaling drivers talk to the hardware. They provide scaling governors with
|
||||
information on the available P-states (or P-state ranges in some cases) and
|
||||
access platform-specific hardware interfaces to change CPU P-states as requested
|
||||
by scaling governors.
|
||||
|
||||
In principle, all available scaling governors can be used with every scaling
|
||||
driver. That design is based on the observation that the information used by
|
||||
performance scaling algorithms for P-state selection can be represented in a
|
||||
platform-independent form in the majority of cases, so it should be possible
|
||||
to use the same performance scaling algorithm implemented in exactly the same
|
||||
way regardless of which scaling driver is used. Consequently, the same set of
|
||||
scaling governors should be suitable for every supported platform.
|
||||
|
||||
However, that observation may not hold for performance scaling algorithms
|
||||
based on information provided by the hardware itself, for example through
|
||||
feedback registers, as that information is typically specific to the hardware
|
||||
interface it comes from and may not be easily represented in an abstract,
|
||||
platform-independent way. For this reason, ``CPUFreq`` allows scaling drivers
|
||||
to bypass the governor layer and implement their own performance scaling
|
||||
algorithms. That is done by the |intel_pstate| scaling driver.
|
||||
|
||||
|
||||
``CPUFreq`` Policy Objects
|
||||
==========================
|
||||
|
||||
In some cases the hardware interface for P-state control is shared by multiple
|
||||
CPUs. That is, for example, the same register (or set of registers) is used to
|
||||
control the P-state of multiple CPUs at the same time and writing to it affects
|
||||
all of those CPUs simultaneously.
|
||||
|
||||
Sets of CPUs sharing hardware P-state control interfaces are represented by
|
||||
``CPUFreq`` as struct cpufreq_policy objects. For consistency,
|
||||
struct cpufreq_policy is also used when there is only one CPU in the given
|
||||
set.
|
||||
|
||||
The ``CPUFreq`` core maintains a pointer to a struct cpufreq_policy object for
|
||||
every CPU in the system, including CPUs that are currently offline. If multiple
|
||||
CPUs share the same hardware P-state control interface, all of the pointers
|
||||
corresponding to them point to the same struct cpufreq_policy object.
|
||||
|
||||
``CPUFreq`` uses struct cpufreq_policy as its basic data type and the design
|
||||
of its user space interface is based on the policy concept.
|
||||
|
||||
|
||||
CPU Initialization
|
||||
==================
|
||||
|
||||
First of all, a scaling driver has to be registered for ``CPUFreq`` to work.
|
||||
It is only possible to register one scaling driver at a time, so the scaling
|
||||
driver is expected to be able to handle all CPUs in the system.
|
||||
|
||||
The scaling driver may be registered before or after CPU registration. If
|
||||
CPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to
|
||||
take a note of all of the already registered CPUs during the registration of the
|
||||
scaling driver. In turn, if any CPUs are registered after the registration of
|
||||
the scaling driver, the ``CPUFreq`` core will be invoked to take note of them
|
||||
at their registration time.
|
||||
|
||||
In any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it
|
||||
has not seen so far as soon as it is ready to handle that CPU. [Note that the
|
||||
logical CPU may be a physical single-core processor, or a single core in a
|
||||
multicore processor, or a hardware thread in a physical processor or processor
|
||||
core. In what follows "CPU" always means "logical CPU" unless explicitly stated
|
||||
otherwise and the word "processor" is used to refer to the physical part
|
||||
possibly including multiple logical CPUs.]
|
||||
|
||||
Once invoked, the ``CPUFreq`` core checks if the policy pointer is already set
|
||||
for the given CPU and if so, it skips the policy object creation. Otherwise,
|
||||
a new policy object is created and initialized, which involves the creation of
|
||||
a new policy directory in ``sysfs``, and the policy pointer corresponding to
|
||||
the given CPU is set to the new policy object's address in memory.
|
||||
|
||||
Next, the scaling driver's ``->init()`` callback is invoked with the policy
|
||||
pointer of the new CPU passed to it as the argument. That callback is expected
|
||||
to initialize the performance scaling hardware interface for the given CPU (or,
|
||||
more precisely, for the set of CPUs sharing the hardware interface it belongs
|
||||
to, represented by its policy object) and, if the policy object it has been
|
||||
called for is new, to set parameters of the policy, like the minimum and maximum
|
||||
frequencies supported by the hardware, the table of available frequencies (if
|
||||
the set of supported P-states is not a continuous range), and the mask of CPUs
|
||||
that belong to the same policy (including both online and offline CPUs). That
|
||||
mask is then used by the core to populate the policy pointers for all of the
|
||||
CPUs in it.
|
||||
|
||||
The next major initialization step for a new policy object is to attach a
|
||||
scaling governor to it (to begin with, that is the default scaling governor
|
||||
determined by the kernel command line or configuration, but it may be changed
|
||||
later via ``sysfs``). First, a pointer to the new policy object is passed to
|
||||
the governor's ``->init()`` callback which is expected to initialize all of the
|
||||
data structures necessary to handle the given policy and, possibly, to add
|
||||
a governor ``sysfs`` interface to it. Next, the governor is started by
|
||||
invoking its ``->start()`` callback.
|
||||
|
||||
That callback is expected to register per-CPU utilization update callbacks for
|
||||
all of the online CPUs belonging to the given policy with the CPU scheduler.
|
||||
The utilization update callbacks will be invoked by the CPU scheduler on
|
||||
important events, like task enqueue and dequeue, on every iteration of the
|
||||
scheduler tick or generally whenever the CPU utilization may change (from the
|
||||
scheduler's perspective). They are expected to carry out computations needed
|
||||
to determine the P-state to use for the given policy going forward and to
|
||||
invoke the scaling driver to make changes to the hardware in accordance with
|
||||
the P-state selection. The scaling driver may be invoked directly from
|
||||
scheduler context or asynchronously, via a kernel thread or workqueue, depending
|
||||
on the configuration and capabilities of the scaling driver and the governor.
|
||||
|
||||
Similar steps are taken for policy objects that are not new, but were "inactive"
|
||||
previously, meaning that all of the CPUs belonging to them were offline. The
|
||||
only practical difference in that case is that the ``CPUFreq`` core will attempt
|
||||
to use the scaling governor previously used with the policy that became
|
||||
"inactive" (and is re-initialized now) instead of the default governor.
|
||||
|
||||
In turn, if a previously offline CPU is being brought back online, but some
|
||||
other CPUs sharing the policy object with it are online already, there is no
|
||||
need to re-initialize the policy object at all. In that case, it only is
|
||||
necessary to restart the scaling governor so that it can take the new online CPU
|
||||
into account. That is achieved by invoking the governor's ``->stop`` and
|
||||
``->start()`` callbacks, in this order, for the entire policy.
|
||||
|
||||
As mentioned before, the |intel_pstate| scaling driver bypasses the scaling
|
||||
governor layer of ``CPUFreq`` and provides its own P-state selection algorithms.
|
||||
Consequently, if |intel_pstate| is used, scaling governors are not attached to
|
||||
new policy objects. Instead, the driver's ``->setpolicy()`` callback is invoked
|
||||
to register per-CPU utilization update callbacks for each policy. These
|
||||
callbacks are invoked by the CPU scheduler in the same way as for scaling
|
||||
governors, but in the |intel_pstate| case they both determine the P-state to
|
||||
use and change the hardware configuration accordingly in one go from scheduler
|
||||
context.
|
||||
|
||||
The policy objects created during CPU initialization and other data structures
|
||||
associated with them are torn down when the scaling driver is unregistered
|
||||
(which happens when the kernel module containing it is unloaded, for example) or
|
||||
when the last CPU belonging to the given policy in unregistered.
|
||||
|
||||
|
||||
Policy Interface in ``sysfs``
|
||||
=============================
|
||||
|
||||
During the initialization of the kernel, the ``CPUFreq`` core creates a
|
||||
``sysfs`` directory (kobject) called ``cpufreq`` under
|
||||
:file:`/sys/devices/system/cpu/`.
|
||||
|
||||
That directory contains a ``policyX`` subdirectory (where ``X`` represents an
|
||||
integer number) for every policy object maintained by the ``CPUFreq`` core.
|
||||
Each ``policyX`` directory is pointed to by ``cpufreq`` symbolic links
|
||||
under :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer
|
||||
that may be different from the one represented by ``X``) for all of the CPUs
|
||||
associated with (or belonging to) the given policy. The ``policyX`` directories
|
||||
in :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific
|
||||
attributes (files) to control ``CPUFreq`` behavior for the corresponding policy
|
||||
objects (that is, for all of the CPUs associated with them).
|
||||
|
||||
Some of those attributes are generic. They are created by the ``CPUFreq`` core
|
||||
and their behavior generally does not depend on what scaling driver is in use
|
||||
and what scaling governor is attached to the given policy. Some scaling drivers
|
||||
also add driver-specific attributes to the policy directories in ``sysfs`` to
|
||||
control policy-specific aspects of driver behavior.
|
||||
|
||||
The generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/`
|
||||
are the following:
|
||||
|
||||
``affected_cpus``
|
||||
List of online CPUs belonging to this policy (i.e. sharing the hardware
|
||||
performance scaling interface represented by the ``policyX`` policy
|
||||
object).
|
||||
|
||||
``bios_limit``
|
||||
If the platform firmware (BIOS) tells the OS to apply an upper limit to
|
||||
CPU frequencies, that limit will be reported through this attribute (if
|
||||
present).
|
||||
|
||||
The existence of the limit may be a result of some (often unintentional)
|
||||
BIOS settings, restrictions coming from a service processor or another
|
||||
BIOS/HW-based mechanisms.
|
||||
|
||||
This does not cover ACPI thermal limitations which can be discovered
|
||||
through a generic thermal driver.
|
||||
|
||||
This attribute is not present if the scaling driver in use does not
|
||||
support it.
|
||||
|
||||
``cpuinfo_cur_freq``
|
||||
Current frequency of the CPUs belonging to this policy as obtained from
|
||||
the hardware (in KHz).
|
||||
|
||||
This is expected to be the frequency the hardware actually runs at.
|
||||
If that frequency cannot be determined, this attribute should not
|
||||
be present.
|
||||
|
||||
``cpuinfo_max_freq``
|
||||
Maximum possible operating frequency the CPUs belonging to this policy
|
||||
can run at (in kHz).
|
||||
|
||||
``cpuinfo_min_freq``
|
||||
Minimum possible operating frequency the CPUs belonging to this policy
|
||||
can run at (in kHz).
|
||||
|
||||
``cpuinfo_transition_latency``
|
||||
The time it takes to switch the CPUs belonging to this policy from one
|
||||
P-state to another, in nanoseconds.
|
||||
|
||||
If unknown or if known to be so high that the scaling driver does not
|
||||
work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`)
|
||||
will be returned by reads from this attribute.
|
||||
|
||||
``related_cpus``
|
||||
List of all (online and offline) CPUs belonging to this policy.
|
||||
|
||||
``scaling_available_frequencies``
|
||||
List of available frequencies of the CPUs belonging to this policy
|
||||
(in kHz).
|
||||
|
||||
``scaling_available_governors``
|
||||
List of ``CPUFreq`` scaling governors present in the kernel that can
|
||||
be attached to this policy or (if the |intel_pstate| scaling driver is
|
||||
in use) list of scaling algorithms provided by the driver that can be
|
||||
applied to this policy.
|
||||
|
||||
[Note that some governors are modular and it may be necessary to load a
|
||||
kernel module for the governor held by it to become available and be
|
||||
listed by this attribute.]
|
||||
|
||||
``scaling_cur_freq``
|
||||
Current frequency of all of the CPUs belonging to this policy (in kHz).
|
||||
|
||||
In the majority of cases, this is the frequency of the last P-state
|
||||
requested by the scaling driver from the hardware using the scaling
|
||||
interface provided by it, which may or may not reflect the frequency
|
||||
the CPU is actually running at (due to hardware design and other
|
||||
limitations).
|
||||
|
||||
Some architectures (e.g. ``x86``) may attempt to provide information
|
||||
more precisely reflecting the current CPU frequency through this
|
||||
attribute, but that still may not be the exact current CPU frequency as
|
||||
seen by the hardware at the moment.
|
||||
|
||||
``scaling_driver``
|
||||
The scaling driver currently in use.
|
||||
|
||||
``scaling_governor``
|
||||
The scaling governor currently attached to this policy or (if the
|
||||
|intel_pstate| scaling driver is in use) the scaling algorithm
|
||||
provided by the driver that is currently applied to this policy.
|
||||
|
||||
This attribute is read-write and writing to it will cause a new scaling
|
||||
governor to be attached to this policy or a new scaling algorithm
|
||||
provided by the scaling driver to be applied to it (in the
|
||||
|intel_pstate| case), as indicated by the string written to this
|
||||
attribute (which must be one of the names listed by the
|
||||
``scaling_available_governors`` attribute described above).
|
||||
|
||||
``scaling_max_freq``
|
||||
Maximum frequency the CPUs belonging to this policy are allowed to be
|
||||
running at (in kHz).
|
||||
|
||||
This attribute is read-write and writing a string representing an
|
||||
integer to it will cause a new limit to be set (it must not be lower
|
||||
than the value of the ``scaling_min_freq`` attribute).
|
||||
|
||||
``scaling_min_freq``
|
||||
Minimum frequency the CPUs belonging to this policy are allowed to be
|
||||
running at (in kHz).
|
||||
|
||||
This attribute is read-write and writing a string representing a
|
||||
non-negative integer to it will cause a new limit to be set (it must not
|
||||
be higher than the value of the ``scaling_max_freq`` attribute).
|
||||
|
||||
``scaling_setspeed``
|
||||
This attribute is functional only if the `userspace`_ scaling governor
|
||||
is attached to the given policy.
|
||||
|
||||
It returns the last frequency requested by the governor (in kHz) or can
|
||||
be written to in order to set a new frequency for the policy.
|
||||
|
||||
|
||||
Generic Scaling Governors
|
||||
=========================
|
||||
|
||||
``CPUFreq`` provides generic scaling governors that can be used with all
|
||||
scaling drivers. As stated before, each of them implements a single, possibly
|
||||
parametrized, performance scaling algorithm.
|
||||
|
||||
Scaling governors are attached to policy objects and different policy objects
|
||||
can be handled by different scaling governors at the same time (although that
|
||||
may lead to suboptimal results in some cases).
|
||||
|
||||
The scaling governor for a given policy object can be changed at any time with
|
||||
the help of the ``scaling_governor`` policy attribute in ``sysfs``.
|
||||
|
||||
Some governors expose ``sysfs`` attributes to control or fine-tune the scaling
|
||||
algorithms implemented by them. Those attributes, referred to as governor
|
||||
tunables, can be either global (system-wide) or per-policy, depending on the
|
||||
scaling driver in use. If the driver requires governor tunables to be
|
||||
per-policy, they are located in a subdirectory of each policy directory.
|
||||
Otherwise, they are located in a subdirectory under
|
||||
:file:`/sys/devices/system/cpu/cpufreq/`. In either case the name of the
|
||||
subdirectory containing the governor tunables is the name of the governor
|
||||
providing them.
|
||||
|
||||
``performance``
|
||||
---------------
|
||||
|
||||
When attached to a policy object, this governor causes the highest frequency,
|
||||
within the ``scaling_max_freq`` policy limit, to be requested for that policy.
|
||||
|
||||
The request is made once at that time the governor for the policy is set to
|
||||
``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
|
||||
policy limits change after that.
|
||||
|
||||
``powersave``
|
||||
-------------
|
||||
|
||||
When attached to a policy object, this governor causes the lowest frequency,
|
||||
within the ``scaling_min_freq`` policy limit, to be requested for that policy.
|
||||
|
||||
The request is made once at that time the governor for the policy is set to
|
||||
``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
|
||||
policy limits change after that.
|
||||
|
||||
``userspace``
|
||||
-------------
|
||||
|
||||
This governor does not do anything by itself. Instead, it allows user space
|
||||
to set the CPU frequency for the policy it is attached to by writing to the
|
||||
``scaling_setspeed`` attribute of that policy.
|
||||
|
||||
``schedutil``
|
||||
-------------
|
||||
|
||||
This governor uses CPU utilization data available from the CPU scheduler. It
|
||||
generally is regarded as a part of the CPU scheduler, so it can access the
|
||||
scheduler's internal data structures directly.
|
||||
|
||||
It runs entirely in scheduler context, although in some cases it may need to
|
||||
invoke the scaling driver asynchronously when it decides that the CPU frequency
|
||||
should be changed for a given policy (that depends on whether or not the driver
|
||||
is capable of changing the CPU frequency from scheduler context).
|
||||
|
||||
The actions of this governor for a particular CPU depend on the scheduling class
|
||||
invoking its utilization update callback for that CPU. If it is invoked by the
|
||||
RT or deadline scheduling classes, the governor will increase the frequency to
|
||||
the allowed maximum (that is, the ``scaling_max_freq`` policy limit). In turn,
|
||||
if it is invoked by the CFS scheduling class, the governor will use the
|
||||
Per-Entity Load Tracking (PELT) metric for the root control group of the
|
||||
given CPU as the CPU utilization estimate (see the *Per-entity load tracking*
|
||||
LWN.net article [1]_ for a description of the PELT mechanism). Then, the new
|
||||
CPU frequency to apply is computed in accordance with the formula
|
||||
|
||||
f = 1.25 * ``f_0`` * ``util`` / ``max``
|
||||
|
||||
where ``util`` is the PELT number, ``max`` is the theoretical maximum of
|
||||
``util``, and ``f_0`` is either the maximum possible CPU frequency for the given
|
||||
policy (if the PELT number is frequency-invariant), or the current CPU frequency
|
||||
(otherwise).
|
||||
|
||||
This governor also employs a mechanism allowing it to temporarily bump up the
|
||||
CPU frequency for tasks that have been waiting on I/O most recently, called
|
||||
"IO-wait boosting". That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag
|
||||
is passed by the scheduler to the governor callback which causes the frequency
|
||||
to go up to the allowed maximum immediately and then draw back to the value
|
||||
returned by the above formula over time.
|
||||
|
||||
This governor exposes only one tunable:
|
||||
|
||||
``rate_limit_us``
|
||||
Minimum time (in microseconds) that has to pass between two consecutive
|
||||
runs of governor computations (default: 1.5 times the scaling driver's
|
||||
transition latency or the maximum 2ms).
|
||||
|
||||
The purpose of this tunable is to reduce the scheduler context overhead
|
||||
of the governor which might be excessive without it.
|
||||
|
||||
This governor generally is regarded as a replacement for the older `ondemand`_
|
||||
and `conservative`_ governors (described below), as it is simpler and more
|
||||
tightly integrated with the CPU scheduler, its overhead in terms of CPU context
|
||||
switches and similar is less significant, and it uses the scheduler's own CPU
|
||||
utilization metric, so in principle its decisions should not contradict the
|
||||
decisions made by the other parts of the scheduler.
|
||||
|
||||
``ondemand``
|
||||
------------
|
||||
|
||||
This governor uses CPU load as a CPU frequency selection metric.
|
||||
|
||||
In order to estimate the current CPU load, it measures the time elapsed between
|
||||
consecutive invocations of its worker routine and computes the fraction of that
|
||||
time in which the given CPU was not idle. The ratio of the non-idle (active)
|
||||
time to the total CPU time is taken as an estimate of the load.
|
||||
|
||||
If this governor is attached to a policy shared by multiple CPUs, the load is
|
||||
estimated for all of them and the greatest result is taken as the load estimate
|
||||
for the entire policy.
|
||||
|
||||
The worker routine of this governor has to run in process context, so it is
|
||||
invoked asynchronously (via a workqueue) and CPU P-states are updated from
|
||||
there if necessary. As a result, the scheduler context overhead from this
|
||||
governor is minimum, but it causes additional CPU context switches to happen
|
||||
relatively often and the CPU P-state updates triggered by it can be relatively
|
||||
irregular. Also, it affects its own CPU load metric by running code that
|
||||
reduces the CPU idle time (even though the CPU idle time is only reduced very
|
||||
slightly by it).
|
||||
|
||||
It generally selects CPU frequencies proportional to the estimated load, so that
|
||||
the value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of
|
||||
1 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute
|
||||
corresponds to the load of 0, unless when the load exceeds a (configurable)
|
||||
speedup threshold, in which case it will go straight for the highest frequency
|
||||
it is allowed to use (the ``scaling_max_freq`` policy limit).
|
||||
|
||||
This governor exposes the following tunables:
|
||||
|
||||
``sampling_rate``
|
||||
This is how often the governor's worker routine should run, in
|
||||
microseconds.
|
||||
|
||||
Typically, it is set to values of the order of 2000 (2 ms). Its
|
||||
default value is to add a 50% breathing room
|
||||
to ``cpuinfo_transition_latency`` on each policy this governor is
|
||||
attached to. The minimum is typically the length of two scheduler
|
||||
ticks.
|
||||
|
||||
If this tunable is per-policy, the following shell command sets the time
|
||||
represented by it to be 1.5 times as high as the transition latency
|
||||
(the default)::
|
||||
|
||||
# echo `$(($(cat cpuinfo_transition_latency) * 3 / 2)) > ondemand/sampling_rate
|
||||
|
||||
``up_threshold``
|
||||
If the estimated CPU load is above this value (in percent), the governor
|
||||
will set the frequency to the maximum value allowed for the policy.
|
||||
Otherwise, the selected frequency will be proportional to the estimated
|
||||
CPU load.
|
||||
|
||||
``ignore_nice_load``
|
||||
If set to 1 (default 0), it will cause the CPU load estimation code to
|
||||
treat the CPU time spent on executing tasks with "nice" levels greater
|
||||
than 0 as CPU idle time.
|
||||
|
||||
This may be useful if there are tasks in the system that should not be
|
||||
taken into account when deciding what frequency to run the CPUs at.
|
||||
Then, to make that happen it is sufficient to increase the "nice" level
|
||||
of those tasks above 0 and set this attribute to 1.
|
||||
|
||||
``sampling_down_factor``
|
||||
Temporary multiplier, between 1 (default) and 100 inclusive, to apply to
|
||||
the ``sampling_rate`` value if the CPU load goes above ``up_threshold``.
|
||||
|
||||
This causes the next execution of the governor's worker routine (after
|
||||
setting the frequency to the allowed maximum) to be delayed, so the
|
||||
frequency stays at the maximum level for a longer time.
|
||||
|
||||
Frequency fluctuations in some bursty workloads may be avoided this way
|
||||
at the cost of additional energy spent on maintaining the maximum CPU
|
||||
capacity.
|
||||
|
||||
``powersave_bias``
|
||||
Reduction factor to apply to the original frequency target of the
|
||||
governor (including the maximum value used when the ``up_threshold``
|
||||
value is exceeded by the estimated CPU load) or sensitivity threshold
|
||||
for the AMD frequency sensitivity powersave bias driver
|
||||
(:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000
|
||||
inclusive.
|
||||
|
||||
If the AMD frequency sensitivity powersave bias driver is not loaded,
|
||||
the effective frequency to apply is given by
|
||||
|
||||
f * (1 - ``powersave_bias`` / 1000)
|
||||
|
||||
where f is the governor's original frequency target. The default value
|
||||
of this attribute is 0 in that case.
|
||||
|
||||
If the AMD frequency sensitivity powersave bias driver is loaded, the
|
||||
value of this attribute is 400 by default and it is used in a different
|
||||
way.
|
||||
|
||||
On Family 16h (and later) AMD processors there is a mechanism to get a
|
||||
measured workload sensitivity, between 0 and 100% inclusive, from the
|
||||
hardware. That value can be used to estimate how the performance of the
|
||||
workload running on a CPU will change in response to frequency changes.
|
||||
|
||||
The performance of a workload with the sensitivity of 0 (memory-bound or
|
||||
IO-bound) is not expected to increase at all as a result of increasing
|
||||
the CPU frequency, whereas workloads with the sensitivity of 100%
|
||||
(CPU-bound) are expected to perform much better if the CPU frequency is
|
||||
increased.
|
||||
|
||||
If the workload sensitivity is less than the threshold represented by
|
||||
the ``powersave_bias`` value, the sensitivity powersave bias driver
|
||||
will cause the governor to select a frequency lower than its original
|
||||
target, so as to avoid over-provisioning workloads that will not benefit
|
||||
from running at higher CPU frequencies.
|
||||
|
||||
``conservative``
|
||||
----------------
|
||||
|
||||
This governor uses CPU load as a CPU frequency selection metric.
|
||||
|
||||
It estimates the CPU load in the same way as the `ondemand`_ governor described
|
||||
above, but the CPU frequency selection algorithm implemented by it is different.
|
||||
|
||||
Namely, it avoids changing the frequency significantly over short time intervals
|
||||
which may not be suitable for systems with limited power supply capacity (e.g.
|
||||
battery-powered). To achieve that, it changes the frequency in relatively
|
||||
small steps, one step at a time, up or down - depending on whether or not a
|
||||
(configurable) threshold has been exceeded by the estimated CPU load.
|
||||
|
||||
This governor exposes the following tunables:
|
||||
|
||||
``freq_step``
|
||||
Frequency step in percent of the maximum frequency the governor is
|
||||
allowed to set (the ``scaling_max_freq`` policy limit), between 0 and
|
||||
100 (5 by default).
|
||||
|
||||
This is how much the frequency is allowed to change in one go. Setting
|
||||
it to 0 will cause the default frequency step (5 percent) to be used
|
||||
and setting it to 100 effectively causes the governor to periodically
|
||||
switch the frequency between the ``scaling_min_freq`` and
|
||||
``scaling_max_freq`` policy limits.
|
||||
|
||||
``down_threshold``
|
||||
Threshold value (in percent, 20 by default) used to determine the
|
||||
frequency change direction.
|
||||
|
||||
If the estimated CPU load is greater than this value, the frequency will
|
||||
go up (by ``freq_step``). If the load is less than this value (and the
|
||||
``sampling_down_factor`` mechanism is not in effect), the frequency will
|
||||
go down. Otherwise, the frequency will not be changed.
|
||||
|
||||
``sampling_down_factor``
|
||||
Frequency decrease deferral factor, between 1 (default) and 10
|
||||
inclusive.
|
||||
|
||||
It effectively causes the frequency to go down ``sampling_down_factor``
|
||||
times slower than it ramps up.
|
||||
|
||||
|
||||
Frequency Boost Support
|
||||
=======================
|
||||
|
||||
Background
|
||||
----------
|
||||
|
||||
Some processors support a mechanism to raise the operating frequency of some
|
||||
cores in a multicore package temporarily (and above the sustainable frequency
|
||||
threshold for the whole package) under certain conditions, for example if the
|
||||
whole chip is not fully utilized and below its intended thermal or power budget.
|
||||
|
||||
Different names are used by different vendors to refer to this functionality.
|
||||
For Intel processors it is referred to as "Turbo Boost", AMD calls it
|
||||
"Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on.
|
||||
As a rule, it also is implemented differently by different vendors. The simple
|
||||
term "frequency boost" is used here for brevity to refer to all of those
|
||||
implementations.
|
||||
|
||||
The frequency boost mechanism may be either hardware-based or software-based.
|
||||
If it is hardware-based (e.g. on x86), the decision to trigger the boosting is
|
||||
made by the hardware (although in general it requires the hardware to be put
|
||||
into a special state in which it can control the CPU frequency within certain
|
||||
limits). If it is software-based (e.g. on ARM), the scaling driver decides
|
||||
whether or not to trigger boosting and when to do that.
|
||||
|
||||
The ``boost`` File in ``sysfs``
|
||||
-------------------------------
|
||||
|
||||
This file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls
|
||||
the "boost" setting for the whole system. It is not present if the underlying
|
||||
scaling driver does not support the frequency boost mechanism (or supports it,
|
||||
but provides a driver-specific interface for controlling it, like
|
||||
|intel_pstate|).
|
||||
|
||||
If the value in this file is 1, the frequency boost mechanism is enabled. This
|
||||
means that either the hardware can be put into states in which it is able to
|
||||
trigger boosting (in the hardware-based case), or the software is allowed to
|
||||
trigger boosting (in the software-based case). It does not mean that boosting
|
||||
is actually in use at the moment on any CPUs in the system. It only means a
|
||||
permission to use the frequency boost mechanism (which still may never be used
|
||||
for other reasons).
|
||||
|
||||
If the value in this file is 0, the frequency boost mechanism is disabled and
|
||||
cannot be used at all.
|
||||
|
||||
The only values that can be written to this file are 0 and 1.
|
||||
|
||||
Rationale for Boost Control Knob
|
||||
--------------------------------
|
||||
|
||||
The frequency boost mechanism is generally intended to help to achieve optimum
|
||||
CPU performance on time scales below software resolution (e.g. below the
|
||||
scheduler tick interval) and it is demonstrably suitable for many workloads, but
|
||||
it may lead to problems in certain situations.
|
||||
|
||||
For this reason, many systems make it possible to disable the frequency boost
|
||||
mechanism in the platform firmware (BIOS) setup, but that requires the system to
|
||||
be restarted for the setting to be adjusted as desired, which may not be
|
||||
practical at least in some cases. For example:
|
||||
|
||||
1. Boosting means overclocking the processor, although under controlled
|
||||
conditions. Generally, the processor's energy consumption increases
|
||||
as a result of increasing its frequency and voltage, even temporarily.
|
||||
That may not be desirable on systems that switch to power sources of
|
||||
limited capacity, such as batteries, so the ability to disable the boost
|
||||
mechanism while the system is running may help there (but that depends on
|
||||
the workload too).
|
||||
|
||||
2. In some situations deterministic behavior is more important than
|
||||
performance or energy consumption (or both) and the ability to disable
|
||||
boosting while the system is running may be useful then.
|
||||
|
||||
3. To examine the impact of the frequency boost mechanism itself, it is useful
|
||||
to be able to run tests with and without boosting, preferably without
|
||||
restarting the system in the meantime.
|
||||
|
||||
4. Reproducible results are important when running benchmarks. Since
|
||||
the boosting functionality depends on the load of the whole package,
|
||||
single-thread performance may vary because of it which may lead to
|
||||
unreproducible results sometimes. That can be avoided by disabling the
|
||||
frequency boost mechanism before running benchmarks sensitive to that
|
||||
issue.
|
||||
|
||||
Legacy AMD ``cpb`` Knob
|
||||
-----------------------
|
||||
|
||||
The AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to
|
||||
the global ``boost`` one. It is used for disabling/enabling the "Core
|
||||
Performance Boost" feature of some AMD processors.
|
||||
|
||||
If present, that knob is located in every ``CPUFreq`` policy directory in
|
||||
``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called
|
||||
``cpb``, which indicates a more fine grained control interface. The actual
|
||||
implementation, however, works on the system-wide basis and setting that knob
|
||||
for one policy causes the same value of it to be set for all of the other
|
||||
policies at the same time.
|
||||
|
||||
That knob is still supported on AMD processors that support its underlying
|
||||
hardware feature, but it may be configured out of the kernel (via the
|
||||
:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global
|
||||
``boost`` knob is present regardless. Thus it is always possible use the
|
||||
``boost`` knob instead of the ``cpb`` one which is highly recommended, as that
|
||||
is more consistent with what all of the other systems do (and the ``cpb`` knob
|
||||
may not be supported any more in the future).
|
||||
|
||||
The ``cpb`` knob is never present for any processors without the underlying
|
||||
hardware feature (e.g. all Intel ones), even if the
|
||||
:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] Jonathan Corbet, *Per-entity load tracking*,
|
||||
https://lwn.net/Articles/531853/
|
274
Documentation/admin-guide/pm/cpufreq_drivers.rst
Normal file
274
Documentation/admin-guide/pm/cpufreq_drivers.rst
Normal file
|
@ -0,0 +1,274 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=======================================================
|
||||
Legacy Documentation of CPU Performance Scaling Drivers
|
||||
=======================================================
|
||||
|
||||
Included below are historic documents describing assorted
|
||||
:doc:`CPU performance scaling <cpufreq>` drivers. They are reproduced verbatim,
|
||||
with the original white space formatting and indentation preserved, except for
|
||||
the added leading space character in every line of text.
|
||||
|
||||
|
||||
AMD PowerNow! Drivers
|
||||
=====================
|
||||
|
||||
::
|
||||
|
||||
PowerNow! and Cool'n'Quiet are AMD names for frequency
|
||||
management capabilities in AMD processors. As the hardware
|
||||
implementation changes in new generations of the processors,
|
||||
there is a different cpu-freq driver for each generation.
|
||||
|
||||
Note that the driver's will not load on the "wrong" hardware,
|
||||
so it is safe to try each driver in turn when in doubt as to
|
||||
which is the correct driver.
|
||||
|
||||
Note that the functionality to change frequency (and voltage)
|
||||
is not available in all processors. The drivers will refuse
|
||||
to load on processors without this capability. The capability
|
||||
is detected with the cpuid instruction.
|
||||
|
||||
The drivers use BIOS supplied tables to obtain frequency and
|
||||
voltage information appropriate for a particular platform.
|
||||
Frequency transitions will be unavailable if the BIOS does
|
||||
not supply these tables.
|
||||
|
||||
6th Generation: powernow-k6
|
||||
|
||||
7th Generation: powernow-k7: Athlon, Duron, Geode.
|
||||
|
||||
8th Generation: powernow-k8: Athlon, Athlon 64, Opteron, Sempron.
|
||||
Documentation on this functionality in 8th generation processors
|
||||
is available in the "BIOS and Kernel Developer's Guide", publication
|
||||
26094, in chapter 9, available for download from www.amd.com.
|
||||
|
||||
BIOS supplied data, for powernow-k7 and for powernow-k8, may be
|
||||
from either the PSB table or from ACPI objects. The ACPI support
|
||||
is only available if the kernel config sets CONFIG_ACPI_PROCESSOR.
|
||||
The powernow-k8 driver will attempt to use ACPI if so configured,
|
||||
and fall back to PST if that fails.
|
||||
The powernow-k7 driver will try to use the PSB support first, and
|
||||
fall back to ACPI if the PSB support fails. A module parameter,
|
||||
acpi_force, is provided to force ACPI support to be used instead
|
||||
of PSB support.
|
||||
|
||||
|
||||
``cpufreq-nforce2``
|
||||
===================
|
||||
|
||||
::
|
||||
|
||||
The cpufreq-nforce2 driver changes the FSB on nVidia nForce2 platforms.
|
||||
|
||||
This works better than on other platforms, because the FSB of the CPU
|
||||
can be controlled independently from the PCI/AGP clock.
|
||||
|
||||
The module has two options:
|
||||
|
||||
fid: multiplier * 10 (for example 8.5 = 85)
|
||||
min_fsb: minimum FSB
|
||||
|
||||
If not set, fid is calculated from the current CPU speed and the FSB.
|
||||
min_fsb defaults to FSB at boot time - 50 MHz.
|
||||
|
||||
IMPORTANT: The available range is limited downwards!
|
||||
Also the minimum available FSB can differ, for systems
|
||||
booting with 200 MHz, 150 should always work.
|
||||
|
||||
|
||||
``pcc-cpufreq``
|
||||
===============
|
||||
|
||||
::
|
||||
|
||||
/*
|
||||
* pcc-cpufreq.txt - PCC interface documentation
|
||||
*
|
||||
* Copyright (C) 2009 Red Hat, Matthew Garrett <mjg@redhat.com>
|
||||
* Copyright (C) 2009 Hewlett-Packard Development Company, L.P.
|
||||
* Nagananda Chumbalkar <nagananda.chumbalkar@hp.com>
|
||||
*/
|
||||
|
||||
|
||||
Processor Clocking Control Driver
|
||||
---------------------------------
|
||||
|
||||
Contents:
|
||||
---------
|
||||
1. Introduction
|
||||
1.1 PCC interface
|
||||
1.1.1 Get Average Frequency
|
||||
1.1.2 Set Desired Frequency
|
||||
1.2 Platforms affected
|
||||
2. Driver and /sys details
|
||||
2.1 scaling_available_frequencies
|
||||
2.2 cpuinfo_transition_latency
|
||||
2.3 cpuinfo_cur_freq
|
||||
2.4 related_cpus
|
||||
3. Caveats
|
||||
|
||||
1. Introduction:
|
||||
----------------
|
||||
Processor Clocking Control (PCC) is an interface between the platform
|
||||
firmware and OSPM. It is a mechanism for coordinating processor
|
||||
performance (ie: frequency) between the platform firmware and the OS.
|
||||
|
||||
The PCC driver (pcc-cpufreq) allows OSPM to take advantage of the PCC
|
||||
interface.
|
||||
|
||||
OS utilizes the PCC interface to inform platform firmware what frequency the
|
||||
OS wants for a logical processor. The platform firmware attempts to achieve
|
||||
the requested frequency. If the request for the target frequency could not be
|
||||
satisfied by platform firmware, then it usually means that power budget
|
||||
conditions are in place, and "power capping" is taking place.
|
||||
|
||||
1.1 PCC interface:
|
||||
------------------
|
||||
The complete PCC specification is available here:
|
||||
https://acpica.org/sites/acpica/files/Processor-Clocking-Control-v1p0.pdf
|
||||
|
||||
PCC relies on a shared memory region that provides a channel for communication
|
||||
between the OS and platform firmware. PCC also implements a "doorbell" that
|
||||
is used by the OS to inform the platform firmware that a command has been
|
||||
sent.
|
||||
|
||||
The ACPI PCCH() method is used to discover the location of the PCC shared
|
||||
memory region. The shared memory region header contains the "command" and
|
||||
"status" interface. PCCH() also contains details on how to access the platform
|
||||
doorbell.
|
||||
|
||||
The following commands are supported by the PCC interface:
|
||||
* Get Average Frequency
|
||||
* Set Desired Frequency
|
||||
|
||||
The ACPI PCCP() method is implemented for each logical processor and is
|
||||
used to discover the offsets for the input and output buffers in the shared
|
||||
memory region.
|
||||
|
||||
When PCC mode is enabled, the platform will not expose processor performance
|
||||
or throttle states (_PSS, _TSS and related ACPI objects) to OSPM. Therefore,
|
||||
the native P-state driver (such as acpi-cpufreq for Intel, powernow-k8 for
|
||||
AMD) will not load.
|
||||
|
||||
However, OSPM remains in control of policy. The governor (eg: "ondemand")
|
||||
computes the required performance for each processor based on server workload.
|
||||
The PCC driver fills in the command interface, and the input buffer and
|
||||
communicates the request to the platform firmware. The platform firmware is
|
||||
responsible for delivering the requested performance.
|
||||
|
||||
Each PCC command is "global" in scope and can affect all the logical CPUs in
|
||||
the system. Therefore, PCC is capable of performing "group" updates. With PCC
|
||||
the OS is capable of getting/setting the frequency of all the logical CPUs in
|
||||
the system with a single call to the BIOS.
|
||||
|
||||
1.1.1 Get Average Frequency:
|
||||
----------------------------
|
||||
This command is used by the OSPM to query the running frequency of the
|
||||
processor since the last time this command was completed. The output buffer
|
||||
indicates the average unhalted frequency of the logical processor expressed as
|
||||
a percentage of the nominal (ie: maximum) CPU frequency. The output buffer
|
||||
also signifies if the CPU frequency is limited by a power budget condition.
|
||||
|
||||
1.1.2 Set Desired Frequency:
|
||||
----------------------------
|
||||
This command is used by the OSPM to communicate to the platform firmware the
|
||||
desired frequency for a logical processor. The output buffer is currently
|
||||
ignored by OSPM. The next invocation of "Get Average Frequency" will inform
|
||||
OSPM if the desired frequency was achieved or not.
|
||||
|
||||
1.2 Platforms affected:
|
||||
-----------------------
|
||||
The PCC driver will load on any system where the platform firmware:
|
||||
* supports the PCC interface, and the associated PCCH() and PCCP() methods
|
||||
* assumes responsibility for managing the hardware clocking controls in order
|
||||
to deliver the requested processor performance
|
||||
|
||||
Currently, certain HP ProLiant platforms implement the PCC interface. On those
|
||||
platforms PCC is the "default" choice.
|
||||
|
||||
However, it is possible to disable this interface via a BIOS setting. In
|
||||
such an instance, as is also the case on platforms where the PCC interface
|
||||
is not implemented, the PCC driver will fail to load silently.
|
||||
|
||||
2. Driver and /sys details:
|
||||
---------------------------
|
||||
When the driver loads, it merely prints the lowest and the highest CPU
|
||||
frequencies supported by the platform firmware.
|
||||
|
||||
The PCC driver loads with a message such as:
|
||||
pcc-cpufreq: (v1.00.00) driver loaded with frequency limits: 1600 MHz, 2933
|
||||
MHz
|
||||
|
||||
This means that the OPSM can request the CPU to run at any frequency in
|
||||
between the limits (1600 MHz, and 2933 MHz) specified in the message.
|
||||
|
||||
Internally, there is no need for the driver to convert the "target" frequency
|
||||
to a corresponding P-state.
|
||||
|
||||
The VERSION number for the driver will be of the format v.xy.ab.
|
||||
eg: 1.00.02
|
||||
----- --
|
||||
| |
|
||||
| -- this will increase with bug fixes/enhancements to the driver
|
||||
|-- this is the version of the PCC specification the driver adheres to
|
||||
|
||||
|
||||
The following is a brief discussion on some of the fields exported via the
|
||||
/sys filesystem and how their values are affected by the PCC driver:
|
||||
|
||||
2.1 scaling_available_frequencies:
|
||||
----------------------------------
|
||||
scaling_available_frequencies is not created in /sys. No intermediate
|
||||
frequencies need to be listed because the BIOS will try to achieve any
|
||||
frequency, within limits, requested by the governor. A frequency does not have
|
||||
to be strictly associated with a P-state.
|
||||
|
||||
2.2 cpuinfo_transition_latency:
|
||||
-------------------------------
|
||||
The cpuinfo_transition_latency field is 0. The PCC specification does
|
||||
not include a field to expose this value currently.
|
||||
|
||||
2.3 cpuinfo_cur_freq:
|
||||
---------------------
|
||||
A) Often cpuinfo_cur_freq will show a value different than what is declared
|
||||
in the scaling_available_frequencies or scaling_cur_freq, or scaling_max_freq.
|
||||
This is due to "turbo boost" available on recent Intel processors. If certain
|
||||
conditions are met the BIOS can achieve a slightly higher speed than requested
|
||||
by OSPM. An example:
|
||||
|
||||
scaling_cur_freq : 2933000
|
||||
cpuinfo_cur_freq : 3196000
|
||||
|
||||
B) There is a round-off error associated with the cpuinfo_cur_freq value.
|
||||
Since the driver obtains the current frequency as a "percentage" (%) of the
|
||||
nominal frequency from the BIOS, sometimes, the values displayed by
|
||||
scaling_cur_freq and cpuinfo_cur_freq may not match. An example:
|
||||
|
||||
scaling_cur_freq : 1600000
|
||||
cpuinfo_cur_freq : 1583000
|
||||
|
||||
In this example, the nominal frequency is 2933 MHz. The driver obtains the
|
||||
current frequency, cpuinfo_cur_freq, as 54% of the nominal frequency:
|
||||
|
||||
54% of 2933 MHz = 1583 MHz
|
||||
|
||||
Nominal frequency is the maximum frequency of the processor, and it usually
|
||||
corresponds to the frequency of the P0 P-state.
|
||||
|
||||
2.4 related_cpus:
|
||||
-----------------
|
||||
The related_cpus field is identical to affected_cpus.
|
||||
|
||||
affected_cpus : 4
|
||||
related_cpus : 4
|
||||
|
||||
Currently, the PCC driver does not evaluate _PSD. The platforms that support
|
||||
PCC do not implement SW_ALL. So OSPM doesn't need to perform any coordination
|
||||
to ensure that the same frequency is requested of all dependent CPUs.
|
||||
|
||||
3. Caveats:
|
||||
-----------
|
||||
The "cpufreq_stats" module in its present form cannot be loaded and
|
||||
expected to work with the PCC driver. Since the "cpufreq_stats" module
|
||||
provides information wrt each P-state, it is not applicable to the PCC driver.
|
662
Documentation/admin-guide/pm/cpuidle.rst
Normal file
662
Documentation/admin-guide/pm/cpuidle.rst
Normal file
|
@ -0,0 +1,662 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
.. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state <cpuidle_state>`
|
||||
.. |cpufreq| replace:: :doc:`CPU Performance Scaling <cpufreq>`
|
||||
|
||||
========================
|
||||
CPU Idle Time Management
|
||||
========================
|
||||
|
||||
:Copyright: |copy| 2018 Intel Corporation
|
||||
|
||||
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
||||
|
||||
|
||||
Concepts
|
||||
========
|
||||
|
||||
Modern processors are generally able to enter states in which the execution of
|
||||
a program is suspended and instructions belonging to it are not fetched from
|
||||
memory or executed. Those states are the *idle* states of the processor.
|
||||
|
||||
Since part of the processor hardware is not used in idle states, entering them
|
||||
generally allows power drawn by the processor to be reduced and, in consequence,
|
||||
it is an opportunity to save energy.
|
||||
|
||||
CPU idle time management is an energy-efficiency feature concerned about using
|
||||
the idle states of processors for this purpose.
|
||||
|
||||
Logical CPUs
|
||||
------------
|
||||
|
||||
CPU idle time management operates on CPUs as seen by the *CPU scheduler* (that
|
||||
is the part of the kernel responsible for the distribution of computational
|
||||
work in the system). In its view, CPUs are *logical* units. That is, they need
|
||||
not be separate physical entities and may just be interfaces appearing to
|
||||
software as individual single-core processors. In other words, a CPU is an
|
||||
entity which appears to be fetching instructions that belong to one sequence
|
||||
(program) from memory and executing them, but it need not work this way
|
||||
physically. Generally, three different cases can be consider here.
|
||||
|
||||
First, if the whole processor can only follow one sequence of instructions (one
|
||||
program) at a time, it is a CPU. In that case, if the hardware is asked to
|
||||
enter an idle state, that applies to the processor as a whole.
|
||||
|
||||
Second, if the processor is multi-core, each core in it is able to follow at
|
||||
least one program at a time. The cores need not be entirely independent of each
|
||||
other (for example, they may share caches), but still most of the time they
|
||||
work physically in parallel with each other, so if each of them executes only
|
||||
one program, those programs run mostly independently of each other at the same
|
||||
time. The entire cores are CPUs in that case and if the hardware is asked to
|
||||
enter an idle state, that applies to the core that asked for it in the first
|
||||
place, but it also may apply to a larger unit (say a "package" or a "cluster")
|
||||
that the core belongs to (in fact, it may apply to an entire hierarchy of larger
|
||||
units containing the core). Namely, if all of the cores in the larger unit
|
||||
except for one have been put into idle states at the "core level" and the
|
||||
remaining core asks the processor to enter an idle state, that may trigger it
|
||||
to put the whole larger unit into an idle state which also will affect the
|
||||
other cores in that unit.
|
||||
|
||||
Finally, each core in a multi-core processor may be able to follow more than one
|
||||
program in the same time frame (that is, each core may be able to fetch
|
||||
instructions from multiple locations in memory and execute them in the same time
|
||||
frame, but not necessarily entirely in parallel with each other). In that case
|
||||
the cores present themselves to software as "bundles" each consisting of
|
||||
multiple individual single-core "processors", referred to as *hardware threads*
|
||||
(or hyper-threads specifically on Intel hardware), that each can follow one
|
||||
sequence of instructions. Then, the hardware threads are CPUs from the CPU idle
|
||||
time management perspective and if the processor is asked to enter an idle state
|
||||
by one of them, the hardware thread (or CPU) that asked for it is stopped, but
|
||||
nothing more happens, unless all of the other hardware threads within the same
|
||||
core also have asked the processor to enter an idle state. In that situation,
|
||||
the core may be put into an idle state individually or a larger unit containing
|
||||
it may be put into an idle state as a whole (if the other cores within the
|
||||
larger unit are in idle states already).
|
||||
|
||||
Idle CPUs
|
||||
---------
|
||||
|
||||
Logical CPUs, simply referred to as "CPUs" in what follows, are regarded as
|
||||
*idle* by the Linux kernel when there are no tasks to run on them except for the
|
||||
special "idle" task.
|
||||
|
||||
Tasks are the CPU scheduler's representation of work. Each task consists of a
|
||||
sequence of instructions to execute, or code, data to be manipulated while
|
||||
running that code, and some context information that needs to be loaded into the
|
||||
processor every time the task's code is run by a CPU. The CPU scheduler
|
||||
distributes work by assigning tasks to run to the CPUs present in the system.
|
||||
|
||||
Tasks can be in various states. In particular, they are *runnable* if there are
|
||||
no specific conditions preventing their code from being run by a CPU as long as
|
||||
there is a CPU available for that (for example, they are not waiting for any
|
||||
events to occur or similar). When a task becomes runnable, the CPU scheduler
|
||||
assigns it to one of the available CPUs to run and if there are no more runnable
|
||||
tasks assigned to it, the CPU will load the given task's context and run its
|
||||
code (from the instruction following the last one executed so far, possibly by
|
||||
another CPU). [If there are multiple runnable tasks assigned to one CPU
|
||||
simultaneously, they will be subject to prioritization and time sharing in order
|
||||
to allow them to make some progress over time.]
|
||||
|
||||
The special "idle" task becomes runnable if there are no other runnable tasks
|
||||
assigned to the given CPU and the CPU is then regarded as idle. In other words,
|
||||
in Linux idle CPUs run the code of the "idle" task called *the idle loop*. That
|
||||
code may cause the processor to be put into one of its idle states, if they are
|
||||
supported, in order to save energy, but if the processor does not support any
|
||||
idle states, or there is not enough time to spend in an idle state before the
|
||||
next wakeup event, or there are strict latency constraints preventing any of the
|
||||
available idle states from being used, the CPU will simply execute more or less
|
||||
useless instructions in a loop until it is assigned a new task to run.
|
||||
|
||||
|
||||
.. _idle-loop:
|
||||
|
||||
The Idle Loop
|
||||
=============
|
||||
|
||||
The idle loop code takes two major steps in every iteration of it. First, it
|
||||
calls into a code module referred to as the *governor* that belongs to the CPU
|
||||
idle time management subsystem called ``CPUIdle`` to select an idle state for
|
||||
the CPU to ask the hardware to enter. Second, it invokes another code module
|
||||
from the ``CPUIdle`` subsystem, called the *driver*, to actually ask the
|
||||
processor hardware to enter the idle state selected by the governor.
|
||||
|
||||
The role of the governor is to find an idle state most suitable for the
|
||||
conditions at hand. For this purpose, idle states that the hardware can be
|
||||
asked to enter by logical CPUs are represented in an abstract way independent of
|
||||
the platform or the processor architecture and organized in a one-dimensional
|
||||
(linear) array. That array has to be prepared and supplied by the ``CPUIdle``
|
||||
driver matching the platform the kernel is running on at the initialization
|
||||
time. This allows ``CPUIdle`` governors to be independent of the underlying
|
||||
hardware and to work with any platforms that the Linux kernel can run on.
|
||||
|
||||
Each idle state present in that array is characterized by two parameters to be
|
||||
taken into account by the governor, the *target residency* and the (worst-case)
|
||||
*exit latency*. The target residency is the minimum time the hardware must
|
||||
spend in the given state, including the time needed to enter it (which may be
|
||||
substantial), in order to save more energy than it would save by entering one of
|
||||
the shallower idle states instead. [The "depth" of an idle state roughly
|
||||
corresponds to the power drawn by the processor in that state.] The exit
|
||||
latency, in turn, is the maximum time it will take a CPU asking the processor
|
||||
hardware to enter an idle state to start executing the first instruction after a
|
||||
wakeup from that state. Note that in general the exit latency also must cover
|
||||
the time needed to enter the given state in case the wakeup occurs when the
|
||||
hardware is entering it and it must be entered completely to be exited in an
|
||||
ordered manner.
|
||||
|
||||
There are two types of information that can influence the governor's decisions.
|
||||
First of all, the governor knows the time until the closest timer event. That
|
||||
time is known exactly, because the kernel programs timers and it knows exactly
|
||||
when they will trigger, and it is the maximum time the hardware that the given
|
||||
CPU depends on can spend in an idle state, including the time necessary to enter
|
||||
and exit it. However, the CPU may be woken up by a non-timer event at any time
|
||||
(in particular, before the closest timer triggers) and it generally is not known
|
||||
when that may happen. The governor can only see how much time the CPU actually
|
||||
was idle after it has been woken up (that time will be referred to as the *idle
|
||||
duration* from now on) and it can use that information somehow along with the
|
||||
time until the closest timer to estimate the idle duration in future. How the
|
||||
governor uses that information depends on what algorithm is implemented by it
|
||||
and that is the primary reason for having more than one governor in the
|
||||
``CPUIdle`` subsystem.
|
||||
|
||||
There are four ``CPUIdle`` governors available, ``menu``, `TEO <teo-gov_>`_,
|
||||
``ladder`` and ``haltpoll``. Which of them is used by default depends on the
|
||||
configuration of the kernel and in particular on whether or not the scheduler
|
||||
tick can be `stopped by the idle loop <idle-cpus-and-tick_>`_. Available
|
||||
governors can be read from the :file:`available_governors`, and the governor
|
||||
can be changed at runtime. The name of the ``CPUIdle`` governor currently
|
||||
used by the kernel can be read from the :file:`current_governor_ro` or
|
||||
:file:`current_governor` file under :file:`/sys/devices/system/cpu/cpuidle/`
|
||||
in ``sysfs``.
|
||||
|
||||
Which ``CPUIdle`` driver is used, on the other hand, usually depends on the
|
||||
platform the kernel is running on, but there are platforms with more than one
|
||||
matching driver. For example, there are two drivers that can work with the
|
||||
majority of Intel platforms, ``intel_idle`` and ``acpi_idle``, one with
|
||||
hardcoded idle states information and the other able to read that information
|
||||
from the system's ACPI tables, respectively. Still, even in those cases, the
|
||||
driver chosen at the system initialization time cannot be replaced later, so the
|
||||
decision on which one of them to use has to be made early (on Intel platforms
|
||||
the ``acpi_idle`` driver will be used if ``intel_idle`` is disabled for some
|
||||
reason or if it does not recognize the processor). The name of the ``CPUIdle``
|
||||
driver currently used by the kernel can be read from the :file:`current_driver`
|
||||
file under :file:`/sys/devices/system/cpu/cpuidle/` in ``sysfs``.
|
||||
|
||||
|
||||
.. _idle-cpus-and-tick:
|
||||
|
||||
Idle CPUs and The Scheduler Tick
|
||||
================================
|
||||
|
||||
The scheduler tick is a timer that triggers periodically in order to implement
|
||||
the time sharing strategy of the CPU scheduler. Of course, if there are
|
||||
multiple runnable tasks assigned to one CPU at the same time, the only way to
|
||||
allow them to make reasonable progress in a given time frame is to make them
|
||||
share the available CPU time. Namely, in rough approximation, each task is
|
||||
given a slice of the CPU time to run its code, subject to the scheduling class,
|
||||
prioritization and so on and when that time slice is used up, the CPU should be
|
||||
switched over to running (the code of) another task. The currently running task
|
||||
may not want to give the CPU away voluntarily, however, and the scheduler tick
|
||||
is there to make the switch happen regardless. That is not the only role of the
|
||||
tick, but it is the primary reason for using it.
|
||||
|
||||
The scheduler tick is problematic from the CPU idle time management perspective,
|
||||
because it triggers periodically and relatively often (depending on the kernel
|
||||
configuration, the length of the tick period is between 1 ms and 10 ms).
|
||||
Thus, if the tick is allowed to trigger on idle CPUs, it will not make sense
|
||||
for them to ask the hardware to enter idle states with target residencies above
|
||||
the tick period length. Moreover, in that case the idle duration of any CPU
|
||||
will never exceed the tick period length and the energy used for entering and
|
||||
exiting idle states due to the tick wakeups on idle CPUs will be wasted.
|
||||
|
||||
Fortunately, it is not really necessary to allow the tick to trigger on idle
|
||||
CPUs, because (by definition) they have no tasks to run except for the special
|
||||
"idle" one. In other words, from the CPU scheduler perspective, the only user
|
||||
of the CPU time on them is the idle loop. Since the time of an idle CPU need
|
||||
not be shared between multiple runnable tasks, the primary reason for using the
|
||||
tick goes away if the given CPU is idle. Consequently, it is possible to stop
|
||||
the scheduler tick entirely on idle CPUs in principle, even though that may not
|
||||
always be worth the effort.
|
||||
|
||||
Whether or not it makes sense to stop the scheduler tick in the idle loop
|
||||
depends on what is expected by the governor. First, if there is another
|
||||
(non-tick) timer due to trigger within the tick range, stopping the tick clearly
|
||||
would be a waste of time, even though the timer hardware may not need to be
|
||||
reprogrammed in that case. Second, if the governor is expecting a non-timer
|
||||
wakeup within the tick range, stopping the tick is not necessary and it may even
|
||||
be harmful. Namely, in that case the governor will select an idle state with
|
||||
the target residency within the time until the expected wakeup, so that state is
|
||||
going to be relatively shallow. The governor really cannot select a deep idle
|
||||
state then, as that would contradict its own expectation of a wakeup in short
|
||||
order. Now, if the wakeup really occurs shortly, stopping the tick would be a
|
||||
waste of time and in this case the timer hardware would need to be reprogrammed,
|
||||
which is expensive. On the other hand, if the tick is stopped and the wakeup
|
||||
does not occur any time soon, the hardware may spend indefinite amount of time
|
||||
in the shallow idle state selected by the governor, which will be a waste of
|
||||
energy. Hence, if the governor is expecting a wakeup of any kind within the
|
||||
tick range, it is better to allow the tick trigger. Otherwise, however, the
|
||||
governor will select a relatively deep idle state, so the tick should be stopped
|
||||
so that it does not wake up the CPU too early.
|
||||
|
||||
In any case, the governor knows what it is expecting and the decision on whether
|
||||
or not to stop the scheduler tick belongs to it. Still, if the tick has been
|
||||
stopped already (in one of the previous iterations of the loop), it is better
|
||||
to leave it as is and the governor needs to take that into account.
|
||||
|
||||
The kernel can be configured to disable stopping the scheduler tick in the idle
|
||||
loop altogether. That can be done through the build-time configuration of it
|
||||
(by unsetting the ``CONFIG_NO_HZ_IDLE`` configuration option) or by passing
|
||||
``nohz=off`` to it in the command line. In both cases, as the stopping of the
|
||||
scheduler tick is disabled, the governor's decisions regarding it are simply
|
||||
ignored by the idle loop code and the tick is never stopped.
|
||||
|
||||
The systems that run kernels configured to allow the scheduler tick to be
|
||||
stopped on idle CPUs are referred to as *tickless* systems and they are
|
||||
generally regarded as more energy-efficient than the systems running kernels in
|
||||
which the tick cannot be stopped. If the given system is tickless, it will use
|
||||
the ``menu`` governor by default and if it is not tickless, the default
|
||||
``CPUIdle`` governor on it will be ``ladder``.
|
||||
|
||||
|
||||
.. _menu-gov:
|
||||
|
||||
The ``menu`` Governor
|
||||
=====================
|
||||
|
||||
The ``menu`` governor is the default ``CPUIdle`` governor for tickless systems.
|
||||
It is quite complex, but the basic principle of its design is straightforward.
|
||||
Namely, when invoked to select an idle state for a CPU (i.e. an idle state that
|
||||
the CPU will ask the processor hardware to enter), it attempts to predict the
|
||||
idle duration and uses the predicted value for idle state selection.
|
||||
|
||||
It first obtains the time until the closest timer event with the assumption
|
||||
that the scheduler tick will be stopped. That time, referred to as the *sleep
|
||||
length* in what follows, is the upper bound on the time before the next CPU
|
||||
wakeup. It is used to determine the sleep length range, which in turn is needed
|
||||
to get the sleep length correction factor.
|
||||
|
||||
The ``menu`` governor maintains two arrays of sleep length correction factors.
|
||||
One of them is used when tasks previously running on the given CPU are waiting
|
||||
for some I/O operations to complete and the other one is used when that is not
|
||||
the case. Each array contains several correction factor values that correspond
|
||||
to different sleep length ranges organized so that each range represented in the
|
||||
array is approximately 10 times wider than the previous one.
|
||||
|
||||
The correction factor for the given sleep length range (determined before
|
||||
selecting the idle state for the CPU) is updated after the CPU has been woken
|
||||
up and the closer the sleep length is to the observed idle duration, the closer
|
||||
to 1 the correction factor becomes (it must fall between 0 and 1 inclusive).
|
||||
The sleep length is multiplied by the correction factor for the range that it
|
||||
falls into to obtain the first approximation of the predicted idle duration.
|
||||
|
||||
Next, the governor uses a simple pattern recognition algorithm to refine its
|
||||
idle duration prediction. Namely, it saves the last 8 observed idle duration
|
||||
values and, when predicting the idle duration next time, it computes the average
|
||||
and variance of them. If the variance is small (smaller than 400 square
|
||||
milliseconds) or it is small relative to the average (the average is greater
|
||||
that 6 times the standard deviation), the average is regarded as the "typical
|
||||
interval" value. Otherwise, the longest of the saved observed idle duration
|
||||
values is discarded and the computation is repeated for the remaining ones.
|
||||
Again, if the variance of them is small (in the above sense), the average is
|
||||
taken as the "typical interval" value and so on, until either the "typical
|
||||
interval" is determined or too many data points are disregarded, in which case
|
||||
the "typical interval" is assumed to equal "infinity" (the maximum unsigned
|
||||
integer value). The "typical interval" computed this way is compared with the
|
||||
sleep length multiplied by the correction factor and the minimum of the two is
|
||||
taken as the predicted idle duration.
|
||||
|
||||
Then, the governor computes an extra latency limit to help "interactive"
|
||||
workloads. It uses the observation that if the exit latency of the selected
|
||||
idle state is comparable with the predicted idle duration, the total time spent
|
||||
in that state probably will be very short and the amount of energy to save by
|
||||
entering it will be relatively small, so likely it is better to avoid the
|
||||
overhead related to entering that state and exiting it. Thus selecting a
|
||||
shallower state is likely to be a better option then. The first approximation
|
||||
of the extra latency limit is the predicted idle duration itself which
|
||||
additionally is divided by a value depending on the number of tasks that
|
||||
previously ran on the given CPU and now they are waiting for I/O operations to
|
||||
complete. The result of that division is compared with the latency limit coming
|
||||
from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_,
|
||||
framework and the minimum of the two is taken as the limit for the idle states'
|
||||
exit latency.
|
||||
|
||||
Now, the governor is ready to walk the list of idle states and choose one of
|
||||
them. For this purpose, it compares the target residency of each state with
|
||||
the predicted idle duration and the exit latency of it with the computed latency
|
||||
limit. It selects the state with the target residency closest to the predicted
|
||||
idle duration, but still below it, and exit latency that does not exceed the
|
||||
limit.
|
||||
|
||||
In the final step the governor may still need to refine the idle state selection
|
||||
if it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That
|
||||
happens if the idle duration predicted by it is less than the tick period and
|
||||
the tick has not been stopped already (in a previous iteration of the idle
|
||||
loop). Then, the sleep length used in the previous computations may not reflect
|
||||
the real time until the closest timer event and if it really is greater than
|
||||
that time, the governor may need to select a shallower state with a suitable
|
||||
target residency.
|
||||
|
||||
|
||||
.. _teo-gov:
|
||||
|
||||
The Timer Events Oriented (TEO) Governor
|
||||
========================================
|
||||
|
||||
The timer events oriented (TEO) governor is an alternative ``CPUIdle`` governor
|
||||
for tickless systems. It follows the same basic strategy as the ``menu`` `one
|
||||
<menu-gov_>`_: it always tries to find the deepest idle state suitable for the
|
||||
given conditions. However, it applies a different approach to that problem.
|
||||
|
||||
.. kernel-doc:: drivers/cpuidle/governors/teo.c
|
||||
:doc: teo-description
|
||||
|
||||
.. _idle-states-representation:
|
||||
|
||||
Representation of Idle States
|
||||
=============================
|
||||
|
||||
For the CPU idle time management purposes all of the physical idle states
|
||||
supported by the processor have to be represented as a one-dimensional array of
|
||||
|struct cpuidle_state| objects each allowing an individual (logical) CPU to ask
|
||||
the processor hardware to enter an idle state of certain properties. If there
|
||||
is a hierarchy of units in the processor, one |struct cpuidle_state| object can
|
||||
cover a combination of idle states supported by the units at different levels of
|
||||
the hierarchy. In that case, the `target residency and exit latency parameters
|
||||
of it <idle-loop_>`_, must reflect the properties of the idle state at the
|
||||
deepest level (i.e. the idle state of the unit containing all of the other
|
||||
units).
|
||||
|
||||
For example, take a processor with two cores in a larger unit referred to as
|
||||
a "module" and suppose that asking the hardware to enter a specific idle state
|
||||
(say "X") at the "core" level by one core will trigger the module to try to
|
||||
enter a specific idle state of its own (say "MX") if the other core is in idle
|
||||
state "X" already. In other words, asking for idle state "X" at the "core"
|
||||
level gives the hardware a license to go as deep as to idle state "MX" at the
|
||||
"module" level, but there is no guarantee that this is going to happen (the core
|
||||
asking for idle state "X" may just end up in that state by itself instead).
|
||||
Then, the target residency of the |struct cpuidle_state| object representing
|
||||
idle state "X" must reflect the minimum time to spend in idle state "MX" of
|
||||
the module (including the time needed to enter it), because that is the minimum
|
||||
time the CPU needs to be idle to save any energy in case the hardware enters
|
||||
that state. Analogously, the exit latency parameter of that object must cover
|
||||
the exit time of idle state "MX" of the module (and usually its entry time too),
|
||||
because that is the maximum delay between a wakeup signal and the time the CPU
|
||||
will start to execute the first new instruction (assuming that both cores in the
|
||||
module will always be ready to execute instructions as soon as the module
|
||||
becomes operational as a whole).
|
||||
|
||||
There are processors without direct coordination between different levels of the
|
||||
hierarchy of units inside them, however. In those cases asking for an idle
|
||||
state at the "core" level does not automatically affect the "module" level, for
|
||||
example, in any way and the ``CPUIdle`` driver is responsible for the entire
|
||||
handling of the hierarchy. Then, the definition of the idle state objects is
|
||||
entirely up to the driver, but still the physical properties of the idle state
|
||||
that the processor hardware finally goes into must always follow the parameters
|
||||
used by the governor for idle state selection (for instance, the actual exit
|
||||
latency of that idle state must not exceed the exit latency parameter of the
|
||||
idle state object selected by the governor).
|
||||
|
||||
In addition to the target residency and exit latency idle state parameters
|
||||
discussed above, the objects representing idle states each contain a few other
|
||||
parameters describing the idle state and a pointer to the function to run in
|
||||
order to ask the hardware to enter that state. Also, for each
|
||||
|struct cpuidle_state| object, there is a corresponding
|
||||
:c:type:`struct cpuidle_state_usage <cpuidle_state_usage>` one containing usage
|
||||
statistics of the given idle state. That information is exposed by the kernel
|
||||
via ``sysfs``.
|
||||
|
||||
For each CPU in the system, there is a :file:`/sys/devices/system/cpu/cpu<N>/cpuidle/`
|
||||
directory in ``sysfs``, where the number ``<N>`` is assigned to the given
|
||||
CPU at the initialization time. That directory contains a set of subdirectories
|
||||
called :file:`state0`, :file:`state1` and so on, up to the number of idle state
|
||||
objects defined for the given CPU minus one. Each of these directories
|
||||
corresponds to one idle state object and the larger the number in its name, the
|
||||
deeper the (effective) idle state represented by it. Each of them contains
|
||||
a number of files (attributes) representing the properties of the idle state
|
||||
object corresponding to it, as follows:
|
||||
|
||||
``above``
|
||||
Total number of times this idle state had been asked for, but the
|
||||
observed idle duration was certainly too short to match its target
|
||||
residency.
|
||||
|
||||
``below``
|
||||
Total number of times this idle state had been asked for, but certainly
|
||||
a deeper idle state would have been a better match for the observed idle
|
||||
duration.
|
||||
|
||||
``desc``
|
||||
Description of the idle state.
|
||||
|
||||
``disable``
|
||||
Whether or not this idle state is disabled.
|
||||
|
||||
``default_status``
|
||||
The default status of this state, "enabled" or "disabled".
|
||||
|
||||
``latency``
|
||||
Exit latency of the idle state in microseconds.
|
||||
|
||||
``name``
|
||||
Name of the idle state.
|
||||
|
||||
``power``
|
||||
Power drawn by hardware in this idle state in milliwatts (if specified,
|
||||
0 otherwise).
|
||||
|
||||
``residency``
|
||||
Target residency of the idle state in microseconds.
|
||||
|
||||
``time``
|
||||
Total time spent in this idle state by the given CPU (as measured by the
|
||||
kernel) in microseconds.
|
||||
|
||||
``usage``
|
||||
Total number of times the hardware has been asked by the given CPU to
|
||||
enter this idle state.
|
||||
|
||||
``rejected``
|
||||
Total number of times a request to enter this idle state on the given
|
||||
CPU was rejected.
|
||||
|
||||
The :file:`desc` and :file:`name` files both contain strings. The difference
|
||||
between them is that the name is expected to be more concise, while the
|
||||
description may be longer and it may contain white space or special characters.
|
||||
The other files listed above contain integer numbers.
|
||||
|
||||
The :file:`disable` attribute is the only writeable one. If it contains 1, the
|
||||
given idle state is disabled for this particular CPU, which means that the
|
||||
governor will never select it for this particular CPU and the ``CPUIdle``
|
||||
driver will never ask the hardware to enter it for that CPU as a result.
|
||||
However, disabling an idle state for one CPU does not prevent it from being
|
||||
asked for by the other CPUs, so it must be disabled for all of them in order to
|
||||
never be asked for by any of them. [Note that, due to the way the ``ladder``
|
||||
governor is implemented, disabling an idle state prevents that governor from
|
||||
selecting any idle states deeper than the disabled one too.]
|
||||
|
||||
If the :file:`disable` attribute contains 0, the given idle state is enabled for
|
||||
this particular CPU, but it still may be disabled for some or all of the other
|
||||
CPUs in the system at the same time. Writing 1 to it causes the idle state to
|
||||
be disabled for this particular CPU and writing 0 to it allows the governor to
|
||||
take it into consideration for the given CPU and the driver to ask for it,
|
||||
unless that state was disabled globally in the driver (in which case it cannot
|
||||
be used at all).
|
||||
|
||||
The :file:`power` attribute is not defined very well, especially for idle state
|
||||
objects representing combinations of idle states at different levels of the
|
||||
hierarchy of units in the processor, and it generally is hard to obtain idle
|
||||
state power numbers for complex hardware, so :file:`power` often contains 0 (not
|
||||
available) and if it contains a nonzero number, that number may not be very
|
||||
accurate and it should not be relied on for anything meaningful.
|
||||
|
||||
The number in the :file:`time` file generally may be greater than the total time
|
||||
really spent by the given CPU in the given idle state, because it is measured by
|
||||
the kernel and it may not cover the cases in which the hardware refused to enter
|
||||
this idle state and entered a shallower one instead of it (or even it did not
|
||||
enter any idle state at all). The kernel can only measure the time span between
|
||||
asking the hardware to enter an idle state and the subsequent wakeup of the CPU
|
||||
and it cannot say what really happened in the meantime at the hardware level.
|
||||
Moreover, if the idle state object in question represents a combination of idle
|
||||
states at different levels of the hierarchy of units in the processor,
|
||||
the kernel can never say how deep the hardware went down the hierarchy in any
|
||||
particular case. For these reasons, the only reliable way to find out how
|
||||
much time has been spent by the hardware in different idle states supported by
|
||||
it is to use idle state residency counters in the hardware, if available.
|
||||
|
||||
Generally, an interrupt received when trying to enter an idle state causes the
|
||||
idle state entry request to be rejected, in which case the ``CPUIdle`` driver
|
||||
may return an error code to indicate that this was the case. The :file:`usage`
|
||||
and :file:`rejected` files report the number of times the given idle state
|
||||
was entered successfully or rejected, respectively.
|
||||
|
||||
.. _cpu-pm-qos:
|
||||
|
||||
Power Management Quality of Service for CPUs
|
||||
============================================
|
||||
|
||||
The power management quality of service (PM QoS) framework in the Linux kernel
|
||||
allows kernel code and user space processes to set constraints on various
|
||||
energy-efficiency features of the kernel to prevent performance from dropping
|
||||
below a required level.
|
||||
|
||||
CPU idle time management can be affected by PM QoS in two ways, through the
|
||||
global CPU latency limit and through the resume latency constraints for
|
||||
individual CPUs. Kernel code (e.g. device drivers) can set both of them with
|
||||
the help of special internal interfaces provided by the PM QoS framework. User
|
||||
space can modify the former by opening the :file:`cpu_dma_latency` special
|
||||
device file under :file:`/dev/` and writing a binary value (interpreted as a
|
||||
signed 32-bit integer) to it. In turn, the resume latency constraint for a CPU
|
||||
can be modified from user space by writing a string (representing a signed
|
||||
32-bit integer) to the :file:`power/pm_qos_resume_latency_us` file under
|
||||
:file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs``, where the CPU number
|
||||
``<N>`` is allocated at the system initialization time. Negative values
|
||||
will be rejected in both cases and, also in both cases, the written integer
|
||||
number will be interpreted as a requested PM QoS constraint in microseconds.
|
||||
|
||||
The requested value is not automatically applied as a new constraint, however,
|
||||
as it may be less restrictive (greater in this particular case) than another
|
||||
constraint previously requested by someone else. For this reason, the PM QoS
|
||||
framework maintains a list of requests that have been made so far for the
|
||||
global CPU latency limit and for each individual CPU, aggregates them and
|
||||
applies the effective (minimum in this particular case) value as the new
|
||||
constraint.
|
||||
|
||||
In fact, opening the :file:`cpu_dma_latency` special device file causes a new
|
||||
PM QoS request to be created and added to a global priority list of CPU latency
|
||||
limit requests and the file descriptor coming from the "open" operation
|
||||
represents that request. If that file descriptor is then used for writing, the
|
||||
number written to it will be associated with the PM QoS request represented by
|
||||
it as a new requested limit value. Next, the priority list mechanism will be
|
||||
used to determine the new effective value of the entire list of requests and
|
||||
that effective value will be set as a new CPU latency limit. Thus requesting a
|
||||
new limit value will only change the real limit if the effective "list" value is
|
||||
affected by it, which is the case if it is the minimum of the requested values
|
||||
in the list.
|
||||
|
||||
The process holding a file descriptor obtained by opening the
|
||||
:file:`cpu_dma_latency` special device file controls the PM QoS request
|
||||
associated with that file descriptor, but it controls this particular PM QoS
|
||||
request only.
|
||||
|
||||
Closing the :file:`cpu_dma_latency` special device file or, more precisely, the
|
||||
file descriptor obtained while opening it, causes the PM QoS request associated
|
||||
with that file descriptor to be removed from the global priority list of CPU
|
||||
latency limit requests and destroyed. If that happens, the priority list
|
||||
mechanism will be used again, to determine the new effective value for the whole
|
||||
list and that value will become the new limit.
|
||||
|
||||
In turn, for each CPU there is one resume latency PM QoS request associated with
|
||||
the :file:`power/pm_qos_resume_latency_us` file under
|
||||
:file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs`` and writing to it causes
|
||||
this single PM QoS request to be updated regardless of which user space
|
||||
process does that. In other words, this PM QoS request is shared by the entire
|
||||
user space, so access to the file associated with it needs to be arbitrated
|
||||
to avoid confusion. [Arguably, the only legitimate use of this mechanism in
|
||||
practice is to pin a process to the CPU in question and let it use the
|
||||
``sysfs`` interface to control the resume latency constraint for it.] It is
|
||||
still only a request, however. It is an entry in a priority list used to
|
||||
determine the effective value to be set as the resume latency constraint for the
|
||||
CPU in question every time the list of requests is updated this way or another
|
||||
(there may be other requests coming from kernel code in that list).
|
||||
|
||||
CPU idle time governors are expected to regard the minimum of the global
|
||||
(effective) CPU latency limit and the effective resume latency constraint for
|
||||
the given CPU as the upper limit for the exit latency of the idle states that
|
||||
they are allowed to select for that CPU. They should never select any idle
|
||||
states with exit latency beyond that limit.
|
||||
|
||||
|
||||
Idle States Control Via Kernel Command Line
|
||||
===========================================
|
||||
|
||||
In addition to the ``sysfs`` interface allowing individual idle states to be
|
||||
`disabled for individual CPUs <idle-states-representation_>`_, there are kernel
|
||||
command line parameters affecting CPU idle time management.
|
||||
|
||||
The ``cpuidle.off=1`` kernel command line option can be used to disable the
|
||||
CPU idle time management entirely. It does not prevent the idle loop from
|
||||
running on idle CPUs, but it prevents the CPU idle time governors and drivers
|
||||
from being invoked. If it is added to the kernel command line, the idle loop
|
||||
will ask the hardware to enter idle states on idle CPUs via the CPU architecture
|
||||
support code that is expected to provide a default mechanism for this purpose.
|
||||
That default mechanism usually is the least common denominator for all of the
|
||||
processors implementing the architecture (i.e. CPU instruction set) in question,
|
||||
however, so it is rather crude and not very energy-efficient. For this reason,
|
||||
it is not recommended for production use.
|
||||
|
||||
The ``cpuidle.governor=`` kernel command line switch allows the ``CPUIdle``
|
||||
governor to use to be specified. It has to be appended with a string matching
|
||||
the name of an available governor (e.g. ``cpuidle.governor=menu``) and that
|
||||
governor will be used instead of the default one. It is possible to force
|
||||
the ``menu`` governor to be used on the systems that use the ``ladder`` governor
|
||||
by default this way, for example.
|
||||
|
||||
The other kernel command line parameters controlling CPU idle time management
|
||||
described below are only relevant for the *x86* architecture and references
|
||||
to ``intel_idle`` affect Intel processors only.
|
||||
|
||||
The *x86* architecture support code recognizes three kernel command line
|
||||
options related to CPU idle time management: ``idle=poll``, ``idle=halt``,
|
||||
and ``idle=nomwait``. The first two of them disable the ``acpi_idle`` and
|
||||
``intel_idle`` drivers altogether, which effectively causes the entire
|
||||
``CPUIdle`` subsystem to be disabled and makes the idle loop invoke the
|
||||
architecture support code to deal with idle CPUs. How it does that depends on
|
||||
which of the two parameters is added to the kernel command line. In the
|
||||
``idle=halt`` case, the architecture support code will use the ``HLT``
|
||||
instruction of the CPUs (which, as a rule, suspends the execution of the program
|
||||
and causes the hardware to attempt to enter the shallowest available idle state)
|
||||
for this purpose, and if ``idle=poll`` is used, idle CPUs will execute a
|
||||
more or less "lightweight" sequence of instructions in a tight loop. [Note
|
||||
that using ``idle=poll`` is somewhat drastic in many cases, as preventing idle
|
||||
CPUs from saving almost any energy at all may not be the only effect of it.
|
||||
For example, on Intel hardware it effectively prevents CPUs from using
|
||||
P-states (see |cpufreq|) that require any number of CPUs in a package to be
|
||||
idle, so it very well may hurt single-thread computations performance as well as
|
||||
energy-efficiency. Thus using it for performance reasons may not be a good idea
|
||||
at all.]
|
||||
|
||||
The ``idle=nomwait`` option prevents the use of ``MWAIT`` instruction of
|
||||
the CPU to enter idle states. When this option is used, the ``acpi_idle``
|
||||
driver will use the ``HLT`` instruction instead of ``MWAIT``. On systems
|
||||
running Intel processors, this option disables the ``intel_idle`` driver
|
||||
and forces the use of the ``acpi_idle`` driver instead. Note that in either
|
||||
case, ``acpi_idle`` driver will function only if all the information needed
|
||||
by it is in the system's ACPI tables.
|
||||
|
||||
In addition to the architecture-level kernel command line options affecting CPU
|
||||
idle time management, there are parameters affecting individual ``CPUIdle``
|
||||
drivers that can be passed to them via the kernel command line. Specifically,
|
||||
the ``intel_idle.max_cstate=<n>`` and ``processor.max_cstate=<n>`` parameters,
|
||||
where ``<n>`` is an idle state index also used in the name of the given
|
||||
state's directory in ``sysfs`` (see
|
||||
`Representation of Idle States <idle-states-representation_>`_), causes the
|
||||
``intel_idle`` and ``acpi_idle`` drivers, respectively, to discard all of the
|
||||
idle states deeper than idle state ``<n>``. In that case, they will never ask
|
||||
for any of those idle states or expose them to the governor. [The behavior of
|
||||
the two drivers is different for ``<n>`` equal to ``0``. Adding
|
||||
``intel_idle.max_cstate=0`` to the kernel command line disables the
|
||||
``intel_idle`` driver and allows ``acpi_idle`` to be used, whereas
|
||||
``processor.max_cstate=0`` is equivalent to ``processor.max_cstate=1``.
|
||||
Also, the ``acpi_idle`` driver is part of the ``processor`` kernel module that
|
||||
can be loaded separately and ``max_cstate=<n>`` can be passed to it as a module
|
||||
parameter when it is loaded.]
|
12
Documentation/admin-guide/pm/index.rst
Normal file
12
Documentation/admin-guide/pm/index.rst
Normal file
|
@ -0,0 +1,12 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
================
|
||||
Power Management
|
||||
================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
strategies
|
||||
system-wide
|
||||
working-state
|
939
Documentation/admin-guide/pm/intel-speed-select.rst
Normal file
939
Documentation/admin-guide/pm/intel-speed-select.rst
Normal file
|
@ -0,0 +1,939 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
============================================================
|
||||
Intel(R) Speed Select Technology User Guide
|
||||
============================================================
|
||||
|
||||
The Intel(R) Speed Select Technology (Intel(R) SST) provides a powerful new
|
||||
collection of features that give more granular control over CPU performance.
|
||||
With Intel(R) SST, one server can be configured for power and performance for a
|
||||
variety of diverse workload requirements.
|
||||
|
||||
Refer to the links below for an overview of the technology:
|
||||
|
||||
- https://www.intel.com/content/www/us/en/architecture-and-technology/speed-select-technology-article.html
|
||||
- https://builders.intel.com/docs/networkbuilders/intel-speed-select-technology-base-frequency-enhancing-performance.pdf
|
||||
|
||||
These capabilities are further enhanced in some of the newer generations of
|
||||
server platforms where these features can be enumerated and controlled
|
||||
dynamically without pre-configuring via BIOS setup options. This dynamic
|
||||
configuration is done via mailbox commands to the hardware. One way to enumerate
|
||||
and configure these features is by using the Intel Speed Select utility.
|
||||
|
||||
This document explains how to use the Intel Speed Select tool to enumerate and
|
||||
control Intel(R) SST features. This document gives example commands and explains
|
||||
how these commands change the power and performance profile of the system under
|
||||
test. Using this tool as an example, customers can replicate the messaging
|
||||
implemented in the tool in their production software.
|
||||
|
||||
intel-speed-select configuration tool
|
||||
======================================
|
||||
|
||||
Most Linux distribution packages may include the "intel-speed-select" tool. If not,
|
||||
it can be built by downloading the Linux kernel tree from kernel.org. Once
|
||||
downloaded, the tool can be built without building the full kernel.
|
||||
|
||||
From the kernel tree, run the following commands::
|
||||
|
||||
# cd tools/power/x86/intel-speed-select/
|
||||
# make
|
||||
# make install
|
||||
|
||||
Getting Help
|
||||
------------
|
||||
|
||||
To get help with the tool, execute the command below::
|
||||
|
||||
# intel-speed-select --help
|
||||
|
||||
The top-level help describes arguments and features. Notice that there is a
|
||||
multi-level help structure in the tool. For example, to get help for the feature "perf-profile"::
|
||||
|
||||
# intel-speed-select perf-profile --help
|
||||
|
||||
To get help on a command, another level of help is provided. For example for the command info "info"::
|
||||
|
||||
# intel-speed-select perf-profile info --help
|
||||
|
||||
Summary of platform capability
|
||||
------------------------------
|
||||
To check the current platform and driver capabilities, execute::
|
||||
|
||||
#intel-speed-select --info
|
||||
|
||||
For example on a test system::
|
||||
|
||||
# intel-speed-select --info
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
Platform: API version : 1
|
||||
Platform: Driver version : 1
|
||||
Platform: mbox supported : 1
|
||||
Platform: mmio supported : 1
|
||||
Intel(R) SST-PP (feature perf-profile) is supported
|
||||
TDP level change control is unlocked, max level: 4
|
||||
Intel(R) SST-TF (feature turbo-freq) is supported
|
||||
Intel(R) SST-BF (feature base-freq) is not supported
|
||||
Intel(R) SST-CP (feature core-power) is supported
|
||||
|
||||
Intel(R) Speed Select Technology - Performance Profile (Intel(R) SST-PP)
|
||||
------------------------------------------------------------------------
|
||||
|
||||
This feature allows configuration of a server dynamically based on workload
|
||||
performance requirements. This helps users during deployment as they do not have
|
||||
to choose a specific server configuration statically. This Intel(R) Speed Select
|
||||
Technology - Performance Profile (Intel(R) SST-PP) feature introduces a mechanism
|
||||
that allows multiple optimized performance profiles per system. Each profile
|
||||
defines a set of CPUs that need to be online and rest offline to sustain a
|
||||
guaranteed base frequency. Once the user issues a command to use a specific
|
||||
performance profile and meet CPU online/offline requirement, the user can expect
|
||||
a change in the base frequency dynamically. This feature is called
|
||||
"perf-profile" when using the Intel Speed Select tool.
|
||||
|
||||
Number or performance levels
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
There can be multiple performance profiles on a system. To get the number of
|
||||
profiles, execute the command below::
|
||||
|
||||
# intel-speed-select perf-profile get-config-levels
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
get-config-levels:4
|
||||
package-1
|
||||
die-0
|
||||
cpu-14
|
||||
get-config-levels:4
|
||||
|
||||
On this system under test, there are 4 performance profiles in addition to the
|
||||
base performance profile (which is performance level 0).
|
||||
|
||||
Lock/Unlock status
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Even if there are multiple performance profiles, it is possible that they
|
||||
are locked. If they are locked, users cannot issue a command to change the
|
||||
performance state. It is possible that there is a BIOS setup to unlock or check
|
||||
with your system vendor.
|
||||
|
||||
To check if the system is locked, execute the following command::
|
||||
|
||||
# intel-speed-select perf-profile get-lock-status
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
get-lock-status:0
|
||||
package-1
|
||||
die-0
|
||||
cpu-14
|
||||
get-lock-status:0
|
||||
|
||||
In this case, lock status is 0, which means that the system is unlocked.
|
||||
|
||||
Properties of a performance level
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To get properties of a specific performance level (For example for the level 0, below), execute the command below::
|
||||
|
||||
# intel-speed-select perf-profile info -l 0
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
perf-profile-level-0
|
||||
cpu-count:28
|
||||
enable-cpu-mask:000003ff,f0003fff
|
||||
enable-cpu-list:0,1,2,3,4,5,6,7,8,9,10,11,12,13,28,29,30,31,32,33,34,35,36,37,38,39,40,41
|
||||
thermal-design-power-ratio:26
|
||||
base-frequency(MHz):2600
|
||||
speed-select-turbo-freq:disabled
|
||||
speed-select-base-freq:disabled
|
||||
...
|
||||
...
|
||||
|
||||
Here -l option is used to specify a performance level.
|
||||
|
||||
If the option -l is omitted, then this command will print information about all
|
||||
the performance levels. The above command is printing properties of the
|
||||
performance level 0.
|
||||
|
||||
For this performance profile, the list of CPUs displayed by the
|
||||
"enable-cpu-mask/enable-cpu-list" at the max can be "online." When that
|
||||
condition is met, then base frequency of 2600 MHz can be maintained. To
|
||||
understand more, execute "intel-speed-select perf-profile info" for performance
|
||||
level 4::
|
||||
|
||||
# intel-speed-select perf-profile info -l 4
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
perf-profile-level-4
|
||||
cpu-count:28
|
||||
enable-cpu-mask:000000fa,f0000faf
|
||||
enable-cpu-list:0,1,2,3,5,7,8,9,10,11,28,29,30,31,33,35,36,37,38,39
|
||||
thermal-design-power-ratio:28
|
||||
base-frequency(MHz):2800
|
||||
speed-select-turbo-freq:disabled
|
||||
speed-select-base-freq:unsupported
|
||||
...
|
||||
...
|
||||
|
||||
There are fewer CPUs in the "enable-cpu-mask/enable-cpu-list". Consequently, if
|
||||
the user only keeps these CPUs online and the rest "offline," then the base
|
||||
frequency is increased to 2.8 GHz compared to 2.6 GHz at performance level 0.
|
||||
|
||||
Get current performance level
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To get the current performance level, execute::
|
||||
|
||||
# intel-speed-select perf-profile get-config-current-level
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
get-config-current_level:0
|
||||
|
||||
First verify that the base_frequency displayed by the cpufreq sysfs is correct::
|
||||
|
||||
# cat /sys/devices/system/cpu/cpu0/cpufreq/base_frequency
|
||||
2600000
|
||||
|
||||
This matches the base-frequency (MHz) field value displayed from the
|
||||
"perf-profile info" command for performance level 0(cpufreq frequency is in
|
||||
KHz).
|
||||
|
||||
To check if the average frequency is equal to the base frequency for a 100% busy
|
||||
workload, disable turbo::
|
||||
|
||||
# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
|
||||
|
||||
Then runs a busy workload on all CPUs, for example::
|
||||
|
||||
#stress -c 64
|
||||
|
||||
To verify the base frequency, run turbostat::
|
||||
|
||||
#turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1
|
||||
|
||||
Package Core CPU Bzy_MHz
|
||||
- - 2600
|
||||
0 0 0 2600
|
||||
0 1 1 2600
|
||||
0 2 2 2600
|
||||
0 3 3 2600
|
||||
0 4 4 2600
|
||||
. . . .
|
||||
|
||||
|
||||
Changing performance level
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To the change the performance level to 4, execute::
|
||||
|
||||
# intel-speed-select -d perf-profile set-config-level -l 4 -o
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
perf-profile
|
||||
set_tdp_level:success
|
||||
|
||||
In the command above, "-o" is optional. If it is specified, then it will also
|
||||
offline CPUs which are not present in the enable_cpu_mask for this performance
|
||||
level.
|
||||
|
||||
Now if the base_frequency is checked::
|
||||
|
||||
#cat /sys/devices/system/cpu/cpu0/cpufreq/base_frequency
|
||||
2800000
|
||||
|
||||
Which shows that the base frequency now increased from 2600 MHz at performance
|
||||
level 0 to 2800 MHz at performance level 4. As a result, any workload, which can
|
||||
use fewer CPUs, can see a boost of 200 MHz compared to performance level 0.
|
||||
|
||||
Changing performance level via BMC Interface
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
It is possible to change SST-PP level using out of band (OOB) agent (Via some
|
||||
remote management console, through BMC "Baseboard Management Controller"
|
||||
interface). This mode is supported from the Sapphire Rapids processor
|
||||
generation. The kernel and tool change to support this mode is added to Linux
|
||||
kernel version 5.18. To enable this feature, kernel config
|
||||
"CONFIG_INTEL_HFI_THERMAL" is required. The minimum version of the tool
|
||||
is "v1.12" to support this feature, which is part of Linux kernel version 5.18.
|
||||
|
||||
To support such configuration, this tool can be used as a daemon. Add
|
||||
a command line option --oob::
|
||||
|
||||
# intel-speed-select --oob
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model:143[0x8f]
|
||||
OOB mode is enabled and will run as daemon
|
||||
|
||||
In this mode the tool will online/offline CPUs based on the new performance
|
||||
level.
|
||||
|
||||
Check presence of other Intel(R) SST features
|
||||
---------------------------------------------
|
||||
|
||||
Each of the performance profiles also specifies weather there is support of
|
||||
other two Intel(R) SST features (Intel(R) Speed Select Technology - Base Frequency
|
||||
(Intel(R) SST-BF) and Intel(R) Speed Select Technology - Turbo Frequency (Intel
|
||||
SST-TF)).
|
||||
|
||||
For example, from the output of "perf-profile info" above, for level 0 and level
|
||||
4:
|
||||
|
||||
For level 0::
|
||||
speed-select-turbo-freq:disabled
|
||||
speed-select-base-freq:disabled
|
||||
|
||||
For level 4::
|
||||
speed-select-turbo-freq:disabled
|
||||
speed-select-base-freq:unsupported
|
||||
|
||||
Given these results, the "speed-select-base-freq" (Intel(R) SST-BF) in level 4
|
||||
changed from "disabled" to "unsupported" compared to performance level 0.
|
||||
|
||||
This means that at performance level 4, the "speed-select-base-freq" feature is
|
||||
not supported. However, at performance level 0, this feature is "supported", but
|
||||
currently "disabled", meaning the user has not activated this feature. Whereas
|
||||
"speed-select-turbo-freq" (Intel(R) SST-TF) is supported at both performance
|
||||
levels, but currently not activated by the user.
|
||||
|
||||
The Intel(R) SST-BF and the Intel(R) SST-TF features are built on a foundation
|
||||
technology called Intel(R) Speed Select Technology - Core Power (Intel(R) SST-CP).
|
||||
The platform firmware enables this feature when Intel(R) SST-BF or Intel(R) SST-TF
|
||||
is supported on a platform.
|
||||
|
||||
Intel(R) Speed Select Technology Core Power (Intel(R) SST-CP)
|
||||
---------------------------------------------------------------
|
||||
|
||||
Intel(R) Speed Select Technology Core Power (Intel(R) SST-CP) is an interface that
|
||||
allows users to define per core priority. This defines a mechanism to distribute
|
||||
power among cores when there is a power constrained scenario. This defines a
|
||||
class of service (CLOS) configuration.
|
||||
|
||||
The user can configure up to 4 class of service configurations. Each CLOS group
|
||||
configuration allows definitions of parameters, which affects how the frequency
|
||||
can be limited and power is distributed. Each CPU core can be tied to a class of
|
||||
service and hence an associated priority. The granularity is at core level not
|
||||
at per CPU level.
|
||||
|
||||
Enable CLOS based prioritization
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To use CLOS based prioritization feature, firmware must be informed to enable
|
||||
and use a priority type. There is a default per platform priority type, which
|
||||
can be changed with optional command line parameter.
|
||||
|
||||
To enable and check the options, execute::
|
||||
|
||||
# intel-speed-select core-power enable --help
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
Enable core-power for a package/die
|
||||
Clos Enable: Specify priority type with [--priority|-p]
|
||||
0: Proportional, 1: Ordered
|
||||
|
||||
There are two types of priority types:
|
||||
|
||||
- Ordered
|
||||
|
||||
Priority for ordered throttling is defined based on the index of the assigned
|
||||
CLOS group. Where CLOS0 gets highest priority (throttled last).
|
||||
|
||||
Priority order is:
|
||||
CLOS0 > CLOS1 > CLOS2 > CLOS3.
|
||||
|
||||
- Proportional
|
||||
|
||||
When proportional priority is used, there is an additional parameter called
|
||||
frequency_weight, which can be specified per CLOS group. The goal of
|
||||
proportional priority is to provide each core with the requested min., then
|
||||
distribute all remaining (excess/deficit) budgets in proportion to a defined
|
||||
weight. This proportional priority can be configured using "core-power config"
|
||||
command.
|
||||
|
||||
To enable with the platform default priority type, execute::
|
||||
|
||||
# intel-speed-select core-power enable
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
core-power
|
||||
enable:success
|
||||
package-1
|
||||
die-0
|
||||
cpu-6
|
||||
core-power
|
||||
enable:success
|
||||
|
||||
The scope of this enable is per package or die scoped when a package contains
|
||||
multiple dies. To check if CLOS is enabled and get priority type, "core-power
|
||||
info" command can be used. For example to check the status of core-power feature
|
||||
on CPU 0, execute::
|
||||
|
||||
# intel-speed-select -c 0 core-power info
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
core-power
|
||||
support-status:supported
|
||||
enable-status:enabled
|
||||
clos-enable-status:enabled
|
||||
priority-type:proportional
|
||||
package-1
|
||||
die-0
|
||||
cpu-24
|
||||
core-power
|
||||
support-status:supported
|
||||
enable-status:enabled
|
||||
clos-enable-status:enabled
|
||||
priority-type:proportional
|
||||
|
||||
Configuring CLOS groups
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Each CLOS group has its own attributes including min, max, freq_weight and
|
||||
desired. These parameters can be configured with "core-power config" command.
|
||||
Defaults will be used if user skips setting a parameter except clos id, which is
|
||||
mandatory. To check core-power config options, execute::
|
||||
|
||||
# intel-speed-select core-power config --help
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
Set core-power configuration for one of the four clos ids
|
||||
Specify targeted clos id with [--clos|-c]
|
||||
Specify clos Proportional Priority [--weight|-w]
|
||||
Specify clos min in MHz with [--min|-n]
|
||||
Specify clos max in MHz with [--max|-m]
|
||||
|
||||
For example::
|
||||
|
||||
# intel-speed-select core-power config -c 0
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
clos epp is not specified, default: 0
|
||||
clos frequency weight is not specified, default: 0
|
||||
clos min is not specified, default: 0 MHz
|
||||
clos max is not specified, default: 25500 MHz
|
||||
clos desired is not specified, default: 0
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
core-power
|
||||
config:success
|
||||
package-1
|
||||
die-0
|
||||
cpu-6
|
||||
core-power
|
||||
config:success
|
||||
|
||||
The user has the option to change defaults. For example, the user can change the
|
||||
"min" and set the base frequency to always get guaranteed base frequency.
|
||||
|
||||
Get the current CLOS configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To check the current configuration, "core-power get-config" can be used. For
|
||||
example, to get the configuration of CLOS 0::
|
||||
|
||||
# intel-speed-select core-power get-config -c 0
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
core-power
|
||||
clos:0
|
||||
epp:0
|
||||
clos-proportional-priority:0
|
||||
clos-min:0 MHz
|
||||
clos-max:Max Turbo frequency
|
||||
clos-desired:0 MHz
|
||||
package-1
|
||||
die-0
|
||||
cpu-24
|
||||
core-power
|
||||
clos:0
|
||||
epp:0
|
||||
clos-proportional-priority:0
|
||||
clos-min:0 MHz
|
||||
clos-max:Max Turbo frequency
|
||||
clos-desired:0 MHz
|
||||
|
||||
Associating a CPU with a CLOS group
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To associate a CPU to a CLOS group "core-power assoc" command can be used::
|
||||
|
||||
# intel-speed-select core-power assoc --help
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
Associate a clos id to a CPU
|
||||
Specify targeted clos id with [--clos|-c]
|
||||
|
||||
|
||||
For example to associate CPU 10 to CLOS group 3, execute::
|
||||
|
||||
# intel-speed-select -c 10 core-power assoc -c 3
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-10
|
||||
core-power
|
||||
assoc:success
|
||||
|
||||
Once a CPU is associated, its sibling CPUs are also associated to a CLOS group.
|
||||
Once associated, avoid changing Linux "cpufreq" subsystem scaling frequency
|
||||
limits.
|
||||
|
||||
To check the existing association for a CPU, "core-power get-assoc" command can
|
||||
be used. For example, to get association of CPU 10, execute::
|
||||
|
||||
# intel-speed-select -c 10 core-power get-assoc
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-1
|
||||
die-0
|
||||
cpu-10
|
||||
get-assoc
|
||||
clos:3
|
||||
|
||||
This shows that CPU 10 is part of a CLOS group 3.
|
||||
|
||||
|
||||
Disable CLOS based prioritization
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To disable, execute::
|
||||
|
||||
# intel-speed-select core-power disable
|
||||
|
||||
Some features like Intel(R) SST-TF can only be enabled when CLOS based prioritization
|
||||
is enabled. For this reason, disabling while Intel(R) SST-TF is enabled can cause
|
||||
Intel(R) SST-TF to fail. This will cause the "disable" command to display an error
|
||||
if Intel(R) SST-TF is already enabled. In turn, to disable, the Intel(R) SST-TF
|
||||
feature must be disabled first.
|
||||
|
||||
Intel(R) Speed Select Technology - Base Frequency (Intel(R) SST-BF)
|
||||
-------------------------------------------------------------------
|
||||
|
||||
The Intel(R) Speed Select Technology - Base Frequency (Intel(R) SST-BF) feature lets
|
||||
the user control base frequency. If some critical workload threads demand
|
||||
constant high guaranteed performance, then this feature can be used to execute
|
||||
the thread at higher base frequency on specific sets of CPUs (high priority
|
||||
CPUs) at the cost of lower base frequency (low priority CPUs) on other CPUs.
|
||||
This feature does not require offline of the low priority CPUs.
|
||||
|
||||
The support of Intel(R) SST-BF depends on the Intel(R) Speed Select Technology -
|
||||
Performance Profile (Intel(R) SST-PP) performance level configuration. It is
|
||||
possible that only certain performance levels support Intel(R) SST-BF. It is also
|
||||
possible that only base performance level (level = 0) has support of Intel
|
||||
SST-BF. Consequently, first select the desired performance level to enable this
|
||||
feature.
|
||||
|
||||
In the system under test here, Intel(R) SST-BF is supported at the base
|
||||
performance level 0, but currently disabled. For example for the level 0::
|
||||
|
||||
# intel-speed-select -c 0 perf-profile info -l 0
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
perf-profile-level-0
|
||||
...
|
||||
|
||||
speed-select-base-freq:disabled
|
||||
...
|
||||
|
||||
Before enabling Intel(R) SST-BF and measuring its impact on a workload
|
||||
performance, execute some workload and measure performance and get a baseline
|
||||
performance to compare against.
|
||||
|
||||
Here the user wants more guaranteed performance. For this reason, it is likely
|
||||
that turbo is disabled. To disable turbo, execute::
|
||||
|
||||
#echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
|
||||
|
||||
Based on the output of the "intel-speed-select perf-profile info -l 0" base
|
||||
frequency of guaranteed frequency 2600 MHz.
|
||||
|
||||
|
||||
Measure baseline performance for comparison
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To compare, pick a multi-threaded workload where each thread can be scheduled on
|
||||
separate CPUs. "Hackbench pipe" test is a good example on how to improve
|
||||
performance using Intel(R) SST-BF.
|
||||
|
||||
Below, the workload is measuring average scheduler wakeup latency, so a lower
|
||||
number means better performance::
|
||||
|
||||
# taskset -c 3,4 perf bench -r 100 sched pipe
|
||||
# Running 'sched/pipe' benchmark:
|
||||
# Executed 1000000 pipe operations between two processes
|
||||
Total time: 6.102 [sec]
|
||||
6.102445 usecs/op
|
||||
163868 ops/sec
|
||||
|
||||
While running the above test, if we take turbostat output, it will show us that
|
||||
2 of the CPUs are busy and reaching max. frequency (which would be the base
|
||||
frequency as the turbo is disabled). The turbostat output::
|
||||
|
||||
#turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1
|
||||
Package Core CPU Bzy_MHz
|
||||
0 0 0 1000
|
||||
0 1 1 1005
|
||||
0 2 2 1000
|
||||
0 3 3 2600
|
||||
0 4 4 2600
|
||||
0 5 5 1000
|
||||
0 6 6 1000
|
||||
0 7 7 1005
|
||||
0 8 8 1005
|
||||
0 9 9 1000
|
||||
0 10 10 1000
|
||||
0 11 11 995
|
||||
0 12 12 1000
|
||||
0 13 13 1000
|
||||
|
||||
From the above turbostat output, both CPU 3 and 4 are very busy and reaching
|
||||
full guaranteed frequency of 2600 MHz.
|
||||
|
||||
Intel(R) SST-BF Capabilities
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To get capabilities of Intel(R) SST-BF for the current performance level 0,
|
||||
execute::
|
||||
|
||||
# intel-speed-select base-freq info -l 0
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
speed-select-base-freq
|
||||
high-priority-base-frequency(MHz):3000
|
||||
high-priority-cpu-mask:00000216,00002160
|
||||
high-priority-cpu-list:5,6,8,13,33,34,36,41
|
||||
low-priority-base-frequency(MHz):2400
|
||||
tjunction-temperature(C):125
|
||||
thermal-design-power(W):205
|
||||
|
||||
The above capabilities show that there are some CPUs on this system that can
|
||||
offer base frequency of 3000 MHz compared to the standard base frequency at this
|
||||
performance levels. Nevertheless, these CPUs are fixed, and they are presented
|
||||
via high-priority-cpu-list/high-priority-cpu-mask. But if this Intel(R) SST-BF
|
||||
feature is selected, the low priorities CPUs (which are not in
|
||||
high-priority-cpu-list) can only offer up to 2400 MHz. As a result, if this
|
||||
clipping of low priority CPUs is acceptable, then the user can enable Intel
|
||||
SST-BF feature particularly for the above "sched pipe" workload since only two
|
||||
CPUs are used, they can be scheduled on high priority CPUs and can get boost of
|
||||
400 MHz.
|
||||
|
||||
Enable Intel(R) SST-BF
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To enable Intel(R) SST-BF feature, execute::
|
||||
|
||||
# intel-speed-select base-freq enable -a
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
base-freq
|
||||
enable:success
|
||||
package-1
|
||||
die-0
|
||||
cpu-14
|
||||
base-freq
|
||||
enable:success
|
||||
|
||||
In this case, -a option is optional. This not only enables Intel(R) SST-BF, but it
|
||||
also adjusts the priority of cores using Intel(R) Speed Select Technology Core
|
||||
Power (Intel(R) SST-CP) features. This option sets the minimum performance of each
|
||||
Intel(R) Speed Select Technology - Performance Profile (Intel(R) SST-PP) class to
|
||||
maximum performance so that the hardware will give maximum performance possible
|
||||
for each CPU.
|
||||
|
||||
If -a option is not used, then the following steps are required before enabling
|
||||
Intel(R) SST-BF:
|
||||
|
||||
- Discover Intel(R) SST-BF and note low and high priority base frequency
|
||||
- Note the high priority CPU list
|
||||
- Enable CLOS using core-power feature set
|
||||
- Configure CLOS parameters. Use CLOS.min to set to minimum performance
|
||||
- Subscribe desired CPUs to CLOS groups
|
||||
|
||||
With this configuration, if the same workload is executed by pinning the
|
||||
workload to high priority CPUs (CPU 5 and 6 in this case)::
|
||||
|
||||
#taskset -c 5,6 perf bench -r 100 sched pipe
|
||||
# Running 'sched/pipe' benchmark:
|
||||
# Executed 1000000 pipe operations between two processes
|
||||
Total time: 5.627 [sec]
|
||||
5.627922 usecs/op
|
||||
177685 ops/sec
|
||||
|
||||
This way, by enabling Intel(R) SST-BF, the performance of this benchmark is
|
||||
improved (latency reduced) by 7.79%. From the turbostat output, it can be
|
||||
observed that the high priority CPUs reached 3000 MHz compared to 2600 MHz.
|
||||
The turbostat output::
|
||||
|
||||
#turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1
|
||||
Package Core CPU Bzy_MHz
|
||||
0 0 0 2151
|
||||
0 1 1 2166
|
||||
0 2 2 2175
|
||||
0 3 3 2175
|
||||
0 4 4 2175
|
||||
0 5 5 3000
|
||||
0 6 6 3000
|
||||
0 7 7 2180
|
||||
0 8 8 2662
|
||||
0 9 9 2176
|
||||
0 10 10 2175
|
||||
0 11 11 2176
|
||||
0 12 12 2176
|
||||
0 13 13 2661
|
||||
|
||||
Disable Intel(R) SST-BF
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To disable the Intel(R) SST-BF feature, execute::
|
||||
|
||||
# intel-speed-select base-freq disable -a
|
||||
|
||||
|
||||
Intel(R) Speed Select Technology - Turbo Frequency (Intel(R) SST-TF)
|
||||
--------------------------------------------------------------------
|
||||
|
||||
This feature enables the ability to set different "All core turbo ratio limits"
|
||||
to cores based on the priority. By using this feature, some cores can be
|
||||
configured to get higher turbo frequency by designating them as high priority at
|
||||
the cost of lower or no turbo frequency on the low priority cores.
|
||||
|
||||
For this reason, this feature is only useful when system is busy utilizing all
|
||||
CPUs, but the user wants some configurable option to get high performance on
|
||||
some CPUs.
|
||||
|
||||
The support of Intel(R) Speed Select Technology - Turbo Frequency (Intel(R) SST-TF)
|
||||
depends on the Intel(R) Speed Select Technology - Performance Profile (Intel
|
||||
SST-PP) performance level configuration. It is possible that only a certain
|
||||
performance level supports Intel(R) SST-TF. It is also possible that only the base
|
||||
performance level (level = 0) has the support of Intel(R) SST-TF. Hence, first
|
||||
select the desired performance level to enable this feature.
|
||||
|
||||
In the system under test here, Intel(R) SST-TF is supported at the base
|
||||
performance level 0, but currently disabled::
|
||||
|
||||
# intel-speed-select -c 0 perf-profile info -l 0
|
||||
Intel(R) Speed Select Technology
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
perf-profile-level-0
|
||||
...
|
||||
...
|
||||
speed-select-turbo-freq:disabled
|
||||
...
|
||||
...
|
||||
|
||||
|
||||
To check if performance can be improved using Intel(R) SST-TF feature, get the turbo
|
||||
frequency properties with Intel(R) SST-TF enabled and compare to the base turbo
|
||||
capability of this system.
|
||||
|
||||
Get Base turbo capability
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To get the base turbo capability of performance level 0, execute::
|
||||
|
||||
# intel-speed-select perf-profile info -l 0
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
perf-profile-level-0
|
||||
...
|
||||
...
|
||||
turbo-ratio-limits-sse
|
||||
bucket-0
|
||||
core-count:2
|
||||
max-turbo-frequency(MHz):3200
|
||||
bucket-1
|
||||
core-count:4
|
||||
max-turbo-frequency(MHz):3100
|
||||
bucket-2
|
||||
core-count:6
|
||||
max-turbo-frequency(MHz):3100
|
||||
bucket-3
|
||||
core-count:8
|
||||
max-turbo-frequency(MHz):3100
|
||||
bucket-4
|
||||
core-count:10
|
||||
max-turbo-frequency(MHz):3100
|
||||
bucket-5
|
||||
core-count:12
|
||||
max-turbo-frequency(MHz):3100
|
||||
bucket-6
|
||||
core-count:14
|
||||
max-turbo-frequency(MHz):3100
|
||||
bucket-7
|
||||
core-count:16
|
||||
max-turbo-frequency(MHz):3100
|
||||
|
||||
Based on the data above, when all the CPUS are busy, the max. frequency of 3100
|
||||
MHz can be achieved. If there is some busy workload on cpu 0 - 11 (e.g. stress)
|
||||
and on CPU 12 and 13, execute "hackbench pipe" workload::
|
||||
|
||||
# taskset -c 12,13 perf bench -r 100 sched pipe
|
||||
# Running 'sched/pipe' benchmark:
|
||||
# Executed 1000000 pipe operations between two processes
|
||||
Total time: 5.705 [sec]
|
||||
5.705488 usecs/op
|
||||
175269 ops/sec
|
||||
|
||||
The turbostat output::
|
||||
|
||||
#turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1
|
||||
Package Core CPU Bzy_MHz
|
||||
0 0 0 3000
|
||||
0 1 1 3000
|
||||
0 2 2 3000
|
||||
0 3 3 3000
|
||||
0 4 4 3000
|
||||
0 5 5 3100
|
||||
0 6 6 3100
|
||||
0 7 7 3000
|
||||
0 8 8 3100
|
||||
0 9 9 3000
|
||||
0 10 10 3000
|
||||
0 11 11 3000
|
||||
0 12 12 3100
|
||||
0 13 13 3100
|
||||
|
||||
Based on turbostat output, the performance is limited by frequency cap of 3100
|
||||
MHz. To check if the hackbench performance can be improved for CPU 12 and CPU
|
||||
13, first check the capability of the Intel(R) SST-TF feature for this performance
|
||||
level.
|
||||
|
||||
Get Intel(R) SST-TF Capability
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To get the capability, the "turbo-freq info" command can be used::
|
||||
|
||||
# intel-speed-select turbo-freq info -l 0
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-0
|
||||
speed-select-turbo-freq
|
||||
bucket-0
|
||||
high-priority-cores-count:2
|
||||
high-priority-max-frequency(MHz):3200
|
||||
high-priority-max-avx2-frequency(MHz):3200
|
||||
high-priority-max-avx512-frequency(MHz):3100
|
||||
bucket-1
|
||||
high-priority-cores-count:4
|
||||
high-priority-max-frequency(MHz):3100
|
||||
high-priority-max-avx2-frequency(MHz):3000
|
||||
high-priority-max-avx512-frequency(MHz):2900
|
||||
bucket-2
|
||||
high-priority-cores-count:6
|
||||
high-priority-max-frequency(MHz):3100
|
||||
high-priority-max-avx2-frequency(MHz):3000
|
||||
high-priority-max-avx512-frequency(MHz):2900
|
||||
speed-select-turbo-freq-clip-frequencies
|
||||
low-priority-max-frequency(MHz):2600
|
||||
low-priority-max-avx2-frequency(MHz):2400
|
||||
low-priority-max-avx512-frequency(MHz):2100
|
||||
|
||||
Based on the output above, there is an Intel(R) SST-TF bucket for which there are
|
||||
two high priority cores. If only two high priority cores are set, then max.
|
||||
turbo frequency on those cores can be increased to 3200 MHz. This is 100 MHz
|
||||
more than the base turbo capability for all cores.
|
||||
|
||||
In turn, for the hackbench workload, two CPUs can be set as high priority and
|
||||
rest as low priority. One side effect is that once enabled, the low priority
|
||||
cores will be clipped to a lower frequency of 2600 MHz.
|
||||
|
||||
Enable Intel(R) SST-TF
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To enable Intel(R) SST-TF, execute::
|
||||
|
||||
# intel-speed-select -c 12,13 turbo-freq enable -a
|
||||
Intel(R) Speed Select Technology
|
||||
Executing on CPU model: X
|
||||
package-0
|
||||
die-0
|
||||
cpu-12
|
||||
turbo-freq
|
||||
enable:success
|
||||
package-0
|
||||
die-0
|
||||
cpu-13
|
||||
turbo-freq
|
||||
enable:success
|
||||
package--1
|
||||
die-0
|
||||
cpu-63
|
||||
turbo-freq --auto
|
||||
enable:success
|
||||
|
||||
In this case, the option "-a" is optional. If set, it enables Intel(R) SST-TF
|
||||
feature and also sets the CPUs to high and low priority using Intel Speed
|
||||
Select Technology Core Power (Intel(R) SST-CP) features. The CPU numbers passed
|
||||
with "-c" arguments are marked as high priority, including its siblings.
|
||||
|
||||
If -a option is not used, then the following steps are required before enabling
|
||||
Intel(R) SST-TF:
|
||||
|
||||
- Discover Intel(R) SST-TF and note buckets of high priority cores and maximum frequency
|
||||
|
||||
- Enable CLOS using core-power feature set - Configure CLOS parameters
|
||||
|
||||
- Subscribe desired CPUs to CLOS groups making sure that high priority cores are set to the maximum frequency
|
||||
|
||||
If the same hackbench workload is executed, schedule hackbench threads on high
|
||||
priority CPUs::
|
||||
|
||||
#taskset -c 12,13 perf bench -r 100 sched pipe
|
||||
# Running 'sched/pipe' benchmark:
|
||||
# Executed 1000000 pipe operations between two processes
|
||||
Total time: 5.510 [sec]
|
||||
5.510165 usecs/op
|
||||
180826 ops/sec
|
||||
|
||||
This improved performance by around 3.3% improvement on a busy system. Here the
|
||||
turbostat output will show that the CPU 12 and CPU 13 are getting 100 MHz boost.
|
||||
The turbostat output::
|
||||
|
||||
#turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1
|
||||
Package Core CPU Bzy_MHz
|
||||
...
|
||||
0 12 12 3200
|
||||
0 13 13 3200
|
41
Documentation/admin-guide/pm/intel_epb.rst
Normal file
41
Documentation/admin-guide/pm/intel_epb.rst
Normal file
|
@ -0,0 +1,41 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
======================================
|
||||
Intel Performance and Energy Bias Hint
|
||||
======================================
|
||||
|
||||
:Copyright: |copy| 2019 Intel Corporation
|
||||
|
||||
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
||||
|
||||
|
||||
.. kernel-doc:: arch/x86/kernel/cpu/intel_epb.c
|
||||
:doc: overview
|
||||
|
||||
Intel Performance and Energy Bias Attribute in ``sysfs``
|
||||
========================================================
|
||||
|
||||
The Intel Performance and Energy Bias Hint (EPB) value for a given (logical) CPU
|
||||
can be checked or updated through a ``sysfs`` attribute (file) under
|
||||
:file:`/sys/devices/system/cpu/cpu<N>/power/`, where the CPU number ``<N>``
|
||||
is allocated at the system initialization time:
|
||||
|
||||
``energy_perf_bias``
|
||||
Shows the current EPB value for the CPU in a sliding scale 0 - 15, where
|
||||
a value of 0 corresponds to a hint preference for highest performance
|
||||
and a value of 15 corresponds to the maximum energy savings.
|
||||
|
||||
In order to update the EPB value for the CPU, this attribute can be
|
||||
written to, either with a number in the 0 - 15 sliding scale above, or
|
||||
with one of the strings: "performance", "balance-performance", "normal",
|
||||
"balance-power", "power" that represent values reflected by their
|
||||
meaning.
|
||||
|
||||
This attribute is present for all online CPUs supporting the EPB
|
||||
feature.
|
||||
|
||||
Note that while the EPB interface to the processor is defined at the logical CPU
|
||||
level, the physical register backing it may be shared by multiple CPUs (for
|
||||
example, SMT siblings or cores in one package). For this reason, updating the
|
||||
EPB value for one CPU may cause the EPB values for other CPUs to change.
|
287
Documentation/admin-guide/pm/intel_idle.rst
Normal file
287
Documentation/admin-guide/pm/intel_idle.rst
Normal file
|
@ -0,0 +1,287 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
==============================================
|
||||
``intel_idle`` CPU Idle Time Management Driver
|
||||
==============================================
|
||||
|
||||
:Copyright: |copy| 2020 Intel Corporation
|
||||
|
||||
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
||||
|
||||
|
||||
General Information
|
||||
===================
|
||||
|
||||
``intel_idle`` is a part of the
|
||||
:doc:`CPU idle time management subsystem <cpuidle>` in the Linux kernel
|
||||
(``CPUIdle``). It is the default CPU idle time management driver for the
|
||||
Nehalem and later generations of Intel processors, but the level of support for
|
||||
a particular processor model in it depends on whether or not it recognizes that
|
||||
processor model and may also depend on information coming from the platform
|
||||
firmware. [To understand ``intel_idle`` it is necessary to know how ``CPUIdle``
|
||||
works in general, so this is the time to get familiar with
|
||||
Documentation/admin-guide/pm/cpuidle.rst if you have not done that yet.]
|
||||
|
||||
``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the
|
||||
logical CPU executing it is idle and so it may be possible to put some of the
|
||||
processor's functional blocks into low-power states. That instruction takes two
|
||||
arguments (passed in the ``EAX`` and ``ECX`` registers of the target CPU), the
|
||||
first of which, referred to as a *hint*, can be used by the processor to
|
||||
determine what can be done (for details refer to Intel Software Developer’s
|
||||
Manual [1]_). Accordingly, ``intel_idle`` refuses to work with processors in
|
||||
which the support for the ``MWAIT`` instruction has been disabled (for example,
|
||||
via the platform firmware configuration menu) or which do not support that
|
||||
instruction at all.
|
||||
|
||||
``intel_idle`` is not modular, so it cannot be unloaded, which means that the
|
||||
only way to pass early-configuration-time parameters to it is via the kernel
|
||||
command line.
|
||||
|
||||
|
||||
.. _intel-idle-enumeration-of-states:
|
||||
|
||||
Enumeration of Idle States
|
||||
==========================
|
||||
|
||||
Each ``MWAIT`` hint value is interpreted by the processor as a license to
|
||||
reconfigure itself in a certain way in order to save energy. The processor
|
||||
configurations (with reduced power draw) resulting from that are referred to
|
||||
as C-states (in the ACPI terminology) or idle states. The list of meaningful
|
||||
``MWAIT`` hint values and idle states (i.e. low-power configurations of the
|
||||
processor) corresponding to them depends on the processor model and it may also
|
||||
depend on the configuration of the platform.
|
||||
|
||||
In order to create a list of available idle states required by the ``CPUIdle``
|
||||
subsystem (see :ref:`idle-states-representation` in
|
||||
Documentation/admin-guide/pm/cpuidle.rst),
|
||||
``intel_idle`` can use two sources of information: static tables of idle states
|
||||
for different processor models included in the driver itself and the ACPI tables
|
||||
of the system. The former are always used if the processor model at hand is
|
||||
recognized by ``intel_idle`` and the latter are used if that is required for
|
||||
the given processor model (which is the case for all server processor models
|
||||
recognized by ``intel_idle``) or if the processor model is not recognized.
|
||||
[There is a module parameter that can be used to make the driver use the ACPI
|
||||
tables with any processor model recognized by it; see
|
||||
`below <intel-idle-parameters_>`_.]
|
||||
|
||||
If the ACPI tables are going to be used for building the list of available idle
|
||||
states, ``intel_idle`` first looks for a ``_CST`` object under one of the ACPI
|
||||
objects corresponding to the CPUs in the system (refer to the ACPI specification
|
||||
[2]_ for the description of ``_CST`` and its output package). Because the
|
||||
``CPUIdle`` subsystem expects that the list of idle states supplied by the
|
||||
driver will be suitable for all of the CPUs handled by it and ``intel_idle`` is
|
||||
registered as the ``CPUIdle`` driver for all of the CPUs in the system, the
|
||||
driver looks for the first ``_CST`` object returning at least one valid idle
|
||||
state description and such that all of the idle states included in its return
|
||||
package are of the FFH (Functional Fixed Hardware) type, which means that the
|
||||
``MWAIT`` instruction is expected to be used to tell the processor that it can
|
||||
enter one of them. The return package of that ``_CST`` is then assumed to be
|
||||
applicable to all of the other CPUs in the system and the idle state
|
||||
descriptions extracted from it are stored in a preliminary list of idle states
|
||||
coming from the ACPI tables. [This step is skipped if ``intel_idle`` is
|
||||
configured to ignore the ACPI tables; see `below <intel-idle-parameters_>`_.]
|
||||
|
||||
Next, the first (index 0) entry in the list of available idle states is
|
||||
initialized to represent a "polling idle state" (a pseudo-idle state in which
|
||||
the target CPU continuously fetches and executes instructions), and the
|
||||
subsequent (real) idle state entries are populated as follows.
|
||||
|
||||
If the processor model at hand is recognized by ``intel_idle``, there is a
|
||||
(static) table of idle state descriptions for it in the driver. In that case,
|
||||
the "internal" table is the primary source of information on idle states and the
|
||||
information from it is copied to the final list of available idle states. If
|
||||
using the ACPI tables for the enumeration of idle states is not required
|
||||
(depending on the processor model), all of the listed idle state are enabled by
|
||||
default (so all of them will be taken into consideration by ``CPUIdle``
|
||||
governors during CPU idle state selection). Otherwise, some of the listed idle
|
||||
states may not be enabled by default if there are no matching entries in the
|
||||
preliminary list of idle states coming from the ACPI tables. In that case user
|
||||
space still can enable them later (on a per-CPU basis) with the help of
|
||||
the ``disable`` idle state attribute in ``sysfs`` (see
|
||||
:ref:`idle-states-representation` in
|
||||
Documentation/admin-guide/pm/cpuidle.rst). This basically means that
|
||||
the idle states "known" to the driver may not be enabled by default if they have
|
||||
not been exposed by the platform firmware (through the ACPI tables).
|
||||
|
||||
If the given processor model is not recognized by ``intel_idle``, but it
|
||||
supports ``MWAIT``, the preliminary list of idle states coming from the ACPI
|
||||
tables is used for building the final list that will be supplied to the
|
||||
``CPUIdle`` core during driver registration. For each idle state in that list,
|
||||
the description, ``MWAIT`` hint and exit latency are copied to the corresponding
|
||||
entry in the final list of idle states. The name of the idle state represented
|
||||
by it (to be returned by the ``name`` idle state attribute in ``sysfs``) is
|
||||
"CX_ACPI", where X is the index of that idle state in the final list (note that
|
||||
the minimum value of X is 1, because 0 is reserved for the "polling" state), and
|
||||
its target residency is based on the exit latency value. Specifically, for
|
||||
C1-type idle states the exit latency value is also used as the target residency
|
||||
(for compatibility with the majority of the "internal" tables of idle states for
|
||||
various processor models recognized by ``intel_idle``) and for the other idle
|
||||
state types (C2 and C3) the target residency value is 3 times the exit latency
|
||||
(again, that is because it reflects the target residency to exit latency ratio
|
||||
in the majority of cases for the processor models recognized by ``intel_idle``).
|
||||
All of the idle states in the final list are enabled by default in this case.
|
||||
|
||||
|
||||
.. _intel-idle-initialization:
|
||||
|
||||
Initialization
|
||||
==============
|
||||
|
||||
The initialization of ``intel_idle`` starts with checking if the kernel command
|
||||
line options forbid the use of the ``MWAIT`` instruction. If that is the case,
|
||||
an error code is returned right away.
|
||||
|
||||
The next step is to check whether or not the processor model is known to the
|
||||
driver, which determines the idle states enumeration method (see
|
||||
`above <intel-idle-enumeration-of-states_>`_), and whether or not the processor
|
||||
supports ``MWAIT`` (the initialization fails if that is not the case). Then,
|
||||
the ``MWAIT`` support in the processor is enumerated through ``CPUID`` and the
|
||||
driver initialization fails if the level of support is not as expected (for
|
||||
example, if the total number of ``MWAIT`` substates returned is 0).
|
||||
|
||||
Next, if the driver is not configured to ignore the ACPI tables (see
|
||||
`below <intel-idle-parameters_>`_), the idle states information provided by the
|
||||
platform firmware is extracted from them.
|
||||
|
||||
Then, ``CPUIdle`` device objects are allocated for all CPUs and the list of
|
||||
available idle states is created as explained
|
||||
`above <intel-idle-enumeration-of-states_>`_.
|
||||
|
||||
Finally, ``intel_idle`` is registered with the help of cpuidle_register_driver()
|
||||
as the ``CPUIdle`` driver for all CPUs in the system and a CPU online callback
|
||||
for configuring individual CPUs is registered via cpuhp_setup_state(), which
|
||||
(among other things) causes the callback routine to be invoked for all of the
|
||||
CPUs present in the system at that time (each CPU executes its own instance of
|
||||
the callback routine). That routine registers a ``CPUIdle`` device for the CPU
|
||||
running it (which enables the ``CPUIdle`` subsystem to operate that CPU) and
|
||||
optionally performs some CPU-specific initialization actions that may be
|
||||
required for the given processor model.
|
||||
|
||||
|
||||
.. _intel-idle-parameters:
|
||||
|
||||
Kernel Command Line Options and Module Parameters
|
||||
=================================================
|
||||
|
||||
The *x86* architecture support code recognizes three kernel command line
|
||||
options related to CPU idle time management: ``idle=poll``, ``idle=halt``,
|
||||
and ``idle=nomwait``. If any of them is present in the kernel command line, the
|
||||
``MWAIT`` instruction is not allowed to be used, so the initialization of
|
||||
``intel_idle`` will fail.
|
||||
|
||||
Apart from that there are five module parameters recognized by ``intel_idle``
|
||||
itself that can be set via the kernel command line (they cannot be updated via
|
||||
sysfs, so that is the only way to change their values).
|
||||
|
||||
The ``max_cstate`` parameter value is the maximum idle state index in the list
|
||||
of idle states supplied to the ``CPUIdle`` core during the registration of the
|
||||
driver. It is also the maximum number of regular (non-polling) idle states that
|
||||
can be used by ``intel_idle``, so the enumeration of idle states is terminated
|
||||
after finding that number of usable idle states (the other idle states that
|
||||
potentially might have been used if ``max_cstate`` had been greater are not
|
||||
taken into consideration at all). Setting ``max_cstate`` can prevent
|
||||
``intel_idle`` from exposing idle states that are regarded as "too deep" for
|
||||
some reason to the ``CPUIdle`` core, but it does so by making them effectively
|
||||
invisible until the system is shut down and started again which may not always
|
||||
be desirable. In practice, it is only really necessary to do that if the idle
|
||||
states in question cannot be enabled during system startup, because in the
|
||||
working state of the system the CPU power management quality of service (PM
|
||||
QoS) feature can be used to prevent ``CPUIdle`` from touching those idle states
|
||||
even if they have been enumerated (see :ref:`cpu-pm-qos` in
|
||||
Documentation/admin-guide/pm/cpuidle.rst).
|
||||
Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail.
|
||||
|
||||
The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle``
|
||||
if the kernel has been configured with ACPI support) can be set to make the
|
||||
driver ignore the system's ACPI tables entirely or use them for all of the
|
||||
recognized processor models, respectively (they both are unset by default and
|
||||
``use_acpi`` has no effect if ``no_acpi`` is set).
|
||||
|
||||
The value of the ``states_off`` module parameter (0 by default) represents a
|
||||
list of idle states to be disabled by default in the form of a bitmask.
|
||||
|
||||
Namely, the positions of the bits that are set in the ``states_off`` value are
|
||||
the indices of idle states to be disabled by default (as reflected by the names
|
||||
of the corresponding idle state directories in ``sysfs``, :file:`state0`,
|
||||
:file:`state1` ... :file:`state<i>` ..., where ``<i>`` is the index of the given
|
||||
idle state; see :ref:`idle-states-representation` in
|
||||
Documentation/admin-guide/pm/cpuidle.rst).
|
||||
|
||||
For example, if ``states_off`` is equal to 3, the driver will disable idle
|
||||
states 0 and 1 by default, and if it is equal to 8, idle state 3 will be
|
||||
disabled by default and so on (bit positions beyond the maximum idle state index
|
||||
are ignored).
|
||||
|
||||
The idle states disabled this way can be enabled (on a per-CPU basis) from user
|
||||
space via ``sysfs``.
|
||||
|
||||
The ``ibrs_off`` module parameter is a boolean flag (defaults to
|
||||
false). If set, it is used to control if IBRS (Indirect Branch Restricted
|
||||
Speculation) should be turned off when the CPU enters an idle state.
|
||||
This flag does not affect CPUs that use Enhanced IBRS which can remain
|
||||
on with little performance impact.
|
||||
|
||||
For some CPUs, IBRS will be selected as mitigation for Spectre v2 and Retbleed
|
||||
security vulnerabilities by default. Leaving the IBRS mode on while idling may
|
||||
have a performance impact on its sibling CPU. The IBRS mode will be turned off
|
||||
by default when the CPU enters into a deep idle state, but not in some
|
||||
shallower ones. Setting the ``ibrs_off`` module parameter will force the IBRS
|
||||
mode to off when the CPU is in any one of the available idle states. This may
|
||||
help performance of a sibling CPU at the expense of a slightly higher wakeup
|
||||
latency for the idle CPU.
|
||||
|
||||
|
||||
.. _intel-idle-core-and-package-idle-states:
|
||||
|
||||
Core and Package Levels of Idle States
|
||||
======================================
|
||||
|
||||
Typically, in a processor supporting the ``MWAIT`` instruction there are (at
|
||||
least) two levels of idle states (or C-states). One level, referred to as
|
||||
"core C-states", covers individual cores in the processor, whereas the other
|
||||
level, referred to as "package C-states", covers the entire processor package
|
||||
and it may also involve other components of the system (GPUs, memory
|
||||
controllers, I/O hubs etc.).
|
||||
|
||||
Some of the ``MWAIT`` hint values allow the processor to use core C-states only
|
||||
(most importantly, that is the case for the ``MWAIT`` hint value corresponding
|
||||
to the ``C1`` idle state), but the majority of them give it a license to put
|
||||
the target core (i.e. the core containing the logical CPU executing ``MWAIT``
|
||||
with the given hint value) into a specific core C-state and then (if possible)
|
||||
to enter a specific package C-state at the deeper level. For example, the
|
||||
``MWAIT`` hint value representing the ``C3`` idle state allows the processor to
|
||||
put the target core into the low-power state referred to as "core ``C3``" (or
|
||||
``CC3``), which happens if all of the logical CPUs (SMT siblings) in that core
|
||||
have executed ``MWAIT`` with the ``C3`` hint value (or with a hint value
|
||||
representing a deeper idle state), and in addition to that (in the majority of
|
||||
cases) it gives the processor a license to put the entire package (possibly
|
||||
including some non-CPU components such as a GPU or a memory controller) into the
|
||||
low-power state referred to as "package ``C3``" (or ``PC3``), which happens if
|
||||
all of the cores have gone into the ``CC3`` state and (possibly) some additional
|
||||
conditions are satisfied (for instance, if the GPU is covered by ``PC3``, it may
|
||||
be required to be in a certain GPU-specific low-power state for ``PC3`` to be
|
||||
reachable).
|
||||
|
||||
As a rule, there is no simple way to make the processor use core C-states only
|
||||
if the conditions for entering the corresponding package C-states are met, so
|
||||
the logical CPU executing ``MWAIT`` with a hint value that is not core-level
|
||||
only (like for ``C1``) must always assume that this may cause the processor to
|
||||
enter a package C-state. [That is why the exit latency and target residency
|
||||
values corresponding to the majority of ``MWAIT`` hint values in the "internal"
|
||||
tables of idle states in ``intel_idle`` reflect the properties of package
|
||||
C-states.] If using package C-states is not desirable at all, either
|
||||
:ref:`PM QoS <cpu-pm-qos>` or the ``max_cstate`` module parameter of
|
||||
``intel_idle`` described `above <intel-idle-parameters_>`_ must be used to
|
||||
restrict the range of permissible idle states to the ones with core-level only
|
||||
``MWAIT`` hint values (like ``C1``).
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2B*,
|
||||
https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2b-manual.html
|
||||
|
||||
.. [2] *Advanced Configuration and Power Interface (ACPI) Specification*,
|
||||
https://uefi.org/specifications
|
770
Documentation/admin-guide/pm/intel_pstate.rst
Normal file
770
Documentation/admin-guide/pm/intel_pstate.rst
Normal file
|
@ -0,0 +1,770 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
===============================================
|
||||
``intel_pstate`` CPU Performance Scaling Driver
|
||||
===============================================
|
||||
|
||||
:Copyright: |copy| 2017 Intel Corporation
|
||||
|
||||
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
||||
|
||||
|
||||
General Information
|
||||
===================
|
||||
|
||||
``intel_pstate`` is a part of the
|
||||
:doc:`CPU performance scaling subsystem <cpufreq>` in the Linux kernel
|
||||
(``CPUFreq``). It is a scaling driver for the Sandy Bridge and later
|
||||
generations of Intel processors. Note, however, that some of those processors
|
||||
may not be supported. [To understand ``intel_pstate`` it is necessary to know
|
||||
how ``CPUFreq`` works in general, so this is the time to read
|
||||
Documentation/admin-guide/pm/cpufreq.rst if you have not done that yet.]
|
||||
|
||||
For the processors supported by ``intel_pstate``, the P-state concept is broader
|
||||
than just an operating frequency or an operating performance point (see the
|
||||
LinuxCon Europe 2015 presentation by Kristen Accardi [1]_ for more
|
||||
information about that). For this reason, the representation of P-states used
|
||||
by ``intel_pstate`` internally follows the hardware specification (for details
|
||||
refer to Intel Software Developer’s Manual [2]_). However, the ``CPUFreq`` core
|
||||
uses frequencies for identifying operating performance points of CPUs and
|
||||
frequencies are involved in the user space interface exposed by it, so
|
||||
``intel_pstate`` maps its internal representation of P-states to frequencies too
|
||||
(fortunately, that mapping is unambiguous). At the same time, it would not be
|
||||
practical for ``intel_pstate`` to supply the ``CPUFreq`` core with a table of
|
||||
available frequencies due to the possible size of it, so the driver does not do
|
||||
that. Some functionality of the core is limited by that.
|
||||
|
||||
Since the hardware P-state selection interface used by ``intel_pstate`` is
|
||||
available at the logical CPU level, the driver always works with individual
|
||||
CPUs. Consequently, if ``intel_pstate`` is in use, every ``CPUFreq`` policy
|
||||
object corresponds to one logical CPU and ``CPUFreq`` policies are effectively
|
||||
equivalent to CPUs. In particular, this means that they become "inactive" every
|
||||
time the corresponding CPU is taken offline and need to be re-initialized when
|
||||
it goes back online.
|
||||
|
||||
``intel_pstate`` is not modular, so it cannot be unloaded, which means that the
|
||||
only way to pass early-configuration-time parameters to it is via the kernel
|
||||
command line. However, its configuration can be adjusted via ``sysfs`` to a
|
||||
great extent. In some configurations it even is possible to unregister it via
|
||||
``sysfs`` which allows another ``CPUFreq`` scaling driver to be loaded and
|
||||
registered (see `below <status_attr_>`_).
|
||||
|
||||
|
||||
Operation Modes
|
||||
===============
|
||||
|
||||
``intel_pstate`` can operate in two different modes, active or passive. In the
|
||||
active mode, it uses its own internal performance scaling governor algorithm or
|
||||
allows the hardware to do performance scaling by itself, while in the passive
|
||||
mode it responds to requests made by a generic ``CPUFreq`` governor implementing
|
||||
a certain performance scaling algorithm. Which of them will be in effect
|
||||
depends on what kernel command line options are used and on the capabilities of
|
||||
the processor.
|
||||
|
||||
Active Mode
|
||||
-----------
|
||||
|
||||
This is the default operation mode of ``intel_pstate`` for processors with
|
||||
hardware-managed P-states (HWP) support. If it works in this mode, the
|
||||
``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq`` policies
|
||||
contains the string "intel_pstate".
|
||||
|
||||
In this mode the driver bypasses the scaling governors layer of ``CPUFreq`` and
|
||||
provides its own scaling algorithms for P-state selection. Those algorithms
|
||||
can be applied to ``CPUFreq`` policies in the same way as generic scaling
|
||||
governors (that is, through the ``scaling_governor`` policy attribute in
|
||||
``sysfs``). [Note that different P-state selection algorithms may be chosen for
|
||||
different policies, but that is not recommended.]
|
||||
|
||||
They are not generic scaling governors, but their names are the same as the
|
||||
names of some of those governors. Moreover, confusingly enough, they generally
|
||||
do not work in the same way as the generic governors they share the names with.
|
||||
For example, the ``powersave`` P-state selection algorithm provided by
|
||||
``intel_pstate`` is not a counterpart of the generic ``powersave`` governor
|
||||
(roughly, it corresponds to the ``schedutil`` and ``ondemand`` governors).
|
||||
|
||||
There are two P-state selection algorithms provided by ``intel_pstate`` in the
|
||||
active mode: ``powersave`` and ``performance``. The way they both operate
|
||||
depends on whether or not the hardware-managed P-states (HWP) feature has been
|
||||
enabled in the processor and possibly on the processor model.
|
||||
|
||||
Which of the P-state selection algorithms is used by default depends on the
|
||||
:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option.
|
||||
Namely, if that option is set, the ``performance`` algorithm will be used by
|
||||
default, and the other one will be used by default if it is not set.
|
||||
|
||||
Active Mode With HWP
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If the processor supports the HWP feature, it will be enabled during the
|
||||
processor initialization and cannot be disabled after that. It is possible
|
||||
to avoid enabling it by passing the ``intel_pstate=no_hwp`` argument to the
|
||||
kernel in the command line.
|
||||
|
||||
If the HWP feature has been enabled, ``intel_pstate`` relies on the processor to
|
||||
select P-states by itself, but still it can give hints to the processor's
|
||||
internal P-state selection logic. What those hints are depends on which P-state
|
||||
selection algorithm has been applied to the given policy (or to the CPU it
|
||||
corresponds to).
|
||||
|
||||
Even though the P-state selection is carried out by the processor automatically,
|
||||
``intel_pstate`` registers utilization update callbacks with the CPU scheduler
|
||||
in this mode. However, they are not used for running a P-state selection
|
||||
algorithm, but for periodic updates of the current CPU frequency information to
|
||||
be made available from the ``scaling_cur_freq`` policy attribute in ``sysfs``.
|
||||
|
||||
HWP + ``performance``
|
||||
.....................
|
||||
|
||||
In this configuration ``intel_pstate`` will write 0 to the processor's
|
||||
Energy-Performance Preference (EPP) knob (if supported) or its
|
||||
Energy-Performance Bias (EPB) knob (otherwise), which means that the processor's
|
||||
internal P-state selection logic is expected to focus entirely on performance.
|
||||
|
||||
This will override the EPP/EPB setting coming from the ``sysfs`` interface
|
||||
(see `Energy vs Performance Hints`_ below). Moreover, any attempts to change
|
||||
the EPP/EPB to a value different from 0 ("performance") via ``sysfs`` in this
|
||||
configuration will be rejected.
|
||||
|
||||
Also, in this configuration the range of P-states available to the processor's
|
||||
internal P-state selection logic is always restricted to the upper boundary
|
||||
(that is, the maximum P-state that the driver is allowed to use).
|
||||
|
||||
HWP + ``powersave``
|
||||
...................
|
||||
|
||||
In this configuration ``intel_pstate`` will set the processor's
|
||||
Energy-Performance Preference (EPP) knob (if supported) or its
|
||||
Energy-Performance Bias (EPB) knob (otherwise) to whatever value it was
|
||||
previously set to via ``sysfs`` (or whatever default value it was
|
||||
set to by the platform firmware). This usually causes the processor's
|
||||
internal P-state selection logic to be less performance-focused.
|
||||
|
||||
Active Mode Without HWP
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This operation mode is optional for processors that do not support the HWP
|
||||
feature or when the ``intel_pstate=no_hwp`` argument is passed to the kernel in
|
||||
the command line. The active mode is used in those cases if the
|
||||
``intel_pstate=active`` argument is passed to the kernel in the command line.
|
||||
In this mode ``intel_pstate`` may refuse to work with processors that are not
|
||||
recognized by it. [Note that ``intel_pstate`` will never refuse to work with
|
||||
any processor with the HWP feature enabled.]
|
||||
|
||||
In this mode ``intel_pstate`` registers utilization update callbacks with the
|
||||
CPU scheduler in order to run a P-state selection algorithm, either
|
||||
``powersave`` or ``performance``, depending on the ``scaling_governor`` policy
|
||||
setting in ``sysfs``. The current CPU frequency information to be made
|
||||
available from the ``scaling_cur_freq`` policy attribute in ``sysfs`` is
|
||||
periodically updated by those utilization update callbacks too.
|
||||
|
||||
``performance``
|
||||
...............
|
||||
|
||||
Without HWP, this P-state selection algorithm is always the same regardless of
|
||||
the processor model and platform configuration.
|
||||
|
||||
It selects the maximum P-state it is allowed to use, subject to limits set via
|
||||
``sysfs``, every time the driver configuration for the given CPU is updated
|
||||
(e.g. via ``sysfs``).
|
||||
|
||||
This is the default P-state selection algorithm if the
|
||||
:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option
|
||||
is set.
|
||||
|
||||
``powersave``
|
||||
.............
|
||||
|
||||
Without HWP, this P-state selection algorithm is similar to the algorithm
|
||||
implemented by the generic ``schedutil`` scaling governor except that the
|
||||
utilization metric used by it is based on numbers coming from feedback
|
||||
registers of the CPU. It generally selects P-states proportional to the
|
||||
current CPU utilization.
|
||||
|
||||
This algorithm is run by the driver's utilization update callback for the
|
||||
given CPU when it is invoked by the CPU scheduler, but not more often than
|
||||
every 10 ms. Like in the ``performance`` case, the hardware configuration
|
||||
is not touched if the new P-state turns out to be the same as the current
|
||||
one.
|
||||
|
||||
This is the default P-state selection algorithm if the
|
||||
:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option
|
||||
is not set.
|
||||
|
||||
Passive Mode
|
||||
------------
|
||||
|
||||
This is the default operation mode of ``intel_pstate`` for processors without
|
||||
hardware-managed P-states (HWP) support. It is always used if the
|
||||
``intel_pstate=passive`` argument is passed to the kernel in the command line
|
||||
regardless of whether or not the given processor supports HWP. [Note that the
|
||||
``intel_pstate=no_hwp`` setting causes the driver to start in the passive mode
|
||||
if it is not combined with ``intel_pstate=active``.] Like in the active mode
|
||||
without HWP support, in this mode ``intel_pstate`` may refuse to work with
|
||||
processors that are not recognized by it if HWP is prevented from being enabled
|
||||
through the kernel command line.
|
||||
|
||||
If the driver works in this mode, the ``scaling_driver`` policy attribute in
|
||||
``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
|
||||
Then, the driver behaves like a regular ``CPUFreq`` scaling driver. That is,
|
||||
it is invoked by generic scaling governors when necessary to talk to the
|
||||
hardware in order to change the P-state of a CPU (in particular, the
|
||||
``schedutil`` governor can invoke it directly from scheduler context).
|
||||
|
||||
While in this mode, ``intel_pstate`` can be used with all of the (generic)
|
||||
scaling governors listed by the ``scaling_available_governors`` policy attribute
|
||||
in ``sysfs`` (and the P-state selection algorithms described above are not
|
||||
used). Then, it is responsible for the configuration of policy objects
|
||||
corresponding to CPUs and provides the ``CPUFreq`` core (and the scaling
|
||||
governors attached to the policy objects) with accurate information on the
|
||||
maximum and minimum operating frequencies supported by the hardware (including
|
||||
the so-called "turbo" frequency ranges). In other words, in the passive mode
|
||||
the entire range of available P-states is exposed by ``intel_pstate`` to the
|
||||
``CPUFreq`` core. However, in this mode the driver does not register
|
||||
utilization update callbacks with the CPU scheduler and the ``scaling_cur_freq``
|
||||
information comes from the ``CPUFreq`` core (and is the last frequency selected
|
||||
by the current scaling governor for the given policy).
|
||||
|
||||
|
||||
.. _turbo:
|
||||
|
||||
Turbo P-states Support
|
||||
======================
|
||||
|
||||
In the majority of cases, the entire range of P-states available to
|
||||
``intel_pstate`` can be divided into two sub-ranges that correspond to
|
||||
different types of processor behavior, above and below a boundary that
|
||||
will be referred to as the "turbo threshold" in what follows.
|
||||
|
||||
The P-states above the turbo threshold are referred to as "turbo P-states" and
|
||||
the whole sub-range of P-states they belong to is referred to as the "turbo
|
||||
range". These names are related to the Turbo Boost technology allowing a
|
||||
multicore processor to opportunistically increase the P-state of one or more
|
||||
cores if there is enough power to do that and if that is not going to cause the
|
||||
thermal envelope of the processor package to be exceeded.
|
||||
|
||||
Specifically, if software sets the P-state of a CPU core within the turbo range
|
||||
(that is, above the turbo threshold), the processor is permitted to take over
|
||||
performance scaling control for that core and put it into turbo P-states of its
|
||||
choice going forward. However, that permission is interpreted differently by
|
||||
different processor generations. Namely, the Sandy Bridge generation of
|
||||
processors will never use any P-states above the last one set by software for
|
||||
the given core, even if it is within the turbo range, whereas all of the later
|
||||
processor generations will take it as a license to use any P-states from the
|
||||
turbo range, even above the one set by software. In other words, on those
|
||||
processors setting any P-state from the turbo range will enable the processor
|
||||
to put the given core into all turbo P-states up to and including the maximum
|
||||
supported one as it sees fit.
|
||||
|
||||
One important property of turbo P-states is that they are not sustainable. More
|
||||
precisely, there is no guarantee that any CPUs will be able to stay in any of
|
||||
those states indefinitely, because the power distribution within the processor
|
||||
package may change over time or the thermal envelope it was designed for might
|
||||
be exceeded if a turbo P-state was used for too long.
|
||||
|
||||
In turn, the P-states below the turbo threshold generally are sustainable. In
|
||||
fact, if one of them is set by software, the processor is not expected to change
|
||||
it to a lower one unless in a thermal stress or a power limit violation
|
||||
situation (a higher P-state may still be used if it is set for another CPU in
|
||||
the same package at the same time, for example).
|
||||
|
||||
Some processors allow multiple cores to be in turbo P-states at the same time,
|
||||
but the maximum P-state that can be set for them generally depends on the number
|
||||
of cores running concurrently. The maximum turbo P-state that can be set for 3
|
||||
cores at the same time usually is lower than the analogous maximum P-state for
|
||||
2 cores, which in turn usually is lower than the maximum turbo P-state that can
|
||||
be set for 1 core. The one-core maximum turbo P-state is thus the maximum
|
||||
supported one overall.
|
||||
|
||||
The maximum supported turbo P-state, the turbo threshold (the maximum supported
|
||||
non-turbo P-state) and the minimum supported P-state are specific to the
|
||||
processor model and can be determined by reading the processor's model-specific
|
||||
registers (MSRs). Moreover, some processors support the Configurable TDP
|
||||
(Thermal Design Power) feature and, when that feature is enabled, the turbo
|
||||
threshold effectively becomes a configurable value that can be set by the
|
||||
platform firmware.
|
||||
|
||||
Unlike ``_PSS`` objects in the ACPI tables, ``intel_pstate`` always exposes
|
||||
the entire range of available P-states, including the whole turbo range, to the
|
||||
``CPUFreq`` core and (in the passive mode) to generic scaling governors. This
|
||||
generally causes turbo P-states to be set more often when ``intel_pstate`` is
|
||||
used relative to ACPI-based CPU performance scaling (see `below <acpi-cpufreq_>`_
|
||||
for more information).
|
||||
|
||||
Moreover, since ``intel_pstate`` always knows what the real turbo threshold is
|
||||
(even if the Configurable TDP feature is enabled in the processor), its
|
||||
``no_turbo`` attribute in ``sysfs`` (described `below <no_turbo_attr_>`_) should
|
||||
work as expected in all cases (that is, if set to disable turbo P-states, it
|
||||
always should prevent ``intel_pstate`` from using them).
|
||||
|
||||
|
||||
Processor Support
|
||||
=================
|
||||
|
||||
To handle a given processor ``intel_pstate`` requires a number of different
|
||||
pieces of information on it to be known, including:
|
||||
|
||||
* The minimum supported P-state.
|
||||
|
||||
* The maximum supported `non-turbo P-state <turbo_>`_.
|
||||
|
||||
* Whether or not turbo P-states are supported at all.
|
||||
|
||||
* The maximum supported `one-core turbo P-state <turbo_>`_ (if turbo P-states
|
||||
are supported).
|
||||
|
||||
* The scaling formula to translate the driver's internal representation
|
||||
of P-states into frequencies and the other way around.
|
||||
|
||||
Generally, ways to obtain that information are specific to the processor model
|
||||
or family. Although it often is possible to obtain all of it from the processor
|
||||
itself (using model-specific registers), there are cases in which hardware
|
||||
manuals need to be consulted to get to it too.
|
||||
|
||||
For this reason, there is a list of supported processors in ``intel_pstate`` and
|
||||
the driver initialization will fail if the detected processor is not in that
|
||||
list, unless it supports the HWP feature. [The interface to obtain all of the
|
||||
information listed above is the same for all of the processors supporting the
|
||||
HWP feature, which is why ``intel_pstate`` works with all of them.]
|
||||
|
||||
|
||||
User Space Interface in ``sysfs``
|
||||
=================================
|
||||
|
||||
Global Attributes
|
||||
-----------------
|
||||
|
||||
``intel_pstate`` exposes several global attributes (files) in ``sysfs`` to
|
||||
control its functionality at the system level. They are located in the
|
||||
``/sys/devices/system/cpu/intel_pstate/`` directory and affect all CPUs.
|
||||
|
||||
Some of them are not present if the ``intel_pstate=per_cpu_perf_limits``
|
||||
argument is passed to the kernel in the command line.
|
||||
|
||||
``max_perf_pct``
|
||||
Maximum P-state the driver is allowed to set in percent of the
|
||||
maximum supported performance level (the highest supported `turbo
|
||||
P-state <turbo_>`_).
|
||||
|
||||
This attribute will not be exposed if the
|
||||
``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel
|
||||
command line.
|
||||
|
||||
``min_perf_pct``
|
||||
Minimum P-state the driver is allowed to set in percent of the
|
||||
maximum supported performance level (the highest supported `turbo
|
||||
P-state <turbo_>`_).
|
||||
|
||||
This attribute will not be exposed if the
|
||||
``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel
|
||||
command line.
|
||||
|
||||
``num_pstates``
|
||||
Number of P-states supported by the processor (between 0 and 255
|
||||
inclusive) including both turbo and non-turbo P-states (see
|
||||
`Turbo P-states Support`_).
|
||||
|
||||
This attribute is present only if the value exposed by it is the same
|
||||
for all of the CPUs in the system.
|
||||
|
||||
The value of this attribute is not affected by the ``no_turbo``
|
||||
setting described `below <no_turbo_attr_>`_.
|
||||
|
||||
This attribute is read-only.
|
||||
|
||||
``turbo_pct``
|
||||
Ratio of the `turbo range <turbo_>`_ size to the size of the entire
|
||||
range of supported P-states, in percent.
|
||||
|
||||
This attribute is present only if the value exposed by it is the same
|
||||
for all of the CPUs in the system.
|
||||
|
||||
This attribute is read-only.
|
||||
|
||||
.. _no_turbo_attr:
|
||||
|
||||
``no_turbo``
|
||||
If set (equal to 1), the driver is not allowed to set any turbo P-states
|
||||
(see `Turbo P-states Support`_). If unset (equal to 0, which is the
|
||||
default), turbo P-states can be set by the driver.
|
||||
[Note that ``intel_pstate`` does not support the general ``boost``
|
||||
attribute (supported by some other scaling drivers) which is replaced
|
||||
by this one.]
|
||||
|
||||
This attribute does not affect the maximum supported frequency value
|
||||
supplied to the ``CPUFreq`` core and exposed via the policy interface,
|
||||
but it affects the maximum possible value of per-policy P-state limits
|
||||
(see `Interpretation of Policy Attributes`_ below for details).
|
||||
|
||||
``hwp_dynamic_boost``
|
||||
This attribute is only present if ``intel_pstate`` works in the
|
||||
`active mode with the HWP feature enabled <Active Mode With HWP_>`_ in
|
||||
the processor. If set (equal to 1), it causes the minimum P-state limit
|
||||
to be increased dynamically for a short time whenever a task previously
|
||||
waiting on I/O is selected to run on a given logical CPU (the purpose
|
||||
of this mechanism is to improve performance).
|
||||
|
||||
This setting has no effect on logical CPUs whose minimum P-state limit
|
||||
is directly set to the highest non-turbo P-state or above it.
|
||||
|
||||
.. _status_attr:
|
||||
|
||||
``status``
|
||||
Operation mode of the driver: "active", "passive" or "off".
|
||||
|
||||
"active"
|
||||
The driver is functional and in the `active mode
|
||||
<Active Mode_>`_.
|
||||
|
||||
"passive"
|
||||
The driver is functional and in the `passive mode
|
||||
<Passive Mode_>`_.
|
||||
|
||||
"off"
|
||||
The driver is not functional (it is not registered as a scaling
|
||||
driver with the ``CPUFreq`` core).
|
||||
|
||||
This attribute can be written to in order to change the driver's
|
||||
operation mode or to unregister it. The string written to it must be
|
||||
one of the possible values of it and, if successful, the write will
|
||||
cause the driver to switch over to the operation mode represented by
|
||||
that string - or to be unregistered in the "off" case. [Actually,
|
||||
switching over from the active mode to the passive mode or the other
|
||||
way around causes the driver to be unregistered and registered again
|
||||
with a different set of callbacks, so all of its settings (the global
|
||||
as well as the per-policy ones) are then reset to their default
|
||||
values, possibly depending on the target operation mode.]
|
||||
|
||||
``energy_efficiency``
|
||||
This attribute is only present on platforms with CPUs matching the Kaby
|
||||
Lake or Coffee Lake desktop CPU model. By default, energy-efficiency
|
||||
optimizations are disabled on these CPU models if HWP is enabled.
|
||||
Enabling energy-efficiency optimizations may limit maximum operating
|
||||
frequency with or without the HWP feature. With HWP enabled, the
|
||||
optimizations are done only in the turbo frequency range. Without it,
|
||||
they are done in the entire available frequency range. Setting this
|
||||
attribute to "1" enables the energy-efficiency optimizations and setting
|
||||
to "0" disables them.
|
||||
|
||||
Interpretation of Policy Attributes
|
||||
-----------------------------------
|
||||
|
||||
The interpretation of some ``CPUFreq`` policy attributes described in
|
||||
Documentation/admin-guide/pm/cpufreq.rst is special with ``intel_pstate``
|
||||
as the current scaling driver and it generally depends on the driver's
|
||||
`operation mode <Operation Modes_>`_.
|
||||
|
||||
First of all, the values of the ``cpuinfo_max_freq``, ``cpuinfo_min_freq`` and
|
||||
``scaling_cur_freq`` attributes are produced by applying a processor-specific
|
||||
multiplier to the internal P-state representation used by ``intel_pstate``.
|
||||
Also, the values of the ``scaling_max_freq`` and ``scaling_min_freq``
|
||||
attributes are capped by the frequency corresponding to the maximum P-state that
|
||||
the driver is allowed to set.
|
||||
|
||||
If the ``no_turbo`` `global attribute <no_turbo_attr_>`_ is set, the driver is
|
||||
not allowed to use turbo P-states, so the maximum value of ``scaling_max_freq``
|
||||
and ``scaling_min_freq`` is limited to the maximum non-turbo P-state frequency.
|
||||
Accordingly, setting ``no_turbo`` causes ``scaling_max_freq`` and
|
||||
``scaling_min_freq`` to go down to that value if they were above it before.
|
||||
However, the old values of ``scaling_max_freq`` and ``scaling_min_freq`` will be
|
||||
restored after unsetting ``no_turbo``, unless these attributes have been written
|
||||
to after ``no_turbo`` was set.
|
||||
|
||||
If ``no_turbo`` is not set, the maximum possible value of ``scaling_max_freq``
|
||||
and ``scaling_min_freq`` corresponds to the maximum supported turbo P-state,
|
||||
which also is the value of ``cpuinfo_max_freq`` in either case.
|
||||
|
||||
Next, the following policy attributes have special meaning if
|
||||
``intel_pstate`` works in the `active mode <Active Mode_>`_:
|
||||
|
||||
``scaling_available_governors``
|
||||
List of P-state selection algorithms provided by ``intel_pstate``.
|
||||
|
||||
``scaling_governor``
|
||||
P-state selection algorithm provided by ``intel_pstate`` currently in
|
||||
use with the given policy.
|
||||
|
||||
``scaling_cur_freq``
|
||||
Frequency of the average P-state of the CPU represented by the given
|
||||
policy for the time interval between the last two invocations of the
|
||||
driver's utilization update callback by the CPU scheduler for that CPU.
|
||||
|
||||
One more policy attribute is present if the HWP feature is enabled in the
|
||||
processor:
|
||||
|
||||
``base_frequency``
|
||||
Shows the base frequency of the CPU. Any frequency above this will be
|
||||
in the turbo frequency range.
|
||||
|
||||
The meaning of these attributes in the `passive mode <Passive Mode_>`_ is the
|
||||
same as for other scaling drivers.
|
||||
|
||||
Additionally, the value of the ``scaling_driver`` attribute for ``intel_pstate``
|
||||
depends on the operation mode of the driver. Namely, it is either
|
||||
"intel_pstate" (in the `active mode <Active Mode_>`_) or "intel_cpufreq" (in the
|
||||
`passive mode <Passive Mode_>`_).
|
||||
|
||||
Coordination of P-State Limits
|
||||
------------------------------
|
||||
|
||||
``intel_pstate`` allows P-state limits to be set in two ways: with the help of
|
||||
the ``max_perf_pct`` and ``min_perf_pct`` `global attributes
|
||||
<Global Attributes_>`_ or via the ``scaling_max_freq`` and ``scaling_min_freq``
|
||||
``CPUFreq`` policy attributes. The coordination between those limits is based
|
||||
on the following rules, regardless of the current operation mode of the driver:
|
||||
|
||||
1. All CPUs are affected by the global limits (that is, none of them can be
|
||||
requested to run faster than the global maximum and none of them can be
|
||||
requested to run slower than the global minimum).
|
||||
|
||||
2. Each individual CPU is affected by its own per-policy limits (that is, it
|
||||
cannot be requested to run faster than its own per-policy maximum and it
|
||||
cannot be requested to run slower than its own per-policy minimum). The
|
||||
effective performance depends on whether the platform supports per core
|
||||
P-states, hyper-threading is enabled and on current performance requests
|
||||
from other CPUs. When platform doesn't support per core P-states, the
|
||||
effective performance can be more than the policy limits set on a CPU, if
|
||||
other CPUs are requesting higher performance at that moment. Even with per
|
||||
core P-states support, when hyper-threading is enabled, if the sibling CPU
|
||||
is requesting higher performance, the other siblings will get higher
|
||||
performance than their policy limits.
|
||||
|
||||
3. The global and per-policy limits can be set independently.
|
||||
|
||||
In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the
|
||||
resulting effective values are written into hardware registers whenever the
|
||||
limits change in order to request its internal P-state selection logic to always
|
||||
set P-states within these limits. Otherwise, the limits are taken into account
|
||||
by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
|
||||
every time before setting a new P-state for a CPU.
|
||||
|
||||
Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
|
||||
is passed to the kernel, ``max_perf_pct`` and ``min_perf_pct`` are not exposed
|
||||
at all and the only way to set the limits is by using the policy attributes.
|
||||
|
||||
|
||||
Energy vs Performance Hints
|
||||
---------------------------
|
||||
|
||||
If the hardware-managed P-states (HWP) is enabled in the processor, additional
|
||||
attributes, intended to allow user space to help ``intel_pstate`` to adjust the
|
||||
processor's internal P-state selection logic by focusing it on performance or on
|
||||
energy-efficiency, or somewhere between the two extremes, are present in every
|
||||
``CPUFreq`` policy directory in ``sysfs``. They are :
|
||||
|
||||
``energy_performance_preference``
|
||||
Current value of the energy vs performance hint for the given policy
|
||||
(or the CPU represented by it).
|
||||
|
||||
The hint can be changed by writing to this attribute.
|
||||
|
||||
``energy_performance_available_preferences``
|
||||
List of strings that can be written to the
|
||||
``energy_performance_preference`` attribute.
|
||||
|
||||
They represent different energy vs performance hints and should be
|
||||
self-explanatory, except that ``default`` represents whatever hint
|
||||
value was set by the platform firmware.
|
||||
|
||||
Strings written to the ``energy_performance_preference`` attribute are
|
||||
internally translated to integer values written to the processor's
|
||||
Energy-Performance Preference (EPP) knob (if supported) or its
|
||||
Energy-Performance Bias (EPB) knob. It is also possible to write a positive
|
||||
integer value between 0 to 255, if the EPP feature is present. If the EPP
|
||||
feature is not present, writing integer value to this attribute is not
|
||||
supported. In this case, user can use the
|
||||
"/sys/devices/system/cpu/cpu*/power/energy_perf_bias" interface.
|
||||
|
||||
[Note that tasks may by migrated from one CPU to another by the scheduler's
|
||||
load-balancing algorithm and if different energy vs performance hints are
|
||||
set for those CPUs, that may lead to undesirable outcomes. To avoid such
|
||||
issues it is better to set the same energy vs performance hint for all CPUs
|
||||
or to pin every task potentially sensitive to them to a specific CPU.]
|
||||
|
||||
.. _acpi-cpufreq:
|
||||
|
||||
``intel_pstate`` vs ``acpi-cpufreq``
|
||||
====================================
|
||||
|
||||
On the majority of systems supported by ``intel_pstate``, the ACPI tables
|
||||
provided by the platform firmware contain ``_PSS`` objects returning information
|
||||
that can be used for CPU performance scaling (refer to the ACPI specification
|
||||
[3]_ for details on the ``_PSS`` objects and the format of the information
|
||||
returned by them).
|
||||
|
||||
The information returned by the ACPI ``_PSS`` objects is used by the
|
||||
``acpi-cpufreq`` scaling driver. On systems supported by ``intel_pstate``
|
||||
the ``acpi-cpufreq`` driver uses the same hardware CPU performance scaling
|
||||
interface, but the set of P-states it can use is limited by the ``_PSS``
|
||||
output.
|
||||
|
||||
On those systems each ``_PSS`` object returns a list of P-states supported by
|
||||
the corresponding CPU which basically is a subset of the P-states range that can
|
||||
be used by ``intel_pstate`` on the same system, with one exception: the whole
|
||||
`turbo range <turbo_>`_ is represented by one item in it (the topmost one). By
|
||||
convention, the frequency returned by ``_PSS`` for that item is greater by 1 MHz
|
||||
than the frequency of the highest non-turbo P-state listed by it, but the
|
||||
corresponding P-state representation (following the hardware specification)
|
||||
returned for it matches the maximum supported turbo P-state (or is the
|
||||
special value 255 meaning essentially "go as high as you can get").
|
||||
|
||||
The list of P-states returned by ``_PSS`` is reflected by the table of
|
||||
available frequencies supplied by ``acpi-cpufreq`` to the ``CPUFreq`` core and
|
||||
scaling governors and the minimum and maximum supported frequencies reported by
|
||||
it come from that list as well. In particular, given the special representation
|
||||
of the turbo range described above, this means that the maximum supported
|
||||
frequency reported by ``acpi-cpufreq`` is higher by 1 MHz than the frequency
|
||||
of the highest supported non-turbo P-state listed by ``_PSS`` which, of course,
|
||||
affects decisions made by the scaling governors, except for ``powersave`` and
|
||||
``performance``.
|
||||
|
||||
For example, if a given governor attempts to select a frequency proportional to
|
||||
estimated CPU load and maps the load of 100% to the maximum supported frequency
|
||||
(possibly multiplied by a constant), then it will tend to choose P-states below
|
||||
the turbo threshold if ``acpi-cpufreq`` is used as the scaling driver, because
|
||||
in that case the turbo range corresponds to a small fraction of the frequency
|
||||
band it can use (1 MHz vs 1 GHz or more). In consequence, it will only go to
|
||||
the turbo range for the highest loads and the other loads above 50% that might
|
||||
benefit from running at turbo frequencies will be given non-turbo P-states
|
||||
instead.
|
||||
|
||||
One more issue related to that may appear on systems supporting the
|
||||
`Configurable TDP feature <turbo_>`_ allowing the platform firmware to set the
|
||||
turbo threshold. Namely, if that is not coordinated with the lists of P-states
|
||||
returned by ``_PSS`` properly, there may be more than one item corresponding to
|
||||
a turbo P-state in those lists and there may be a problem with avoiding the
|
||||
turbo range (if desirable or necessary). Usually, to avoid using turbo
|
||||
P-states overall, ``acpi-cpufreq`` simply avoids using the topmost state listed
|
||||
by ``_PSS``, but that is not sufficient when there are other turbo P-states in
|
||||
the list returned by it.
|
||||
|
||||
Apart from the above, ``acpi-cpufreq`` works like ``intel_pstate`` in the
|
||||
`passive mode <Passive Mode_>`_, except that the number of P-states it can set
|
||||
is limited to the ones listed by the ACPI ``_PSS`` objects.
|
||||
|
||||
|
||||
Kernel Command Line Options for ``intel_pstate``
|
||||
================================================
|
||||
|
||||
Several kernel command line options can be used to pass early-configuration-time
|
||||
parameters to ``intel_pstate`` in order to enforce specific behavior of it. All
|
||||
of them have to be prepended with the ``intel_pstate=`` prefix.
|
||||
|
||||
``disable``
|
||||
Do not register ``intel_pstate`` as the scaling driver even if the
|
||||
processor is supported by it.
|
||||
|
||||
``active``
|
||||
Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start
|
||||
with.
|
||||
|
||||
``passive``
|
||||
Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
|
||||
start with.
|
||||
|
||||
``force``
|
||||
Register ``intel_pstate`` as the scaling driver instead of
|
||||
``acpi-cpufreq`` even if the latter is preferred on the given system.
|
||||
|
||||
This may prevent some platform features (such as thermal controls and
|
||||
power capping) that rely on the availability of ACPI P-states
|
||||
information from functioning as expected, so it should be used with
|
||||
caution.
|
||||
|
||||
This option does not work with processors that are not supported by
|
||||
``intel_pstate`` and on platforms where the ``pcc-cpufreq`` scaling
|
||||
driver is used instead of ``acpi-cpufreq``.
|
||||
|
||||
``no_hwp``
|
||||
Do not enable the hardware-managed P-states (HWP) feature even if it is
|
||||
supported by the processor.
|
||||
|
||||
``hwp_only``
|
||||
Register ``intel_pstate`` as the scaling driver only if the
|
||||
hardware-managed P-states (HWP) feature is supported by the processor.
|
||||
|
||||
``support_acpi_ppc``
|
||||
Take ACPI ``_PPC`` performance limits into account.
|
||||
|
||||
If the preferred power management profile in the FADT (Fixed ACPI
|
||||
Description Table) is set to "Enterprise Server" or "Performance
|
||||
Server", the ACPI ``_PPC`` limits are taken into account by default
|
||||
and this option has no effect.
|
||||
|
||||
``per_cpu_perf_limits``
|
||||
Use per-logical-CPU P-State limits (see `Coordination of P-state
|
||||
Limits`_ for details).
|
||||
|
||||
|
||||
Diagnostics and Tuning
|
||||
======================
|
||||
|
||||
Trace Events
|
||||
------------
|
||||
|
||||
There are two static trace events that can be used for ``intel_pstate``
|
||||
diagnostics. One of them is the ``cpu_frequency`` trace event generally used
|
||||
by ``CPUFreq``, and the other one is the ``pstate_sample`` trace event specific
|
||||
to ``intel_pstate``. Both of them are triggered by ``intel_pstate`` only if
|
||||
it works in the `active mode <Active Mode_>`_.
|
||||
|
||||
The following sequence of shell commands can be used to enable them and see
|
||||
their output (if the kernel is generally configured to support event tracing)::
|
||||
|
||||
# cd /sys/kernel/tracing/
|
||||
# echo 1 > events/power/pstate_sample/enable
|
||||
# echo 1 > events/power/cpu_frequency/enable
|
||||
# cat trace
|
||||
gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107 scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 freq=2474476
|
||||
cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2
|
||||
|
||||
If ``intel_pstate`` works in the `passive mode <Passive Mode_>`_, the
|
||||
``cpu_frequency`` trace event will be triggered either by the ``schedutil``
|
||||
scaling governor (for the policies it is attached to), or by the ``CPUFreq``
|
||||
core (for the policies with other scaling governors).
|
||||
|
||||
``ftrace``
|
||||
----------
|
||||
|
||||
The ``ftrace`` interface can be used for low-level diagnostics of
|
||||
``intel_pstate``. For example, to check how often the function to set a
|
||||
P-state is called, the ``ftrace`` filter can be set to
|
||||
:c:func:`intel_pstate_set_pstate`::
|
||||
|
||||
# cd /sys/kernel/tracing/
|
||||
# cat available_filter_functions | grep -i pstate
|
||||
intel_pstate_set_pstate
|
||||
intel_pstate_cpu_init
|
||||
...
|
||||
# echo intel_pstate_set_pstate > set_ftrace_filter
|
||||
# echo function > current_tracer
|
||||
# cat trace | head -15
|
||||
# tracer: function
|
||||
#
|
||||
# entries-in-buffer/entries-written: 80/80 #P:4
|
||||
#
|
||||
# _-----=> irqs-off
|
||||
# / _----=> need-resched
|
||||
# | / _---=> hardirq/softirq
|
||||
# || / _--=> preempt-depth
|
||||
# ||| / delay
|
||||
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
|
||||
# | | | |||| | |
|
||||
Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func
|
||||
gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func
|
||||
gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func
|
||||
<idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] Kristen Accardi, *Balancing Power and Performance in the Linux Kernel*,
|
||||
https://events.static.linuxfound.org/sites/events/files/slides/LinuxConEurope_2015.pdf
|
||||
|
||||
.. [2] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide*,
|
||||
https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
|
||||
|
||||
.. [3] *Advanced Configuration and Power Interface Specification*,
|
||||
https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf
|
174
Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst
Normal file
174
Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst
Normal file
|
@ -0,0 +1,174 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
==============================
|
||||
Intel Uncore Frequency Scaling
|
||||
==============================
|
||||
|
||||
:Copyright: |copy| 2022-2023 Intel Corporation
|
||||
|
||||
:Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
The uncore can consume significant amount of power in Intel's Xeon servers based
|
||||
on the workload characteristics. To optimize the total power and improve overall
|
||||
performance, SoCs have internal algorithms for scaling uncore frequency. These
|
||||
algorithms monitor workload usage of uncore and set a desirable frequency.
|
||||
|
||||
It is possible that users have different expectations of uncore performance and
|
||||
want to have control over it. The objective is similar to allowing users to set
|
||||
the scaling min/max frequencies via cpufreq sysfs to improve CPU performance.
|
||||
Users may have some latency sensitive workloads where they do not want any
|
||||
change to uncore frequency. Also, users may have workloads which require
|
||||
different core and uncore performance at distinct phases and they may want to
|
||||
use both cpufreq and the uncore scaling interface to distribute power and
|
||||
improve overall performance.
|
||||
|
||||
Sysfs Interface
|
||||
---------------
|
||||
|
||||
To control uncore frequency, a sysfs interface is provided in the directory:
|
||||
`/sys/devices/system/cpu/intel_uncore_frequency/`.
|
||||
|
||||
There is one directory for each package and die combination as the scope of
|
||||
uncore scaling control is per die in multiple die/package SoCs or per
|
||||
package for single die per package SoCs. The name represents the
|
||||
scope of control. For example: 'package_00_die_00' is for package id 0 and
|
||||
die 0.
|
||||
|
||||
Each package_*_die_* contains the following attributes:
|
||||
|
||||
``initial_max_freq_khz``
|
||||
Out of reset, this attribute represent the maximum possible frequency.
|
||||
This is a read-only attribute. If users adjust max_freq_khz,
|
||||
they can always go back to maximum using the value from this attribute.
|
||||
|
||||
``initial_min_freq_khz``
|
||||
Out of reset, this attribute represent the minimum possible frequency.
|
||||
This is a read-only attribute. If users adjust min_freq_khz,
|
||||
they can always go back to minimum using the value from this attribute.
|
||||
|
||||
``max_freq_khz``
|
||||
This attribute is used to set the maximum uncore frequency.
|
||||
|
||||
``min_freq_khz``
|
||||
This attribute is used to set the minimum uncore frequency.
|
||||
|
||||
``current_freq_khz``
|
||||
This attribute is used to get the current uncore frequency.
|
||||
|
||||
SoCs with TPMI (Topology Aware Register and PM Capsule Interface)
|
||||
-----------------------------------------------------------------
|
||||
|
||||
An SoC can contain multiple power domains with individual or collection
|
||||
of mesh partitions. This partition is called fabric cluster.
|
||||
|
||||
Certain type of meshes will need to run at the same frequency, they will
|
||||
be placed in the same fabric cluster. Benefit of fabric cluster is that it
|
||||
offers a scalable mechanism to deal with partitioned fabrics in a SoC.
|
||||
|
||||
The current sysfs interface supports controls at package and die level.
|
||||
This interface is not enough to support more granular control at
|
||||
fabric cluster level.
|
||||
|
||||
SoCs with the support of TPMI (Topology Aware Register and PM Capsule
|
||||
Interface), can have multiple power domains. Each power domain can
|
||||
contain one or more fabric clusters.
|
||||
|
||||
To represent controls at fabric cluster level in addition to the
|
||||
controls at package and die level (like systems without TPMI
|
||||
support), sysfs is enhanced. This granular interface is presented in the
|
||||
sysfs with directories names prefixed with "uncore". For example:
|
||||
uncore00, uncore01 etc.
|
||||
|
||||
The scope of control is specified by attributes "package_id", "domain_id"
|
||||
and "fabric_cluster_id" in the directory.
|
||||
|
||||
Attributes in each directory:
|
||||
|
||||
``domain_id``
|
||||
This attribute is used to get the power domain id of this instance.
|
||||
|
||||
``fabric_cluster_id``
|
||||
This attribute is used to get the fabric cluster id of this instance.
|
||||
|
||||
``package_id``
|
||||
This attribute is used to get the package id of this instance.
|
||||
|
||||
The other attributes are same as presented at package_*_die_* level.
|
||||
|
||||
In most of current use cases, the "max_freq_khz" and "min_freq_khz"
|
||||
is updated at "package_*_die_*" level. This model will be still supported
|
||||
with the following approach:
|
||||
|
||||
When user uses controls at "package_*_die_*" level, then every fabric
|
||||
cluster is affected in that package and die. For example: user changes
|
||||
"max_freq_khz" in the package_00_die_00, then "max_freq_khz" for uncore*
|
||||
directory with the same package id will be updated. In this case user can
|
||||
still update "max_freq_khz" at each uncore* level, which is more restrictive.
|
||||
Similarly, user can update "min_freq_khz" at "package_*_die_*" level
|
||||
to apply at each uncore* level.
|
||||
|
||||
Support for "current_freq_khz" is available only at each fabric cluster
|
||||
level (i.e., in uncore* directory).
|
||||
|
||||
Efficiency vs. Latency Tradeoff
|
||||
-------------------------------
|
||||
|
||||
The Efficiency Latency Control (ELC) feature improves performance
|
||||
per watt. With this feature hardware power management algorithms
|
||||
optimize trade-off between latency and power consumption. For some
|
||||
latency sensitive workloads further tuning can be done by SW to
|
||||
get desired performance.
|
||||
|
||||
The hardware monitors the average CPU utilization across all cores
|
||||
in a power domain at regular intervals and decides an uncore frequency.
|
||||
While this may result in the best performance per watt, workload may be
|
||||
expecting higher performance at the expense of power. Consider an
|
||||
application that intermittently wakes up to perform memory reads on an
|
||||
otherwise idle system. In such cases, if hardware lowers uncore
|
||||
frequency, then there may be delay in ramp up of frequency to meet
|
||||
target performance.
|
||||
|
||||
The ELC control defines some parameters which can be changed from SW.
|
||||
If the average CPU utilization is below a user-defined threshold
|
||||
(elc_low_threshold_percent attribute below), the user-defined uncore
|
||||
floor frequency will be used (elc_floor_freq_khz attribute below)
|
||||
instead of hardware calculated minimum.
|
||||
|
||||
Similarly in high load scenario where the CPU utilization goes above
|
||||
the high threshold value (elc_high_threshold_percent attribute below)
|
||||
instead of jumping to maximum uncore frequency, frequency is increased
|
||||
in 100MHz steps. This avoids consuming unnecessarily high power
|
||||
immediately with CPU utilization spikes.
|
||||
|
||||
Attributes for efficiency latency control:
|
||||
|
||||
``elc_floor_freq_khz``
|
||||
This attribute is used to get/set the efficiency latency floor frequency.
|
||||
If this variable is lower than the 'min_freq_khz', it is ignored by
|
||||
the firmware.
|
||||
|
||||
``elc_low_threshold_percent``
|
||||
This attribute is used to get/set the efficiency latency control low
|
||||
threshold. This attribute is in percentages of CPU utilization.
|
||||
|
||||
``elc_high_threshold_percent``
|
||||
This attribute is used to get/set the efficiency latency control high
|
||||
threshold. This attribute is in percentages of CPU utilization.
|
||||
|
||||
``elc_high_threshold_enable``
|
||||
This attribute is used to enable/disable the efficiency latency control
|
||||
high threshold. Write '1' to enable, '0' to disable.
|
||||
|
||||
Example system configuration below, which does following:
|
||||
* when CPU utilization is less than 10%: sets uncore frequency to 800MHz
|
||||
* when CPU utilization is higher than 95%: increases uncore frequency in
|
||||
100MHz steps, until power limit is reached
|
||||
|
||||
elc_floor_freq_khz:800000
|
||||
elc_high_threshold_percent:95
|
||||
elc_high_threshold_enable:1
|
||||
elc_low_threshold_percent:10
|
291
Documentation/admin-guide/pm/sleep-states.rst
Normal file
291
Documentation/admin-guide/pm/sleep-states.rst
Normal file
|
@ -0,0 +1,291 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
===================
|
||||
System Sleep States
|
||||
===================
|
||||
|
||||
:Copyright: |copy| 2017 Intel Corporation
|
||||
|
||||
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
||||
|
||||
|
||||
Sleep states are global low-power states of the entire system in which user
|
||||
space code cannot be executed and the overall system activity is significantly
|
||||
reduced.
|
||||
|
||||
|
||||
Sleep States That Can Be Supported
|
||||
==================================
|
||||
|
||||
Depending on its configuration and the capabilities of the platform it runs on,
|
||||
the Linux kernel can support up to four system sleep states, including
|
||||
hibernation and up to three variants of system suspend. The sleep states that
|
||||
can be supported by the kernel are listed below.
|
||||
|
||||
.. _s2idle:
|
||||
|
||||
Suspend-to-Idle
|
||||
---------------
|
||||
|
||||
This is a generic, pure software, light-weight variant of system suspend (also
|
||||
referred to as S2I or S2Idle). It allows more energy to be saved relative to
|
||||
runtime idle by freezing user space, suspending the timekeeping and putting all
|
||||
I/O devices into low-power states (possibly lower-power than available in the
|
||||
working state), such that the processors can spend time in their deepest idle
|
||||
states while the system is suspended.
|
||||
|
||||
The system is woken up from this state by in-band interrupts, so theoretically
|
||||
any devices that can cause interrupts to be generated in the working state can
|
||||
also be set up as wakeup devices for S2Idle.
|
||||
|
||||
This state can be used on platforms without support for :ref:`standby <standby>`
|
||||
or :ref:`suspend-to-RAM <s2ram>`, or it can be used in addition to any of the
|
||||
deeper system suspend variants to provide reduced resume latency. It is always
|
||||
supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration option is set.
|
||||
|
||||
.. _standby:
|
||||
|
||||
Standby
|
||||
-------
|
||||
|
||||
This state, if supported, offers moderate, but real, energy savings, while
|
||||
providing a relatively straightforward transition back to the working state. No
|
||||
operating state is lost (the system core logic retains power), so the system can
|
||||
go back to where it left off easily enough.
|
||||
|
||||
In addition to freezing user space, suspending the timekeeping and putting all
|
||||
I/O devices into low-power states, which is done for :ref:`suspend-to-idle
|
||||
<s2idle>` too, nonboot CPUs are taken offline and all low-level system functions
|
||||
are suspended during transitions into this state. For this reason, it should
|
||||
allow more energy to be saved relative to :ref:`suspend-to-idle <s2idle>`, but
|
||||
the resume latency will generally be greater than for that state.
|
||||
|
||||
The set of devices that can wake up the system from this state usually is
|
||||
reduced relative to :ref:`suspend-to-idle <s2idle>` and it may be necessary to
|
||||
rely on the platform for setting up the wakeup functionality as appropriate.
|
||||
|
||||
This state is supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration
|
||||
option is set and the support for it is registered by the platform with the
|
||||
core system suspend subsystem. On ACPI-based systems this state is mapped to
|
||||
the S1 system state defined by ACPI.
|
||||
|
||||
.. _s2ram:
|
||||
|
||||
Suspend-to-RAM
|
||||
--------------
|
||||
|
||||
This state (also referred to as STR or S2RAM), if supported, offers significant
|
||||
energy savings as everything in the system is put into a low-power state, except
|
||||
for memory, which should be placed into the self-refresh mode to retain its
|
||||
contents. All of the steps carried out when entering :ref:`standby <standby>`
|
||||
are also carried out during transitions to S2RAM. Additional operations may
|
||||
take place depending on the platform capabilities. In particular, on ACPI-based
|
||||
systems the kernel passes control to the platform firmware (BIOS) as the last
|
||||
step during S2RAM transitions and that usually results in powering down some
|
||||
more low-level components that are not directly controlled by the kernel.
|
||||
|
||||
The state of devices and CPUs is saved and held in memory. All devices are
|
||||
suspended and put into low-power states. In many cases, all peripheral buses
|
||||
lose power when entering S2RAM, so devices must be able to handle the transition
|
||||
back to the "on" state.
|
||||
|
||||
On ACPI-based systems S2RAM requires some minimal boot-strapping code in the
|
||||
platform firmware to resume the system from it. This may be the case on other
|
||||
platforms too.
|
||||
|
||||
The set of devices that can wake up the system from S2RAM usually is reduced
|
||||
relative to :ref:`suspend-to-idle <s2idle>` and :ref:`standby <standby>` and it
|
||||
may be necessary to rely on the platform for setting up the wakeup functionality
|
||||
as appropriate.
|
||||
|
||||
S2RAM is supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration option
|
||||
is set and the support for it is registered by the platform with the core system
|
||||
suspend subsystem. On ACPI-based systems it is mapped to the S3 system state
|
||||
defined by ACPI.
|
||||
|
||||
.. _hibernation:
|
||||
|
||||
Hibernation
|
||||
-----------
|
||||
|
||||
This state (also referred to as Suspend-to-Disk or STD) offers the greatest
|
||||
energy savings and can be used even in the absence of low-level platform support
|
||||
for system suspend. However, it requires some low-level code for resuming the
|
||||
system to be present for the underlying CPU architecture.
|
||||
|
||||
Hibernation is significantly different from any of the system suspend variants.
|
||||
It takes three system state changes to put it into hibernation and two system
|
||||
state changes to resume it.
|
||||
|
||||
First, when hibernation is triggered, the kernel stops all system activity and
|
||||
creates a snapshot image of memory to be written into persistent storage. Next,
|
||||
the system goes into a state in which the snapshot image can be saved, the image
|
||||
is written out and finally the system goes into the target low-power state in
|
||||
which power is cut from almost all of its hardware components, including memory,
|
||||
except for a limited set of wakeup devices.
|
||||
|
||||
Once the snapshot image has been written out, the system may either enter a
|
||||
special low-power state (like ACPI S4), or it may simply power down itself.
|
||||
Powering down means minimum power draw and it allows this mechanism to work on
|
||||
any system. However, entering a special low-power state may allow additional
|
||||
means of system wakeup to be used (e.g. pressing a key on the keyboard or
|
||||
opening a laptop lid).
|
||||
|
||||
After wakeup, control goes to the platform firmware that runs a boot loader
|
||||
which boots a fresh instance of the kernel (control may also go directly to
|
||||
the boot loader, depending on the system configuration, but anyway it causes
|
||||
a fresh instance of the kernel to be booted). That new instance of the kernel
|
||||
(referred to as the ``restore kernel``) looks for a hibernation image in
|
||||
persistent storage and if one is found, it is loaded into memory. Next, all
|
||||
activity in the system is stopped and the restore kernel overwrites itself with
|
||||
the image contents and jumps into a special trampoline area in the original
|
||||
kernel stored in the image (referred to as the ``image kernel``), which is where
|
||||
the special architecture-specific low-level code is needed. Finally, the
|
||||
image kernel restores the system to the pre-hibernation state and allows user
|
||||
space to run again.
|
||||
|
||||
Hibernation is supported if the :c:macro:`CONFIG_HIBERNATION` kernel
|
||||
configuration option is set. However, this option can only be set if support
|
||||
for the given CPU architecture includes the low-level code for system resume.
|
||||
|
||||
|
||||
Basic ``sysfs`` Interfaces for System Suspend and Hibernation
|
||||
=============================================================
|
||||
|
||||
The power management subsystem provides userspace with a unified ``sysfs``
|
||||
interface for system sleep regardless of the underlying system architecture or
|
||||
platform. That interface is located in the :file:`/sys/power/` directory
|
||||
(assuming that ``sysfs`` is mounted at :file:`/sys`) and it consists of the
|
||||
following attributes (files):
|
||||
|
||||
``state``
|
||||
This file contains a list of strings representing sleep states supported
|
||||
by the kernel. Writing one of these strings into it causes the kernel
|
||||
to start a transition of the system into the sleep state represented by
|
||||
that string.
|
||||
|
||||
In particular, the "disk", "freeze" and "standby" strings represent the
|
||||
:ref:`hibernation <hibernation>`, :ref:`suspend-to-idle <s2idle>` and
|
||||
:ref:`standby <standby>` sleep states, respectively. The "mem" string
|
||||
is interpreted in accordance with the contents of the ``mem_sleep`` file
|
||||
described below.
|
||||
|
||||
If the kernel does not support any system sleep states, this file is
|
||||
not present.
|
||||
|
||||
``mem_sleep``
|
||||
This file contains a list of strings representing supported system
|
||||
suspend variants and allows user space to select the variant to be
|
||||
associated with the "mem" string in the ``state`` file described above.
|
||||
|
||||
The strings that may be present in this file are "s2idle", "shallow"
|
||||
and "deep". The "s2idle" string always represents :ref:`suspend-to-idle
|
||||
<s2idle>` and, by convention, "shallow" and "deep" represent
|
||||
:ref:`standby <standby>` and :ref:`suspend-to-RAM <s2ram>`,
|
||||
respectively.
|
||||
|
||||
Writing one of the listed strings into this file causes the system
|
||||
suspend variant represented by it to be associated with the "mem" string
|
||||
in the ``state`` file. The string representing the suspend variant
|
||||
currently associated with the "mem" string in the ``state`` file is
|
||||
shown in square brackets.
|
||||
|
||||
If the kernel does not support system suspend, this file is not present.
|
||||
|
||||
``disk``
|
||||
This file controls the operating mode of hibernation (Suspend-to-Disk).
|
||||
Specifically, it tells the kernel what to do after creating a
|
||||
hibernation image.
|
||||
|
||||
Reading from it returns a list of supported options encoded as:
|
||||
|
||||
``platform``
|
||||
Put the system into a special low-power state (e.g. ACPI S4) to
|
||||
make additional wakeup options available and possibly allow the
|
||||
platform firmware to take a simplified initialization path after
|
||||
wakeup.
|
||||
|
||||
It is only available if the platform provides a special
|
||||
mechanism to put the system to sleep after creating a
|
||||
hibernation image (platforms with ACPI do that as a rule, for
|
||||
example).
|
||||
|
||||
``shutdown``
|
||||
Power off the system.
|
||||
|
||||
``reboot``
|
||||
Reboot the system (useful for diagnostics mostly).
|
||||
|
||||
``suspend``
|
||||
Hybrid system suspend. Put the system into the suspend sleep
|
||||
state selected through the ``mem_sleep`` file described above.
|
||||
If the system is successfully woken up from that state, discard
|
||||
the hibernation image and continue. Otherwise, use the image
|
||||
to restore the previous state of the system.
|
||||
|
||||
It is available if system suspend is supported.
|
||||
|
||||
``test_resume``
|
||||
Diagnostic operation. Load the image as though the system had
|
||||
just woken up from hibernation and the currently running kernel
|
||||
instance was a restore kernel and follow up with full system
|
||||
resume.
|
||||
|
||||
Writing one of the strings listed above into this file causes the option
|
||||
represented by it to be selected.
|
||||
|
||||
The currently selected option is shown in square brackets, which means
|
||||
that the operation represented by it will be carried out after creating
|
||||
and saving the image when hibernation is triggered by writing ``disk``
|
||||
to :file:`/sys/power/state`.
|
||||
|
||||
If the kernel does not support hibernation, this file is not present.
|
||||
|
||||
``image_size``
|
||||
This file controls the size of hibernation images.
|
||||
|
||||
It can be written a string representing a non-negative integer that will
|
||||
be used as a best-effort upper limit of the image size, in bytes. The
|
||||
hibernation core will do its best to ensure that the image size will not
|
||||
exceed that number, but if that turns out to be impossible to achieve, a
|
||||
hibernation image will still be created and its size will be as small as
|
||||
possible. In particular, writing '0' to this file causes the size of
|
||||
hibernation images to be minimum.
|
||||
|
||||
Reading from it returns the current image size limit, which is set to
|
||||
around 2/5 of the available RAM size by default.
|
||||
|
||||
``pm_trace``
|
||||
This file controls the "PM trace" mechanism saving the last suspend
|
||||
or resume event point in the RTC memory across reboots. It helps to
|
||||
debug hard lockups or reboots due to device driver failures that occur
|
||||
during system suspend or resume (which is more common) more effectively.
|
||||
|
||||
If it contains "1", the fingerprint of each suspend/resume event point
|
||||
in turn will be stored in the RTC memory (overwriting the actual RTC
|
||||
information), so it will survive a system crash if one occurs right
|
||||
after storing it and it can be used later to identify the driver that
|
||||
caused the crash to happen.
|
||||
|
||||
It contains "0" by default, which may be changed to "1" by writing a
|
||||
string representing a nonzero integer into it.
|
||||
|
||||
According to the above, there are two ways to make the system go into the
|
||||
:ref:`suspend-to-idle <s2idle>` state. The first one is to write "freeze"
|
||||
directly to :file:`/sys/power/state`. The second one is to write "s2idle" to
|
||||
:file:`/sys/power/mem_sleep` and then to write "mem" to
|
||||
:file:`/sys/power/state`. Likewise, there are two ways to make the system go
|
||||
into the :ref:`standby <standby>` state (the strings to write to the control
|
||||
files in that case are "standby" or "shallow" and "mem", respectively) if that
|
||||
state is supported by the platform. However, there is only one way to make the
|
||||
system go into the :ref:`suspend-to-RAM <s2ram>` state (write "deep" into
|
||||
:file:`/sys/power/mem_sleep` and "mem" into :file:`/sys/power/state`).
|
||||
|
||||
The default suspend variant (ie. the one to be used without writing anything
|
||||
into :file:`/sys/power/mem_sleep`) is either "deep" (on the majority of systems
|
||||
supporting :ref:`suspend-to-RAM <s2ram>`) or "s2idle", but it can be overridden
|
||||
by the value of the ``mem_sleep_default`` parameter in the kernel command line.
|
||||
On some systems with ACPI, depending on the information in the ACPI tables, the
|
||||
default may be "s2idle" even if :ref:`suspend-to-RAM <s2ram>` is supported in
|
||||
principle.
|
56
Documentation/admin-guide/pm/strategies.rst
Normal file
56
Documentation/admin-guide/pm/strategies.rst
Normal file
|
@ -0,0 +1,56 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
===========================
|
||||
Power Management Strategies
|
||||
===========================
|
||||
|
||||
:Copyright: |copy| 2017 Intel Corporation
|
||||
|
||||
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
||||
|
||||
|
||||
The Linux kernel supports two major high-level power management strategies.
|
||||
|
||||
One of them is based on using global low-power states of the whole system in
|
||||
which user space code cannot be executed and the overall system activity is
|
||||
significantly reduced, referred to as :doc:`sleep states <sleep-states>`. The
|
||||
kernel puts the system into one of these states when requested by user space
|
||||
and the system stays in it until a special signal is received from one of
|
||||
designated devices, triggering a transition to the ``working state`` in which
|
||||
user space code can run. Because sleep states are global and the whole system
|
||||
is affected by the state changes, this strategy is referred to as the
|
||||
:doc:`system-wide power management <system-wide>`.
|
||||
|
||||
The other strategy, referred to as the :doc:`working-state power management
|
||||
<working-state>`, is based on adjusting the power states of individual hardware
|
||||
components of the system, as needed, in the working state. In consequence, if
|
||||
this strategy is in use, the working state of the system usually does not
|
||||
correspond to any particular physical configuration of it, but can be treated as
|
||||
a metastate covering a range of different power states of the system in which
|
||||
the individual components of it can be either ``active`` (in use) or
|
||||
``inactive`` (idle). If they are active, they have to be in power states
|
||||
allowing them to process data and to be accessed by software. In turn, if they
|
||||
are inactive, ideally, they should be in low-power states in which they may not
|
||||
be accessible.
|
||||
|
||||
If all of the system components are active, the system as a whole is regarded as
|
||||
"runtime active" and that situation typically corresponds to the maximum power
|
||||
draw (or maximum energy usage) of it. If all of them are inactive, the system
|
||||
as a whole is regarded as "runtime idle" which may be very close to a sleep
|
||||
state from the physical system configuration and power draw perspective, but
|
||||
then it takes much less time and effort to start executing user space code than
|
||||
for the same system in a sleep state. However, transitions from sleep states
|
||||
back to the working state can only be started by a limited set of devices, so
|
||||
typically the system can spend much more time in a sleep state than it can be
|
||||
runtime idle in one go. For this reason, systems usually use less energy in
|
||||
sleep states than when they are runtime idle most of the time.
|
||||
|
||||
Moreover, the two power management strategies address different usage scenarios.
|
||||
Namely, if the user indicates that the system will not be in use going forward,
|
||||
for example by closing its lid (if the system is a laptop), it probably should
|
||||
go into a sleep state at that point. On the other hand, if the user simply goes
|
||||
away from the laptop keyboard, it probably should stay in the working state and
|
||||
use the working-state power management in case it becomes idle, because the user
|
||||
may come back to it at any time and then may want the system to be immediately
|
||||
accessible.
|
270
Documentation/admin-guide/pm/suspend-flows.rst
Normal file
270
Documentation/admin-guide/pm/suspend-flows.rst
Normal file
|
@ -0,0 +1,270 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
=========================
|
||||
System Suspend Code Flows
|
||||
=========================
|
||||
|
||||
:Copyright: |copy| 2020 Intel Corporation
|
||||
|
||||
:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
||||
|
||||
At least one global system-wide transition needs to be carried out for the
|
||||
system to get from the working state into one of the supported
|
||||
:doc:`sleep states <sleep-states>`. Hibernation requires more than one
|
||||
transition to occur for this purpose, but the other sleep states, commonly
|
||||
referred to as *system-wide suspend* (or simply *system suspend*) states, need
|
||||
only one.
|
||||
|
||||
For those sleep states, the transition from the working state of the system into
|
||||
the target sleep state is referred to as *system suspend* too (in the majority
|
||||
of cases, whether this means a transition or a sleep state of the system should
|
||||
be clear from the context) and the transition back from the sleep state into the
|
||||
working state is referred to as *system resume*.
|
||||
|
||||
The kernel code flows associated with the suspend and resume transitions for
|
||||
different sleep states of the system are quite similar, but there are some
|
||||
significant differences between the :ref:`suspend-to-idle <s2idle>` code flows
|
||||
and the code flows related to the :ref:`suspend-to-RAM <s2ram>` and
|
||||
:ref:`standby <standby>` sleep states.
|
||||
|
||||
The :ref:`suspend-to-RAM <s2ram>` and :ref:`standby <standby>` sleep states
|
||||
cannot be implemented without platform support and the difference between them
|
||||
boils down to the platform-specific actions carried out by the suspend and
|
||||
resume hooks that need to be provided by the platform driver to make them
|
||||
available. Apart from that, the suspend and resume code flows for these sleep
|
||||
states are mostly identical, so they both together will be referred to as
|
||||
*platform-dependent suspend* states in what follows.
|
||||
|
||||
|
||||
.. _s2idle_suspend:
|
||||
|
||||
Suspend-to-idle Suspend Code Flow
|
||||
=================================
|
||||
|
||||
The following steps are taken in order to transition the system from the working
|
||||
state to the :ref:`suspend-to-idle <s2idle>` sleep state:
|
||||
|
||||
1. Invoking system-wide suspend notifiers.
|
||||
|
||||
Kernel subsystems can register callbacks to be invoked when the suspend
|
||||
transition is about to occur and when the resume transition has finished.
|
||||
|
||||
That allows them to prepare for the change of the system state and to clean
|
||||
up after getting back to the working state.
|
||||
|
||||
2. Freezing tasks.
|
||||
|
||||
Tasks are frozen primarily in order to avoid unchecked hardware accesses
|
||||
from user space through MMIO regions or I/O registers exposed directly to
|
||||
it and to prevent user space from entering the kernel while the next step
|
||||
of the transition is in progress (which might have been problematic for
|
||||
various reasons).
|
||||
|
||||
All user space tasks are intercepted as though they were sent a signal and
|
||||
put into uninterruptible sleep until the end of the subsequent system resume
|
||||
transition.
|
||||
|
||||
The kernel threads that choose to be frozen during system suspend for
|
||||
specific reasons are frozen subsequently, but they are not intercepted.
|
||||
Instead, they are expected to periodically check whether or not they need
|
||||
to be frozen and to put themselves into uninterruptible sleep if so. [Note,
|
||||
however, that kernel threads can use locking and other concurrency controls
|
||||
available in kernel space to synchronize themselves with system suspend and
|
||||
resume, which can be much more precise than the freezing, so the latter is
|
||||
not a recommended option for kernel threads.]
|
||||
|
||||
3. Suspending devices and reconfiguring IRQs.
|
||||
|
||||
Devices are suspended in four phases called *prepare*, *suspend*,
|
||||
*late suspend* and *noirq suspend* (see :ref:`driverapi_pm_devices` for more
|
||||
information on what exactly happens in each phase).
|
||||
|
||||
Every device is visited in each phase, but typically it is not physically
|
||||
accessed in more than two of them.
|
||||
|
||||
The runtime PM API is disabled for every device during the *late* suspend
|
||||
phase and high-level ("action") interrupt handlers are prevented from being
|
||||
invoked before the *noirq* suspend phase.
|
||||
|
||||
Interrupts are still handled after that, but they are only acknowledged to
|
||||
interrupt controllers without performing any device-specific actions that
|
||||
would be triggered in the working state of the system (those actions are
|
||||
deferred till the subsequent system resume transition as described
|
||||
`below <s2idle_resume_>`_).
|
||||
|
||||
IRQs associated with system wakeup devices are "armed" so that the resume
|
||||
transition of the system is started when one of them signals an event.
|
||||
|
||||
4. Freezing the scheduler tick and suspending timekeeping.
|
||||
|
||||
When all devices have been suspended, CPUs enter the idle loop and are put
|
||||
into the deepest available idle state. While doing that, each of them
|
||||
"freezes" its own scheduler tick so that the timer events associated with
|
||||
the tick do not occur until the CPU is woken up by another interrupt source.
|
||||
|
||||
The last CPU to enter the idle state also stops the timekeeping which
|
||||
(among other things) prevents high resolution timers from triggering going
|
||||
forward until the first CPU that is woken up restarts the timekeeping.
|
||||
That allows the CPUs to stay in the deep idle state relatively long in one
|
||||
go.
|
||||
|
||||
From this point on, the CPUs can only be woken up by non-timer hardware
|
||||
interrupts. If that happens, they go back to the idle state unless the
|
||||
interrupt that woke up one of them comes from an IRQ that has been armed for
|
||||
system wakeup, in which case the system resume transition is started.
|
||||
|
||||
|
||||
.. _s2idle_resume:
|
||||
|
||||
Suspend-to-idle Resume Code Flow
|
||||
================================
|
||||
|
||||
The following steps are taken in order to transition the system from the
|
||||
:ref:`suspend-to-idle <s2idle>` sleep state into the working state:
|
||||
|
||||
1. Resuming timekeeping and unfreezing the scheduler tick.
|
||||
|
||||
When one of the CPUs is woken up (by a non-timer hardware interrupt), it
|
||||
leaves the idle state entered in the last step of the preceding suspend
|
||||
transition, restarts the timekeeping (unless it has been restarted already
|
||||
by another CPU that woke up earlier) and the scheduler tick on that CPU is
|
||||
unfrozen.
|
||||
|
||||
If the interrupt that has woken up the CPU was armed for system wakeup,
|
||||
the system resume transition begins.
|
||||
|
||||
2. Resuming devices and restoring the working-state configuration of IRQs.
|
||||
|
||||
Devices are resumed in four phases called *noirq resume*, *early resume*,
|
||||
*resume* and *complete* (see :ref:`driverapi_pm_devices` for more
|
||||
information on what exactly happens in each phase).
|
||||
|
||||
Every device is visited in each phase, but typically it is not physically
|
||||
accessed in more than two of them.
|
||||
|
||||
The working-state configuration of IRQs is restored after the *noirq* resume
|
||||
phase and the runtime PM API is re-enabled for every device whose driver
|
||||
supports it during the *early* resume phase.
|
||||
|
||||
3. Thawing tasks.
|
||||
|
||||
Tasks frozen in step 2 of the preceding `suspend <s2idle_suspend_>`_
|
||||
transition are "thawed", which means that they are woken up from the
|
||||
uninterruptible sleep that they went into at that time and user space tasks
|
||||
are allowed to exit the kernel.
|
||||
|
||||
4. Invoking system-wide resume notifiers.
|
||||
|
||||
This is analogous to step 1 of the `suspend <s2idle_suspend_>`_ transition
|
||||
and the same set of callbacks is invoked at this point, but a different
|
||||
"notification type" parameter value is passed to them.
|
||||
|
||||
|
||||
Platform-dependent Suspend Code Flow
|
||||
====================================
|
||||
|
||||
The following steps are taken in order to transition the system from the working
|
||||
state to platform-dependent suspend state:
|
||||
|
||||
1. Invoking system-wide suspend notifiers.
|
||||
|
||||
This step is the same as step 1 of the suspend-to-idle suspend transition
|
||||
described `above <s2idle_suspend_>`_.
|
||||
|
||||
2. Freezing tasks.
|
||||
|
||||
This step is the same as step 2 of the suspend-to-idle suspend transition
|
||||
described `above <s2idle_suspend_>`_.
|
||||
|
||||
3. Suspending devices and reconfiguring IRQs.
|
||||
|
||||
This step is analogous to step 3 of the suspend-to-idle suspend transition
|
||||
described `above <s2idle_suspend_>`_, but the arming of IRQs for system
|
||||
wakeup generally does not have any effect on the platform.
|
||||
|
||||
There are platforms that can go into a very deep low-power state internally
|
||||
when all CPUs in them are in sufficiently deep idle states and all I/O
|
||||
devices have been put into low-power states. On those platforms,
|
||||
suspend-to-idle can reduce system power very effectively.
|
||||
|
||||
On the other platforms, however, low-level components (like interrupt
|
||||
controllers) need to be turned off in a platform-specific way (implemented
|
||||
in the hooks provided by the platform driver) to achieve comparable power
|
||||
reduction.
|
||||
|
||||
That usually prevents in-band hardware interrupts from waking up the system,
|
||||
which must be done in a special platform-dependent way. Then, the
|
||||
configuration of system wakeup sources usually starts when system wakeup
|
||||
devices are suspended and is finalized by the platform suspend hooks later
|
||||
on.
|
||||
|
||||
4. Disabling non-boot CPUs.
|
||||
|
||||
On some platforms the suspend hooks mentioned above must run in a one-CPU
|
||||
configuration of the system (in particular, the hardware cannot be accessed
|
||||
by any code running in parallel with the platform suspend hooks that may,
|
||||
and often do, trap into the platform firmware in order to finalize the
|
||||
suspend transition).
|
||||
|
||||
For this reason, the CPU offline/online (CPU hotplug) framework is used
|
||||
to take all of the CPUs in the system, except for one (the boot CPU),
|
||||
offline (typically, the CPUs that have been taken offline go into deep idle
|
||||
states).
|
||||
|
||||
This means that all tasks are migrated away from those CPUs and all IRQs are
|
||||
rerouted to the only CPU that remains online.
|
||||
|
||||
5. Suspending core system components.
|
||||
|
||||
This prepares the core system components for (possibly) losing power going
|
||||
forward and suspends the timekeeping.
|
||||
|
||||
6. Platform-specific power removal.
|
||||
|
||||
This is expected to remove power from all of the system components except
|
||||
for the memory controller and RAM (in order to preserve the contents of the
|
||||
latter) and some devices designated for system wakeup.
|
||||
|
||||
In many cases control is passed to the platform firmware which is expected
|
||||
to finalize the suspend transition as needed.
|
||||
|
||||
|
||||
Platform-dependent Resume Code Flow
|
||||
===================================
|
||||
|
||||
The following steps are taken in order to transition the system from a
|
||||
platform-dependent suspend state into the working state:
|
||||
|
||||
1. Platform-specific system wakeup.
|
||||
|
||||
The platform is woken up by a signal from one of the designated system
|
||||
wakeup devices (which need not be an in-band hardware interrupt) and
|
||||
control is passed back to the kernel (the working configuration of the
|
||||
platform may need to be restored by the platform firmware before the
|
||||
kernel gets control again).
|
||||
|
||||
2. Resuming core system components.
|
||||
|
||||
The suspend-time configuration of the core system components is restored and
|
||||
the timekeeping is resumed.
|
||||
|
||||
3. Re-enabling non-boot CPUs.
|
||||
|
||||
The CPUs disabled in step 4 of the preceding suspend transition are taken
|
||||
back online and their suspend-time configuration is restored.
|
||||
|
||||
4. Resuming devices and restoring the working-state configuration of IRQs.
|
||||
|
||||
This step is the same as step 2 of the suspend-to-idle suspend transition
|
||||
described `above <s2idle_resume_>`_.
|
||||
|
||||
5. Thawing tasks.
|
||||
|
||||
This step is the same as step 3 of the suspend-to-idle suspend transition
|
||||
described `above <s2idle_resume_>`_.
|
||||
|
||||
6. Invoking system-wide resume notifiers.
|
||||
|
||||
This step is the same as step 4 of the suspend-to-idle suspend transition
|
||||
described `above <s2idle_resume_>`_.
|
11
Documentation/admin-guide/pm/system-wide.rst
Normal file
11
Documentation/admin-guide/pm/system-wide.rst
Normal file
|
@ -0,0 +1,11 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
============================
|
||||
System-Wide Power Management
|
||||
============================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
sleep-states
|
||||
suspend-flows
|
18
Documentation/admin-guide/pm/working-state.rst
Normal file
18
Documentation/admin-guide/pm/working-state.rst
Normal file
|
@ -0,0 +1,18 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==============================
|
||||
Working-State Power Management
|
||||
==============================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
cpuidle
|
||||
intel_idle
|
||||
cpufreq
|
||||
intel_pstate
|
||||
amd-pstate
|
||||
cpufreq_drivers
|
||||
intel_epb
|
||||
intel-speed-select
|
||||
intel_uncore_frequency_scaling
|
Loading…
Add table
Add a link
Reference in a new issue