Adding upstream version 4.19.249.upstream/4.19.249

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-05-06 01:02:30 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-05-06 01:02:30 +0000
commit: 76cb841cb886eef6b3bee341a2266c76578724ad (patch)
tree: f5892e5ba6cc11949952a6ce4ecbe6d516d6ce58 /Documentation/admin-guide/pm
parent: Initial commit. (diff)
download: linux-76cb841cb886eef6b3bee341a2266c76578724ad.tar.xz
linux-76cb841cb886eef6b3bee341a2266c76578724ad.zip
7 files changed, 1743 insertions, 0 deletions
diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
new file mode 100644
index 000000000..47153e64d
--- /dev/null
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -0,0 +1,701 @@
+.. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>`
+.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>`
+
+=======================
+CPU Performance Scaling
+=======================
+
+::
+
+ Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+The Concept of CPU Performance Scaling
+======================================
+
+The majority of modern processors are capable of operating in a number of
+different clock frequency and voltage configurations, often referred to as
+Operating Performance Points or P-states (in ACPI terminology).  As a rule,
+the higher the clock frequency and the higher the voltage, the more instructions
+can be retired by the CPU over a unit of time, but also the higher the clock
+frequency and the higher the voltage, the more energy is consumed over a unit of
+time (or the more power is drawn) by the CPU in the given P-state.  Therefore
+there is a natural tradeoff between the CPU capacity (the number of instructions
+that can be executed over a unit of time) and the power drawn by the CPU.
+
+In some situations it is desirable or even necessary to run the program as fast
+as possible and then there is no reason to use any P-states different from the
+highest one (i.e. the highest-performance frequency/voltage configuration
+available).  In some other cases, however, it may not be necessary to execute
+instructions so quickly and maintaining the highest available CPU capacity for a
+relatively long time without utilizing it entirely may be regarded as wasteful.
+It also may not be physically possible to maintain maximum CPU capacity for too
+long for thermal or power supply capacity reasons or similar.  To cover those
+cases, there are hardware interfaces allowing CPUs to be switched between
+different frequency/voltage configurations or (in the ACPI terminology) to be
+put into different P-states.
+
+Typically, they are used along with algorithms to estimate the required CPU
+capacity, so as to decide which P-states to put the CPUs into.  Of course, since
+the utilization of the system generally changes over time, that has to be done
+repeatedly on a regular basis.  The activity by which this happens is referred
+to as CPU performance scaling or CPU frequency scaling (because it involves
+adjusting the CPU clock frequency).
+
+
+CPU Performance Scaling in Linux
+================================
+
+The Linux kernel supports CPU performance scaling by means of the ``CPUFreq``
+(CPU Frequency scaling) subsystem that consists of three layers of code: the
+core, scaling governors and scaling drivers.
+
+The ``CPUFreq`` core provides the common code infrastructure and user space
+interfaces for all platforms that support CPU performance scaling.  It defines
+the basic framework in which the other components operate.
+
+Scaling governors implement algorithms to estimate the required CPU capacity.
+As a rule, each governor implements one, possibly parametrized, scaling
+algorithm.
+
+Scaling drivers talk to the hardware.  They provide scaling governors with
+information on the available P-states (or P-state ranges in some cases) and
+access platform-specific hardware interfaces to change CPU P-states as requested
+by scaling governors.
+
+In principle, all available scaling governors can be used with every scaling
+driver.  That design is based on the observation that the information used by
+performance scaling algorithms for P-state selection can be represented in a
+platform-independent form in the majority of cases, so it should be possible
+to use the same performance scaling algorithm implemented in exactly the same
+way regardless of which scaling driver is used.  Consequently, the same set of
+scaling governors should be suitable for every supported platform.
+
+However, that observation may not hold for performance scaling algorithms
+based on information provided by the hardware itself, for example through
+feedback registers, as that information is typically specific to the hardware
+interface it comes from and may not be easily represented in an abstract,
+platform-independent way.  For this reason, ``CPUFreq`` allows scaling drivers
+to bypass the governor layer and implement their own performance scaling
+algorithms.  That is done by the |intel_pstate| scaling driver.
+
+
+``CPUFreq`` Policy Objects
+==========================
+
+In some cases the hardware interface for P-state control is shared by multiple
+CPUs.  That is, for example, the same register (or set of registers) is used to
+control the P-state of multiple CPUs at the same time and writing to it affects
+all of those CPUs simultaneously.
+
+Sets of CPUs sharing hardware P-state control interfaces are represented by
+``CPUFreq`` as |struct cpufreq_policy| objects.  For consistency,
+|struct cpufreq_policy| is also used when there is only one CPU in the given
+set.
+
+The ``CPUFreq`` core maintains a pointer to a |struct cpufreq_policy| object for
+every CPU in the system, including CPUs that are currently offline.  If multiple
+CPUs share the same hardware P-state control interface, all of the pointers
+corresponding to them point to the same |struct cpufreq_policy| object.
+
+``CPUFreq`` uses |struct cpufreq_policy| as its basic data type and the design
+of its user space interface is based on the policy concept.
+
+
+CPU Initialization
+==================
+
+First of all, a scaling driver has to be registered for ``CPUFreq`` to work.
+It is only possible to register one scaling driver at a time, so the scaling
+driver is expected to be able to handle all CPUs in the system.
+
+The scaling driver may be registered before or after CPU registration.  If
+CPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to
+take a note of all of the already registered CPUs during the registration of the
+scaling driver.  In turn, if any CPUs are registered after the registration of
+the scaling driver, the ``CPUFreq`` core will be invoked to take note of them
+at their registration time.
+
+In any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it
+has not seen so far as soon as it is ready to handle that CPU.  [Note that the
+logical CPU may be a physical single-core processor, or a single core in a
+multicore processor, or a hardware thread in a physical processor or processor
+core.  In what follows "CPU" always means "logical CPU" unless explicitly stated
+otherwise and the word "processor" is used to refer to the physical part
+possibly including multiple logical CPUs.]
+
+Once invoked, the ``CPUFreq`` core checks if the policy pointer is already set
+for the given CPU and if so, it skips the policy object creation.  Otherwise,
+a new policy object is created and initialized, which involves the creation of
+a new policy directory in ``sysfs``, and the policy pointer corresponding to
+the given CPU is set to the new policy object's address in memory.
+
+Next, the scaling driver's ``->init()`` callback is invoked with the policy
+pointer of the new CPU passed to it as the argument.  That callback is expected
+to initialize the performance scaling hardware interface for the given CPU (or,
+more precisely, for the set of CPUs sharing the hardware interface it belongs
+to, represented by its policy object) and, if the policy object it has been
+called for is new, to set parameters of the policy, like the minimum and maximum
+frequencies supported by the hardware, the table of available frequencies (if
+the set of supported P-states is not a continuous range), and the mask of CPUs
+that belong to the same policy (including both online and offline CPUs).  That
+mask is then used by the core to populate the policy pointers for all of the
+CPUs in it.
+
+The next major initialization step for a new policy object is to attach a
+scaling governor to it (to begin with, that is the default scaling governor
+determined by the kernel configuration, but it may be changed later
+via ``sysfs``).  First, a pointer to the new policy object is passed to the
+governor's ``->init()`` callback which is expected to initialize all of the
+data structures necessary to handle the given policy and, possibly, to add
+a governor ``sysfs`` interface to it.  Next, the governor is started by
+invoking its ``->start()`` callback.
+
+That callback it expected to register per-CPU utilization update callbacks for
+all of the online CPUs belonging to the given policy with the CPU scheduler.
+The utilization update callbacks will be invoked by the CPU scheduler on
+important events, like task enqueue and dequeue, on every iteration of the
+scheduler tick or generally whenever the CPU utilization may change (from the
+scheduler's perspective).  They are expected to carry out computations needed
+to determine the P-state to use for the given policy going forward and to
+invoke the scaling driver to make changes to the hardware in accordance with
+the P-state selection.  The scaling driver may be invoked directly from
+scheduler context or asynchronously, via a kernel thread or workqueue, depending
+on the configuration and capabilities of the scaling driver and the governor.
+
+Similar steps are taken for policy objects that are not new, but were "inactive"
+previously, meaning that all of the CPUs belonging to them were offline.  The
+only practical difference in that case is that the ``CPUFreq`` core will attempt
+to use the scaling governor previously used with the policy that became
+"inactive" (and is re-initialized now) instead of the default governor.
+
+In turn, if a previously offline CPU is being brought back online, but some
+other CPUs sharing the policy object with it are online already, there is no
+need to re-initialize the policy object at all.  In that case, it only is
+necessary to restart the scaling governor so that it can take the new online CPU
+into account.  That is achieved by invoking the governor's ``->stop`` and
+``->start()`` callbacks, in this order, for the entire policy.
+
+As mentioned before, the |intel_pstate| scaling driver bypasses the scaling
+governor layer of ``CPUFreq`` and provides its own P-state selection algorithms.
+Consequently, if |intel_pstate| is used, scaling governors are not attached to
+new policy objects.  Instead, the driver's ``->setpolicy()`` callback is invoked
+to register per-CPU utilization update callbacks for each policy.  These
+callbacks are invoked by the CPU scheduler in the same way as for scaling
+governors, but in the |intel_pstate| case they both determine the P-state to
+use and change the hardware configuration accordingly in one go from scheduler
+context.
+
+The policy objects created during CPU initialization and other data structures
+associated with them are torn down when the scaling driver is unregistered
+(which happens when the kernel module containing it is unloaded, for example) or
+when the last CPU belonging to the given policy in unregistered.
+
+
+Policy Interface in ``sysfs``
+=============================
+
+During the initialization of the kernel, the ``CPUFreq`` core creates a
+``sysfs`` directory (kobject) called ``cpufreq`` under
+:file:`/sys/devices/system/cpu/`.
+
+That directory contains a ``policyX`` subdirectory (where ``X`` represents an
+integer number) for every policy object maintained by the ``CPUFreq`` core.
+Each ``policyX`` directory is pointed to by ``cpufreq`` symbolic links
+under :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer
+that may be different from the one represented by ``X``) for all of the CPUs
+associated with (or belonging to) the given policy.  The ``policyX`` directories
+in :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific
+attributes (files) to control ``CPUFreq`` behavior for the corresponding policy
+objects (that is, for all of the CPUs associated with them).
+
+Some of those attributes are generic.  They are created by the ``CPUFreq`` core
+and their behavior generally does not depend on what scaling driver is in use
+and what scaling governor is attached to the given policy.  Some scaling drivers
+also add driver-specific attributes to the policy directories in ``sysfs`` to
+control policy-specific aspects of driver behavior.
+
+The generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/`
+are the following:
+
+``affected_cpus``
+	List of online CPUs belonging to this policy (i.e. sharing the hardware
+	performance scaling interface represented by the ``policyX`` policy
+	object).
+
+``bios_limit``
+	If the platform firmware (BIOS) tells the OS to apply an upper limit to
+	CPU frequencies, that limit will be reported through this attribute (if
+	present).
+
+	The existence of the limit may be a result of some (often unintentional)
+	BIOS settings, restrictions coming from a service processor or another
+	BIOS/HW-based mechanisms.
+
+	This does not cover ACPI thermal limitations which can be discovered
+	through a generic thermal driver.
+
+	This attribute is not present if the scaling driver in use does not
+	support it.
+
+``cpuinfo_cur_freq``
+	Current frequency of the CPUs belonging to this policy as obtained from
+	the hardware (in KHz).
+
+	This is expected to be the frequency the hardware actually runs at.
+	If that frequency cannot be determined, this attribute should not
+	be present.
+
+``cpuinfo_max_freq``
+	Maximum possible operating frequency the CPUs belonging to this policy
+	can run at (in kHz).
+
+``cpuinfo_min_freq``
+	Minimum possible operating frequency the CPUs belonging to this policy
+	can run at (in kHz).
+
+``cpuinfo_transition_latency``
+	The time it takes to switch the CPUs belonging to this policy from one
+	P-state to another, in nanoseconds.
+
+	If unknown or if known to be so high that the scaling driver does not
+	work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`)
+	will be returned by reads from this attribute.
+
+``related_cpus``
+	List of all (online and offline) CPUs belonging to this policy.
+
+``scaling_available_governors``
+	List of ``CPUFreq`` scaling governors present in the kernel that can
+	be attached to this policy or (if the |intel_pstate| scaling driver is
+	in use) list of scaling algorithms provided by the driver that can be
+	applied to this policy.
+
+	[Note that some governors are modular and it may be necessary to load a
+	kernel module for the governor held by it to become available and be
+	listed by this attribute.]
+
+``scaling_cur_freq``
+	Current frequency of all of the CPUs belonging to this policy (in kHz).
+
+	In the majority of cases, this is the frequency of the last P-state
+	requested by the scaling driver from the hardware using the scaling
+	interface provided by it, which may or may not reflect the frequency
+	the CPU is actually running at (due to hardware design and other
+	limitations).
+
+	Some architectures (e.g. ``x86``) may attempt to provide information
+	more precisely reflecting the current CPU frequency through this
+	attribute, but that still may not be the exact current CPU frequency as
+	seen by the hardware at the moment.
+
+``scaling_driver``
+	The scaling driver currently in use.
+
+``scaling_governor``
+	The scaling governor currently attached to this policy or (if the
+	|intel_pstate| scaling driver is in use) the scaling algorithm
+	provided by the driver that is currently applied to this policy.
+
+	This attribute is read-write and writing to it will cause a new scaling
+	governor to be attached to this policy or a new scaling algorithm
+	provided by the scaling driver to be applied to it (in the
+	|intel_pstate| case), as indicated by the string written to this
+	attribute (which must be one of the names listed by the
+	``scaling_available_governors`` attribute described above).
+
+``scaling_max_freq``
+	Maximum frequency the CPUs belonging to this policy are allowed to be
+	running at (in kHz).
+
+	This attribute is read-write and writing a string representing an
+	integer to it will cause a new limit to be set (it must not be lower
+	than the value of the ``scaling_min_freq`` attribute).
+
+``scaling_min_freq``
+	Minimum frequency the CPUs belonging to this policy are allowed to be
+	running at (in kHz).
+
+	This attribute is read-write and writing a string representing a
+	non-negative integer to it will cause a new limit to be set (it must not
+	be higher than the value of the ``scaling_max_freq`` attribute).
+
+``scaling_setspeed``
+	This attribute is functional only if the `userspace`_ scaling governor
+	is attached to the given policy.
+
+	It returns the last frequency requested by the governor (in kHz) or can
+	be written to in order to set a new frequency for the policy.
+
+
+Generic Scaling Governors
+=========================
+
+``CPUFreq`` provides generic scaling governors that can be used with all
+scaling drivers.  As stated before, each of them implements a single, possibly
+parametrized, performance scaling algorithm.
+
+Scaling governors are attached to policy objects and different policy objects
+can be handled by different scaling governors at the same time (although that
+may lead to suboptimal results in some cases).
+
+The scaling governor for a given policy object can be changed at any time with
+the help of the ``scaling_governor`` policy attribute in ``sysfs``.
+
+Some governors expose ``sysfs`` attributes to control or fine-tune the scaling
+algorithms implemented by them.  Those attributes, referred to as governor
+tunables, can be either global (system-wide) or per-policy, depending on the
+scaling driver in use.  If the driver requires governor tunables to be
+per-policy, they are located in a subdirectory of each policy directory.
+Otherwise, they are located in a subdirectory under
+:file:`/sys/devices/system/cpu/cpufreq/`.  In either case the name of the
+subdirectory containing the governor tunables is the name of the governor
+providing them.
+
+``performance``
+---------------
+
+When attached to a policy object, this governor causes the highest frequency,
+within the ``scaling_max_freq`` policy limit, to be requested for that policy.
+
+The request is made once at that time the governor for the policy is set to
+``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
+policy limits change after that.
+
+``powersave``
+-------------
+
+When attached to a policy object, this governor causes the lowest frequency,
+within the ``scaling_min_freq`` policy limit, to be requested for that policy.
+
+The request is made once at that time the governor for the policy is set to
+``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
+policy limits change after that.
+
+``userspace``
+-------------
+
+This governor does not do anything by itself.  Instead, it allows user space
+to set the CPU frequency for the policy it is attached to by writing to the
+``scaling_setspeed`` attribute of that policy.
+
+``schedutil``
+-------------
+
+This governor uses CPU utilization data available from the CPU scheduler.  It
+generally is regarded as a part of the CPU scheduler, so it can access the
+scheduler's internal data structures directly.
+
+It runs entirely in scheduler context, although in some cases it may need to
+invoke the scaling driver asynchronously when it decides that the CPU frequency
+should be changed for a given policy (that depends on whether or not the driver
+is capable of changing the CPU frequency from scheduler context).
+
+The actions of this governor for a particular CPU depend on the scheduling class
+invoking its utilization update callback for that CPU.  If it is invoked by the
+RT or deadline scheduling classes, the governor will increase the frequency to
+the allowed maximum (that is, the ``scaling_max_freq`` policy limit).  In turn,
+if it is invoked by the CFS scheduling class, the governor will use the
+Per-Entity Load Tracking (PELT) metric for the root control group of the
+given CPU as the CPU utilization estimate (see the `Per-entity load tracking`_
+LWN.net article for a description of the PELT mechanism).  Then, the new
+CPU frequency to apply is computed in accordance with the formula
+
+	f = 1.25 * ``f_0`` * ``util`` / ``max``
+
+where ``util`` is the PELT number, ``max`` is the theoretical maximum of
+``util``, and ``f_0`` is either the maximum possible CPU frequency for the given
+policy (if the PELT number is frequency-invariant), or the current CPU frequency
+(otherwise).
+
+This governor also employs a mechanism allowing it to temporarily bump up the
+CPU frequency for tasks that have been waiting on I/O most recently, called
+"IO-wait boosting".  That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag
+is passed by the scheduler to the governor callback which causes the frequency
+to go up to the allowed maximum immediately and then draw back to the value
+returned by the above formula over time.
+
+This governor exposes only one tunable:
+
+``rate_limit_us``
+	Minimum time (in microseconds) that has to pass between two consecutive
+	runs of governor computations (default: 1000 times the scaling driver's
+	transition latency).
+
+	The purpose of this tunable is to reduce the scheduler context overhead
+	of the governor which might be excessive without it.
+
+This governor generally is regarded as a replacement for the older `ondemand`_
+and `conservative`_ governors (described below), as it is simpler and more
+tightly integrated with the CPU scheduler, its overhead in terms of CPU context
+switches and similar is less significant, and it uses the scheduler's own CPU
+utilization metric, so in principle its decisions should not contradict the
+decisions made by the other parts of the scheduler.
+
+``ondemand``
+------------
+
+This governor uses CPU load as a CPU frequency selection metric.
+
+In order to estimate the current CPU load, it measures the time elapsed between
+consecutive invocations of its worker routine and computes the fraction of that
+time in which the given CPU was not idle.  The ratio of the non-idle (active)
+time to the total CPU time is taken as an estimate of the load.
+
+If this governor is attached to a policy shared by multiple CPUs, the load is
+estimated for all of them and the greatest result is taken as the load estimate
+for the entire policy.
+
+The worker routine of this governor has to run in process context, so it is
+invoked asynchronously (via a workqueue) and CPU P-states are updated from
+there if necessary.  As a result, the scheduler context overhead from this
+governor is minimum, but it causes additional CPU context switches to happen
+relatively often and the CPU P-state updates triggered by it can be relatively
+irregular.  Also, it affects its own CPU load metric by running code that
+reduces the CPU idle time (even though the CPU idle time is only reduced very
+slightly by it).
+
+It generally selects CPU frequencies proportional to the estimated load, so that
+the value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of
+1 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute
+corresponds to the load of 0, unless when the load exceeds a (configurable)
+speedup threshold, in which case it will go straight for the highest frequency
+it is allowed to use (the ``scaling_max_freq`` policy limit).
+
+This governor exposes the following tunables:
+
+``sampling_rate``
+	This is how often the governor's worker routine should run, in
+	microseconds.
+
+	Typically, it is set to values of the order of 10000 (10 ms).  Its
+	default value is equal to the value of ``cpuinfo_transition_latency``
+	for each policy this governor is attached to (but since the unit here
+	is greater by 1000, this means that the time represented by
+	``sampling_rate`` is 1000 times greater than the transition latency by
+	default).
+
+	If this tunable is per-policy, the following shell command sets the time
+	represented by it to be 750 times as high as the transition latency::
+
+	# echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate
+
+``up_threshold``
+	If the estimated CPU load is above this value (in percent), the governor
+	will set the frequency to the maximum value allowed for the policy.
+	Otherwise, the selected frequency will be proportional to the estimated
+	CPU load.
+
+``ignore_nice_load``
+	If set to 1 (default 0), it will cause the CPU load estimation code to
+	treat the CPU time spent on executing tasks with "nice" levels greater
+	than 0 as CPU idle time.
+
+	This may be useful if there are tasks in the system that should not be
+	taken into account when deciding what frequency to run the CPUs at.
+	Then, to make that happen it is sufficient to increase the "nice" level
+	of those tasks above 0 and set this attribute to 1.
+
+``sampling_down_factor``
+	Temporary multiplier, between 1 (default) and 100 inclusive, to apply to
+	the ``sampling_rate`` value if the CPU load goes above ``up_threshold``.
+
+	This causes the next execution of the governor's worker routine (after
+	setting the frequency to the allowed maximum) to be delayed, so the
+	frequency stays at the maximum level for a longer time.
+
+	Frequency fluctuations in some bursty workloads may be avoided this way
+	at the cost of additional energy spent on maintaining the maximum CPU
+	capacity.
+
+``powersave_bias``
+	Reduction factor to apply to the original frequency target of the
+	governor (including the maximum value used when the ``up_threshold``
+	value is exceeded by the estimated CPU load) or sensitivity threshold
+	for the AMD frequency sensitivity powersave bias driver
+	(:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000
+	inclusive.
+
+	If the AMD frequency sensitivity powersave bias driver is not loaded,
+	the effective frequency to apply is given by
+
+		f * (1 - ``powersave_bias`` / 1000)
+
+	where f is the governor's original frequency target.  The default value
+	of this attribute is 0 in that case.
+
+	If the AMD frequency sensitivity powersave bias driver is loaded, the
+	value of this attribute is 400 by default and it is used in a different
+	way.
+
+	On Family 16h (and later) AMD processors there is a mechanism to get a
+	measured workload sensitivity, between 0 and 100% inclusive, from the
+	hardware.  That value can be used to estimate how the performance of the
+	workload running on a CPU will change in response to frequency changes.
+
+	The performance of a workload with the sensitivity of 0 (memory-bound or
+	IO-bound) is not expected to increase at all as a result of increasing
+	the CPU frequency, whereas workloads with the sensitivity of 100%
+	(CPU-bound) are expected to perform much better if the CPU frequency is
+	increased.
+
+	If the workload sensitivity is less than the threshold represented by
+	the ``powersave_bias`` value, the sensitivity powersave bias driver
+	will cause the governor to select a frequency lower than its original
+	target, so as to avoid over-provisioning workloads that will not benefit
+	from running at higher CPU frequencies.
+
+``conservative``
+----------------
+
+This governor uses CPU load as a CPU frequency selection metric.
+
+It estimates the CPU load in the same way as the `ondemand`_ governor described
+above, but the CPU frequency selection algorithm implemented by it is different.
+
+Namely, it avoids changing the frequency significantly over short time intervals
+which may not be suitable for systems with limited power supply capacity (e.g.
+battery-powered).  To achieve that, it changes the frequency in relatively
+small steps, one step at a time, up or down - depending on whether or not a
+(configurable) threshold has been exceeded by the estimated CPU load.
+
+This governor exposes the following tunables:
+
+``freq_step``
+	Frequency step in percent of the maximum frequency the governor is
+	allowed to set (the ``scaling_max_freq`` policy limit), between 0 and
+	100 (5 by default).
+
+	This is how much the frequency is allowed to change in one go.  Setting
+	it to 0 will cause the default frequency step (5 percent) to be used
+	and setting it to 100 effectively causes the governor to periodically
+	switch the frequency between the ``scaling_min_freq`` and
+	``scaling_max_freq`` policy limits.
+
+``down_threshold``
+	Threshold value (in percent, 20 by default) used to determine the
+	frequency change direction.
+
+	If the estimated CPU load is greater than this value, the frequency will
+	go up (by ``freq_step``).  If the load is less than this value (and the
+	``sampling_down_factor`` mechanism is not in effect), the frequency will
+	go down.  Otherwise, the frequency will not be changed.
+
+``sampling_down_factor``
+	Frequency decrease deferral factor, between 1 (default) and 10
+	inclusive.
+
+	It effectively causes the frequency to go down ``sampling_down_factor``
+	times slower than it ramps up.
+
+
+Frequency Boost Support
+=======================
+
+Background
+----------
+
+Some processors support a mechanism to raise the operating frequency of some
+cores in a multicore package temporarily (and above the sustainable frequency
+threshold for the whole package) under certain conditions, for example if the
+whole chip is not fully utilized and below its intended thermal or power budget.
+
+Different names are used by different vendors to refer to this functionality.
+For Intel processors it is referred to as "Turbo Boost", AMD calls it
+"Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on.
+As a rule, it also is implemented differently by different vendors.  The simple
+term "frequency boost" is used here for brevity to refer to all of those
+implementations.
+
+The frequency boost mechanism may be either hardware-based or software-based.
+If it is hardware-based (e.g. on x86), the decision to trigger the boosting is
+made by the hardware (although in general it requires the hardware to be put
+into a special state in which it can control the CPU frequency within certain
+limits).  If it is software-based (e.g. on ARM), the scaling driver decides
+whether or not to trigger boosting and when to do that.
+
+The ``boost`` File in ``sysfs``
+-------------------------------
+
+This file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls
+the "boost" setting for the whole system.  It is not present if the underlying
+scaling driver does not support the frequency boost mechanism (or supports it,
+but provides a driver-specific interface for controlling it, like
+|intel_pstate|).
+
+If the value in this file is 1, the frequency boost mechanism is enabled.  This
+means that either the hardware can be put into states in which it is able to
+trigger boosting (in the hardware-based case), or the software is allowed to
+trigger boosting (in the software-based case).  It does not mean that boosting
+is actually in use at the moment on any CPUs in the system.  It only means a
+permission to use the frequency boost mechanism (which still may never be used
+for other reasons).
+
+If the value in this file is 0, the frequency boost mechanism is disabled and
+cannot be used at all.
+
+The only values that can be written to this file are 0 and 1.
+
+Rationale for Boost Control Knob
+--------------------------------
+
+The frequency boost mechanism is generally intended to help to achieve optimum
+CPU performance on time scales below software resolution (e.g. below the
+scheduler tick interval) and it is demonstrably suitable for many workloads, but
+it may lead to problems in certain situations.
+
+For this reason, many systems make it possible to disable the frequency boost
+mechanism in the platform firmware (BIOS) setup, but that requires the system to
+be restarted for the setting to be adjusted as desired, which may not be
+practical at least in some cases.  For example:
+
+  1. Boosting means overclocking the processor, although under controlled
+     conditions.  Generally, the processor's energy consumption increases
+     as a result of increasing its frequency and voltage, even temporarily.
+     That may not be desirable on systems that switch to power sources of
+     limited capacity, such as batteries, so the ability to disable the boost
+     mechanism while the system is running may help there (but that depends on
+     the workload too).
+
+  2. In some situations deterministic behavior is more important than
+     performance or energy consumption (or both) and the ability to disable
+     boosting while the system is running may be useful then.
+
+  3. To examine the impact of the frequency boost mechanism itself, it is useful
+     to be able to run tests with and without boosting, preferably without
+     restarting the system in the meantime.
+
+  4. Reproducible results are important when running benchmarks.  Since
+     the boosting functionality depends on the load of the whole package,
+     single-thread performance may vary because of it which may lead to
+     unreproducible results sometimes.  That can be avoided by disabling the
+     frequency boost mechanism before running benchmarks sensitive to that
+     issue.
+
+Legacy AMD ``cpb`` Knob
+-----------------------
+
+The AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to
+the global ``boost`` one.  It is used for disabling/enabling the "Core
+Performance Boost" feature of some AMD processors.
+
+If present, that knob is located in every ``CPUFreq`` policy directory in
+``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called
+``cpb``, which indicates a more fine grained control interface.  The actual
+implementation, however, works on the system-wide basis and setting that knob
+for one policy causes the same value of it to be set for all of the other
+policies at the same time.
+
+That knob is still supported on AMD processors that support its underlying
+hardware feature, but it may be configured out of the kernel (via the
+:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global
+``boost`` knob is present regardless.  Thus it is always possible use the
+``boost`` knob instead of the ``cpb`` one which is highly recommended, as that
+is more consistent with what all of the other systems do (and the ``cpb`` knob
+may not be supported any more in the future).
+
+The ``cpb`` knob is never present for any processors without the underlying
+hardware feature (e.g. all Intel ones), even if the
+:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set.
+
+
+.. _Per-entity load tracking: https://lwn.net/Articles/531853/
diff --git a/Documentation/admin-guide/pm/index.rst b/Documentation/admin-guide/pm/index.rst
new file mode 100644
index 000000000..49237ac73
--- /dev/null
+++ b/Documentation/admin-guide/pm/index.rst
@@ -0,0 +1,10 @@
+================
+Power Management
+================
+
+.. toctree::
+   :maxdepth: 2
+
+   strategies
+   system-wide
+   working-state
diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
new file mode 100644
index 000000000..8f1d3de44
--- /dev/null
+++ b/Documentation/admin-guide/pm/intel_pstate.rst
@@ -0,0 +1,718 @@
+===============================================
+``intel_pstate`` CPU Performance Scaling Driver
+===============================================
+
+::
+
+ Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+
+General Information
+===================
+
+``intel_pstate`` is a part of the
+:doc:`CPU performance scaling subsystem <cpufreq>` in the Linux kernel
+(``CPUFreq``).  It is a scaling driver for the Sandy Bridge and later
+generations of Intel processors.  Note, however, that some of those processors
+may not be supported.  [To understand ``intel_pstate`` it is necessary to know
+how ``CPUFreq`` works in general, so this is the time to read :doc:`cpufreq` if
+you have not done that yet.]
+
+For the processors supported by ``intel_pstate``, the P-state concept is broader
+than just an operating frequency or an operating performance point (see the
+`LinuxCon Europe 2015 presentation by Kristen Accardi <LCEU2015_>`_ for more
+information about that).  For this reason, the representation of P-states used
+by ``intel_pstate`` internally follows the hardware specification (for details
+refer to `Intel® 64 and IA-32 Architectures Software Developer’s Manual
+Volume 3: System Programming Guide <SDM_>`_).  However, the ``CPUFreq`` core
+uses frequencies for identifying operating performance points of CPUs and
+frequencies are involved in the user space interface exposed by it, so
+``intel_pstate`` maps its internal representation of P-states to frequencies too
+(fortunately, that mapping is unambiguous).  At the same time, it would not be
+practical for ``intel_pstate`` to supply the ``CPUFreq`` core with a table of
+available frequencies due to the possible size of it, so the driver does not do
+that.  Some functionality of the core is limited by that.
+
+Since the hardware P-state selection interface used by ``intel_pstate`` is
+available at the logical CPU level, the driver always works with individual
+CPUs.  Consequently, if ``intel_pstate`` is in use, every ``CPUFreq`` policy
+object corresponds to one logical CPU and ``CPUFreq`` policies are effectively
+equivalent to CPUs.  In particular, this means that they become "inactive" every
+time the corresponding CPU is taken offline and need to be re-initialized when
+it goes back online.
+
+``intel_pstate`` is not modular, so it cannot be unloaded, which means that the
+only way to pass early-configuration-time parameters to it is via the kernel
+command line.  However, its configuration can be adjusted via ``sysfs`` to a
+great extent.  In some configurations it even is possible to unregister it via
+``sysfs`` which allows another ``CPUFreq`` scaling driver to be loaded and
+registered (see `below <status_attr_>`_).
+
+
+Operation Modes
+===============
+
+``intel_pstate`` can operate in three different modes: in the active mode with
+or without hardware-managed P-states support and in the passive mode.  Which of
+them will be in effect depends on what kernel command line options are used and
+on the capabilities of the processor.
+
+Active Mode
+-----------
+
+This is the default operation mode of ``intel_pstate``.  If it works in this
+mode, the ``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq``
+policies contains the string "intel_pstate".
+
+In this mode the driver bypasses the scaling governors layer of ``CPUFreq`` and
+provides its own scaling algorithms for P-state selection.  Those algorithms
+can be applied to ``CPUFreq`` policies in the same way as generic scaling
+governors (that is, through the ``scaling_governor`` policy attribute in
+``sysfs``).  [Note that different P-state selection algorithms may be chosen for
+different policies, but that is not recommended.]
+
+They are not generic scaling governors, but their names are the same as the
+names of some of those governors.  Moreover, confusingly enough, they generally
+do not work in the same way as the generic governors they share the names with.
+For example, the ``powersave`` P-state selection algorithm provided by
+``intel_pstate`` is not a counterpart of the generic ``powersave`` governor
+(roughly, it corresponds to the ``schedutil`` and ``ondemand`` governors).
+
+There are two P-state selection algorithms provided by ``intel_pstate`` in the
+active mode: ``powersave`` and ``performance``.  The way they both operate
+depends on whether or not the hardware-managed P-states (HWP) feature has been
+enabled in the processor and possibly on the processor model.
+
+Which of the P-state selection algorithms is used by default depends on the
+:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option.
+Namely, if that option is set, the ``performance`` algorithm will be used by
+default, and the other one will be used by default if it is not set.
+
+Active Mode With HWP
+~~~~~~~~~~~~~~~~~~~~
+
+If the processor supports the HWP feature, it will be enabled during the
+processor initialization and cannot be disabled after that.  It is possible
+to avoid enabling it by passing the ``intel_pstate=no_hwp`` argument to the
+kernel in the command line.
+
+If the HWP feature has been enabled, ``intel_pstate`` relies on the processor to
+select P-states by itself, but still it can give hints to the processor's
+internal P-state selection logic.  What those hints are depends on which P-state
+selection algorithm has been applied to the given policy (or to the CPU it
+corresponds to).
+
+Even though the P-state selection is carried out by the processor automatically,
+``intel_pstate`` registers utilization update callbacks with the CPU scheduler
+in this mode.  However, they are not used for running a P-state selection
+algorithm, but for periodic updates of the current CPU frequency information to
+be made available from the ``scaling_cur_freq`` policy attribute in ``sysfs``.
+
+HWP + ``performance``
+.....................
+
+In this configuration ``intel_pstate`` will write 0 to the processor's
+Energy-Performance Preference (EPP) knob (if supported) or its
+Energy-Performance Bias (EPB) knob (otherwise), which means that the processor's
+internal P-state selection logic is expected to focus entirely on performance.
+
+This will override the EPP/EPB setting coming from the ``sysfs`` interface
+(see `Energy vs Performance Hints`_ below).
+
+Also, in this configuration the range of P-states available to the processor's
+internal P-state selection logic is always restricted to the upper boundary
+(that is, the maximum P-state that the driver is allowed to use).
+
+HWP + ``powersave``
+...................
+
+In this configuration ``intel_pstate`` will set the processor's
+Energy-Performance Preference (EPP) knob (if supported) or its
+Energy-Performance Bias (EPB) knob (otherwise) to whatever value it was
+previously set to via ``sysfs`` (or whatever default value it was
+set to by the platform firmware).  This usually causes the processor's
+internal P-state selection logic to be less performance-focused.
+
+Active Mode Without HWP
+~~~~~~~~~~~~~~~~~~~~~~~
+
+This is the default operation mode for processors that do not support the HWP
+feature.  It also is used by default with the ``intel_pstate=no_hwp`` argument
+in the kernel command line.  However, in this mode ``intel_pstate`` may refuse
+to work with the given processor if it does not recognize it.  [Note that
+``intel_pstate`` will never refuse to work with any processor with the HWP
+feature enabled.]
+
+In this mode ``intel_pstate`` registers utilization update callbacks with the
+CPU scheduler in order to run a P-state selection algorithm, either
+``powersave`` or ``performance``, depending on the ``scaling_governor`` policy
+setting in ``sysfs``.  The current CPU frequency information to be made
+available from the ``scaling_cur_freq`` policy attribute in ``sysfs`` is
+periodically updated by those utilization update callbacks too.
+
+``performance``
+...............
+
+Without HWP, this P-state selection algorithm is always the same regardless of
+the processor model and platform configuration.
+
+It selects the maximum P-state it is allowed to use, subject to limits set via
+``sysfs``, every time the driver configuration for the given CPU is updated
+(e.g. via ``sysfs``).
+
+This is the default P-state selection algorithm if the
+:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option
+is set.
+
+``powersave``
+.............
+
+Without HWP, this P-state selection algorithm is similar to the algorithm
+implemented by the generic ``schedutil`` scaling governor except that the
+utilization metric used by it is based on numbers coming from feedback
+registers of the CPU.  It generally selects P-states proportional to the
+current CPU utilization.
+
+This algorithm is run by the driver's utilization update callback for the
+given CPU when it is invoked by the CPU scheduler, but not more often than
+every 10 ms.  Like in the ``performance`` case, the hardware configuration
+is not touched if the new P-state turns out to be the same as the current
+one.
+
+This is the default P-state selection algorithm if the
+:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option
+is not set.
+
+Passive Mode
+------------
+
+This mode is used if the ``intel_pstate=passive`` argument is passed to the
+kernel in the command line (it implies the ``intel_pstate=no_hwp`` setting too).
+Like in the active mode without HWP support, in this mode ``intel_pstate`` may
+refuse to work with the given processor if it does not recognize it.
+
+If the driver works in this mode, the ``scaling_driver`` policy attribute in
+``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
+Then, the driver behaves like a regular ``CPUFreq`` scaling driver.  That is,
+it is invoked by generic scaling governors when necessary to talk to the
+hardware in order to change the P-state of a CPU (in particular, the
+``schedutil`` governor can invoke it directly from scheduler context).
+
+While in this mode, ``intel_pstate`` can be used with all of the (generic)
+scaling governors listed by the ``scaling_available_governors`` policy attribute
+in ``sysfs`` (and the P-state selection algorithms described above are not
+used).  Then, it is responsible for the configuration of policy objects
+corresponding to CPUs and provides the ``CPUFreq`` core (and the scaling
+governors attached to the policy objects) with accurate information on the
+maximum and minimum operating frequencies supported by the hardware (including
+the so-called "turbo" frequency ranges).  In other words, in the passive mode
+the entire range of available P-states is exposed by ``intel_pstate`` to the
+``CPUFreq`` core.  However, in this mode the driver does not register
+utilization update callbacks with the CPU scheduler and the ``scaling_cur_freq``
+information comes from the ``CPUFreq`` core (and is the last frequency selected
+by the current scaling governor for the given policy).
+
+
+.. _turbo:
+
+Turbo P-states Support
+======================
+
+In the majority of cases, the entire range of P-states available to
+``intel_pstate`` can be divided into two sub-ranges that correspond to
+different types of processor behavior, above and below a boundary that
+will be referred to as the "turbo threshold" in what follows.
+
+The P-states above the turbo threshold are referred to as "turbo P-states" and
+the whole sub-range of P-states they belong to is referred to as the "turbo
+range".  These names are related to the Turbo Boost technology allowing a
+multicore processor to opportunistically increase the P-state of one or more
+cores if there is enough power to do that and if that is not going to cause the
+thermal envelope of the processor package to be exceeded.
+
+Specifically, if software sets the P-state of a CPU core within the turbo range
+(that is, above the turbo threshold), the processor is permitted to take over
+performance scaling control for that core and put it into turbo P-states of its
+choice going forward.  However, that permission is interpreted differently by
+different processor generations.  Namely, the Sandy Bridge generation of
+processors will never use any P-states above the last one set by software for
+the given core, even if it is within the turbo range, whereas all of the later
+processor generations will take it as a license to use any P-states from the
+turbo range, even above the one set by software.  In other words, on those
+processors setting any P-state from the turbo range will enable the processor
+to put the given core into all turbo P-states up to and including the maximum
+supported one as it sees fit.
+
+One important property of turbo P-states is that they are not sustainable.  More
+precisely, there is no guarantee that any CPUs will be able to stay in any of
+those states indefinitely, because the power distribution within the processor
+package may change over time  or the thermal envelope it was designed for might
+be exceeded if a turbo P-state was used for too long.
+
+In turn, the P-states below the turbo threshold generally are sustainable.  In
+fact, if one of them is set by software, the processor is not expected to change
+it to a lower one unless in a thermal stress or a power limit violation
+situation (a higher P-state may still be used if it is set for another CPU in
+the same package at the same time, for example).
+
+Some processors allow multiple cores to be in turbo P-states at the same time,
+but the maximum P-state that can be set for them generally depends on the number
+of cores running concurrently.  The maximum turbo P-state that can be set for 3
+cores at the same time usually is lower than the analogous maximum P-state for
+2 cores, which in turn usually is lower than the maximum turbo P-state that can
+be set for 1 core.  The one-core maximum turbo P-state is thus the maximum
+supported one overall.
+
+The maximum supported turbo P-state, the turbo threshold (the maximum supported
+non-turbo P-state) and the minimum supported P-state are specific to the
+processor model and can be determined by reading the processor's model-specific
+registers (MSRs).  Moreover, some processors support the Configurable TDP
+(Thermal Design Power) feature and, when that feature is enabled, the turbo
+threshold effectively becomes a configurable value that can be set by the
+platform firmware.
+
+Unlike ``_PSS`` objects in the ACPI tables, ``intel_pstate`` always exposes
+the entire range of available P-states, including the whole turbo range, to the
+``CPUFreq`` core and (in the passive mode) to generic scaling governors.  This
+generally causes turbo P-states to be set more often when ``intel_pstate`` is
+used relative to ACPI-based CPU performance scaling (see `below <acpi-cpufreq_>`_
+for more information).
+
+Moreover, since ``intel_pstate`` always knows what the real turbo threshold is
+(even if the Configurable TDP feature is enabled in the processor), its
+``no_turbo`` attribute in ``sysfs`` (described `below <no_turbo_attr_>`_) should
+work as expected in all cases (that is, if set to disable turbo P-states, it
+always should prevent ``intel_pstate`` from using them).
+
+
+Processor Support
+=================
+
+To handle a given processor ``intel_pstate`` requires a number of different
+pieces of information on it to be known, including:
+
+ * The minimum supported P-state.
+
+ * The maximum supported `non-turbo P-state <turbo_>`_.
+
+ * Whether or not turbo P-states are supported at all.
+
+ * The maximum supported `one-core turbo P-state <turbo_>`_ (if turbo P-states
+   are supported).
+
+ * The scaling formula to translate the driver's internal representation
+   of P-states into frequencies and the other way around.
+
+Generally, ways to obtain that information are specific to the processor model
+or family.  Although it often is possible to obtain all of it from the processor
+itself (using model-specific registers), there are cases in which hardware
+manuals need to be consulted to get to it too.
+
+For this reason, there is a list of supported processors in ``intel_pstate`` and
+the driver initialization will fail if the detected processor is not in that
+list, unless it supports the `HWP feature <Active Mode_>`_.  [The interface to
+obtain all of the information listed above is the same for all of the processors
+supporting the HWP feature, which is why they all are supported by
+``intel_pstate``.]
+
+
+User Space Interface in ``sysfs``
+=================================
+
+Global Attributes
+-----------------
+
+``intel_pstate`` exposes several global attributes (files) in ``sysfs`` to
+control its functionality at the system level.  They are located in the
+``/sys/devices/system/cpu/intel_pstate/`` directory and affect all CPUs.
+
+Some of them are not present if the ``intel_pstate=per_cpu_perf_limits``
+argument is passed to the kernel in the command line.
+
+``max_perf_pct``
+	Maximum P-state the driver is allowed to set in percent of the
+	maximum supported performance level (the highest supported `turbo
+	P-state <turbo_>`_).
+
+	This attribute will not be exposed if the
+	``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel
+	command line.
+
+``min_perf_pct``
+	Minimum P-state the driver is allowed to set in percent of the
+	maximum supported performance level (the highest supported `turbo
+	P-state <turbo_>`_).
+
+	This attribute will not be exposed if the
+	``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel
+	command line.
+
+``num_pstates``
+	Number of P-states supported by the processor (between 0 and 255
+	inclusive) including both turbo and non-turbo P-states (see
+	`Turbo P-states Support`_).
+
+	The value of this attribute is not affected by the ``no_turbo``
+	setting described `below <no_turbo_attr_>`_.
+
+	This attribute is read-only.
+
+``turbo_pct``
+	Ratio of the `turbo range <turbo_>`_ size to the size of the entire
+	range of supported P-states, in percent.
+
+	This attribute is read-only.
+
+.. _no_turbo_attr:
+
+``no_turbo``
+	If set (equal to 1), the driver is not allowed to set any turbo P-states
+	(see `Turbo P-states Support`_).  If unset (equalt to 0, which is the
+	default), turbo P-states can be set by the driver.
+	[Note that ``intel_pstate`` does not support the general ``boost``
+	attribute (supported by some other scaling drivers) which is replaced
+	by this one.]
+
+	This attrubute does not affect the maximum supported frequency value
+	supplied to the ``CPUFreq`` core and exposed via the policy interface,
+	but it affects the maximum possible value of per-policy P-state	limits
+	(see `Interpretation of Policy Attributes`_ below for details).
+
+``hwp_dynamic_boost``
+	This attribute is only present if ``intel_pstate`` works in the
+	`active mode with the HWP feature enabled <Active Mode With HWP_>`_ in
+	the processor.  If set (equal to 1), it causes the minimum P-state limit
+	to be increased dynamically for a short time whenever a task previously
+	waiting on I/O is selected to run on a given logical CPU (the purpose
+	of this mechanism is to improve performance).
+
+	This setting has no effect on logical CPUs whose minimum P-state limit
+	is directly set to the highest non-turbo P-state or above it.
+
+.. _status_attr:
+
+``status``
+	Operation mode of the driver: "active", "passive" or "off".
+
+	"active"
+		The driver is functional and in the `active mode
+		<Active Mode_>`_.
+
+	"passive"
+		The driver is functional and in the `passive mode
+		<Passive Mode_>`_.
+
+	"off"
+		The driver is not functional (it is not registered as a scaling
+		driver with the ``CPUFreq`` core).
+
+	This attribute can be written to in order to change the driver's
+	operation mode or to unregister it.  The string written to it must be
+	one of the possible values of it and, if successful, the write will
+	cause the driver to switch over to the operation mode represented by
+	that string - or to be unregistered in the "off" case.  [Actually,
+	switching over from the active mode to the passive mode or the other
+	way around causes the driver to be unregistered and registered again
+	with a different set of callbacks, so all of its settings (the global
+	as well as the per-policy ones) are then reset to their default
+	values, possibly depending on the target operation mode.]
+
+	That only is supported in some configurations, though (for example, if
+	the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
+	the operation mode of the driver cannot be changed), and if it is not
+	supported in the current configuration, writes to this attribute will
+	fail with an appropriate error.
+
+Interpretation of Policy Attributes
+-----------------------------------
+
+The interpretation of some ``CPUFreq`` policy attributes described in
+:doc:`cpufreq` is special with ``intel_pstate`` as the current scaling driver
+and it generally depends on the driver's `operation mode <Operation Modes_>`_.
+
+First of all, the values of the ``cpuinfo_max_freq``, ``cpuinfo_min_freq`` and
+``scaling_cur_freq`` attributes are produced by applying a processor-specific
+multiplier to the internal P-state representation used by ``intel_pstate``.
+Also, the values of the ``scaling_max_freq`` and ``scaling_min_freq``
+attributes are capped by the frequency corresponding to the maximum P-state that
+the driver is allowed to set.
+
+If the ``no_turbo`` `global attribute <no_turbo_attr_>`_ is set, the driver is
+not allowed to use turbo P-states, so the maximum value of ``scaling_max_freq``
+and ``scaling_min_freq`` is limited to the maximum non-turbo P-state frequency.
+Accordingly, setting ``no_turbo`` causes ``scaling_max_freq`` and
+``scaling_min_freq`` to go down to that value if they were above it before.
+However, the old values of ``scaling_max_freq`` and ``scaling_min_freq`` will be
+restored after unsetting ``no_turbo``, unless these attributes have been written
+to after ``no_turbo`` was set.
+
+If ``no_turbo`` is not set, the maximum possible value of ``scaling_max_freq``
+and ``scaling_min_freq`` corresponds to the maximum supported turbo P-state,
+which also is the value of ``cpuinfo_max_freq`` in either case.
+
+Next, the following policy attributes have special meaning if
+``intel_pstate`` works in the `active mode <Active Mode_>`_:
+
+``scaling_available_governors``
+	List of P-state selection algorithms provided by ``intel_pstate``.
+
+``scaling_governor``
+	P-state selection algorithm provided by ``intel_pstate`` currently in
+	use with the given policy.
+
+``scaling_cur_freq``
+	Frequency of the average P-state of the CPU represented by the given
+	policy for the time interval between the last two invocations of the
+	driver's utilization update callback by the CPU scheduler for that CPU.
+
+The meaning of these attributes in the `passive mode <Passive Mode_>`_ is the
+same as for other scaling drivers.
+
+Additionally, the value of the ``scaling_driver`` attribute for ``intel_pstate``
+depends on the operation mode of the driver.  Namely, it is either
+"intel_pstate" (in the `active mode <Active Mode_>`_) or "intel_cpufreq" (in the
+`passive mode <Passive Mode_>`_).
+
+Coordination of P-State Limits
+------------------------------
+
+``intel_pstate`` allows P-state limits to be set in two ways: with the help of
+the ``max_perf_pct`` and ``min_perf_pct`` `global attributes
+<Global Attributes_>`_ or via the ``scaling_max_freq`` and ``scaling_min_freq``
+``CPUFreq`` policy attributes.  The coordination between those limits is based
+on the following rules, regardless of the current operation mode of the driver:
+
+ 1. All CPUs are affected by the global limits (that is, none of them can be
+    requested to run faster than the global maximum and none of them can be
+    requested to run slower than the global minimum).
+
+ 2. Each individual CPU is affected by its own per-policy limits (that is, it
+    cannot be requested to run faster than its own per-policy maximum and it
+    cannot be requested to run slower than its own per-policy minimum).
+
+ 3. The global and per-policy limits can be set independently.
+
+If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the
+resulting effective values are written into its registers whenever the limits
+change in order to request its internal P-state selection logic to always set
+P-states within these limits.  Otherwise, the limits are taken into account by
+scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
+every time before setting a new P-state for a CPU.
+
+Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
+is passed to the kernel, ``max_perf_pct`` and ``min_perf_pct`` are not exposed
+at all and the only way to set the limits is by using the policy attributes.
+
+
+Energy vs Performance Hints
+---------------------------
+
+If ``intel_pstate`` works in the `active mode with the HWP feature enabled
+<Active Mode With HWP_>`_ in the processor, additional attributes are present
+in every ``CPUFreq`` policy directory in ``sysfs``.  They are intended to allow
+user space to help ``intel_pstate`` to adjust the processor's internal P-state
+selection logic by focusing it on performance or on energy-efficiency, or
+somewhere between the two extremes:
+
+``energy_performance_preference``
+	Current value of the energy vs performance hint for the given policy
+	(or the CPU represented by it).
+
+	The hint can be changed by writing to this attribute.
+
+``energy_performance_available_preferences``
+	List of strings that can be written to the
+	``energy_performance_preference`` attribute.
+
+	They represent different energy vs performance hints and should be
+	self-explanatory, except that ``default`` represents whatever hint
+	value was set by the platform firmware.
+
+Strings written to the ``energy_performance_preference`` attribute are
+internally translated to integer values written to the processor's
+Energy-Performance Preference (EPP) knob (if supported) or its
+Energy-Performance Bias (EPB) knob.
+
+[Note that tasks may by migrated from one CPU to another by the scheduler's
+load-balancing algorithm and if different energy vs performance hints are
+set for those CPUs, that may lead to undesirable outcomes.  To avoid such
+issues it is better to set the same energy vs performance hint for all CPUs
+or to pin every task potentially sensitive to them to a specific CPU.]
+
+.. _acpi-cpufreq:
+
+``intel_pstate`` vs ``acpi-cpufreq``
+====================================
+
+On the majority of systems supported by ``intel_pstate``, the ACPI tables
+provided by the platform firmware contain ``_PSS`` objects returning information
+that can be used for CPU performance scaling (refer to the `ACPI specification`_
+for details on the ``_PSS`` objects and the format of the information returned
+by them).
+
+The information returned by the ACPI ``_PSS`` objects is used by the
+``acpi-cpufreq`` scaling driver.  On systems supported by ``intel_pstate``
+the ``acpi-cpufreq`` driver uses the same hardware CPU performance scaling
+interface, but the set of P-states it can use is limited by the ``_PSS``
+output.
+
+On those systems each ``_PSS`` object returns a list of P-states supported by
+the corresponding CPU which basically is a subset of the P-states range that can
+be used by ``intel_pstate`` on the same system, with one exception: the whole
+`turbo range <turbo_>`_ is represented by one item in it (the topmost one).  By
+convention, the frequency returned by ``_PSS`` for that item is greater by 1 MHz
+than the frequency of the highest non-turbo P-state listed by it, but the
+corresponding P-state representation (following the hardware specification)
+returned for it matches the maximum supported turbo P-state (or is the
+special value 255 meaning essentially "go as high as you can get").
+
+The list of P-states returned by ``_PSS`` is reflected by the table of
+available frequencies supplied by ``acpi-cpufreq`` to the ``CPUFreq`` core and
+scaling governors and the minimum and maximum supported frequencies reported by
+it come from that list as well.  In particular, given the special representation
+of the turbo range described above, this means that the maximum supported
+frequency reported by ``acpi-cpufreq`` is higher by 1 MHz than the frequency
+of the highest supported non-turbo P-state listed by ``_PSS`` which, of course,
+affects decisions made by the scaling governors, except for ``powersave`` and
+``performance``.
+
+For example, if a given governor attempts to select a frequency proportional to
+estimated CPU load and maps the load of 100% to the maximum supported frequency
+(possibly multiplied by a constant), then it will tend to choose P-states below
+the turbo threshold if ``acpi-cpufreq`` is used as the scaling driver, because
+in that case the turbo range corresponds to a small fraction of the frequency
+band it can use (1 MHz vs 1 GHz or more).  In consequence, it will only go to
+the turbo range for the highest loads and the other loads above 50% that might
+benefit from running at turbo frequencies will be given non-turbo P-states
+instead.
+
+One more issue related to that may appear on systems supporting the
+`Configurable TDP feature <turbo_>`_ allowing the platform firmware to set the
+turbo threshold.  Namely, if that is not coordinated with the lists of P-states
+returned by ``_PSS`` properly, there may be more than one item corresponding to
+a turbo P-state in those lists and there may be a problem with avoiding the
+turbo range (if desirable or necessary).  Usually, to avoid using turbo
+P-states overall, ``acpi-cpufreq`` simply avoids using the topmost state listed
+by ``_PSS``, but that is not sufficient when there are other turbo P-states in
+the list returned by it.
+
+Apart from the above, ``acpi-cpufreq`` works like ``intel_pstate`` in the
+`passive mode <Passive Mode_>`_, except that the number of P-states it can set
+is limited to the ones listed by the ACPI ``_PSS`` objects.
+
+
+Kernel Command Line Options for ``intel_pstate``
+================================================
+
+Several kernel command line options can be used to pass early-configuration-time
+parameters to ``intel_pstate`` in order to enforce specific behavior of it.  All
+of them have to be prepended with the ``intel_pstate=`` prefix.
+
+``disable``
+	Do not register ``intel_pstate`` as the scaling driver even if the
+	processor is supported by it.
+
+``passive``
+	Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
+	start with.
+
+	This option implies the ``no_hwp`` one described below.
+
+``force``
+	Register ``intel_pstate`` as the scaling driver instead of
+	``acpi-cpufreq`` even if the latter is preferred on the given system.
+
+	This may prevent some platform features (such as thermal controls and
+	power capping) that rely on the availability of ACPI P-states
+	information from functioning as expected, so it should be used with
+	caution.
+
+	This option does not work with processors that are not supported by
+	``intel_pstate`` and on platforms where the ``pcc-cpufreq`` scaling
+	driver is used instead of ``acpi-cpufreq``.
+
+``no_hwp``
+	Do not enable the `hardware-managed P-states (HWP) feature
+	<Active Mode With HWP_>`_ even if it is supported by the processor.
+
+``hwp_only``
+	Register ``intel_pstate`` as the scaling driver only if the
+	`hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is
+	supported by the processor.
+
+``support_acpi_ppc``
+	Take ACPI ``_PPC`` performance limits into account.
+
+	If the preferred power management profile in the FADT (Fixed ACPI
+	Description Table) is set to "Enterprise Server" or "Performance
+	Server", the ACPI ``_PPC`` limits are taken into account by default
+	and this option has no effect.
+
+``per_cpu_perf_limits``
+	Use per-logical-CPU P-State limits (see `Coordination of P-state
+	Limits`_ for details).
+
+
+Diagnostics and Tuning
+======================
+
+Trace Events
+------------
+
+There are two static trace events that can be used for ``intel_pstate``
+diagnostics.  One of them is the ``cpu_frequency`` trace event generally used
+by ``CPUFreq``, and the other one is the ``pstate_sample`` trace event specific
+to ``intel_pstate``.  Both of them are triggered by ``intel_pstate`` only if
+it works in the `active mode <Active Mode_>`_.
+
+The following sequence of shell commands can be used to enable them and see
+their output (if the kernel is generally configured to support event tracing)::
+
+ # cd /sys/kernel/debug/tracing/
+ # echo 1 > events/power/pstate_sample/enable
+ # echo 1 > events/power/cpu_frequency/enable
+ # cat trace
+ gnome-terminal--4510  [001] ..s.  1177.680733: pstate_sample: core_busy=107 scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 freq=2474476
+ cat-5235  [002] ..s.  1177.681723: cpu_frequency: state=2900000 cpu_id=2
+
+If ``intel_pstate`` works in the `passive mode <Passive Mode_>`_, the
+``cpu_frequency`` trace event will be triggered either by the ``schedutil``
+scaling governor (for the policies it is attached to), or by the ``CPUFreq``
+core (for the policies with other scaling governors).
+
+``ftrace``
+----------
+
+The ``ftrace`` interface can be used for low-level diagnostics of
+``intel_pstate``.  For example, to check how often the function to set a
+P-state is called, the ``ftrace`` filter can be set to to
+:c:func:`intel_pstate_set_pstate`::
+
+ # cd /sys/kernel/debug/tracing/
+ # cat available_filter_functions | grep -i pstate
+ intel_pstate_set_pstate
+ intel_pstate_cpu_init
+ ...
+ # echo intel_pstate_set_pstate > set_ftrace_filter
+ # echo function > current_tracer
+ # cat trace | head -15
+ # tracer: function
+ #
+ # entries-in-buffer/entries-written: 80/80   #P:4
+ #
+ #                              _-----=> irqs-off
+ #                             / _----=> need-resched
+ #                            | / _---=> hardirq/softirq
+ #                            || / _--=> preempt-depth
+ #                            ||| /     delay
+ #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
+ #              | |       |   ||||       |         |
+             Xorg-3129  [000] ..s.  2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func
+  gnome-terminal--4510  [002] ..s.  2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func
+      gnome-shell-3409  [001] ..s.  2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func
+           <idle>-0     [000] ..s.  2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func
+
+
+.. _LCEU2015: http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
+.. _SDM: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
+.. _ACPI specification: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
diff --git a/Documentation/admin-guide/pm/sleep-states.rst b/Documentation/admin-guide/pm/sleep-states.rst
new file mode 100644
index 000000000..dbf5acd49
--- /dev/null
+++ b/Documentation/admin-guide/pm/sleep-states.rst
@@ -0,0 +1,245 @@
+===================
+System Sleep States
+===================
+
+::
+
+ Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+Sleep states are global low-power states of the entire system in which user
+space code cannot be executed and the overall system activity is significantly
+reduced.
+
+
+Sleep States That Can Be Supported
+==================================
+
+Depending on its configuration and the capabilities of the platform it runs on,
+the Linux kernel can support up to four system sleep states, including
+hibernation and up to three variants of system suspend.  The sleep states that
+can be supported by the kernel are listed below.
+
+.. _s2idle:
+
+Suspend-to-Idle
+---------------
+
+This is a generic, pure software, light-weight variant of system suspend (also
+referred to as S2I or S2Idle).  It allows more energy to be saved relative to
+runtime idle by freezing user space, suspending the timekeeping and putting all
+I/O devices into low-power states (possibly lower-power than available in the
+working state), such that the processors can spend time in their deepest idle
+states while the system is suspended.
+
+The system is woken up from this state by in-band interrupts, so theoretically
+any devices that can cause interrupts to be generated in the working state can
+also be set up as wakeup devices for S2Idle.
+
+This state can be used on platforms without support for :ref:`standby <standby>`
+or :ref:`suspend-to-RAM <s2ram>`, or it can be used in addition to any of the
+deeper system suspend variants to provide reduced resume latency.  It is always
+supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration option is set.
+
+.. _standby:
+
+Standby
+-------
+
+This state, if supported, offers moderate, but real, energy savings, while
+providing a relatively straightforward transition back to the working state.  No
+operating state is lost (the system core logic retains power), so the system can
+go back to where it left off easily enough.
+
+In addition to freezing user space, suspending the timekeeping and putting all
+I/O devices into low-power states, which is done for :ref:`suspend-to-idle
+<s2idle>` too, nonboot CPUs are taken offline and all low-level system functions
+are suspended during transitions into this state.  For this reason, it should
+allow more energy to be saved relative to :ref:`suspend-to-idle <s2idle>`, but
+the resume latency will generally be greater than for that state.
+
+The set of devices that can wake up the system from this state usually is
+reduced relative to :ref:`suspend-to-idle <s2idle>` and it may be necessary to
+rely on the platform for setting up the wakeup functionality as appropriate.
+
+This state is supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration
+option is set and the support for it is registered by the platform with the
+core system suspend subsystem.  On ACPI-based systems this state is mapped to
+the S1 system state defined by ACPI.
+
+.. _s2ram:
+
+Suspend-to-RAM
+--------------
+
+This state (also referred to as STR or S2RAM), if supported, offers significant
+energy savings as everything in the system is put into a low-power state, except
+for memory, which should be placed into the self-refresh mode to retain its
+contents.  All of the steps carried out when entering :ref:`standby <standby>`
+are also carried out during transitions to S2RAM.  Additional operations may
+take place depending on the platform capabilities.  In particular, on ACPI-based
+systems the kernel passes control to the platform firmware (BIOS) as the last
+step during S2RAM transitions and that usually results in powering down some
+more low-level components that are not directly controlled by the kernel.
+
+The state of devices and CPUs is saved and held in memory.  All devices are
+suspended and put into low-power states.  In many cases, all peripheral buses
+lose power when entering S2RAM, so devices must be able to handle the transition
+back to the "on" state.
+
+On ACPI-based systems S2RAM requires some minimal boot-strapping code in the
+platform firmware to resume the system from it.  This may be the case on other
+platforms too.
+
+The set of devices that can wake up the system from S2RAM usually is reduced
+relative to :ref:`suspend-to-idle <s2idle>` and :ref:`standby <standby>` and it
+may be necessary to rely on the platform for setting up the wakeup functionality
+as appropriate.
+
+S2RAM is supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration option
+is set and the support for it is registered by the platform with the core system
+suspend subsystem.  On ACPI-based systems it is mapped to the S3 system state
+defined by ACPI.
+
+.. _hibernation:
+
+Hibernation
+-----------
+
+This state (also referred to as Suspend-to-Disk or STD) offers the greatest
+energy savings and can be used even in the absence of low-level platform support
+for system suspend.  However, it requires some low-level code for resuming the
+system to be present for the underlying CPU architecture.
+
+Hibernation is significantly different from any of the system suspend variants.
+It takes three system state changes to put it into hibernation and two system
+state changes to resume it.
+
+First, when hibernation is triggered, the kernel stops all system activity and
+creates a snapshot image of memory to be written into persistent storage.  Next,
+the system goes into a state in which the snapshot image can be saved, the image
+is written out and finally the system goes into the target low-power state in
+which power is cut from almost all of its hardware components, including memory,
+except for a limited set of wakeup devices.
+
+Once the snapshot image has been written out, the system may either enter a
+special low-power state (like ACPI S4), or it may simply power down itself.
+Powering down means minimum power draw and it allows this mechanism to work on
+any system.  However, entering a special low-power state may allow additional
+means of system wakeup to be used  (e.g. pressing a key on the keyboard or
+opening a laptop lid).
+
+After wakeup, control goes to the platform firmware that runs a boot loader
+which boots a fresh instance of the kernel (control may also go directly to
+the boot loader, depending on the system configuration, but anyway it causes
+a fresh instance of the kernel to be booted).  That new instance of the kernel
+(referred to as the ``restore kernel``) looks for a hibernation image in
+persistent storage and if one is found, it is loaded into memory.  Next, all
+activity in the system is stopped and the restore kernel overwrites itself with
+the image contents and jumps into a special trampoline area in the original
+kernel stored in the image (referred to as the ``image kernel``), which is where
+the special architecture-specific low-level code is needed.  Finally, the
+image kernel restores the system to the pre-hibernation state and allows user
+space to run again.
+
+Hibernation is supported if the :c:macro:`CONFIG_HIBERNATION` kernel
+configuration option is set.  However, this option can only be set if support
+for the given CPU architecture includes the low-level code for system resume.
+
+
+Basic ``sysfs`` Interfaces for System Suspend and Hibernation
+=============================================================
+
+The following files located in the :file:`/sys/power/` directory can be used by
+user space for sleep states control.
+
+``state``
+	This file contains a list of strings representing sleep states supported
+	by the kernel.  Writing one of these strings into it causes the kernel
+	to start a transition of the system into the sleep state represented by
+	that string.
+
+	In particular, the strings "disk", "freeze" and "standby" represent the
+	:ref:`hibernation <hibernation>`, :ref:`suspend-to-idle <s2idle>` and
+	:ref:`standby <standby>` sleep states, respectively.  The string "mem"
+	is interpreted in accordance with the contents of the ``mem_sleep`` file
+	described below.
+
+	If the kernel does not support any system sleep states, this file is
+	not present.
+
+``mem_sleep``
+	This file contains a list of strings representing supported system
+	suspend	variants and allows user space to select the variant to be
+	associated with the "mem" string in the ``state`` file described above.
+
+	The strings that may be present in this file are "s2idle", "shallow"
+	and "deep".  The string "s2idle" always represents :ref:`suspend-to-idle
+	<s2idle>` and, by convention, "shallow" and "deep" represent
+	:ref:`standby <standby>` and :ref:`suspend-to-RAM <s2ram>`,
+	respectively.
+
+	Writing one of the listed strings into this file causes the system
+	suspend variant represented by it to be associated with the "mem" string
+	in the ``state`` file.  The string representing the suspend variant
+	currently associated with the "mem" string in the ``state`` file
+	is listed in square brackets.
+
+	If the kernel does not support system suspend, this file is not present.
+
+``disk``
+	This file contains a list of strings representing different operations
+	that can be carried out after the hibernation image has been saved.  The
+	possible options are as follows:
+
+	``platform``
+		Put the system into a special low-power state (e.g. ACPI S4) to
+		make additional wakeup options available and possibly allow the
+		platform firmware to take a simplified initialization path after
+		wakeup.
+
+	``shutdown``
+		Power off the system.
+
+	``reboot``
+		Reboot the system (useful for diagnostics mostly).
+
+	``suspend``
+		Hybrid system suspend.  Put the system into the suspend sleep
+		state selected through the ``mem_sleep`` file described above.
+		If the system is successfully woken up from that state, discard
+		the hibernation image and continue.  Otherwise, use the image
+		to restore the previous state of the system.
+
+	``test_resume``
+		Diagnostic operation.  Load the image as though the system had
+		just woken up from hibernation and the currently running kernel
+		instance was a restore kernel and follow up with full system
+		resume.
+
+	Writing one of the listed strings into this file causes the option
+	represented by it to be selected.
+
+	The currently selected option is shown in square brackets which means
+	that the operation represented by it will be carried out after creating
+	and saving the image next time hibernation is triggered by writing
+	``disk`` to :file:`/sys/power/state`.
+
+	If the kernel does not support hibernation, this file is not present.
+
+According to the above, there are two ways to make the system go into the
+:ref:`suspend-to-idle <s2idle>` state.  The first one is to write "freeze"
+directly to :file:`/sys/power/state`.  The second one is to write "s2idle" to
+:file:`/sys/power/mem_sleep` and then to write "mem" to
+:file:`/sys/power/state`.  Likewise, there are two ways to make the system go
+into the :ref:`standby <standby>` state (the strings to write to the control
+files in that case are "standby" or "shallow" and "mem", respectively) if that
+state is supported by the platform.  However, there is only one way to make the
+system go into the :ref:`suspend-to-RAM <s2ram>` state (write "deep" into
+:file:`/sys/power/mem_sleep` and "mem" into :file:`/sys/power/state`).
+
+The default suspend variant (ie. the one to be used without writing anything
+into :file:`/sys/power/mem_sleep`) is either "deep" (on the majority of systems
+supporting :ref:`suspend-to-RAM <s2ram>`) or "s2idle", but it can be overridden
+by the value of the "mem_sleep_default" parameter in the kernel command line.
+On some ACPI-based systems, depending on the information in the ACPI tables, the
+default may be "s2idle" even if :ref:`suspend-to-RAM <s2ram>` is supported.
diff --git a/Documentation/admin-guide/pm/strategies.rst b/Documentation/admin-guide/pm/strategies.rst
new file mode 100644
index 000000000..afe4d3f83
--- /dev/null
+++ b/Documentation/admin-guide/pm/strategies.rst
@@ -0,0 +1,52 @@
+===========================
+Power Management Strategies
+===========================
+
+::
+
+ Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+The Linux kernel supports two major high-level power management strategies.
+
+One of them is based on using global low-power states of the whole system in
+which user space code cannot be executed and the overall system activity is
+significantly reduced, referred to as :doc:`sleep states <sleep-states>`.  The
+kernel puts the system into one of these states when requested by user space
+and the system stays in it until a special signal is received from one of
+designated devices, triggering a transition to the ``working state`` in which
+user space code can run.  Because sleep states are global and the whole system
+is affected by the state changes, this strategy is referred to as the
+:doc:`system-wide power management <system-wide>`.
+
+The other strategy, referred to as the :doc:`working-state power management
+<working-state>`, is based on adjusting the power states of individual hardware
+components of the system, as needed, in the working state.  In consequence, if
+this strategy is in use, the working state of the system usually does not
+correspond to any particular physical configuration of it, but can be treated as
+a metastate covering a range of different power states of the system in which
+the individual components of it can be either ``active`` (in use) or
+``inactive`` (idle).  If they are active, they have to be in power states
+allowing them to process data and to be accessed by software.  In turn, if they
+are inactive, ideally, they should be in low-power states in which they may not
+be accessible.
+
+If all of the system components are active, the system as a whole is regarded as
+"runtime active" and that situation typically corresponds to the maximum power
+draw (or maximum energy usage) of it.  If all of them are inactive, the system
+as a whole is regarded as "runtime idle" which may be very close to a sleep
+state from the physical system configuration and power draw perspective, but
+then it takes much less time and effort to start executing user space code than
+for the same system in a sleep state.  However, transitions from sleep states
+back to the working state can only be started by a limited set of devices, so
+typically the system can spend much more time in a sleep state than it can be
+runtime idle in one go.  For this reason, systems usually use less energy in
+sleep states than when they are runtime idle most of the time.
+
+Moreover, the two power management strategies address different usage scenarios.
+Namely, if the user indicates that the system will not be in use going forward,
+for example by closing its lid (if the system is a laptop), it probably should
+go into a sleep state at that point.  On the other hand, if the user simply goes
+away from the laptop keyboard, it probably should stay in the working state and
+use the working-state power management in case it becomes idle, because the user
+may come back to it at any time and then may want the system to be immediately
+accessible.
diff --git a/Documentation/admin-guide/pm/system-wide.rst b/Documentation/admin-guide/pm/system-wide.rst
new file mode 100644
index 000000000..0c81e4c5d
--- /dev/null
+++ b/Documentation/admin-guide/pm/system-wide.rst
@@ -0,0 +1,8 @@
+============================
+System-Wide Power Management
+============================
+
+.. toctree::
+   :maxdepth: 2
+
+   sleep-states
diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst
new file mode 100644
index 000000000..fa01bf083
--- /dev/null
+++ b/Documentation/admin-guide/pm/working-state.rst
@@ -0,0 +1,9 @@
+==============================
+Working-State Power Management
+==============================
+
+.. toctree::
+   :maxdepth: 2
+
+   cpufreq
+   intel_pstate
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-05-06 01:02:30 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-05-06 01:02:30 +0000
commit	76cb841cb886eef6b3bee341a2266c76578724ad (patch)
tree	f5892e5ba6cc11949952a6ce4ecbe6d516d6ce58 /Documentation/admin-guide/pm
parent	Initial commit. (diff)
download	linux-76cb841cb886eef6b3bee341a2266c76578724ad.tar.xz linux-76cb841cb886eef6b3bee341a2266c76578724ad.zip