7 files changed, 1203 insertions, 0 deletions
diff --git a/docs/perf/index.rst b/docs/perf/index.rst
new file mode 100644
index 0000000..0938a17
--- /dev/null
+++ b/docs/perf/index.rst
@@ -0,0 +1,17 @@
+Performance & Testing
+=====================
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Contents
+
+   psci-performance-instr
+   psci-performance-juno
+   psci-performance-n1sdp
+   psci-performance-methodology
+   tsp
+   performance-monitoring-unit
+
+--------------
+
+*Copyright (c) 2019-2023, Arm Limited. All rights reserved.*
diff --git a/docs/perf/performance-monitoring-unit.rst b/docs/perf/performance-monitoring-unit.rst
new file mode 100644
index 0000000..5dd1af5
--- /dev/null
+++ b/docs/perf/performance-monitoring-unit.rst
@@ -0,0 +1,158 @@
+Performance Monitoring Unit
+===========================
+
+The Performance Monitoring Unit (PMU) allows recording of architectural and
+microarchitectural events for profiling purposes.
+
+This document gives an overview of the PMU counter configuration to assist with
+implementation and to complement the PMU security guidelines given in the
+:ref:`Secure Development Guidelines` document.
+
+.. note::
+   This section applies to Armv8-A implementations which have version 3
+   of the Performance Monitors Extension (PMUv3).
+
+PMU Counters
+------------
+
+The PMU makes 32 counters available at all privilege levels:
+
+-  31 programmable event counters: ``PMEVCNTR<n>``, where ``n`` is ``0`` to
+   ``30``.
+-  A dedicated cycle counter: ``PMCCNTR``.
+
+Architectural mappings
+~~~~~~~~~~~~~~~~~~~~~~
+
++--------------+---------+----------------------------+
+| Counters     | State   | System Register Name       |
++==============+=========+============================+
+|              | AArch64 | ``PMEVCNTR<n>_EL0[63*:0]`` |
+| Programmable +---------+----------------------------+
+|              | AArch32 | ``PMEVCNTR<n>[31:0]``      |
++--------------+---------+----------------------------+
+|              | AArch64 | ``PMCCNTR_EL0[63:0]``      |
+| Cycle        +---------+----------------------------+
+|              | AArch32 | ``PMCCNTR[63:0]``          |
++--------------+---------+----------------------------+
+
+.. note::
+   Bits [63:32] are only available if ARMv8.5-PMU is implemented. Refer to the
+   `Arm ARM`_ for a detailed description of ARMv8.5-PMU features.
+
+Configuring the PMU for counting events
+---------------------------------------
+
+Each programmable counter has an associated register, ``PMEVTYPER<n>`` which
+configures it. The cycle counter has the ``PMCCFILTR_EL0`` register, which has
+an identical function and bit field layout as ``PMEVTYPER<n>``. In addition,
+the counters are enabled (permitted to increment) via the ``PMCNTENSET`` and
+``PMCR`` registers. These can be accessed at all privilege levels.
+
+Architectural mappings
+~~~~~~~~~~~~~~~~~~~~~~
+
++-----------------------------+------------------------+
+| AArch64                     | AArch32                |
++=============================+========================+
+| ``PMEVTYPER<n>_EL0[63*:0]`` | ``PMEVTYPER<n>[31:0]`` |
++-----------------------------+------------------------+
+| ``PMCCFILTR_EL0[63*:0]``    | ``PMCCFILTR[31:0]``    |
++-----------------------------+------------------------+
+| ``PMCNTENSET_EL0[63*:0]``   | ``PMCNTENSET[31:0]``   |
++-----------------------------+------------------------+
+| ``PMCR_EL0[63*:0]``         | ``PMCR[31:0]``         |
++-----------------------------+------------------------+
+
+.. note::
+   Bits [63:32] are reserved.
+
+Relevant register fields
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+For ``PMEVTYPER<n>_EL0``/``PMEVTYPER<n>`` and ``PMCCFILTR_EL0/PMCCFILTR``, the
+most important fields are:
+
+-  ``P``:
+
+   -  Bit 31.
+   -  If set to ``0``, will increment the associated ``PMEVCNTR<n>`` at EL1.
+
+-  ``NSK``:
+
+   -  Bit 29.
+   -  If equal to the ``P`` bit it enables the associated ``PMEVCNTR<n>`` at
+      Non-secure EL1.
+   -  Reserved if EL3 not implemented.
+
+-  ``NSH``:
+
+   -  Bit 27.
+   -  If set to ``1``, will increment the associated ``PMEVCNTR<n>`` at EL2.
+   -  Reserved if EL2 not implemented.
+
+-  ``SH``:
+
+   -  Bit 24.
+   -  If different to the ``NSH`` bit it enables the associated ``PMEVCNTR<n>``
+      at Secure EL2.
+   -  Reserved if Secure EL2 not implemented.
+
+-  ``M``:
+
+   -  Bit 26.
+   -  If equal to the ``P`` bit it enables the associated ``PMEVCNTR<n>`` at
+      EL3.
+
+-  ``evtCount[15:10]``:
+
+   -  Extension to ``evtCount[9:0]``. Reserved unless ARMv8.1-PMU implemented.
+
+-  ``evtCount[9:0]``:
+
+   -  The event number that the associated ``PMEVCNTR<n>`` will count.
+
+For ``PMCNTENSET_EL0``/``PMCNTENSET``, the most important fields are:
+
+-  ``P[30:0]``:
+
+   -  Setting bit ``P[n]`` to ``1`` enables counter ``PMEVCNTR<n>``.
+   -  The effects of ``PMEVTYPER<n>`` are applied on top of this.
+      In other words, the counter will not increment at any privilege level or
+      security state unless it is enabled here.
+
+-  ``C``:
+
+   -  Bit 31.
+   -  If set to ``1`` enables the cycle counter ``PMCCNTR``.
+
+For ``PMCR``/``PMCR_EL0``, the most important fields are:
+
+-  ``DP``:
+
+   -  Bit 5.
+   -  If set to ``1`` it disables the cycle counter ``PMCCNTR`` where event
+      counting (by ``PMEVCNTR<n>``) is prohibited (e.g. EL2 and the Secure
+      world).
+   -  If set to ``0``, ``PMCCNTR`` will not be affected by this bit and
+      therefore will be able to count where the programmable counters are
+      prohibited.
+
+-  ``E``:
+
+   -  Bit 0.
+   -  Enables/disables counting altogether.
+   -  The effects of ``PMCNTENSET`` and ``PMCR.DP`` are applied on top of this.
+      In other words, if this bit is ``0`` then no counters will increment
+      regardless of how the other PMU system registers or bit fields are
+      configured.
+
+.. rubric:: References
+
+-  `Arm ARM`_
+
+--------------
+
+*Copyright (c) 2019-2020, Arm Limited and Contributors. All rights reserved.*
+
+.. _Arm ARM: https://developer.arm.com/docs/ddi0487/latest
diff --git a/docs/perf/psci-performance-instr.rst b/docs/perf/psci-performance-instr.rst
new file mode 100644
index 0000000..41094b2
--- /dev/null
+++ b/docs/perf/psci-performance-instr.rst
@@ -0,0 +1,116 @@
+PSCI Performance Measurement
+============================
+
+TF-A provides two instrumentation tools for performing analysis of the PSCI
+implementation:
+
+* PSCI STAT
+* Runtime Instrumentation
+
+This page explains how they may be enabled and used to perform all varieties of
+analysis.
+
+Performance Measurement Framework
+---------------------------------
+
+The Performance Measurement Framework :ref:`PMF <firmware_design_pmf>`
+is a framework that provides mechanisms for collecting and retrieving timestamps
+at runtime from the Performance Measurement Unit
+(:ref:`PMU <Performance Monitoring Unit>`).
+The PMU is a generalized abstraction for accessing CPU hardware registers used to
+measure hardware events. This means, for instance, that the PMU might be used to
+place instrumentation points at logical locations in code for tracing purposes.
+
+TF-A utilises the PMF as a backend for the two instrumentation services it
+provides--PSCI Statistics and Runtime Instrumentation. The PMF is used by
+these services to facilitate collection and retrieval of timestamps. For
+instance, the PSCI Statistics service registers the PMF service
+``psci_svc`` to track its residency statistics.
+
+This is reserved a unique ID, name, and space in memory by the PMF. The
+framework provides a convenient interface for PSCI Statistics to retrieve
+values from ``psci_svc`` at runtime.  Alternatively, the service may be
+configured such that the PMF dumps those values to the console. A platform may
+choose to expose SMCs that allow retrieval of these timestamps from the
+service.
+
+This feature is enabled with the Boolean flag ``ENABLE_PMF``.
+
+PSCI Statistics
+---------------
+
+PSCI Statistics is a runtime service that provides residency statistics for
+power states used by the platform. The service tracks residency time and
+entry count. Residency time is the total time spent in a particular power
+state by a PE. The entry count is the number of times the PE has entered
+the power state. PSCI Statistics implements the optional functions
+``PSCI_STAT_RESIDENCY`` and ``PSCI_STAT_COUNT`` from the `PSCI`_
+specification.
+
+
+.. c:macro:: PSCI_STAT_RESIDENCY
+
+    :param target_cpu: Contains copy of affinity fields in the MPIDR register
+      for identifying the target core (See section 5.1.4 of `PSCI`_
+      specifications for more details).
+    :param power_state: identifier for a specific local
+      state. Generally, this parameter takes the same form as the power_state
+      parameter described for CPU_SUSPEND in section 5.4.2.
+
+    :returns: Time spent in ``power_state``, in microseconds, by ``target_cpu``
+      and the highest level expressed in ``power_state``.
+
+
+.. c:macro:: PSCI_STAT_COUNT
+
+    :param target_cpu: follows the same format as ``PSCI_STAT_RESIDENCY``.
+    :param power_state: follows the same format as ``PSCI_STAT_RESIDENCY``.
+
+    :returns: Number of times the state expressed in ``power_state`` has been
+      used by ``target_cpu`` and the highest level expressed in
+      ``power_state``.
+
+The implementation provides residency statistics only for low power states,
+and does this regardless of the entry mechanism into those states. The
+statistics it collects are set to 0 during shutdown or reset.
+
+PSCI Statistics is enabled with the Boolean build flag
+``ENABLE_PSCI_STAT``.  All Arm platforms utilise the PMF unless another
+collection backend is provided (``ENABLE_PMF`` is implicitly enabled).
+
+Runtime Instrumentation
+-----------------------
+
+The Runtime Instrumentation Service is an instrumentation tool that wraps
+around the PMF to provide timestamp data. Although the service is not
+restricted to PSCI, it is used primarily in TF-A to quantify the total time
+spent in the PSCI implementation. The tool can be used to instrument other
+components in TF-A as well. It is enabled with the Boolean flag
+``ENABLE_RUNTIME_INSTRUMENTATION``, and as with PSCI STAT, requires PMF to
+be enabled.
+
+In PSCI, this service provides instrumentation points in the
+following code paths:
+
+* Entry into the PSCI SMC handler
+* Exit from the PSCI SMC handler
+* Entry to low power state
+* Exit from low power state
+* Entry into cache maintenance operations in PSCI
+* Exit from cache maintenance operations in PSCI
+
+The service captures the cycle count, which allows for the time spent in the
+implementation to be calculated, given the frequency counter.
+
+PSCI SMC Handler Instrumentation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The timestamp during entry into the handler is captured as early as possible
+during the runtime exception, prior to entry into the handler itself. All
+timestamps are stored in memory for later retrieval. The exit timestamp is
+captured after normal return from the PSCI SMC handler, or, if a low power state
+was requested, it is captured in the warm boot path.
+
+*Copyright (c) 2023, Arm Limited. All rights reserved.*
+
+.. _PSCI: https://developer.arm.com/documentation/den0022/latest/
diff --git a/docs/perf/psci-performance-juno.rst b/docs/perf/psci-performance-juno.rst
new file mode 100644
index 0000000..bab1086
--- /dev/null
+++ b/docs/perf/psci-performance-juno.rst
@@ -0,0 +1,533 @@
+PSCI Performance Measurements on Arm Juno Development Platform
+==============================================================
+
+This document summarises the findings of performance measurements of key
+operations in the Trusted Firmware-A Power State Coordination Interface (PSCI)
+implementation, using the in-built Performance Measurement Framework (PMF) and
+runtime instrumentation timestamps.
+
+Method
+------
+
+We used the `Juno R1 platform`_ for these tests, which has 4 x Cortex-A53 and 2
+x Cortex-A57 clusters running at the following frequencies:
+
++-----------------+--------------------+
+| Domain          | Frequency (MHz)    |
++=================+====================+
+| Cortex-A57      | 900 (nominal)      |
++-----------------+--------------------+
+| Cortex-A53      | 650 (underdrive)   |
++-----------------+--------------------+
+| AXI subsystem   | 533                |
++-----------------+--------------------+
+
+Juno supports CPU, cluster and system power down states, corresponding to power
+levels 0, 1 and 2 respectively. It does not support any retention states.
+
+Given that runtime instrumentation using PMF is invasive, there is a small
+(unquantified) overhead on the results. PMF uses the generic counter for
+timestamps, which runs at 50MHz on Juno.
+
+The following source trees and binaries were used:
+
+- TF-A [`v2.9-rc0`_]
+- TFTF [`v2.9-rc0`_]
+
+Please see the Runtime Instrumentation :ref:`Testing Methodology
+<Runtime Instrumentation Methodology>`
+page for more details.
+
+Procedure
+---------
+
+#. Build TFTF with runtime instrumentation enabled:
+
+    .. code:: shell
+
+        make CROSS_COMPILE=aarch64-none-elf- PLAT=juno \
+            TESTS=runtime-instrumentation all
+
+#. Fetch Juno's SCP binary from TF-A's archive:
+
+    .. code:: shell
+
+        curl --fail --connect-timeout 5 --retry 5 -sLS -o scp_bl2.bin \
+            https://downloads.trustedfirmware.org/tf-a/css_scp_2.12.0/juno/release/juno-bl2.bin
+
+#. Build TF-A with the following build options:
+
+    .. code:: shell
+
+        make CROSS_COMPILE=aarch64-none-elf- PLAT=juno \
+            BL33="/path/to/tftf.bin" SCP_BL2="scp_bl2.bin" \
+            ENABLE_RUNTIME_INSTRUMENTATION=1 fiptool all fip
+
+#. Load the following images onto the development board: ``fip.bin``,
+   ``scp_bl2.bin``.
+
+Results
+-------
+
+``CPU_SUSPEND`` to deepest power level
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in
+        parallel (v2.9)
+
+    +---------+------+-----------+--------+-------------+
+    | Cluster | Core | Powerdown | Wakeup | Cache Flush |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |   104.58  | 241.20 |     5.26    |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  1   |   384.24  | 22.50  |    138.76   |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |   244.56  | 22.18  |     5.16    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  1   |   670.56  | 18.58  |     4.44    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  2   |   809.36  | 269.28 |     4.44    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  3   |   984.96  | 219.70 |    79.62    |
+    +---------+------+-----------+--------+-------------+
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in
+        parallel (v2.10)
+
+    +---------+------+-------------------+--------+-------------+
+    | Cluster | Core |     Powerdown     | Wakeup | Cache Flush |
+    +---------+------+-------------------+--------+-------------+
+    |    0    |  0   | 242.66 (+132.03%) | 245.1  |     5.4     |
+    +---------+------+-------------------+--------+-------------+
+    |    0    |  1   |  522.08 (+35.87%) | 26.24  |    138.32   |
+    +---------+------+-------------------+--------+-------------+
+    |    1    |  0   |  104.36 (-57.33%) |  27.1  |     5.32    |
+    +---------+------+-------------------+--------+-------------+
+    |    1    |  1   |  382.56 (-42.95%) | 23.34  |     4.42    |
+    +---------+------+-------------------+--------+-------------+
+    |    1    |  2   |       807.74      | 271.54 |     4.64    |
+    +---------+------+-------------------+--------+-------------+
+    |    1    |  3   |       981.36      | 221.8  |    79.48    |
+    +---------+------+-------------------+--------+-------------+
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in
+        serial (v2.9)
+
+    +---------+------+-----------+--------+-------------+
+    | Cluster | Core | Powerdown | Wakeup | Cache Flush |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |   236.56  | 23.24  |    138.18   |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  1   |   236.86  | 23.28  |    138.10   |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |   281.04  | 22.80  |    77.24    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  1   |   100.28  | 18.52  |     4.54    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  2   |   100.12  | 18.78  |     4.50    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  3   |   100.36  | 18.94  |     4.44    |
+    +---------+------+-----------+--------+-------------+
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in
+        serial (v2.10)
+
+    +---------+------+-----------+--------+-------------+
+    | Cluster | Core | Powerdown | Wakeup | Cache Flush |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |   236.84  |  27.1  |    138.36   |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  1   |   236.96  |  27.1  |    138.32   |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |   280.06  | 26.94  |     77.5    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  1   |   100.76  | 23.42  |     4.36    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  2   |   100.02  | 23.42  |     4.44    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  3   |   100.08  |  23.2  |     4.4     |
+    +---------+------+-----------+--------+-------------+
+
+``CPU_SUSPEND`` to power level 0
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in
+        parallel (v2.9)
+
+    +---------+------+-----------+--------+-------------+
+    | Cluster | Core | Powerdown | Wakeup | Cache Flush |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |   662.34  | 15.22  |     8.08    |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  1   |   802.00  | 15.50  |     8.16    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |   385.22  | 15.74  |     7.88    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  1   |   106.16  | 16.06  |     7.44    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  2   |   524.38  | 15.64  |     7.34    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  3   |   246.00  | 15.78  |     7.72    |
+    +---------+------+-----------+--------+-------------+
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in
+        parallel (v2.10)
+
+    +---------+------+-------------------+--------+-------------+
+    | Cluster | Core |     Powerdown     | Wakeup | Cache Flush |
+    +---------+------+-------------------+--------+-------------+
+    |    0    |  0   |       801.04      | 18.66  |     8.22    |
+    +---------+------+-------------------+--------+-------------+
+    |    0    |  1   |       661.28      | 19.08  |     7.88    |
+    +---------+------+-------------------+--------+-------------+
+    |    1    |  0   |  105.9 (-72.51%)  |  20.3  |     7.58    |
+    +---------+------+-------------------+--------+-------------+
+    |    1    |  1   | 383.58 (+261.32%) |  20.4  |     7.42    |
+    +---------+------+-------------------+--------+-------------+
+    |    1    |  2   |       523.52      |  20.1  |     7.74    |
+    +---------+------+-------------------+--------+-------------+
+    |    1    |  3   |       244.5       | 20.16  |     7.56    |
+    +---------+------+-------------------+--------+-------------+
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in serial (v2.9)
+
+    +---------+------+-----------+--------+-------------+
+    | Cluster | Core | Powerdown | Wakeup | Cache Flush |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |   99.80   | 15.94  |     5.42    |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  1   |   99.76   | 15.80  |     5.24    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |   278.26  | 16.16  |     4.58    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  1   |   96.88   | 16.00  |     4.52    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  2   |   96.80   | 16.12  |     4.54    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  3   |   96.88   | 16.12  |     4.54    |
+    +---------+------+-----------+--------+-------------+
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in serial (v2.10)
+
+    +---------+------+-----------+--------+-------------+
+    | Cluster | Core | Powerdown | Wakeup | Cache Flush |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |   99.84   | 18.86  |     5.54    |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  1   |   100.2   | 18.82  |     5.66    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |   278.12  | 20.56  |     4.48    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  1   |   96.68   | 20.62  |     4.3     |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  2   |   96.94   | 20.14  |     4.42    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  3   |   96.68   | 20.46  |     4.32    |
+    +---------+------+-----------+--------+-------------+
+
+``CPU_OFF`` on all non-lead CPUs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``CPU_OFF`` on all non-lead CPUs in sequence then, ``CPU_SUSPEND`` on the lead
+core to the deepest power level.
+
+.. table:: ``CPU_OFF`` latencies (µs) on all non-lead CPUs (v2.9)
+
+    +---------+------+-----------+--------+-------------+
+    | Cluster | Core | Powerdown | Wakeup | Cache Flush |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |   235.76  | 26.14  |    137.80   |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  1   |   235.40  | 25.72  |    137.62   |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |   174.70  | 22.40  |    77.26    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  1   |   100.92  | 24.04  |     4.52    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  2   |   100.68  | 22.44  |     4.36    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  3   |   101.36  | 22.70  |     4.52    |
+    +---------+------+-----------+--------+-------------+
+
+.. table:: ``CPU_OFF`` latencies (µs) on all non-lead CPUs (v2.10)
+
+    +---------------------------------------------------+
+    |       test_rt_instr_cpu_off_serial (latest)       |
+    +---------+------+-----------+--------+-------------+
+    | Cluster | Core | Powerdown | Wakeup | Cache Flush |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |   236.04  | 30.02  |    137.9    |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  1   |   235.38  |  29.7  |    137.72   |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |   175.18  | 26.96  |    77.26    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  1   |   100.56  | 28.34  |     4.32    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  2   |   100.38  | 26.82  |     4.3     |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  3   |   100.86  | 26.98  |     4.42    |
+    +---------+------+-----------+--------+-------------+
+
+``CPU_VERSION`` in parallel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. table:: ``CPU_VERSION`` latency (µs) in parallel on all cores (2.9)
+
+    +-------------+--------+-------------+
+    |   Cluster   |  Core  |   Latency   |
+    +-------------+--------+-------------+
+    |      0      |   0    |     1.48    |
+    +-------------+--------+-------------+
+    |      0      |   1    |     1.04    |
+    +-------------+--------+-------------+
+    |      1      |   0    |     0.56    |
+    +-------------+--------+-------------+
+    |      1      |   1    |     0.92    |
+    +-------------+--------+-------------+
+    |      1      |   2    |     0.96    |
+    +-------------+--------+-------------+
+    |      1      |   3    |     0.96    |
+    +-------------+--------+-------------+
+
+.. table:: ``CPU_VERSION`` latency (µs) in parallel on all cores (2.10)
+
+    +-------------+--------+----------------------+
+    |   Cluster   |  Core  |       Latency        |
+    +-------------+--------+----------------------+
+    |      0      |   0    |    1.1 (-25.68%)     |
+    +-------------+--------+----------------------+
+    |      0      |   1    |         1.06         |
+    +-------------+--------+----------------------+
+    |      1      |   0    |         0.58         |
+    +-------------+--------+----------------------+
+    |      1      |   1    |         0.88         |
+    +-------------+--------+----------------------+
+    |      1      |   2    |         0.92         |
+    +-------------+--------+----------------------+
+    |      1      |   3    |         0.9          |
+    +-------------+--------+----------------------+
+
+Annotated Historic Results
+--------------------------
+
+The following results are based on the upstream `TF master as of 31/01/2017`_.
+TF-A was built using the same build instructions as detailed in the procedure
+above.
+
+In the results below, CPUs 0-3 refer to CPUs in the little cluster (A53) and
+CPUs 4-5 refer to CPUs in the big cluster (A57). In all cases CPU 4 is the lead
+CPU.
+
+``PSCI_ENTRY`` corresponds to the powerdown latency, ``PSCI_EXIT`` the wakeup latency, and
+``CFLUSH_OVERHEAD`` the latency of the cache flush operation.
+
+``CPU_SUSPEND`` to deepest power level on all CPUs in parallel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------+---------------------+--------------------+--------------------------+
+| CPU   | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
++=======+=====================+====================+==========================+
+| 0     | 27                  | 20                 | 5                        |
++-------+---------------------+--------------------+--------------------------+
+| 1     | 114                 | 86                 | 5                        |
++-------+---------------------+--------------------+--------------------------+
+| 2     | 202                 | 58                 | 5                        |
++-------+---------------------+--------------------+--------------------------+
+| 3     | 375                 | 29                 | 94                       |
++-------+---------------------+--------------------+--------------------------+
+| 4     | 20                  | 22                 | 6                        |
++-------+---------------------+--------------------+--------------------------+
+| 5     | 290                 | 18                 | 206                      |
++-------+---------------------+--------------------+--------------------------+
+
+A large variance in ``PSCI_ENTRY`` and ``PSCI_EXIT`` times across CPUs is
+observed due to TF PSCI lock contention. In the worst case, CPU 3 has to wait
+for the 3 other CPUs in the cluster (0-2) to complete ``PSCI_ENTRY`` and release
+the lock before proceeding.
+
+The ``CFLUSH_OVERHEAD`` times for CPUs 3 and 5 are higher because they are the
+last CPUs in their respective clusters to power down, therefore both the L1 and
+L2 caches are flushed.
+
+The ``CFLUSH_OVERHEAD`` time for CPU 5 is a lot larger than that for CPU 3
+because the L2 cache size for the big cluster is lot larger (2MB) compared to
+the little cluster (1MB).
+
+``CPU_SUSPEND`` to power level 0 on all CPUs in parallel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------+---------------------+--------------------+--------------------------+
+| CPU   | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
++=======+=====================+====================+==========================+
+| 0     | 116                 | 14                 | 8                        |
++-------+---------------------+--------------------+--------------------------+
+| 1     | 204                 | 14                 | 8                        |
++-------+---------------------+--------------------+--------------------------+
+| 2     | 287                 | 13                 | 8                        |
++-------+---------------------+--------------------+--------------------------+
+| 3     | 376                 | 13                 | 9                        |
++-------+---------------------+--------------------+--------------------------+
+| 4     | 29                  | 15                 | 7                        |
++-------+---------------------+--------------------+--------------------------+
+| 5     | 21                  | 15                 | 8                        |
++-------+---------------------+--------------------+--------------------------+
+
+There is no lock contention in TF generic code at power level 0 but the large
+variance in ``PSCI_ENTRY`` times across CPUs is due to lock contention in Juno
+platform code. The platform lock is used to mediate access to a single SCP
+communication channel. This is compounded by the SCP firmware waiting for each
+AP CPU to enter WFI before making the channel available to other CPUs, which
+effectively serializes the SCP power down commands from all CPUs.
+
+On platforms with a more efficient CPU power down mechanism, it should be
+possible to make the ``PSCI_ENTRY`` times smaller and consistent.
+
+The ``PSCI_EXIT`` times are consistent across all CPUs because TF does not
+require locks at power level 0.
+
+The ``CFLUSH_OVERHEAD`` times for all CPUs are small and consistent since only
+the cache associated with power level 0 is flushed (L1).
+
+``CPU_SUSPEND`` to deepest power level on all CPUs in sequence
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------+---------------------+--------------------+--------------------------+
+| CPU   | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
++=======+=====================+====================+==========================+
+| 0     | 114                 | 20                 | 94                       |
++-------+---------------------+--------------------+--------------------------+
+| 1     | 114                 | 20                 | 94                       |
++-------+---------------------+--------------------+--------------------------+
+| 2     | 114                 | 20                 | 94                       |
++-------+---------------------+--------------------+--------------------------+
+| 3     | 114                 | 20                 | 94                       |
++-------+---------------------+--------------------+--------------------------+
+| 4     | 195                 | 22                 | 180                      |
++-------+---------------------+--------------------+--------------------------+
+| 5     | 21                  | 17                 | 6                        |
++-------+---------------------+--------------------+--------------------------+
+
+The ``CFLUSH_OVERHEAD`` times for lead CPU 4 and all CPUs in the non-lead cluster
+are large because all other CPUs in the cluster are powered down during the
+test. The ``CPU_SUSPEND`` call powers down to the cluster level, requiring a
+flush of both L1 and L2 caches.
+
+The ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little
+CPUs because the L2 cache size for the big cluster is lot larger (2MB) compared
+to the little cluster (1MB).
+
+The ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are low because lead
+CPU 4 continues to run while CPU 5 is suspended. Hence CPU 5 only powers down to
+level 0, which only requires L1 cache flush.
+
+``CPU_SUSPEND`` to power level 0 on all CPUs in sequence
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------+---------------------+--------------------+--------------------------+
+| CPU   | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
++=======+=====================+====================+==========================+
+| 0     | 22                  | 14                 | 5                        |
++-------+---------------------+--------------------+--------------------------+
+| 1     | 22                  | 14                 | 5                        |
++-------+---------------------+--------------------+--------------------------+
+| 2     | 21                  | 14                 | 5                        |
++-------+---------------------+--------------------+--------------------------+
+| 3     | 22                  | 14                 | 5                        |
++-------+---------------------+--------------------+--------------------------+
+| 4     | 17                  | 14                 | 6                        |
++-------+---------------------+--------------------+--------------------------+
+| 5     | 18                  | 15                 | 6                        |
++-------+---------------------+--------------------+--------------------------+
+
+Here the times are small and consistent since there is no contention and it is
+only necessary to flush the cache to power level 0 (L1). This is the best case
+scenario.
+
+The ``PSCI_ENTRY`` times for CPUs in the big cluster are slightly smaller than
+for the CPUs in little cluster due to greater CPU performance.
+
+The ``PSCI_EXIT`` times are generally lower than in the last test because the
+cluster remains powered on throughout the test and there is less code to execute
+on power on (for example, no need to enter CCI coherency)
+
+``CPU_OFF`` on all non-lead CPUs in sequence then ``CPU_SUSPEND`` on lead CPU to deepest power level
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The test sequence here is as follows:
+
+1. Call ``CPU_ON`` and ``CPU_OFF`` on each non-lead CPU in sequence.
+
+2. Program wake up timer and suspend the lead CPU to the deepest power level.
+
+3. Call ``CPU_ON`` on non-lead CPU to get the timestamps from each CPU.
+
++-------+---------------------+--------------------+--------------------------+
+| CPU   | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
++=======+=====================+====================+==========================+
+| 0     | 110                 | 28                 | 93                       |
++-------+---------------------+--------------------+--------------------------+
+| 1     | 110                 | 28                 | 93                       |
++-------+---------------------+--------------------+--------------------------+
+| 2     | 110                 | 28                 | 93                       |
++-------+---------------------+--------------------+--------------------------+
+| 3     | 111                 | 28                 | 93                       |
++-------+---------------------+--------------------+--------------------------+
+| 4     | 195                 | 22                 | 181                      |
++-------+---------------------+--------------------+--------------------------+
+| 5     | 20                  | 23                 | 6                        |
++-------+---------------------+--------------------+--------------------------+
+
+The ``CFLUSH_OVERHEAD`` times for all little CPUs are large because all other
+CPUs in that cluster are powerered down during the test. The ``CPU_OFF`` call
+powers down to the cluster level, requiring a flush of both L1 and L2 caches.
+
+The ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are small because
+lead CPU 4 is running and CPU 5 only powers down to level 0, which only requires
+an L1 cache flush.
+
+The ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little
+CPUs because the L2 cache size for the big cluster is lot larger (2MB) compared
+to the little cluster (1MB).
+
+The ``PSCI_EXIT`` times for CPUs in the big cluster are slightly smaller than
+for CPUs in the little cluster due to greater CPU performance.  These times
+generally are greater than the ``PSCI_EXIT`` times in the ``CPU_SUSPEND`` tests
+because there is more code to execute in the "on finisher" compared to the
+"suspend finisher" (for example, GIC redistributor register programming).
+
+``PSCI_VERSION`` on all CPUs in parallel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since very little code is associated with ``PSCI_VERSION``, this test
+approximates the round trip latency for handling a fast SMC at EL3 in TF.
+
++-------+-------------------+
+| CPU   | TOTAL TIME (ns)   |
++=======+===================+
+| 0     | 3020              |
++-------+-------------------+
+| 1     | 2940              |
++-------+-------------------+
+| 2     | 2980              |
++-------+-------------------+
+| 3     | 3060              |
++-------+-------------------+
+| 4     | 520               |
++-------+-------------------+
+| 5     | 720               |
++-------+-------------------+
+
+The times for the big CPUs are less than the little CPUs due to greater CPU
+performance.
+
+We suspect the time for lead CPU 4 is shorter than CPU 5 due to subtle cache
+effects, given that these measurements are at the nano-second level.
+
+--------------
+
+*Copyright (c) 2019-2023, Arm Limited and Contributors. All rights reserved.*
+
+.. _Juno R1 platform: https://developer.arm.com/documentation/100122/latest/
+.. _TF master as of 31/01/2017: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?id=c38b36d
+.. _v2.9-rc0: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?h=v2.9-rc0
diff --git a/docs/perf/psci-performance-methodology.rst b/docs/perf/psci-performance-methodology.rst
new file mode 100644
index 0000000..a9f379d
--- /dev/null
+++ b/docs/perf/psci-performance-methodology.rst
@@ -0,0 +1,55 @@
+Runtime Instrumentation Methodology
+===================================
+
+This document outlines steps for undertaking performance measurements of key
+operations in the Trusted Firmware-A Power State Coordination Interface (PSCI)
+implementation, using the in-built Performance Measurement Framework (PMF) and
+runtime instrumentation timestamps.
+
+Framework
+~~~~~~~~~
+
+The tests are based on the ``runtime-instrumentation`` test suite provided by
+the Trusted Firmware Test Framework (TFTF). The release build of this framework
+was used because the results in the debug build became skewed; the console
+output prevented some of the tests from executing in parallel.
+
+The tests consist of both parallel and sequential tests, which are broadly
+described as follows:
+
+- **Parallel Tests** This type of test powers on all the non-lead CPUs and
+  brings them and the lead CPU to a common synchronization point.  The lead CPU
+  then initiates the test on all CPUs in parallel.
+
+- **Sequential Tests** This type of test powers on each non-lead CPU in
+  sequence. The lead CPU initiates the test on a non-lead CPU then waits for the
+  test to complete before proceeding to the next non-lead CPU. The lead CPU then
+  executes the test on itself.
+
+Note there is very little variance observed in the values given (~1us), although
+the values for each CPU are sometimes interchanged, depending on the order in
+which locks are acquired. Also, there is very little variance observed between
+executing the tests sequentially in a single boot or rebooting between tests.
+
+Given that runtime instrumentation using PMF is invasive, there is a small
+(unquantified) overhead on the results. PMF uses the generic counter for
+timestamps, which runs at 50MHz on Juno.
+
+Metrics
+~~~~~~~
+
+.. glossary::
+
+   Powerdown Latency
+        Time taken from entering the TF PSCI implementation to the point the hardware
+        enters the low power state (WFI). Referring to the TF runtime instrumentation points, this
+        corresponds to: ``(RT_INSTR_ENTER_HW_LOW_PWR - RT_INSTR_ENTER_PSCI)``.
+
+   Wakeup Latency
+        Time taken from the point the hardware exits the low power state to exiting
+        the TF PSCI implementation. This corresponds to: ``(RT_INSTR_EXIT_PSCI -
+        RT_INSTR_EXIT_HW_LOW_PWR)``.
+
+   Cache Flush Latency
+        Time taken to flush the caches during powerdown. This corresponds to:
+        ``(RT_INSTR_EXIT_CFLUSH - RT_INSTR_ENTER_CFLUSH)``.
diff --git a/docs/perf/psci-performance-n1sdp.rst b/docs/perf/psci-performance-n1sdp.rst
new file mode 100644
index 0000000..fd3c9c9
--- /dev/null
+++ b/docs/perf/psci-performance-n1sdp.rst
@@ -0,0 +1,297 @@
+Runtime Instrumentation Testing - N1SDP
+=======================================
+
+For this test we used the N1 System Development Platform (`N1SDP`_), which
+contains an SoC consisting of two dual-core Arm N1 clusters.
+
+The following source trees and binaries were used:
+
+- TF-A [`v2.9-rc0-16-g666aec401`_]
+- TFTF [`v2.9-rc0`_]
+- SCP/MCP `Prebuilt Images`_
+
+Please see the Runtime Instrumentation :ref:`Testing Methodology
+<Runtime Instrumentation Methodology>` page for more details.
+
+Procedure
+---------
+
+#. Build TFTF with runtime instrumentation enabled:
+
+    .. code:: shell
+
+        make CROSS_COMPILE=aarch64-none-elf- PLAT=n1sdp \
+            TESTS=runtime-instrumentation all
+
+#. Build TF-A with the following build options:
+
+    .. code:: shell
+
+        make CROSS_COMPILE=aarch64-none-elf- PLAT=n1sdp \
+            ENABLE_RUNTIME_INSTRUMENTATION=1 fiptool all
+
+#. Fetch the SCP firmware images:
+
+    .. code:: shell
+
+        curl --fail --connect-timeout 5 --retry 5 \
+            -sLS -o build/n1sdp/release/scp_rom.bin \
+            https://downloads.trustedfirmware.org/tf-a/css_scp_2.12.0/n1sdp/release/n1sdp-bl1.bin
+        curl --fail --connect-timeout 5 \
+            --retry 5 -sLS -o build/n1sdp/release/scp_ram.bin \
+            https://downloads.trustedfirmware.org/tf-a/css_scp_2.12.0/n1sdp/release/n1sdp-bl2.bin
+
+#. Fetch the MCP firmware images:
+
+    .. code:: shell
+
+        curl --fail --connect-timeout 5 --retry 5 \
+            -sLS -o build/n1sdp/release/mcp_rom.bin \
+            https://downloads.trustedfirmware.org/tf-a/css_scp_2.12.0/n1sdp/release/n1sdp-mcp-bl1.bin
+        curl --fail --connect-timeout 5 --retry 5 \
+            -sLS -o build/n1sdp/release/mcp_ram.bin \
+            https://downloads.trustedfirmware.org/tf-a/css_scp_2.12.0/n1sdp/release/n1sdp-mcp-bl2.bin
+
+#. Using the fiptool, create a new FIP package and append the SCP ram image onto
+   it.
+
+    .. code:: shell
+
+        ./tools/fiptool/fiptool create --blob \
+                uuid=cfacc2c4-15e8-4668-82be-430a38fad705,file=build/n1sdp/release/bl1.bin \
+                --scp-fw build/n1sdp/release/scp_ram.bin build/n1sdp/release/scp_fw.bin
+
+#. Append the MCP image to the FIP.
+
+    .. code:: shell
+
+        ./tools/fiptool/fiptool create \
+            --blob uuid=54464222-a4cf-4bf8-b1b6-cee7dade539e,file=build/n1sdp/release/mcp_ram.bin \
+            build/n1sdp/release/mcp_fw.bin
+
+#. Then, add TFTF as the Non-Secure workload in the FIP image:
+
+    .. code:: shell
+
+        make CROSS_COMPILE=aarch64-none-elf- PLAT=n1sdp \
+            ENABLE_RUNTIME_INSTRUMENTATION=1 SCP_BL2=/dev/null \
+            BL33=<path/to/tftf.bin>  fip
+
+#. Load the following images onto the development board: ``fip.bin``,
+   ``scp_rom.bin``, ``scp_ram.bin``, ``mcp_rom.bin``, and ``mcp_ram.bin``.
+
+.. note::
+
+    These instructions presume you have a complete firmware stack. The N1SDP
+    `user guide`_ provides a detailed explanation on how to get setup from
+    scratch.
+
+Results
+-------
+
+``CPU_SUSPEND`` to deepest power level
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in
+        parallel (v2.9)
+
+    +---------+------+-----------+--------+-------------+
+    | Cluster | Core | Powerdown | Wakeup | Cache Flush |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |    2.80   | 10.08  |     0.80    |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |    4.14   | 15.92  |     0.16    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |    3.68   | 12.96  |     0.16    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |    3.36   | 18.58  |     0.18    |
+    +---------+------+-----------+--------+-------------+
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in
+        parallel (v2.10)
+
+    +---------+------+----------------+------------------+-----------------+
+    | Cluster | Core |   Powerdown    |      Wakeup      |   Cache Flush   |
+    +---------+------+----------------+------------------+-----------------+
+    |    0    |  0   |      2.12      | 23.94 (+137.50%) |  0.42 (-47.50%) |
+    +---------+------+----------------+------------------+-----------------+
+    |    0    |  0   |      3.52      | 42.08 (+164.32%) |  0.26 (+62.50%) |
+    +---------+------+----------------+------------------+-----------------+
+    |    1    |  0   | 2.76 (-25.00%) | 38.3 (+195.52%)  |  0.26 (+62.50%) |
+    +---------+------+----------------+------------------+-----------------+
+    |    1    |  0   |      2.64      | 44.56 (+139.83%) | 0.36 (+100.00%) |
+    +---------+------+----------------+------------------+-----------------+
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in
+        serial (v2.9)
+
+    +---------+------+-----------+--------+-------------+
+    | Cluster | Core | Powerdown | Wakeup | Cache Flush |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |    1.86   |  9.92  |     0.32    |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |    2.70   | 10.48  |     0.36    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |    1.78   |  9.72  |     0.16    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |    1.94   | 10.44  |     0.16    |
+    +---------+------+-----------+--------+-------------+
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in
+        serial (v2.10)
+
+    +---------+------+-----------+------------------+----------------+
+    | Cluster | Core | Powerdown |      Wakeup      |  Cache Flush   |
+    +---------+------+-----------+------------------+----------------+
+    |    0    |  0   |    1.74   | 23.7 (+138.91%)  |      0.3       |
+    +---------+------+-----------+------------------+----------------+
+    |    0    |  0   |    2.08   | 23.96 (+128.63%) | 0.26 (-27.78%) |
+    +---------+------+-----------+------------------+----------------+
+    |    1    |  0   |    1.9    | 23.62 (+143.00%) | 0.28 (+75.00%) |
+    +---------+------+-----------+------------------+----------------+
+    |    1    |  0   |    2.06   | 23.92 (+129.12%) | 0.26 (+62.50%) |
+    +---------+------+-----------+------------------+----------------+
+
+``CPU_SUSPEND`` to power level 0
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in
+        parallel (v2.9)
+
+    +---------------------------------------------------+
+    |          test_rt_instr_cpu_susp_parallel          |
+    +---------+------+-----------+--------+-------------+
+    | Cluster | Core | Powerdown | Wakeup | Cache Flush |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |    0.88   | 12.32  |     0.26    |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |    2.12   | 14.62  |     0.26    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |    1.86   | 14.14  |     0.16    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |    1.92   |  9.44  |     0.18    |
+    +---------+------+-----------+--------+-------------+
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in
+        parallel (v2.10)
+
+    +---------+------+---------------+------------------+----------------+
+    | Cluster | Core |   Powerdown   |      Wakeup      |  Cache Flush   |
+    +---------+------+---------------+------------------+----------------+
+    |    0    |  0   | 1.5 (+70.45%) | 35.02 (+184.25%) |      0.24      |
+    +---------+------+---------------+------------------+----------------+
+    |    0    |  0   |      1.92     | 38.12 (+160.74%) |      0.28      |
+    +---------+------+---------------+------------------+----------------+
+    |    1    |  0   |      1.88     | 38.1 (+169.45%)  | 0.26 (+62.50%) |
+    +---------+------+---------------+------------------+----------------+
+    |    1    |  0   |      2.04     | 23.1 (+144.70%)  |      0.24      |
+    +---------+------+---------------+------------------+----------------+
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in serial (v2.9)
+
+    +---------------------------------------------------+
+    |           test_rt_instr_cpu_susp_serial           |
+    +---------+------+-----------+--------+-------------+
+    | Cluster | Core | Powerdown | Wakeup | Cache Flush |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |    1.52   |  9.40  |     0.30    |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |    1.92   |  9.80  |     0.18    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |    2.20   |  9.60  |     0.14    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |    1.82   |  9.78  |     0.18    |
+    +---------+------+-----------+--------+-------------+
+
+.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in serial (v2.10)
+
+    +---------+------+-----------+------------------+-----------------+
+    | Cluster | Core | Powerdown |      Wakeup      |   Cache Flush   |
+    +---------+------+-----------+------------------+-----------------+
+    |    0    |  0   |    1.52   | 23.08 (+145.53%) |       0.3       |
+    +---------+------+-----------+------------------+-----------------+
+    |    0    |  0   |    1.98   | 23.68 (+141.63%) |  0.28 (+55.56%) |
+    +---------+------+-----------+------------------+-----------------+
+    |    1    |  0   |    1.84   | 23.86 (+148.54%) | 0.28 (+100.00%) |
+    +---------+------+-----------+------------------+-----------------+
+    |    1    |  0   |    1.98   | 23.68 (+142.13%) |  0.28 (+55.56%) |
+    +---------+------+-----------+------------------+-----------------+
+
+``CPU_OFF`` on all non-lead CPUs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``CPU_OFF`` on all non-lead CPUs in sequence then, ``CPU_SUSPEND`` on the lead
+core to the deepest power level.
+
+.. table:: ``CPU_OFF`` latencies (µs) on all non-lead CPUs (v2.9)
+
+    +---------+------+-----------+--------+-------------+
+    | Cluster | Core | Powerdown | Wakeup | Cache Flush |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |    1.84   |  9.94  |     0.32    |
+    +---------+------+-----------+--------+-------------+
+    |    0    |  0   |   14.20   | 13.10  |     0.50    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |   13.88   | 12.36  |     0.42    |
+    +---------+------+-----------+--------+-------------+
+    |    1    |  0   |   14.40   | 13.26  |     0.52    |
+    +---------+------+-----------+--------+-------------+
+
+.. table:: ``CPU_OFF`` latencies (µs) on all non-lead CPUs (v2.10)
+
+    +---------+------+-----------+------------------+----------------+
+    | Cluster | Core | Powerdown |      Wakeup      |  Cache Flush   |
+    +---------+------+-----------+------------------+----------------+
+    |    0    |  0   |    1.78   | 23.7 (+138.43%)  |      0.3       |
+    +---------+------+-----------+------------------+----------------+
+    |    0    |  0   |   13.96   | 31.16 (+137.86%) | 0.34 (-32.00%) |
+    +---------+------+-----------+------------------+----------------+
+    |    1    |  0   |   13.54   | 30.24 (+144.66%) | 0.26 (-38.10%) |
+    +---------+------+-----------+------------------+----------------+
+    |    1    |  0   |   14.46   | 31.12 (+134.69%) | 0.7 (+34.62%)  |
+    +---------+------+-----------+------------------+----------------+
+
+``CPU_VERSION`` in parallel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. table:: ``CPU_VERSION`` latency (µs) in parallel on all cores (v2.9)
+
+    +------------------------------------+
+    | test_rt_instr_psci_version_parallel|
+    +-------------+--------+-------------+
+    |   Cluster   |  Core  |   Latency   |
+    +-------------+--------+-------------+
+    |      0      |   0    |     0.08    |
+    +-------------+--------+-------------+
+    |      0      |   0    |     0.26    |
+    +-------------+--------+-------------+
+    |      1      |   0    |     0.20    |
+    +-------------+--------+-------------+
+    |      1      |   0    |     0.26    |
+    +-------------+--------+-------------+
+
+.. table:: ``CPU_VERSION`` latency (µs) in parallel on all cores (v2.10)
+
+    +----------------------------------------------+
+    | test_rt_instr_psci_version_parallel (latest) |
+    +-------------+--------+-----------------------+
+    |   Cluster   |  Core  |        Latency        |
+    +-------------+--------+-----------------------+
+    |      0      |   0    |     0.14 (+75.00%)    |
+    +-------------+--------+-----------------------+
+    |      0      |   0    |          0.22         |
+    +-------------+--------+-----------------------+
+    |      1      |   0    |          0.2          |
+    +-------------+--------+-----------------------+
+    |      1      |   0    |          0.26         |
+    +-------------+--------+-----------------------+
+
+--------------
+
+*Copyright (c) 2023, Arm Limited. All rights reserved.*
+
+.. _v2.9-rc0-16-g666aec401: https://review.trustedfirmware.org/plugins/gitiles/TF-A/trusted-firmware-a/+/refs/heads/v2.9-rc0-16-g666aec401
+.. _v2.9-rc0: https://review.trustedfirmware.org/plugins/gitiles/TF-A/tf-a-tests/+/refs/tags/v2.9-rc0
+.. _user guide: https://gitlab.arm.com/arm-reference-solutions/arm-reference-solutions-docs/-/blob/master/docs/n1sdp/user-guide.rst
+.. _Prebuilt Images:  https://downloads.trustedfirmware.org/tf-a/css_scp_2.11.0/n1sdp/release/
+.. _N1SDP: https://developer.arm.com/documentation/101489/latest
diff --git a/docs/perf/tsp.rst b/docs/perf/tsp.rst
new file mode 100644
index 0000000..f8b0048
--- /dev/null
+++ b/docs/perf/tsp.rst
@@ -0,0 +1,27 @@
+Test Secure Payload (TSP) and Dispatcher (TSPD)
+===============================================
+
+Building the Test Secure Payload
+--------------------------------
+
+The TSP is coupled with a companion runtime service in the BL31 firmware,
+called the TSPD. Therefore, if you intend to use the TSP, the BL31 image
+must be recompiled as well. For more information on SPs and SPDs, see the
+:ref:`firmware_design_sel1_spd` section in the :ref:`Firmware Design`.
+
+First clean the TF-A build directory to get rid of any previous BL31 binary.
+Then to build the TSP image use:
+
+.. code:: shell
+
+    make PLAT=<platform> SPD=tspd all
+
+An additional boot loader binary file is created in the ``build`` directory:
+
+::
+
+    build/<platform>/<build-type>/bl32.bin
+
+--------------
+
+*Copyright (c) 2019, Arm Limited. All rights reserved.*