diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 17:43:51 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 17:43:51 +0000 |
commit | be58c81aff4cd4c0ccf43dbd7998da4a6a08c03b (patch) | |
tree | 779c248fb61c83f65d1f0dc867f2053d76b4e03a /docs/perf | |
parent | Initial commit. (diff) | |
download | arm-trusted-firmware-be58c81aff4cd4c0ccf43dbd7998da4a6a08c03b.tar.xz arm-trusted-firmware-be58c81aff4cd4c0ccf43dbd7998da4a6a08c03b.zip |
Adding upstream version 2.10.0+dfsg.upstream/2.10.0+dfsgupstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'docs/perf')
-rw-r--r-- | docs/perf/index.rst | 17 | ||||
-rw-r--r-- | docs/perf/performance-monitoring-unit.rst | 158 | ||||
-rw-r--r-- | docs/perf/psci-performance-instr.rst | 116 | ||||
-rw-r--r-- | docs/perf/psci-performance-juno.rst | 533 | ||||
-rw-r--r-- | docs/perf/psci-performance-methodology.rst | 55 | ||||
-rw-r--r-- | docs/perf/psci-performance-n1sdp.rst | 297 | ||||
-rw-r--r-- | docs/perf/tsp.rst | 27 |
7 files changed, 1203 insertions, 0 deletions
diff --git a/docs/perf/index.rst b/docs/perf/index.rst new file mode 100644 index 0000000..0938a17 --- /dev/null +++ b/docs/perf/index.rst @@ -0,0 +1,17 @@ +Performance & Testing +===================== + +.. toctree:: + :maxdepth: 1 + :caption: Contents + + psci-performance-instr + psci-performance-juno + psci-performance-n1sdp + psci-performance-methodology + tsp + performance-monitoring-unit + +-------------- + +*Copyright (c) 2019-2023, Arm Limited. All rights reserved.* diff --git a/docs/perf/performance-monitoring-unit.rst b/docs/perf/performance-monitoring-unit.rst new file mode 100644 index 0000000..5dd1af5 --- /dev/null +++ b/docs/perf/performance-monitoring-unit.rst @@ -0,0 +1,158 @@ +Performance Monitoring Unit +=========================== + +The Performance Monitoring Unit (PMU) allows recording of architectural and +microarchitectural events for profiling purposes. + +This document gives an overview of the PMU counter configuration to assist with +implementation and to complement the PMU security guidelines given in the +:ref:`Secure Development Guidelines` document. + +.. note:: + This section applies to Armv8-A implementations which have version 3 + of the Performance Monitors Extension (PMUv3). + +PMU Counters +------------ + +The PMU makes 32 counters available at all privilege levels: + +- 31 programmable event counters: ``PMEVCNTR<n>``, where ``n`` is ``0`` to + ``30``. +- A dedicated cycle counter: ``PMCCNTR``. + +Architectural mappings +~~~~~~~~~~~~~~~~~~~~~~ + ++--------------+---------+----------------------------+ +| Counters | State | System Register Name | ++==============+=========+============================+ +| | AArch64 | ``PMEVCNTR<n>_EL0[63*:0]`` | +| Programmable +---------+----------------------------+ +| | AArch32 | ``PMEVCNTR<n>[31:0]`` | ++--------------+---------+----------------------------+ +| | AArch64 | ``PMCCNTR_EL0[63:0]`` | +| Cycle +---------+----------------------------+ +| | AArch32 | ``PMCCNTR[63:0]`` | ++--------------+---------+----------------------------+ + +.. note:: + Bits [63:32] are only available if ARMv8.5-PMU is implemented. Refer to the + `Arm ARM`_ for a detailed description of ARMv8.5-PMU features. + +Configuring the PMU for counting events +--------------------------------------- + +Each programmable counter has an associated register, ``PMEVTYPER<n>`` which +configures it. The cycle counter has the ``PMCCFILTR_EL0`` register, which has +an identical function and bit field layout as ``PMEVTYPER<n>``. In addition, +the counters are enabled (permitted to increment) via the ``PMCNTENSET`` and +``PMCR`` registers. These can be accessed at all privilege levels. + +Architectural mappings +~~~~~~~~~~~~~~~~~~~~~~ + ++-----------------------------+------------------------+ +| AArch64 | AArch32 | ++=============================+========================+ +| ``PMEVTYPER<n>_EL0[63*:0]`` | ``PMEVTYPER<n>[31:0]`` | ++-----------------------------+------------------------+ +| ``PMCCFILTR_EL0[63*:0]`` | ``PMCCFILTR[31:0]`` | ++-----------------------------+------------------------+ +| ``PMCNTENSET_EL0[63*:0]`` | ``PMCNTENSET[31:0]`` | ++-----------------------------+------------------------+ +| ``PMCR_EL0[63*:0]`` | ``PMCR[31:0]`` | ++-----------------------------+------------------------+ + +.. note:: + Bits [63:32] are reserved. + +Relevant register fields +~~~~~~~~~~~~~~~~~~~~~~~~ + +For ``PMEVTYPER<n>_EL0``/``PMEVTYPER<n>`` and ``PMCCFILTR_EL0/PMCCFILTR``, the +most important fields are: + +- ``P``: + + - Bit 31. + - If set to ``0``, will increment the associated ``PMEVCNTR<n>`` at EL1. + +- ``NSK``: + + - Bit 29. + - If equal to the ``P`` bit it enables the associated ``PMEVCNTR<n>`` at + Non-secure EL1. + - Reserved if EL3 not implemented. + +- ``NSH``: + + - Bit 27. + - If set to ``1``, will increment the associated ``PMEVCNTR<n>`` at EL2. + - Reserved if EL2 not implemented. + +- ``SH``: + + - Bit 24. + - If different to the ``NSH`` bit it enables the associated ``PMEVCNTR<n>`` + at Secure EL2. + - Reserved if Secure EL2 not implemented. + +- ``M``: + + - Bit 26. + - If equal to the ``P`` bit it enables the associated ``PMEVCNTR<n>`` at + EL3. + +- ``evtCount[15:10]``: + + - Extension to ``evtCount[9:0]``. Reserved unless ARMv8.1-PMU implemented. + +- ``evtCount[9:0]``: + + - The event number that the associated ``PMEVCNTR<n>`` will count. + +For ``PMCNTENSET_EL0``/``PMCNTENSET``, the most important fields are: + +- ``P[30:0]``: + + - Setting bit ``P[n]`` to ``1`` enables counter ``PMEVCNTR<n>``. + - The effects of ``PMEVTYPER<n>`` are applied on top of this. + In other words, the counter will not increment at any privilege level or + security state unless it is enabled here. + +- ``C``: + + - Bit 31. + - If set to ``1`` enables the cycle counter ``PMCCNTR``. + +For ``PMCR``/``PMCR_EL0``, the most important fields are: + +- ``DP``: + + - Bit 5. + - If set to ``1`` it disables the cycle counter ``PMCCNTR`` where event + counting (by ``PMEVCNTR<n>``) is prohibited (e.g. EL2 and the Secure + world). + - If set to ``0``, ``PMCCNTR`` will not be affected by this bit and + therefore will be able to count where the programmable counters are + prohibited. + +- ``E``: + + - Bit 0. + - Enables/disables counting altogether. + - The effects of ``PMCNTENSET`` and ``PMCR.DP`` are applied on top of this. + In other words, if this bit is ``0`` then no counters will increment + regardless of how the other PMU system registers or bit fields are + configured. + +.. rubric:: References + +- `Arm ARM`_ + +-------------- + +*Copyright (c) 2019-2020, Arm Limited and Contributors. All rights reserved.* + +.. _Arm ARM: https://developer.arm.com/docs/ddi0487/latest diff --git a/docs/perf/psci-performance-instr.rst b/docs/perf/psci-performance-instr.rst new file mode 100644 index 0000000..41094b2 --- /dev/null +++ b/docs/perf/psci-performance-instr.rst @@ -0,0 +1,116 @@ +PSCI Performance Measurement +============================ + +TF-A provides two instrumentation tools for performing analysis of the PSCI +implementation: + +* PSCI STAT +* Runtime Instrumentation + +This page explains how they may be enabled and used to perform all varieties of +analysis. + +Performance Measurement Framework +--------------------------------- + +The Performance Measurement Framework :ref:`PMF <firmware_design_pmf>` +is a framework that provides mechanisms for collecting and retrieving timestamps +at runtime from the Performance Measurement Unit +(:ref:`PMU <Performance Monitoring Unit>`). +The PMU is a generalized abstraction for accessing CPU hardware registers used to +measure hardware events. This means, for instance, that the PMU might be used to +place instrumentation points at logical locations in code for tracing purposes. + +TF-A utilises the PMF as a backend for the two instrumentation services it +provides--PSCI Statistics and Runtime Instrumentation. The PMF is used by +these services to facilitate collection and retrieval of timestamps. For +instance, the PSCI Statistics service registers the PMF service +``psci_svc`` to track its residency statistics. + +This is reserved a unique ID, name, and space in memory by the PMF. The +framework provides a convenient interface for PSCI Statistics to retrieve +values from ``psci_svc`` at runtime. Alternatively, the service may be +configured such that the PMF dumps those values to the console. A platform may +choose to expose SMCs that allow retrieval of these timestamps from the +service. + +This feature is enabled with the Boolean flag ``ENABLE_PMF``. + +PSCI Statistics +--------------- + +PSCI Statistics is a runtime service that provides residency statistics for +power states used by the platform. The service tracks residency time and +entry count. Residency time is the total time spent in a particular power +state by a PE. The entry count is the number of times the PE has entered +the power state. PSCI Statistics implements the optional functions +``PSCI_STAT_RESIDENCY`` and ``PSCI_STAT_COUNT`` from the `PSCI`_ +specification. + + +.. c:macro:: PSCI_STAT_RESIDENCY + + :param target_cpu: Contains copy of affinity fields in the MPIDR register + for identifying the target core (See section 5.1.4 of `PSCI`_ + specifications for more details). + :param power_state: identifier for a specific local + state. Generally, this parameter takes the same form as the power_state + parameter described for CPU_SUSPEND in section 5.4.2. + + :returns: Time spent in ``power_state``, in microseconds, by ``target_cpu`` + and the highest level expressed in ``power_state``. + + +.. c:macro:: PSCI_STAT_COUNT + + :param target_cpu: follows the same format as ``PSCI_STAT_RESIDENCY``. + :param power_state: follows the same format as ``PSCI_STAT_RESIDENCY``. + + :returns: Number of times the state expressed in ``power_state`` has been + used by ``target_cpu`` and the highest level expressed in + ``power_state``. + +The implementation provides residency statistics only for low power states, +and does this regardless of the entry mechanism into those states. The +statistics it collects are set to 0 during shutdown or reset. + +PSCI Statistics is enabled with the Boolean build flag +``ENABLE_PSCI_STAT``. All Arm platforms utilise the PMF unless another +collection backend is provided (``ENABLE_PMF`` is implicitly enabled). + +Runtime Instrumentation +----------------------- + +The Runtime Instrumentation Service is an instrumentation tool that wraps +around the PMF to provide timestamp data. Although the service is not +restricted to PSCI, it is used primarily in TF-A to quantify the total time +spent in the PSCI implementation. The tool can be used to instrument other +components in TF-A as well. It is enabled with the Boolean flag +``ENABLE_RUNTIME_INSTRUMENTATION``, and as with PSCI STAT, requires PMF to +be enabled. + +In PSCI, this service provides instrumentation points in the +following code paths: + +* Entry into the PSCI SMC handler +* Exit from the PSCI SMC handler +* Entry to low power state +* Exit from low power state +* Entry into cache maintenance operations in PSCI +* Exit from cache maintenance operations in PSCI + +The service captures the cycle count, which allows for the time spent in the +implementation to be calculated, given the frequency counter. + +PSCI SMC Handler Instrumentation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The timestamp during entry into the handler is captured as early as possible +during the runtime exception, prior to entry into the handler itself. All +timestamps are stored in memory for later retrieval. The exit timestamp is +captured after normal return from the PSCI SMC handler, or, if a low power state +was requested, it is captured in the warm boot path. + +*Copyright (c) 2023, Arm Limited. All rights reserved.* + +.. _PSCI: https://developer.arm.com/documentation/den0022/latest/ diff --git a/docs/perf/psci-performance-juno.rst b/docs/perf/psci-performance-juno.rst new file mode 100644 index 0000000..bab1086 --- /dev/null +++ b/docs/perf/psci-performance-juno.rst @@ -0,0 +1,533 @@ +PSCI Performance Measurements on Arm Juno Development Platform +============================================================== + +This document summarises the findings of performance measurements of key +operations in the Trusted Firmware-A Power State Coordination Interface (PSCI) +implementation, using the in-built Performance Measurement Framework (PMF) and +runtime instrumentation timestamps. + +Method +------ + +We used the `Juno R1 platform`_ for these tests, which has 4 x Cortex-A53 and 2 +x Cortex-A57 clusters running at the following frequencies: + ++-----------------+--------------------+ +| Domain | Frequency (MHz) | ++=================+====================+ +| Cortex-A57 | 900 (nominal) | ++-----------------+--------------------+ +| Cortex-A53 | 650 (underdrive) | ++-----------------+--------------------+ +| AXI subsystem | 533 | ++-----------------+--------------------+ + +Juno supports CPU, cluster and system power down states, corresponding to power +levels 0, 1 and 2 respectively. It does not support any retention states. + +Given that runtime instrumentation using PMF is invasive, there is a small +(unquantified) overhead on the results. PMF uses the generic counter for +timestamps, which runs at 50MHz on Juno. + +The following source trees and binaries were used: + +- TF-A [`v2.9-rc0`_] +- TFTF [`v2.9-rc0`_] + +Please see the Runtime Instrumentation :ref:`Testing Methodology +<Runtime Instrumentation Methodology>` +page for more details. + +Procedure +--------- + +#. Build TFTF with runtime instrumentation enabled: + + .. code:: shell + + make CROSS_COMPILE=aarch64-none-elf- PLAT=juno \ + TESTS=runtime-instrumentation all + +#. Fetch Juno's SCP binary from TF-A's archive: + + .. code:: shell + + curl --fail --connect-timeout 5 --retry 5 -sLS -o scp_bl2.bin \ + https://downloads.trustedfirmware.org/tf-a/css_scp_2.12.0/juno/release/juno-bl2.bin + +#. Build TF-A with the following build options: + + .. code:: shell + + make CROSS_COMPILE=aarch64-none-elf- PLAT=juno \ + BL33="/path/to/tftf.bin" SCP_BL2="scp_bl2.bin" \ + ENABLE_RUNTIME_INSTRUMENTATION=1 fiptool all fip + +#. Load the following images onto the development board: ``fip.bin``, + ``scp_bl2.bin``. + +Results +------- + +``CPU_SUSPEND`` to deepest power level +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in + parallel (v2.9) + + +---------+------+-----------+--------+-------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 104.58 | 241.20 | 5.26 | + +---------+------+-----------+--------+-------------+ + | 0 | 1 | 384.24 | 22.50 | 138.76 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 244.56 | 22.18 | 5.16 | + +---------+------+-----------+--------+-------------+ + | 1 | 1 | 670.56 | 18.58 | 4.44 | + +---------+------+-----------+--------+-------------+ + | 1 | 2 | 809.36 | 269.28 | 4.44 | + +---------+------+-----------+--------+-------------+ + | 1 | 3 | 984.96 | 219.70 | 79.62 | + +---------+------+-----------+--------+-------------+ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in + parallel (v2.10) + + +---------+------+-------------------+--------+-------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-------------------+--------+-------------+ + | 0 | 0 | 242.66 (+132.03%) | 245.1 | 5.4 | + +---------+------+-------------------+--------+-------------+ + | 0 | 1 | 522.08 (+35.87%) | 26.24 | 138.32 | + +---------+------+-------------------+--------+-------------+ + | 1 | 0 | 104.36 (-57.33%) | 27.1 | 5.32 | + +---------+------+-------------------+--------+-------------+ + | 1 | 1 | 382.56 (-42.95%) | 23.34 | 4.42 | + +---------+------+-------------------+--------+-------------+ + | 1 | 2 | 807.74 | 271.54 | 4.64 | + +---------+------+-------------------+--------+-------------+ + | 1 | 3 | 981.36 | 221.8 | 79.48 | + +---------+------+-------------------+--------+-------------+ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in + serial (v2.9) + + +---------+------+-----------+--------+-------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 236.56 | 23.24 | 138.18 | + +---------+------+-----------+--------+-------------+ + | 0 | 1 | 236.86 | 23.28 | 138.10 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 281.04 | 22.80 | 77.24 | + +---------+------+-----------+--------+-------------+ + | 1 | 1 | 100.28 | 18.52 | 4.54 | + +---------+------+-----------+--------+-------------+ + | 1 | 2 | 100.12 | 18.78 | 4.50 | + +---------+------+-----------+--------+-------------+ + | 1 | 3 | 100.36 | 18.94 | 4.44 | + +---------+------+-----------+--------+-------------+ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in + serial (v2.10) + + +---------+------+-----------+--------+-------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 236.84 | 27.1 | 138.36 | + +---------+------+-----------+--------+-------------+ + | 0 | 1 | 236.96 | 27.1 | 138.32 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 280.06 | 26.94 | 77.5 | + +---------+------+-----------+--------+-------------+ + | 1 | 1 | 100.76 | 23.42 | 4.36 | + +---------+------+-----------+--------+-------------+ + | 1 | 2 | 100.02 | 23.42 | 4.44 | + +---------+------+-----------+--------+-------------+ + | 1 | 3 | 100.08 | 23.2 | 4.4 | + +---------+------+-----------+--------+-------------+ + +``CPU_SUSPEND`` to power level 0 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in + parallel (v2.9) + + +---------+------+-----------+--------+-------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 662.34 | 15.22 | 8.08 | + +---------+------+-----------+--------+-------------+ + | 0 | 1 | 802.00 | 15.50 | 8.16 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 385.22 | 15.74 | 7.88 | + +---------+------+-----------+--------+-------------+ + | 1 | 1 | 106.16 | 16.06 | 7.44 | + +---------+------+-----------+--------+-------------+ + | 1 | 2 | 524.38 | 15.64 | 7.34 | + +---------+------+-----------+--------+-------------+ + | 1 | 3 | 246.00 | 15.78 | 7.72 | + +---------+------+-----------+--------+-------------+ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in + parallel (v2.10) + + +---------+------+-------------------+--------+-------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-------------------+--------+-------------+ + | 0 | 0 | 801.04 | 18.66 | 8.22 | + +---------+------+-------------------+--------+-------------+ + | 0 | 1 | 661.28 | 19.08 | 7.88 | + +---------+------+-------------------+--------+-------------+ + | 1 | 0 | 105.9 (-72.51%) | 20.3 | 7.58 | + +---------+------+-------------------+--------+-------------+ + | 1 | 1 | 383.58 (+261.32%) | 20.4 | 7.42 | + +---------+------+-------------------+--------+-------------+ + | 1 | 2 | 523.52 | 20.1 | 7.74 | + +---------+------+-------------------+--------+-------------+ + | 1 | 3 | 244.5 | 20.16 | 7.56 | + +---------+------+-------------------+--------+-------------+ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in serial (v2.9) + + +---------+------+-----------+--------+-------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 99.80 | 15.94 | 5.42 | + +---------+------+-----------+--------+-------------+ + | 0 | 1 | 99.76 | 15.80 | 5.24 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 278.26 | 16.16 | 4.58 | + +---------+------+-----------+--------+-------------+ + | 1 | 1 | 96.88 | 16.00 | 4.52 | + +---------+------+-----------+--------+-------------+ + | 1 | 2 | 96.80 | 16.12 | 4.54 | + +---------+------+-----------+--------+-------------+ + | 1 | 3 | 96.88 | 16.12 | 4.54 | + +---------+------+-----------+--------+-------------+ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in serial (v2.10) + + +---------+------+-----------+--------+-------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 99.84 | 18.86 | 5.54 | + +---------+------+-----------+--------+-------------+ + | 0 | 1 | 100.2 | 18.82 | 5.66 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 278.12 | 20.56 | 4.48 | + +---------+------+-----------+--------+-------------+ + | 1 | 1 | 96.68 | 20.62 | 4.3 | + +---------+------+-----------+--------+-------------+ + | 1 | 2 | 96.94 | 20.14 | 4.42 | + +---------+------+-----------+--------+-------------+ + | 1 | 3 | 96.68 | 20.46 | 4.32 | + +---------+------+-----------+--------+-------------+ + +``CPU_OFF`` on all non-lead CPUs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``CPU_OFF`` on all non-lead CPUs in sequence then, ``CPU_SUSPEND`` on the lead +core to the deepest power level. + +.. table:: ``CPU_OFF`` latencies (µs) on all non-lead CPUs (v2.9) + + +---------+------+-----------+--------+-------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 235.76 | 26.14 | 137.80 | + +---------+------+-----------+--------+-------------+ + | 0 | 1 | 235.40 | 25.72 | 137.62 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 174.70 | 22.40 | 77.26 | + +---------+------+-----------+--------+-------------+ + | 1 | 1 | 100.92 | 24.04 | 4.52 | + +---------+------+-----------+--------+-------------+ + | 1 | 2 | 100.68 | 22.44 | 4.36 | + +---------+------+-----------+--------+-------------+ + | 1 | 3 | 101.36 | 22.70 | 4.52 | + +---------+------+-----------+--------+-------------+ + +.. table:: ``CPU_OFF`` latencies (µs) on all non-lead CPUs (v2.10) + + +---------------------------------------------------+ + | test_rt_instr_cpu_off_serial (latest) | + +---------+------+-----------+--------+-------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 236.04 | 30.02 | 137.9 | + +---------+------+-----------+--------+-------------+ + | 0 | 1 | 235.38 | 29.7 | 137.72 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 175.18 | 26.96 | 77.26 | + +---------+------+-----------+--------+-------------+ + | 1 | 1 | 100.56 | 28.34 | 4.32 | + +---------+------+-----------+--------+-------------+ + | 1 | 2 | 100.38 | 26.82 | 4.3 | + +---------+------+-----------+--------+-------------+ + | 1 | 3 | 100.86 | 26.98 | 4.42 | + +---------+------+-----------+--------+-------------+ + +``CPU_VERSION`` in parallel +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: ``CPU_VERSION`` latency (µs) in parallel on all cores (2.9) + + +-------------+--------+-------------+ + | Cluster | Core | Latency | + +-------------+--------+-------------+ + | 0 | 0 | 1.48 | + +-------------+--------+-------------+ + | 0 | 1 | 1.04 | + +-------------+--------+-------------+ + | 1 | 0 | 0.56 | + +-------------+--------+-------------+ + | 1 | 1 | 0.92 | + +-------------+--------+-------------+ + | 1 | 2 | 0.96 | + +-------------+--------+-------------+ + | 1 | 3 | 0.96 | + +-------------+--------+-------------+ + +.. table:: ``CPU_VERSION`` latency (µs) in parallel on all cores (2.10) + + +-------------+--------+----------------------+ + | Cluster | Core | Latency | + +-------------+--------+----------------------+ + | 0 | 0 | 1.1 (-25.68%) | + +-------------+--------+----------------------+ + | 0 | 1 | 1.06 | + +-------------+--------+----------------------+ + | 1 | 0 | 0.58 | + +-------------+--------+----------------------+ + | 1 | 1 | 0.88 | + +-------------+--------+----------------------+ + | 1 | 2 | 0.92 | + +-------------+--------+----------------------+ + | 1 | 3 | 0.9 | + +-------------+--------+----------------------+ + +Annotated Historic Results +-------------------------- + +The following results are based on the upstream `TF master as of 31/01/2017`_. +TF-A was built using the same build instructions as detailed in the procedure +above. + +In the results below, CPUs 0-3 refer to CPUs in the little cluster (A53) and +CPUs 4-5 refer to CPUs in the big cluster (A57). In all cases CPU 4 is the lead +CPU. + +``PSCI_ENTRY`` corresponds to the powerdown latency, ``PSCI_EXIT`` the wakeup latency, and +``CFLUSH_OVERHEAD`` the latency of the cache flush operation. + +``CPU_SUSPEND`` to deepest power level on all CPUs in parallel +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++-------+---------------------+--------------------+--------------------------+ +| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | ++=======+=====================+====================+==========================+ +| 0 | 27 | 20 | 5 | ++-------+---------------------+--------------------+--------------------------+ +| 1 | 114 | 86 | 5 | ++-------+---------------------+--------------------+--------------------------+ +| 2 | 202 | 58 | 5 | ++-------+---------------------+--------------------+--------------------------+ +| 3 | 375 | 29 | 94 | ++-------+---------------------+--------------------+--------------------------+ +| 4 | 20 | 22 | 6 | ++-------+---------------------+--------------------+--------------------------+ +| 5 | 290 | 18 | 206 | ++-------+---------------------+--------------------+--------------------------+ + +A large variance in ``PSCI_ENTRY`` and ``PSCI_EXIT`` times across CPUs is +observed due to TF PSCI lock contention. In the worst case, CPU 3 has to wait +for the 3 other CPUs in the cluster (0-2) to complete ``PSCI_ENTRY`` and release +the lock before proceeding. + +The ``CFLUSH_OVERHEAD`` times for CPUs 3 and 5 are higher because they are the +last CPUs in their respective clusters to power down, therefore both the L1 and +L2 caches are flushed. + +The ``CFLUSH_OVERHEAD`` time for CPU 5 is a lot larger than that for CPU 3 +because the L2 cache size for the big cluster is lot larger (2MB) compared to +the little cluster (1MB). + +``CPU_SUSPEND`` to power level 0 on all CPUs in parallel +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++-------+---------------------+--------------------+--------------------------+ +| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | ++=======+=====================+====================+==========================+ +| 0 | 116 | 14 | 8 | ++-------+---------------------+--------------------+--------------------------+ +| 1 | 204 | 14 | 8 | ++-------+---------------------+--------------------+--------------------------+ +| 2 | 287 | 13 | 8 | ++-------+---------------------+--------------------+--------------------------+ +| 3 | 376 | 13 | 9 | ++-------+---------------------+--------------------+--------------------------+ +| 4 | 29 | 15 | 7 | ++-------+---------------------+--------------------+--------------------------+ +| 5 | 21 | 15 | 8 | ++-------+---------------------+--------------------+--------------------------+ + +There is no lock contention in TF generic code at power level 0 but the large +variance in ``PSCI_ENTRY`` times across CPUs is due to lock contention in Juno +platform code. The platform lock is used to mediate access to a single SCP +communication channel. This is compounded by the SCP firmware waiting for each +AP CPU to enter WFI before making the channel available to other CPUs, which +effectively serializes the SCP power down commands from all CPUs. + +On platforms with a more efficient CPU power down mechanism, it should be +possible to make the ``PSCI_ENTRY`` times smaller and consistent. + +The ``PSCI_EXIT`` times are consistent across all CPUs because TF does not +require locks at power level 0. + +The ``CFLUSH_OVERHEAD`` times for all CPUs are small and consistent since only +the cache associated with power level 0 is flushed (L1). + +``CPU_SUSPEND`` to deepest power level on all CPUs in sequence +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++-------+---------------------+--------------------+--------------------------+ +| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | ++=======+=====================+====================+==========================+ +| 0 | 114 | 20 | 94 | ++-------+---------------------+--------------------+--------------------------+ +| 1 | 114 | 20 | 94 | ++-------+---------------------+--------------------+--------------------------+ +| 2 | 114 | 20 | 94 | ++-------+---------------------+--------------------+--------------------------+ +| 3 | 114 | 20 | 94 | ++-------+---------------------+--------------------+--------------------------+ +| 4 | 195 | 22 | 180 | ++-------+---------------------+--------------------+--------------------------+ +| 5 | 21 | 17 | 6 | ++-------+---------------------+--------------------+--------------------------+ + +The ``CFLUSH_OVERHEAD`` times for lead CPU 4 and all CPUs in the non-lead cluster +are large because all other CPUs in the cluster are powered down during the +test. The ``CPU_SUSPEND`` call powers down to the cluster level, requiring a +flush of both L1 and L2 caches. + +The ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little +CPUs because the L2 cache size for the big cluster is lot larger (2MB) compared +to the little cluster (1MB). + +The ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are low because lead +CPU 4 continues to run while CPU 5 is suspended. Hence CPU 5 only powers down to +level 0, which only requires L1 cache flush. + +``CPU_SUSPEND`` to power level 0 on all CPUs in sequence +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++-------+---------------------+--------------------+--------------------------+ +| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | ++=======+=====================+====================+==========================+ +| 0 | 22 | 14 | 5 | ++-------+---------------------+--------------------+--------------------------+ +| 1 | 22 | 14 | 5 | ++-------+---------------------+--------------------+--------------------------+ +| 2 | 21 | 14 | 5 | ++-------+---------------------+--------------------+--------------------------+ +| 3 | 22 | 14 | 5 | ++-------+---------------------+--------------------+--------------------------+ +| 4 | 17 | 14 | 6 | ++-------+---------------------+--------------------+--------------------------+ +| 5 | 18 | 15 | 6 | ++-------+---------------------+--------------------+--------------------------+ + +Here the times are small and consistent since there is no contention and it is +only necessary to flush the cache to power level 0 (L1). This is the best case +scenario. + +The ``PSCI_ENTRY`` times for CPUs in the big cluster are slightly smaller than +for the CPUs in little cluster due to greater CPU performance. + +The ``PSCI_EXIT`` times are generally lower than in the last test because the +cluster remains powered on throughout the test and there is less code to execute +on power on (for example, no need to enter CCI coherency) + +``CPU_OFF`` on all non-lead CPUs in sequence then ``CPU_SUSPEND`` on lead CPU to deepest power level +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The test sequence here is as follows: + +1. Call ``CPU_ON`` and ``CPU_OFF`` on each non-lead CPU in sequence. + +2. Program wake up timer and suspend the lead CPU to the deepest power level. + +3. Call ``CPU_ON`` on non-lead CPU to get the timestamps from each CPU. + ++-------+---------------------+--------------------+--------------------------+ +| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | ++=======+=====================+====================+==========================+ +| 0 | 110 | 28 | 93 | ++-------+---------------------+--------------------+--------------------------+ +| 1 | 110 | 28 | 93 | ++-------+---------------------+--------------------+--------------------------+ +| 2 | 110 | 28 | 93 | ++-------+---------------------+--------------------+--------------------------+ +| 3 | 111 | 28 | 93 | ++-------+---------------------+--------------------+--------------------------+ +| 4 | 195 | 22 | 181 | ++-------+---------------------+--------------------+--------------------------+ +| 5 | 20 | 23 | 6 | ++-------+---------------------+--------------------+--------------------------+ + +The ``CFLUSH_OVERHEAD`` times for all little CPUs are large because all other +CPUs in that cluster are powerered down during the test. The ``CPU_OFF`` call +powers down to the cluster level, requiring a flush of both L1 and L2 caches. + +The ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are small because +lead CPU 4 is running and CPU 5 only powers down to level 0, which only requires +an L1 cache flush. + +The ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little +CPUs because the L2 cache size for the big cluster is lot larger (2MB) compared +to the little cluster (1MB). + +The ``PSCI_EXIT`` times for CPUs in the big cluster are slightly smaller than +for CPUs in the little cluster due to greater CPU performance. These times +generally are greater than the ``PSCI_EXIT`` times in the ``CPU_SUSPEND`` tests +because there is more code to execute in the "on finisher" compared to the +"suspend finisher" (for example, GIC redistributor register programming). + +``PSCI_VERSION`` on all CPUs in parallel +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Since very little code is associated with ``PSCI_VERSION``, this test +approximates the round trip latency for handling a fast SMC at EL3 in TF. + ++-------+-------------------+ +| CPU | TOTAL TIME (ns) | ++=======+===================+ +| 0 | 3020 | ++-------+-------------------+ +| 1 | 2940 | ++-------+-------------------+ +| 2 | 2980 | ++-------+-------------------+ +| 3 | 3060 | ++-------+-------------------+ +| 4 | 520 | ++-------+-------------------+ +| 5 | 720 | ++-------+-------------------+ + +The times for the big CPUs are less than the little CPUs due to greater CPU +performance. + +We suspect the time for lead CPU 4 is shorter than CPU 5 due to subtle cache +effects, given that these measurements are at the nano-second level. + +-------------- + +*Copyright (c) 2019-2023, Arm Limited and Contributors. All rights reserved.* + +.. _Juno R1 platform: https://developer.arm.com/documentation/100122/latest/ +.. _TF master as of 31/01/2017: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?id=c38b36d +.. _v2.9-rc0: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?h=v2.9-rc0 diff --git a/docs/perf/psci-performance-methodology.rst b/docs/perf/psci-performance-methodology.rst new file mode 100644 index 0000000..a9f379d --- /dev/null +++ b/docs/perf/psci-performance-methodology.rst @@ -0,0 +1,55 @@ +Runtime Instrumentation Methodology +=================================== + +This document outlines steps for undertaking performance measurements of key +operations in the Trusted Firmware-A Power State Coordination Interface (PSCI) +implementation, using the in-built Performance Measurement Framework (PMF) and +runtime instrumentation timestamps. + +Framework +~~~~~~~~~ + +The tests are based on the ``runtime-instrumentation`` test suite provided by +the Trusted Firmware Test Framework (TFTF). The release build of this framework +was used because the results in the debug build became skewed; the console +output prevented some of the tests from executing in parallel. + +The tests consist of both parallel and sequential tests, which are broadly +described as follows: + +- **Parallel Tests** This type of test powers on all the non-lead CPUs and + brings them and the lead CPU to a common synchronization point. The lead CPU + then initiates the test on all CPUs in parallel. + +- **Sequential Tests** This type of test powers on each non-lead CPU in + sequence. The lead CPU initiates the test on a non-lead CPU then waits for the + test to complete before proceeding to the next non-lead CPU. The lead CPU then + executes the test on itself. + +Note there is very little variance observed in the values given (~1us), although +the values for each CPU are sometimes interchanged, depending on the order in +which locks are acquired. Also, there is very little variance observed between +executing the tests sequentially in a single boot or rebooting between tests. + +Given that runtime instrumentation using PMF is invasive, there is a small +(unquantified) overhead on the results. PMF uses the generic counter for +timestamps, which runs at 50MHz on Juno. + +Metrics +~~~~~~~ + +.. glossary:: + + Powerdown Latency + Time taken from entering the TF PSCI implementation to the point the hardware + enters the low power state (WFI). Referring to the TF runtime instrumentation points, this + corresponds to: ``(RT_INSTR_ENTER_HW_LOW_PWR - RT_INSTR_ENTER_PSCI)``. + + Wakeup Latency + Time taken from the point the hardware exits the low power state to exiting + the TF PSCI implementation. This corresponds to: ``(RT_INSTR_EXIT_PSCI - + RT_INSTR_EXIT_HW_LOW_PWR)``. + + Cache Flush Latency + Time taken to flush the caches during powerdown. This corresponds to: + ``(RT_INSTR_EXIT_CFLUSH - RT_INSTR_ENTER_CFLUSH)``. diff --git a/docs/perf/psci-performance-n1sdp.rst b/docs/perf/psci-performance-n1sdp.rst new file mode 100644 index 0000000..fd3c9c9 --- /dev/null +++ b/docs/perf/psci-performance-n1sdp.rst @@ -0,0 +1,297 @@ +Runtime Instrumentation Testing - N1SDP +======================================= + +For this test we used the N1 System Development Platform (`N1SDP`_), which +contains an SoC consisting of two dual-core Arm N1 clusters. + +The following source trees and binaries were used: + +- TF-A [`v2.9-rc0-16-g666aec401`_] +- TFTF [`v2.9-rc0`_] +- SCP/MCP `Prebuilt Images`_ + +Please see the Runtime Instrumentation :ref:`Testing Methodology +<Runtime Instrumentation Methodology>` page for more details. + +Procedure +--------- + +#. Build TFTF with runtime instrumentation enabled: + + .. code:: shell + + make CROSS_COMPILE=aarch64-none-elf- PLAT=n1sdp \ + TESTS=runtime-instrumentation all + +#. Build TF-A with the following build options: + + .. code:: shell + + make CROSS_COMPILE=aarch64-none-elf- PLAT=n1sdp \ + ENABLE_RUNTIME_INSTRUMENTATION=1 fiptool all + +#. Fetch the SCP firmware images: + + .. code:: shell + + curl --fail --connect-timeout 5 --retry 5 \ + -sLS -o build/n1sdp/release/scp_rom.bin \ + https://downloads.trustedfirmware.org/tf-a/css_scp_2.12.0/n1sdp/release/n1sdp-bl1.bin + curl --fail --connect-timeout 5 \ + --retry 5 -sLS -o build/n1sdp/release/scp_ram.bin \ + https://downloads.trustedfirmware.org/tf-a/css_scp_2.12.0/n1sdp/release/n1sdp-bl2.bin + +#. Fetch the MCP firmware images: + + .. code:: shell + + curl --fail --connect-timeout 5 --retry 5 \ + -sLS -o build/n1sdp/release/mcp_rom.bin \ + https://downloads.trustedfirmware.org/tf-a/css_scp_2.12.0/n1sdp/release/n1sdp-mcp-bl1.bin + curl --fail --connect-timeout 5 --retry 5 \ + -sLS -o build/n1sdp/release/mcp_ram.bin \ + https://downloads.trustedfirmware.org/tf-a/css_scp_2.12.0/n1sdp/release/n1sdp-mcp-bl2.bin + +#. Using the fiptool, create a new FIP package and append the SCP ram image onto + it. + + .. code:: shell + + ./tools/fiptool/fiptool create --blob \ + uuid=cfacc2c4-15e8-4668-82be-430a38fad705,file=build/n1sdp/release/bl1.bin \ + --scp-fw build/n1sdp/release/scp_ram.bin build/n1sdp/release/scp_fw.bin + +#. Append the MCP image to the FIP. + + .. code:: shell + + ./tools/fiptool/fiptool create \ + --blob uuid=54464222-a4cf-4bf8-b1b6-cee7dade539e,file=build/n1sdp/release/mcp_ram.bin \ + build/n1sdp/release/mcp_fw.bin + +#. Then, add TFTF as the Non-Secure workload in the FIP image: + + .. code:: shell + + make CROSS_COMPILE=aarch64-none-elf- PLAT=n1sdp \ + ENABLE_RUNTIME_INSTRUMENTATION=1 SCP_BL2=/dev/null \ + BL33=<path/to/tftf.bin> fip + +#. Load the following images onto the development board: ``fip.bin``, + ``scp_rom.bin``, ``scp_ram.bin``, ``mcp_rom.bin``, and ``mcp_ram.bin``. + +.. note:: + + These instructions presume you have a complete firmware stack. The N1SDP + `user guide`_ provides a detailed explanation on how to get setup from + scratch. + +Results +------- + +``CPU_SUSPEND`` to deepest power level +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in + parallel (v2.9) + + +---------+------+-----------+--------+-------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 2.80 | 10.08 | 0.80 | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 4.14 | 15.92 | 0.16 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 3.68 | 12.96 | 0.16 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 3.36 | 18.58 | 0.18 | + +---------+------+-----------+--------+-------------+ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in + parallel (v2.10) + + +---------+------+----------------+------------------+-----------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+----------------+------------------+-----------------+ + | 0 | 0 | 2.12 | 23.94 (+137.50%) | 0.42 (-47.50%) | + +---------+------+----------------+------------------+-----------------+ + | 0 | 0 | 3.52 | 42.08 (+164.32%) | 0.26 (+62.50%) | + +---------+------+----------------+------------------+-----------------+ + | 1 | 0 | 2.76 (-25.00%) | 38.3 (+195.52%) | 0.26 (+62.50%) | + +---------+------+----------------+------------------+-----------------+ + | 1 | 0 | 2.64 | 44.56 (+139.83%) | 0.36 (+100.00%) | + +---------+------+----------------+------------------+-----------------+ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in + serial (v2.9) + + +---------+------+-----------+--------+-------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 1.86 | 9.92 | 0.32 | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 2.70 | 10.48 | 0.36 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 1.78 | 9.72 | 0.16 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 1.94 | 10.44 | 0.16 | + +---------+------+-----------+--------+-------------+ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in + serial (v2.10) + + +---------+------+-----------+------------------+----------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+------------------+----------------+ + | 0 | 0 | 1.74 | 23.7 (+138.91%) | 0.3 | + +---------+------+-----------+------------------+----------------+ + | 0 | 0 | 2.08 | 23.96 (+128.63%) | 0.26 (-27.78%) | + +---------+------+-----------+------------------+----------------+ + | 1 | 0 | 1.9 | 23.62 (+143.00%) | 0.28 (+75.00%) | + +---------+------+-----------+------------------+----------------+ + | 1 | 0 | 2.06 | 23.92 (+129.12%) | 0.26 (+62.50%) | + +---------+------+-----------+------------------+----------------+ + +``CPU_SUSPEND`` to power level 0 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in + parallel (v2.9) + + +---------------------------------------------------+ + | test_rt_instr_cpu_susp_parallel | + +---------+------+-----------+--------+-------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 0.88 | 12.32 | 0.26 | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 2.12 | 14.62 | 0.26 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 1.86 | 14.14 | 0.16 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 1.92 | 9.44 | 0.18 | + +---------+------+-----------+--------+-------------+ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in + parallel (v2.10) + + +---------+------+---------------+------------------+----------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+---------------+------------------+----------------+ + | 0 | 0 | 1.5 (+70.45%) | 35.02 (+184.25%) | 0.24 | + +---------+------+---------------+------------------+----------------+ + | 0 | 0 | 1.92 | 38.12 (+160.74%) | 0.28 | + +---------+------+---------------+------------------+----------------+ + | 1 | 0 | 1.88 | 38.1 (+169.45%) | 0.26 (+62.50%) | + +---------+------+---------------+------------------+----------------+ + | 1 | 0 | 2.04 | 23.1 (+144.70%) | 0.24 | + +---------+------+---------------+------------------+----------------+ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in serial (v2.9) + + +---------------------------------------------------+ + | test_rt_instr_cpu_susp_serial | + +---------+------+-----------+--------+-------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 1.52 | 9.40 | 0.30 | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 1.92 | 9.80 | 0.18 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 2.20 | 9.60 | 0.14 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 1.82 | 9.78 | 0.18 | + +---------+------+-----------+--------+-------------+ + +.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in serial (v2.10) + + +---------+------+-----------+------------------+-----------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+------------------+-----------------+ + | 0 | 0 | 1.52 | 23.08 (+145.53%) | 0.3 | + +---------+------+-----------+------------------+-----------------+ + | 0 | 0 | 1.98 | 23.68 (+141.63%) | 0.28 (+55.56%) | + +---------+------+-----------+------------------+-----------------+ + | 1 | 0 | 1.84 | 23.86 (+148.54%) | 0.28 (+100.00%) | + +---------+------+-----------+------------------+-----------------+ + | 1 | 0 | 1.98 | 23.68 (+142.13%) | 0.28 (+55.56%) | + +---------+------+-----------+------------------+-----------------+ + +``CPU_OFF`` on all non-lead CPUs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``CPU_OFF`` on all non-lead CPUs in sequence then, ``CPU_SUSPEND`` on the lead +core to the deepest power level. + +.. table:: ``CPU_OFF`` latencies (µs) on all non-lead CPUs (v2.9) + + +---------+------+-----------+--------+-------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 1.84 | 9.94 | 0.32 | + +---------+------+-----------+--------+-------------+ + | 0 | 0 | 14.20 | 13.10 | 0.50 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 13.88 | 12.36 | 0.42 | + +---------+------+-----------+--------+-------------+ + | 1 | 0 | 14.40 | 13.26 | 0.52 | + +---------+------+-----------+--------+-------------+ + +.. table:: ``CPU_OFF`` latencies (µs) on all non-lead CPUs (v2.10) + + +---------+------+-----------+------------------+----------------+ + | Cluster | Core | Powerdown | Wakeup | Cache Flush | + +---------+------+-----------+------------------+----------------+ + | 0 | 0 | 1.78 | 23.7 (+138.43%) | 0.3 | + +---------+------+-----------+------------------+----------------+ + | 0 | 0 | 13.96 | 31.16 (+137.86%) | 0.34 (-32.00%) | + +---------+------+-----------+------------------+----------------+ + | 1 | 0 | 13.54 | 30.24 (+144.66%) | 0.26 (-38.10%) | + +---------+------+-----------+------------------+----------------+ + | 1 | 0 | 14.46 | 31.12 (+134.69%) | 0.7 (+34.62%) | + +---------+------+-----------+------------------+----------------+ + +``CPU_VERSION`` in parallel +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: ``CPU_VERSION`` latency (µs) in parallel on all cores (v2.9) + + +------------------------------------+ + | test_rt_instr_psci_version_parallel| + +-------------+--------+-------------+ + | Cluster | Core | Latency | + +-------------+--------+-------------+ + | 0 | 0 | 0.08 | + +-------------+--------+-------------+ + | 0 | 0 | 0.26 | + +-------------+--------+-------------+ + | 1 | 0 | 0.20 | + +-------------+--------+-------------+ + | 1 | 0 | 0.26 | + +-------------+--------+-------------+ + +.. table:: ``CPU_VERSION`` latency (µs) in parallel on all cores (v2.10) + + +----------------------------------------------+ + | test_rt_instr_psci_version_parallel (latest) | + +-------------+--------+-----------------------+ + | Cluster | Core | Latency | + +-------------+--------+-----------------------+ + | 0 | 0 | 0.14 (+75.00%) | + +-------------+--------+-----------------------+ + | 0 | 0 | 0.22 | + +-------------+--------+-----------------------+ + | 1 | 0 | 0.2 | + +-------------+--------+-----------------------+ + | 1 | 0 | 0.26 | + +-------------+--------+-----------------------+ + +-------------- + +*Copyright (c) 2023, Arm Limited. All rights reserved.* + +.. _v2.9-rc0-16-g666aec401: https://review.trustedfirmware.org/plugins/gitiles/TF-A/trusted-firmware-a/+/refs/heads/v2.9-rc0-16-g666aec401 +.. _v2.9-rc0: https://review.trustedfirmware.org/plugins/gitiles/TF-A/tf-a-tests/+/refs/tags/v2.9-rc0 +.. _user guide: https://gitlab.arm.com/arm-reference-solutions/arm-reference-solutions-docs/-/blob/master/docs/n1sdp/user-guide.rst +.. _Prebuilt Images: https://downloads.trustedfirmware.org/tf-a/css_scp_2.11.0/n1sdp/release/ +.. _N1SDP: https://developer.arm.com/documentation/101489/latest diff --git a/docs/perf/tsp.rst b/docs/perf/tsp.rst new file mode 100644 index 0000000..f8b0048 --- /dev/null +++ b/docs/perf/tsp.rst @@ -0,0 +1,27 @@ +Test Secure Payload (TSP) and Dispatcher (TSPD) +=============================================== + +Building the Test Secure Payload +-------------------------------- + +The TSP is coupled with a companion runtime service in the BL31 firmware, +called the TSPD. Therefore, if you intend to use the TSP, the BL31 image +must be recompiled as well. For more information on SPs and SPDs, see the +:ref:`firmware_design_sel1_spd` section in the :ref:`Firmware Design`. + +First clean the TF-A build directory to get rid of any previous BL31 binary. +Then to build the TSP image use: + +.. code:: shell + + make PLAT=<platform> SPD=tspd all + +An additional boot loader binary file is created in the ``build`` directory: + +:: + + build/<platform>/<build-type>/bl32.bin + +-------------- + +*Copyright (c) 2019, Arm Limited. All rights reserved.* |