4 files changed, 491 insertions, 0 deletions
diff --git a/docs/perf/index.rst b/docs/perf/index.rst
new file mode 100644
index 0000000..bccad00
--- /dev/null
+++ b/docs/perf/index.rst
@@ -0,0 +1,14 @@
+Performance & Testing
+=====================
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Contents
+
+   psci-performance-juno
+   tsp
+   performance-monitoring-unit
+
+--------------
+
+*Copyright (c) 2019-2020, Arm Limited. All rights reserved.*
diff --git a/docs/perf/performance-monitoring-unit.rst b/docs/perf/performance-monitoring-unit.rst
new file mode 100644
index 0000000..5dd1af5
--- /dev/null
+++ b/docs/perf/performance-monitoring-unit.rst
@@ -0,0 +1,158 @@
+Performance Monitoring Unit
+===========================
+
+The Performance Monitoring Unit (PMU) allows recording of architectural and
+microarchitectural events for profiling purposes.
+
+This document gives an overview of the PMU counter configuration to assist with
+implementation and to complement the PMU security guidelines given in the
+:ref:`Secure Development Guidelines` document.
+
+.. note::
+   This section applies to Armv8-A implementations which have version 3
+   of the Performance Monitors Extension (PMUv3).
+
+PMU Counters
+------------
+
+The PMU makes 32 counters available at all privilege levels:
+
+-  31 programmable event counters: ``PMEVCNTR<n>``, where ``n`` is ``0`` to
+   ``30``.
+-  A dedicated cycle counter: ``PMCCNTR``.
+
+Architectural mappings
+~~~~~~~~~~~~~~~~~~~~~~
+
++--------------+---------+----------------------------+
+| Counters     | State   | System Register Name       |
++==============+=========+============================+
+|              | AArch64 | ``PMEVCNTR<n>_EL0[63*:0]`` |
+| Programmable +---------+----------------------------+
+|              | AArch32 | ``PMEVCNTR<n>[31:0]``      |
++--------------+---------+----------------------------+
+|              | AArch64 | ``PMCCNTR_EL0[63:0]``      |
+| Cycle        +---------+----------------------------+
+|              | AArch32 | ``PMCCNTR[63:0]``          |
++--------------+---------+----------------------------+
+
+.. note::
+   Bits [63:32] are only available if ARMv8.5-PMU is implemented. Refer to the
+   `Arm ARM`_ for a detailed description of ARMv8.5-PMU features.
+
+Configuring the PMU for counting events
+---------------------------------------
+
+Each programmable counter has an associated register, ``PMEVTYPER<n>`` which
+configures it. The cycle counter has the ``PMCCFILTR_EL0`` register, which has
+an identical function and bit field layout as ``PMEVTYPER<n>``. In addition,
+the counters are enabled (permitted to increment) via the ``PMCNTENSET`` and
+``PMCR`` registers. These can be accessed at all privilege levels.
+
+Architectural mappings
+~~~~~~~~~~~~~~~~~~~~~~
+
++-----------------------------+------------------------+
+| AArch64                     | AArch32                |
++=============================+========================+
+| ``PMEVTYPER<n>_EL0[63*:0]`` | ``PMEVTYPER<n>[31:0]`` |
++-----------------------------+------------------------+
+| ``PMCCFILTR_EL0[63*:0]``    | ``PMCCFILTR[31:0]``    |
++-----------------------------+------------------------+
+| ``PMCNTENSET_EL0[63*:0]``   | ``PMCNTENSET[31:0]``   |
++-----------------------------+------------------------+
+| ``PMCR_EL0[63*:0]``         | ``PMCR[31:0]``         |
++-----------------------------+------------------------+
+
+.. note::
+   Bits [63:32] are reserved.
+
+Relevant register fields
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+For ``PMEVTYPER<n>_EL0``/``PMEVTYPER<n>`` and ``PMCCFILTR_EL0/PMCCFILTR``, the
+most important fields are:
+
+-  ``P``:
+
+   -  Bit 31.
+   -  If set to ``0``, will increment the associated ``PMEVCNTR<n>`` at EL1.
+
+-  ``NSK``:
+
+   -  Bit 29.
+   -  If equal to the ``P`` bit it enables the associated ``PMEVCNTR<n>`` at
+      Non-secure EL1.
+   -  Reserved if EL3 not implemented.
+
+-  ``NSH``:
+
+   -  Bit 27.
+   -  If set to ``1``, will increment the associated ``PMEVCNTR<n>`` at EL2.
+   -  Reserved if EL2 not implemented.
+
+-  ``SH``:
+
+   -  Bit 24.
+   -  If different to the ``NSH`` bit it enables the associated ``PMEVCNTR<n>``
+      at Secure EL2.
+   -  Reserved if Secure EL2 not implemented.
+
+-  ``M``:
+
+   -  Bit 26.
+   -  If equal to the ``P`` bit it enables the associated ``PMEVCNTR<n>`` at
+      EL3.
+
+-  ``evtCount[15:10]``:
+
+   -  Extension to ``evtCount[9:0]``. Reserved unless ARMv8.1-PMU implemented.
+
+-  ``evtCount[9:0]``:
+
+   -  The event number that the associated ``PMEVCNTR<n>`` will count.
+
+For ``PMCNTENSET_EL0``/``PMCNTENSET``, the most important fields are:
+
+-  ``P[30:0]``:
+
+   -  Setting bit ``P[n]`` to ``1`` enables counter ``PMEVCNTR<n>``.
+   -  The effects of ``PMEVTYPER<n>`` are applied on top of this.
+      In other words, the counter will not increment at any privilege level or
+      security state unless it is enabled here.
+
+-  ``C``:
+
+   -  Bit 31.
+   -  If set to ``1`` enables the cycle counter ``PMCCNTR``.
+
+For ``PMCR``/``PMCR_EL0``, the most important fields are:
+
+-  ``DP``:
+
+   -  Bit 5.
+   -  If set to ``1`` it disables the cycle counter ``PMCCNTR`` where event
+      counting (by ``PMEVCNTR<n>``) is prohibited (e.g. EL2 and the Secure
+      world).
+   -  If set to ``0``, ``PMCCNTR`` will not be affected by this bit and
+      therefore will be able to count where the programmable counters are
+      prohibited.
+
+-  ``E``:
+
+   -  Bit 0.
+   -  Enables/disables counting altogether.
+   -  The effects of ``PMCNTENSET`` and ``PMCR.DP`` are applied on top of this.
+      In other words, if this bit is ``0`` then no counters will increment
+      regardless of how the other PMU system registers or bit fields are
+      configured.
+
+.. rubric:: References
+
+-  `Arm ARM`_
+
+--------------
+
+*Copyright (c) 2019-2020, Arm Limited and Contributors. All rights reserved.*
+
+.. _Arm ARM: https://developer.arm.com/docs/ddi0487/latest
diff --git a/docs/perf/psci-performance-juno.rst b/docs/perf/psci-performance-juno.rst
new file mode 100644
index 0000000..eab3e4d
--- /dev/null
+++ b/docs/perf/psci-performance-juno.rst
@@ -0,0 +1,292 @@
+PSCI Performance Measurements on Arm Juno Development Platform
+==============================================================
+
+This document summarises the findings of performance measurements of key
+operations in the Trusted Firmware-A Power State Coordination Interface (PSCI)
+implementation, using the in-built Performance Measurement Framework (PMF) and
+runtime instrumentation timestamps.
+
+Method
+------
+
+We used the `Juno R1 platform`_ for these tests, which has 4 x Cortex-A53 and 2
+x Cortex-A57 clusters running at the following frequencies:
+
++-----------------+--------------------+
+| Domain          | Frequency (MHz)    |
++=================+====================+
+| Cortex-A57      | 900 (nominal)      |
++-----------------+--------------------+
+| Cortex-A53      | 650 (underdrive)   |
++-----------------+--------------------+
+| AXI subsystem   | 533                |
++-----------------+--------------------+
+
+Juno supports CPU, cluster and system power down states, corresponding to power
+levels 0, 1 and 2 respectively. It does not support any retention states.
+
+We used the upstream `TF master as of 31/01/2017`_, building the platform using
+the ``ENABLE_RUNTIME_INSTRUMENTATION`` option:
+
+.. code:: shell
+
+    make PLAT=juno ENABLE_RUNTIME_INSTRUMENTATION=1 \
+        SCP_BL2=<path/to/scp-fw.bin>                \
+        BL33=<path/to/test-fw.bin>                  \
+        all fip
+
+When using the debug build of TF, there was no noticeable difference in the
+results.
+
+The tests are based on an ARM-internal test framework. The release build of this
+framework was used because the results in the debug build became skewed; the
+console output prevented some of the tests from executing in parallel.
+
+The tests consist of both parallel and sequential tests, which are broadly
+described as follows:
+
+- **Parallel Tests** This type of test powers on all the non-lead CPUs and
+  brings them and the lead CPU to a common synchronization point.  The lead CPU
+  then initiates the test on all CPUs in parallel.
+
+- **Sequential Tests** This type of test powers on each non-lead CPU in
+  sequence. The lead CPU initiates the test on a non-lead CPU then waits for the
+  test to complete before proceeding to the next non-lead CPU. The lead CPU then
+  executes the test on itself.
+
+In the results below, CPUs 0-3 refer to CPUs in the little cluster (A53) and
+CPUs 4-5 refer to CPUs in the big cluster (A57). In all cases CPU 4 is the lead
+CPU.
+
+``PSCI_ENTRY`` refers to the time taken from entering the TF PSCI implementation
+to the point the hardware enters the low power state (WFI). Referring to the TF
+runtime instrumentation points, this corresponds to:
+``(RT_INSTR_ENTER_HW_LOW_PWR - RT_INSTR_ENTER_PSCI)``.
+
+``PSCI_EXIT`` refers to the time taken from the point the hardware exits the low
+power state to exiting the TF PSCI implementation. This corresponds to:
+``(RT_INSTR_EXIT_PSCI - RT_INSTR_EXIT_HW_LOW_PWR)``.
+
+``CFLUSH_OVERHEAD`` refers to the part of ``PSCI_ENTRY`` taken to flush the
+caches. This corresponds to: ``(RT_INSTR_EXIT_CFLUSH - RT_INSTR_ENTER_CFLUSH)``.
+
+Note there is very little variance observed in the values given (~1us), although
+the values for each CPU are sometimes interchanged, depending on the order in
+which locks are acquired. Also, there is very little variance observed between
+executing the tests sequentially in a single boot or rebooting between tests.
+
+Given that runtime instrumentation using PMF is invasive, there is a small
+(unquantified) overhead on the results. PMF uses the generic counter for
+timestamps, which runs at 50MHz on Juno.
+
+Results and Commentary
+----------------------
+
+``CPU_SUSPEND`` to deepest power level on all CPUs in parallel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------+---------------------+--------------------+--------------------------+
+| CPU   | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
++=======+=====================+====================+==========================+
+| 0     | 27                  | 20                 | 5                        |
++-------+---------------------+--------------------+--------------------------+
+| 1     | 114                 | 86                 | 5                        |
++-------+---------------------+--------------------+--------------------------+
+| 2     | 202                 | 58                 | 5                        |
++-------+---------------------+--------------------+--------------------------+
+| 3     | 375                 | 29                 | 94                       |
++-------+---------------------+--------------------+--------------------------+
+| 4     | 20                  | 22                 | 6                        |
++-------+---------------------+--------------------+--------------------------+
+| 5     | 290                 | 18                 | 206                      |
++-------+---------------------+--------------------+--------------------------+
+
+A large variance in ``PSCI_ENTRY`` and ``PSCI_EXIT`` times across CPUs is
+observed due to TF PSCI lock contention. In the worst case, CPU 3 has to wait
+for the 3 other CPUs in the cluster (0-2) to complete ``PSCI_ENTRY`` and release
+the lock before proceeding.
+
+The ``CFLUSH_OVERHEAD`` times for CPUs 3 and 5 are higher because they are the
+last CPUs in their respective clusters to power down, therefore both the L1 and
+L2 caches are flushed.
+
+The ``CFLUSH_OVERHEAD`` time for CPU 5 is a lot larger than that for CPU 3
+because the L2 cache size for the big cluster is lot larger (2MB) compared to
+the little cluster (1MB).
+
+``CPU_SUSPEND`` to power level 0 on all CPUs in parallel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------+---------------------+--------------------+--------------------------+
+| CPU   | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
++=======+=====================+====================+==========================+
+| 0     | 116                 | 14                 | 8                        |
++-------+---------------------+--------------------+--------------------------+
+| 1     | 204                 | 14                 | 8                        |
++-------+---------------------+--------------------+--------------------------+
+| 2     | 287                 | 13                 | 8                        |
++-------+---------------------+--------------------+--------------------------+
+| 3     | 376                 | 13                 | 9                        |
++-------+---------------------+--------------------+--------------------------+
+| 4     | 29                  | 15                 | 7                        |
++-------+---------------------+--------------------+--------------------------+
+| 5     | 21                  | 15                 | 8                        |
++-------+---------------------+--------------------+--------------------------+
+
+There is no lock contention in TF generic code at power level 0 but the large
+variance in ``PSCI_ENTRY`` times across CPUs is due to lock contention in Juno
+platform code. The platform lock is used to mediate access to a single SCP
+communication channel. This is compounded by the SCP firmware waiting for each
+AP CPU to enter WFI before making the channel available to other CPUs, which
+effectively serializes the SCP power down commands from all CPUs.
+
+On platforms with a more efficient CPU power down mechanism, it should be
+possible to make the ``PSCI_ENTRY`` times smaller and consistent.
+
+The ``PSCI_EXIT`` times are consistent across all CPUs because TF does not
+require locks at power level 0.
+
+The ``CFLUSH_OVERHEAD`` times for all CPUs are small and consistent since only
+the cache associated with power level 0 is flushed (L1).
+
+``CPU_SUSPEND`` to deepest power level on all CPUs in sequence
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------+---------------------+--------------------+--------------------------+
+| CPU   | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
++=======+=====================+====================+==========================+
+| 0     | 114                 | 20                 | 94                       |
++-------+---------------------+--------------------+--------------------------+
+| 1     | 114                 | 20                 | 94                       |
++-------+---------------------+--------------------+--------------------------+
+| 2     | 114                 | 20                 | 94                       |
++-------+---------------------+--------------------+--------------------------+
+| 3     | 114                 | 20                 | 94                       |
++-------+---------------------+--------------------+--------------------------+
+| 4     | 195                 | 22                 | 180                      |
++-------+---------------------+--------------------+--------------------------+
+| 5     | 21                  | 17                 | 6                        |
++-------+---------------------+--------------------+--------------------------+
+
+The ``CFLUSH_OVERHEAD`` times for lead CPU 4 and all CPUs in the non-lead cluster
+are large because all other CPUs in the cluster are powered down during the
+test. The ``CPU_SUSPEND`` call powers down to the cluster level, requiring a
+flush of both L1 and L2 caches.
+
+The ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little
+CPUs because the L2 cache size for the big cluster is lot larger (2MB) compared
+to the little cluster (1MB).
+
+The ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are low because lead
+CPU 4 continues to run while CPU 5 is suspended. Hence CPU 5 only powers down to
+level 0, which only requires L1 cache flush.
+
+``CPU_SUSPEND`` to power level 0 on all CPUs in sequence
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------+---------------------+--------------------+--------------------------+
+| CPU   | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
++=======+=====================+====================+==========================+
+| 0     | 22                  | 14                 | 5                        |
++-------+---------------------+--------------------+--------------------------+
+| 1     | 22                  | 14                 | 5                        |
++-------+---------------------+--------------------+--------------------------+
+| 2     | 21                  | 14                 | 5                        |
++-------+---------------------+--------------------+--------------------------+
+| 3     | 22                  | 14                 | 5                        |
++-------+---------------------+--------------------+--------------------------+
+| 4     | 17                  | 14                 | 6                        |
++-------+---------------------+--------------------+--------------------------+
+| 5     | 18                  | 15                 | 6                        |
++-------+---------------------+--------------------+--------------------------+
+
+Here the times are small and consistent since there is no contention and it is
+only necessary to flush the cache to power level 0 (L1). This is the best case
+scenario.
+
+The ``PSCI_ENTRY`` times for CPUs in the big cluster are slightly smaller than
+for the CPUs in little cluster due to greater CPU performance.
+
+The ``PSCI_EXIT`` times are generally lower than in the last test because the
+cluster remains powered on throughout the test and there is less code to execute
+on power on (for example, no need to enter CCI coherency)
+
+``CPU_OFF`` on all non-lead CPUs in sequence then ``CPU_SUSPEND`` on lead CPU to deepest power level
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The test sequence here is as follows:
+
+1. Call ``CPU_ON`` and ``CPU_OFF`` on each non-lead CPU in sequence.
+
+2. Program wake up timer and suspend the lead CPU to the deepest power level.
+
+3. Call ``CPU_ON`` on non-lead CPU to get the timestamps from each CPU.
+
++-------+---------------------+--------------------+--------------------------+
+| CPU   | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
++=======+=====================+====================+==========================+
+| 0     | 110                 | 28                 | 93                       |
++-------+---------------------+--------------------+--------------------------+
+| 1     | 110                 | 28                 | 93                       |
++-------+---------------------+--------------------+--------------------------+
+| 2     | 110                 | 28                 | 93                       |
++-------+---------------------+--------------------+--------------------------+
+| 3     | 111                 | 28                 | 93                       |
++-------+---------------------+--------------------+--------------------------+
+| 4     | 195                 | 22                 | 181                      |
++-------+---------------------+--------------------+--------------------------+
+| 5     | 20                  | 23                 | 6                        |
++-------+---------------------+--------------------+--------------------------+
+
+The ``CFLUSH_OVERHEAD`` times for all little CPUs are large because all other
+CPUs in that cluster are powerered down during the test. The ``CPU_OFF`` call
+powers down to the cluster level, requiring a flush of both L1 and L2 caches.
+
+The ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are small because
+lead CPU 4 is running and CPU 5 only powers down to level 0, which only requires
+an L1 cache flush.
+
+The ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little
+CPUs because the L2 cache size for the big cluster is lot larger (2MB) compared
+to the little cluster (1MB).
+
+The ``PSCI_EXIT`` times for CPUs in the big cluster are slightly smaller than
+for CPUs in the little cluster due to greater CPU performance.  These times
+generally are greater than the ``PSCI_EXIT`` times in the ``CPU_SUSPEND`` tests
+because there is more code to execute in the "on finisher" compared to the
+"suspend finisher" (for example, GIC redistributor register programming).
+
+``PSCI_VERSION`` on all CPUs in parallel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since very little code is associated with ``PSCI_VERSION``, this test
+approximates the round trip latency for handling a fast SMC at EL3 in TF.
+
++-------+-------------------+
+| CPU   | TOTAL TIME (ns)   |
++=======+===================+
+| 0     | 3020              |
++-------+-------------------+
+| 1     | 2940              |
++-------+-------------------+
+| 2     | 2980              |
++-------+-------------------+
+| 3     | 3060              |
++-------+-------------------+
+| 4     | 520               |
++-------+-------------------+
+| 5     | 720               |
++-------+-------------------+
+
+The times for the big CPUs are less than the little CPUs due to greater CPU
+performance.
+
+We suspect the time for lead CPU 4 is shorter than CPU 5 due to subtle cache
+effects, given that these measurements are at the nano-second level.
+
+--------------
+
+*Copyright (c) 2019-2020, Arm Limited and Contributors. All rights reserved.*
+
+.. _Juno R1 platform: https://static.docs.arm.com/100122/0100/arm_versatile_express_juno_r1_development_platform_(v2m_juno_r1)_technical_reference_manual_100122_0100_05_en.pdf
+.. _TF master as of 31/01/2017: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?id=c38b36d
diff --git a/docs/perf/tsp.rst b/docs/perf/tsp.rst
new file mode 100644
index 0000000..f8b0048
--- /dev/null
+++ b/docs/perf/tsp.rst
@@ -0,0 +1,27 @@
+Test Secure Payload (TSP) and Dispatcher (TSPD)
+===============================================
+
+Building the Test Secure Payload
+--------------------------------
+
+The TSP is coupled with a companion runtime service in the BL31 firmware,
+called the TSPD. Therefore, if you intend to use the TSP, the BL31 image
+must be recompiled as well. For more information on SPs and SPDs, see the
+:ref:`firmware_design_sel1_spd` section in the :ref:`Firmware Design`.
+
+First clean the TF-A build directory to get rid of any previous BL31 binary.
+Then to build the TSP image use:
+
+.. code:: shell
+
+    make PLAT=<platform> SPD=tspd all
+
+An additional boot loader binary file is created in the ``build`` directory:
+
+::
+
+    build/<platform>/<build-type>/bl32.bin
+
+--------------
+
+*Copyright (c) 2019, Arm Limited. All rights reserved.*