diff options
Diffstat (limited to 'docs/components/ras.rst')
-rw-r--r-- | docs/components/ras.rst | 346 |
1 files changed, 346 insertions, 0 deletions
diff --git a/docs/components/ras.rst b/docs/components/ras.rst new file mode 100644 index 0000000..747367a --- /dev/null +++ b/docs/components/ras.rst @@ -0,0 +1,346 @@ +Reliability, Availability, and Serviceability (RAS) Extensions +************************************************************** + +This document describes |TF-A| support for Arm Reliability, Availability, and +Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and +later CPUs, and also an optional extension to the base Armv8.0 architecture. + +For the description of Arm RAS extensions, Standard Error Records, and the +precise definition of RAS terminology, please refer to the Arm Architecture +Reference Manual and `RAS Supplement`_. The rest of this document assumes +familiarity with architecture and terminology. + +**IMPORTANT NOTE**: TF-A implementation assumes that if RAS extension is present +then FEAT_IESB is also implmented. + +There are two philosophies for handling RAS errors from Non-secure world point +of view. + +- :ref:`Firmware First Handling (FFH)` +- :ref:`Kernel First Handling (KFH)` + +.. _Firmware First Handling (FFH): + +Firmware First Handling (FFH) +============================= + +Introduction +------------ + +EA’s and Error interrupts corresponding to NS nodes are handled first in firmware + +- Errors signaled back to NS world via suitable mechanism +- Kernel is prohibited from accessing the RAS error records directly +- Firmware creates CPER records for kernel to navigate and process +- Firmware signals error back to Kernel via SDEI + +Overview +-------- + +FFH works in conjunction with `Exception Handling Framework`. Exceptions resulting from +errors in Non-secure world are routed to and handled in EL3. Said errors are Synchronous +External Abort (SEA), Asynchronous External Abort (signalled as SErrors), Fault Handling +and Error Recovery interrupts. +RAS Framework in TF-A allows the platform to define an external abort handler and to +register RAS nodes and interrupts. It also provides `helpers`__ for accessing Standard +Error Records as introduced by the RAS extensions + + +.. __: `Standard Error Record helpers`_ + +.. _Kernel First Handling (KFH): + +Kernel First Handling (KFH) +=========================== + +Introduction +------------ + +EA's originating/attributed to NS world are handled first in NS and Kernel navigates +the std error records directly. + +- KFH is the default handling mode if platform does not explicitly enable FFH mode. +- KFH mode does not need any EL3 involvement except for the reflection of errors back + to lower EL. This happens when there is an error (EA) in the system which is not yet + signaled to PE while executing at lower EL. During entry into EL3 the errors (EA) are + synchronized causing async EA to pend at EL3. + +Error Syncronization at EL3 entry +================================= + +During entry to EL3 from lower EL, if there is any pending async EAs they are either +reflected back to lower EL (KFH) or handled in EL3 itself (FFH). + +|Image 1| + +TF-A build options +================== + +- **ENABLE_FEAT_RAS**: Enable RAS extension feature at EL3. +- **HANDLE_EA_EL3_FIRST_NS**: Required for FFH +- **RAS_TRAP_NS_ERR_REC_ACCESS**: Trap Non-secure access of RAS error record registers. +- **RAS_EXTENSION**: Deprecated macro, equivalent to ENABLE_FEAT_RAS and + HANDLE_EA_EL3_FIRST_NS put together. + +RAS internal macros + +- **FFH_SUPPORT**: Gets enabled if **HANDLE_EA_EL3_FIRST_NS** is enabled. + +RAS feature has dependency on some other TF-A build flags + +- **EL3_EXCEPTION_HANDLING**: Required for FFH +- **FAULT_INJECTION_SUPPORT**: Required for testing RAS feature on fvp platform + +TF-A Tests +========== + +RAS functionality is regularly tested in TF-A CI using `RAS test group`_ which has multiple +configurations for testing lower EL External aborts. + +All the tests are written in TF-A tests which runs as NS-EL2 payload. + +- **FFH without RAS extension** + + *fvp-ea-ffh,fvp-ea-ffh:fvp-tftf-fip.tftf-aemv8a-debug* + + Couple of tests, one each for sync EA and async EA from lower EL which gets handled in El3. + Inject External aborts(sync/async) which traps in EL3, FVP has a handler which gracefully + handles these errors and returns back to TF-A Tests + + Build Configs : **HANDLE_EA_EL3_FIRST_NS** , **PLATFORM_TEST_EA_FFH** + +- **FFH with RAS extension** + + Three Tests : + + - *fvp-ras-ffh,fvp-single-fault:fvp-tftf-fip.tftf-aemv8a.fi-debug* + + Inject an unrecoverable RAS error, which gets handled in EL3. + + - *fvp-ras-ffh,fvp-uncontainable:fvp-tftf.fault-fip.tftf-aemv8a.fi-debug* + + Inject uncontainable RAS errors which causes platform to panic. + + - *fvp-ras-ffh,fvp-ras-ffh-nested:fvp-tftf-fip.tftf-ras_ffh_nested-aemv8a.fi-debug* + + Test nested exception handling at El3 for synchronized async EAs. Inject an SError in lower EL + which remain pending until we enter EL3 through SMC call. At EL3 entry on encountering a pending + async EA it will handle the async EA first (nested exception) before handling the original SMC call. + +- **KFH with RAS extension** + + Couple of tests in the group : + + - *fvp-ras-kfh,fvp-ras-kfh:fvp-tftf-fip.tftf-aemv8a.fi-debug* + + Inject and handle RAS errors in TF-A tests (no El3 involvement) + + - *fvp-ras-kfh,fvp-ras-kfh-reflect:fvp-tftf-fip.tftf-ras_kfh_reflection-aemv8a.fi-debug* + + Reflection of synchronized errors from EL3 to TF-A tests, two tests one each for reflecting + in IRQ and SMC path. + +RAS Framework +============= + + +.. _ras-figure: + +.. image:: ../resources/diagrams/draw.io/ras.svg + +Platform APIs +------------- + +The RAS framework allows the platform to define handlers for External Abort, +Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please +refer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`. + +Registering RAS error records +----------------------------- + +RAS nodes are components in the system capable of signalling errors to PEs +through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS +nodes contain one or more error records, which are registers through which the +nodes advertise various properties of the signalled error. Arm recommends that +error records are implemented in the Standard Error Record format. The RAS +architecture allows for error records to be accessible via system or +memory-mapped registers. + +The platform should enumerate the error records providing for each of them: + +- A handler to probe error records for errors; +- When the probing identifies an error, a handler to handle it; +- For memory-mapped error record, its base address and size in KB; for a system + register-accessed record, the start index of the record and number of + continuous records from that index; +- Any node-specific auxiliary data. + +With this information supplied, when the run time firmware receives one of the +notification mechanisms, the RAS framework can iterate through and probe error +records for error, and invoke the appropriate handler to handle it. + +The RAS framework provides the macros to populate error record information. The +macros are versioned, and the latest version as of this writing is 1. These +macros create a structure of type ``struct err_record_info`` from its arguments, +which are later passed to probe and error handlers. + +For memory-mapped error records: + +.. code:: c + + ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux) + +And, for system register ones: + +.. code:: c + + ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux) + +The probe handler must have the following prototype: + +.. code:: c + + typedef int (*err_record_probe_t)(const struct err_record_info *info, + int *probe_data); + +The probe handler must return a non-zero value if an error was detected, or 0 +otherwise. The ``probe_data`` output parameter can be used to pass any useful +information resulting from probe to the error handler (see `below`__). For +example, it could return the index of the record. + +.. __: `Standard Error Record helpers`_ + +The error handler must have the following prototype: + +.. code:: c + + typedef int (*err_record_handler_t)(const struct err_record_info *info, + int probe_data, const struct err_handler_data *const data); + +The ``data`` constant parameter describes the various properties of the error, +including the reason for the error, exception syndrome, and also ``flags``, +``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler +<EL3 interrupts>`. + +The platform is expected populate an array using the macros above, and register +the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``, +passing it the name of the array describing the records. Note that the macro +must be used in the same file where the array is defined. + +Standard Error Record helpers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The |TF-A| RAS framework provides probe handlers for Standard Error Records, for +both memory-mapped and System Register accesses: + +.. code:: c + + int ras_err_ser_probe_memmap(const struct err_record_info *info, + int *probe_data); + + int ras_err_ser_probe_sysreg(const struct err_record_info *info, + int *probe_data); + +When the platform enumerates error records, for those records in the Standard +Error Record format, these helpers maybe used instead of rolling out their own. +Both helpers above: + +- Return non-zero value when an error is detected in a Standard Error Record; +- Set ``probe_data`` to the index of the error record upon detecting an error. + +Registering RAS interrupts +-------------------------- + +RAS nodes can signal errors to the PE by raising Fault Handling and/or Error +Recovery interrupts. For the firmware-first handling paradigm for interrupts to +work, the platform must setup and register with |EHF|. See `Interaction with +Exception Handling Framework`_. + +For each RAS interrupt, the platform has to provide structure of type ``struct +ras_interrupt``: + +- Interrupt number; +- The associated error record information (pointer to the corresponding + ``struct err_record_info``); +- Optionally, a cookie. + +The platform is expected to define an array of ``struct ras_interrupt``, and +register it with the RAS framework using the macro +``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the +macro must be used in the same file where the array is defined. + +The array of ``struct ras_interrupt`` must be sorted in the increasing order of +interrupt number. This allows for fast look of handlers in order to service RAS +interrupts. + +Double-fault handling +--------------------- + +A Double Fault condition arises when an error is signalled to the PE while +handling of a previously signalled error is still underway. When a Double Fault +condition arises, the Arm RAS extensions only require for handler to perform +orderly shutdown of the system, as recovery may be impossible. + +The RAS extensions part of Armv8.4 introduced new architectural features to deal +with Double Fault conditions, specifically, the introduction of ``NMEA`` and +``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3 +software which runs part of its entry/exit routines with exceptions momentarily +masked—meaning, in such systems, External Aborts/SErrors are not immediately +handled when they occur, but only after the exceptions are unmasked again. + +|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked. +This means that all exceptions routed to EL3 are handled immediately. |TF-A| +thus is able to detect a Double Fault conditions in software, without needing +the intended advantages of Armv8.4 Double Fault architecture extensions. + +Double faults are fatal, and terminate at the platform double fault handler, and +doesn't return. + +Engaging the RAS framework +-------------------------- + +Enabling RAS support is a platform choice + +The RAS support in |TF-A| introduces a default implementation of +``plat_ea_handler``, the External Abort handler in EL3. When ``ENABLE_FEAT_RAS`` +is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the +top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating +to through platform-supplied error records, probe them, and when an error is +identified, look up and invoke the corresponding error handler. + +Note that, if the platform chooses to override the ``plat_ea_handler`` function +and intend to use the RAS framework, it must explicitly call +``ras_ea_handler()`` from within. + +Similarly, for RAS interrupts, the framework defines +``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked +when a RAS interrupt taken at EL3. The function bisects the platform-supplied +sorted array of interrupts to look up the error record information associated +with the interrupt number. That error handler for that record is then invoked to +handle the error. + +Interaction with Exception Handling Framework +--------------------------------------------- + +As mentioned in earlier sections, RAS framework interacts with the |EHF| to +arbitrate handling of RAS exceptions with others that are routed to EL3. This +means that the platform must partition a :ref:`priority level <Partitioning +priority levels>` for handling RAS exceptions. The platform must then define +the macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions. +Platforms would typically want to allocate the highest secure priority for +RAS handling. + +Handling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt +<non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF| +documentation. I.e., for interrupts, the priority management is implicit; but +for non-interrupt exceptions, they're explicit using :ref:`EHF APIs +<Activating and Deactivating priorities>`. + +-------------- + +*Copyright (c) 2018-2023, Arm Limited and Contributors. All rights reserved.* + +.. _RAS Supplement: https://developer.arm.com/documentation/ddi0587/latest +.. _RAS Test group: https://git.trustedfirmware.org/ci/tf-a-ci-scripts.git/tree/group/tf-l3-boot-tests-ras?h=refs/heads/master + +.. |Image 1| image:: ../resources/diagrams/bl31-exception-entry-error-synchronization.png |