diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-11 08:27:49 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-11 08:27:49 +0000 |
commit | ace9429bb58fd418f0c81d4c2835699bddf6bde6 (patch) | |
tree | b2d64bc10158fdd5497876388cd68142ca374ed3 /Documentation/admin-guide/perf | |
parent | Initial commit. (diff) | |
download | linux-ace9429bb58fd418f0c81d4c2835699bddf6bde6.tar.xz linux-ace9429bb58fd418f0c81d4c2835699bddf6bde6.zip |
Adding upstream version 6.6.15.upstream/6.6.15
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'Documentation/admin-guide/perf')
-rw-r--r-- | Documentation/admin-guide/perf/alibaba_pmu.rst | 105 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/arm-ccn.rst | 61 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/arm-cmn.rst | 65 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/arm_dsu_pmu.rst | 29 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/cxl.rst | 68 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/hisi-pcie-pmu.rst | 130 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/hisi-pmu.rst | 126 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/hns3-pmu.rst | 136 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/imx-ddr.rst | 71 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/index.rst | 24 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/meson-ddr-pmu.rst | 70 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/nvidia-pmu.rst | 299 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/qcom_l2_pmu.rst | 39 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/qcom_l3_pmu.rst | 26 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/thunderx2-pmu.rst | 44 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/xgene-pmu.rst | 49 |
16 files changed, 1342 insertions, 0 deletions
diff --git a/Documentation/admin-guide/perf/alibaba_pmu.rst b/Documentation/admin-guide/perf/alibaba_pmu.rst new file mode 100644 index 0000000000..7d84002390 --- /dev/null +++ b/Documentation/admin-guide/perf/alibaba_pmu.rst @@ -0,0 +1,105 @@ +============================================================= +Alibaba's T-Head SoC Uncore Performance Monitoring Unit (PMU) +============================================================= + +The Yitian 710, custom-built by Alibaba Group's chip development business, +T-Head, implements uncore PMU for performance and functional debugging to +facilitate system maintenance. + +DDR Sub-System Driveway (DRW) PMU Driver +========================================= + +Yitian 710 employs eight DDR5/4 channels, four on each die. Each DDR5 channel +is independent of others to service system memory requests. And one DDR5 +channel is split into two independent sub-channels. The DDR Sub-System Driveway +implements separate PMUs for each sub-channel to monitor various performance +metrics. + +The Driveway PMU devices are named as ali_drw_<sys_base_addr> with perf. +For example, ali_drw_21000 and ali_drw_21080 are two PMU devices for two +sub-channels of the same channel in die 0. And the PMU device of die 1 is +prefixed with ali_drw_400XXXXX, e.g. ali_drw_40021000. + +Each sub-channel has 36 PMU counters in total, which is classified into +four groups: + +- Group 0: PMU Cycle Counter. This group has one pair of counters + pmu_cycle_cnt_low and pmu_cycle_cnt_high, that is used as the cycle count + based on DDRC core clock. + +- Group 1: PMU Bandwidth Counters. This group has 8 counters that are used + to count the total access number of either the eight bank groups in a + selected rank, or four ranks separately in the first 4 counters. The base + transfer unit is 64B. + +- Group 2: PMU Retry Counters. This group has 10 counters, that intend to + count the total retry number of each type of uncorrectable error. + +- Group 3: PMU Common Counters. This group has 16 counters, that are used + to count the common events. + +For now, the Driveway PMU driver only uses counters in group 0 and group 3. + +The DDR Controller (DDRCTL) and DDR PHY combine to create a complete solution +for connecting an SoC application bus to DDR memory devices. The DDRCTL +receives transactions Host Interface (HIF) which is custom-defined by Synopsys. +These transactions are queued internally and scheduled for access while +satisfying the SDRAM protocol timing requirements, transaction priorities, and +dependencies between the transactions. The DDRCTL in turn issues commands on +the DDR PHY Interface (DFI) to the PHY module, which launches and captures data +to and from the SDRAM. The driveway PMUs have hardware logic to gather +statistics and performance logging signals on HIF, DFI, etc. + +By counting the READ, WRITE and RMW commands sent to the DDRC through the HIF +interface, we could calculate the bandwidth. Example usage of counting memory +data bandwidth:: + + perf stat \ + -e ali_drw_21000/hif_wr/ \ + -e ali_drw_21000/hif_rd/ \ + -e ali_drw_21000/hif_rmw/ \ + -e ali_drw_21000/cycle/ \ + -e ali_drw_21080/hif_wr/ \ + -e ali_drw_21080/hif_rd/ \ + -e ali_drw_21080/hif_rmw/ \ + -e ali_drw_21080/cycle/ \ + -e ali_drw_23000/hif_wr/ \ + -e ali_drw_23000/hif_rd/ \ + -e ali_drw_23000/hif_rmw/ \ + -e ali_drw_23000/cycle/ \ + -e ali_drw_23080/hif_wr/ \ + -e ali_drw_23080/hif_rd/ \ + -e ali_drw_23080/hif_rmw/ \ + -e ali_drw_23080/cycle/ \ + -e ali_drw_25000/hif_wr/ \ + -e ali_drw_25000/hif_rd/ \ + -e ali_drw_25000/hif_rmw/ \ + -e ali_drw_25000/cycle/ \ + -e ali_drw_25080/hif_wr/ \ + -e ali_drw_25080/hif_rd/ \ + -e ali_drw_25080/hif_rmw/ \ + -e ali_drw_25080/cycle/ \ + -e ali_drw_27000/hif_wr/ \ + -e ali_drw_27000/hif_rd/ \ + -e ali_drw_27000/hif_rmw/ \ + -e ali_drw_27000/cycle/ \ + -e ali_drw_27080/hif_wr/ \ + -e ali_drw_27080/hif_rd/ \ + -e ali_drw_27080/hif_rmw/ \ + -e ali_drw_27080/cycle/ -- sleep 10 + +Example usage of counting all memory read/write bandwidth by metric:: + + perf stat -M ddr_read_bandwidth.all -- sleep 10 + perf stat -M ddr_write_bandwidth.all -- sleep 10 + +The average DRAM bandwidth can be calculated as follows: + +- Read Bandwidth = perf_hif_rd * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle +- Write Bandwidth = (perf_hif_wr + perf_hif_rmw) * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle + +Here, DDRC_WIDTH = 64 bytes. + +The current driver does not support sampling. So "perf record" is +unsupported. Also attach to a task is unsupported as the events are all +uncore. diff --git a/Documentation/admin-guide/perf/arm-ccn.rst b/Documentation/admin-guide/perf/arm-ccn.rst new file mode 100644 index 0000000000..f62f7fe50e --- /dev/null +++ b/Documentation/admin-guide/perf/arm-ccn.rst @@ -0,0 +1,61 @@ +========================== +ARM Cache Coherent Network +========================== + +CCN-504 is a ring-bus interconnect consisting of 11 crosspoints +(XPs), with each crosspoint supporting up to two device ports, +so nodes (devices) 0 and 1 are connected to crosspoint 0, +nodes 2 and 3 to crosspoint 1 etc. + +PMU (perf) driver +----------------- + +The CCN driver registers a perf PMU driver, which provides +description of available events and configuration options +in sysfs, see /sys/bus/event_source/devices/ccn*. + +The "format" directory describes format of the config, config1 +and config2 fields of the perf_event_attr structure. The "events" +directory provides configuration templates for all documented +events, that can be used with perf tool. For example "xp_valid_flit" +is an equivalent of "type=0x8,event=0x4". Other parameters must be +explicitly specified. + +For events originating from device, "node" defines its index. + +Crosspoint PMU events require "xp" (index), "bus" (bus number) +and "vc" (virtual channel ID). + +Crosspoint watchpoint-based events (special "event" value 0xfe) +require "xp" and "vc" as above plus "port" (device port index), +"dir" (transmit/receive direction), comparator values ("cmp_l" +and "cmp_h") and "mask", being index of the comparator mask. + +Masks are defined separately from the event description +(due to limited number of the config values) in the "cmp_mask" +directory, with first 8 configurable by user and additional +4 hardcoded for the most frequent use cases. + +Cycle counter is described by a "type" value 0xff and does +not require any other settings. + +The driver also provides a "cpumask" sysfs attribute, which contains +a single CPU ID, of the processor which will be used to handle all +the CCN PMU events. It is recommended that the user space tools +request the events on this processor (if not, the perf_event->cpu value +will be overwritten anyway). In case of this processor being offlined, +the events are migrated to another one and the attribute is updated. + +Example of perf tool use:: + + / # perf list | grep ccn + ccn/cycles/ [Kernel PMU event] + <...> + ccn/xp_valid_flit,xp=?,port=?,vc=?,dir=?/ [Kernel PMU event] + <...> + + / # perf stat -a -e ccn/cycles/,ccn/xp_valid_flit,xp=1,port=0,vc=1,dir=1/ \ + sleep 1 + +The driver does not support sampling, therefore "perf record" will +not work. Per-task (without "-a") perf sessions are not supported. diff --git a/Documentation/admin-guide/perf/arm-cmn.rst b/Documentation/admin-guide/perf/arm-cmn.rst new file mode 100644 index 0000000000..796e25b702 --- /dev/null +++ b/Documentation/admin-guide/perf/arm-cmn.rst @@ -0,0 +1,65 @@ +============================= +Arm Coherent Mesh Network PMU +============================= + +CMN-600 is a configurable mesh interconnect consisting of a rectangular +grid of crosspoints (XPs), with each crosspoint supporting up to two +device ports to which various AMBA CHI agents are attached. + +CMN implements a distributed PMU design as part of its debug and trace +functionality. This consists of a local monitor (DTM) at every XP, which +counts up to 4 event signals from the connected device nodes and/or the +XP itself. Overflow from these local counters is accumulated in up to 8 +global counters implemented by the main controller (DTC), which provides +overall PMU control and interrupts for global counter overflow. + +PMU events +---------- + +The PMU driver registers a single PMU device for the whole interconnect, +see /sys/bus/event_source/devices/arm_cmn_0. Multi-chip systems may link +more than one CMN together via external CCIX links - in this situation, +each mesh counts its own events entirely independently, and additional +PMU devices will be named arm_cmn_{1..n}. + +Most events are specified in a format based directly on the TRM +definitions - "type" selects the respective node type, and "eventid" the +event number. Some events require an additional occupancy ID, which is +specified by "occupid". + +* Since RN-D nodes do not have any distinct events from RN-I nodes, they + are treated as the same type (0xa), and the common event templates are + named "rnid_*". + +* The cycle counter is treated as a synthetic event belonging to the DTC + node ("type" == 0x3, "eventid" is ignored). + +* XP events also encode the port and channel in the "eventid" field, to + match the underlying pmu_event0_id encoding for the pmu_event_sel + register. The event templates are named with prefixes to cover all + permutations. + +By default each event provides an aggregate count over all nodes of the +given type. To target a specific node, "bynodeid" must be set to 1 and +"nodeid" to the appropriate value derived from the CMN configuration +(as defined in the "Node ID Mapping" section of the TRM). + +Watchpoints +----------- + +The PMU can also count watchpoint events to monitor specific flit +traffic. Watchpoints are treated as a synthetic event type, and like PMU +events can be global or targeted with a particular XP's "nodeid" value. +Since the watchpoint direction is otherwise implicit in the underlying +register selection, separate events are provided for flit uploads and +downloads. + +The flit match value and mask are passed in config1 and config2 ("val" +and "mask" respectively). "wp_dev_sel", "wp_chn_sel", "wp_grp" and +"wp_exclusive" are specified per the TRM definitions for dtm_wp_config0. +Where a watchpoint needs to match fields from both match groups on the +REQ or SNP channel, it can be specified as two events - one for each +group - with the same nonzero "combine" value. The count for such a +pair of combined events will be attributed to the primary match. +Watchpoint events with a "combine" value of 0 are considered independent +and will count individually. diff --git a/Documentation/admin-guide/perf/arm_dsu_pmu.rst b/Documentation/admin-guide/perf/arm_dsu_pmu.rst new file mode 100644 index 0000000000..7fd34db75d --- /dev/null +++ b/Documentation/admin-guide/perf/arm_dsu_pmu.rst @@ -0,0 +1,29 @@ +================================== +ARM DynamIQ Shared Unit (DSU) PMU +================================== + +ARM DynamIQ Shared Unit integrates one or more cores with an L3 memory system, +control logic and external interfaces to form a multicore cluster. The PMU +allows counting the various events related to the L3 cache, Snoop Control Unit +etc, using 32bit independent counters. It also provides a 64bit cycle counter. + +The PMU can only be accessed via CPU system registers and are common to the +cores connected to the same DSU. Like most of the other uncore PMUs, DSU +PMU doesn't support process specific events and cannot be used in sampling mode. + +The DSU provides a bitmap for a subset of implemented events via hardware +registers. There is no way for the driver to determine if the other events +are available or not. Hence the driver exposes only those events advertised +by the DSU, in "events" directory under:: + + /sys/bus/event_sources/devices/arm_dsu_<N>/ + +The user should refer to the TRM of the product to figure out the supported events +and use the raw event code for the unlisted events. + +The driver also exposes the CPUs connected to the DSU instance in "associated_cpus". + + +e.g usage:: + + perf stat -a -e arm_dsu_0/cycles/ diff --git a/Documentation/admin-guide/perf/cxl.rst b/Documentation/admin-guide/perf/cxl.rst new file mode 100644 index 0000000000..9233ea0d0b --- /dev/null +++ b/Documentation/admin-guide/perf/cxl.rst @@ -0,0 +1,68 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================== +CXL Performance Monitoring Unit (CPMU) +====================================== + +The CXL rev 3.0 specification provides a definition of CXL Performance +Monitoring Unit in section 13.2: Performance Monitoring. + +CXL components (e.g. Root Port, Switch Upstream Port, End Point) may have +any number of CPMU instances. CPMU capabilities are fully discoverable from +the devices. The specification provides event definitions for all CXL protocol +message types and a set of additional events for things commonly counted on +CXL devices (e.g. DRAM events). + +CPMU driver +=========== + +The CPMU driver registers a perf PMU with the name pmu_mem<X>.<Y> on the CXL bus +representing the Yth CPMU for memX. + + /sys/bus/cxl/device/pmu_mem<X>.<Y> + +The associated PMU is registered as + + /sys/bus/event_sources/devices/cxl_pmu_mem<X>.<Y> + +In common with other CXL bus devices, the id has no specific meaning and the +relationship to specific CXL device should be established via the device parent +of the device on the CXL bus. + +PMU driver provides description of available events and filter options in sysfs. + +The "format" directory describes all formats of the config (event vendor id, +group id and mask) config1 (threshold, filter enables) and config2 (filter +parameters) fields of the perf_event_attr structure. The "events" directory +describes all documented events show in perf list. + +The events shown in perf list are the most fine grained events with a single +bit of the event mask set. More general events may be enable by setting +multiple mask bits in config. For example, all Device to Host Read Requests +may be captured on a single counter by setting the bits for all of + +* d2h_req_rdcurr +* d2h_req_rdown +* d2h_req_rdshared +* d2h_req_rdany +* d2h_req_rdownnodata + +Example of usage:: + + $#perf list + cxl_pmu_mem0.0/clock_ticks/ [Kernel PMU event] + cxl_pmu_mem0.0/d2h_req_rdshared/ [Kernel PMU event] + cxl_pmu_mem0.0/h2d_req_snpcur/ [Kernel PMU event] + cxl_pmu_mem0.0/h2d_req_snpdata/ [Kernel PMU event] + cxl_pmu_mem0.0/h2d_req_snpinv/ [Kernel PMU event] + ----------------------------------------------------------- + + $# perf stat -a -e cxl_pmu_mem0.0/clock_ticks/ -e cxl_pmu_mem0.0/d2h_req_rdshared/ + +Vendor specific events may also be available and if so can be used via + + $# perf stat -a -e cxl_pmu_mem0.0/vid=VID,gid=GID,mask=MASK/ + +The driver does not support sampling so "perf record" is unsupported. +It only supports system-wide counting so attaching to a task is +unsupported. diff --git a/Documentation/admin-guide/perf/hisi-pcie-pmu.rst b/Documentation/admin-guide/perf/hisi-pcie-pmu.rst new file mode 100644 index 0000000000..7e863662e2 --- /dev/null +++ b/Documentation/admin-guide/perf/hisi-pcie-pmu.rst @@ -0,0 +1,130 @@ +================================================ +HiSilicon PCIe Performance Monitoring Unit (PMU) +================================================ + +On Hip09, HiSilicon PCIe Performance Monitoring Unit (PMU) could monitor +bandwidth, latency, bus utilization and buffer occupancy data of PCIe. + +Each PCIe Core has a PMU to monitor multi Root Ports of this PCIe Core and +all Endpoints downstream these Root Ports. + + +HiSilicon PCIe PMU driver +========================= + +The PCIe PMU driver registers a perf PMU with the name of its sicl-id and PCIe +Core id.:: + + /sys/bus/event_source/hisi_pcie<sicl>_core<core> + +PMU driver provides description of available events and filter options in sysfs, +see /sys/bus/event_source/devices/hisi_pcie<sicl>_core<core>. + +The "format" directory describes all formats of the config (events) and config1 +(filter options) fields of the perf_event_attr structure. The "events" directory +describes all documented events shown in perf list. + +The "identifier" sysfs file allows users to identify the version of the +PMU hardware device. + +The "bus" sysfs file allows users to get the bus number of Root Ports +monitored by PMU. + +Example usage of perf:: + + $# perf list + hisi_pcie0_core0/rx_mwr_latency/ [kernel PMU event] + hisi_pcie0_core0/rx_mwr_cnt/ [kernel PMU event] + ------------------------------------------ + + $# perf stat -e hisi_pcie0_core0/rx_mwr_latency/ + $# perf stat -e hisi_pcie0_core0/rx_mwr_cnt/ + $# perf stat -g -e hisi_pcie0_core0/rx_mwr_latency/ -e hisi_pcie0_core0/rx_mwr_cnt/ + +The current driver does not support sampling. So "perf record" is unsupported. +Also attach to a task is unsupported for PCIe PMU. + +Filter options +-------------- + +1. Target filter + + PMU could only monitor the performance of traffic downstream target Root + Ports or downstream target Endpoint. PCIe PMU driver support "port" and + "bdf" interfaces for users, and these two interfaces aren't supported at the + same time. + + - port + + "port" filter can be used in all PCIe PMU events, target Root Port can be + selected by configuring the 16-bits-bitmap "port". Multi ports can be + selected for AP-layer-events, and only one port can be selected for + TL/DL-layer-events. + + For example, if target Root Port is 0000:00:00.0 (x8 lanes), bit0 of + bitmap should be set, port=0x1; if target Root Port is 0000:00:04.0 (x4 + lanes), bit8 is set, port=0x100; if these two Root Ports are both + monitored, port=0x101. + + Example usage of perf:: + + $# perf stat -e hisi_pcie0_core0/rx_mwr_latency,port=0x1/ sleep 5 + + - bdf + + "bdf" filter can only be used in bandwidth events, target Endpoint is + selected by configuring BDF to "bdf". Counter only counts the bandwidth of + message requested by target Endpoint. + + For example, "bdf=0x3900" means BDF of target Endpoint is 0000:39:00.0. + + Example usage of perf:: + + $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,bdf=0x3900/ sleep 5 + +2. Trigger filter + + Event statistics start when the first time TLP length is greater/smaller + than trigger condition. You can set the trigger condition by writing + "trig_len", and set the trigger mode by writing "trig_mode". This filter can + only be used in bandwidth events. + + For example, "trig_len=4" means trigger condition is 2^4 DW, "trig_mode=0" + means statistics start when TLP length > trigger condition, "trig_mode=1" + means start when TLP length < condition. + + Example usage of perf:: + + $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,trig_len=0x4,trig_mode=1/ sleep 5 + +3. Threshold filter + + Counter counts when TLP length within the specified range. You can set the + threshold by writing "thr_len", and set the threshold mode by writing + "thr_mode". This filter can only be used in bandwidth events. + + For example, "thr_len=4" means threshold is 2^4 DW, "thr_mode=0" means + counter counts when TLP length >= threshold, and "thr_mode=1" means counts + when TLP length < threshold. + + Example usage of perf:: + + $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,thr_len=0x4,thr_mode=1/ sleep 5 + +4. TLP Length filter + + When counting bandwidth, the data can be composed of certain parts of TLP + packets. You can specify it through "len_mode": + + - 2'b00: Reserved (Do not use this since the behaviour is undefined) + - 2'b01: Bandwidth of TLP payloads + - 2'b10: Bandwidth of TLP headers + - 2'b11: Bandwidth of both TLP payloads and headers + + For example, "len_mode=2" means only counting the bandwidth of TLP headers + and "len_mode=3" means the final bandwidth data is composed of both TLP + headers and payloads. Default value if not specified is 2'b11. + + Example usage of perf:: + + $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,len_mode=0x1/ sleep 5 diff --git a/Documentation/admin-guide/perf/hisi-pmu.rst b/Documentation/admin-guide/perf/hisi-pmu.rst new file mode 100644 index 0000000000..e0174d2080 --- /dev/null +++ b/Documentation/admin-guide/perf/hisi-pmu.rst @@ -0,0 +1,126 @@ +====================================================== +HiSilicon SoC uncore Performance Monitoring Unit (PMU) +====================================================== + +The HiSilicon SoC chip includes various independent system device PMUs +such as L3 cache (L3C), Hydra Home Agent (HHA) and DDRC. These PMUs are +independent and have hardware logic to gather statistics and performance +information. + +The HiSilicon SoC encapsulates multiple CPU and IO dies. Each CPU cluster +(CCL) is made up of 4 cpu cores sharing one L3 cache; each CPU die is +called Super CPU cluster (SCCL) and is made up of 6 CCLs. Each SCCL has +two HHAs (0 - 1) and four DDRCs (0 - 3), respectively. + +HiSilicon SoC uncore PMU driver +------------------------------- + +Each device PMU has separate registers for event counting, control and +interrupt, and the PMU driver shall register perf PMU drivers like L3C, +HHA and DDRC etc. The available events and configuration options shall +be described in the sysfs, see: + +/sys/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}>/, or +/sys/bus/event_source/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}>. +The "perf list" command shall list the available events from sysfs. + +Each L3C, HHA and DDRC is registered as a separate PMU with perf. The PMU +name will appear in event listing as hisi_sccl<sccl-id>_module<index-id>. +where "sccl-id" is the identifier of the SCCL and "index-id" is the index of +module. + +e.g. hisi_sccl3_l3c0/rd_hit_cpipe is READ_HIT_CPIPE event of L3C index #0 in +SCCL ID #3. + +e.g. hisi_sccl1_hha0/rx_operations is RX_OPERATIONS event of HHA index #0 in +SCCL ID #1. + +The driver also provides a "cpumask" sysfs attribute, which shows the CPU core +ID used to count the uncore PMU event. + +Example usage of perf:: + + $# perf list + hisi_sccl3_l3c0/rd_hit_cpipe/ [kernel PMU event] + ------------------------------------------ + hisi_sccl3_l3c0/wr_hit_cpipe/ [kernel PMU event] + ------------------------------------------ + hisi_sccl1_l3c0/rd_hit_cpipe/ [kernel PMU event] + ------------------------------------------ + hisi_sccl1_l3c0/wr_hit_cpipe/ [kernel PMU event] + ------------------------------------------ + + $# perf stat -a -e hisi_sccl3_l3c0/rd_hit_cpipe/ sleep 5 + $# perf stat -a -e hisi_sccl3_l3c0/config=0x02/ sleep 5 + +For HiSilicon uncore PMU v2 whose identifier is 0x30, the topology is the same +as PMU v1, but some new functions are added to the hardware. + +1. L3C PMU supports filtering by core/thread within the cluster which can be +specified as a bitmap:: + + $# perf stat -a -e hisi_sccl3_l3c0/config=0x02,tt_core=0x3/ sleep 5 + +This will only count the operations from core/thread 0 and 1 in this cluster. + +2. Tracetag allow the user to chose to count only read, write or atomic +operations via the tt_req parameeter in perf. The default value counts all +operations. tt_req is 3bits, 3'b100 represents read operations, 3'b101 +represents write operations, 3'b110 represents atomic store operations and +3'b111 represents atomic non-store operations, other values are reserved:: + + $# perf stat -a -e hisi_sccl3_l3c0/config=0x02,tt_req=0x4/ sleep 5 + +This will only count the read operations in this cluster. + +3. Datasrc allows the user to check where the data comes from. It is 5 bits. +Some important codes are as follows: + +- 5'b00001: comes from L3C in this die; +- 5'b01000: comes from L3C in the cross-die; +- 5'b01001: comes from L3C which is in another socket; +- 5'b01110: comes from the local DDR; +- 5'b01111: comes from the cross-die DDR; +- 5'b10000: comes from cross-socket DDR; + +etc, it is mainly helpful to find that the data source is nearest from the CPU +cores. If datasrc_cfg is used in the multi-chips, the datasrc_skt shall be +configured in perf command:: + + $# perf stat -a -e hisi_sccl3_l3c0/config=0xb9,datasrc_cfg=0xE/, + hisi_sccl3_l3c0/config=0xb9,datasrc_cfg=0xF/ sleep 5 + +4. Some HiSilicon SoCs encapsulate multiple CPU and IO dies. Each CPU die +contains several Compute Clusters (CCLs). The I/O dies are called Super I/O +clusters (SICL) containing multiple I/O clusters (ICLs). Each CCL/ICL in the +SoC has a unique ID. Each ID is 11bits, include a 6-bit SCCL-ID and 5-bit +CCL/ICL-ID. For I/O die, the ICL-ID is followed by: + +- 5'b00000: I/O_MGMT_ICL; +- 5'b00001: Network_ICL; +- 5'b00011: HAC_ICL; +- 5'b10000: PCIe_ICL; + +5. uring_channel: UC PMU events 0x47~0x59 supports filtering by tx request +uring channel. It is 2 bits. Some important codes are as follows: + +- 2'b11: count the events which sent to the uring_ext (MATA) channel; +- 2'b01: is the same as 2'b11; +- 2'b10: count the events which sent to the uring (non-MATA) channel; +- 2'b00: default value, count the events which sent to the both uring and + uring_ext channel; + +Users could configure IDs to count data come from specific CCL/ICL, by setting +srcid_cmd & srcid_msk, and data desitined for specific CCL/ICL by setting +tgtid_cmd & tgtid_msk. A set bit in srcid_msk/tgtid_msk means the PMU will not +check the bit when matching against the srcid_cmd/tgtid_cmd. + +If all of these options are disabled, it can works by the default value that +doesn't distinguish the filter condition and ID information and will return +the total counter values in the PMU counters. + +The current driver does not support sampling. So "perf record" is unsupported. +Also attach to a task is unsupported as the events are all uncore. + +Note: Please contact the maintainer for a complete list of events supported for +the PMU devices in the SoC and its information if needed. diff --git a/Documentation/admin-guide/perf/hns3-pmu.rst b/Documentation/admin-guide/perf/hns3-pmu.rst new file mode 100644 index 0000000000..75a40846d4 --- /dev/null +++ b/Documentation/admin-guide/perf/hns3-pmu.rst @@ -0,0 +1,136 @@ +====================================== +HNS3 Performance Monitoring Unit (PMU) +====================================== + +HNS3(HiSilicon network system 3) Performance Monitoring Unit (PMU) is an +End Point device to collect performance statistics of HiSilicon SoC NIC. +On Hip09, each SICL(Super I/O cluster) has one PMU device. + +HNS3 PMU supports collection of performance statistics such as bandwidth, +latency, packet rate and interrupt rate. + +Each HNS3 PMU supports 8 hardware events. + +HNS3 PMU driver +=============== + +The HNS3 PMU driver registers a perf PMU with the name of its sicl id.:: + + /sys/devices/hns3_pmu_sicl_<sicl_id> + +PMU driver provides description of available events, filter modes, format, +identifier and cpumask in sysfs. + +The "events" directory describes the event code of all supported events +shown in perf list. + +The "filtermode" directory describes the supported filter modes of each +event. + +The "format" directory describes all formats of the config (events) and +config1 (filter options) fields of the perf_event_attr structure. + +The "identifier" file shows version of PMU hardware device. + +The "bdf_min" and "bdf_max" files show the supported bdf range of each +pmu device. + +The "hw_clk_freq" file shows the hardware clock frequency of each pmu +device. + +Example usage of checking event code and subevent code:: + + $# cat /sys/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_time + config=0x00204 + $# cat /sys/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_packet_num + config=0x10204 + +Each performance statistic has a pair of events to get two values to +calculate real performance data in userspace. + +The bits 0~15 of config (here 0x0204) are the true hardware event code. If +two events have same value of bits 0~15 of config, that means they are +event pair. And the bit 16 of config indicates getting counter 0 or +counter 1 of hardware event. + +After getting two values of event pair in userspace, the formula of +computation to calculate real performance data is::: + + counter 0 / counter 1 + +Example usage of checking supported filter mode:: + + $# cat /sys/devices/hns3_pmu_sicl_0/filtermode/bw_ssu_rpu_byte_num + filter mode supported: global/port/port-tc/func/func-queue/ + +Example usage of perf:: + + $# perf list + hns3_pmu_sicl_0/bw_ssu_rpu_byte_num/ [kernel PMU event] + hns3_pmu_sicl_0/bw_ssu_rpu_time/ [kernel PMU event] + ------------------------------------------ + + $# perf stat -g -e hns3_pmu_sicl_0/bw_ssu_rpu_byte_num,global=1/ -e hns3_pmu_sicl_0/bw_ssu_rpu_time,global=1/ -I 1000 + or + $# perf stat -g -e hns3_pmu_sicl_0/config=0x00002,global=1/ -e hns3_pmu_sicl_0/config=0x10002,global=1/ -I 1000 + + +Filter modes +-------------- + +1. global mode +PMU collect performance statistics for all HNS3 PCIe functions of IO DIE. +Set the "global" filter option to 1 will enable this mode. +Example usage of perf:: + + $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,global=1/ -I 1000 + +2. port mode +PMU collect performance statistic of one whole physical port. The port id +is same as mac id. The "tc" filter option must be set to 0xF in this mode, +here tc stands for traffic class. + +Example usage of perf:: + + $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,port=0,tc=0xF/ -I 1000 + +3. port-tc mode +PMU collect performance statistic of one tc of physical port. The port id +is same as mac id. The "tc" filter option must be set to 0 ~ 7 in this +mode. +Example usage of perf:: + + $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,port=0,tc=0/ -I 1000 + +4. func mode +PMU collect performance statistic of one PF/VF. The function id is BDF of +PF/VF, its conversion formula:: + + func = (bus << 8) + (device << 3) + (function) + +for example: + BDF func + 35:00.0 0x3500 + 35:00.1 0x3501 + 35:01.0 0x3508 + +In this mode, the "queue" filter option must be set to 0xFFFF. +Example usage of perf:: + + $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,bdf=0x3500,queue=0xFFFF/ -I 1000 + +5. func-queue mode +PMU collect performance statistic of one queue of PF/VF. The function id +is BDF of PF/VF, the "queue" filter option must be set to the exact queue +id of function. +Example usage of perf:: + + $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,bdf=0x3500,queue=0/ -I 1000 + +6. func-intr mode +PMU collect performance statistic of one interrupt of PF/VF. The function +id is BDF of PF/VF, the "intr" filter option must be set to the exact +interrupt id of function. +Example usage of perf:: + + $# perf stat -a -e hns3_pmu_sicl_0/config=0x00301,bdf=0x3500,intr=0/ -I 1000 diff --git a/Documentation/admin-guide/perf/imx-ddr.rst b/Documentation/admin-guide/perf/imx-ddr.rst new file mode 100644 index 0000000000..90926d0fb8 --- /dev/null +++ b/Documentation/admin-guide/perf/imx-ddr.rst @@ -0,0 +1,71 @@ +===================================================== +Freescale i.MX8 DDR Performance Monitoring Unit (PMU) +===================================================== + +There are no performance counters inside the DRAM controller, so performance +signals are brought out to the edge of the controller where a set of 4 x 32 bit +counters is implemented. This is controlled by the CSV modes programmed in counter +control register which causes a large number of PERF signals to be generated. + +Selection of the value for each counter is done via the config registers. There +is one register for each counter. Counter 0 is special in that it always counts +“time” and when expired causes a lock on itself and the other counters and an +interrupt is raised. If any other counter overflows, it continues counting, and +no interrupt is raised. + +The "format" directory describes format of the config (event ID) and config1 +(AXI filtering) fields of the perf_event_attr structure, see /sys/bus/event_source/ +devices/imx8_ddr0/format/. The "events" directory describes the events types +hardware supported that can be used with perf tool, see /sys/bus/event_source/ +devices/imx8_ddr0/events/. The "caps" directory describes filter features implemented +in DDR PMU, see /sys/bus/events_source/devices/imx8_ddr0/caps/. + + .. code-block:: bash + + perf stat -a -e imx8_ddr0/cycles/ cmd + perf stat -a -e imx8_ddr0/read/,imx8_ddr0/write/ cmd + +AXI filtering is only used by CSV modes 0x41 (axid-read) and 0x42 (axid-write) +to count reading or writing matches filter setting. Filter setting is various +from different DRAM controller implementations, which is distinguished by quirks +in the driver. You also can dump info from userspace, filter in "caps" directory +indicates whether PMU supports AXI ID filter or not; enhanced_filter indicates +whether PMU supports enhanced AXI ID filter or not. Value 0 for un-supported, and +value 1 for supported. + +* With DDR_CAP_AXI_ID_FILTER quirk(filter: 1, enhanced_filter: 0). + Filter is defined with two configuration parts: + --AXI_ID defines AxID matching value. + --AXI_MASKING defines which bits of AxID are meaningful for the matching. + + - 0: corresponding bit is masked. + - 1: corresponding bit is not masked, i.e. used to do the matching. + + AXI_ID and AXI_MASKING are mapped on DPCR1 register in performance counter. + When non-masked bits are matching corresponding AXI_ID bits then counter is + incremented. Perf counter is incremented if:: + + AxID && AXI_MASKING == AXI_ID && AXI_MASKING + + This filter doesn't support filter different AXI ID for axid-read and axid-write + event at the same time as this filter is shared between counters. + + .. code-block:: bash + + perf stat -a -e imx8_ddr0/axid-read,axi_mask=0xMMMM,axi_id=0xDDDD/ cmd + perf stat -a -e imx8_ddr0/axid-write,axi_mask=0xMMMM,axi_id=0xDDDD/ cmd + + .. note:: + + axi_mask is inverted in userspace(i.e. set bits are bits to mask), and + it will be reverted in driver automatically. so that the user can just specify + axi_id to monitor a specific id, rather than having to specify axi_mask. + + .. code-block:: bash + + perf stat -a -e imx8_ddr0/axid-read,axi_id=0x12/ cmd, which will monitor ARID=0x12 + +* With DDR_CAP_AXI_ID_FILTER_ENHANCED quirk(filter: 1, enhanced_filter: 1). + This is an extension to the DDR_CAP_AXI_ID_FILTER quirk which permits + counting the number of bytes (as opposed to the number of bursts) from DDR + read and write transactions concurrently with another set of data counters. diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst new file mode 100644 index 0000000000..f60be04e4e --- /dev/null +++ b/Documentation/admin-guide/perf/index.rst @@ -0,0 +1,24 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +Performance monitor support +=========================== + +.. toctree:: + :maxdepth: 1 + + hisi-pmu + hisi-pcie-pmu + hns3-pmu + imx-ddr + qcom_l2_pmu + qcom_l3_pmu + arm-ccn + arm-cmn + xgene-pmu + arm_dsu_pmu + thunderx2-pmu + alibaba_pmu + nvidia-pmu + meson-ddr-pmu + cxl diff --git a/Documentation/admin-guide/perf/meson-ddr-pmu.rst b/Documentation/admin-guide/perf/meson-ddr-pmu.rst new file mode 100644 index 0000000000..8e71be1d63 --- /dev/null +++ b/Documentation/admin-guide/perf/meson-ddr-pmu.rst @@ -0,0 +1,70 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================================================== +Amlogic SoC DDR Bandwidth Performance Monitoring Unit (PMU) +=========================================================== + +The Amlogic Meson G12 SoC contains a bandwidth monitor inside DRAM controller. +The monitor includes 4 channels. Each channel can count the request accessing +DRAM. The channel can count up to 3 AXI port simultaneously. It can be helpful +to show if the performance bottleneck is on DDR bandwidth. + +Currently, this driver supports the following 5 perf events: + ++ meson_ddr_bw/total_rw_bytes/ ++ meson_ddr_bw/chan_1_rw_bytes/ ++ meson_ddr_bw/chan_2_rw_bytes/ ++ meson_ddr_bw/chan_3_rw_bytes/ ++ meson_ddr_bw/chan_4_rw_bytes/ + +meson_ddr_bw/chan_{1,2,3,4}_rw_bytes/ events are channel-specific events. +Each channel support filtering, which can let the channel to monitor +individual IP module in SoC. + +Below are DDR access request event filter keywords: + ++ arm - from CPU ++ vpu_read1 - from OSD + VPP read ++ gpu - from 3D GPU ++ pcie - from PCIe controller ++ hdcp - from HDCP controller ++ hevc_front - from HEVC codec front end ++ usb3_0 - from USB3.0 controller ++ hevc_back - from HEVC codec back end ++ h265enc - from HEVC encoder ++ vpu_read2 - from DI read ++ vpu_write1 - from VDIN write ++ vpu_write2 - from di write ++ vdec - from legacy codec video decoder ++ hcodec - from H264 encoder ++ ge2d - from ge2d ++ spicc1 - from SPI controller 1 ++ usb0 - from USB2.0 controller 0 ++ dma - from system DMA controller 1 ++ arb0 - from arb0 ++ sd_emmc_b - from SD eMMC b controller ++ usb1 - from USB2.0 controller 1 ++ audio - from Audio module ++ sd_emmc_c - from SD eMMC c controller ++ spicc2 - from SPI controller 2 ++ ethernet - from Ethernet controller + + +Examples: + + + Show the total DDR bandwidth per seconds: + + .. code-block:: bash + + perf stat -a -e meson_ddr_bw/total_rw_bytes/ -I 1000 sleep 10 + + + + Show individual DDR bandwidth from CPU and GPU respectively, as well as + sum of them: + + .. code-block:: bash + + perf stat -a -e meson_ddr_bw/chan_1_rw_bytes,arm=1/ -I 1000 sleep 10 + perf stat -a -e meson_ddr_bw/chan_2_rw_bytes,gpu=1/ -I 1000 sleep 10 + perf stat -a -e meson_ddr_bw/chan_3_rw_bytes,arm=1,gpu=1/ -I 1000 sleep 10 + diff --git a/Documentation/admin-guide/perf/nvidia-pmu.rst b/Documentation/admin-guide/perf/nvidia-pmu.rst new file mode 100644 index 0000000000..2e0d47cfe7 --- /dev/null +++ b/Documentation/admin-guide/perf/nvidia-pmu.rst @@ -0,0 +1,299 @@ +========================================================= +NVIDIA Tegra SoC Uncore Performance Monitoring Unit (PMU) +========================================================= + +The NVIDIA Tegra SoC includes various system PMUs to measure key performance +metrics like memory bandwidth, latency, and utilization: + +* Scalable Coherency Fabric (SCF) +* NVLink-C2C0 +* NVLink-C2C1 +* CNVLink +* PCIE + +PMU Driver +---------- + +The PMUs in this document are based on ARM CoreSight PMU Architecture as +described in document: ARM IHI 0091. Since this is a standard architecture, the +PMUs are managed by a common driver "arm-cs-arch-pmu". This driver describes +the available events and configuration of each PMU in sysfs. Please see the +sections below to get the sysfs path of each PMU. Like other uncore PMU drivers, +the driver provides "cpumask" sysfs attribute to show the CPU id used to handle +the PMU event. There is also "associated_cpus" sysfs attribute, which contains a +list of CPUs associated with the PMU instance. + +.. _SCF_PMU_Section: + +SCF PMU +------- + +The SCF PMU monitors system level cache events, CPU traffic, and +strongly-ordered (SO) PCIE write traffic to local/remote memory. Please see +:ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU +traffic coverage. + +The events and configuration options of this PMU device are described in sysfs, +see /sys/bus/event_sources/devices/nvidia_scf_pmu_<socket-id>. + +Example usage: + +* Count event id 0x0 in socket 0:: + + perf stat -a -e nvidia_scf_pmu_0/event=0x0/ + +* Count event id 0x0 in socket 1:: + + perf stat -a -e nvidia_scf_pmu_1/event=0x0/ + +NVLink-C2C0 PMU +-------------------- + +The NVLink-C2C0 PMU monitors incoming traffic from a GPU/CPU connected with +NVLink-C2C (Chip-2-Chip) interconnect. The type of traffic captured by this PMU +varies dependent on the chip configuration: + +* NVIDIA Grace Hopper Superchip: Hopper GPU is connected with Grace SoC. + + In this config, the PMU captures GPU ATS translated or EGM traffic from the GPU. + +* NVIDIA Grace CPU Superchip: two Grace CPU SoCs are connected. + + In this config, the PMU captures read and relaxed ordered (RO) writes from + PCIE device of the remote SoC. + +Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about +the PMU traffic coverage. + +The events and configuration options of this PMU device are described in sysfs, +see /sys/bus/event_sources/devices/nvidia_nvlink_c2c0_pmu_<socket-id>. + +Example usage: + +* Count event id 0x0 from the GPU/CPU connected with socket 0:: + + perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0/ + +* Count event id 0x0 from the GPU/CPU connected with socket 1:: + + perf stat -a -e nvidia_nvlink_c2c0_pmu_1/event=0x0/ + +* Count event id 0x0 from the GPU/CPU connected with socket 2:: + + perf stat -a -e nvidia_nvlink_c2c0_pmu_2/event=0x0/ + +* Count event id 0x0 from the GPU/CPU connected with socket 3:: + + perf stat -a -e nvidia_nvlink_c2c0_pmu_3/event=0x0/ + +NVLink-C2C1 PMU +------------------- + +The NVLink-C2C1 PMU monitors incoming traffic from a GPU connected with +NVLink-C2C (Chip-2-Chip) interconnect. This PMU captures untranslated GPU +traffic, in contrast with NvLink-C2C0 PMU that captures ATS translated traffic. +Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about +the PMU traffic coverage. + +The events and configuration options of this PMU device are described in sysfs, +see /sys/bus/event_sources/devices/nvidia_nvlink_c2c1_pmu_<socket-id>. + +Example usage: + +* Count event id 0x0 from the GPU connected with socket 0:: + + perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0/ + +* Count event id 0x0 from the GPU connected with socket 1:: + + perf stat -a -e nvidia_nvlink_c2c1_pmu_1/event=0x0/ + +* Count event id 0x0 from the GPU connected with socket 2:: + + perf stat -a -e nvidia_nvlink_c2c1_pmu_2/event=0x0/ + +* Count event id 0x0 from the GPU connected with socket 3:: + + perf stat -a -e nvidia_nvlink_c2c1_pmu_3/event=0x0/ + +CNVLink PMU +--------------- + +The CNVLink PMU monitors traffic from GPU and PCIE device on remote sockets +to local memory. For PCIE traffic, this PMU captures read and relaxed ordered +(RO) write traffic. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` +for more info about the PMU traffic coverage. + +The events and configuration options of this PMU device are described in sysfs, +see /sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>. + +Each SoC socket can be connected to one or more sockets via CNVLink. The user can +use "rem_socket" bitmap parameter to select the remote socket(s) to monitor. +Each bit represents the socket number, e.g. "rem_socket=0xE" corresponds to +socket 1 to 3. +/sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket +shows the valid bits that can be set in the "rem_socket" parameter. + +The PMU can not distinguish the remote traffic initiator, therefore it does not +provide filter to select the traffic source to monitor. It reports combined +traffic from remote GPU and PCIE devices. + +Example usage: + +* Count event id 0x0 for the traffic from remote socket 1, 2, and 3 to socket 0:: + + perf stat -a -e nvidia_cnvlink_pmu_0/event=0x0,rem_socket=0xE/ + +* Count event id 0x0 for the traffic from remote socket 0, 2, and 3 to socket 1:: + + perf stat -a -e nvidia_cnvlink_pmu_1/event=0x0,rem_socket=0xD/ + +* Count event id 0x0 for the traffic from remote socket 0, 1, and 3 to socket 2:: + + perf stat -a -e nvidia_cnvlink_pmu_2/event=0x0,rem_socket=0xB/ + +* Count event id 0x0 for the traffic from remote socket 0, 1, and 2 to socket 3:: + + perf stat -a -e nvidia_cnvlink_pmu_3/event=0x0,rem_socket=0x7/ + + +PCIE PMU +------------ + +The PCIE PMU monitors all read/write traffic from PCIE root ports to +local/remote memory. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` +for more info about the PMU traffic coverage. + +The events and configuration options of this PMU device are described in sysfs, +see /sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>. + +Each SoC socket can support multiple root ports. The user can use +"root_port" bitmap parameter to select the port(s) to monitor, i.e. +"root_port=0xF" corresponds to root port 0 to 3. +/sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>/format/root_port +shows the valid bits that can be set in the "root_port" parameter. + +Example usage: + +* Count event id 0x0 from root port 0 and 1 of socket 0:: + + perf stat -a -e nvidia_pcie_pmu_0/event=0x0,root_port=0x3/ + +* Count event id 0x0 from root port 0 and 1 of socket 1:: + + perf stat -a -e nvidia_pcie_pmu_1/event=0x0,root_port=0x3/ + +.. _NVIDIA_Uncore_PMU_Traffic_Coverage_Section: + +Traffic Coverage +---------------- + +The PMU traffic coverage may vary dependent on the chip configuration: + +* **NVIDIA Grace Hopper Superchip**: Hopper GPU is connected with Grace SoC. + + Example configuration with two Grace SoCs:: + + ********************************* ********************************* + * SOCKET-A * * SOCKET-B * + * * * * + * :::::::: * * :::::::: * + * : PCIE : * * : PCIE : * + * :::::::: * * :::::::: * + * | * * | * + * | * * | * + * ::::::: ::::::::: * * ::::::::: ::::::: * + * : : : : * * : : : : * + * : GPU :<--NVLink-->: Grace :<---CNVLink--->: Grace :<--NVLink-->: GPU : * + * : : C2C : SoC : * * : SoC : C2C : : * + * ::::::: ::::::::: * * ::::::::: ::::::: * + * | | * * | | * + * | | * * | | * + * &&&&&&&& &&&&&&&& * * &&&&&&&& &&&&&&&& * + * & GMEM & & CMEM & * * & CMEM & & GMEM & * + * &&&&&&&& &&&&&&&& * * &&&&&&&& &&&&&&&& * + * * * * + ********************************* ********************************* + + GMEM = GPU Memory (e.g. HBM) + CMEM = CPU Memory (e.g. LPDDR5X) + + | + | Following table contains traffic coverage of Grace SoC PMU in socket-A: + + :: + + +--------------+-------+-----------+-----------+-----+----------+----------+ + | | Source | + + +-------+-----------+-----------+-----+----------+----------+ + | Destination | |GPU ATS |GPU Not-ATS| | Socket-B | Socket-B | + | |PCI R/W|Translated,|Translated | CPU | CPU/PCIE1| GPU/PCIE2| + | | |EGM | | | | | + +==============+=======+===========+===========+=====+==========+==========+ + | Local | PCIE |NVLink-C2C0|NVLink-C2C1| SCF | SCF PMU | CNVLink | + | SYSRAM/CMEM | PMU |PMU |PMU | PMU | | PMU | + +--------------+-------+-----------+-----------+-----+----------+----------+ + | Local GMEM | PCIE | N/A |NVLink-C2C1| SCF | SCF PMU | CNVLink | + | | PMU | |PMU | PMU | | PMU | + +--------------+-------+-----------+-----------+-----+----------+----------+ + | Remote | PCIE |NVLink-C2C0|NVLink-C2C1| SCF | | | + | SYSRAM/CMEM | PMU |PMU |PMU | PMU | N/A | N/A | + | over CNVLink | | | | | | | + +--------------+-------+-----------+-----------+-----+----------+----------+ + | Remote GMEM | PCIE |NVLink-C2C0|NVLink-C2C1| SCF | | | + | over CNVLink | PMU |PMU |PMU | PMU | N/A | N/A | + +--------------+-------+-----------+-----------+-----+----------+----------+ + + PCIE1 traffic represents strongly ordered (SO) writes. + PCIE2 traffic represents reads and relaxed ordered (RO) writes. + +* **NVIDIA Grace CPU Superchip**: two Grace CPU SoCs are connected. + + Example configuration with two Grace SoCs:: + + ******************* ******************* + * SOCKET-A * * SOCKET-B * + * * * * + * :::::::: * * :::::::: * + * : PCIE : * * : PCIE : * + * :::::::: * * :::::::: * + * | * * | * + * | * * | * + * ::::::::: * * ::::::::: * + * : : * * : : * + * : Grace :<--------NVLink------->: Grace : * + * : SoC : * C2C * : SoC : * + * ::::::::: * * ::::::::: * + * | * * | * + * | * * | * + * &&&&&&&& * * &&&&&&&& * + * & CMEM & * * & CMEM & * + * &&&&&&&& * * &&&&&&&& * + * * * * + ******************* ******************* + + GMEM = GPU Memory (e.g. HBM) + CMEM = CPU Memory (e.g. LPDDR5X) + + | + | Following table contains traffic coverage of Grace SoC PMU in socket-A: + + :: + + +-----------------+-----------+---------+----------+-------------+ + | | Source | + + +-----------+---------+----------+-------------+ + | Destination | | | Socket-B | Socket-B | + | | PCI R/W | CPU | CPU/PCIE1| PCIE2 | + | | | | | | + +=================+===========+=========+==========+=============+ + | Local | PCIE PMU | SCF PMU | SCF PMU | NVLink-C2C0 | + | SYSRAM/CMEM | | | | PMU | + +-----------------+-----------+---------+----------+-------------+ + | Remote | | | | | + | SYSRAM/CMEM | PCIE PMU | SCF PMU | N/A | N/A | + | over NVLink-C2C | | | | | + +-----------------+-----------+---------+----------+-------------+ + + PCIE1 traffic represents strongly ordered (SO) writes. + PCIE2 traffic represents reads and relaxed ordered (RO) writes. diff --git a/Documentation/admin-guide/perf/qcom_l2_pmu.rst b/Documentation/admin-guide/perf/qcom_l2_pmu.rst new file mode 100644 index 0000000000..c130178a4a --- /dev/null +++ b/Documentation/admin-guide/perf/qcom_l2_pmu.rst @@ -0,0 +1,39 @@ +===================================================================== +Qualcomm Technologies Level-2 Cache Performance Monitoring Unit (PMU) +===================================================================== + +This driver supports the L2 cache clusters found in Qualcomm Technologies +Centriq SoCs. There are multiple physical L2 cache clusters, each with their +own PMU. Each cluster has one or more CPUs associated with it. + +There is one logical L2 PMU exposed, which aggregates the results from +the physical PMUs. + +The driver provides a description of its available events and configuration +options in sysfs, see /sys/devices/l2cache_0. + +The "format" directory describes the format of the events. + +Events can be envisioned as a 2-dimensional array. Each column represents +a group of events. There are 8 groups. Only one entry from each +group can be in use at a time. If multiple events from the same group +are specified, the conflicting events cannot be counted at the same time. + +Events are specified as 0xCCG, where CC is 2 hex digits specifying +the code (array row) and G specifies the group (column) 0-7. + +In addition there is a cycle counter event specified by the value 0xFE +which is outside the above scheme. + +The driver provides a "cpumask" sysfs attribute which contains a mask +consisting of one CPU per cluster which will be used to handle all the PMU +events on that cluster. + +Examples for use with perf:: + + perf stat -e l2cache_0/config=0x001/,l2cache_0/config=0x042/ -a sleep 1 + + perf stat -e l2cache_0/config=0xfe/ -C 2 sleep 1 + +The driver does not support sampling, therefore "perf record" will +not work. Per-task perf sessions are not supported. diff --git a/Documentation/admin-guide/perf/qcom_l3_pmu.rst b/Documentation/admin-guide/perf/qcom_l3_pmu.rst new file mode 100644 index 0000000000..a3d014a46b --- /dev/null +++ b/Documentation/admin-guide/perf/qcom_l3_pmu.rst @@ -0,0 +1,26 @@ +=========================================================================== +Qualcomm Datacenter Technologies L3 Cache Performance Monitoring Unit (PMU) +=========================================================================== + +This driver supports the L3 cache PMUs found in Qualcomm Datacenter Technologies +Centriq SoCs. The L3 cache on these SOCs is composed of multiple slices, shared +by all cores within a socket. Each slice is exposed as a separate uncore perf +PMU with device name l3cache_<socket>_<instance>. User space is responsible +for aggregating across slices. + +The driver provides a description of its available events and configuration +options in sysfs, see /sys/devices/l3cache*. Given that these are uncore PMUs +the driver also exposes a "cpumask" sysfs attribute which contains a mask +consisting of one CPU per socket which will be used to handle all the PMU +events on that socket. + +The hardware implements 32bit event counters and has a flat 8bit event space +exposed via the "event" format attribute. In addition to the 32bit physical +counters the driver supports virtual 64bit hardware counters by using hardware +counter chaining. This feature is exposed via the "lc" (long counter) format +flag. E.g.:: + + perf stat -e l3cache_0_0/read-miss,lc/ + +Given that these are uncore PMUs the driver does not support sampling, therefore +"perf record" will not work. Per-task perf sessions are not supported. diff --git a/Documentation/admin-guide/perf/thunderx2-pmu.rst b/Documentation/admin-guide/perf/thunderx2-pmu.rst new file mode 100644 index 0000000000..01f158238a --- /dev/null +++ b/Documentation/admin-guide/perf/thunderx2-pmu.rst @@ -0,0 +1,44 @@ +============================================================= +Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE) +============================================================= + +The ThunderX2 SoC PMU consists of independent, system-wide, per-socket +PMUs such as the Level 3 Cache (L3C), DDR4 Memory Controller (DMC) and +Cavium Coherent Processor Interconnect (CCPI2). + +The DMC has 8 interleaved channels and the L3C has 16 interleaved tiles. +Events are counted for the default channel (i.e. channel 0) and prorated +to the total number of channels/tiles. + +The DMC and L3C support up to 4 counters, while the CCPI2 supports up to 8 +counters. Counters are independently programmable to different events and +can be started and stopped individually. None of the counters support an +overflow interrupt. DMC and L3C counters are 32-bit and read every 2 seconds. +The CCPI2 counters are 64-bit and assumed not to overflow in normal operation. + +PMU UNCORE (perf) driver: + +The thunderx2_pmu driver registers per-socket perf PMUs for the DMC and +L3C devices. Each PMU can be used to count up to 4 (DMC/L3C) or up to 8 +(CCPI2) events simultaneously. The PMUs provide a description of their +available events and configuration options under sysfs, see +/sys/devices/uncore_<l3c_S/dmc_S/ccpi2_S/>; S is the socket id. + +The driver does not support sampling, therefore "perf record" will not +work. Per-task perf sessions are also not supported. + +Examples:: + + # perf stat -a -e uncore_dmc_0/cnt_cycles/ sleep 1 + + # perf stat -a -e \ + uncore_dmc_0/cnt_cycles/,\ + uncore_dmc_0/data_transfers/,\ + uncore_dmc_0/read_txns/,\ + uncore_dmc_0/write_txns/ sleep 1 + + # perf stat -a -e \ + uncore_l3c_0/read_request/,\ + uncore_l3c_0/read_hit/,\ + uncore_l3c_0/inv_request/,\ + uncore_l3c_0/inv_hit/ sleep 1 diff --git a/Documentation/admin-guide/perf/xgene-pmu.rst b/Documentation/admin-guide/perf/xgene-pmu.rst new file mode 100644 index 0000000000..644f8ed891 --- /dev/null +++ b/Documentation/admin-guide/perf/xgene-pmu.rst @@ -0,0 +1,49 @@ +================================================ +APM X-Gene SoC Performance Monitoring Unit (PMU) +================================================ + +X-Gene SoC PMU consists of various independent system device PMUs such as +L3 cache(s), I/O bridge(s), memory controller bridge(s) and memory +controller(s). These PMU devices are loosely architected to follow the +same model as the PMU for ARM cores. The PMUs share the same top level +interrupt and status CSR region. + +PMU (perf) driver +----------------- + +The xgene-pmu driver registers several perf PMU drivers. Each of the perf +driver provides description of its available events and configuration options +in sysfs, see /sys/devices/<l3cX/iobX/mcbX/mcX>/. + +The "format" directory describes format of the config (event ID), +config1 (agent ID) fields of the perf_event_attr structure. The "events" +directory provides configuration templates for all supported event types that +can be used with perf tool. For example, "l3c0/bank-fifo-full/" is an +equivalent of "l3c0/config=0x0b/". + +Most of the SoC PMU has a specific list of agent ID used for monitoring +performance of a specific datapath. For example, agents of a L3 cache can be +a specific CPU or an I/O bridge. Each PMU has a set of 2 registers capable of +masking the agents from which the request come from. If the bit with +the bit number corresponding to the agent is set, the event is counted only if +it is caused by a request from that agent. Each agent ID bit is inversely mapped +to a corresponding bit in "config1" field. By default, the event will be +counted for all agent requests (config1 = 0x0). For all the supported agents of +each PMU, please refer to APM X-Gene User Manual. + +Each perf driver also provides a "cpumask" sysfs attribute, which contains a +single CPU ID of the processor which will be used to handle all the PMU events. + +Example for perf tool use:: + + / # perf list | grep -e l3c -e iob -e mcb -e mc + l3c0/ackq-full/ [Kernel PMU event] + <...> + mcb1/mcb-csw-stall/ [Kernel PMU event] + + / # perf stat -a -e l3c0/read-miss/,mcb1/csw-write-request/ sleep 1 + + / # perf stat -a -e l3c0/read-miss,config1=0xfffffffffffffffe/ sleep 1 + +The driver does not support sampling, therefore "perf record" will +not work. Per-task (without "-a") perf sessions are not supported. |