1 files changed, 346 insertions, 0 deletions
diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt
new file mode 100644
index 0000000000..856f0dfb8e
--- /dev/null
+++ b/tools/perf/Documentation/perf-c2c.txt
@@ -0,0 +1,346 @@
+perf-c2c(1)
+===========
+
+NAME
+----
+perf-c2c - Shared Data C2C/HITM Analyzer.
+
+SYNOPSIS
+--------
+[verse]
+'perf c2c record' [<options>] <command>
+'perf c2c record' [<options>] \-- [<record command options>] <command>
+'perf c2c report' [<options>]
+
+DESCRIPTION
+-----------
+C2C stands for Cache To Cache.
+
+The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows
+you to track down the cacheline contentions.
+
+On Intel, the tool is based on load latency and precise store facility events
+provided by Intel CPUs. On PowerPC, the tool uses random instruction sampling
+with thresholding feature. On AMD, the tool uses IBS op pmu (due to hardware
+limitations, perf c2c is not supported on Zen3 cpus). On Arm64 it uses SPE to
+sample load and store operations, therefore hardware and kernel support is
+required. See linkperf:perf-arm-spe[1] for a setup guide. Due to the
+statistical nature of Arm SPE sampling, not every memory operation will be
+sampled.
+
+These events provide:
+  - memory address of the access
+  - type of the access (load and store details)
+  - latency (in cycles) of the load access
+
+The c2c tool provide means to record this data and report back access details
+for cachelines with highest contention - highest number of HITM accesses.
+
+The basic workflow with this tool follows the standard record/report phase.
+User uses the record command to record events data and report command to
+display it.
+
+
+RECORD OPTIONS
+--------------
+-e::
+--event=::
+	Select the PMU event. Use 'perf c2c record -e list'
+	to list available events.
+
+-v::
+--verbose::
+	Be more verbose (show counter open errors, etc).
+
+-l::
+--ldlat::
+	Configure mem-loads latency. Supported on Intel and Arm64 processors
+	only. Ignored on other archs.
+
+-k::
+--all-kernel::
+	Configure all used events to run in kernel space.
+
+-u::
+--all-user::
+	Configure all used events to run in user space.
+
+REPORT OPTIONS
+--------------
+-k::
+--vmlinux=<file>::
+	vmlinux pathname
+
+-v::
+--verbose::
+	Be more verbose (show counter open errors, etc).
+
+-i::
+--input::
+	Specify the input file to process.
+
+-N::
+--node-info::
+	Show extra node info in report (see NODE INFO section)
+
+-c::
+--coalesce::
+	Specify sorting fields for single cacheline display.
+	Following fields are available: tid,pid,iaddr,dso
+	(see COALESCE)
+
+-g::
+--call-graph::
+	Setup callchains parameters.
+	Please refer to perf-report man page for details.
+
+--stdio::
+	Force the stdio output (see STDIO OUTPUT)
+
+--stats::
+	Display only statistic tables and force stdio mode.
+
+--full-symbols::
+	Display full length of symbols.
+
+--no-source::
+	Do not display Source:Line column.
+
+--show-all::
+	Show all captured HITM lines, with no regard to HITM % 0.0005 limit.
+
+-f::
+--force::
+	Don't do ownership validation.
+
+-d::
+--display::
+	Switch to HITM type (rmt, lcl) or peer snooping type (peer) to display
+	and sort on. Total HITMs (tot) as default, except Arm64 uses peer mode
+	as default.
+
+--stitch-lbr::
+	Show callgraph with stitched LBRs, which may have more complete
+	callgraph. The perf.data file must have been obtained using
+	perf c2c record --call-graph lbr.
+	Disabled by default. In common cases with call stack overflows,
+	it can recreate better call stacks than the default lbr call stack
+	output. But this approach is not foolproof. There can be cases
+	where it creates incorrect call stacks from incorrect matches.
+	The known limitations include exception handing such as
+	setjmp/longjmp will have calls/returns not match.
+
+--double-cl::
+	Group the detection of shared cacheline events into double cacheline
+	granularity. Some architectures have an Adjacent Cacheline Prefetch
+	feature, which causes cacheline sharing to behave like the cacheline
+	size is doubled.
+
+C2C RECORD
+----------
+The perf c2c record command setup options related to HITM cacheline analysis
+and calls standard perf record command.
+
+Following perf record options are configured by default:
+(check perf record man page for details)
+
+  -W,-d,--phys-data,--sample-cpu
+
+Unless specified otherwise with '-e' option, following events are monitored by
+default on Intel:
+
+  cpu/mem-loads,ldlat=30/P
+  cpu/mem-stores/P
+
+following on AMD:
+
+  ibs_op//
+
+and following on PowerPC:
+
+  cpu/mem-loads/
+  cpu/mem-stores/
+
+User can pass any 'perf record' option behind '--' mark, like (to enable
+callchains and system wide monitoring):
+
+  $ perf c2c record -- -g -a
+
+Please check RECORD OPTIONS section for specific c2c record options.
+
+C2C REPORT
+----------
+The perf c2c report command displays shared data analysis.  It comes in two
+display modes: stdio and tui (default).
+
+The report command workflow is following:
+  - sort all the data based on the cacheline address
+  - store access details for each cacheline
+  - sort all cachelines based on user settings
+  - display data
+
+In general perf report output consist of 2 basic views:
+  1) most expensive cachelines list
+  2) offsets details for each cacheline
+
+For each cacheline in the 1) list we display following data:
+(Both stdio and TUI modes follow the same fields output)
+
+  Index
+  - zero based index to identify the cacheline
+
+  Cacheline
+  - cacheline address (hex number)
+
+  Rmt/Lcl Hitm (Display with HITM types)
+  - cacheline percentage of all Remote/Local HITM accesses
+
+  Peer Snoop (Display with peer type)
+  - cacheline percentage of all peer accesses
+
+  LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types)
+  - count of Total/Local/Remote load HITMs
+
+  Load Peer - Total, Local, Remote (For display with peer type)
+  - count of Total/Local/Remote load from peer cache or DRAM
+
+  Total records
+  - sum of all cachelines accesses
+
+  Total loads
+  - sum of all load accesses
+
+  Total stores
+  - sum of all store accesses
+
+  Store Reference - L1Hit, L1Miss, N/A
+    L1Hit - store accesses that hit L1
+    L1Miss - store accesses that missed L1
+    N/A - store accesses with memory level is not available
+
+  Core Load Hit - FB, L1, L2
+  - count of load hits in FB (Fill Buffer), L1 and L2 cache
+
+  LLC Load Hit - LlcHit, LclHitm
+  - count of LLC load accesses, includes LLC hits and LLC HITMs
+
+  RMT Load Hit - RmtHit, RmtHitm
+  - count of remote load accesses, includes remote hits and remote HITMs;
+    on Arm neoverse cores, RmtHit is used to account remote accesses,
+    includes remote DRAM or any upward cache level in remote node
+
+  Load Dram - Lcl, Rmt
+  - count of local and remote DRAM accesses
+
+For each offset in the 2) list we display following data:
+
+  HITM - Rmt, Lcl (Display with HITM types)
+  - % of Remote/Local HITM accesses for given offset within cacheline
+
+  Peer Snoop - Rmt, Lcl (Display with peer type)
+  - % of Remote/Local peer accesses for given offset within cacheline
+
+  Store Refs - L1 Hit, L1 Miss, N/A
+  - % of store accesses that hit L1, missed L1 and N/A (no available) memory
+    level for given offset within cacheline
+
+  Data address - Offset
+  - offset address
+
+  Pid
+  - pid of the process responsible for the accesses
+
+  Tid
+  - tid of the process responsible for the accesses
+
+  Code address
+  - code address responsible for the accesses
+
+  cycles - rmt hitm, lcl hitm, load (Display with HITM types)
+    - sum of cycles for given accesses - Remote/Local HITM and generic load
+
+  cycles - rmt peer, lcl peer, load (Display with peer type)
+    - sum of cycles for given accesses - Remote/Local peer load and generic load
+
+  cpu cnt
+    - number of cpus that participated on the access
+
+  Symbol
+    - code symbol related to the 'Code address' value
+
+  Shared Object
+    - shared object name related to the 'Code address' value
+
+  Source:Line
+    - source information related to the 'Code address' value
+
+  Node
+    - nodes participating on the access (see NODE INFO section)
+
+NODE INFO
+---------
+The 'Node' field displays nodes that accesses given cacheline
+offset. Its output comes in 3 flavors:
+  - node IDs separated by ','
+  - node IDs with stats for each ID, in following format:
+      Node{cpus %hitms %stores} (Display with HITM types)
+      Node{cpus %peers %stores} (Display with peer type)
+  - node IDs with list of affected CPUs in following format:
+      Node{cpu list}
+
+User can switch between above flavors with -N option or
+use 'n' key to interactively switch in TUI mode.
+
+COALESCE
+--------
+User can specify how to sort offsets for cacheline.
+
+Following fields are available and governs the final
+output fields set for cacheline offsets output:
+
+  tid   - coalesced by process TIDs
+  pid   - coalesced by process PIDs
+  iaddr - coalesced by code address, following fields are displayed:
+             Code address, Code symbol, Shared Object, Source line
+  dso   - coalesced by shared object
+
+By default the coalescing is setup with 'pid,iaddr'.
+
+STDIO OUTPUT
+------------
+The stdio output displays data on standard output.
+
+Following tables are displayed:
+  Trace Event Information
+  - overall statistics of memory accesses
+
+  Global Shared Cache Line Event Information
+  - overall statistics on shared cachelines
+
+  Shared Data Cache Line Table
+  - list of most expensive cachelines
+
+  Shared Cache Line Distribution Pareto
+  - list of all accessed offsets for each cacheline
+
+TUI OUTPUT
+----------
+The TUI output provides interactive interface to navigate
+through cachelines list and to display offset details.
+
+For details please refer to the help window by pressing '?' key.
+
+CREDITS
+-------
+Although Don Zickus, Dick Fowles and Joe Mario worked together
+to get this implemented, we got lots of early help from Arnaldo
+Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
+
+C2C BLOG
+--------
+Check Joe's blog on c2c tool for detailed use case explanation:
+  https://joemario.github.io/blog/2016/09/01/c2c-blog/
+
+SEE ALSO
+--------
+linkperf:perf-record[1], linkperf:perf-mem[1], linkperf:perf-arm-spe[1]