From 8dd16259287f58f9273002717ec4d27e97127719 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Wed, 12 Jun 2024 07:43:14 +0200 Subject: Merging upstream version 127.0. Signed-off-by: Daniel Baumann --- .../rust/packed_simd/perf-guide/src/prof/mca.md | 100 --------------------- 1 file changed, 100 deletions(-) delete mode 100644 third_party/rust/packed_simd/perf-guide/src/prof/mca.md (limited to 'third_party/rust/packed_simd/perf-guide/src/prof/mca.md') diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/mca.md b/third_party/rust/packed_simd/perf-guide/src/prof/mca.md deleted file mode 100644 index 65ddf1a4eb..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/prof/mca.md +++ /dev/null @@ -1,100 +0,0 @@ -# Machine code analysis tools - -## The microarchitecture of modern CPUs - -While you might have heard of Instruction Set Architectures, such as `x86` or -`arm` or `mips`, the term _microarchitecture_ (also written here as _µ-arch_), -refers to the internal details of an actual family of CPUs, such as Intel's -_Haswell_ or AMD's _Jaguar_. - -Replacing scalar code with SIMD code will improve performance on all CPUs -supporting the required vector extensions. -However, due to microarchitectural differences, the actual speed-up at -runtime might vary. - -**Example**: a simple example arises when optimizing for AMD K8 CPUs. -The assembly generated for an empty function should look like this: - -```asm -nop -ret -``` - -The `nop` is used to align the `ret` instruction for better performance. -However, the compiler will actually generated the following code: - -```asm -repz ret -``` - -The `repz` instruction will repeat the following instruction until a certain -condition. Of course, in this situation, the function will simply immediately -return, and the `ret` instruction is still aligned. -However, AMD K8's branch predictor performs better with the latter code. - -For those looking to absolutely maximize performance for a certain target µ-arch, -you will have to read some CPU manuals, or ask the compiler to do it for you -with `-C target-cpu`. - -### Summary of CPU internals - -Modern processors are able to execute instructions out-of-order for better performance, -by utilizing tricks such as [branch prediction], [instruction pipelining], -or [superscalar execution]. - -[branch prediction]: https://en.wikipedia.org/wiki/Branch_predictor -[instruction pipelining]: https://en.wikipedia.org/wiki/Instruction_pipelining -[superscalar execution]: https://en.wikipedia.org/wiki/Superscalar_processor - -SIMD instructions are also subject to these optimizations, meaning it can get pretty -difficult to determine where the slowdown happens. -For example, if the profiler reports a store operation is slow, one of two things -could be happening: - -- the store is limited by the CPU's memory bandwidth, which is actually an ideal - scenario, all things considered; - -- memory bandwidth is nowhere near its peak, but the value to be stored is at the - end of a long chain of operations, and this store is where the profiler - encountered the pipeline stall; - -Since most profilers are simple tools which don't understand the subtleties of -instruction scheduling, you - -## Analyzing the machine code - -Certain tools have knowledge of internal CPU microarchitecture, i.e. they know - -- how many physical [register files] a CPU actually has - -- what is the latency / throughtput of an instruction - -- what [µ-ops] are generated for a set of instructions - -and many other architectural details. - -[register files]: https://en.wikipedia.org/wiki/Register_file -[µ-ops]: https://en.wikipedia.org/wiki/Micro-operation - -These tools are therefore able to provide accurate information as to why some -instructions are inefficient, and where the bottleneck is. - -The disadvantage is that the output of these tools requires advanced knowledge -of the target architecture to understand, i.e. they **cannot** point out what -the cause of the issue is explicitly. - -## Intel's Architecture Code Analyzer (IACA) - -[IACA] is a free tool offered by Intel for analyzing the performance of various -computational kernels. - -Being a proprietary, closed source tool, it _only_ supports Intel's µ-arches. - -[IACA]: https://software.intel.com/en-us/articles/intel-architecture-code-analyzer - -## llvm-mca - - -- cgit v1.2.3