diff options
Diffstat (limited to 'third_party/rust/packed_simd/perf-guide/src/prof')
3 files changed, 0 insertions, 221 deletions
diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/linux.md b/third_party/rust/packed_simd/perf-guide/src/prof/linux.md deleted file mode 100644 index 96c7d67bc4..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/prof/linux.md +++ /dev/null @@ -1,107 +0,0 @@ -# Performance profiling on Linux - -## Using `perf` - -[perf](https://perf.wiki.kernel.org/) is the most powerful performance profiler -for Linux, featuring support for various hardware Performance Monitoring Units, -as well as integration with the kernel's performance events framework. - -We will only look at how can the `perf` command can be used to profile SIMD code. -Full system profiling is outside of the scope of this book. - -### Recording - -The first step is to record a program's execution during an average workload. -It helps if you can isolate the parts of your program which have performance -issues, and set up a benchmark which can be easily (re)run. - -Build the benchmark binary in release mode, after having enabled debug info: - -```sh -$ cargo build --release -Finished release [optimized + debuginfo] target(s) in 0.02s -``` - -Then use the `perf record` subcommand: - -```sh -$ perf record --call-graph=dwarf ./target/release/my-program -[ perf record: Woken up 10 times to write data ] -[ perf record: Captured and wrote 2,356 MB perf.data (292 samples) ] -``` - -Instead of using `--call-graph=dwarf`, which can become pretty slow, you can use -`--call-graph=lbr` if you have a processor with support for Last Branch Record -(i.e. Intel Haswell and newer). - -`perf` will, by default, record the count of CPU cycles it takes to execute -various parts of your program. You can use the `-e` command line option -to enable other performance events, such as `cache-misses`. Use `perf list` -to get a list of all hardware counters supported by your CPU. - -### Viewing the report - -The next step is getting a bird's eye view of the program's execution. -`perf` provides a `ncurses`-based interface which will get you started. - -Use `perf report` to open a visualization of your program's performance: - -```sh -perf report --hierarchy -M intel -``` - -`--hierarchy` will display a tree-like structure of where your program spent -most of its time. `-M intel` enables disassembly output with Intel syntax, which -is subjectively more readable than the default AT&T syntax. - -Here is the output from profiling the `nbody` benchmark: - -``` -- 100,00% nbody - - 94,18% nbody - + 93,48% [.] nbody_lib::simd::advance - + 0,70% [.] nbody_lib::run - + 5,06% libc-2.28.so -``` - -If you move with the arrow keys to any node in the tree, you can the press `a` -to have `perf` _annotate_ that node. This means it will: - -- disassemble the function - -- associate every instruction with the percentage of time which was spent executing it - -- interleaves the disassembly with the source code, - assuming it found the debug symbols - (you can use `s` to toggle this behaviour) - -`perf` will, by default, open the instruction which it identified as being the -hottest spot in the function: - -``` -0,76 │ movapd xmm2,xmm0 -0,38 │ movhlps xmm2,xmm0 - │ addpd xmm2,xmm0 - │ unpcklpd xmm1,xmm2 -12,50 │ sqrtpd xmm0,xmm1 -1,52 │ mulpd xmm0,xmm1 -``` - -In this case, `sqrtpd` will be highlighted in red, since that's the instruction -which the CPU spends most of its time executing. - -## Using Valgrind - -Valgrind is a set of tools which initially helped C/C++ programmers find unsafe -memory accesses in their code. Nowadays the project also has - -- a heap profiler called `massif` - -- a cache utilization profiler called `cachegrind` - -- a call-graph performance profiler called `callgrind` - -<!-- -TODO: explain valgrind's dynamic binary translation, warn about massive -slowdown, talk about `kcachegrind` for a GUI ---> diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/mca.md b/third_party/rust/packed_simd/perf-guide/src/prof/mca.md deleted file mode 100644 index 65ddf1a4eb..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/prof/mca.md +++ /dev/null @@ -1,100 +0,0 @@ -# Machine code analysis tools - -## The microarchitecture of modern CPUs - -While you might have heard of Instruction Set Architectures, such as `x86` or -`arm` or `mips`, the term _microarchitecture_ (also written here as _µ-arch_), -refers to the internal details of an actual family of CPUs, such as Intel's -_Haswell_ or AMD's _Jaguar_. - -Replacing scalar code with SIMD code will improve performance on all CPUs -supporting the required vector extensions. -However, due to microarchitectural differences, the actual speed-up at -runtime might vary. - -**Example**: a simple example arises when optimizing for AMD K8 CPUs. -The assembly generated for an empty function should look like this: - -```asm -nop -ret -``` - -The `nop` is used to align the `ret` instruction for better performance. -However, the compiler will actually generated the following code: - -```asm -repz ret -``` - -The `repz` instruction will repeat the following instruction until a certain -condition. Of course, in this situation, the function will simply immediately -return, and the `ret` instruction is still aligned. -However, AMD K8's branch predictor performs better with the latter code. - -For those looking to absolutely maximize performance for a certain target µ-arch, -you will have to read some CPU manuals, or ask the compiler to do it for you -with `-C target-cpu`. - -### Summary of CPU internals - -Modern processors are able to execute instructions out-of-order for better performance, -by utilizing tricks such as [branch prediction], [instruction pipelining], -or [superscalar execution]. - -[branch prediction]: https://en.wikipedia.org/wiki/Branch_predictor -[instruction pipelining]: https://en.wikipedia.org/wiki/Instruction_pipelining -[superscalar execution]: https://en.wikipedia.org/wiki/Superscalar_processor - -SIMD instructions are also subject to these optimizations, meaning it can get pretty -difficult to determine where the slowdown happens. -For example, if the profiler reports a store operation is slow, one of two things -could be happening: - -- the store is limited by the CPU's memory bandwidth, which is actually an ideal - scenario, all things considered; - -- memory bandwidth is nowhere near its peak, but the value to be stored is at the - end of a long chain of operations, and this store is where the profiler - encountered the pipeline stall; - -Since most profilers are simple tools which don't understand the subtleties of -instruction scheduling, you - -## Analyzing the machine code - -Certain tools have knowledge of internal CPU microarchitecture, i.e. they know - -- how many physical [register files] a CPU actually has - -- what is the latency / throughtput of an instruction - -- what [µ-ops] are generated for a set of instructions - -and many other architectural details. - -[register files]: https://en.wikipedia.org/wiki/Register_file -[µ-ops]: https://en.wikipedia.org/wiki/Micro-operation - -These tools are therefore able to provide accurate information as to why some -instructions are inefficient, and where the bottleneck is. - -The disadvantage is that the output of these tools requires advanced knowledge -of the target architecture to understand, i.e. they **cannot** point out what -the cause of the issue is explicitly. - -## Intel's Architecture Code Analyzer (IACA) - -[IACA] is a free tool offered by Intel for analyzing the performance of various -computational kernels. - -Being a proprietary, closed source tool, it _only_ supports Intel's µ-arches. - -[IACA]: https://software.intel.com/en-us/articles/intel-architecture-code-analyzer - -## llvm-mca - -<!-- -TODO: once LLVM 7 gets released, write a chapter on using llvm-mca -with SIMD disassembly. ---> diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md b/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md deleted file mode 100644 index 02ba78d2f2..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md +++ /dev/null @@ -1,14 +0,0 @@ -# Performance profiling - -While the rest of the book provides practical advice on how to improve the performance -of SIMD code, this chapter is dedicated to [**performance profiling**][profiling]. -Profiling consists of recording a program's execution in order to identify program -hotspots. - -**Important**: most profilers require debug information in order to accurately -link the program hotspots back to the corresponding source code lines. Rust will -disable debug info generation by default for optimized builds, but you can change -that [in your `Cargo.toml`][cargo-ref]. - -[profiling]: https://en.wikipedia.org/wiki/Profiling_(computer_programming) -[cargo-ref]: https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections |