diff options
Diffstat (limited to 'third_party/rust/packed_simd/perf-guide/src/prof')
3 files changed, 221 insertions, 0 deletions
diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/linux.md b/third_party/rust/packed_simd/perf-guide/src/prof/linux.md new file mode 100644 index 0000000000..96c7d67bc4 --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/prof/linux.md @@ -0,0 +1,107 @@ +# Performance profiling on Linux + +## Using `perf` + +[perf](https://perf.wiki.kernel.org/) is the most powerful performance profiler +for Linux, featuring support for various hardware Performance Monitoring Units, +as well as integration with the kernel's performance events framework. + +We will only look at how can the `perf` command can be used to profile SIMD code. +Full system profiling is outside of the scope of this book. + +### Recording + +The first step is to record a program's execution during an average workload. +It helps if you can isolate the parts of your program which have performance +issues, and set up a benchmark which can be easily (re)run. + +Build the benchmark binary in release mode, after having enabled debug info: + +```sh +$ cargo build --release +Finished release [optimized + debuginfo] target(s) in 0.02s +``` + +Then use the `perf record` subcommand: + +```sh +$ perf record --call-graph=dwarf ./target/release/my-program +[ perf record: Woken up 10 times to write data ] +[ perf record: Captured and wrote 2,356 MB perf.data (292 samples) ] +``` + +Instead of using `--call-graph=dwarf`, which can become pretty slow, you can use +`--call-graph=lbr` if you have a processor with support for Last Branch Record +(i.e. Intel Haswell and newer). + +`perf` will, by default, record the count of CPU cycles it takes to execute +various parts of your program. You can use the `-e` command line option +to enable other performance events, such as `cache-misses`. Use `perf list` +to get a list of all hardware counters supported by your CPU. + +### Viewing the report + +The next step is getting a bird's eye view of the program's execution. +`perf` provides a `ncurses`-based interface which will get you started. + +Use `perf report` to open a visualization of your program's performance: + +```sh +perf report --hierarchy -M intel +``` + +`--hierarchy` will display a tree-like structure of where your program spent +most of its time. `-M intel` enables disassembly output with Intel syntax, which +is subjectively more readable than the default AT&T syntax. + +Here is the output from profiling the `nbody` benchmark: + +``` +- 100,00% nbody + - 94,18% nbody + + 93,48% [.] nbody_lib::simd::advance + + 0,70% [.] nbody_lib::run + + 5,06% libc-2.28.so +``` + +If you move with the arrow keys to any node in the tree, you can the press `a` +to have `perf` _annotate_ that node. This means it will: + +- disassemble the function + +- associate every instruction with the percentage of time which was spent executing it + +- interleaves the disassembly with the source code, + assuming it found the debug symbols + (you can use `s` to toggle this behaviour) + +`perf` will, by default, open the instruction which it identified as being the +hottest spot in the function: + +``` +0,76 │ movapd xmm2,xmm0 +0,38 │ movhlps xmm2,xmm0 + │ addpd xmm2,xmm0 + │ unpcklpd xmm1,xmm2 +12,50 │ sqrtpd xmm0,xmm1 +1,52 │ mulpd xmm0,xmm1 +``` + +In this case, `sqrtpd` will be highlighted in red, since that's the instruction +which the CPU spends most of its time executing. + +## Using Valgrind + +Valgrind is a set of tools which initially helped C/C++ programmers find unsafe +memory accesses in their code. Nowadays the project also has + +- a heap profiler called `massif` + +- a cache utilization profiler called `cachegrind` + +- a call-graph performance profiler called `callgrind` + +<!-- +TODO: explain valgrind's dynamic binary translation, warn about massive +slowdown, talk about `kcachegrind` for a GUI +--> diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/mca.md b/third_party/rust/packed_simd/perf-guide/src/prof/mca.md new file mode 100644 index 0000000000..65ddf1a4eb --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/prof/mca.md @@ -0,0 +1,100 @@ +# Machine code analysis tools + +## The microarchitecture of modern CPUs + +While you might have heard of Instruction Set Architectures, such as `x86` or +`arm` or `mips`, the term _microarchitecture_ (also written here as _µ-arch_), +refers to the internal details of an actual family of CPUs, such as Intel's +_Haswell_ or AMD's _Jaguar_. + +Replacing scalar code with SIMD code will improve performance on all CPUs +supporting the required vector extensions. +However, due to microarchitectural differences, the actual speed-up at +runtime might vary. + +**Example**: a simple example arises when optimizing for AMD K8 CPUs. +The assembly generated for an empty function should look like this: + +```asm +nop +ret +``` + +The `nop` is used to align the `ret` instruction for better performance. +However, the compiler will actually generated the following code: + +```asm +repz ret +``` + +The `repz` instruction will repeat the following instruction until a certain +condition. Of course, in this situation, the function will simply immediately +return, and the `ret` instruction is still aligned. +However, AMD K8's branch predictor performs better with the latter code. + +For those looking to absolutely maximize performance for a certain target µ-arch, +you will have to read some CPU manuals, or ask the compiler to do it for you +with `-C target-cpu`. + +### Summary of CPU internals + +Modern processors are able to execute instructions out-of-order for better performance, +by utilizing tricks such as [branch prediction], [instruction pipelining], +or [superscalar execution]. + +[branch prediction]: https://en.wikipedia.org/wiki/Branch_predictor +[instruction pipelining]: https://en.wikipedia.org/wiki/Instruction_pipelining +[superscalar execution]: https://en.wikipedia.org/wiki/Superscalar_processor + +SIMD instructions are also subject to these optimizations, meaning it can get pretty +difficult to determine where the slowdown happens. +For example, if the profiler reports a store operation is slow, one of two things +could be happening: + +- the store is limited by the CPU's memory bandwidth, which is actually an ideal + scenario, all things considered; + +- memory bandwidth is nowhere near its peak, but the value to be stored is at the + end of a long chain of operations, and this store is where the profiler + encountered the pipeline stall; + +Since most profilers are simple tools which don't understand the subtleties of +instruction scheduling, you + +## Analyzing the machine code + +Certain tools have knowledge of internal CPU microarchitecture, i.e. they know + +- how many physical [register files] a CPU actually has + +- what is the latency / throughtput of an instruction + +- what [µ-ops] are generated for a set of instructions + +and many other architectural details. + +[register files]: https://en.wikipedia.org/wiki/Register_file +[µ-ops]: https://en.wikipedia.org/wiki/Micro-operation + +These tools are therefore able to provide accurate information as to why some +instructions are inefficient, and where the bottleneck is. + +The disadvantage is that the output of these tools requires advanced knowledge +of the target architecture to understand, i.e. they **cannot** point out what +the cause of the issue is explicitly. + +## Intel's Architecture Code Analyzer (IACA) + +[IACA] is a free tool offered by Intel for analyzing the performance of various +computational kernels. + +Being a proprietary, closed source tool, it _only_ supports Intel's µ-arches. + +[IACA]: https://software.intel.com/en-us/articles/intel-architecture-code-analyzer + +## llvm-mca + +<!-- +TODO: once LLVM 7 gets released, write a chapter on using llvm-mca +with SIMD disassembly. +--> diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md b/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md new file mode 100644 index 0000000000..02ba78d2f2 --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md @@ -0,0 +1,14 @@ +# Performance profiling + +While the rest of the book provides practical advice on how to improve the performance +of SIMD code, this chapter is dedicated to [**performance profiling**][profiling]. +Profiling consists of recording a program's execution in order to identify program +hotspots. + +**Important**: most profilers require debug information in order to accurately +link the program hotspots back to the corresponding source code lines. Rust will +disable debug info generation by default for optimized builds, but you can change +that [in your `Cargo.toml`][cargo-ref]. + +[profiling]: https://en.wikipedia.org/wiki/Profiling_(computer_programming) +[cargo-ref]: https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections |