summaryrefslogtreecommitdiffstats
path: root/third_party/rust/packed_simd/perf-guide/src/prof
diff options
context:
space:
mode:
Diffstat (limited to 'third_party/rust/packed_simd/perf-guide/src/prof')
-rw-r--r--third_party/rust/packed_simd/perf-guide/src/prof/linux.md107
-rw-r--r--third_party/rust/packed_simd/perf-guide/src/prof/mca.md100
-rw-r--r--third_party/rust/packed_simd/perf-guide/src/prof/profiling.md14
3 files changed, 221 insertions, 0 deletions
diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/linux.md b/third_party/rust/packed_simd/perf-guide/src/prof/linux.md
new file mode 100644
index 0000000000..96c7d67bc4
--- /dev/null
+++ b/third_party/rust/packed_simd/perf-guide/src/prof/linux.md
@@ -0,0 +1,107 @@
+# Performance profiling on Linux
+
+## Using `perf`
+
+[perf](https://perf.wiki.kernel.org/) is the most powerful performance profiler
+for Linux, featuring support for various hardware Performance Monitoring Units,
+as well as integration with the kernel's performance events framework.
+
+We will only look at how can the `perf` command can be used to profile SIMD code.
+Full system profiling is outside of the scope of this book.
+
+### Recording
+
+The first step is to record a program's execution during an average workload.
+It helps if you can isolate the parts of your program which have performance
+issues, and set up a benchmark which can be easily (re)run.
+
+Build the benchmark binary in release mode, after having enabled debug info:
+
+```sh
+$ cargo build --release
+Finished release [optimized + debuginfo] target(s) in 0.02s
+```
+
+Then use the `perf record` subcommand:
+
+```sh
+$ perf record --call-graph=dwarf ./target/release/my-program
+[ perf record: Woken up 10 times to write data ]
+[ perf record: Captured and wrote 2,356 MB perf.data (292 samples) ]
+```
+
+Instead of using `--call-graph=dwarf`, which can become pretty slow, you can use
+`--call-graph=lbr` if you have a processor with support for Last Branch Record
+(i.e. Intel Haswell and newer).
+
+`perf` will, by default, record the count of CPU cycles it takes to execute
+various parts of your program. You can use the `-e` command line option
+to enable other performance events, such as `cache-misses`. Use `perf list`
+to get a list of all hardware counters supported by your CPU.
+
+### Viewing the report
+
+The next step is getting a bird's eye view of the program's execution.
+`perf` provides a `ncurses`-based interface which will get you started.
+
+Use `perf report` to open a visualization of your program's performance:
+
+```sh
+perf report --hierarchy -M intel
+```
+
+`--hierarchy` will display a tree-like structure of where your program spent
+most of its time. `-M intel` enables disassembly output with Intel syntax, which
+is subjectively more readable than the default AT&T syntax.
+
+Here is the output from profiling the `nbody` benchmark:
+
+```
+- 100,00% nbody
+ - 94,18% nbody
+ + 93,48% [.] nbody_lib::simd::advance
+ + 0,70% [.] nbody_lib::run
+ + 5,06% libc-2.28.so
+```
+
+If you move with the arrow keys to any node in the tree, you can the press `a`
+to have `perf` _annotate_ that node. This means it will:
+
+- disassemble the function
+
+- associate every instruction with the percentage of time which was spent executing it
+
+- interleaves the disassembly with the source code,
+ assuming it found the debug symbols
+ (you can use `s` to toggle this behaviour)
+
+`perf` will, by default, open the instruction which it identified as being the
+hottest spot in the function:
+
+```
+0,76 │ movapd xmm2,xmm0
+0,38 │ movhlps xmm2,xmm0
+ │ addpd xmm2,xmm0
+ │ unpcklpd xmm1,xmm2
+12,50 │ sqrtpd xmm0,xmm1
+1,52 │ mulpd xmm0,xmm1
+```
+
+In this case, `sqrtpd` will be highlighted in red, since that's the instruction
+which the CPU spends most of its time executing.
+
+## Using Valgrind
+
+Valgrind is a set of tools which initially helped C/C++ programmers find unsafe
+memory accesses in their code. Nowadays the project also has
+
+- a heap profiler called `massif`
+
+- a cache utilization profiler called `cachegrind`
+
+- a call-graph performance profiler called `callgrind`
+
+<!--
+TODO: explain valgrind's dynamic binary translation, warn about massive
+slowdown, talk about `kcachegrind` for a GUI
+-->
diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/mca.md b/third_party/rust/packed_simd/perf-guide/src/prof/mca.md
new file mode 100644
index 0000000000..65ddf1a4eb
--- /dev/null
+++ b/third_party/rust/packed_simd/perf-guide/src/prof/mca.md
@@ -0,0 +1,100 @@
+# Machine code analysis tools
+
+## The microarchitecture of modern CPUs
+
+While you might have heard of Instruction Set Architectures, such as `x86` or
+`arm` or `mips`, the term _microarchitecture_ (also written here as _µ-arch_),
+refers to the internal details of an actual family of CPUs, such as Intel's
+_Haswell_ or AMD's _Jaguar_.
+
+Replacing scalar code with SIMD code will improve performance on all CPUs
+supporting the required vector extensions.
+However, due to microarchitectural differences, the actual speed-up at
+runtime might vary.
+
+**Example**: a simple example arises when optimizing for AMD K8 CPUs.
+The assembly generated for an empty function should look like this:
+
+```asm
+nop
+ret
+```
+
+The `nop` is used to align the `ret` instruction for better performance.
+However, the compiler will actually generated the following code:
+
+```asm
+repz ret
+```
+
+The `repz` instruction will repeat the following instruction until a certain
+condition. Of course, in this situation, the function will simply immediately
+return, and the `ret` instruction is still aligned.
+However, AMD K8's branch predictor performs better with the latter code.
+
+For those looking to absolutely maximize performance for a certain target µ-arch,
+you will have to read some CPU manuals, or ask the compiler to do it for you
+with `-C target-cpu`.
+
+### Summary of CPU internals
+
+Modern processors are able to execute instructions out-of-order for better performance,
+by utilizing tricks such as [branch prediction], [instruction pipelining],
+or [superscalar execution].
+
+[branch prediction]: https://en.wikipedia.org/wiki/Branch_predictor
+[instruction pipelining]: https://en.wikipedia.org/wiki/Instruction_pipelining
+[superscalar execution]: https://en.wikipedia.org/wiki/Superscalar_processor
+
+SIMD instructions are also subject to these optimizations, meaning it can get pretty
+difficult to determine where the slowdown happens.
+For example, if the profiler reports a store operation is slow, one of two things
+could be happening:
+
+- the store is limited by the CPU's memory bandwidth, which is actually an ideal
+ scenario, all things considered;
+
+- memory bandwidth is nowhere near its peak, but the value to be stored is at the
+ end of a long chain of operations, and this store is where the profiler
+ encountered the pipeline stall;
+
+Since most profilers are simple tools which don't understand the subtleties of
+instruction scheduling, you
+
+## Analyzing the machine code
+
+Certain tools have knowledge of internal CPU microarchitecture, i.e. they know
+
+- how many physical [register files] a CPU actually has
+
+- what is the latency / throughtput of an instruction
+
+- what [µ-ops] are generated for a set of instructions
+
+and many other architectural details.
+
+[register files]: https://en.wikipedia.org/wiki/Register_file
+[µ-ops]: https://en.wikipedia.org/wiki/Micro-operation
+
+These tools are therefore able to provide accurate information as to why some
+instructions are inefficient, and where the bottleneck is.
+
+The disadvantage is that the output of these tools requires advanced knowledge
+of the target architecture to understand, i.e. they **cannot** point out what
+the cause of the issue is explicitly.
+
+## Intel's Architecture Code Analyzer (IACA)
+
+[IACA] is a free tool offered by Intel for analyzing the performance of various
+computational kernels.
+
+Being a proprietary, closed source tool, it _only_ supports Intel's µ-arches.
+
+[IACA]: https://software.intel.com/en-us/articles/intel-architecture-code-analyzer
+
+## llvm-mca
+
+<!--
+TODO: once LLVM 7 gets released, write a chapter on using llvm-mca
+with SIMD disassembly.
+-->
diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md b/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md
new file mode 100644
index 0000000000..02ba78d2f2
--- /dev/null
+++ b/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md
@@ -0,0 +1,14 @@
+# Performance profiling
+
+While the rest of the book provides practical advice on how to improve the performance
+of SIMD code, this chapter is dedicated to [**performance profiling**][profiling].
+Profiling consists of recording a program's execution in order to identify program
+hotspots.
+
+**Important**: most profilers require debug information in order to accurately
+link the program hotspots back to the corresponding source code lines. Rust will
+disable debug info generation by default for optimized builds, but you can change
+that [in your `Cargo.toml`][cargo-ref].
+
+[profiling]: https://en.wikipedia.org/wiki/Profiling_(computer_programming)
+[cargo-ref]: https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections