diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-19 00:47:55 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-19 00:47:55 +0000 |
commit | 26a029d407be480d791972afb5975cf62c9360a6 (patch) | |
tree | f435a8308119effd964b339f76abb83a57c29483 /third_party/rust/packed_simd/perf-guide/src/prof/linux.md | |
parent | Initial commit. (diff) | |
download | firefox-26a029d407be480d791972afb5975cf62c9360a6.tar.xz firefox-26a029d407be480d791972afb5975cf62c9360a6.zip |
Adding upstream version 124.0.1.upstream/124.0.1
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'third_party/rust/packed_simd/perf-guide/src/prof/linux.md')
-rw-r--r-- | third_party/rust/packed_simd/perf-guide/src/prof/linux.md | 107 |
1 files changed, 107 insertions, 0 deletions
diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/linux.md b/third_party/rust/packed_simd/perf-guide/src/prof/linux.md new file mode 100644 index 0000000000..96c7d67bc4 --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/prof/linux.md @@ -0,0 +1,107 @@ +# Performance profiling on Linux + +## Using `perf` + +[perf](https://perf.wiki.kernel.org/) is the most powerful performance profiler +for Linux, featuring support for various hardware Performance Monitoring Units, +as well as integration with the kernel's performance events framework. + +We will only look at how can the `perf` command can be used to profile SIMD code. +Full system profiling is outside of the scope of this book. + +### Recording + +The first step is to record a program's execution during an average workload. +It helps if you can isolate the parts of your program which have performance +issues, and set up a benchmark which can be easily (re)run. + +Build the benchmark binary in release mode, after having enabled debug info: + +```sh +$ cargo build --release +Finished release [optimized + debuginfo] target(s) in 0.02s +``` + +Then use the `perf record` subcommand: + +```sh +$ perf record --call-graph=dwarf ./target/release/my-program +[ perf record: Woken up 10 times to write data ] +[ perf record: Captured and wrote 2,356 MB perf.data (292 samples) ] +``` + +Instead of using `--call-graph=dwarf`, which can become pretty slow, you can use +`--call-graph=lbr` if you have a processor with support for Last Branch Record +(i.e. Intel Haswell and newer). + +`perf` will, by default, record the count of CPU cycles it takes to execute +various parts of your program. You can use the `-e` command line option +to enable other performance events, such as `cache-misses`. Use `perf list` +to get a list of all hardware counters supported by your CPU. + +### Viewing the report + +The next step is getting a bird's eye view of the program's execution. +`perf` provides a `ncurses`-based interface which will get you started. + +Use `perf report` to open a visualization of your program's performance: + +```sh +perf report --hierarchy -M intel +``` + +`--hierarchy` will display a tree-like structure of where your program spent +most of its time. `-M intel` enables disassembly output with Intel syntax, which +is subjectively more readable than the default AT&T syntax. + +Here is the output from profiling the `nbody` benchmark: + +``` +- 100,00% nbody + - 94,18% nbody + + 93,48% [.] nbody_lib::simd::advance + + 0,70% [.] nbody_lib::run + + 5,06% libc-2.28.so +``` + +If you move with the arrow keys to any node in the tree, you can the press `a` +to have `perf` _annotate_ that node. This means it will: + +- disassemble the function + +- associate every instruction with the percentage of time which was spent executing it + +- interleaves the disassembly with the source code, + assuming it found the debug symbols + (you can use `s` to toggle this behaviour) + +`perf` will, by default, open the instruction which it identified as being the +hottest spot in the function: + +``` +0,76 │ movapd xmm2,xmm0 +0,38 │ movhlps xmm2,xmm0 + │ addpd xmm2,xmm0 + │ unpcklpd xmm1,xmm2 +12,50 │ sqrtpd xmm0,xmm1 +1,52 │ mulpd xmm0,xmm1 +``` + +In this case, `sqrtpd` will be highlighted in red, since that's the instruction +which the CPU spends most of its time executing. + +## Using Valgrind + +Valgrind is a set of tools which initially helped C/C++ programmers find unsafe +memory accesses in their code. Nowadays the project also has + +- a heap profiler called `massif` + +- a cache utilization profiler called `cachegrind` + +- a call-graph performance profiler called `callgrind` + +<!-- +TODO: explain valgrind's dynamic binary translation, warn about massive +slowdown, talk about `kcachegrind` for a GUI +--> |