third_party/rust/packed_simd/perf-guide/src/prof/linux.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107

# Performance profiling on Linux

## Using `perf`

[perf](https://perf.wiki.kernel.org/) is the most powerful performance profiler
for Linux, featuring support for various hardware Performance Monitoring Units,
as well as integration with the kernel's performance events framework.

We will only look at how can the `perf` command can be used to profile SIMD code.
Full system profiling is outside of the scope of this book.

### Recording

The first step is to record a program's execution during an average workload.
It helps if you can isolate the parts of your program which have performance
issues, and set up a benchmark which can be easily (re)run.

Build the benchmark binary in release mode, after having enabled debug info:

```sh
$ cargo build --release
Finished release [optimized + debuginfo] target(s) in 0.02s
```

Then use the `perf record` subcommand:

```sh
$ perf record --call-graph=dwarf ./target/release/my-program
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 2,356 MB perf.data (292 samples) ]
```

Instead of using `--call-graph=dwarf`, which can become pretty slow, you can use
`--call-graph=lbr` if you have a processor with support for Last Branch Record
(i.e. Intel Haswell and newer).

`perf` will, by default, record the count of CPU cycles it takes to execute
various parts of your program. You can use the `-e` command line option
to enable other performance events, such as `cache-misses`. Use `perf list`
to get a list of all hardware counters supported by your CPU.

### Viewing the report

The next step is getting a bird's eye view of the program's execution.
`perf` provides a `ncurses`-based interface which will get you started.

Use `perf report` to open a visualization of your program's performance:

```sh
perf report --hierarchy -M intel
```

`--hierarchy` will display a tree-like structure of where your program spent
most of its time. `-M intel` enables disassembly output with Intel syntax, which
is subjectively more readable than the default AT&T syntax.

Here is the output from profiling the `nbody` benchmark:

```
- 100,00% nbody
  - 94,18% nbody
    + 93,48% [.] nbody_lib::simd::advance
    + 0,70% [.] nbody_lib::run
    + 5,06% libc-2.28.so
```

If you move with the arrow keys to any node in the tree, you can the press `a`
to have `perf` _annotate_ that node. This means it will:

- disassemble the function

- associate every instruction with the percentage of time which was spent executing it

- interleaves the disassembly with the source code,
  assuming it found the debug symbols
  (you can use `s` to toggle this behaviour)

`perf` will, by default, open the instruction which it identified as being the
hottest spot in the function:

```
0,76  │ movapd xmm2,xmm0
0,38  │ movhlps xmm2,xmm0
      │ addpd  xmm2,xmm0
      │ unpcklpd xmm1,xmm2
12,50 │ sqrtpd xmm0,xmm1
1,52  │ mulpd  xmm0,xmm1
```

In this case, `sqrtpd` will be highlighted in red, since that's the instruction
which the CPU spends most of its time executing.

## Using Valgrind

Valgrind is a set of tools which initially helped C/C++ programmers find unsafe
memory accesses in their code. Nowadays the project also has

- a heap profiler called `massif`

- a cache utilization profiler called `cachegrind`

- a call-graph performance profiler called `callgrind`

<!--
TODO: explain valgrind's dynamic binary translation, warn about massive
slowdown, talk about `kcachegrind` for a GUI
-->