summaryrefslogtreecommitdiffstats
path: root/third_party/rust/packed_simd/perf-guide/src/prof
diff options
context:
space:
mode:
Diffstat (limited to 'third_party/rust/packed_simd/perf-guide/src/prof')
-rw-r--r--third_party/rust/packed_simd/perf-guide/src/prof/linux.md107
-rw-r--r--third_party/rust/packed_simd/perf-guide/src/prof/mca.md100
-rw-r--r--third_party/rust/packed_simd/perf-guide/src/prof/profiling.md14
3 files changed, 0 insertions, 221 deletions
diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/linux.md b/third_party/rust/packed_simd/perf-guide/src/prof/linux.md
deleted file mode 100644
index 96c7d67bc4..0000000000
--- a/third_party/rust/packed_simd/perf-guide/src/prof/linux.md
+++ /dev/null
@@ -1,107 +0,0 @@
-# Performance profiling on Linux
-
-## Using `perf`
-
-[perf](https://perf.wiki.kernel.org/) is the most powerful performance profiler
-for Linux, featuring support for various hardware Performance Monitoring Units,
-as well as integration with the kernel's performance events framework.
-
-We will only look at how can the `perf` command can be used to profile SIMD code.
-Full system profiling is outside of the scope of this book.
-
-### Recording
-
-The first step is to record a program's execution during an average workload.
-It helps if you can isolate the parts of your program which have performance
-issues, and set up a benchmark which can be easily (re)run.
-
-Build the benchmark binary in release mode, after having enabled debug info:
-
-```sh
-$ cargo build --release
-Finished release [optimized + debuginfo] target(s) in 0.02s
-```
-
-Then use the `perf record` subcommand:
-
-```sh
-$ perf record --call-graph=dwarf ./target/release/my-program
-[ perf record: Woken up 10 times to write data ]
-[ perf record: Captured and wrote 2,356 MB perf.data (292 samples) ]
-```
-
-Instead of using `--call-graph=dwarf`, which can become pretty slow, you can use
-`--call-graph=lbr` if you have a processor with support for Last Branch Record
-(i.e. Intel Haswell and newer).
-
-`perf` will, by default, record the count of CPU cycles it takes to execute
-various parts of your program. You can use the `-e` command line option
-to enable other performance events, such as `cache-misses`. Use `perf list`
-to get a list of all hardware counters supported by your CPU.
-
-### Viewing the report
-
-The next step is getting a bird's eye view of the program's execution.
-`perf` provides a `ncurses`-based interface which will get you started.
-
-Use `perf report` to open a visualization of your program's performance:
-
-```sh
-perf report --hierarchy -M intel
-```
-
-`--hierarchy` will display a tree-like structure of where your program spent
-most of its time. `-M intel` enables disassembly output with Intel syntax, which
-is subjectively more readable than the default AT&T syntax.
-
-Here is the output from profiling the `nbody` benchmark:
-
-```
-- 100,00% nbody
- - 94,18% nbody
- + 93,48% [.] nbody_lib::simd::advance
- + 0,70% [.] nbody_lib::run
- + 5,06% libc-2.28.so
-```
-
-If you move with the arrow keys to any node in the tree, you can the press `a`
-to have `perf` _annotate_ that node. This means it will:
-
-- disassemble the function
-
-- associate every instruction with the percentage of time which was spent executing it
-
-- interleaves the disassembly with the source code,
- assuming it found the debug symbols
- (you can use `s` to toggle this behaviour)
-
-`perf` will, by default, open the instruction which it identified as being the
-hottest spot in the function:
-
-```
-0,76 │ movapd xmm2,xmm0
-0,38 │ movhlps xmm2,xmm0
- │ addpd xmm2,xmm0
- │ unpcklpd xmm1,xmm2
-12,50 │ sqrtpd xmm0,xmm1
-1,52 │ mulpd xmm0,xmm1
-```
-
-In this case, `sqrtpd` will be highlighted in red, since that's the instruction
-which the CPU spends most of its time executing.
-
-## Using Valgrind
-
-Valgrind is a set of tools which initially helped C/C++ programmers find unsafe
-memory accesses in their code. Nowadays the project also has
-
-- a heap profiler called `massif`
-
-- a cache utilization profiler called `cachegrind`
-
-- a call-graph performance profiler called `callgrind`
-
-<!--
-TODO: explain valgrind's dynamic binary translation, warn about massive
-slowdown, talk about `kcachegrind` for a GUI
--->
diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/mca.md b/third_party/rust/packed_simd/perf-guide/src/prof/mca.md
deleted file mode 100644
index 65ddf1a4eb..0000000000
--- a/third_party/rust/packed_simd/perf-guide/src/prof/mca.md
+++ /dev/null
@@ -1,100 +0,0 @@
-# Machine code analysis tools
-
-## The microarchitecture of modern CPUs
-
-While you might have heard of Instruction Set Architectures, such as `x86` or
-`arm` or `mips`, the term _microarchitecture_ (also written here as _µ-arch_),
-refers to the internal details of an actual family of CPUs, such as Intel's
-_Haswell_ or AMD's _Jaguar_.
-
-Replacing scalar code with SIMD code will improve performance on all CPUs
-supporting the required vector extensions.
-However, due to microarchitectural differences, the actual speed-up at
-runtime might vary.
-
-**Example**: a simple example arises when optimizing for AMD K8 CPUs.
-The assembly generated for an empty function should look like this:
-
-```asm
-nop
-ret
-```
-
-The `nop` is used to align the `ret` instruction for better performance.
-However, the compiler will actually generated the following code:
-
-```asm
-repz ret
-```
-
-The `repz` instruction will repeat the following instruction until a certain
-condition. Of course, in this situation, the function will simply immediately
-return, and the `ret` instruction is still aligned.
-However, AMD K8's branch predictor performs better with the latter code.
-
-For those looking to absolutely maximize performance for a certain target µ-arch,
-you will have to read some CPU manuals, or ask the compiler to do it for you
-with `-C target-cpu`.
-
-### Summary of CPU internals
-
-Modern processors are able to execute instructions out-of-order for better performance,
-by utilizing tricks such as [branch prediction], [instruction pipelining],
-or [superscalar execution].
-
-[branch prediction]: https://en.wikipedia.org/wiki/Branch_predictor
-[instruction pipelining]: https://en.wikipedia.org/wiki/Instruction_pipelining
-[superscalar execution]: https://en.wikipedia.org/wiki/Superscalar_processor
-
-SIMD instructions are also subject to these optimizations, meaning it can get pretty
-difficult to determine where the slowdown happens.
-For example, if the profiler reports a store operation is slow, one of two things
-could be happening:
-
-- the store is limited by the CPU's memory bandwidth, which is actually an ideal
- scenario, all things considered;
-
-- memory bandwidth is nowhere near its peak, but the value to be stored is at the
- end of a long chain of operations, and this store is where the profiler
- encountered the pipeline stall;
-
-Since most profilers are simple tools which don't understand the subtleties of
-instruction scheduling, you
-
-## Analyzing the machine code
-
-Certain tools have knowledge of internal CPU microarchitecture, i.e. they know
-
-- how many physical [register files] a CPU actually has
-
-- what is the latency / throughtput of an instruction
-
-- what [µ-ops] are generated for a set of instructions
-
-and many other architectural details.
-
-[register files]: https://en.wikipedia.org/wiki/Register_file
-[µ-ops]: https://en.wikipedia.org/wiki/Micro-operation
-
-These tools are therefore able to provide accurate information as to why some
-instructions are inefficient, and where the bottleneck is.
-
-The disadvantage is that the output of these tools requires advanced knowledge
-of the target architecture to understand, i.e. they **cannot** point out what
-the cause of the issue is explicitly.
-
-## Intel's Architecture Code Analyzer (IACA)
-
-[IACA] is a free tool offered by Intel for analyzing the performance of various
-computational kernels.
-
-Being a proprietary, closed source tool, it _only_ supports Intel's µ-arches.
-
-[IACA]: https://software.intel.com/en-us/articles/intel-architecture-code-analyzer
-
-## llvm-mca
-
-<!--
-TODO: once LLVM 7 gets released, write a chapter on using llvm-mca
-with SIMD disassembly.
--->
diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md b/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md
deleted file mode 100644
index 02ba78d2f2..0000000000
--- a/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# Performance profiling
-
-While the rest of the book provides practical advice on how to improve the performance
-of SIMD code, this chapter is dedicated to [**performance profiling**][profiling].
-Profiling consists of recording a program's execution in order to identify program
-hotspots.
-
-**Important**: most profilers require debug information in order to accurately
-link the program hotspots back to the corresponding source code lines. Rust will
-disable debug info generation by default for optimized builds, but you can change
-that [in your `Cargo.toml`][cargo-ref].
-
-[profiling]: https://en.wikipedia.org/wiki/Profiling_(computer_programming)
-[cargo-ref]: https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections