diff options
Diffstat (limited to 'third_party/rust/packed_simd/perf-guide')
20 files changed, 543 insertions, 0 deletions
diff --git a/third_party/rust/packed_simd/perf-guide/.gitignore b/third_party/rust/packed_simd/perf-guide/.gitignore new file mode 100644 index 0000000000..5a0bf0317d --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/.gitignore @@ -0,0 +1 @@ +/book diff --git a/third_party/rust/packed_simd/perf-guide/book.toml b/third_party/rust/packed_simd/perf-guide/book.toml new file mode 100644 index 0000000000..69ba3053ca --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/book.toml @@ -0,0 +1,12 @@ +[book] +authors = ["Gonzalo Brito Gadeschi", "Gabriel Majeri"] +multilingual = false +src = "src" +title = "Rust SIMD Performance Guide" +description = "This book describes how to write performant SIMD code in Rust." + +[build] +create-missing = false + +[output.html] +additional-css = ["./src/ascii.css"] diff --git a/third_party/rust/packed_simd/perf-guide/src/SUMMARY.md b/third_party/rust/packed_simd/perf-guide/src/SUMMARY.md new file mode 100644 index 0000000000..1e76898865 --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/SUMMARY.md @@ -0,0 +1,21 @@ +# Summary + +[Introduction](./introduction.md) + +- [Floating-point Math](./float-math/fp.md) + - [Short-vector Math Library](./float-math/svml.md) + - [Approximate functions](./float-math/approx.md) + - [Fused multiply-accumulate](./float-math/fma.md) + +- [Target features](./target-feature/features.md) + - [Using `RUSTFLAGS`](./target-feature/rustflags.md) + - [Using the `target_feature` attribute](./target-feature/attribute.md) + - [Interaction with inlining](./target-feature/inlining.md) + - [Detecting features at runtime](./target-feature/runtime.md) + +- [Bounds checking](./bound_checks.md) +- [Vertical and horizontal operations](./vert-hor-ops.md) + +- [Performance profiling](./prof/profiling.md) + - [Profiling on Linux](./prof/linux.md) + - [Using machine code analyzers](./prof/mca.md) diff --git a/third_party/rust/packed_simd/perf-guide/src/ascii.css b/third_party/rust/packed_simd/perf-guide/src/ascii.css new file mode 100644 index 0000000000..4c02651195 --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/ascii.css @@ -0,0 +1,4 @@ +code { + /* "Source Code Pro" breaks ASCII art */ + font-family: Consolas, "Ubuntu Mono", Menlo, "DejaVu Sans Mono", monospace; +} diff --git a/third_party/rust/packed_simd/perf-guide/src/bound_checks.md b/third_party/rust/packed_simd/perf-guide/src/bound_checks.md new file mode 100644 index 0000000000..2eeedb5ac8 --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/bound_checks.md @@ -0,0 +1,22 @@ +# Bounds checking + +Reading and writing packed vectors to/from slices is checked by default. +Independently of the configuration options used, the safe functions: + +* `Simd<[T; N]>::from_slice_aligned(& s[..])` +* `Simd<[T; N]>::write_to_slice_aligned(&mut s[..])` + +always check that: + +* the slice is big enough to hold the vector +* the slice is suitably aligned to perform an aligned load/store for a `Simd<[T; + N]>` (this alignment is often much larger than that of `T`). + +There are `_unaligned` versions that use unaligned load and stores, as well as +`unsafe` `_unchecked` that do not perform any checks iff `debug-assertions = +false` / `debug = false`. That is, the `_unchecked` methods do still assert size +and alignment in debug builds and could also do so in release builds depending +on the configuration options. + +These assertions do often significantly impact performance and you should be +aware of them. diff --git a/third_party/rust/packed_simd/perf-guide/src/float-math/approx.md b/third_party/rust/packed_simd/perf-guide/src/float-math/approx.md new file mode 100644 index 0000000000..2237c67ec4 --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/float-math/approx.md @@ -0,0 +1,8 @@ +# Approximate functions + +<!-- TODO: + +Explain that they exists, that they are often _much_ faster, how to use them, +that people should check whether the error is good enough for their +applications. Explain that this error is currently unstable and might change. +--> diff --git a/third_party/rust/packed_simd/perf-guide/src/float-math/fma.md b/third_party/rust/packed_simd/perf-guide/src/float-math/fma.md new file mode 100644 index 0000000000..357748383d --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/float-math/fma.md @@ -0,0 +1,6 @@ +# Fused Multiply Add + +<!-- TODO: +Explain that this is a compound operation, infinite precision, difference +between `mul_add` and `mul_adde`, that LLVM cannot do this by itself, etc. +--> diff --git a/third_party/rust/packed_simd/perf-guide/src/float-math/fp.md b/third_party/rust/packed_simd/perf-guide/src/float-math/fp.md new file mode 100644 index 0000000000..711fcc4fd5 --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/float-math/fp.md @@ -0,0 +1,3 @@ +# Floating-point math + +This chapter contains information pertaining to working with floating-point numbers. diff --git a/third_party/rust/packed_simd/perf-guide/src/float-math/svml.md b/third_party/rust/packed_simd/perf-guide/src/float-math/svml.md new file mode 100644 index 0000000000..266c2531cc --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/float-math/svml.md @@ -0,0 +1,7 @@ +# Short Vector Math Library + +<!-- TODO: +Explain how is short-vector math performed by default (just scalarized libm calls). + +Explain how to enable `sleef`, etc. +--> diff --git a/third_party/rust/packed_simd/perf-guide/src/introduction.md b/third_party/rust/packed_simd/perf-guide/src/introduction.md new file mode 100644 index 0000000000..7243e19c8a --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/introduction.md @@ -0,0 +1,26 @@ +# Introduction + +## What is SIMD + +<!-- TODO: +describe what SIMD is, which algorithms can benefit from it, +give usage examples +--> + +## History of SIMD in Rust + +<!-- TODO: +discuss history of unstable std::simd, +stabilization of std::arch, etc. +--> + +## Discover packed_simd + +<!-- TODO: describe scope of this project --> + +Writing fast and portable SIMD algorithms using `packed_simd` is, unfortunately, +not trivial. There are many pitfals that one should be aware of, and some idioms +that help avoid those pitfalls. + +This book attempts to document these best practices and provides practical examples +on how to apply the tips to _your_ code. diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/linux.md b/third_party/rust/packed_simd/perf-guide/src/prof/linux.md new file mode 100644 index 0000000000..96c7d67bc4 --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/prof/linux.md @@ -0,0 +1,107 @@ +# Performance profiling on Linux + +## Using `perf` + +[perf](https://perf.wiki.kernel.org/) is the most powerful performance profiler +for Linux, featuring support for various hardware Performance Monitoring Units, +as well as integration with the kernel's performance events framework. + +We will only look at how can the `perf` command can be used to profile SIMD code. +Full system profiling is outside of the scope of this book. + +### Recording + +The first step is to record a program's execution during an average workload. +It helps if you can isolate the parts of your program which have performance +issues, and set up a benchmark which can be easily (re)run. + +Build the benchmark binary in release mode, after having enabled debug info: + +```sh +$ cargo build --release +Finished release [optimized + debuginfo] target(s) in 0.02s +``` + +Then use the `perf record` subcommand: + +```sh +$ perf record --call-graph=dwarf ./target/release/my-program +[ perf record: Woken up 10 times to write data ] +[ perf record: Captured and wrote 2,356 MB perf.data (292 samples) ] +``` + +Instead of using `--call-graph=dwarf`, which can become pretty slow, you can use +`--call-graph=lbr` if you have a processor with support for Last Branch Record +(i.e. Intel Haswell and newer). + +`perf` will, by default, record the count of CPU cycles it takes to execute +various parts of your program. You can use the `-e` command line option +to enable other performance events, such as `cache-misses`. Use `perf list` +to get a list of all hardware counters supported by your CPU. + +### Viewing the report + +The next step is getting a bird's eye view of the program's execution. +`perf` provides a `ncurses`-based interface which will get you started. + +Use `perf report` to open a visualization of your program's performance: + +```sh +perf report --hierarchy -M intel +``` + +`--hierarchy` will display a tree-like structure of where your program spent +most of its time. `-M intel` enables disassembly output with Intel syntax, which +is subjectively more readable than the default AT&T syntax. + +Here is the output from profiling the `nbody` benchmark: + +``` +- 100,00% nbody + - 94,18% nbody + + 93,48% [.] nbody_lib::simd::advance + + 0,70% [.] nbody_lib::run + + 5,06% libc-2.28.so +``` + +If you move with the arrow keys to any node in the tree, you can the press `a` +to have `perf` _annotate_ that node. This means it will: + +- disassemble the function + +- associate every instruction with the percentage of time which was spent executing it + +- interleaves the disassembly with the source code, + assuming it found the debug symbols + (you can use `s` to toggle this behaviour) + +`perf` will, by default, open the instruction which it identified as being the +hottest spot in the function: + +``` +0,76 │ movapd xmm2,xmm0 +0,38 │ movhlps xmm2,xmm0 + │ addpd xmm2,xmm0 + │ unpcklpd xmm1,xmm2 +12,50 │ sqrtpd xmm0,xmm1 +1,52 │ mulpd xmm0,xmm1 +``` + +In this case, `sqrtpd` will be highlighted in red, since that's the instruction +which the CPU spends most of its time executing. + +## Using Valgrind + +Valgrind is a set of tools which initially helped C/C++ programmers find unsafe +memory accesses in their code. Nowadays the project also has + +- a heap profiler called `massif` + +- a cache utilization profiler called `cachegrind` + +- a call-graph performance profiler called `callgrind` + +<!-- +TODO: explain valgrind's dynamic binary translation, warn about massive +slowdown, talk about `kcachegrind` for a GUI +--> diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/mca.md b/third_party/rust/packed_simd/perf-guide/src/prof/mca.md new file mode 100644 index 0000000000..65ddf1a4eb --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/prof/mca.md @@ -0,0 +1,100 @@ +# Machine code analysis tools + +## The microarchitecture of modern CPUs + +While you might have heard of Instruction Set Architectures, such as `x86` or +`arm` or `mips`, the term _microarchitecture_ (also written here as _µ-arch_), +refers to the internal details of an actual family of CPUs, such as Intel's +_Haswell_ or AMD's _Jaguar_. + +Replacing scalar code with SIMD code will improve performance on all CPUs +supporting the required vector extensions. +However, due to microarchitectural differences, the actual speed-up at +runtime might vary. + +**Example**: a simple example arises when optimizing for AMD K8 CPUs. +The assembly generated for an empty function should look like this: + +```asm +nop +ret +``` + +The `nop` is used to align the `ret` instruction for better performance. +However, the compiler will actually generated the following code: + +```asm +repz ret +``` + +The `repz` instruction will repeat the following instruction until a certain +condition. Of course, in this situation, the function will simply immediately +return, and the `ret` instruction is still aligned. +However, AMD K8's branch predictor performs better with the latter code. + +For those looking to absolutely maximize performance for a certain target µ-arch, +you will have to read some CPU manuals, or ask the compiler to do it for you +with `-C target-cpu`. + +### Summary of CPU internals + +Modern processors are able to execute instructions out-of-order for better performance, +by utilizing tricks such as [branch prediction], [instruction pipelining], +or [superscalar execution]. + +[branch prediction]: https://en.wikipedia.org/wiki/Branch_predictor +[instruction pipelining]: https://en.wikipedia.org/wiki/Instruction_pipelining +[superscalar execution]: https://en.wikipedia.org/wiki/Superscalar_processor + +SIMD instructions are also subject to these optimizations, meaning it can get pretty +difficult to determine where the slowdown happens. +For example, if the profiler reports a store operation is slow, one of two things +could be happening: + +- the store is limited by the CPU's memory bandwidth, which is actually an ideal + scenario, all things considered; + +- memory bandwidth is nowhere near its peak, but the value to be stored is at the + end of a long chain of operations, and this store is where the profiler + encountered the pipeline stall; + +Since most profilers are simple tools which don't understand the subtleties of +instruction scheduling, you + +## Analyzing the machine code + +Certain tools have knowledge of internal CPU microarchitecture, i.e. they know + +- how many physical [register files] a CPU actually has + +- what is the latency / throughtput of an instruction + +- what [µ-ops] are generated for a set of instructions + +and many other architectural details. + +[register files]: https://en.wikipedia.org/wiki/Register_file +[µ-ops]: https://en.wikipedia.org/wiki/Micro-operation + +These tools are therefore able to provide accurate information as to why some +instructions are inefficient, and where the bottleneck is. + +The disadvantage is that the output of these tools requires advanced knowledge +of the target architecture to understand, i.e. they **cannot** point out what +the cause of the issue is explicitly. + +## Intel's Architecture Code Analyzer (IACA) + +[IACA] is a free tool offered by Intel for analyzing the performance of various +computational kernels. + +Being a proprietary, closed source tool, it _only_ supports Intel's µ-arches. + +[IACA]: https://software.intel.com/en-us/articles/intel-architecture-code-analyzer + +## llvm-mca + +<!-- +TODO: once LLVM 7 gets released, write a chapter on using llvm-mca +with SIMD disassembly. +--> diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md b/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md new file mode 100644 index 0000000000..02ba78d2f2 --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md @@ -0,0 +1,14 @@ +# Performance profiling + +While the rest of the book provides practical advice on how to improve the performance +of SIMD code, this chapter is dedicated to [**performance profiling**][profiling]. +Profiling consists of recording a program's execution in order to identify program +hotspots. + +**Important**: most profilers require debug information in order to accurately +link the program hotspots back to the corresponding source code lines. Rust will +disable debug info generation by default for optimized builds, but you can change +that [in your `Cargo.toml`][cargo-ref]. + +[profiling]: https://en.wikipedia.org/wiki/Profiling_(computer_programming) +[cargo-ref]: https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections diff --git a/third_party/rust/packed_simd/perf-guide/src/target-feature/attribute.md b/third_party/rust/packed_simd/perf-guide/src/target-feature/attribute.md new file mode 100644 index 0000000000..ee670fea5b --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/target-feature/attribute.md @@ -0,0 +1,5 @@ +# The `target_feature` attribute + +<!-- TODO: +Explain the `#[target_feature]` attribute +--> diff --git a/third_party/rust/packed_simd/perf-guide/src/target-feature/features.md b/third_party/rust/packed_simd/perf-guide/src/target-feature/features.md new file mode 100644 index 0000000000..b93030ca67 --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/target-feature/features.md @@ -0,0 +1,13 @@ +# Enabling target features + +Not all processors of a certain architecture will have SIMD processing units, +and using a SIMD instruction which is not supported will trigger undefined behavior. + +To allow building safe, portable programs, the Rust compiler will **not**, by default, +generate any sort of vector instructions, unless it can statically determine +they are supported. For example, on AMD64, SSE2 support is architecturally guaranteed. +The `x86_64-apple-darwin` target enables up to SSSE3. The get a defintive list of +which features are enabled by default on various platforms, refer to the target +specifications [in the compiler's source code][targets]. + +[targets]: https://github.com/rust-lang/rust/tree/master/src/librustc_target/spec diff --git a/third_party/rust/packed_simd/perf-guide/src/target-feature/inlining.md b/third_party/rust/packed_simd/perf-guide/src/target-feature/inlining.md new file mode 100644 index 0000000000..86705102a7 --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/target-feature/inlining.md @@ -0,0 +1,5 @@ +# Inlining + +<!-- TODO: +Explain how the `#[target_feature]` attribute interacts with inlining +--> diff --git a/third_party/rust/packed_simd/perf-guide/src/target-feature/practice.md b/third_party/rust/packed_simd/perf-guide/src/target-feature/practice.md new file mode 100644 index 0000000000..5b55c61c26 --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/target-feature/practice.md @@ -0,0 +1,31 @@ +# Target features in practice + +Using `RUSTFLAGS` will allow the crate being compiled, as well as all its +transitive dependencies to use certain target features. + +A tehnique used to avoid undefined behavior at runtime is to compile and +ship multiple binaries, each compiled with a certain set of features. +This might not be feasible in some cases, and can quickly get out of hand +as more and more vector extensions are added to an architecture. + +Rust can be more flexible: you can build a single binary/library which automatically +picks the best supported vector instructions depending on the host machine. +The trick consists of monomorphizing parts of the code during building, and then +using run-time feature detection to select the right code path when running. + +<!-- TODO +Explain how to create efficient functions that dispatch to different +implementations at run-time without issues (e.g. using `#[inline(always)]` for +the impls, wrapping in `#[target_feature]`, and the wrapping those in a function +that does run-time feature detection). +--> + +**NOTE** (x86 specific): because the AVX (256-bit) registers extend the existing +SSE (128-bit) registers, mixing SSE and AVX instructions in a program can cause +performance issues. + +The solution is to compile all code, even the code written with 128-bit vectors, +with the AVX target feature enabled. This will cause the compiler to prefix the +generated instructions with the [VEX] prefix. + +[VEX]: https://en.wikipedia.org/wiki/VEX_prefix diff --git a/third_party/rust/packed_simd/perf-guide/src/target-feature/runtime.md b/third_party/rust/packed_simd/perf-guide/src/target-feature/runtime.md new file mode 100644 index 0000000000..47ddcc8660 --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/target-feature/runtime.md @@ -0,0 +1,5 @@ +# Detecting host features at runtime + +<!-- TODO: +Explain cost (how it works). +--> diff --git a/third_party/rust/packed_simd/perf-guide/src/target-feature/rustflags.md b/third_party/rust/packed_simd/perf-guide/src/target-feature/rustflags.md new file mode 100644 index 0000000000..f4c1d1304a --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/target-feature/rustflags.md @@ -0,0 +1,77 @@ +# Using RUSTFLAGS + +One of the easiest ways to benefit from SIMD is to allow the compiler +to generate code using certain vector instruction extensions. + +The environment variable `RUSTFLAGS` can be used to pass options for code +generation to the Rust compiler. These flags will affect **all** compiled crates. + +There are two flags which can be used to enable specific vector extensions: + +## target-feature + +- Syntax: `-C target-feature=<features>` + +- Provides the compiler with a comma-separated set of instruction extensions + to enable. + + **Example**: Use `-C target-feature=+sse3,+avx` to enable generating instructions + for [Streaming SIMD Extensions 3](https://en.wikipedia.org/wiki/SSE3) and + [Advanced Vector Extensions](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions). + +- To list target triples for all targets supported by Rust, use: + + ```sh + rustc --print target-list + ``` + +- To list all support target features for a certain target triple, use: + + ```sh + rustc --target=${TRIPLE} --print target-features + ``` + +- Note that all CPU features are independent, and will have to be enabled individually. + + **Example**: Setting `-C target-feature=+avx2` will _not_ enable `fma`, even though + all CPUs which support AVX2 also support FMA. To enable both, one has to use + `-C target-feature=+avx2,+fma` + +- Some features also depend on other features, which need to be enabled for the + target instructions to be generated. + + **Example**: Unless `v7` is specified as the target CPU (see below), to enable + NEON on ARM it is necessary to use `-C target-feature=+v7,+neon`. + +## target-cpu + +- Syntax: `-C target-cpu=<cpu>` + +- Sets the identifier of a CPU family / model for which to build and optimize the code. + + **Example**: `RUSTFLAGS='-C target-cpu=cortex-a75'` + +- To list all supported target CPUs for a certain target triple, use: + + ```sh + rustc --target=${TRIPLE} --print target-cpus + ``` + + **Example**: + + ```sh + rustc --target=i686-pc-windows-msvc --print target-cpus + ``` + +- The compiler will translate this into a list of target features. Therefore, + individual feature checks (`#[cfg(target_feature = "...")]`) will still + work properly. + +- It will cause the code generator to optimize the generated code for that + specific CPU model. + +- Using `native` as the CPU model will cause Rust to generate and optimize code + for the CPU running the compiler. It is useful when building programs which you + plan to only use locally. This should never be used when the generated programs + are meant to be run on other computers, such as when packaging for distribution + or cross-compiling. diff --git a/third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md b/third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md new file mode 100644 index 0000000000..d0dd1be12a --- /dev/null +++ b/third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md @@ -0,0 +1,76 @@ +# Vertical and horizontal operations + +In SIMD terminology, each vector has a certain "width" (number of lanes). +A vector processor is able to perform two kinds of operations on a vector: + +- Vertical operations: + operate on two vectors of the same width, result has same width + +**Example**: vertical addition of two `f32x4` vectors + + %0 == | 2 | -3.5 | 0 | 7 | + + + + + + %1 == | 4 | 1.5 | -1 | 0 | + = = = = + %0 + %1 == | 6 | -2 | -1 | 7 | + +- Horizontal operations: + reduce the elements of two vectors in some way, + the result's elements combine information from the two original ones + +**Example**: horizontal addition of two `u64x2` vectors + + %0 == | 1 | 3 | + └─+───┘ + └───────┐ + │ + %1 == | 4 | -1 | │ + └─+──┘ │ + └───┐ │ + │ │ + ┌─────│───┘ + ▼ ▼ + %0 + %1 == | 4 | 3 | + +## Performance consideration of horizontal operations + +The result of vertical operations, like vector negation: `-a`, for a given lane, +does not depend on the result of the operation for the other lanes. The result +of horizontal operations, like the vector `sum` reduction: `a.sum()`, depends on +the value of all vector lanes. + +In virtually all architectures vertical operations are fast, while horizontal +operations are, by comparison, very slow. + +Consider the following two functions for computing the sum of all `f32` values +in a slice: + +```rust +fn fast_sum(x: &[f32]) -> f32 { + assert!(x.len() % 4 == 0); + let mut sum = f32x4::splat(0.); // [0., 0., 0., 0.] + for i in (0..x.len()).step_by(4) { + sum += f32x4::from_slice_unaligned(&x[i..]); + } + sum.sum() +} + +fn slow_sum(x: &[f32]) -> f32 { + assert!(x.len() % 4 == 0); + let mut sum: f32 = 0.; + for i in (0..x.len()).step_by(4) { + sum += f32x4::from_slice_unaligned(&x[i..]).sum(); + } + sum +} +``` + +The inner loop over the slice is where the bulk of the work actually happens. +There, the `fast_sum` function perform vertical operations into a vector, doing +a single horizontal reduction at the end, while the `slow_sum` function performs +horizontal vector operations inside of the loop. + +On all widely-used architectures, `fast_sum` is a large constant factor faster +than `slow_sum`. You can run the [slice_sum]() example and see for yourself. On +the particular machine tested there the algorithm using the horizontal vector +addition is 2.7x slower than the one using vertical vector operations! |