diff options
Diffstat (limited to 'third_party/rust/packed_simd/perf-guide')
19 files changed, 0 insertions, 542 deletions
diff --git a/third_party/rust/packed_simd/perf-guide/book.toml b/third_party/rust/packed_simd/perf-guide/book.toml deleted file mode 100644 index 69ba3053ca..0000000000 --- a/third_party/rust/packed_simd/perf-guide/book.toml +++ /dev/null @@ -1,12 +0,0 @@ -[book] -authors = ["Gonzalo Brito Gadeschi", "Gabriel Majeri"] -multilingual = false -src = "src" -title = "Rust SIMD Performance Guide" -description = "This book describes how to write performant SIMD code in Rust." - -[build] -create-missing = false - -[output.html] -additional-css = ["./src/ascii.css"] diff --git a/third_party/rust/packed_simd/perf-guide/src/SUMMARY.md b/third_party/rust/packed_simd/perf-guide/src/SUMMARY.md deleted file mode 100644 index 1e76898865..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/SUMMARY.md +++ /dev/null @@ -1,21 +0,0 @@ -# Summary - -[Introduction](./introduction.md) - -- [Floating-point Math](./float-math/fp.md) - - [Short-vector Math Library](./float-math/svml.md) - - [Approximate functions](./float-math/approx.md) - - [Fused multiply-accumulate](./float-math/fma.md) - -- [Target features](./target-feature/features.md) - - [Using `RUSTFLAGS`](./target-feature/rustflags.md) - - [Using the `target_feature` attribute](./target-feature/attribute.md) - - [Interaction with inlining](./target-feature/inlining.md) - - [Detecting features at runtime](./target-feature/runtime.md) - -- [Bounds checking](./bound_checks.md) -- [Vertical and horizontal operations](./vert-hor-ops.md) - -- [Performance profiling](./prof/profiling.md) - - [Profiling on Linux](./prof/linux.md) - - [Using machine code analyzers](./prof/mca.md) diff --git a/third_party/rust/packed_simd/perf-guide/src/ascii.css b/third_party/rust/packed_simd/perf-guide/src/ascii.css deleted file mode 100644 index 4c02651195..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/ascii.css +++ /dev/null @@ -1,4 +0,0 @@ -code { - /* "Source Code Pro" breaks ASCII art */ - font-family: Consolas, "Ubuntu Mono", Menlo, "DejaVu Sans Mono", monospace; -} diff --git a/third_party/rust/packed_simd/perf-guide/src/bound_checks.md b/third_party/rust/packed_simd/perf-guide/src/bound_checks.md deleted file mode 100644 index 2eeedb5ac8..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/bound_checks.md +++ /dev/null @@ -1,22 +0,0 @@ -# Bounds checking - -Reading and writing packed vectors to/from slices is checked by default. -Independently of the configuration options used, the safe functions: - -* `Simd<[T; N]>::from_slice_aligned(& s[..])` -* `Simd<[T; N]>::write_to_slice_aligned(&mut s[..])` - -always check that: - -* the slice is big enough to hold the vector -* the slice is suitably aligned to perform an aligned load/store for a `Simd<[T; - N]>` (this alignment is often much larger than that of `T`). - -There are `_unaligned` versions that use unaligned load and stores, as well as -`unsafe` `_unchecked` that do not perform any checks iff `debug-assertions = -false` / `debug = false`. That is, the `_unchecked` methods do still assert size -and alignment in debug builds and could also do so in release builds depending -on the configuration options. - -These assertions do often significantly impact performance and you should be -aware of them. diff --git a/third_party/rust/packed_simd/perf-guide/src/float-math/approx.md b/third_party/rust/packed_simd/perf-guide/src/float-math/approx.md deleted file mode 100644 index 2237c67ec4..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/float-math/approx.md +++ /dev/null @@ -1,8 +0,0 @@ -# Approximate functions - -<!-- TODO: - -Explain that they exists, that they are often _much_ faster, how to use them, -that people should check whether the error is good enough for their -applications. Explain that this error is currently unstable and might change. ---> diff --git a/third_party/rust/packed_simd/perf-guide/src/float-math/fma.md b/third_party/rust/packed_simd/perf-guide/src/float-math/fma.md deleted file mode 100644 index 357748383d..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/float-math/fma.md +++ /dev/null @@ -1,6 +0,0 @@ -# Fused Multiply Add - -<!-- TODO: -Explain that this is a compound operation, infinite precision, difference -between `mul_add` and `mul_adde`, that LLVM cannot do this by itself, etc. ---> diff --git a/third_party/rust/packed_simd/perf-guide/src/float-math/fp.md b/third_party/rust/packed_simd/perf-guide/src/float-math/fp.md deleted file mode 100644 index 711fcc4fd5..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/float-math/fp.md +++ /dev/null @@ -1,3 +0,0 @@ -# Floating-point math - -This chapter contains information pertaining to working with floating-point numbers. diff --git a/third_party/rust/packed_simd/perf-guide/src/float-math/svml.md b/third_party/rust/packed_simd/perf-guide/src/float-math/svml.md deleted file mode 100644 index 266c2531cc..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/float-math/svml.md +++ /dev/null @@ -1,7 +0,0 @@ -# Short Vector Math Library - -<!-- TODO: -Explain how is short-vector math performed by default (just scalarized libm calls). - -Explain how to enable `sleef`, etc. ---> diff --git a/third_party/rust/packed_simd/perf-guide/src/introduction.md b/third_party/rust/packed_simd/perf-guide/src/introduction.md deleted file mode 100644 index 7243e19c8a..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/introduction.md +++ /dev/null @@ -1,26 +0,0 @@ -# Introduction - -## What is SIMD - -<!-- TODO: -describe what SIMD is, which algorithms can benefit from it, -give usage examples ---> - -## History of SIMD in Rust - -<!-- TODO: -discuss history of unstable std::simd, -stabilization of std::arch, etc. ---> - -## Discover packed_simd - -<!-- TODO: describe scope of this project --> - -Writing fast and portable SIMD algorithms using `packed_simd` is, unfortunately, -not trivial. There are many pitfals that one should be aware of, and some idioms -that help avoid those pitfalls. - -This book attempts to document these best practices and provides practical examples -on how to apply the tips to _your_ code. diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/linux.md b/third_party/rust/packed_simd/perf-guide/src/prof/linux.md deleted file mode 100644 index 96c7d67bc4..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/prof/linux.md +++ /dev/null @@ -1,107 +0,0 @@ -# Performance profiling on Linux - -## Using `perf` - -[perf](https://perf.wiki.kernel.org/) is the most powerful performance profiler -for Linux, featuring support for various hardware Performance Monitoring Units, -as well as integration with the kernel's performance events framework. - -We will only look at how can the `perf` command can be used to profile SIMD code. -Full system profiling is outside of the scope of this book. - -### Recording - -The first step is to record a program's execution during an average workload. -It helps if you can isolate the parts of your program which have performance -issues, and set up a benchmark which can be easily (re)run. - -Build the benchmark binary in release mode, after having enabled debug info: - -```sh -$ cargo build --release -Finished release [optimized + debuginfo] target(s) in 0.02s -``` - -Then use the `perf record` subcommand: - -```sh -$ perf record --call-graph=dwarf ./target/release/my-program -[ perf record: Woken up 10 times to write data ] -[ perf record: Captured and wrote 2,356 MB perf.data (292 samples) ] -``` - -Instead of using `--call-graph=dwarf`, which can become pretty slow, you can use -`--call-graph=lbr` if you have a processor with support for Last Branch Record -(i.e. Intel Haswell and newer). - -`perf` will, by default, record the count of CPU cycles it takes to execute -various parts of your program. You can use the `-e` command line option -to enable other performance events, such as `cache-misses`. Use `perf list` -to get a list of all hardware counters supported by your CPU. - -### Viewing the report - -The next step is getting a bird's eye view of the program's execution. -`perf` provides a `ncurses`-based interface which will get you started. - -Use `perf report` to open a visualization of your program's performance: - -```sh -perf report --hierarchy -M intel -``` - -`--hierarchy` will display a tree-like structure of where your program spent -most of its time. `-M intel` enables disassembly output with Intel syntax, which -is subjectively more readable than the default AT&T syntax. - -Here is the output from profiling the `nbody` benchmark: - -``` -- 100,00% nbody - - 94,18% nbody - + 93,48% [.] nbody_lib::simd::advance - + 0,70% [.] nbody_lib::run - + 5,06% libc-2.28.so -``` - -If you move with the arrow keys to any node in the tree, you can the press `a` -to have `perf` _annotate_ that node. This means it will: - -- disassemble the function - -- associate every instruction with the percentage of time which was spent executing it - -- interleaves the disassembly with the source code, - assuming it found the debug symbols - (you can use `s` to toggle this behaviour) - -`perf` will, by default, open the instruction which it identified as being the -hottest spot in the function: - -``` -0,76 │ movapd xmm2,xmm0 -0,38 │ movhlps xmm2,xmm0 - │ addpd xmm2,xmm0 - │ unpcklpd xmm1,xmm2 -12,50 │ sqrtpd xmm0,xmm1 -1,52 │ mulpd xmm0,xmm1 -``` - -In this case, `sqrtpd` will be highlighted in red, since that's the instruction -which the CPU spends most of its time executing. - -## Using Valgrind - -Valgrind is a set of tools which initially helped C/C++ programmers find unsafe -memory accesses in their code. Nowadays the project also has - -- a heap profiler called `massif` - -- a cache utilization profiler called `cachegrind` - -- a call-graph performance profiler called `callgrind` - -<!-- -TODO: explain valgrind's dynamic binary translation, warn about massive -slowdown, talk about `kcachegrind` for a GUI ---> diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/mca.md b/third_party/rust/packed_simd/perf-guide/src/prof/mca.md deleted file mode 100644 index 65ddf1a4eb..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/prof/mca.md +++ /dev/null @@ -1,100 +0,0 @@ -# Machine code analysis tools - -## The microarchitecture of modern CPUs - -While you might have heard of Instruction Set Architectures, such as `x86` or -`arm` or `mips`, the term _microarchitecture_ (also written here as _µ-arch_), -refers to the internal details of an actual family of CPUs, such as Intel's -_Haswell_ or AMD's _Jaguar_. - -Replacing scalar code with SIMD code will improve performance on all CPUs -supporting the required vector extensions. -However, due to microarchitectural differences, the actual speed-up at -runtime might vary. - -**Example**: a simple example arises when optimizing for AMD K8 CPUs. -The assembly generated for an empty function should look like this: - -```asm -nop -ret -``` - -The `nop` is used to align the `ret` instruction for better performance. -However, the compiler will actually generated the following code: - -```asm -repz ret -``` - -The `repz` instruction will repeat the following instruction until a certain -condition. Of course, in this situation, the function will simply immediately -return, and the `ret` instruction is still aligned. -However, AMD K8's branch predictor performs better with the latter code. - -For those looking to absolutely maximize performance for a certain target µ-arch, -you will have to read some CPU manuals, or ask the compiler to do it for you -with `-C target-cpu`. - -### Summary of CPU internals - -Modern processors are able to execute instructions out-of-order for better performance, -by utilizing tricks such as [branch prediction], [instruction pipelining], -or [superscalar execution]. - -[branch prediction]: https://en.wikipedia.org/wiki/Branch_predictor -[instruction pipelining]: https://en.wikipedia.org/wiki/Instruction_pipelining -[superscalar execution]: https://en.wikipedia.org/wiki/Superscalar_processor - -SIMD instructions are also subject to these optimizations, meaning it can get pretty -difficult to determine where the slowdown happens. -For example, if the profiler reports a store operation is slow, one of two things -could be happening: - -- the store is limited by the CPU's memory bandwidth, which is actually an ideal - scenario, all things considered; - -- memory bandwidth is nowhere near its peak, but the value to be stored is at the - end of a long chain of operations, and this store is where the profiler - encountered the pipeline stall; - -Since most profilers are simple tools which don't understand the subtleties of -instruction scheduling, you - -## Analyzing the machine code - -Certain tools have knowledge of internal CPU microarchitecture, i.e. they know - -- how many physical [register files] a CPU actually has - -- what is the latency / throughtput of an instruction - -- what [µ-ops] are generated for a set of instructions - -and many other architectural details. - -[register files]: https://en.wikipedia.org/wiki/Register_file -[µ-ops]: https://en.wikipedia.org/wiki/Micro-operation - -These tools are therefore able to provide accurate information as to why some -instructions are inefficient, and where the bottleneck is. - -The disadvantage is that the output of these tools requires advanced knowledge -of the target architecture to understand, i.e. they **cannot** point out what -the cause of the issue is explicitly. - -## Intel's Architecture Code Analyzer (IACA) - -[IACA] is a free tool offered by Intel for analyzing the performance of various -computational kernels. - -Being a proprietary, closed source tool, it _only_ supports Intel's µ-arches. - -[IACA]: https://software.intel.com/en-us/articles/intel-architecture-code-analyzer - -## llvm-mca - -<!-- -TODO: once LLVM 7 gets released, write a chapter on using llvm-mca -with SIMD disassembly. ---> diff --git a/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md b/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md deleted file mode 100644 index 02ba78d2f2..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/prof/profiling.md +++ /dev/null @@ -1,14 +0,0 @@ -# Performance profiling - -While the rest of the book provides practical advice on how to improve the performance -of SIMD code, this chapter is dedicated to [**performance profiling**][profiling]. -Profiling consists of recording a program's execution in order to identify program -hotspots. - -**Important**: most profilers require debug information in order to accurately -link the program hotspots back to the corresponding source code lines. Rust will -disable debug info generation by default for optimized builds, but you can change -that [in your `Cargo.toml`][cargo-ref]. - -[profiling]: https://en.wikipedia.org/wiki/Profiling_(computer_programming) -[cargo-ref]: https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections diff --git a/third_party/rust/packed_simd/perf-guide/src/target-feature/attribute.md b/third_party/rust/packed_simd/perf-guide/src/target-feature/attribute.md deleted file mode 100644 index ee670fea5b..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/target-feature/attribute.md +++ /dev/null @@ -1,5 +0,0 @@ -# The `target_feature` attribute - -<!-- TODO: -Explain the `#[target_feature]` attribute ---> diff --git a/third_party/rust/packed_simd/perf-guide/src/target-feature/features.md b/third_party/rust/packed_simd/perf-guide/src/target-feature/features.md deleted file mode 100644 index b93030ca67..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/target-feature/features.md +++ /dev/null @@ -1,13 +0,0 @@ -# Enabling target features - -Not all processors of a certain architecture will have SIMD processing units, -and using a SIMD instruction which is not supported will trigger undefined behavior. - -To allow building safe, portable programs, the Rust compiler will **not**, by default, -generate any sort of vector instructions, unless it can statically determine -they are supported. For example, on AMD64, SSE2 support is architecturally guaranteed. -The `x86_64-apple-darwin` target enables up to SSSE3. The get a defintive list of -which features are enabled by default on various platforms, refer to the target -specifications [in the compiler's source code][targets]. - -[targets]: https://github.com/rust-lang/rust/tree/master/src/librustc_target/spec diff --git a/third_party/rust/packed_simd/perf-guide/src/target-feature/inlining.md b/third_party/rust/packed_simd/perf-guide/src/target-feature/inlining.md deleted file mode 100644 index 86705102a7..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/target-feature/inlining.md +++ /dev/null @@ -1,5 +0,0 @@ -# Inlining - -<!-- TODO: -Explain how the `#[target_feature]` attribute interacts with inlining ---> diff --git a/third_party/rust/packed_simd/perf-guide/src/target-feature/practice.md b/third_party/rust/packed_simd/perf-guide/src/target-feature/practice.md deleted file mode 100644 index 5b55c61c26..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/target-feature/practice.md +++ /dev/null @@ -1,31 +0,0 @@ -# Target features in practice - -Using `RUSTFLAGS` will allow the crate being compiled, as well as all its -transitive dependencies to use certain target features. - -A tehnique used to avoid undefined behavior at runtime is to compile and -ship multiple binaries, each compiled with a certain set of features. -This might not be feasible in some cases, and can quickly get out of hand -as more and more vector extensions are added to an architecture. - -Rust can be more flexible: you can build a single binary/library which automatically -picks the best supported vector instructions depending on the host machine. -The trick consists of monomorphizing parts of the code during building, and then -using run-time feature detection to select the right code path when running. - -<!-- TODO -Explain how to create efficient functions that dispatch to different -implementations at run-time without issues (e.g. using `#[inline(always)]` for -the impls, wrapping in `#[target_feature]`, and the wrapping those in a function -that does run-time feature detection). ---> - -**NOTE** (x86 specific): because the AVX (256-bit) registers extend the existing -SSE (128-bit) registers, mixing SSE and AVX instructions in a program can cause -performance issues. - -The solution is to compile all code, even the code written with 128-bit vectors, -with the AVX target feature enabled. This will cause the compiler to prefix the -generated instructions with the [VEX] prefix. - -[VEX]: https://en.wikipedia.org/wiki/VEX_prefix diff --git a/third_party/rust/packed_simd/perf-guide/src/target-feature/runtime.md b/third_party/rust/packed_simd/perf-guide/src/target-feature/runtime.md deleted file mode 100644 index 47ddcc8660..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/target-feature/runtime.md +++ /dev/null @@ -1,5 +0,0 @@ -# Detecting host features at runtime - -<!-- TODO: -Explain cost (how it works). ---> diff --git a/third_party/rust/packed_simd/perf-guide/src/target-feature/rustflags.md b/third_party/rust/packed_simd/perf-guide/src/target-feature/rustflags.md deleted file mode 100644 index f4c1d1304a..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/target-feature/rustflags.md +++ /dev/null @@ -1,77 +0,0 @@ -# Using RUSTFLAGS - -One of the easiest ways to benefit from SIMD is to allow the compiler -to generate code using certain vector instruction extensions. - -The environment variable `RUSTFLAGS` can be used to pass options for code -generation to the Rust compiler. These flags will affect **all** compiled crates. - -There are two flags which can be used to enable specific vector extensions: - -## target-feature - -- Syntax: `-C target-feature=<features>` - -- Provides the compiler with a comma-separated set of instruction extensions - to enable. - - **Example**: Use `-C target-feature=+sse3,+avx` to enable generating instructions - for [Streaming SIMD Extensions 3](https://en.wikipedia.org/wiki/SSE3) and - [Advanced Vector Extensions](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions). - -- To list target triples for all targets supported by Rust, use: - - ```sh - rustc --print target-list - ``` - -- To list all support target features for a certain target triple, use: - - ```sh - rustc --target=${TRIPLE} --print target-features - ``` - -- Note that all CPU features are independent, and will have to be enabled individually. - - **Example**: Setting `-C target-feature=+avx2` will _not_ enable `fma`, even though - all CPUs which support AVX2 also support FMA. To enable both, one has to use - `-C target-feature=+avx2,+fma` - -- Some features also depend on other features, which need to be enabled for the - target instructions to be generated. - - **Example**: Unless `v7` is specified as the target CPU (see below), to enable - NEON on ARM it is necessary to use `-C target-feature=+v7,+neon`. - -## target-cpu - -- Syntax: `-C target-cpu=<cpu>` - -- Sets the identifier of a CPU family / model for which to build and optimize the code. - - **Example**: `RUSTFLAGS='-C target-cpu=cortex-a75'` - -- To list all supported target CPUs for a certain target triple, use: - - ```sh - rustc --target=${TRIPLE} --print target-cpus - ``` - - **Example**: - - ```sh - rustc --target=i686-pc-windows-msvc --print target-cpus - ``` - -- The compiler will translate this into a list of target features. Therefore, - individual feature checks (`#[cfg(target_feature = "...")]`) will still - work properly. - -- It will cause the code generator to optimize the generated code for that - specific CPU model. - -- Using `native` as the CPU model will cause Rust to generate and optimize code - for the CPU running the compiler. It is useful when building programs which you - plan to only use locally. This should never be used when the generated programs - are meant to be run on other computers, such as when packaging for distribution - or cross-compiling. diff --git a/third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md b/third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md deleted file mode 100644 index d0dd1be12a..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md +++ /dev/null @@ -1,76 +0,0 @@ -# Vertical and horizontal operations - -In SIMD terminology, each vector has a certain "width" (number of lanes). -A vector processor is able to perform two kinds of operations on a vector: - -- Vertical operations: - operate on two vectors of the same width, result has same width - -**Example**: vertical addition of two `f32x4` vectors - - %0 == | 2 | -3.5 | 0 | 7 | - + + + + - %1 == | 4 | 1.5 | -1 | 0 | - = = = = - %0 + %1 == | 6 | -2 | -1 | 7 | - -- Horizontal operations: - reduce the elements of two vectors in some way, - the result's elements combine information from the two original ones - -**Example**: horizontal addition of two `u64x2` vectors - - %0 == | 1 | 3 | - └─+───┘ - └───────┐ - │ - %1 == | 4 | -1 | │ - └─+──┘ │ - └───┐ │ - │ │ - ┌─────│───┘ - ▼ ▼ - %0 + %1 == | 4 | 3 | - -## Performance consideration of horizontal operations - -The result of vertical operations, like vector negation: `-a`, for a given lane, -does not depend on the result of the operation for the other lanes. The result -of horizontal operations, like the vector `sum` reduction: `a.sum()`, depends on -the value of all vector lanes. - -In virtually all architectures vertical operations are fast, while horizontal -operations are, by comparison, very slow. - -Consider the following two functions for computing the sum of all `f32` values -in a slice: - -```rust -fn fast_sum(x: &[f32]) -> f32 { - assert!(x.len() % 4 == 0); - let mut sum = f32x4::splat(0.); // [0., 0., 0., 0.] - for i in (0..x.len()).step_by(4) { - sum += f32x4::from_slice_unaligned(&x[i..]); - } - sum.sum() -} - -fn slow_sum(x: &[f32]) -> f32 { - assert!(x.len() % 4 == 0); - let mut sum: f32 = 0.; - for i in (0..x.len()).step_by(4) { - sum += f32x4::from_slice_unaligned(&x[i..]).sum(); - } - sum -} -``` - -The inner loop over the slice is where the bulk of the work actually happens. -There, the `fast_sum` function perform vertical operations into a vector, doing -a single horizontal reduction at the end, while the `slow_sum` function performs -horizontal vector operations inside of the loop. - -On all widely-used architectures, `fast_sum` is a large constant factor faster -than `slow_sum`. You can run the [slice_sum]() example and see for yourself. On -the particular machine tested there the algorithm using the horizontal vector -addition is 2.7x slower than the one using vertical vector operations! |