summaryrefslogtreecommitdiffstats
path: root/vendor/packed_simd/perf-guide/src
diff options
context:
space:
mode:
Diffstat (limited to 'vendor/packed_simd/perf-guide/src')
-rw-r--r--vendor/packed_simd/perf-guide/src/SUMMARY.md21
-rw-r--r--vendor/packed_simd/perf-guide/src/ascii.css4
-rw-r--r--vendor/packed_simd/perf-guide/src/bound_checks.md22
-rw-r--r--vendor/packed_simd/perf-guide/src/float-math/approx.md8
-rw-r--r--vendor/packed_simd/perf-guide/src/float-math/fma.md6
-rw-r--r--vendor/packed_simd/perf-guide/src/float-math/fp.md3
-rw-r--r--vendor/packed_simd/perf-guide/src/float-math/svml.md7
-rw-r--r--vendor/packed_simd/perf-guide/src/introduction.md26
-rw-r--r--vendor/packed_simd/perf-guide/src/prof/linux.md107
-rw-r--r--vendor/packed_simd/perf-guide/src/prof/mca.md100
-rw-r--r--vendor/packed_simd/perf-guide/src/prof/profiling.md14
-rw-r--r--vendor/packed_simd/perf-guide/src/target-feature/attribute.md5
-rw-r--r--vendor/packed_simd/perf-guide/src/target-feature/features.md13
-rw-r--r--vendor/packed_simd/perf-guide/src/target-feature/inlining.md5
-rw-r--r--vendor/packed_simd/perf-guide/src/target-feature/practice.md31
-rw-r--r--vendor/packed_simd/perf-guide/src/target-feature/runtime.md5
-rw-r--r--vendor/packed_simd/perf-guide/src/target-feature/rustflags.md77
-rw-r--r--vendor/packed_simd/perf-guide/src/vert-hor-ops.md76
18 files changed, 530 insertions, 0 deletions
diff --git a/vendor/packed_simd/perf-guide/src/SUMMARY.md b/vendor/packed_simd/perf-guide/src/SUMMARY.md
new file mode 100644
index 000000000..1e7689886
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/SUMMARY.md
@@ -0,0 +1,21 @@
+# Summary
+
+[Introduction](./introduction.md)
+
+- [Floating-point Math](./float-math/fp.md)
+ - [Short-vector Math Library](./float-math/svml.md)
+ - [Approximate functions](./float-math/approx.md)
+ - [Fused multiply-accumulate](./float-math/fma.md)
+
+- [Target features](./target-feature/features.md)
+ - [Using `RUSTFLAGS`](./target-feature/rustflags.md)
+ - [Using the `target_feature` attribute](./target-feature/attribute.md)
+ - [Interaction with inlining](./target-feature/inlining.md)
+ - [Detecting features at runtime](./target-feature/runtime.md)
+
+- [Bounds checking](./bound_checks.md)
+- [Vertical and horizontal operations](./vert-hor-ops.md)
+
+- [Performance profiling](./prof/profiling.md)
+ - [Profiling on Linux](./prof/linux.md)
+ - [Using machine code analyzers](./prof/mca.md)
diff --git a/vendor/packed_simd/perf-guide/src/ascii.css b/vendor/packed_simd/perf-guide/src/ascii.css
new file mode 100644
index 000000000..4c0265119
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/ascii.css
@@ -0,0 +1,4 @@
+code {
+ /* "Source Code Pro" breaks ASCII art */
+ font-family: Consolas, "Ubuntu Mono", Menlo, "DejaVu Sans Mono", monospace;
+}
diff --git a/vendor/packed_simd/perf-guide/src/bound_checks.md b/vendor/packed_simd/perf-guide/src/bound_checks.md
new file mode 100644
index 000000000..2eeedb5ac
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/bound_checks.md
@@ -0,0 +1,22 @@
+# Bounds checking
+
+Reading and writing packed vectors to/from slices is checked by default.
+Independently of the configuration options used, the safe functions:
+
+* `Simd<[T; N]>::from_slice_aligned(& s[..])`
+* `Simd<[T; N]>::write_to_slice_aligned(&mut s[..])`
+
+always check that:
+
+* the slice is big enough to hold the vector
+* the slice is suitably aligned to perform an aligned load/store for a `Simd<[T;
+ N]>` (this alignment is often much larger than that of `T`).
+
+There are `_unaligned` versions that use unaligned load and stores, as well as
+`unsafe` `_unchecked` that do not perform any checks iff `debug-assertions =
+false` / `debug = false`. That is, the `_unchecked` methods do still assert size
+and alignment in debug builds and could also do so in release builds depending
+on the configuration options.
+
+These assertions do often significantly impact performance and you should be
+aware of them.
diff --git a/vendor/packed_simd/perf-guide/src/float-math/approx.md b/vendor/packed_simd/perf-guide/src/float-math/approx.md
new file mode 100644
index 000000000..2237c67ec
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/float-math/approx.md
@@ -0,0 +1,8 @@
+# Approximate functions
+
+<!-- TODO:
+
+Explain that they exists, that they are often _much_ faster, how to use them,
+that people should check whether the error is good enough for their
+applications. Explain that this error is currently unstable and might change.
+-->
diff --git a/vendor/packed_simd/perf-guide/src/float-math/fma.md b/vendor/packed_simd/perf-guide/src/float-math/fma.md
new file mode 100644
index 000000000..357748383
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/float-math/fma.md
@@ -0,0 +1,6 @@
+# Fused Multiply Add
+
+<!-- TODO:
+Explain that this is a compound operation, infinite precision, difference
+between `mul_add` and `mul_adde`, that LLVM cannot do this by itself, etc.
+-->
diff --git a/vendor/packed_simd/perf-guide/src/float-math/fp.md b/vendor/packed_simd/perf-guide/src/float-math/fp.md
new file mode 100644
index 000000000..711fcc4fd
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/float-math/fp.md
@@ -0,0 +1,3 @@
+# Floating-point math
+
+This chapter contains information pertaining to working with floating-point numbers.
diff --git a/vendor/packed_simd/perf-guide/src/float-math/svml.md b/vendor/packed_simd/perf-guide/src/float-math/svml.md
new file mode 100644
index 000000000..266c2531c
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/float-math/svml.md
@@ -0,0 +1,7 @@
+# Short Vector Math Library
+
+<!-- TODO:
+Explain how is short-vector math performed by default (just scalarized libm calls).
+
+Explain how to enable `sleef`, etc.
+-->
diff --git a/vendor/packed_simd/perf-guide/src/introduction.md b/vendor/packed_simd/perf-guide/src/introduction.md
new file mode 100644
index 000000000..7243e19c8
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/introduction.md
@@ -0,0 +1,26 @@
+# Introduction
+
+## What is SIMD
+
+<!-- TODO:
+describe what SIMD is, which algorithms can benefit from it,
+give usage examples
+-->
+
+## History of SIMD in Rust
+
+<!-- TODO:
+discuss history of unstable std::simd,
+stabilization of std::arch, etc.
+-->
+
+## Discover packed_simd
+
+<!-- TODO: describe scope of this project -->
+
+Writing fast and portable SIMD algorithms using `packed_simd` is, unfortunately,
+not trivial. There are many pitfals that one should be aware of, and some idioms
+that help avoid those pitfalls.
+
+This book attempts to document these best practices and provides practical examples
+on how to apply the tips to _your_ code.
diff --git a/vendor/packed_simd/perf-guide/src/prof/linux.md b/vendor/packed_simd/perf-guide/src/prof/linux.md
new file mode 100644
index 000000000..96c7d67bc
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/prof/linux.md
@@ -0,0 +1,107 @@
+# Performance profiling on Linux
+
+## Using `perf`
+
+[perf](https://perf.wiki.kernel.org/) is the most powerful performance profiler
+for Linux, featuring support for various hardware Performance Monitoring Units,
+as well as integration with the kernel's performance events framework.
+
+We will only look at how can the `perf` command can be used to profile SIMD code.
+Full system profiling is outside of the scope of this book.
+
+### Recording
+
+The first step is to record a program's execution during an average workload.
+It helps if you can isolate the parts of your program which have performance
+issues, and set up a benchmark which can be easily (re)run.
+
+Build the benchmark binary in release mode, after having enabled debug info:
+
+```sh
+$ cargo build --release
+Finished release [optimized + debuginfo] target(s) in 0.02s
+```
+
+Then use the `perf record` subcommand:
+
+```sh
+$ perf record --call-graph=dwarf ./target/release/my-program
+[ perf record: Woken up 10 times to write data ]
+[ perf record: Captured and wrote 2,356 MB perf.data (292 samples) ]
+```
+
+Instead of using `--call-graph=dwarf`, which can become pretty slow, you can use
+`--call-graph=lbr` if you have a processor with support for Last Branch Record
+(i.e. Intel Haswell and newer).
+
+`perf` will, by default, record the count of CPU cycles it takes to execute
+various parts of your program. You can use the `-e` command line option
+to enable other performance events, such as `cache-misses`. Use `perf list`
+to get a list of all hardware counters supported by your CPU.
+
+### Viewing the report
+
+The next step is getting a bird's eye view of the program's execution.
+`perf` provides a `ncurses`-based interface which will get you started.
+
+Use `perf report` to open a visualization of your program's performance:
+
+```sh
+perf report --hierarchy -M intel
+```
+
+`--hierarchy` will display a tree-like structure of where your program spent
+most of its time. `-M intel` enables disassembly output with Intel syntax, which
+is subjectively more readable than the default AT&T syntax.
+
+Here is the output from profiling the `nbody` benchmark:
+
+```
+- 100,00% nbody
+ - 94,18% nbody
+ + 93,48% [.] nbody_lib::simd::advance
+ + 0,70% [.] nbody_lib::run
+ + 5,06% libc-2.28.so
+```
+
+If you move with the arrow keys to any node in the tree, you can the press `a`
+to have `perf` _annotate_ that node. This means it will:
+
+- disassemble the function
+
+- associate every instruction with the percentage of time which was spent executing it
+
+- interleaves the disassembly with the source code,
+ assuming it found the debug symbols
+ (you can use `s` to toggle this behaviour)
+
+`perf` will, by default, open the instruction which it identified as being the
+hottest spot in the function:
+
+```
+0,76 │ movapd xmm2,xmm0
+0,38 │ movhlps xmm2,xmm0
+ │ addpd xmm2,xmm0
+ │ unpcklpd xmm1,xmm2
+12,50 │ sqrtpd xmm0,xmm1
+1,52 │ mulpd xmm0,xmm1
+```
+
+In this case, `sqrtpd` will be highlighted in red, since that's the instruction
+which the CPU spends most of its time executing.
+
+## Using Valgrind
+
+Valgrind is a set of tools which initially helped C/C++ programmers find unsafe
+memory accesses in their code. Nowadays the project also has
+
+- a heap profiler called `massif`
+
+- a cache utilization profiler called `cachegrind`
+
+- a call-graph performance profiler called `callgrind`
+
+<!--
+TODO: explain valgrind's dynamic binary translation, warn about massive
+slowdown, talk about `kcachegrind` for a GUI
+-->
diff --git a/vendor/packed_simd/perf-guide/src/prof/mca.md b/vendor/packed_simd/perf-guide/src/prof/mca.md
new file mode 100644
index 000000000..65ddf1a4e
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/prof/mca.md
@@ -0,0 +1,100 @@
+# Machine code analysis tools
+
+## The microarchitecture of modern CPUs
+
+While you might have heard of Instruction Set Architectures, such as `x86` or
+`arm` or `mips`, the term _microarchitecture_ (also written here as _µ-arch_),
+refers to the internal details of an actual family of CPUs, such as Intel's
+_Haswell_ or AMD's _Jaguar_.
+
+Replacing scalar code with SIMD code will improve performance on all CPUs
+supporting the required vector extensions.
+However, due to microarchitectural differences, the actual speed-up at
+runtime might vary.
+
+**Example**: a simple example arises when optimizing for AMD K8 CPUs.
+The assembly generated for an empty function should look like this:
+
+```asm
+nop
+ret
+```
+
+The `nop` is used to align the `ret` instruction for better performance.
+However, the compiler will actually generated the following code:
+
+```asm
+repz ret
+```
+
+The `repz` instruction will repeat the following instruction until a certain
+condition. Of course, in this situation, the function will simply immediately
+return, and the `ret` instruction is still aligned.
+However, AMD K8's branch predictor performs better with the latter code.
+
+For those looking to absolutely maximize performance for a certain target µ-arch,
+you will have to read some CPU manuals, or ask the compiler to do it for you
+with `-C target-cpu`.
+
+### Summary of CPU internals
+
+Modern processors are able to execute instructions out-of-order for better performance,
+by utilizing tricks such as [branch prediction], [instruction pipelining],
+or [superscalar execution].
+
+[branch prediction]: https://en.wikipedia.org/wiki/Branch_predictor
+[instruction pipelining]: https://en.wikipedia.org/wiki/Instruction_pipelining
+[superscalar execution]: https://en.wikipedia.org/wiki/Superscalar_processor
+
+SIMD instructions are also subject to these optimizations, meaning it can get pretty
+difficult to determine where the slowdown happens.
+For example, if the profiler reports a store operation is slow, one of two things
+could be happening:
+
+- the store is limited by the CPU's memory bandwidth, which is actually an ideal
+ scenario, all things considered;
+
+- memory bandwidth is nowhere near its peak, but the value to be stored is at the
+ end of a long chain of operations, and this store is where the profiler
+ encountered the pipeline stall;
+
+Since most profilers are simple tools which don't understand the subtleties of
+instruction scheduling, you
+
+## Analyzing the machine code
+
+Certain tools have knowledge of internal CPU microarchitecture, i.e. they know
+
+- how many physical [register files] a CPU actually has
+
+- what is the latency / throughtput of an instruction
+
+- what [µ-ops] are generated for a set of instructions
+
+and many other architectural details.
+
+[register files]: https://en.wikipedia.org/wiki/Register_file
+[µ-ops]: https://en.wikipedia.org/wiki/Micro-operation
+
+These tools are therefore able to provide accurate information as to why some
+instructions are inefficient, and where the bottleneck is.
+
+The disadvantage is that the output of these tools requires advanced knowledge
+of the target architecture to understand, i.e. they **cannot** point out what
+the cause of the issue is explicitly.
+
+## Intel's Architecture Code Analyzer (IACA)
+
+[IACA] is a free tool offered by Intel for analyzing the performance of various
+computational kernels.
+
+Being a proprietary, closed source tool, it _only_ supports Intel's µ-arches.
+
+[IACA]: https://software.intel.com/en-us/articles/intel-architecture-code-analyzer
+
+## llvm-mca
+
+<!--
+TODO: once LLVM 7 gets released, write a chapter on using llvm-mca
+with SIMD disassembly.
+-->
diff --git a/vendor/packed_simd/perf-guide/src/prof/profiling.md b/vendor/packed_simd/perf-guide/src/prof/profiling.md
new file mode 100644
index 000000000..02ba78d2f
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/prof/profiling.md
@@ -0,0 +1,14 @@
+# Performance profiling
+
+While the rest of the book provides practical advice on how to improve the performance
+of SIMD code, this chapter is dedicated to [**performance profiling**][profiling].
+Profiling consists of recording a program's execution in order to identify program
+hotspots.
+
+**Important**: most profilers require debug information in order to accurately
+link the program hotspots back to the corresponding source code lines. Rust will
+disable debug info generation by default for optimized builds, but you can change
+that [in your `Cargo.toml`][cargo-ref].
+
+[profiling]: https://en.wikipedia.org/wiki/Profiling_(computer_programming)
+[cargo-ref]: https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections
diff --git a/vendor/packed_simd/perf-guide/src/target-feature/attribute.md b/vendor/packed_simd/perf-guide/src/target-feature/attribute.md
new file mode 100644
index 000000000..ee670fea5
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/target-feature/attribute.md
@@ -0,0 +1,5 @@
+# The `target_feature` attribute
+
+<!-- TODO:
+Explain the `#[target_feature]` attribute
+-->
diff --git a/vendor/packed_simd/perf-guide/src/target-feature/features.md b/vendor/packed_simd/perf-guide/src/target-feature/features.md
new file mode 100644
index 000000000..b93030ca6
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/target-feature/features.md
@@ -0,0 +1,13 @@
+# Enabling target features
+
+Not all processors of a certain architecture will have SIMD processing units,
+and using a SIMD instruction which is not supported will trigger undefined behavior.
+
+To allow building safe, portable programs, the Rust compiler will **not**, by default,
+generate any sort of vector instructions, unless it can statically determine
+they are supported. For example, on AMD64, SSE2 support is architecturally guaranteed.
+The `x86_64-apple-darwin` target enables up to SSSE3. The get a defintive list of
+which features are enabled by default on various platforms, refer to the target
+specifications [in the compiler's source code][targets].
+
+[targets]: https://github.com/rust-lang/rust/tree/master/src/librustc_target/spec
diff --git a/vendor/packed_simd/perf-guide/src/target-feature/inlining.md b/vendor/packed_simd/perf-guide/src/target-feature/inlining.md
new file mode 100644
index 000000000..86705102a
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/target-feature/inlining.md
@@ -0,0 +1,5 @@
+# Inlining
+
+<!-- TODO:
+Explain how the `#[target_feature]` attribute interacts with inlining
+-->
diff --git a/vendor/packed_simd/perf-guide/src/target-feature/practice.md b/vendor/packed_simd/perf-guide/src/target-feature/practice.md
new file mode 100644
index 000000000..5b55c61c2
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/target-feature/practice.md
@@ -0,0 +1,31 @@
+# Target features in practice
+
+Using `RUSTFLAGS` will allow the crate being compiled, as well as all its
+transitive dependencies to use certain target features.
+
+A tehnique used to avoid undefined behavior at runtime is to compile and
+ship multiple binaries, each compiled with a certain set of features.
+This might not be feasible in some cases, and can quickly get out of hand
+as more and more vector extensions are added to an architecture.
+
+Rust can be more flexible: you can build a single binary/library which automatically
+picks the best supported vector instructions depending on the host machine.
+The trick consists of monomorphizing parts of the code during building, and then
+using run-time feature detection to select the right code path when running.
+
+<!-- TODO
+Explain how to create efficient functions that dispatch to different
+implementations at run-time without issues (e.g. using `#[inline(always)]` for
+the impls, wrapping in `#[target_feature]`, and the wrapping those in a function
+that does run-time feature detection).
+-->
+
+**NOTE** (x86 specific): because the AVX (256-bit) registers extend the existing
+SSE (128-bit) registers, mixing SSE and AVX instructions in a program can cause
+performance issues.
+
+The solution is to compile all code, even the code written with 128-bit vectors,
+with the AVX target feature enabled. This will cause the compiler to prefix the
+generated instructions with the [VEX] prefix.
+
+[VEX]: https://en.wikipedia.org/wiki/VEX_prefix
diff --git a/vendor/packed_simd/perf-guide/src/target-feature/runtime.md b/vendor/packed_simd/perf-guide/src/target-feature/runtime.md
new file mode 100644
index 000000000..47ddcc866
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/target-feature/runtime.md
@@ -0,0 +1,5 @@
+# Detecting host features at runtime
+
+<!-- TODO:
+Explain cost (how it works).
+-->
diff --git a/vendor/packed_simd/perf-guide/src/target-feature/rustflags.md b/vendor/packed_simd/perf-guide/src/target-feature/rustflags.md
new file mode 100644
index 000000000..f4c1d1304
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/target-feature/rustflags.md
@@ -0,0 +1,77 @@
+# Using RUSTFLAGS
+
+One of the easiest ways to benefit from SIMD is to allow the compiler
+to generate code using certain vector instruction extensions.
+
+The environment variable `RUSTFLAGS` can be used to pass options for code
+generation to the Rust compiler. These flags will affect **all** compiled crates.
+
+There are two flags which can be used to enable specific vector extensions:
+
+## target-feature
+
+- Syntax: `-C target-feature=<features>`
+
+- Provides the compiler with a comma-separated set of instruction extensions
+ to enable.
+
+ **Example**: Use `-C target-feature=+sse3,+avx` to enable generating instructions
+ for [Streaming SIMD Extensions 3](https://en.wikipedia.org/wiki/SSE3) and
+ [Advanced Vector Extensions](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions).
+
+- To list target triples for all targets supported by Rust, use:
+
+ ```sh
+ rustc --print target-list
+ ```
+
+- To list all support target features for a certain target triple, use:
+
+ ```sh
+ rustc --target=${TRIPLE} --print target-features
+ ```
+
+- Note that all CPU features are independent, and will have to be enabled individually.
+
+ **Example**: Setting `-C target-feature=+avx2` will _not_ enable `fma`, even though
+ all CPUs which support AVX2 also support FMA. To enable both, one has to use
+ `-C target-feature=+avx2,+fma`
+
+- Some features also depend on other features, which need to be enabled for the
+ target instructions to be generated.
+
+ **Example**: Unless `v7` is specified as the target CPU (see below), to enable
+ NEON on ARM it is necessary to use `-C target-feature=+v7,+neon`.
+
+## target-cpu
+
+- Syntax: `-C target-cpu=<cpu>`
+
+- Sets the identifier of a CPU family / model for which to build and optimize the code.
+
+ **Example**: `RUSTFLAGS='-C target-cpu=cortex-a75'`
+
+- To list all supported target CPUs for a certain target triple, use:
+
+ ```sh
+ rustc --target=${TRIPLE} --print target-cpus
+ ```
+
+ **Example**:
+
+ ```sh
+ rustc --target=i686-pc-windows-msvc --print target-cpus
+ ```
+
+- The compiler will translate this into a list of target features. Therefore,
+ individual feature checks (`#[cfg(target_feature = "...")]`) will still
+ work properly.
+
+- It will cause the code generator to optimize the generated code for that
+ specific CPU model.
+
+- Using `native` as the CPU model will cause Rust to generate and optimize code
+ for the CPU running the compiler. It is useful when building programs which you
+ plan to only use locally. This should never be used when the generated programs
+ are meant to be run on other computers, such as when packaging for distribution
+ or cross-compiling.
diff --git a/vendor/packed_simd/perf-guide/src/vert-hor-ops.md b/vendor/packed_simd/perf-guide/src/vert-hor-ops.md
new file mode 100644
index 000000000..d0dd1be12
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/vert-hor-ops.md
@@ -0,0 +1,76 @@
+# Vertical and horizontal operations
+
+In SIMD terminology, each vector has a certain "width" (number of lanes).
+A vector processor is able to perform two kinds of operations on a vector:
+
+- Vertical operations:
+ operate on two vectors of the same width, result has same width
+
+**Example**: vertical addition of two `f32x4` vectors
+
+ %0 == | 2 | -3.5 | 0 | 7 |
+ + + + +
+ %1 == | 4 | 1.5 | -1 | 0 |
+ = = = =
+ %0 + %1 == | 6 | -2 | -1 | 7 |
+
+- Horizontal operations:
+ reduce the elements of two vectors in some way,
+ the result's elements combine information from the two original ones
+
+**Example**: horizontal addition of two `u64x2` vectors
+
+ %0 == | 1 | 3 |
+ └─+───┘
+ └───────┐
+ │
+ %1 == | 4 | -1 | │
+ └─+──┘ │
+ └───┐ │
+ │ │
+ ┌─────│───┘
+ ▼ ▼
+ %0 + %1 == | 4 | 3 |
+
+## Performance consideration of horizontal operations
+
+The result of vertical operations, like vector negation: `-a`, for a given lane,
+does not depend on the result of the operation for the other lanes. The result
+of horizontal operations, like the vector `sum` reduction: `a.sum()`, depends on
+the value of all vector lanes.
+
+In virtually all architectures vertical operations are fast, while horizontal
+operations are, by comparison, very slow.
+
+Consider the following two functions for computing the sum of all `f32` values
+in a slice:
+
+```rust
+fn fast_sum(x: &[f32]) -> f32 {
+ assert!(x.len() % 4 == 0);
+ let mut sum = f32x4::splat(0.); // [0., 0., 0., 0.]
+ for i in (0..x.len()).step_by(4) {
+ sum += f32x4::from_slice_unaligned(&x[i..]);
+ }
+ sum.sum()
+}
+
+fn slow_sum(x: &[f32]) -> f32 {
+ assert!(x.len() % 4 == 0);
+ let mut sum: f32 = 0.;
+ for i in (0..x.len()).step_by(4) {
+ sum += f32x4::from_slice_unaligned(&x[i..]).sum();
+ }
+ sum
+}
+```
+
+The inner loop over the slice is where the bulk of the work actually happens.
+There, the `fast_sum` function perform vertical operations into a vector, doing
+a single horizontal reduction at the end, while the `slow_sum` function performs
+horizontal vector operations inside of the loop.
+
+On all widely-used architectures, `fast_sum` is a large constant factor faster
+than `slow_sum`. You can run the [slice_sum]() example and see for yourself. On
+the particular machine tested there the algorithm using the horizontal vector
+addition is 2.7x slower than the one using vertical vector operations!