summaryrefslogtreecommitdiffstats
path: root/vendor/packed_simd_2/perf-guide
diff options
context:
space:
mode:
Diffstat (limited to 'vendor/packed_simd_2/perf-guide')
-rw-r--r--vendor/packed_simd_2/perf-guide/book.toml12
-rw-r--r--vendor/packed_simd_2/perf-guide/src/SUMMARY.md21
-rw-r--r--vendor/packed_simd_2/perf-guide/src/ascii.css4
-rw-r--r--vendor/packed_simd_2/perf-guide/src/bound_checks.md22
-rw-r--r--vendor/packed_simd_2/perf-guide/src/float-math/approx.md8
-rw-r--r--vendor/packed_simd_2/perf-guide/src/float-math/fma.md6
-rw-r--r--vendor/packed_simd_2/perf-guide/src/float-math/fp.md3
-rw-r--r--vendor/packed_simd_2/perf-guide/src/float-math/svml.md7
-rw-r--r--vendor/packed_simd_2/perf-guide/src/introduction.md26
-rw-r--r--vendor/packed_simd_2/perf-guide/src/prof/linux.md107
-rw-r--r--vendor/packed_simd_2/perf-guide/src/prof/mca.md100
-rw-r--r--vendor/packed_simd_2/perf-guide/src/prof/profiling.md14
-rw-r--r--vendor/packed_simd_2/perf-guide/src/target-feature/attribute.md5
-rw-r--r--vendor/packed_simd_2/perf-guide/src/target-feature/features.md13
-rw-r--r--vendor/packed_simd_2/perf-guide/src/target-feature/inlining.md5
-rw-r--r--vendor/packed_simd_2/perf-guide/src/target-feature/practice.md31
-rw-r--r--vendor/packed_simd_2/perf-guide/src/target-feature/runtime.md5
-rw-r--r--vendor/packed_simd_2/perf-guide/src/target-feature/rustflags.md77
-rw-r--r--vendor/packed_simd_2/perf-guide/src/vert-hor-ops.md76
19 files changed, 0 insertions, 542 deletions
diff --git a/vendor/packed_simd_2/perf-guide/book.toml b/vendor/packed_simd_2/perf-guide/book.toml
deleted file mode 100644
index 69ba3053c..000000000
--- a/vendor/packed_simd_2/perf-guide/book.toml
+++ /dev/null
@@ -1,12 +0,0 @@
-[book]
-authors = ["Gonzalo Brito Gadeschi", "Gabriel Majeri"]
-multilingual = false
-src = "src"
-title = "Rust SIMD Performance Guide"
-description = "This book describes how to write performant SIMD code in Rust."
-
-[build]
-create-missing = false
-
-[output.html]
-additional-css = ["./src/ascii.css"]
diff --git a/vendor/packed_simd_2/perf-guide/src/SUMMARY.md b/vendor/packed_simd_2/perf-guide/src/SUMMARY.md
deleted file mode 100644
index 1e7689886..000000000
--- a/vendor/packed_simd_2/perf-guide/src/SUMMARY.md
+++ /dev/null
@@ -1,21 +0,0 @@
-# Summary
-
-[Introduction](./introduction.md)
-
-- [Floating-point Math](./float-math/fp.md)
- - [Short-vector Math Library](./float-math/svml.md)
- - [Approximate functions](./float-math/approx.md)
- - [Fused multiply-accumulate](./float-math/fma.md)
-
-- [Target features](./target-feature/features.md)
- - [Using `RUSTFLAGS`](./target-feature/rustflags.md)
- - [Using the `target_feature` attribute](./target-feature/attribute.md)
- - [Interaction with inlining](./target-feature/inlining.md)
- - [Detecting features at runtime](./target-feature/runtime.md)
-
-- [Bounds checking](./bound_checks.md)
-- [Vertical and horizontal operations](./vert-hor-ops.md)
-
-- [Performance profiling](./prof/profiling.md)
- - [Profiling on Linux](./prof/linux.md)
- - [Using machine code analyzers](./prof/mca.md)
diff --git a/vendor/packed_simd_2/perf-guide/src/ascii.css b/vendor/packed_simd_2/perf-guide/src/ascii.css
deleted file mode 100644
index 4c0265119..000000000
--- a/vendor/packed_simd_2/perf-guide/src/ascii.css
+++ /dev/null
@@ -1,4 +0,0 @@
-code {
- /* "Source Code Pro" breaks ASCII art */
- font-family: Consolas, "Ubuntu Mono", Menlo, "DejaVu Sans Mono", monospace;
-}
diff --git a/vendor/packed_simd_2/perf-guide/src/bound_checks.md b/vendor/packed_simd_2/perf-guide/src/bound_checks.md
deleted file mode 100644
index 2eeedb5ac..000000000
--- a/vendor/packed_simd_2/perf-guide/src/bound_checks.md
+++ /dev/null
@@ -1,22 +0,0 @@
-# Bounds checking
-
-Reading and writing packed vectors to/from slices is checked by default.
-Independently of the configuration options used, the safe functions:
-
-* `Simd<[T; N]>::from_slice_aligned(& s[..])`
-* `Simd<[T; N]>::write_to_slice_aligned(&mut s[..])`
-
-always check that:
-
-* the slice is big enough to hold the vector
-* the slice is suitably aligned to perform an aligned load/store for a `Simd<[T;
- N]>` (this alignment is often much larger than that of `T`).
-
-There are `_unaligned` versions that use unaligned load and stores, as well as
-`unsafe` `_unchecked` that do not perform any checks iff `debug-assertions =
-false` / `debug = false`. That is, the `_unchecked` methods do still assert size
-and alignment in debug builds and could also do so in release builds depending
-on the configuration options.
-
-These assertions do often significantly impact performance and you should be
-aware of them.
diff --git a/vendor/packed_simd_2/perf-guide/src/float-math/approx.md b/vendor/packed_simd_2/perf-guide/src/float-math/approx.md
deleted file mode 100644
index 2237c67ec..000000000
--- a/vendor/packed_simd_2/perf-guide/src/float-math/approx.md
+++ /dev/null
@@ -1,8 +0,0 @@
-# Approximate functions
-
-<!-- TODO:
-
-Explain that they exists, that they are often _much_ faster, how to use them,
-that people should check whether the error is good enough for their
-applications. Explain that this error is currently unstable and might change.
--->
diff --git a/vendor/packed_simd_2/perf-guide/src/float-math/fma.md b/vendor/packed_simd_2/perf-guide/src/float-math/fma.md
deleted file mode 100644
index 357748383..000000000
--- a/vendor/packed_simd_2/perf-guide/src/float-math/fma.md
+++ /dev/null
@@ -1,6 +0,0 @@
-# Fused Multiply Add
-
-<!-- TODO:
-Explain that this is a compound operation, infinite precision, difference
-between `mul_add` and `mul_adde`, that LLVM cannot do this by itself, etc.
--->
diff --git a/vendor/packed_simd_2/perf-guide/src/float-math/fp.md b/vendor/packed_simd_2/perf-guide/src/float-math/fp.md
deleted file mode 100644
index 711fcc4fd..000000000
--- a/vendor/packed_simd_2/perf-guide/src/float-math/fp.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# Floating-point math
-
-This chapter contains information pertaining to working with floating-point numbers.
diff --git a/vendor/packed_simd_2/perf-guide/src/float-math/svml.md b/vendor/packed_simd_2/perf-guide/src/float-math/svml.md
deleted file mode 100644
index 266c2531c..000000000
--- a/vendor/packed_simd_2/perf-guide/src/float-math/svml.md
+++ /dev/null
@@ -1,7 +0,0 @@
-# Short Vector Math Library
-
-<!-- TODO:
-Explain how is short-vector math performed by default (just scalarized libm calls).
-
-Explain how to enable `sleef`, etc.
--->
diff --git a/vendor/packed_simd_2/perf-guide/src/introduction.md b/vendor/packed_simd_2/perf-guide/src/introduction.md
deleted file mode 100644
index 7243e19c8..000000000
--- a/vendor/packed_simd_2/perf-guide/src/introduction.md
+++ /dev/null
@@ -1,26 +0,0 @@
-# Introduction
-
-## What is SIMD
-
-<!-- TODO:
-describe what SIMD is, which algorithms can benefit from it,
-give usage examples
--->
-
-## History of SIMD in Rust
-
-<!-- TODO:
-discuss history of unstable std::simd,
-stabilization of std::arch, etc.
--->
-
-## Discover packed_simd
-
-<!-- TODO: describe scope of this project -->
-
-Writing fast and portable SIMD algorithms using `packed_simd` is, unfortunately,
-not trivial. There are many pitfals that one should be aware of, and some idioms
-that help avoid those pitfalls.
-
-This book attempts to document these best practices and provides practical examples
-on how to apply the tips to _your_ code.
diff --git a/vendor/packed_simd_2/perf-guide/src/prof/linux.md b/vendor/packed_simd_2/perf-guide/src/prof/linux.md
deleted file mode 100644
index 96c7d67bc..000000000
--- a/vendor/packed_simd_2/perf-guide/src/prof/linux.md
+++ /dev/null
@@ -1,107 +0,0 @@
-# Performance profiling on Linux
-
-## Using `perf`
-
-[perf](https://perf.wiki.kernel.org/) is the most powerful performance profiler
-for Linux, featuring support for various hardware Performance Monitoring Units,
-as well as integration with the kernel's performance events framework.
-
-We will only look at how can the `perf` command can be used to profile SIMD code.
-Full system profiling is outside of the scope of this book.
-
-### Recording
-
-The first step is to record a program's execution during an average workload.
-It helps if you can isolate the parts of your program which have performance
-issues, and set up a benchmark which can be easily (re)run.
-
-Build the benchmark binary in release mode, after having enabled debug info:
-
-```sh
-$ cargo build --release
-Finished release [optimized + debuginfo] target(s) in 0.02s
-```
-
-Then use the `perf record` subcommand:
-
-```sh
-$ perf record --call-graph=dwarf ./target/release/my-program
-[ perf record: Woken up 10 times to write data ]
-[ perf record: Captured and wrote 2,356 MB perf.data (292 samples) ]
-```
-
-Instead of using `--call-graph=dwarf`, which can become pretty slow, you can use
-`--call-graph=lbr` if you have a processor with support for Last Branch Record
-(i.e. Intel Haswell and newer).
-
-`perf` will, by default, record the count of CPU cycles it takes to execute
-various parts of your program. You can use the `-e` command line option
-to enable other performance events, such as `cache-misses`. Use `perf list`
-to get a list of all hardware counters supported by your CPU.
-
-### Viewing the report
-
-The next step is getting a bird's eye view of the program's execution.
-`perf` provides a `ncurses`-based interface which will get you started.
-
-Use `perf report` to open a visualization of your program's performance:
-
-```sh
-perf report --hierarchy -M intel
-```
-
-`--hierarchy` will display a tree-like structure of where your program spent
-most of its time. `-M intel` enables disassembly output with Intel syntax, which
-is subjectively more readable than the default AT&T syntax.
-
-Here is the output from profiling the `nbody` benchmark:
-
-```
-- 100,00% nbody
- - 94,18% nbody
- + 93,48% [.] nbody_lib::simd::advance
- + 0,70% [.] nbody_lib::run
- + 5,06% libc-2.28.so
-```
-
-If you move with the arrow keys to any node in the tree, you can the press `a`
-to have `perf` _annotate_ that node. This means it will:
-
-- disassemble the function
-
-- associate every instruction with the percentage of time which was spent executing it
-
-- interleaves the disassembly with the source code,
- assuming it found the debug symbols
- (you can use `s` to toggle this behaviour)
-
-`perf` will, by default, open the instruction which it identified as being the
-hottest spot in the function:
-
-```
-0,76 │ movapd xmm2,xmm0
-0,38 │ movhlps xmm2,xmm0
- │ addpd xmm2,xmm0
- │ unpcklpd xmm1,xmm2
-12,50 │ sqrtpd xmm0,xmm1
-1,52 │ mulpd xmm0,xmm1
-```
-
-In this case, `sqrtpd` will be highlighted in red, since that's the instruction
-which the CPU spends most of its time executing.
-
-## Using Valgrind
-
-Valgrind is a set of tools which initially helped C/C++ programmers find unsafe
-memory accesses in their code. Nowadays the project also has
-
-- a heap profiler called `massif`
-
-- a cache utilization profiler called `cachegrind`
-
-- a call-graph performance profiler called `callgrind`
-
-<!--
-TODO: explain valgrind's dynamic binary translation, warn about massive
-slowdown, talk about `kcachegrind` for a GUI
--->
diff --git a/vendor/packed_simd_2/perf-guide/src/prof/mca.md b/vendor/packed_simd_2/perf-guide/src/prof/mca.md
deleted file mode 100644
index 65ddf1a4e..000000000
--- a/vendor/packed_simd_2/perf-guide/src/prof/mca.md
+++ /dev/null
@@ -1,100 +0,0 @@
-# Machine code analysis tools
-
-## The microarchitecture of modern CPUs
-
-While you might have heard of Instruction Set Architectures, such as `x86` or
-`arm` or `mips`, the term _microarchitecture_ (also written here as _µ-arch_),
-refers to the internal details of an actual family of CPUs, such as Intel's
-_Haswell_ or AMD's _Jaguar_.
-
-Replacing scalar code with SIMD code will improve performance on all CPUs
-supporting the required vector extensions.
-However, due to microarchitectural differences, the actual speed-up at
-runtime might vary.
-
-**Example**: a simple example arises when optimizing for AMD K8 CPUs.
-The assembly generated for an empty function should look like this:
-
-```asm
-nop
-ret
-```
-
-The `nop` is used to align the `ret` instruction for better performance.
-However, the compiler will actually generated the following code:
-
-```asm
-repz ret
-```
-
-The `repz` instruction will repeat the following instruction until a certain
-condition. Of course, in this situation, the function will simply immediately
-return, and the `ret` instruction is still aligned.
-However, AMD K8's branch predictor performs better with the latter code.
-
-For those looking to absolutely maximize performance for a certain target µ-arch,
-you will have to read some CPU manuals, or ask the compiler to do it for you
-with `-C target-cpu`.
-
-### Summary of CPU internals
-
-Modern processors are able to execute instructions out-of-order for better performance,
-by utilizing tricks such as [branch prediction], [instruction pipelining],
-or [superscalar execution].
-
-[branch prediction]: https://en.wikipedia.org/wiki/Branch_predictor
-[instruction pipelining]: https://en.wikipedia.org/wiki/Instruction_pipelining
-[superscalar execution]: https://en.wikipedia.org/wiki/Superscalar_processor
-
-SIMD instructions are also subject to these optimizations, meaning it can get pretty
-difficult to determine where the slowdown happens.
-For example, if the profiler reports a store operation is slow, one of two things
-could be happening:
-
-- the store is limited by the CPU's memory bandwidth, which is actually an ideal
- scenario, all things considered;
-
-- memory bandwidth is nowhere near its peak, but the value to be stored is at the
- end of a long chain of operations, and this store is where the profiler
- encountered the pipeline stall;
-
-Since most profilers are simple tools which don't understand the subtleties of
-instruction scheduling, you
-
-## Analyzing the machine code
-
-Certain tools have knowledge of internal CPU microarchitecture, i.e. they know
-
-- how many physical [register files] a CPU actually has
-
-- what is the latency / throughtput of an instruction
-
-- what [µ-ops] are generated for a set of instructions
-
-and many other architectural details.
-
-[register files]: https://en.wikipedia.org/wiki/Register_file
-[µ-ops]: https://en.wikipedia.org/wiki/Micro-operation
-
-These tools are therefore able to provide accurate information as to why some
-instructions are inefficient, and where the bottleneck is.
-
-The disadvantage is that the output of these tools requires advanced knowledge
-of the target architecture to understand, i.e. they **cannot** point out what
-the cause of the issue is explicitly.
-
-## Intel's Architecture Code Analyzer (IACA)
-
-[IACA] is a free tool offered by Intel for analyzing the performance of various
-computational kernels.
-
-Being a proprietary, closed source tool, it _only_ supports Intel's µ-arches.
-
-[IACA]: https://software.intel.com/en-us/articles/intel-architecture-code-analyzer
-
-## llvm-mca
-
-<!--
-TODO: once LLVM 7 gets released, write a chapter on using llvm-mca
-with SIMD disassembly.
--->
diff --git a/vendor/packed_simd_2/perf-guide/src/prof/profiling.md b/vendor/packed_simd_2/perf-guide/src/prof/profiling.md
deleted file mode 100644
index 02ba78d2f..000000000
--- a/vendor/packed_simd_2/perf-guide/src/prof/profiling.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# Performance profiling
-
-While the rest of the book provides practical advice on how to improve the performance
-of SIMD code, this chapter is dedicated to [**performance profiling**][profiling].
-Profiling consists of recording a program's execution in order to identify program
-hotspots.
-
-**Important**: most profilers require debug information in order to accurately
-link the program hotspots back to the corresponding source code lines. Rust will
-disable debug info generation by default for optimized builds, but you can change
-that [in your `Cargo.toml`][cargo-ref].
-
-[profiling]: https://en.wikipedia.org/wiki/Profiling_(computer_programming)
-[cargo-ref]: https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections
diff --git a/vendor/packed_simd_2/perf-guide/src/target-feature/attribute.md b/vendor/packed_simd_2/perf-guide/src/target-feature/attribute.md
deleted file mode 100644
index ee670fea5..000000000
--- a/vendor/packed_simd_2/perf-guide/src/target-feature/attribute.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# The `target_feature` attribute
-
-<!-- TODO:
-Explain the `#[target_feature]` attribute
--->
diff --git a/vendor/packed_simd_2/perf-guide/src/target-feature/features.md b/vendor/packed_simd_2/perf-guide/src/target-feature/features.md
deleted file mode 100644
index b93030ca6..000000000
--- a/vendor/packed_simd_2/perf-guide/src/target-feature/features.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Enabling target features
-
-Not all processors of a certain architecture will have SIMD processing units,
-and using a SIMD instruction which is not supported will trigger undefined behavior.
-
-To allow building safe, portable programs, the Rust compiler will **not**, by default,
-generate any sort of vector instructions, unless it can statically determine
-they are supported. For example, on AMD64, SSE2 support is architecturally guaranteed.
-The `x86_64-apple-darwin` target enables up to SSSE3. The get a defintive list of
-which features are enabled by default on various platforms, refer to the target
-specifications [in the compiler's source code][targets].
-
-[targets]: https://github.com/rust-lang/rust/tree/master/src/librustc_target/spec
diff --git a/vendor/packed_simd_2/perf-guide/src/target-feature/inlining.md b/vendor/packed_simd_2/perf-guide/src/target-feature/inlining.md
deleted file mode 100644
index 86705102a..000000000
--- a/vendor/packed_simd_2/perf-guide/src/target-feature/inlining.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# Inlining
-
-<!-- TODO:
-Explain how the `#[target_feature]` attribute interacts with inlining
--->
diff --git a/vendor/packed_simd_2/perf-guide/src/target-feature/practice.md b/vendor/packed_simd_2/perf-guide/src/target-feature/practice.md
deleted file mode 100644
index 5b55c61c2..000000000
--- a/vendor/packed_simd_2/perf-guide/src/target-feature/practice.md
+++ /dev/null
@@ -1,31 +0,0 @@
-# Target features in practice
-
-Using `RUSTFLAGS` will allow the crate being compiled, as well as all its
-transitive dependencies to use certain target features.
-
-A tehnique used to avoid undefined behavior at runtime is to compile and
-ship multiple binaries, each compiled with a certain set of features.
-This might not be feasible in some cases, and can quickly get out of hand
-as more and more vector extensions are added to an architecture.
-
-Rust can be more flexible: you can build a single binary/library which automatically
-picks the best supported vector instructions depending on the host machine.
-The trick consists of monomorphizing parts of the code during building, and then
-using run-time feature detection to select the right code path when running.
-
-<!-- TODO
-Explain how to create efficient functions that dispatch to different
-implementations at run-time without issues (e.g. using `#[inline(always)]` for
-the impls, wrapping in `#[target_feature]`, and the wrapping those in a function
-that does run-time feature detection).
--->
-
-**NOTE** (x86 specific): because the AVX (256-bit) registers extend the existing
-SSE (128-bit) registers, mixing SSE and AVX instructions in a program can cause
-performance issues.
-
-The solution is to compile all code, even the code written with 128-bit vectors,
-with the AVX target feature enabled. This will cause the compiler to prefix the
-generated instructions with the [VEX] prefix.
-
-[VEX]: https://en.wikipedia.org/wiki/VEX_prefix
diff --git a/vendor/packed_simd_2/perf-guide/src/target-feature/runtime.md b/vendor/packed_simd_2/perf-guide/src/target-feature/runtime.md
deleted file mode 100644
index 47ddcc866..000000000
--- a/vendor/packed_simd_2/perf-guide/src/target-feature/runtime.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# Detecting host features at runtime
-
-<!-- TODO:
-Explain cost (how it works).
--->
diff --git a/vendor/packed_simd_2/perf-guide/src/target-feature/rustflags.md b/vendor/packed_simd_2/perf-guide/src/target-feature/rustflags.md
deleted file mode 100644
index f4c1d1304..000000000
--- a/vendor/packed_simd_2/perf-guide/src/target-feature/rustflags.md
+++ /dev/null
@@ -1,77 +0,0 @@
-# Using RUSTFLAGS
-
-One of the easiest ways to benefit from SIMD is to allow the compiler
-to generate code using certain vector instruction extensions.
-
-The environment variable `RUSTFLAGS` can be used to pass options for code
-generation to the Rust compiler. These flags will affect **all** compiled crates.
-
-There are two flags which can be used to enable specific vector extensions:
-
-## target-feature
-
-- Syntax: `-C target-feature=<features>`
-
-- Provides the compiler with a comma-separated set of instruction extensions
- to enable.
-
- **Example**: Use `-C target-feature=+sse3,+avx` to enable generating instructions
- for [Streaming SIMD Extensions 3](https://en.wikipedia.org/wiki/SSE3) and
- [Advanced Vector Extensions](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions).
-
-- To list target triples for all targets supported by Rust, use:
-
- ```sh
- rustc --print target-list
- ```
-
-- To list all support target features for a certain target triple, use:
-
- ```sh
- rustc --target=${TRIPLE} --print target-features
- ```
-
-- Note that all CPU features are independent, and will have to be enabled individually.
-
- **Example**: Setting `-C target-feature=+avx2` will _not_ enable `fma`, even though
- all CPUs which support AVX2 also support FMA. To enable both, one has to use
- `-C target-feature=+avx2,+fma`
-
-- Some features also depend on other features, which need to be enabled for the
- target instructions to be generated.
-
- **Example**: Unless `v7` is specified as the target CPU (see below), to enable
- NEON on ARM it is necessary to use `-C target-feature=+v7,+neon`.
-
-## target-cpu
-
-- Syntax: `-C target-cpu=<cpu>`
-
-- Sets the identifier of a CPU family / model for which to build and optimize the code.
-
- **Example**: `RUSTFLAGS='-C target-cpu=cortex-a75'`
-
-- To list all supported target CPUs for a certain target triple, use:
-
- ```sh
- rustc --target=${TRIPLE} --print target-cpus
- ```
-
- **Example**:
-
- ```sh
- rustc --target=i686-pc-windows-msvc --print target-cpus
- ```
-
-- The compiler will translate this into a list of target features. Therefore,
- individual feature checks (`#[cfg(target_feature = "...")]`) will still
- work properly.
-
-- It will cause the code generator to optimize the generated code for that
- specific CPU model.
-
-- Using `native` as the CPU model will cause Rust to generate and optimize code
- for the CPU running the compiler. It is useful when building programs which you
- plan to only use locally. This should never be used when the generated programs
- are meant to be run on other computers, such as when packaging for distribution
- or cross-compiling.
diff --git a/vendor/packed_simd_2/perf-guide/src/vert-hor-ops.md b/vendor/packed_simd_2/perf-guide/src/vert-hor-ops.md
deleted file mode 100644
index d0dd1be12..000000000
--- a/vendor/packed_simd_2/perf-guide/src/vert-hor-ops.md
+++ /dev/null
@@ -1,76 +0,0 @@
-# Vertical and horizontal operations
-
-In SIMD terminology, each vector has a certain "width" (number of lanes).
-A vector processor is able to perform two kinds of operations on a vector:
-
-- Vertical operations:
- operate on two vectors of the same width, result has same width
-
-**Example**: vertical addition of two `f32x4` vectors
-
- %0 == | 2 | -3.5 | 0 | 7 |
- + + + +
- %1 == | 4 | 1.5 | -1 | 0 |
- = = = =
- %0 + %1 == | 6 | -2 | -1 | 7 |
-
-- Horizontal operations:
- reduce the elements of two vectors in some way,
- the result's elements combine information from the two original ones
-
-**Example**: horizontal addition of two `u64x2` vectors
-
- %0 == | 1 | 3 |
- └─+───┘
- └───────┐
- │
- %1 == | 4 | -1 | │
- └─+──┘ │
- └───┐ │
- │ │
- ┌─────│───┘
- ▼ ▼
- %0 + %1 == | 4 | 3 |
-
-## Performance consideration of horizontal operations
-
-The result of vertical operations, like vector negation: `-a`, for a given lane,
-does not depend on the result of the operation for the other lanes. The result
-of horizontal operations, like the vector `sum` reduction: `a.sum()`, depends on
-the value of all vector lanes.
-
-In virtually all architectures vertical operations are fast, while horizontal
-operations are, by comparison, very slow.
-
-Consider the following two functions for computing the sum of all `f32` values
-in a slice:
-
-```rust
-fn fast_sum(x: &[f32]) -> f32 {
- assert!(x.len() % 4 == 0);
- let mut sum = f32x4::splat(0.); // [0., 0., 0., 0.]
- for i in (0..x.len()).step_by(4) {
- sum += f32x4::from_slice_unaligned(&x[i..]);
- }
- sum.sum()
-}
-
-fn slow_sum(x: &[f32]) -> f32 {
- assert!(x.len() % 4 == 0);
- let mut sum: f32 = 0.;
- for i in (0..x.len()).step_by(4) {
- sum += f32x4::from_slice_unaligned(&x[i..]).sum();
- }
- sum
-}
-```
-
-The inner loop over the slice is where the bulk of the work actually happens.
-There, the `fast_sum` function perform vertical operations into a vector, doing
-a single horizontal reduction at the end, while the `slow_sum` function performs
-horizontal vector operations inside of the loop.
-
-On all widely-used architectures, `fast_sum` is a large constant factor faster
-than `slow_sum`. You can run the [slice_sum]() example and see for yourself. On
-the particular machine tested there the algorithm using the horizontal vector
-addition is 2.7x slower than the one using vertical vector operations!