summaryrefslogtreecommitdiffstats
path: root/third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md
diff options
context:
space:
mode:
Diffstat (limited to 'third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md')
-rw-r--r--third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md76
1 files changed, 0 insertions, 76 deletions
diff --git a/third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md b/third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md
deleted file mode 100644
index d0dd1be12a..0000000000
--- a/third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md
+++ /dev/null
@@ -1,76 +0,0 @@
-# Vertical and horizontal operations
-
-In SIMD terminology, each vector has a certain "width" (number of lanes).
-A vector processor is able to perform two kinds of operations on a vector:
-
-- Vertical operations:
- operate on two vectors of the same width, result has same width
-
-**Example**: vertical addition of two `f32x4` vectors
-
- %0 == | 2 | -3.5 | 0 | 7 |
- + + + +
- %1 == | 4 | 1.5 | -1 | 0 |
- = = = =
- %0 + %1 == | 6 | -2 | -1 | 7 |
-
-- Horizontal operations:
- reduce the elements of two vectors in some way,
- the result's elements combine information from the two original ones
-
-**Example**: horizontal addition of two `u64x2` vectors
-
- %0 == | 1 | 3 |
- └─+───┘
- └───────┐
- │
- %1 == | 4 | -1 | │
- └─+──┘ │
- └───┐ │
- │ │
- ┌─────│───┘
- ▼ ▼
- %0 + %1 == | 4 | 3 |
-
-## Performance consideration of horizontal operations
-
-The result of vertical operations, like vector negation: `-a`, for a given lane,
-does not depend on the result of the operation for the other lanes. The result
-of horizontal operations, like the vector `sum` reduction: `a.sum()`, depends on
-the value of all vector lanes.
-
-In virtually all architectures vertical operations are fast, while horizontal
-operations are, by comparison, very slow.
-
-Consider the following two functions for computing the sum of all `f32` values
-in a slice:
-
-```rust
-fn fast_sum(x: &[f32]) -> f32 {
- assert!(x.len() % 4 == 0);
- let mut sum = f32x4::splat(0.); // [0., 0., 0., 0.]
- for i in (0..x.len()).step_by(4) {
- sum += f32x4::from_slice_unaligned(&x[i..]);
- }
- sum.sum()
-}
-
-fn slow_sum(x: &[f32]) -> f32 {
- assert!(x.len() % 4 == 0);
- let mut sum: f32 = 0.;
- for i in (0..x.len()).step_by(4) {
- sum += f32x4::from_slice_unaligned(&x[i..]).sum();
- }
- sum
-}
-```
-
-The inner loop over the slice is where the bulk of the work actually happens.
-There, the `fast_sum` function perform vertical operations into a vector, doing
-a single horizontal reduction at the end, while the `slow_sum` function performs
-horizontal vector operations inside of the loop.
-
-On all widely-used architectures, `fast_sum` is a large constant factor faster
-than `slow_sum`. You can run the [slice_sum]() example and see for yourself. On
-the particular machine tested there the algorithm using the horizontal vector
-addition is 2.7x slower than the one using vertical vector operations!