summaryrefslogtreecommitdiffstats
path: root/vendor/packed_simd/perf-guide/src/vert-hor-ops.md
diff options
context:
space:
mode:
Diffstat (limited to 'vendor/packed_simd/perf-guide/src/vert-hor-ops.md')
-rw-r--r--vendor/packed_simd/perf-guide/src/vert-hor-ops.md76
1 files changed, 76 insertions, 0 deletions
diff --git a/vendor/packed_simd/perf-guide/src/vert-hor-ops.md b/vendor/packed_simd/perf-guide/src/vert-hor-ops.md
new file mode 100644
index 000000000..d0dd1be12
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/vert-hor-ops.md
@@ -0,0 +1,76 @@
+# Vertical and horizontal operations
+
+In SIMD terminology, each vector has a certain "width" (number of lanes).
+A vector processor is able to perform two kinds of operations on a vector:
+
+- Vertical operations:
+ operate on two vectors of the same width, result has same width
+
+**Example**: vertical addition of two `f32x4` vectors
+
+ %0 == | 2 | -3.5 | 0 | 7 |
+ + + + +
+ %1 == | 4 | 1.5 | -1 | 0 |
+ = = = =
+ %0 + %1 == | 6 | -2 | -1 | 7 |
+
+- Horizontal operations:
+ reduce the elements of two vectors in some way,
+ the result's elements combine information from the two original ones
+
+**Example**: horizontal addition of two `u64x2` vectors
+
+ %0 == | 1 | 3 |
+ └─+───┘
+ └───────┐
+ │
+ %1 == | 4 | -1 | │
+ └─+──┘ │
+ └───┐ │
+ │ │
+ ┌─────│───┘
+ ▼ ▼
+ %0 + %1 == | 4 | 3 |
+
+## Performance consideration of horizontal operations
+
+The result of vertical operations, like vector negation: `-a`, for a given lane,
+does not depend on the result of the operation for the other lanes. The result
+of horizontal operations, like the vector `sum` reduction: `a.sum()`, depends on
+the value of all vector lanes.
+
+In virtually all architectures vertical operations are fast, while horizontal
+operations are, by comparison, very slow.
+
+Consider the following two functions for computing the sum of all `f32` values
+in a slice:
+
+```rust
+fn fast_sum(x: &[f32]) -> f32 {
+ assert!(x.len() % 4 == 0);
+ let mut sum = f32x4::splat(0.); // [0., 0., 0., 0.]
+ for i in (0..x.len()).step_by(4) {
+ sum += f32x4::from_slice_unaligned(&x[i..]);
+ }
+ sum.sum()
+}
+
+fn slow_sum(x: &[f32]) -> f32 {
+ assert!(x.len() % 4 == 0);
+ let mut sum: f32 = 0.;
+ for i in (0..x.len()).step_by(4) {
+ sum += f32x4::from_slice_unaligned(&x[i..]).sum();
+ }
+ sum
+}
+```
+
+The inner loop over the slice is where the bulk of the work actually happens.
+There, the `fast_sum` function perform vertical operations into a vector, doing
+a single horizontal reduction at the end, while the `slow_sum` function performs
+horizontal vector operations inside of the loop.
+
+On all widely-used architectures, `fast_sum` is a large constant factor faster
+than `slow_sum`. You can run the [slice_sum]() example and see for yourself. On
+the particular machine tested there the algorithm using the horizontal vector
+addition is 2.7x slower than the one using vertical vector operations!