1 files changed, 76 insertions, 0 deletions
diff --git a/vendor/packed_simd/perf-guide/src/vert-hor-ops.md b/vendor/packed_simd/perf-guide/src/vert-hor-ops.md
new file mode 100644
index 000000000..d0dd1be12
--- /dev/null
+++ b/vendor/packed_simd/perf-guide/src/vert-hor-ops.md
@@ -0,0 +1,76 @@
+# Vertical and horizontal operations
+
+In SIMD terminology, each vector has a certain "width" (number of lanes).
+A vector processor is able to perform two kinds of operations on a vector:
+
+- Vertical operations:
+  operate on two vectors of the same width, result has same width
+
+**Example**: vertical addition of two `f32x4` vectors
+
+      %0     == | 2 | -3.5 |  0 | 7 |
+                  +     +     +   +
+      %1     == | 4 |  1.5 | -1 | 0 |
+                  =     =     =   =
+    %0 + %1  == | 6 |  -2  | -1 | 7 |
+
+- Horizontal operations:
+  reduce the elements of two vectors in some way,
+  the result's elements combine information from the two original ones
+
+**Example**: horizontal addition of two `u64x2` vectors
+
+      %0     == | 1 |  3 |
+                  └─+───┘
+                    └───────┐
+                            │
+      %1     == | 4 | -1 |  │
+                  └─+──┘    │
+                    └───┐   │
+                        │   │
+                  ┌─────│───┘
+                  ▼     ▼
+    %0 + %1  == | 4 |   3 |
+
+## Performance consideration of horizontal operations
+
+The result of vertical operations, like vector negation: `-a`, for a given lane,
+does not depend on the result of the operation for the other lanes. The result
+of horizontal operations, like the vector `sum` reduction: `a.sum()`, depends on
+the value of all vector lanes.
+
+In virtually all architectures vertical operations are fast, while horizontal
+operations are, by comparison, very slow.
+
+Consider the following two functions for computing the sum of all `f32` values
+in a slice:
+
+```rust
+fn fast_sum(x: &[f32]) -> f32 {
+    assert!(x.len() % 4 == 0);
+    let mut sum = f32x4::splat(0.); // [0., 0., 0., 0.]
+    for i in (0..x.len()).step_by(4) {
+        sum += f32x4::from_slice_unaligned(&x[i..]);
+    }
+    sum.sum()
+}
+
+fn slow_sum(x: &[f32]) -> f32 {
+    assert!(x.len() % 4 == 0);
+    let mut sum: f32 = 0.;
+    for i in (0..x.len()).step_by(4) {
+        sum += f32x4::from_slice_unaligned(&x[i..]).sum();
+    }
+    sum
+}
+```
+
+The inner loop over the slice is where the bulk of the work actually happens.
+There, the `fast_sum` function perform vertical operations into a vector, doing
+a single horizontal reduction at the end, while the `slow_sum` function performs
+horizontal vector operations inside of the loop.
+
+On all widely-used architectures, `fast_sum` is a large constant factor faster
+than `slow_sum`. You can run the [slice_sum]() example and see for yourself. On
+the particular machine tested there the algorithm using the horizontal vector
+addition is 2.7x slower than the one using vertical vector operations!