diff options
Diffstat (limited to 'third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md')
-rw-r--r-- | third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md | 76 |
1 files changed, 0 insertions, 76 deletions
diff --git a/third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md b/third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md deleted file mode 100644 index d0dd1be12a..0000000000 --- a/third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md +++ /dev/null @@ -1,76 +0,0 @@ -# Vertical and horizontal operations - -In SIMD terminology, each vector has a certain "width" (number of lanes). -A vector processor is able to perform two kinds of operations on a vector: - -- Vertical operations: - operate on two vectors of the same width, result has same width - -**Example**: vertical addition of two `f32x4` vectors - - %0 == | 2 | -3.5 | 0 | 7 | - + + + + - %1 == | 4 | 1.5 | -1 | 0 | - = = = = - %0 + %1 == | 6 | -2 | -1 | 7 | - -- Horizontal operations: - reduce the elements of two vectors in some way, - the result's elements combine information from the two original ones - -**Example**: horizontal addition of two `u64x2` vectors - - %0 == | 1 | 3 | - └─+───┘ - └───────┐ - │ - %1 == | 4 | -1 | │ - └─+──┘ │ - └───┐ │ - │ │ - ┌─────│───┘ - ▼ ▼ - %0 + %1 == | 4 | 3 | - -## Performance consideration of horizontal operations - -The result of vertical operations, like vector negation: `-a`, for a given lane, -does not depend on the result of the operation for the other lanes. The result -of horizontal operations, like the vector `sum` reduction: `a.sum()`, depends on -the value of all vector lanes. - -In virtually all architectures vertical operations are fast, while horizontal -operations are, by comparison, very slow. - -Consider the following two functions for computing the sum of all `f32` values -in a slice: - -```rust -fn fast_sum(x: &[f32]) -> f32 { - assert!(x.len() % 4 == 0); - let mut sum = f32x4::splat(0.); // [0., 0., 0., 0.] - for i in (0..x.len()).step_by(4) { - sum += f32x4::from_slice_unaligned(&x[i..]); - } - sum.sum() -} - -fn slow_sum(x: &[f32]) -> f32 { - assert!(x.len() % 4 == 0); - let mut sum: f32 = 0.; - for i in (0..x.len()).step_by(4) { - sum += f32x4::from_slice_unaligned(&x[i..]).sum(); - } - sum -} -``` - -The inner loop over the slice is where the bulk of the work actually happens. -There, the `fast_sum` function perform vertical operations into a vector, doing -a single horizontal reduction at the end, while the `slow_sum` function performs -horizontal vector operations inside of the loop. - -On all widely-used architectures, `fast_sum` is a large constant factor faster -than `slow_sum`. You can run the [slice_sum]() example and see for yourself. On -the particular machine tested there the algorithm using the horizontal vector -addition is 2.7x slower than the one using vertical vector operations! |