third_party/rust/packed_simd/perf-guide/src/vert-hor-ops.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76

# Vertical and horizontal operations

In SIMD terminology, each vector has a certain "width" (number of lanes).
A vector processor is able to perform two kinds of operations on a vector:

- Vertical operations:
  operate on two vectors of the same width, result has same width

**Example**: vertical addition of two `f32x4` vectors

      %0     == | 2 | -3.5 |  0 | 7 |
                  +     +     +   +
      %1     == | 4 |  1.5 | -1 | 0 |
                  =     =     =   =
    %0 + %1  == | 6 |  -2  | -1 | 7 |

- Horizontal operations:
  reduce the elements of two vectors in some way,
  the result's elements combine information from the two original ones

**Example**: horizontal addition of two `u64x2` vectors

      %0     == | 1 |  3 |
                  └─+───┘
                    └───────┐
                            │
      %1     == | 4 | -1 |  │
                  └─+──┘    │
                    └───┐   │
                        │   │
                  ┌─────│───┘
                  ▼     ▼
    %0 + %1  == | 4 |   3 |

## Performance consideration of horizontal operations

The result of vertical operations, like vector negation: `-a`, for a given lane,
does not depend on the result of the operation for the other lanes. The result
of horizontal operations, like the vector `sum` reduction: `a.sum()`, depends on
the value of all vector lanes.

In virtually all architectures vertical operations are fast, while horizontal
operations are, by comparison, very slow.

Consider the following two functions for computing the sum of all `f32` values
in a slice:

```rust
fn fast_sum(x: &[f32]) -> f32 {
    assert!(x.len() % 4 == 0);
    let mut sum = f32x4::splat(0.); // [0., 0., 0., 0.]
    for i in (0..x.len()).step_by(4) {
        sum += f32x4::from_slice_unaligned(&x[i..]);
    }
    sum.sum()
}

fn slow_sum(x: &[f32]) -> f32 {
    assert!(x.len() % 4 == 0);
    let mut sum: f32 = 0.;
    for i in (0..x.len()).step_by(4) {
        sum += f32x4::from_slice_unaligned(&x[i..]).sum();
    }
    sum
}
```

The inner loop over the slice is where the bulk of the work actually happens.
There, the `fast_sum` function perform vertical operations into a vector, doing
a single horizontal reduction at the end, while the `slow_sum` function performs
horizontal vector operations inside of the loop.

On all widely-used architectures, `fast_sum` is a large constant factor faster
than `slow_sum`. You can run the [slice_sum]() example and see for yourself. On
the particular machine tested there the algorithm using the horizontal vector
addition is 2.7x slower than the one using vertical vector operations!