1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
|
# Efficient and performance-portable vector software
[//]: # (placeholder, do not remove)
Highway is a C++ library that provides portable SIMD/vector intrinsics.
## Why
We are passionate about high-performance software. We see major untapped
potential in CPUs (servers, mobile, desktops). Highway is for engineers who want
to reliably and economically push the boundaries of what is possible in
software.
## How
CPUs provide SIMD/vector instructions that apply the same operation to multiple
data items. This can reduce energy usage e.g. *fivefold* because fewer
instructions are executed. We also often see *5-10x* speedups.
Highway makes SIMD/vector programming practical and workable according to these
guiding principles:
**Does what you expect**: Highway is a C++ library with carefully-chosen
functions that map well to CPU instructions without extensive compiler
transformations. The resulting code is more predictable and robust to code
changes/compiler updates than autovectorization.
**Works on widely-used platforms**: Highway supports four architectures; the
same application code can target eight instruction sets, including those with
'scalable' vectors (size unknown at compile time). Highway only requires C++11
and supports four families of compilers. If you would like to use Highway on
other platforms, please raise an issue.
**Flexible to deploy**: Applications using Highway can run on heterogeneous
clouds or client devices, choosing the best available instruction set at
runtime. Alternatively, developers may choose to target a single instruction set
without any runtime overhead. In both cases, the application code is the same
except for swapping `HWY_STATIC_DISPATCH` with `HWY_DYNAMIC_DISPATCH` plus one
line of code.
**Suitable for a variety of domains**: Highway provides an extensive set of
operations, used for image processing (floating-point), compression, video
analysis, linear algebra, cryptography, sorting and random generation. We
recognise that new use-cases may require additional ops and are happy to add
them where it makes sense (e.g. no performance cliffs on some architectures). If
you would like to discuss, please file an issue.
**Rewards data-parallel design**: Highway provides tools such as Gather,
MaskedLoad, and FixedTag to enable speedups for legacy data structures. However,
the biggest gains are unlocked by designing algorithms and data structures for
scalable vectors. Helpful techniques include batching, structure-of-array
layouts, and aligned/padded allocations.
## Examples
Online demos using Compiler Explorer:
- [multiple targets with dynamic dispatch](https://gcc.godbolt.org/z/zP7MYe9Yf)
(recommended)
- [single target using -m flags](https://gcc.godbolt.org/z/rGnjMevKG)
Projects using Highway: (to add yours, feel free to raise an issue or contact us
via the below email)
* [iresearch database index](https://github.com/iresearch-toolkit/iresearch/blob/e7638e7a4b99136ca41f82be6edccf01351a7223/core/utils/simd_utils.hpp)
* [JPEG XL image codec](https://github.com/libjxl/libjxl)
* [Grok JPEG 2000 image codec](https://github.com/GrokImageCompression/grok)
* [vectorized Quicksort](https://github.com/google/highway/tree/master/hwy/contrib/sort) ([paper](https://arxiv.org/abs/2205.05982))
## Current status
### Targets
Supported targets: scalar, S-SSE3, SSE4, AVX2, AVX-512, AVX3_DL (~Icelake,
requires opt-in by defining `HWY_WANT_AVX3_DL`), NEON (ARMv7 and v8), SVE, SVE2,
WASM SIMD, RISC-V V.
`HWY_WASM_EMU256` is a 2x unrolled version of wasm128 and is enabled if
`HWY_WANT_WASM2` is defined. This will remain supported until it is potentially
superseded by a future version of WASM.
SVE was initially tested using farm_sve (see acknowledgments).
### Versioning
Highway releases aim to follow the semver.org system (MAJOR.MINOR.PATCH),
incrementing MINOR after backward-compatible additions and PATCH after
backward-compatible fixes. We recommend using releases (rather than the Git tip)
because they are tested more extensively, see below.
The current version 1.0 signals an increased focus on backwards compatibility.
Applications using documented functionality will remain compatible with future
updates that have the same major version number.
### Testing
Continuous integration tests build with a recent version of Clang (running on
native x86, or QEMU for RVV and ARM) and MSVC 2019 (v19.28, running on native
x86).
Before releases, we also test on x86 with Clang and GCC, and ARMv7/8 via GCC
cross-compile. See the [testing process](g3doc/release_testing_process.md) for
details.
### Related modules
The `contrib` directory contains SIMD-related utilities: an image class with
aligned rows, a math library (16 functions already implemented, mostly
trigonometry), and functions for computing dot products and sorting.
## Installation
This project uses CMake to generate and build. In a Debian-based system you can
install it via:
```bash
sudo apt install cmake
```
Highway's unit tests use [googletest](https://github.com/google/googletest).
By default, Highway's CMake downloads this dependency at configuration time.
You can disable this by setting the `HWY_SYSTEM_GTEST` CMake variable to ON and
installing gtest separately:
```bash
sudo apt install libgtest-dev
```
To build Highway as a shared or static library (depending on BUILD_SHARED_LIBS),
the standard CMake workflow can be used:
```bash
mkdir -p build && cd build
cmake ..
make -j && make test
```
Or you can run `run_tests.sh` (`run_tests.bat` on Windows).
Bazel is also supported for building, but it is not as widely used/tested.
When building for Arm v7, a limitation of current compilers requires you to add
`-DHWY_CMAKE_ARM7:BOOL=ON` to the CMake command line; see #834 and #1032. We
understand that work is underway to remove this limitation.
## Quick start
You can use the `benchmark` inside examples/ as a starting point.
A [quick-reference page](g3doc/quick_reference.md) briefly lists all operations
and their parameters, and the [instruction_matrix](g3doc/instruction_matrix.pdf)
indicates the number of instructions per operation.
The [FAQ](g3doc/faq.md) answers questions about portability, API design and
where to find more information.
We recommend using full SIMD vectors whenever possible for maximum performance
portability. To obtain them, pass a `ScalableTag<float>` (or equivalently
`HWY_FULL(float)`) tag to functions such as `Zero/Set/Load`. There are two
alternatives for use-cases requiring an upper bound on the lanes:
- For up to `N` lanes, specify `CappedTag<T, N>` or the equivalent
`HWY_CAPPED(T, N)`. The actual number of lanes will be `N` rounded down to
the nearest power of two, such as 4 if `N` is 5, or 8 if `N` is 8. This is
useful for data structures such as a narrow matrix. A loop is still required
because vectors may actually have fewer than `N` lanes.
- For exactly a power of two `N` lanes, specify `FixedTag<T, N>`. The largest
supported `N` depends on the target, but is guaranteed to be at least
`16/sizeof(T)`.
Due to ADL restrictions, user code calling Highway ops must either:
* Reside inside `namespace hwy { namespace HWY_NAMESPACE {`; or
* prefix each op with an alias such as `namespace hn = hwy::HWY_NAMESPACE;
hn::Add()`; or
* add using-declarations for each op used: `using hwy::HWY_NAMESPACE::Add;`.
Additionally, each function that calls Highway ops (such as `Load`) must either
be prefixed with `HWY_ATTR`, OR reside between `HWY_BEFORE_NAMESPACE()` and
`HWY_AFTER_NAMESPACE()`. Lambda functions currently require `HWY_ATTR` before
their opening brace.
The entry points into code using Highway differ slightly depending on whether
they use static or dynamic dispatch.
* For static dispatch, `HWY_TARGET` will be the best available target among
`HWY_BASELINE_TARGETS`, i.e. those allowed for use by the compiler (see
[quick-reference](g3doc/quick_reference.md)). Functions inside
`HWY_NAMESPACE` can be called using `HWY_STATIC_DISPATCH(func)(args)` within
the same module they are defined in. You can call the function from other
modules by wrapping it in a regular function and declaring the regular
function in a header.
* For dynamic dispatch, a table of function pointers is generated via the
`HWY_EXPORT` macro that is used by `HWY_DYNAMIC_DISPATCH(func)(args)` to
call the best function pointer for the current CPU's supported targets. A
module is automatically compiled for each target in `HWY_TARGETS` (see
[quick-reference](g3doc/quick_reference.md)) if `HWY_TARGET_INCLUDE` is
defined and `foreach_target.h` is included.
When using dynamic dispatch, `foreach_target.h` is included from translation
units (.cc files), not headers. Headers containing vector code shared between
several translation units require a special include guard, for example the
following taken from `examples/skeleton-inl.h`:
```
#if defined(HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_) == defined(HWY_TARGET_TOGGLE)
#ifdef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
#undef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
#else
#define HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
#endif
#include "hwy/highway.h"
// Your vector code
#endif
```
By convention, we name such headers `-inl.h` because their contents (often
function templates) are usually inlined.
## Compiler flags
Applications should be compiled with optimizations enabled - without inlining,
SIMD code may slow down by factors of 10 to 100. For clang and GCC, `-O2` is
generally sufficient.
For MSVC, we recommend compiling with `/Gv` to allow non-inlined functions to
pass vector arguments in registers. If intending to use the AVX2 target together
with half-width vectors (e.g. for `PromoteTo`), it is also important to compile
with `/arch:AVX2`. This seems to be the only way to generate VEX-encoded SSE4
instructions on MSVC. Otherwise, mixing VEX-encoded AVX2 instructions and
non-VEX SSE4 may cause severe performance degradation. Unfortunately, the
resulting binary will then require AVX2. Note that no such flag is needed for
clang and GCC because they support target-specific attributes, which we use to
ensure proper VEX code generation for AVX2 targets.
## Strip-mining loops
To vectorize a loop, "strip-mining" transforms it into an outer loop and inner
loop with number of iterations matching the preferred vector width.
In this section, let `T` denote the element type, `d = ScalableTag<T>`, `count`
the number of elements to process, and `N = Lanes(d)` the number of lanes in a
full vector. Assume the loop body is given as a function `template<bool partial,
class D> void LoopBody(D d, size_t index, size_t max_n)`.
Highway offers several ways to express loops where `N` need not divide `count`:
* Ensure all inputs/outputs are padded. Then the loop is simply
```
for (size_t i = 0; i < count; i += N) LoopBody<false>(d, i, 0);
```
Here, the template parameter and second function argument are not needed.
This is the preferred option, unless `N` is in the thousands and vector
operations are pipelined with long latencies. This was the case for
supercomputers in the 90s, but nowadays ALUs are cheap and we see most
implementations split vectors into 1, 2 or 4 parts, so there is little cost
to processing entire vectors even if we do not need all their lanes. Indeed
this avoids the (potentially large) cost of predication or partial
loads/stores on older targets, and does not duplicate code.
* Use the `Transform*` functions in hwy/contrib/algo/transform-inl.h. This
takes care of the loop and remainder handling and you simply define a
generic lambda function (C++14) or functor which receives the current vector
from the input/output array, plus optionally vectors from up to two extra
input arrays, and returns the value to write to the input/output array.
Here is an example implementing the BLAS function SAXPY (`alpha * x + y`):
```
Transform1(d, x, n, y, [](auto d, const auto v, const auto v1) HWY_ATTR {
return MulAdd(Set(d, alpha), v, v1);
});
```
* Process whole vectors as above, followed by a scalar loop:
```
size_t i = 0;
for (; i + N <= count; i += N) LoopBody<false>(d, i, 0);
for (; i < count; ++i) LoopBody<false>(CappedTag<T, 1>(), i, 0);
```
The template parameter and second function arguments are again not needed.
This avoids duplicating code, and is reasonable if `count` is large.
If `count` is small, the second loop may be slower than the next option.
* Process whole vectors as above, followed by a single call to a modified
`LoopBody` with masking:
```
size_t i = 0;
for (; i + N <= count; i += N) {
LoopBody<false>(d, i, 0);
}
if (i < count) {
LoopBody<true>(d, i, count - i);
}
```
Now the template parameter and third function argument can be used inside
`LoopBody` to non-atomically 'blend' the first `num_remaining` lanes of `v`
with the previous contents of memory at subsequent locations:
`BlendedStore(v, FirstN(d, num_remaining), d, pointer);`. Similarly,
`MaskedLoad(FirstN(d, num_remaining), d, pointer)` loads the first
`num_remaining` elements and returns zero in other lanes.
This is a good default when it is infeasible to ensure vectors are padded,
but is only safe `#if !HWY_MEM_OPS_MIGHT_FAULT`!
In contrast to the scalar loop, only a single final iteration is needed.
The increased code size from two loop bodies is expected to be worthwhile
because it avoids the cost of masking in all but the final iteration.
## Additional resources
* [Highway introduction (slides)](g3doc/highway_intro.pdf)
* [Overview of instructions per operation on different architectures](g3doc/instruction_matrix.pdf)
* [Design philosophy and comparison](g3doc/design_philosophy.md)
* [Implementation details](g3doc/impl_details.md)
## Acknowledgments
We have used [farm-sve](https://gitlab.inria.fr/bramas/farm-sve) by Berenger
Bramas; it has proved useful for checking the SVE port on an x86 development
machine.
This is not an officially supported Google product.
Contact: janwas@google.com
|