summaryrefslogtreecommitdiffstats
path: root/dom/webgpu/tests/cts/checkout/docs/fp_primer.md
diff options
context:
space:
mode:
Diffstat (limited to 'dom/webgpu/tests/cts/checkout/docs/fp_primer.md')
-rw-r--r--dom/webgpu/tests/cts/checkout/docs/fp_primer.md871
1 files changed, 871 insertions, 0 deletions
diff --git a/dom/webgpu/tests/cts/checkout/docs/fp_primer.md b/dom/webgpu/tests/cts/checkout/docs/fp_primer.md
new file mode 100644
index 0000000000..a8302fb461
--- /dev/null
+++ b/dom/webgpu/tests/cts/checkout/docs/fp_primer.md
@@ -0,0 +1,871 @@
+# Floating Point Primer
+
+This document is meant to be a primer of the concepts related to floating point
+numbers that are needed to be understood when working on tests in WebGPU's CTS.
+
+WebGPU's CTS is responsible for testing if an implementation of WebGPU
+satisfies the spec, and thus meets the expectations of programmers based on the
+contract defined by the spec.
+
+Floating point math makes up a significant portion of the WGSL spec, and has
+many subtle corner cases to get correct.
+
+Additionally, floating point math, unlike integer math, is broadly not exact, so
+how inaccurate a calculation is allowed to be is required to be stated in the
+spec and tested in the CTS, as opposed to testing for a singular correct
+response.
+
+Thus, the WebGPU CTS has a significant amount of machinery around how to
+correctly test floating point expectations in a fluent manner.
+
+## Floating Point Numbers
+
+For some of the following discussion of floating point numbers 32-bit
+floating numbers are assumed, also known as single precision IEEE floating
+point numbers or `f32`s. Most of the discussions that apply to this format apply
+to other concrete formats that are handled, i.e. 16-bit/f16/half-precision.
+There are some significant differences with respect to AbstractFloats, which
+will be discussed in its own section.
+
+Details of how these formats work are discussed as needed below, but for a more
+involved discussion, please see the references in the Resources sections.
+
+Additionally, in the Appendix there is a table of interesting/common values that
+are often referenced in tests or this document.
+
+A floating point number system defines
+- A finite set of values to stand as representatives for the infinite set of
+ real numbers, and
+- Arithmetic operations on those representatives, trying to approximate the
+ ideal operations on real numbers.
+
+The cardinality mismatch alone implies that any floating point number system necessarily loses information.
+
+This means that not all numbers in the bounds can be exactly represented as a
+floating point value.
+
+For example, the integer `1` is exactly represented as a f32 as `0x3f800000`,
+but the next nearest number `0x3f800001` is `1.00000011920928955`.
+
+So any number between `1` and `1.00000011920928955` is not exactly representable
+as a f32 and instead is approximated as either `1` or `1.00000011920928955`.
+
+When a number X is not exactly representable by a floating point value, there
+are normally two neighbouring numbers that could reasonably represent X: the
+nearest floating point value above X, and the nearest floating point value below
+X. Which of these values gets used is dictated by the rounding mode being used,
+which may be something like always round towards 0 or go to the nearest
+neighbour, or something else entirely.
+
+The process of converting numbers between different precisions is called
+quantization. WGSL does not prescribe a specific rounding mode when
+quantizing, so either of the neighbouring values is considered valid when
+converting a non-exactly representable value to a floating point value. This has
+significant implications on the CTS that are discussed later.
+
+From here on, we assume you are familiar with the internal structure of a
+floating point number (a sign bit, a biased exponent, and a mantissa). For
+reference, see
+[binary64 on Wikipedia](https://en.wikipedia.org/wiki/Double-precision_floating-point_format),
+[binary32 on Wikipedia](https://en.wikipedia.org/wiki/Single-precision_floating-point_format),
+and
+[binary16 on Wikipedia](https://en.wikipedia.org/wiki/Half-precision_floating-point_format).
+
+In the floating points formats described above, there are two possible zero
+values, one with all bits being 0, called positive zero, and one all the same
+except with the sign bit being 1, called negative zero.
+
+For WGSL, and thus the CTS's purposes, these values are considered equivalent.
+Typescript, which the CTS is written in, treats all zeros as positive zeros,
+unless you explicitly escape hatch to differentiate between them, so most of the
+time there being two zeros doesn't materially affect code.
+
+### Normal Numbers
+
+Normal numbers are floating point numbers whose biased exponent is not all 0s or
+all 1s. When working with normal numbers the mantissa starts with an implied
+leading 1. For WGSL these numbers behave as you expect for floating point values
+with no interesting caveats.
+
+### Subnormal Numbers
+
+Subnormal numbers are finite non-zero numbers whose biased exponent is all 0s,
+sometimes called denorms.
+
+These are the closest numbers to zero, both positive and negative, and fill in
+the gap between the normal numbers with the smallest magnitude, and 0.
+
+Some devices, for performance reasons, do not handle operations on the
+subnormal numbers, and instead treat them as being zero, this is called *flush
+to zero* or FTZ behaviour.
+
+This means in the CTS that when a subnormal number is consumed or produced by an
+operation, an implementation may choose to replace it with zero.
+
+Like the rounding mode for quantization, this adds significant complexity to the
+CTS, which will be discussed later.
+
+### Inf & NaNs
+
+Floating point numbers include positive and negative infinity to represent
+values that are out of the bounds supported by the current precision.
+
+Implementations may assume that infinities are not present. When an evaluation
+at runtime would produce an infinity, an indeterminate value is produced
+instead.
+
+When a value goes out of bounds for a specific precision there are special
+rounding rules that apply. If it is 'near' the edge of finite values for that
+precision, it is considered to be near-overflowing, and the implementation may
+choose to round it to the edge value or the appropriate infinity. If it is not
+near the finite values, which it is considered to be far-overflowing, then it
+must be rounded to the appropriate infinity.
+
+This of course is vague, but the spec does have a precise definition where the
+transition from near to far overflow is.
+
+Let `x` be our value.
+
+Let `exp_max` be the (unbiased) exponent of the largest finite value for the
+floating point type.
+
+If `|x|` < `2 ** (exp_max + 1)`, but not in
+the finite range, than it is considered to be near-overflowing for the
+floating point type.
+
+If the magnitude is equal to or greater than this limit, then it is
+far-overflowing for the floating point type.
+
+This concept of near-overflow vs far-overflow divides the real number line into
+5 distinct regions.
+
+| Region | Rule |
+|-----------------------------------------------|---------------------------------|
+| -∞ < `x` <= `-(2 ** (exp_max + 1))` | must round to -∞ |
+| `-(2 ** (exp_max + 1))` < `x` <= min fp value | must round to -∞ or min value |
+| min fp value < `x` < max fp value | round as discussed below |
+| max fp value <= `x` < `2 ** (exp_max + 1)` | must round to max value or ∞ |
+| `2 ** (exp_max + 1))` < `x` | implementations must round to ∞ |
+
+
+The CTS encodes the least restrictive interpretation of the rules in the spec,
+i.e. assuming someone has made a slightly adversarial implementation that always
+chooses the thing with the least accuracy.
+
+This means that the above rules about infinities and overflow combine to say
+that any time a non-finite value for the specific floating point type is seen,
+any finite value is acceptable afterward. This is because the non-finite value
+may be converted to an infinity and then an indeterminate value can be used
+instead of the infinity.
+
+(This comes with the caveat that this is only for runtime execution on a GPU,
+the rules for compile time execution will be discussed below.)
+
+Signaling NaNs are treated as quiet NaNs in the WGSL spec. And quiet NaNs have
+the same "may-convert-to-indeterminate-value" behaviour that infinities have, so
+for the purpose of the CTS they are handled by the infinite/out of bounds logic
+normally.
+
+## Notation/Terminology
+
+When discussing floating point values in the CTS, there are a few terms used
+with precise meanings, which will be elaborated here.
+
+Additionally, any specific notation used will be specified here to avoid
+confusion.
+
+### Operations
+
+The CTS tests for the proper execution of builtins, i.e. `sin`, `sqrt`, `abs`,
+etc, and expressions, i.e. `*`, `/`, `<`, etc, when provided with floating
+point inputs. These collectively can be referred to as floating point
+operations.
+
+Operations, which can be thought of as mathematical functions, are mappings from
+a set of inputs to a set of outputs.
+
+Denoted `f(x, y) = X`, where `f` is a placeholder or the name of the operation,
+lower case variables are the inputs to the function, and uppercase variables are
+the outputs of the function.
+
+Operations have one or more inputs and an output value.
+
+Values are generally defined as floats, integers, booleans, vectors, and
+matrices. Consult the [WGSL Spec](https://www.w3.org/TR/WGSL/) for the exact
+list of types and their definitions.
+
+Most operations inputs and output are the same type of value. There are some
+exceptions that accept or emit heterogeneous data types, normally a floating
+point type and a integer type or a boolean.
+
+There are a couple of builtins (`frexp` and `modf`) that return composite
+outputs where there are multiple values being returned, there is a single result
+value made of structured data. Whereas composite inputs are handle by having
+multiple input parameters.
+
+Some examples of different types of operations:
+
+`multiplication(x, y) = X`, which represents the WGSL expression `x * y`, takes
+in floating point values, `x` and `y`, and produces a floating point value `X`.
+
+`lessThan(x, y) = X`, which represents the WGSL expression `x < y`, again takes
+in floating point values, but in this case returns a boolean value.
+
+`ldexp(x, y) = X`, which builds a floating point value, takes in a floating
+point value `x` and a restricted integer `y`.
+
+### Domain, Range, and Intervals
+
+For an operation `f(x) = X`, the interval of valid values for the input, `x`, is
+called the *domain*, and the interval for valid results, `X`, is called the
+*range*.
+
+An interval, `[a, b]`, is a set of real numbers that contains `a`, `b`, and all
+the real numbers between them.
+
+Open-ended intervals, i.e. ones that don't include `a` and/or `b`, are avoided,
+and are called out explicitly when they occur.
+
+The convention in this doc and the CTS code is that `a <= b`, so `a` can be
+referred to as the beginning of the interval and `b` as the end of the interval.
+
+When talking about intervals, this doc and the code endeavours to avoid using
+the term **range** to refer to the span of values that an interval covers,
+instead using the term bounds to avoid confusion of terminology around output of
+operations.
+
+## Accuracy
+
+As mentioned above floating point numbers are not able to represent all the
+possible values over their bounds, but instead represent discrete values in that
+interval, and approximate the remainder.
+
+Additionally, floating point numbers are not evenly distributed over the real
+number line, but instead are more densely clustered around zero, with the space
+between values increasing in steps as the magnitude increases.
+
+When discussing operations on floating point numbers, there is often reference
+to a true value. This is the value that given no performance constraints and
+infinite precision you would get, i.e `acos(1) = π`, where π has infinite
+digits of precision.
+
+For the CTS it is often sufficient to calculate the true value using TypeScript,
+since its native number format is higher precision (double-precision/f64), so
+all f64, f32, and f16 values can be represented in it. Where this breaks down
+will be discussed in the section on compile time vs runtime execution.
+
+The true value is sometimes representable exactly as a floating point value, but
+often is not.
+
+Additionally, many operations are implemented using approximations from
+numerical analysis, where there is a tradeoff between the precision of the
+result and the cost.
+
+Thus, the spec specifies what the accuracy constraints for specific operations
+is, how close to truth an implementation is required to be, to be
+considered conforming.
+
+There are 5 different ways that accuracy requirements are defined in the spec:
+
+1. *Exact*
+
+ This is the situation where it is expected that true value for an operation
+ is always expected to be exactly representable. This doesn't happen for any
+ of the operations that return floating point values, but does occur for
+ logical operations that return boolean values.
+
+
+2. *Correctly Rounded*
+
+ For the case that the true value is exactly representable as a floating
+ point, this is the equivalent of exactly from above. In the event that the
+ true value is not exact, then the acceptable answer for most numbers is
+ either the nearest representable value above or below the true value.
+
+ For values near the subnormal range, e.g. close to zero, this becomes more
+ complex, since an implementation may FTZ at any point. So if the exact
+ solution is subnormal or either of the neighbours of the true value are
+ subnormal, zero becomes a possible result, thus the acceptance interval is
+ wider than naively expected.
+
+ On the edge of and beyond the bounds of a floating point type the definition
+ of correctly rounded becomes complex, which is discussed in detail in the
+ section on overflow.
+
+
+3. *Absolute Error*
+
+ This type of accuracy specifies an error value, ε, and the calculated result
+ is expected to be within that distance from the true value, i.e.
+ `[ X - ε, X + ε ]`.
+
+ The main drawback with this manner of specifying accuracy is that it doesn't
+ scale with the level of precision in floating point numbers themselves at a
+ specific value. Thus, it tends to be only used for specifying accuracy over
+ specific limited intervals, i.e. [-π, π].
+
+
+4. *Units of Least Precision (ULP)*
+
+ The solution to the issue of not scaling with precision of floating point is
+ to use units of least precision.
+
+ ULP(X) is min (b-a) over all pairs (a,b) of representable floating point
+ numbers such that (a <= X <= b and a =/= b). For a more formal discussion of
+ ULP see
+ [On the definition of ulp(x)](https://hal.inria.fr/inria-00070503/document).
+
+ n * ULP or nULP means `[X - n * ULP @ X, X + n * ULP @ X]`.
+
+
+5. *Inherited*
+
+ When an operation's accuracy is defined in terms of other operations, then
+ its accuracy is said to be inherited. Handling of inherited accuracies is
+ one of the main driving factors in the design of testing framework, so will
+ need to be discussed in detail.
+
+## Acceptance Intervals
+
+The first four accuracy types; Exact, Correctly Rounded, Absolute Error, and
+ULP, sometimes called simple accuracies, can be defined in isolation from each
+other, and by association can be implemented using relatively independent
+implementations.
+
+The original implementation of the floating point framework did this as it was
+being built out, but ran into difficulties when defining the inherited
+accuracies.
+
+For examples, `tan(x) inherits from sin(x)/cos(x)`, one can take the defined
+rules and manually build up a bespoke solution for checking the results, but
+this is tedious, error-prone, and doesn't allow for code re-use.
+
+Instead, it would be better if there was a single conceptual framework that one
+can express all the 'simple' accuracy requirements in, and then have a mechanism
+for composing them to define inherited accuracies.
+
+In the WebGPU CTS this is done via the concept of acceptance intervals, which is
+derived from a similar concept in the Vulkan CTS, though implemented
+significantly differently.
+
+The core of this idea is that each of different accuracy types can be integrated
+into the definition of the operation, so that instead of transforming an input
+from the domain to a point in the range, the operation is producing an interval
+in the range, that is the acceptable values an implementation may emit.
+
+
+The simple accuracies can be defined as follows:
+
+1. *Exact*
+
+ `f(x) => [X, X]`
+
+
+2. *Correctly Rounded*
+
+ If `X` is precisely defined as a floating point value
+
+ `f(x) => [X, X]`
+
+ otherwise,
+
+ `[a, b]` where `a` is the largest representable number with `a <= X`, and `b`
+ is the smallest representable number with `X <= b`
+
+
+3. *Absolute Error*
+
+ `f(x) => [ X - ε, X + ε ]`, where ε is the absolute error value
+
+
+4. **ULP Error**
+
+ `f(x) = X => [X - n*ULP(X), X + n*ULP(X)]`
+
+As defined, these definitions handle mapping from a point in the domain into an
+interval in the range.
+
+This is insufficient for implementing inherited accuracies, since inheritance
+sometimes involve mapping domain intervals to range intervals.
+
+Here we use the convention for naturally extending a function on real numbers
+into a function on intervals of real numbers, i.e. `f([a, b]) = [A, B]`.
+
+Given that floating point numbers have a finite number of precise values for any
+given interval, one could implement just running the accuracy computation for
+every point in the interval and then spanning together the resultant intervals.
+That would be very inefficient though and make your reviewer sad to read.
+
+For mapping intervals to intervals the key insight is that we only need to be
+concerned with the extrema of the operation in the interval, since the
+acceptance interval is the bounds of the possible outputs.
+
+In more precise terms:
+```
+ f(x) => X, x = [a, b] and X = [A, B]
+
+ X = [min(f(x)), max(f(x))]
+ X = [min(f([a, b])), max(f([a, b]))]
+ X = [f(m), f(n)]
+```
+where `m` and `n` are in `[a, b]`, `m <= n`, and produce the min and max results
+for `f` on the interval, respectively.
+
+So how do we find the minima and maxima for our operation in the domain?
+
+The common general solution for this requires using calculus to calculate the
+derivative of `f`, `f'`, and then find the zeroes `f'` to find inflection
+points of `f`.
+
+This solution wouldn't be sufficient for all builtins, i.e. `step` which is not
+differentiable at edge values.
+
+Thankfully we do not need a general solution for the CTS, since all the builtin
+operations are defined in the spec, so `f` is from a known set of options.
+
+These operations can be divided into two broad categories: monotonic, and
+non-monotonic, with respect to an interval.
+
+The monotonic operations are ones that preserve the order of inputs in their
+outputs (or reverse it). Their graph only ever decreases or increases,
+never changing from one or the other, though it can have flat sections.
+
+The non-monotonic operations are ones whose graph would have both regions of
+increase and decrease.
+
+The monotonic operations, when mapping an interval to an interval, are simple to
+handle, since the extrema are guaranteed to be the ends of the domain, `a` and
+`b`.
+
+So `f([a, b])` = `[f(a), f(b)]` or `[f(b), f(a)]`. We could figure out if `f` is
+increasing or decreasing beforehand to determine if it should be `[f(a), f(b)]`
+or `[f(b), f(a)]`.
+
+It is simpler to just use min & max to have an implementation that is agnostic
+to the details of `f`.
+```
+ A = f(a), B = f(b)
+ X = [min(A, B), max(A, B)]
+```
+
+The non-monotonic functions that we need to handle for interval-to-interval
+mappings are more complex. Thankfully are a small number of the overall
+operations that need to be handled, since they are only the operations that are
+used in an inherited accuracy and take in the output of another operation as
+part of that inherited accuracy.
+
+So in the CTS we just have bespoke implementations for each of them.
+
+Part of the operation definition in the CTS is a function that takes in the
+domain interval, and returns a sub-interval such that the subject function is
+monotonic over that sub-interval, and hence the function's minima and maxima are
+at the ends.
+
+This adjusted domain interval can then be fed through the same machinery as the
+monotonic functions.
+
+### Inherited Accuracy
+
+So with all of that background out of the way, we can now define an inherited
+accuracy in terms of acceptance intervals.
+
+The crux of this is the insight that the range of one operation can become the
+domain of another operation to compose them together.
+
+And since we have defined how to do this interval to interval mapping above,
+transforming things becomes mechanical and thus implementable in reusable code.
+
+When talking about inherited accuracies `f(x) => g(x)` is used to denote that
+`f`'s accuracy is a defined as `g`.
+
+An example to illustrate inherited accuracies, in f32:
+
+```
+ tan(x) => sin(x)/cos(x)
+
+ sin(x) => [sin(x) - 2 ** -11, sin(x) + 2 ** -11]`
+ cos(x) => [cos(x) - 2 ** -11, cos(x) + 2-11]
+
+ x/y => [x/y - 2.5 * ULP(x/y), x/y + 2.5 * ULP(x/y)]
+```
+
+`sin(x)` and `cos(x)` are non-monotonic, so calculating out a closed generic
+form over an interval is a pain, since the min and max vary depending on the
+value of x. Let's isolate this to a single point, so you don't have to read
+literally pages of expanded intervals.
+
+```
+ x = π/2
+
+ sin(π/2) => [sin(π/2) - 2 ** -11, sin(π/2) + 2 ** -11]
+ => [0 - 2 ** -11, 0 + 2 ** -11]
+ => [-0.000488…, 0.000488…]
+ cos(π/2) => [cos(π/2) - 2 ** -11, cos(π/2) + 2 ** -11]
+ => [-0.500488…, -0.499511…]
+
+ tan(π/2) => sin(π/2)/cos(π/2)
+ => [-0.000488…, 0.000488…]/[-0.500488…, -0.499511…]
+ => [min(-0.000488…/-0.500488…, -0.000488…/-0.499511…, 0.000488…/-0.500488…, 0.000488…/-0.499511…),
+ max(-0.000488…/-0.500488…, -0.000488…/-0.499511…, 0.000488…/-0.500488…, 0.000488…/-0.499511…)]
+ => [0.000488…/-0.499511…, 0.000488…/0.499511…]
+ => [-0.0009775171, 0.0009775171]
+```
+
+For clarity this has omitted a bunch of complexity around FTZ behaviours, and
+that these operations are only defined for specific domains, but the high-level
+concepts hold.
+
+For each of the inherited operations we could implement a manually written out
+closed form solution, but that would be quite error-prone and not be
+re-using code between builtins.
+
+Instead, the CTS takes advantage of the fact in addition to testing
+implementations of `tan(x)` we are going to be testing implementations of
+`sin(x)`, `cos(x)` and `x/y`, so there should be functions to generate
+acceptance intervals for those operations.
+
+The `tan(x)` acceptance interval can be constructed by generating the acceptance
+intervals for `sin(x)`, `cos(x)` and `x/y` via function calls and composing the
+results.
+
+This algorithmically looks something like this:
+
+```
+ tan(x):
+ Calculate sin(x) interval
+ Calculate cos(x) interval
+ Calculate sin(x) result divided by cos(x) result
+ Return division result
+```
+
+## Compile vs Run Time Evaluation
+
+The above discussions have been primarily agnostic to when and where a
+calculation is occurring, with an implicit bias to runtime execution on a GPU.
+
+In reality where/when a computation is occurring has a significant impact on the
+expected outcome when dealing with edge cases.
+
+### Terminology
+
+There are two related axes that will be referred to when it comes to evaluation.
+These are compile vs run time, and CPU vs GPU. Broadly speaking compile time
+execution happens on the host CPU, and run time evaluation occurs on a dedicated
+GPU.
+
+(Software graphics implementations like WARP and SwiftShader technically break this by
+being a software emulation of a GPU that runs on the CPU, but conceptually one can
+think of these implementations being a type of GPU in this context, since it has
+similar constraints when it comes to precision, etc.)
+
+Compile time evaluation is execution that occurs when setting up a shader
+module, i.e. when compiling WGSL to a platform specific shading language. It is
+part of resolving values for things like constants, and occurs once before the
+shader is run by the caller. It includes constant evaluation and override
+evaluation. All AbstractFloat operations are compile time evaluated.
+
+Runtime evaluation is execution that occurs every time the shader is run, and
+may include dynamic data that is provided between invocations. It is work that
+is sent to the GPU for execution in the shader.
+
+WGSL const-expressions and override-expressions are evaluated before runtime and
+both are considered "compile time" in this discussion. WGSL runtime-expressions
+are evaluated at runtime.
+
+### Behavioural Differences
+
+For a well-defined operation with a finite result, runtime and compile time
+evaluation should be indistinguishable.
+
+For example:
+```
+// runtime
+@group(0) @binding(0) var a : f32;
+@group(0) @binding(1) var b : f32;
+
+let c: f32 = a + b
+```
+and
+```
+// compile time
+const c: f32 = 1.0f + 2.0f
+```
+should produce the same result of `3.0` in the variable `c`, assuming `1.0` and `2.0`
+were passed in as `a` and `b`.
+
+The only difference, is when/where the execution occurs.
+
+The difference in behaviour between these two occur when the result of the
+operation is not finite for the underlying floating point type.
+
+If instead of `1.0` and `2.0`, we had `10.0` and `f32.max`, so the true result is
+`f32.max + 10.0`, the behaviours differ. Specifically the runtime
+evaluated version will still run, but the result in `c` will be an indeterminate
+value, which is any finite f32 value. For the compile time example instead,
+compiling the shader will fail validation.
+
+This applies to any operation, and isn't restricted to just addition. Anytime a
+value goes outside the finite values the shader will hit these results,
+indeterminate for runtime execution and validation failure for compile time
+execution.
+
+Unfortunately we are dealing with intervals of results and not precise results.
+So this leads to more even conceptual complexity. For runtime evaluation, this
+isn't too bad, because the rule becomes: if any part of the interval is
+non-finite then an indeterminate value can be a result, and the interval for an
+indeterminate result `[fp min, fp max]`, will include any finite portions of the
+interval.
+
+Compile time evaluation becomes significantly more complex, because difference
+isn't what interval is returned, but does this shader compile or not, which are
+mutually exclusive. This is compounded even further by having to consider
+near-overflow vs far-overflow behaviour. Thankfully this can be broken down into
+a case by case basis based on where an interval falls.
+
+Assuming `X`, is the well-defined result of an operation, i.e. not indeterminate
+due to the operation isn't defined for the inputs:
+
+| Region | | Result |
+|------------------------------|------------------------------------------------------|--------------------------------|
+| `abs(X) <= fp max` | interval falls completely in the finite bounds | validation succeeds |
+| `abs(X) >= 2 ** (exp_max+1)` | interval falls completely in the far-overflow bounds | validation fails |
+| Otherwise | interval intersects the near-overflow region | validation may succeed or fail |
+
+The final case is somewhat difficult from a CTS perspective, because now it
+isn't sufficient to know that a non-finite result has occurred, but what the
+specific result is needs to be tracked. Additionally, the expected result is
+somewhat ambiguous, since a shader may or may not compile. This could in theory
+still be tested by the CTS, via having switching logic that determines in this
+region, if the shader compiles expect these results, otherwise pass the test.
+This adds a significant amount of complexity to the testing code for thoroughly
+testing a relatively small segment of values. Other environments do not have the
+behaviour in this region as rigorously defined nor tested, so fully testing
+here would likely find lots of issues that would just need to be mitigated in
+the CTS.
+
+Currently, we choose to avoid testing validation of near-overflow scenarios.
+
+### Additional Technical Limitations
+
+The above description of compile and runtime evaluation was somewhat based in
+the theoretical world that the intervals being used for testing are infinitely
+precise, when in actuality they are implemented by the ECMAScript `number` type,
+which is implemented as a f64 value.
+
+For the vast majority of cases, even out of bounds and overflow, this is
+sufficient. There is one small slice where this breaks down. Specifically if
+the result just outside the finite range by less than 1 f64 ULP of the edge
+value. An example of this is `2 ** -11 + f32.max`. This will be between `f32.max`
+and `f32.max + ULPF64(f32.max)`. This becomes a problem, because this value
+technically falls into the out-of-bounds region, but depending on how
+quantization for f64 is handled in the test runner will be either `f32.max` or
+`f32.max + ULPF64(f32.max)`. So as a compile time evaluation either we expect an
+implementation to always handle this, or it might fail, but we cannot easily
+detect it, since this is pushing hard on the limits of precision of the testing
+environment.
+
+(A parallel version of this probably exists on the other side of the
+out-of-bounds region, but I don't have a proven example of this)
+
+The high road fix to this problem is to use an arbitrary precision floating
+point implementation. Unfortunately such a library is not on the standards
+track for ECMAScript at this time, so we would have to evaluate and pick a
+third party dependency to use. Beyond the selection process, this would also
+require a significant refactoring of the existing framework code for fixing a
+very marginal case.
+
+(This differs from Float16 support, where the prototyped version of the proposed
+API has been pulled in, and the long term plan it use the ECMAScript
+implementation's version, once all the major runtimes support it. So it can
+be viewed as a polyfill).
+
+This region currently is not tested as part of the decision to defer testing on
+the entire out-of-bounds but not overflowing region.
+
+In the future if we decide to add testing to the out-of-bounds region, to avoid
+perfect being the enemy of good here, it is likely the CTS would still avoid
+testing these regions where f64 precision breaks down. If someone is interested
+in taking on the effort needed to migrate to an arbitrary precision float
+library, or if this turns out to be a significant issue in the future, this
+decision can be revisited.
+
+## Abstract Float
+
+### Accuracy
+
+For the concrete floating point types (f32 & f16) the accuracy of operations are
+defined in terms of their own type. Specifically for f32, correctly rounded
+refers to the nearest f32 values, and ULP is in terms of the distance between
+f32 values.
+
+AbstractFloat internally is defined as a f64, and this applies for exact and
+correctly rounded accuracies. Thus, correctly rounded refers to the nearest f64
+values. However, AbstractFloat differs for ULP and absolute errors. Reading
+the spec strictly, these all have unbounded accuracies, but it is recommended
+that their accuracies be at least as good as the f32 equivalent.
+
+The difference between f32 and f64 ULP at a specific value X are significant, so
+at least as good as f32 requirement is always less strict than if it was
+calculated in terms of f64. Similarly, for absolute accuracies the interval
+`[x - epsilon, x + epsilon]` is always equal or wider if calculated as f32s
+vs f64s.
+
+If an inherited accuracy is only defined in terms of correctly rounded
+accuracies, then the interval is calculated in terms of f64s. If any of the
+defining accuracies are ULP or absolute errors, then the result falls into the
+unbounded accuracy, but recommended to be at least as good as f32 bucket.
+
+What this means from a CTS implementation is that for these "at least as good as
+f32" error intervals, if the infinitely accurate result is finite for f32, then
+the error interval for f64 is just the f32 interval. If the result is not finite
+for f32, then the accuracy interval is just the unbounded interval.
+
+How this is implemented in the CTS is by having the FPTraits for AbstractFloat
+forward to the f32 implementation for the operations that are tested to be as
+good as f32.
+
+### Implementation
+
+AbstractFloats are a compile time construct that exist in WGSL. They are
+expressible as literal values or the result of operations that return them, but
+a variable cannot be typed as an AbstractFloat. Instead, the variable needs be a
+concrete type, i.e. f32 or f16, and the AbstractFloat value will be quantized
+on assignment.
+
+Because they cannot be stored nor passed via buffers, it is tricky to test them.
+There are two approaches that have been proposed for testing the results of
+operations that return AbstractFloats.
+
+As of the writing of this doc, this second option for testing AbstractFloats
+is the one being pursued in the CTS.
+
+#### const_assert
+
+The first proposal is to lean on the `const_assert` statement that exists in
+WGSL. For each test case a snippet of code would be written out that has a form
+something like this
+
+```
+// foo(x) is the operation under test
+const_assert lower < foo(x) // Result was below the acceptance interval
+const_assert upper > foo(x) // Result was above the acceptance interval
+```
+
+where lower and upper would actually be string replaced with literals for the
+bounds of the acceptance interval when generating the shader text.
+
+This approach has a number of limitations that made it unacceptable for the CTS.
+First, how errors are reported is a pain to debug. Someone working with the CTS
+would either get a report of a failed shader compile, or a failed compile with
+the line number, but they will not get the result of `foo(x)`. Just that it is
+out of range. Additionally, if you place many of these stanzas in the same
+shader to optimize dispatch, you will not get a report that these 3 of 32 cases
+failed with these results, you will just get this batch failed. All of these
+makes for a very poor experience in attempting to understand what is failing.
+
+Beyond the lack of ergonomics, this approach also makes things like AF
+comparison and const_assert very load bearing for the CTS. It is possible that
+a bug could exist in an implementation of const_assert for example that would
+cause it to not fail shader compilation, which could lead to silent passing of
+tests. Conceptually you can think of this instead of depending on a signal to
+indicate something is working, we would be depending on a signal that it isn't
+working, and assuming if we don't receive that signal everything is good, not
+that our signal mechanism was broken.
+
+#### Extracting Bits
+
+The other proposal that was developed depends on the fact that AbstractFloat is
+spec'd to be a f64 internally. So the CTS could store the result of an operation
+as two 32-bit unsigned integers (or broken up into sign, exponent, and
+mantissa). These stored integers could be exported to the testing framework via
+a buffer, which could in turn rebuild the f64 values.
+
+This approach allows the CTS to test values directly in the testing framework,
+thus provide the same diagnostics as other tests, as well as reusing the same
+running harness.
+
+The major downsides come from actually implementing extracting the bits. Due to
+the restrictions on AbstractFloats the actual code to extract the bits is
+tricky. Specifically there is no simple bit cast to something like an
+AbstractInt that can be used. Instead, `frexp` needs to be used with additional
+operations. This leads to problems, since as is `frexp` is not defined for
+subnormal values, so it is impossible to extract a subnormal AbstractFloat,
+though 0 could be returned when one is encountered.
+
+Test that do try to extract bits to determine the result should either avoid
+cases with subnormal results or check for the nearest normal or zero number.
+
+The inability to store AbstractFloats in non-lossy fashion also has additional
+issues, since this means that user defined functions that take in or return
+them do not exist in WGSL. Thus, the snippet of code for extracting
+AbstractFloats cannot just be inserted as a function at the top of a testing
+shader, and then invoked on each test case. Instead, it needs to be inlined
+into the shader at each call-site. Actually implementing this in the CTS isn't
+difficult, but it does make the shaders significantly longer and more
+difficult to read. It also may have an impact on how many test cases can be in
+a batch, since runtime for some backends is sensitive to the length of the
+shader being run.
+
+# Appendix
+
+### Significant f64 Values
+
+| Name | Decimal (~) | Hex | Sign Bit | Exponent Bits | Significand Bits |
+|------------------------|----------------:|----------------------:|---------:|--------------:|-----------------------------------------------------------------:|
+| Negative Infinity | -∞ | 0xfff0 0000 0000 0000 | 1 | 111 1111 1111 | 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 |
+| Min Negative Normal | -1.79769313E308 | 0xffef ffff ffff ffff | 1 | 111 1111 1110 | 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 |
+| Max Negative Normal | -2.2250738E−308 | 0x8010 0000 0000 0000 | 1 | 000 0000 0001 | 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 |
+| Min Negative Subnormal | -2.2250738E−308 | 0x800f ffff ffff ffff | 1 | 000 0000 0000 | 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 |
+| Max Negative Subnormal | -4.9406564E−324 | 0x8000 0000 0000 0001 | 1 | 000 0000 0000 | 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001 |
+| Negative Zero | -0 | 0x8000 0000 0000 0000 | 1 | 000 0000 0000 | 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 |
+| Positive Zero | 0 | 0x0000 0000 0000 0000 | 0 | 000 0000 0000 | 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 |
+| Min Positive Subnormal | 4.9406564E−324 | 0x0000 0000 0000 0001 | 0 | 000 0000 0000 | 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001 |
+| Max Positive Subnormal | 2.2250738E−308 | 0x000f ffff ffff ffff | 0 | 000 0000 0000 | 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 |
+| Min Positive Normal | 2.2250738E−308 | 0x0010 0000 0000 0000 | 0 | 000 0000 0001 | 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 |
+| Max Positive Normal | 1.79769313E308 | 0x7fef ffff ffff ffff | 0 | 111 1111 1110 | 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 |
+| Negative Infinity | ∞ | 0x7ff0 0000 0000 0000 | 0 | 111 1111 1111 | 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 |
+
+### Significant f32 Values
+
+| Name | Decimal (~) | Hex | Sign Bit | Exponent Bits | Significand Bits |
+|------------------------|---------------:|------------:|---------:|--------------:|-----------------------------:|
+| Negative Infinity | -∞ | 0xff80 0000 | 1 | 1111 1111 | 0000 0000 0000 0000 0000 000 |
+| Min Negative Normal | -3.40282346E38 | 0xff7f ffff | 1 | 1111 1110 | 1111 1111 1111 1111 1111 111 |
+| Max Negative Normal | -1.1754943E−38 | 0x8080 0000 | 1 | 0000 0001 | 0000 0000 0000 0000 0000 000 |
+| Min Negative Subnormal | -1.1754942E-38 | 0x807f ffff | 1 | 0000 0000 | 1111 1111 1111 1111 1111 111 |
+| Max Negative Subnormal | -1.4012984E−45 | 0x8000 0001 | 1 | 0000 0000 | 0000 0000 0000 0000 0000 001 |
+| Negative Zero | -0 | 0x8000 0000 | 1 | 0000 0000 | 0000 0000 0000 0000 0000 000 |
+| Positive Zero | 0 | 0x0000 0000 | 0 | 0000 0000 | 0000 0000 0000 0000 0000 000 |
+| Min Positive Subnormal | 1.4012984E−45 | 0x0000 0001 | 0 | 0000 0000 | 0000 0000 0000 0000 0000 001 |
+| Max Positive Subnormal | 1.1754942E-38 | 0x007f ffff | 0 | 0000 0000 | 1111 1111 1111 1111 1111 111 |
+| Min Positive Normal | 1.1754943E−38 | 0x0080 0000 | 0 | 0000 0001 | 0000 0000 0000 0000 0000 000 |
+| Max Positive Normal | 3.40282346E38 | 0x7f7f ffff | 0 | 1111 1110 | 1111 1111 1111 1111 1111 111 |
+| Negative Infinity | ∞ | 0x7f80 0000 | 0 | 1111 1111 | 0000 0000 0000 0000 0000 000 |
+
+### Significant f16 Values
+
+| Name | Decimal (~) | Hex | Sign Bit | Exponent Bits | Significand Bits |
+|------------------------|--------------:|-------:|---------:|--------------:|-----------------:|
+| Negative Infinity | -∞ | 0xfc00 | 1 | 111 11 | 00 0000 0000 |
+| Min Negative Normal | -65504 | 0xfbff | 1 | 111 10 | 11 1111 1111 |
+| Max Negative Normal | -6.1035156E−5 | 0x8400 | 1 | 000 01 | 00 0000 0000 |
+| Min Negative Subnormal | -6.0975552E−5 | 0x83ff | 1 | 000 00 | 11 1111 1111 |
+| Max Negative Subnormal | -5.9604645E−8 | 0x8001 | 1 | 000 00 | 00 0000 0001 |
+| Negative Zero | -0 | 0x8000 | 1 | 000 00 | 00 0000 0000 |
+| Positive Zero | 0 | 0x0000 | 0 | 000 00 | 00 0000 0000 |
+| Min Positive Subnormal | 5.9604645E−8 | 0x0001 | 0 | 000 00 | 00 0000 0001 |
+| Max Positive Subnormal | 6.0975552E−5 | 0x03ff | 0 | 000 00 | 11 1111 1111 |
+| Min Positive Normal | 6.1035156E−5 | 0x0400 | 0 | 000 01 | 00 0000 0000 |
+| Max Positive Normal | 65504 | 0x7bff | 0 | 111 10 | 11 1111 1111 |
+| Negative Infinity | ∞ | 0x7c00 | 0 | 111 11 | 00 0000 0000 |
+
+# Resources
+- [WebGPU Spec](https://www.w3.org/TR/webgpu/)
+- [WGSL Spec](https://www.w3.org/TR/WGSL/)
+- [binary64 on Wikipedia](https://en.wikipedia.org/wiki/Double-precision_floating-point_format)
+- [binary32 on Wikipedia](https://en.wikipedia.org/wiki/Single-precision_floating-point_format)
+- [binary16 on Wikipedia](https://en.wikipedia.org/wiki/Half-precision_floating-point_format)
+- [IEEE-754 Floating Point Converter](https://www.h-schmidt.net/FloatConverter/IEEE754.html)
+- [IEEE 754 Calculator](http://weitz.de/ieee/)
+- [On the definition of ulp(x)](https://hal.inria.fr/inria-00070503/document)
+- [Float Exposed](https://float.exposed/)