From 16f504a9dca3fe3b70568f67b7d41241ae485288 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 7 Apr 2024 18:49:04 +0200 Subject: Adding upstream version 7.0.6-dfsg. Signed-off-by: Daniel Baumann --- src/libs/softfloat-3e/doc/SoftFloat.html | 1527 ++++++++++++++++++++++++++++++ 1 file changed, 1527 insertions(+) create mode 100644 src/libs/softfloat-3e/doc/SoftFloat.html (limited to 'src/libs/softfloat-3e/doc/SoftFloat.html') diff --git a/src/libs/softfloat-3e/doc/SoftFloat.html b/src/libs/softfloat-3e/doc/SoftFloat.html new file mode 100644 index 00000000..b72b407f --- /dev/null +++ b/src/libs/softfloat-3e/doc/SoftFloat.html @@ -0,0 +1,1527 @@ + + + + +Berkeley SoftFloat Library Interface + + + + +

Berkeley SoftFloat Release 3e: Library Interface

+ +

+John R. Hauser
+2018 January 20
+

+ + +

+ +

+ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

1. Introduction
2. Limitations
3. Acknowledgments and License
4. Types and Functions
4.1. Boolean and Integer Types
4.2. Floating-Point Types
4.3. Supported Floating-Point Functions
4.4. Non-canonical Representations in extFloat80_t
4.5. Conventions for Passing Arguments and Results
5. Reserved Names
6. Mode Variables
6.1. Rounding Mode
6.2. Underflow Detection
6.3. Rounding Precision for the 80-Bit Extended Format
7. Exceptions and Exception Flags
8. Function Details
8.1. Conversions from Integer to Floating-Point
8.2. Conversions from Floating-Point to Integer
8.3. Conversions Among Floating-Point Types
8.4. Basic Arithmetic Functions
8.5. Fused Multiply-Add Functions
8.6. Remainder Functions
8.7. Round-to-Integer Functions
8.8. Comparison Functions
8.9. Signaling NaN Test Functions
8.10. Raise-Exception Function
9. Changes from SoftFloat Release 2
9.1. Name Changes
9.2. Changes to Function Arguments
9.3. Added Capabilities
9.4. Better Compatibility with the C Language
9.5. New Organization as a Library
9.6. Optimization Gains (and Losses)
10. Future Directions
11. Contact Information
+

+ + +

1. Introduction

+ +

+Berkeley SoftFloat is a software implementation of binary floating-point that +conforms to the IEEE Standard for Floating-Point Arithmetic. +The current release supports five binary formats: 16-bit +half-precision, 32-bit single-precision, 64-bit +double-precision, 80-bit double-extended-precision, and +128-bit quadruple-precision. +The following functions are supported for each format: +

+addition, subtraction, multiplication, division, and square root; +
+fused multiply-add as defined by the IEEE Standard, except for +80-bit double-extended-precision; +
+remainder as defined by the IEEE Standard; +
+round to integral value; +
+comparisons; +
+conversions to/from other supported formats; and +
+conversions to/from 32-bit and 64-bit integers, +signed and unsigned. +

+All operations required by the original 1985 version of the IEEE Floating-Point +Standard are implemented, except for conversions to and from decimal. +

+ +

+This document gives information about the types defined and the routines +implemented by SoftFloat. +It does not attempt to define or explain the IEEE Floating-Point Standard. +Information about the standard is available elsewhere. +

+ +

+The current version of SoftFloat is Release 3e. +This release modifies the behavior of the rarely used odd rounding mode +(round to odd, also known as jamming), and also adds some new +specialization and optimization examples for those compiling SoftFloat. +

+ +

+The previous Release 3d fixed bugs that were found in the square +root functions for the 64-bit, 80-bit, and +128-bit floating-point formats. +(Thanks to Alexei Sibidanov at the University of Victoria for reporting an +incorrect result.) +The bugs affected all prior Release-3 versions of SoftFloat +through 3c. +The flaw in the 64-bit floating-point square root function was of +very minor impact, causing a 1-ulp error (1 unit in +the last place) a few times out of a billion. +The bugs in the 80-bit and 128-bit square root +functions were more serious. +Although incorrect results again occurred only a few times out of a billion, +when they did occur a large portion of the less-significant bits could be +wrong. +

+ +

+Among earlier releases, 3b was notable for adding support for the +16-bit half-precision format. +For more about the evolution of SoftFloat releases, see +SoftFloat-history.html. +

+ +

+The functional interface of SoftFloat Release 3 and later differs +in many details from the releases that came before. +For specifics of these differences, see section 9 below, +Changes from SoftFloat Release 2. +

+ + +

2. Limitations

+ +

+SoftFloat assumes the computer has an addressable byte size of 8 or +16 bits. +(Nearly all computers in use today have 8-bit bytes.) +

+ +

+SoftFloat is written in C and is designed to work with other C code. +The C compiler used must conform at a minimum to the 1989 ANSI standard for the +C language (same as the 1990 ISO standard) and must in addition support basic +arithmetic on 64-bit integers. +Earlier releases of SoftFloat included implementations of 32-bit +single-precision and 64-bit double-precision floating-point that +did not require 64-bit integers, but this option is not supported +starting with Release 3. +Since 1999, ISO standards for C have mandated compiler support for +64-bit integers. +A compiler conforming to the 1999 C Standard or later is recommended but not +strictly required. +

+ +

+Most operations not required by the original 1985 version of the IEEE +Floating-Point Standard but added in the 2008 version are not yet supported in +SoftFloat Release 3e. +

+ + +

3. Acknowledgments and License

+ +

+The SoftFloat package was written by me, John R. Hauser. +Release 3 of SoftFloat was a completely new implementation +supplanting earlier releases. +The project to create Release 3 (now through 3e) was +done in the employ of the University of California, Berkeley, within the +Department of Electrical Engineering and Computer Sciences, first for the +Parallel Computing Laboratory (Par Lab) and then for the ASPIRE Lab. +The work was officially overseen by Prof. Krste Asanovic, with funding provided +by these sources: +

+ ++++ + + + + + + + + + +

Par Lab: +Microsoft (Award #024263), Intel (Award #024894), and U.C. Discovery +(Award #DIG07-10227), with additional support from Par Lab affiliates Nokia, +NVIDIA, Oracle, and Samsung. +
ASPIRE Lab: +DARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from +ASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA, +Oracle, and Samsung. +
+

+ +

+The following applies to the whole of SoftFloat Release 3e as well +as to each source file individually. +

+ +

+Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: +

+
+Redistributions of source code must retain the above copyright notice, this +list of conditions, and the following disclaimer. +
+ +
+
+Redistributions in binary form must reproduce the above copyright notice, this +list of conditions, and the following disclaimer in the documentation and/or +other materials provided with the distribution. +
+ +
+
+Neither the name of the University nor the names of its contributors may be +used to endorse or promote products derived from this software without specific +prior written permission. +
+ +

+ +

+THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS “AS IS”, +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, ARE +DISCLAIMED. +IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, +INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, +BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF +LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE +OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF +ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +

+ + +

4. Types and Functions

+ +

+The types and functions of SoftFloat are declared in header file +softfloat.h. +

+ +

4.1. Boolean and Integer Types

+ +

+Header file softfloat.h depends on standard headers +<stdbool.h> and <stdint.h> to define type +bool and several integer types. +These standard headers have been part of the ISO C Standard Library since 1999. +With any recent compiler, they are likely to be supported, even if the compiler +does not claim complete conformance to the latest ISO C Standard. +For older or nonstandard compilers, a port of SoftFloat may have substitutes +for these headers. +Header softfloat.h depends only on the name bool from +<stdbool.h> and on these type names from +<stdint.h>: +

+uint16_t
+uint32_t
+uint64_t
+int32_t
+int64_t
+uint_fast8_t
+uint_fast32_t
+uint_fast64_t
+int_fast32_t
+int_fast64_t
+

+ + +

4.2. Floating-Point Types

+ +

+The softfloat.h header defines five floating-point types: +

+ + + + + + + + + + + + + + + + + + + + + +
float16_t 16-bit half-precision binary format
float32_t 32-bit single-precision binary format
float64_t 64-bit double-precision binary format
extFloat80_t 80-bit double-extended-precision binary format (old Intel or +Motorola format)
float128_t 128-bit quadruple-precision binary format
+

+The non-extended types are each exactly the size specified: +16 bits for float16_t, 32 bits for +float32_t, 64 bits for float64_t, and +128 bits for float128_t. +Aside from these size requirements, the definitions of all these types may +differ for different ports of SoftFloat to specific systems. +A given port of SoftFloat may or may not define some of the floating-point +types as aliases for the C standard types float, +double, and long double. +

+ +

+Header file softfloat.h also defines a structure, +struct extFloat80M, for the representation of +80-bit double-extended-precision floating-point values in memory. +This structure is the same size as type extFloat80_t and contains +at least these two fields (not necessarily in this order): +

+uint16_t signExp;
+uint64_t signif;
+

+Field signExp contains the sign and exponent of the floating-point +value, with the sign in the most significant bit (bit 15) and the +encoded exponent in the other 15 bits. +Field signif is the complete 64-bit significand of +the floating-point value. +(In the usual encoding for 80-bit extended floating-point, the +leading 1 bit of normalized numbers is not implicit but is stored +in the most significant bit of the significand.) +

+ +

4.3. Supported Floating-Point Functions

+ +

+SoftFloat implements these arithmetic operations for its floating-point types: +

+conversions between any two floating-point formats; +
+for each floating-point format, conversions to and from signed and unsigned +32-bit and 64-bit integers; +
+for each format, the usual addition, subtraction, multiplication, division, and +square root operations; +
+for each format except extFloat80_t, the fused multiply-add +operation defined by the IEEE Standard; +
+for each format, the floating-point remainder operation defined by the IEEE +Standard; +
+for each format, a “round to integer” operation that rounds to the +nearest integer value in the same format; and +
+comparisons between two values in the same floating-point format. +

+ +

+The following operations required by the 2008 IEEE Floating-Point Standard are +not supported in SoftFloat Release 3e: +

+nextUp, nextDown, minNum, maxNum, minNumMag, +maxNumMag, scaleB, and logB; +
+conversions between floating-point formats and decimal or hexadecimal character +sequences; +
+all “quiet-computation” operations (copy, negate, +abs, and copySign, which all involve only simple copying and/or +manipulation of the floating-point sign bit); and +
+all “non-computational” operations other than isSignaling +(which is supported). +

+ +

4.4. Non-canonical Representations in `extFloat80_t`

+ +

+Because the 80-bit double-extended-precision format, +extFloat80_t, stores an explicit leading significand bit, many +finite floating-point numbers are encodable in this type in multiple equivalent +forms. +Of these multiple encodings, there is always a unique one with the least +encoded exponent value, and this encoding is considered the canonical +representation of the floating-point number. +Any other equivalent representations (having a higher encoded exponent value) +are non-canonical. +For a value in the subnormal range (including zero), the canonical +representation always has an encoded exponent of zero and a leading significand +bit of 0. +For finite values outside the subnormal range, the canonical representation +always has an encoded exponent that is nonzero and a leading significand bit +of 1. +

+ +

+For an infinity or NaN, the leading significand bit is similarly expected to +be 1. +An infinity or NaN with a leading significand bit of 0 is again +considered non-canonical. +Hence, altogether, to be canonical, a value of type extFloat80_t +must have a leading significand bit of 1, unless the value is +subnormal or zero, in which case the leading significand bit and the encoded +exponent must both be zero. +

+ +

+SoftFloat’s functions are not guaranteed to operate as expected when +inputs of type extFloat80_t are non-canonical. +Assuming all of a function’s extFloat80_t inputs (if any) +are canonical, function outputs of type extFloat80_t will always +be canonical. +

+ +

4.5. Conventions for Passing Arguments and Results

+ +

+Values that are at most 64 bits in size (i.e., not the +80-bit or 128-bit floating-point formats) are in all +cases passed as function arguments by value. +Likewise, when an output of a function is no more than 64 bits, it +is always returned directly as the function result. +Thus, for example, the SoftFloat function for adding two 64-bit +floating-point values has this simple signature: +

+float64_t f64_add( float64_t, float64_t ); +

+ +

+The story is more complex when function inputs and outputs are +80-bit and 128-bit floating-point. +For these types, SoftFloat always provides a function that passes these larger +values into or out of the function indirectly, via pointers. +For example, for adding two 128-bit floating-point values, +SoftFloat supplies this function: +

+void f128M_add( const float128_t *, const float128_t *, float128_t * ); +

+The first two arguments point to the values to be added, and the last argument +points to the location where the sum will be stored. +The M in the name f128M_add is mnemonic for the fact +that the 128-bit inputs and outputs are “in memory”, +pointed to by pointer arguments. +

+ +

+All ports of SoftFloat implement these pass-by-pointer functions for +types extFloat80_t and float128_t. +At the same time, SoftFloat ports may also implement alternate versions of +these same functions that pass extFloat80_t and +float128_t by value, like the smaller formats. +Thus, besides the function with name f128M_add shown above, a +SoftFloat port may also supply an equivalent function with this signature: +

+float128_t f128_add( float128_t, float128_t ); +

+ +

+As a general rule, on computers where the machine word size is +32 bits or smaller, only the pass-by-pointer versions of functions +(e.g., f128M_add) are provided for types extFloat80_t +and float128_t, because passing such large types directly can have +significant extra cost. +On computers where the word size is 64 bits or larger, both +function versions (f128M_add and f128_add) are +provided, because the cost of passing by value is then more reasonable. +Applications that must be portable accross both classes of computers must use +the pointer-based functions, as these are always implemented. +However, if it is known that SoftFloat includes the by-value functions for all +platforms of interest, programmers can use whichever version they prefer. +

+ + +

5. Reserved Names

+ +

+In addition to the variables and functions documented here, SoftFloat defines +some symbol names for its own private use. +These private names always begin with the prefix +‘softfloat_’. +When a program includes header softfloat.h or links with the +SoftFloat library, all names with prefix ‘softfloat_’ +are reserved for possible use by SoftFloat. +Applications that use SoftFloat should not define their own names with this +prefix, and should reference only such names as are documented. +

+ + +

6. Mode Variables

+ +

+The following global variables control rounding mode, underflow detection, and +the 80-bit extended format’s rounding precision: +

+softfloat_roundingMode
+softfloat_detectTininess
+extF80_roundingPrecision +

+These mode variables are covered in the next several subsections. +For some SoftFloat ports, these variables may be per-thread (declared +thread_local), meaning that different execution threads have their +own separate copies of the variables. +

+ +

6.1. Rounding Mode

+ +

+All five rounding modes defined by the 2008 IEEE Floating-Point Standard are +implemented for all operations that require rounding. +Some ports of SoftFloat may also implement the round-to-odd mode. +

+ +

+The rounding mode is selected by the global variable +

+uint_fast8_t softfloat_roundingMode; +

+This variable may be set to one of the values +

+ + + + + + + + + + + + + + + + + + + + + + + + + +
softfloat_round_near_even round to nearest, with ties to even
softfloat_round_near_maxMag round to nearest, with ties to maximum magnitude (away from zero)
softfloat_round_minMag round to minimum magnitude (toward zero)
softfloat_round_min round to minimum (down)
softfloat_round_max round to maximum (up)
softfloat_round_odd round to odd (jamming), if supported by the SoftFloat port
+

+Variable softfloat_roundingMode is initialized to +softfloat_round_near_even. +

+ +

+When softfloat_round_odd is the rounding mode for a function that +rounds to an integer value (either conversion to an integer format or a +‘roundToInt’ function), if the input is not already an +integer, the rounded result is the closest odd integer. +For other operations, this rounding mode acts as though the floating-point +result is first rounded to minimum magnitude, the same as +softfloat_round_minMag, and then, if the result is inexact, the +least-significant bit of the result is set to 1. +Rounding to odd is also known as jamming. +

+ +

6.2. Underflow Detection

+ +

+In the terminology of the IEEE Standard, SoftFloat can detect tininess for +underflow either before or after rounding. +The choice is made by the global variable +

+uint_fast8_t softfloat_detectTininess; +

+which can be set to either +

+softfloat_tininess_beforeRounding
+softfloat_tininess_afterRounding +

+Detecting tininess after rounding is usually better because it results in fewer +spurious underflow signals. +The other option is provided for compatibility with some systems. +Like most systems (and as required by the newer 2008 IEEE Standard), SoftFloat +always detects loss of accuracy for underflow as an inexact result. +

+ +

6.3. Rounding Precision for the 80-Bit Extended Format

+ +

+For extFloat80_t only, the rounding precision of the basic +arithmetic operations is controlled by the global variable +

+uint_fast8_t extF80_roundingPrecision; +

+The operations affected are: +

+extF80_add
+extF80_sub
+extF80_mul
+extF80_div
+extF80_sqrt +

+When extF80_roundingPrecision is set to its default value of 80, +these operations are rounded to the full precision of the 80-bit +double-extended-precision format, like occurs for other formats. +Setting extF80_roundingPrecision to 32 or to 64 causes the +operations listed to be rounded to 32-bit precision (equivalent to +float32_t) or to 64-bit precision (equivalent to +float64_t), respectively. +When rounding to reduced precision, additional bits in the result significand +beyond the rounding point are set to zero. +The consequences of setting extF80_roundingPrecision to a value +other than 32, 64, or 80 is not specified. +Operations other than the ones listed above are not affected by +extF80_roundingPrecision. +

+ + +

7. Exceptions and Exception Flags

+ +

+All five exception flags required by the IEEE Floating-Point Standard are +implemented. +Each flag is stored as a separate bit in the global variable +

+uint_fast8_t softfloat_exceptionFlags; +

+The positions of the exception flag bits within this variable are determined by +the bit masks +

+softfloat_flag_inexact
+softfloat_flag_underflow
+softfloat_flag_overflow
+softfloat_flag_infinite
+softfloat_flag_invalid +

+Variable softfloat_exceptionFlags is initialized to all zeros, +meaning no exceptions. +

+ +

+For some SoftFloat ports, softfloat_exceptionFlags may be +per-thread (declared thread_local), meaning that different +execution threads have their own separate instances of it. +

+ +

+An individual exception flag can be cleared with the statement +

+softfloat_exceptionFlags &= ~softfloat_flag_<exception>; +

+where <exception> is the appropriate name. +To raise a floating-point exception, function softfloat_raiseFlags +should normally be used. +

+ +

+When SoftFloat detects an exception other than inexact, it calls +softfloat_raiseFlags. +The default version of this function simply raises the corresponding exception +flags. +Particular ports of SoftFloat may support alternate behavior, such as exception +traps, by modifying the default softfloat_raiseFlags. +A program may also supply its own softfloat_raiseFlags function to +override the one from the SoftFloat library. +

+ +

+Because inexact results occur frequently under most circumstances (and thus are +hardly exceptional), SoftFloat does not ordinarily call +softfloat_raiseFlags for inexact exceptions. +It does always raise the inexact exception flag as required. +

+ + +

8. Function Details

+ +

+In this section, <float> appears in function names as +a substitute for one of these abbreviations: +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
f16 indicates float16_t, passed by value
f32 indicates float32_t, passed by value
f64 indicates float64_t, passed by value
extF80M indicates extFloat80_t, passed indirectly via pointers
extF80 indicates extFloat80_t, passed by value
f128M indicates float128_t, passed indirectly via pointers
f128 indicates float128_t, passed by value
+

+The circumstances under which values of floating-point types +extFloat80_t and float128_t may be passed either by +value or indirectly via pointers was discussed earlier in +section 4.5, Conventions for Passing Arguments and Results. +

+ +

8.1. Conversions from Integer to Floating-Point

+ +

+All conversions from a 32-bit or 64-bit integer, +signed or unsigned, to a floating-point format are supported. +Functions performing these conversions have these names: +

+ui32_to_<float>
+ui64_to_<float>
+i32_to_<float>
+i64_to_<float> +

+Conversions from 32-bit integers to 64-bit +double-precision and larger formats are always exact, and likewise conversions +from 64-bit integers to 80-bit +double-extended-precision and 128-bit quadruple-precision are also +always exact. +

+ +

+Each conversion function takes one input of the appropriate type and generates +one output. +The following illustrates the signatures of these functions in cases when the +floating-point result is passed either by value or via pointers: +

+float64_t i32_to_f64( int32_t a );
+

+void i32_to_f128M( int32_t a, float128_t *destPtr );
+

+ +

8.2. Conversions from Floating-Point to Integer

+ +

+Conversions from a floating-point format to a 32-bit or +64-bit integer, signed or unsigned, are supported with these +functions: +

+<float>_to_ui32
+<float>_to_ui64
+<float>_to_i32
+<float>_to_i64 +

+The functions have signatures as follows, depending on whether the +floating-point input is passed by value or via pointers: +

+int_fast32_t f64_to_i32( float64_t a, uint_fast8_t roundingMode, bool exact );
+

+int_fast32_t
+ f128M_to_i32( const float128_t *aPtr, uint_fast8_t roundingMode, bool exact );
+

+ +

+The roundingMode argument specifies the rounding mode for +the conversion. +The variable that usually indicates rounding mode, +softfloat_roundingMode, is ignored. +Argument exact determines whether the inexact +exception flag is raised if the conversion is not exact. +If exact is true, the inexact flag may +be raised; +otherwise, it will not be, even if the conversion is inexact. +

+ +

+A conversion from floating-point to integer format raises the invalid +exception if the source value cannot be rounded to a representable integer of +the desired size (32 or 64 bits). +In such circumstances, the integer result returned is determined by the +particular port of SoftFloat, although typically this value will be either the +maximum or minimum value of the integer format. +The functions that convert to integer types never raise the floating-point +overflow exception. +

+ +

+Because languages such as C require that conversions to integers +be rounded toward zero, the following functions are provided for improved speed +and convenience: +

+<float>_to_ui32_r_minMag
+<float>_to_ui64_r_minMag
+<float>_to_i32_r_minMag
+<float>_to_i64_r_minMag +

+These functions round only toward zero (to minimum magnitude). +The signatures for these functions are the same as above without the redundant +roundingMode argument: +

+int_fast32_t f64_to_i32_r_minMag( float64_t a, bool exact );
+

+int_fast32_t f128M_to_i32_r_minMag( const float128_t *aPtr, bool exact );
+

+ +

8.3. Conversions Among Floating-Point Types

+ +

+Conversions between floating-point formats are done by functions with these +names: +

+<float>_to_<float> +

+All combinations of source and result type are supported where the source and +result are different formats. +There are four different styles of signature for these functions, depending on +whether the input and the output floating-point values are passed by value or +via pointers: +

+float32_t f64_to_f32( float64_t a );
+

+float32_t f128M_to_f32( const float128_t *aPtr );
+

+void f32_to_f128M( float32_t a, float128_t *destPtr );
+

+void extF80M_to_f128M( const extFloat80_t *aPtr, float128_t *destPtr );
+

+ +

+Conversions from a smaller to a larger floating-point format are always exact +and so require no rounding. +

+ +

8.4. Basic Arithmetic Functions

+ +

+The following basic arithmetic functions are provided: +

+<float>_add
+<float>_sub
+<float>_mul
+<float>_div
+<float>_sqrt +

+Each floating-point operation takes two operands, except for sqrt +(square root) which takes only one. +The operands and result are all of the same floating-point format. +Signatures for these functions take the following forms: +

+float64_t f64_add( float64_t a, float64_t b );
+

+void
+ f128M_add(
+     const float128_t *aPtr, const float128_t *bPtr, float128_t *destPtr );
+

+float64_t f64_sqrt( float64_t a );
+

+void f128M_sqrt( const float128_t *aPtr, float128_t *destPtr );
+

+When floating-point values are passed indirectly through pointers, arguments +aPtr and bPtr point to the input +operands, and the last argument, destPtr, points to the +location where the result is stored. +

+ +

+Rounding of the 80-bit double-extended-precision +(extFloat80_t) functions is affected by variable +extF80_roundingPrecision, as explained earlier in +section 6.3, +Rounding Precision for the 80-Bit Extended Format. +

+ +

8.5. Fused Multiply-Add Functions

+ +

+The 2008 version of the IEEE Floating-Point Standard defines a fused +multiply-add operation that does a combined multiplication and addition +with only a single rounding. +SoftFloat implements fused multiply-add with functions +

+<float>_mulAdd +

+Unlike other operations, fused multiple-add is not supported for the +80-bit double-extended-precision format, +extFloat80_t. +

+ +

+Depending on whether floating-point values are passed by value or via pointers, +the fused multiply-add functions have signatures of these forms: +

+float64_t f64_mulAdd( float64_t a, float64_t b, float64_t c );
+

+void
+ f128M_mulAdd(
+     const float128_t *aPtr,
+     const float128_t *bPtr,
+     const float128_t *cPtr,
+     float128_t *destPtr
+ );
+

+The functions compute +(a × b) + + c +with a single rounding. +When floating-point values are passed indirectly through pointers, arguments +aPtr, bPtr, and +cPtr point to operands a, +b, and c respectively, and +destPtr points to the location where the result is stored. +

+ +

+If one of the multiplication operands a and +b is infinite and the other is zero, these functions raise +the invalid exception even if operand c is a quiet NaN. +

+ +

8.6. Remainder Functions

+ +

+For each format, SoftFloat implements the remainder operation defined by the +IEEE Floating-Point Standard. +The remainder functions have names +

+<float>_rem +

+Each remainder operation takes two floating-point operands of the same format +and returns a result in the same format. +Depending on whether floating-point values are passed by value or via pointers, +the remainder functions have signatures of these forms: +

+float64_t f64_rem( float64_t a, float64_t b );
+

+void
+ f128M_rem(
+     const float128_t *aPtr, const float128_t *bPtr, float128_t *destPtr );
+

+When floating-point values are passed indirectly through pointers, arguments +aPtr and bPtr point to operands +a and b respectively, and +destPtr points to the location where the result is stored. +

+ +

+The IEEE Standard remainder operation computes the value +a + − n × b, +where n is the integer closest to +a ÷ b. +If a ÷ b is exactly +halfway between two integers, n is the even integer closest to +a ÷ b. +The IEEE Standard’s remainder operation is always exact and so requires +no rounding. +

+ +

+Depending on the relative magnitudes of the operands, the remainder +functions can take considerably longer to execute than the other SoftFloat +functions. +This is an inherent characteristic of the remainder operation itself and is not +a flaw in the SoftFloat implementation. +

+ +

8.7. Round-to-Integer Functions

+ +

+For each format, SoftFloat implements the round-to-integer operation specified +by the IEEE Floating-Point Standard. +These functions are named +

+<float>_roundToInt +

+Each round-to-integer operation takes a single floating-point operand. +This operand is rounded to an integer according to a specified rounding mode, +and the resulting integer value is returned in the same floating-point format. +(Note that the result is not an integer type.) +

+ +

+The signatures of the round-to-integer functions are similar to those for +conversions to an integer type: +

+float64_t f64_roundToInt( float64_t a, uint_fast8_t roundingMode, bool exact );
+

+void
+ f128M_roundToInt(
+     const float128_t *aPtr,
+     uint_fast8_t roundingMode,
+     bool exact,
+     float128_t *destPtr
+ );
+

+When floating-point values are passed indirectly through pointers, +aPtr points to the input operand and +destPtr points to the location where the result is stored. +

+ +

+The roundingMode argument specifies the rounding mode to +apply. +The variable that usually indicates rounding mode, +softfloat_roundingMode, is ignored. +Argument exact determines whether the inexact +exception flag is raised if the conversion is not exact. +If exact is true, the inexact flag may +be raised; +otherwise, it will not be, even if the conversion is inexact. +

+ +

8.8. Comparison Functions

+ +

+For each format, the following floating-point comparison functions are +provided: +

+<float>_eq
+<float>_le
+<float>_lt +

+Each comparison takes two operands of the same type and returns a Boolean. +The abbreviation eq stands for “equal” (=); +le stands for “less than or equal” (≤); +and lt stands for “less than” (<). +Depending on whether the floating-point operands are passed by value or via +pointers, the comparison functions have signatures of these forms: +

+bool f64_eq( float64_t a, float64_t b );
+

+bool f128M_eq( const float128_t *aPtr, const float128_t *bPtr );
+

+ +

+The usual greater-than (>), greater-than-or-equal (≥), and not-equal +(≠) comparisons are easily obtained from the functions provided. +The not-equal function is just the logical complement of the equal function. +The greater-than-or-equal function is identical to the less-than-or-equal +function with the arguments in reverse order, and likewise the greater-than +function is identical to the less-than function with the arguments reversed. +

+ +

+The IEEE Floating-Point Standard specifies that the less-than-or-equal and +less-than comparisons by default raise the invalid exception if either +operand is any kind of NaN. +Equality comparisons, on the other hand, are defined by default to raise the +invalid exception only for signaling NaNs, not quiet NaNs. +For completeness, SoftFloat provides these complementary functions: +

+<float>_eq_signaling
+<float>_le_quiet
+<float>_lt_quiet +

+The signaling equality comparisons are identical to the default +equality comparisons except that the invalid exception is raised for any +NaN input, not just for signaling NaNs. +Similarly, the quiet comparison functions are identical to their +default counterparts except that the invalid exception is not raised for +quiet NaNs. +

+ +

8.9. Signaling NaN Test Functions

+ +

+Functions for testing whether a floating-point value is a signaling NaN are +provided with these names: +

+<float>_isSignalingNaN +

+The functions take one floating-point operand and return a Boolean indicating +whether the operand is a signaling NaN. +Accordingly, the functions have the forms +

+bool f64_isSignalingNaN( float64_t a );
+

+bool f128M_isSignalingNaN( const float128_t *aPtr );
+

+ +

8.10. Raise-Exception Function

+ +

+SoftFloat provides a single function for raising floating-point exceptions: +

+void softfloat_raiseFlags( uint_fast8_t exceptions );
+

+The exceptions argument is a mask indicating the set of +exceptions to raise. +(See earlier section 7, Exceptions and Exception Flags.) +In addition to setting the specified exception flags in variable +softfloat_exceptionFlags, the softfloat_raiseFlags +function may cause a trap or abort appropriate for the current system. +

+ + +

9. Changes from SoftFloat Release 2

+ +

+Apart from a change in the legal use license, Release 3 of +SoftFloat introduced numerous technical differences compared to earlier +releases. +

+ +

9.1. Name Changes

+ +

+The most obvious and pervasive difference compared to Release 2 +is that the names of most functions and variables have changed, even when the +behavior has not. +First, the floating-point types, the mode variables, the exception flags +variable, the function to raise exceptions, and various associated constants +have been renamed as follows: +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
old name, Release 2: new name, Release 3:
float32 float32_t
float64 float64_t
floatx80 extFloat80_t
float128 float128_t
float_rounding_mode softfloat_roundingMode
float_round_nearest_even softfloat_round_near_even
float_round_to_zero softfloat_round_minMag
float_round_down softfloat_round_min
float_round_up softfloat_round_max
float_detect_tininess softfloat_detectTininess
float_tininess_before_rounding softfloat_tininess_beforeRounding
float_tininess_after_rounding softfloat_tininess_afterRounding
floatx80_rounding_precision extF80_roundingPrecision
float_exception_flags softfloat_exceptionFlags
float_flag_inexact softfloat_flag_inexact
float_flag_underflow softfloat_flag_underflow
float_flag_overflow softfloat_flag_overflow
float_flag_divbyzero softfloat_flag_infinite
float_flag_invalid softfloat_flag_invalid
float_raise softfloat_raiseFlags
+

+ +

+Furthermore, Release 3 adopted the following new abbreviations for +function names: +

+ + + + + + + + + + + +
used in names in Release 2: used in names in Release 3:
int32 i32
int64 i64
float32 f32
float64 f64
floatx80 extF80
float128 f128
+

+Thus, for example, the function to add two 32-bit floating-point +numbers, previously called float32_add in Release 2, +is now f32_add. +Lastly, there have been a few other changes to function names: +

+ + + + + + + + + + + + + + + + + + + + + +
used in names in Release 2: used in names in Release 3: relevant functions:
_round_to_zero _r_minMag conversions from floating-point to integer (section 8.2)
round_to_int roundToInt round-to-integer functions (section 8.7)
is_signaling_nan isSignalingNaN signaling NaN test functions (section 8.9)
+

+ +

9.2. Changes to Function Arguments

+ +

+Besides simple name changes, some operations were given a different interface +in Release 3 than they had in Release 2: +

+
+Since Release 3, integer arguments and results of functions have +standard types from header <stdint.h>, such as +uint32_t, whereas previously their types could be defined +differently for each port of SoftFloat, usually using traditional C types such +as unsigned int. +Likewise, functions in Release 3 and later pass Booleans as +standard type bool from <stdbool.h>, whereas +previously these were again passed as a port-specific type (usually +int). +
+ +
+
+As explained earlier in section 4.5, Conventions for Passing +Arguments and Results, SoftFloat functions in Release 3 and +later may pass 80-bit and 128-bit floating-point +values through pointers, meaning that functions take pointer arguments and then +read or write floating-point values at the locations indicated by the pointers. +In Release 2, floating-point arguments and results were always +passed by value, regardless of their size. +
+ +
+
+Functions that round to an integer have additional +roundingMode and exact arguments that +they did not have in Release 2. +Refer to sections 8.2 and 8.7 for descriptions of these functions +since Release 3. +For Release 2, the rounding mode, when needed, was taken from the +same global variable that affects the basic arithmetic operations (now called +softfloat_roundingMode but previously known as +float_rounding_mode). +Also, for Release 2, if the original floating-point input was not +an exact integer value, and if the invalid exception was not raised by +the function, the inexact exception was always raised. +Release 2 had no option to suppress raising inexact in this +case. +Applications using SoftFloat Release 3 or later can get the same +effect as Release 2 by passing variable +softfloat_roundingMode for argument +roundingMode and true for argument +exact. +
+ +

+ +

9.3. Added Capabilities

+ +

+With Release 3, some new features have been added that were not +present in Release 2: +

+
+A port of SoftFloat can now define any of the floating-point types +float32_t, float64_t, extFloat80_t, and +float128_t as aliases for C’s standard floating-point types +float, double, and long +double, using either #define or typedef. +This potential convenience was not supported under Release 2. +
+ +
+(Note, however, that there may be a performance cost to defining +SoftFloat’s floating-point types this way, depending on the platform and +the applications using SoftFloat. +Ports of SoftFloat may choose to forgo the convenience in favor of better +speed.) +
+ +
+
+As of Release 3b, 16-bit half-precision, +float16_t, is supported. +
+ +
+
+Functions have been added for converting between the floating-point types and +unsigned integers. +Release 2 supported only signed integers, not unsigned. +
+ +
+
+Fused multiply-add functions have been added for all floating-point formats +except 80-bit double-extended-precision, +extFloat80_t. +
+ +
+
+New rounding modes are supported: +softfloat_round_near_maxMag (round to nearest, with ties to +maximum magnitude, away from zero), and, as of Release 3c, +optional softfloat_round_odd (round to odd, also known as +jamming). +
+ +

+ +

9.4. Better Compatibility with the C Language

+ +

+Release 3 of SoftFloat was written to conform better to the ISO C +Standard’s rules for portability. +For example, older releases of SoftFloat employed type conversions in ways +that, while commonly practiced, are not fully defined by the C Standard. +Such problematic type conversions have generally been replaced by the use of +unions, the behavior around which is more strictly regulated these days. +

+ +

9.5. New Organization as a Library

+ +

+Starting with Release 3, SoftFloat now builds as a library. +Previously, SoftFloat compiled into a single, monolithic object file containing +all the SoftFloat functions, with the consequence that a program linking with +SoftFloat would get every SoftFloat function in its binary file even if only a +few functions were actually used. +With SoftFloat in the form of a library, a program that is linked by a standard +linker will include only those functions of SoftFloat that it needs and no +others. +

+ +

9.6. Optimization Gains (and Losses)

+ +

+Individual SoftFloat functions have been variously improved in +Release 3 compared to earlier releases. +In particular, better, faster algorithms have been deployed for the operations +of division, square root, and remainder. +For functions operating on the larger 80-bit and +128-bit formats, extFloat80_t and +float128_t, code size has also generally been reduced. +

+ +

+However, because Release 2 compiled all of SoftFloat together as a +single object file, compilers could make optimizations across function calls +when one SoftFloat function calls another. +Now that the functions of SoftFloat are compiled separately and only afterward +linked together into a program, there is not usually the same opportunity to +optimize across function calls. +Some loss of speed has been observed due to this change. +

+ + +

10. Future Directions

+ +

+The following improvements are anticipated for future releases of SoftFloat: +

+more functions from the 2008 version of the IEEE Floating-Point Standard; +
+consistent, defined behavior for non-canonical representations of extended +format extFloat80_t (discussed in section 4.4, +Non-canonical Representations in extFloat80_t). + +

+ + +

11. Contact Information

+ +

+At the time of this writing, the most up-to-date information about SoftFloat +and the latest release can be found at the Web page +http://www.jhauser.us/arithmetic/SoftFloat.html. +

+ + + + -- cgit v1.2.3

Par Lab:		+Microsoft (Award #024263), Intel (Award #024894), and U.C. Discovery +(Award #DIG07-10227), with additional support from Par Lab affiliates Nokia, +NVIDIA, Oracle, and Samsung. +
ASPIRE Lab:		+DARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from +ASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA, +Oracle, and Samsung. +

`float16_t`	16-bit half-precision binary format
`float32_t`	32-bit single-precision binary format
`float64_t`	64-bit double-precision binary format
`extFloat80_t`	80-bit double-extended-precision binary format (old Intel or +Motorola format)
`float128_t`	128-bit quadruple-precision binary format

`softfloat_round_near_even`	round to nearest, with ties to even
`softfloat_round_near_maxMag`	round to nearest, with ties to maximum magnitude (away from zero)
`softfloat_round_minMag`	round to minimum magnitude (toward zero)
`softfloat_round_min`	round to minimum (down)
`softfloat_round_max`	round to maximum (up)
`softfloat_round_odd`	round to odd (jamming), if supported by the SoftFloat port

`f16`	indicates `float16_t`, passed by value
`f32`	indicates `float32_t`, passed by value
`f64`	indicates `float64_t`, passed by value
`extF80M`	indicates `extFloat80_t`, passed indirectly via pointers
`extF80`	indicates `extFloat80_t`, passed by value
`f128M`	indicates `float128_t`, passed indirectly via pointers
`f128`	indicates `float128_t`, passed by value

old name, Release 2:	new name, Release 3:
`float32`	`float32_t`
`float64`	`float64_t`
`floatx80`	`extFloat80_t`
`float128`	`float128_t`
`float_rounding_mode`	`softfloat_roundingMode`
`float_round_nearest_even`	`softfloat_round_near_even`
`float_round_to_zero`	`softfloat_round_minMag`
`float_round_down`	`softfloat_round_min`
`float_round_up`	`softfloat_round_max`
`float_detect_tininess`	`softfloat_detectTininess`
`float_tininess_before_rounding`	`softfloat_tininess_beforeRounding`
`float_tininess_after_rounding`	`softfloat_tininess_afterRounding`
`floatx80_rounding_precision`	`extF80_roundingPrecision`
`float_exception_flags`	`softfloat_exceptionFlags`
`float_flag_inexact`	`softfloat_flag_inexact`
`float_flag_underflow`	`softfloat_flag_underflow`
`float_flag_overflow`	`softfloat_flag_overflow`
`float_flag_divbyzero`	`softfloat_flag_infinite`
`float_flag_invalid`	`softfloat_flag_invalid`
`float_raise`	`softfloat_raiseFlags`

used in names in Release 2:	used in names in Release 3:
`int32`	`i32`
`int64`	`i64`
`float32`	`f32`
`float64`	`f64`
`floatx80`	`extF80`
`float128`	`f128`

used in names in Release 2:	used in names in Release 3:	relevant functions:
`_round_to_zero`	`_r_minMag`	conversions from floating-point to integer (section 8.2)
`round_to_int`	`roundToInt`	round-to-integer functions (section 8.7)
`is_signaling_nan`	`isSignalingNaN`	signaling NaN test functions (section 8.9)

Berkeley SoftFloat Release 3e: Library Interface

Contents

1. Introduction

2. Limitations

3. Acknowledgments and License

4. Types and Functions

4.1. Boolean and Integer Types

4.2. Floating-Point Types

4.3. Supported Floating-Point Functions

4.4. Non-canonical Representations in extFloat80_t

4.5. Conventions for Passing Arguments and Results

5. Reserved Names

6. Mode Variables

6.1. Rounding Mode

6.2. Underflow Detection

6.3. Rounding Precision for the 80-Bit Extended Format

7. Exceptions and Exception Flags

8. Function Details

8.1. Conversions from Integer to Floating-Point

8.2. Conversions from Floating-Point to Integer

8.3. Conversions Among Floating-Point Types

8.4. Basic Arithmetic Functions

8.5. Fused Multiply-Add Functions

8.6. Remainder Functions

8.7. Round-to-Integer Functions

8.8. Comparison Functions

8.9. Signaling NaN Test Functions

8.10. Raise-Exception Function

9. Changes from SoftFloat Release 2

9.1. Name Changes

9.2. Changes to Function Arguments

9.3. Added Capabilities

9.4. Better Compatibility with the C Language

9.5. New Organization as a Library

9.6. Optimization Gains (and Losses)

10. Future Directions

11. Contact Information

4.4. Non-canonical Representations in `extFloat80_t`