diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
commit | e6918187568dbd01842d8d1d2c808ce16a894239 (patch) | |
tree | 64f88b554b444a49f656b6c656111a145cbbaa28 /src/arrow/docs/source/format/Integration.rst | |
parent | Initial commit. (diff) | |
download | ceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip |
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/arrow/docs/source/format/Integration.rst')
-rw-r--r-- | src/arrow/docs/source/format/Integration.rst | 398 |
1 files changed, 398 insertions, 0 deletions
diff --git a/src/arrow/docs/source/format/Integration.rst b/src/arrow/docs/source/format/Integration.rst new file mode 100644 index 000000000..22d595e99 --- /dev/null +++ b/src/arrow/docs/source/format/Integration.rst @@ -0,0 +1,398 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. _format_integration_testing: + +Integration Testing +=================== + +Our strategy for integration testing between Arrow implementations is: + +* Test datasets are specified in a custom human-readable, JSON-based format + designed exclusively for Arrow's integration tests +* Each implementation provides a testing executable capable of converting + between the JSON and the binary Arrow file representation +* The test executable is also capable of validating the contents of a binary + file against a corresponding JSON file + +Running integration tests +------------------------- + +The integration test data generator and runner are implemented inside +the :ref:`Archery <archery>` utility. + +The integration tests are run using the ``archery integration`` command. + +.. code-block:: shell + + archery integration --help + +In order to run integration tests, you'll first need to build each component +you want to include. See the respective developer docs for C++, Java, etc. +for instructions on building those. + +Some languages may require additional build options to enable integration +testing. For C++, for example, you need to add ``-DARROW_BUILD_INTEGRATION=ON`` +to your cmake command. + +Depending on which components you have built, you can enable and add them to +the archery test run. For example, if you only have the C++ project built, run: + +.. code-block:: shell + + archery integration --with-cpp=1 + + +For Java, it may look like: + +.. code-block:: shell + + VERSION=0.11.0-SNAPSHOT + export ARROW_JAVA_INTEGRATION_JAR=$JAVA_DIR/tools/target/arrow-tools-$VERSION-jar-with-dependencies.jar + archery integration --with-cpp=1 --with-java=1 + +To run all tests, including Flight integration tests, do: + +.. code-block:: shell + + archery integration --with-all --run-flight + +Note that we run these tests in continuous integration, and the CI job uses +docker-compose. You may also run the docker-compose job locally, or at least +refer to it if you have questions about how to build other languages or enable +certain tests. + +See :ref:`docker-builds` for more information about the project's +``docker-compose`` configuration. + +JSON test data format +--------------------- + +A JSON representation of Arrow columnar data is provided for +cross-language integration testing purposes. +This representation is `not canonical <https://lists.apache.org/thread.html/6947fb7666a0f9cc27d9677d2dad0fb5990f9063b7cf3d80af5e270f%40%3Cdev.arrow.apache.org%3E>`_ +but it provides a human-readable way of verifying language implementations. + +See `here <https://github.com/apache/arrow/tree/master/docs/source/format/integration_json_examples>`_ +for some examples of this JSON data. + +.. can we check in more examples, e.g. from the generated_*.json test files? + +The high level structure of a JSON integration test files is as follows: + +**Data file** :: + + { + "schema": /*Schema*/, + "batches": [ /*RecordBatch*/ ], + "dictionaries": [ /*DictionaryBatch*/ ], + } + +All files contain ``schema`` and ``batches``, while ``dictionaries`` is only +present if there are dictionary type fields in the schema. + +**Schema** :: + + { + "fields" : [ + /* Field */ + ], + "metadata" : /* Metadata */ + } + +**Field** :: + + { + "name" : "name_of_the_field", + "nullable" : /* boolean */, + "type" : /* Type */, + "children" : [ /* Field */ ], + "dictionary": { + "id": /* integer */, + "indexType": /* Type */, + "isOrdered": /* boolean */ + }, + "metadata" : /* Metadata */ + } + +The ``dictionary`` attribute is present if and only if the ``Field`` corresponds to a +dictionary type, and its ``id`` maps onto a column in the ``DictionaryBatch``. In this +case the ``type`` attribute describes the value type of the dictionary. + +For primitive types, ``children`` is an empty array. + +**Metadata** :: + + null | + [ { + "key": /* string */, + "value": /* string */ + } ] + +A key-value mapping of custom metadata. It may be omitted or null, in which case it is +considered equivalent to ``[]`` (no metadata). Duplicated keys are not forbidden here. + +**Type**: :: + + { + "name" : "null|struct|list|largelist|fixedsizelist|union|int|floatingpoint|utf8|largeutf8|binary|largebinary|fixedsizebinary|bool|decimal|date|time|timestamp|interval|duration|map" + } + +A ``Type`` will have other fields as defined in +`Schema.fbs <https://github.com/apache/arrow/tree/master/format/Schema.fbs>`_ +depending on its name. + +Int: :: + + { + "name" : "int", + "bitWidth" : /* integer */, + "isSigned" : /* boolean */ + } + +FloatingPoint: :: + + { + "name" : "floatingpoint", + "precision" : "HALF|SINGLE|DOUBLE" + } + +FixedSizeBinary: :: + + { + "name" : "fixedsizebinary", + "byteWidth" : /* byte width */ + } + +Decimal: :: + + { + "name" : "decimal", + "precision" : /* integer */, + "scale" : /* integer */ + } + +Timestamp: :: + + { + "name" : "timestamp", + "unit" : "$TIME_UNIT", + "timezone": "$timezone" + } + +``$TIME_UNIT`` is one of ``"SECOND|MILLISECOND|MICROSECOND|NANOSECOND"`` + +"timezone" is an optional string. + +Duration: :: + + { + "name" : "duration", + "unit" : "$TIME_UNIT" + } + +Date: :: + + { + "name" : "date", + "unit" : "DAY|MILLISECOND" + } + +Time: :: + + { + "name" : "time", + "unit" : "$TIME_UNIT", + "bitWidth": /* integer: 32 or 64 */ + } + +Interval: :: + + { + "name" : "interval", + "unit" : "YEAR_MONTH|DAY_TIME" + } + +Union: :: + + { + "name" : "union", + "mode" : "SPARSE|DENSE", + "typeIds" : [ /* integer */ ] + } + +The ``typeIds`` field in ``Union`` are the codes used to denote which member of +the union is active in each array slot. Note that in general these discriminants are not identical +to the index of the corresponding child array. + +List: :: + + { + "name": "list" + } + +The type that the list is a "list of" will be included in the ``Field``'s +"children" member, as a single ``Field`` there. For example, for a list of +``int32``, :: + + { + "name": "list_nullable", + "type": { + "name": "list" + }, + "nullable": true, + "children": [ + { + "name": "item", + "type": { + "name": "int", + "isSigned": true, + "bitWidth": 32 + }, + "nullable": true, + "children": [] + } + ] + } + +FixedSizeList: :: + + { + "name": "fixedsizelist", + "listSize": /* integer */ + } + +This type likewise comes with a length-1 "children" array. + +Struct: :: + + { + "name": "struct" + } + +The ``Field``'s "children" contains an array of ``Fields`` with meaningful +names and types. + +Map: :: + + { + "name": "map", + "keysSorted": /* boolean */ + } + +The ``Field``'s "children" contains a single ``struct`` field, which itself +contains 2 children, named "key" and "value". + +Null: :: + + { + "name": "null" + } + +Extension types are, as in the IPC format, represented as their underlying +storage type plus some dedicated field metadata to reconstruct the extension +type. For example, assuming a "uuid" extension type backed by a +FixedSizeBinary(16) storage, here is how a "uuid" field would be represented:: + + { + "name" : "name_of_the_field", + "nullable" : /* boolean */, + "type" : { + "name" : "fixedsizebinary", + "byteWidth" : 16 + }, + "children" : [], + "metadata" : [ + {"key": "ARROW:extension:name", "value": "uuid"}, + {"key": "ARROW:extension:metadata", "value": "uuid-serialized"} + ] + } + +**RecordBatch**:: + + { + "count": /* integer number of rows */, + "columns": [ /* FieldData */ ] + } + +**DictionaryBatch**:: + + { + "id": /* integer */, + "data": [ /* RecordBatch */ ] + } + +**FieldData**:: + + { + "name": "field_name", + "count" "field_length", + "$BUFFER_TYPE": /* BufferData */ + ... + "$BUFFER_TYPE": /* BufferData */ + "children": [ /* FieldData */ ] + } + +The "name" member of a ``Field`` in the ``Schema`` corresponds to the "name" +of a ``FieldData`` contained in the "columns" of a ``RecordBatch``. +For nested types (list, struct, etc.), ``Field``'s "children" each have a +"name" that corresponds to the "name" of a ``FieldData`` inside the +"children" of that ``FieldData``. +For ``FieldData`` inside of a ``DictionaryBatch``, the "name" field does not +correspond to anything. + +Here ``$BUFFER_TYPE`` is one of ``VALIDITY``, ``OFFSET`` (for +variable-length types, such as strings and lists), ``TYPE_ID`` (for unions), +or ``DATA``. + +``BufferData`` is encoded based on the type of buffer: + +* ``VALIDITY``: a JSON array of 1 (valid) and 0 (null). Data for non-nullable + ``Field`` still has a ``VALIDITY`` array, even though all values are 1. +* ``OFFSET``: a JSON array of integers for 32-bit offsets or + string-formatted integers for 64-bit offsets +* ``TYPE_ID``: a JSON array of integers +* ``DATA``: a JSON array of encoded values + +The value encoding for ``DATA`` is different depending on the logical +type: + +* For boolean type: an array of 1 (true) and 0 (false). +* For integer-based types (including timestamps): an array of JSON numbers. +* For 64-bit integers: an array of integers formatted as JSON strings, + so as to avoid loss of precision. +* For floating point types: an array of JSON numbers. Values are limited + to 3 decimal places to avoid loss of precision. +* For binary types, an array of uppercase hex-encoded strings, so as + to represent arbitrary binary data. +* For UTF-8 string types, an array of JSON strings. + +For "list" and "largelist" types, ``BufferData`` has ``VALIDITY`` and +``OFFSET``, and the rest of the data is inside "children". These child +``FieldData`` contain all of the same attributes as non-child data, so in +the example of a list of ``int32``, the child data has ``VALIDITY`` and +``DATA``. + +For "fixedsizelist", there is no ``OFFSET`` member because the offsets are +implied by the field's "listSize". + +Note that the "count" for these child data may not match the parent "count". +For example, if a ``RecordBatch`` has 7 rows and contains a ``FixedSizeList`` +of ``listSize`` 4, then the data inside the "children" of that ``FieldData`` +will have count 28. + +For "null" type, ``BufferData`` does not contain any buffers. |