summaryrefslogtreecommitdiffstats
path: root/src/arrow/docs/source/format/Integration.rst
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
commite6918187568dbd01842d8d1d2c808ce16a894239 (patch)
tree64f88b554b444a49f656b6c656111a145cbbaa28 /src/arrow/docs/source/format/Integration.rst
parentInitial commit. (diff)
downloadceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz
ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/arrow/docs/source/format/Integration.rst')
-rw-r--r--src/arrow/docs/source/format/Integration.rst398
1 files changed, 398 insertions, 0 deletions
diff --git a/src/arrow/docs/source/format/Integration.rst b/src/arrow/docs/source/format/Integration.rst
new file mode 100644
index 000000000..22d595e99
--- /dev/null
+++ b/src/arrow/docs/source/format/Integration.rst
@@ -0,0 +1,398 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. _format_integration_testing:
+
+Integration Testing
+===================
+
+Our strategy for integration testing between Arrow implementations is:
+
+* Test datasets are specified in a custom human-readable, JSON-based format
+ designed exclusively for Arrow's integration tests
+* Each implementation provides a testing executable capable of converting
+ between the JSON and the binary Arrow file representation
+* The test executable is also capable of validating the contents of a binary
+ file against a corresponding JSON file
+
+Running integration tests
+-------------------------
+
+The integration test data generator and runner are implemented inside
+the :ref:`Archery <archery>` utility.
+
+The integration tests are run using the ``archery integration`` command.
+
+.. code-block:: shell
+
+ archery integration --help
+
+In order to run integration tests, you'll first need to build each component
+you want to include. See the respective developer docs for C++, Java, etc.
+for instructions on building those.
+
+Some languages may require additional build options to enable integration
+testing. For C++, for example, you need to add ``-DARROW_BUILD_INTEGRATION=ON``
+to your cmake command.
+
+Depending on which components you have built, you can enable and add them to
+the archery test run. For example, if you only have the C++ project built, run:
+
+.. code-block:: shell
+
+ archery integration --with-cpp=1
+
+
+For Java, it may look like:
+
+.. code-block:: shell
+
+ VERSION=0.11.0-SNAPSHOT
+ export ARROW_JAVA_INTEGRATION_JAR=$JAVA_DIR/tools/target/arrow-tools-$VERSION-jar-with-dependencies.jar
+ archery integration --with-cpp=1 --with-java=1
+
+To run all tests, including Flight integration tests, do:
+
+.. code-block:: shell
+
+ archery integration --with-all --run-flight
+
+Note that we run these tests in continuous integration, and the CI job uses
+docker-compose. You may also run the docker-compose job locally, or at least
+refer to it if you have questions about how to build other languages or enable
+certain tests.
+
+See :ref:`docker-builds` for more information about the project's
+``docker-compose`` configuration.
+
+JSON test data format
+---------------------
+
+A JSON representation of Arrow columnar data is provided for
+cross-language integration testing purposes.
+This representation is `not canonical <https://lists.apache.org/thread.html/6947fb7666a0f9cc27d9677d2dad0fb5990f9063b7cf3d80af5e270f%40%3Cdev.arrow.apache.org%3E>`_
+but it provides a human-readable way of verifying language implementations.
+
+See `here <https://github.com/apache/arrow/tree/master/docs/source/format/integration_json_examples>`_
+for some examples of this JSON data.
+
+.. can we check in more examples, e.g. from the generated_*.json test files?
+
+The high level structure of a JSON integration test files is as follows:
+
+**Data file** ::
+
+ {
+ "schema": /*Schema*/,
+ "batches": [ /*RecordBatch*/ ],
+ "dictionaries": [ /*DictionaryBatch*/ ],
+ }
+
+All files contain ``schema`` and ``batches``, while ``dictionaries`` is only
+present if there are dictionary type fields in the schema.
+
+**Schema** ::
+
+ {
+ "fields" : [
+ /* Field */
+ ],
+ "metadata" : /* Metadata */
+ }
+
+**Field** ::
+
+ {
+ "name" : "name_of_the_field",
+ "nullable" : /* boolean */,
+ "type" : /* Type */,
+ "children" : [ /* Field */ ],
+ "dictionary": {
+ "id": /* integer */,
+ "indexType": /* Type */,
+ "isOrdered": /* boolean */
+ },
+ "metadata" : /* Metadata */
+ }
+
+The ``dictionary`` attribute is present if and only if the ``Field`` corresponds to a
+dictionary type, and its ``id`` maps onto a column in the ``DictionaryBatch``. In this
+case the ``type`` attribute describes the value type of the dictionary.
+
+For primitive types, ``children`` is an empty array.
+
+**Metadata** ::
+
+ null |
+ [ {
+ "key": /* string */,
+ "value": /* string */
+ } ]
+
+A key-value mapping of custom metadata. It may be omitted or null, in which case it is
+considered equivalent to ``[]`` (no metadata). Duplicated keys are not forbidden here.
+
+**Type**: ::
+
+ {
+ "name" : "null|struct|list|largelist|fixedsizelist|union|int|floatingpoint|utf8|largeutf8|binary|largebinary|fixedsizebinary|bool|decimal|date|time|timestamp|interval|duration|map"
+ }
+
+A ``Type`` will have other fields as defined in
+`Schema.fbs <https://github.com/apache/arrow/tree/master/format/Schema.fbs>`_
+depending on its name.
+
+Int: ::
+
+ {
+ "name" : "int",
+ "bitWidth" : /* integer */,
+ "isSigned" : /* boolean */
+ }
+
+FloatingPoint: ::
+
+ {
+ "name" : "floatingpoint",
+ "precision" : "HALF|SINGLE|DOUBLE"
+ }
+
+FixedSizeBinary: ::
+
+ {
+ "name" : "fixedsizebinary",
+ "byteWidth" : /* byte width */
+ }
+
+Decimal: ::
+
+ {
+ "name" : "decimal",
+ "precision" : /* integer */,
+ "scale" : /* integer */
+ }
+
+Timestamp: ::
+
+ {
+ "name" : "timestamp",
+ "unit" : "$TIME_UNIT",
+ "timezone": "$timezone"
+ }
+
+``$TIME_UNIT`` is one of ``"SECOND|MILLISECOND|MICROSECOND|NANOSECOND"``
+
+"timezone" is an optional string.
+
+Duration: ::
+
+ {
+ "name" : "duration",
+ "unit" : "$TIME_UNIT"
+ }
+
+Date: ::
+
+ {
+ "name" : "date",
+ "unit" : "DAY|MILLISECOND"
+ }
+
+Time: ::
+
+ {
+ "name" : "time",
+ "unit" : "$TIME_UNIT",
+ "bitWidth": /* integer: 32 or 64 */
+ }
+
+Interval: ::
+
+ {
+ "name" : "interval",
+ "unit" : "YEAR_MONTH|DAY_TIME"
+ }
+
+Union: ::
+
+ {
+ "name" : "union",
+ "mode" : "SPARSE|DENSE",
+ "typeIds" : [ /* integer */ ]
+ }
+
+The ``typeIds`` field in ``Union`` are the codes used to denote which member of
+the union is active in each array slot. Note that in general these discriminants are not identical
+to the index of the corresponding child array.
+
+List: ::
+
+ {
+ "name": "list"
+ }
+
+The type that the list is a "list of" will be included in the ``Field``'s
+"children" member, as a single ``Field`` there. For example, for a list of
+``int32``, ::
+
+ {
+ "name": "list_nullable",
+ "type": {
+ "name": "list"
+ },
+ "nullable": true,
+ "children": [
+ {
+ "name": "item",
+ "type": {
+ "name": "int",
+ "isSigned": true,
+ "bitWidth": 32
+ },
+ "nullable": true,
+ "children": []
+ }
+ ]
+ }
+
+FixedSizeList: ::
+
+ {
+ "name": "fixedsizelist",
+ "listSize": /* integer */
+ }
+
+This type likewise comes with a length-1 "children" array.
+
+Struct: ::
+
+ {
+ "name": "struct"
+ }
+
+The ``Field``'s "children" contains an array of ``Fields`` with meaningful
+names and types.
+
+Map: ::
+
+ {
+ "name": "map",
+ "keysSorted": /* boolean */
+ }
+
+The ``Field``'s "children" contains a single ``struct`` field, which itself
+contains 2 children, named "key" and "value".
+
+Null: ::
+
+ {
+ "name": "null"
+ }
+
+Extension types are, as in the IPC format, represented as their underlying
+storage type plus some dedicated field metadata to reconstruct the extension
+type. For example, assuming a "uuid" extension type backed by a
+FixedSizeBinary(16) storage, here is how a "uuid" field would be represented::
+
+ {
+ "name" : "name_of_the_field",
+ "nullable" : /* boolean */,
+ "type" : {
+ "name" : "fixedsizebinary",
+ "byteWidth" : 16
+ },
+ "children" : [],
+ "metadata" : [
+ {"key": "ARROW:extension:name", "value": "uuid"},
+ {"key": "ARROW:extension:metadata", "value": "uuid-serialized"}
+ ]
+ }
+
+**RecordBatch**::
+
+ {
+ "count": /* integer number of rows */,
+ "columns": [ /* FieldData */ ]
+ }
+
+**DictionaryBatch**::
+
+ {
+ "id": /* integer */,
+ "data": [ /* RecordBatch */ ]
+ }
+
+**FieldData**::
+
+ {
+ "name": "field_name",
+ "count" "field_length",
+ "$BUFFER_TYPE": /* BufferData */
+ ...
+ "$BUFFER_TYPE": /* BufferData */
+ "children": [ /* FieldData */ ]
+ }
+
+The "name" member of a ``Field`` in the ``Schema`` corresponds to the "name"
+of a ``FieldData`` contained in the "columns" of a ``RecordBatch``.
+For nested types (list, struct, etc.), ``Field``'s "children" each have a
+"name" that corresponds to the "name" of a ``FieldData`` inside the
+"children" of that ``FieldData``.
+For ``FieldData`` inside of a ``DictionaryBatch``, the "name" field does not
+correspond to anything.
+
+Here ``$BUFFER_TYPE`` is one of ``VALIDITY``, ``OFFSET`` (for
+variable-length types, such as strings and lists), ``TYPE_ID`` (for unions),
+or ``DATA``.
+
+``BufferData`` is encoded based on the type of buffer:
+
+* ``VALIDITY``: a JSON array of 1 (valid) and 0 (null). Data for non-nullable
+ ``Field`` still has a ``VALIDITY`` array, even though all values are 1.
+* ``OFFSET``: a JSON array of integers for 32-bit offsets or
+ string-formatted integers for 64-bit offsets
+* ``TYPE_ID``: a JSON array of integers
+* ``DATA``: a JSON array of encoded values
+
+The value encoding for ``DATA`` is different depending on the logical
+type:
+
+* For boolean type: an array of 1 (true) and 0 (false).
+* For integer-based types (including timestamps): an array of JSON numbers.
+* For 64-bit integers: an array of integers formatted as JSON strings,
+ so as to avoid loss of precision.
+* For floating point types: an array of JSON numbers. Values are limited
+ to 3 decimal places to avoid loss of precision.
+* For binary types, an array of uppercase hex-encoded strings, so as
+ to represent arbitrary binary data.
+* For UTF-8 string types, an array of JSON strings.
+
+For "list" and "largelist" types, ``BufferData`` has ``VALIDITY`` and
+``OFFSET``, and the rest of the data is inside "children". These child
+``FieldData`` contain all of the same attributes as non-child data, so in
+the example of a list of ``int32``, the child data has ``VALIDITY`` and
+``DATA``.
+
+For "fixedsizelist", there is no ``OFFSET`` member because the offsets are
+implied by the field's "listSize".
+
+Note that the "count" for these child data may not match the parent "count".
+For example, if a ``RecordBatch`` has 7 rows and contains a ``FixedSizeList``
+of ``listSize`` 4, then the data inside the "children" of that ``FieldData``
+will have count 28.
+
+For "null" type, ``BufferData`` does not contain any buffers.