diff options
Diffstat (limited to 'src/arrow/docs/source/format/CDataInterface.rst')
-rw-r--r-- | src/arrow/docs/source/format/CDataInterface.rst | 948 |
1 files changed, 948 insertions, 0 deletions
diff --git a/src/arrow/docs/source/format/CDataInterface.rst b/src/arrow/docs/source/format/CDataInterface.rst new file mode 100644 index 000000000..20446411a --- /dev/null +++ b/src/arrow/docs/source/format/CDataInterface.rst @@ -0,0 +1,948 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. _c-data-interface: + +========================== +The Arrow C data interface +========================== + +Rationale +========= + +Apache Arrow is designed to be a universal in-memory format for the representation +of tabular ("columnar") data. However, some projects may face a difficult +choice between either depending on a fast-evolving project such as the +Arrow C++ library, or having to reimplement adapters for data interchange, +which may require significant, redundant development effort. + +The Arrow C data interface defines a very small, stable set of C definitions +that can be easily *copied* in any project's source code and used for columnar +data interchange in the Arrow format. For non-C/C++ languages and runtimes, +it should be almost as easy to translate the C definitions into the +corresponding C FFI declarations. + +Applications and libraries can therefore work with Arrow memory without +necessarily using Arrow libraries or reinventing the wheel. Developers can +choose between tight integration +with the Arrow *software project* (benefitting from the growing array of +facilities exposed by e.g. the C++ or Java implementations of Apache Arrow, +but with the cost of a dependency) or minimal integration with the Arrow +*format* only. + +Goals +----- + +* Expose an ABI-stable interface. +* Make it easy for third-party projects to implement support for (including partial + support where sufficient), with little initial investment. +* Allow zero-copy sharing of Arrow data between independent runtimes + and components running in the same process. +* Match the Arrow array concepts closely to avoid the development of + yet another marshalling layer. +* Avoid the need for one-to-one adaptation layers such as the limited + JPype-based bridge between Java and Python. +* Enable integration without an explicit dependency (either at compile-time + or runtime) on the Arrow software project. + +Ideally, the Arrow C data interface can become a low-level *lingua franca* +for sharing columnar data at runtime and establish Arrow as the universal +building block in the columnar processing ecosystem. + +Non-goals +--------- + +* Expose a C API mimicking operations available in higher-level runtimes + (such as C++, Java...). +* Data sharing between distinct processes or storage persistence. + + +Comparison with the Arrow IPC format +------------------------------------ + +Pros of the C data interface vs. the IPC format: + +* No dependency on Flatbuffers. +* No buffer reassembly (data is already exposed in logical Arrow format). +* Zero-copy by design. +* Easy to reimplement from scratch. +* Minimal C definition that can be easily copied into other codebases. +* Resource lifetime management through a custom release callback. + +Pros of the IPC format vs. the data interface: + +* Works across processes and machines. +* Allows data storage and persistence. +* Being a streamable format, the IPC format has room for composing more features + (such as integrity checks, compression...). +* Does not require explicit C data access. + +Data type description -- format strings +======================================= + +A data type is described using a format string. The format string only +encodes information about the top-level type; for nested type, child types +are described separately. Also, metadata is encoded in a separate string. + +The format strings are designed to be easily parsable, even from a language +such as C. The most common primitive formats have one-character format +strings: + ++-----------------+--------------------------+------------+ +| Format string | Arrow data type | Notes | ++=================+==========================+============+ +| ``n`` | null | | ++-----------------+--------------------------+------------+ +| ``b`` | boolean | | ++-----------------+--------------------------+------------+ +| ``c`` | int8 | | ++-----------------+--------------------------+------------+ +| ``C`` | uint8 | | ++-----------------+--------------------------+------------+ +| ``s`` | int16 | | ++-----------------+--------------------------+------------+ +| ``S`` | uint16 | | ++-----------------+--------------------------+------------+ +| ``i`` | int32 | | ++-----------------+--------------------------+------------+ +| ``I`` | uint32 | | ++-----------------+--------------------------+------------+ +| ``l`` | int64 | | ++-----------------+--------------------------+------------+ +| ``L`` | uint64 | | ++-----------------+--------------------------+------------+ +| ``e`` | float16 | | ++-----------------+--------------------------+------------+ +| ``f`` | float32 | | ++-----------------+--------------------------+------------+ +| ``g`` | float64 | | ++-----------------+--------------------------+------------+ + ++-----------------+---------------------------------------------------+------------+ +| Format string | Arrow data type | Notes | ++=================+===================================================+============+ +| ``z`` | binary | | ++-----------------+---------------------------------------------------+------------+ +| ``Z`` | large binary | | ++-----------------+---------------------------------------------------+------------+ +| ``u`` | utf-8 string | | ++-----------------+---------------------------------------------------+------------+ +| ``U`` | large utf-8 string | | ++-----------------+---------------------------------------------------+------------+ +| ``d:19,10`` | decimal128 [precision 19, scale 10] | | ++-----------------+---------------------------------------------------+------------+ +| ``d:19,10,NNN`` | decimal bitwidth = NNN [precision 19, scale 10] | | ++-----------------+---------------------------------------------------+------------+ +| ``w:42`` | fixed-width binary [42 bytes] | | ++-----------------+---------------------------------------------------+------------+ + +Temporal types have multi-character format strings starting with ``t``: + ++-----------------+---------------------------------------------------+------------+ +| Format string | Arrow data type | Notes | ++=================+===================================================+============+ +| ``tdD`` | date32 [days] | | ++-----------------+---------------------------------------------------+------------+ +| ``tdm`` | date64 [milliseconds] | | ++-----------------+---------------------------------------------------+------------+ +| ``tts`` | time32 [seconds] | | ++-----------------+---------------------------------------------------+------------+ +| ``ttm`` | time32 [milliseconds] | | ++-----------------+---------------------------------------------------+------------+ +| ``ttu`` | time64 [microseconds] | | ++-----------------+---------------------------------------------------+------------+ +| ``ttn`` | time64 [nanoseconds] | | ++-----------------+---------------------------------------------------+------------+ +| ``tss:...`` | timestamp [seconds] with timezone "..." | \(1) | ++-----------------+---------------------------------------------------+------------+ +| ``tsm:...`` | timestamp [milliseconds] with timezone "..." | \(1) | ++-----------------+---------------------------------------------------+------------+ +| ``tsu:...`` | timestamp [microseconds] with timezone "..." | \(1) | ++-----------------+---------------------------------------------------+------------+ +| ``tsn:...`` | timestamp [nanoseconds] with timezone "..." | \(1) | ++-----------------+---------------------------------------------------+------------+ +| ``tDs`` | duration [seconds] | | ++-----------------+---------------------------------------------------+------------+ +| ``tDm`` | duration [milliseconds] | | ++-----------------+---------------------------------------------------+------------+ +| ``tDu`` | duration [microseconds] | | ++-----------------+---------------------------------------------------+------------+ +| ``tDn`` | duration [nanoseconds] | | ++-----------------+---------------------------------------------------+------------+ +| ``tiM`` | interval [months] | | ++-----------------+---------------------------------------------------+------------+ +| ``tiD`` | interval [days, time] | | ++-----------------+---------------------------------------------------+------------+ +| ``tin`` | interval [month, day, nanoseconds] | | ++-----------------+---------------------------------------------------+------------+ + + +Dictionary-encoded types do not have a specific format string. Instead, the +format string of the base array represents the dictionary index type, and the +value type can be read from the dependent dictionary array (see below +"Dictionary-encoded arrays"). + +Nested types have multiple-character format strings starting with ``+``. The +names and types of child fields are read from the child arrays. + ++------------------------+---------------------------------------------------+------------+ +| Format string | Arrow data type | Notes | ++========================+===================================================+============+ +| ``+l`` | list | | ++------------------------+---------------------------------------------------+------------+ +| ``+L`` | large list | | ++------------------------+---------------------------------------------------+------------+ +| ``+w:123`` | fixed-sized list [123 items] | | ++------------------------+---------------------------------------------------+------------+ +| ``+s`` | struct | | ++------------------------+---------------------------------------------------+------------+ +| ``+m`` | map | \(2) | ++------------------------+---------------------------------------------------+------------+ +| ``+ud:I,J,...`` | dense union with type ids I,J... | | ++------------------------+---------------------------------------------------+------------+ +| ``+us:I,J,...`` | sparse union with type ids I,J... | | ++------------------------+---------------------------------------------------+------------+ + +Notes: + +(1) + The timezone string is appended as-is after the colon character ``:``, without + any quotes. If the timezone is empty, the colon ``:`` must still be included. + +(2) + As specified in the Arrow columnar format, the map type has a single child type + named ``entries``, itself a 2-child struct type of ``(key, value)``. + +Examples +-------- + +* A dictionary-encoded ``decimal128(precision = 12, scale = 5)`` array + with ``int16`` indices has format string ``s``, and its dependent dictionary + array has format string ``d:12,5``. +* A ``list<uint64>`` array has format string ``+l``, and its single child + has format string ``L``. +* A ``struct<ints: int32, floats: float32>`` has format string ``+s``; its two + children have names ``ints`` and ``floats``, and format strings ``i`` and + ``f`` respectively. +* A ``map<string, float64>`` array has format string ``+m``; its single child + has name ``entries`` and format string ``+s``; its two grandchildren have names + ``key`` and ``value``, and format strings ``u`` and ``g`` respectively. +* A ``sparse_union<ints: int32, floats: float32>`` with type ids ``4, 5`` + has format string ``+us:4,5``; its two children have names ``ints`` and + ``floats``, and format strings ``i`` and ``f`` respectively. + + +Structure definitions +===================== + +The following free-standing definitions are enough to support the Arrow +C data interface in your project. Like the rest of the Arrow project, they +are available under the Apache License 2.0. + +.. code-block:: c + + #define ARROW_FLAG_DICTIONARY_ORDERED 1 + #define ARROW_FLAG_NULLABLE 2 + #define ARROW_FLAG_MAP_KEYS_SORTED 4 + + struct ArrowSchema { + // Array type description + const char* format; + const char* name; + const char* metadata; + int64_t flags; + int64_t n_children; + struct ArrowSchema** children; + struct ArrowSchema* dictionary; + + // Release callback + void (*release)(struct ArrowSchema*); + // Opaque producer-specific data + void* private_data; + }; + + struct ArrowArray { + // Array data description + int64_t length; + int64_t null_count; + int64_t offset; + int64_t n_buffers; + int64_t n_children; + const void** buffers; + struct ArrowArray** children; + struct ArrowArray* dictionary; + + // Release callback + void (*release)(struct ArrowArray*); + // Opaque producer-specific data + void* private_data; + }; + +The ArrowSchema structure +------------------------- + +The ``ArrowSchema`` structure describes the type and metadata of an exported +array or record batch. It has the following fields: + +.. c:member:: const char* ArrowSchema.format + + Mandatory. A null-terminated, UTF8-encoded string describing + the data type. If the data type is nested, child types are not + encoded here but in the :c:member:`ArrowSchema.children` structures. + + Consumers MAY decide not to support all data types, but they + should document this limitation. + +.. c:member:: const char* ArrowSchema.name + + Optional. A null-terminated, UTF8-encoded string of the field + or array name. This is mainly used to reconstruct child fields + of nested types. + + Producers MAY decide not to provide this information, and consumers + MAY decide to ignore it. If omitted, MAY be NULL or an empty string. + +.. c:member:: const char* ArrowSchema.metadata + + Optional. A binary string describing the type's metadata. + If the data type is nested, child types are not encoded here but + in the :c:member:`ArrowSchema.children` structures. + + This string is not null-terminated but follows a specific format:: + + int32: number of key/value pairs (noted N below) + int32: byte length of key 0 + key 0 (not null-terminated) + int32: byte length of value 0 + value 0 (not null-terminated) + ... + int32: byte length of key N - 1 + key N - 1 (not null-terminated) + int32: byte length of value N - 1 + value N - 1 (not null-terminated) + + Integers are stored in native endianness. For example, the metadata + ``[('key1', 'value1')]`` is encoded on a little-endian machine as:: + + \x01\x00\x00\x00\x04\x00\x00\x00key1\x06\x00\x00\x00value1 + + On a big-endian machine, the same example would be encoded as:: + + \x00\x00\x00\x01\x00\x00\x00\x04key1\x00\x00\x00\x06value1 + + If omitted, this field MUST be NULL (not an empty string). + + Consumers MAY choose to ignore this information. + +.. c:member:: int64_t ArrowSchema.flags + + Optional. A bitfield of flags enriching the type description. + Its value is computed by OR'ing together the flag values. + The following flags are available: + + * ``ARROW_FLAG_NULLABLE``: whether this field is semantically nullable + (regardless of whether it actually has null values). + * ``ARROW_FLAG_DICTIONARY_ORDERED``: for dictionary-encoded types, + whether the ordering of dictionary indices is semantically meaningful. + * ``ARROW_FLAG_MAP_KEYS_SORTED``: for map types, whether the keys within + each map value are sorted. + + If omitted, MUST be 0. + + Consumers MAY choose to ignore some or all of the flags. Even then, + they SHOULD keep this value around so as to propagate its information + to their own consumers. + +.. c:member:: int64_t ArrowSchema.n_children + + Mandatory. The number of children this type has. + +.. c:member:: ArrowSchema** ArrowSchema.children + + Optional. A C array of pointers to each child type of this type. + There must be :c:member:`ArrowSchema.n_children` pointers. + + MAY be NULL only if :c:member:`ArrowSchema.n_children` is 0. + +.. c:member:: ArrowSchema* ArrowSchema.dictionary + + Optional. A pointer to the type of dictionary values. + + MUST be present if the ArrowSchema represents a dictionary-encoded type. + MUST be NULL otherwise. + +.. c:member:: void (*ArrowSchema.release)(struct ArrowSchema*) + + Mandatory. A pointer to a producer-provided release callback. + + See below for memory management and release callback semantics. + +.. c:member:: void* ArrowSchema.private_data + + Optional. An opaque pointer to producer-provided private data. + + Consumers MUST not process this member. Lifetime of this member + is handled by the producer, and especially by the release callback. + + +The ArrowArray structure +------------------------ + +The ``ArrowArray`` describes the data of an exported array or record batch. +For the ``ArrowArray`` structure to be interpreted type, the array type +or record batch schema must already be known. This is either done by +convention -- for example a producer API that always produces the same data +type -- or by passing a ``ArrowSchema`` on the side. + +It has the following fields: + +.. c:member:: int64_t ArrowArray.length + + Mandatory. The logical length of the array (i.e. its number of items). + +.. c:member:: int64_t ArrowArray.null_count + + Mandatory. The number of null items in the array. MAY be -1 if not + yet computed. + +.. c:member:: int64_t ArrowArray.offset + + Mandatory. The logical offset inside the array (i.e. the number of items + from the physical start of the buffers). MUST be 0 or positive. + + Producers MAY specify that they will only produce 0-offset arrays to + ease implementation of consumer code. + Consumers MAY decide not to support non-0-offset arrays, but they + should document this limitation. + +.. c:member:: int64_t ArrowArray.n_buffers + + Mandatory. The number of physical buffers backing this array. The + number of buffers is a function of the data type, as described in the + :ref:`Columnar format specification <format_columnar>`. + + Buffers of children arrays are not included. + +.. c:member:: const void** ArrowArray.buffers + + Mandatory. A C array of pointers to the start of each physical buffer + backing this array. Each `void*` pointer is the physical start of + a contiguous buffer. There must be :c:member:`ArrowArray.n_buffers` pointers. + + The producer MUST ensure that each contiguous buffer is large enough to + represent `length + offset` values encoded according to the + :ref:`Columnar format specification <format_columnar>`. + + It is recommended, but not required, that the memory addresses of the + buffers be aligned at least according to the type of primitive data that + they contain. Consumers MAY decide not to support unaligned memory. + + The pointer to the null bitmap buffer, if the data type specifies one, + MAY be NULL only if :c:member:`ArrowArray.null_count` is 0. + + Buffers of children arrays are not included. + +.. c:member:: int64_t ArrowArray.n_children + + Mandatory. The number of children this array has. The number of children + is a function of the data type, as described in the + :ref:`Columnar format specification <format_columnar>`. + +.. c:member:: ArrowArray** ArrowArray.children + + Optional. A C array of pointers to each child array of this array. + There must be :c:member:`ArrowArray.n_children` pointers. + + MAY be NULL only if :c:member:`ArrowArray.n_children` is 0. + +.. c:member:: ArrowArray* ArrowArray.dictionary + + Optional. A pointer to the underlying array of dictionary values. + + MUST be present if the ArrowArray represents a dictionary-encoded array. + MUST be NULL otherwise. + +.. c:member:: void (*ArrowArray.release)(struct ArrowArray*) + + Mandatory. A pointer to a producer-provided release callback. + + See below for memory management and release callback semantics. + +.. c:member:: void* ArrowArray.private_data + + Optional. An opaque pointer to producer-provided private data. + + Consumers MUST not process this member. Lifetime of this member + is handled by the producer, and especially by the release callback. + + +Dictionary-encoded arrays +------------------------- + +For dictionary-encoded arrays, the :c:member:`ArrowSchema.format` string +encodes the *index* type. The dictionary *value* type can be read +from the :c:member:`ArrowSchema.dictionary` structure. + +The same holds for :c:member:`ArrowArray` structure: while the parent +structure points to the index data, the :c:member:`ArrowArray.dictionary` +points to the dictionary values array. + +Extension arrays +---------------- + +For extension arrays, the :c:member:`ArrowSchema.format` string encodes the +*storage* type. Information about the extension type is encoded in the +:c:member:`ArrowSchema.metadata` string, similarly to the +:ref:`IPC format <format_metadata_extension_types>`. Specifically, the +metadata key ``ARROW:extension:name`` encodes the extension type name, +and the metadata key ``ARROW:extension:metadata`` encodes the +implementation-specific serialization of the extension type (for +parameterized extension types). The base64 encoding of metadata values +ensures that any possible serialization is representable. + +The ``ArrowArray`` structure exported from an extension array simply points +to the storage data of the extension array. + +Memory management +----------------- + +The ``ArrowSchema`` and ``ArrowArray`` structures follow the same conventions +for memory management. The term *"base structure"* below refers to the +``ArrowSchema`` or ``ArrowArray`` that is passed between producer and consumer +-- not any child structure thereof. + +Member allocation +''''''''''''''''' + +It is intended for the base structure to be stack- or heap-allocated by the +consumer. In this case, the producer API should take a pointer to the +consumer-allocated structure. + +However, any data pointed to by the struct MUST be allocated and maintained +by the producer. This includes the format and metadata strings, the arrays +of buffer and children pointers, etc. + +Therefore, the consumer MUST not try to interfere with the producer's +handling of these members' lifetime. The only way the consumer influences +data lifetime is by calling the base structure's ``release`` callback. + +.. _c-data-interface-released: + +Released structure +'''''''''''''''''' + +A released structure is indicated by setting its ``release`` callback to NULL. +Before reading and interpreting a structure's data, consumers SHOULD check +for a NULL release callback and treat it accordingly (probably by erroring +out). + +Release callback semantics -- for consumers +''''''''''''''''''''''''''''''''''''''''''' + +Consumers MUST call a base structure's release callback when they won't be using +it anymore, but they MUST not call any of its children's release callbacks +(including the optional dictionary). The producer is responsible for releasing +the children. + +In any case, a consumer MUST not try to access the base structure anymore +after calling its release callback -- including any associated data such +as its children. + +Release callback semantics -- for producers +''''''''''''''''''''''''''''''''''''''''''' + +If producers need additional information for lifetime handling (for +example, a C++ producer may want to use ``shared_ptr`` for array and +buffer lifetime), they MUST use the ``private_data`` member to locate the +required bookkeeping information. + +The release callback MUST not assume that the structure will be located +at the same memory location as when it was originally produced. The consumer +is free to move the structure around (see "Moving an array"). + +The release callback MUST walk all children structures (including the optional +dictionary) and call their own release callbacks. + +The release callback MUST free any data area directly owned by the structure +(such as the buffers and children members). + +The release callback MUST mark the structure as released, by setting +its ``release`` member to NULL. + +Below is a good starting point for implementing a release callback, where the +TODO area must be filled with producer-specific deallocation code: + +.. code-block:: c + + static void ReleaseExportedArray(struct ArrowArray* array) { + // This should not be called on already released array + assert(array->format != NULL); + + // Release children + for (int64_t i = 0; i < array->n_children; ++i) { + struct ArrowArray* child = array->children[i]; + if (child->release != NULL) { + child->release(child); + assert(child->release == NULL); + } + } + + // Release dictionary + struct ArrowArray* dict = array->dictionary; + if (dict != NULL && dict->release != NULL) { + dict->release(dict); + assert(dict->release == NULL); + } + + // TODO here: release and/or deallocate all data directly owned by + // the ArrowArray struct, such as the private_data. + + // Mark array released + array->release = NULL; + } + + +Moving an array +''''''''''''''' + +The consumer can *move* the ``ArrowArray`` structure by bitwise copying or +shallow member-wise copying. Then it MUST mark the source structure released +(see "released structure" above for how to do it) but *without* calling the +release callback. This ensures that only one live copy of the struct is +active at any given time and that lifetime is correctly communicated to +the producer. + +As usual, the release callback will be called on the destination structure +when it is not needed anymore. + +Moving child arrays +~~~~~~~~~~~~~~~~~~~ + +It is also possible to move one or several child arrays, but the parent +``ArrowArray`` structure MUST be released immediately afterwards, as it +won't point to valid child arrays anymore. + +The main use case for this is to keep alive only a subset of child arrays +(for example if you are only interested in certain columns of the data), +while releasing the others. + +.. note:: + + For moving to work correctly, the ``ArrowArray`` structure has to be + trivially relocatable. Therefore, pointer members inside the ``ArrowArray`` + structure (including ``private_data``) MUST not point inside the structure + itself. Also, external pointers to the structure MUST not be separately + stored by the producer. Instead, the producer MUST use the ``private_data`` + member so as to remember any necessary bookkeeping information. + +Record batches +-------------- + +A record batch can be trivially considered as an equivalent struct array with +additional top-level metadata. + +Example use case +================ + +A C++ database engine wants to provide the option to deliver results in Arrow +format, but without imposing themselves a dependency on the Arrow software +libraries. With the Arrow C data interface, the engine can let the caller pass +a pointer to a ``ArrowArray`` structure, and fill it with the next chunk of +results. + +It can do so without including the Arrow C++ headers or linking with the +Arrow DLLs. Furthermore, the database engine's C API can benefit other +runtimes and libraries that know about the Arrow C data interface, +through e.g. a C FFI layer. + +C producer examples +=================== + +Exporting a simple ``int32`` array +---------------------------------- + +Export a non-nullable ``int32`` type with empty metadata. In this case, +all ``ArrowSchema`` members point to statically-allocated data, so the +release callback is trivial. + +.. code-block:: c + + static void release_int32_type(struct ArrowSchema* schema) { + // Mark released + schema->release = NULL; + } + + void export_int32_type(struct ArrowSchema* schema) { + *schema = (struct ArrowSchema) { + // Type description + .format = "i", + .name = "", + .metadata = NULL, + .flags = 0, + .n_children = 0, + .children = NULL, + .dictionary = NULL, + // Bookkeeping + .release = &release_int32_type + }; + } + +Export a C-malloc()ed array of the same type as a Arrow array, transferring +ownership to the consumer through the release callback: + +.. code-block:: c + + static void release_int32_array(struct ArrowArray* array) { + assert(array->n_buffers == 2); + // Free the buffers and the buffers array + free((void *) array->buffers[1]); + free(array->buffers); + // Mark released + array->release = NULL; + } + + void export_int32_array(const int32_t* data, int64_t nitems, + struct ArrowArray* array) { + // Initialize primitive fields + *array = (struct ArrowArray) { + // Data description + .length = nitems, + .offset = 0, + .null_count = 0, + .n_buffers = 2, + .n_children = 0, + .children = NULL, + .dictionary = NULL, + // Bookkeeping + .release = &release_int32_array + }; + // Allocate list of buffers + array->buffers = (const void**) malloc(sizeof(void*) * array->n_buffers); + assert(array->buffers != NULL); + array->buffers[0] = NULL; // no nulls, null bitmap can be omitted + array->buffers[1] = data; + } + +Exporting a ``struct<float32, utf8>`` array +------------------------------------------- + +Export the array type as a ``ArrowSchema`` with C-malloc()ed children: + +.. code-block:: c + + static void release_malloced_type(struct ArrowSchema* schema) { + int i; + for (i = 0; i < schema->n_children; ++i) { + struct ArrowSchema* child = schema->children[i]; + if (child->release != NULL) { + child->release(child); + } + } + free(schema->children); + // Mark released + schema->release = NULL; + } + + void export_float32_utf8_type(struct ArrowSchema* schema) { + struct ArrowSchema* child; + + // + // Initialize parent type + // + *schema = (struct ArrowSchema) { + // Type description + .format = "+s", + .name = "", + .metadata = NULL, + .flags = 0, + .n_children = 2, + .dictionary = NULL, + // Bookkeeping + .release = &release_malloced_type + }; + // Allocate list of children types + schema->children = malloc(sizeof(struct ArrowSchema*) * schema->n_children); + + // + // Initialize child type #0 + // + child = schema->children[0] = malloc(sizeof(struct ArrowSchema)); + *child = (struct ArrowSchema) { + // Type description + .format = "f", + .name = "floats", + .metadata = NULL, + .flags = ARROW_FLAG_NULLABLE, + .n_children = 0, + .dictionary = NULL, + .children = NULL, + // Bookkeeping + .release = &release_malloced_type + }; + + // + // Initialize child type #1 + // + child = schema->children[1] = malloc(sizeof(struct ArrowSchema)); + *child = (struct ArrowSchema) { + // Type description + .format = "u", + .name = "strings", + .metadata = NULL, + .flags = ARROW_FLAG_NULLABLE, + .n_children = 0, + .dictionary = NULL, + .children = NULL, + // Bookkeeping + .release = &release_malloced_type + }; + } + +Export C-malloc()ed arrays in Arrow-compatible layout as an Arrow struct array, +transferring ownership to the consumer: + +.. code-block:: c + + static void release_malloced_array(struct ArrowArray* array) { + int i; + // Free children + for (i = 0; i < array->n_children; ++i) { + struct ArrowArray* child = array->children[i]; + if (child->release != NULL) { + child->release(child); + } + } + free(array->children); + // Free buffers + for (i = 0; i < array->n_buffers; ++i) { + free((void *) array->buffers[i]); + } + free(array->buffers); + // Mark released + array->release = NULL; + } + + void export_float32_utf8_array( + int64_t nitems, + const uint8_t* float32_nulls, const float* float32_data, + const uint8_t* utf8_nulls, const int32_t* utf8_offsets, const uint8_t* utf8_data, + struct ArrowArray* array) { + struct ArrowArray* child; + + // + // Initialize parent array + // + *array = (struct ArrowArray) { + // Data description + .length = nitems, + .offset = 0, + .null_count = 0, + .n_buffers = 1, + .n_children = 2, + .dictionary = NULL, + // Bookkeeping + .release = &release_malloced_array + }; + // Allocate list of parent buffers + array->buffers = malloc(sizeof(void*) * array->n_buffers); + array->buffers[0] = NULL; // no nulls, null bitmap can be omitted + // Allocate list of children arrays + array->children = malloc(sizeof(struct ArrowArray*) * array->n_children); + + // + // Initialize child array #0 + // + child = array->children[0] = malloc(sizeof(struct ArrowArray)); + *child = (struct ArrowArray) { + // Data description + .length = nitems, + .offset = 0, + .null_count = -1, + .n_buffers = 2, + .n_children = 0, + .dictionary = NULL, + .children = NULL, + // Bookkeeping + .release = &release_malloced_array + }; + child->buffers = malloc(sizeof(void*) * array->n_buffers); + child->buffers[0] = float32_nulls; + child->buffers[1] = float32_data; + + // + // Initialize child array #1 + // + child = array->children[1] = malloc(sizeof(struct ArrowArray)); + *child = (struct ArrowArray) { + // Data description + .length = nitems, + .offset = 0, + .null_count = -1, + .n_buffers = 3, + .n_children = 0, + .dictionary = NULL, + .children = NULL, + // Bookkeeping + .release = &release_malloced_array + }; + child->buffers = malloc(sizeof(void*) * array->n_buffers); + child->buffers[0] = utf8_nulls; + child->buffers[1] = utf8_offsets; + child->buffers[2] = utf8_data; + } + + +Why two distinct structures? +============================ + +In many cases, the same type or schema description applies to multiple, +possibly short, batches of data. To avoid paying the cost of exporting +and importing the type description for each batch, the ``ArrowSchema`` +can be passed once, separately, at the beginning of the conversation between +producer and consumer. + +In other cases yet, the data type is fixed by the producer API, and may not +need to be communicated at all. + +However, if a producer is focused on one-shot exchange of data, it can +communicate the ``ArrowSchema`` and ``ArrowArray`` structures in the same +API call. + +Updating this specification +=========================== + +Once this specification is supported in an official Arrow release, the C +ABI is frozen. This means the ``ArrowSchema`` and ``ArrowArray`` structure +definitions should not change in any way -- including adding new members. + +Backwards-compatible changes are allowed, for example new +:c:member:`ArrowSchema.flags` values or expanded possibilities for +the :c:member:`ArrowSchema.format` string. + +Any incompatible changes should be part of a new specification, for example +"Arrow C data interface v2". + +Inspiration +=========== + +The Arrow C data interface is inspired by the `Python buffer protocol`_, +which has proven immensely successful in allowing various Python libraries +exchange numerical data with no knowledge of each other and near-zero +adaptation cost. + + +.. _Python buffer protocol: https://www.python.org/dev/peps/pep-3118/ |