summaryrefslogtreecommitdiffstats
path: root/src/arrow/docs/source/format/CDataInterface.rst
diff options
context:
space:
mode:
Diffstat (limited to 'src/arrow/docs/source/format/CDataInterface.rst')
-rw-r--r--src/arrow/docs/source/format/CDataInterface.rst948
1 files changed, 948 insertions, 0 deletions
diff --git a/src/arrow/docs/source/format/CDataInterface.rst b/src/arrow/docs/source/format/CDataInterface.rst
new file mode 100644
index 000000000..20446411a
--- /dev/null
+++ b/src/arrow/docs/source/format/CDataInterface.rst
@@ -0,0 +1,948 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. _c-data-interface:
+
+==========================
+The Arrow C data interface
+==========================
+
+Rationale
+=========
+
+Apache Arrow is designed to be a universal in-memory format for the representation
+of tabular ("columnar") data. However, some projects may face a difficult
+choice between either depending on a fast-evolving project such as the
+Arrow C++ library, or having to reimplement adapters for data interchange,
+which may require significant, redundant development effort.
+
+The Arrow C data interface defines a very small, stable set of C definitions
+that can be easily *copied* in any project's source code and used for columnar
+data interchange in the Arrow format. For non-C/C++ languages and runtimes,
+it should be almost as easy to translate the C definitions into the
+corresponding C FFI declarations.
+
+Applications and libraries can therefore work with Arrow memory without
+necessarily using Arrow libraries or reinventing the wheel. Developers can
+choose between tight integration
+with the Arrow *software project* (benefitting from the growing array of
+facilities exposed by e.g. the C++ or Java implementations of Apache Arrow,
+but with the cost of a dependency) or minimal integration with the Arrow
+*format* only.
+
+Goals
+-----
+
+* Expose an ABI-stable interface.
+* Make it easy for third-party projects to implement support for (including partial
+ support where sufficient), with little initial investment.
+* Allow zero-copy sharing of Arrow data between independent runtimes
+ and components running in the same process.
+* Match the Arrow array concepts closely to avoid the development of
+ yet another marshalling layer.
+* Avoid the need for one-to-one adaptation layers such as the limited
+ JPype-based bridge between Java and Python.
+* Enable integration without an explicit dependency (either at compile-time
+ or runtime) on the Arrow software project.
+
+Ideally, the Arrow C data interface can become a low-level *lingua franca*
+for sharing columnar data at runtime and establish Arrow as the universal
+building block in the columnar processing ecosystem.
+
+Non-goals
+---------
+
+* Expose a C API mimicking operations available in higher-level runtimes
+ (such as C++, Java...).
+* Data sharing between distinct processes or storage persistence.
+
+
+Comparison with the Arrow IPC format
+------------------------------------
+
+Pros of the C data interface vs. the IPC format:
+
+* No dependency on Flatbuffers.
+* No buffer reassembly (data is already exposed in logical Arrow format).
+* Zero-copy by design.
+* Easy to reimplement from scratch.
+* Minimal C definition that can be easily copied into other codebases.
+* Resource lifetime management through a custom release callback.
+
+Pros of the IPC format vs. the data interface:
+
+* Works across processes and machines.
+* Allows data storage and persistence.
+* Being a streamable format, the IPC format has room for composing more features
+ (such as integrity checks, compression...).
+* Does not require explicit C data access.
+
+Data type description -- format strings
+=======================================
+
+A data type is described using a format string. The format string only
+encodes information about the top-level type; for nested type, child types
+are described separately. Also, metadata is encoded in a separate string.
+
+The format strings are designed to be easily parsable, even from a language
+such as C. The most common primitive formats have one-character format
+strings:
+
++-----------------+--------------------------+------------+
+| Format string | Arrow data type | Notes |
++=================+==========================+============+
+| ``n`` | null | |
++-----------------+--------------------------+------------+
+| ``b`` | boolean | |
++-----------------+--------------------------+------------+
+| ``c`` | int8 | |
++-----------------+--------------------------+------------+
+| ``C`` | uint8 | |
++-----------------+--------------------------+------------+
+| ``s`` | int16 | |
++-----------------+--------------------------+------------+
+| ``S`` | uint16 | |
++-----------------+--------------------------+------------+
+| ``i`` | int32 | |
++-----------------+--------------------------+------------+
+| ``I`` | uint32 | |
++-----------------+--------------------------+------------+
+| ``l`` | int64 | |
++-----------------+--------------------------+------------+
+| ``L`` | uint64 | |
++-----------------+--------------------------+------------+
+| ``e`` | float16 | |
++-----------------+--------------------------+------------+
+| ``f`` | float32 | |
++-----------------+--------------------------+------------+
+| ``g`` | float64 | |
++-----------------+--------------------------+------------+
+
++-----------------+---------------------------------------------------+------------+
+| Format string | Arrow data type | Notes |
++=================+===================================================+============+
+| ``z`` | binary | |
++-----------------+---------------------------------------------------+------------+
+| ``Z`` | large binary | |
++-----------------+---------------------------------------------------+------------+
+| ``u`` | utf-8 string | |
++-----------------+---------------------------------------------------+------------+
+| ``U`` | large utf-8 string | |
++-----------------+---------------------------------------------------+------------+
+| ``d:19,10`` | decimal128 [precision 19, scale 10] | |
++-----------------+---------------------------------------------------+------------+
+| ``d:19,10,NNN`` | decimal bitwidth = NNN [precision 19, scale 10] | |
++-----------------+---------------------------------------------------+------------+
+| ``w:42`` | fixed-width binary [42 bytes] | |
++-----------------+---------------------------------------------------+------------+
+
+Temporal types have multi-character format strings starting with ``t``:
+
++-----------------+---------------------------------------------------+------------+
+| Format string | Arrow data type | Notes |
++=================+===================================================+============+
+| ``tdD`` | date32 [days] | |
++-----------------+---------------------------------------------------+------------+
+| ``tdm`` | date64 [milliseconds] | |
++-----------------+---------------------------------------------------+------------+
+| ``tts`` | time32 [seconds] | |
++-----------------+---------------------------------------------------+------------+
+| ``ttm`` | time32 [milliseconds] | |
++-----------------+---------------------------------------------------+------------+
+| ``ttu`` | time64 [microseconds] | |
++-----------------+---------------------------------------------------+------------+
+| ``ttn`` | time64 [nanoseconds] | |
++-----------------+---------------------------------------------------+------------+
+| ``tss:...`` | timestamp [seconds] with timezone "..." | \(1) |
++-----------------+---------------------------------------------------+------------+
+| ``tsm:...`` | timestamp [milliseconds] with timezone "..." | \(1) |
++-----------------+---------------------------------------------------+------------+
+| ``tsu:...`` | timestamp [microseconds] with timezone "..." | \(1) |
++-----------------+---------------------------------------------------+------------+
+| ``tsn:...`` | timestamp [nanoseconds] with timezone "..." | \(1) |
++-----------------+---------------------------------------------------+------------+
+| ``tDs`` | duration [seconds] | |
++-----------------+---------------------------------------------------+------------+
+| ``tDm`` | duration [milliseconds] | |
++-----------------+---------------------------------------------------+------------+
+| ``tDu`` | duration [microseconds] | |
++-----------------+---------------------------------------------------+------------+
+| ``tDn`` | duration [nanoseconds] | |
++-----------------+---------------------------------------------------+------------+
+| ``tiM`` | interval [months] | |
++-----------------+---------------------------------------------------+------------+
+| ``tiD`` | interval [days, time] | |
++-----------------+---------------------------------------------------+------------+
+| ``tin`` | interval [month, day, nanoseconds] | |
++-----------------+---------------------------------------------------+------------+
+
+
+Dictionary-encoded types do not have a specific format string. Instead, the
+format string of the base array represents the dictionary index type, and the
+value type can be read from the dependent dictionary array (see below
+"Dictionary-encoded arrays").
+
+Nested types have multiple-character format strings starting with ``+``. The
+names and types of child fields are read from the child arrays.
+
++------------------------+---------------------------------------------------+------------+
+| Format string | Arrow data type | Notes |
++========================+===================================================+============+
+| ``+l`` | list | |
++------------------------+---------------------------------------------------+------------+
+| ``+L`` | large list | |
++------------------------+---------------------------------------------------+------------+
+| ``+w:123`` | fixed-sized list [123 items] | |
++------------------------+---------------------------------------------------+------------+
+| ``+s`` | struct | |
++------------------------+---------------------------------------------------+------------+
+| ``+m`` | map | \(2) |
++------------------------+---------------------------------------------------+------------+
+| ``+ud:I,J,...`` | dense union with type ids I,J... | |
++------------------------+---------------------------------------------------+------------+
+| ``+us:I,J,...`` | sparse union with type ids I,J... | |
++------------------------+---------------------------------------------------+------------+
+
+Notes:
+
+(1)
+ The timezone string is appended as-is after the colon character ``:``, without
+ any quotes. If the timezone is empty, the colon ``:`` must still be included.
+
+(2)
+ As specified in the Arrow columnar format, the map type has a single child type
+ named ``entries``, itself a 2-child struct type of ``(key, value)``.
+
+Examples
+--------
+
+* A dictionary-encoded ``decimal128(precision = 12, scale = 5)`` array
+ with ``int16`` indices has format string ``s``, and its dependent dictionary
+ array has format string ``d:12,5``.
+* A ``list<uint64>`` array has format string ``+l``, and its single child
+ has format string ``L``.
+* A ``struct<ints: int32, floats: float32>`` has format string ``+s``; its two
+ children have names ``ints`` and ``floats``, and format strings ``i`` and
+ ``f`` respectively.
+* A ``map<string, float64>`` array has format string ``+m``; its single child
+ has name ``entries`` and format string ``+s``; its two grandchildren have names
+ ``key`` and ``value``, and format strings ``u`` and ``g`` respectively.
+* A ``sparse_union<ints: int32, floats: float32>`` with type ids ``4, 5``
+ has format string ``+us:4,5``; its two children have names ``ints`` and
+ ``floats``, and format strings ``i`` and ``f`` respectively.
+
+
+Structure definitions
+=====================
+
+The following free-standing definitions are enough to support the Arrow
+C data interface in your project. Like the rest of the Arrow project, they
+are available under the Apache License 2.0.
+
+.. code-block:: c
+
+ #define ARROW_FLAG_DICTIONARY_ORDERED 1
+ #define ARROW_FLAG_NULLABLE 2
+ #define ARROW_FLAG_MAP_KEYS_SORTED 4
+
+ struct ArrowSchema {
+ // Array type description
+ const char* format;
+ const char* name;
+ const char* metadata;
+ int64_t flags;
+ int64_t n_children;
+ struct ArrowSchema** children;
+ struct ArrowSchema* dictionary;
+
+ // Release callback
+ void (*release)(struct ArrowSchema*);
+ // Opaque producer-specific data
+ void* private_data;
+ };
+
+ struct ArrowArray {
+ // Array data description
+ int64_t length;
+ int64_t null_count;
+ int64_t offset;
+ int64_t n_buffers;
+ int64_t n_children;
+ const void** buffers;
+ struct ArrowArray** children;
+ struct ArrowArray* dictionary;
+
+ // Release callback
+ void (*release)(struct ArrowArray*);
+ // Opaque producer-specific data
+ void* private_data;
+ };
+
+The ArrowSchema structure
+-------------------------
+
+The ``ArrowSchema`` structure describes the type and metadata of an exported
+array or record batch. It has the following fields:
+
+.. c:member:: const char* ArrowSchema.format
+
+ Mandatory. A null-terminated, UTF8-encoded string describing
+ the data type. If the data type is nested, child types are not
+ encoded here but in the :c:member:`ArrowSchema.children` structures.
+
+ Consumers MAY decide not to support all data types, but they
+ should document this limitation.
+
+.. c:member:: const char* ArrowSchema.name
+
+ Optional. A null-terminated, UTF8-encoded string of the field
+ or array name. This is mainly used to reconstruct child fields
+ of nested types.
+
+ Producers MAY decide not to provide this information, and consumers
+ MAY decide to ignore it. If omitted, MAY be NULL or an empty string.
+
+.. c:member:: const char* ArrowSchema.metadata
+
+ Optional. A binary string describing the type's metadata.
+ If the data type is nested, child types are not encoded here but
+ in the :c:member:`ArrowSchema.children` structures.
+
+ This string is not null-terminated but follows a specific format::
+
+ int32: number of key/value pairs (noted N below)
+ int32: byte length of key 0
+ key 0 (not null-terminated)
+ int32: byte length of value 0
+ value 0 (not null-terminated)
+ ...
+ int32: byte length of key N - 1
+ key N - 1 (not null-terminated)
+ int32: byte length of value N - 1
+ value N - 1 (not null-terminated)
+
+ Integers are stored in native endianness. For example, the metadata
+ ``[('key1', 'value1')]`` is encoded on a little-endian machine as::
+
+ \x01\x00\x00\x00\x04\x00\x00\x00key1\x06\x00\x00\x00value1
+
+ On a big-endian machine, the same example would be encoded as::
+
+ \x00\x00\x00\x01\x00\x00\x00\x04key1\x00\x00\x00\x06value1
+
+ If omitted, this field MUST be NULL (not an empty string).
+
+ Consumers MAY choose to ignore this information.
+
+.. c:member:: int64_t ArrowSchema.flags
+
+ Optional. A bitfield of flags enriching the type description.
+ Its value is computed by OR'ing together the flag values.
+ The following flags are available:
+
+ * ``ARROW_FLAG_NULLABLE``: whether this field is semantically nullable
+ (regardless of whether it actually has null values).
+ * ``ARROW_FLAG_DICTIONARY_ORDERED``: for dictionary-encoded types,
+ whether the ordering of dictionary indices is semantically meaningful.
+ * ``ARROW_FLAG_MAP_KEYS_SORTED``: for map types, whether the keys within
+ each map value are sorted.
+
+ If omitted, MUST be 0.
+
+ Consumers MAY choose to ignore some or all of the flags. Even then,
+ they SHOULD keep this value around so as to propagate its information
+ to their own consumers.
+
+.. c:member:: int64_t ArrowSchema.n_children
+
+ Mandatory. The number of children this type has.
+
+.. c:member:: ArrowSchema** ArrowSchema.children
+
+ Optional. A C array of pointers to each child type of this type.
+ There must be :c:member:`ArrowSchema.n_children` pointers.
+
+ MAY be NULL only if :c:member:`ArrowSchema.n_children` is 0.
+
+.. c:member:: ArrowSchema* ArrowSchema.dictionary
+
+ Optional. A pointer to the type of dictionary values.
+
+ MUST be present if the ArrowSchema represents a dictionary-encoded type.
+ MUST be NULL otherwise.
+
+.. c:member:: void (*ArrowSchema.release)(struct ArrowSchema*)
+
+ Mandatory. A pointer to a producer-provided release callback.
+
+ See below for memory management and release callback semantics.
+
+.. c:member:: void* ArrowSchema.private_data
+
+ Optional. An opaque pointer to producer-provided private data.
+
+ Consumers MUST not process this member. Lifetime of this member
+ is handled by the producer, and especially by the release callback.
+
+
+The ArrowArray structure
+------------------------
+
+The ``ArrowArray`` describes the data of an exported array or record batch.
+For the ``ArrowArray`` structure to be interpreted type, the array type
+or record batch schema must already be known. This is either done by
+convention -- for example a producer API that always produces the same data
+type -- or by passing a ``ArrowSchema`` on the side.
+
+It has the following fields:
+
+.. c:member:: int64_t ArrowArray.length
+
+ Mandatory. The logical length of the array (i.e. its number of items).
+
+.. c:member:: int64_t ArrowArray.null_count
+
+ Mandatory. The number of null items in the array. MAY be -1 if not
+ yet computed.
+
+.. c:member:: int64_t ArrowArray.offset
+
+ Mandatory. The logical offset inside the array (i.e. the number of items
+ from the physical start of the buffers). MUST be 0 or positive.
+
+ Producers MAY specify that they will only produce 0-offset arrays to
+ ease implementation of consumer code.
+ Consumers MAY decide not to support non-0-offset arrays, but they
+ should document this limitation.
+
+.. c:member:: int64_t ArrowArray.n_buffers
+
+ Mandatory. The number of physical buffers backing this array. The
+ number of buffers is a function of the data type, as described in the
+ :ref:`Columnar format specification <format_columnar>`.
+
+ Buffers of children arrays are not included.
+
+.. c:member:: const void** ArrowArray.buffers
+
+ Mandatory. A C array of pointers to the start of each physical buffer
+ backing this array. Each `void*` pointer is the physical start of
+ a contiguous buffer. There must be :c:member:`ArrowArray.n_buffers` pointers.
+
+ The producer MUST ensure that each contiguous buffer is large enough to
+ represent `length + offset` values encoded according to the
+ :ref:`Columnar format specification <format_columnar>`.
+
+ It is recommended, but not required, that the memory addresses of the
+ buffers be aligned at least according to the type of primitive data that
+ they contain. Consumers MAY decide not to support unaligned memory.
+
+ The pointer to the null bitmap buffer, if the data type specifies one,
+ MAY be NULL only if :c:member:`ArrowArray.null_count` is 0.
+
+ Buffers of children arrays are not included.
+
+.. c:member:: int64_t ArrowArray.n_children
+
+ Mandatory. The number of children this array has. The number of children
+ is a function of the data type, as described in the
+ :ref:`Columnar format specification <format_columnar>`.
+
+.. c:member:: ArrowArray** ArrowArray.children
+
+ Optional. A C array of pointers to each child array of this array.
+ There must be :c:member:`ArrowArray.n_children` pointers.
+
+ MAY be NULL only if :c:member:`ArrowArray.n_children` is 0.
+
+.. c:member:: ArrowArray* ArrowArray.dictionary
+
+ Optional. A pointer to the underlying array of dictionary values.
+
+ MUST be present if the ArrowArray represents a dictionary-encoded array.
+ MUST be NULL otherwise.
+
+.. c:member:: void (*ArrowArray.release)(struct ArrowArray*)
+
+ Mandatory. A pointer to a producer-provided release callback.
+
+ See below for memory management and release callback semantics.
+
+.. c:member:: void* ArrowArray.private_data
+
+ Optional. An opaque pointer to producer-provided private data.
+
+ Consumers MUST not process this member. Lifetime of this member
+ is handled by the producer, and especially by the release callback.
+
+
+Dictionary-encoded arrays
+-------------------------
+
+For dictionary-encoded arrays, the :c:member:`ArrowSchema.format` string
+encodes the *index* type. The dictionary *value* type can be read
+from the :c:member:`ArrowSchema.dictionary` structure.
+
+The same holds for :c:member:`ArrowArray` structure: while the parent
+structure points to the index data, the :c:member:`ArrowArray.dictionary`
+points to the dictionary values array.
+
+Extension arrays
+----------------
+
+For extension arrays, the :c:member:`ArrowSchema.format` string encodes the
+*storage* type. Information about the extension type is encoded in the
+:c:member:`ArrowSchema.metadata` string, similarly to the
+:ref:`IPC format <format_metadata_extension_types>`. Specifically, the
+metadata key ``ARROW:extension:name`` encodes the extension type name,
+and the metadata key ``ARROW:extension:metadata`` encodes the
+implementation-specific serialization of the extension type (for
+parameterized extension types). The base64 encoding of metadata values
+ensures that any possible serialization is representable.
+
+The ``ArrowArray`` structure exported from an extension array simply points
+to the storage data of the extension array.
+
+Memory management
+-----------------
+
+The ``ArrowSchema`` and ``ArrowArray`` structures follow the same conventions
+for memory management. The term *"base structure"* below refers to the
+``ArrowSchema`` or ``ArrowArray`` that is passed between producer and consumer
+-- not any child structure thereof.
+
+Member allocation
+'''''''''''''''''
+
+It is intended for the base structure to be stack- or heap-allocated by the
+consumer. In this case, the producer API should take a pointer to the
+consumer-allocated structure.
+
+However, any data pointed to by the struct MUST be allocated and maintained
+by the producer. This includes the format and metadata strings, the arrays
+of buffer and children pointers, etc.
+
+Therefore, the consumer MUST not try to interfere with the producer's
+handling of these members' lifetime. The only way the consumer influences
+data lifetime is by calling the base structure's ``release`` callback.
+
+.. _c-data-interface-released:
+
+Released structure
+''''''''''''''''''
+
+A released structure is indicated by setting its ``release`` callback to NULL.
+Before reading and interpreting a structure's data, consumers SHOULD check
+for a NULL release callback and treat it accordingly (probably by erroring
+out).
+
+Release callback semantics -- for consumers
+'''''''''''''''''''''''''''''''''''''''''''
+
+Consumers MUST call a base structure's release callback when they won't be using
+it anymore, but they MUST not call any of its children's release callbacks
+(including the optional dictionary). The producer is responsible for releasing
+the children.
+
+In any case, a consumer MUST not try to access the base structure anymore
+after calling its release callback -- including any associated data such
+as its children.
+
+Release callback semantics -- for producers
+'''''''''''''''''''''''''''''''''''''''''''
+
+If producers need additional information for lifetime handling (for
+example, a C++ producer may want to use ``shared_ptr`` for array and
+buffer lifetime), they MUST use the ``private_data`` member to locate the
+required bookkeeping information.
+
+The release callback MUST not assume that the structure will be located
+at the same memory location as when it was originally produced. The consumer
+is free to move the structure around (see "Moving an array").
+
+The release callback MUST walk all children structures (including the optional
+dictionary) and call their own release callbacks.
+
+The release callback MUST free any data area directly owned by the structure
+(such as the buffers and children members).
+
+The release callback MUST mark the structure as released, by setting
+its ``release`` member to NULL.
+
+Below is a good starting point for implementing a release callback, where the
+TODO area must be filled with producer-specific deallocation code:
+
+.. code-block:: c
+
+ static void ReleaseExportedArray(struct ArrowArray* array) {
+ // This should not be called on already released array
+ assert(array->format != NULL);
+
+ // Release children
+ for (int64_t i = 0; i < array->n_children; ++i) {
+ struct ArrowArray* child = array->children[i];
+ if (child->release != NULL) {
+ child->release(child);
+ assert(child->release == NULL);
+ }
+ }
+
+ // Release dictionary
+ struct ArrowArray* dict = array->dictionary;
+ if (dict != NULL && dict->release != NULL) {
+ dict->release(dict);
+ assert(dict->release == NULL);
+ }
+
+ // TODO here: release and/or deallocate all data directly owned by
+ // the ArrowArray struct, such as the private_data.
+
+ // Mark array released
+ array->release = NULL;
+ }
+
+
+Moving an array
+'''''''''''''''
+
+The consumer can *move* the ``ArrowArray`` structure by bitwise copying or
+shallow member-wise copying. Then it MUST mark the source structure released
+(see "released structure" above for how to do it) but *without* calling the
+release callback. This ensures that only one live copy of the struct is
+active at any given time and that lifetime is correctly communicated to
+the producer.
+
+As usual, the release callback will be called on the destination structure
+when it is not needed anymore.
+
+Moving child arrays
+~~~~~~~~~~~~~~~~~~~
+
+It is also possible to move one or several child arrays, but the parent
+``ArrowArray`` structure MUST be released immediately afterwards, as it
+won't point to valid child arrays anymore.
+
+The main use case for this is to keep alive only a subset of child arrays
+(for example if you are only interested in certain columns of the data),
+while releasing the others.
+
+.. note::
+
+ For moving to work correctly, the ``ArrowArray`` structure has to be
+ trivially relocatable. Therefore, pointer members inside the ``ArrowArray``
+ structure (including ``private_data``) MUST not point inside the structure
+ itself. Also, external pointers to the structure MUST not be separately
+ stored by the producer. Instead, the producer MUST use the ``private_data``
+ member so as to remember any necessary bookkeeping information.
+
+Record batches
+--------------
+
+A record batch can be trivially considered as an equivalent struct array with
+additional top-level metadata.
+
+Example use case
+================
+
+A C++ database engine wants to provide the option to deliver results in Arrow
+format, but without imposing themselves a dependency on the Arrow software
+libraries. With the Arrow C data interface, the engine can let the caller pass
+a pointer to a ``ArrowArray`` structure, and fill it with the next chunk of
+results.
+
+It can do so without including the Arrow C++ headers or linking with the
+Arrow DLLs. Furthermore, the database engine's C API can benefit other
+runtimes and libraries that know about the Arrow C data interface,
+through e.g. a C FFI layer.
+
+C producer examples
+===================
+
+Exporting a simple ``int32`` array
+----------------------------------
+
+Export a non-nullable ``int32`` type with empty metadata. In this case,
+all ``ArrowSchema`` members point to statically-allocated data, so the
+release callback is trivial.
+
+.. code-block:: c
+
+ static void release_int32_type(struct ArrowSchema* schema) {
+ // Mark released
+ schema->release = NULL;
+ }
+
+ void export_int32_type(struct ArrowSchema* schema) {
+ *schema = (struct ArrowSchema) {
+ // Type description
+ .format = "i",
+ .name = "",
+ .metadata = NULL,
+ .flags = 0,
+ .n_children = 0,
+ .children = NULL,
+ .dictionary = NULL,
+ // Bookkeeping
+ .release = &release_int32_type
+ };
+ }
+
+Export a C-malloc()ed array of the same type as a Arrow array, transferring
+ownership to the consumer through the release callback:
+
+.. code-block:: c
+
+ static void release_int32_array(struct ArrowArray* array) {
+ assert(array->n_buffers == 2);
+ // Free the buffers and the buffers array
+ free((void *) array->buffers[1]);
+ free(array->buffers);
+ // Mark released
+ array->release = NULL;
+ }
+
+ void export_int32_array(const int32_t* data, int64_t nitems,
+ struct ArrowArray* array) {
+ // Initialize primitive fields
+ *array = (struct ArrowArray) {
+ // Data description
+ .length = nitems,
+ .offset = 0,
+ .null_count = 0,
+ .n_buffers = 2,
+ .n_children = 0,
+ .children = NULL,
+ .dictionary = NULL,
+ // Bookkeeping
+ .release = &release_int32_array
+ };
+ // Allocate list of buffers
+ array->buffers = (const void**) malloc(sizeof(void*) * array->n_buffers);
+ assert(array->buffers != NULL);
+ array->buffers[0] = NULL; // no nulls, null bitmap can be omitted
+ array->buffers[1] = data;
+ }
+
+Exporting a ``struct<float32, utf8>`` array
+-------------------------------------------
+
+Export the array type as a ``ArrowSchema`` with C-malloc()ed children:
+
+.. code-block:: c
+
+ static void release_malloced_type(struct ArrowSchema* schema) {
+ int i;
+ for (i = 0; i < schema->n_children; ++i) {
+ struct ArrowSchema* child = schema->children[i];
+ if (child->release != NULL) {
+ child->release(child);
+ }
+ }
+ free(schema->children);
+ // Mark released
+ schema->release = NULL;
+ }
+
+ void export_float32_utf8_type(struct ArrowSchema* schema) {
+ struct ArrowSchema* child;
+
+ //
+ // Initialize parent type
+ //
+ *schema = (struct ArrowSchema) {
+ // Type description
+ .format = "+s",
+ .name = "",
+ .metadata = NULL,
+ .flags = 0,
+ .n_children = 2,
+ .dictionary = NULL,
+ // Bookkeeping
+ .release = &release_malloced_type
+ };
+ // Allocate list of children types
+ schema->children = malloc(sizeof(struct ArrowSchema*) * schema->n_children);
+
+ //
+ // Initialize child type #0
+ //
+ child = schema->children[0] = malloc(sizeof(struct ArrowSchema));
+ *child = (struct ArrowSchema) {
+ // Type description
+ .format = "f",
+ .name = "floats",
+ .metadata = NULL,
+ .flags = ARROW_FLAG_NULLABLE,
+ .n_children = 0,
+ .dictionary = NULL,
+ .children = NULL,
+ // Bookkeeping
+ .release = &release_malloced_type
+ };
+
+ //
+ // Initialize child type #1
+ //
+ child = schema->children[1] = malloc(sizeof(struct ArrowSchema));
+ *child = (struct ArrowSchema) {
+ // Type description
+ .format = "u",
+ .name = "strings",
+ .metadata = NULL,
+ .flags = ARROW_FLAG_NULLABLE,
+ .n_children = 0,
+ .dictionary = NULL,
+ .children = NULL,
+ // Bookkeeping
+ .release = &release_malloced_type
+ };
+ }
+
+Export C-malloc()ed arrays in Arrow-compatible layout as an Arrow struct array,
+transferring ownership to the consumer:
+
+.. code-block:: c
+
+ static void release_malloced_array(struct ArrowArray* array) {
+ int i;
+ // Free children
+ for (i = 0; i < array->n_children; ++i) {
+ struct ArrowArray* child = array->children[i];
+ if (child->release != NULL) {
+ child->release(child);
+ }
+ }
+ free(array->children);
+ // Free buffers
+ for (i = 0; i < array->n_buffers; ++i) {
+ free((void *) array->buffers[i]);
+ }
+ free(array->buffers);
+ // Mark released
+ array->release = NULL;
+ }
+
+ void export_float32_utf8_array(
+ int64_t nitems,
+ const uint8_t* float32_nulls, const float* float32_data,
+ const uint8_t* utf8_nulls, const int32_t* utf8_offsets, const uint8_t* utf8_data,
+ struct ArrowArray* array) {
+ struct ArrowArray* child;
+
+ //
+ // Initialize parent array
+ //
+ *array = (struct ArrowArray) {
+ // Data description
+ .length = nitems,
+ .offset = 0,
+ .null_count = 0,
+ .n_buffers = 1,
+ .n_children = 2,
+ .dictionary = NULL,
+ // Bookkeeping
+ .release = &release_malloced_array
+ };
+ // Allocate list of parent buffers
+ array->buffers = malloc(sizeof(void*) * array->n_buffers);
+ array->buffers[0] = NULL; // no nulls, null bitmap can be omitted
+ // Allocate list of children arrays
+ array->children = malloc(sizeof(struct ArrowArray*) * array->n_children);
+
+ //
+ // Initialize child array #0
+ //
+ child = array->children[0] = malloc(sizeof(struct ArrowArray));
+ *child = (struct ArrowArray) {
+ // Data description
+ .length = nitems,
+ .offset = 0,
+ .null_count = -1,
+ .n_buffers = 2,
+ .n_children = 0,
+ .dictionary = NULL,
+ .children = NULL,
+ // Bookkeeping
+ .release = &release_malloced_array
+ };
+ child->buffers = malloc(sizeof(void*) * array->n_buffers);
+ child->buffers[0] = float32_nulls;
+ child->buffers[1] = float32_data;
+
+ //
+ // Initialize child array #1
+ //
+ child = array->children[1] = malloc(sizeof(struct ArrowArray));
+ *child = (struct ArrowArray) {
+ // Data description
+ .length = nitems,
+ .offset = 0,
+ .null_count = -1,
+ .n_buffers = 3,
+ .n_children = 0,
+ .dictionary = NULL,
+ .children = NULL,
+ // Bookkeeping
+ .release = &release_malloced_array
+ };
+ child->buffers = malloc(sizeof(void*) * array->n_buffers);
+ child->buffers[0] = utf8_nulls;
+ child->buffers[1] = utf8_offsets;
+ child->buffers[2] = utf8_data;
+ }
+
+
+Why two distinct structures?
+============================
+
+In many cases, the same type or schema description applies to multiple,
+possibly short, batches of data. To avoid paying the cost of exporting
+and importing the type description for each batch, the ``ArrowSchema``
+can be passed once, separately, at the beginning of the conversation between
+producer and consumer.
+
+In other cases yet, the data type is fixed by the producer API, and may not
+need to be communicated at all.
+
+However, if a producer is focused on one-shot exchange of data, it can
+communicate the ``ArrowSchema`` and ``ArrowArray`` structures in the same
+API call.
+
+Updating this specification
+===========================
+
+Once this specification is supported in an official Arrow release, the C
+ABI is frozen. This means the ``ArrowSchema`` and ``ArrowArray`` structure
+definitions should not change in any way -- including adding new members.
+
+Backwards-compatible changes are allowed, for example new
+:c:member:`ArrowSchema.flags` values or expanded possibilities for
+the :c:member:`ArrowSchema.format` string.
+
+Any incompatible changes should be part of a new specification, for example
+"Arrow C data interface v2".
+
+Inspiration
+===========
+
+The Arrow C data interface is inspired by the `Python buffer protocol`_,
+which has proven immensely successful in allowing various Python libraries
+exchange numerical data with no knowledge of each other and near-zero
+adaptation cost.
+
+
+.. _Python buffer protocol: https://www.python.org/dev/peps/pep-3118/