diff options
Diffstat (limited to 'src/arrow/docs/source/python/data.rst')
-rw-r--r-- | src/arrow/docs/source/python/data.rst | 434 |
1 files changed, 434 insertions, 0 deletions
diff --git a/src/arrow/docs/source/python/data.rst b/src/arrow/docs/source/python/data.rst new file mode 100644 index 000000000..b8a90039f --- /dev/null +++ b/src/arrow/docs/source/python/data.rst @@ -0,0 +1,434 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. currentmodule:: pyarrow +.. _data: + +Data Types and In-Memory Data Model +=================================== + +Apache Arrow defines columnar array data structures by composing type metadata +with memory buffers, like the ones explained in the documentation on +:ref:`Memory and IO <io>`. These data structures are exposed in Python through +a series of interrelated classes: + +* **Type Metadata**: Instances of ``pyarrow.DataType``, which describe a logical + array type +* **Schemas**: Instances of ``pyarrow.Schema``, which describe a named + collection of types. These can be thought of as the column types in a + table-like object. +* **Arrays**: Instances of ``pyarrow.Array``, which are atomic, contiguous + columnar data structures composed from Arrow Buffer objects +* **Record Batches**: Instances of ``pyarrow.RecordBatch``, which are a + collection of Array objects with a particular Schema +* **Tables**: Instances of ``pyarrow.Table``, a logical table data structure in + which each column consists of one or more ``pyarrow.Array`` objects of the + same type. + +We will examine these in the sections below in a series of examples. + +.. _data.types: + +Type Metadata +------------- + +Apache Arrow defines language agnostic column-oriented data structures for +array data. These include: + +* **Fixed-length primitive types**: numbers, booleans, date and times, fixed + size binary, decimals, and other values that fit into a given number +* **Variable-length primitive types**: binary, string +* **Nested types**: list, struct, and union +* **Dictionary type**: An encoded categorical type (more on this later) + +Each logical data type in Arrow has a corresponding factory function for +creating an instance of that type object in Python: + +.. ipython:: python + + import pyarrow as pa + t1 = pa.int32() + t2 = pa.string() + t3 = pa.binary() + t4 = pa.binary(10) + t5 = pa.timestamp('ms') + + t1 + print(t1) + print(t4) + print(t5) + +We use the name **logical type** because the **physical** storage may be the +same for one or more types. For example, ``int64``, ``float64``, and +``timestamp[ms]`` all occupy 64 bits per value. + +These objects are `metadata`; they are used for describing the data in arrays, +schemas, and record batches. In Python, they can be used in functions where the +input data (e.g. Python objects) may be coerced to more than one Arrow type. + +The :class:`~pyarrow.Field` type is a type plus a name and optional +user-defined metadata: + +.. ipython:: python + + f0 = pa.field('int32_field', t1) + f0 + f0.name + f0.type + +Arrow supports **nested value types** like list, struct, and union. When +creating these, you must pass types or fields to indicate the data types of the +types' children. For example, we can define a list of int32 values with: + +.. ipython:: python + + t6 = pa.list_(t1) + t6 + +A `struct` is a collection of named fields: + +.. ipython:: python + + fields = [ + pa.field('s0', t1), + pa.field('s1', t2), + pa.field('s2', t4), + pa.field('s3', t6), + ] + + t7 = pa.struct(fields) + print(t7) + +For convenience, you can pass ``(name, type)`` tuples directly instead of +:class:`~pyarrow.Field` instances: + +.. ipython:: python + + t8 = pa.struct([('s0', t1), ('s1', t2), ('s2', t4), ('s3', t6)]) + print(t8) + t8 == t7 + + +See :ref:`Data Types API <api.types>` for a full listing of data type +functions. + +.. _data.schema: + +Schemas +------- + +The :class:`~pyarrow.Schema` type is similar to the ``struct`` array type; it +defines the column names and types in a record batch or table data +structure. The :func:`pyarrow.schema` factory function makes new Schema objects in +Python: + +.. ipython:: python + + my_schema = pa.schema([('field0', t1), + ('field1', t2), + ('field2', t4), + ('field3', t6)]) + my_schema + +In some applications, you may not create schemas directly, only using the ones +that are embedded in :ref:`IPC messages <ipc>`. + +.. _data.array: + +Arrays +------ + +For each data type, there is an accompanying array data structure for holding +memory buffers that define a single contiguous chunk of columnar array +data. When you are using PyArrow, this data may come from IPC tools, though it +can also be created from various types of Python sequences (lists, NumPy +arrays, pandas data). + +A simple way to create arrays is with ``pyarrow.array``, which is similar to +the ``numpy.array`` function. By default PyArrow will infer the data type +for you: + +.. ipython:: python + + arr = pa.array([1, 2, None, 3]) + arr + +But you may also pass a specific data type to override type inference: + +.. ipython:: python + + pa.array([1, 2], type=pa.uint16()) + +The array's ``type`` attribute is the corresponding piece of type metadata: + +.. ipython:: python + + arr.type + +Each in-memory array has a known length and null count (which will be 0 if +there are no null values): + +.. ipython:: python + + len(arr) + arr.null_count + +Scalar values can be selected with normal indexing. ``pyarrow.array`` converts +``None`` values to Arrow nulls; we return the special ``pyarrow.NA`` value for +nulls: + +.. ipython:: python + + arr[0] + arr[2] + +Arrow data is immutable, so values can be selected but not assigned. + +Arrays can be sliced without copying: + +.. ipython:: python + + arr[1:3] + +None values and NAN handling +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As mentioned in the above section, the Python object ``None`` is always +converted to an Arrow null element on the conversion to ``pyarrow.Array``. For +the float NaN value which is either represented by the Python object +``float('nan')`` or ``numpy.nan`` we normally convert it to a *valid* float +value during the conversion. If an integer input is supplied to +``pyarrow.array`` that contains ``np.nan``, ``ValueError`` is raised. + +To handle better compatibility with Pandas, we support interpreting NaN values as +null elements. This is enabled automatically on all ``from_pandas`` function and +can be enable on the other conversion functions by passing ``from_pandas=True`` +as a function parameter. + +List arrays +~~~~~~~~~~~ + +``pyarrow.array`` is able to infer the type of simple nested data structures +like lists: + +.. ipython:: python + + nested_arr = pa.array([[], None, [1, 2], [None, 1]]) + print(nested_arr.type) + +Struct arrays +~~~~~~~~~~~~~ + +For other kinds of nested arrays, such as struct arrays, you currently need +to pass the type explicitly. Struct arrays can be initialized from a +sequence of Python dicts or tuples: + +.. ipython:: python + + ty = pa.struct([('x', pa.int8()), + ('y', pa.bool_())]) + pa.array([{'x': 1, 'y': True}, {'x': 2, 'y': False}], type=ty) + pa.array([(3, True), (4, False)], type=ty) + +When initializing a struct array, nulls are allowed both at the struct +level and at the individual field level. If initializing from a sequence +of Python dicts, a missing dict key is handled as a null value: + +.. ipython:: python + + pa.array([{'x': 1}, None, {'y': None}], type=ty) + +You can also construct a struct array from existing arrays for each of the +struct's components. In this case, data storage will be shared with the +individual arrays, and no copy is involved: + +.. ipython:: python + + xs = pa.array([5, 6, 7], type=pa.int16()) + ys = pa.array([False, True, True]) + arr = pa.StructArray.from_arrays((xs, ys), names=('x', 'y')) + arr.type + arr + +Union arrays +~~~~~~~~~~~~ + +The union type represents a nested array type where each value can be one +(and only one) of a set of possible types. There are two possible +storage types for union arrays: sparse and dense. + +In a sparse union array, each of the child arrays has the same length +as the resulting union array. They are adjuncted with a ``int8`` "types" +array that tells, for each value, from which child array it must be +selected: + +.. ipython:: python + + xs = pa.array([5, 6, 7]) + ys = pa.array([False, False, True]) + types = pa.array([0, 1, 1], type=pa.int8()) + union_arr = pa.UnionArray.from_sparse(types, [xs, ys]) + union_arr.type + union_arr + +In a dense union array, you also pass, in addition to the ``int8`` "types" +array, a ``int32`` "offsets" array that tells, for each value, at +each offset in the selected child array it can be found: + +.. ipython:: python + + xs = pa.array([5, 6, 7]) + ys = pa.array([False, True]) + types = pa.array([0, 1, 1, 0, 0], type=pa.int8()) + offsets = pa.array([0, 0, 1, 1, 2], type=pa.int32()) + union_arr = pa.UnionArray.from_dense(types, offsets, [xs, ys]) + union_arr.type + union_arr + +.. _data.dictionary: + +Dictionary Arrays +~~~~~~~~~~~~~~~~~ + +The **Dictionary** type in PyArrow is a special array type that is similar to a +factor in R or a ``pandas.Categorical``. It enables one or more record batches +in a file or stream to transmit integer *indices* referencing a shared +**dictionary** containing the distinct values in the logical array. This is +particularly often used with strings to save memory and improve performance. + +The way that dictionaries are handled in the Apache Arrow format and the way +they appear in C++ and Python is slightly different. We define a special +:class:`~.DictionaryArray` type with a corresponding dictionary type. Let's +consider an example: + +.. ipython:: python + + indices = pa.array([0, 1, 0, 1, 2, 0, None, 2]) + dictionary = pa.array(['foo', 'bar', 'baz']) + + dict_array = pa.DictionaryArray.from_arrays(indices, dictionary) + dict_array + +Here we have: + +.. ipython:: python + + print(dict_array.type) + dict_array.indices + dict_array.dictionary + +When using :class:`~.DictionaryArray` with pandas, the analogue is +``pandas.Categorical`` (more on this later): + +.. ipython:: python + + dict_array.to_pandas() + +.. _data.record_batch: + +Record Batches +-------------- + +A **Record Batch** in Apache Arrow is a collection of equal-length array +instances. Let's consider a collection of arrays: + +.. ipython:: python + + data = [ + pa.array([1, 2, 3, 4]), + pa.array(['foo', 'bar', 'baz', None]), + pa.array([True, None, False, True]) + ] + +A record batch can be created from this list of arrays using +``RecordBatch.from_arrays``: + +.. ipython:: python + + batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2']) + batch.num_columns + batch.num_rows + batch.schema + + batch[1] + +A record batch can be sliced without copying memory like an array: + +.. ipython:: python + + batch2 = batch.slice(1, 3) + batch2[1] + +.. _data.table: + +Tables +------ + +The PyArrow :class:`~.Table` type is not part of the Apache Arrow +specification, but is rather a tool to help with wrangling multiple record +batches and array pieces as a single logical dataset. As a relevant example, we +may receive multiple small record batches in a socket stream, then need to +concatenate them into contiguous memory for use in NumPy or pandas. The Table +object makes this efficient without requiring additional memory copying. + +Considering the record batch we created above, we can create a Table containing +one or more copies of the batch using ``Table.from_batches``: + +.. ipython:: python + + batches = [batch] * 5 + table = pa.Table.from_batches(batches) + table + table.num_rows + +The table's columns are instances of :class:`~.ChunkedArray`, which is a +container for one or more arrays of the same type. + +.. ipython:: python + + c = table[0] + c + c.num_chunks + c.chunk(0) + +As you'll see in the :ref:`pandas section <pandas_interop>`, we can convert +these objects to contiguous NumPy arrays for use in pandas: + +.. ipython:: python + + c.to_pandas() + +Multiple tables can also be concatenated together to form a single table using +``pyarrow.concat_tables``, if the schemas are equal: + +.. ipython:: python + + tables = [table] * 2 + table_all = pa.concat_tables(tables) + table_all.num_rows + c = table_all[0] + c.num_chunks + +This is similar to ``Table.from_batches``, but uses tables as input instead of +record batches. Record batches can be made into tables, but not the other way +around, so if your data is already in table form, then use +``pyarrow.concat_tables``. + +Custom Schema and Field Metadata +-------------------------------- + +TODO |