summaryrefslogtreecommitdiffstats
path: root/src/arrow/docs/source/python/data.rst
diff options
context:
space:
mode:
Diffstat (limited to 'src/arrow/docs/source/python/data.rst')
-rw-r--r--src/arrow/docs/source/python/data.rst434
1 files changed, 434 insertions, 0 deletions
diff --git a/src/arrow/docs/source/python/data.rst b/src/arrow/docs/source/python/data.rst
new file mode 100644
index 000000000..b8a90039f
--- /dev/null
+++ b/src/arrow/docs/source/python/data.rst
@@ -0,0 +1,434 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow
+.. _data:
+
+Data Types and In-Memory Data Model
+===================================
+
+Apache Arrow defines columnar array data structures by composing type metadata
+with memory buffers, like the ones explained in the documentation on
+:ref:`Memory and IO <io>`. These data structures are exposed in Python through
+a series of interrelated classes:
+
+* **Type Metadata**: Instances of ``pyarrow.DataType``, which describe a logical
+ array type
+* **Schemas**: Instances of ``pyarrow.Schema``, which describe a named
+ collection of types. These can be thought of as the column types in a
+ table-like object.
+* **Arrays**: Instances of ``pyarrow.Array``, which are atomic, contiguous
+ columnar data structures composed from Arrow Buffer objects
+* **Record Batches**: Instances of ``pyarrow.RecordBatch``, which are a
+ collection of Array objects with a particular Schema
+* **Tables**: Instances of ``pyarrow.Table``, a logical table data structure in
+ which each column consists of one or more ``pyarrow.Array`` objects of the
+ same type.
+
+We will examine these in the sections below in a series of examples.
+
+.. _data.types:
+
+Type Metadata
+-------------
+
+Apache Arrow defines language agnostic column-oriented data structures for
+array data. These include:
+
+* **Fixed-length primitive types**: numbers, booleans, date and times, fixed
+ size binary, decimals, and other values that fit into a given number
+* **Variable-length primitive types**: binary, string
+* **Nested types**: list, struct, and union
+* **Dictionary type**: An encoded categorical type (more on this later)
+
+Each logical data type in Arrow has a corresponding factory function for
+creating an instance of that type object in Python:
+
+.. ipython:: python
+
+ import pyarrow as pa
+ t1 = pa.int32()
+ t2 = pa.string()
+ t3 = pa.binary()
+ t4 = pa.binary(10)
+ t5 = pa.timestamp('ms')
+
+ t1
+ print(t1)
+ print(t4)
+ print(t5)
+
+We use the name **logical type** because the **physical** storage may be the
+same for one or more types. For example, ``int64``, ``float64``, and
+``timestamp[ms]`` all occupy 64 bits per value.
+
+These objects are `metadata`; they are used for describing the data in arrays,
+schemas, and record batches. In Python, they can be used in functions where the
+input data (e.g. Python objects) may be coerced to more than one Arrow type.
+
+The :class:`~pyarrow.Field` type is a type plus a name and optional
+user-defined metadata:
+
+.. ipython:: python
+
+ f0 = pa.field('int32_field', t1)
+ f0
+ f0.name
+ f0.type
+
+Arrow supports **nested value types** like list, struct, and union. When
+creating these, you must pass types or fields to indicate the data types of the
+types' children. For example, we can define a list of int32 values with:
+
+.. ipython:: python
+
+ t6 = pa.list_(t1)
+ t6
+
+A `struct` is a collection of named fields:
+
+.. ipython:: python
+
+ fields = [
+ pa.field('s0', t1),
+ pa.field('s1', t2),
+ pa.field('s2', t4),
+ pa.field('s3', t6),
+ ]
+
+ t7 = pa.struct(fields)
+ print(t7)
+
+For convenience, you can pass ``(name, type)`` tuples directly instead of
+:class:`~pyarrow.Field` instances:
+
+.. ipython:: python
+
+ t8 = pa.struct([('s0', t1), ('s1', t2), ('s2', t4), ('s3', t6)])
+ print(t8)
+ t8 == t7
+
+
+See :ref:`Data Types API <api.types>` for a full listing of data type
+functions.
+
+.. _data.schema:
+
+Schemas
+-------
+
+The :class:`~pyarrow.Schema` type is similar to the ``struct`` array type; it
+defines the column names and types in a record batch or table data
+structure. The :func:`pyarrow.schema` factory function makes new Schema objects in
+Python:
+
+.. ipython:: python
+
+ my_schema = pa.schema([('field0', t1),
+ ('field1', t2),
+ ('field2', t4),
+ ('field3', t6)])
+ my_schema
+
+In some applications, you may not create schemas directly, only using the ones
+that are embedded in :ref:`IPC messages <ipc>`.
+
+.. _data.array:
+
+Arrays
+------
+
+For each data type, there is an accompanying array data structure for holding
+memory buffers that define a single contiguous chunk of columnar array
+data. When you are using PyArrow, this data may come from IPC tools, though it
+can also be created from various types of Python sequences (lists, NumPy
+arrays, pandas data).
+
+A simple way to create arrays is with ``pyarrow.array``, which is similar to
+the ``numpy.array`` function. By default PyArrow will infer the data type
+for you:
+
+.. ipython:: python
+
+ arr = pa.array([1, 2, None, 3])
+ arr
+
+But you may also pass a specific data type to override type inference:
+
+.. ipython:: python
+
+ pa.array([1, 2], type=pa.uint16())
+
+The array's ``type`` attribute is the corresponding piece of type metadata:
+
+.. ipython:: python
+
+ arr.type
+
+Each in-memory array has a known length and null count (which will be 0 if
+there are no null values):
+
+.. ipython:: python
+
+ len(arr)
+ arr.null_count
+
+Scalar values can be selected with normal indexing. ``pyarrow.array`` converts
+``None`` values to Arrow nulls; we return the special ``pyarrow.NA`` value for
+nulls:
+
+.. ipython:: python
+
+ arr[0]
+ arr[2]
+
+Arrow data is immutable, so values can be selected but not assigned.
+
+Arrays can be sliced without copying:
+
+.. ipython:: python
+
+ arr[1:3]
+
+None values and NAN handling
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As mentioned in the above section, the Python object ``None`` is always
+converted to an Arrow null element on the conversion to ``pyarrow.Array``. For
+the float NaN value which is either represented by the Python object
+``float('nan')`` or ``numpy.nan`` we normally convert it to a *valid* float
+value during the conversion. If an integer input is supplied to
+``pyarrow.array`` that contains ``np.nan``, ``ValueError`` is raised.
+
+To handle better compatibility with Pandas, we support interpreting NaN values as
+null elements. This is enabled automatically on all ``from_pandas`` function and
+can be enable on the other conversion functions by passing ``from_pandas=True``
+as a function parameter.
+
+List arrays
+~~~~~~~~~~~
+
+``pyarrow.array`` is able to infer the type of simple nested data structures
+like lists:
+
+.. ipython:: python
+
+ nested_arr = pa.array([[], None, [1, 2], [None, 1]])
+ print(nested_arr.type)
+
+Struct arrays
+~~~~~~~~~~~~~
+
+For other kinds of nested arrays, such as struct arrays, you currently need
+to pass the type explicitly. Struct arrays can be initialized from a
+sequence of Python dicts or tuples:
+
+.. ipython:: python
+
+ ty = pa.struct([('x', pa.int8()),
+ ('y', pa.bool_())])
+ pa.array([{'x': 1, 'y': True}, {'x': 2, 'y': False}], type=ty)
+ pa.array([(3, True), (4, False)], type=ty)
+
+When initializing a struct array, nulls are allowed both at the struct
+level and at the individual field level. If initializing from a sequence
+of Python dicts, a missing dict key is handled as a null value:
+
+.. ipython:: python
+
+ pa.array([{'x': 1}, None, {'y': None}], type=ty)
+
+You can also construct a struct array from existing arrays for each of the
+struct's components. In this case, data storage will be shared with the
+individual arrays, and no copy is involved:
+
+.. ipython:: python
+
+ xs = pa.array([5, 6, 7], type=pa.int16())
+ ys = pa.array([False, True, True])
+ arr = pa.StructArray.from_arrays((xs, ys), names=('x', 'y'))
+ arr.type
+ arr
+
+Union arrays
+~~~~~~~~~~~~
+
+The union type represents a nested array type where each value can be one
+(and only one) of a set of possible types. There are two possible
+storage types for union arrays: sparse and dense.
+
+In a sparse union array, each of the child arrays has the same length
+as the resulting union array. They are adjuncted with a ``int8`` "types"
+array that tells, for each value, from which child array it must be
+selected:
+
+.. ipython:: python
+
+ xs = pa.array([5, 6, 7])
+ ys = pa.array([False, False, True])
+ types = pa.array([0, 1, 1], type=pa.int8())
+ union_arr = pa.UnionArray.from_sparse(types, [xs, ys])
+ union_arr.type
+ union_arr
+
+In a dense union array, you also pass, in addition to the ``int8`` "types"
+array, a ``int32`` "offsets" array that tells, for each value, at
+each offset in the selected child array it can be found:
+
+.. ipython:: python
+
+ xs = pa.array([5, 6, 7])
+ ys = pa.array([False, True])
+ types = pa.array([0, 1, 1, 0, 0], type=pa.int8())
+ offsets = pa.array([0, 0, 1, 1, 2], type=pa.int32())
+ union_arr = pa.UnionArray.from_dense(types, offsets, [xs, ys])
+ union_arr.type
+ union_arr
+
+.. _data.dictionary:
+
+Dictionary Arrays
+~~~~~~~~~~~~~~~~~
+
+The **Dictionary** type in PyArrow is a special array type that is similar to a
+factor in R or a ``pandas.Categorical``. It enables one or more record batches
+in a file or stream to transmit integer *indices* referencing a shared
+**dictionary** containing the distinct values in the logical array. This is
+particularly often used with strings to save memory and improve performance.
+
+The way that dictionaries are handled in the Apache Arrow format and the way
+they appear in C++ and Python is slightly different. We define a special
+:class:`~.DictionaryArray` type with a corresponding dictionary type. Let's
+consider an example:
+
+.. ipython:: python
+
+ indices = pa.array([0, 1, 0, 1, 2, 0, None, 2])
+ dictionary = pa.array(['foo', 'bar', 'baz'])
+
+ dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)
+ dict_array
+
+Here we have:
+
+.. ipython:: python
+
+ print(dict_array.type)
+ dict_array.indices
+ dict_array.dictionary
+
+When using :class:`~.DictionaryArray` with pandas, the analogue is
+``pandas.Categorical`` (more on this later):
+
+.. ipython:: python
+
+ dict_array.to_pandas()
+
+.. _data.record_batch:
+
+Record Batches
+--------------
+
+A **Record Batch** in Apache Arrow is a collection of equal-length array
+instances. Let's consider a collection of arrays:
+
+.. ipython:: python
+
+ data = [
+ pa.array([1, 2, 3, 4]),
+ pa.array(['foo', 'bar', 'baz', None]),
+ pa.array([True, None, False, True])
+ ]
+
+A record batch can be created from this list of arrays using
+``RecordBatch.from_arrays``:
+
+.. ipython:: python
+
+ batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2'])
+ batch.num_columns
+ batch.num_rows
+ batch.schema
+
+ batch[1]
+
+A record batch can be sliced without copying memory like an array:
+
+.. ipython:: python
+
+ batch2 = batch.slice(1, 3)
+ batch2[1]
+
+.. _data.table:
+
+Tables
+------
+
+The PyArrow :class:`~.Table` type is not part of the Apache Arrow
+specification, but is rather a tool to help with wrangling multiple record
+batches and array pieces as a single logical dataset. As a relevant example, we
+may receive multiple small record batches in a socket stream, then need to
+concatenate them into contiguous memory for use in NumPy or pandas. The Table
+object makes this efficient without requiring additional memory copying.
+
+Considering the record batch we created above, we can create a Table containing
+one or more copies of the batch using ``Table.from_batches``:
+
+.. ipython:: python
+
+ batches = [batch] * 5
+ table = pa.Table.from_batches(batches)
+ table
+ table.num_rows
+
+The table's columns are instances of :class:`~.ChunkedArray`, which is a
+container for one or more arrays of the same type.
+
+.. ipython:: python
+
+ c = table[0]
+ c
+ c.num_chunks
+ c.chunk(0)
+
+As you'll see in the :ref:`pandas section <pandas_interop>`, we can convert
+these objects to contiguous NumPy arrays for use in pandas:
+
+.. ipython:: python
+
+ c.to_pandas()
+
+Multiple tables can also be concatenated together to form a single table using
+``pyarrow.concat_tables``, if the schemas are equal:
+
+.. ipython:: python
+
+ tables = [table] * 2
+ table_all = pa.concat_tables(tables)
+ table_all.num_rows
+ c = table_all[0]
+ c.num_chunks
+
+This is similar to ``Table.from_batches``, but uses tables as input instead of
+record batches. Record batches can be made into tables, but not the other way
+around, so if your data is already in table form, then use
+``pyarrow.concat_tables``.
+
+Custom Schema and Field Metadata
+--------------------------------
+
+TODO