summaryrefslogtreecommitdiffstats
path: root/src/arrow/docs/source/python/memory.rst
diff options
context:
space:
mode:
Diffstat (limited to 'src/arrow/docs/source/python/memory.rst')
-rw-r--r--src/arrow/docs/source/python/memory.rst298
1 files changed, 298 insertions, 0 deletions
diff --git a/src/arrow/docs/source/python/memory.rst b/src/arrow/docs/source/python/memory.rst
new file mode 100644
index 000000000..4febc668c
--- /dev/null
+++ b/src/arrow/docs/source/python/memory.rst
@@ -0,0 +1,298 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow
+.. highlight:: python
+
+.. _io:
+
+========================
+Memory and IO Interfaces
+========================
+
+This section will introduce you to the major concepts in PyArrow's memory
+management and IO systems:
+
+* Buffers
+* Memory pools
+* File-like and stream-like objects
+
+Referencing and Allocating Memory
+=================================
+
+pyarrow.Buffer
+--------------
+
+The :class:`Buffer` object wraps the C++ :cpp:class:`arrow::Buffer` type
+which is the primary tool for memory management in Apache Arrow in C++. It permits
+higher-level array classes to safely interact with memory which they may or may
+not own. ``arrow::Buffer`` can be zero-copy sliced to permit Buffers to cheaply
+reference other Buffers, while preserving memory lifetime and clean
+parent-child relationships.
+
+There are many implementations of ``arrow::Buffer``, but they all provide a
+standard interface: a data pointer and length. This is similar to Python's
+built-in `buffer protocol` and ``memoryview`` objects.
+
+A :class:`Buffer` can be created from any Python object implementing
+the buffer protocol by calling the :func:`py_buffer` function. Let's consider
+a bytes object:
+
+.. ipython:: python
+
+ import pyarrow as pa
+
+ data = b'abcdefghijklmnopqrstuvwxyz'
+ buf = pa.py_buffer(data)
+ buf
+ buf.size
+
+Creating a Buffer in this way does not allocate any memory; it is a zero-copy
+view on the memory exported from the ``data`` bytes object.
+
+External memory, under the form of a raw pointer and size, can also be
+referenced using the :func:`foreign_buffer` function.
+
+Buffers can be used in circumstances where a Python buffer or memoryview is
+required, and such conversions are zero-copy:
+
+.. ipython:: python
+
+ memoryview(buf)
+
+The Buffer's :meth:`~Buffer.to_pybytes` method converts the Buffer's data to a
+Python bytestring (thus making a copy of the data):
+
+.. ipython:: python
+
+ buf.to_pybytes()
+
+Memory Pools
+------------
+
+All memory allocations and deallocations (like ``malloc`` and ``free`` in C)
+are tracked in an instance of :class:`MemoryPool`. This means that we can
+then precisely track amount of memory that has been allocated:
+
+.. ipython:: python
+
+ pa.total_allocated_bytes()
+
+Let's allocate a resizable :class:`Buffer` from the default pool:
+
+.. ipython:: python
+
+ buf = pa.allocate_buffer(1024, resizable=True)
+ pa.total_allocated_bytes()
+ buf.resize(2048)
+ pa.total_allocated_bytes()
+
+The default allocator requests memory in a minimum increment of 64 bytes. If
+the buffer is garbaged-collected, all of the memory is freed:
+
+.. ipython:: python
+
+ buf = None
+ pa.total_allocated_bytes()
+
+Besides the default built-in memory pool, there may be additional memory pools
+to choose (such as `mimalloc <https://github.com/microsoft/mimalloc>`_)
+from depending on how Arrow was built. One can get the backend
+name for a memory pool::
+
+ >>> pa.default_memory_pool().backend_name
+ 'jemalloc'
+
+.. seealso::
+ :ref:`API documentation for memory pools <api.memory_pool>`.
+
+.. seealso::
+ On-GPU buffers using Arrow's optional :doc:`CUDA integration <cuda>`.
+
+
+Input and Output
+================
+
+.. _io.native_file:
+
+The Arrow C++ libraries have several abstract interfaces for different kinds of
+IO objects:
+
+* Read-only streams
+* Read-only files supporting random access
+* Write-only streams
+* Write-only files supporting random access
+* File supporting reads, writes, and random access
+
+In the interest of making these objects behave more like Python's built-in
+``file`` objects, we have defined a :class:`~pyarrow.NativeFile` base class
+which implements the same API as regular Python file objects.
+
+:class:`~pyarrow.NativeFile` has some important features which make it
+preferable to using Python files with PyArrow where possible:
+
+* Other Arrow classes can access the internal C++ IO objects natively, and do
+ not need to acquire the Python GIL
+* Native C++ IO may be able to do zero-copy IO, such as with memory maps
+
+There are several kinds of :class:`~pyarrow.NativeFile` options available:
+
+* :class:`~pyarrow.OSFile`, a native file that uses your operating system's
+ file descriptors
+* :class:`~pyarrow.MemoryMappedFile`, for reading (zero-copy) and writing with
+ memory maps
+* :class:`~pyarrow.BufferReader`, for reading :class:`~pyarrow.Buffer` objects
+ as a file
+* :class:`~pyarrow.BufferOutputStream`, for writing data in-memory, producing a
+ Buffer at the end
+* :class:`~pyarrow.FixedSizeBufferWriter`, for writing data into an already
+ allocated Buffer
+* :class:`~pyarrow.HdfsFile`, for reading and writing data to the Hadoop Filesystem
+* :class:`~pyarrow.PythonFile`, for interfacing with Python file objects in C++
+* :class:`~pyarrow.CompressedInputStream` and
+ :class:`~pyarrow.CompressedOutputStream`, for on-the-fly compression or
+ decompression to/from another stream
+
+There are also high-level APIs to make instantiating common kinds of streams
+easier.
+
+High-Level API
+--------------
+
+Input Streams
+~~~~~~~~~~~~~
+
+The :func:`~pyarrow.input_stream` function allows creating a readable
+:class:`~pyarrow.NativeFile` from various kinds of sources.
+
+* If passed a :class:`~pyarrow.Buffer` or a ``memoryview`` object, a
+ :class:`~pyarrow.BufferReader` will be returned:
+
+ .. ipython:: python
+
+ buf = memoryview(b"some data")
+ stream = pa.input_stream(buf)
+ stream.read(4)
+
+* If passed a string or file path, it will open the given file on disk
+ for reading, creating a :class:`~pyarrow.OSFile`. Optionally, the file
+ can be compressed: if its filename ends with a recognized extension
+ such as ``.gz``, its contents will automatically be decompressed on
+ reading.
+
+ .. ipython:: python
+
+ import gzip
+ with gzip.open('example.gz', 'wb') as f:
+ f.write(b'some data\n' * 3)
+
+ stream = pa.input_stream('example.gz')
+ stream.read()
+
+* If passed a Python file object, it will wrapped in a :class:`PythonFile`
+ such that the Arrow C++ libraries can read data from it (at the expense
+ of a slight overhead).
+
+Output Streams
+~~~~~~~~~~~~~~
+
+:func:`~pyarrow.output_stream` is the equivalent function for output streams
+and allows creating a writable :class:`~pyarrow.NativeFile`. It has the same
+features as explained above for :func:`~pyarrow.input_stream`, such as being
+able to write to buffers or do on-the-fly compression.
+
+.. ipython:: python
+
+ with pa.output_stream('example1.dat') as stream:
+ stream.write(b'some data')
+
+ f = open('example1.dat', 'rb')
+ f.read()
+
+
+On-Disk and Memory Mapped Files
+-------------------------------
+
+PyArrow includes two ways to interact with data on disk: standard operating
+system-level file APIs, and memory-mapped files. In regular Python we can
+write:
+
+.. ipython:: python
+
+ with open('example2.dat', 'wb') as f:
+ f.write(b'some example data')
+
+Using pyarrow's :class:`~pyarrow.OSFile` class, you can write:
+
+.. ipython:: python
+
+ with pa.OSFile('example3.dat', 'wb') as f:
+ f.write(b'some example data')
+
+For reading files, you can use :class:`~pyarrow.OSFile` or
+:class:`~pyarrow.MemoryMappedFile`. The difference between these is that
+:class:`~pyarrow.OSFile` allocates new memory on each read, like Python file
+objects. In reads from memory maps, the library constructs a buffer referencing
+the mapped memory without any memory allocation or copying:
+
+.. ipython:: python
+
+ file_obj = pa.OSFile('example2.dat')
+ mmap = pa.memory_map('example3.dat')
+ file_obj.read(4)
+ mmap.read(4)
+
+The ``read`` method implements the standard Python file ``read`` API. To read
+into Arrow Buffer objects, use ``read_buffer``:
+
+.. ipython:: python
+
+ mmap.seek(0)
+ buf = mmap.read_buffer(4)
+ print(buf)
+ buf.to_pybytes()
+
+Many tools in PyArrow, particular the Apache Parquet interface and the file and
+stream messaging tools, are more efficient when used with these ``NativeFile``
+types than with normal Python file objects.
+
+.. ipython:: python
+ :suppress:
+
+ buf = mmap = file_obj = None
+ !rm example.dat
+ !rm example2.dat
+
+In-Memory Reading and Writing
+-----------------------------
+
+To assist with serialization and deserialization of in-memory data, we have
+file interfaces that can read and write to Arrow Buffers.
+
+.. ipython:: python
+
+ writer = pa.BufferOutputStream()
+ writer.write(b'hello, friends')
+
+ buf = writer.getvalue()
+ buf
+ buf.size
+ reader = pa.BufferReader(buf)
+ reader.seek(7)
+ reader.read(7)
+
+These have similar semantics to Python's built-in ``io.BytesIO``.