diff options
Diffstat (limited to 'src/arrow/docs/source/python/memory.rst')
-rw-r--r-- | src/arrow/docs/source/python/memory.rst | 298 |
1 files changed, 298 insertions, 0 deletions
diff --git a/src/arrow/docs/source/python/memory.rst b/src/arrow/docs/source/python/memory.rst new file mode 100644 index 000000000..4febc668c --- /dev/null +++ b/src/arrow/docs/source/python/memory.rst @@ -0,0 +1,298 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. currentmodule:: pyarrow +.. highlight:: python + +.. _io: + +======================== +Memory and IO Interfaces +======================== + +This section will introduce you to the major concepts in PyArrow's memory +management and IO systems: + +* Buffers +* Memory pools +* File-like and stream-like objects + +Referencing and Allocating Memory +================================= + +pyarrow.Buffer +-------------- + +The :class:`Buffer` object wraps the C++ :cpp:class:`arrow::Buffer` type +which is the primary tool for memory management in Apache Arrow in C++. It permits +higher-level array classes to safely interact with memory which they may or may +not own. ``arrow::Buffer`` can be zero-copy sliced to permit Buffers to cheaply +reference other Buffers, while preserving memory lifetime and clean +parent-child relationships. + +There are many implementations of ``arrow::Buffer``, but they all provide a +standard interface: a data pointer and length. This is similar to Python's +built-in `buffer protocol` and ``memoryview`` objects. + +A :class:`Buffer` can be created from any Python object implementing +the buffer protocol by calling the :func:`py_buffer` function. Let's consider +a bytes object: + +.. ipython:: python + + import pyarrow as pa + + data = b'abcdefghijklmnopqrstuvwxyz' + buf = pa.py_buffer(data) + buf + buf.size + +Creating a Buffer in this way does not allocate any memory; it is a zero-copy +view on the memory exported from the ``data`` bytes object. + +External memory, under the form of a raw pointer and size, can also be +referenced using the :func:`foreign_buffer` function. + +Buffers can be used in circumstances where a Python buffer or memoryview is +required, and such conversions are zero-copy: + +.. ipython:: python + + memoryview(buf) + +The Buffer's :meth:`~Buffer.to_pybytes` method converts the Buffer's data to a +Python bytestring (thus making a copy of the data): + +.. ipython:: python + + buf.to_pybytes() + +Memory Pools +------------ + +All memory allocations and deallocations (like ``malloc`` and ``free`` in C) +are tracked in an instance of :class:`MemoryPool`. This means that we can +then precisely track amount of memory that has been allocated: + +.. ipython:: python + + pa.total_allocated_bytes() + +Let's allocate a resizable :class:`Buffer` from the default pool: + +.. ipython:: python + + buf = pa.allocate_buffer(1024, resizable=True) + pa.total_allocated_bytes() + buf.resize(2048) + pa.total_allocated_bytes() + +The default allocator requests memory in a minimum increment of 64 bytes. If +the buffer is garbaged-collected, all of the memory is freed: + +.. ipython:: python + + buf = None + pa.total_allocated_bytes() + +Besides the default built-in memory pool, there may be additional memory pools +to choose (such as `mimalloc <https://github.com/microsoft/mimalloc>`_) +from depending on how Arrow was built. One can get the backend +name for a memory pool:: + + >>> pa.default_memory_pool().backend_name + 'jemalloc' + +.. seealso:: + :ref:`API documentation for memory pools <api.memory_pool>`. + +.. seealso:: + On-GPU buffers using Arrow's optional :doc:`CUDA integration <cuda>`. + + +Input and Output +================ + +.. _io.native_file: + +The Arrow C++ libraries have several abstract interfaces for different kinds of +IO objects: + +* Read-only streams +* Read-only files supporting random access +* Write-only streams +* Write-only files supporting random access +* File supporting reads, writes, and random access + +In the interest of making these objects behave more like Python's built-in +``file`` objects, we have defined a :class:`~pyarrow.NativeFile` base class +which implements the same API as regular Python file objects. + +:class:`~pyarrow.NativeFile` has some important features which make it +preferable to using Python files with PyArrow where possible: + +* Other Arrow classes can access the internal C++ IO objects natively, and do + not need to acquire the Python GIL +* Native C++ IO may be able to do zero-copy IO, such as with memory maps + +There are several kinds of :class:`~pyarrow.NativeFile` options available: + +* :class:`~pyarrow.OSFile`, a native file that uses your operating system's + file descriptors +* :class:`~pyarrow.MemoryMappedFile`, for reading (zero-copy) and writing with + memory maps +* :class:`~pyarrow.BufferReader`, for reading :class:`~pyarrow.Buffer` objects + as a file +* :class:`~pyarrow.BufferOutputStream`, for writing data in-memory, producing a + Buffer at the end +* :class:`~pyarrow.FixedSizeBufferWriter`, for writing data into an already + allocated Buffer +* :class:`~pyarrow.HdfsFile`, for reading and writing data to the Hadoop Filesystem +* :class:`~pyarrow.PythonFile`, for interfacing with Python file objects in C++ +* :class:`~pyarrow.CompressedInputStream` and + :class:`~pyarrow.CompressedOutputStream`, for on-the-fly compression or + decompression to/from another stream + +There are also high-level APIs to make instantiating common kinds of streams +easier. + +High-Level API +-------------- + +Input Streams +~~~~~~~~~~~~~ + +The :func:`~pyarrow.input_stream` function allows creating a readable +:class:`~pyarrow.NativeFile` from various kinds of sources. + +* If passed a :class:`~pyarrow.Buffer` or a ``memoryview`` object, a + :class:`~pyarrow.BufferReader` will be returned: + + .. ipython:: python + + buf = memoryview(b"some data") + stream = pa.input_stream(buf) + stream.read(4) + +* If passed a string or file path, it will open the given file on disk + for reading, creating a :class:`~pyarrow.OSFile`. Optionally, the file + can be compressed: if its filename ends with a recognized extension + such as ``.gz``, its contents will automatically be decompressed on + reading. + + .. ipython:: python + + import gzip + with gzip.open('example.gz', 'wb') as f: + f.write(b'some data\n' * 3) + + stream = pa.input_stream('example.gz') + stream.read() + +* If passed a Python file object, it will wrapped in a :class:`PythonFile` + such that the Arrow C++ libraries can read data from it (at the expense + of a slight overhead). + +Output Streams +~~~~~~~~~~~~~~ + +:func:`~pyarrow.output_stream` is the equivalent function for output streams +and allows creating a writable :class:`~pyarrow.NativeFile`. It has the same +features as explained above for :func:`~pyarrow.input_stream`, such as being +able to write to buffers or do on-the-fly compression. + +.. ipython:: python + + with pa.output_stream('example1.dat') as stream: + stream.write(b'some data') + + f = open('example1.dat', 'rb') + f.read() + + +On-Disk and Memory Mapped Files +------------------------------- + +PyArrow includes two ways to interact with data on disk: standard operating +system-level file APIs, and memory-mapped files. In regular Python we can +write: + +.. ipython:: python + + with open('example2.dat', 'wb') as f: + f.write(b'some example data') + +Using pyarrow's :class:`~pyarrow.OSFile` class, you can write: + +.. ipython:: python + + with pa.OSFile('example3.dat', 'wb') as f: + f.write(b'some example data') + +For reading files, you can use :class:`~pyarrow.OSFile` or +:class:`~pyarrow.MemoryMappedFile`. The difference between these is that +:class:`~pyarrow.OSFile` allocates new memory on each read, like Python file +objects. In reads from memory maps, the library constructs a buffer referencing +the mapped memory without any memory allocation or copying: + +.. ipython:: python + + file_obj = pa.OSFile('example2.dat') + mmap = pa.memory_map('example3.dat') + file_obj.read(4) + mmap.read(4) + +The ``read`` method implements the standard Python file ``read`` API. To read +into Arrow Buffer objects, use ``read_buffer``: + +.. ipython:: python + + mmap.seek(0) + buf = mmap.read_buffer(4) + print(buf) + buf.to_pybytes() + +Many tools in PyArrow, particular the Apache Parquet interface and the file and +stream messaging tools, are more efficient when used with these ``NativeFile`` +types than with normal Python file objects. + +.. ipython:: python + :suppress: + + buf = mmap = file_obj = None + !rm example.dat + !rm example2.dat + +In-Memory Reading and Writing +----------------------------- + +To assist with serialization and deserialization of in-memory data, we have +file interfaces that can read and write to Arrow Buffers. + +.. ipython:: python + + writer = pa.BufferOutputStream() + writer.write(b'hello, friends') + + buf = writer.getvalue() + buf + buf.size + reader = pa.BufferReader(buf) + reader.seek(7) + reader.read(7) + +These have similar semantics to Python's built-in ``io.BytesIO``. |