summaryrefslogtreecommitdiffstats
path: root/src/arrow/docs/source/python/memory.rst
blob: 4febc668c6eb744ad206fb4a4c682e8a389e1f7f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements.  See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership.  The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License.  You may obtain a copy of the License at

..   http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied.  See the License for the
.. specific language governing permissions and limitations
.. under the License.

.. currentmodule:: pyarrow
.. highlight:: python

.. _io:

========================
Memory and IO Interfaces
========================

This section will introduce you to the major concepts in PyArrow's memory
management and IO systems:

* Buffers
* Memory pools
* File-like and stream-like objects

Referencing and Allocating Memory
=================================

pyarrow.Buffer
--------------

The :class:`Buffer` object wraps the C++ :cpp:class:`arrow::Buffer` type
which is the primary tool for memory management in Apache Arrow in C++. It permits
higher-level array classes to safely interact with memory which they may or may
not own. ``arrow::Buffer`` can be zero-copy sliced to permit Buffers to cheaply
reference other Buffers, while preserving memory lifetime and clean
parent-child relationships.

There are many implementations of ``arrow::Buffer``, but they all provide a
standard interface: a data pointer and length. This is similar to Python's
built-in `buffer protocol` and ``memoryview`` objects.

A :class:`Buffer` can be created from any Python object implementing
the buffer protocol by calling the :func:`py_buffer` function. Let's consider
a bytes object:

.. ipython:: python

   import pyarrow as pa

   data = b'abcdefghijklmnopqrstuvwxyz'
   buf = pa.py_buffer(data)
   buf
   buf.size

Creating a Buffer in this way does not allocate any memory; it is a zero-copy
view on the memory exported from the ``data`` bytes object.

External memory, under the form of a raw pointer and size, can also be
referenced using the :func:`foreign_buffer` function.

Buffers can be used in circumstances where a Python buffer or memoryview is
required, and such conversions are zero-copy:

.. ipython:: python

   memoryview(buf)

The Buffer's :meth:`~Buffer.to_pybytes` method converts the Buffer's data to a
Python bytestring (thus making a copy of the data):

.. ipython:: python

   buf.to_pybytes()

Memory Pools
------------

All memory allocations and deallocations (like ``malloc`` and ``free`` in C)
are tracked in an instance of :class:`MemoryPool`. This means that we can
then precisely track amount of memory that has been allocated:

.. ipython:: python

   pa.total_allocated_bytes()

Let's allocate a resizable :class:`Buffer` from the default pool:

.. ipython:: python

   buf = pa.allocate_buffer(1024, resizable=True)
   pa.total_allocated_bytes()
   buf.resize(2048)
   pa.total_allocated_bytes()

The default allocator requests memory in a minimum increment of 64 bytes. If
the buffer is garbaged-collected, all of the memory is freed:

.. ipython:: python

   buf = None
   pa.total_allocated_bytes()

Besides the default built-in memory pool, there may be additional memory pools
to choose (such as `mimalloc <https://github.com/microsoft/mimalloc>`_)
from depending on how Arrow was built.  One can get the backend
name for a memory pool::

   >>> pa.default_memory_pool().backend_name
   'jemalloc'

.. seealso::
   :ref:`API documentation for memory pools <api.memory_pool>`.

.. seealso::
   On-GPU buffers using Arrow's optional :doc:`CUDA integration <cuda>`.


Input and Output
================

.. _io.native_file:

The Arrow C++ libraries have several abstract interfaces for different kinds of
IO objects:

* Read-only streams
* Read-only files supporting random access
* Write-only streams
* Write-only files supporting random access
* File supporting reads, writes, and random access

In the interest of making these objects behave more like Python's built-in
``file`` objects, we have defined a :class:`~pyarrow.NativeFile` base class
which implements the same API as regular Python file objects.

:class:`~pyarrow.NativeFile` has some important features which make it
preferable to using Python files with PyArrow where possible:

* Other Arrow classes can access the internal C++ IO objects natively, and do
  not need to acquire the Python GIL
* Native C++ IO may be able to do zero-copy IO, such as with memory maps

There are several kinds of :class:`~pyarrow.NativeFile` options available:

* :class:`~pyarrow.OSFile`, a native file that uses your operating system's
  file descriptors
* :class:`~pyarrow.MemoryMappedFile`, for reading (zero-copy) and writing with
  memory maps
* :class:`~pyarrow.BufferReader`, for reading :class:`~pyarrow.Buffer` objects
  as a file
* :class:`~pyarrow.BufferOutputStream`, for writing data in-memory, producing a
  Buffer at the end
* :class:`~pyarrow.FixedSizeBufferWriter`, for writing data into an already
  allocated Buffer
* :class:`~pyarrow.HdfsFile`, for reading and writing data to the Hadoop Filesystem
* :class:`~pyarrow.PythonFile`, for interfacing with Python file objects in C++
* :class:`~pyarrow.CompressedInputStream` and
  :class:`~pyarrow.CompressedOutputStream`, for on-the-fly compression or
  decompression to/from another stream

There are also high-level APIs to make instantiating common kinds of streams
easier.

High-Level API
--------------

Input Streams
~~~~~~~~~~~~~

The :func:`~pyarrow.input_stream` function allows creating a readable
:class:`~pyarrow.NativeFile` from various kinds of sources.

* If passed a :class:`~pyarrow.Buffer` or a ``memoryview`` object, a
  :class:`~pyarrow.BufferReader` will be returned:

   .. ipython:: python

      buf = memoryview(b"some data")
      stream = pa.input_stream(buf)
      stream.read(4)

* If passed a string or file path, it will open the given file on disk
  for reading, creating a :class:`~pyarrow.OSFile`.  Optionally, the file
  can be compressed: if its filename ends with a recognized extension
  such as ``.gz``, its contents will automatically be decompressed on
  reading.

  .. ipython:: python

     import gzip
     with gzip.open('example.gz', 'wb') as f:
         f.write(b'some data\n' * 3)

     stream = pa.input_stream('example.gz')
     stream.read()

* If passed a Python file object, it will wrapped in a :class:`PythonFile`
  such that the Arrow C++ libraries can read data from it (at the expense
  of a slight overhead).

Output Streams
~~~~~~~~~~~~~~

:func:`~pyarrow.output_stream` is the equivalent function for output streams
and allows creating a writable :class:`~pyarrow.NativeFile`.  It has the same
features as explained above for :func:`~pyarrow.input_stream`, such as being
able to write to buffers or do on-the-fly compression.

.. ipython:: python

   with pa.output_stream('example1.dat') as stream:
       stream.write(b'some data')

   f = open('example1.dat', 'rb')
   f.read()


On-Disk and Memory Mapped Files
-------------------------------

PyArrow includes two ways to interact with data on disk: standard operating
system-level file APIs, and memory-mapped files. In regular Python we can
write:

.. ipython:: python

   with open('example2.dat', 'wb') as f:
       f.write(b'some example data')

Using pyarrow's :class:`~pyarrow.OSFile` class, you can write:

.. ipython:: python

   with pa.OSFile('example3.dat', 'wb') as f:
       f.write(b'some example data')

For reading files, you can use :class:`~pyarrow.OSFile` or
:class:`~pyarrow.MemoryMappedFile`. The difference between these is that
:class:`~pyarrow.OSFile` allocates new memory on each read, like Python file
objects. In reads from memory maps, the library constructs a buffer referencing
the mapped memory without any memory allocation or copying:

.. ipython:: python

   file_obj = pa.OSFile('example2.dat')
   mmap = pa.memory_map('example3.dat')
   file_obj.read(4)
   mmap.read(4)

The ``read`` method implements the standard Python file ``read`` API. To read
into Arrow Buffer objects, use ``read_buffer``:

.. ipython:: python

   mmap.seek(0)
   buf = mmap.read_buffer(4)
   print(buf)
   buf.to_pybytes()

Many tools in PyArrow, particular the Apache Parquet interface and the file and
stream messaging tools, are more efficient when used with these ``NativeFile``
types than with normal Python file objects.

.. ipython:: python
   :suppress:

   buf = mmap = file_obj = None
   !rm example.dat
   !rm example2.dat

In-Memory Reading and Writing
-----------------------------

To assist with serialization and deserialization of in-memory data, we have
file interfaces that can read and write to Arrow Buffers.

.. ipython:: python

   writer = pa.BufferOutputStream()
   writer.write(b'hello, friends')

   buf = writer.getvalue()
   buf
   buf.size
   reader = pa.BufferReader(buf)
   reader.seek(7)
   reader.read(7)

These have similar semantics to Python's built-in ``io.BytesIO``.