summaryrefslogtreecommitdiffstats
path: root/src/arrow/docs/source/python/filesystems.rst
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
commite6918187568dbd01842d8d1d2c808ce16a894239 (patch)
tree64f88b554b444a49f656b6c656111a145cbbaa28 /src/arrow/docs/source/python/filesystems.rst
parentInitial commit. (diff)
downloadceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz
ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/arrow/docs/source/python/filesystems.rst')
-rw-r--r--src/arrow/docs/source/python/filesystems.rst305
1 files changed, 305 insertions, 0 deletions
diff --git a/src/arrow/docs/source/python/filesystems.rst b/src/arrow/docs/source/python/filesystems.rst
new file mode 100644
index 000000000..1ddb4dfa2
--- /dev/null
+++ b/src/arrow/docs/source/python/filesystems.rst
@@ -0,0 +1,305 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. _filesystem:
+
+.. currentmodule:: pyarrow.fs
+
+Filesystem Interface
+====================
+
+PyArrow comes with an abstract filesystem interface, as well as concrete
+implementations for various storage types.
+
+The filesystem interface provides input and output streams as well as
+directory operations. A simplified view of the underlying data
+storage is exposed. Data paths are represented as *abstract paths*, which
+are ``/``-separated, even on Windows, and shouldn't include special path
+components such as ``.`` and ``..``. Symbolic links, if supported by the
+underlying storage, are automatically dereferenced. Only basic
+:class:`metadata <FileInfo>` about file entries, such as the file size
+and modification time, is made available.
+
+The core interface is represented by the base class :class:`FileSystem`.
+
+Pyarrow implements natively the following filesystem subclasses:
+
+* :ref:`filesystem-localfs` (:class:`LocalFileSystem`)
+* :ref:`filesystem-s3` (:class:`S3FileSystem`)
+* :ref:`filesystem-hdfs` (:class:`HadoopFileSystem`)
+
+It is also possible to use your own fsspec-compliant filesystem with pyarrow functionalities as described in the section :ref:`filesystem-fsspec`.
+
+
+.. _filesystem-usage:
+
+Usage
+-----
+
+Instantiating a filesystem
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A FileSystem object can be created with one of the constructors (and check the
+respective constructor for its options)::
+
+ >>> from pyarrow import fs
+ >>> local = fs.LocalFileSystem()
+
+or alternatively inferred from a URI::
+
+ >>> s3, path = fs.FileSystem.from_uri("s3://my-bucket")
+ >>> s3
+ <pyarrow._s3fs.S3FileSystem at 0x7f6760cbf4f0>
+ >>> path
+ 'my-bucket'
+
+
+Reading and writing files
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Several of the IO-related functions in PyArrow accept either a URI (and infer
+the filesystem) or an explicit ``filesystem`` argument to specify the filesystem
+to read or write from. For example, the :meth:`pyarrow.parquet.read_table`
+function can be used in the following ways::
+
+ import pyarrow.parquet as pq
+
+ # using a URI -> filesystem is inferred
+ pq.read_table("s3://my-bucket/data.parquet")
+ # using a path and filesystem
+ s3 = fs.S3FileSystem(..)
+ pq.read_table("my-bucket/data.parquet", filesystem=s3)
+
+The filesystem interface further allows to open files for reading (input) or
+writing (output) directly, which can be combined with functions that work with
+file-like objects. For example::
+
+ import pyarrow as pa
+
+ local = fs.LocalFileSystem()
+
+ with local.open_output_stream("test.arrow") as file:
+ with pa.RecordBatchFileWriter(file, table.schema) as writer:
+ writer.write_table(table)
+
+
+Listing files
+~~~~~~~~~~~~~
+
+Inspecting the directories and files on a filesystem can be done with the
+:meth:`FileSystem.get_file_info` method. To list the contents of a directory,
+use the :class:`FileSelector` object to specify the selection::
+
+ >>> local.get_file_info(fs.FileSelector("dataset/", recursive=True))
+ [<FileInfo for 'dataset/part=B': type=FileType.Directory>,
+ <FileInfo for 'dataset/part=B/data0.parquet': type=FileType.File, size=1564>,
+ <FileInfo for 'dataset/part=A': type=FileType.Directory>,
+ <FileInfo for 'dataset/part=A/data0.parquet': type=FileType.File, size=1564>]
+
+This returns a list of :class:`FileInfo` objects, containing information about
+the type (file or directory), the size, the date last modified, etc.
+
+You can also get this information for a single explicit path (or list of
+paths)::
+
+ >>> local.get_file_info('test.arrow')
+ <FileInfo for 'test.arrow': type=FileType.File, size=3250>
+
+ >>> local.get_file_info('non_existent')
+ <FileInfo for 'non_existent': type=FileType.NotFound>
+
+
+.. _filesystem-localfs:
+
+Local FS
+--------
+
+The :class:`LocalFileSystem` allows you to access files on the local machine.
+
+Example how to write to disk and read it back::
+
+ >>> from pyarrow import fs
+ >>> local = fs.LocalFileSystem()
+ >>> with local.open_output_stream('/tmp/pyarrowtest.dat') as stream:
+ stream.write(b'data')
+ 4
+ >>> with local.open_input_stream('/tmp/pyarrowtest.dat') as stream:
+ print(stream.readall())
+ b'data'
+
+
+.. _filesystem-s3:
+
+S3
+--
+
+PyArrow implements natively a S3 filesystem for S3 compatible storage.
+
+The :class:`S3FileSystem` constructor has several options to configure the S3
+connection (e.g. credentials, the region, an endpoint override, etc). In
+addition, the constructor will also inspect configured S3 credentials as
+supported by AWS (for example the ``AWS_ACCESS_KEY_ID`` and
+``AWS_SECRET_ACCESS_KEY`` environment variables).
+
+Example how you can read contents from a S3 bucket::
+
+ >>> from pyarrow import fs
+ >>> s3 = fs.S3FileSystem(region='eu-west-3')
+
+ # List all contents in a bucket, recursively
+ >>> s3.get_file_info(fs.FileSelector('my-test-bucket', recursive=True))
+ [<FileInfo for 'my-test-bucket/File1': type=FileType.File, size=10>,
+ <FileInfo for 'my-test-bucket/File5': type=FileType.File, size=10>,
+ <FileInfo for 'my-test-bucket/Dir1': type=FileType.Directory>,
+ <FileInfo for 'my-test-bucket/Dir2': type=FileType.Directory>,
+ <FileInfo for 'my-test-bucket/EmptyDir': type=FileType.Directory>,
+ <FileInfo for 'my-test-bucket/Dir1/File2': type=FileType.File, size=11>,
+ <FileInfo for 'my-test-bucket/Dir1/Subdir': type=FileType.Directory>,
+ <FileInfo for 'my-test-bucket/Dir2/Subdir': type=FileType.Directory>,
+ <FileInfo for 'my-test-bucket/Dir2/Subdir/File3': type=FileType.File, size=10>]
+
+ # Open a file for reading and download its contents
+ >>> f = s3.open_input_stream('my-test-bucket/Dir1/File2')
+ >>> f.readall()
+ b'some data'
+
+.. seealso::
+
+ See the `AWS docs <https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/credentials.html>`__
+ for the different ways to configure the AWS credentials.
+
+
+.. _filesystem-hdfs:
+
+Hadoop Distributed File System (HDFS)
+-------------------------------------
+
+PyArrow comes with bindings to the Hadoop File System (based on C++ bindings
+using ``libhdfs``, a JNI-based interface to the Java Hadoop client). You connect
+using the :class:`HadoopFileSystem` constructor:
+
+.. code-block:: python
+
+ from pyarrow import fs
+ hdfs = fs.HadoopFileSystem(host, port, user=user, kerb_ticket=ticket_cache_path)
+
+The ``libhdfs`` library is loaded **at runtime** (rather than at link / library
+load time, since the library may not be in your LD_LIBRARY_PATH), and relies on
+some environment variables.
+
+* ``HADOOP_HOME``: the root of your installed Hadoop distribution. Often has
+ `lib/native/libhdfs.so`.
+
+* ``JAVA_HOME``: the location of your Java SDK installation.
+
+* ``ARROW_LIBHDFS_DIR`` (optional): explicit location of ``libhdfs.so`` if it is
+ installed somewhere other than ``$HADOOP_HOME/lib/native``.
+
+* ``CLASSPATH``: must contain the Hadoop jars. You can set these using:
+
+ .. code-block:: shell
+
+ export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`
+ # or on Windows
+ %HADOOP_HOME%/bin/hadoop classpath --glob > %CLASSPATH%
+
+ In contrast to the legacy HDFS filesystem with ``pa.hdfs.connect``, setting
+ ``CLASSPATH`` is not optional (pyarrow will not attempt to infer it).
+
+.. _filesystem-fsspec:
+
+Using fsspec-compatible filesystems with Arrow
+----------------------------------------------
+
+The filesystems mentioned above are natively supported by Arrow C++ / PyArrow.
+The Python ecosystem, however, also has several filesystem packages. Those
+packages following the `fsspec`_ interface can be used in PyArrow as well.
+
+Functions accepting a filesystem object will also accept an fsspec subclass.
+For example::
+
+ # creating an fsspec-based filesystem object for Google Cloud Storage
+ import gcsfs
+ fs = gcsfs.GCSFileSystem(project='my-google-project')
+
+ # using this to read a partitioned dataset
+ import pyarrow.dataset as ds
+ ds.dataset("data/", filesystem=fs)
+
+Similarly for Azure Blob Storage::
+
+ import adlfs
+ # ... load your credentials and configure the filesystem
+ fs = adlfs.AzureBlobFileSystem(account_name=account_name, account_key=account_key)
+
+ import pyarrow.dataset as ds
+ ds.dataset("mycontainer/data/", filesystem=fs)
+
+Under the hood, the fsspec filesystem object is wrapped into a python-based
+PyArrow filesystem (:class:`PyFileSystem`) using :class:`FSSpecHandler`.
+You can also manually do this to get an object with the PyArrow FileSystem
+interface::
+
+ from pyarrow.fs import PyFileSystem, FSSpecHandler
+ pa_fs = PyFileSystem(FSSpecHandler(fs))
+
+Then all the functionalities of :class:`FileSystem` are accessible::
+
+ # write data
+ with pa_fs.open_output_stream('mycontainer/pyarrowtest.dat') as stream:
+ stream.write(b'data')
+
+ # read data
+ with pa_fs.open_input_stream('mycontainer/pyarrowtest.dat') as stream:
+ print(stream.readall())
+ #b'data'
+
+ # read a partitioned dataset
+ ds.dataset("data/", filesystem=pa_fs)
+
+
+Using Arrow filesystems with fsspec
+-----------------------------------
+
+The Arrow FileSystem interface has a limited, developer-oriented API surface.
+This is sufficient for basic interactions and for using this with
+Arrow's IO functionality. On the other hand, the `fsspec`_ interface provides
+a very large API with many helper methods. If you want to use those, or if you
+need to interact with a package that expects fsspec-compatible filesystem
+objects, you can wrap an Arrow FileSystem object with fsspec.
+
+Starting with ``fsspec`` version 2021.09, the ``ArrowFSWrapper`` can be used
+for this::
+
+ >>> from pyarrow import fs
+ >>> local = fs.LocalFileSystem()
+ >>> from fsspec.implementations.arrow import ArrowFSWrapper
+ >>> local_fsspec = ArrowFSWrapper(local)
+
+The resulting object now has an fsspec-compatible interface, while being backed
+by the Arrow FileSystem under the hood.
+Example usage to create a directory and file, and list the content::
+
+ >>> local_fsspec.mkdir("./test")
+ >>> local_fsspec.touch("./test/file.txt")
+ >>> local_fsspec.ls("./test/")
+ ['./test/file.txt']
+
+For more information, see the `fsspec`_ documentation.
+
+
+.. _fsspec: https://filesystem-spec.readthedocs.io/en/latest/