summaryrefslogtreecommitdiffstats
path: root/src/arrow/docs/source/python/filesystems_deprecated.rst
blob: 04887e97738abd5aa8185e990f1636f445baa599 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements.  See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership.  The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License.  You may obtain a copy of the License at

..   http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied.  See the License for the
.. specific language governing permissions and limitations
.. under the License.

Filesystem Interface (legacy)
=============================

.. warning::
   This section documents the deprecated filesystem layer.  You should
   use the :ref:`new filesystem layer <filesystem>` instead.

.. _hdfs:

Hadoop File System (HDFS)
-------------------------

PyArrow comes with bindings to a C++-based interface to the Hadoop File
System. You connect like so:

.. code-block:: python

   import pyarrow as pa
   fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
   with fs.open(path, 'rb') as f:
       # Do something with f

By default, ``pyarrow.hdfs.HadoopFileSystem`` uses libhdfs, a JNI-based
interface to the Java Hadoop client. This library is loaded **at runtime**
(rather than at link / library load time, since the library may not be in your
LD_LIBRARY_PATH), and relies on some environment variables.

* ``HADOOP_HOME``: the root of your installed Hadoop distribution. Often has
  `lib/native/libhdfs.so`.

* ``JAVA_HOME``: the location of your Java SDK installation.

* ``ARROW_LIBHDFS_DIR`` (optional): explicit location of ``libhdfs.so`` if it is
  installed somewhere other than ``$HADOOP_HOME/lib/native``.

* ``CLASSPATH``: must contain the Hadoop jars. You can set these using:

.. code-block:: shell

    export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`

If ``CLASSPATH`` is not set, then it will be set automatically if the
``hadoop`` executable is in your system path, or if ``HADOOP_HOME`` is set.

You can also use libhdfs3, a thirdparty C++ library for HDFS from Pivotal Labs:

.. code-block:: python

   fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path,
                       driver='libhdfs3')

HDFS API
~~~~~~~~

.. currentmodule:: pyarrow

.. autosummary::
   :toctree: generated/

   hdfs.connect
   HadoopFileSystem.cat
   HadoopFileSystem.chmod
   HadoopFileSystem.chown
   HadoopFileSystem.delete
   HadoopFileSystem.df
   HadoopFileSystem.disk_usage
   HadoopFileSystem.download
   HadoopFileSystem.exists
   HadoopFileSystem.get_capacity
   HadoopFileSystem.get_space_used
   HadoopFileSystem.info
   HadoopFileSystem.ls
   HadoopFileSystem.mkdir
   HadoopFileSystem.open
   HadoopFileSystem.rename
   HadoopFileSystem.rm
   HadoopFileSystem.upload
   HdfsFile