summaryrefslogtreecommitdiffstats
path: root/src/arrow/cpp/apidoc/HDFS.md
blob: d3671fb7691bae63804be61095ea593ad9af8d4e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
<!---
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.
-->

## Using Arrow's HDFS (Apache Hadoop Distributed File System) interface

### Build requirements

To build the integration, pass the following option to CMake

```shell
-DARROW_HDFS=on
```

For convenience, we have bundled `hdfs.h` for libhdfs from Apache Hadoop in
Arrow's thirdparty. If you wish to build against the `hdfs.h` in your installed
Hadoop distribution, set the `$HADOOP_HOME` environment variable.

### Runtime requirements

By default, the HDFS client C++ class in `libarrow_io` uses the libhdfs JNI
interface to the Java Hadoop client. This library is loaded **at runtime**
(rather than at link / library load time, since the library may not be in your
LD_LIBRARY_PATH), and relies on some environment variables.

* `HADOOP_HOME`: the root of your installed Hadoop distribution. Often has
`lib/native/libhdfs.so`.
* `JAVA_HOME`: the location of your Java SDK installation.
* `CLASSPATH`: must contain the Hadoop jars. You can set these using:

```shell
export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`
```

* `ARROW_LIBHDFS_DIR` (optional): explicit location of `libhdfs.so` if it is
installed somewhere other than `$HADOOP_HOME/lib/native`.

To accommodate distribution-specific nuances, the `JAVA_HOME` variable may be
set to the root path for the Java SDK, the JRE path itself, or to the directory
containing the `libjvm` library.

### Mac Specifics

The installed location of Java on OS X can vary, however the following snippet
will set it automatically for you:

```shell
export JAVA_HOME=$(/usr/libexec/java_home)
```

Homebrew's Hadoop does not have native libs. Apache doesn't build these, so
users must build Hadoop to get the native libs. See this Stack Overflow
answer for details:

http://stackoverflow.com/a/40051353/478288

Be sure to include the path to the native libs in `JAVA_LIBRARY_PATH`:

```shell
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
```

If you get an error about needing to install Java 6, then add *BundledApp* and
*JNI* to the `JVMCapabilities` in `$JAVA_HOME/../Info.plist`. See

https://oliverdowling.com.au/2015/10/09/oracles-jre-8-on-mac-os-x-el-capitan/

https://derflounder.wordpress.com/2015/08/08/modifying-oracles-java-sdk-to-run-java-applications-on-os-x/