summaryrefslogtreecommitdiffstats
path: root/src/arrow/r/vignettes/install.Rmd
blob: 5bd76a3719b49553d25a98dcb46265b1e7573c60 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
---
title: "Installing the Arrow Package on Linux"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Installing the Arrow Package on Linux}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

On macOS and Windows, when you `install.packages("arrow")`,
you get a binary package that contains Arrow’s C++ dependencies along with it.
On Linux, `install.packages()` retrieves a source package that has to be compiled locally,
and C++ dependencies need to be resolved as well.
Generally for R packages with C++ dependencies,
this requires either installing system packages, which you may not have privileges to do,
or building the C++ dependencies separately,
which introduces all sorts of additional ways for things to go wrong.

Our goal is to make `install.packages("arrow")` "just work" for as many Linux distributions,
versions, and configurations as possible.
This document describes how it works and the options for fine-tuning Linux installation.
The intended audience for this document is `arrow` R package users on Linux, not developers.
If you're contributing to the Arrow project, see `vignette("developing", package = "arrow") for guidance on setting up your development environment.

Note also that if you use `conda` to manage your R environment, this document does not apply.
You can `conda install -c conda-forge --strict-channel-priority r-arrow` and you'll get the latest official
release of the R package along with any C++ dependencies.

> Having trouble installing `arrow`? See the "Troubleshooting" section below.

# Installation basics

Install the latest release of `arrow` from CRAN with

```r
install.packages("arrow")
```

Daily development builds, which are not official releases,
can be installed from the Ursa Labs repository:

```r
install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
```

or for conda users via:

```
conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
```

You can also install the R package from a git checkout:

```shell
git clone https://github.com/apache/arrow
cd arrow/r
R CMD INSTALL .
```

If you don't already have the Arrow C++ libraries on your system,
when installing the R package from source, it will also download and build
the Arrow C++ libraries for you. To speed installation up, you can set

```shell
export LIBARROW_BINARY=true
```

to look for C++ binaries prebuilt for your Linux distribution/version.
Alternatively, you can set

```shell
export LIBARROW_MINIMAL=false
```

to build the Arrow libraries from source with optional features such as compression libraries
enabled. This will increase the build time but provides many useful features.
Prebuilt binaries are built with this flag enabled, so you get the full
functionality by using them as well.

Both of these variables are also set this way if you have the `NOT_CRAN=true`
environment variable set.

## Helper function: install_arrow()

If you already have `arrow` installed and want to upgrade to a different version,
install a development build, or try to reinstall and fix issues with Linux
C++ binaries, you can call `install_arrow()`.
`install_arrow()` provides some convenience wrappers around the various
environment variables described below.
This function is part of the `arrow` package,
and it is also available as a standalone script, so you can
access it for convenience without first installing the package:

```r
source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R")
```

`install_arrow()` will install from CRAN,
while `install_arrow(nightly = TRUE)` will give you a development build.
`install_arrow()` does not require environment variables to be set in order to
satisfy C++ dependencies.

> Note that, unlike packages like `tensorflow`, `blogdown`, and others that require external dependencies, you do not need to run `install_arrow()` after a successful `arrow` installation.

## Offline installation

The `install-arrow.R` file also includes the `create_package_with_all_dependencies()`
function. Normally, when installing on a computer with internet access, the
build process will download third-party dependencies as needed.
This function provides a way to download them in advance.
Doing so may be useful when installing Arrow on a computer without internet access.
Note that Arrow _can_ be installed on a computer without internet access without doing this, but
many useful features will be disabled, as they depend on third-party components.
More precisely, `arrow::arrow_info()$capabilities()` will be `FALSE` for every
capability.
One approach to add more capabilities in an offline install is to prepare a
package with pre-downloaded dependencies. The
`create_package_with_all_dependencies()` function does this preparation.

If you're using binary packages you shouldn't need to follow these steps. You
should download the appropriate binary from your package repository, transfer
that to the offline computer, and install that. Any OS can create the source
bundle, but it cannot be installed on Windows. (Instead, use a standard
Windows binary package.)

Note if you're using RStudio Package Manager on Linux: If you still want to
make a source bundle with this function, make sure to set the first repo in
`options("repos")` to be a mirror that contains source packages (that is:
something other than the RSPM binary mirror URLs).

### Using a computer with internet access, pre-download the dependencies:
* Install the `arrow` package _or_ run
  `source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R")`
* Run `create_package_with_all_dependencies("my_arrow_pkg.tar.gz")`
* Copy the newly created `my_arrow_pkg.tar.gz` to the computer without internet access

### On the computer without internet access, install the prepared package:
* Install the `arrow` package from the copied file
  * `install.packages("my_arrow_pkg.tar.gz", dependencies = c("Depends", "Imports", "LinkingTo"))`
  * This installation will build from source, so `cmake` must be available
* Run `arrow_info()` to check installed capabilities

#### Alternative, hands-on approach
* Download the dependency files (`cpp/thirdparty/download_dependencies.sh` may be helpful)
* Copy the directory of dependencies to the offline computer
* Create the environment variable `ARROW_THIRDPARTY_DEPENDENCY_DIR` on the offline computer, pointing to the copied directory.
* Install the `arrow` package as usual.

## S3 support

The `arrow` package allows you to work with data in AWS S3 or in other cloud
storage system that emulate S3. However, support for working with S3 is not
enabled in the default build, and it has additional system requirements. To
enable it, set the environment variable `LIBARROW_MINIMAL=false` or
`NOT_CRAN=true` to choose the full-featured build, or more selectively set
`ARROW_S3=ON`. You also need the following system dependencies:

* `gcc` >= 4.9 or `clang` >= 3.3; note that the default compiler on CentOS 7 is gcc 4.8.5, which is not sufficient
* CURL: install `libcurl-devel` (rpm) or `libcurl4-openssl-dev` (deb)
* OpenSSL >= 1.0.2: install `openssl-devel` (rpm) or `libssl-dev` (deb)

The prebuilt C++ binaries come with S3 support enabled, so you will need to meet
these system requirements in order to use them--the package will not install
without them. If you're building everything from source, the install script
will check for the presence of these dependencies and turn off S3 support in the
build if the prerequisites are not met--installation will succeed but without
S3 functionality. If afterwards you install the missing system requirements,
you'll need to reinstall the package in order to enable S3 support.

# How dependencies are resolved

In order for the `arrow` R package to work, it needs the Arrow C++ library.
There are a number of ways you can get it: a system package; a library you've
built yourself outside of the context of installing the R package;
or, if you don't already have it, the R package will attempt to resolve it
automatically when it installs.

If you are authorized to install system packages and you're installing a CRAN release,
you may want to use the official Apache Arrow release packages corresponding to the R package version (though there are some drawbacks: see "Troubleshooting" below).
See the [Arrow project installation page](https://arrow.apache.org/install/)
to find pre-compiled binary packages for some common Linux distributions,
including Debian, Ubuntu, and CentOS.
You'll need to install `libparquet-dev` on Debian and Ubuntu, or `parquet-devel` on CentOS.
This will also automatically install the Arrow C++ library as a dependency.

When you install the `arrow` R package on Linux,
it will first attempt to find the Arrow C++ libraries on your system using
the `pkg-config` command.
This will find either installed system packages or libraries you've built yourself.
In order for `install.packages("arrow")` to work with these system packages,
you'll need to install them before installing the R package.

If no Arrow C++ libraries are found on the system,
the R package installation script will next attempt to download
prebuilt static Arrow C++ libraries
that match your both your local operating system and `arrow` R package version.
C++ binaries will only be retrieved if you have set the environment variable
`LIBARROW_BINARY` or `NOT_CRAN`.
If found, they will be downloaded and bundled when your R package compiles.
For a list of supported distributions and versions,
see the [arrow-r-nightly](https://github.com/ursa-labs/arrow-r-nightly/blob/master/README.md) project.

If no C++ library binary is found, it will attempt to build it locally.
First, it will also look to see if you are in
a checkout of the `apache/arrow` git repository and thus have the C++ source there.
Otherwise, it builds from the C++ files included in the package.
Depending on your system, building Arrow C++ from source may be slow.

For the specific mechanics of how all this works, see the R package `configure` script,
which calls `tools/nixlibs.R`.

If the C++ library is built from source, `inst/build_arrow_static.sh` is executed.
This build script is also what is used to generate the prebuilt binaries.

## How the package is installed - advanced

This subsection contains information which is likely to be most relevant mostly
to Arrow developers and is not necessary for Arrow users to install Arrow.

There are a number of scripts that are triggered when `R CMD INSTALL .` is run.
For Arrow users, these should all just work without configuration and pull in
the most complete pieces (e.g. official binaries that we host).

An overview of these scripts is shown below:

* `configure` and `configure.win` - these scripts are triggered during
`R CMD INSTALL .` on non-Windows and Windows platforms, respectively. They
handle finding the Arrow library, setting up the build variables necessary, and
writing the package Makevars file that is used to compile the C++ code in the R
package.

* `tools/nixlibs.R` - this script is sometimes called by `configure` on Linux
(or on any non-windows OS with the environment variable
`FORCE_BUNDLED_BUILD=true`). This sets up the build process for our bundled
builds (which is the default on linux). The operative logic is at the end of
the script, but it will do the following (and it will stop with the first one
that succeeds and some of the steps are only checked if they are enabled via an
environment variable):
  * Check if there is an already built libarrow in `arrow/r/libarrow-{version}`,
  use that to link against if it exists.
  * Check if a binary is available from our hosted unofficial builds.
  * Download the Arrow source and build the Arrow Library from source.
  * `*** Proceed without C++` dependencies (this is an error and the package
  will not work, but if you see this message you know the previous steps have
  not succeeded/were not enabled)

* `inst/build_arrow_static.sh` - called by `tools/nixlibs.R` when the Arrow
library is being built.  It builds Arrow for a bundled, static build, and
mirrors the steps described in the ["Arrow R Developer Guide" vignette](./developing.html)

# Troubleshooting

The intent is that `install.packages("arrow")` will just work and handle all C++
dependencies, but depending on your system, you may have better results if you
tune one of several parameters. Here are some known complications and ways to address them.

## Package failed to build C++ dependencies

If you see a message like

```
------------------------- NOTE ---------------------------
There was an issue preparing the Arrow C++ libraries.
See https://arrow.apache.org/docs/r/articles/install.html
---------------------------------------------------------
```

in the output when the package fails to install,
that means that installation failed to retrieve or build C++ libraries
compatible with the current version of the R package.

It is expected that C++ dependencies should be built successfully
on all Linux distributions, so you should not see this message. If you do,
please check the "Known installation issues" below to see if any apply.
If none apply, set the environment variable `ARROW_R_DEV=TRUE`
so that details on what failed are shown, and try installing again. Then,
please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues)
and include the full verbose installation output.

## Using system libraries

If a system library or other installed Arrow is found but it doesn't match the R package version
(for example, you have libarrow 1.0.0 on your system and are installing R package 2.0.0),
it is likely that the R bindings will fail to compile.
Because the Apache Arrow project is under active development,
is it essential that versions of the C++ and R libraries match.
When `install.packages("arrow")` has to download the C++ libraries,
the install script ensures that you fetch the C++ libraries that correspond to your R package version.
However, if you are using Arrow libraries already on your system, version match isn't guaranteed.

To fix version mismatch, you can either update your system packages to match the R package version,
or set the environment variable `ARROW_USE_PKG_CONFIG=FALSE`
to tell the configure script not to look for system Arrow packages.
(The latter is the default of `install_arrow()`.)
System packages are available corresponding to all CRAN releases
but not for nightly or dev versions, so depending on the R package version you're installing,
system packages may not be an option.

Note also that once you have a working R package installation based on system (shared) libraries,
if you update your system Arrow, you'll need to reinstall the R package to match its version.
Similarly, if you're using Arrow system libraries, running `update.packages()`
after a new release of the `arrow` package will likely fail unless you first
update the system packages.

## Using prebuilt binaries

If the R package finds and downloads a prebuilt binary of the C++ library,
but then the `arrow` package can't be loaded, perhaps with "undefined symbols" errors,
please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues).
This is likely a compiler mismatch and may be resolvable by setting some
environment variables to instruct R to compile the packages to match the C++ library.

A workaround would be to set the environment variable `LIBARROW_BINARY=FALSE`
and retry installation: this value instructs the package to build the C++ library from source
instead of downloading the prebuilt binary.
That should guarantee that the compiler settings match.

If a prebuilt binary wasn't found for your operating system but you think it should have been,
check the logs for a message that says `*** Unable to identify current OS/version`,
or a message that says `*** No C++ binaries found for` an invalid OS.
If you see either, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues).
You may also set the environment variable `ARROW_R_DEV=TRUE` for additional
debug messages.

A workaround would be to set the environment variable `LIBARROW_BINARY`
to a `distribution-version` that exists in the Ursa Labs repository.
Setting `LIBARROW_BINARY` is also an option when there's not an exact match
for your OS but a similar version would work,
such as if you're on `ubuntu-18.10` and there's only a binary for `ubuntu-18.04`.

If that workaround works for you, and you believe that it should work for everyone else too,
you may propose [adding an entry to this lookup table](https://github.com/ursa-labs/arrow-r-nightly/edit/master/linux/distro-map.csv).
This table is checked during the installation process
and tells the script to use binaries built on a different operating system/version
because they're known to work.

## Building C++ from source

If building the C++ library from source fails, check the error message.
(If you don't see an error message, only the `----- NOTE -----`,
set the environment variable `ARROW_R_DEV=TRUE` to increase verbosity and retry installation.)
The install script should work everywhere, so if the C++ library fails to compile,
please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues)
so that we can improve the script.

## Known installation issues

* On CentOS, if you are using a more modern `devtoolset`, you may need to set
the environment variables `CC` and `CXX` either in the shell or in R's `Makeconf`.
For CentOS 7 and above, both the Arrow system packages and the C++ binaries
for R are built with the default system compilers. If you want to use either of these
and you have a `devtoolset` installed, set `CC=/usr/bin/gcc CXX=/usr/bin/g++`
to use the system compilers instead of the `devtoolset`.
Alternatively, if you want to build `arrow` with the newer `devtoolset` compilers,
set both `ARROW_USE_PKG_CONFIG` and `LIBARROW_BINARY` to `false` so that
you build the Arrow C++ from source using those compilers.
Compiler mismatch between the arrow system libraries and the R
package may cause R to segfault when `arrow` package functions are used.
See discussions [here](https://issues.apache.org/jira/browse/ARROW-8586)
and [here](https://issues.apache.org/jira/browse/ARROW-10780).

* If you have multiple versions of `zstd` installed on your system,
installation by building the C++ from source may fail with an undefined symbols
error. Workarounds include (1) setting `LIBARROW_BINARY` to use a C++ binary; (2)
setting `ARROW_WITH_ZSTD=OFF` to build without `zstd`; or (3) uninstalling
the conflicting `zstd`.
See discussion [here](https://issues.apache.org/jira/browse/ARROW-8556).

## Summary of build environment variables

Some features are optional when you build Arrow from source. With the exception of `ARROW_S3`, these are all `ON` by default in the bundled C++ build, but you can set them to `OFF` to disable them.

* `ARROW_S3`: If set to `ON` S3 support will be built as long as the
  dependencies are met; if they are not met, the build script will turn this `OFF`
* `ARROW_JEMALLOC` for the `jemalloc` memory allocator
* `ARROW_MIMALLOC` for the `mimalloc` memmory allocator
* `ARROW_PARQUET`
* `ARROW_DATASET`
* `ARROW_JSON` for the JSON parsing library
* `ARROW_WITH_RE2` for the RE2 regular expression library, used in some string compute functions
* `ARROW_WITH_UTF8PROC` for the UTF8Proc string library, used in many other string compute functions
* `ARROW_JSON` for JSON parsing
* `ARROW_WITH_BROTLI`, `ARROW_WITH_BZ2`, `ARROW_WITH_LZ4`, `ARROW_WITH_SNAPPY`, `ARROW_WITH_ZLIB`, and `ARROW_WITH_ZSTD` for various compression algorithms


There are a number of other variables that affect the `configure` script and the bundled build script.
By default, these are all unset. All boolean variables are case-insensitive.

* `ARROW_USE_PKG_CONFIG`: If set to `false`, the configure script
  won't look for Arrow libraries on your system and instead will look to download/build them.
  Use this if you have a version mismatch between installed system libraries
  and the version of the R package you're installing.
* `LIBARROW_BINARY`: If set to `true`, the script will try to download a binary
  C++ library built for your operating system.
  You may also set it to some other string,
  a related "distro-version" that has binaries built that work for your OS.
  If no binary is found, installation will fall back to building C++
  dependencies from source.
* `LIBARROW_BUILD`: If set to `false`, the build script
  will not attempt to build the C++ from source. This means you will only get
  a working `arrow` R package if a prebuilt binary is found.
  Use this if you want to avoid compiling the C++ library, which may be slow
  and resource-intensive, and ensure that you only use a prebuilt binary.
* `LIBARROW_MINIMAL`: If set to `false`, the build script
  will enable some optional features, including compression libraries, S3
  support, and additional alternative memory allocators. This will increase the
  source build time but results in a more fully functional library.
* `NOT_CRAN`: If this variable is set to `true`, as the `devtools` package does,
  the build script will set `LIBARROW_BINARY=true` and `LIBARROW_MINIMAL=false`
  unless those environment variables are already set. This provides for a more
  complete and fast installation experience for users who already have
  `NOT_CRAN=true` as part of their workflow, without requiring additional
  environment variables to be set.
* `ARROW_R_DEV`: If set to `true`, more verbose messaging will be printed
  in the build script. `arrow::install_arrow(verbose = TRUE)` sets this.
  This variable also is needed if you're modifying C++
  code in the package: see the developer guide vignette.
* `LIBARROW_DEBUG_DIR`: If the C++ library building from source fails (`cmake`),
  there may be messages telling you to check some log file in the build directory.
  However, when the library is built during R package installation,
  that location is in a temp directory that is already deleted.
  To capture those logs, set this variable to an absolute (not relative) path
  and the log files will be copied there.
  The directory will be created if it does not exist.
* `CMAKE`: When building the C++ library from source, you can specify a
  `/path/to/cmake` to use a different version than whatever is found on the `$PATH`

# Contributing

As mentioned above, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues)
if you encounter ways to improve this. If you find that your Linux distribution
or version is not supported, we welcome the contribution of Docker images
(hosted on Docker Hub) that we can use in our continuous integration. These
Docker images should be minimal, containing only R and the dependencies it
requires. (For reference, see the images that
[R-hub](https://github.com/r-hub/rhub-linux-builders) uses.)

You can test the `arrow` R package installation using the `docker-compose`
setup included in the `apache/arrow` git repository. For example,

```
R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose build r
R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose run r
```

installs the `arrow` R package, including the C++ source build, on the
[rhub/ubuntu-gcc-release](https://hub.docker.com/r/rhub/ubuntu-gcc-release)
image.