diff options
Diffstat (limited to 'src/arrow/r/vignettes/install.Rmd')
-rw-r--r-- | src/arrow/r/vignettes/install.Rmd | 448 |
1 files changed, 448 insertions, 0 deletions
diff --git a/src/arrow/r/vignettes/install.Rmd b/src/arrow/r/vignettes/install.Rmd new file mode 100644 index 000000000..5bd76a371 --- /dev/null +++ b/src/arrow/r/vignettes/install.Rmd @@ -0,0 +1,448 @@ +--- +title: "Installing the Arrow Package on Linux" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Installing the Arrow Package on Linux} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +On macOS and Windows, when you `install.packages("arrow")`, +you get a binary package that contains Arrow’s C++ dependencies along with it. +On Linux, `install.packages()` retrieves a source package that has to be compiled locally, +and C++ dependencies need to be resolved as well. +Generally for R packages with C++ dependencies, +this requires either installing system packages, which you may not have privileges to do, +or building the C++ dependencies separately, +which introduces all sorts of additional ways for things to go wrong. + +Our goal is to make `install.packages("arrow")` "just work" for as many Linux distributions, +versions, and configurations as possible. +This document describes how it works and the options for fine-tuning Linux installation. +The intended audience for this document is `arrow` R package users on Linux, not developers. +If you're contributing to the Arrow project, see `vignette("developing", package = "arrow") for guidance on setting up your development environment. + +Note also that if you use `conda` to manage your R environment, this document does not apply. +You can `conda install -c conda-forge --strict-channel-priority r-arrow` and you'll get the latest official +release of the R package along with any C++ dependencies. + +> Having trouble installing `arrow`? See the "Troubleshooting" section below. + +# Installation basics + +Install the latest release of `arrow` from CRAN with + +```r +install.packages("arrow") +``` + +Daily development builds, which are not official releases, +can be installed from the Ursa Labs repository: + +```r +install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com") +``` + +or for conda users via: + +``` +conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow +``` + +You can also install the R package from a git checkout: + +```shell +git clone https://github.com/apache/arrow +cd arrow/r +R CMD INSTALL . +``` + +If you don't already have the Arrow C++ libraries on your system, +when installing the R package from source, it will also download and build +the Arrow C++ libraries for you. To speed installation up, you can set + +```shell +export LIBARROW_BINARY=true +``` + +to look for C++ binaries prebuilt for your Linux distribution/version. +Alternatively, you can set + +```shell +export LIBARROW_MINIMAL=false +``` + +to build the Arrow libraries from source with optional features such as compression libraries +enabled. This will increase the build time but provides many useful features. +Prebuilt binaries are built with this flag enabled, so you get the full +functionality by using them as well. + +Both of these variables are also set this way if you have the `NOT_CRAN=true` +environment variable set. + +## Helper function: install_arrow() + +If you already have `arrow` installed and want to upgrade to a different version, +install a development build, or try to reinstall and fix issues with Linux +C++ binaries, you can call `install_arrow()`. +`install_arrow()` provides some convenience wrappers around the various +environment variables described below. +This function is part of the `arrow` package, +and it is also available as a standalone script, so you can +access it for convenience without first installing the package: + +```r +source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R") +``` + +`install_arrow()` will install from CRAN, +while `install_arrow(nightly = TRUE)` will give you a development build. +`install_arrow()` does not require environment variables to be set in order to +satisfy C++ dependencies. + +> Note that, unlike packages like `tensorflow`, `blogdown`, and others that require external dependencies, you do not need to run `install_arrow()` after a successful `arrow` installation. + +## Offline installation + +The `install-arrow.R` file also includes the `create_package_with_all_dependencies()` +function. Normally, when installing on a computer with internet access, the +build process will download third-party dependencies as needed. +This function provides a way to download them in advance. +Doing so may be useful when installing Arrow on a computer without internet access. +Note that Arrow _can_ be installed on a computer without internet access without doing this, but +many useful features will be disabled, as they depend on third-party components. +More precisely, `arrow::arrow_info()$capabilities()` will be `FALSE` for every +capability. +One approach to add more capabilities in an offline install is to prepare a +package with pre-downloaded dependencies. The +`create_package_with_all_dependencies()` function does this preparation. + +If you're using binary packages you shouldn't need to follow these steps. You +should download the appropriate binary from your package repository, transfer +that to the offline computer, and install that. Any OS can create the source +bundle, but it cannot be installed on Windows. (Instead, use a standard +Windows binary package.) + +Note if you're using RStudio Package Manager on Linux: If you still want to +make a source bundle with this function, make sure to set the first repo in +`options("repos")` to be a mirror that contains source packages (that is: +something other than the RSPM binary mirror URLs). + +### Using a computer with internet access, pre-download the dependencies: +* Install the `arrow` package _or_ run + `source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R")` +* Run `create_package_with_all_dependencies("my_arrow_pkg.tar.gz")` +* Copy the newly created `my_arrow_pkg.tar.gz` to the computer without internet access + +### On the computer without internet access, install the prepared package: +* Install the `arrow` package from the copied file + * `install.packages("my_arrow_pkg.tar.gz", dependencies = c("Depends", "Imports", "LinkingTo"))` + * This installation will build from source, so `cmake` must be available +* Run `arrow_info()` to check installed capabilities + +#### Alternative, hands-on approach +* Download the dependency files (`cpp/thirdparty/download_dependencies.sh` may be helpful) +* Copy the directory of dependencies to the offline computer +* Create the environment variable `ARROW_THIRDPARTY_DEPENDENCY_DIR` on the offline computer, pointing to the copied directory. +* Install the `arrow` package as usual. + +## S3 support + +The `arrow` package allows you to work with data in AWS S3 or in other cloud +storage system that emulate S3. However, support for working with S3 is not +enabled in the default build, and it has additional system requirements. To +enable it, set the environment variable `LIBARROW_MINIMAL=false` or +`NOT_CRAN=true` to choose the full-featured build, or more selectively set +`ARROW_S3=ON`. You also need the following system dependencies: + +* `gcc` >= 4.9 or `clang` >= 3.3; note that the default compiler on CentOS 7 is gcc 4.8.5, which is not sufficient +* CURL: install `libcurl-devel` (rpm) or `libcurl4-openssl-dev` (deb) +* OpenSSL >= 1.0.2: install `openssl-devel` (rpm) or `libssl-dev` (deb) + +The prebuilt C++ binaries come with S3 support enabled, so you will need to meet +these system requirements in order to use them--the package will not install +without them. If you're building everything from source, the install script +will check for the presence of these dependencies and turn off S3 support in the +build if the prerequisites are not met--installation will succeed but without +S3 functionality. If afterwards you install the missing system requirements, +you'll need to reinstall the package in order to enable S3 support. + +# How dependencies are resolved + +In order for the `arrow` R package to work, it needs the Arrow C++ library. +There are a number of ways you can get it: a system package; a library you've +built yourself outside of the context of installing the R package; +or, if you don't already have it, the R package will attempt to resolve it +automatically when it installs. + +If you are authorized to install system packages and you're installing a CRAN release, +you may want to use the official Apache Arrow release packages corresponding to the R package version (though there are some drawbacks: see "Troubleshooting" below). +See the [Arrow project installation page](https://arrow.apache.org/install/) +to find pre-compiled binary packages for some common Linux distributions, +including Debian, Ubuntu, and CentOS. +You'll need to install `libparquet-dev` on Debian and Ubuntu, or `parquet-devel` on CentOS. +This will also automatically install the Arrow C++ library as a dependency. + +When you install the `arrow` R package on Linux, +it will first attempt to find the Arrow C++ libraries on your system using +the `pkg-config` command. +This will find either installed system packages or libraries you've built yourself. +In order for `install.packages("arrow")` to work with these system packages, +you'll need to install them before installing the R package. + +If no Arrow C++ libraries are found on the system, +the R package installation script will next attempt to download +prebuilt static Arrow C++ libraries +that match your both your local operating system and `arrow` R package version. +C++ binaries will only be retrieved if you have set the environment variable +`LIBARROW_BINARY` or `NOT_CRAN`. +If found, they will be downloaded and bundled when your R package compiles. +For a list of supported distributions and versions, +see the [arrow-r-nightly](https://github.com/ursa-labs/arrow-r-nightly/blob/master/README.md) project. + +If no C++ library binary is found, it will attempt to build it locally. +First, it will also look to see if you are in +a checkout of the `apache/arrow` git repository and thus have the C++ source there. +Otherwise, it builds from the C++ files included in the package. +Depending on your system, building Arrow C++ from source may be slow. + +For the specific mechanics of how all this works, see the R package `configure` script, +which calls `tools/nixlibs.R`. + +If the C++ library is built from source, `inst/build_arrow_static.sh` is executed. +This build script is also what is used to generate the prebuilt binaries. + +## How the package is installed - advanced + +This subsection contains information which is likely to be most relevant mostly +to Arrow developers and is not necessary for Arrow users to install Arrow. + +There are a number of scripts that are triggered when `R CMD INSTALL .` is run. +For Arrow users, these should all just work without configuration and pull in +the most complete pieces (e.g. official binaries that we host). + +An overview of these scripts is shown below: + +* `configure` and `configure.win` - these scripts are triggered during +`R CMD INSTALL .` on non-Windows and Windows platforms, respectively. They +handle finding the Arrow library, setting up the build variables necessary, and +writing the package Makevars file that is used to compile the C++ code in the R +package. + +* `tools/nixlibs.R` - this script is sometimes called by `configure` on Linux +(or on any non-windows OS with the environment variable +`FORCE_BUNDLED_BUILD=true`). This sets up the build process for our bundled +builds (which is the default on linux). The operative logic is at the end of +the script, but it will do the following (and it will stop with the first one +that succeeds and some of the steps are only checked if they are enabled via an +environment variable): + * Check if there is an already built libarrow in `arrow/r/libarrow-{version}`, + use that to link against if it exists. + * Check if a binary is available from our hosted unofficial builds. + * Download the Arrow source and build the Arrow Library from source. + * `*** Proceed without C++` dependencies (this is an error and the package + will not work, but if you see this message you know the previous steps have + not succeeded/were not enabled) + +* `inst/build_arrow_static.sh` - called by `tools/nixlibs.R` when the Arrow +library is being built. It builds Arrow for a bundled, static build, and +mirrors the steps described in the ["Arrow R Developer Guide" vignette](./developing.html) + +# Troubleshooting + +The intent is that `install.packages("arrow")` will just work and handle all C++ +dependencies, but depending on your system, you may have better results if you +tune one of several parameters. Here are some known complications and ways to address them. + +## Package failed to build C++ dependencies + +If you see a message like + +``` +------------------------- NOTE --------------------------- +There was an issue preparing the Arrow C++ libraries. +See https://arrow.apache.org/docs/r/articles/install.html +--------------------------------------------------------- +``` + +in the output when the package fails to install, +that means that installation failed to retrieve or build C++ libraries +compatible with the current version of the R package. + +It is expected that C++ dependencies should be built successfully +on all Linux distributions, so you should not see this message. If you do, +please check the "Known installation issues" below to see if any apply. +If none apply, set the environment variable `ARROW_R_DEV=TRUE` +so that details on what failed are shown, and try installing again. Then, +please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) +and include the full verbose installation output. + +## Using system libraries + +If a system library or other installed Arrow is found but it doesn't match the R package version +(for example, you have libarrow 1.0.0 on your system and are installing R package 2.0.0), +it is likely that the R bindings will fail to compile. +Because the Apache Arrow project is under active development, +is it essential that versions of the C++ and R libraries match. +When `install.packages("arrow")` has to download the C++ libraries, +the install script ensures that you fetch the C++ libraries that correspond to your R package version. +However, if you are using Arrow libraries already on your system, version match isn't guaranteed. + +To fix version mismatch, you can either update your system packages to match the R package version, +or set the environment variable `ARROW_USE_PKG_CONFIG=FALSE` +to tell the configure script not to look for system Arrow packages. +(The latter is the default of `install_arrow()`.) +System packages are available corresponding to all CRAN releases +but not for nightly or dev versions, so depending on the R package version you're installing, +system packages may not be an option. + +Note also that once you have a working R package installation based on system (shared) libraries, +if you update your system Arrow, you'll need to reinstall the R package to match its version. +Similarly, if you're using Arrow system libraries, running `update.packages()` +after a new release of the `arrow` package will likely fail unless you first +update the system packages. + +## Using prebuilt binaries + +If the R package finds and downloads a prebuilt binary of the C++ library, +but then the `arrow` package can't be loaded, perhaps with "undefined symbols" errors, +please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues). +This is likely a compiler mismatch and may be resolvable by setting some +environment variables to instruct R to compile the packages to match the C++ library. + +A workaround would be to set the environment variable `LIBARROW_BINARY=FALSE` +and retry installation: this value instructs the package to build the C++ library from source +instead of downloading the prebuilt binary. +That should guarantee that the compiler settings match. + +If a prebuilt binary wasn't found for your operating system but you think it should have been, +check the logs for a message that says `*** Unable to identify current OS/version`, +or a message that says `*** No C++ binaries found for` an invalid OS. +If you see either, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues). +You may also set the environment variable `ARROW_R_DEV=TRUE` for additional +debug messages. + +A workaround would be to set the environment variable `LIBARROW_BINARY` +to a `distribution-version` that exists in the Ursa Labs repository. +Setting `LIBARROW_BINARY` is also an option when there's not an exact match +for your OS but a similar version would work, +such as if you're on `ubuntu-18.10` and there's only a binary for `ubuntu-18.04`. + +If that workaround works for you, and you believe that it should work for everyone else too, +you may propose [adding an entry to this lookup table](https://github.com/ursa-labs/arrow-r-nightly/edit/master/linux/distro-map.csv). +This table is checked during the installation process +and tells the script to use binaries built on a different operating system/version +because they're known to work. + +## Building C++ from source + +If building the C++ library from source fails, check the error message. +(If you don't see an error message, only the `----- NOTE -----`, +set the environment variable `ARROW_R_DEV=TRUE` to increase verbosity and retry installation.) +The install script should work everywhere, so if the C++ library fails to compile, +please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) +so that we can improve the script. + +## Known installation issues + +* On CentOS, if you are using a more modern `devtoolset`, you may need to set +the environment variables `CC` and `CXX` either in the shell or in R's `Makeconf`. +For CentOS 7 and above, both the Arrow system packages and the C++ binaries +for R are built with the default system compilers. If you want to use either of these +and you have a `devtoolset` installed, set `CC=/usr/bin/gcc CXX=/usr/bin/g++` +to use the system compilers instead of the `devtoolset`. +Alternatively, if you want to build `arrow` with the newer `devtoolset` compilers, +set both `ARROW_USE_PKG_CONFIG` and `LIBARROW_BINARY` to `false` so that +you build the Arrow C++ from source using those compilers. +Compiler mismatch between the arrow system libraries and the R +package may cause R to segfault when `arrow` package functions are used. +See discussions [here](https://issues.apache.org/jira/browse/ARROW-8586) +and [here](https://issues.apache.org/jira/browse/ARROW-10780). + +* If you have multiple versions of `zstd` installed on your system, +installation by building the C++ from source may fail with an undefined symbols +error. Workarounds include (1) setting `LIBARROW_BINARY` to use a C++ binary; (2) +setting `ARROW_WITH_ZSTD=OFF` to build without `zstd`; or (3) uninstalling +the conflicting `zstd`. +See discussion [here](https://issues.apache.org/jira/browse/ARROW-8556). + +## Summary of build environment variables + +Some features are optional when you build Arrow from source. With the exception of `ARROW_S3`, these are all `ON` by default in the bundled C++ build, but you can set them to `OFF` to disable them. + +* `ARROW_S3`: If set to `ON` S3 support will be built as long as the + dependencies are met; if they are not met, the build script will turn this `OFF` +* `ARROW_JEMALLOC` for the `jemalloc` memory allocator +* `ARROW_MIMALLOC` for the `mimalloc` memmory allocator +* `ARROW_PARQUET` +* `ARROW_DATASET` +* `ARROW_JSON` for the JSON parsing library +* `ARROW_WITH_RE2` for the RE2 regular expression library, used in some string compute functions +* `ARROW_WITH_UTF8PROC` for the UTF8Proc string library, used in many other string compute functions +* `ARROW_JSON` for JSON parsing +* `ARROW_WITH_BROTLI`, `ARROW_WITH_BZ2`, `ARROW_WITH_LZ4`, `ARROW_WITH_SNAPPY`, `ARROW_WITH_ZLIB`, and `ARROW_WITH_ZSTD` for various compression algorithms + + +There are a number of other variables that affect the `configure` script and the bundled build script. +By default, these are all unset. All boolean variables are case-insensitive. + +* `ARROW_USE_PKG_CONFIG`: If set to `false`, the configure script + won't look for Arrow libraries on your system and instead will look to download/build them. + Use this if you have a version mismatch between installed system libraries + and the version of the R package you're installing. +* `LIBARROW_BINARY`: If set to `true`, the script will try to download a binary + C++ library built for your operating system. + You may also set it to some other string, + a related "distro-version" that has binaries built that work for your OS. + If no binary is found, installation will fall back to building C++ + dependencies from source. +* `LIBARROW_BUILD`: If set to `false`, the build script + will not attempt to build the C++ from source. This means you will only get + a working `arrow` R package if a prebuilt binary is found. + Use this if you want to avoid compiling the C++ library, which may be slow + and resource-intensive, and ensure that you only use a prebuilt binary. +* `LIBARROW_MINIMAL`: If set to `false`, the build script + will enable some optional features, including compression libraries, S3 + support, and additional alternative memory allocators. This will increase the + source build time but results in a more fully functional library. +* `NOT_CRAN`: If this variable is set to `true`, as the `devtools` package does, + the build script will set `LIBARROW_BINARY=true` and `LIBARROW_MINIMAL=false` + unless those environment variables are already set. This provides for a more + complete and fast installation experience for users who already have + `NOT_CRAN=true` as part of their workflow, without requiring additional + environment variables to be set. +* `ARROW_R_DEV`: If set to `true`, more verbose messaging will be printed + in the build script. `arrow::install_arrow(verbose = TRUE)` sets this. + This variable also is needed if you're modifying C++ + code in the package: see the developer guide vignette. +* `LIBARROW_DEBUG_DIR`: If the C++ library building from source fails (`cmake`), + there may be messages telling you to check some log file in the build directory. + However, when the library is built during R package installation, + that location is in a temp directory that is already deleted. + To capture those logs, set this variable to an absolute (not relative) path + and the log files will be copied there. + The directory will be created if it does not exist. +* `CMAKE`: When building the C++ library from source, you can specify a + `/path/to/cmake` to use a different version than whatever is found on the `$PATH` + +# Contributing + +As mentioned above, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) +if you encounter ways to improve this. If you find that your Linux distribution +or version is not supported, we welcome the contribution of Docker images +(hosted on Docker Hub) that we can use in our continuous integration. These +Docker images should be minimal, containing only R and the dependencies it +requires. (For reference, see the images that +[R-hub](https://github.com/r-hub/rhub-linux-builders) uses.) + +You can test the `arrow` R package installation using the `docker-compose` +setup included in the `apache/arrow` git repository. For example, + +``` +R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose build r +R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose run r +``` + +installs the `arrow` R package, including the C++ source build, on the +[rhub/ubuntu-gcc-release](https://hub.docker.com/r/rhub/ubuntu-gcc-release) +image. |