summaryrefslogtreecommitdiffstats
path: root/src/arrow/r/vignettes/install.Rmd
diff options
context:
space:
mode:
Diffstat (limited to 'src/arrow/r/vignettes/install.Rmd')
-rw-r--r--src/arrow/r/vignettes/install.Rmd448
1 files changed, 448 insertions, 0 deletions
diff --git a/src/arrow/r/vignettes/install.Rmd b/src/arrow/r/vignettes/install.Rmd
new file mode 100644
index 000000000..5bd76a371
--- /dev/null
+++ b/src/arrow/r/vignettes/install.Rmd
@@ -0,0 +1,448 @@
+---
+title: "Installing the Arrow Package on Linux"
+output: rmarkdown::html_vignette
+vignette: >
+ %\VignetteIndexEntry{Installing the Arrow Package on Linux}
+ %\VignetteEngine{knitr::rmarkdown}
+ %\VignetteEncoding{UTF-8}
+---
+
+On macOS and Windows, when you `install.packages("arrow")`,
+you get a binary package that contains Arrow’s C++ dependencies along with it.
+On Linux, `install.packages()` retrieves a source package that has to be compiled locally,
+and C++ dependencies need to be resolved as well.
+Generally for R packages with C++ dependencies,
+this requires either installing system packages, which you may not have privileges to do,
+or building the C++ dependencies separately,
+which introduces all sorts of additional ways for things to go wrong.
+
+Our goal is to make `install.packages("arrow")` "just work" for as many Linux distributions,
+versions, and configurations as possible.
+This document describes how it works and the options for fine-tuning Linux installation.
+The intended audience for this document is `arrow` R package users on Linux, not developers.
+If you're contributing to the Arrow project, see `vignette("developing", package = "arrow") for guidance on setting up your development environment.
+
+Note also that if you use `conda` to manage your R environment, this document does not apply.
+You can `conda install -c conda-forge --strict-channel-priority r-arrow` and you'll get the latest official
+release of the R package along with any C++ dependencies.
+
+> Having trouble installing `arrow`? See the "Troubleshooting" section below.
+
+# Installation basics
+
+Install the latest release of `arrow` from CRAN with
+
+```r
+install.packages("arrow")
+```
+
+Daily development builds, which are not official releases,
+can be installed from the Ursa Labs repository:
+
+```r
+install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
+```
+
+or for conda users via:
+
+```
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
+```
+
+You can also install the R package from a git checkout:
+
+```shell
+git clone https://github.com/apache/arrow
+cd arrow/r
+R CMD INSTALL .
+```
+
+If you don't already have the Arrow C++ libraries on your system,
+when installing the R package from source, it will also download and build
+the Arrow C++ libraries for you. To speed installation up, you can set
+
+```shell
+export LIBARROW_BINARY=true
+```
+
+to look for C++ binaries prebuilt for your Linux distribution/version.
+Alternatively, you can set
+
+```shell
+export LIBARROW_MINIMAL=false
+```
+
+to build the Arrow libraries from source with optional features such as compression libraries
+enabled. This will increase the build time but provides many useful features.
+Prebuilt binaries are built with this flag enabled, so you get the full
+functionality by using them as well.
+
+Both of these variables are also set this way if you have the `NOT_CRAN=true`
+environment variable set.
+
+## Helper function: install_arrow()
+
+If you already have `arrow` installed and want to upgrade to a different version,
+install a development build, or try to reinstall and fix issues with Linux
+C++ binaries, you can call `install_arrow()`.
+`install_arrow()` provides some convenience wrappers around the various
+environment variables described below.
+This function is part of the `arrow` package,
+and it is also available as a standalone script, so you can
+access it for convenience without first installing the package:
+
+```r
+source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R")
+```
+
+`install_arrow()` will install from CRAN,
+while `install_arrow(nightly = TRUE)` will give you a development build.
+`install_arrow()` does not require environment variables to be set in order to
+satisfy C++ dependencies.
+
+> Note that, unlike packages like `tensorflow`, `blogdown`, and others that require external dependencies, you do not need to run `install_arrow()` after a successful `arrow` installation.
+
+## Offline installation
+
+The `install-arrow.R` file also includes the `create_package_with_all_dependencies()`
+function. Normally, when installing on a computer with internet access, the
+build process will download third-party dependencies as needed.
+This function provides a way to download them in advance.
+Doing so may be useful when installing Arrow on a computer without internet access.
+Note that Arrow _can_ be installed on a computer without internet access without doing this, but
+many useful features will be disabled, as they depend on third-party components.
+More precisely, `arrow::arrow_info()$capabilities()` will be `FALSE` for every
+capability.
+One approach to add more capabilities in an offline install is to prepare a
+package with pre-downloaded dependencies. The
+`create_package_with_all_dependencies()` function does this preparation.
+
+If you're using binary packages you shouldn't need to follow these steps. You
+should download the appropriate binary from your package repository, transfer
+that to the offline computer, and install that. Any OS can create the source
+bundle, but it cannot be installed on Windows. (Instead, use a standard
+Windows binary package.)
+
+Note if you're using RStudio Package Manager on Linux: If you still want to
+make a source bundle with this function, make sure to set the first repo in
+`options("repos")` to be a mirror that contains source packages (that is:
+something other than the RSPM binary mirror URLs).
+
+### Using a computer with internet access, pre-download the dependencies:
+* Install the `arrow` package _or_ run
+ `source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R")`
+* Run `create_package_with_all_dependencies("my_arrow_pkg.tar.gz")`
+* Copy the newly created `my_arrow_pkg.tar.gz` to the computer without internet access
+
+### On the computer without internet access, install the prepared package:
+* Install the `arrow` package from the copied file
+ * `install.packages("my_arrow_pkg.tar.gz", dependencies = c("Depends", "Imports", "LinkingTo"))`
+ * This installation will build from source, so `cmake` must be available
+* Run `arrow_info()` to check installed capabilities
+
+#### Alternative, hands-on approach
+* Download the dependency files (`cpp/thirdparty/download_dependencies.sh` may be helpful)
+* Copy the directory of dependencies to the offline computer
+* Create the environment variable `ARROW_THIRDPARTY_DEPENDENCY_DIR` on the offline computer, pointing to the copied directory.
+* Install the `arrow` package as usual.
+
+## S3 support
+
+The `arrow` package allows you to work with data in AWS S3 or in other cloud
+storage system that emulate S3. However, support for working with S3 is not
+enabled in the default build, and it has additional system requirements. To
+enable it, set the environment variable `LIBARROW_MINIMAL=false` or
+`NOT_CRAN=true` to choose the full-featured build, or more selectively set
+`ARROW_S3=ON`. You also need the following system dependencies:
+
+* `gcc` >= 4.9 or `clang` >= 3.3; note that the default compiler on CentOS 7 is gcc 4.8.5, which is not sufficient
+* CURL: install `libcurl-devel` (rpm) or `libcurl4-openssl-dev` (deb)
+* OpenSSL >= 1.0.2: install `openssl-devel` (rpm) or `libssl-dev` (deb)
+
+The prebuilt C++ binaries come with S3 support enabled, so you will need to meet
+these system requirements in order to use them--the package will not install
+without them. If you're building everything from source, the install script
+will check for the presence of these dependencies and turn off S3 support in the
+build if the prerequisites are not met--installation will succeed but without
+S3 functionality. If afterwards you install the missing system requirements,
+you'll need to reinstall the package in order to enable S3 support.
+
+# How dependencies are resolved
+
+In order for the `arrow` R package to work, it needs the Arrow C++ library.
+There are a number of ways you can get it: a system package; a library you've
+built yourself outside of the context of installing the R package;
+or, if you don't already have it, the R package will attempt to resolve it
+automatically when it installs.
+
+If you are authorized to install system packages and you're installing a CRAN release,
+you may want to use the official Apache Arrow release packages corresponding to the R package version (though there are some drawbacks: see "Troubleshooting" below).
+See the [Arrow project installation page](https://arrow.apache.org/install/)
+to find pre-compiled binary packages for some common Linux distributions,
+including Debian, Ubuntu, and CentOS.
+You'll need to install `libparquet-dev` on Debian and Ubuntu, or `parquet-devel` on CentOS.
+This will also automatically install the Arrow C++ library as a dependency.
+
+When you install the `arrow` R package on Linux,
+it will first attempt to find the Arrow C++ libraries on your system using
+the `pkg-config` command.
+This will find either installed system packages or libraries you've built yourself.
+In order for `install.packages("arrow")` to work with these system packages,
+you'll need to install them before installing the R package.
+
+If no Arrow C++ libraries are found on the system,
+the R package installation script will next attempt to download
+prebuilt static Arrow C++ libraries
+that match your both your local operating system and `arrow` R package version.
+C++ binaries will only be retrieved if you have set the environment variable
+`LIBARROW_BINARY` or `NOT_CRAN`.
+If found, they will be downloaded and bundled when your R package compiles.
+For a list of supported distributions and versions,
+see the [arrow-r-nightly](https://github.com/ursa-labs/arrow-r-nightly/blob/master/README.md) project.
+
+If no C++ library binary is found, it will attempt to build it locally.
+First, it will also look to see if you are in
+a checkout of the `apache/arrow` git repository and thus have the C++ source there.
+Otherwise, it builds from the C++ files included in the package.
+Depending on your system, building Arrow C++ from source may be slow.
+
+For the specific mechanics of how all this works, see the R package `configure` script,
+which calls `tools/nixlibs.R`.
+
+If the C++ library is built from source, `inst/build_arrow_static.sh` is executed.
+This build script is also what is used to generate the prebuilt binaries.
+
+## How the package is installed - advanced
+
+This subsection contains information which is likely to be most relevant mostly
+to Arrow developers and is not necessary for Arrow users to install Arrow.
+
+There are a number of scripts that are triggered when `R CMD INSTALL .` is run.
+For Arrow users, these should all just work without configuration and pull in
+the most complete pieces (e.g. official binaries that we host).
+
+An overview of these scripts is shown below:
+
+* `configure` and `configure.win` - these scripts are triggered during
+`R CMD INSTALL .` on non-Windows and Windows platforms, respectively. They
+handle finding the Arrow library, setting up the build variables necessary, and
+writing the package Makevars file that is used to compile the C++ code in the R
+package.
+
+* `tools/nixlibs.R` - this script is sometimes called by `configure` on Linux
+(or on any non-windows OS with the environment variable
+`FORCE_BUNDLED_BUILD=true`). This sets up the build process for our bundled
+builds (which is the default on linux). The operative logic is at the end of
+the script, but it will do the following (and it will stop with the first one
+that succeeds and some of the steps are only checked if they are enabled via an
+environment variable):
+ * Check if there is an already built libarrow in `arrow/r/libarrow-{version}`,
+ use that to link against if it exists.
+ * Check if a binary is available from our hosted unofficial builds.
+ * Download the Arrow source and build the Arrow Library from source.
+ * `*** Proceed without C++` dependencies (this is an error and the package
+ will not work, but if you see this message you know the previous steps have
+ not succeeded/were not enabled)
+
+* `inst/build_arrow_static.sh` - called by `tools/nixlibs.R` when the Arrow
+library is being built. It builds Arrow for a bundled, static build, and
+mirrors the steps described in the ["Arrow R Developer Guide" vignette](./developing.html)
+
+# Troubleshooting
+
+The intent is that `install.packages("arrow")` will just work and handle all C++
+dependencies, but depending on your system, you may have better results if you
+tune one of several parameters. Here are some known complications and ways to address them.
+
+## Package failed to build C++ dependencies
+
+If you see a message like
+
+```
+------------------------- NOTE ---------------------------
+There was an issue preparing the Arrow C++ libraries.
+See https://arrow.apache.org/docs/r/articles/install.html
+---------------------------------------------------------
+```
+
+in the output when the package fails to install,
+that means that installation failed to retrieve or build C++ libraries
+compatible with the current version of the R package.
+
+It is expected that C++ dependencies should be built successfully
+on all Linux distributions, so you should not see this message. If you do,
+please check the "Known installation issues" below to see if any apply.
+If none apply, set the environment variable `ARROW_R_DEV=TRUE`
+so that details on what failed are shown, and try installing again. Then,
+please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues)
+and include the full verbose installation output.
+
+## Using system libraries
+
+If a system library or other installed Arrow is found but it doesn't match the R package version
+(for example, you have libarrow 1.0.0 on your system and are installing R package 2.0.0),
+it is likely that the R bindings will fail to compile.
+Because the Apache Arrow project is under active development,
+is it essential that versions of the C++ and R libraries match.
+When `install.packages("arrow")` has to download the C++ libraries,
+the install script ensures that you fetch the C++ libraries that correspond to your R package version.
+However, if you are using Arrow libraries already on your system, version match isn't guaranteed.
+
+To fix version mismatch, you can either update your system packages to match the R package version,
+or set the environment variable `ARROW_USE_PKG_CONFIG=FALSE`
+to tell the configure script not to look for system Arrow packages.
+(The latter is the default of `install_arrow()`.)
+System packages are available corresponding to all CRAN releases
+but not for nightly or dev versions, so depending on the R package version you're installing,
+system packages may not be an option.
+
+Note also that once you have a working R package installation based on system (shared) libraries,
+if you update your system Arrow, you'll need to reinstall the R package to match its version.
+Similarly, if you're using Arrow system libraries, running `update.packages()`
+after a new release of the `arrow` package will likely fail unless you first
+update the system packages.
+
+## Using prebuilt binaries
+
+If the R package finds and downloads a prebuilt binary of the C++ library,
+but then the `arrow` package can't be loaded, perhaps with "undefined symbols" errors,
+please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues).
+This is likely a compiler mismatch and may be resolvable by setting some
+environment variables to instruct R to compile the packages to match the C++ library.
+
+A workaround would be to set the environment variable `LIBARROW_BINARY=FALSE`
+and retry installation: this value instructs the package to build the C++ library from source
+instead of downloading the prebuilt binary.
+That should guarantee that the compiler settings match.
+
+If a prebuilt binary wasn't found for your operating system but you think it should have been,
+check the logs for a message that says `*** Unable to identify current OS/version`,
+or a message that says `*** No C++ binaries found for` an invalid OS.
+If you see either, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues).
+You may also set the environment variable `ARROW_R_DEV=TRUE` for additional
+debug messages.
+
+A workaround would be to set the environment variable `LIBARROW_BINARY`
+to a `distribution-version` that exists in the Ursa Labs repository.
+Setting `LIBARROW_BINARY` is also an option when there's not an exact match
+for your OS but a similar version would work,
+such as if you're on `ubuntu-18.10` and there's only a binary for `ubuntu-18.04`.
+
+If that workaround works for you, and you believe that it should work for everyone else too,
+you may propose [adding an entry to this lookup table](https://github.com/ursa-labs/arrow-r-nightly/edit/master/linux/distro-map.csv).
+This table is checked during the installation process
+and tells the script to use binaries built on a different operating system/version
+because they're known to work.
+
+## Building C++ from source
+
+If building the C++ library from source fails, check the error message.
+(If you don't see an error message, only the `----- NOTE -----`,
+set the environment variable `ARROW_R_DEV=TRUE` to increase verbosity and retry installation.)
+The install script should work everywhere, so if the C++ library fails to compile,
+please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues)
+so that we can improve the script.
+
+## Known installation issues
+
+* On CentOS, if you are using a more modern `devtoolset`, you may need to set
+the environment variables `CC` and `CXX` either in the shell or in R's `Makeconf`.
+For CentOS 7 and above, both the Arrow system packages and the C++ binaries
+for R are built with the default system compilers. If you want to use either of these
+and you have a `devtoolset` installed, set `CC=/usr/bin/gcc CXX=/usr/bin/g++`
+to use the system compilers instead of the `devtoolset`.
+Alternatively, if you want to build `arrow` with the newer `devtoolset` compilers,
+set both `ARROW_USE_PKG_CONFIG` and `LIBARROW_BINARY` to `false` so that
+you build the Arrow C++ from source using those compilers.
+Compiler mismatch between the arrow system libraries and the R
+package may cause R to segfault when `arrow` package functions are used.
+See discussions [here](https://issues.apache.org/jira/browse/ARROW-8586)
+and [here](https://issues.apache.org/jira/browse/ARROW-10780).
+
+* If you have multiple versions of `zstd` installed on your system,
+installation by building the C++ from source may fail with an undefined symbols
+error. Workarounds include (1) setting `LIBARROW_BINARY` to use a C++ binary; (2)
+setting `ARROW_WITH_ZSTD=OFF` to build without `zstd`; or (3) uninstalling
+the conflicting `zstd`.
+See discussion [here](https://issues.apache.org/jira/browse/ARROW-8556).
+
+## Summary of build environment variables
+
+Some features are optional when you build Arrow from source. With the exception of `ARROW_S3`, these are all `ON` by default in the bundled C++ build, but you can set them to `OFF` to disable them.
+
+* `ARROW_S3`: If set to `ON` S3 support will be built as long as the
+ dependencies are met; if they are not met, the build script will turn this `OFF`
+* `ARROW_JEMALLOC` for the `jemalloc` memory allocator
+* `ARROW_MIMALLOC` for the `mimalloc` memmory allocator
+* `ARROW_PARQUET`
+* `ARROW_DATASET`
+* `ARROW_JSON` for the JSON parsing library
+* `ARROW_WITH_RE2` for the RE2 regular expression library, used in some string compute functions
+* `ARROW_WITH_UTF8PROC` for the UTF8Proc string library, used in many other string compute functions
+* `ARROW_JSON` for JSON parsing
+* `ARROW_WITH_BROTLI`, `ARROW_WITH_BZ2`, `ARROW_WITH_LZ4`, `ARROW_WITH_SNAPPY`, `ARROW_WITH_ZLIB`, and `ARROW_WITH_ZSTD` for various compression algorithms
+
+
+There are a number of other variables that affect the `configure` script and the bundled build script.
+By default, these are all unset. All boolean variables are case-insensitive.
+
+* `ARROW_USE_PKG_CONFIG`: If set to `false`, the configure script
+ won't look for Arrow libraries on your system and instead will look to download/build them.
+ Use this if you have a version mismatch between installed system libraries
+ and the version of the R package you're installing.
+* `LIBARROW_BINARY`: If set to `true`, the script will try to download a binary
+ C++ library built for your operating system.
+ You may also set it to some other string,
+ a related "distro-version" that has binaries built that work for your OS.
+ If no binary is found, installation will fall back to building C++
+ dependencies from source.
+* `LIBARROW_BUILD`: If set to `false`, the build script
+ will not attempt to build the C++ from source. This means you will only get
+ a working `arrow` R package if a prebuilt binary is found.
+ Use this if you want to avoid compiling the C++ library, which may be slow
+ and resource-intensive, and ensure that you only use a prebuilt binary.
+* `LIBARROW_MINIMAL`: If set to `false`, the build script
+ will enable some optional features, including compression libraries, S3
+ support, and additional alternative memory allocators. This will increase the
+ source build time but results in a more fully functional library.
+* `NOT_CRAN`: If this variable is set to `true`, as the `devtools` package does,
+ the build script will set `LIBARROW_BINARY=true` and `LIBARROW_MINIMAL=false`
+ unless those environment variables are already set. This provides for a more
+ complete and fast installation experience for users who already have
+ `NOT_CRAN=true` as part of their workflow, without requiring additional
+ environment variables to be set.
+* `ARROW_R_DEV`: If set to `true`, more verbose messaging will be printed
+ in the build script. `arrow::install_arrow(verbose = TRUE)` sets this.
+ This variable also is needed if you're modifying C++
+ code in the package: see the developer guide vignette.
+* `LIBARROW_DEBUG_DIR`: If the C++ library building from source fails (`cmake`),
+ there may be messages telling you to check some log file in the build directory.
+ However, when the library is built during R package installation,
+ that location is in a temp directory that is already deleted.
+ To capture those logs, set this variable to an absolute (not relative) path
+ and the log files will be copied there.
+ The directory will be created if it does not exist.
+* `CMAKE`: When building the C++ library from source, you can specify a
+ `/path/to/cmake` to use a different version than whatever is found on the `$PATH`
+
+# Contributing
+
+As mentioned above, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues)
+if you encounter ways to improve this. If you find that your Linux distribution
+or version is not supported, we welcome the contribution of Docker images
+(hosted on Docker Hub) that we can use in our continuous integration. These
+Docker images should be minimal, containing only R and the dependencies it
+requires. (For reference, see the images that
+[R-hub](https://github.com/r-hub/rhub-linux-builders) uses.)
+
+You can test the `arrow` R package installation using the `docker-compose`
+setup included in the `apache/arrow` git repository. For example,
+
+```
+R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose build r
+R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose run r
+```
+
+installs the `arrow` R package, including the C++ source build, on the
+[rhub/ubuntu-gcc-release](https://hub.docker.com/r/rhub/ubuntu-gcc-release)
+image.