diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
commit | e6918187568dbd01842d8d1d2c808ce16a894239 (patch) | |
tree | 64f88b554b444a49f656b6c656111a145cbbaa28 /src/arrow/r/vignettes | |
parent | Initial commit. (diff) | |
download | ceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip |
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/arrow/r/vignettes')
-rw-r--r-- | src/arrow/r/vignettes/arrow.Rmd | 225 | ||||
-rw-r--r-- | src/arrow/r/vignettes/dataset.Rmd | 421 | ||||
-rw-r--r-- | src/arrow/r/vignettes/developing.Rmd | 605 | ||||
-rw-r--r-- | src/arrow/r/vignettes/flight.Rmd | 87 | ||||
-rw-r--r-- | src/arrow/r/vignettes/fs.Rmd | 130 | ||||
-rw-r--r-- | src/arrow/r/vignettes/install.Rmd | 448 | ||||
-rw-r--r-- | src/arrow/r/vignettes/python.Rmd | 131 |
7 files changed, 2047 insertions, 0 deletions
diff --git a/src/arrow/r/vignettes/arrow.Rmd b/src/arrow/r/vignettes/arrow.Rmd new file mode 100644 index 000000000..ff6bf7ce0 --- /dev/null +++ b/src/arrow/r/vignettes/arrow.Rmd @@ -0,0 +1,225 @@ +--- +title: "Using the Arrow C++ Library in R" +description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package." +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Using the Arrow C++ Library in R} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R. + +# Features + +## Multi-file datasets + +The `arrow` package lets you work efficiently with large, multi-file datasets +using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview. + +## Reading and writing files + +`arrow` provides some simple functions for using the Arrow C++ library to read and write files. +These functions are designed to drop into your normal R workflow +without requiring any knowledge of the Arrow C++ library +and use naming conventions and arguments that follow popular R packages, particularly `readr`. +The readers return `data.frame`s +(or if you use the `tibble` package, they will act like `tbl_df`s), +and the writers take `data.frame`s. + +Importantly, `arrow` provides basic read and write support for the [Apache +Parquet](https://parquet.apache.org/) columnar data file format. + +```r +library(arrow) +df <- read_parquet("path/to/file.parquet") +``` + +Just as you can read, you can write Parquet files: + +```r +write_parquet(df, "path/to/different_file.parquet") +``` + +The `arrow` package also includes a faster and more robust implementation of the +[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and +`write_feather()`. This implementation depends +on the same underlying C++ library as the Python version does, +resulting in more reliable and consistent behavior across the two languages, as +well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/). +`arrow` also by default writes the Feather V2 format, +which supports a wider range of data types, as well as compression. + +For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively. +While `read_csv_arrow()` currently has fewer parsing options for dealing with +every CSV format variation in the wild, for the files it can read, it is +often significantly faster than other R CSV readers, such as +`base::read.csv`, `readr::read_csv`, and `data.table::fread`. + +## Working with Arrow data in Python + +Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you +share data between R and Python (`pyarrow`) efficiently, enabling you to take +advantage of the vibrant ecosystem of Python packages that build on top of +Apache Arrow. See `vignette("python", package = "arrow")` for details. + +## Access to Arrow messages, buffers, and streams + +The `arrow` package also provides many lower-level bindings to the C++ library, which enable you +to access and manipulate Arrow objects. You can use these to build connectors +to other applications and services that use Arrow. One example is Spark: the +[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to +move data to and from Spark, yielding [significant performance +gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/). + +# Object hierarchy + +## Metadata objects + +Arrow defines the following classes for representing metadata: + +| Class | Description | How to create an instance | +| ---------- | -------------------------------------------------- | -------------------------------- | +| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` | +| `Field` | a character string name and a `DataType` | `field(name, type)` | +| `Schema` | list of `Field`s | `schema(...)` | + +## Data objects + +Arrow defines the following classes for representing zero-dimensional (scalar), +one-dimensional (array/vector-like), and two-dimensional (tabular/data +frame-like) data: + +| Dim | Class | Description | How to create an instance | +| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------| +| 0 | `Scalar` | single value and its `DataType` | `Scalar$create(value, type)` | +| 1 | `Array` | vector of values and its `DataType` | `Array$create(vector, type)` | +| 1 | `ChunkedArray` | vectors of values and their `DataType` | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)` | +| 2 | `RecordBatch` | list of `Array`s with a `Schema` | `RecordBatch$create(...)` or alias `record_batch(...)` | +| 2 | `Table` | list of `ChunkedArray` with a `Schema` | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)` | +| 2 | `Dataset` | list of `Table`s with the same `Schema` | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)` | + +Each of these is defined as an `R6` class in the `arrow` R package and +corresponds to a class of the same name in the Arrow C++ library. The `arrow` +package provides a variety of `R6` and S3 methods for interacting with instances +of these classes. + +For convenience, the `arrow` package also defines several synthetic classes that +do not exist in the C++ library, including: + +* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray` +* `ArrowTabular`: inherited by `RecordBatch` and `Table` +* `ArrowObject`: inherited by all Arrow objects + +# Internals + +## Mapping of R <--> Arrow types + +Arrow has a rich data type system that includes direct parallels with R's data types and much more. + +In the tables, entries with a `-` are not currently implemented. + +### R to Arrow + +| R type | Arrow type | +|--------------------------|------------| +| logical | boolean | +| integer | int32 | +| double ("numeric") | float64^1^ | +| character | utf8^2^ | +| factor | dictionary | +| raw | uint8 | +| Date | date32 | +| POSIXct | timestamp | +| POSIXlt | struct | +| data.frame | struct | +| list^3^ | list | +| bit64::integer64 | int64 | +| difftime | time32 | +| vctrs::vctrs_unspecified | null | + + + +^1^: `float64` and `double` are the same concept and data type in Arrow C++; +however, only `float64()` is used in arrow as the function `double()` already +exists in base R + +^2^: If the character vector exceeds 2GB of strings, it will be converted to a +`large_utf8` Arrow type + +^3^: Only lists where all elements are the same type are able to be translated +to Arrow list type (which is a "list of" some type). + + +### Arrow to R + +| Arrow type | R type | +|-------------------|------------------------------| +| boolean | logical | +| int8 | integer | +| int16 | integer | +| int32 | integer | +| int64 | integer^1^ | +| uint8 | integer | +| uint16 | integer | +| uint32 | integer^1^ | +| uint64 | integer^1^ | +| float16 | -^2^ | +| float32 | double | +| float64 | double | +| utf8 | character | +| large_utf8 | character | +| binary | arrow_binary ^3^ | +| large_binary | arrow_large_binary ^3^ | +| fixed_size_binary | arrow_fixed_size_binary ^3^ | +| date32 | Date | +| date64 | POSIXct | +| time32 | hms::difftime | +| time64 | hms::difftime | +| timestamp | POSIXct | +| duration | -^2^ | +| decimal | double | +| dictionary | factor^4^ | +| list | arrow_list ^5^ | +| large_list | arrow_large_list ^5^ | +| fixed_size_list | arrow_fixed_size_list ^5^ | +| struct | data.frame | +| null | vctrs::vctrs_unspecified | +| map | -^2^ | +| union | -^2^ | + +^1^: These integer types may contain values that exceed the range of R's +`integer` type (32-bit signed integer). When they do, `uint32` and `uint64` are +converted to `double` ("numeric") and `int64` is converted to +`bit64::integer64`. This conversion can be disabled (so that `int64` always +yields a `bit64::integer64` vector) by setting `options(arrow.int64_downcast = FALSE)`. + +^2^: Some Arrow data types do not currently have an R equivalent and will raise an error +if cast to or mapped to via a schema. + +^3^: `arrow*_binary` classes are implemented as lists of raw vectors. + +^4^: Due to the limitation of R factors, Arrow `dictionary` values are coerced +to string when translated to R if they are not already strings. + +^5^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` +with a `ptype` attribute set to what an empty Array of the value type converts to. + + +### R object attributes + +Arrow supports custom key-value metadata attached to Schemas. When we convert a `data.frame` to an Arrow Table or RecordBatch, the package stores any `attributes()` attached to the columns of the `data.frame` in the Arrow object's Schema. These attributes are stored under the "r" key; you can assign additional string metadata under any other key you wish, like `x$metadata$new_key <- "new value"`. + +This metadata is preserved when writing the table to Feather or Parquet, and when reading those files into R, or when calling `as.data.frame()` on a Table/RecordBatch, the column attributes are restored to the columns of the resulting `data.frame`. This means that custom data types, including `haven::labelled`, `vctrs` annotations, and others, are preserved when doing a round-trip through Arrow. + +Note that the `attributes()` stored in `$metadata$r` are only understood by R. If you write a `data.frame` with `haven` columns to a Feather file and read that in Pandas, the `haven` metadata won't be recognized there. (Similarly, Pandas writes its own custom metadata, which the R package does not consume.) You are free, however, to define custom metadata conventions for your application and assign any (string) values you want to other metadata keys. For more details, see the documentation for `schema()`. + +## Class structure and package conventions + +C++ is an object-oriented language, so the core logic of the Arrow library is encapsulated in classes and methods. In the R package, these classes are implemented as `R6` reference classes, most of which are exported from the namespace. + +In order to match the C++ naming conventions, the `R6` classes are in TitleCase, e.g. `RecordBatch`. This makes it easy to look up the relevant C++ implementations in the [code](https://github.com/apache/arrow/tree/master/cpp) or [documentation](https://arrow.apache.org/docs/cpp/). To simplify things in R, the C++ library namespaces are generally dropped or flattened; that is, where the C++ library has `arrow::io::FileOutputStream`, it is just `FileOutputStream` in the R package. One exception is for the file readers, where the namespace is necessary to disambiguate. So `arrow::csv::TableReader` becomes `CsvTableReader`, and `arrow::json::TableReader` becomes `JsonTableReader`. + +Some of these classes are not meant to be instantiated directly; they may be base classes or other kinds of helpers. For those that you should be able to create, use the `$create()` method to instantiate an object. For example, `rb <- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10))` will create a `RecordBatch`. Many of these factory methods that an R user might most often encounter also have a `snake_case` alias, in order to be more familiar for contemporary R users. So `record_batch(int = 1:10, dbl = as.numeric(1:10))` would do the same as `RecordBatch$create()` above. + +The typical user of the `arrow` R package may never deal directly with the `R6` objects. We provide more R-friendly wrapper functions as a higher-level interface to the C++ library. An R user can call `read_parquet()` without knowing or caring that they're instantiating a `ParquetFileReader` object and calling the `$ReadFile()` method on it. The classes are there and available to the advanced programmer who wants fine-grained control over how the C++ library is used. diff --git a/src/arrow/r/vignettes/dataset.Rmd b/src/arrow/r/vignettes/dataset.Rmd new file mode 100644 index 000000000..3f33cbae4 --- /dev/null +++ b/src/arrow/r/vignettes/dataset.Rmd @@ -0,0 +1,421 @@ +--- +title: "Working with Arrow Datasets and dplyr" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Working with Arrow Datasets and dplyr} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +Apache Arrow lets you work efficiently with large, multi-file datasets. +The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets, +and other tools for interactive exploration of Arrow data. + +This vignette introduces Datasets and shows how to use dplyr to analyze them. + +## Example: NYC taxi data + +The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) +is widely used in big data exercises and competitions. +For demonstration purposes, we have hosted a Parquet-formatted version +of about ten years of the trip data in a public Amazon S3 bucket. + +The total file size is around 37 gigabytes, even in the efficient Parquet file +format. That's bigger than memory on most people's computers, so you can't just +read it all in and stack it into a single data frame. + +In Windows (for R > 3.6) and macOS binary packages, S3 support is included. +On Linux, when installing from source, S3 support is not enabled by default, +and it has additional system requirements. +See `vignette("install", package = "arrow")` for details. +To see if your arrow installation has S3 support, run: + +```{r} +arrow::arrow_with_s3() +``` + +Even with S3 support enabled, network speed will be a bottleneck unless your +machine is located in the same AWS region as the data. So, for this vignette, +we assume that the NYC taxi dataset has been downloaded locally in an "nyc-taxi" +directory. + +### Retrieving data from a public Amazon S3 bucket + +If your arrow build has S3 support, you can sync the data locally with: + +```{r, eval = FALSE} +arrow::copy_files("s3://ursa-labs-taxi-data", "nyc-taxi") +``` + +If your arrow build doesn't have S3 support, you can download the files +with some additional code: + +```{r, eval = FALSE} +bucket <- "https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com" +for (year in 2009:2019) { + if (year == 2019) { + # We only have through June 2019 there + months <- 1:6 + } else { + months <- 1:12 + } + for (month in sprintf("%02d", months)) { + dir.create(file.path("nyc-taxi", year, month), recursive = TRUE) + try(download.file( + paste(bucket, year, month, "data.parquet", sep = "/"), + file.path("nyc-taxi", year, month, "data.parquet"), + mode = "wb" + ), silent = TRUE) + } +} +``` + +Note that these download steps in the vignette are not executed: if you want to run +with live data, you'll have to do it yourself separately. +Given the size, if you're running this locally and don't have a fast connection, +feel free to grab only a year or two of data. + +If you don't have the taxi data downloaded, the vignette will still run and will +yield previously cached output for reference. To be explicit about which version +is running, let's check whether you're running with live data: + +```{r} +dir.exists("nyc-taxi") +``` + +## Opening the dataset + +Because dplyr is not necessary for many Arrow workflows, +it is an optional (`Suggests`) dependency. So, to work with Datasets, +you need to load both arrow and dplyr. + +```{r} +library(arrow, warn.conflicts = FALSE) +library(dplyr, warn.conflicts = FALSE) +``` + +The first step is to create a Dataset object, pointing at the directory of data. + +```{r, eval = file.exists("nyc-taxi")} +ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) +``` + +The file format for `open_dataset()` is controlled by the `format` parameter, +which has a default value of `"parquet"`. If you had a directory +of Arrow format files, you could instead specify `format = "arrow"` in the call. + +Other supported formats include: + +* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format) +* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files) +* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use) + +For text files, you can pass the following parsing options to `open_dataset()`: + +* `delim` +* `quote` +* `escape_double` +* `escape_backslash` +* `skip_empty_rows` + +For more information on the usage of these parameters, see `?read_delim_arrow()`. + +The `partitioning` argument lets you specify how the file paths provide information +about how the dataset is chunked into different files. The files in this example +have file paths like + +``` +2009/01/data.parquet +2009/02/data.parquet +... +``` + +By providing `c("year", "month")` to the `partitioning` argument, you're saying that the first +path segment gives the value for `year`, and the second segment is `month`. +Every row in `2009/01/data.parquet` has a value of 2009 for `year` +and 1 for `month`, even though those columns may not be present in the file. + +Indeed, when you look at the dataset, you can see that in addition to the columns present +in every file, there are also columns `year` and `month` even though they are not present in the files themselves. + +```{r, eval = file.exists("nyc-taxi")} +ds +``` +```{r, echo = FALSE, eval = !file.exists("nyc-taxi")} +cat(" +FileSystemDataset with 125 Parquet files +vendor_id: string +pickup_at: timestamp[us] +dropoff_at: timestamp[us] +passenger_count: int8 +trip_distance: float +pickup_longitude: float +pickup_latitude: float +rate_code_id: null +store_and_fwd_flag: string +dropoff_longitude: float +dropoff_latitude: float +payment_type: string +fare_amount: float +extra: float +mta_tax: float +tip_amount: float +tolls_amount: float +total_amount: float +year: int32 +month: int32 + +See $metadata for additional Schema metadata +") +``` + +The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style, +in which the partition variable names are included in the path segments. +If you had saved your files in paths like: + +``` +year=2009/month=01/data.parquet +year=2009/month=02/data.parquet +... +``` + +you would not have had to provide the names in `partitioning`; +you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions +would have been detected automatically. + +## Querying the dataset + +Up to this point, you haven't loaded any data. You've walked directories to find +files, you've parsed file paths to identify partitions, and you've read the +headers of the Parquet files to inspect their schemas so that you can make sure +they all are as expected. + +In the current release, arrow supports the dplyr verbs `mutate()`, +`transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and +`arrange()`. Aggregation is not yet supported, so before you call `summarise()` +or other verbs with aggregate functions, use `collect()` to pull the selected +subset of the data into an in-memory R data frame. + +Suppose you attempt to call unsupported dplyr verbs or unimplemented functions +in your query on an Arrow Dataset. In that case, the arrow package raises an error. However, +for dplyr queries on Arrow Table objects (which are already in memory), the +package automatically calls `collect()` before processing that dplyr verb. + +Here's an example: suppose that you are curious about tipping behavior among the +longest taxi rides. Let's find the median tip percentage for rides with +fares greater than $100 in 2015, broken down by the number of passengers: + +```{r, eval = file.exists("nyc-taxi")} +system.time(ds %>% + filter(total_amount > 100, year == 2015) %>% + select(tip_amount, total_amount, passenger_count) %>% + mutate(tip_pct = 100 * tip_amount / total_amount) %>% + group_by(passenger_count) %>% + collect() %>% + summarise( + median_tip_pct = median(tip_pct), + n = n() + ) %>% + print()) +``` + +```{r, echo = FALSE, eval = !file.exists("nyc-taxi")} +cat(" +# A tibble: 10 x 3 + passenger_count median_tip_pct n + <int> <dbl> <int> + 1 0 9.84 380 + 2 1 16.7 143087 + 3 2 16.6 34418 + 4 3 14.4 8922 + 5 4 11.4 4771 + 6 5 16.7 5806 + 7 6 16.7 3338 + 8 7 16.7 11 + 9 8 16.7 32 +10 9 16.7 42 + + user system elapsed + 4.436 1.012 1.402 +") +``` + +You've just selected a subset out of a dataset with around 2 billion rows, computed +a new column, and aggregated it in under 2 seconds on a modern laptop. How does +this work? + +First, `mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, +`group_by()`, and `arrange()` record their actions but don't evaluate on the +data until you run `collect()`. + +```{r, eval = file.exists("nyc-taxi")} +ds %>% + filter(total_amount > 100, year == 2015) %>% + select(tip_amount, total_amount, passenger_count) %>% + mutate(tip_pct = 100 * tip_amount / total_amount) %>% + group_by(passenger_count) +``` + +```{r, echo = FALSE, eval = !file.exists("nyc-taxi")} +cat(" +FileSystemDataset (query) +tip_amount: float +total_amount: float +passenger_count: int8 +tip_pct: expr + +* Filter: ((total_amount > 100) and (year == 2015)) +* Grouped by passenger_count +See $.data for the source Arrow object +") +``` + +This code returns an output instantly and shows the manipulations you've made, without +loading data from the files. Because the evaluation of these queries is deferred, +you can build up a query that selects down to a small subset without generating +intermediate datasets that would potentially be large. + +Second, all work is pushed down to the individual data files, +and depending on the file format, chunks of data within the files. As a result, +you can select a subset of data from a much larger dataset by collecting the +smaller slices from each file—you don't have to load the whole dataset in +memory to slice from it. + +Third, because of partitioning, you can ignore some files entirely. +In this example, by filtering `year == 2015`, all files corresponding to other years +are immediately excluded: you don't have to load them in order to find that no +rows match the filter. Relatedly, since Parquet files contain row groups with +statistics on the data within, there may be entire chunks of data you can +avoid scanning because they have no rows where `total_amount > 100`. + +## More dataset options + +There are a few ways you can control the Dataset creation to adapt to special use cases. + +### Work with files in a directory + +If you are working with a single file or a set of files that are not all in the +same directory, you can provide a file path or a vector of multiple file paths +to `open_dataset()`. This is useful if, for example, you have a single CSV file +that is too big to read into memory. You could pass the file path to +`open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, +then use `write_dataset()` to write each chunk to a separate Parquet file—all +without needing to read the full CSV file into R. + +### Explicitly declare column names and data types + +You can specify the `schema` argument to `open_dataset()` to declare the columns +and their data types. This is useful if you have data files that have different +storage schema (for example, a column could be `int32` in one and `int8` in +another) and you want to ensure that the resulting Dataset has a specific type. + +To be clear, it's not necessary to specify a schema, even in this example of +mixed integer types, because the Dataset constructor will reconcile differences +like these. The schema specification just lets you declare what you want the +result to be. + +### Explicitly declare partition format + +Similarly, you can provide a Schema in the `partitioning` argument of `open_dataset()` +in order to declare the types of the virtual columns that define the partitions. +This would be useful, in the taxi dataset example, if you wanted to keep +`month` as a string instead of an integer. + +### Work with multiple data sources + +Another feature of Datasets is that they can be composed of multiple data sources. +That is, you may have a directory of partitioned Parquet files in one location, +and in another directory, files that haven't been partitioned. +Or, you could point to an S3 bucket of Parquet data and a directory +of CSVs on the local file system and query them together as a single dataset. +To create a multi-source dataset, provide a list of datasets to `open_dataset()` +instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, ds2)`. + +## Writing datasets + +As you can see, querying a large dataset can be made quite fast by storage in an +efficient binary columnar format like Parquet or Feather and partitioning based on +columns commonly used for filtering. However, data isn't always stored that way. +Sometimes you might start with one giant CSV. The first step in analyzing data +is cleaning is up and reshaping it into a more usable form. + +The `write_dataset()` function allows you to take a Dataset or another tabular +data object—an Arrow Table or RecordBatch, or an R data frame—and write +it to a different file format, partitioned into multiple files. + +Assume that you have a version of the NYC Taxi data as CSV: + +```r +ds <- open_dataset("nyc-taxi/csv/", format = "csv") +``` + +You can write it to a new location and translate the files to the Feather format +by calling `write_dataset()` on it: + +```r +write_dataset(ds, "nyc-taxi/feather", format = "feather") +``` + +Next, let's imagine that the `payment_type` column is something you often filter +on, so you want to partition the data by that variable. By doing so you ensure +that a filter like `payment_type == "Cash"` will touch only a subset of files +where `payment_type` is always `"Cash"`. + +One natural way to express the columns you want to partition on is to use the +`group_by()` method: + +```r +ds %>% + group_by(payment_type) %>% + write_dataset("nyc-taxi/feather", format = "feather") +``` + +This will write files to a directory tree that looks like this: + +```r +system("tree nyc-taxi/feather") +``` + +``` +## feather +## ├── payment_type=1 +## │ └── part-18.feather +## ├── payment_type=2 +## │ └── part-19.feather +## ... +## └── payment_type=UNK +## └── part-17.feather +## +## 18 directories, 23 files +``` + +Note that the directory names are `payment_type=Cash` and similar: +this is the Hive-style partitioning described above. This means that when +you call `open_dataset()` on this directory, you don't have to declare what the +partitions are because they can be read from the file paths. +(To instead write bare values for partition segments, i.e. `Cash` rather than +`payment_type=Cash`, call `write_dataset()` with `hive_style = FALSE`.) + +Perhaps, though, `payment_type == "Cash"` is the only data you ever care about, +and you just want to drop the rest and have a smaller working set. +For this, you can `filter()` them out when writing: + +```r +ds %>% + filter(payment_type == "Cash") %>% + write_dataset("nyc-taxi/feather", format = "feather") +``` + +The other thing you can do when writing datasets is select a subset of columns +or reorder them. Suppose you never care about `vendor_id`, and being a string column, +it can take up a lot of space when you read it in, so let's drop it: + +```r +ds %>% + group_by(payment_type) %>% + select(-vendor_id) %>% + write_dataset("nyc-taxi/feather", format = "feather") +``` + +Note that while you can select a subset of columns, +you cannot currently rename columns when writing a dataset. diff --git a/src/arrow/r/vignettes/developing.Rmd b/src/arrow/r/vignettes/developing.Rmd new file mode 100644 index 000000000..5cff5e560 --- /dev/null +++ b/src/arrow/r/vignettes/developing.Rmd @@ -0,0 +1,605 @@ +--- +title: "Arrow R Developer Guide" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Arrow R Developer Guide} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r setup-options, include=FALSE} +knitr::opts_chunk$set(error = TRUE, eval = FALSE) +# Get environment variables describing what to evaluate +run <- tolower(Sys.getenv("RUN_DEVDOCS", "false")) == "true" +macos <- tolower(Sys.getenv("DEVDOCS_MACOS", "false")) == "true" +ubuntu <- tolower(Sys.getenv("DEVDOCS_UBUNTU", "false")) == "true" +sys_install <- tolower(Sys.getenv("DEVDOCS_SYSTEM_INSTALL", "false")) == "true" +# Update the source knit_hook to save the chunk (if it is marked to be saved) +knit_hooks_source <- knitr::knit_hooks$get("source") +knitr::knit_hooks$set(source = function(x, options) { + # Extra paranoia about when this will write the chunks to the script, we will + # only save when: + # * CI is true + # * RUN_DEVDOCS is true + # * options$save is TRUE (and a check that not NULL won't crash it) + if (as.logical(Sys.getenv("CI", FALSE)) && run && !is.null(options$save) && options$save) + cat(x, file = "script.sh", append = TRUE, sep = "\n") + # but hide the blocks we want hidden: + if (!is.null(options$hide) && options$hide) { + return(NULL) + } + knit_hooks_source(x, options) +}) +``` + +```{bash, save=run, hide=TRUE} +# Stop on failure, echo input as we go +set -e +set -x +``` + +If you're looking to contribute to arrow, this vignette can help you set up a development environment that will enable you to write code and run tests locally. It outlines: + +* how to build the components that make up the Arrow project and R package +* workflows that developers use +* some common troubleshooting steps and solutions + +This document is intended only for **developers** of Apache Arrow or the Arrow R package. R package users do not need to do any of this setup. If you're looking for how to install Arrow, see [the instructions in the readme](https://arrow.apache.org/docs/r/#installation). + +This document is a work in progress and will grow and change as the Apache Arrow project grows and changes. We have tried to make these steps as robust as possible (in fact, we even test exactly these instructions on our nightly CI to ensure they don't become stale!), but custom configurations might conflict with these instructions and there are differences of opinion across developers about how to set up development environments like this. + +We welcome any feedback you have about things that are confusing or additions you would like to see here - please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) if you have any suggestions or requests. + +# Developer environment setup + +## R-only {.tabset} + +Windows and macOS users who wish to contribute to the R package and +don't need to alter libarrow (Arrow's C++ library) may be able to obtain a +recent version of the library without building from source. + +### Linux + +On Linux, you can download a .zip file containing libarrow from the +nightly repository. + +To see what nightlies are available, you can use arrow's (or any other S3 client's) S3 listing functionality to see what is in the bucket `s3://arrow-r-nightly/libarrow/bin`: + +``` +nightly <- s3_bucket("arrow-r-nightly") +nightly$ls("libarrow/bin") +``` +Version numbers in that repository correspond to dates. + +You'll need to create a `libarrow` directory inside the R package directory and unzip the zip file containing the compiled libarrow binary files into it. + +### macOS +On macOS, you can install libarrow using [Homebrew](https://brew.sh/): + +```bash +# For the released version: +brew install apache-arrow +# Or for a development version, you can try: +brew install apache-arrow --HEAD +``` + +### Windows + +On Windows, you can download a .zip file containing libarrow from the nightly repository. + +To see what nightlies are available, you can use arrow's (or any other S3 client's) S3 listing functionality to see what is in the bucket `s3://arrow-r-nightly/libarrow/bin`: + +``` +nightly <- s3_bucket("arrow-r-nightly") +nightly$ls("libarrow/bin") +``` +Version numbers in that repository correspond to dates. + +You can set the `RWINLIB_LOCAL` environment variable to point to the zip file containing libarrow before installing the arrow R package. + + +## R and C++ + +If you need to alter both libarrow and the R package code, or if you can't get a binary version of the latest libarrow elsewhere, you'll need to build it from source. This section discusses how to set up a C++ libarrow build configured to work with the R package. For more general resources, see the [Arrow C++ developer guide](https://arrow.apache.org/docs/developers/cpp/building.html). + +There are five major steps to the process. + +### Step 1 - Install dependencies {.tabset} + +When building libarrow, by default, system dependencies will be used if suitable versions are found. If system dependencies are not present, libarrow will build them during its own build process. The only dependencies that you need to install _outside_ of the build process are [cmake](https://cmake.org/) (for configuring the build) and [openssl](https://www.openssl.org/) if you are building with S3 support. + +For a faster build, you may choose to pre-install more C++ library dependencies (such as [lz4](http://lz4.github.io/lz4/), [zstd](https://facebook.github.io/zstd/), etc.) on the system so that they don't need to be built from source in the libarrow build. + +#### Ubuntu +```{bash, save=run & ubuntu} +sudo apt install -y cmake libcurl4-openssl-dev libssl-dev +``` + +#### macOS +```{bash, save=run & macos} +brew install cmake openssl +``` + +#### Windows + +Currently, the R package cannot be made to work with a local libarrow build. This will be resolved in a future release. + +### Step 2 - Configure the libarrow build + +We recommend that you configure libarrow to be built to a user-level directory rather than a system directory for your development work. This is so that the development version you are using doesn't overwrite a released version of libarrow you may already have installed, and so that you are also able work with more than one version of libarrow (by using different `ARROW_HOME` directories for the different versions). + +In the example below, libarrow is installed to a directory called `dist` that has the same parent directory as the `arrow` checkout. Your installation of the Arrow R package can point to any directory with any name, though we recommend *not* placing it inside of the `arrow` git checkout directory as unwanted changes could stop it working properly. + +```{bash, save=run & !sys_install} +export ARROW_HOME=$(pwd)/dist +mkdir $ARROW_HOME +``` + +_Special instructions on Linux:_ You will need to set `LD_LIBRARY_PATH` to the `lib` directory that is under where you set `$ARROW_HOME`, before launching R and using arrow. One way to do this is to add it to your profile (we use `~/.bash_profile` here, but you might need to put this in a different file depending on your setup, e.g. if you use a shell other than `bash`). On macOS you do not need to do this because the macOS shared library paths are hardcoded to their locations during build time. + +```{bash, save=run & ubuntu & !sys_install} +export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH +echo "export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH" >> ~/.bash_profile +``` + +Start by navigating in a terminal to the `arrow` repository. You will need to create a directory into which the C++ build will put its contents. We recommend that you make a `build` directory inside of the `cpp` directory of the Arrow git repository (it is git-ignored, so you won't accidentally check it in). Next, change directories to be inside `cpp/build`: + +```{bash, save=run & !sys_install} +pushd arrow +mkdir -p cpp/build +pushd cpp/build +``` + +You'll first call `cmake` to configure the build and then `make install`. For the R package, you'll need to enable several features in libarrow using `-D` flags: + +```{bash, save=run & !sys_install} +cmake \ + -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ + -DCMAKE_INSTALL_LIBDIR=lib \ + -DARROW_COMPUTE=ON \ + -DARROW_CSV=ON \ + -DARROW_DATASET=ON \ + -DARROW_EXTRA_ERROR_CONTEXT=ON \ + -DARROW_FILESYSTEM=ON \ + -DARROW_INSTALL_NAME_RPATH=OFF \ + -DARROW_JEMALLOC=ON \ + -DARROW_JSON=ON \ + -DARROW_PARQUET=ON \ + -DARROW_WITH_SNAPPY=ON \ + -DARROW_WITH_ZLIB=ON \ + .. +``` + +`..` refers to the C++ source directory: you're in `cpp/build` and the source is in `cpp`. + +#### Enabling more Arrow features + +To enable optional features including: S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags to your call to `cmake` (the trailing `\` makes them easier to paste into a bash shell on a new line): + +```bash + -DARROW_MIMALLOC=ON \ + -DARROW_S3=ON \ + -DARROW_WITH_BROTLI=ON \ + -DARROW_WITH_BZ2=ON \ + -DARROW_WITH_LZ4=ON \ + -DARROW_WITH_SNAPPY=ON \ + -DARROW_WITH_ZSTD=ON \ +``` + +Other flags that may be useful: + +* `-DBoost_SOURCE=BUNDLED` and `-DThrift_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source. + +* `-DCMAKE_BUILD_TYPE=debug` or `-DCMAKE_BUILD_TYPE=relwithdebinfo` can be useful for debugging. You probably don't want to do this generally because a debug build is much slower at runtime than the default `release` build. + +_Note_ `cmake` is particularly sensitive to whitespacing, if you see errors, check that you don't have any errant whitespace. + +### Step 3 - Building libarrow + +You can add `-j#` between `make` and `install` here too to speed up compilation by running in parallel (where `#` is the number of cores you have available). + +```{bash, save=run & !(sys_install & ubuntu)} +make -j8 install +``` + +### Step 4 - Build the Arrow R package + +Once you've built libarrow, you can install the R package and its +dependencies, along with additional dev dependencies, from the git +checkout: + +```{bash, save=run} +popd # To go back to the root directory of the project, from cpp/build +pushd r +R -e 'install.packages("remotes"); remotes::install_deps(dependencies = TRUE)' +R CMD INSTALL . +``` + +#### Compilation flags + +If you need to set any compilation flags while building the C++ +extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For +example, if you are using `perf` to profile the R extensions, you may +need to set + +```bash +export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer +``` + +#### Recompiling the C++ code + +With the setup described here, you should not need to rebuild the Arrow library or even the C++ source in the R package as you iterate and work on the R package. The only time those should need to be rebuilt is if you have changed the C++ in the R package (and even then, `R CMD INSTALL .` should only need to recompile the files that have changed) _or_ if the libarrow C++ has changed and there is a mismatch between libarrow and the R package. If you find yourself rebuilding either or both each time you install the package or run tests, something is probably wrong with your set up. + +<details> +<summary>For a full build: a `cmake` command with all of the R-relevant optional dependencies turned on. Development with other languages might require different flags as well. For example, to develop Python, you would need to also add `-DARROW_PYTHON=ON` (though all of the other flags used for Python are already included here).</summary> +<p> + +```bash +cmake \ + -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ + -DCMAKE_INSTALL_LIBDIR=lib \ + -DARROW_COMPUTE=ON \ + -DARROW_CSV=ON \ + -DARROW_DATASET=ON \ + -DARROW_EXTRA_ERROR_CONTEXT=ON \ + -DARROW_FILESYSTEM=ON \ + -DARROW_INSTALL_NAME_RPATH=OFF \ + -DARROW_JEMALLOC=ON \ + -DARROW_JSON=ON \ + -DARROW_MIMALLOC=ON \ + -DARROW_PARQUET=ON \ + -DARROW_S3=ON \ + -DARROW_WITH_BROTLI=ON \ + -DARROW_WITH_BZ2=ON \ + -DARROW_WITH_LZ4=ON \ + -DARROW_WITH_SNAPPY=ON \ + -DARROW_WITH_ZLIB=ON \ + -DARROW_WITH_ZSTD=ON \ + .. +``` +</p> +</details> + +## Installing a version of the R package with a specific git reference + +If you need an arrow installation from a specific repository or git reference, on most platforms except Windows, you can run: + +```{r} +remotes::install_github("apache/arrow/r", build = FALSE) +``` + +The `build = FALSE` argument is important so that the installation can access the +C++ source in the `cpp/` directory in `apache/arrow`. + +As with other installation methods, setting the environment variables `LIBARROW_MINIMAL=false` and `ARROW_R_DEV=true` will provide a more full-featured version of Arrow and provide more verbose output, respectively. + +For example, to install from the (fictional) branch `bugfix` from `apache/arrow` you could run: + +```r +Sys.setenv(LIBARROW_MINIMAL="false") +remotes::install_github("apache/arrow/r@bugfix", build = FALSE) +``` + +Developers may wish to use this method of installing a specific commit +separate from another Arrow development environment or system installation +(e.g. we use this in [arrowbench](https://github.com/ursacomputing/arrowbench) +to install development versions of libarrow isolated from the system install). If +you already have libarrow installed system-wide, you may need to set +some additional variables in order to isolate this build from your system libraries: + +* Setting the environment variable `FORCE_BUNDLED_BUILD` to `true` will skip the `pkg-config` search for libarrow and attempt to build from the same source at the repository+ref given. + +* You may also need to set the Makevars `CPPFLAGS` and `LDFLAGS` to `""` in order to prevent the installation process from attempting to link to already installed system versions of libarrow. One way to do this temporarily is wrapping your `remotes::install_github()` call like so: +```{r} +withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), remotes::install_github(...)) +``` + +# Common developer workflow tasks + +The `arrow/r` directory contains a `Makefile` to help with some common tasks from the command line (e.g. `make test`, `make doc`, `make clean`, etc.). + +## Loading arrow + +You can load the R package via `devtools::load_all()`. + +## Rebuilding the documentation + +The R documentation uses the [`@examplesIf`](https://roxygen2.r-lib.org/articles/rd.html#functions) tag introduced in `roxygen2` version 7.1.1.9001, which hasn't yet been released on CRAN at the time of writing. If you are making changes which require updating the documentation, please install the development version of `roxygen2` from GitHub. + +```{r} +remotes::install_github("r-lib/roxygen2") +``` + +You can use `devtools::document()` and `pkgdown::build_site()` to rebuild the documentation and preview the results. + +```r +# Update roxygen documentation +devtools::document() + +# To preview the documentation website +pkgdown::build_site(preview=TRUE) +``` + +## Styling and linting + +### R code + +The R code in the package follows [the tidyverse style](https://style.tidyverse.org/). On PR submission (and on pushes) our CI will run linting and will flag possible errors on the pull request with annotations. + +To run the [lintr](https://github.com/jimhester/lintr) locally, install the lintr package (note, we currently use a fork that includes fixes not yet accepted upstream, see how lintr is being installed in the file `ci/docker/linux-apt-lint.dockerfile` for the current status) and then run + +```{r} +lintr::lint_package("arrow/r") +``` + +You can automatically change the formatting of the code in the package using the [styler](https://styler.r-lib.org/) package. There are two ways to do this: + +1. Use the comment bot to do this automatically with the command `@github-actions autotune` on a PR, and commit it back to the branch. + +2. Run the styler locally either via Makefile commands: + +```bash +make style # (for only the files changed) +make style-all # (for all files) +``` + +or in R: + +```{r} +# note the two excluded files which should not be styled +styler::style_pkg(exclude_files = c("tests/testthat/latin1.R", "data-raw/codegen.R")) +``` + +The styler package will fix many styling errors, thought not all lintr errors are automatically fixable with styler. The list of files we intentionally do not style is in `r/.styler_excludes.R`. + +### C++ code + +The arrow package uses some customized tools on top of [cpp11](https://cpp11.r-lib.org/) to prepare its +C++ code in `src/`. This is because there are some features that are only enabled +and built conditionally during build time. If you change C++ code in the R +package, you will need to set the `ARROW_R_DEV` environment variable to `true` +(optionally, add it to your `~/.Renviron` file to persist across sessions) so +that the `data-raw/codegen.R` file is used for code generation. The `Makefile` +commands also handles this automatically. + +We use Google C++ style in our C++ code. The easiest way to accomplish this is +use an editors/IDE that formats your code for you. Many popular editors/IDEs +have support for running `clang-format` on C++ files when you save them. +Installing/enabling the appropriate plugin may save you much frustration. + +Check for style errors with + +```bash +./lint.sh +``` + +Fix any style issues before committing with + +```bash +./lint.sh --fix +``` + +The lint script requires Python 3 and `clang-format-8`. If the command +isn't found, you can explicitly provide the path to it like: + +```bash +CLANG_FORMAT=$(which clang-format-8) ./lint.sh +``` + +On macOS, you can get this by installing LLVM via Homebrew and running the script as: +```bash +CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh +``` + +_Note_ that the lint script requires Python 3 and the Python dependencies +(note that `cmake_format is pinned to a specific version): + +* autopep8 +* flake8 +* cmake_format==0.5.2 + +## Running tests + +Tests can be run either using `devtools::test()` or the Makefile alternative. + +```r +# Run the test suite, optionally filtering file names +devtools::test(filter="^regexp$") + +# or the Makefile alternative from the arrow/r directory in a shell: +make test file=regexp +``` + +Some tests are conditionally enabled based on the availability of certain +features in the package build (S3 support, compression libraries, etc.). +Others are generally skipped by default but can be enabled with environment +variables or other settings: + +* All tests are skipped on Linux if the package builds without the C++ libarrow. + To make the build fail if libarrow is not available (as in, to test that + the C++ build was successful), set `TEST_R_WITH_ARROW=true` + +* Some tests are disabled unless `ARROW_R_DEV=true` + +* Tests that require allocating >2GB of memory to test Large types are disabled + unless `ARROW_LARGE_MEMORY_TESTS=true` + +* Integration tests against a real S3 bucket are disabled unless credentials + are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available + on request + +* S3 tests using [MinIO](https://min.io/) locally are enabled if the + `minio server` process is found running. If you're running MinIO with custom + settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and + `MINIO_PORT` to override the defaults. + +## Running checks + +You can run package checks by using `devtools::check()` and check test coverage +with `covr::package_coverage()`. + +```r +# All package checks +devtools::check() + +# See test coverage statistics +covr::report() +covr::package_coverage() +``` + +For full package validation, you can run the following commands from a terminal. + +``` +R CMD build . +R CMD check arrow_*.tar.gz --as-cran +``` + + +## Running additional CI checks + +On a pull request, there are some actions you can trigger by commenting on the +PR. We have additional CI checks that run nightly and can be requested on demand +using an internal tool called +[crossbow](https://arrow.apache.org/docs/developers/crossbow.html). +A few important GitHub comment commands are shown below. + +#### Run all extended R CI tasks +``` +@github-actions crossbow submit -g r +``` + +This runs each of the R-related CI tasks. + +#### Run a specific task +``` +@github-actions crossbow submit {task-name} +``` + +See the `r:` group definition near the beginning of the [crossbow configuration](https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml) +for a list of glob expression patterns that match names of items in the `tasks:` +list below it. + +#### Run linting and documentation building tasks + +``` +@github-actions autotune +``` + +This will run and fix lint C++ linting errors, run R documentation (among other +cleanup tasks), run styler on any changed R code, and commit the resulting +updates to the branch. + +# Summary of environment variables + +* See the user-facing [Install vignette](install.html) for a large number of + environment variables that determine how the build works and what features + get built. +* `TEST_OFFLINE_BUILD`: When set to `true`, the build script will not download + prebuilt the C++ library binary. + It will turn off any features that require a download, unless they're available + in the `tools/cpp/thirdparty/download/` subfolder of the tar.gz file. + `create_package_with_all_dependencies()` creates that subfolder. + Regardless of this flag's value, `cmake` will be downloaded if it's unavailable. +* `TEST_R_WITHOUT_LIBARROW`: When set to `true`, skip tests that would require + the C++ Arrow library (that is, almost everything). + +# Troubleshooting + +Note that after any change to libarrow, you must reinstall it and +run `make clean` or `git clean -fdx .` to remove any cached object code +in the `r/src/` directory before reinstalling the R package. This is +only necessary if you make changes to libarrow source; you do not +need to manually purge object files if you are only editing R or C++ +code inside `r/`. + +## Arrow library - R package mismatches + +If libarrow and the R package have diverged, you will see errors like: + +``` +Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...): + unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so': + dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Symbol not found: __ZN5arrow2io16RandomAccessFile9ReadAsyncERKNS0_9IOContextExx + Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so + Expected in: flat namespace + in /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so +Error: loading failed +Execution halted +ERROR: loading failed +``` + +To resolve this, try [rebuilding the Arrow library](#step-3-building-arrow). + +## Multiple versions of libarrow + +If you are installing from a user-level directory, and you already have a +previous installation of libarrow in a system directory, you get you may get +errors like the following when you install the R package: + +``` +Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...): + unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so': + dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: /usr/local/lib/libarrow.400.dylib + Referenced from: /usr/local/lib/libparquet.400.dylib + Reason: image not found +``` + +If this happens, you need to make sure that you don't let R link to your system +library when building arrow. You can do this a number of different ways: + +* Setting the `MAKEFLAGS` environment variable to `"LDFLAGS="` (see below for an example) this is the recommended way to accomplish this +* Using {withr}'s `with_makevars(list(LDFLAGS = ""), ...)` +* adding `LDFLAGS=` to your `~/.R/Makevars` file (the least recommended way, though it is a common debugging approach suggested online) + +```{bash, save=run & !sys_install & macos, hide=TRUE} +# Setup troubleshooting section +# install a system-level arrow on macOS +brew install apache-arrow +``` + + +```{bash, save=run & !sys_install & ubuntu, hide=TRUE} +# Setup troubleshooting section +# install a system-level arrow on Ubuntu +sudo apt update +sudo apt install -y -V ca-certificates lsb-release wget +wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb +sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb +sudo apt update +sudo apt install -y -V libarrow-dev +``` + +```{bash, save=run & !sys_install & macos} +MAKEFLAGS="LDFLAGS=" R CMD INSTALL . +``` + + +## `rpath` issues + +If the package fails to install/load with an error like this: + +``` + ** testing if installed package can be loaded from temporary location + Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...): + unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so': + dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib +``` + +ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on +macOS to prevent problems at link time and is a no-op on other platforms). +Alternatively, try setting the environment variable `R_LD_LIBRARY_PATH` to +wherever Arrow C++ was put in `make install`, e.g. `export +R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package. + +When installing from source, if the R and C++ library versions do not +match, installation may fail. If you've previously installed the +libraries and want to upgrade the R package, you'll need to update the +Arrow C++ library first. + +For any other build/configuration challenges, see the [C++ developer +guide](https://arrow.apache.org/docs/developers/cpp/building.html). + +## Other installation issues + +There are a number of scripts that are triggered when the arrow R package is installed. For package users who are not interacting with the underlying code, these should all just work without configuration and pull in the most complete pieces (e.g. official binaries that we host). However, knowing about these scripts can help package developers troubleshoot if things go wrong in them or things go wrong in an install. See [the installation vignette](./install.html#how-dependencies-are-resolved) for more information. +>>>>>>> master diff --git a/src/arrow/r/vignettes/flight.Rmd b/src/arrow/r/vignettes/flight.Rmd new file mode 100644 index 000000000..e8af5cad6 --- /dev/null +++ b/src/arrow/r/vignettes/flight.Rmd @@ -0,0 +1,87 @@ +--- +title: "Connecting to Flight RPC Servers" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Connecting to Flight RPC Servers} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +[**Flight**](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) +is a general-purpose client-server framework for high performance +transport of large datasets over network interfaces, built as part of the +[Apache Arrow](https://arrow.apache.org) project. + +Flight allows for highly efficient data transfer as it: + +* removes the need for deserialization during data transfer +* allows for parallel data streaming +* is highly optimized to take advantage of Arrow's columnar format. + +The arrow package provides methods for connecting to Flight RPC servers +to send and receive data. + +## Getting Started + +The `flight` functions in the package use [reticulate](https://rstudio.github.io/reticulate/) to call methods in the +[pyarrow](https://arrow.apache.org/docs/python/api/flight.html) Python package. + +Before using them for the first time, +you'll need to be sure you have reticulate and pyarrow installed: + +```r +install.packages("reticulate") +arrow::install_pyarrow() +``` + +See `vignette("python", package = "arrow")` for more details on setting up +`pyarrow`. + +## Example + +The package includes methods for starting a Python-based Flight server, as well +as methods for connecting to a Flight server running elsewhere. + +To illustrate both sides, in one process let's start a demo server: + +```r +library(arrow) +demo_server <- load_flight_server("demo_flight_server") +server <- demo_server$DemoFlightServer(port = 8089) +server$serve() +``` + +We'll leave that one running. + +In a different R process, let's connect to it and put some data in it. + +```r +library(arrow) +client <- flight_connect(port = 8089) +# Upload some data to our server so there's something to demo +flight_put(client, iris, path = "test_data/iris") +``` + +Now, in a new R process, let's connect to the server and pull the data we +put there: + +```r +library(arrow) +library(dplyr) +client <- flight_connect(port = 8089) +client %>% + flight_get("test_data/iris") %>% + group_by(Species) %>% + summarize(max_petal = max(Petal.Length)) + +## # A tibble: 3 x 2 +## Species max_petal +## <fct> <dbl> +## 1 setosa 1.9 +## 2 versicolor 5.1 +## 3 virginica 6.9 +``` + +Because `flight_get()` returns an Arrow data structure, you can directly pipe +its result into a [dplyr](https://dplyr.tidyverse.org/) workflow. +See `vignette("dataset", package = "arrow")` for more information on working with Arrow objects via a dplyr interface. diff --git a/src/arrow/r/vignettes/fs.Rmd b/src/arrow/r/vignettes/fs.Rmd new file mode 100644 index 000000000..5d699c49d --- /dev/null +++ b/src/arrow/r/vignettes/fs.Rmd @@ -0,0 +1,130 @@ +--- +title: "Working with Cloud Storage (S3)" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Working with Cloud Storage (S3)} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +The Arrow C++ library includes a generic filesystem interface and specific +implementations for some cloud storage systems. This setup allows various +parts of the project to be able to read and write data with different storage +backends. In the `arrow` R package, support has been enabled for AWS S3. +This vignette provides an overview of working with S3 data using Arrow. + +> In Windows and macOS binary packages, S3 support is included. On Linux when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. + +## URIs + +File readers and writers (`read_parquet()`, `write_feather()`, et al.) +accept an S3 URI as the source or destination file, +as do `open_dataset()` and `write_dataset()`. +An S3 URI looks like: + +``` +s3://[access_key:secret_key@]bucket/path[?region=] +``` + +For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at + +``` +s3://ursa-labs-taxi-data/2019/06/data.parquet +``` + +Given this URI, we can pass it to `read_parquet()` just as if it were a local file path: + +```r +df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet") +``` + +Note that this will be slower to read than if the file were local, +though if you're running on a machine in the same AWS region as the file in S3, +the cost of reading the data over the network should be much lower. + +## Creating a FileSystem object + +Another way to connect to S3 is to create a `FileSystem` object once and pass +that to the read/write functions. +`S3FileSystem` objects can be created with the `s3_bucket()` function, which +automatically detects the bucket's AWS region. Additionally, the resulting +`FileSystem` will consider paths relative to the bucket's path (so for example +you don't need to prefix the bucket path when listing a directory). +This may be convenient when dealing with +long URIs, and it's necessary for some options and authentication methods +that aren't supported in the URI format. + +With a `FileSystem` object, we can point to specific files in it with the `$path()` method. +In the previous example, this would look like: + +```r +bucket <- s3_bucket("ursa-labs-taxi-data") +df <- read_parquet(bucket$path("2019/06/data.parquet")) +``` + +See the help for `FileSystem` for a list of options that `s3_bucket()` and `S3FileSystem$create()` +can take. `region`, `scheme`, and `endpoint_override` can be encoded as query +parameters in the URI (though `region` will be auto-detected in `s3_bucket()` or from the URI if omitted). +`access_key` and `secret_key` can also be included, +but other options are not supported in the URI. + +The object that `s3_bucket()` returns is technically a `SubTreeFileSystem`, which holds a path and a file system to which it corresponds. `SubTreeFileSystem`s can be useful for holding a reference to a subdirectory somewhere, on S3 or elsewhere. + +One way to get a subtree is to call the `$cd()` method on a `FileSystem` + +```r +june2019 <- bucket$cd("2019/06") +df <- read_parquet(june2019$path("data.parquet")) +``` + +`SubTreeFileSystem` can also be made from a URI: + +```r +june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06") +``` + +## Authentication + +To access private S3 buckets, you need typically need two secret parameters: +a `access_key`, which is like a user id, +and `secret_key`, like a token. +There are a few options for passing these credentials: + +1. Include them in the URI, like `s3://access_key:secret_key@bucket-name/path/to/file`. Be sure to [URL-encode](https://en.wikipedia.org/wiki/Percent-encoding) your secrets if they contain special characters like "/". + +2. Pass them as `access_key` and `secret_key` to `S3FileSystem$create()` or `s3_bucket()` + +3. Set them as environment variables named `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`, respectively. + +4. Define them in a `~/.aws/credentials` file, according to the [AWS documentation](https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/credentials.html). + +You can also use an [AccessRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) +for temporary access by passing the `role_arn` identifier to `S3FileSystem$create()` or `s3_bucket()`. + +## File systems that emulate S3 + +The `S3FileSystem` machinery enables you to work with any file system that +provides an S3-compatible interface. For example, [MinIO](https://min.io/) is +and object-storage server that emulates the S3 API. If you were to +run `minio server` locally with its default settings, you could connect to +it with `arrow` using `S3FileSystem` like this: + +```r +minio <- S3FileSystem$create( + access_key = "minioadmin", + secret_key = "minioadmin", + scheme = "http", + endpoint_override = "localhost:9000" +) +``` + +or, as a URI, it would be + +``` +s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000 +``` + +(note the URL escaping of the `:` in `endpoint_override`). + +Among other applications, this can be useful for testing out code locally before +running on a remote S3 bucket. diff --git a/src/arrow/r/vignettes/install.Rmd b/src/arrow/r/vignettes/install.Rmd new file mode 100644 index 000000000..5bd76a371 --- /dev/null +++ b/src/arrow/r/vignettes/install.Rmd @@ -0,0 +1,448 @@ +--- +title: "Installing the Arrow Package on Linux" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Installing the Arrow Package on Linux} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +On macOS and Windows, when you `install.packages("arrow")`, +you get a binary package that contains Arrow’s C++ dependencies along with it. +On Linux, `install.packages()` retrieves a source package that has to be compiled locally, +and C++ dependencies need to be resolved as well. +Generally for R packages with C++ dependencies, +this requires either installing system packages, which you may not have privileges to do, +or building the C++ dependencies separately, +which introduces all sorts of additional ways for things to go wrong. + +Our goal is to make `install.packages("arrow")` "just work" for as many Linux distributions, +versions, and configurations as possible. +This document describes how it works and the options for fine-tuning Linux installation. +The intended audience for this document is `arrow` R package users on Linux, not developers. +If you're contributing to the Arrow project, see `vignette("developing", package = "arrow") for guidance on setting up your development environment. + +Note also that if you use `conda` to manage your R environment, this document does not apply. +You can `conda install -c conda-forge --strict-channel-priority r-arrow` and you'll get the latest official +release of the R package along with any C++ dependencies. + +> Having trouble installing `arrow`? See the "Troubleshooting" section below. + +# Installation basics + +Install the latest release of `arrow` from CRAN with + +```r +install.packages("arrow") +``` + +Daily development builds, which are not official releases, +can be installed from the Ursa Labs repository: + +```r +install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com") +``` + +or for conda users via: + +``` +conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow +``` + +You can also install the R package from a git checkout: + +```shell +git clone https://github.com/apache/arrow +cd arrow/r +R CMD INSTALL . +``` + +If you don't already have the Arrow C++ libraries on your system, +when installing the R package from source, it will also download and build +the Arrow C++ libraries for you. To speed installation up, you can set + +```shell +export LIBARROW_BINARY=true +``` + +to look for C++ binaries prebuilt for your Linux distribution/version. +Alternatively, you can set + +```shell +export LIBARROW_MINIMAL=false +``` + +to build the Arrow libraries from source with optional features such as compression libraries +enabled. This will increase the build time but provides many useful features. +Prebuilt binaries are built with this flag enabled, so you get the full +functionality by using them as well. + +Both of these variables are also set this way if you have the `NOT_CRAN=true` +environment variable set. + +## Helper function: install_arrow() + +If you already have `arrow` installed and want to upgrade to a different version, +install a development build, or try to reinstall and fix issues with Linux +C++ binaries, you can call `install_arrow()`. +`install_arrow()` provides some convenience wrappers around the various +environment variables described below. +This function is part of the `arrow` package, +and it is also available as a standalone script, so you can +access it for convenience without first installing the package: + +```r +source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R") +``` + +`install_arrow()` will install from CRAN, +while `install_arrow(nightly = TRUE)` will give you a development build. +`install_arrow()` does not require environment variables to be set in order to +satisfy C++ dependencies. + +> Note that, unlike packages like `tensorflow`, `blogdown`, and others that require external dependencies, you do not need to run `install_arrow()` after a successful `arrow` installation. + +## Offline installation + +The `install-arrow.R` file also includes the `create_package_with_all_dependencies()` +function. Normally, when installing on a computer with internet access, the +build process will download third-party dependencies as needed. +This function provides a way to download them in advance. +Doing so may be useful when installing Arrow on a computer without internet access. +Note that Arrow _can_ be installed on a computer without internet access without doing this, but +many useful features will be disabled, as they depend on third-party components. +More precisely, `arrow::arrow_info()$capabilities()` will be `FALSE` for every +capability. +One approach to add more capabilities in an offline install is to prepare a +package with pre-downloaded dependencies. The +`create_package_with_all_dependencies()` function does this preparation. + +If you're using binary packages you shouldn't need to follow these steps. You +should download the appropriate binary from your package repository, transfer +that to the offline computer, and install that. Any OS can create the source +bundle, but it cannot be installed on Windows. (Instead, use a standard +Windows binary package.) + +Note if you're using RStudio Package Manager on Linux: If you still want to +make a source bundle with this function, make sure to set the first repo in +`options("repos")` to be a mirror that contains source packages (that is: +something other than the RSPM binary mirror URLs). + +### Using a computer with internet access, pre-download the dependencies: +* Install the `arrow` package _or_ run + `source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R")` +* Run `create_package_with_all_dependencies("my_arrow_pkg.tar.gz")` +* Copy the newly created `my_arrow_pkg.tar.gz` to the computer without internet access + +### On the computer without internet access, install the prepared package: +* Install the `arrow` package from the copied file + * `install.packages("my_arrow_pkg.tar.gz", dependencies = c("Depends", "Imports", "LinkingTo"))` + * This installation will build from source, so `cmake` must be available +* Run `arrow_info()` to check installed capabilities + +#### Alternative, hands-on approach +* Download the dependency files (`cpp/thirdparty/download_dependencies.sh` may be helpful) +* Copy the directory of dependencies to the offline computer +* Create the environment variable `ARROW_THIRDPARTY_DEPENDENCY_DIR` on the offline computer, pointing to the copied directory. +* Install the `arrow` package as usual. + +## S3 support + +The `arrow` package allows you to work with data in AWS S3 or in other cloud +storage system that emulate S3. However, support for working with S3 is not +enabled in the default build, and it has additional system requirements. To +enable it, set the environment variable `LIBARROW_MINIMAL=false` or +`NOT_CRAN=true` to choose the full-featured build, or more selectively set +`ARROW_S3=ON`. You also need the following system dependencies: + +* `gcc` >= 4.9 or `clang` >= 3.3; note that the default compiler on CentOS 7 is gcc 4.8.5, which is not sufficient +* CURL: install `libcurl-devel` (rpm) or `libcurl4-openssl-dev` (deb) +* OpenSSL >= 1.0.2: install `openssl-devel` (rpm) or `libssl-dev` (deb) + +The prebuilt C++ binaries come with S3 support enabled, so you will need to meet +these system requirements in order to use them--the package will not install +without them. If you're building everything from source, the install script +will check for the presence of these dependencies and turn off S3 support in the +build if the prerequisites are not met--installation will succeed but without +S3 functionality. If afterwards you install the missing system requirements, +you'll need to reinstall the package in order to enable S3 support. + +# How dependencies are resolved + +In order for the `arrow` R package to work, it needs the Arrow C++ library. +There are a number of ways you can get it: a system package; a library you've +built yourself outside of the context of installing the R package; +or, if you don't already have it, the R package will attempt to resolve it +automatically when it installs. + +If you are authorized to install system packages and you're installing a CRAN release, +you may want to use the official Apache Arrow release packages corresponding to the R package version (though there are some drawbacks: see "Troubleshooting" below). +See the [Arrow project installation page](https://arrow.apache.org/install/) +to find pre-compiled binary packages for some common Linux distributions, +including Debian, Ubuntu, and CentOS. +You'll need to install `libparquet-dev` on Debian and Ubuntu, or `parquet-devel` on CentOS. +This will also automatically install the Arrow C++ library as a dependency. + +When you install the `arrow` R package on Linux, +it will first attempt to find the Arrow C++ libraries on your system using +the `pkg-config` command. +This will find either installed system packages or libraries you've built yourself. +In order for `install.packages("arrow")` to work with these system packages, +you'll need to install them before installing the R package. + +If no Arrow C++ libraries are found on the system, +the R package installation script will next attempt to download +prebuilt static Arrow C++ libraries +that match your both your local operating system and `arrow` R package version. +C++ binaries will only be retrieved if you have set the environment variable +`LIBARROW_BINARY` or `NOT_CRAN`. +If found, they will be downloaded and bundled when your R package compiles. +For a list of supported distributions and versions, +see the [arrow-r-nightly](https://github.com/ursa-labs/arrow-r-nightly/blob/master/README.md) project. + +If no C++ library binary is found, it will attempt to build it locally. +First, it will also look to see if you are in +a checkout of the `apache/arrow` git repository and thus have the C++ source there. +Otherwise, it builds from the C++ files included in the package. +Depending on your system, building Arrow C++ from source may be slow. + +For the specific mechanics of how all this works, see the R package `configure` script, +which calls `tools/nixlibs.R`. + +If the C++ library is built from source, `inst/build_arrow_static.sh` is executed. +This build script is also what is used to generate the prebuilt binaries. + +## How the package is installed - advanced + +This subsection contains information which is likely to be most relevant mostly +to Arrow developers and is not necessary for Arrow users to install Arrow. + +There are a number of scripts that are triggered when `R CMD INSTALL .` is run. +For Arrow users, these should all just work without configuration and pull in +the most complete pieces (e.g. official binaries that we host). + +An overview of these scripts is shown below: + +* `configure` and `configure.win` - these scripts are triggered during +`R CMD INSTALL .` on non-Windows and Windows platforms, respectively. They +handle finding the Arrow library, setting up the build variables necessary, and +writing the package Makevars file that is used to compile the C++ code in the R +package. + +* `tools/nixlibs.R` - this script is sometimes called by `configure` on Linux +(or on any non-windows OS with the environment variable +`FORCE_BUNDLED_BUILD=true`). This sets up the build process for our bundled +builds (which is the default on linux). The operative logic is at the end of +the script, but it will do the following (and it will stop with the first one +that succeeds and some of the steps are only checked if they are enabled via an +environment variable): + * Check if there is an already built libarrow in `arrow/r/libarrow-{version}`, + use that to link against if it exists. + * Check if a binary is available from our hosted unofficial builds. + * Download the Arrow source and build the Arrow Library from source. + * `*** Proceed without C++` dependencies (this is an error and the package + will not work, but if you see this message you know the previous steps have + not succeeded/were not enabled) + +* `inst/build_arrow_static.sh` - called by `tools/nixlibs.R` when the Arrow +library is being built. It builds Arrow for a bundled, static build, and +mirrors the steps described in the ["Arrow R Developer Guide" vignette](./developing.html) + +# Troubleshooting + +The intent is that `install.packages("arrow")` will just work and handle all C++ +dependencies, but depending on your system, you may have better results if you +tune one of several parameters. Here are some known complications and ways to address them. + +## Package failed to build C++ dependencies + +If you see a message like + +``` +------------------------- NOTE --------------------------- +There was an issue preparing the Arrow C++ libraries. +See https://arrow.apache.org/docs/r/articles/install.html +--------------------------------------------------------- +``` + +in the output when the package fails to install, +that means that installation failed to retrieve or build C++ libraries +compatible with the current version of the R package. + +It is expected that C++ dependencies should be built successfully +on all Linux distributions, so you should not see this message. If you do, +please check the "Known installation issues" below to see if any apply. +If none apply, set the environment variable `ARROW_R_DEV=TRUE` +so that details on what failed are shown, and try installing again. Then, +please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) +and include the full verbose installation output. + +## Using system libraries + +If a system library or other installed Arrow is found but it doesn't match the R package version +(for example, you have libarrow 1.0.0 on your system and are installing R package 2.0.0), +it is likely that the R bindings will fail to compile. +Because the Apache Arrow project is under active development, +is it essential that versions of the C++ and R libraries match. +When `install.packages("arrow")` has to download the C++ libraries, +the install script ensures that you fetch the C++ libraries that correspond to your R package version. +However, if you are using Arrow libraries already on your system, version match isn't guaranteed. + +To fix version mismatch, you can either update your system packages to match the R package version, +or set the environment variable `ARROW_USE_PKG_CONFIG=FALSE` +to tell the configure script not to look for system Arrow packages. +(The latter is the default of `install_arrow()`.) +System packages are available corresponding to all CRAN releases +but not for nightly or dev versions, so depending on the R package version you're installing, +system packages may not be an option. + +Note also that once you have a working R package installation based on system (shared) libraries, +if you update your system Arrow, you'll need to reinstall the R package to match its version. +Similarly, if you're using Arrow system libraries, running `update.packages()` +after a new release of the `arrow` package will likely fail unless you first +update the system packages. + +## Using prebuilt binaries + +If the R package finds and downloads a prebuilt binary of the C++ library, +but then the `arrow` package can't be loaded, perhaps with "undefined symbols" errors, +please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues). +This is likely a compiler mismatch and may be resolvable by setting some +environment variables to instruct R to compile the packages to match the C++ library. + +A workaround would be to set the environment variable `LIBARROW_BINARY=FALSE` +and retry installation: this value instructs the package to build the C++ library from source +instead of downloading the prebuilt binary. +That should guarantee that the compiler settings match. + +If a prebuilt binary wasn't found for your operating system but you think it should have been, +check the logs for a message that says `*** Unable to identify current OS/version`, +or a message that says `*** No C++ binaries found for` an invalid OS. +If you see either, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues). +You may also set the environment variable `ARROW_R_DEV=TRUE` for additional +debug messages. + +A workaround would be to set the environment variable `LIBARROW_BINARY` +to a `distribution-version` that exists in the Ursa Labs repository. +Setting `LIBARROW_BINARY` is also an option when there's not an exact match +for your OS but a similar version would work, +such as if you're on `ubuntu-18.10` and there's only a binary for `ubuntu-18.04`. + +If that workaround works for you, and you believe that it should work for everyone else too, +you may propose [adding an entry to this lookup table](https://github.com/ursa-labs/arrow-r-nightly/edit/master/linux/distro-map.csv). +This table is checked during the installation process +and tells the script to use binaries built on a different operating system/version +because they're known to work. + +## Building C++ from source + +If building the C++ library from source fails, check the error message. +(If you don't see an error message, only the `----- NOTE -----`, +set the environment variable `ARROW_R_DEV=TRUE` to increase verbosity and retry installation.) +The install script should work everywhere, so if the C++ library fails to compile, +please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) +so that we can improve the script. + +## Known installation issues + +* On CentOS, if you are using a more modern `devtoolset`, you may need to set +the environment variables `CC` and `CXX` either in the shell or in R's `Makeconf`. +For CentOS 7 and above, both the Arrow system packages and the C++ binaries +for R are built with the default system compilers. If you want to use either of these +and you have a `devtoolset` installed, set `CC=/usr/bin/gcc CXX=/usr/bin/g++` +to use the system compilers instead of the `devtoolset`. +Alternatively, if you want to build `arrow` with the newer `devtoolset` compilers, +set both `ARROW_USE_PKG_CONFIG` and `LIBARROW_BINARY` to `false` so that +you build the Arrow C++ from source using those compilers. +Compiler mismatch between the arrow system libraries and the R +package may cause R to segfault when `arrow` package functions are used. +See discussions [here](https://issues.apache.org/jira/browse/ARROW-8586) +and [here](https://issues.apache.org/jira/browse/ARROW-10780). + +* If you have multiple versions of `zstd` installed on your system, +installation by building the C++ from source may fail with an undefined symbols +error. Workarounds include (1) setting `LIBARROW_BINARY` to use a C++ binary; (2) +setting `ARROW_WITH_ZSTD=OFF` to build without `zstd`; or (3) uninstalling +the conflicting `zstd`. +See discussion [here](https://issues.apache.org/jira/browse/ARROW-8556). + +## Summary of build environment variables + +Some features are optional when you build Arrow from source. With the exception of `ARROW_S3`, these are all `ON` by default in the bundled C++ build, but you can set them to `OFF` to disable them. + +* `ARROW_S3`: If set to `ON` S3 support will be built as long as the + dependencies are met; if they are not met, the build script will turn this `OFF` +* `ARROW_JEMALLOC` for the `jemalloc` memory allocator +* `ARROW_MIMALLOC` for the `mimalloc` memmory allocator +* `ARROW_PARQUET` +* `ARROW_DATASET` +* `ARROW_JSON` for the JSON parsing library +* `ARROW_WITH_RE2` for the RE2 regular expression library, used in some string compute functions +* `ARROW_WITH_UTF8PROC` for the UTF8Proc string library, used in many other string compute functions +* `ARROW_JSON` for JSON parsing +* `ARROW_WITH_BROTLI`, `ARROW_WITH_BZ2`, `ARROW_WITH_LZ4`, `ARROW_WITH_SNAPPY`, `ARROW_WITH_ZLIB`, and `ARROW_WITH_ZSTD` for various compression algorithms + + +There are a number of other variables that affect the `configure` script and the bundled build script. +By default, these are all unset. All boolean variables are case-insensitive. + +* `ARROW_USE_PKG_CONFIG`: If set to `false`, the configure script + won't look for Arrow libraries on your system and instead will look to download/build them. + Use this if you have a version mismatch between installed system libraries + and the version of the R package you're installing. +* `LIBARROW_BINARY`: If set to `true`, the script will try to download a binary + C++ library built for your operating system. + You may also set it to some other string, + a related "distro-version" that has binaries built that work for your OS. + If no binary is found, installation will fall back to building C++ + dependencies from source. +* `LIBARROW_BUILD`: If set to `false`, the build script + will not attempt to build the C++ from source. This means you will only get + a working `arrow` R package if a prebuilt binary is found. + Use this if you want to avoid compiling the C++ library, which may be slow + and resource-intensive, and ensure that you only use a prebuilt binary. +* `LIBARROW_MINIMAL`: If set to `false`, the build script + will enable some optional features, including compression libraries, S3 + support, and additional alternative memory allocators. This will increase the + source build time but results in a more fully functional library. +* `NOT_CRAN`: If this variable is set to `true`, as the `devtools` package does, + the build script will set `LIBARROW_BINARY=true` and `LIBARROW_MINIMAL=false` + unless those environment variables are already set. This provides for a more + complete and fast installation experience for users who already have + `NOT_CRAN=true` as part of their workflow, without requiring additional + environment variables to be set. +* `ARROW_R_DEV`: If set to `true`, more verbose messaging will be printed + in the build script. `arrow::install_arrow(verbose = TRUE)` sets this. + This variable also is needed if you're modifying C++ + code in the package: see the developer guide vignette. +* `LIBARROW_DEBUG_DIR`: If the C++ library building from source fails (`cmake`), + there may be messages telling you to check some log file in the build directory. + However, when the library is built during R package installation, + that location is in a temp directory that is already deleted. + To capture those logs, set this variable to an absolute (not relative) path + and the log files will be copied there. + The directory will be created if it does not exist. +* `CMAKE`: When building the C++ library from source, you can specify a + `/path/to/cmake` to use a different version than whatever is found on the `$PATH` + +# Contributing + +As mentioned above, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) +if you encounter ways to improve this. If you find that your Linux distribution +or version is not supported, we welcome the contribution of Docker images +(hosted on Docker Hub) that we can use in our continuous integration. These +Docker images should be minimal, containing only R and the dependencies it +requires. (For reference, see the images that +[R-hub](https://github.com/r-hub/rhub-linux-builders) uses.) + +You can test the `arrow` R package installation using the `docker-compose` +setup included in the `apache/arrow` git repository. For example, + +``` +R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose build r +R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose run r +``` + +installs the `arrow` R package, including the C++ source build, on the +[rhub/ubuntu-gcc-release](https://hub.docker.com/r/rhub/ubuntu-gcc-release) +image. diff --git a/src/arrow/r/vignettes/python.Rmd b/src/arrow/r/vignettes/python.Rmd new file mode 100644 index 000000000..c05ee7dc7 --- /dev/null +++ b/src/arrow/r/vignettes/python.Rmd @@ -0,0 +1,131 @@ +--- +title: "Apache Arrow in Python and R with reticulate" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Apache Arrow in Python and R with reticulate} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +The `arrow` package provides `reticulate` methods for passing data between +R and Python in the same process. This document provides a brief overview. + +## Installing + +To use `arrow` in Python, at a minimum you'll need the `pyarrow` library. +To install it in a virtualenv, + +```r +library(reticulate) +virtualenv_create("arrow-env") +install_pyarrow("arrow-env") +``` + +If you want to install a development version of `pyarrow`, +add `nightly = TRUE`: + +```r +install_pyarrow("arrow-env", nightly = TRUE) +``` + +`install_pyarrow()` also works with `conda` environments +(`conda_create()` instead of `virtualenv_create()`). + +For more on installing and configuring Python, +see the [reticulate docs](https://rstudio.github.io/reticulate/articles/python_packages.html). + +## Using + +To start, load `arrow` and `reticulate`, and then import `pyarrow`. + +```r +library(arrow) +library(reticulate) +use_virtualenv("arrow-env") +pa <- import("pyarrow") +``` + +The package includes support for sharing Arrow `Array` and `RecordBatch` +objects in-process between R and Python. For example, let's create an `Array` +in `pyarrow`. + +```r +a <- pa$array(c(1, 2, 3)) +a + +## Array +## <double> +## [ +## 1, +## 2, +## 3 +## ] +``` + +`a` is now an `Array` object in our R session, even though we created it in Python. +We can apply R methods on it: + +```r +a[a > 1] + +## Array +## <double> +## [ +## 2, +## 3 +## ] +``` + +We can send data both ways. One reason we might want to use `pyarrow` in R is +to take advantage of functionality that is better supported in Python than in R. +For example, `pyarrow` has a `concat_arrays` function, but as of 0.17, this +function is not implemented in the `arrow` R package. We can use `reticulate` +to use it efficiently. + +```r +b <- Array$create(c(5, 6, 7, 8, 9)) +a_and_b <- pa$concat_arrays(list(a, b)) +a_and_b + +## Array +## <double> +## [ +## 1, +## 2, +## 3, +## 5, +## 6, +## 7, +## 8, +## 9 +## ] +``` + +Now we have a single `Array` in R. + +"Send", however, isn't the correct word. Internally, we're passing pointers to +the data between the R and Python interpreters running together in the same +process, without copying anything. Nothing is being sent: we're sharing and +accessing the same internal Arrow memory buffers. + +## Troubleshooting + +If you get an error like + +``` +Error in py_get_attr_impl(x, name, silent) : + AttributeError: 'pyarrow.lib.DoubleArray' object has no attribute '_export_to_c' +``` + +it means that the version of `pyarrow` you're using is too old. +Support for passing data to and from R is included in versions 0.17 and greater. +Check your pyarrow version like this: + +```r +pa$`__version__` + +## [1] "0.16.0" +``` + +Note that your `pyarrow` and `arrow` versions don't need themselves to match: +they just need to be 0.17 or greater. |