summaryrefslogtreecommitdiffstats
path: root/src/arrow/r/README.md
blob: dcd529dae2cf75b6047acc682658d6829fb31d9a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
# arrow

[![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
[![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
[![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)

**[Apache Arrow](https://arrow.apache.org/) is a cross-language
development platform for in-memory data.** It specifies a standardized
language-independent columnar memory format for flat and hierarchical
data, organized for efficient analytic operations on modern hardware. It
also provides computational libraries and zero-copy streaming messaging
and interprocess communication.

**The `arrow` package exposes an interface to the Arrow C++ library,
enabling access to many of its features in R.** It provides low-level
access to the Arrow C++ library API and higher-level access through a
`dplyr` backend and familiar R functions.

## What can the `arrow` package do?

-   Read and write **Parquet files** (`read_parquet()`,
    `write_parquet()`), an efficient and widely used columnar format
-   Read and write **Feather files** (`read_feather()`,
    `write_feather()`), a format optimized for speed and
    interoperability
-   Analyze, process, and write **multi-file, larger-than-memory
    datasets** (`open_dataset()`, `write_dataset()`)
-   Read **large CSV and JSON files** with excellent **speed and
    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
-   Write CSV files (`write_csv_arrow()`)
-   Manipulate and analyze Arrow data with **`dplyr` verbs**
-   Read and write files in **Amazon S3** buckets with no additional
    function calls
-   Exercise **fine control over column types** for seamless
    interoperability with databases and data warehouse systems
-   Use **compression codecs** including Snappy, gzip, Brotli,
    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
-   Enable **zero-copy data sharing** between **R and Python**
-   Connect to **Arrow Flight** RPC servers to send and receive large
    datasets over networks
-   Access and manipulate Arrow objects through **low-level bindings**
    to the C++ library
-   Provide a **toolkit for building connectors** to other applications
    and services that use Arrow

## Installation

### Installing the latest release version

Install the latest release of `arrow` from CRAN with

``` r
install.packages("arrow")
```

Conda users can install `arrow` from conda-forge with

``` shell
conda install -c conda-forge --strict-channel-priority r-arrow
```

Installing a released version of the `arrow` package requires no
additional system dependencies. For macOS and Windows, CRAN hosts binary
packages that contain the Arrow C++ library. On Linux, source package
installation will also build necessary C++ dependencies. For a faster,
more complete installation, set the environment variable
`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
details.

For Windows users of R 3.6 and earlier, note that support for AWS S3 is not
available, and the 32-bit version does not support Arrow Datasets.
These features are only supported by the `rtools40` toolchain on Windows
and thus are only available in R >= 4.0.

### Installing a development version

Development versions of the package (binary and source) are built
nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
install from there:

``` r
install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
```

Conda users can install `arrow` nightly builds with

``` shell
conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
```

If you already have a version of `arrow` installed, you can switch to
the latest nightly development version with

``` r
arrow::install_arrow(nightly = TRUE)
```

These nightly package builds are not official Apache releases and are
not recommended for production use. They may be useful for testing bug
fixes and new features under active development.

## Usage

Among the many applications of the `arrow` package, two of the most accessible are:

-   High-performance reading and writing of data files with multiple
    file formats and compression codecs, including built-in support for
    cloud storage
-   Analyzing and manipulating bigger-than-memory data with `dplyr`
    verbs

The sections below describe these two uses and illustrate them with
basic examples. The sections below mention two Arrow data structures:

-   `Table`: a tabular, column-oriented data structure capable of
    storing and processing large amounts of data more efficiently than
    R’s built-in `data.frame` and with SQL-like column data types that
    afford better interoperability with databases and data warehouse
    systems
-   `Dataset`: a data structure functionally similar to `Table` but with
    the capability to work on larger-than-memory data partitioned across
    multiple files

### Reading and writing data files with `arrow`

The `arrow` package provides functions for reading single data files in
several common formats. By default, calling any of these functions
returns an R `data.frame`. To return an Arrow `Table`, set argument
`as_data_frame = FALSE`.

-   `read_parquet()`: read a file in Parquet format
-   `read_feather()`: read a file in Feather format (the Apache Arrow
    IPC format)
-   `read_delim_arrow()`: read a delimited text file (default delimiter
    is comma)
-   `read_csv_arrow()`: read a comma-separated values (CSV) file
-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
-   `read_json_arrow()`: read a JSON data file

For writing data to single files, the `arrow` package provides the
functions `write_parquet()`, `write_feather()`, and `write_csv_arrow()`. 
These can be used with R `data.frame` and Arrow `Table` objects.

For example, let’s write the Star Wars characters data that’s included
in `dplyr` to a Parquet file, then read it back in. Parquet is a popular
choice for storing analytic data; it is optimized for reduced file sizes
and fast read performance, especially for column-based access patterns.
Parquet is widely supported by many tools and platforms.

First load the `arrow` and `dplyr` packages:

``` r
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
```

Then write the `data.frame` named `starwars` to a Parquet file at
`file_path`:

``` r
file_path <- tempfile()
write_parquet(starwars, file_path)
```

Then read the Parquet file into an R `data.frame` named `sw`:

``` r
sw <- read_parquet(file_path)
```

R object attributes are preserved when writing data to Parquet or
Feather files and when reading those files back into R. This enables
round-trip writing and reading of `sf::sf` objects, R `data.frame`s with
with `haven::labelled` columns, and `data.frame`s with other custom
attributes.

For reading and writing larger files or sets of multiple files, `arrow`
defines `Dataset` objects and provides the functions `open_dataset()`
and `write_dataset()`, which enable analysis and processing of
bigger-than-memory data, including the ability to partition data into
smaller chunks without loading the full data into memory. For examples
of these functions, see `vignette("dataset", package = "arrow")`.

All these functions can read and write files in the local filesystem or
in Amazon S3 (by passing S3 URIs beginning with `s3://`). For more
details, see `vignette("fs", package = "arrow")`

### Using `dplyr` with `arrow`

The `arrow` package provides a `dplyr` backend enabling manipulation of
Arrow tabular data with `dplyr` verbs. To use it, first load both
packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or
`Dataset` object. For example, read the Parquet file written in the
previous example into an Arrow `Table` named `sw`:

``` r
sw <- read_parquet(file_path, as_data_frame = FALSE)
```

Next, pipe on `dplyr` verbs:

``` r
result <- sw %>%
  filter(homeworld == "Tatooine") %>%
  rename(height_cm = height, mass_kg = mass) %>%
  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
  arrange(desc(birth_year)) %>%
  select(name, height_in, mass_lbs)
```

The `arrow` package uses lazy evaluation to delay computation until the
result is required. This speeds up processing by enabling the Arrow C++
library to perform multiple computations in one operation. `result` is
an object with class `arrow_dplyr_query` which represents all the
computations to be performed:

``` r
result
#> Table (query)
#> name: string
#> height_in: expr
#> mass_lbs: expr
#>
#> * Filter: equal(homeworld, "Tatooine")
#> * Sorted by birth_year [desc]
#> See $.data for the source Arrow object
```

To perform these computations and materialize the result, call
`compute()` or `collect()`. `compute()` returns an Arrow `Table`,
suitable for passing to other `arrow` or `dplyr` functions:

``` r
result %>% compute()
#> Table
#> 10 rows x 3 columns
#> $name <string>
#> $height_in <double>
#> $mass_lbs <double>
```

`collect()` returns an R `data.frame`, suitable for viewing or passing
to other R functions for analysis or visualization:

``` r
result %>% collect()
#> # A tibble: 10 x 3
#>    name               height_in mass_lbs
#>    <chr>                  <dbl>    <dbl>
#>  1 C-3PO                   65.7    165.
#>  2 Cliegg Lars             72.0     NA
#>  3 Shmi Skywalker          64.2     NA
#>  4 Owen Lars               70.1    265.
#>  5 Beru Whitesun lars      65.0    165.
#>  6 Darth Vader             79.5    300.
#>  7 Anakin Skywalker        74.0    185.
#>  8 Biggs Darklighter       72.0    185.
#>  9 Luke Skywalker          67.7    170.
#> 10 R5-D4                   38.2     70.5
```

The `arrow` package works with most single-table `dplyr` verbs, including those
that compute aggregates.

```r
sw %>%
  group_by(species) %>%
  summarise(mean_height = mean(height, na.rm = TRUE)) %>%
  collect()
```

Additionally, equality joins (e.g. `left_join()`, `inner_join()`) are supported
for joining multiple tables. 

```r
jedi <- data.frame(
  name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"),
  jedi = c(FALSE, TRUE, TRUE)
)

sw %>%
  select(1:11) %>%
  right_join(jedi) %>%
  collect()
```

Window functions (e.g. `ntile()`) are not yet
supported. Inside `dplyr` verbs, Arrow offers support for many functions and
operators, with common functions mapped to their base R and tidyverse
equivalents. The [changelog](https://arrow.apache.org/docs/r/news/index.html)
lists many of them. If there are additional functions you would like to see
implemented, please file an issue as described in the [Getting
help](#getting-help) section below.

For `dplyr` queries on `Table` objects, if the `arrow` package detects
an unimplemented function within a `dplyr` verb, it automatically calls
`collect()` to return the data as an R `data.frame` before processing
that `dplyr` verb. For queries on `Dataset` objects (which can be larger
than memory), it raises an error if the function is unimplemented;
you need to explicitly tell it to `collect()`.

### Additional features

Other applications of `arrow` are described in the following vignettes:

-   `vignette("python", package = "arrow")`: use `arrow` and
    `reticulate` to pass data between R and Python
-   `vignette("flight", package = "arrow")`: connect to Arrow Flight RPC
    servers to send and receive data
-   `vignette("arrow", package = "arrow")`: access and manipulate Arrow
    objects through low-level bindings to the C++ library

## Getting help

If you encounter a bug, please file an issue with a minimal reproducible
example on the [Apache Jira issue
tracker](https://issues.apache.org/jira/projects/ARROW/issues). Create
an account or log in, then click **Create** to file an issue. Select the
project **Apache Arrow (ARROW)**, select the component **R**, and begin
the issue summary with **`[R]`** followed by a space. For more
information, see the **Report bugs and propose features** section of the
[Contributing to Apache
Arrow](https://arrow.apache.org/docs/developers/contributing.html) page
in the Arrow developer documentation.

We welcome questions, discussion, and contributions from users of the
`arrow` package. For information about mailing lists and other venues
for engaging with the Arrow developer and user communities, please see
the [Apache Arrow Community](https://arrow.apache.org/community/) page.

------------------------------------------------------------------------

All participation in the Apache Arrow project is governed by the Apache
Software Foundation’s [code of
conduct](https://www.apache.org/foundation/policies/conduct.html).