--- title: "Apache Arrow in Python and R with reticulate" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Apache Arrow in Python and R with reticulate} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The `arrow` package provides `reticulate` methods for passing data between R and Python in the same process. This document provides a brief overview. ## Installing To use `arrow` in Python, at a minimum you'll need the `pyarrow` library. To install it in a virtualenv, ```r library(reticulate) virtualenv_create("arrow-env") install_pyarrow("arrow-env") ``` If you want to install a development version of `pyarrow`, add `nightly = TRUE`: ```r install_pyarrow("arrow-env", nightly = TRUE) ``` `install_pyarrow()` also works with `conda` environments (`conda_create()` instead of `virtualenv_create()`). For more on installing and configuring Python, see the [reticulate docs](https://rstudio.github.io/reticulate/articles/python_packages.html). ## Using To start, load `arrow` and `reticulate`, and then import `pyarrow`. ```r library(arrow) library(reticulate) use_virtualenv("arrow-env") pa <- import("pyarrow") ``` The package includes support for sharing Arrow `Array` and `RecordBatch` objects in-process between R and Python. For example, let's create an `Array` in `pyarrow`. ```r a <- pa$array(c(1, 2, 3)) a ## Array ## ## [ ## 1, ## 2, ## 3 ## ] ``` `a` is now an `Array` object in our R session, even though we created it in Python. We can apply R methods on it: ```r a[a > 1] ## Array ## ## [ ## 2, ## 3 ## ] ``` We can send data both ways. One reason we might want to use `pyarrow` in R is to take advantage of functionality that is better supported in Python than in R. For example, `pyarrow` has a `concat_arrays` function, but as of 0.17, this function is not implemented in the `arrow` R package. We can use `reticulate` to use it efficiently. ```r b <- Array$create(c(5, 6, 7, 8, 9)) a_and_b <- pa$concat_arrays(list(a, b)) a_and_b ## Array ## ## [ ## 1, ## 2, ## 3, ## 5, ## 6, ## 7, ## 8, ## 9 ## ] ``` Now we have a single `Array` in R. "Send", however, isn't the correct word. Internally, we're passing pointers to the data between the R and Python interpreters running together in the same process, without copying anything. Nothing is being sent: we're sharing and accessing the same internal Arrow memory buffers. ## Troubleshooting If you get an error like ``` Error in py_get_attr_impl(x, name, silent) : AttributeError: 'pyarrow.lib.DoubleArray' object has no attribute '_export_to_c' ``` it means that the version of `pyarrow` you're using is too old. Support for passing data to and from R is included in versions 0.17 and greater. Check your pyarrow version like this: ```r pa$`__version__` ## [1] "0.16.0" ``` Note that your `pyarrow` and `arrow` versions don't need themselves to match: they just need to be 0.17 or greater.