summaryrefslogtreecommitdiffstats
path: root/src/arrow/r/vignettes/python.Rmd
blob: c05ee7dc7c3234e349adf23e27986cf1a6c751a1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
title: "Apache Arrow in Python and R with reticulate"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Apache Arrow in Python and R with reticulate}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

The `arrow` package provides `reticulate` methods for passing data between
R and Python in the same process. This document provides a brief overview.

## Installing

To use `arrow` in Python, at a minimum you'll need the `pyarrow` library.
To install it in a virtualenv,

```r
library(reticulate)
virtualenv_create("arrow-env")
install_pyarrow("arrow-env")
```

If you want to install a development version of `pyarrow`,
add `nightly = TRUE`:

```r
install_pyarrow("arrow-env", nightly = TRUE)
```

`install_pyarrow()` also works with `conda` environments
(`conda_create()` instead of `virtualenv_create()`).

For more on installing and configuring Python,
see the [reticulate docs](https://rstudio.github.io/reticulate/articles/python_packages.html).

## Using

To start, load `arrow` and `reticulate`, and then import `pyarrow`.

```r
library(arrow)
library(reticulate)
use_virtualenv("arrow-env")
pa <- import("pyarrow")
```

The package includes support for sharing Arrow `Array` and `RecordBatch`
objects in-process between R and Python. For example, let's create an `Array`
in `pyarrow`.

```r
a <- pa$array(c(1, 2, 3))
a

## Array
## <double>
## [
##   1,
##   2,
##   3
## ]
```

`a` is now an `Array` object in our R session, even though we created it in Python.
We can apply R methods on it:

```r
a[a > 1]

## Array
## <double>
## [
##   2,
##   3
## ]
```

We can send data both ways. One reason we might want to use `pyarrow` in R is
to take advantage of functionality that is better supported in Python than in R.
For example, `pyarrow` has a `concat_arrays` function, but as of 0.17, this
function is not implemented in the `arrow` R package. We can use `reticulate`
to use it efficiently.

```r
b <- Array$create(c(5, 6, 7, 8, 9))
a_and_b <- pa$concat_arrays(list(a, b))
a_and_b

## Array
## <double>
## [
##   1,
##   2,
##   3,
##   5,
##   6,
##   7,
##   8,
##   9
## ]
```

Now we have a single `Array` in R.

"Send", however, isn't the correct word. Internally, we're passing pointers to
the data between the R and Python interpreters running together in the same
process, without copying anything. Nothing is being sent: we're sharing and
accessing the same internal Arrow memory buffers.

## Troubleshooting

If you get an error like

```
Error in py_get_attr_impl(x, name, silent) :
  AttributeError: 'pyarrow.lib.DoubleArray' object has no attribute '_export_to_c'
```

it means that the version of `pyarrow` you're using is too old.
Support for passing data to and from R is included in versions 0.17 and greater.
Check your pyarrow version like this:

```r
pa$`__version__`

## [1] "0.16.0"
```

Note that your `pyarrow` and `arrow` versions don't need themselves to match:
they just need to be 0.17 or greater.