summaryrefslogtreecommitdiffstats
path: root/src/arrow/r/vignettes/python.Rmd
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
commite6918187568dbd01842d8d1d2c808ce16a894239 (patch)
tree64f88b554b444a49f656b6c656111a145cbbaa28 /src/arrow/r/vignettes/python.Rmd
parentInitial commit. (diff)
downloadceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz
ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
-rw-r--r--src/arrow/r/vignettes/python.Rmd131
1 files changed, 131 insertions, 0 deletions
diff --git a/src/arrow/r/vignettes/python.Rmd b/src/arrow/r/vignettes/python.Rmd
new file mode 100644
index 000000000..c05ee7dc7
--- /dev/null
+++ b/src/arrow/r/vignettes/python.Rmd
@@ -0,0 +1,131 @@
+---
+title: "Apache Arrow in Python and R with reticulate"
+output: rmarkdown::html_vignette
+vignette: >
+ %\VignetteIndexEntry{Apache Arrow in Python and R with reticulate}
+ %\VignetteEngine{knitr::rmarkdown}
+ %\VignetteEncoding{UTF-8}
+---
+
+The `arrow` package provides `reticulate` methods for passing data between
+R and Python in the same process. This document provides a brief overview.
+
+## Installing
+
+To use `arrow` in Python, at a minimum you'll need the `pyarrow` library.
+To install it in a virtualenv,
+
+```r
+library(reticulate)
+virtualenv_create("arrow-env")
+install_pyarrow("arrow-env")
+```
+
+If you want to install a development version of `pyarrow`,
+add `nightly = TRUE`:
+
+```r
+install_pyarrow("arrow-env", nightly = TRUE)
+```
+
+`install_pyarrow()` also works with `conda` environments
+(`conda_create()` instead of `virtualenv_create()`).
+
+For more on installing and configuring Python,
+see the [reticulate docs](https://rstudio.github.io/reticulate/articles/python_packages.html).
+
+## Using
+
+To start, load `arrow` and `reticulate`, and then import `pyarrow`.
+
+```r
+library(arrow)
+library(reticulate)
+use_virtualenv("arrow-env")
+pa <- import("pyarrow")
+```
+
+The package includes support for sharing Arrow `Array` and `RecordBatch`
+objects in-process between R and Python. For example, let's create an `Array`
+in `pyarrow`.
+
+```r
+a <- pa$array(c(1, 2, 3))
+a
+
+## Array
+## <double>
+## [
+## 1,
+## 2,
+## 3
+## ]
+```
+
+`a` is now an `Array` object in our R session, even though we created it in Python.
+We can apply R methods on it:
+
+```r
+a[a > 1]
+
+## Array
+## <double>
+## [
+## 2,
+## 3
+## ]
+```
+
+We can send data both ways. One reason we might want to use `pyarrow` in R is
+to take advantage of functionality that is better supported in Python than in R.
+For example, `pyarrow` has a `concat_arrays` function, but as of 0.17, this
+function is not implemented in the `arrow` R package. We can use `reticulate`
+to use it efficiently.
+
+```r
+b <- Array$create(c(5, 6, 7, 8, 9))
+a_and_b <- pa$concat_arrays(list(a, b))
+a_and_b
+
+## Array
+## <double>
+## [
+## 1,
+## 2,
+## 3,
+## 5,
+## 6,
+## 7,
+## 8,
+## 9
+## ]
+```
+
+Now we have a single `Array` in R.
+
+"Send", however, isn't the correct word. Internally, we're passing pointers to
+the data between the R and Python interpreters running together in the same
+process, without copying anything. Nothing is being sent: we're sharing and
+accessing the same internal Arrow memory buffers.
+
+## Troubleshooting
+
+If you get an error like
+
+```
+Error in py_get_attr_impl(x, name, silent) :
+ AttributeError: 'pyarrow.lib.DoubleArray' object has no attribute '_export_to_c'
+```
+
+it means that the version of `pyarrow` you're using is too old.
+Support for passing data to and from R is included in versions 0.17 and greater.
+Check your pyarrow version like this:
+
+```r
+pa$`__version__`
+
+## [1] "0.16.0"
+```
+
+Note that your `pyarrow` and `arrow` versions don't need themselves to match:
+they just need to be 0.17 or greater.