summaryrefslogtreecommitdiffstats
path: root/src/arrow/r/vignettes/flight.Rmd
blob: e8af5cad6f7788736851a4931dc1abd4e3c5cc48 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
title: "Connecting to Flight RPC Servers"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Connecting to Flight RPC Servers}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

[**Flight**](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/)
is a general-purpose client-server framework for high performance
transport of large datasets over network interfaces, built as part of the
[Apache Arrow](https://arrow.apache.org) project.

Flight allows for highly efficient data transfer as it:

* removes the need for deserialization during data transfer
* allows for parallel data streaming
* is highly optimized to take advantage of Arrow's columnar format.

The arrow package provides methods for connecting to Flight RPC servers
to send and receive data.

## Getting Started

The `flight` functions in the package use [reticulate](https://rstudio.github.io/reticulate/) to call methods in the
[pyarrow](https://arrow.apache.org/docs/python/api/flight.html) Python package.

Before using them for the first time,
you'll need to be sure you have reticulate and pyarrow installed:

```r
install.packages("reticulate")
arrow::install_pyarrow()
```

See `vignette("python", package = "arrow")` for more details on setting up
`pyarrow`.

## Example

The package includes methods for starting a Python-based Flight server, as well
as methods for connecting to a Flight server running elsewhere.

To illustrate both sides, in one process let's start a demo server:

```r
library(arrow)
demo_server <- load_flight_server("demo_flight_server")
server <- demo_server$DemoFlightServer(port = 8089)
server$serve()
```

We'll leave that one running.

In a different R process, let's connect to it and put some data in it.

```r
library(arrow)
client <- flight_connect(port = 8089)
# Upload some data to our server so there's something to demo
flight_put(client, iris, path = "test_data/iris")
```

Now, in a new R process, let's connect to the server and pull the data we
put there:

```r
library(arrow)
library(dplyr)
client <- flight_connect(port = 8089)
client %>%
  flight_get("test_data/iris") %>%
  group_by(Species) %>%
  summarize(max_petal = max(Petal.Length))

## # A tibble: 3 x 2
##   Species    max_petal
##   <fct>          <dbl>
## 1 setosa           1.9
## 2 versicolor       5.1
## 3 virginica        6.9
```

Because `flight_get()` returns an Arrow data structure, you can directly pipe
its result into a [dplyr](https://dplyr.tidyverse.org/) workflow.
See `vignette("dataset", package = "arrow")` for more information on working with Arrow objects via a dplyr interface.