summaryrefslogtreecommitdiffstats
path: root/src/arrow/r/vignettes/fs.Rmd
diff options
context:
space:
mode:
Diffstat (limited to 'src/arrow/r/vignettes/fs.Rmd')
-rw-r--r--src/arrow/r/vignettes/fs.Rmd130
1 files changed, 130 insertions, 0 deletions
diff --git a/src/arrow/r/vignettes/fs.Rmd b/src/arrow/r/vignettes/fs.Rmd
new file mode 100644
index 000000000..5d699c49d
--- /dev/null
+++ b/src/arrow/r/vignettes/fs.Rmd
@@ -0,0 +1,130 @@
+---
+title: "Working with Cloud Storage (S3)"
+output: rmarkdown::html_vignette
+vignette: >
+ %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+ %\VignetteEngine{knitr::rmarkdown}
+ %\VignetteEncoding{UTF-8}
+---
+
+The Arrow C++ library includes a generic filesystem interface and specific
+implementations for some cloud storage systems. This setup allows various
+parts of the project to be able to read and write data with different storage
+backends. In the `arrow` R package, support has been enabled for AWS S3.
+This vignette provides an overview of working with S3 data using Arrow.
+
+> In Windows and macOS binary packages, S3 support is included. On Linux when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details.
+
+## URIs
+
+File readers and writers (`read_parquet()`, `write_feather()`, et al.)
+accept an S3 URI as the source or destination file,
+as do `open_dataset()` and `write_dataset()`.
+An S3 URI looks like:
+
+```
+s3://[access_key:secret_key@]bucket/path[?region=]
+```
+
+For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at
+
+```
+s3://ursa-labs-taxi-data/2019/06/data.parquet
+```
+
+Given this URI, we can pass it to `read_parquet()` just as if it were a local file path:
+
+```r
+df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
+```
+
+Note that this will be slower to read than if the file were local,
+though if you're running on a machine in the same AWS region as the file in S3,
+the cost of reading the data over the network should be much lower.
+
+## Creating a FileSystem object
+
+Another way to connect to S3 is to create a `FileSystem` object once and pass
+that to the read/write functions.
+`S3FileSystem` objects can be created with the `s3_bucket()` function, which
+automatically detects the bucket's AWS region. Additionally, the resulting
+`FileSystem` will consider paths relative to the bucket's path (so for example
+you don't need to prefix the bucket path when listing a directory).
+This may be convenient when dealing with
+long URIs, and it's necessary for some options and authentication methods
+that aren't supported in the URI format.
+
+With a `FileSystem` object, we can point to specific files in it with the `$path()` method.
+In the previous example, this would look like:
+
+```r
+bucket <- s3_bucket("ursa-labs-taxi-data")
+df <- read_parquet(bucket$path("2019/06/data.parquet"))
+```
+
+See the help for `FileSystem` for a list of options that `s3_bucket()` and `S3FileSystem$create()`
+can take. `region`, `scheme`, and `endpoint_override` can be encoded as query
+parameters in the URI (though `region` will be auto-detected in `s3_bucket()` or from the URI if omitted).
+`access_key` and `secret_key` can also be included,
+but other options are not supported in the URI.
+
+The object that `s3_bucket()` returns is technically a `SubTreeFileSystem`, which holds a path and a file system to which it corresponds. `SubTreeFileSystem`s can be useful for holding a reference to a subdirectory somewhere, on S3 or elsewhere.
+
+One way to get a subtree is to call the `$cd()` method on a `FileSystem`
+
+```r
+june2019 <- bucket$cd("2019/06")
+df <- read_parquet(june2019$path("data.parquet"))
+```
+
+`SubTreeFileSystem` can also be made from a URI:
+
+```r
+june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06")
+```
+
+## Authentication
+
+To access private S3 buckets, you need typically need two secret parameters:
+a `access_key`, which is like a user id,
+and `secret_key`, like a token.
+There are a few options for passing these credentials:
+
+1. Include them in the URI, like `s3://access_key:secret_key@bucket-name/path/to/file`. Be sure to [URL-encode](https://en.wikipedia.org/wiki/Percent-encoding) your secrets if they contain special characters like "/".
+
+2. Pass them as `access_key` and `secret_key` to `S3FileSystem$create()` or `s3_bucket()`
+
+3. Set them as environment variables named `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`, respectively.
+
+4. Define them in a `~/.aws/credentials` file, according to the [AWS documentation](https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/credentials.html).
+
+You can also use an [AccessRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html)
+for temporary access by passing the `role_arn` identifier to `S3FileSystem$create()` or `s3_bucket()`.
+
+## File systems that emulate S3
+
+The `S3FileSystem` machinery enables you to work with any file system that
+provides an S3-compatible interface. For example, [MinIO](https://min.io/) is
+and object-storage server that emulates the S3 API. If you were to
+run `minio server` locally with its default settings, you could connect to
+it with `arrow` using `S3FileSystem` like this:
+
+```r
+minio <- S3FileSystem$create(
+ access_key = "minioadmin",
+ secret_key = "minioadmin",
+ scheme = "http",
+ endpoint_override = "localhost:9000"
+)
+```
+
+or, as a URI, it would be
+
+```
+s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000
+```
+
+(note the URL escaping of the `:` in `endpoint_override`).
+
+Among other applications, this can be useful for testing out code locally before
+running on a remote S3 bucket.