diff options
Diffstat (limited to 'src/arrow/r/vignettes/fs.Rmd')
-rw-r--r-- | src/arrow/r/vignettes/fs.Rmd | 130 |
1 files changed, 130 insertions, 0 deletions
diff --git a/src/arrow/r/vignettes/fs.Rmd b/src/arrow/r/vignettes/fs.Rmd new file mode 100644 index 000000000..5d699c49d --- /dev/null +++ b/src/arrow/r/vignettes/fs.Rmd @@ -0,0 +1,130 @@ +--- +title: "Working with Cloud Storage (S3)" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Working with Cloud Storage (S3)} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +The Arrow C++ library includes a generic filesystem interface and specific +implementations for some cloud storage systems. This setup allows various +parts of the project to be able to read and write data with different storage +backends. In the `arrow` R package, support has been enabled for AWS S3. +This vignette provides an overview of working with S3 data using Arrow. + +> In Windows and macOS binary packages, S3 support is included. On Linux when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. + +## URIs + +File readers and writers (`read_parquet()`, `write_feather()`, et al.) +accept an S3 URI as the source or destination file, +as do `open_dataset()` and `write_dataset()`. +An S3 URI looks like: + +``` +s3://[access_key:secret_key@]bucket/path[?region=] +``` + +For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at + +``` +s3://ursa-labs-taxi-data/2019/06/data.parquet +``` + +Given this URI, we can pass it to `read_parquet()` just as if it were a local file path: + +```r +df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet") +``` + +Note that this will be slower to read than if the file were local, +though if you're running on a machine in the same AWS region as the file in S3, +the cost of reading the data over the network should be much lower. + +## Creating a FileSystem object + +Another way to connect to S3 is to create a `FileSystem` object once and pass +that to the read/write functions. +`S3FileSystem` objects can be created with the `s3_bucket()` function, which +automatically detects the bucket's AWS region. Additionally, the resulting +`FileSystem` will consider paths relative to the bucket's path (so for example +you don't need to prefix the bucket path when listing a directory). +This may be convenient when dealing with +long URIs, and it's necessary for some options and authentication methods +that aren't supported in the URI format. + +With a `FileSystem` object, we can point to specific files in it with the `$path()` method. +In the previous example, this would look like: + +```r +bucket <- s3_bucket("ursa-labs-taxi-data") +df <- read_parquet(bucket$path("2019/06/data.parquet")) +``` + +See the help for `FileSystem` for a list of options that `s3_bucket()` and `S3FileSystem$create()` +can take. `region`, `scheme`, and `endpoint_override` can be encoded as query +parameters in the URI (though `region` will be auto-detected in `s3_bucket()` or from the URI if omitted). +`access_key` and `secret_key` can also be included, +but other options are not supported in the URI. + +The object that `s3_bucket()` returns is technically a `SubTreeFileSystem`, which holds a path and a file system to which it corresponds. `SubTreeFileSystem`s can be useful for holding a reference to a subdirectory somewhere, on S3 or elsewhere. + +One way to get a subtree is to call the `$cd()` method on a `FileSystem` + +```r +june2019 <- bucket$cd("2019/06") +df <- read_parquet(june2019$path("data.parquet")) +``` + +`SubTreeFileSystem` can also be made from a URI: + +```r +june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06") +``` + +## Authentication + +To access private S3 buckets, you need typically need two secret parameters: +a `access_key`, which is like a user id, +and `secret_key`, like a token. +There are a few options for passing these credentials: + +1. Include them in the URI, like `s3://access_key:secret_key@bucket-name/path/to/file`. Be sure to [URL-encode](https://en.wikipedia.org/wiki/Percent-encoding) your secrets if they contain special characters like "/". + +2. Pass them as `access_key` and `secret_key` to `S3FileSystem$create()` or `s3_bucket()` + +3. Set them as environment variables named `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`, respectively. + +4. Define them in a `~/.aws/credentials` file, according to the [AWS documentation](https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/credentials.html). + +You can also use an [AccessRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) +for temporary access by passing the `role_arn` identifier to `S3FileSystem$create()` or `s3_bucket()`. + +## File systems that emulate S3 + +The `S3FileSystem` machinery enables you to work with any file system that +provides an S3-compatible interface. For example, [MinIO](https://min.io/) is +and object-storage server that emulates the S3 API. If you were to +run `minio server` locally with its default settings, you could connect to +it with `arrow` using `S3FileSystem` like this: + +```r +minio <- S3FileSystem$create( + access_key = "minioadmin", + secret_key = "minioadmin", + scheme = "http", + endpoint_override = "localhost:9000" +) +``` + +or, as a URI, it would be + +``` +s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000 +``` + +(note the URL escaping of the `:` in `endpoint_override`). + +Among other applications, this can be useful for testing out code locally before +running on a remote S3 bucket. |