summaryrefslogtreecommitdiffstats
path: root/src/arrow/r/man/dataset_factory.Rd
diff options
context:
space:
mode:
Diffstat (limited to 'src/arrow/r/man/dataset_factory.Rd')
-rw-r--r--src/arrow/r/man/dataset_factory.Rd76
1 files changed, 76 insertions, 0 deletions
diff --git a/src/arrow/r/man/dataset_factory.Rd b/src/arrow/r/man/dataset_factory.Rd
new file mode 100644
index 000000000..d119c150b
--- /dev/null
+++ b/src/arrow/r/man/dataset_factory.Rd
@@ -0,0 +1,76 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/dataset-factory.R
+\name{dataset_factory}
+\alias{dataset_factory}
+\title{Create a DatasetFactory}
+\usage{
+dataset_factory(
+ x,
+ filesystem = NULL,
+ format = c("parquet", "arrow", "ipc", "feather", "csv", "tsv", "text"),
+ partitioning = NULL,
+ ...
+)
+}
+\arguments{
+\item{x}{A string path to a directory containing data files, a vector of one
+one or more string paths to data files, or a list of \code{DatasetFactory} objects
+whose datasets should be combined. If this argument is specified it will be
+used to construct a \code{UnionDatasetFactory} and other arguments will be
+ignored.}
+
+\item{filesystem}{A \link{FileSystem} object; if omitted, the \code{FileSystem} will
+be detected from \code{x}}
+
+\item{format}{A \link{FileFormat} object, or a string identifier of the format of
+the files in \code{x}. Currently supported values:
+\itemize{
+\item "parquet"
+\item "ipc"/"arrow"/"feather", all aliases for each other; for Feather, note that
+only version 2 files are supported
+\item "csv"/"text", aliases for the same thing (because comma is the default
+delimiter for text files
+\item "tsv", equivalent to passing \verb{format = "text", delimiter = "\\t"}
+}
+
+Default is "parquet", unless a \code{delimiter} is also specified, in which case
+it is assumed to be "text".}
+
+\item{partitioning}{One of
+\itemize{
+\item A \code{Schema}, in which case the file paths relative to \code{sources} will be
+parsed, and path segments will be matched with the schema fields. For
+example, \code{schema(year = int16(), month = int8())} would create partitions
+for file paths like "2019/01/file.parquet", "2019/02/file.parquet", etc.
+\item A character vector that defines the field names corresponding to those
+path segments (that is, you're providing the names that would correspond
+to a \code{Schema} but the types will be autodetected)
+\item A \code{HivePartitioning} or \code{HivePartitioningFactory}, as returned
+by \code{\link[=hive_partition]{hive_partition()}} which parses explicit or autodetected fields from
+Hive-style path segments
+\item \code{NULL} for no partitioning
+}}
+
+\item{...}{Additional format-specific options, passed to
+\code{FileFormat$create()}. For CSV options, note that you can specify them either
+with the Arrow C++ library naming ("delimiter", "quoting", etc.) or the
+\code{readr}-style naming used in \code{\link[=read_csv_arrow]{read_csv_arrow()}} ("delim", "quote", etc.).
+Not all \code{readr} options are currently supported; please file an issue if you
+encounter one that \code{arrow} should support.}
+}
+\value{
+A \code{DatasetFactory} object. Pass this to \code{\link[=open_dataset]{open_dataset()}},
+in a list potentially with other \code{DatasetFactory} objects, to create
+a \code{Dataset}.
+}
+\description{
+A \link{Dataset} can constructed using one or more \link{DatasetFactory}s.
+This function helps you construct a \code{DatasetFactory} that you can pass to
+\code{\link[=open_dataset]{open_dataset()}}.
+}
+\details{
+If you would only have a single \code{DatasetFactory} (for example, you have a
+single directory containing Parquet files), you can call \code{open_dataset()}
+directly. Use \code{dataset_factory()} when you
+want to combine different directories, file systems, or file formats.
+}