summaryrefslogtreecommitdiffstats
path: root/src/arrow/r/man/Dataset.Rd
diff options
context:
space:
mode:
Diffstat (limited to 'src/arrow/r/man/Dataset.Rd')
-rw-r--r--src/arrow/r/man/Dataset.Rd81
1 files changed, 81 insertions, 0 deletions
diff --git a/src/arrow/r/man/Dataset.Rd b/src/arrow/r/man/Dataset.Rd
new file mode 100644
index 000000000..c19a0df6c
--- /dev/null
+++ b/src/arrow/r/man/Dataset.Rd
@@ -0,0 +1,81 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/dataset.R, R/dataset-factory.R
+\name{Dataset}
+\alias{Dataset}
+\alias{FileSystemDataset}
+\alias{UnionDataset}
+\alias{InMemoryDataset}
+\alias{DatasetFactory}
+\alias{FileSystemDatasetFactory}
+\title{Multi-file datasets}
+\description{
+Arrow Datasets allow you to query against data that has been split across
+multiple files. This sharding of data may indicate partitioning, which
+can accelerate queries that only touch some partitions (files).
+
+A \code{Dataset} contains one or more \code{Fragments}, such as files, of potentially
+differing type and partitioning.
+
+For \code{Dataset$create()}, see \code{\link[=open_dataset]{open_dataset()}}, which is an alias for it.
+
+\code{DatasetFactory} is used to provide finer control over the creation of \code{Dataset}s.
+}
+\section{Factory}{
+
+\code{DatasetFactory} is used to create a \code{Dataset}, inspect the \link{Schema} of the
+fragments contained in it, and declare a partitioning.
+\code{FileSystemDatasetFactory} is a subclass of \code{DatasetFactory} for
+discovering files in the local file system, the only currently supported
+file system.
+
+For the \code{DatasetFactory$create()} factory method, see \code{\link[=dataset_factory]{dataset_factory()}}, an
+alias for it. A \code{DatasetFactory} has:
+\itemize{
+\item \verb{$Inspect(unify_schemas)}: If \code{unify_schemas} is \code{TRUE}, all fragments
+will be scanned and a unified \link{Schema} will be created from them; if \code{FALSE}
+(default), only the first fragment will be inspected for its schema. Use this
+fast path when you know and trust that all fragments have an identical schema.
+\item \verb{$Finish(schema, unify_schemas)}: Returns a \code{Dataset}. If \code{schema} is provided,
+it will be used for the \code{Dataset}; if omitted, a \code{Schema} will be created from
+inspecting the fragments (files) in the dataset, following \code{unify_schemas}
+as described above.
+}
+
+\code{FileSystemDatasetFactory$create()} is a lower-level factory method and
+takes the following arguments:
+\itemize{
+\item \code{filesystem}: A \link{FileSystem}
+\item \code{selector}: Either a \link{FileSelector} or \code{NULL}
+\item \code{paths}: Either a character vector of file paths or \code{NULL}
+\item \code{format}: A \link{FileFormat}
+\item \code{partitioning}: Either \code{Partitioning}, \code{PartitioningFactory}, or \code{NULL}
+}
+}
+
+\section{Methods}{
+
+
+A \code{Dataset} has the following methods:
+\itemize{
+\item \verb{$NewScan()}: Returns a \link{ScannerBuilder} for building a query
+\item \verb{$schema}: Active binding that returns the \link{Schema} of the Dataset; you
+may also replace the dataset's schema by using \code{ds$schema <- new_schema}.
+This method currently supports only adding, removing, or reordering
+fields in the schema: you cannot alter or cast the field types.
+}
+
+\code{FileSystemDataset} has the following methods:
+\itemize{
+\item \verb{$files}: Active binding, returns the files of the \code{FileSystemDataset}
+\item \verb{$format}: Active binding, returns the \link{FileFormat} of the \code{FileSystemDataset}
+}
+
+\code{UnionDataset} has the following methods:
+\itemize{
+\item \verb{$children}: Active binding, returns all child \code{Dataset}s.
+}
+}
+
+\seealso{
+\code{\link[=open_dataset]{open_dataset()}} for a simple interface to creating a \code{Dataset}
+}