diff options
Diffstat (limited to 'src/arrow/r/man/Dataset.Rd')
-rw-r--r-- | src/arrow/r/man/Dataset.Rd | 81 |
1 files changed, 81 insertions, 0 deletions
diff --git a/src/arrow/r/man/Dataset.Rd b/src/arrow/r/man/Dataset.Rd new file mode 100644 index 000000000..c19a0df6c --- /dev/null +++ b/src/arrow/r/man/Dataset.Rd @@ -0,0 +1,81 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/dataset.R, R/dataset-factory.R +\name{Dataset} +\alias{Dataset} +\alias{FileSystemDataset} +\alias{UnionDataset} +\alias{InMemoryDataset} +\alias{DatasetFactory} +\alias{FileSystemDatasetFactory} +\title{Multi-file datasets} +\description{ +Arrow Datasets allow you to query against data that has been split across +multiple files. This sharding of data may indicate partitioning, which +can accelerate queries that only touch some partitions (files). + +A \code{Dataset} contains one or more \code{Fragments}, such as files, of potentially +differing type and partitioning. + +For \code{Dataset$create()}, see \code{\link[=open_dataset]{open_dataset()}}, which is an alias for it. + +\code{DatasetFactory} is used to provide finer control over the creation of \code{Dataset}s. +} +\section{Factory}{ + +\code{DatasetFactory} is used to create a \code{Dataset}, inspect the \link{Schema} of the +fragments contained in it, and declare a partitioning. +\code{FileSystemDatasetFactory} is a subclass of \code{DatasetFactory} for +discovering files in the local file system, the only currently supported +file system. + +For the \code{DatasetFactory$create()} factory method, see \code{\link[=dataset_factory]{dataset_factory()}}, an +alias for it. A \code{DatasetFactory} has: +\itemize{ +\item \verb{$Inspect(unify_schemas)}: If \code{unify_schemas} is \code{TRUE}, all fragments +will be scanned and a unified \link{Schema} will be created from them; if \code{FALSE} +(default), only the first fragment will be inspected for its schema. Use this +fast path when you know and trust that all fragments have an identical schema. +\item \verb{$Finish(schema, unify_schemas)}: Returns a \code{Dataset}. If \code{schema} is provided, +it will be used for the \code{Dataset}; if omitted, a \code{Schema} will be created from +inspecting the fragments (files) in the dataset, following \code{unify_schemas} +as described above. +} + +\code{FileSystemDatasetFactory$create()} is a lower-level factory method and +takes the following arguments: +\itemize{ +\item \code{filesystem}: A \link{FileSystem} +\item \code{selector}: Either a \link{FileSelector} or \code{NULL} +\item \code{paths}: Either a character vector of file paths or \code{NULL} +\item \code{format}: A \link{FileFormat} +\item \code{partitioning}: Either \code{Partitioning}, \code{PartitioningFactory}, or \code{NULL} +} +} + +\section{Methods}{ + + +A \code{Dataset} has the following methods: +\itemize{ +\item \verb{$NewScan()}: Returns a \link{ScannerBuilder} for building a query +\item \verb{$schema}: Active binding that returns the \link{Schema} of the Dataset; you +may also replace the dataset's schema by using \code{ds$schema <- new_schema}. +This method currently supports only adding, removing, or reordering +fields in the schema: you cannot alter or cast the field types. +} + +\code{FileSystemDataset} has the following methods: +\itemize{ +\item \verb{$files}: Active binding, returns the files of the \code{FileSystemDataset} +\item \verb{$format}: Active binding, returns the \link{FileFormat} of the \code{FileSystemDataset} +} + +\code{UnionDataset} has the following methods: +\itemize{ +\item \verb{$children}: Active binding, returns all child \code{Dataset}s. +} +} + +\seealso{ +\code{\link[=open_dataset]{open_dataset()}} for a simple interface to creating a \code{Dataset} +} |