summaryrefslogtreecommitdiffstats
path: root/src/arrow/r/man/Scanner.Rd
diff options
context:
space:
mode:
Diffstat (limited to 'src/arrow/r/man/Scanner.Rd')
-rw-r--r--src/arrow/r/man/Scanner.Rd51
1 files changed, 51 insertions, 0 deletions
diff --git a/src/arrow/r/man/Scanner.Rd b/src/arrow/r/man/Scanner.Rd
new file mode 100644
index 000000000..db6488f50
--- /dev/null
+++ b/src/arrow/r/man/Scanner.Rd
@@ -0,0 +1,51 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/dataset-scan.R
+\name{Scanner}
+\alias{Scanner}
+\alias{ScannerBuilder}
+\title{Scan the contents of a dataset}
+\description{
+A \code{Scanner} iterates over a \link{Dataset}'s fragments and returns data
+according to given row filtering and column projection. A \code{ScannerBuilder}
+can help create one.
+}
+\section{Factory}{
+
+\code{Scanner$create()} wraps the \code{ScannerBuilder} interface to make a \code{Scanner}.
+It takes the following arguments:
+\itemize{
+\item \code{dataset}: A \code{Dataset} or \code{arrow_dplyr_query} object, as returned by the
+\code{dplyr} methods on \code{Dataset}.
+\item \code{projection}: A character vector of column names to select columns or a
+named list of expressions
+\item \code{filter}: A \code{Expression} to filter the scanned rows by, or \code{TRUE} (default)
+to keep all rows.
+\item \code{use_threads}: logical: should scanning use multithreading? Default \code{TRUE}
+\item \code{use_async}: logical: should the async scanner (performs better on
+high-latency/highly parallel filesystems like S3) be used? Default \code{FALSE}
+\item \code{...}: Additional arguments, currently ignored
+}
+}
+
+\section{Methods}{
+
+\code{ScannerBuilder} has the following methods:
+\itemize{
+\item \verb{$Project(cols)}: Indicate that the scan should only return columns given
+by \code{cols}, a character vector of column names
+\item \verb{$Filter(expr)}: Filter rows by an \link{Expression}.
+\item \verb{$UseThreads(threads)}: logical: should the scan use multithreading?
+The method's default input is \code{TRUE}, but you must call the method to enable
+multithreading because the scanner default is \code{FALSE}.
+\item \verb{$UseAsync(use_async)}: logical: should the async scanner be used?
+\item \verb{$BatchSize(batch_size)}: integer: Maximum row count of scanned record
+batches, default is 32K. If scanned record batches are overflowing memory
+then this method can be called to reduce their size.
+\item \verb{$schema}: Active binding, returns the \link{Schema} of the Dataset
+\item \verb{$Finish()}: Returns a \code{Scanner}
+}
+
+\code{Scanner} currently has a single method, \verb{$ToTable()}, which evaluates the
+query and returns an Arrow \link{Table}.
+}
+