.. Licensed to the Apache Software Foundation (ASF) under one .. or more contributor license agreements. See the NOTICE file .. distributed with this work for additional information .. regarding copyright ownership. The ASF licenses this file .. to you under the Apache License, Version 2.0 (the .. "License"); you may not use this file except in compliance .. with the License. You may obtain a copy of the License at .. http://www.apache.org/licenses/LICENSE-2.0 .. Unless required by applicable law or agreed to in writing, .. software distributed under the License is distributed on an .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY .. KIND, either express or implied. See the License for the .. specific language governing permissions and limitations .. under the License. .. default-domain:: cpp .. highlight:: cpp ================ Tabular Datasets ================ .. seealso:: :doc:`Dataset API reference ` .. warning:: The ``arrow::dataset`` namespace is experimental, and a stable API is not yet guaranteed. The Arrow Datasets library provides functionality to efficiently work with tabular, potentially larger than memory, and multi-file datasets. This includes: * A unified interface that supports different sources and file formats (currently, Parquet, ORC, Feather / Arrow IPC, and CSV files) and different file systems (local, cloud). * Discovery of sources (crawling directories, handling partitioned datasets with various partitioning schemes, basic schema normalization, ...) * Optimized reading with predicate pushdown (filtering rows), projection (selecting and deriving columns), and optionally parallel reading. The goal is to expand support to other file formats and data sources (e.g. database connections) in the future. Reading Datasets ---------------- For the examples below, let's create a small dataset consisting of a directory with two parquet files: .. literalinclude:: ../../../cpp/examples/arrow/dataset_documentation_example.cc :language: cpp :start-after: (Doc section: Reading Datasets) :end-before: (Doc section: Reading Datasets) :linenos: :lineno-match: (See the full example at bottom: :ref:`cpp-dataset-full-example`.) Dataset discovery ~~~~~~~~~~~~~~~~~ A :class:`arrow::dataset::Dataset` object can be created using the various :class:`arrow::dataset::DatasetFactory` objects. Here, we'll use the :class:`arrow::dataset::FileSystemDatasetFactory`, which can create a dataset given a base directory path: .. literalinclude:: ../../../cpp/examples/arrow/dataset_documentation_example.cc :language: cpp :start-after: (Doc section: Dataset discovery) :end-before: (Doc section: Dataset discovery) :emphasize-lines: 6-11 :linenos: :lineno-match: We're also passing the filesystem to use and the file format to use for reading. This lets us choose between (for example) reading local files or files in Amazon S3, or between Parquet and CSV. In addition to searching a base directory, we can list file paths manually. Creating a :class:`arrow::dataset::Dataset` does not begin reading the data itself. It only crawls the directory to find all the files (if needed), which can be retrieved with :func:`arrow::dataset::FileSystemDataset::files`: .. code-block:: cpp // Print out the files crawled (only for FileSystemDataset) for (const auto& filename : dataset->files()) { std::cout << filename << std::endl; } …and infers the dataset's schema (by default from the first file): .. code-block:: cpp std::cout << dataset->schema()->ToString() << std::endl; Using the :func:`arrow::dataset::Dataset::NewScan` method, we can build a :class:`arrow::dataset::Scanner` and read the dataset (or a portion of it) into a :class:`arrow::Table` with the :func:`arrow::dataset::Scanner::ToTable` method: .. literalinclude:: ../../../cpp/examples/arrow/dataset_documentation_example.cc :language: cpp :start-after: (Doc section: Dataset discovery) :end-before: (Doc section: Dataset discovery) :emphasize-lines: 16-19 :linenos: :lineno-match: .. TODO: iterative loading not documented pending API changes .. note:: Depending on the size of your dataset, this can require a lot of memory; see :ref:`cpp-dataset-filtering-data` below on filtering/projecting. Reading different file formats ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The above examples use Parquet files on local disk, but the Dataset API provides a consistent interface across multiple file formats and filesystems. (See :ref:`cpp-dataset-cloud-storage` for more information on the latter.) Currently, Parquet, ORC, Feather / Arrow IPC, and CSV file formats are supported; more formats are planned in the future. If we save the table as Feather files instead of Parquet files: .. literalinclude:: ../../../cpp/examples/arrow/dataset_documentation_example.cc :language: cpp :start-after: (Doc section: Reading different file formats) :end-before: (Doc section: Reading different file formats) :linenos: :lineno-match: …then we can read the Feather file by passing an :class:`arrow::dataset::IpcFileFormat`: .. code-block:: cpp auto format = std::make_shared(); // ... auto factory = ds::FileSystemDatasetFactory::Make(filesystem, selector, format, options) .ValueOrDie(); Customizing file formats ~~~~~~~~~~~~~~~~~~~~~~~~ :class:`arrow::dataset::FileFormat` objects have properties that control how files are read. For example:: auto format = std::make_shared(); format->reader_options.dict_columns.insert("a"); Will configure column ``"a"`` to be dictionary-encoded when read. Similarly, setting :member:`arrow::dataset::CsvFileFormat::parse_options` lets us change things like reading comma-separated or tab-separated data. Additionally, passing an :class:`arrow::dataset::FragmentScanOptions` to :func:`arrow::dataset::ScannerBuilder::FragmentScanOptions` offers fine-grained control over data scanning. For example, for CSV files, we can change what values are converted into Boolean true and false at scan time. .. _cpp-dataset-filtering-data: Filtering data -------------- So far, we've been reading the entire dataset, but if we need only a subset of the data, this can waste time or memory reading data we don't need. The :class:`arrow::dataset::Scanner` offers control over what data to read. In this snippet, we use :func:`arrow::dataset::ScannerBuilder::Project` to select which columns to read: .. literalinclude:: ../../../cpp/examples/arrow/dataset_documentation_example.cc :language: cpp :start-after: (Doc section: Filtering data) :end-before: (Doc section: Filtering data) :emphasize-lines: 16 :linenos: :lineno-match: Some formats, such as Parquet, can reduce I/O costs here by reading only the specified columns from the filesystem. A filter can be provided with :func:`arrow::dataset::ScannerBuilder::Filter`, so that rows which do not match the filter predicate will not be included in the returned table. Again, some formats, such as Parquet, can use this filter to reduce the amount of I/O needed. .. literalinclude:: ../../../cpp/examples/arrow/dataset_documentation_example.cc :language: cpp :start-after: (Doc section: Filtering data) :end-before: (Doc section: Filtering data) :emphasize-lines: 17 :linenos: :lineno-match: .. TODO Expressions not documented pending renamespacing Projecting columns ------------------ In addition to selecting columns, :func:`arrow::dataset::ScannerBuilder::Project` can also be used for more complex projections, such as renaming columns, casting them to other types, and even deriving new columns based on evaluating expressions. In this case, we pass a vector of expressions used to construct column values and a vector of names for the columns: .. literalinclude:: ../../../cpp/examples/arrow/dataset_documentation_example.cc :language: cpp :start-after: (Doc section: Projecting columns) :end-before: (Doc section: Projecting columns) :emphasize-lines: 18-28 :linenos: :lineno-match: This also determines the column selection; only the given columns will be present in the resulting table. If you want to include a derived column in *addition* to the existing columns, you can build up the expressions from the dataset schema: .. literalinclude:: ../../../cpp/examples/arrow/dataset_documentation_example.cc :language: cpp :start-after: (Doc section: Projecting columns #2) :end-before: (Doc section: Projecting columns #2) :emphasize-lines: 17-27 :linenos: :lineno-match: .. note:: When combining filters and projections, Arrow will determine all necessary columns to read. For instance, if you filter on a column that isn't ultimately selected, Arrow will still read the column to evaluate the filter. Reading and writing partitioned data ------------------------------------ So far, we've been working with datasets consisting of flat directories with files. Oftentimes, a dataset will have one or more columns that are frequently filtered on. Instead of having to read and then filter the data, by organizing the files into a nested directory structure, we can define a partitioned dataset, where sub-directory names hold information about which subset of the data is stored in that directory. Then, we can more efficiently filter data by using that information to avoid loading files that don't match the filter. For example, a dataset partitioned by year and month may have the following layout: .. code-block:: text dataset_name/ year=2007/ month=01/ data0.parquet data1.parquet ... month=02/ data0.parquet data1.parquet ... month=03/ ... year=2008/ month=01/ ... ... The above partitioning scheme is using "/key=value/" directory names, as found in Apache Hive. Under this convention, the file at ``dataset_name/year=2007/month=01/data0.parquet`` contains only data for which ``year == 2007`` and ``month == 01``. Let's create a small partitioned dataset. For this, we'll use Arrow's dataset writing functionality. .. literalinclude:: ../../../cpp/examples/arrow/dataset_documentation_example.cc :language: cpp :start-after: (Doc section: Reading and writing partitioned data) :end-before: (Doc section: Reading and writing partitioned data) :emphasize-lines: 25-42 :linenos: :lineno-match: The above created a directory with two subdirectories ("part=a" and "part=b"), and the Parquet files written in those directories no longer include the "part" column. Reading this dataset, we now specify that the dataset should use a Hive-like partitioning scheme: .. literalinclude:: ../../../cpp/examples/arrow/dataset_documentation_example.cc :language: cpp :start-after: (Doc section: Reading and writing partitioned data #2) :end-before: (Doc section: Reading and writing partitioned data #2) :emphasize-lines: 7,9-11 :linenos: :lineno-match: Although the partition fields are not included in the actual Parquet files, they will be added back to the resulting table when scanning this dataset: .. code-block:: text $ ./debug/dataset_documentation_example file:///tmp parquet_hive partitioned Found fragment: /tmp/parquet_dataset/part=a/part0.parquet Partition expression: (part == "a") Found fragment: /tmp/parquet_dataset/part=b/part1.parquet Partition expression: (part == "b") Read 20 rows a: int64 -- field metadata -- PARQUET:field_id: '1' b: double -- field metadata -- PARQUET:field_id: '2' c: int64 -- field metadata -- PARQUET:field_id: '3' part: string ---- # snip... We can now filter on the partition keys, which avoids loading files altogether if they do not match the filter: .. literalinclude:: ../../../cpp/examples/arrow/dataset_documentation_example.cc :language: cpp :start-after: (Doc section: Reading and writing partitioned data #3) :end-before: (Doc section: Reading and writing partitioned data #3) :emphasize-lines: 15-18 :linenos: :lineno-match: Different partitioning schemes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The above example uses a Hive-like directory scheme, such as "/year=2009/month=11/day=15". We specified this by passing the Hive partitioning factory. In this case, the types of the partition keys are inferred from the file paths. It is also possible to directly construct the partitioning and explicitly define the schema of the partition keys. For example: .. code-block:: cpp auto part = std::make_shared(arrow::schema({ arrow::field("year", arrow::int16()), arrow::field("month", arrow::int8()), arrow::field("day", arrow::int32()) })); Arrow supports another partitioning scheme, "directory partitioning", where the segments in the file path represent the values of the partition keys without including the name (the field names are implicit in the segment's index). For example, given field names "year", "month", and "day", one path might be "/2019/11/15". Since the names are not included in the file paths, these must be specified when constructing a directory partitioning: .. code-block:: cpp auto part = ds::DirectoryPartitioning::MakeFactory({"year", "month", "day"}); Directory partitioning also supports providing a full schema rather than inferring types from file paths. Reading from other data sources ------------------------------- Reading in-memory data ~~~~~~~~~~~~~~~~~~~~~~ If you already have data in memory that you'd like to use with the Datasets API (e.g. to filter/project data, or to write it out to a filesystem), you can wrap it in an :class:`arrow::dataset::InMemoryDataset`: .. code-block:: cpp auto table = arrow::Table::FromRecordBatches(...); auto dataset = std::make_shared(std::move(table)); // Scan the dataset, filter, it, etc. auto scanner_builder = dataset->NewScan(); In the example, we used the InMemoryDataset to write our example data to local disk which was used in the rest of the example: .. literalinclude:: ../../../cpp/examples/arrow/dataset_documentation_example.cc :language: cpp :start-after: Reading and writing partitioned data :end-before: Reading and writing partitioned data :emphasize-lines: 24-28 :linenos: :lineno-match: .. _cpp-dataset-cloud-storage: Reading from cloud storage ~~~~~~~~~~~~~~~~~~~~~~~~~~ In addition to local files, Arrow Datasets also support reading from cloud storage systems, such as Amazon S3, by passing a different filesystem. See the :ref:`filesystem ` docs for more details on the available filesystems. .. _cpp-dataset-full-example: Full Example ------------ .. literalinclude:: ../../../cpp/examples/arrow/dataset_documentation_example.cc :language: cpp :linenos: