diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
commit | e6918187568dbd01842d8d1d2c808ce16a894239 (patch) | |
tree | 64f88b554b444a49f656b6c656111a145cbbaa28 /src/arrow/docs/source/python/csv.rst | |
parent | Initial commit. (diff) | |
download | ceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip |
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/arrow/docs/source/python/csv.rst')
-rw-r--r-- | src/arrow/docs/source/python/csv.rst | 170 |
1 files changed, 170 insertions, 0 deletions
diff --git a/src/arrow/docs/source/python/csv.rst b/src/arrow/docs/source/python/csv.rst new file mode 100644 index 000000000..1724c63f4 --- /dev/null +++ b/src/arrow/docs/source/python/csv.rst @@ -0,0 +1,170 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. currentmodule:: pyarrow.csv +.. _csv: + +Reading and Writing CSV files +============================= + +Arrow supports reading and writing columnar data from/to CSV files. +The features currently offered are the following: + +* multi-threaded or single-threaded reading +* automatic decompression of input files (based on the filename extension, + such as ``my_data.csv.gz``) +* fetching column names from the first row in the CSV file +* column-wise type inference and conversion to one of ``null``, ``int64``, + ``float64``, ``date32``, ``time32[s]``, ``timestamp[s]``, ``timestamp[ns]``, + ``string`` or ``binary`` data +* opportunistic dictionary encoding of ``string`` and ``binary`` columns + (disabled by default) +* detecting various spellings of null values such as ``NaN`` or ``#N/A`` +* writing CSV files with options to configure the exact output format + +Usage +----- + +CSV reading and writing functionality is available through the +:mod:`pyarrow.csv` module. In many cases, you will simply call the +:func:`read_csv` function with the file path you want to read from:: + + >>> from pyarrow import csv + >>> fn = 'tips.csv.gz' + >>> table = csv.read_csv(fn) + >>> table + pyarrow.Table + total_bill: double + tip: double + sex: string + smoker: string + day: string + time: string + size: int64 + >>> len(table) + 244 + >>> df = table.to_pandas() + >>> df.head() + total_bill tip sex smoker day time size + 0 16.99 1.01 Female No Sun Dinner 2 + 1 10.34 1.66 Male No Sun Dinner 3 + 2 21.01 3.50 Male No Sun Dinner 3 + 3 23.68 3.31 Male No Sun Dinner 2 + 4 24.59 3.61 Female No Sun Dinner 4 + +To write CSV files, just call :func:`write_csv` with a +:class:`pyarrow.RecordBatch` or :class:`pyarrow.Table` and a path or +file-like object:: + + >>> import pyarrow as pa + >>> import pyarrow.csv as csv + >>> csv.write_csv(table, "tips.csv") + >>> with pa.CompressedOutputStream("tips.csv.gz", "gzip") as out: + ... csv.write_csv(table, out) + +.. note:: The writer does not yet support all Arrow types. + +Customized parsing +------------------ + +To alter the default parsing settings in case of reading CSV files with an +unusual structure, you should create a :class:`ParseOptions` instance +and pass it to :func:`read_csv`. + +Customized conversion +--------------------- + +To alter how CSV data is converted to Arrow types and data, you should create +a :class:`ConvertOptions` instance and pass it to :func:`read_csv`:: + + import pyarrow as pa + import pyarrow.csv as csv + + table = csv.read_csv('tips.csv.gz', convert_options=pa.csv.ConvertOptions( + column_types={ + 'total_bill': pa.decimal128(precision=10, scale=2), + 'tip': pa.decimal128(precision=10, scale=2), + } + )) + + +Incremental reading +------------------- + +For memory-constrained environments, it is also possible to read a CSV file +one batch at a time, using :func:`open_csv`. + +There are a few caveats: + +1. For now, the incremental reader is always single-threaded (regardless of + :attr:`ReadOptions.use_threads`) + +2. Type inference is done on the first block and types are frozen afterwards; + to make sure the right data types are inferred, either set + :attr:`ReadOptions.block_size` to a large enough value, or use + :attr:`ConvertOptions.column_types` to set the desired data types explicitly. + +Character encoding +------------------ + +By default, CSV files are expected to be encoded in UTF8. Non-UTF8 data +is accepted for ``binary`` columns. The encoding can be changed using +the :class:`ReadOptions` class. + +Customized writing +------------------ + +To alter the default write settings in case of writing CSV files with +different conventions, you can create a :class:`WriteOptions` instance and +pass it to :func:`write_csv`:: + + >>> import pyarrow as pa + >>> import pyarrow.csv as csv + >>> # Omit the header row (include_header=True is the default) + >>> options = csv.WriteOptions(include_header=False) + >>> csv.write_csv(table, "data.csv", options) + +Incremental writing +------------------- + +To write CSV files one batch at a time, create a :class:`CSVWriter`. This +requires the output (a path or file-like object), the schema of the data to +be written, and optionally write options as described above:: + + >>> import pyarrow as pa + >>> import pyarrow.csv as csv + >>> with csv.CSVWriter("data.csv", table.schema) as writer: + >>> writer.write_table(table) + +Performance +----------- + +Due to the structure of CSV files, one cannot expect the same levels of +performance as when reading dedicated binary formats like +:ref:`Parquet <Parquet>`. Nevertheless, Arrow strives to reduce the +overhead of reading CSV files. A reasonable expectation is at least +100 MB/s per core on a performant desktop or laptop computer (measured +in source CSV bytes, not target Arrow data bytes). + +Performance options can be controlled through the :class:`ReadOptions` class. +Multi-threaded reading is the default for highest performance, distributing +the workload efficiently over all available cores. + +.. note:: + The number of concurrent threads is automatically inferred by Arrow. + You can inspect and change it using the :func:`~pyarrow.cpu_count()` + and :func:`~pyarrow.set_cpu_count()` functions, respectively. |