summaryrefslogtreecommitdiffstats
path: root/src/arrow/docs/source/python/csv.rst
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
commite6918187568dbd01842d8d1d2c808ce16a894239 (patch)
tree64f88b554b444a49f656b6c656111a145cbbaa28 /src/arrow/docs/source/python/csv.rst
parentInitial commit. (diff)
downloadceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz
ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/arrow/docs/source/python/csv.rst')
-rw-r--r--src/arrow/docs/source/python/csv.rst170
1 files changed, 170 insertions, 0 deletions
diff --git a/src/arrow/docs/source/python/csv.rst b/src/arrow/docs/source/python/csv.rst
new file mode 100644
index 000000000..1724c63f4
--- /dev/null
+++ b/src/arrow/docs/source/python/csv.rst
@@ -0,0 +1,170 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow.csv
+.. _csv:
+
+Reading and Writing CSV files
+=============================
+
+Arrow supports reading and writing columnar data from/to CSV files.
+The features currently offered are the following:
+
+* multi-threaded or single-threaded reading
+* automatic decompression of input files (based on the filename extension,
+ such as ``my_data.csv.gz``)
+* fetching column names from the first row in the CSV file
+* column-wise type inference and conversion to one of ``null``, ``int64``,
+ ``float64``, ``date32``, ``time32[s]``, ``timestamp[s]``, ``timestamp[ns]``,
+ ``string`` or ``binary`` data
+* opportunistic dictionary encoding of ``string`` and ``binary`` columns
+ (disabled by default)
+* detecting various spellings of null values such as ``NaN`` or ``#N/A``
+* writing CSV files with options to configure the exact output format
+
+Usage
+-----
+
+CSV reading and writing functionality is available through the
+:mod:`pyarrow.csv` module. In many cases, you will simply call the
+:func:`read_csv` function with the file path you want to read from::
+
+ >>> from pyarrow import csv
+ >>> fn = 'tips.csv.gz'
+ >>> table = csv.read_csv(fn)
+ >>> table
+ pyarrow.Table
+ total_bill: double
+ tip: double
+ sex: string
+ smoker: string
+ day: string
+ time: string
+ size: int64
+ >>> len(table)
+ 244
+ >>> df = table.to_pandas()
+ >>> df.head()
+ total_bill tip sex smoker day time size
+ 0 16.99 1.01 Female No Sun Dinner 2
+ 1 10.34 1.66 Male No Sun Dinner 3
+ 2 21.01 3.50 Male No Sun Dinner 3
+ 3 23.68 3.31 Male No Sun Dinner 2
+ 4 24.59 3.61 Female No Sun Dinner 4
+
+To write CSV files, just call :func:`write_csv` with a
+:class:`pyarrow.RecordBatch` or :class:`pyarrow.Table` and a path or
+file-like object::
+
+ >>> import pyarrow as pa
+ >>> import pyarrow.csv as csv
+ >>> csv.write_csv(table, "tips.csv")
+ >>> with pa.CompressedOutputStream("tips.csv.gz", "gzip") as out:
+ ... csv.write_csv(table, out)
+
+.. note:: The writer does not yet support all Arrow types.
+
+Customized parsing
+------------------
+
+To alter the default parsing settings in case of reading CSV files with an
+unusual structure, you should create a :class:`ParseOptions` instance
+and pass it to :func:`read_csv`.
+
+Customized conversion
+---------------------
+
+To alter how CSV data is converted to Arrow types and data, you should create
+a :class:`ConvertOptions` instance and pass it to :func:`read_csv`::
+
+ import pyarrow as pa
+ import pyarrow.csv as csv
+
+ table = csv.read_csv('tips.csv.gz', convert_options=pa.csv.ConvertOptions(
+ column_types={
+ 'total_bill': pa.decimal128(precision=10, scale=2),
+ 'tip': pa.decimal128(precision=10, scale=2),
+ }
+ ))
+
+
+Incremental reading
+-------------------
+
+For memory-constrained environments, it is also possible to read a CSV file
+one batch at a time, using :func:`open_csv`.
+
+There are a few caveats:
+
+1. For now, the incremental reader is always single-threaded (regardless of
+ :attr:`ReadOptions.use_threads`)
+
+2. Type inference is done on the first block and types are frozen afterwards;
+ to make sure the right data types are inferred, either set
+ :attr:`ReadOptions.block_size` to a large enough value, or use
+ :attr:`ConvertOptions.column_types` to set the desired data types explicitly.
+
+Character encoding
+------------------
+
+By default, CSV files are expected to be encoded in UTF8. Non-UTF8 data
+is accepted for ``binary`` columns. The encoding can be changed using
+the :class:`ReadOptions` class.
+
+Customized writing
+------------------
+
+To alter the default write settings in case of writing CSV files with
+different conventions, you can create a :class:`WriteOptions` instance and
+pass it to :func:`write_csv`::
+
+ >>> import pyarrow as pa
+ >>> import pyarrow.csv as csv
+ >>> # Omit the header row (include_header=True is the default)
+ >>> options = csv.WriteOptions(include_header=False)
+ >>> csv.write_csv(table, "data.csv", options)
+
+Incremental writing
+-------------------
+
+To write CSV files one batch at a time, create a :class:`CSVWriter`. This
+requires the output (a path or file-like object), the schema of the data to
+be written, and optionally write options as described above::
+
+ >>> import pyarrow as pa
+ >>> import pyarrow.csv as csv
+ >>> with csv.CSVWriter("data.csv", table.schema) as writer:
+ >>> writer.write_table(table)
+
+Performance
+-----------
+
+Due to the structure of CSV files, one cannot expect the same levels of
+performance as when reading dedicated binary formats like
+:ref:`Parquet <Parquet>`. Nevertheless, Arrow strives to reduce the
+overhead of reading CSV files. A reasonable expectation is at least
+100 MB/s per core on a performant desktop or laptop computer (measured
+in source CSV bytes, not target Arrow data bytes).
+
+Performance options can be controlled through the :class:`ReadOptions` class.
+Multi-threaded reading is the default for highest performance, distributing
+the workload efficiently over all available cores.
+
+.. note::
+ The number of concurrent threads is automatically inferred by Arrow.
+ You can inspect and change it using the :func:`~pyarrow.cpu_count()`
+ and :func:`~pyarrow.set_cpu_count()` functions, respectively.