You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ap...@apache.org on 2022/10/26 19:52:33 UTC
[arrow] branch master updated: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation (#14452)
This is an automated email from the ASF dual-hosted git repository.
apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new c56934b579 ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation (#14452)
c56934b579 is described below
commit c56934b57922a6cbb46eaef097a36ed8d2473467
Author: Bryce Mecum <pe...@gmail.com>
AuthorDate: Wed Oct 26 11:52:26 2022 -0800
ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation (#14452)
Updates the C++ CSV reader docs to include documenting and an example of the streaming CSV reader (StreamingReader), as per the suggestion in [ARROW-15328](https://issues.apache.org/jira/browse/ARROW-15328).
@ westonpace could you look at this and let me know if this is what you were thinking?
Authored-by: Bryce Mecum <pe...@gmail.com>
Signed-off-by: Antoine Pitrou <an...@python.org>
---
docs/source/cpp/csv.rst | 110 ++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 101 insertions(+), 9 deletions(-)
diff --git a/docs/source/cpp/csv.rst b/docs/source/cpp/csv.rst
index d6bb66ce49..6078ec5892 100644
--- a/docs/source/cpp/csv.rst
+++ b/docs/source/cpp/csv.rst
@@ -25,15 +25,26 @@ Reading and Writing CSV files
=============================
Arrow provides a fast CSV reader allowing ingestion of external data
-as Arrow tables.
+to create Arrow Tables or a stream of Arrow RecordBatches.
.. seealso::
:ref:`CSV reader/writer API reference <cpp-api-csv>`.
-Basic usage
-===========
+Reading CSV files
+=================
+
+Data in a CSV file can either be read in as a single Arrow Table using
+:class:`~arrow::csv::TableReader` or streamed as RecordBatches using
+:class:`~arrow::csv::StreamingReader`. See :ref:`Tradeoffs <cpp-csv-tradeoffs>` for a
+discussion of the tradeoffs between the two methods.
-A CSV file is read from a :class:`~arrow::io::InputStream`.
+Both these readers require an :class:`arrow::io::InputStream` instance
+representing the input file. Their behavior can be customized using a
+combination of :class:`~arrow::csv::ReadOptions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+TableReader
+-----------
.. code-block:: cpp
@@ -56,19 +67,98 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
parse_options,
convert_options);
if (!maybe_reader.ok()) {
- // Handle TableReader instantiation error...
+ // Handle TableReader instantiation error...
}
std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
// Read table from CSV file
auto maybe_table = reader->Read();
if (!maybe_table.ok()) {
- // Handle CSV read error
- // (for example a CSV syntax error or failed type conversion)
+ // Handle CSV read error
+ // (for example a CSV syntax error or failed type conversion)
}
std::shared_ptr<arrow::Table> table = *maybe_table;
}
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+ #include "arrow/csv/api.h"
+
+ {
+ // ...
+ arrow::io::IOContext io_context = arrow::io::default_io_context();
+ std::shared_ptr<arrow::io::InputStream> input = ...;
+
+ auto read_options = arrow::csv::ReadOptions::Defaults();
+ auto parse_options = arrow::csv::ParseOptions::Defaults();
+ auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+ // Instantiate StreamingReader from input stream and options
+ auto maybe_reader =
+ arrow::csv::StreamingReader::Make(io_context,
+ input,
+ read_options,
+ parse_options,
+ convert_options);
+ if (!maybe_reader.ok()) {
+ // Handle StreamingReader instantiation error...
+ }
+ std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+ // Set aside a RecordBatch pointer for re-use while streaming
+ std::shared_ptr<RecordBatch> batch;
+
+ while (true) {
+ // Attempt to read the first RecordBatch
+ arrow::Status status = reader->ReadNext(&batch);
+
+ if (!status.ok()) {
+ // Handle read error
+ }
+
+ if (batch == NULL) {
+ // Handle end of file
+ break;
+ }
+
+ // Do something with the batch
+ }
+ }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will ultimately depend on the use case
+but there are a few tradeoffs to be aware of:
+
+1. **Memory usage:** :class:`~arrow::csv::TableReader` loads all of the data
+ into memory at once and, depending on the amount of data, may require
+ considerably more memory than :class:`~arrow::csv::StreamingReader` which
+ only loads one :class:`~arrow::RecordBatch` at a time. This is likely to be
+ the most significant tradeoff for users.
+2. **Speed:** When reading the entire contents of a CSV,
+ :class:`~arrow::csv::TableReader` will tend to be faster than
+ :class:`~arrow::csv::StreamingReader` because it makes better use of
+ available cores. See :ref:`Performance <cpp-csv-performance>` for more
+ details.
+3. **Flexibility:** :class:`~arrow::csv::StreamingReader` might be considered
+ less flexible than :class:`~arrow::csv::TableReader` because it performs type
+ inference only on the first block that's read in, after which point the types
+ are frozen and any data in subsequent blocks that cannot be converted to
+ those types will cause an error. Note that this can be remedied either by
+ setting :member:`ReadOptions::block_size` to a large enough value or by using
+ :member:`ConvertOptions::column_types` to set the desired data types
+ explicitly.
+
+Writing CSV files
+=================
+
A CSV file is written to a :class:`~arrow::io::OutputStream`.
.. code-block:: cpp
@@ -275,11 +365,13 @@ Write Options
The format of written CSV files can be customized via :class:`~arrow::csv::WriteOptions`.
Currently few options are available; more will be added in future releases.
+.. _cpp-csv-performance:
+
Performance
===========
-By default, the CSV reader will parallelize reads in order to exploit all
-CPU cores on your machine. You can change this setting in
+By default, :class:`~arrow::csv::TableReader` will parallelize reads in order to
+exploit all CPU cores on your machine. You can change this setting in
:member:`ReadOptions::use_threads`. A reasonable expectation is at least
100 MB/s per core on a performant desktop or laptop computer (measured in
source CSV bytes, not target Arrow data bytes).