You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ap...@apache.org on 2022/10/26 19:52:33 UTC

[arrow] branch master updated: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation (#14452)

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new c56934b579 ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation (#14452)
c56934b579 is described below

commit c56934b57922a6cbb46eaef097a36ed8d2473467
Author: Bryce Mecum <pe...@gmail.com>
AuthorDate: Wed Oct 26 11:52:26 2022 -0800

    ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation (#14452)
    
    Updates the C++ CSV reader docs to include documenting and an example of the streaming CSV reader (StreamingReader), as per the suggestion in [ARROW-15328](https://issues.apache.org/jira/browse/ARROW-15328).
    
    @ westonpace could you look at this and let me know if this is what you were thinking?
    
    Authored-by: Bryce Mecum <pe...@gmail.com>
    Signed-off-by: Antoine Pitrou <an...@python.org>
---
 docs/source/cpp/csv.rst | 110 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 101 insertions(+), 9 deletions(-)

diff --git a/docs/source/cpp/csv.rst b/docs/source/cpp/csv.rst
index d6bb66ce49..6078ec5892 100644
--- a/docs/source/cpp/csv.rst
+++ b/docs/source/cpp/csv.rst
@@ -25,15 +25,26 @@ Reading and Writing CSV files
 =============================
 
 Arrow provides a fast CSV reader allowing ingestion of external data
-as Arrow tables.
+to create Arrow Tables or a stream of Arrow RecordBatches.
 
 .. seealso::
    :ref:`CSV reader/writer API reference <cpp-api-csv>`.
 
-Basic usage
-===========
+Reading CSV files
+=================
+
+Data in a CSV file can either be read in as a single Arrow Table using
+:class:`~arrow::csv::TableReader` or streamed as RecordBatches using
+:class:`~arrow::csv::StreamingReader`. See :ref:`Tradeoffs <cpp-csv-tradeoffs>` for a
+discussion of the tradeoffs between the two methods.
 
-A CSV file is read from a :class:`~arrow::io::InputStream`.
+Both these readers require an :class:`arrow::io::InputStream` instance
+representing the input file. Their behavior can be customized using a
+combination of :class:`~arrow::csv::ReadOptions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+TableReader
+-----------
 
 .. code-block:: cpp
 
@@ -56,19 +67,98 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      while (true) {
+          // Attempt to read the first RecordBatch
+          arrow::Status status = reader->ReadNext(&batch);
+
+          if (!status.ok()) {
+            // Handle read error
+          }
+
+          if (batch == NULL) {
+            // Handle end of file
+            break;
+          }
+
+          // Do something with the batch
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will ultimately depend on the use case
+but there are a few tradeoffs to be aware of:
+
+1. **Memory usage:** :class:`~arrow::csv::TableReader` loads all of the data
+   into memory at once and, depending on the amount of data, may require
+   considerably more memory than :class:`~arrow::csv::StreamingReader` which
+   only loads one :class:`~arrow::RecordBatch` at a time. This is likely to be
+   the most significant tradeoff for users.
+2. **Speed:** When reading the entire contents of a CSV,
+   :class:`~arrow::csv::TableReader` will tend to be faster than
+   :class:`~arrow::csv::StreamingReader` because it makes better use of
+   available cores. See :ref:`Performance <cpp-csv-performance>` for more
+   details.
+3. **Flexibility:** :class:`~arrow::csv::StreamingReader` might be considered
+   less flexible than :class:`~arrow::csv::TableReader` because it performs type
+   inference only on the first block that's read in, after which point the types
+   are frozen and any data in subsequent blocks that cannot be converted to
+   those types will cause an error. Note that this can be remedied either by
+   setting :member:`ReadOptions::block_size` to a large enough value or by using
+   :member:`ConvertOptions::column_types` to set the desired data types
+   explicitly.
+
+Writing CSV files
+=================
+
 A CSV file is written to a :class:`~arrow::io::OutputStream`.
 
 .. code-block:: cpp
@@ -275,11 +365,13 @@ Write Options
 The format of written CSV files can be customized via :class:`~arrow::csv::WriteOptions`.
 Currently few options are available; more will be added in future releases.
 
+.. _cpp-csv-performance:
+
 Performance
 ===========
 
-By default, the CSV reader will parallelize reads in order to exploit all
-CPU cores on your machine.  You can change this setting in
+By default, :class:`~arrow::csv::TableReader` will parallelize reads in order to
+exploit all CPU cores on your machine.  You can change this setting in
 :member:`ReadOptions::use_threads`.  A reasonable expectation is at least
 100 MB/s per core on a performant desktop or laptop computer (measured in
 source CSV bytes, not target Arrow data bytes).