You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/19 13:22:32 UTC

[GitHub] [arrow] pitrou commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

pitrou commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r999437065


##########
docs/source/cpp/csv.rst:
##########
@@ -25,15 +25,24 @@ Reading and Writing CSV files
 =============================
 
 Arrow provides a fast CSV reader allowing ingestion of external data
-as Arrow tables.
+as Arrow Tables or streamed as Arrow RecordBatches.
 
 .. seealso::
    :ref:`CSV reader/writer API reference <cpp-api-csv>`.
 
-Basic usage
-===========
+Reading CSV files
+=================
+
+Data in a CSV file can either be read in as a single Arrow Table using
+:class:`~arrow::csv::TableReader` or streamed as RecordBatches using
+:class:`~arrow::csv::StreamingReader`. See :ref:`Tradeoffs <tradeoffs>` for a
+discussion of the tradeoffs between the two methods.
+
+TableReader
+-----------
 
-A CSV file is read from a :class:`~arrow::io::InputStream`.
+The :class:`~arrow::csv::TableReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.

Review Comment:
   Since this is true for both readers, why not move this up as a more general sentence in "Reading CSV files"?
   
   For example:
   ```rest
   Both these readers require an :class:`arrow::io::InputStream` instance
   representing the input file. Their behavior can be customized using a
   combination of :class:`~arrow::csv::ReadOptrions`,
   :class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
   ```
   



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +65,88 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+Behavior of :class:`~arrow::csv::TableReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+StreamingReader
+---------------
+
+The :class:`~arrow::csv::StreamingReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      reader->ReadNext(&batch);
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+Behavior of :class:`~arrow::csv::StreamingReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptrions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+.. _tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or :class:`~arrow::csv::StreamingReader`
+will depend on your use case but two caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.
+2. :class:`~arrow::csv::StreamingReader` performs type inference off the first
+   block that's read in, after which point the types are frozen. Either set :member:`ReadOptions::block_size` to a large enough value or use :member:`ConvertOptions::column_types` to set the desired data types

Review Comment:
   Can you try to limit line length here?



##########
docs/source/cpp/csv.rst:
##########
@@ -275,11 +353,13 @@ Write Options
 The format of written CSV files can be customized via :class:`~arrow::csv::WriteOptions`.
 Currently few options are available; more will be added in future releases.
 
+.. _performance:

Review Comment:
   Same here.



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +65,88 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+Behavior of :class:`~arrow::csv::TableReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+StreamingReader
+---------------
+
+The :class:`~arrow::csv::StreamingReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      reader->ReadNext(&batch);
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+Behavior of :class:`~arrow::csv::StreamingReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptrions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+.. _tradeoffs:

Review Comment:
   These references are global to the entire docs, so should be disambiguated, for example `cpp-csv-tradeoffs` instead of `tradeoffs`.



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +65,88 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+Behavior of :class:`~arrow::csv::TableReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+StreamingReader
+---------------
+
+The :class:`~arrow::csv::StreamingReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      reader->ReadNext(&batch);

Review Comment:
   Should not ignore the Status returned here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org