You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/18 20:05:22 UTC

[GitHub] [arrow] amoeba opened a new pull request, #14452: ARROW-15328: [C++] [Docs] Streaming CSV reader missing from documentation

amoeba opened a new pull request, #14452:
URL: https://github.com/apache/arrow/pull/14452

   Updates the C++ CSV reader docs to include documenting and an example of the streaming CSV reader (StreamingReader), as per the suggestion in [ARROW-15328](https://issues.apache.org/jira/browse/ARROW-15328). 
   
   @westonpace could you look at this and let me know if this is what you were thinking?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r999981762


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +65,88 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+Behavior of :class:`~arrow::csv::TableReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+StreamingReader
+---------------
+
+The :class:`~arrow::csv::StreamingReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      reader->ReadNext(&batch);
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+Behavior of :class:`~arrow::csv::StreamingReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptrions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+.. _tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or :class:`~arrow::csv::StreamingReader`
+will depend on your use case but two caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.
+2. :class:`~arrow::csv::StreamingReader` performs type inference off the first
+   block that's read in, after which point the types are frozen. Either set :member:`ReadOptions::block_size` to a large enough value or use :member:`ConvertOptions::column_types` to set the desired data types

Review Comment:
   Done in https://github.com/apache/arrow/pull/14452/commits/dc159736569a2eb8f5f07623df2d950f9f8d66b6.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1000031167


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.
+2. :class:`~arrow::csv::StreamingReader` performs type inference off the first
+   block that's read in, after which point the types are frozen. Either set
+   :member:`ReadOptions::block_size` to a large enough value or use
+   :member:`ConvertOptions::column_types` to set the desired data types
+   explicitly.

Review Comment:
   Good point here. I'll update this language.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1000030204


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.

Review Comment:
   I based my text here off of https://github.com/amoeba/arrow/blob/9640c3f7db50a1e95d397a8d9ec37ca33d10c733/cpp/src/arrow/csv/reader.h#L64-L71. Should we update that comment too or is this just nuance?
   
   Would better language be something like:
   
   > When reading the entire contents of a CSV, TableReader will tend to be more performant than StreamingReader because it makes better use of available cores.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #14452:
URL: https://github.com/apache/arrow/pull/14452#issuecomment-1295741067

   Benchmark runs are scheduled for baseline = f21c92a956df1a775b1110f49de917405957aa9c and contender = c56934b57922a6cbb46eaef097a36ed8d2473467. c56934b57922a6cbb46eaef097a36ed8d2473467 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/818825c2d7af4ebfa86a8f6ac4b41139...e1029ecb456a4d1f8415dfb0c0b40116/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/86cf87554f1a4736893db82c7bdfffc4...b3293be4bc554cba93d21372aa1ad60d/)
   [Finished :arrow_down:0.27% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/4818f4d3f66049669d6c13bf69c28193...89865ddad95c4b588311139026ec1a1d/)
   [Finished :arrow_down:0.89% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/298142f8c06d48119cc60391e5cd84b3...72eb00ef0abb4ba0abee61a8fa907ee3/)
   Buildkite builds:
   [Finished] [`c56934b5` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1773)
   [Failed] [`c56934b5` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1794)
   [Finished] [`c56934b5` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1760)
   [Finished] [`c56934b5` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1786)
   [Finished] [`f21c92a9` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1772)
   [Failed] [`f21c92a9` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1793)
   [Finished] [`f21c92a9` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1759)
   [Finished] [`f21c92a9` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1785)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14452:
URL: https://github.com/apache/arrow/pull/14452#issuecomment-1284008530

   @wjones127 Would you like to take a look and review this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1005075568


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.
+2. :class:`~arrow::csv::StreamingReader` performs type inference off the first
+   block that's read in, after which point the types are frozen. Either set
+   :member:`ReadOptions::block_size` to a large enough value or use
+   :member:`ConvertOptions::column_types` to set the desired data types
+   explicitly.

Review Comment:
   Updated in https://github.com/apache/arrow/commit/3d54059caa2f6dc888be670f98da33d526a12ba4 and 039d23f8aa093017e72446d405a5f45b849d2689. I didn't include specific error text but said,
   
   > after which point the types
      are frozen and any data in subsequent blocks that cannot be converted to
      those types will cause an error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r999980886


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +65,88 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+Behavior of :class:`~arrow::csv::TableReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+StreamingReader
+---------------
+
+The :class:`~arrow::csv::StreamingReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      reader->ReadNext(&batch);

Review Comment:
   Good catch, changed to show an explicit block for handling status.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r999981582


##########
docs/source/cpp/csv.rst:
##########
@@ -25,15 +25,24 @@ Reading and Writing CSV files
 =============================
 
 Arrow provides a fast CSV reader allowing ingestion of external data
-as Arrow tables.
+as Arrow Tables or streamed as Arrow RecordBatches.
 
 .. seealso::
    :ref:`CSV reader/writer API reference <cpp-api-csv>`.
 
-Basic usage
-===========
+Reading CSV files
+=================
+
+Data in a CSV file can either be read in as a single Arrow Table using
+:class:`~arrow::csv::TableReader` or streamed as RecordBatches using
+:class:`~arrow::csv::StreamingReader`. See :ref:`Tradeoffs <tradeoffs>` for a
+discussion of the tradeoffs between the two methods.
+
+TableReader
+-----------
 
-A CSV file is read from a :class:`~arrow::io::InputStream`.
+The :class:`~arrow::csv::TableReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.

Review Comment:
   This is a good improvement, thanks.



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +65,88 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+Behavior of :class:`~arrow::csv::TableReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+StreamingReader
+---------------
+
+The :class:`~arrow::csv::StreamingReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      reader->ReadNext(&batch);
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+Behavior of :class:`~arrow::csv::StreamingReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptrions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+.. _tradeoffs:

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
pitrou merged PR #14452:
URL: https://github.com/apache/arrow/pull/14452


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1000024176


##########
docs/source/cpp/csv.rst:
##########
@@ -25,15 +25,26 @@ Reading and Writing CSV files
 =============================
 
 Arrow provides a fast CSV reader allowing ingestion of external data
-as Arrow tables.
+as Arrow Tables or streamed as Arrow RecordBatches.
 
 .. seealso::
    :ref:`CSV reader/writer API reference <cpp-api-csv>`.
 
-Basic usage
-===========
+Reading CSV files
+=================
+
+Data in a CSV file can either be read in as a single Arrow Table using
+:class:`~arrow::csv::TableReader` or streamed as RecordBatches using
+:class:`~arrow::csv::StreamingReader`. See :ref:`Tradeoffs <cpp-csv-tradeoffs>` for a
+discussion of the tradeoffs between the two methods.
+
+Both these readers require an :class:`arrow::io::InputStream` instance

Review Comment:
   Thanks for the heads up. I'll keep track of that PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1004124908


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.
+2. :class:`~arrow::csv::StreamingReader` performs type inference off the first
+   block that's read in, after which point the types are frozen. Either set
+   :member:`ReadOptions::block_size` to a large enough value or use
+   :member:`ConvertOptions::column_types` to set the desired data types
+   explicitly.
+

Review Comment:
   > Do you just mean to say they aren't really tradeoffs?
   
   As written, then yes, there aren't any :-)
   
   > Yep. I wasn't sure if that was too obvious and removed it. I'll put it back in.
   
   -+



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.
+2. :class:`~arrow::csv::StreamingReader` performs type inference off the first
+   block that's read in, after which point the types are frozen. Either set
+   :member:`ReadOptions::block_size` to a large enough value or use
+   :member:`ConvertOptions::column_types` to set the desired data types
+   explicitly.
+

Review Comment:
   > Do you just mean to say they aren't really tradeoffs?
   
   As written, then yes, there aren't any :-)
   
   > Yep. I wasn't sure if that was too obvious and removed it. I'll put it back in.
   
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1005514775


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,97 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      while (true) {
+          // Attempt to read the first RecordBatch
+          arrow::Status status = reader->ReadNext(&batch);
+
+          if (!status.ok()) {
+            // Handle read error
+          }
+
+          if (batch == NULL) {
+            // Handle end of file
+            break;
+          }
+
+          // Do something with the batch
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will ultimately depend on the use case
+but there are a few tradeoffs to be aware of:
+
+1. **Memory usage:** :class:`~arrow::csv::TableReader` loads all of the data
+   into memory at once and, depending on the amount of data, may require
+   considerably more memory than :class:`~arrow::csv::StreamingReader` which
+   only loads one :class:`~arrow::RecordBatch` at a time. This is likely to be
+   the most significant tradeoff for users.
+3. **Speed:** When reading the entire contents of a CSV,
+   :class:`~arrow::csv::TableReader` will tend to be faster than
+   :class:`~arrow::csv::StreamingReader` because it makes better use of
+   available cores.

Review Comment:
   Link to the `cpp-csv-performance` reference here somehow?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1005064385


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.
+2. :class:`~arrow::csv::StreamingReader` performs type inference off the first
+   block that's read in, after which point the types are frozen. Either set
+   :member:`ReadOptions::block_size` to a large enough value or use
+   :member:`ConvertOptions::column_types` to set the desired data types
+   explicitly.
+

Review Comment:
   I made some substantial changes to the tradeoffs section following the above feedback in 3d54059caa2f6dc888be670f98da33d526a12ba4 and 039d23f8aa093017e72446d405a5f45b849d2689. All three are now actual tradeoffs. I was struggling a bit with the language so feel free to rewrite.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r999980886


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +65,88 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+Behavior of :class:`~arrow::csv::TableReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+StreamingReader
+---------------
+
+The :class:`~arrow::csv::StreamingReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      reader->ReadNext(&batch);

Review Comment:
   Good catch, changed to show an explicit block for handling status in https://github.com/apache/arrow/pull/14452/commits/dafb8290682e1261e62d5af490e825820450c098.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r999437065


##########
docs/source/cpp/csv.rst:
##########
@@ -25,15 +25,24 @@ Reading and Writing CSV files
 =============================
 
 Arrow provides a fast CSV reader allowing ingestion of external data
-as Arrow tables.
+as Arrow Tables or streamed as Arrow RecordBatches.
 
 .. seealso::
    :ref:`CSV reader/writer API reference <cpp-api-csv>`.
 
-Basic usage
-===========
+Reading CSV files
+=================
+
+Data in a CSV file can either be read in as a single Arrow Table using
+:class:`~arrow::csv::TableReader` or streamed as RecordBatches using
+:class:`~arrow::csv::StreamingReader`. See :ref:`Tradeoffs <tradeoffs>` for a
+discussion of the tradeoffs between the two methods.
+
+TableReader
+-----------
 
-A CSV file is read from a :class:`~arrow::io::InputStream`.
+The :class:`~arrow::csv::TableReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.

Review Comment:
   Since this is true for both readers, why not move this up as a more general sentence in "Reading CSV files"?
   
   For example:
   ```rest
   Both these readers require an :class:`arrow::io::InputStream` instance
   representing the input file. Their behavior can be customized using a
   combination of :class:`~arrow::csv::ReadOptrions`,
   :class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
   ```
   



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +65,88 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+Behavior of :class:`~arrow::csv::TableReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+StreamingReader
+---------------
+
+The :class:`~arrow::csv::StreamingReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      reader->ReadNext(&batch);
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+Behavior of :class:`~arrow::csv::StreamingReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptrions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+.. _tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or :class:`~arrow::csv::StreamingReader`
+will depend on your use case but two caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.
+2. :class:`~arrow::csv::StreamingReader` performs type inference off the first
+   block that's read in, after which point the types are frozen. Either set :member:`ReadOptions::block_size` to a large enough value or use :member:`ConvertOptions::column_types` to set the desired data types

Review Comment:
   Can you try to limit line length here?



##########
docs/source/cpp/csv.rst:
##########
@@ -275,11 +353,13 @@ Write Options
 The format of written CSV files can be customized via :class:`~arrow::csv::WriteOptions`.
 Currently few options are available; more will be added in future releases.
 
+.. _performance:

Review Comment:
   Same here.



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +65,88 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+Behavior of :class:`~arrow::csv::TableReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+StreamingReader
+---------------
+
+The :class:`~arrow::csv::StreamingReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      reader->ReadNext(&batch);
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+Behavior of :class:`~arrow::csv::StreamingReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptrions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+.. _tradeoffs:

Review Comment:
   These references are global to the entire docs, so should be disambiguated, for example `cpp-csv-tradeoffs` instead of `tradeoffs`.



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +65,88 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+Behavior of :class:`~arrow::csv::TableReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+StreamingReader
+---------------
+
+The :class:`~arrow::csv::StreamingReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      reader->ReadNext(&batch);

Review Comment:
   Should not ignore the Status returned here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1005514077


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,97 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      while (true) {
+          // Attempt to read the first RecordBatch
+          arrow::Status status = reader->ReadNext(&batch);
+
+          if (!status.ok()) {
+            // Handle read error
+          }
+
+          if (batch == NULL) {
+            // Handle end of file
+            break;
+          }
+
+          // Do something with the batch
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will ultimately depend on the use case
+but there are a few tradeoffs to be aware of:
+
+1. **Memory usage:** :class:`~arrow::csv::TableReader` loads all of the data
+   into memory at once and, depending on the amount of data, may require
+   considerably more memory than :class:`~arrow::csv::StreamingReader` which
+   only loads one :class:`~arrow::RecordBatch` at a time. This is likely to be
+   the most significant tradeoff for users.
+3. **Speed:** When reading the entire contents of a CSV,

Review Comment:
   Looks like the numbering is off :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1000025482


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }

Review Comment:
   Agreed. Accepted your change in 32dcec6de04b10111ec78413589013bd6646c2a5.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #14452: ARROW-15328: [C++] [Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #14452:
URL: https://github.com/apache/arrow/pull/14452#issuecomment-1283087275

   https://issues.apache.org/jira/browse/ARROW-15328


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1000030908


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.
+2. :class:`~arrow::csv::StreamingReader` performs type inference off the first
+   block that's read in, after which point the types are frozen. Either set
+   :member:`ReadOptions::block_size` to a large enough value or use
+   :member:`ConvertOptions::column_types` to set the desired data types
+   explicitly.
+

Review Comment:
   > These are all cons :)
   
   Can you elaborate on this? Do you just mean to say they aren't really tradeoffs?
   
   > The table reader requires loading all of the data into memory. This is a pretty significant tradeoff we should point out here.
   
   Yep. I wasn't sure if that was too obvious and removed it. I'll put it back in.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r999981661


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +65,88 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+Behavior of :class:`~arrow::csv::TableReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+StreamingReader
+---------------
+
+The :class:`~arrow::csv::StreamingReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      reader->ReadNext(&batch);
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+Behavior of :class:`~arrow::csv::StreamingReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptrions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+.. _tradeoffs:

Review Comment:
   Done in https://github.com/apache/arrow/pull/14452/commits/d3632aa444b8e96a3c620e94f0269a096ce72a6b.



##########
docs/source/cpp/csv.rst:
##########
@@ -275,11 +353,13 @@ Write Options
 The format of written CSV files can be customized via :class:`~arrow::csv::WriteOptions`.
 Currently few options are available; more will be added in future releases.
 
+.. _performance:

Review Comment:
   Done in https://github.com/apache/arrow/pull/14452/commits/d3632aa444b8e96a3c620e94f0269a096ce72a6b.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on PR #14452:
URL: https://github.com/apache/arrow/pull/14452#issuecomment-1284633523

   Thanks so much @pitrou for taking a look. I made all your suggested changes in atomic commits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1004124288


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.

Review Comment:
   @amoeba Yes, that sounds good to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on PR #14452:
URL: https://github.com/apache/arrow/pull/14452#issuecomment-1291262689

   Thanks for the feedback so far @westonpace and @pitrou. This is ready for another look when/if you have time, specifically the section on Tradeoffs ([link to diff](https://github.com/apache/arrow/pull/14452/files#diff-2aabd416367bc0f2211ad7cca469bfdf0a7a5ec615f19b7db5bc5d77b4024a0fR133)).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r999981695


##########
docs/source/cpp/csv.rst:
##########
@@ -275,11 +353,13 @@ Write Options
 The format of written CSV files can be customized via :class:`~arrow::csv::WriteOptions`.
 Currently few options are available; more will be added in future releases.
 
+.. _performance:

Review Comment:
   Done.



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +65,88 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+Behavior of :class:`~arrow::csv::TableReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+StreamingReader
+---------------
+
+The :class:`~arrow::csv::StreamingReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      reader->ReadNext(&batch);
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+Behavior of :class:`~arrow::csv::StreamingReader` can be customized using a
+combination of :class:`~arrow::csv::ReadOptrions`,
+:class:`~arrow::csv::ParseOptions`, and :class:`~arrow::csv::ConvertOptions`.
+
+.. _tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or :class:`~arrow::csv::StreamingReader`
+will depend on your use case but two caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.
+2. :class:`~arrow::csv::StreamingReader` performs type inference off the first
+   block that's read in, after which point the types are frozen. Either set :member:`ReadOptions::block_size` to a large enough value or use :member:`ConvertOptions::column_types` to set the desired data types

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r999981582


##########
docs/source/cpp/csv.rst:
##########
@@ -25,15 +25,24 @@ Reading and Writing CSV files
 =============================
 
 Arrow provides a fast CSV reader allowing ingestion of external data
-as Arrow tables.
+as Arrow Tables or streamed as Arrow RecordBatches.
 
 .. seealso::
    :ref:`CSV reader/writer API reference <cpp-api-csv>`.
 
-Basic usage
-===========
+Reading CSV files
+=================
+
+Data in a CSV file can either be read in as a single Arrow Table using
+:class:`~arrow::csv::TableReader` or streamed as RecordBatches using
+:class:`~arrow::csv::StreamingReader`. See :ref:`Tradeoffs <tradeoffs>` for a
+discussion of the tradeoffs between the two methods.
+
+TableReader
+-----------
 
-A CSV file is read from a :class:`~arrow::io::InputStream`.
+The :class:`~arrow::csv::TableReader` class requires an
+:class:`::arrow::io::InputStream` instance representing the input file.

Review Comment:
   This is a good improvement, thanks. Done in https://github.com/apache/arrow/pull/14452/commits/fd4c4190fc498023e8a2ccf92a8e35af6ccc664f.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1004868424


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.

Review Comment:
   Thanks, changed to above in 0b04652b5f991e4480eac1be3c16274d48d6de94.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1005064385


##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.
+2. :class:`~arrow::csv::StreamingReader` performs type inference off the first
+   block that's read in, after which point the types are frozen. Either set
+   :member:`ReadOptions::block_size` to a large enough value or use
+   :member:`ConvertOptions::column_types` to set the desired data types
+   explicitly.
+

Review Comment:
   I made some substantial changes to the tradeoffs section following the above feedback in 3d54059caa2f6dc888be670f98da33d526a12ba4. All three are now actual tradeoffs. I was struggling a bit with the language so feel free to rewrite.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1000008261


##########
docs/source/cpp/csv.rst:
##########
@@ -25,15 +25,26 @@ Reading and Writing CSV files
 =============================
 
 Arrow provides a fast CSV reader allowing ingestion of external data
-as Arrow tables.
+as Arrow Tables or streamed as Arrow RecordBatches.

Review Comment:
   ```suggestion
   to create Arrow Tables or a stream of Arrow RecordBatches.
   ```
   
   This sentence feels slightly grammatically off to me.  It sounds like the input data is tables or batches but really it's the output.



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.

Review Comment:
   Close enough I suppose.  The streaming reader does use some threads and will (hopefully) use more in the future.  But I don't know that we can meaningfully explain that here without going too far into the details.



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }

Review Comment:
   ```suggestion
         while (true) {
           // Attempt to read the first RecordBatch
           arrow::Status status = reader->ReadNext(&batch);
   
           if (!status.ok()) {
             // Handle read error
           }
   
           if (batch == NULL) {
             // Handle end of file
             break;
           }
           
           // Do something with the batch
         }
   ```
   
   I know it is an example but I feel we should show actual streaming consumption.



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.
+2. :class:`~arrow::csv::StreamingReader` performs type inference off the first
+   block that's read in, after which point the types are frozen. Either set
+   :member:`ReadOptions::block_size` to a large enough value or use
+   :member:`ConvertOptions::column_types` to set the desired data types
+   explicitly.

Review Comment:
   Or else what?  It may not be clear to the user this would lead to an error like "XYZ is not a valid float32"



##########
docs/source/cpp/csv.rst:
##########
@@ -25,15 +25,26 @@ Reading and Writing CSV files
 =============================
 
 Arrow provides a fast CSV reader allowing ingestion of external data
-as Arrow tables.
+as Arrow Tables or streamed as Arrow RecordBatches.
 
 .. seealso::
    :ref:`CSV reader/writer API reference <cpp-api-csv>`.
 
-Basic usage
-===========
+Reading CSV files
+=================
+
+Data in a CSV file can either be read in as a single Arrow Table using
+:class:`~arrow::csv::TableReader` or streamed as RecordBatches using
+:class:`~arrow::csv::StreamingReader`. See :ref:`Tradeoffs <cpp-csv-tradeoffs>` for a
+discussion of the tradeoffs between the two methods.
+
+Both these readers require an :class:`arrow::io::InputStream` instance

Review Comment:
   Not a problem at the moment.  However, there is [a PR](https://github.com/apache/arrow/pull/14269) which would allow you to get slightly better performance if you use a random access file (typically if you're connected to cloud storage) instead of an input stream.  So we might need to update this text after that pull merges.



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.
+2. :class:`~arrow::csv::StreamingReader` performs type inference off the first
+   block that's read in, after which point the types are frozen. Either set
+   :member:`ReadOptions::block_size` to a large enough value or use
+   :member:`ConvertOptions::column_types` to set the desired data types
+   explicitly.
+

Review Comment:
   These are all cons :)
   
   The table reader requires loading all of the data into memory.  This is a pretty significant tradeoff we should point out here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on PR #14452:
URL: https://github.com/apache/arrow/pull/14452#issuecomment-1284724655

   Thanks for the review @westonpace, this is really helpful. I have some language to work on and I'll update here when this is ready for another review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14452: ARROW-15328: [C++][Docs] Streaming CSV reader missing from documentation

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1000023812


##########
docs/source/cpp/csv.rst:
##########
@@ -25,15 +25,26 @@ Reading and Writing CSV files
 =============================
 
 Arrow provides a fast CSV reader allowing ingestion of external data
-as Arrow tables.
+as Arrow Tables or streamed as Arrow RecordBatches.

Review Comment:
   Fair enough. Fixed in 381f8b29ba205f6e0040c632eb23cec8eb46764c.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org