You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/22 18:28:10 UTC

[GitHub] [arrow] westonpace commented on a change in pull request #10132: ARROW-12482: [Doc][C++][Python] Mention CSVStreamingReader pitfalls with type inference

westonpace commented on a change in pull request #10132:
URL: https://github.com/apache/arrow/pull/10132#discussion_r618640819



##########
File path: docs/source/python/csv.rst
##########
@@ -92,8 +92,17 @@ Incremental reading
 -------------------
 
 For memory-constrained environments, it is also possible to read a CSV file
-one batch at a time, using :func:`open_csv`.  It currently doesn't support
-parallel reading.
+one batch at a time, using :func:`open_csv`.
+
+There are a few caveats:
+
+1. For now, the incremental reader is always single-threaded (regardless of
+   :attr:`ReadOptions.use_threads`)
+
+2. Type inference is done on the first block and types are frozen afterwards;
+   to make sure the right data types are inferred, either set
+   :attr:`ReadOptions.block_size` to a large enough value, or use
+   :attr:`ConvertOptions.column_types` to explicit set the desired data types.

Review comment:
       `explicit` -> `explicitly`

##########
File path: cpp/src/arrow/csv/reader.h
##########
@@ -59,7 +59,14 @@ class ARROW_EXPORT TableReader {
       const ReadOptions&, const ParseOptions&, const ConvertOptions&);
 };
 
-/// Experimental
+/// \brief A class that reads a CSV file incrementally
+///
+/// Caveats:
+/// - For now, this is always single-threaded (regardless of `ReadOptions::use_threads`.
+/// - Type inference is done on the first block and types are frozen afterwards;
+///   to make sure the right data types are inferred, either set
+///   `ReadOptions::block_size` to a large enough value, or use
+///   `ConvertOptions::column_types` to explicit set the desired data types.

Review comment:
       `explicit` -> `explicitly`

##########
File path: cpp/src/arrow/csv/options.h
##########
@@ -119,15 +119,20 @@ struct ARROW_EXPORT ReadOptions {
 
   /// Whether to use the global CPU thread pool
   bool use_threads = true;
-  /// Block size we request from the IO layer; also determines the size of
-  /// chunks when use_threads is true
+
+  /// \brief Block size we request from the IO layer.
+  ///
+  /// This will determine multi-threading granularity as well as

Review comment:
       I'm not sure how multi-threading is related to block size at the moment.  In the future we could add parallel readahead but currently both the file and streaming CSV readers do not do this.  The threaded CSV reader is threaded across the block and not multiple blocks at once.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org