You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/07 21:01:02 UTC

[GitHub] [arrow] lidavidm opened a new pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

lidavidm opened a new pull request #10270:
URL: https://github.com/apache/arrow/pull/10270


   This does not implement a fast path for CSV. However, it does configure the CSV reader to not actually deserialize any data, resulting in a large gain. When scanning 85 million rows of the NYC Taxi dataset, scan time dropped from 11 seconds to 2.
   
   This also sneaks in an implementation of the fast path for InMemoryFragment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#discussion_r630951509



##########
File path: cpp/src/arrow/csv/options.h
##########
@@ -101,6 +101,9 @@ struct ARROW_EXPORT ConvertOptions {
   /// or null by default)
   /// This option is ignored if `include_columns` is empty.
   bool include_missing_columns = false;
+  /// If true, no columns are decoded (record batches will have a row count and no
+  /// columns). This is useful if you need only a row count.
+  bool skip_decoding = false;

Review comment:
       I'd rather if you exposed a separate API. Is it easily doable?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#discussion_r630989906



##########
File path: cpp/src/arrow/csv/options.h
##########
@@ -101,6 +101,9 @@ struct ARROW_EXPORT ConvertOptions {
   /// or null by default)
   /// This option is ignored if `include_columns` is empty.
   bool include_missing_columns = false;
+  /// If true, no columns are decoded (record batches will have a row count and no
+  /// columns). This is useful if you need only a row count.
+  bool skip_decoding = false;

Review comment:
       It should be doable, we'll just have to be careful to respect all the same options (skip_rows, etc.). I'll make an attempt.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#discussion_r628644912



##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -104,6 +115,13 @@ static inline Result<csv::ConvertOptions> GetConvertOptions(
     // Properly set conversion types
     convert_options.column_types[field->name()] = field->type();
   }
+  if (convert_options.include_columns.empty()) {
+    // We know which columns we want - so if none were specified, override the default
+    // behavior of reading all columns by reading a missing fake null column
+    convert_options.include_missing_columns = true;
+    convert_options.include_columns.push_back("fabricated");

Review comment:
       Hmm. I'll have to check, but I think this won't show up even if you do something like `to_table(columns=[])` because then it'll get projected away. Oh, but if you projected a column called `fabricated` that didn't depend on any physical columns…I'll have to rethink this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#discussion_r643253931



##########
File path: cpp/src/arrow/dataset/file_csv_test.cc
##########
@@ -50,7 +50,10 @@ class CsvFormatHelper {
   }
 
   static std::shared_ptr<CsvFileFormat> MakeFormat() {
-    return std::make_shared<CsvFileFormat>();
+    auto format = std::make_shared<CsvFileFormat>();
+    // Required for CountRows
+    format->parse_options.ignore_empty_lines = false;

Review comment:
       Hmm, but then why is this change necessary? Is it just a byproduct of how the tests are written?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10270:
URL: https://github.com/apache/arrow/pull/10270


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#issuecomment-841264102


   Now there's a separate CSV row counter. It still reuses the reader mixin for handling the header (so we should get a proper row count that way). There is a segfault under the RTools35 toolchain; it occurs after running a future callback but before the future's ultimate consumer gets the value. I'm not likely going to look into it much further since RTools35 is going away soon :crossed_fingers: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#discussion_r629409104



##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -104,6 +115,13 @@ static inline Result<csv::ConvertOptions> GetConvertOptions(
     // Properly set conversion types
     convert_options.column_types[field->name()] = field->type();
   }
+  if (convert_options.include_columns.empty()) {
+    // We know which columns we want - so if none were specified, override the default
+    // behavior of reading all columns by reading a missing fake null column
+    convert_options.include_missing_columns = true;
+    convert_options.include_columns.push_back("fabricated");

Review comment:
       Ok, now I've added an explicit CSV reader option - it still uses a fabricated null column internally for convenience but this way that won't leak out into consumers.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#discussion_r643145474



##########
File path: cpp/src/arrow/csv/reader.h
##########
@@ -96,5 +96,11 @@ class ARROW_EXPORT StreamingReader : public RecordBatchReader {
       const ConvertOptions& convert_options);
 };
 
+/// \brief Count the rows in a CSV file.
+ARROW_EXPORT
+Future<int64_t> CountRows(io::IOContext io_context,

Review comment:
       Nit: perhaps call this `CountRowsAsync`?

##########
File path: cpp/src/arrow/csv/reader.cc
##########
@@ -1081,6 +1145,16 @@ Future<std::shared_ptr<StreamingReader>> StreamingReader::MakeAsync(
                              parse_options, convert_options);
 }
 
+Future<int64_t> CountRows(io::IOContext io_context,
+                          std::shared_ptr<io::InputStream> input,
+                          const ReadOptions& read_options,
+                          const ParseOptions& parse_options) {
+  auto cpu_executor = internal::GetCpuThreadPool();

Review comment:
       Is there a reason the CPU executor isn't a `CountRows` parameter?

##########
File path: cpp/src/arrow/dataset/file_csv.h
##########
@@ -61,6 +61,10 @@ class ARROW_DS_EXPORT CsvFileFormat : public FileFormat {
       const std::shared_ptr<ScanOptions>& scan_options,
       const std::shared_ptr<FileFragment>& file) const override;
 
+  Future<util::optional<int64_t>> CountRows(
+      const std::shared_ptr<FileFragment>& file, compute::Expression predicate,
+      std::shared_ptr<ScanOptions> options) override;

Review comment:
       Is there a reason that some methods take `const std::shared_ptr<ScanOptions>&` and this one `std::shared_ptr<ScanOptions>`?

##########
File path: cpp/src/arrow/dataset/file_csv_test.cc
##########
@@ -50,7 +50,10 @@ class CsvFormatHelper {
   }
 
   static std::shared_ptr<CsvFileFormat> MakeFormat() {
-    return std::make_shared<CsvFileFormat>();
+    auto format = std::make_shared<CsvFileFormat>();
+    // Required for CountRows
+    format->parse_options.ignore_empty_lines = false;

Review comment:
       I'm not sure I understand this change. Is `CountRows` supposed to count logical rows of data, or physical rows inside the file (even if they may be ignored as empty)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#discussion_r628644999



##########
File path: cpp/src/arrow/dataset/file_csv_test.cc
##########
@@ -182,6 +183,18 @@ bar)");
     }
     ASSERT_EQ(rows, 5);
   }
+  {
+    SetSchema({field("custom_header", utf8())});
+    defaults->read_options.skip_rows = 0;

Review comment:
       This is resetting the value from above because the test is stateful - I'll come up with a better way to do this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#issuecomment-834776907


   https://issues.apache.org/jira/browse/ARROW-12598


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#discussion_r643254456



##########
File path: cpp/src/arrow/dataset/file_csv_test.cc
##########
@@ -50,7 +50,10 @@ class CsvFormatHelper {
   }
 
   static std::shared_ptr<CsvFileFormat> MakeFormat() {
-    return std::make_shared<CsvFileFormat>();
+    auto format = std::make_shared<CsvFileFormat>();
+    // Required for CountRows
+    format->parse_options.ignore_empty_lines = false;

Review comment:
       Ah, I see you added an explanation, sorry!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#discussion_r628569404



##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -48,18 +48,28 @@ using internal::SerialExecutor;
 using RecordBatchGenerator = std::function<Future<std::shared_ptr<RecordBatch>>()>;
 
 Result<std::unordered_set<std::string>> GetColumnNames(
-    const csv::ParseOptions& parse_options, util::string_view first_block,
-    MemoryPool* pool) {
+    const csv::ParseOptions& parse_options, const csv::ReadOptions& read_options,

Review comment:
       Minor nit: I think everywhere else `read_options` comes before `parse_options`.

##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -104,6 +115,13 @@ static inline Result<csv::ConvertOptions> GetConvertOptions(
     // Properly set conversion types
     convert_options.column_types[field->name()] = field->type();
   }
+  if (convert_options.include_columns.empty()) {
+    // We know which columns we want - so if none were specified, override the default
+    // behavior of reading all columns by reading a missing fake null column
+    convert_options.include_missing_columns = true;
+    convert_options.include_columns.push_back("fabricated");

Review comment:
       It's probably rare but could this cause an issue if the CSV happened to have a column named `fabricated`?  Although I suppose it'll be addressed by the newline scanning fast implementation.

##########
File path: cpp/src/arrow/dataset/file_csv_test.cc
##########
@@ -182,6 +183,18 @@ bar)");
     }
     ASSERT_EQ(rows, 5);
   }
+  {
+    SetSchema({field("custom_header", utf8())});
+    defaults->read_options.skip_rows = 0;

Review comment:
       0 is the default for `skip_rows`.  If you specify `column_names` then the CSV reader will automatically assume there is no header so you shouldn't have to manually set it to 0.  Maybe add a test case where `skip_rows` is non-zero to ensure you are passing it down correctly?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#issuecomment-851973894


   @lidavidm Could you please rebase this to fix conflicts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#issuecomment-849573022


   Just to follow up - I've updated the PR here to implement a separate row counter (reusing the reader's base classes) instead of adding a parameter to the reader.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#discussion_r643169048



##########
File path: cpp/src/arrow/dataset/file_csv.h
##########
@@ -61,6 +61,10 @@ class ARROW_DS_EXPORT CsvFileFormat : public FileFormat {
       const std::shared_ptr<ScanOptions>& scan_options,
       const std::shared_ptr<FileFragment>& file) const override;
 
+  Future<util::optional<int64_t>> CountRows(
+      const std::shared_ptr<FileFragment>& file, compute::Expression predicate,
+      std::shared_ptr<ScanOptions> options) override;

Review comment:
       I think I had patterned everything on Fragment::Scan (which takes it by value) but all the new methods take it by const reference so I'll adjust that.

##########
File path: cpp/src/arrow/dataset/file_csv_test.cc
##########
@@ -50,7 +50,10 @@ class CsvFormatHelper {
   }
 
   static std::shared_ptr<CsvFileFormat> MakeFormat() {
-    return std::make_shared<CsvFileFormat>();
+    auto format = std::make_shared<CsvFileFormat>();
+    // Required for CountRows
+    format->parse_options.ignore_empty_lines = false;

Review comment:
       It is logical rows of data; I'll clarify this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org