You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/07 22:24:24 UTC

[GitHub] [arrow] westonpace commented on a change in pull request #10270: ARROW-12598: [C++][Dataset] Speed up CountRows for CSV

westonpace commented on a change in pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#discussion_r628569404



##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -48,18 +48,28 @@ using internal::SerialExecutor;
 using RecordBatchGenerator = std::function<Future<std::shared_ptr<RecordBatch>>()>;
 
 Result<std::unordered_set<std::string>> GetColumnNames(
-    const csv::ParseOptions& parse_options, util::string_view first_block,
-    MemoryPool* pool) {
+    const csv::ParseOptions& parse_options, const csv::ReadOptions& read_options,

Review comment:
       Minor nit: I think everywhere else `read_options` comes before `parse_options`.

##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -104,6 +115,13 @@ static inline Result<csv::ConvertOptions> GetConvertOptions(
     // Properly set conversion types
     convert_options.column_types[field->name()] = field->type();
   }
+  if (convert_options.include_columns.empty()) {
+    // We know which columns we want - so if none were specified, override the default
+    // behavior of reading all columns by reading a missing fake null column
+    convert_options.include_missing_columns = true;
+    convert_options.include_columns.push_back("fabricated");

Review comment:
       It's probably rare but could this cause an issue if the CSV happened to have a column named `fabricated`?  Although I suppose it'll be addressed by the newline scanning fast implementation.

##########
File path: cpp/src/arrow/dataset/file_csv_test.cc
##########
@@ -182,6 +183,18 @@ bar)");
     }
     ASSERT_EQ(rows, 5);
   }
+  {
+    SetSchema({field("custom_header", utf8())});
+    defaults->read_options.skip_rows = 0;

Review comment:
       0 is the default for `skip_rows`.  If you specify `column_names` then the CSV reader will automatically assume there is no header so you shouldn't have to manually set it to 0.  Maybe add a test case where `skip_rows` is non-zero to ensure you are passing it down correctly?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org