You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/03 21:13:49 UTC

[GitHub] [arrow] bkietz opened a new pull request #7896: ARROW-9609: [C++][Dataset] CsvFileFormat reads all virtual columns as null

bkietz opened a new pull request #7896:
URL: https://github.com/apache/arrow/pull/7896


   `ConvertOptions::include_missing_columns = true` was insufficient to produce the required behavior with missing columns: we need to read the csv file's header to find the names of columns actually present in the file before instantiating a StreamingReader. Otherwise the StreamingReader will fill absent columns with `null`, which prevents the projector from materializing them correctly later.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7896: ARROW-9609: [C++][Dataset] CsvFileFormat reads all virtual columns as null

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7896:
URL: https://github.com/apache/arrow/pull/7896#issuecomment-668246598


   https://issues.apache.org/jira/browse/ARROW-9609


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #7896: ARROW-9609: [C++][Dataset] CsvFileFormat reads all virtual columns as null

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #7896:
URL: https://github.com/apache/arrow/pull/7896#issuecomment-668244882


   @pitrou


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz closed pull request #7896: ARROW-9609: [C++][Dataset] CsvFileFormat reads all virtual columns as null

Posted by GitBox <gi...@apache.org>.
bkietz closed pull request #7896:
URL: https://github.com/apache/arrow/pull/7896


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #7896: ARROW-9609: [C++][Dataset] CsvFileFormat reads all virtual columns as null

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7896:
URL: https://github.com/apache/arrow/pull/7896#discussion_r464742390



##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -39,49 +41,92 @@ namespace dataset {
 using internal::checked_cast;
 using internal::checked_pointer_cast;
 
+Result<std::unordered_set<std::string>> GetColumnNames(

Review comment:
       I wrote this as a separate function in part so that it would be easy to extract as a public function if we wanted that. I'd like to leave that for a follow up, though




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #7896: ARROW-9609: [C++][Dataset] CsvFileFormat reads all virtual columns as null

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #7896:
URL: https://github.com/apache/arrow/pull/7896#discussion_r464676677



##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -39,49 +41,92 @@ namespace dataset {
 using internal::checked_cast;
 using internal::checked_pointer_cast;
 
+Result<std::unordered_set<std::string>> GetColumnNames(

Review comment:
       Would it make sense/be possible to move this somewhere more generally available for CSV reading? @romainfrancois encountered a similar need in https://github.com/apache/arrow/pull/7807#issuecomment-662045381




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org