You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/06 03:25:03 UTC

[GitHub] [arrow] n3world edited a comment on pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

n3world edited a comment on pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#issuecomment-833185925

> That would be good. Eventually the dataset scanner will probably be getting a skip operation of some kind as well so that'll increase the pressure on [ARROW-8527](https://issues.apache.org/jira/browse/ARROW-8527). [ARROW-12598](https://issues.apache.org/jira/browse/ARROW-12598) is also (admittedly tangentially) related since you seem to be on a roll smile

The only tricky part about a count(*) implementation with this is that skip_rows documented that it was skipping header rows which shouldn't be counted as part of a data row count. I feel like the row count operation would have to be a little different and maybe give an indicator for on which line the actual data rows start so that the header rows before that point could be skipped.

Maybe a simpler solution would a set of two indexes: column names and first data row . While this doesn't allow arbitrary row skipping in the middle this would allow for the most common use cases, including skipping over valid rows to first desired row. With another option or operation could be used to count the number of data rows starting at first data row. The defaults would be 0, 1 for when column names are part of the csv or -1, 0 when they are not.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org