You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/06 00:33:30 UTC

[GitHub] [arrow] n3world commented on pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

n3world commented on pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#issuecomment-833140856


   > Thanks for submitting these PRs! I'm still brushing up on my parser knowledge but I'll take a look at the other one too.
   > 
   > This one looks pretty straightforward to me. It's a bit unfortunate that we have two `skip_rows` but it does seem that there are cases for skipping before and after. I do prefer this over the before/after boolean. Another option would be to take in a list of row indices. This is what pandas does. So [0:1] skips two rows before the header (if any) and [1:2] would skip two rows after the header or [0, 2] would skip a row before and after.
   > 
   > @pitrou should probably take a look at this as he's got the most experience with the CSV reader. He's out for the rest of the week so it might be next week.
   
   I like the idea of replacing the current skip_rows with a list of ranges to skip. This would be a more complex but a more robust solution. I assume that skip_rows would have to be kept but deprecated. The one difficulty would be if `ParseOptions::newlines_in_values` were true then the reader would have to be aware of the rows that are parsed to have a correct row index. I can start working on this solution ignoring the newlines_in_values issue now if that is the consensus.
   
   I might make it so any range starting with 0 is handled by the reader, as skip_row does now, and any range not starting with 0 is handled by the parser so that potentially quoted new lines are handled correctly.
   
   There might be a way to also tackle https://issues.apache.org/jira/browse/ARROW-8527 with this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org