You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/01 20:48:47 UTC

[GitHub] [arrow] westonpace commented on pull request #10202: ARROW-12673: [C++] Add parser handler for incorrect column counts

westonpace commented on pull request #10202:
URL: https://github.com/apache/arrow/pull/10202#issuecomment-872540450


   I'm not sure there is an obvious way to solve this problem in parallel.  The parser will start parsing block X before block X-1 has finished parsing.  The parser input (CSVBlock) doesn't know how many lines it has.  That is not discovered until parsing time.  So, for example, the parser for block 2 might realize there is an error on the third line of block 2.  But, without knowing how many lines are in block 1 (and block 1 may not have finished parsing) it is hard to say what the lines number of that error is.
   
   You could do a serial pass prior to parsing that just figures out how many lines are in a block but I suspect that would be too much overhead.
   
   You could delay reporting an error on block X until blocks 1 to (X-1) have finished parsing (so you can know what the line number is).  That would probably be the solution I would take if I needed to do this.  However, I don't know off-hand how to do that delay in a low-complexity way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org