You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/03 18:35:17 UTC

[GitHub] [arrow] Dandandan edited a comment on pull request #9084: ARROW-11119: [Rust] Expose functions to parse a single CSV column / StringRecord into an array / recordBatch

Dandandan edited a comment on pull request #9084:
URL: https://github.com/apache/arrow/pull/9084#issuecomment-753658416

This looks cool @jorgecarleitao !

Some thoughts for the future of csv /other parsers:

* It might be worth exploring if we can use the `cast` (or a similar) kernel of Arrow to parse the data. The benefit of this would be that we can just load the data (as bytes / string) into arrays and utilize the existing parsing logic in Arrow. I think this is interesting because the code can be vectorized/use SIMD, parallelized, etc. more easily from that point, will reduce code duplication, and creates more incentive to improve the `cast` kernels, which benefits more than "only" one parser.
* For further optimization it might be worth to stop using the `StringRecord`s at some point (and use `csv_core`), as there is quite some overhead associated with them compared to "just" loading the bytes from the file. How would this fit into your suggestion?
* For a user like DataFusion, it might often not make sense to have parallelism on the file level (if there are many files), so I think it makes sense to not make the parser slower / consume more resources for one thread. This is more a general thing we should keep in mind.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org