You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/04 16:45:55 UTC

[GitHub] [arrow] jorgecarleitao commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

jorgecarleitao commented on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754084930


   I'm curious about the perf implications. Even for integers or dates, we will always need to verify that they are utf8 compliant to create valid `StringArray` value buffers. We could store then as `BinaryArray` instead. My hypothesis is that most of the `to_str / to_datetime / to_int` are not SIMDed and thus unlikely to benefit from Arrow, but it could also be that the columnar format helps the compiler.
   
   Note that people can always pass a `DataType::Utf8` to the schema instead of inferring it and cast the types themselves. I always understood the readers (csv, json, parquet) as ways to bypass that approach and create Arrow arrays directly from the format. It happens that CSV is a particularly poor format for this, as everything is a (not-necessarily utf8) string with very little invariants.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org