You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/21 09:33:26 UTC

[GitHub] [arrow] jorgecarleitao commented on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

jorgecarleitao commented on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-731554302


   * Converting `CSV -> StringArray -> [Type]Array` is not recommended, as it forces us to load everything in memory, even if there are shorter representations. Therefore, really need a way to build arrays out of CSV columns.
   
   * CSV is parsed as rows, but arrow is column-based. Therefore, there will need to be a pivot of the data at some point.
   
   My feeling is that there are wildly different specs out there into how we should convert a CSV column into an Array. IMO we should not try to solve all those use-cases ourselves and instead offer users the freedom to choose, as well as common utilities.
   
   As such, one idea is to offer a way to plugin that allow users to parse CSV column into `[Type]Array`, and offer a default offering.
   
   Since these are stateless, one simple idea is have the CSV reader accept a trait with two functions:
   
   ```rust
   infer: Fn(rows: &[StringRecord], col_idx: usize) -> DataType;
   convert: Fn(data_type: &DataType, rows: &[StringRecord], col_idx: usize) -> Result<ArrayRef>;
   # or something like this
   ```
   
   This signature indicates that:
   
   1. The function transverses rows
   2. the function is falible
   3. the resulting array is dynamic
   
   This allows the user to e.g. make unparsable rows as nulls, adopt specific notations for CSV files that are (for them) interoperable with Arrow, etc.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org