You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/04 07:48:34 UTC

[GitHub] [arrow] Dandandan opened a new pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Dandandan opened a new pull request #9090:
URL: https://github.com/apache/arrow/pull/9090


   This came up in https://github.com/apache/arrow/pull/9084 when discussing the PR with @jorgecarleitao 
   
   One of the ideas here is that we could take advantage of the `cast` kernel in Arrow to parse the csv column, deduplicating code and making more use of Arrow. We can get rid of the custom parsers and parsing / building primitive arrays.
   
   On loading the CSV in the TCPH benchmark looks like this has a small negative effect on performance.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754181158


   I also like the general idea of using the `cast` kernel for consistency. Nice idea @Dandandan 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan edited a comment on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754102662


   @jorgecarleitao note that the `csv` `StringRecord` also verifies whether strings are utf8. It adds a bit of overhead, but the utf8 checking itself is not much for now, it is mostly the logic surrounding `StringRecord` that adds the most overhead.
   I think eventually we could use a `StringArray` or `BinaryArray` as buffer so we can remove the `StringRecords` which is internally a `Vec<u8>` (by using `ByteRecord`) and a `Vec<usize>` for the rows.
   
   The current performance penalty between master and this branch currently is ~10% as we introduce an extra intermediate step which I think could be more than compensated for by removing the `StringRecord` abstraction, and trying to write to a string or binary array without intermediate steps.
   
   This is the structure the csv crate is using per row:
   
   ```rust
   struct ByteRecordInner {
       /// The position of this byte record.
       pos: Option<Position>,
       /// All fields in this record, stored contiguously.
       fields: Vec<u8>,
       /// The number of and location of each field in this record.
       bounds: Bounds,
   }
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-753821198


   https://issues.apache.org/jira/browse/ARROW-11123


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan closed pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Posted by GitBox <gi...@apache.org>.
Dandandan closed pull request #9090:
URL: https://github.com/apache/arrow/pull/9090


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754074455


   I think this needs to be done to fully implement this PR:
   
   - [ ] Implement cast that returns an error on parsing errors instead of null
   - [ ] Implement `cast(utf8, bool)` to remove remaining parser function
   - [ ] Some (cast) / stringarray ( https://github.com/apache/arrow/pull/9016 ) perf. improvements to avoid performance hit
   
   I think an eventual performance benefit is that it makes it easier to get rid of the (slow) `StringRecord` eventually as we can just load bytes on the go, maybe we could even directly append to the array as a buffer without intermediate copies and extra allocations as we do currently.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-778610906


   @Dandandan 
   What is the status of this PR?
   
   As part of trying to clean up the backlog of Rust PRs in this repo, I am going through seemingly stale PRs and pinging the authors to see if there are any plans to continue the work or conversation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan edited a comment on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754102662


   @jorgecarleitao note that the `csv` `StringRecord` also verifies whether strings are utf8. It adds a bit of overhead, but the utf8 checking itself is not much for now, it is mostly the logic surrounding `StringRecord` that adds the most overhead.
   I think eventually we could use a `StringArray` or `BinaryArray` as buffer so we can remove the `StringRecords` which is internally a `Vec<u8>` (by using `ByteRecord`) and a `Vec<usize>` for the rows.
   
   The current performance penalty between master and this branch currently is ~10% as we introduce an extra intermediate step which I think could be more than compensated for by removing the `StringRecord` abstraction, and trying to write to a string or binary array without intermediate steps.
   
   ```rust
   struct ByteRecordInner {
       /// The position of this byte record.
       pos: Option<Position>,
       /// All fields in this record, stored contiguously.
       fields: Vec<u8>,
       /// The number of and location of each field in this record.
       bounds: Bounds,
   }
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan edited a comment on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754102662


   @jorgecarleitao note that the `csv` `StringRecord` also verifies whether strings are utf8. It adds a bit of overhead, but the utf8 checking itself is not much for now, it is mostly the logic surrounding `StringRecord` that adds some overhead.
   I think eventually we could use a `StringArray` or `BinaryArray` as buffer so we can remove the `StringRecords` which is internally a `Vec<u8>` (by using `ByteRecord`) and a `Vec<usize>` for the rows.
   
   The current performance penalty between master and this branch currently is ~10% as we introduce an extra intermediate step which I think could be more than compensated for by removing the `StringRecord` abstraction, and trying to write to a string or binary array without intermediate steps.
   
   ```rust
   struct ByteRecordInner {
       /// The position of this byte record.
       pos: Option<Position>,
       /// All fields in this record, stored contiguously.
       fields: Vec<u8>,
       /// The number of and location of each field in this record.
       bounds: Bounds,
   }
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-778611338


   Closing for now, will reopen when the other parts are done @alamb 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754025100


   I like the idea of using `CAST` here for consistency.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754208396


   > * Implement cast that returns an error on parsing errors instead of null
   
   See https://issues.apache.org/jira/browse/ARROW-7364


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754102662


   @jorgecarleitao note that the `csv` `StringRecord` also verifies whether strings are utf8. It adds a bit of overhead, but the utf8 checking itself is not much for now, it is mostly the logic surrounding `StringRecord` that adds some overhead.
   I think eventually we could use a `StringArray` or `BinaryArray` as buffer so we can remove the `StringRecords` which is internally a `Vec<u8>` (by using `ByteRecord`) and a `Vec<usize>` for the rows.
   
   The current performance penalty between master and this branch currently is ~10% as we introduce an extra intermediate step which I think could be more than compensated for by removing the `StringRecord` abstraction, and trying to write to a string or byte array without intermediate steps.
   
   ```rust
   struct ByteRecordInner {
       /// The position of this byte record.
       pos: Option<Position>,
       /// All fields in this record, stored contiguously.
       fields: Vec<u8>,
       /// The number of and location of each field in this record.
       bounds: Bounds,
   }
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754084930


   I'm curious about the perf implications. Even for integers or dates, we will always need to verify that they are utf8 compliant to create valid `StringArray` value buffers. We could store then as `BinaryArray` instead. My hypothesis is that most of the `to_str / to_datetime / to_int` are not SIMDed and thus unlikely to benefit from Arrow, but it could also be that the columnar format helps the compiler.
   
   Note that people can always pass a `DataType::Utf8` to the schema instead of inferring it and cast the types themselves. I always understood the readers (csv, json, parquet) as ways to bypass that approach and create Arrow arrays directly from the format. It happens that CSV is a particularly poor format for this, as everything is a (not-necessarily utf8) string with very little invariants.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754125519


   Thanks for the explanation. I see what you mean. I agree that it is worth the effort to try it out 👍


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org