You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/04 17:57:24 UTC

[GitHub] [arrow] Dandandan edited a comment on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Dandandan edited a comment on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754102662

@jorgecarleitao note that the `csv` `StringRecord` also verifies whether strings are utf8. It adds a bit of overhead, but the utf8 checking itself is not much for now, it is mostly the logic surrounding `StringRecord` that adds some overhead.
I think eventually we could use a `StringArray` or `BinaryArray` as buffer so we can remove the `StringRecords` which is internally a `Vec<u8>` (by using `ByteRecord`) and a `Vec<usize>` for the rows.

The current performance penalty between master and this branch currently is ~10% as we introduce an extra intermediate step which I think could be more than compensated for by removing the `StringRecord` abstraction, and trying to write to a string or binary array without intermediate steps.

```rust
struct ByteRecordInner {
/// The position of this byte record.
pos: Option<Position>,
/// All fields in this record, stored contiguously.
fields: Vec<u8>,
/// The number of and location of each field in this record.
bounds: Bounds,
}
```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org