You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Yordan Pavlov (Jira)" <ji...@apache.org> on 2021/03/07 11:02:00 UTC
[jira] [Comment Edited] (ARROW-11897) [Rust][Parquet] Use iterators to increase performance of creating Arrow arrays

    [ https://issues.apache.org/jira/browse/ARROW-11897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17296838#comment-17296838 ] 

Yordan Pavlov edited comment on ARROW-11897 at 3/7/21, 11:01 AM:
-----------------------------------------------------------------

[~Dandandan] regarding the count values, yes you are right - in the case of string arrays, the generated count values will always equal 1. But the count values may still be useful in cases where a primitive array is split across multiple non-contiguous slices, e.g. due to page boundaries. It may be possible to calculate the count values based on the data type (I have to think more about that), but at the moment I still like how they make the expected value count explicit. This could change during implementation though.


was (Author: yordan-pavlov):
[~Dandandan] regarding the count values, yes you are right - in the case of string arrays, the generated count values will always equal 1. But the count values may still be useful in cases where a primitive array is split across multiple non-contiguous slices, e.g. due to page boundaries. It may be possible to calculate the count values based on the data type (I have to think more about that), but at the moment I still like how they make the expected value count explicit.

> [Rust][Parquet] Use iterators to increase performance of creating Arrow arrays
> ------------------------------------------------------------------------------
>
>                 Key: ARROW-11897
>                 URL: https://issues.apache.org/jira/browse/ARROW-11897
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust
>            Reporter: Yordan Pavlov
>            Priority: Major
>
> The overall goal is to create an efficient pipeline from Parquet page data into Arrow arrays, with as little intermediate conversion and memory allocation as possible. It is assumed, that for best performance, we favor doing fewer but larger copy operations (rather than more but smaller). 
> Such a pipeline would need to be flexible in order to enable high performance implementations in several different cases:
>  (1) In some cases, such as plain-encoded number array, it might even be possible to copy / create the array from a single contiguous section from a page buffer. 
>  (2) In other cases, such as plain-encoded string array, since values are encoded in non-contiguous slices (where value bytes are separated by length bytes) in a page buffer contains multiple values, individual values will have to be copied separately and it's not obvious how this can be avoided.
>  (3) Finally, in the case of bit-packing encoding and smaller numeric values, page buffer data has to be decoded / expanded before it is ready to copy into an arrow arrow, so a `Vec<u8>` will have to be returned instead of a slice pointing to a page buffer.
> I propose that the implementation is split into three layers - (1) decoder, (2) column reader and (3) array converter layers (not too dissimilar from the current implementation, except it would be based on Iterators), as follows:
> *(1) Decoder layer:*
> A decoder output abstraction that enables all of the above cases and minimizes intermediate memory allocation is `Iterator<Item = (count, AsRef<[u8]>)>`.
>  Then in case (1) above, where a numeric array could be created from a single contiguous byte slice, such an iterator could return a single item such as `(1024, &[u8])`. 
>  In case (2) above, where each string value is encoded as an individual byte slice, but it is still possible to copy directly from a page buffer, a decoder iterator could return a sequence of items such as `(1, &[u8])`. 
>  And finally in case (3) above, where bit-packed values have to be unpacked/expanded, and it's NOT possible to copy value bytes directly from a page buffer, a decoder iterator could return items representing chunks of values such as `(32, Vec<u8>)` where bit-packed values have been unpacked and  the chunk size is configured for best performance.
> Another benefit of an `Iterator`-based abstraction is that it would prepare the parquet crate for  migration to `async` `Stream`s (my understanding is that a `Stream` is effectively an async `Iterator`).
> *(2) Column reader layer:*
> Then a higher level iterator could combine a value iterator and a (def) level iterator to produce a sequence of `ValueSequence(count, AsRef<[u8]>)` and `NullSequence(count)` items from which an arrow array can be created efficiently.
> In future, a higher level iterator (for the keys) could be combined with a dictionary value iterator to create a dictionary array.
> *(3) Array converter layer:*
> Finally, Arrow arrays would be created from a (generic) higher-level iterator, using a layer of array converters that know what the value bytes and nulls mean for each type of array.
>  
> [~nevime] , [~Dandandan] , [~jorgecarleitao] let me know what you think
> Next steps:
>  * split work into smaller tasks that could be done over time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)