You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Ji Liu (JIRA)" <ji...@apache.org> on 2019/07/03 07:42:00 UTC

[jira] [Comment Edited] (ARROW-5821) [Java] Support compact fixed-width vectors

    [ https://issues.apache.org/jira/browse/ARROW-5821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877410#comment-16877410 ] 

Ji Liu edited comment on ARROW-5821 at 7/3/19 7:41 AM:
-------------------------------------------------------

Thanks a lot for your feedback. [~jnadeau] [~wesmckinn]

More exactly, I suggest to provide a utility class in arrow-algorithm module and not break the IPC format anymore. The role the utility plays is that for a given fixed width vector which has lot of null values (e.g. valueCount=1000, nullCount=990), it could make non-null value move ahead and make valueCount=10, create a BitVector to trace null value indices. Meanwhile, for a given compacted vector and BitVector, it could recovery the original data format(e.g. valueCount=1000, nullCount=990).

In some cases, before shuffle and after shuffle, use this kind of utility will greatly reduce the data size. Moreover, the control is in the hands of users and we do not need worry about IPC format since we won't change it anymore.

Thanks!


was (Author: tianchen92):
Thanks a lot for your feedback. [~jnadeau] [~wesmckinn]

More exactly, I suggest to provide a utility class in arrow-algorithm module and not break the IPC format anymore. The role the utility plays is that for a given fixed width vector which has lot of null values (e.g. valueCount=1000, nullCount=990), it could create a new fixed width vector with valueCount=10 and a BitVector to trace null value indices. Meanwhile, for a given compacted vector and BitVector, it could recovery the original data format(e.g. valueCount=1000, nullCount=990).

In some cases, before shuffle and after shuffle, use this kind of utility will greatly reduce the data size. Moreover, the control is in the hands of users and we do not need worry about IPC format since we won't change it anymore.

Thanks!

> [Java] Support compact fixed-width vectors
> ------------------------------------------
>
>                 Key: ARROW-5821
>                 URL: https://issues.apache.org/jira/browse/ARROW-5821
>             Project: Apache Arrow
>          Issue Type: New Feature
>            Reporter: Ji Liu
>            Assignee: Ji Liu
>            Priority: Minor
>
> In shuffle stage of some applications, FixedWitdhVectors may have very little non-null data.
> In this case, directly serialize vectors is not a good choice, generally we can compact the vector make it only holding non-null value and create a BitVector to trace the indices for non-null values so that it could be deserialized properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)