You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 15:34:20 UTC

[GitHub] [arrow] emkornfield commented on pull request #4815: [DISCUSS] Add strawman proposal for sparseness and data integrity

emkornfield commented on pull request #4815:
URL: https://github.com/apache/arrow/pull/4815#issuecomment-826935381

@alamb thanks for the comments.

> You would encode that, like a dictionary array with one buffer of values ([1,3]) and another buffer of run lengths ([2, 4]).

> This proposal seems to add another dimension (as a new type of SparseRecordBatch at a lower level. )

Hi Andrew, thanks. The encoding here is slightly different then encoding the lengths, it encoded [cumulative run lengths](https://github.com/apache/arrow/pull/4815/files#r308298514). The main reason for proposing using cumulative values is is still allows for sublinear but not O(1) access to elements in a batch. So instead of the run lengths `[2,4]` you would have `[2, 6]`.

The new message type SparseRecordBatch is motivated by two related concerns:
1. Previously extra metadata was found to increase the overhead on RecordBatch was found to be too high for some scenarios. Adding a new RLE type probably wouldn't be too bad in terms of metadata cost, but I thinking having a more general extensible framework would serve us better in the long run.
2. Adding a new message type makes it less likely to break existing IPC code paths and can be bolted on (there are pros and cons to this).

So if the sole goal is supporting RLE I could see potentially modifying existing RecordBatches.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org