You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Tobias Zagorni <to...@zagorni.eu.INVALID> on 2022/08/25 17:34:13 UTC

PRs for RLE support

Hello Everyone,

Recently, I have implemented support for run-length encoding in Arrow
C++. So far my implementation is split into different subtasks of
ARROW-16771 (https://issues.apache.org/jira/browse/ARROW-16771).

I have (draft) PRs available for:
- general handling of RLE in arrow C++, Type, Arrow, Builder
subclasses, etc.
  (subtasks 1-9)
- encode, decode kernels (fixed size only):
  (https://issues.apache.org/jira/browse/ARROW-16772)
- filter kernel (fixed size only):
  (https://issues.apache.org/jira/browse/ARROW-16774)
- simple benchmark for the RLE kernels
  (https://issues.apache.org/jira/browse/ARROW-17026)
- adding RLE to Arrow Columnar format document
  (https://issues.apache.org/jira/browse/ARROW-16773)

What is not yet implemented:
- converting RLE to formats like Parquet, JSON, IPC.
- caching of physical offsets when working with sliced arrays - finding
these offsets is an  O(log(n)) binary search which could be avoided in
a lot of cases 

I'm interested in any feedback on the code and I'm wondering what would
be the best way to get this merged.

A lot of the PRs depend on earlier ones. I ordered the subtasks in a
way they could be merged. The first would be:
1. Handling of array-only types using VisitTypeInline:
   https://issues.apache.org/jira/browse/ARROW-17258
2. Adding RLE type / array class (only builds on #1):
   https://issues.apache.org/jira/browse/ARROW-17261
-  also, since it has no dependencies: adding RLE to Arrow Columnar
format document
   https://issues.apache.org/jira/browse/ARROW-16773

Best,
Tobias

Re: PRs for RLE support

Posted by Dewey Dunnington <de...@voltrondata.com.INVALID>.

Thanks for the clarification! The (probably very common) slicing case makes
a lot of sense.

On Wed, Sep 14, 2022 at 3:19 PM Weston Pace <we...@gmail.com> wrote:

> I will clarify the offset problem.  It essentially boils down to "if
> you don't have constant access to elements then an array length offset
> does not give you constant access to buffer offsets".
>
> We start with an RLE<int64> array of length 200.  We slice it with
> (start=10, length=100) to get an RLE<int64> array of length 100 and an
> offset of 10.
>
> Now we want to write an IPC file (or access the values for whatever
> reason).  The values buffer has 400 bytes and the run ends buffer has
> 200 bytes (these numbers could be anything less than 1600/800 so I'm
> picking these at random).  We need to copy a portion of the "run ends"
> buffer into the file.  What bytes are these?  The only way to tell
> would be to do a binary search on the 200 bytes run ends buffer.
>
> On the other hand, if there were two child arrays then an
> implementation, when slicing, could choose to always keep the offset
> of the parent array at 0 and instead put the offsets in the child
> arrays.  Now you have a parent array with offset 0, a run ends (int32)
> array with offset 74 and length 5 and a values (int64) array with
> offset 74 and length 5.  We can clearly say that we want to grab bytes
> 296-316 from the run ends buffer and bytes 592-632 from the values
> buffer.  Of course, other implementations would always be free to use
> offsets in the parent array.  So I think the log(N) approach would
> still be needed as a fallback.
>
> Other options:
>
>  * Just eat the log(n) cost, it's not that expensive and any
> application doing excessive slicing could cache the offsets
> themselves.
>  * Add an optional buffer offset to the spec that can be populated in
> cases where random array access is not possible.
>
> On Wed, Sep 14, 2022 at 10:53 AM Dewey Dunnington
> <de...@voltrondata.com.invalid> wrote:
> >
> > >  * Should we encode "run lengths" or "run ends"?
> >
> > In addition to the points mentioned above, this seems the most consistent
> > with the variable-length binary/list layouts
> >
> > > encoding the run ends as a buffer (similar to list array for example)
> > makes it difficult to calculate offsets
> >
> > I don't have a strong opinion about this, but I also don't understand the
> > logic. Surely the implementation is just generating/reading a buffer of
> > integers and there's some overhead/indirection to wrapping it in an Array
> > (that must then be validated).
> >
> > As a matter of curiosity, was a dictionary approach ever considered? If
> one
> > new layout was added (one buffer containing the run ends of a RLE 0:N
> int32
> > array), the dictionary member could be the values array and perhaps make
> it
> > easier for implementations that already handle dictionaries.
> >
> >
> >
> >
> >
> > On Wed, Sep 14, 2022 at 2:04 PM Matthew Topol
> <ma...@voltrondata.com.invalid>
> > wrote:
> >
> > > Just wanted to chime in here that I also have several draft PRs for
> > > implementing the RLE arrays in Go as the second implementation (since
> > > we use two implementations as a requirement to vote on
> > > changes/additions to the format).
> > >
> > > They can be found here:
> > >
> > > <https://github.com/apache/arrow/pull/14111>
> > > <https://github.com/apache/arrow/pull/14114>
> > > <https://github.com/apache/arrow/pull/14126>
> > >
> > > --Matt
> > >
> > > On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield
> > > <em...@gmail.com> wrote:
> > > >>
> > > >>   * Should we encode "run lengths" or "run ends"?
> > > >
> > > >
> > > > I think the project has leaned towards sublinear access, so run ends
> > > > make
> > > > sense.  The downside is that we run into similar issues with
> > > > List/LargeList
> > > > where the total number of elements is limited by bit-width (which can
> > > > also
> > > > cause space wastage, e.g. with run ends it might be reasonable to
> > > > limit
> > > > bit-width to 16).
> > > >
> > > > The values are definitely a child array.  However, encoding the run
> > > >>  ends as a buffer (similar to list array for example) makes it
> > > >>  difficult to calculate offsets.  Translating an array offset to a
> > > >>  buffer offset takes O(log(N)) time.  If the run ends are encoded as
> > > >> a
> > > >>  child array (so the RLE array has no buffers and two child arrays)
> > > >>  then this problem goes away.
> > > >
> > > >
> > > > I'm not sure I understand this, could you provide an example of the
> > > > problem
> > > > that the child array solves?
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Sep 14, 2022 at 9:36 AM Weston Pace <weston.pace@gmail.com
> > > > <ma...@gmail.com>> wrote:
> > > >
> > > >>  I'm going to bump this because it would be good to get feedback.
> In
> > > >>  particular it would be nice to get feedback on the suggested format
> > > >>  change[1].  We are currently moving forward on coming up with an
> IPC
> > > >>  format proposal which we will share when ready.
> > > >>
> > > >>  The two interesting points that jump out to me are:
> > > >>
> > > >>   * Should we encode "run lengths" or "run ends"?
> > > >>
> > > >>  For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run lengths"
> > > >> 3,
> > > >>  2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is
> preferred
> > > >>  as that leads to O(log(N)) random access (instead of O(N)) and it's
> > > >>  not clear there are any disadvantages.
> > > >>
> > > >>   * Should the run ends be encoded as a buffer or a child array?
> > > >>
> > > >>  The values are definitely a child array.  However, encoding the run
> > > >>  ends as a buffer (similar to list array for example) makes it
> > > >>  difficult to calculate offsets.  Translating an array offset to a
> > > >>  buffer offset takes O(log(N)) time.  If the run ends are encoded as
> > > >> a
> > > >>  child array (so the RLE array has no buffers and two child arrays)
> > > >>  then this problem goes away.
> > > >>
> > > >>  [1] <https://github.com/apache/arrow/pull/13333/files>
> > > >>
> > > >>  On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
> > > >>  <tobias@zagorni.eu.invalid <ma...@zagorni.eu.invalid>>
> > > >> wrote:
> > > >>  >
> > > >>  > Hello Everyone,
> > > >>  >
> > > >>  > Recently, I have implemented support for run-length encoding in
> > > >> Arrow
> > > >>  > C++. So far my implementation is split into different subtasks of
> > > >>  > ARROW-16771 (<https://issues.apache.org/jira/browse/ARROW-16771
> >).
> > > >>  >
> > > >>  > I have (draft) PRs available for:
> > > >>  > - general handling of RLE in arrow C++, Type, Arrow, Builder
> > > >>  > subclasses, etc.
> > > >>  >   (subtasks 1-9)
> > > >>  > - encode, decode kernels (fixed size only):
> > > >>  >   (<https://issues.apache.org/jira/browse/ARROW-16772>)
> > > >>  > - filter kernel (fixed size only):
> > > >>  >   (<https://issues.apache.org/jira/browse/ARROW-16774>)
> > > >>  > - simple benchmark for the RLE kernels
> > > >>  >   (<https://issues.apache.org/jira/browse/ARROW-17026>)
> > > >>  > - adding RLE to Arrow Columnar format document
> > > >>  >   (<https://issues.apache.org/jira/browse/ARROW-16773>)
> > > >>  >
> > > >>  > What is not yet implemented:
> > > >>  > - converting RLE to formats like Parquet, JSON, IPC.
> > > >>  > - caching of physical offsets when working with sliced arrays -
> > > >> finding
> > > >>  > these offsets is an  O(log(n)) binary search which could be
> > > >> avoided in
> > > >>  > a lot of cases
> > > >>  >
> > > >>  > I'm interested in any feedback on the code and I'm wondering what
> > > >> would
> > > >>  > be the best way to get this merged.
> > > >>  >
> > > >>  > A lot of the PRs depend on earlier ones. I ordered the subtasks
> > > >> in a
> > > >>  > way they could be merged. The first would be:
> > > >>  > 1. Handling of array-only types using VisitTypeInline:
> > > >>  >    <https://issues.apache.org/jira/browse/ARROW-17258>
> > > >>  > 2. Adding RLE type / array class (only builds on #1):
> > > >>  >    <https://issues.apache.org/jira/browse/ARROW-17261>
> > > >>  > -  also, since it has no dependencies: adding RLE to Arrow
> > > >> Columnar
> > > >>  > format document
> > > >>  >    <https://issues.apache.org/jira/browse/ARROW-16773>
> > > >>  >
> > > >>  > Best,
> > > >>  > Tobias
> > > >>
> > >
> > >
>

Re: RLE array slicing

Posted by Weston Pace <we...@gmail.com>.

Thank you everyone, I think I was pretty far off base in representing the
work Tobias had done and both Tobias and Matt have clarified well.

 * There are two child arrays not necessarily for slicing but more to help
distinguish between the logical length (there are no buffers with the
logical length but there is an array) and the physical length (all of the
buffers are in these child arrays which are sized to the physical length).
 * There may be some caching tricks that could help with slicing but for
the moment it is not a big deal.  The extra log(N) is unlikely to be
significant and, if it were, then an implementation or user library is free
to use whatever caching mechanism needed to help.

On Thu, Sep 15, 2022 at 8:16 AM Matt Topol <zo...@gmail.com> wrote:

> > why would the run ends and values have the same offset?
>
> That's why I liked the idea of the children arrays and having the parent
> offset being a "logical offset" and children being "physical offsets"
> because it maintains the independence of the arrays. Slicing the RLE is
> simply just setting the length and offset of the parent (at the cost of
> O(Log(n)) to do a lookup without caching) and you can use run ends and
> values with different offsets as the children. The only requirement is that
> the "length" of the run ends and values arrays need to be the same.
>
>
>
> On Thu, Sep 15, 2022, 8:23 AM Tobias Zagorni <to...@zagorni.eu.invalid>
> wrote:
>
> > > {
> > >     length: 2
> > >     offset: 6
> > >     rle: {
> > >         length: 1 // actually physical length
> > >         offset: 2
> > >         buffer: [3, 5,8]
> > >     }
> > >     values: {
> > >        length: 1
> > >        offset: 2
> > >        buffer: [5, 6, 7]
> > >     }
> > > }
> > > Does this make sense?
> >
> > I think this is a valid way of doing it. There are 2 problems I see
> > with this variant:
> >
> > - The Slice function would need to be modified to to specially handle
> > RLE. Slicing RLE based on a logical offset will now always be O(log(n))
> > . Slicing by just setting offset/length of the main array will no
> > longer work. (which may not be that bad performance-wise, since in a
> > lot of cases we can't get around resolving the phyiscal offset at some
> > point).
> >
> > - How would we slice nested types, i.e a struct where some fields are
> > run-length encoded?
> >   - We don't touch the rle fields since they are struct fields and
> >   these are not touched during slicing -> for anything where RLE is
> >   nested there is still only a logical offset. Now performance
> >   characteristics are also inconsistent between rle and struct<rle,...>
> >   - We change the RLE child's run ends/values arrays. Now we have a
> >     struct field that is no longer a valid array on its own, likely
> >     breaking a lot of things
> >
> > > Apologies Weston, if this is what you were getting at, but I think it
> > > is
> > > slightly different because you talked about 0 offsets in your
> > > example.
> > >
> > >
> > > > > So a slice at 4 would be:
> > > > > Run ends: [5, 8]
> > > > > Values: [6, 7]
> > >
> > >
> > > How do you interpret that?
> > > > Naively, that means the logical values [6, 6, 6, 6, 6, 7, 7, 7]
> > > > It doesn't seem right...
> >
> > we could find the logical offset by looking into run_ends_array[-1] if
> > run_ends_array.offset is not 0. That said the current code does not
> > support that.
> > >
> >
> > Best,
> > Tobias
> >
> >
>

Re: RLE array slicing

Posted by Matt Topol <zo...@gmail.com>.

> why would the run ends and values have the same offset?

That's why I liked the idea of the children arrays and having the parent
offset being a "logical offset" and children being "physical offsets"
because it maintains the independence of the arrays. Slicing the RLE is
simply just setting the length and offset of the parent (at the cost of
O(Log(n)) to do a lookup without caching) and you can use run ends and
values with different offsets as the children. The only requirement is that
the "length" of the run ends and values arrays need to be the same.



On Thu, Sep 15, 2022, 8:23 AM Tobias Zagorni <to...@zagorni.eu.invalid>
wrote:

> > {
> >     length: 2
> >     offset: 6
> >     rle: {
> >         length: 1 // actually physical length
> >         offset: 2
> >         buffer: [3, 5,8]
> >     }
> >     values: {
> >        length: 1
> >        offset: 2
> >        buffer: [5, 6, 7]
> >     }
> > }
> > Does this make sense?
>
> I think this is a valid way of doing it. There are 2 problems I see
> with this variant:
>
> - The Slice function would need to be modified to to specially handle
> RLE. Slicing RLE based on a logical offset will now always be O(log(n))
> . Slicing by just setting offset/length of the main array will no
> longer work. (which may not be that bad performance-wise, since in a
> lot of cases we can't get around resolving the phyiscal offset at some
> point).
>
> - How would we slice nested types, i.e a struct where some fields are
> run-length encoded?
>   - We don't touch the rle fields since they are struct fields and
>   these are not touched during slicing -> for anything where RLE is
>   nested there is still only a logical offset. Now performance
>   characteristics are also inconsistent between rle and struct<rle,...>
>   - We change the RLE child's run ends/values arrays. Now we have a
>     struct field that is no longer a valid array on its own, likely
>     breaking a lot of things
>
> > Apologies Weston, if this is what you were getting at, but I think it
> > is
> > slightly different because you talked about 0 offsets in your
> > example.
> >
> >
> > > > So a slice at 4 would be:
> > > > Run ends: [5, 8]
> > > > Values: [6, 7]
> >
> >
> > How do you interpret that?
> > > Naively, that means the logical values [6, 6, 6, 6, 6, 7, 7, 7]
> > > It doesn't seem right...
>
> we could find the logical offset by looking into run_ends_array[-1] if
> run_ends_array.offset is not 0. That said the current code does not
> support that.
> >
>
> Best,
> Tobias
>
>

Re: RLE array slicing

Posted by Tobias Zagorni <to...@zagorni.eu.INVALID>.

> {
>     length: 2
>     offset: 6
>     rle: {
>         length: 1 // actually physical length
>         offset: 2
>         buffer: [3, 5,8]
>     }
>     values: {
>        length: 1
>        offset: 2
>        buffer: [5, 6, 7]
>     }
> }
> Does this make sense?

I think this is a valid way of doing it. There are 2 problems I see
with this variant:

- The Slice function would need to be modified to to specially handle 
RLE. Slicing RLE based on a logical offset will now always be O(log(n))
. Slicing by just setting offset/length of the main array will no
longer work. (which may not be that bad performance-wise, since in a
lot of cases we can't get around resolving the phyiscal offset at some
point).

- How would we slice nested types, i.e a struct where some fields are
run-length encoded?
  - We don't touch the rle fields since they are struct fields and
  these are not touched during slicing -> for anything where RLE is
  nested there is still only a logical offset. Now performance
  characteristics are also inconsistent between rle and struct<rle,...>
  - We change the RLE child's run ends/values arrays. Now we have a
    struct field that is no longer a valid array on its own, likely
    breaking a lot of things

> Apologies Weston, if this is what you were getting at, but I think it
> is
> slightly different because you talked about 0 offsets in your
> example.
> 
> 
> > > So a slice at 4 would be:
> > > Run ends: [5, 8]
> > > Values: [6, 7]
> 
> 
> How do you interpret that?
> > Naively, that means the logical values [6, 6, 6, 6, 6, 7, 7, 7]
> > It doesn't seem right...

we could find the logical offset by looking into run_ends_array[-1] if
run_ends_array.offset is not 0. That said the current code does not
support that.
> 

Best,
Tobias

Re: RLE array slicing

Posted by Micah Kornfield <em...@gmail.com>.

>
> Slicing is part of the C data interface (with the offset member).

OK, so refreshing myself for the C data interface, IIUC, I think one needs
to hack RLE at a parent Array with two children arrays, because otherwise
in general, I don't think I see a way of actually communicating buffer size
at all (unless we want to revisit the API, which might not be worth it).
But it would still require special logic for decoding.  For the slice
example I think it could be represented as (changing slice to '6" since
"4" make some things non-obvious):
{
    length: 2
    offset: 6
    rle: {
        length: 1 // actually physical length
        offset: 2
        buffer: [3, 5,8]
    }
    values: {
       length: 1
       offset: 2
       buffer: [5, 6, 7]
    }
}
Does this make sense?
Apologies Weston, if this is what you were getting at, but I think it is
slightly different because you talked about 0 offsets in your example.


> > So a slice at 4 would be:
> > Run ends: [5, 8]
> > Values: [6, 7]


How do you interpret that?
> Naively, that means the logical values [6, 6, 6, 6, 6, 7, 7, 7]
> It doesn't seem right...

I think the key is you need to keep the logical offset around as part of
you metadata so you can do the subtract.

On Thu, Sep 15, 2022 at 1:19 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 15/09/2022 à 10:14, Micah Kornfield a écrit :
> > I agree slicing can be tricky here.  Since slicing is not part of the
> > specification, maybe there should be two separate discussions here.  I'll
> > be honest, I forget exactly how slicing works in the C++ implementation,
> > but is
>
> Slicing is part of the C data interface (with the offset member).
>
> >> Say you have the logical values: [5, 5, 5, 6, 6, 7, 7, 7]
> >>
> >> Run ends: [3, 5, 8]
> >> Values: [5, 6, 7]
> >
> > So a slice at 4 would be:
> > Run ends: [5, 8]
> > Values: [6, 7]
>
> How do you interpret that?
> Naively, that means the logical values [6, 6, 6, 6, 6, 7, 7, 7]
> It doesn't seem right...
>
> > For Lookup of elements one could add the logical offset to the index and
> to
> > the binary search as normal.
>
> Right. But that implies we just have a logical offset and no physical
> offset. Which is probably fine to me, but not what Weston proposed AFAIU
> :-)
>
> > I guess this might be harder to implement based on the current slicing
> > implementation?  Or I might be missing something obvious?
>
> Well, in any case, slicing will be more involved for RLE than it is for
> other data types or encodings...
>
> Regards
>
> Antoine.
>

Re: RLE array slicing

Posted by Antoine Pitrou <an...@python.org>.

Le 15/09/2022 à 10:14, Micah Kornfield a écrit :
> I agree slicing can be tricky here.  Since slicing is not part of the
> specification, maybe there should be two separate discussions here.  I'll
> be honest, I forget exactly how slicing works in the C++ implementation,
> but is

Slicing is part of the C data interface (with the offset member).

>> Say you have the logical values: [5, 5, 5, 6, 6, 7, 7, 7]
>>
>> Run ends: [3, 5, 8]
>> Values: [5, 6, 7]
> 
> So a slice at 4 would be:
> Run ends: [5, 8]
> Values: [6, 7]

How do you interpret that?
Naively, that means the logical values [6, 6, 6, 6, 6, 7, 7, 7]
It doesn't seem right...

> For Lookup of elements one could add the logical offset to the index and to
> the binary search as normal.

Right. But that implies we just have a logical offset and no physical 
offset. Which is probably fine to me, but not what Weston proposed AFAIU :-)

> I guess this might be harder to implement based on the current slicing
> implementation?  Or I might be missing something obvious?

Well, in any case, slicing will be more involved for RLE than it is for 
other data types or encodings...

Regards

Antoine.

Re: RLE array slicing

Posted by Micah Kornfield <em...@gmail.com>.

I agree slicing can be tricky here.  Since slicing is not part of the
specification, maybe there should be two separate discussions here.  I'll
be honest, I forget exactly how slicing works in the C++ implementation,
but is

> Say you want to slice the RLE array from Logical Offset 4 (which doesn't
> > fall on a run boundary). How do you represent that with Physical Offsets
> > into Run ends and Values?
>
Do we need to solve this problem, can we keep the logical offset as part of
the  RLE array and slice the Run Buffer and the Value Array at the same
time?

> Say you have the logical values: [5, 5, 5, 6, 6, 7, 7, 7]
>
> Run ends: [3, 5, 8]
> Values: [5, 6, 7]

So a slice at 4 would be:
Run ends: [5, 8]
Values: [6, 7]
This can be done in LOG(N) the physical slice offset for the array is the
same physical slice offset for Run ends (the first element greater than
then the logical offset)

When writing to the IPC format: you subtract the logical offset from the
run ends in the sliced buffer and write that.  Arrays are written as normal:
Run ends: [1, 4]
Values: [6, 7]

Which would reconstruct [6, 7, 7, 7].

For Lookup of elements one could add the logical offset to the index and to
the binary search as normal.

I guess this might be harder to implement based on the current slicing
implementation?  Or I might be missing something obvious?

Cheers,
Micah

On Thu, Sep 15, 2022 at 12:41 AM Antoine Pitrou <an...@python.org> wrote:

> On Thu, 15 Sep 2022 09:25:53 +0200
> Antoine Pitrou <an...@python.org> wrote:
> >
> > Why would the run ends and the values have the same offset?
> > Also, how do you interpret the run ends if you have a physical offset
> > into the values array?
> >
> >
> > Say you have the logical values: [5, 5, 5, 6, 6, 7, 7, 7]
> >
> > Run ends: [3, 5, 8]
> > Values: [5, 6, 7]
> >
> > Say you want to slice the RLE array from Logical Offset 4 (which doesn't
> > fall on a run boundary). How do you represent that with Physical Offsets
> > into Run ends and Values?
> >
> > As soon as you set a Physical Offset on the Values, the Run ends don't
> > match anymore.
>
> Hmm, part of my message does not make sense, sorry.
>
> That said, the question about representing a Logical Offset into the
> RLE array purely as Physical Offsets into Run ends and Values still
> holds :-)
>
> Regards
>
> Antoine.
>
>
>

Re: RLE array slicing

Posted by Antoine Pitrou <an...@python.org>.

On Thu, 15 Sep 2022 09:25:53 +0200
Antoine Pitrou <an...@python.org> wrote:
> 
> Why would the run ends and the values have the same offset?
> Also, how do you interpret the run ends if you have a physical offset 
> into the values array?
> 
> 
> Say you have the logical values: [5, 5, 5, 6, 6, 7, 7, 7]
> 
> Run ends: [3, 5, 8]
> Values: [5, 6, 7]
> 
> Say you want to slice the RLE array from Logical Offset 4 (which doesn't 
> fall on a run boundary). How do you represent that with Physical Offsets 
> into Run ends and Values?
> 
> As soon as you set a Physical Offset on the Values, the Run ends don't 
> match anymore.

Hmm, part of my message does not make sense, sorry.

That said, the question about representing a Logical Offset into the
RLE array purely as Physical Offsets into Run ends and Values still
holds :-)

Regards

Antoine.

Re: RLE array slicing

Posted by Antoine Pitrou <an...@python.org>.

Le 14/09/2022 à 20:18, Weston Pace a écrit :
> I will clarify the offset problem.  It essentially boils down to "if
> you don't have constant access to elements then an array length offset
> does not give you constant access to buffer offsets".
> 
> We start with an RLE<int64> array of length 200.  We slice it with
> (start=10, length=100) to get an RLE<int64> array of length 100 and an
> offset of 10.
> 
> Now we want to write an IPC file (or access the values for whatever
> reason).  The values buffer has 400 bytes and the run ends buffer has
> 200 bytes (these numbers could be anything less than 1600/800 so I'm
> picking these at random).  We need to copy a portion of the "run ends"
> buffer into the file.  What bytes are these?  The only way to tell
> would be to do a binary search on the 200 bytes run ends buffer.
> 
> On the other hand, if there were two child arrays then an
> implementation, when slicing, could choose to always keep the offset
> of the parent array at 0 and instead put the offsets in the child
> arrays.  Now you have a parent array with offset 0, a run ends (int32)
> array with offset 74 and length 5 and a values (int64) array with
> offset 74 and length 5.

Why would the run ends and the values have the same offset?
Also, how do you interpret the run ends if you have a physical offset 
into the values array?


Say you have the logical values: [5, 5, 5, 6, 6, 7, 7, 7]

Run ends: [3, 5, 8]
Values: [5, 6, 7]

Say you want to slice the RLE array from Logical Offset 4 (which doesn't 
fall on a run boundary). How do you represent that with Physical Offsets 
into Run ends and Values?

As soon as you set a Physical Offset on the Values, the Run ends don't 
match anymore.

Regards

Antoine.

Re: PRs for RLE support

Posted by Matt Topol <zo...@gmail.com>.

IMHO I think it's worth parameterizing for the 16/32-bit case. Despite it
being nice to be able to just assume it's a 32bit signed int in terms of
code simplicity, I think it would be a good benefit for memory usage of RLE
arrays.

That said I don't have anything to back that up as I don't regularly use
them. Maybe Velox or DuckDB have some stats we can look at in terms of
average sizes of runs for their RLE encodings? It would be great if we
could base this decision on any usage stats we could find.

On Thu, Sep 15, 2022, 2:32 AM Weston Pace <we...@gmail.com> wrote:

> > The downside is that we run into similar issues with List/LargeList
> > where the total number of elements is limited by bit-width (which can
> also
> > cause space wastage, e.g. with run ends it might be reasonable to limit
> > bit-width to 16).
>
> True.  I guess another question would be whether we want to parameterize
> the type for run ends.  The 32-bit/64-bit case isn't terribly interesting
> to me personally but I suppose we have LargeList for a reason.  The
> 16-bit/32-bit case is more compelling I think.
>
> On Wed, Sep 14, 2022 at 1:44 PM Tobias Zagorni <to...@zagorni.eu.invalid>
> wrote:
>
> > Am Mittwoch, dem 14.09.2022 um 14:33 -0400 schrieb Matthew Topol:
> > > Doesn't this explanation conflate the Logical Offset (the parent's
> > > offset) and the Physical Offset (the offset of the run ends / values
> > > children)? Or am I missing something? If you have an RLE<int64> array
> > > of length 200, and you slice it with {start: 10, length: 100}, I
> > > don't
> > > understand how you can represent that while keeping the parent's
> > > offset
> > > as "0" because the offset of "10" in the slice could represent
> > > anywhere
> > > from 0 - 10 physical values you have to offset by. You'd still keep
> > > an
> > > offset of 10 in the parent, and potentially cache the *physical*
> > > offset
> > > too, right?
> >
> > I could see this working if we require that the whole Buffer of the run
> > ends array (and not just the Array) holds valid run end values. We
> > could look into run_ends_array[-1] to find the logical offset, and
> > assume 0 if run_ends_array.offset is 0. This actually seems like an
> > interesting property to me, even if it makes logic more complex because
> > another offset to consider.
> >
> > What I like about the custom Slicing approach is that it works without
> > a cache and for that reason also across library boundaries without
> > losing cache information.
> >
> > With the current format, adding an efficient Slice function with
> > physical length inputs would still be possible, but rely on the cache
> > to store the phyiscal start and end of the slice.
> >
> > > The benefit I see for having the child arrays over the run-ends being
> > > a
> > > buffer in the RLE array directly is that you can easily tell how
> > > large
> > > the buffers *should* be as the length and offset of the run ends
> > > (int32) array and the values array will always represent *physical*
> > > values while the parent's offset and length can represent the
> > > *logical*
> > > values of the RLE. If the run ends were a buffer in the RLE array
> > > directly, it would be much more difficult to maintain the separation
> > > of
> > > the "physical" and "logical" offset/lengths.
> > >
> > > Does that make sense?
> > >
> > > --Matt
> >
> > This makes perfect sense to me. The primary reason for me was that
> > Buffers do not have a length field in the Arrow C interface so thier
> > length is not really known, while in Arrow C++ they have a size. For
> > all existing data types this can be easiliy resolved from either the
> > length field or offsets buffer. For RLE the the run ends buffer is
> > physical size of the array (or larger) which cannot be easily
> > determined without iterating over the whole buffer.
> >
> > But we need a valid buffer size, so we can resolve logical to physical
> > offsets using binary search. Also I'm not sure if it is required to be
> > known for each buffer in Arrow C++ anyways.
> >
> > - Tobias
> >
> > >
> > > On Wed, Sep 14 2022 at 11:18:27 AM -0700, Weston Pace
> > > <we...@gmail.com> wrote:
> > > > I will clarify the offset problem.  It essentially boils down to
> > > > "if
> > > > you don't have constant access to elements then an array length
> > > > offset
> > > > does not give you constant access to buffer offsets".
> > > >
> > > > We start with an RLE<int64> array of length 200.  We slice it with
> > > > (start=10, length=100) to get an RLE<int64> array of length 100 and
> > > > an
> > > > offset of 10.
> > > >
> > > > Now we want to write an IPC file (or access the values for whatever
> > > > reason).  The values buffer has 400 bytes and the run ends buffer
> > > > has
> > > > 200 bytes (these numbers could be anything less than 1600/800 so
> > > > I'm
> > > > picking these at random).  We need to copy a portion of the "run
> > > > ends"
> > > > buffer into the file.  What bytes are these?  The only way to tell
> > > > would be to do a binary search on the 200 bytes run ends buffer.
> > > >
> > > > On the other hand, if there were two child arrays then an
> > > > implementation, when slicing, could choose to always keep the
> > > > offset
> > > > of the parent array at 0 and instead put the offsets in the child
> > > > arrays.  Now you have a parent array with offset 0, a run ends
> > > > (int32)
> > > > array with offset 74 and length 5 and a values (int64) array with
> > > > offset 74 and length 5.  We can clearly say that we want to grab
> > > > bytes
> > > > 296-316 from the run ends buffer and bytes 592-632 from the values
> > > > buffer.  Of course, other implementations would always be free to
> > > > use
> > > > offsets in the parent array.  So I think the log(N) approach would
> > > > still be needed as a fallback.
> > > >
> > > > Other options:
> > > >
> > > >  * Just eat the log(n) cost, it's not that expensive and any
> > > > application doing excessive slicing could cache the offsets
> > > > themselves.
> > > >  * Add an optional buffer offset to the spec that can be populated
> > > > in
> > > > cases where random array access is not possible.
> > > >
> > > > On Wed, Sep 14, 2022 at 10:53 AM Dewey Dunnington
> > > > <dewey@voltrondata.com.invalid
> > > > <ma...@voltrondata.com.invalid>> wrote:
> > > > >
> > > > >  >  * Should we encode "run lengths" or "run ends"?
> > > > >
> > > > >  In addition to the points mentioned above, this seems the most
> > > > > consistent
> > > > >  with the variable-length binary/list layouts
> > > > >
> > > > >  > encoding the run ends as a buffer (similar to list array for
> > > > > example)
> > > > >  makes it difficult to calculate offsets
> > > > >
> > > > >  I don't have a strong opinion about this, but I also don't
> > > > > understand the
> > > > >  logic. Surely the implementation is just generating/reading a
> > > > > buffer of
> > > > >  integers and there's some overhead/indirection to wrapping it in
> > > > > an
> > > > > Array
> > > > >  (that must then be validated).
> > > > >
> > > > >  As a matter of curiosity, was a dictionary approach ever
> > > > > considered? If one
> > > > >  new layout was added (one buffer containing the run ends of a
> > > > > RLE
> > > > > 0:N int32
> > > > >  array), the dictionary member could be the values array and
> > > > > perhaps
> > > > > make it
> > > > >  easier for implementations that already handle dictionaries.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >  On Wed, Sep 14, 2022 at 2:04 PM Matthew Topol
> > > > > <
> > > > > matt@voltrondata.com.invalid <ma...@voltrondata.com.invalid>
> > > > > >
> > > > >  wrote:
> > > > >
> > > > >  > Just wanted to chime in here that I also have several draft
> > > > > PRs
> > > > > for
> > > > >  > implementing the RLE arrays in Go as the second implementation
> > > > > (since
> > > > >  > we use two implementations as a requirement to vote on
> > > > >  > changes/additions to the format).
> > > > >  >
> > > > >  > They can be found here:
> > > > >  >
> > > > >  > <<https://github.com/apache/arrow/pull/14111>>
> > > > >  > <<https://github.com/apache/arrow/pull/14114>>
> > > > >  > <<https://github.com/apache/arrow/pull/14126>>
> > > > >  >
> > > > >  > --Matt
> > > > >  >
> > > > >  > On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield
> > > > >  > <emkornfield@gmail.com <ma...@gmail.com>> wrote:
> > > > >  > >>
> > > > >  > >>   * Should we encode "run lengths" or "run ends"?
> > > > >  > >
> > > > >  > >
> > > > >  > > I think the project has leaned towards sublinear access, so
> > > > > run
> > > > > ends
> > > > >  > > make
> > > > >  > > sense.  The downside is that we run into similar issues with
> > > > >  > > List/LargeList
> > > > >  > > where the total number of elements is limited by bit-width
> > > > > (which can
> > > > >  > > also
> > > > >  > > cause space wastage, e.g. with run ends it might be
> > > > > reasonable
> > > > > to
> > > > >  > > limit
> > > > >  > > bit-width to 16).
> > > > >  > >
> > > > >  > > The values are definitely a child array.  However, encoding
> > > > > the
> > > > > run
> > > > >  > >>  ends as a buffer (similar to list array for example) makes
> > > > > it
> > > > >  > >>  difficult to calculate offsets.  Translating an array
> > > > > offset
> > > > > to a
> > > > >  > >>  buffer offset takes O(log(N)) time.  If the run ends are
> > > > > encoded as
> > > > >  > >> a
> > > > >  > >>  child array (so the RLE array has no buffers and two child
> > > > > arrays)
> > > > >  > >>  then this problem goes away.
> > > > >  > >
> > > > >  > >
> > > > >  > > I'm not sure I understand this, could you provide an example
> > > > > of
> > > > > the
> > > > >  > > problem
> > > > >  > > that the child array solves?
> > > > >  > >
> > > > >  > >
> > > > >  > >
> > > > >  > >
> > > > >  > > On Wed, Sep 14, 2022 at 9:36 AM Weston Pace
> > > > > <weston.pace@gmail.com <ma...@gmail.com>
> > > > >  > > <<m...@gmail.com>>> wrote:
> > > > >  > >
> > > > >  > >>  I'm going to bump this because it would be good to get
> > > > > feedback.  In
> > > > >  > >>  particular it would be nice to get feedback on the
> > > > > suggested
> > > > > format
> > > > >  > >>  change[1].  We are currently moving forward on coming up
> > > > > with
> > > > > an IPC
> > > > >  > >>  format proposal which we will share when ready.
> > > > >  > >>
> > > > >  > >>  The two interesting points that jump out to me are:
> > > > >  > >>
> > > > >  > >>   * Should we encode "run lengths" or "run ends"?
> > > > >  > >>
> > > > >  > >>  For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run
> > > > > lengths"
> > > > >  > >> 3,
> > > > >  > >>  2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is
> > > > > preferred
> > > > >  > >>  as that leads to O(log(N)) random access (instead of O(N))
> > > > > and it's
> > > > >  > >>  not clear there are any disadvantages.
> > > > >  > >>
> > > > >  > >>   * Should the run ends be encoded as a buffer or a child
> > > > > array?
> > > > >  > >>
> > > > >  > >>  The values are definitely a child array.  However,
> > > > > encoding
> > > > > the run
> > > > >  > >>  ends as a buffer (similar to list array for example) makes
> > > > > it
> > > > >  > >>  difficult to calculate offsets.  Translating an array
> > > > > offset
> > > > > to a
> > > > >  > >>  buffer offset takes O(log(N)) time.  If the run ends are
> > > > > encoded as
> > > > >  > >> a
> > > > >  > >>  child array (so the RLE array has no buffers and two child
> > > > > arrays)
> > > > >  > >>  then this problem goes away.
> > > > >  > >>
> > > > >  > >>  [1] <<https://github.com/apache/arrow/pull/13333/files>>
> > > > >  > >>
> > > > >  > >>  On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
> > > > >  > >>
> > > > > <tobias@zagorni.eu.invalid <ma...@zagorni.eu.invalid>
> > > > > <<m...@zagorni.eu.invalid>>>
> > > > >  > >> wrote:
> > > > >  > >>  >
> > > > >  > >>  > Hello Everyone,
> > > > >  > >>  >
> > > > >  > >>  > Recently, I have implemented support for run-length
> > > > > encoding in
> > > > >  > >> Arrow
> > > > >  > >>  > C++. So far my implementation is split into different
> > > > > subtasks of
> > > > >  > >>  > ARROW-16771
> > > > > (<<https://issues.apache.org/jira/browse/ARROW-16771>>).
> > > > >  > >>  >
> > > > >  > >>  > I have (draft) PRs available for:
> > > > >  > >>  > - general handling of RLE in arrow C++, Type, Arrow,
> > > > > Builder
> > > > >  > >>  > subclasses, etc.
> > > > >  > >>  >   (subtasks 1-9)
> > > > >  > >>  > - encode, decode kernels (fixed size only):
> > > > >  > >>  >
> > > > > (<<https://issues.apache.org/jira/browse/ARROW-16772>>)
> > > > >  > >>  > - filter kernel (fixed size only):
> > > > >  > >>  >
> > > > > (<<https://issues.apache.org/jira/browse/ARROW-16774>>)
> > > > >  > >>  > - simple benchmark for the RLE kernels
> > > > >  > >>  >
> > > > > (<<https://issues.apache.org/jira/browse/ARROW-17026>>)
> > > > >  > >>  > - adding RLE to Arrow Columnar format document
> > > > >  > >>  >
> > > > > (<<https://issues.apache.org/jira/browse/ARROW-16773>>)
> > > > >  > >>  >
> > > > >  > >>  > What is not yet implemented:
> > > > >  > >>  > - converting RLE to formats like Parquet, JSON, IPC.
> > > > >  > >>  > - caching of physical offsets when working with sliced
> > > > > arrays -
> > > > >  > >> finding
> > > > >  > >>  > these offsets is an  O(log(n)) binary search which could
> > > > > be
> > > > >  > >> avoided in
> > > > >  > >>  > a lot of cases
> > > > >  > >>  >
> > > > >  > >>  > I'm interested in any feedback on the code and I'm
> > > > > wondering what
> > > > >  > >> would
> > > > >  > >>  > be the best way to get this merged.
> > > > >  > >>  >
> > > > >  > >>  > A lot of the PRs depend on earlier ones. I ordered the
> > > > > subtasks
> > > > >  > >> in a
> > > > >  > >>  > way they could be merged. The first would be:
> > > > >  > >>  > 1. Handling of array-only types using VisitTypeInline:
> > > > >  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-17258>>
> > > > >  > >>  > 2. Adding RLE type / array class (only builds on #1):
> > > > >  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-17261>>
> > > > >  > >>  > -  also, since it has no dependencies: adding RLE to
> > > > > Arrow
> > > > >  > >> Columnar
> > > > >  > >>  > format document
> > > > >  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-16773>>
> > > > >  > >>  >
> > > > >  > >>  > Best,
> > > > >  > >>  > Tobias
> > > > >  > >>
> > > > >  >
> > > > >  >
> > >
> >
>

Re: PRs for RLE support

Posted by Weston Pace <we...@gmail.com>.

> The downside is that we run into similar issues with List/LargeList
> where the total number of elements is limited by bit-width (which can also
> cause space wastage, e.g. with run ends it might be reasonable to limit
> bit-width to 16).

True.  I guess another question would be whether we want to parameterize
the type for run ends.  The 32-bit/64-bit case isn't terribly interesting
to me personally but I suppose we have LargeList for a reason.  The
16-bit/32-bit case is more compelling I think.

On Wed, Sep 14, 2022 at 1:44 PM Tobias Zagorni <to...@zagorni.eu.invalid>
wrote:

> Am Mittwoch, dem 14.09.2022 um 14:33 -0400 schrieb Matthew Topol:
> > Doesn't this explanation conflate the Logical Offset (the parent's
> > offset) and the Physical Offset (the offset of the run ends / values
> > children)? Or am I missing something? If you have an RLE<int64> array
> > of length 200, and you slice it with {start: 10, length: 100}, I
> > don't
> > understand how you can represent that while keeping the parent's
> > offset
> > as "0" because the offset of "10" in the slice could represent
> > anywhere
> > from 0 - 10 physical values you have to offset by. You'd still keep
> > an
> > offset of 10 in the parent, and potentially cache the *physical*
> > offset
> > too, right?
>
> I could see this working if we require that the whole Buffer of the run
> ends array (and not just the Array) holds valid run end values. We
> could look into run_ends_array[-1] to find the logical offset, and
> assume 0 if run_ends_array.offset is 0. This actually seems like an
> interesting property to me, even if it makes logic more complex because
> another offset to consider.
>
> What I like about the custom Slicing approach is that it works without
> a cache and for that reason also across library boundaries without
> losing cache information.
>
> With the current format, adding an efficient Slice function with
> physical length inputs would still be possible, but rely on the cache
> to store the phyiscal start and end of the slice.
>
> > The benefit I see for having the child arrays over the run-ends being
> > a
> > buffer in the RLE array directly is that you can easily tell how
> > large
> > the buffers *should* be as the length and offset of the run ends
> > (int32) array and the values array will always represent *physical*
> > values while the parent's offset and length can represent the
> > *logical*
> > values of the RLE. If the run ends were a buffer in the RLE array
> > directly, it would be much more difficult to maintain the separation
> > of
> > the "physical" and "logical" offset/lengths.
> >
> > Does that make sense?
> >
> > --Matt
>
> This makes perfect sense to me. The primary reason for me was that
> Buffers do not have a length field in the Arrow C interface so thier
> length is not really known, while in Arrow C++ they have a size. For
> all existing data types this can be easiliy resolved from either the
> length field or offsets buffer. For RLE the the run ends buffer is
> physical size of the array (or larger) which cannot be easily
> determined without iterating over the whole buffer.
>
> But we need a valid buffer size, so we can resolve logical to physical
> offsets using binary search. Also I'm not sure if it is required to be
> known for each buffer in Arrow C++ anyways.
>
> - Tobias
>
> >
> > On Wed, Sep 14 2022 at 11:18:27 AM -0700, Weston Pace
> > <we...@gmail.com> wrote:
> > > I will clarify the offset problem.  It essentially boils down to
> > > "if
> > > you don't have constant access to elements then an array length
> > > offset
> > > does not give you constant access to buffer offsets".
> > >
> > > We start with an RLE<int64> array of length 200.  We slice it with
> > > (start=10, length=100) to get an RLE<int64> array of length 100 and
> > > an
> > > offset of 10.
> > >
> > > Now we want to write an IPC file (or access the values for whatever
> > > reason).  The values buffer has 400 bytes and the run ends buffer
> > > has
> > > 200 bytes (these numbers could be anything less than 1600/800 so
> > > I'm
> > > picking these at random).  We need to copy a portion of the "run
> > > ends"
> > > buffer into the file.  What bytes are these?  The only way to tell
> > > would be to do a binary search on the 200 bytes run ends buffer.
> > >
> > > On the other hand, if there were two child arrays then an
> > > implementation, when slicing, could choose to always keep the
> > > offset
> > > of the parent array at 0 and instead put the offsets in the child
> > > arrays.  Now you have a parent array with offset 0, a run ends
> > > (int32)
> > > array with offset 74 and length 5 and a values (int64) array with
> > > offset 74 and length 5.  We can clearly say that we want to grab
> > > bytes
> > > 296-316 from the run ends buffer and bytes 592-632 from the values
> > > buffer.  Of course, other implementations would always be free to
> > > use
> > > offsets in the parent array.  So I think the log(N) approach would
> > > still be needed as a fallback.
> > >
> > > Other options:
> > >
> > >  * Just eat the log(n) cost, it's not that expensive and any
> > > application doing excessive slicing could cache the offsets
> > > themselves.
> > >  * Add an optional buffer offset to the spec that can be populated
> > > in
> > > cases where random array access is not possible.
> > >
> > > On Wed, Sep 14, 2022 at 10:53 AM Dewey Dunnington
> > > <dewey@voltrondata.com.invalid
> > > <ma...@voltrondata.com.invalid>> wrote:
> > > >
> > > >  >  * Should we encode "run lengths" or "run ends"?
> > > >
> > > >  In addition to the points mentioned above, this seems the most
> > > > consistent
> > > >  with the variable-length binary/list layouts
> > > >
> > > >  > encoding the run ends as a buffer (similar to list array for
> > > > example)
> > > >  makes it difficult to calculate offsets
> > > >
> > > >  I don't have a strong opinion about this, but I also don't
> > > > understand the
> > > >  logic. Surely the implementation is just generating/reading a
> > > > buffer of
> > > >  integers and there's some overhead/indirection to wrapping it in
> > > > an
> > > > Array
> > > >  (that must then be validated).
> > > >
> > > >  As a matter of curiosity, was a dictionary approach ever
> > > > considered? If one
> > > >  new layout was added (one buffer containing the run ends of a
> > > > RLE
> > > > 0:N int32
> > > >  array), the dictionary member could be the values array and
> > > > perhaps
> > > > make it
> > > >  easier for implementations that already handle dictionaries.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >  On Wed, Sep 14, 2022 at 2:04 PM Matthew Topol
> > > > <
> > > > matt@voltrondata.com.invalid <ma...@voltrondata.com.invalid>
> > > > >
> > > >  wrote:
> > > >
> > > >  > Just wanted to chime in here that I also have several draft
> > > > PRs
> > > > for
> > > >  > implementing the RLE arrays in Go as the second implementation
> > > > (since
> > > >  > we use two implementations as a requirement to vote on
> > > >  > changes/additions to the format).
> > > >  >
> > > >  > They can be found here:
> > > >  >
> > > >  > <<https://github.com/apache/arrow/pull/14111>>
> > > >  > <<https://github.com/apache/arrow/pull/14114>>
> > > >  > <<https://github.com/apache/arrow/pull/14126>>
> > > >  >
> > > >  > --Matt
> > > >  >
> > > >  > On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield
> > > >  > <emkornfield@gmail.com <ma...@gmail.com>> wrote:
> > > >  > >>
> > > >  > >>   * Should we encode "run lengths" or "run ends"?
> > > >  > >
> > > >  > >
> > > >  > > I think the project has leaned towards sublinear access, so
> > > > run
> > > > ends
> > > >  > > make
> > > >  > > sense.  The downside is that we run into similar issues with
> > > >  > > List/LargeList
> > > >  > > where the total number of elements is limited by bit-width
> > > > (which can
> > > >  > > also
> > > >  > > cause space wastage, e.g. with run ends it might be
> > > > reasonable
> > > > to
> > > >  > > limit
> > > >  > > bit-width to 16).
> > > >  > >
> > > >  > > The values are definitely a child array.  However, encoding
> > > > the
> > > > run
> > > >  > >>  ends as a buffer (similar to list array for example) makes
> > > > it
> > > >  > >>  difficult to calculate offsets.  Translating an array
> > > > offset
> > > > to a
> > > >  > >>  buffer offset takes O(log(N)) time.  If the run ends are
> > > > encoded as
> > > >  > >> a
> > > >  > >>  child array (so the RLE array has no buffers and two child
> > > > arrays)
> > > >  > >>  then this problem goes away.
> > > >  > >
> > > >  > >
> > > >  > > I'm not sure I understand this, could you provide an example
> > > > of
> > > > the
> > > >  > > problem
> > > >  > > that the child array solves?
> > > >  > >
> > > >  > >
> > > >  > >
> > > >  > >
> > > >  > > On Wed, Sep 14, 2022 at 9:36 AM Weston Pace
> > > > <weston.pace@gmail.com <ma...@gmail.com>
> > > >  > > <<m...@gmail.com>>> wrote:
> > > >  > >
> > > >  > >>  I'm going to bump this because it would be good to get
> > > > feedback.  In
> > > >  > >>  particular it would be nice to get feedback on the
> > > > suggested
> > > > format
> > > >  > >>  change[1].  We are currently moving forward on coming up
> > > > with
> > > > an IPC
> > > >  > >>  format proposal which we will share when ready.
> > > >  > >>
> > > >  > >>  The two interesting points that jump out to me are:
> > > >  > >>
> > > >  > >>   * Should we encode "run lengths" or "run ends"?
> > > >  > >>
> > > >  > >>  For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run
> > > > lengths"
> > > >  > >> 3,
> > > >  > >>  2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is
> > > > preferred
> > > >  > >>  as that leads to O(log(N)) random access (instead of O(N))
> > > > and it's
> > > >  > >>  not clear there are any disadvantages.
> > > >  > >>
> > > >  > >>   * Should the run ends be encoded as a buffer or a child
> > > > array?
> > > >  > >>
> > > >  > >>  The values are definitely a child array.  However,
> > > > encoding
> > > > the run
> > > >  > >>  ends as a buffer (similar to list array for example) makes
> > > > it
> > > >  > >>  difficult to calculate offsets.  Translating an array
> > > > offset
> > > > to a
> > > >  > >>  buffer offset takes O(log(N)) time.  If the run ends are
> > > > encoded as
> > > >  > >> a
> > > >  > >>  child array (so the RLE array has no buffers and two child
> > > > arrays)
> > > >  > >>  then this problem goes away.
> > > >  > >>
> > > >  > >>  [1] <<https://github.com/apache/arrow/pull/13333/files>>
> > > >  > >>
> > > >  > >>  On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
> > > >  > >>
> > > > <tobias@zagorni.eu.invalid <ma...@zagorni.eu.invalid>
> > > > <<m...@zagorni.eu.invalid>>>
> > > >  > >> wrote:
> > > >  > >>  >
> > > >  > >>  > Hello Everyone,
> > > >  > >>  >
> > > >  > >>  > Recently, I have implemented support for run-length
> > > > encoding in
> > > >  > >> Arrow
> > > >  > >>  > C++. So far my implementation is split into different
> > > > subtasks of
> > > >  > >>  > ARROW-16771
> > > > (<<https://issues.apache.org/jira/browse/ARROW-16771>>).
> > > >  > >>  >
> > > >  > >>  > I have (draft) PRs available for:
> > > >  > >>  > - general handling of RLE in arrow C++, Type, Arrow,
> > > > Builder
> > > >  > >>  > subclasses, etc.
> > > >  > >>  >   (subtasks 1-9)
> > > >  > >>  > - encode, decode kernels (fixed size only):
> > > >  > >>  >
> > > > (<<https://issues.apache.org/jira/browse/ARROW-16772>>)
> > > >  > >>  > - filter kernel (fixed size only):
> > > >  > >>  >
> > > > (<<https://issues.apache.org/jira/browse/ARROW-16774>>)
> > > >  > >>  > - simple benchmark for the RLE kernels
> > > >  > >>  >
> > > > (<<https://issues.apache.org/jira/browse/ARROW-17026>>)
> > > >  > >>  > - adding RLE to Arrow Columnar format document
> > > >  > >>  >
> > > > (<<https://issues.apache.org/jira/browse/ARROW-16773>>)
> > > >  > >>  >
> > > >  > >>  > What is not yet implemented:
> > > >  > >>  > - converting RLE to formats like Parquet, JSON, IPC.
> > > >  > >>  > - caching of physical offsets when working with sliced
> > > > arrays -
> > > >  > >> finding
> > > >  > >>  > these offsets is an  O(log(n)) binary search which could
> > > > be
> > > >  > >> avoided in
> > > >  > >>  > a lot of cases
> > > >  > >>  >
> > > >  > >>  > I'm interested in any feedback on the code and I'm
> > > > wondering what
> > > >  > >> would
> > > >  > >>  > be the best way to get this merged.
> > > >  > >>  >
> > > >  > >>  > A lot of the PRs depend on earlier ones. I ordered the
> > > > subtasks
> > > >  > >> in a
> > > >  > >>  > way they could be merged. The first would be:
> > > >  > >>  > 1. Handling of array-only types using VisitTypeInline:
> > > >  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-17258>>
> > > >  > >>  > 2. Adding RLE type / array class (only builds on #1):
> > > >  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-17261>>
> > > >  > >>  > -  also, since it has no dependencies: adding RLE to
> > > > Arrow
> > > >  > >> Columnar
> > > >  > >>  > format document
> > > >  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-16773>>
> > > >  > >>  >
> > > >  > >>  > Best,
> > > >  > >>  > Tobias
> > > >  > >>
> > > >  >
> > > >  >
> >
>

Re: PRs for RLE support

Posted by Tobias Zagorni <to...@zagorni.eu.INVALID>.

Am Mittwoch, dem 14.09.2022 um 14:33 -0400 schrieb Matthew Topol:
> Doesn't this explanation conflate the Logical Offset (the parent's 
> offset) and the Physical Offset (the offset of the run ends / values 
> children)? Or am I missing something? If you have an RLE<int64> array
> of length 200, and you slice it with {start: 10, length: 100}, I
> don't 
> understand how you can represent that while keeping the parent's
> offset 
> as "0" because the offset of "10" in the slice could represent
> anywhere 
> from 0 - 10 physical values you have to offset by. You'd still keep
> an 
> offset of 10 in the parent, and potentially cache the *physical*
> offset 
> too, right?

I could see this working if we require that the whole Buffer of the run
ends array (and not just the Array) holds valid run end values. We
could look into run_ends_array[-1] to find the logical offset, and
assume 0 if run_ends_array.offset is 0. This actually seems like an
interesting property to me, even if it makes logic more complex because
another offset to consider. 

What I like about the custom Slicing approach is that it works without
a cache and for that reason also across library boundaries without
losing cache information.

With the current format, adding an efficient Slice function with
physical length inputs would still be possible, but rely on the cache
to store the phyiscal start and end of the slice.

> The benefit I see for having the child arrays over the run-ends being
> a 
> buffer in the RLE array directly is that you can easily tell how
> large 
> the buffers *should* be as the length and offset of the run ends 
> (int32) array and the values array will always represent *physical* 
> values while the parent's offset and length can represent the
> *logical* 
> values of the RLE. If the run ends were a buffer in the RLE array 
> directly, it would be much more difficult to maintain the separation
> of 
> the "physical" and "logical" offset/lengths.
> 
> Does that make sense?
> 
> --Matt

This makes perfect sense to me. The primary reason for me was that
Buffers do not have a length field in the Arrow C interface so thier
length is not really known, while in Arrow C++ they have a size. For
all existing data types this can be easiliy resolved from either the
length field or offsets buffer. For RLE the the run ends buffer is
physical size of the array (or larger) which cannot be easily
determined without iterating over the whole buffer.

But we need a valid buffer size, so we can resolve logical to physical
offsets using binary search. Also I'm not sure if it is required to be
known for each buffer in Arrow C++ anyways.

- Tobias

> 
> On Wed, Sep 14 2022 at 11:18:27 AM -0700, Weston Pace 
> <we...@gmail.com> wrote:
> > I will clarify the offset problem.  It essentially boils down to
> > "if
> > you don't have constant access to elements then an array length
> > offset
> > does not give you constant access to buffer offsets".
> > 
> > We start with an RLE<int64> array of length 200.  We slice it with
> > (start=10, length=100) to get an RLE<int64> array of length 100 and
> > an
> > offset of 10.
> > 
> > Now we want to write an IPC file (or access the values for whatever
> > reason).  The values buffer has 400 bytes and the run ends buffer
> > has
> > 200 bytes (these numbers could be anything less than 1600/800 so
> > I'm
> > picking these at random).  We need to copy a portion of the "run
> > ends"
> > buffer into the file.  What bytes are these?  The only way to tell
> > would be to do a binary search on the 200 bytes run ends buffer.
> > 
> > On the other hand, if there were two child arrays then an
> > implementation, when slicing, could choose to always keep the
> > offset
> > of the parent array at 0 and instead put the offsets in the child
> > arrays.  Now you have a parent array with offset 0, a run ends
> > (int32)
> > array with offset 74 and length 5 and a values (int64) array with
> > offset 74 and length 5.  We can clearly say that we want to grab
> > bytes
> > 296-316 from the run ends buffer and bytes 592-632 from the values
> > buffer.  Of course, other implementations would always be free to
> > use
> > offsets in the parent array.  So I think the log(N) approach would
> > still be needed as a fallback.
> > 
> > Other options:
> > 
> >  * Just eat the log(n) cost, it's not that expensive and any
> > application doing excessive slicing could cache the offsets
> > themselves.
> >  * Add an optional buffer offset to the spec that can be populated
> > in
> > cases where random array access is not possible.
> > 
> > On Wed, Sep 14, 2022 at 10:53 AM Dewey Dunnington
> > <dewey@voltrondata.com.invalid 
> > <ma...@voltrondata.com.invalid>> wrote:
> > > 
> > >  >  * Should we encode "run lengths" or "run ends"?
> > > 
> > >  In addition to the points mentioned above, this seems the most 
> > > consistent
> > >  with the variable-length binary/list layouts
> > > 
> > >  > encoding the run ends as a buffer (similar to list array for 
> > > example)
> > >  makes it difficult to calculate offsets
> > > 
> > >  I don't have a strong opinion about this, but I also don't 
> > > understand the
> > >  logic. Surely the implementation is just generating/reading a 
> > > buffer of
> > >  integers and there's some overhead/indirection to wrapping it in
> > > an 
> > > Array
> > >  (that must then be validated).
> > > 
> > >  As a matter of curiosity, was a dictionary approach ever 
> > > considered? If one
> > >  new layout was added (one buffer containing the run ends of a
> > > RLE 
> > > 0:N int32
> > >  array), the dictionary member could be the values array and
> > > perhaps 
> > > make it
> > >  easier for implementations that already handle dictionaries.
> > > 
> > > 
> > > 
> > > 
> > > 
> > >  On Wed, Sep 14, 2022 at 2:04 PM Matthew Topol 
> > > <
> > > matt@voltrondata.com.invalid <ma...@voltrondata.com.invalid>
> > > >
> > >  wrote:
> > > 
> > >  > Just wanted to chime in here that I also have several draft
> > > PRs 
> > > for
> > >  > implementing the RLE arrays in Go as the second implementation
> > > (since
> > >  > we use two implementations as a requirement to vote on
> > >  > changes/additions to the format).
> > >  >
> > >  > They can be found here:
> > >  >
> > >  > <<https://github.com/apache/arrow/pull/14111>>
> > >  > <<https://github.com/apache/arrow/pull/14114>>
> > >  > <<https://github.com/apache/arrow/pull/14126>>
> > >  >
> > >  > --Matt
> > >  >
> > >  > On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield
> > >  > <emkornfield@gmail.com <ma...@gmail.com>> wrote:
> > >  > >>
> > >  > >>   * Should we encode "run lengths" or "run ends"?
> > >  > >
> > >  > >
> > >  > > I think the project has leaned towards sublinear access, so
> > > run 
> > > ends
> > >  > > make
> > >  > > sense.  The downside is that we run into similar issues with
> > >  > > List/LargeList
> > >  > > where the total number of elements is limited by bit-width 
> > > (which can
> > >  > > also
> > >  > > cause space wastage, e.g. with run ends it might be
> > > reasonable 
> > > to
> > >  > > limit
> > >  > > bit-width to 16).
> > >  > >
> > >  > > The values are definitely a child array.  However, encoding
> > > the 
> > > run
> > >  > >>  ends as a buffer (similar to list array for example) makes
> > > it
> > >  > >>  difficult to calculate offsets.  Translating an array
> > > offset 
> > > to a
> > >  > >>  buffer offset takes O(log(N)) time.  If the run ends are 
> > > encoded as
> > >  > >> a
> > >  > >>  child array (so the RLE array has no buffers and two child
> > > arrays)
> > >  > >>  then this problem goes away.
> > >  > >
> > >  > >
> > >  > > I'm not sure I understand this, could you provide an example
> > > of 
> > > the
> > >  > > problem
> > >  > > that the child array solves?
> > >  > >
> > >  > >
> > >  > >
> > >  > >
> > >  > > On Wed, Sep 14, 2022 at 9:36 AM Weston Pace 
> > > <weston.pace@gmail.com <ma...@gmail.com>
> > >  > > <<m...@gmail.com>>> wrote:
> > >  > >
> > >  > >>  I'm going to bump this because it would be good to get 
> > > feedback.  In
> > >  > >>  particular it would be nice to get feedback on the
> > > suggested 
> > > format
> > >  > >>  change[1].  We are currently moving forward on coming up
> > > with 
> > > an IPC
> > >  > >>  format proposal which we will share when ready.
> > >  > >>
> > >  > >>  The two interesting points that jump out to me are:
> > >  > >>
> > >  > >>   * Should we encode "run lengths" or "run ends"?
> > >  > >>
> > >  > >>  For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run
> > > lengths"
> > >  > >> 3,
> > >  > >>  2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is
> > > preferred
> > >  > >>  as that leads to O(log(N)) random access (instead of O(N))
> > > and it's
> > >  > >>  not clear there are any disadvantages.
> > >  > >>
> > >  > >>   * Should the run ends be encoded as a buffer or a child 
> > > array?
> > >  > >>
> > >  > >>  The values are definitely a child array.  However,
> > > encoding 
> > > the run
> > >  > >>  ends as a buffer (similar to list array for example) makes
> > > it
> > >  > >>  difficult to calculate offsets.  Translating an array
> > > offset 
> > > to a
> > >  > >>  buffer offset takes O(log(N)) time.  If the run ends are 
> > > encoded as
> > >  > >> a
> > >  > >>  child array (so the RLE array has no buffers and two child
> > > arrays)
> > >  > >>  then this problem goes away.
> > >  > >>
> > >  > >>  [1] <<https://github.com/apache/arrow/pull/13333/files>>
> > >  > >>
> > >  > >>  On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
> > >  > >> 
> > > <tobias@zagorni.eu.invalid <ma...@zagorni.eu.invalid> 
> > > <<m...@zagorni.eu.invalid>>>
> > >  > >> wrote:
> > >  > >>  >
> > >  > >>  > Hello Everyone,
> > >  > >>  >
> > >  > >>  > Recently, I have implemented support for run-length 
> > > encoding in
> > >  > >> Arrow
> > >  > >>  > C++. So far my implementation is split into different 
> > > subtasks of
> > >  > >>  > ARROW-16771 
> > > (<<https://issues.apache.org/jira/browse/ARROW-16771>>).
> > >  > >>  >
> > >  > >>  > I have (draft) PRs available for:
> > >  > >>  > - general handling of RLE in arrow C++, Type, Arrow,
> > > Builder
> > >  > >>  > subclasses, etc.
> > >  > >>  >   (subtasks 1-9)
> > >  > >>  > - encode, decode kernels (fixed size only):
> > >  > >>  >  
> > > (<<https://issues.apache.org/jira/browse/ARROW-16772>>)
> > >  > >>  > - filter kernel (fixed size only):
> > >  > >>  >  
> > > (<<https://issues.apache.org/jira/browse/ARROW-16774>>)
> > >  > >>  > - simple benchmark for the RLE kernels
> > >  > >>  >  
> > > (<<https://issues.apache.org/jira/browse/ARROW-17026>>)
> > >  > >>  > - adding RLE to Arrow Columnar format document
> > >  > >>  >  
> > > (<<https://issues.apache.org/jira/browse/ARROW-16773>>)
> > >  > >>  >
> > >  > >>  > What is not yet implemented:
> > >  > >>  > - converting RLE to formats like Parquet, JSON, IPC.
> > >  > >>  > - caching of physical offsets when working with sliced 
> > > arrays -
> > >  > >> finding
> > >  > >>  > these offsets is an  O(log(n)) binary search which could
> > > be
> > >  > >> avoided in
> > >  > >>  > a lot of cases
> > >  > >>  >
> > >  > >>  > I'm interested in any feedback on the code and I'm 
> > > wondering what
> > >  > >> would
> > >  > >>  > be the best way to get this merged.
> > >  > >>  >
> > >  > >>  > A lot of the PRs depend on earlier ones. I ordered the 
> > > subtasks
> > >  > >> in a
> > >  > >>  > way they could be merged. The first would be:
> > >  > >>  > 1. Handling of array-only types using VisitTypeInline:
> > >  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-17258>>
> > >  > >>  > 2. Adding RLE type / array class (only builds on #1):
> > >  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-17261>>
> > >  > >>  > -  also, since it has no dependencies: adding RLE to
> > > Arrow
> > >  > >> Columnar
> > >  > >>  > format document
> > >  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-16773>>
> > >  > >>  >
> > >  > >>  > Best,
> > >  > >>  > Tobias
> > >  > >>
> > >  >
> > >  >
>

Re: PRs for RLE support

Posted by Matthew Topol <ma...@voltrondata.com.INVALID>.

 > On the other hand, if there were two child arrays then an
implementation, when slicing, could choose to always keep the offset
of the parent array at 0 and instead put the offsets in the child
arrays.  Now you have a parent array with offset 0, a run ends (int32)
array with offset 74 and length 5 and a values (int64) array with
offset 74 and length 5.  We can clearly say that we want to grab bytes
296-316 from the run ends buffer and bytes 592-632 from the values
buffer.  Of course, other implementations would always be free to use
offsets in the parent array.  So I think the log(N) approach would
still be needed as a fallback.

Doesn't this explanation conflate the Logical Offset (the parent's 
offset) and the Physical Offset (the offset of the run ends / values 
children)? Or am I missing something? If you have an RLE<int64> array 
of length 200, and you slice it with {start: 10, length: 100}, I don't 
understand how you can represent that while keeping the parent's offset 
as "0" because the offset of "10" in the slice could represent anywhere 
from 0 - 10 physical values you have to offset by. You'd still keep an 
offset of 10 in the parent, and potentially cache the *physical* offset 
too, right?

The benefit I see for having the child arrays over the run-ends being a 
buffer in the RLE array directly is that you can easily tell how large 
the buffers *should* be as the length and offset of the run ends 
(int32) array and the values array will always represent *physical* 
values while the parent's offset and length can represent the *logical* 
values of the RLE. If the run ends were a buffer in the RLE array 
directly, it would be much more difficult to maintain the separation of 
the "physical" and "logical" offset/lengths.

Does that make sense?

--Matt

On Wed, Sep 14 2022 at 11:18:27 AM -0700, Weston Pace 
<we...@gmail.com> wrote:
> I will clarify the offset problem.  It essentially boils down to "if
> you don't have constant access to elements then an array length offset
> does not give you constant access to buffer offsets".
> 
> We start with an RLE<int64> array of length 200.  We slice it with
> (start=10, length=100) to get an RLE<int64> array of length 100 and an
> offset of 10.
> 
> Now we want to write an IPC file (or access the values for whatever
> reason).  The values buffer has 400 bytes and the run ends buffer has
> 200 bytes (these numbers could be anything less than 1600/800 so I'm
> picking these at random).  We need to copy a portion of the "run ends"
> buffer into the file.  What bytes are these?  The only way to tell
> would be to do a binary search on the 200 bytes run ends buffer.
> 
> On the other hand, if there were two child arrays then an
> implementation, when slicing, could choose to always keep the offset
> of the parent array at 0 and instead put the offsets in the child
> arrays.  Now you have a parent array with offset 0, a run ends (int32)
> array with offset 74 and length 5 and a values (int64) array with
> offset 74 and length 5.  We can clearly say that we want to grab bytes
> 296-316 from the run ends buffer and bytes 592-632 from the values
> buffer.  Of course, other implementations would always be free to use
> offsets in the parent array.  So I think the log(N) approach would
> still be needed as a fallback.
> 
> Other options:
> 
>  * Just eat the log(n) cost, it's not that expensive and any
> application doing excessive slicing could cache the offsets
> themselves.
>  * Add an optional buffer offset to the spec that can be populated in
> cases where random array access is not possible.
> 
> On Wed, Sep 14, 2022 at 10:53 AM Dewey Dunnington
> <dewey@voltrondata.com.invalid 
> <ma...@voltrondata.com.invalid>> wrote:
>> 
>>  >  * Should we encode "run lengths" or "run ends"?
>> 
>>  In addition to the points mentioned above, this seems the most 
>> consistent
>>  with the variable-length binary/list layouts
>> 
>>  > encoding the run ends as a buffer (similar to list array for 
>> example)
>>  makes it difficult to calculate offsets
>> 
>>  I don't have a strong opinion about this, but I also don't 
>> understand the
>>  logic. Surely the implementation is just generating/reading a 
>> buffer of
>>  integers and there's some overhead/indirection to wrapping it in an 
>> Array
>>  (that must then be validated).
>> 
>>  As a matter of curiosity, was a dictionary approach ever 
>> considered? If one
>>  new layout was added (one buffer containing the run ends of a RLE 
>> 0:N int32
>>  array), the dictionary member could be the values array and perhaps 
>> make it
>>  easier for implementations that already handle dictionaries.
>> 
>> 
>> 
>> 
>> 
>>  On Wed, Sep 14, 2022 at 2:04 PM Matthew Topol 
>> <matt@voltrondata.com.invalid <ma...@voltrondata.com.invalid>>
>>  wrote:
>> 
>>  > Just wanted to chime in here that I also have several draft PRs 
>> for
>>  > implementing the RLE arrays in Go as the second implementation 
>> (since
>>  > we use two implementations as a requirement to vote on
>>  > changes/additions to the format).
>>  >
>>  > They can be found here:
>>  >
>>  > <<https://github.com/apache/arrow/pull/14111>>
>>  > <<https://github.com/apache/arrow/pull/14114>>
>>  > <<https://github.com/apache/arrow/pull/14126>>
>>  >
>>  > --Matt
>>  >
>>  > On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield
>>  > <emkornfield@gmail.com <ma...@gmail.com>> wrote:
>>  > >>
>>  > >>   * Should we encode "run lengths" or "run ends"?
>>  > >
>>  > >
>>  > > I think the project has leaned towards sublinear access, so run 
>> ends
>>  > > make
>>  > > sense.  The downside is that we run into similar issues with
>>  > > List/LargeList
>>  > > where the total number of elements is limited by bit-width 
>> (which can
>>  > > also
>>  > > cause space wastage, e.g. with run ends it might be reasonable 
>> to
>>  > > limit
>>  > > bit-width to 16).
>>  > >
>>  > > The values are definitely a child array.  However, encoding the 
>> run
>>  > >>  ends as a buffer (similar to list array for example) makes it
>>  > >>  difficult to calculate offsets.  Translating an array offset 
>> to a
>>  > >>  buffer offset takes O(log(N)) time.  If the run ends are 
>> encoded as
>>  > >> a
>>  > >>  child array (so the RLE array has no buffers and two child 
>> arrays)
>>  > >>  then this problem goes away.
>>  > >
>>  > >
>>  > > I'm not sure I understand this, could you provide an example of 
>> the
>>  > > problem
>>  > > that the child array solves?
>>  > >
>>  > >
>>  > >
>>  > >
>>  > > On Wed, Sep 14, 2022 at 9:36 AM Weston Pace 
>> <weston.pace@gmail.com <ma...@gmail.com>
>>  > > <<m...@gmail.com>>> wrote:
>>  > >
>>  > >>  I'm going to bump this because it would be good to get 
>> feedback.  In
>>  > >>  particular it would be nice to get feedback on the suggested 
>> format
>>  > >>  change[1].  We are currently moving forward on coming up with 
>> an IPC
>>  > >>  format proposal which we will share when ready.
>>  > >>
>>  > >>  The two interesting points that jump out to me are:
>>  > >>
>>  > >>   * Should we encode "run lengths" or "run ends"?
>>  > >>
>>  > >>  For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run 
>> lengths"
>>  > >> 3,
>>  > >>  2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is 
>> preferred
>>  > >>  as that leads to O(log(N)) random access (instead of O(N)) 
>> and it's
>>  > >>  not clear there are any disadvantages.
>>  > >>
>>  > >>   * Should the run ends be encoded as a buffer or a child 
>> array?
>>  > >>
>>  > >>  The values are definitely a child array.  However, encoding 
>> the run
>>  > >>  ends as a buffer (similar to list array for example) makes it
>>  > >>  difficult to calculate offsets.  Translating an array offset 
>> to a
>>  > >>  buffer offset takes O(log(N)) time.  If the run ends are 
>> encoded as
>>  > >> a
>>  > >>  child array (so the RLE array has no buffers and two child 
>> arrays)
>>  > >>  then this problem goes away.
>>  > >>
>>  > >>  [1] <<https://github.com/apache/arrow/pull/13333/files>>
>>  > >>
>>  > >>  On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
>>  > >>  <tobias@zagorni.eu.invalid <ma...@zagorni.eu.invalid> 
>> <<m...@zagorni.eu.invalid>>>
>>  > >> wrote:
>>  > >>  >
>>  > >>  > Hello Everyone,
>>  > >>  >
>>  > >>  > Recently, I have implemented support for run-length 
>> encoding in
>>  > >> Arrow
>>  > >>  > C++. So far my implementation is split into different 
>> subtasks of
>>  > >>  > ARROW-16771 
>> (<<https://issues.apache.org/jira/browse/ARROW-16771>>).
>>  > >>  >
>>  > >>  > I have (draft) PRs available for:
>>  > >>  > - general handling of RLE in arrow C++, Type, Arrow, Builder
>>  > >>  > subclasses, etc.
>>  > >>  >   (subtasks 1-9)
>>  > >>  > - encode, decode kernels (fixed size only):
>>  > >>  >   (<<https://issues.apache.org/jira/browse/ARROW-16772>>)
>>  > >>  > - filter kernel (fixed size only):
>>  > >>  >   (<<https://issues.apache.org/jira/browse/ARROW-16774>>)
>>  > >>  > - simple benchmark for the RLE kernels
>>  > >>  >   (<<https://issues.apache.org/jira/browse/ARROW-17026>>)
>>  > >>  > - adding RLE to Arrow Columnar format document
>>  > >>  >   (<<https://issues.apache.org/jira/browse/ARROW-16773>>)
>>  > >>  >
>>  > >>  > What is not yet implemented:
>>  > >>  > - converting RLE to formats like Parquet, JSON, IPC.
>>  > >>  > - caching of physical offsets when working with sliced 
>> arrays -
>>  > >> finding
>>  > >>  > these offsets is an  O(log(n)) binary search which could be
>>  > >> avoided in
>>  > >>  > a lot of cases
>>  > >>  >
>>  > >>  > I'm interested in any feedback on the code and I'm 
>> wondering what
>>  > >> would
>>  > >>  > be the best way to get this merged.
>>  > >>  >
>>  > >>  > A lot of the PRs depend on earlier ones. I ordered the 
>> subtasks
>>  > >> in a
>>  > >>  > way they could be merged. The first would be:
>>  > >>  > 1. Handling of array-only types using VisitTypeInline:
>>  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-17258>>
>>  > >>  > 2. Adding RLE type / array class (only builds on #1):
>>  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-17261>>
>>  > >>  > -  also, since it has no dependencies: adding RLE to Arrow
>>  > >> Columnar
>>  > >>  > format document
>>  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-16773>>
>>  > >>  >
>>  > >>  > Best,
>>  > >>  > Tobias
>>  > >>
>>  >
>>  >

Re: PRs for RLE support

Posted by Weston Pace <we...@gmail.com>.

I will clarify the offset problem.  It essentially boils down to "if
you don't have constant access to elements then an array length offset
does not give you constant access to buffer offsets".

We start with an RLE<int64> array of length 200.  We slice it with
(start=10, length=100) to get an RLE<int64> array of length 100 and an
offset of 10.

Now we want to write an IPC file (or access the values for whatever
reason).  The values buffer has 400 bytes and the run ends buffer has
200 bytes (these numbers could be anything less than 1600/800 so I'm
picking these at random).  We need to copy a portion of the "run ends"
buffer into the file.  What bytes are these?  The only way to tell
would be to do a binary search on the 200 bytes run ends buffer.

On the other hand, if there were two child arrays then an
implementation, when slicing, could choose to always keep the offset
of the parent array at 0 and instead put the offsets in the child
arrays.  Now you have a parent array with offset 0, a run ends (int32)
array with offset 74 and length 5 and a values (int64) array with
offset 74 and length 5.  We can clearly say that we want to grab bytes
296-316 from the run ends buffer and bytes 592-632 from the values
buffer.  Of course, other implementations would always be free to use
offsets in the parent array.  So I think the log(N) approach would
still be needed as a fallback.

Other options:

 * Just eat the log(n) cost, it's not that expensive and any
application doing excessive slicing could cache the offsets
themselves.
 * Add an optional buffer offset to the spec that can be populated in
cases where random array access is not possible.

On Wed, Sep 14, 2022 at 10:53 AM Dewey Dunnington
<de...@voltrondata.com.invalid> wrote:
>
> >  * Should we encode "run lengths" or "run ends"?
>
> In addition to the points mentioned above, this seems the most consistent
> with the variable-length binary/list layouts
>
> > encoding the run ends as a buffer (similar to list array for example)
> makes it difficult to calculate offsets
>
> I don't have a strong opinion about this, but I also don't understand the
> logic. Surely the implementation is just generating/reading a buffer of
> integers and there's some overhead/indirection to wrapping it in an Array
> (that must then be validated).
>
> As a matter of curiosity, was a dictionary approach ever considered? If one
> new layout was added (one buffer containing the run ends of a RLE 0:N int32
> array), the dictionary member could be the values array and perhaps make it
> easier for implementations that already handle dictionaries.
>
>
>
>
>
> On Wed, Sep 14, 2022 at 2:04 PM Matthew Topol <ma...@voltrondata.com.invalid>
> wrote:
>
> > Just wanted to chime in here that I also have several draft PRs for
> > implementing the RLE arrays in Go as the second implementation (since
> > we use two implementations as a requirement to vote on
> > changes/additions to the format).
> >
> > They can be found here:
> >
> > <https://github.com/apache/arrow/pull/14111>
> > <https://github.com/apache/arrow/pull/14114>
> > <https://github.com/apache/arrow/pull/14126>
> >
> > --Matt
> >
> > On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield
> > <em...@gmail.com> wrote:
> > >>
> > >>   * Should we encode "run lengths" or "run ends"?
> > >
> > >
> > > I think the project has leaned towards sublinear access, so run ends
> > > make
> > > sense.  The downside is that we run into similar issues with
> > > List/LargeList
> > > where the total number of elements is limited by bit-width (which can
> > > also
> > > cause space wastage, e.g. with run ends it might be reasonable to
> > > limit
> > > bit-width to 16).
> > >
> > > The values are definitely a child array.  However, encoding the run
> > >>  ends as a buffer (similar to list array for example) makes it
> > >>  difficult to calculate offsets.  Translating an array offset to a
> > >>  buffer offset takes O(log(N)) time.  If the run ends are encoded as
> > >> a
> > >>  child array (so the RLE array has no buffers and two child arrays)
> > >>  then this problem goes away.
> > >
> > >
> > > I'm not sure I understand this, could you provide an example of the
> > > problem
> > > that the child array solves?
> > >
> > >
> > >
> > >
> > > On Wed, Sep 14, 2022 at 9:36 AM Weston Pace <weston.pace@gmail.com
> > > <ma...@gmail.com>> wrote:
> > >
> > >>  I'm going to bump this because it would be good to get feedback.  In
> > >>  particular it would be nice to get feedback on the suggested format
> > >>  change[1].  We are currently moving forward on coming up with an IPC
> > >>  format proposal which we will share when ready.
> > >>
> > >>  The two interesting points that jump out to me are:
> > >>
> > >>   * Should we encode "run lengths" or "run ends"?
> > >>
> > >>  For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run lengths"
> > >> 3,
> > >>  2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is preferred
> > >>  as that leads to O(log(N)) random access (instead of O(N)) and it's
> > >>  not clear there are any disadvantages.
> > >>
> > >>   * Should the run ends be encoded as a buffer or a child array?
> > >>
> > >>  The values are definitely a child array.  However, encoding the run
> > >>  ends as a buffer (similar to list array for example) makes it
> > >>  difficult to calculate offsets.  Translating an array offset to a
> > >>  buffer offset takes O(log(N)) time.  If the run ends are encoded as
> > >> a
> > >>  child array (so the RLE array has no buffers and two child arrays)
> > >>  then this problem goes away.
> > >>
> > >>  [1] <https://github.com/apache/arrow/pull/13333/files>
> > >>
> > >>  On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
> > >>  <tobias@zagorni.eu.invalid <ma...@zagorni.eu.invalid>>
> > >> wrote:
> > >>  >
> > >>  > Hello Everyone,
> > >>  >
> > >>  > Recently, I have implemented support for run-length encoding in
> > >> Arrow
> > >>  > C++. So far my implementation is split into different subtasks of
> > >>  > ARROW-16771 (<https://issues.apache.org/jira/browse/ARROW-16771>).
> > >>  >
> > >>  > I have (draft) PRs available for:
> > >>  > - general handling of RLE in arrow C++, Type, Arrow, Builder
> > >>  > subclasses, etc.
> > >>  >   (subtasks 1-9)
> > >>  > - encode, decode kernels (fixed size only):
> > >>  >   (<https://issues.apache.org/jira/browse/ARROW-16772>)
> > >>  > - filter kernel (fixed size only):
> > >>  >   (<https://issues.apache.org/jira/browse/ARROW-16774>)
> > >>  > - simple benchmark for the RLE kernels
> > >>  >   (<https://issues.apache.org/jira/browse/ARROW-17026>)
> > >>  > - adding RLE to Arrow Columnar format document
> > >>  >   (<https://issues.apache.org/jira/browse/ARROW-16773>)
> > >>  >
> > >>  > What is not yet implemented:
> > >>  > - converting RLE to formats like Parquet, JSON, IPC.
> > >>  > - caching of physical offsets when working with sliced arrays -
> > >> finding
> > >>  > these offsets is an  O(log(n)) binary search which could be
> > >> avoided in
> > >>  > a lot of cases
> > >>  >
> > >>  > I'm interested in any feedback on the code and I'm wondering what
> > >> would
> > >>  > be the best way to get this merged.
> > >>  >
> > >>  > A lot of the PRs depend on earlier ones. I ordered the subtasks
> > >> in a
> > >>  > way they could be merged. The first would be:
> > >>  > 1. Handling of array-only types using VisitTypeInline:
> > >>  >    <https://issues.apache.org/jira/browse/ARROW-17258>
> > >>  > 2. Adding RLE type / array class (only builds on #1):
> > >>  >    <https://issues.apache.org/jira/browse/ARROW-17261>
> > >>  > -  also, since it has no dependencies: adding RLE to Arrow
> > >> Columnar
> > >>  > format document
> > >>  >    <https://issues.apache.org/jira/browse/ARROW-16773>
> > >>  >
> > >>  > Best,
> > >>  > Tobias
> > >>
> >
> >

Re: PRs for RLE support

Posted by Dewey Dunnington <de...@voltrondata.com.INVALID>.

>  * Should we encode "run lengths" or "run ends"?

In addition to the points mentioned above, this seems the most consistent
with the variable-length binary/list layouts

> encoding the run ends as a buffer (similar to list array for example)
makes it difficult to calculate offsets

I don't have a strong opinion about this, but I also don't understand the
logic. Surely the implementation is just generating/reading a buffer of
integers and there's some overhead/indirection to wrapping it in an Array
(that must then be validated).

As a matter of curiosity, was a dictionary approach ever considered? If one
new layout was added (one buffer containing the run ends of a RLE 0:N int32
array), the dictionary member could be the values array and perhaps make it
easier for implementations that already handle dictionaries.





On Wed, Sep 14, 2022 at 2:04 PM Matthew Topol <ma...@voltrondata.com.invalid>
wrote:

> Just wanted to chime in here that I also have several draft PRs for
> implementing the RLE arrays in Go as the second implementation (since
> we use two implementations as a requirement to vote on
> changes/additions to the format).
>
> They can be found here:
>
> <https://github.com/apache/arrow/pull/14111>
> <https://github.com/apache/arrow/pull/14114>
> <https://github.com/apache/arrow/pull/14126>
>
> --Matt
>
> On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield
> <em...@gmail.com> wrote:
> >>
> >>   * Should we encode "run lengths" or "run ends"?
> >
> >
> > I think the project has leaned towards sublinear access, so run ends
> > make
> > sense.  The downside is that we run into similar issues with
> > List/LargeList
> > where the total number of elements is limited by bit-width (which can
> > also
> > cause space wastage, e.g. with run ends it might be reasonable to
> > limit
> > bit-width to 16).
> >
> > The values are definitely a child array.  However, encoding the run
> >>  ends as a buffer (similar to list array for example) makes it
> >>  difficult to calculate offsets.  Translating an array offset to a
> >>  buffer offset takes O(log(N)) time.  If the run ends are encoded as
> >> a
> >>  child array (so the RLE array has no buffers and two child arrays)
> >>  then this problem goes away.
> >
> >
> > I'm not sure I understand this, could you provide an example of the
> > problem
> > that the child array solves?
> >
> >
> >
> >
> > On Wed, Sep 14, 2022 at 9:36 AM Weston Pace <weston.pace@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> >>  I'm going to bump this because it would be good to get feedback.  In
> >>  particular it would be nice to get feedback on the suggested format
> >>  change[1].  We are currently moving forward on coming up with an IPC
> >>  format proposal which we will share when ready.
> >>
> >>  The two interesting points that jump out to me are:
> >>
> >>   * Should we encode "run lengths" or "run ends"?
> >>
> >>  For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run lengths"
> >> 3,
> >>  2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is preferred
> >>  as that leads to O(log(N)) random access (instead of O(N)) and it's
> >>  not clear there are any disadvantages.
> >>
> >>   * Should the run ends be encoded as a buffer or a child array?
> >>
> >>  The values are definitely a child array.  However, encoding the run
> >>  ends as a buffer (similar to list array for example) makes it
> >>  difficult to calculate offsets.  Translating an array offset to a
> >>  buffer offset takes O(log(N)) time.  If the run ends are encoded as
> >> a
> >>  child array (so the RLE array has no buffers and two child arrays)
> >>  then this problem goes away.
> >>
> >>  [1] <https://github.com/apache/arrow/pull/13333/files>
> >>
> >>  On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
> >>  <tobias@zagorni.eu.invalid <ma...@zagorni.eu.invalid>>
> >> wrote:
> >>  >
> >>  > Hello Everyone,
> >>  >
> >>  > Recently, I have implemented support for run-length encoding in
> >> Arrow
> >>  > C++. So far my implementation is split into different subtasks of
> >>  > ARROW-16771 (<https://issues.apache.org/jira/browse/ARROW-16771>).
> >>  >
> >>  > I have (draft) PRs available for:
> >>  > - general handling of RLE in arrow C++, Type, Arrow, Builder
> >>  > subclasses, etc.
> >>  >   (subtasks 1-9)
> >>  > - encode, decode kernels (fixed size only):
> >>  >   (<https://issues.apache.org/jira/browse/ARROW-16772>)
> >>  > - filter kernel (fixed size only):
> >>  >   (<https://issues.apache.org/jira/browse/ARROW-16774>)
> >>  > - simple benchmark for the RLE kernels
> >>  >   (<https://issues.apache.org/jira/browse/ARROW-17026>)
> >>  > - adding RLE to Arrow Columnar format document
> >>  >   (<https://issues.apache.org/jira/browse/ARROW-16773>)
> >>  >
> >>  > What is not yet implemented:
> >>  > - converting RLE to formats like Parquet, JSON, IPC.
> >>  > - caching of physical offsets when working with sliced arrays -
> >> finding
> >>  > these offsets is an  O(log(n)) binary search which could be
> >> avoided in
> >>  > a lot of cases
> >>  >
> >>  > I'm interested in any feedback on the code and I'm wondering what
> >> would
> >>  > be the best way to get this merged.
> >>  >
> >>  > A lot of the PRs depend on earlier ones. I ordered the subtasks
> >> in a
> >>  > way they could be merged. The first would be:
> >>  > 1. Handling of array-only types using VisitTypeInline:
> >>  >    <https://issues.apache.org/jira/browse/ARROW-17258>
> >>  > 2. Adding RLE type / array class (only builds on #1):
> >>  >    <https://issues.apache.org/jira/browse/ARROW-17261>
> >>  > -  also, since it has no dependencies: adding RLE to Arrow
> >> Columnar
> >>  > format document
> >>  >    <https://issues.apache.org/jira/browse/ARROW-16773>
> >>  >
> >>  > Best,
> >>  > Tobias
> >>
>
>

Re: PRs for RLE support

Posted by Matthew Topol <ma...@voltrondata.com.INVALID>.

Just wanted to chime in here that I also have several draft PRs for 
implementing the RLE arrays in Go as the second implementation (since 
we use two implementations as a requirement to vote on 
changes/additions to the format).

They can be found here:

<https://github.com/apache/arrow/pull/14111>
<https://github.com/apache/arrow/pull/14114>
<https://github.com/apache/arrow/pull/14126>

--Matt

On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield 
<em...@gmail.com> wrote:
>> 
>>   * Should we encode "run lengths" or "run ends"?
> 
> 
> I think the project has leaned towards sublinear access, so run ends 
> make
> sense.  The downside is that we run into similar issues with 
> List/LargeList
> where the total number of elements is limited by bit-width (which can 
> also
> cause space wastage, e.g. with run ends it might be reasonable to 
> limit
> bit-width to 16).
> 
> The values are definitely a child array.  However, encoding the run
>>  ends as a buffer (similar to list array for example) makes it
>>  difficult to calculate offsets.  Translating an array offset to a
>>  buffer offset takes O(log(N)) time.  If the run ends are encoded as 
>> a
>>  child array (so the RLE array has no buffers and two child arrays)
>>  then this problem goes away.
> 
> 
> I'm not sure I understand this, could you provide an example of the 
> problem
> that the child array solves?
> 
> 
> 
> 
> On Wed, Sep 14, 2022 at 9:36 AM Weston Pace <weston.pace@gmail.com 
> <ma...@gmail.com>> wrote:
> 
>>  I'm going to bump this because it would be good to get feedback.  In
>>  particular it would be nice to get feedback on the suggested format
>>  change[1].  We are currently moving forward on coming up with an IPC
>>  format proposal which we will share when ready.
>> 
>>  The two interesting points that jump out to me are:
>> 
>>   * Should we encode "run lengths" or "run ends"?
>> 
>>  For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run lengths" 
>> 3,
>>  2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is preferred
>>  as that leads to O(log(N)) random access (instead of O(N)) and it's
>>  not clear there are any disadvantages.
>> 
>>   * Should the run ends be encoded as a buffer or a child array?
>> 
>>  The values are definitely a child array.  However, encoding the run
>>  ends as a buffer (similar to list array for example) makes it
>>  difficult to calculate offsets.  Translating an array offset to a
>>  buffer offset takes O(log(N)) time.  If the run ends are encoded as 
>> a
>>  child array (so the RLE array has no buffers and two child arrays)
>>  then this problem goes away.
>> 
>>  [1] <https://github.com/apache/arrow/pull/13333/files>
>> 
>>  On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
>>  <tobias@zagorni.eu.invalid <ma...@zagorni.eu.invalid>> 
>> wrote:
>>  >
>>  > Hello Everyone,
>>  >
>>  > Recently, I have implemented support for run-length encoding in 
>> Arrow
>>  > C++. So far my implementation is split into different subtasks of
>>  > ARROW-16771 (<https://issues.apache.org/jira/browse/ARROW-16771>).
>>  >
>>  > I have (draft) PRs available for:
>>  > - general handling of RLE in arrow C++, Type, Arrow, Builder
>>  > subclasses, etc.
>>  >   (subtasks 1-9)
>>  > - encode, decode kernels (fixed size only):
>>  >   (<https://issues.apache.org/jira/browse/ARROW-16772>)
>>  > - filter kernel (fixed size only):
>>  >   (<https://issues.apache.org/jira/browse/ARROW-16774>)
>>  > - simple benchmark for the RLE kernels
>>  >   (<https://issues.apache.org/jira/browse/ARROW-17026>)
>>  > - adding RLE to Arrow Columnar format document
>>  >   (<https://issues.apache.org/jira/browse/ARROW-16773>)
>>  >
>>  > What is not yet implemented:
>>  > - converting RLE to formats like Parquet, JSON, IPC.
>>  > - caching of physical offsets when working with sliced arrays - 
>> finding
>>  > these offsets is an  O(log(n)) binary search which could be 
>> avoided in
>>  > a lot of cases
>>  >
>>  > I'm interested in any feedback on the code and I'm wondering what 
>> would
>>  > be the best way to get this merged.
>>  >
>>  > A lot of the PRs depend on earlier ones. I ordered the subtasks 
>> in a
>>  > way they could be merged. The first would be:
>>  > 1. Handling of array-only types using VisitTypeInline:
>>  >    <https://issues.apache.org/jira/browse/ARROW-17258>
>>  > 2. Adding RLE type / array class (only builds on #1):
>>  >    <https://issues.apache.org/jira/browse/ARROW-17261>
>>  > -  also, since it has no dependencies: adding RLE to Arrow 
>> Columnar
>>  > format document
>>  >    <https://issues.apache.org/jira/browse/ARROW-16773>
>>  >
>>  > Best,
>>  > Tobias
>>

Re: PRs for RLE support

Posted by Micah Kornfield <em...@gmail.com>.

>
>  * Should we encode "run lengths" or "run ends"?


I think the project has leaned towards sublinear access, so run ends make
sense.  The downside is that we run into similar issues with List/LargeList
where the total number of elements is limited by bit-width (which can also
cause space wastage, e.g. with run ends it might be reasonable to limit
bit-width to 16).

The values are definitely a child array.  However, encoding the run
> ends as a buffer (similar to list array for example) makes it
> difficult to calculate offsets.  Translating an array offset to a
> buffer offset takes O(log(N)) time.  If the run ends are encoded as a
> child array (so the RLE array has no buffers and two child arrays)
> then this problem goes away.


I'm not sure I understand this, could you provide an example of the problem
that the child array solves?




On Wed, Sep 14, 2022 at 9:36 AM Weston Pace <we...@gmail.com> wrote:

> I'm going to bump this because it would be good to get feedback.  In
> particular it would be nice to get feedback on the suggested format
> change[1].  We are currently moving forward on coming up with an IPC
> format proposal which we will share when ready.
>
> The two interesting points that jump out to me are:
>
>  * Should we encode "run lengths" or "run ends"?
>
> For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run lengths" 3,
> 2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is preferred
> as that leads to O(log(N)) random access (instead of O(N)) and it's
> not clear there are any disadvantages.
>
>  * Should the run ends be encoded as a buffer or a child array?
>
> The values are definitely a child array.  However, encoding the run
> ends as a buffer (similar to list array for example) makes it
> difficult to calculate offsets.  Translating an array offset to a
> buffer offset takes O(log(N)) time.  If the run ends are encoded as a
> child array (so the RLE array has no buffers and two child arrays)
> then this problem goes away.
>
> [1] https://github.com/apache/arrow/pull/13333/files
>
> On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
> <to...@zagorni.eu.invalid> wrote:
> >
> > Hello Everyone,
> >
> > Recently, I have implemented support for run-length encoding in Arrow
> > C++. So far my implementation is split into different subtasks of
> > ARROW-16771 (https://issues.apache.org/jira/browse/ARROW-16771).
> >
> > I have (draft) PRs available for:
> > - general handling of RLE in arrow C++, Type, Arrow, Builder
> > subclasses, etc.
> >   (subtasks 1-9)
> > - encode, decode kernels (fixed size only):
> >   (https://issues.apache.org/jira/browse/ARROW-16772)
> > - filter kernel (fixed size only):
> >   (https://issues.apache.org/jira/browse/ARROW-16774)
> > - simple benchmark for the RLE kernels
> >   (https://issues.apache.org/jira/browse/ARROW-17026)
> > - adding RLE to Arrow Columnar format document
> >   (https://issues.apache.org/jira/browse/ARROW-16773)
> >
> > What is not yet implemented:
> > - converting RLE to formats like Parquet, JSON, IPC.
> > - caching of physical offsets when working with sliced arrays - finding
> > these offsets is an  O(log(n)) binary search which could be avoided in
> > a lot of cases
> >
> > I'm interested in any feedback on the code and I'm wondering what would
> > be the best way to get this merged.
> >
> > A lot of the PRs depend on earlier ones. I ordered the subtasks in a
> > way they could be merged. The first would be:
> > 1. Handling of array-only types using VisitTypeInline:
> >    https://issues.apache.org/jira/browse/ARROW-17258
> > 2. Adding RLE type / array class (only builds on #1):
> >    https://issues.apache.org/jira/browse/ARROW-17261
> > -  also, since it has no dependencies: adding RLE to Arrow Columnar
> > format document
> >    https://issues.apache.org/jira/browse/ARROW-16773
> >
> > Best,
> > Tobias
>

Re: PRs for RLE support

Posted by Weston Pace <we...@gmail.com>.

I'm going to bump this because it would be good to get feedback.  In
particular it would be nice to get feedback on the suggested format
change[1].  We are currently moving forward on coming up with an IPC
format proposal which we will share when ready.

The two interesting points that jump out to me are:

 * Should we encode "run lengths" or "run ends"?

For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run lengths" 3,
2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is preferred
as that leads to O(log(N)) random access (instead of O(N)) and it's
not clear there are any disadvantages.

 * Should the run ends be encoded as a buffer or a child array?

The values are definitely a child array.  However, encoding the run
ends as a buffer (similar to list array for example) makes it
difficult to calculate offsets.  Translating an array offset to a
buffer offset takes O(log(N)) time.  If the run ends are encoded as a
child array (so the RLE array has no buffers and two child arrays)
then this problem goes away.

[1] https://github.com/apache/arrow/pull/13333/files

On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
<to...@zagorni.eu.invalid> wrote:
>
> Hello Everyone,
>
> Recently, I have implemented support for run-length encoding in Arrow
> C++. So far my implementation is split into different subtasks of
> ARROW-16771 (https://issues.apache.org/jira/browse/ARROW-16771).
>
> I have (draft) PRs available for:
> - general handling of RLE in arrow C++, Type, Arrow, Builder
> subclasses, etc.
>   (subtasks 1-9)
> - encode, decode kernels (fixed size only):
>   (https://issues.apache.org/jira/browse/ARROW-16772)
> - filter kernel (fixed size only):
>   (https://issues.apache.org/jira/browse/ARROW-16774)
> - simple benchmark for the RLE kernels
>   (https://issues.apache.org/jira/browse/ARROW-17026)
> - adding RLE to Arrow Columnar format document
>   (https://issues.apache.org/jira/browse/ARROW-16773)
>
> What is not yet implemented:
> - converting RLE to formats like Parquet, JSON, IPC.
> - caching of physical offsets when working with sliced arrays - finding
> these offsets is an  O(log(n)) binary search which could be avoided in
> a lot of cases
>
> I'm interested in any feedback on the code and I'm wondering what would
> be the best way to get this merged.
>
> A lot of the PRs depend on earlier ones. I ordered the subtasks in a
> way they could be merged. The first would be:
> 1. Handling of array-only types using VisitTypeInline:
>    https://issues.apache.org/jira/browse/ARROW-17258
> 2. Adding RLE type / array class (only builds on #1):
>    https://issues.apache.org/jira/browse/ARROW-17261
> -  also, since it has no dependencies: adding RLE to Arrow Columnar
> format document
>    https://issues.apache.org/jira/browse/ARROW-16773
>
> Best,
> Tobias