You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Tobias Zagorni <to...@zagorni.eu.INVALID> on 2022/05/31 18:24:53 UTC

[C++] Adding Run-Length Encoding to Arrow

Hi, I'm currently working on adding Run-Length encoding to arrow. I
created a function to dictionary-encode arrays here (currently only for
fixed length types):
https://github.com/apache/arrow/compare/master...zagto:rle?expand=1

The general idea is that RLE data will be a nested data type, with a
single child holding a regular ArrayData of the type of the values, but
with the duplicate values removed. The parent contains a single buffer
of uint64 representing the run lengths by holding the run length of all
runs from the first to the current one

I'm interested to hear what you think about this.

Here are some points that came up during internal discussions:

What are the intended use cases for this:
- external engines want to provide run-length encoded data to work on
using arrow?
- more efficient ACERO (and possibly somewhat simplified, since we can
now use RLE arrays to replace Scalar Datums)

Automatic kernel dispatch:
- Scalar kernels would likely just work on RLE data
- How should it be implemented? The current place for logic like that
seems to be the DispatchBest methods. These only allow to replace the
input types, but for this RLE scheme we would need to make the kernel
work on a child array of the input. This mechanism would likely need to
be extended a lot.
- For kernels which don't work on RLE data, should we automatically
call Encode and Decode kernels?
- Should RLE really be a data type? Dictionary encoding set an example
for this, but just like it, RLE is actually a different encoding for
arrays of the same data type.
- There could be data where RLE is only beneficial on some batches,
while others are smaller without RLE. Supporting this would require to
dispatch different kernels per batch. Should we support this
eventually?
- Kernel dispatch of encodings is probably more about handling
encodings in general, and somewhat separate from adding RLE. We a
similar issue for here: ARROW-11508

Format:
- What data type should we use for the run-length values? int32 would
save memory, but force us to encode very long arrays as multiple
arrays, compared to int64. Or should we support different types?
Especially for external systems working with arrow, this would make
things more flexible, at the cost of additional complexity in arrow.
- I made the length field of the ArrayData hold the physical number of
elements in the encoded array. The logical number can already be fould
at the end of the accumulated run-lengths buffer. Also this is likely
less confusing for code working on the physical data unaware of RLE. Is
there more to consider for this decision?
- Should we allow multiple runs of the same value following each other?
Otherwise we would either need a pass to correct this after a lot of
operations, or make RLE-aware versions of thier kernels.

Original RFC for adding new types (including RLE):
https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq

Best,
Tobias


Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Antoine Pitrou <an...@python.org>.
Le 08/07/2022 à 15:19, Wes McKinney a écrit :
> 
> * I believe that having a Type::RLE is the right approach in C++ and
> it makes dynamic dispatch everywhere in the library pretty
> straightforward.

+1 on this, as it will raise a nice NotImplemented error for existing 
code rather than crash or corrupt memory.

> * It's unclear to me whether it is a good idea for RLE to replace the
> use of Scalar values in the compute kernels sublibrary, especially
> after the refactoring that I've just done to drop the scalar-specific
> code paths from most kernels (and about to remove the rest once I
> finish refactoring the aggregate kernels). It feels complicated (as
> opposed to letting scalars and RLE co-exist peacefully) but I could be
> convinced otherwise -- it may be hard to arrive at a conclusion
> without some prototyping / exploration

Indeed, sounds rather complicated to me.

Regards

Antoine.

Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Wes McKinney <we...@gmail.com>.
hi all,

Just catching up on this e-mail thread from last month. Since I've
been neck deep refactoring the kernels code the last few weeks I have
a few thoughts about this:

* How we implement and use RLE in the C++ library and Acero is
separate from how RLE will be represented in the Arrow IPC format and
C ABI. I think both the IPC representation and the different library
implementations should be given separate but thorough consideration.
There are some other projects like DuckDB and Velox which may already
have some RLE stuff, and given that they use the Arrow C interface, we
should strive if possible to create a C ABI representation that
everyone is happy about zero-copying. I will try to help with some of
this format alignment across projects that have an interest in
Arrow-compatibility if that sounds good.

* I believe that having a Type::RLE is the right approach in C++ and
it makes dynamic dispatch everywhere in the library pretty
straightforward.

* It's unclear to me whether it is a good idea for RLE to replace the
use of Scalar values in the compute kernels sublibrary, especially
after the refactoring that I've just done to drop the scalar-specific
code paths from most kernels (and about to remove the rest once I
finish refactoring the aggregate kernels). It feels complicated (as
opposed to letting scalars and RLE co-exist peacefully) but I could be
convinced otherwise -- it may be hard to arrive at a conclusion
without some prototyping / exploration

When encountering RLE in the kernels, there will be a few cases:

a. Kernels with RLE-optimized code paths (e.g. aggregation, filtering)
b. Kernels which immediately must up-promote from RLE to non-RLE, or
iterate on a per-run basis, invoking the (array, scalar) code paths
for each run
c. Kernels which pass through to execute on the (usually much smaller)
values portion of the RLE array

Otherwise I am excited to see some progress on this (and relatedly,
being able to represent Constant arrays — possibly using the RLE
representation — in the IPC format would be a huge space-saving
benefit for some applications)

Thanks
Wes

On Thu, Jun 9, 2022 at 2:07 PM Sasha Krassovsky
<kr...@gmail.com> wrote:
>
> A format where run lengths and values are interleaved would almost certainly be worse than having them separate. For example, unary scalar kernel evaluation is exactly the same as on raw arrays when they are not interleaved. Further, in the context of vectorization, a vectorized load into the array would be a mix of run lengths and values - not good as you have to de-interleave them somehow anyway. Am I missing something - did you have an optimization in mind?
>
> Regarding a benchmark, I agree it could be useful. We need to decide what’s “reasonable”. There are pathological cases either way: you can be 2x slower than a raw column if you have a million runs of length 1 or you can be 1000000x faster if you have one run of length 1000000. Where I think RLE can shine most is if a filter column is RLE or if a group-by column is RLE. These are the two benchmarks that I would suggest writing.
>
> Sasha Krassovsky
>
> > On Jun 8, 2022, at 3:59 AM, Alessandro Molina <al...@voltrondata.com> wrote:
> >
> > RLE would probably have some benefits that it makes sense to evaluate, I
> > would personally go in the direction of having a minimal benchmarking suite
> > for some of the cases where we expect to seem most benefit (IE: filtering)
> > so we can discuss with real numbers.
> >
> > Also, the currently proposed format divides run lengths and values, maybe a
> > format where run lengths and values are stored interleaved in the same
> > buffer might be able to allow more optimisations in the contest of
> > vectorised operations. Even though it might be harder to work with for
> > things that are not fixed width.
> >
> > On Tue, Jun 7, 2022 at 7:56 PM Tobias Zagorni <to...@zagorni.eu.invalid>
> > wrote:
> >
> >> I created a Jira for adding RLE as ARROW-16771, and draft PRs:
> >>
> >> - https://github.com/apache/arrow/pull/13330
> >>  Encode/Decode functions for (currently fixed width types only)
> >>
> >> - https://github.com/apache/arrow/pull/13333
> >>  For updating docs
> >>
> >> Best,
> >> Tobias
> >>
> >> Am Dienstag, dem 31.05.2022 um 17:13 -0500 schrieb Wes McKinney:
> >>> I haven't had a chance to look at the branch in detail, but if you
> >>> can
> >>> provide a pointer to a specification or other details about the
> >>> proposed memory format for RLE (basically: what would be added to the
> >>> columnar documentation as well as the Flatbuffers schema files), it
> >>> would be helpful so it can be circulated to some other interested
> >>> parties working primarily outside of Arrow (e.g. DuckDB) who might
> >>> like to converge on a standard especially given that it would be
> >>> exported across the C data interface. Thanks!
> >>
> >>
>

Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Sasha Krassovsky <kr...@gmail.com>.
A format where run lengths and values are interleaved would almost certainly be worse than having them separate. For example, unary scalar kernel evaluation is exactly the same as on raw arrays when they are not interleaved. Further, in the context of vectorization, a vectorized load into the array would be a mix of run lengths and values - not good as you have to de-interleave them somehow anyway. Am I missing something - did you have an optimization in mind? 

Regarding a benchmark, I agree it could be useful. We need to decide what’s “reasonable”. There are pathological cases either way: you can be 2x slower than a raw column if you have a million runs of length 1 or you can be 1000000x faster if you have one run of length 1000000. Where I think RLE can shine most is if a filter column is RLE or if a group-by column is RLE. These are the two benchmarks that I would suggest writing. 

Sasha Krassovsky

> On Jun 8, 2022, at 3:59 AM, Alessandro Molina <al...@voltrondata.com> wrote:
> 
> RLE would probably have some benefits that it makes sense to evaluate, I
> would personally go in the direction of having a minimal benchmarking suite
> for some of the cases where we expect to seem most benefit (IE: filtering)
> so we can discuss with real numbers.
> 
> Also, the currently proposed format divides run lengths and values, maybe a
> format where run lengths and values are stored interleaved in the same
> buffer might be able to allow more optimisations in the contest of
> vectorised operations. Even though it might be harder to work with for
> things that are not fixed width.
> 
> On Tue, Jun 7, 2022 at 7:56 PM Tobias Zagorni <to...@zagorni.eu.invalid>
> wrote:
> 
>> I created a Jira for adding RLE as ARROW-16771, and draft PRs:
>> 
>> - https://github.com/apache/arrow/pull/13330
>>  Encode/Decode functions for (currently fixed width types only)
>> 
>> - https://github.com/apache/arrow/pull/13333
>>  For updating docs
>> 
>> Best,
>> Tobias
>> 
>> Am Dienstag, dem 31.05.2022 um 17:13 -0500 schrieb Wes McKinney:
>>> I haven't had a chance to look at the branch in detail, but if you
>>> can
>>> provide a pointer to a specification or other details about the
>>> proposed memory format for RLE (basically: what would be added to the
>>> columnar documentation as well as the Flatbuffers schema files), it
>>> would be helpful so it can be circulated to some other interested
>>> parties working primarily outside of Arrow (e.g. DuckDB) who might
>>> like to converge on a standard especially given that it would be
>>> exported across the C data interface. Thanks!
>> 
>> 


Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Alessandro Molina <al...@voltrondata.com>.
RLE would probably have some benefits that it makes sense to evaluate, I
would personally go in the direction of having a minimal benchmarking suite
for some of the cases where we expect to seem most benefit (IE: filtering)
so we can discuss with real numbers.

Also, the currently proposed format divides run lengths and values, maybe a
format where run lengths and values are stored interleaved in the same
buffer might be able to allow more optimisations in the contest of
vectorised operations. Even though it might be harder to work with for
things that are not fixed width.

On Tue, Jun 7, 2022 at 7:56 PM Tobias Zagorni <to...@zagorni.eu.invalid>
wrote:

> I created a Jira for adding RLE as ARROW-16771, and draft PRs:
>
> - https://github.com/apache/arrow/pull/13330
>   Encode/Decode functions for (currently fixed width types only)
>
> - https://github.com/apache/arrow/pull/13333
>   For updating docs
>
> Best,
> Tobias
>
> Am Dienstag, dem 31.05.2022 um 17:13 -0500 schrieb Wes McKinney:
> > I haven't had a chance to look at the branch in detail, but if you
> > can
> > provide a pointer to a specification or other details about the
> > proposed memory format for RLE (basically: what would be added to the
> > columnar documentation as well as the Flatbuffers schema files), it
> > would be helpful so it can be circulated to some other interested
> > parties working primarily outside of Arrow (e.g. DuckDB) who might
> > like to converge on a standard especially given that it would be
> > exported across the C data interface. Thanks!
>
>

Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Tobias Zagorni <to...@zagorni.eu.INVALID>.
I created a Jira for adding RLE as ARROW-16771, and draft PRs:

- https://github.com/apache/arrow/pull/13330
  Encode/Decode functions for (currently fixed width types only)

- https://github.com/apache/arrow/pull/13333
  For updating docs

Best,
Tobias

Am Dienstag, dem 31.05.2022 um 17:13 -0500 schrieb Wes McKinney:
> I haven't had a chance to look at the branch in detail, but if you
> can
> provide a pointer to a specification or other details about the
> proposed memory format for RLE (basically: what would be added to the
> columnar documentation as well as the Flatbuffers schema files), it
> would be helpful so it can be circulated to some other interested
> parties working primarily outside of Arrow (e.g. DuckDB) who might
> like to converge on a standard especially given that it would be
> exported across the C data interface. Thanks!


Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Weston Pace <we...@gmail.com>.
> I don't think replacing Scalar compute paths with dedicated paths for
> RLE-encoded data would ever be a simplification. Also, when a kernel
> hasn't been upgraded with a native path for RLE data, former Scalar
> Datums would now be expanded to the full RLE-decoded version before
> running the kernel...

> Well, Arrow C++ does not have a notion of encoding distinct from the
> data type. Adding such a notion would risk breaking compatibility for
> all existing software that hasn't been upgraded to dispatch based on
> encoding.

I think these two things are related.  I would argue that
ValueDescr::Shape is already a notion of encoding and we are already
capable of dispatching based on this.  However, it is not very useful,
outside of a few internal performance hacks, because there is no way
for incoming serialized data to end up as a scalar and there is no way
to serialize output data as a scalar.

So it seems like there is a tradeoff.  If we add this then we
introduce the fact that newer software can create data that cannot be
consumed by older software.  Too much of this and the risk is that
Arrow is not enough of a consistent standard to offer benefits.

On the other hand, if we do not add this then we continue the status
quo, which is that newer software has to copy/expand data to fit into
the Arrow format.  Too much of this and we risk newer software using
other formats instead of Arrow to avoid the cost.

From what I've seen of the duckdb/velox use cases it seems they
revolve around in-memory transfers of data (possibly even in-process)
that are not persisted to a file.  In this case:
 * The cost of decoding / encoding is significant compared to the total runtime
 * There is more capability for hand-shaking / format negotiation and
so backwards compatibility isn't as essential.

> So, as a data point, most kernels currently don't have a native path for
> dictionary-encoded data; instead, they decode the data before working on it.

> For the same reason (lack of manpower), chances are that few kernels
> would ever get a native path for RLE-encoded data (or it would take a
> long time before the most common kernels are upgraded with such a native
> path).

I don't see this as a bad thing.  I think we are still very much in
the early days of columnar engines.  There aren't very many cases
where the runtime of the kernel itself is the bottleneck.  What I
would guess is most likely to happen is some user will have some
particular use case that will end up greatly benefiting from alternate
encoding support for a small set of specific functions.  Planning for
the future isn't a bad idea.

On the other hand, if I look at this from an Acero perspective, we
have a lot of work to do and RLE encoding is not a critical priority
for performance at the moment.

On Tue, May 31, 2022 at 12:14 PM Wes McKinney <we...@gmail.com> wrote:
>
> I haven't had a chance to look at the branch in detail, but if you can
> provide a pointer to a specification or other details about the
> proposed memory format for RLE (basically: what would be added to the
> columnar documentation as well as the Flatbuffers schema files), it
> would be helpful so it can be circulated to some other interested
> parties working primarily outside of Arrow (e.g. DuckDB) who might
> like to converge on a standard especially given that it would be
> exported across the C data interface. Thanks!
>
> On Tue, May 31, 2022 at 3:25 PM Tobias Zagorni
> <to...@zagorni.eu.invalid> wrote:
> >
> > Hi,
> >
> > Am Dienstag, dem 31.05.2022 um 21:12 +0200 schrieb Antoine Pitrou:
> > >
> > > Hi,
> > >
> > > Le 31/05/2022 à 20:24, Tobias Zagorni a écrit :
> > > > Hi, I'm currently working on adding Run-Length encoding to arrow. I
> > > > created a function to dictionary-encode arrays here (currently only
> > > > for
> > > > fixed length types):
> > > > https://github.com/apache/arrow/compare/master...zagto:rle?expand=1
> > > >
> > > > The general idea is that RLE data will be a nested data type, with
> > > > a
> > > > single child holding a regular ArrayData of the type of the values,
> > > > but
> > > > with the duplicate values removed. The parent contains a single
> > > > buffer
> > > > of uint64 representing the run lengths by holding the run length of
> > > > all
> > > > runs from the first to the current one
> > >
> > > That's an interesting approach. How do nulls work? Is the null bitmap
> > > stored in the child only?
> >
> > Excactly. Runs of nulls are just like other runs, but the respective
> > element of the child array is null. The parent is not does not have a
> > null buffer since that would mean "run of length null" to me.
> >
> > > > What are the intended use cases for this:
> > > > - external engines want to provide run-length encoded data to work
> > > > on
> > > > using arrow?
> > > > - more efficient ACERO (and possibly somewhat simplified, since we
> > > > can
> > > > now use RLE arrays to replace Scalar Datums)
> > >
> > > I don't think replacing Scalar compute paths with dedicated paths for
> > > RLE-encoded data would ever be a simplification. Also, when a kernel
> > > hasn't been upgraded with a native path for RLE data, former Scalar
> > > Datums would now be expanded to the full RLE-decoded version before
> > > running the kernel...
> >
> > > > Automatic kernel dispatch:
> > > > - Scalar kernels would likely just work on RLE data
> > >
> > > Unary scalar kernels may indeed just work, but n-ary kernels (where n
> > > >
> > > 1) would not, except in the happy case where run lengths are the same
> > > in
> > > all arguments.
> >
> > You're right
> >
> > >
> > > > - How should it be implemented? The current place for logic like
> > > > that
> > > > seems to be the DispatchBest methods. These only allow to replace
> > > > the
> > > > input types, but for this RLE scheme we would need to make the
> > > > kernel
> > > > work on a child array of the input. This mechanism would likely
> > > > need to
> > > > be extended a lot.
> > > > - For kernels which don't work on RLE data, should we automatically
> > > > call Encode and Decode kernels?
> > >
> > > So, as a data point, most kernels currently don't have a native path
> > > for
> > > dictionary-encoded data; instead, they decode the data before working
> > > on it.
> > >
> > > For the same reason (lack of manpower), chances are that few kernels
> > > would ever get a native path for RLE-encoded data (or it would take a
> > > long time before the most common kernels are upgraded with such a
> > > native
> > > path).
> > >
> > > > - Should RLE really be a data type? Dictionary encoding set an
> > > > example
> > > > for this, but just like it, RLE is actually a different encoding
> > > > for
> > > > arrays of the same data type.
> > >
> > > Well, Arrow C++ does not have a notion of encoding distinct from the
> > > data type. Adding such a notion would risk breaking compatibility for
> > > all existing software that hasn't been upgraded to dispatch based on
> > > encoding.
> > >
> > > Regards
> > >
> > > Antoine.
> >
> > Thanks for all the input!
> >
> > Best,
> > Tobias
> >

Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Wes McKinney <we...@gmail.com>.
I haven't had a chance to look at the branch in detail, but if you can
provide a pointer to a specification or other details about the
proposed memory format for RLE (basically: what would be added to the
columnar documentation as well as the Flatbuffers schema files), it
would be helpful so it can be circulated to some other interested
parties working primarily outside of Arrow (e.g. DuckDB) who might
like to converge on a standard especially given that it would be
exported across the C data interface. Thanks!

On Tue, May 31, 2022 at 3:25 PM Tobias Zagorni
<to...@zagorni.eu.invalid> wrote:
>
> Hi,
>
> Am Dienstag, dem 31.05.2022 um 21:12 +0200 schrieb Antoine Pitrou:
> >
> > Hi,
> >
> > Le 31/05/2022 à 20:24, Tobias Zagorni a écrit :
> > > Hi, I'm currently working on adding Run-Length encoding to arrow. I
> > > created a function to dictionary-encode arrays here (currently only
> > > for
> > > fixed length types):
> > > https://github.com/apache/arrow/compare/master...zagto:rle?expand=1
> > >
> > > The general idea is that RLE data will be a nested data type, with
> > > a
> > > single child holding a regular ArrayData of the type of the values,
> > > but
> > > with the duplicate values removed. The parent contains a single
> > > buffer
> > > of uint64 representing the run lengths by holding the run length of
> > > all
> > > runs from the first to the current one
> >
> > That's an interesting approach. How do nulls work? Is the null bitmap
> > stored in the child only?
>
> Excactly. Runs of nulls are just like other runs, but the respective
> element of the child array is null. The parent is not does not have a
> null buffer since that would mean "run of length null" to me.
>
> > > What are the intended use cases for this:
> > > - external engines want to provide run-length encoded data to work
> > > on
> > > using arrow?
> > > - more efficient ACERO (and possibly somewhat simplified, since we
> > > can
> > > now use RLE arrays to replace Scalar Datums)
> >
> > I don't think replacing Scalar compute paths with dedicated paths for
> > RLE-encoded data would ever be a simplification. Also, when a kernel
> > hasn't been upgraded with a native path for RLE data, former Scalar
> > Datums would now be expanded to the full RLE-decoded version before
> > running the kernel...
>
> > > Automatic kernel dispatch:
> > > - Scalar kernels would likely just work on RLE data
> >
> > Unary scalar kernels may indeed just work, but n-ary kernels (where n
> > >
> > 1) would not, except in the happy case where run lengths are the same
> > in
> > all arguments.
>
> You're right
>
> >
> > > - How should it be implemented? The current place for logic like
> > > that
> > > seems to be the DispatchBest methods. These only allow to replace
> > > the
> > > input types, but for this RLE scheme we would need to make the
> > > kernel
> > > work on a child array of the input. This mechanism would likely
> > > need to
> > > be extended a lot.
> > > - For kernels which don't work on RLE data, should we automatically
> > > call Encode and Decode kernels?
> >
> > So, as a data point, most kernels currently don't have a native path
> > for
> > dictionary-encoded data; instead, they decode the data before working
> > on it.
> >
> > For the same reason (lack of manpower), chances are that few kernels
> > would ever get a native path for RLE-encoded data (or it would take a
> > long time before the most common kernels are upgraded with such a
> > native
> > path).
> >
> > > - Should RLE really be a data type? Dictionary encoding set an
> > > example
> > > for this, but just like it, RLE is actually a different encoding
> > > for
> > > arrays of the same data type.
> >
> > Well, Arrow C++ does not have a notion of encoding distinct from the
> > data type. Adding such a notion would risk breaking compatibility for
> > all existing software that hasn't been upgraded to dispatch based on
> > encoding.
> >
> > Regards
> >
> > Antoine.
>
> Thanks for all the input!
>
> Best,
> Tobias
>

Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Tobias Zagorni <to...@zagorni.eu.INVALID>.
Hi, 

Am Dienstag, dem 31.05.2022 um 21:12 +0200 schrieb Antoine Pitrou:
> 
> Hi,
> 
> Le 31/05/2022 à 20:24, Tobias Zagorni a écrit :
> > Hi, I'm currently working on adding Run-Length encoding to arrow. I
> > created a function to dictionary-encode arrays here (currently only
> > for
> > fixed length types):
> > https://github.com/apache/arrow/compare/master...zagto:rle?expand=1
> > 
> > The general idea is that RLE data will be a nested data type, with
> > a
> > single child holding a regular ArrayData of the type of the values,
> > but
> > with the duplicate values removed. The parent contains a single
> > buffer
> > of uint64 representing the run lengths by holding the run length of
> > all
> > runs from the first to the current one
> 
> That's an interesting approach. How do nulls work? Is the null bitmap
> stored in the child only?

Excactly. Runs of nulls are just like other runs, but the respective
element of the child array is null. The parent is not does not have a
null buffer since that would mean "run of length null" to me.

> > What are the intended use cases for this:
> > - external engines want to provide run-length encoded data to work
> > on
> > using arrow?
> > - more efficient ACERO (and possibly somewhat simplified, since we
> > can
> > now use RLE arrays to replace Scalar Datums)
> 
> I don't think replacing Scalar compute paths with dedicated paths for
> RLE-encoded data would ever be a simplification. Also, when a kernel 
> hasn't been upgraded with a native path for RLE data, former Scalar 
> Datums would now be expanded to the full RLE-decoded version before 
> running the kernel...

> > Automatic kernel dispatch:
> > - Scalar kernels would likely just work on RLE data
> 
> Unary scalar kernels may indeed just work, but n-ary kernels (where n
> > 
> 1) would not, except in the happy case where run lengths are the same
> in 
> all arguments.

You're right 

> 
> > - How should it be implemented? The current place for logic like
> > that
> > seems to be the DispatchBest methods. These only allow to replace
> > the
> > input types, but for this RLE scheme we would need to make the
> > kernel
> > work on a child array of the input. This mechanism would likely
> > need to
> > be extended a lot.
> > - For kernels which don't work on RLE data, should we automatically
> > call Encode and Decode kernels?
> 
> So, as a data point, most kernels currently don't have a native path
> for 
> dictionary-encoded data; instead, they decode the data before working
> on it.
> 
> For the same reason (lack of manpower), chances are that few kernels 
> would ever get a native path for RLE-encoded data (or it would take a
> long time before the most common kernels are upgraded with such a
> native 
> path).
> 
> > - Should RLE really be a data type? Dictionary encoding set an
> > example
> > for this, but just like it, RLE is actually a different encoding
> > for
> > arrays of the same data type.
> 
> Well, Arrow C++ does not have a notion of encoding distinct from the 
> data type. Adding such a notion would risk breaking compatibility for
> all existing software that hasn't been upgraded to dispatch based on 
> encoding.
> 
> Regards
> 
> Antoine.

Thanks for all the input!

Best,
Tobias


Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Andrew Lamb <al...@influxdata.com>.
I think the biggest benefit of RLE is not on-the-wire compression, as that
can be done via more general purpose compression schemes as Antoine
mentions.

The biggest benefit of RLE is that it allows operating directly and very
efficiently on the "encoded" form -- for example, you can apply filters
directly to RLE encoded data, as well as update aggregations. The benefit
can be especially large when operating on data stored in formats such as
parquet which already use RLE as the size of the decoded values prior to
filtering can be much lower

Anerew

On Fri, Jun 3, 2022 at 12:52 PM Tobias Zagorni <to...@zagorni.eu.invalid>
wrote:

> Am Freitag, dem 03.06.2022 um 09:32 -0700 schrieb Micah Kornfield:
> > >
> > > Thinking about compatibility with existing software, RLE could
> > > possibly
> > > even made an Extension Type that follows the layout of a struct of
> > > int32 and the encoded value type. I'm wondering wether this would
> > > be
> > > better for compatibility.
> >
> >
> > I might be misunderstanding this proposal, but I don't think this
> > works. Wouldn't the structs with RLE have different row lengths then
> > any
> > Array in the same record/batch and table?  I think this means that
> > validation would fail on them.
>
> I think you understood it corrently. I'm not really familar with
> validation of arrow and didn't think of that problem.
>
> Best,
> Tobias
> >
>
>

Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Tobias Zagorni <to...@zagorni.eu.INVALID>.
Am Freitag, dem 03.06.2022 um 09:32 -0700 schrieb Micah Kornfield:
> > 
> > Thinking about compatibility with existing software, RLE could
> > possibly
> > even made an Extension Type that follows the layout of a struct of
> > int32 and the encoded value type. I'm wondering wether this would
> > be
> > better for compatibility.
> 
> 
> I might be misunderstanding this proposal, but I don't think this
> works. Wouldn't the structs with RLE have different row lengths then
> any
> Array in the same record/batch and table?  I think this means that
> validation would fail on them.

I think you understood it corrently. I'm not really familar with
validation of arrow and didn't think of that problem.

Best,
Tobias
> 


Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Micah Kornfield <em...@gmail.com>.
>
> Thinking about compatibility with existing software, RLE could possibly
> even made an Extension Type that follows the layout of a struct of
> int32 and the encoded value type. I'm wondering wether this would be
> better for compatibility.


I might be misunderstanding this proposal, but I don't think this
works. Wouldn't the structs with RLE have different row lengths then any
Array in the same record/batch and table?  I think this means that
validation would fail on them.


On Fri, Jun 3, 2022 at 8:22 AM Tobias Zagorni <to...@zagorni.eu.invalid>
wrote:

> > Well, Arrow C++ does not have a notion of encoding distinct from the
> > data type. Adding such a notion would risk breaking compatibility for
> > all existing software that hasn't been upgraded to dispatch based on
> > encoding.
>
> Thinking about compatibility with existing software, RLE could possibly
> even made an Extension Type that follows the layout of a struct of
> int32 and the encoded value type. I'm wondering wether this would be
> better for compatibility.
>
> The format is very similar to what was proposed before, exept that the
> indices now become an own child array, and there is in additional
> validity bitmap in the parent that could just be NULLPTR.
> Another adventage here would be that it is obvous how to add support
> for other run-length types than int32 if needed.
>
> Best,
> Tobias
>

Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Tobias Zagorni <to...@zagorni.eu.INVALID>.
> Well, Arrow C++ does not have a notion of encoding distinct from the 
> data type. Adding such a notion would risk breaking compatibility for
> all existing software that hasn't been upgraded to dispatch based on 
> encoding.

Thinking about compatibility with existing software, RLE could possibly
even made an Extension Type that follows the layout of a struct of
int32 and the encoded value type. I'm wondering wether this would be
better for compatibility.

The format is very similar to what was proposed before, exept that the
indices now become an own child array, and there is in additional
validity bitmap in the parent that could just be NULLPTR.
Another adventage here would be that it is obvous how to add support
for other run-length types than int32 if needed.

Best,
Tobias

Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Neal Richardson <ne...@gmail.com>.
Would it make sense to make a draft PR with your branch so that folks can
comment on specific parts of it?

Neal

On Wed, Jun 1, 2022 at 10:20 AM Tobias Zagorni <to...@zagorni.eu.invalid>
wrote:

> Am Dienstag, dem 31.05.2022 um 12:41 -0700 schrieb Micah Kornfield:
> >
> > - Should we allow multiple runs of the same value following each
> > other?
> > > Otherwise we would either need a pass to correct this after a lot
> > > of
> > > operations, or make RLE-aware versions of thier kernels.
> >
> > Is there any benefit you see in disallowing it?
>
> Some operations would be simpler. The one I can think of head is
> equality check between the whole arrays, it could just memcpy the run
> length buffer and compare the child arrays like you would compare any
> array of the child type. Probably not worth the complexity of
> disallowing it, but maybe someone knows more important cases.
>
> >
> Best,
> Tobias
>

Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Tobias Zagorni <to...@zagorni.eu.INVALID>.
Am Dienstag, dem 31.05.2022 um 12:41 -0700 schrieb Micah Kornfield:
> 
> - Should we allow multiple runs of the same value following each
> other?
> > Otherwise we would either need a pass to correct this after a lot
> > of
> > operations, or make RLE-aware versions of thier kernels.
> 
> Is there any benefit you see in disallowing it?

Some operations would be simpler. The one I can think of head is
equality check between the whole arrays, it could just memcpy the run
length buffer and compare the child arrays like you would compare any
array of the child type. Probably not worth the complexity of
disallowing it, but maybe someone knows more important cases.

> 
Best,
Tobias

Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Antoine Pitrou <an...@python.org>.
Le 31/05/2022 à 21:41, Micah Kornfield a écrit :
>>
>> I'm currently working on adding Run-Length encoding to arrow.
> 
> Nice
> 
> 
>> What are the intended use cases for this:
>> - external engines want to provide run-length encoded data to work on
>> using arrow?
>>
> It is more than just external engines.  Many popular file formats support
> RLE encoding.  Being able to natively transfer this into arrow memory
> without exploding it saves CPU time. Similarly, for things like flight it
> can dramatically decreases data on the wire

For things like Flight (or more generally the IPC format) you can also 
enable LZ4 compression (which costs CPU time, but will probably save 
space much more efficiently than a crude RLE scheme with 32-bit offsets).

Regards

Antoine.

Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Micah Kornfield <em...@gmail.com>.
>
> I'm currently working on adding Run-Length encoding to arrow.

Nice


> What are the intended use cases for this:
> - external engines want to provide run-length encoded data to work on
> using arrow?
>
It is more than just external engines.  Many popular file formats support
RLE encoding.  Being able to natively transfer this into arrow memory
without exploding it saves CPU time. Similarly, for things like flight it
can dramatically decreases data on the wire


> - more efficient ACERO (and possibly somewhat simplified, since we can
> now use RLE arrays to replace Scalar Datums)


It can certainly improve efficiency in some cases and if we don't have
optimization for RLE then there is less benefit to surfacing from file
formats, etc.

What data type should we use for the run-length values? int32 would
> save memory, but force us to encode very long arrays as multiple
> arrays, compared to int64. Or should we support different types?

I think we should target one type for now (int32) seems pretty reasonable
to me, but have a model that allows flexibility if we need to have more
types or other schemes (e.g. use delta encoding).


- Should we allow multiple runs of the same value following each other?
> Otherwise we would either need a pass to correct this after a lot of
> operations, or make RLE-aware versions of thier kernels.

Is there any benefit you see in disallowing it?


> - Should RLE really be a data type? Dictionary encoding set an example
> > for this, but just like it, RLE is actually a different encoding for
> > arrays of the same data type.


Well, Arrow C++ does not have a notion of encoding distinct from the
> data type. Adding such a notion would risk breaking compatibility for
> all existing software that hasn't been upgraded to dispatch based on
> encoding.


When I thought about this [1], I think making a separate message type in
the format and modelling encoding separate from type made sense.


[1] https://github.com/apache/arrow/pull/4815





On Tue, May 31, 2022 at 12:13 PM Antoine Pitrou <an...@python.org> wrote:

>
> Hi,
>
> Le 31/05/2022 à 20:24, Tobias Zagorni a écrit :
> > Hi, I'm currently working on adding Run-Length encoding to arrow. I
> > created a function to dictionary-encode arrays here (currently only for
> > fixed length types):
> > https://github.com/apache/arrow/compare/master...zagto:rle?expand=1
> >
> > The general idea is that RLE data will be a nested data type, with a
> > single child holding a regular ArrayData of the type of the values, but
> > with the duplicate values removed. The parent contains a single buffer
> > of uint64 representing the run lengths by holding the run length of all
> > runs from the first to the current one
>
> That's an interesting approach. How do nulls work? Is the null bitmap
> stored in the child only?
>
> > What are the intended use cases for this:
> > - external engines want to provide run-length encoded data to work on
> > using arrow?
> > - more efficient ACERO (and possibly somewhat simplified, since we can
> > now use RLE arrays to replace Scalar Datums)
>
> I don't think replacing Scalar compute paths with dedicated paths for
> RLE-encoded data would ever be a simplification. Also, when a kernel
> hasn't been upgraded with a native path for RLE data, former Scalar
> Datums would now be expanded to the full RLE-decoded version before
> running the kernel...
>
> > Automatic kernel dispatch:
> > - Scalar kernels would likely just work on RLE data
>
> Unary scalar kernels may indeed just work, but n-ary kernels (where n >
> 1) would not, except in the happy case where run lengths are the same in
> all arguments.
>
> > - How should it be implemented? The current place for logic like that
> > seems to be the DispatchBest methods. These only allow to replace the
> > input types, but for this RLE scheme we would need to make the kernel
> > work on a child array of the input. This mechanism would likely need to
> > be extended a lot.
> > - For kernels which don't work on RLE data, should we automatically
> > call Encode and Decode kernels?
>
> So, as a data point, most kernels currently don't have a native path for
> dictionary-encoded data; instead, they decode the data before working on
> it.
>
> For the same reason (lack of manpower), chances are that few kernels
> would ever get a native path for RLE-encoded data (or it would take a
> long time before the most common kernels are upgraded with such a native
> path).
>
> > - Should RLE really be a data type? Dictionary encoding set an example
> > for this, but just like it, RLE is actually a different encoding for
> > arrays of the same data type.
>
> Well, Arrow C++ does not have a notion of encoding distinct from the
> data type. Adding such a notion would risk breaking compatibility for
> all existing software that hasn't been upgraded to dispatch based on
> encoding.
>
> Regards
>
> Antoine.
>

Re: [C++] Adding Run-Length Encoding to Arrow

Posted by Antoine Pitrou <an...@python.org>.
Hi,

Le 31/05/2022 à 20:24, Tobias Zagorni a écrit :
> Hi, I'm currently working on adding Run-Length encoding to arrow. I
> created a function to dictionary-encode arrays here (currently only for
> fixed length types):
> https://github.com/apache/arrow/compare/master...zagto:rle?expand=1
> 
> The general idea is that RLE data will be a nested data type, with a
> single child holding a regular ArrayData of the type of the values, but
> with the duplicate values removed. The parent contains a single buffer
> of uint64 representing the run lengths by holding the run length of all
> runs from the first to the current one

That's an interesting approach. How do nulls work? Is the null bitmap 
stored in the child only?

> What are the intended use cases for this:
> - external engines want to provide run-length encoded data to work on
> using arrow?
> - more efficient ACERO (and possibly somewhat simplified, since we can
> now use RLE arrays to replace Scalar Datums)

I don't think replacing Scalar compute paths with dedicated paths for 
RLE-encoded data would ever be a simplification. Also, when a kernel 
hasn't been upgraded with a native path for RLE data, former Scalar 
Datums would now be expanded to the full RLE-decoded version before 
running the kernel...

> Automatic kernel dispatch:
> - Scalar kernels would likely just work on RLE data

Unary scalar kernels may indeed just work, but n-ary kernels (where n > 
1) would not, except in the happy case where run lengths are the same in 
all arguments.

> - How should it be implemented? The current place for logic like that
> seems to be the DispatchBest methods. These only allow to replace the
> input types, but for this RLE scheme we would need to make the kernel
> work on a child array of the input. This mechanism would likely need to
> be extended a lot.
> - For kernels which don't work on RLE data, should we automatically
> call Encode and Decode kernels?

So, as a data point, most kernels currently don't have a native path for 
dictionary-encoded data; instead, they decode the data before working on it.

For the same reason (lack of manpower), chances are that few kernels 
would ever get a native path for RLE-encoded data (or it would take a 
long time before the most common kernels are upgraded with such a native 
path).

> - Should RLE really be a data type? Dictionary encoding set an example
> for this, but just like it, RLE is actually a different encoding for
> arrays of the same data type.

Well, Arrow C++ does not have a notion of encoding distinct from the 
data type. Adding such a notion would risk breaking compatibility for 
all existing software that hasn't been upgraded to dispatch based on 
encoding.

Regards

Antoine.