You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Weston Pace <we...@gmail.com> on 2022/06/01 00:18:09 UTC
Re: [C++] Adding Run-Length Encoding to Arrow

> I don't think replacing Scalar compute paths with dedicated paths for
> RLE-encoded data would ever be a simplification. Also, when a kernel
> hasn't been upgraded with a native path for RLE data, former Scalar
> Datums would now be expanded to the full RLE-decoded version before
> running the kernel...

> Well, Arrow C++ does not have a notion of encoding distinct from the
> data type. Adding such a notion would risk breaking compatibility for
> all existing software that hasn't been upgraded to dispatch based on
> encoding.

I think these two things are related.  I would argue that
ValueDescr::Shape is already a notion of encoding and we are already
capable of dispatching based on this.  However, it is not very useful,
outside of a few internal performance hacks, because there is no way
for incoming serialized data to end up as a scalar and there is no way
to serialize output data as a scalar.

So it seems like there is a tradeoff.  If we add this then we
introduce the fact that newer software can create data that cannot be
consumed by older software.  Too much of this and the risk is that
Arrow is not enough of a consistent standard to offer benefits.

On the other hand, if we do not add this then we continue the status
quo, which is that newer software has to copy/expand data to fit into
the Arrow format.  Too much of this and we risk newer software using
other formats instead of Arrow to avoid the cost.

From what I've seen of the duckdb/velox use cases it seems they
revolve around in-memory transfers of data (possibly even in-process)
that are not persisted to a file.  In this case:
 * The cost of decoding / encoding is significant compared to the total runtime
 * There is more capability for hand-shaking / format negotiation and
so backwards compatibility isn't as essential.

> So, as a data point, most kernels currently don't have a native path for
> dictionary-encoded data; instead, they decode the data before working on it.

> For the same reason (lack of manpower), chances are that few kernels
> would ever get a native path for RLE-encoded data (or it would take a
> long time before the most common kernels are upgraded with such a native
> path).

I don't see this as a bad thing.  I think we are still very much in
the early days of columnar engines.  There aren't very many cases
where the runtime of the kernel itself is the bottleneck.  What I
would guess is most likely to happen is some user will have some
particular use case that will end up greatly benefiting from alternate
encoding support for a small set of specific functions.  Planning for
the future isn't a bad idea.

On the other hand, if I look at this from an Acero perspective, we
have a lot of work to do and RLE encoding is not a critical priority
for performance at the moment.

On Tue, May 31, 2022 at 12:14 PM Wes McKinney <we...@gmail.com> wrote:
>
> I haven't had a chance to look at the branch in detail, but if you can
> provide a pointer to a specification or other details about the
> proposed memory format for RLE (basically: what would be added to the
> columnar documentation as well as the Flatbuffers schema files), it
> would be helpful so it can be circulated to some other interested
> parties working primarily outside of Arrow (e.g. DuckDB) who might
> like to converge on a standard especially given that it would be
> exported across the C data interface. Thanks!
>
> On Tue, May 31, 2022 at 3:25 PM Tobias Zagorni
> <to...@zagorni.eu.invalid> wrote:
> >
> > Hi,
> >
> > Am Dienstag, dem 31.05.2022 um 21:12 +0200 schrieb Antoine Pitrou:
> > >
> > > Hi,
> > >
> > > Le 31/05/2022 à 20:24, Tobias Zagorni a écrit :
> > > > Hi, I'm currently working on adding Run-Length encoding to arrow. I
> > > > created a function to dictionary-encode arrays here (currently only
> > > > for
> > > > fixed length types):
> > > > https://github.com/apache/arrow/compare/master...zagto:rle?expand=1
> > > >
> > > > The general idea is that RLE data will be a nested data type, with
> > > > a
> > > > single child holding a regular ArrayData of the type of the values,
> > > > but
> > > > with the duplicate values removed. The parent contains a single
> > > > buffer
> > > > of uint64 representing the run lengths by holding the run length of
> > > > all
> > > > runs from the first to the current one
> > >
> > > That's an interesting approach. How do nulls work? Is the null bitmap
> > > stored in the child only?
> >
> > Excactly. Runs of nulls are just like other runs, but the respective
> > element of the child array is null. The parent is not does not have a
> > null buffer since that would mean "run of length null" to me.
> >
> > > > What are the intended use cases for this:
> > > > - external engines want to provide run-length encoded data to work
> > > > on
> > > > using arrow?
> > > > - more efficient ACERO (and possibly somewhat simplified, since we
> > > > can
> > > > now use RLE arrays to replace Scalar Datums)
> > >
> > > I don't think replacing Scalar compute paths with dedicated paths for
> > > RLE-encoded data would ever be a simplification. Also, when a kernel
> > > hasn't been upgraded with a native path for RLE data, former Scalar
> > > Datums would now be expanded to the full RLE-decoded version before
> > > running the kernel...
> >
> > > > Automatic kernel dispatch:
> > > > - Scalar kernels would likely just work on RLE data
> > >
> > > Unary scalar kernels may indeed just work, but n-ary kernels (where n
> > > >
> > > 1) would not, except in the happy case where run lengths are the same
> > > in
> > > all arguments.
> >
> > You're right
> >
> > >
> > > > - How should it be implemented? The current place for logic like
> > > > that
> > > > seems to be the DispatchBest methods. These only allow to replace
> > > > the
> > > > input types, but for this RLE scheme we would need to make the
> > > > kernel
> > > > work on a child array of the input. This mechanism would likely
> > > > need to
> > > > be extended a lot.
> > > > - For kernels which don't work on RLE data, should we automatically
> > > > call Encode and Decode kernels?
> > >
> > > So, as a data point, most kernels currently don't have a native path
> > > for
> > > dictionary-encoded data; instead, they decode the data before working
> > > on it.
> > >
> > > For the same reason (lack of manpower), chances are that few kernels
> > > would ever get a native path for RLE-encoded data (or it would take a
> > > long time before the most common kernels are upgraded with such a
> > > native
> > > path).
> > >
> > > > - Should RLE really be a data type? Dictionary encoding set an
> > > > example
> > > > for this, but just like it, RLE is actually a different encoding
> > > > for
> > > > arrays of the same data type.
> > >
> > > Well, Arrow C++ does not have a notion of encoding distinct from the
> > > data type. Adding such a notion would risk breaking compatibility for
> > > all existing software that hasn't been upgraded to dispatch based on
> > > encoding.
> > >
> > > Regards
> > >
> > > Antoine.
> >
> > Thanks for all the input!
> >
> > Best,
> > Tobias
> >