You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Evan Chan <ev...@apple.com.INVALID> on 2020/03/10 04:15:48 UTC

Summary of RLE and other compression efforts?

Hi folks,

I’m curious about the state of efforts for more compressed encodings in the Arrow columnar format.  I saw discussions previously about RLE, but is there a place to summarize all of the different efforts that are ongoing to bring more compressed encodings?

Is there an effort to compress floating point or integer data using techniques such as XOR compression and Delta-Delta?  I can contribute to some of these efforts as well.

Thanks,
Evan



Re: Summary of RLE and other compression efforts?

Posted by Antoine Pitrou <an...@python.org>.
Hi,

Le 11/03/2020 à 06:31, Micah Kornfield a écrit :
> 
> I still think we should be careful on what is added to the spec, in
> particular, we should be focused on encodings that can be used to improve
> computational efficiency rather than just smaller size. Also, it is
> important to note that any sort of encoding/compression must be supportable
> across multiple languages/platforms.

I don't have much to add, but I definitely agree with this.

Regards

Antoine.

Re: Summary of RLE and other compression efforts?

Posted by Wes McKinney <we...@gmail.com>.
On Wed, Mar 11, 2020 at 11:24 AM Evan Chan <ev...@apple.com.invalid> wrote:
>
> Sure thing.
>
> Computation speed needs to be thought about in context....  We might find something which takes up half the space to be a little more computationally expensive, but in the grand scheme of things is faster to compute as more of it can fit in memory, and it saves I/O.   I definitely agree that just making something smaller without other gains might not be worth it.
>
> Some small changes help both computation and size.  For example, encoding nulls in dictionary values helps reduce the need for both bitmap storage and lookup.
>
> So I suppose the process goes something like this?
>
> 1) Propose designs in this email list
> 2) PR for format specification
> 3) Start implementing across diff languages

Yes, though we'll have to adopt changes to the specification with a
formal PMC vote. The vote can come after the PR details have been
worked out

This may take some time since we want to make sure we're soliciting
enough opinions from different communities (e.g. getting feedback from
people in the analytic database world would be especially valuable)

> -Evan
>
> > On Mar 10, 2020, at 10:31 PM, Micah Kornfield <em...@gmail.com> wrote:
> >
> > +1 to what Wes said.
> >
> > I'm hoping to have some more time to spend on this end of Q2/beginning of
> > Q3 if no progress is made by then.
> >
> > I still think we should be careful on what is added to the spec, in
> > particular, we should be focused on encodings that can be used to improve
> > computational efficiency rather than just smaller size. Also, it is
> > important to note that any sort of encoding/compression must be supportable
> > across multiple languages/platforms.
> >
> > Thanks,
> > Micah
> >
> > On Tue, Mar 10, 2020 at 3:12 PM Wes McKinney <we...@gmail.com> wrote:
> >
> >> On Tue, Mar 10, 2020 at 5:01 PM Evan Chan <ev...@apple.com.invalid>
> >> wrote:
> >>>
> >>> Martin,
> >>>
> >>> Many thanks for the links.
> >>>
> >>> My main concern is not actually FP and integer data, but sparse string
> >> data.  Having many very sparse arrays, each with a bitmap and values
> >> (assume dictionary also), would be really expensive. I have some ideas I’d
> >> like to throw out there, around something like a MapArray (Think of it
> >> essentially as dictionaries of keys and values, plus List<List<u8>> for
> >> example), but something optimized for sparseness.
> >>>
> >>> Overall, while I appreciate the design of Arrow arrays to be super fast
> >> for computation, I’d like to be able to keep more of such data in memory,
> >> thus I’m interested in more compact representations, that ideally don’t
> >> need a compressor.  More like encoding.
> >>>
> >>> I saw a thread middle of last year about RLE encodings, this would be in
> >> the right direction I think.   It could be implemented properly such that
> >> it doesn’t make random access that bad.
> >>>
> >>> As for FP, I have my own scheme which is scale tested, SIMD friendly and
> >> should perform relatively well, and can fit in with different predictors
> >> including XOR, DFCM, etc.   Due to the high cardinality of most such data,
> >> I wonder if it wouldn’t be simpler to stick with one such scheme for all FP
> >> data.
> >>>
> >>> Anyways, I’m most curious about if there is a plan to implement RLE, the
> >> FP schemes you describe, etc. and bring them to Arrow.  IE, is there a plan
> >> for space efficient encodings overall for Arrow?
> >>
> >> It's been discussed many times in the past. As Arrow is developed by
> >> volunteers, if someone volunteers their time to work on it, it can
> >> happen. The first step would be to build consensus about what sort of
> >> protocol level additions (see the format/ directory and associated
> >> documentation) are needed. Once there is consensus about what to build
> >> and a complete specification for that, then implementation can move
> >> forward.
> >>
> >>> Thanks very much,
> >>> Evan
> >>>
> >>>> On Mar 10, 2020, at 1:41 AM, Radev, Martin <ma...@tum.de>
> >> wrote:
> >>>>
> >>>> Hey Evan,
> >>>>
> >>>>
> >>>> thank you for the interest.
> >>>>
> >>>> There has been some effort for compressing floating-point data on the
> >> Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not
> >> compress floating point data but makes it more compressible for when a
> >> compressor, such as ZSTD, LZ4, etc, is used. It only works well for
> >> high-entropy floating-point data, somewhere at least as large as >= 15 bits
> >> of entropy per element. I suppose the encoding might actually also make
> >> sense for high-entropy integer data but I am not super sure.
> >>>> For low-entropy data, the dictionary encoding is good though I suspect
> >> there can be room for performance improvements.
> >>>> This is my final report for the encoding here:
> >> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
> >>>>
> >>>> Note that at some point my investigation turned out be quite the same
> >> solution as the one in https://github.com/powturbo/Turbo-Transpose.
> >>>>
> >>>>
> >>>> Maybe the points I sent can be helpful.
> >>>>
> >>>>
> >>>> Kinds regards,
> >>>>
> >>>> Martin
> >>>>
> >>>> ________________________________
> >>>> From: evan_chan@apple.com <ev...@apple.com> on behalf of Evan
> >> Chan <ev...@apple.com.INVALID>
> >>>> Sent: Tuesday, March 10, 2020 5:15:48 AM
> >>>> To: dev@arrow.apache.org
> >>>> Subject: Summary of RLE and other compression efforts?
> >>>>
> >>>> Hi folks,
> >>>>
> >>>> I’m curious about the state of efforts for more compressed encodings
> >> in the Arrow columnar format.  I saw discussions previously about RLE, but
> >> is there a place to summarize all of the different efforts that are ongoing
> >> to bring more compressed encodings?
> >>>>
> >>>> Is there an effort to compress floating point or integer data using
> >> techniques such as XOR compression and Delta-Delta?  I can contribute to
> >> some of these efforts as well.
> >>>>
> >>>> Thanks,
> >>>> Evan
> >>>>
> >>>>
> >>>
> >>
>

Re: Summary of RLE and other compression efforts?

Posted by Micah Kornfield <em...@gmail.com>.
Hi Evan,


> Hope everyone is staying safe!

Thanks you too.

A fairly substantial amount of CPU is needed for translating from Parquet;
> main memory bandwidth becomes a factor.  Thus, it seems speed and
> constraining factors varies widely by application

I agree performance is going to be application dependent.  As I said, there
are definitely a mix of concerns but at its core Arrow made some
assumptions about constraining factors. Without evidence, I think we should
stick to them.

If the column encodings you are alluding too are really light-weight.  I'm
not sure I see the why they would be compute intensive to decode if they
were included in parquet or another file format stored in memory?

- and having more encodings might extend the use of Arrow to wider scopes
> :)

Maybe. It seems you have something in mind. It might pay to take a step
back, and describe your use-case in a little more depth.  There are still
concerns about keeping reference implementations up-to-date with each
other, which is why I'd like to err on the side of simplicity and
generality.

Perhaps we can have a discussion about what that demonstration would entail?

In my mind it would be showing the benefit of the encodings on standard
public datasets by comparing them with some micro-benchmarks. For instance
some of the operations in the nascent C++ compute component  [1], or
something based on common operations in a standard benchmark suite (e.g.
TPC-*).  If there are standards in the domain you are targeting I think
they would also be useful to consider as well.

Also, would the Arrow community be open to some kind of “third party
> encoding” capability?

I originally had something like this in my proposal, I removed it after
some feedback.  There are justified concerns about this balkanizing the
standard.  Depending on the exact use-case Extension arrays/additional
metadata might be a way of experimenting without adding it to the standard
specifically.

[1]
https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/operations

On Tue, Mar 24, 2020 at 2:54 PM Evan Chan <ev...@apple.com> wrote:

> Hi Micah,
>
> Hope everyone is staying safe!
>
> > On Mar 16, 2020, at 9:41 PM, Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > I feel a little uncomfortable in the fact that there isn't a more
> clearly defined dividing line for what belongs in Arrow and what doesn't.
> I suppose this is what discussions like these are about :)
> >
> > In my mind Arrow has two primary goals:
> > 1.  Wide accessibility
> > 2.  Speed, primarily with the assumption that CPU time is the
> constraining resource.  Some of the technologies in the Arrow ecosystem
> (e.g. Feather, Flight) have blurred this line a little bit. In some cases
> these technologies are dominated by other constraints as seen by Wes's
> compression proposal [1].
>
> If we only look at the narrow scope of iterating over a single Arrow array
> in memory, perhaps CPU is the dominant constraint (though even there, I
> would expect L1/L2 cache to be fairly important).  Once we expand the scope
> wider and wider…. For example a large volume of data, loading and
> translating data from Parquet and disk, etc. etc., then the factors become
> much more complex.  A fairly substantial amount of CPU is needed for
> translating from Parquet; main memory bandwidth becomes a factor.  Thus, it
> seems speed and constraining factors varies widely by application - and
> having more encodings might extend the use of Arrow to wider scopes  :)
>
> >
> > Given these points, adding complexity and constant factors, at least for
> an initial implementation, are something that needs to be looked at very
> carefully.  I would likely change my mind if there was a demonstration that
> the complexity adds substantial value across a variety of data
> sets/workloads beyond simpler versions.
>
> Perhaps we can have a discussion about what that demonstration would
> entail?
> Also, would the Arrow community be open to some kind of “third party
> encoding” capability?  It would facilitate experimentation by others, in
> real world scenarios, and if those use cases and workloads prove to be
> useful to others, perhaps the community could then consider adopting them
> more widely?
>
> Cheers,
> Evan

Re: Summary of RLE and other compression efforts?

Posted by Evan Chan <ev...@apple.com.INVALID>.
Hi Micah,

Hope everyone is staying safe!

> On Mar 16, 2020, at 9:41 PM, Micah Kornfield <em...@gmail.com> wrote:
> 
> I feel a little uncomfortable in the fact that there isn't a more clearly defined dividing line for what belongs in Arrow and what doesn't.  I suppose this is what discussions like these are about :)
> 
> In my mind Arrow has two primary goals:
> 1.  Wide accessibility
> 2.  Speed, primarily with the assumption that CPU time is the constraining resource.  Some of the technologies in the Arrow ecosystem (e.g. Feather, Flight) have blurred this line a little bit. In some cases these technologies are dominated by other constraints as seen by Wes's compression proposal [1].

If we only look at the narrow scope of iterating over a single Arrow array in memory, perhaps CPU is the dominant constraint (though even there, I would expect L1/L2 cache to be fairly important).  Once we expand the scope wider and wider…. For example a large volume of data, loading and translating data from Parquet and disk, etc. etc., then the factors become much more complex.  A fairly substantial amount of CPU is needed for translating from Parquet; main memory bandwidth becomes a factor.  Thus, it seems speed and constraining factors varies widely by application - and having more encodings might extend the use of Arrow to wider scopes  :)

> 
> Given these points, adding complexity and constant factors, at least for an initial implementation, are something that needs to be looked at very carefully.  I would likely change my mind if there was a demonstration that the complexity adds substantial value across a variety of data sets/workloads beyond simpler versions.

Perhaps we can have a discussion about what that demonstration would entail?
Also, would the Arrow community be open to some kind of “third party encoding” capability?  It would facilitate experimentation by others, in real world scenarios, and if those use cases and workloads prove to be useful to others, perhaps the community could then consider adopting them more widely?

Cheers,
Evan

Re: Summary of RLE and other compression efforts?

Posted by Micah Kornfield <em...@gmail.com>.
Hi Evan,
Response inline.  This is just my opinion, others are welcome to chime in.
Apologies for the long reply.


> I’m wondering how important retaining strict O(1) random access is for
> Arrow’s use cases?  Would “substantially sub-linear” or “approximately
> O(c)” where c is a very small constant (a la B-Tree) work?


I don't think this has been formally discussed past the initial design
consideration for the format.  We should be careful of terminology,  RLE
only provides O(lg(n)) random access to a data-element.

In either case I would lean against constant factor additions to the format
and encoding proposal.  Some of the techniques brought up here might be
interesting to add to something like Parquet or create an adapter between
FiloDB data and Arrow.

I feel a little uncomfortable in the fact that there isn't a more clearly
defined dividing line for what belongs in Arrow and what doesn't.  I
suppose this is what discussions like these are about :)

In my mind Arrow has two primary goals:
1.  Wide accessibility
2.  Speed, primarily with the assumption that CPU time is the constraining
resource.  Some of the technologies in the Arrow ecosystem (e.g. Feather,
Flight) have blurred this line a little bit. In some cases these
technologies are dominated by other constraints as seen by Wes's
compression proposal [1].

Given these points, adding complexity and constant factors, at least for an
initial implementation, are something that needs to be looked at very
carefully.  I would likely change my mind if there was a demonstration that
the complexity adds substantial value across a variety of data
sets/workloads beyond simpler versions.

 I also have my own, SIMD-friendly, variable-length encoding which can
> compress both floating point and integers quickly.

Per above I'm worried about adding constant factors here.

Happy to give more details, perhaps in a separate channel if needed.


For Apache projects the mailing list is best place to discuss these things
(or a link to a design doc posted to the list). If a higher bandwidth
medium is needed there is a sync-up every other week on Wednesdays 9 AM
Pacific Time (there is usually a reminder sent out the night before) and
notes are posted back to the list.

-Micah


[1]
https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E


On Sat, Mar 14, 2020 at 10:23 PM Evan Chan <ev...@apple.com> wrote:

> Micah,
>
> Thanks for the link to the PR.  I’ll add more comments about those
> specific proposals there, though maybe I should outline ideas here as I’m
> not sure the PR is the best place to expand on details.
>
> I’m wondering how important retaining strict O(1) random access is for
> Arrow’s use cases?  Would “substantially sub-linear” or “approximately
> O(c)” where c is a very small constant (a la B-Tree) work?
>
>
>
> I'm not sure if this is provided was provided as an example as something
> to add, but I believe this is already supported in the spec.
>
>
> In the longer term I think it is potentially something Arrow should
> include, but given the current number of developers contributing across all
> languages I'm wary of tacking on too much before we have working  reference
> implementations.
>
>
> I can understand trying to tackle less, just one new proposal and see it
> implemented widely first.
>
>
> If you think there are better encodings to tackle in the short term I'd
> welcome feedback on the proposal (or a more formal proposal of your own).
>
>
> An alternative to RLE with storing run-ends for every element, is to use
> RLE or variable-length encoding schemes, but combine them with an index or
> offsets for every Nth element.  This means you get O(c) random access where
> you have to decode no more than N elements at a time.  The N can be
> adjusted to tune for either better compression (larger N) or faster reads
> (less N to decode at a time).
>
> I also have my own, SIMD-friendly, variable-length encoding which can
> compress both floating point and integers quickly.   Some details here:
>
> https://github.com/filodb/FiloDB/blob/develop/doc/compression.md#predictive-nibblepacking
>
> Happy to give more details, perhaps in a separate channel if needed.
>
> -Evan
>
>
> Thanks,
> -Micah
>
> [1] https://github.com/apache/arrow/pull/4815/files
>
>
>
> On Wed, Mar 11, 2020 at 9:24 AM Evan Chan <ev...@apple.com> wrote:
>
>> Sure thing.
>>
>> Computation speed needs to be thought about in context....  We might find
>> something which takes up half the space to be a little more computationally
>> expensive, but in the grand scheme of things is faster to compute as more
>> of it can fit in memory, and it saves I/O.   I definitely agree that just
>> making something smaller without other gains might not be worth it.
>>
>> Some small changes help both computation and size.  For example, encoding
>> nulls in dictionary values helps reduce the need for both bitmap storage
>> and lookup.
>>
>> So I suppose the process goes something like this?
>>
>> 1) Propose designs in this email list
>> 2) PR for format specification
>> 3) Start implementing across diff languages
>>
>> -Evan
>>
>> > On Mar 10, 2020, at 10:31 PM, Micah Kornfield <em...@gmail.com>
>> wrote:
>> >
>> > +1 to what Wes said.
>> >
>> > I'm hoping to have some more time to spend on this end of Q2/beginning
>> of
>> > Q3 if no progress is made by then.
>> >
>> > I still think we should be careful on what is added to the spec, in
>> > particular, we should be focused on encodings that can be used to
>> improve
>> > computational efficiency rather than just smaller size. Also, it is
>> > important to note that any sort of encoding/compression must be
>> supportable
>> > across multiple languages/platforms.
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Tue, Mar 10, 2020 at 3:12 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> >> On Tue, Mar 10, 2020 at 5:01 PM Evan Chan <evan_chan@apple.com.invalid
>> >
>> >> wrote:
>> >>>
>> >>> Martin,
>> >>>
>> >>> Many thanks for the links.
>> >>>
>> >>> My main concern is not actually FP and integer data, but sparse string
>> >> data.  Having many very sparse arrays, each with a bitmap and values
>> >> (assume dictionary also), would be really expensive. I have some ideas
>> I’d
>> >> like to throw out there, around something like a MapArray (Think of it
>> >> essentially as dictionaries of keys and values, plus List<List<u8>> for
>> >> example), but something optimized for sparseness.
>> >>>
>> >>> Overall, while I appreciate the design of Arrow arrays to be super
>> fast
>> >> for computation, I’d like to be able to keep more of such data in
>> memory,
>> >> thus I’m interested in more compact representations, that ideally don’t
>> >> need a compressor.  More like encoding.
>> >>>
>> >>> I saw a thread middle of last year about RLE encodings, this would be
>> in
>> >> the right direction I think.   It could be implemented properly such
>> that
>> >> it doesn’t make random access that bad.
>> >>>
>> >>> As for FP, I have my own scheme which is scale tested, SIMD friendly
>> and
>> >> should perform relatively well, and can fit in with different
>> predictors
>> >> including XOR, DFCM, etc.   Due to the high cardinality of most such
>> data,
>> >> I wonder if it wouldn’t be simpler to stick with one such scheme for
>> all FP
>> >> data.
>> >>>
>> >>> Anyways, I’m most curious about if there is a plan to implement RLE,
>> the
>> >> FP schemes you describe, etc. and bring them to Arrow.  IE, is there a
>> plan
>> >> for space efficient encodings overall for Arrow?
>> >>
>> >> It's been discussed many times in the past. As Arrow is developed by
>> >> volunteers, if someone volunteers their time to work on it, it can
>> >> happen. The first step would be to build consensus about what sort of
>> >> protocol level additions (see the format/ directory and associated
>> >> documentation) are needed. Once there is consensus about what to build
>> >> and a complete specification for that, then implementation can move
>> >> forward.
>> >>
>> >>> Thanks very much,
>> >>> Evan
>> >>>
>> >>>> On Mar 10, 2020, at 1:41 AM, Radev, Martin <ma...@tum.de>
>> >> wrote:
>> >>>>
>> >>>> Hey Evan,
>> >>>>
>> >>>>
>> >>>> thank you for the interest.
>> >>>>
>> >>>> There has been some effort for compressing floating-point data on the
>> >> Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it
>> does not
>> >> compress floating point data but makes it more compressible for when a
>> >> compressor, such as ZSTD, LZ4, etc, is used. It only works well for
>> >> high-entropy floating-point data, somewhere at least as large as >= 15
>> bits
>> >> of entropy per element. I suppose the encoding might actually also make
>> >> sense for high-entropy integer data but I am not super sure.
>> >>>> For low-entropy data, the dictionary encoding is good though I
>> suspect
>> >> there can be room for performance improvements.
>> >>>> This is my final report for the encoding here:
>> >>
>> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
>> >>>>
>> >>>> Note that at some point my investigation turned out be quite the same
>> >> solution as the one in https://github.com/powturbo/Turbo-Transpose.
>> >>>>
>> >>>>
>> >>>> Maybe the points I sent can be helpful.
>> >>>>
>> >>>>
>> >>>> Kinds regards,
>> >>>>
>> >>>> Martin
>> >>>>
>> >>>> ________________________________
>> >>>> From: evan_chan@apple.com <ev...@apple.com> on behalf of Evan
>> >> Chan <ev...@apple.com.INVALID>
>> >>>> Sent: Tuesday, March 10, 2020 5:15:48 AM
>> >>>> To: dev@arrow.apache.org
>> >>>> Subject: Summary of RLE and other compression efforts?
>> >>>>
>> >>>> Hi folks,
>> >>>>
>> >>>> I’m curious about the state of efforts for more compressed encodings
>> >> in the Arrow columnar format.  I saw discussions previously about RLE,
>> but
>> >> is there a place to summarize all of the different efforts that are
>> ongoing
>> >> to bring more compressed encodings?
>> >>>>
>> >>>> Is there an effort to compress floating point or integer data using
>> >> techniques such as XOR compression and Delta-Delta?  I can contribute
>> to
>> >> some of these efforts as well.
>> >>>>
>> >>>> Thanks,
>> >>>> Evan
>> >>>>
>> >>>>
>> >>>
>> >>
>>
>>
>

Re: Summary of RLE and other compression efforts?

Posted by Evan Chan <ev...@apple.com.INVALID>.
Micah,

Thanks for the link to the PR.  I’ll add more comments about those specific proposals there, though maybe I should outline ideas here as I’m not sure the PR is the best place to expand on details.

I’m wondering how important retaining strict O(1) random access is for Arrow’s use cases?  Would “substantially sub-linear” or “approximately O(c)” where c is a very small constant (a la B-Tree) work?


> 
> I'm not sure if this is provided was provided as an example as something to add, but I believe this is already supported in the spec.

> In the longer term I think it is potentially something Arrow should include, but given the current number of developers contributing across all languages I'm wary of tacking on too much before we have working  reference implementations.

I can understand trying to tackle less, just one new proposal and see it implemented widely first.

> 
> If you think there are better encodings to tackle in the short term I'd welcome feedback on the proposal (or a more formal proposal of your own).

An alternative to RLE with storing run-ends for every element, is to use RLE or variable-length encoding schemes, but combine them with an index or offsets for every Nth element.  This means you get O(c) random access where you have to decode no more than N elements at a time.  The N can be adjusted to tune for either better compression (larger N) or faster reads (less N to decode at a time).

I also have my own, SIMD-friendly, variable-length encoding which can compress both floating point and integers quickly.   Some details here:
https://github.com/filodb/FiloDB/blob/develop/doc/compression.md#predictive-nibblepacking <https://github.com/filodb/FiloDB/blob/develop/doc/compression.md#predictive-nibblepacking>

Happy to give more details, perhaps in a separate channel if needed.

-Evan

> 
> Thanks,
> -Micah
> 
> [1] https://github.com/apache/arrow/pull/4815/files <https://github.com/apache/arrow/pull/4815/files>
> 
> 
> 
> On Wed, Mar 11, 2020 at 9:24 AM Evan Chan <evan_chan@apple.com <ma...@apple.com>> wrote:
> Sure thing.
> 
> Computation speed needs to be thought about in context....  We might find something which takes up half the space to be a little more computationally expensive, but in the grand scheme of things is faster to compute as more of it can fit in memory, and it saves I/O.   I definitely agree that just making something smaller without other gains might not be worth it.
> 
> Some small changes help both computation and size.  For example, encoding nulls in dictionary values helps reduce the need for both bitmap storage and lookup.
> 
> So I suppose the process goes something like this?
> 
> 1) Propose designs in this email list
> 2) PR for format specification
> 3) Start implementing across diff languages
> 
> -Evan 
> 
> > On Mar 10, 2020, at 10:31 PM, Micah Kornfield <emkornfield@gmail.com <ma...@gmail.com>> wrote:
> > 
> > +1 to what Wes said.
> > 
> > I'm hoping to have some more time to spend on this end of Q2/beginning of
> > Q3 if no progress is made by then.
> > 
> > I still think we should be careful on what is added to the spec, in
> > particular, we should be focused on encodings that can be used to improve
> > computational efficiency rather than just smaller size. Also, it is
> > important to note that any sort of encoding/compression must be supportable
> > across multiple languages/platforms.
> > 
> > Thanks,
> > Micah
> > 
> > On Tue, Mar 10, 2020 at 3:12 PM Wes McKinney <wesmckinn@gmail.com <ma...@gmail.com>> wrote:
> > 
> >> On Tue, Mar 10, 2020 at 5:01 PM Evan Chan <ev...@apple.com.invalid>
> >> wrote:
> >>> 
> >>> Martin,
> >>> 
> >>> Many thanks for the links.
> >>> 
> >>> My main concern is not actually FP and integer data, but sparse string
> >> data.  Having many very sparse arrays, each with a bitmap and values
> >> (assume dictionary also), would be really expensive. I have some ideas I’d
> >> like to throw out there, around something like a MapArray (Think of it
> >> essentially as dictionaries of keys and values, plus List<List<u8>> for
> >> example), but something optimized for sparseness.
> >>> 
> >>> Overall, while I appreciate the design of Arrow arrays to be super fast
> >> for computation, I’d like to be able to keep more of such data in memory,
> >> thus I’m interested in more compact representations, that ideally don’t
> >> need a compressor.  More like encoding.
> >>> 
> >>> I saw a thread middle of last year about RLE encodings, this would be in
> >> the right direction I think.   It could be implemented properly such that
> >> it doesn’t make random access that bad.
> >>> 
> >>> As for FP, I have my own scheme which is scale tested, SIMD friendly and
> >> should perform relatively well, and can fit in with different predictors
> >> including XOR, DFCM, etc.   Due to the high cardinality of most such data,
> >> I wonder if it wouldn’t be simpler to stick with one such scheme for all FP
> >> data.
> >>> 
> >>> Anyways, I’m most curious about if there is a plan to implement RLE, the
> >> FP schemes you describe, etc. and bring them to Arrow.  IE, is there a plan
> >> for space efficient encodings overall for Arrow?
> >> 
> >> It's been discussed many times in the past. As Arrow is developed by
> >> volunteers, if someone volunteers their time to work on it, it can
> >> happen. The first step would be to build consensus about what sort of
> >> protocol level additions (see the format/ directory and associated
> >> documentation) are needed. Once there is consensus about what to build
> >> and a complete specification for that, then implementation can move
> >> forward.
> >> 
> >>> Thanks very much,
> >>> Evan
> >>> 
> >>>> On Mar 10, 2020, at 1:41 AM, Radev, Martin <martin.radev@tum.de <ma...@tum.de>>
> >> wrote:
> >>>> 
> >>>> Hey Evan,
> >>>> 
> >>>> 
> >>>> thank you for the interest.
> >>>> 
> >>>> There has been some effort for compressing floating-point data on the
> >> Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not
> >> compress floating point data but makes it more compressible for when a
> >> compressor, such as ZSTD, LZ4, etc, is used. It only works well for
> >> high-entropy floating-point data, somewhere at least as large as >= 15 bits
> >> of entropy per element. I suppose the encoding might actually also make
> >> sense for high-entropy integer data but I am not super sure.
> >>>> For low-entropy data, the dictionary encoding is good though I suspect
> >> there can be room for performance improvements.
> >>>> This is my final report for the encoding here:
> >> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf <https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf>
> >>>> 
> >>>> Note that at some point my investigation turned out be quite the same
> >> solution as the one in https://github.com/powturbo/Turbo-Transpose <https://github.com/powturbo/Turbo-Transpose>.
> >>>> 
> >>>> 
> >>>> Maybe the points I sent can be helpful.
> >>>> 
> >>>> 
> >>>> Kinds regards,
> >>>> 
> >>>> Martin
> >>>> 
> >>>> ________________________________
> >>>> From: evan_chan@apple.com <ma...@apple.com> <evan_chan@apple.com <ma...@apple.com>> on behalf of Evan
> >> Chan <ev...@apple.com.INVALID>
> >>>> Sent: Tuesday, March 10, 2020 5:15:48 AM
> >>>> To: dev@arrow.apache.org <ma...@arrow.apache.org>
> >>>> Subject: Summary of RLE and other compression efforts?
> >>>> 
> >>>> Hi folks,
> >>>> 
> >>>> I’m curious about the state of efforts for more compressed encodings
> >> in the Arrow columnar format.  I saw discussions previously about RLE, but
> >> is there a place to summarize all of the different efforts that are ongoing
> >> to bring more compressed encodings?
> >>>> 
> >>>> Is there an effort to compress floating point or integer data using
> >> techniques such as XOR compression and Delta-Delta?  I can contribute to
> >> some of these efforts as well.
> >>>> 
> >>>> Thanks,
> >>>> Evan
> >>>> 
> >>>> 
> >>> 
> >> 
> 


Re: Summary of RLE and other compression efforts?

Posted by Micah Kornfield <em...@gmail.com>.
Hi Evan,
Seems like we are mostly on the same page.  Some more notes below.

> For example, encoding nulls in dictionary values helps reduce the need
for both bitmap storage and lookup.

I'm not sure if this is provided was provided as an example as something to
add, but I believe this is already supported in the spec.

In terms of better support for sparseness.  I originally had a more complex
proposal that in addition to RLE also considered better representation for
very sparse data (using indices to represent valid locations and not
reserving space in data buffers for null values).  I believe, I dropped
them before l made the sparseness proposal PR [1] public.  My main reasons
for doing so was to try to go with a minimal set of enhancements that had
the biggest "bang for the buck".

In the longer term I think it is potentially something Arrow should
include, but given the current number of developers contributing across all
languages I'm wary of tacking on too much before we have working  reference
implementations.

If you think there are better encodings to tackle in the short term I'd
welcome feedback on the proposal (or a more formal proposal of your own).

Thanks,
-Micah

[1] https://github.com/apache/arrow/pull/4815/files



On Wed, Mar 11, 2020 at 9:24 AM Evan Chan <ev...@apple.com> wrote:

> Sure thing.
>
> Computation speed needs to be thought about in context....  We might find
> something which takes up half the space to be a little more computationally
> expensive, but in the grand scheme of things is faster to compute as more
> of it can fit in memory, and it saves I/O.   I definitely agree that just
> making something smaller without other gains might not be worth it.
>
> Some small changes help both computation and size.  For example, encoding
> nulls in dictionary values helps reduce the need for both bitmap storage
> and lookup.
>
> So I suppose the process goes something like this?
>
> 1) Propose designs in this email list
> 2) PR for format specification
> 3) Start implementing across diff languages
>
> -Evan
>
> > On Mar 10, 2020, at 10:31 PM, Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > +1 to what Wes said.
> >
> > I'm hoping to have some more time to spend on this end of Q2/beginning of
> > Q3 if no progress is made by then.
> >
> > I still think we should be careful on what is added to the spec, in
> > particular, we should be focused on encodings that can be used to improve
> > computational efficiency rather than just smaller size. Also, it is
> > important to note that any sort of encoding/compression must be
> supportable
> > across multiple languages/platforms.
> >
> > Thanks,
> > Micah
> >
> > On Tue, Mar 10, 2020 at 3:12 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> >> On Tue, Mar 10, 2020 at 5:01 PM Evan Chan <ev...@apple.com.invalid>
> >> wrote:
> >>>
> >>> Martin,
> >>>
> >>> Many thanks for the links.
> >>>
> >>> My main concern is not actually FP and integer data, but sparse string
> >> data.  Having many very sparse arrays, each with a bitmap and values
> >> (assume dictionary also), would be really expensive. I have some ideas
> I’d
> >> like to throw out there, around something like a MapArray (Think of it
> >> essentially as dictionaries of keys and values, plus List<List<u8>> for
> >> example), but something optimized for sparseness.
> >>>
> >>> Overall, while I appreciate the design of Arrow arrays to be super fast
> >> for computation, I’d like to be able to keep more of such data in
> memory,
> >> thus I’m interested in more compact representations, that ideally don’t
> >> need a compressor.  More like encoding.
> >>>
> >>> I saw a thread middle of last year about RLE encodings, this would be
> in
> >> the right direction I think.   It could be implemented properly such
> that
> >> it doesn’t make random access that bad.
> >>>
> >>> As for FP, I have my own scheme which is scale tested, SIMD friendly
> and
> >> should perform relatively well, and can fit in with different predictors
> >> including XOR, DFCM, etc.   Due to the high cardinality of most such
> data,
> >> I wonder if it wouldn’t be simpler to stick with one such scheme for
> all FP
> >> data.
> >>>
> >>> Anyways, I’m most curious about if there is a plan to implement RLE,
> the
> >> FP schemes you describe, etc. and bring them to Arrow.  IE, is there a
> plan
> >> for space efficient encodings overall for Arrow?
> >>
> >> It's been discussed many times in the past. As Arrow is developed by
> >> volunteers, if someone volunteers their time to work on it, it can
> >> happen. The first step would be to build consensus about what sort of
> >> protocol level additions (see the format/ directory and associated
> >> documentation) are needed. Once there is consensus about what to build
> >> and a complete specification for that, then implementation can move
> >> forward.
> >>
> >>> Thanks very much,
> >>> Evan
> >>>
> >>>> On Mar 10, 2020, at 1:41 AM, Radev, Martin <ma...@tum.de>
> >> wrote:
> >>>>
> >>>> Hey Evan,
> >>>>
> >>>>
> >>>> thank you for the interest.
> >>>>
> >>>> There has been some effort for compressing floating-point data on the
> >> Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does
> not
> >> compress floating point data but makes it more compressible for when a
> >> compressor, such as ZSTD, LZ4, etc, is used. It only works well for
> >> high-entropy floating-point data, somewhere at least as large as >= 15
> bits
> >> of entropy per element. I suppose the encoding might actually also make
> >> sense for high-entropy integer data but I am not super sure.
> >>>> For low-entropy data, the dictionary encoding is good though I suspect
> >> there can be room for performance improvements.
> >>>> This is my final report for the encoding here:
> >>
> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
> >>>>
> >>>> Note that at some point my investigation turned out be quite the same
> >> solution as the one in https://github.com/powturbo/Turbo-Transpose.
> >>>>
> >>>>
> >>>> Maybe the points I sent can be helpful.
> >>>>
> >>>>
> >>>> Kinds regards,
> >>>>
> >>>> Martin
> >>>>
> >>>> ________________________________
> >>>> From: evan_chan@apple.com <ev...@apple.com> on behalf of Evan
> >> Chan <ev...@apple.com.INVALID>
> >>>> Sent: Tuesday, March 10, 2020 5:15:48 AM
> >>>> To: dev@arrow.apache.org
> >>>> Subject: Summary of RLE and other compression efforts?
> >>>>
> >>>> Hi folks,
> >>>>
> >>>> I’m curious about the state of efforts for more compressed encodings
> >> in the Arrow columnar format.  I saw discussions previously about RLE,
> but
> >> is there a place to summarize all of the different efforts that are
> ongoing
> >> to bring more compressed encodings?
> >>>>
> >>>> Is there an effort to compress floating point or integer data using
> >> techniques such as XOR compression and Delta-Delta?  I can contribute to
> >> some of these efforts as well.
> >>>>
> >>>> Thanks,
> >>>> Evan
> >>>>
> >>>>
> >>>
> >>
>
>

Re: Summary of RLE and other compression efforts?

Posted by Evan Chan <ev...@apple.com.INVALID>.
Sure thing.

Computation speed needs to be thought about in context....  We might find something which takes up half the space to be a little more computationally expensive, but in the grand scheme of things is faster to compute as more of it can fit in memory, and it saves I/O.   I definitely agree that just making something smaller without other gains might not be worth it.

Some small changes help both computation and size.  For example, encoding nulls in dictionary values helps reduce the need for both bitmap storage and lookup.

So I suppose the process goes something like this?

1) Propose designs in this email list
2) PR for format specification
3) Start implementing across diff languages

-Evan 

> On Mar 10, 2020, at 10:31 PM, Micah Kornfield <em...@gmail.com> wrote:
> 
> +1 to what Wes said.
> 
> I'm hoping to have some more time to spend on this end of Q2/beginning of
> Q3 if no progress is made by then.
> 
> I still think we should be careful on what is added to the spec, in
> particular, we should be focused on encodings that can be used to improve
> computational efficiency rather than just smaller size. Also, it is
> important to note that any sort of encoding/compression must be supportable
> across multiple languages/platforms.
> 
> Thanks,
> Micah
> 
> On Tue, Mar 10, 2020 at 3:12 PM Wes McKinney <we...@gmail.com> wrote:
> 
>> On Tue, Mar 10, 2020 at 5:01 PM Evan Chan <ev...@apple.com.invalid>
>> wrote:
>>> 
>>> Martin,
>>> 
>>> Many thanks for the links.
>>> 
>>> My main concern is not actually FP and integer data, but sparse string
>> data.  Having many very sparse arrays, each with a bitmap and values
>> (assume dictionary also), would be really expensive. I have some ideas I’d
>> like to throw out there, around something like a MapArray (Think of it
>> essentially as dictionaries of keys and values, plus List<List<u8>> for
>> example), but something optimized for sparseness.
>>> 
>>> Overall, while I appreciate the design of Arrow arrays to be super fast
>> for computation, I’d like to be able to keep more of such data in memory,
>> thus I’m interested in more compact representations, that ideally don’t
>> need a compressor.  More like encoding.
>>> 
>>> I saw a thread middle of last year about RLE encodings, this would be in
>> the right direction I think.   It could be implemented properly such that
>> it doesn’t make random access that bad.
>>> 
>>> As for FP, I have my own scheme which is scale tested, SIMD friendly and
>> should perform relatively well, and can fit in with different predictors
>> including XOR, DFCM, etc.   Due to the high cardinality of most such data,
>> I wonder if it wouldn’t be simpler to stick with one such scheme for all FP
>> data.
>>> 
>>> Anyways, I’m most curious about if there is a plan to implement RLE, the
>> FP schemes you describe, etc. and bring them to Arrow.  IE, is there a plan
>> for space efficient encodings overall for Arrow?
>> 
>> It's been discussed many times in the past. As Arrow is developed by
>> volunteers, if someone volunteers their time to work on it, it can
>> happen. The first step would be to build consensus about what sort of
>> protocol level additions (see the format/ directory and associated
>> documentation) are needed. Once there is consensus about what to build
>> and a complete specification for that, then implementation can move
>> forward.
>> 
>>> Thanks very much,
>>> Evan
>>> 
>>>> On Mar 10, 2020, at 1:41 AM, Radev, Martin <ma...@tum.de>
>> wrote:
>>>> 
>>>> Hey Evan,
>>>> 
>>>> 
>>>> thank you for the interest.
>>>> 
>>>> There has been some effort for compressing floating-point data on the
>> Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not
>> compress floating point data but makes it more compressible for when a
>> compressor, such as ZSTD, LZ4, etc, is used. It only works well for
>> high-entropy floating-point data, somewhere at least as large as >= 15 bits
>> of entropy per element. I suppose the encoding might actually also make
>> sense for high-entropy integer data but I am not super sure.
>>>> For low-entropy data, the dictionary encoding is good though I suspect
>> there can be room for performance improvements.
>>>> This is my final report for the encoding here:
>> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
>>>> 
>>>> Note that at some point my investigation turned out be quite the same
>> solution as the one in https://github.com/powturbo/Turbo-Transpose.
>>>> 
>>>> 
>>>> Maybe the points I sent can be helpful.
>>>> 
>>>> 
>>>> Kinds regards,
>>>> 
>>>> Martin
>>>> 
>>>> ________________________________
>>>> From: evan_chan@apple.com <ev...@apple.com> on behalf of Evan
>> Chan <ev...@apple.com.INVALID>
>>>> Sent: Tuesday, March 10, 2020 5:15:48 AM
>>>> To: dev@arrow.apache.org
>>>> Subject: Summary of RLE and other compression efforts?
>>>> 
>>>> Hi folks,
>>>> 
>>>> I’m curious about the state of efforts for more compressed encodings
>> in the Arrow columnar format.  I saw discussions previously about RLE, but
>> is there a place to summarize all of the different efforts that are ongoing
>> to bring more compressed encodings?
>>>> 
>>>> Is there an effort to compress floating point or integer data using
>> techniques such as XOR compression and Delta-Delta?  I can contribute to
>> some of these efforts as well.
>>>> 
>>>> Thanks,
>>>> Evan
>>>> 
>>>> 
>>> 
>> 


Re: Summary of RLE and other compression efforts?

Posted by Micah Kornfield <em...@gmail.com>.
+1 to what Wes said.

I'm hoping to have some more time to spend on this end of Q2/beginning of
Q3 if no progress is made by then.

I still think we should be careful on what is added to the spec, in
particular, we should be focused on encodings that can be used to improve
computational efficiency rather than just smaller size. Also, it is
important to note that any sort of encoding/compression must be supportable
across multiple languages/platforms.

Thanks,
Micah

On Tue, Mar 10, 2020 at 3:12 PM Wes McKinney <we...@gmail.com> wrote:

> On Tue, Mar 10, 2020 at 5:01 PM Evan Chan <ev...@apple.com.invalid>
> wrote:
> >
> > Martin,
> >
> > Many thanks for the links.
> >
> > My main concern is not actually FP and integer data, but sparse string
> data.  Having many very sparse arrays, each with a bitmap and values
> (assume dictionary also), would be really expensive. I have some ideas I’d
> like to throw out there, around something like a MapArray (Think of it
> essentially as dictionaries of keys and values, plus List<List<u8>> for
> example), but something optimized for sparseness.
> >
> > Overall, while I appreciate the design of Arrow arrays to be super fast
> for computation, I’d like to be able to keep more of such data in memory,
> thus I’m interested in more compact representations, that ideally don’t
> need a compressor.  More like encoding.
> >
> > I saw a thread middle of last year about RLE encodings, this would be in
> the right direction I think.   It could be implemented properly such that
> it doesn’t make random access that bad.
> >
> > As for FP, I have my own scheme which is scale tested, SIMD friendly and
> should perform relatively well, and can fit in with different predictors
> including XOR, DFCM, etc.   Due to the high cardinality of most such data,
> I wonder if it wouldn’t be simpler to stick with one such scheme for all FP
> data.
> >
> > Anyways, I’m most curious about if there is a plan to implement RLE, the
> FP schemes you describe, etc. and bring them to Arrow.  IE, is there a plan
> for space efficient encodings overall for Arrow?
>
> It's been discussed many times in the past. As Arrow is developed by
> volunteers, if someone volunteers their time to work on it, it can
> happen. The first step would be to build consensus about what sort of
> protocol level additions (see the format/ directory and associated
> documentation) are needed. Once there is consensus about what to build
> and a complete specification for that, then implementation can move
> forward.
>
> > Thanks very much,
> > Evan
> >
> > > On Mar 10, 2020, at 1:41 AM, Radev, Martin <ma...@tum.de>
> wrote:
> > >
> > > Hey Evan,
> > >
> > >
> > > thank you for the interest.
> > >
> > > There has been some effort for compressing floating-point data on the
> Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not
> compress floating point data but makes it more compressible for when a
> compressor, such as ZSTD, LZ4, etc, is used. It only works well for
> high-entropy floating-point data, somewhere at least as large as >= 15 bits
> of entropy per element. I suppose the encoding might actually also make
> sense for high-entropy integer data but I am not super sure.
> > > For low-entropy data, the dictionary encoding is good though I suspect
> there can be room for performance improvements.
> > > This is my final report for the encoding here:
> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
> > >
> > > Note that at some point my investigation turned out be quite the same
> solution as the one in https://github.com/powturbo/Turbo-Transpose.
> > >
> > >
> > > Maybe the points I sent can be helpful.
> > >
> > >
> > > Kinds regards,
> > >
> > > Martin
> > >
> > > ________________________________
> > > From: evan_chan@apple.com <ev...@apple.com> on behalf of Evan
> Chan <ev...@apple.com.INVALID>
> > > Sent: Tuesday, March 10, 2020 5:15:48 AM
> > > To: dev@arrow.apache.org
> > > Subject: Summary of RLE and other compression efforts?
> > >
> > > Hi folks,
> > >
> > > I’m curious about the state of efforts for more compressed encodings
> in the Arrow columnar format.  I saw discussions previously about RLE, but
> is there a place to summarize all of the different efforts that are ongoing
> to bring more compressed encodings?
> > >
> > > Is there an effort to compress floating point or integer data using
> techniques such as XOR compression and Delta-Delta?  I can contribute to
> some of these efforts as well.
> > >
> > > Thanks,
> > > Evan
> > >
> > >
> >
>

Re: Summary of RLE and other compression efforts?

Posted by Wes McKinney <we...@gmail.com>.
On Tue, Mar 10, 2020 at 5:01 PM Evan Chan <ev...@apple.com.invalid> wrote:
>
> Martin,
>
> Many thanks for the links.
>
> My main concern is not actually FP and integer data, but sparse string data.  Having many very sparse arrays, each with a bitmap and values (assume dictionary also), would be really expensive. I have some ideas I’d like to throw out there, around something like a MapArray (Think of it essentially as dictionaries of keys and values, plus List<List<u8>> for example), but something optimized for sparseness.
>
> Overall, while I appreciate the design of Arrow arrays to be super fast for computation, I’d like to be able to keep more of such data in memory, thus I’m interested in more compact representations, that ideally don’t need a compressor.  More like encoding.
>
> I saw a thread middle of last year about RLE encodings, this would be in the right direction I think.   It could be implemented properly such that it doesn’t make random access that bad.
>
> As for FP, I have my own scheme which is scale tested, SIMD friendly and should perform relatively well, and can fit in with different predictors including XOR, DFCM, etc.   Due to the high cardinality of most such data, I wonder if it wouldn’t be simpler to stick with one such scheme for all FP data.
>
> Anyways, I’m most curious about if there is a plan to implement RLE, the FP schemes you describe, etc. and bring them to Arrow.  IE, is there a plan for space efficient encodings overall for Arrow?

It's been discussed many times in the past. As Arrow is developed by
volunteers, if someone volunteers their time to work on it, it can
happen. The first step would be to build consensus about what sort of
protocol level additions (see the format/ directory and associated
documentation) are needed. Once there is consensus about what to build
and a complete specification for that, then implementation can move
forward.

> Thanks very much,
> Evan
>
> > On Mar 10, 2020, at 1:41 AM, Radev, Martin <ma...@tum.de> wrote:
> >
> > Hey Evan,
> >
> >
> > thank you for the interest.
> >
> > There has been some effort for compressing floating-point data on the Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress floating point data but makes it more compressible for when a compressor, such as ZSTD, LZ4, etc, is used. It only works well for high-entropy floating-point data, somewhere at least as large as >= 15 bits of entropy per element. I suppose the encoding might actually also make sense for high-entropy integer data but I am not super sure.
> > For low-entropy data, the dictionary encoding is good though I suspect there can be room for performance improvements.
> > This is my final report for the encoding here: https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
> >
> > Note that at some point my investigation turned out be quite the same solution as the one in https://github.com/powturbo/Turbo-Transpose.
> >
> >
> > Maybe the points I sent can be helpful.
> >
> >
> > Kinds regards,
> >
> > Martin
> >
> > ________________________________
> > From: evan_chan@apple.com <ev...@apple.com> on behalf of Evan Chan <ev...@apple.com.INVALID>
> > Sent: Tuesday, March 10, 2020 5:15:48 AM
> > To: dev@arrow.apache.org
> > Subject: Summary of RLE and other compression efforts?
> >
> > Hi folks,
> >
> > I’m curious about the state of efforts for more compressed encodings in the Arrow columnar format.  I saw discussions previously about RLE, but is there a place to summarize all of the different efforts that are ongoing to bring more compressed encodings?
> >
> > Is there an effort to compress floating point or integer data using techniques such as XOR compression and Delta-Delta?  I can contribute to some of these efforts as well.
> >
> > Thanks,
> > Evan
> >
> >
>

Re: Summary of RLE and other compression efforts?

Posted by Evan Chan <ev...@apple.com.INVALID>.
Martin,

Many thanks for the links.  

My main concern is not actually FP and integer data, but sparse string data.  Having many very sparse arrays, each with a bitmap and values (assume dictionary also), would be really expensive. I have some ideas I’d like to throw out there, around something like a MapArray (Think of it essentially as dictionaries of keys and values, plus List<List<u8>> for example), but something optimized for sparseness.

Overall, while I appreciate the design of Arrow arrays to be super fast for computation, I’d like to be able to keep more of such data in memory, thus I’m interested in more compact representations, that ideally don’t need a compressor.  More like encoding.

I saw a thread middle of last year about RLE encodings, this would be in the right direction I think.   It could be implemented properly such that it doesn’t make random access that bad.

As for FP, I have my own scheme which is scale tested, SIMD friendly and should perform relatively well, and can fit in with different predictors including XOR, DFCM, etc.   Due to the high cardinality of most such data, I wonder if it wouldn’t be simpler to stick with one such scheme for all FP data.

Anyways, I’m most curious about if there is a plan to implement RLE, the FP schemes you describe, etc. and bring them to Arrow.  IE, is there a plan for space efficient encodings overall for Arrow?

Thanks very much,
Evan

> On Mar 10, 2020, at 1:41 AM, Radev, Martin <ma...@tum.de> wrote:
> 
> Hey Evan,
> 
> 
> thank you for the interest.
> 
> There has been some effort for compressing floating-point data on the Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress floating point data but makes it more compressible for when a compressor, such as ZSTD, LZ4, etc, is used. It only works well for high-entropy floating-point data, somewhere at least as large as >= 15 bits of entropy per element. I suppose the encoding might actually also make sense for high-entropy integer data but I am not super sure.
> For low-entropy data, the dictionary encoding is good though I suspect there can be room for performance improvements.
> This is my final report for the encoding here: https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
> 
> Note that at some point my investigation turned out be quite the same solution as the one in https://github.com/powturbo/Turbo-Transpose.
> 
> 
> Maybe the points I sent can be helpful.
> 
> 
> Kinds regards,
> 
> Martin
> 
> ________________________________
> From: evan_chan@apple.com <ev...@apple.com> on behalf of Evan Chan <ev...@apple.com.INVALID>
> Sent: Tuesday, March 10, 2020 5:15:48 AM
> To: dev@arrow.apache.org
> Subject: Summary of RLE and other compression efforts?
> 
> Hi folks,
> 
> I’m curious about the state of efforts for more compressed encodings in the Arrow columnar format.  I saw discussions previously about RLE, but is there a place to summarize all of the different efforts that are ongoing to bring more compressed encodings?
> 
> Is there an effort to compress floating point or integer data using techniques such as XOR compression and Delta-Delta?  I can contribute to some of these efforts as well.
> 
> Thanks,
> Evan
> 
> 


Re: Summary of RLE and other compression efforts?

Posted by Evan Chan <ev...@apple.com.INVALID>.
Thank you Wes.  If the stars line up I’d be interested in joining and contributing to this effort.   I have a ton of ideas around efficient encodings for different types of data.

> On Mar 10, 2020, at 2:52 PM, Wes McKinney <we...@gmail.com> wrote:
> 
> See this past mailing list thread
> 
> https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937%40%3Cdev.arrow.apache.org%3E
> 
> and associated PR
> 
> https://github.com/apache/arrow/pull/4815
> 
> There hasn't been a lot of movement on this but primarily because all
> the key people who've expressed interest in it have been really busy
> with other matters (myself included). Have RLE-encoding in memory at
> minimum would be a huge benefit for a number of applications, so it
> would be great to continue the discussion and create a more
> comprehensive proposal document describing what we would like to
> implement (and what we do not want to implement)
> 
> On Tue, Mar 10, 2020 at 3:41 AM Radev, Martin <ma...@tum.de> wrote:
>> 
>> Hey Evan,
>> 
>> 
>> thank you for the interest.
>> 
>> There has been some effort for compressing floating-point data on the Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress floating point data but makes it more compressible for when a compressor, such as ZSTD, LZ4, etc, is used. It only works well for high-entropy floating-point data, somewhere at least as large as >= 15 bits of entropy per element. I suppose the encoding might actually also make sense for high-entropy integer data but I am not super sure.
>> For low-entropy data, the dictionary encoding is good though I suspect there can be room for performance improvements.
>> This is my final report for the encoding here: https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
>> 
>> Note that at some point my investigation turned out be quite the same solution as the one in https://github.com/powturbo/Turbo-Transpose.
>> 
>> 
>> Maybe the points I sent can be helpful.
>> 
>> 
>> Kinds regards,
>> 
>> Martin
>> 
>> ________________________________
>> From: evan_chan@apple.com <ev...@apple.com> on behalf of Evan Chan <ev...@apple.com.INVALID>
>> Sent: Tuesday, March 10, 2020 5:15:48 AM
>> To: dev@arrow.apache.org
>> Subject: Summary of RLE and other compression efforts?
>> 
>> Hi folks,
>> 
>> I’m curious about the state of efforts for more compressed encodings in the Arrow columnar format.  I saw discussions previously about RLE, but is there a place to summarize all of the different efforts that are ongoing to bring more compressed encodings?
>> 
>> Is there an effort to compress floating point or integer data using techniques such as XOR compression and Delta-Delta?  I can contribute to some of these efforts as well.
>> 
>> Thanks,
>> Evan
>> 
>> 


Re: Summary of RLE and other compression efforts?

Posted by Wes McKinney <we...@gmail.com>.
See this past mailing list thread

https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937%40%3Cdev.arrow.apache.org%3E

and associated PR

https://github.com/apache/arrow/pull/4815

There hasn't been a lot of movement on this but primarily because all
the key people who've expressed interest in it have been really busy
with other matters (myself included). Have RLE-encoding in memory at
minimum would be a huge benefit for a number of applications, so it
would be great to continue the discussion and create a more
comprehensive proposal document describing what we would like to
implement (and what we do not want to implement)

On Tue, Mar 10, 2020 at 3:41 AM Radev, Martin <ma...@tum.de> wrote:
>
> Hey Evan,
>
>
> thank you for the interest.
>
> There has been some effort for compressing floating-point data on the Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress floating point data but makes it more compressible for when a compressor, such as ZSTD, LZ4, etc, is used. It only works well for high-entropy floating-point data, somewhere at least as large as >= 15 bits of entropy per element. I suppose the encoding might actually also make sense for high-entropy integer data but I am not super sure.
> For low-entropy data, the dictionary encoding is good though I suspect there can be room for performance improvements.
> This is my final report for the encoding here: https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
>
> Note that at some point my investigation turned out be quite the same solution as the one in https://github.com/powturbo/Turbo-Transpose.
>
>
> Maybe the points I sent can be helpful.
>
>
> Kinds regards,
>
> Martin
>
> ________________________________
> From: evan_chan@apple.com <ev...@apple.com> on behalf of Evan Chan <ev...@apple.com.INVALID>
> Sent: Tuesday, March 10, 2020 5:15:48 AM
> To: dev@arrow.apache.org
> Subject: Summary of RLE and other compression efforts?
>
> Hi folks,
>
> I’m curious about the state of efforts for more compressed encodings in the Arrow columnar format.  I saw discussions previously about RLE, but is there a place to summarize all of the different efforts that are ongoing to bring more compressed encodings?
>
> Is there an effort to compress floating point or integer data using techniques such as XOR compression and Delta-Delta?  I can contribute to some of these efforts as well.
>
> Thanks,
> Evan
>
>

Re: Summary of RLE and other compression efforts?

Posted by "Radev, Martin" <ma...@tum.de>.
Hey Evan,


thank you for the interest.

There has been some effort for compressing floating-point data on the Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress floating point data but makes it more compressible for when a compressor, such as ZSTD, LZ4, etc, is used. It only works well for high-entropy floating-point data, somewhere at least as large as >= 15 bits of entropy per element. I suppose the encoding might actually also make sense for high-entropy integer data but I am not super sure.
For low-entropy data, the dictionary encoding is good though I suspect there can be room for performance improvements.
This is my final report for the encoding here: https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf

Note that at some point my investigation turned out be quite the same solution as the one in https://github.com/powturbo/Turbo-Transpose.


Maybe the points I sent can be helpful.


Kinds regards,

Martin

________________________________
From: evan_chan@apple.com <ev...@apple.com> on behalf of Evan Chan <ev...@apple.com.INVALID>
Sent: Tuesday, March 10, 2020 5:15:48 AM
To: dev@arrow.apache.org
Subject: Summary of RLE and other compression efforts?

Hi folks,

I’m curious about the state of efforts for more compressed encodings in the Arrow columnar format.  I saw discussions previously about RLE, but is there a place to summarize all of the different efforts that are ongoing to bring more compressed encodings?

Is there an effort to compress floating point or integer data using techniques such as XOR compression and Delta-Delta?  I can contribute to some of these efforts as well.

Thanks,
Evan