You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by ma...@markfarnan.com on 2020/08/29 23:04:16 UTC

Compression in Arrow - Question

I was looking at compression in arrow had a couple questions. 

If I've understood compression currently,   it is only used  'in flight'  in either IPC or Arrow Flight, using a block compression,  but still decoded into Ram at the destination in full array form.  Is this correct ? 


Given that arrow is a columnar format, has any thought been given to an option to have the data compressed both in memory and in flight, using some of the columnar techniques ? 
 As I deal primarily with Timeseries numerical data, I was thinking about some of the algorithms from the Gorilla paper [1]  for Floats  and Timestamps (Delta-of-Delta) or similar might be appropriate. 

The interface functions could  still iterate over the data and produce raw values so this is transparent to users of the data, but the data blocks/arrays in-mem are actually compressed.  

With this method, blocks could come out of a data base/source, through the data service, across the wire (flight)  and land in the consuming applications memory without ever being decompressed or processed until final use. 


Crazy thought ?


Regards

Mark. 


[1]: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Re: Compression in Arrow - Question

Posted by Micah Kornfield <em...@gmail.com>.

Hi Mark,

> There is definitely a tradeoff between processing speed and compression,
> however I feel there is a use case for 'small in memory footprint'
> independent of  'high speed processing'.
> Though I appreciate arrow team may not want to address that, given the
> focus on processing speed.   ( can't be all things to everyone. )

Personally, I think adding in the programming interfaces to handle
> compressed in-mem arrays would be a good thing, as well as the 'in flight'
> ones.


I don't see the two being contradictory.  In the thread I linked, I was
advocating for implementing the simplest possible methodologies before
exploring potentially more complex ones especially those that force size vs
access speed tradeoff.

I'm currently doing some work on Parquet->Arrow reading, but I'm hoping to
be able to do more on the  encoding work  once I can wrap that up.  The
first step is building a proof of concept prototype.  If you think the
encodings in the straw-man proposal [1] will not be at all useful for your
use-case that is useful feedback, but I suspect they would still help to
some degree even if they aren't optimal.

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/4815


On Sun, Aug 30, 2020 at 10:21 AM <ma...@markfarnan.com> wrote:

> All,
>
> Micah: appears my google-fu wasn't strong enough to find the previous
> thread, so thanks for pointing that out.
>
> There is definitely a tradeoff between processing speed and compression,
> however I feel there is a use case for 'small in memory footprint'
> independent of  'high speed processing'.
> Though I appreciate arrow team may not want to address that, given the
> focus on processing speed.   ( can't be all things to everyone. )
>
> Personally, I think adding in the programming interfaces to handle
> compressed in-mem arrays would be a good thing, as well as the 'in flight'
> ones.
>
>
> For reference, my specific use case is handing large datasets [1] of
> varying types [2] to the browser  for plotting, inc scrolling over them,
> using WASM (currently in GO).
> Both network bandwidth to browsers, and browser memory is always
> problematic, esp on mobile devices,  hence the desire to compress, and keep
> it compressed on arrival.  And minimize number of in-mem copies needed
>
> The access to the data is either.
>  A: forward read from a certain point for a range,  to draw.   (that point
> and range changes with scroll and zoom)
>  B: Random access for tooltips.    (Value of 'n' columns at index  'y')
>    Both can potentially be efficient enough based on selection of the
> block sizes or other internal boundaries  / search method.
>
>
> Note: Compressing potentially makes my 'other' problem even harder, which
> best method for appending inbound realtime sensor data into the in-memory
> model.    Still thinking about that one.
>
> Regards
>
> Mark.
>
>
> [1]  Large in obviously relative:  In this case, a single plot may have
> 20-50 separate time series, each with between 20k  to 10 million points
> each.
>
> [2]  The data is often  index: time,  value float,  OR  Index:Float
> (length measure), Value:Float,     But not always:   Value could be one of
> int(8,16,32,64), float(32,64), string, vector(float32/64),  etc.      Hence
> why I'm liking Arrow as the standard 'format' for this data as they can all
> be safely encoded within.
>
>
>
> -----Original Message-----
> From: Micah Kornfield <em...@gmail.com>
> Sent: Sunday, August 30, 2020 6:20 PM
> To: Wes McKinney <we...@gmail.com>
> Cc: dev <de...@arrow.apache.org>
> Subject: Re: Compression in Arrow - Question
>
> Agreed, I think it would be useful to make sure the "compute" interfaces
> have the right hooks to support alternate encodings.
>
> On Sunday, August 30, 2020, Wes McKinney <we...@gmail.com> wrote:
>
> > That said, there is nothing preventing the development of programming
> > interfaces for compressed / encoded data right now. When it comes to
> > transporting such data, that's when we will have to decide on what to
> > support and what new metadata structures are required.
> >
> > For example, we could add RLE to C++ in prototype form and then
> > convert to non-RLE when writing to IPC messages.
> >
> > On Sat, Aug 29, 2020 at 7:34 PM Micah Kornfield
> > <em...@gmail.com>
> > wrote:
> > >
> > > Hi Mark,
> > > See the most recent previous discussion about alternate encodings [1].
> > > This is something that in the long run should be added, I'd
> > > personally prefer to start with simpler encodings.
> > >
> > > I don't think we should add anything more with regard to
> > > compression/encoding until at least 3 languages support the current
> > > compression methods that are in the specification.  C++ has it
> > implemented,
> > > there is some work in Java and I think we should have at least one
> more.
> > >
> > > -Micah
> > >
> > > [1]
> > > https://lists.apache.org/thread.html/r1d9d707c481c53c13534f7c72d75c
> > 7a90dc7b2b9966c6c0772d0e416%40%3Cdev.arrow.apache.org%3E
> > >
> > > On Sat, Aug 29, 2020 at 4:04 PM <ma...@markfarnan.com> wrote:
> > >
> > > >
> > > > I was looking at compression in arrow had a couple questions.
> > > >
> > > > If I've understood compression currently,   it is only used  'in
> > flight'
> > > > in either IPC or Arrow Flight, using a block compression,  but
> > > > still decoded into Ram at the destination in full array form.  Is
> > > > this
> > correct ?
> > > >
> > > >
> > > > Given that arrow is a columnar format, has any thought been given
> > > > to an option to have the data compressed both in memory and in
> > > > flight, using
> > some
> > > > of the columnar techniques ?
> > > >  As I deal primarily with Timeseries numerical data, I was
> > > > thinking
> > about
> > > > some of the algorithms from the Gorilla paper [1]  for Floats  and
> > > > Timestamps (Delta-of-Delta) or similar might be appropriate.
> > > >
> > > > The interface functions could  still iterate over the data and
> > > > produce
> > raw
> > > > values so this is transparent to users of the data, but the data
> > > > blocks/arrays in-mem are actually compressed.
> > > >
> > > > With this method, blocks could come out of a data base/source,
> > > > through
> > the
> > > > data service, across the wire (flight)  and land in the consuming
> > > > applications memory without ever being decompressed or processed
> > > > until final use.
> > > >
> > > >
> > > > Crazy thought ?
> > > >
> > > >
> > > > Regards
> > > >
> > > > Mark.
> > > >
> > > >
> > > > [1]: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf
> > > >
> > > >
> >
>
>

RE: Compression in Arrow - Question

Posted by ma...@markfarnan.com.

All, 

Micah: appears my google-fu wasn't strong enough to find the previous thread, so thanks for pointing that out. 

There is definitely a tradeoff between processing speed and compression,  however I feel there is a use case for 'small in memory footprint'  independent of  'high speed processing'.  
Though I appreciate arrow team may not want to address that, given the focus on processing speed.   ( can't be all things to everyone. )

Personally, I think adding in the programming interfaces to handle compressed in-mem arrays would be a good thing, as well as the 'in flight' ones. 

For reference, my specific use case is handing large datasets [1] of varying types [2] to the browser  for plotting, inc scrolling over them,  using WASM (currently in GO). 
Both network bandwidth to browsers, and browser memory is always problematic, esp on mobile devices,  hence the desire to compress, and keep it compressed on arrival.  And minimize number of in-mem copies needed

The access to the data is either. 
 A: forward read from a certain point for a range,  to draw.   (that point and range changes with scroll and zoom)
 B: Random access for tooltips.    (Value of 'n' columns at index  'y') 
   Both can potentially be efficient enough based on selection of the block sizes or other internal boundaries  / search method. 

Note: Compressing potentially makes my 'other' problem even harder, which best method for appending inbound realtime sensor data into the in-memory model.    Still thinking about that one. 

Regards

Mark. 

[1]  Large in obviously relative:  In this case, a single plot may have  20-50 separate time series, each with between 20k  to 10 million points each. 

[2]  The data is often  index: time,  value float,  OR  Index:Float (length measure), Value:Float,     But not always:   Value could be one of int(8,16,32,64), float(32,64), string, vector(float32/64),  etc.      Hence why I'm liking Arrow as the standard 'format' for this data as they can all be safely encoded within.

-----Original Message-----
From: Micah Kornfield <em...@gmail.com> 
Sent: Sunday, August 30, 2020 6:20 PM
To: Wes McKinney <we...@gmail.com>
Cc: dev <de...@arrow.apache.org>
Subject: Re: Compression in Arrow - Question

Agreed, I think it would be useful to make sure the "compute" interfaces have the right hooks to support alternate encodings.

On Sunday, August 30, 2020, Wes McKinney <we...@gmail.com> wrote:

> That said, there is nothing preventing the development of programming 
> interfaces for compressed / encoded data right now. When it comes to 
> transporting such data, that's when we will have to decide on what to 
> support and what new metadata structures are required.
>
> For example, we could add RLE to C++ in prototype form and then 
> convert to non-RLE when writing to IPC messages.
>
> On Sat, Aug 29, 2020 at 7:34 PM Micah Kornfield 
> <em...@gmail.com>
> wrote:
> >
> > Hi Mark,
> > See the most recent previous discussion about alternate encodings [1].
> > This is something that in the long run should be added, I'd 
> > personally prefer to start with simpler encodings.
> >
> > I don't think we should add anything more with regard to 
> > compression/encoding until at least 3 languages support the current 
> > compression methods that are in the specification.  C++ has it
> implemented,
> > there is some work in Java and I think we should have at least one more.
> >
> > -Micah
> >
> > [1]
> > https://lists.apache.org/thread.html/r1d9d707c481c53c13534f7c72d75c
> 7a90dc7b2b9966c6c0772d0e416%40%3Cdev.arrow.apache.org%3E
> >
> > On Sat, Aug 29, 2020 at 4:04 PM <ma...@markfarnan.com> wrote:
> >
> > >
> > > I was looking at compression in arrow had a couple questions.
> > >
> > > If I've understood compression currently,   it is only used  'in
> flight'
> > > in either IPC or Arrow Flight, using a block compression,  but 
> > > still decoded into Ram at the destination in full array form.  Is 
> > > this
> correct ?
> > >
> > >
> > > Given that arrow is a columnar format, has any thought been given 
> > > to an option to have the data compressed both in memory and in 
> > > flight, using
> some
> > > of the columnar techniques ?
> > >  As I deal primarily with Timeseries numerical data, I was 
> > > thinking
> about
> > > some of the algorithms from the Gorilla paper [1]  for Floats  and 
> > > Timestamps (Delta-of-Delta) or similar might be appropriate.
> > >
> > > The interface functions could  still iterate over the data and 
> > > produce
> raw
> > > values so this is transparent to users of the data, but the data 
> > > blocks/arrays in-mem are actually compressed.
> > >
> > > With this method, blocks could come out of a data base/source, 
> > > through
> the
> > > data service, across the wire (flight)  and land in the consuming 
> > > applications memory without ever being decompressed or processed 
> > > until final use.
> > >
> > >
> > > Crazy thought ?
> > >
> > >
> > > Regards
> > >
> > > Mark.
> > >
> > >
> > > [1]: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf
> > >
> > >
>

Re: Compression in Arrow - Question

Posted by Micah Kornfield <em...@gmail.com>.

Agreed, I think it would be useful to make sure the "compute" interfaces
have the right hooks to support alternate encodings.

On Sunday, August 30, 2020, Wes McKinney <we...@gmail.com> wrote:

> That said, there is nothing preventing the development of programming
> interfaces for compressed / encoded data right now. When it comes to
> transporting such data, that's when we will have to decide on what to
> support and what new metadata structures are required.
>
> For example, we could add RLE to C++ in prototype form and then
> convert to non-RLE when writing to IPC messages.
>
> On Sat, Aug 29, 2020 at 7:34 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > Hi Mark,
> > See the most recent previous discussion about alternate encodings [1].
> > This is something that in the long run should be added, I'd personally
> > prefer to start with simpler encodings.
> >
> > I don't think we should add anything more with regard to
> > compression/encoding until at least 3 languages support the current
> > compression methods that are in the specification.  C++ has it
> implemented,
> > there is some work in Java and I think we should have at least one more.
> >
> > -Micah
> >
> > [1]
> > https://lists.apache.org/thread.html/r1d9d707c481c53c13534f7c72d75c
> 7a90dc7b2b9966c6c0772d0e416%40%3Cdev.arrow.apache.org%3E
> >
> > On Sat, Aug 29, 2020 at 4:04 PM <ma...@markfarnan.com> wrote:
> >
> > >
> > > I was looking at compression in arrow had a couple questions.
> > >
> > > If I've understood compression currently,   it is only used  'in
> flight'
> > > in either IPC or Arrow Flight, using a block compression,  but still
> > > decoded into Ram at the destination in full array form.  Is this
> correct ?
> > >
> > >
> > > Given that arrow is a columnar format, has any thought been given to an
> > > option to have the data compressed both in memory and in flight, using
> some
> > > of the columnar techniques ?
> > >  As I deal primarily with Timeseries numerical data, I was thinking
> about
> > > some of the algorithms from the Gorilla paper [1]  for Floats  and
> > > Timestamps (Delta-of-Delta) or similar might be appropriate.
> > >
> > > The interface functions could  still iterate over the data and produce
> raw
> > > values so this is transparent to users of the data, but the data
> > > blocks/arrays in-mem are actually compressed.
> > >
> > > With this method, blocks could come out of a data base/source, through
> the
> > > data service, across the wire (flight)  and land in the consuming
> > > applications memory without ever being decompressed or processed until
> > > final use.
> > >
> > >
> > > Crazy thought ?
> > >
> > >
> > > Regards
> > >
> > > Mark.
> > >
> > >
> > > [1]: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf
> > >
> > >
>

Re: Compression in Arrow - Question

Posted by Wes McKinney <we...@gmail.com>.

That said, there is nothing preventing the development of programming
interfaces for compressed / encoded data right now. When it comes to
transporting such data, that's when we will have to decide on what to
support and what new metadata structures are required.

For example, we could add RLE to C++ in prototype form and then
convert to non-RLE when writing to IPC messages.

On Sat, Aug 29, 2020 at 7:34 PM Micah Kornfield <em...@gmail.com> wrote:
>
> Hi Mark,
> See the most recent previous discussion about alternate encodings [1].
> This is something that in the long run should be added, I'd personally
> prefer to start with simpler encodings.
>
> I don't think we should add anything more with regard to
> compression/encoding until at least 3 languages support the current
> compression methods that are in the specification.  C++ has it implemented,
> there is some work in Java and I think we should have at least one more.
>
> -Micah
>
> [1]
> https://lists.apache.org/thread.html/r1d9d707c481c53c13534f7c72d75c7a90dc7b2b9966c6c0772d0e416%40%3Cdev.arrow.apache.org%3E
>
> On Sat, Aug 29, 2020 at 4:04 PM <ma...@markfarnan.com> wrote:
>
> >
> > I was looking at compression in arrow had a couple questions.
> >
> > If I've understood compression currently,   it is only used  'in flight'
> > in either IPC or Arrow Flight, using a block compression,  but still
> > decoded into Ram at the destination in full array form.  Is this correct ?
> >
> >
> > Given that arrow is a columnar format, has any thought been given to an
> > option to have the data compressed both in memory and in flight, using some
> > of the columnar techniques ?
> >  As I deal primarily with Timeseries numerical data, I was thinking about
> > some of the algorithms from the Gorilla paper [1]  for Floats  and
> > Timestamps (Delta-of-Delta) or similar might be appropriate.
> >
> > The interface functions could  still iterate over the data and produce raw
> > values so this is transparent to users of the data, but the data
> > blocks/arrays in-mem are actually compressed.
> >
> > With this method, blocks could come out of a data base/source, through the
> > data service, across the wire (flight)  and land in the consuming
> > applications memory without ever being decompressed or processed until
> > final use.
> >
> >
> > Crazy thought ?
> >
> >
> > Regards
> >
> > Mark.
> >
> >
> > [1]: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf
> >
> >

Re: Compression in Arrow - Question

Posted by Micah Kornfield <em...@gmail.com>.

Hi Mark,
See the most recent previous discussion about alternate encodings [1].
This is something that in the long run should be added, I'd personally
prefer to start with simpler encodings.

I don't think we should add anything more with regard to
compression/encoding until at least 3 languages support the current
compression methods that are in the specification.  C++ has it implemented,
there is some work in Java and I think we should have at least one more.

-Micah

[1]
https://lists.apache.org/thread.html/r1d9d707c481c53c13534f7c72d75c7a90dc7b2b9966c6c0772d0e416%40%3Cdev.arrow.apache.org%3E

On Sat, Aug 29, 2020 at 4:04 PM <ma...@markfarnan.com> wrote:

>
> I was looking at compression in arrow had a couple questions.
>
> If I've understood compression currently,   it is only used  'in flight'
> in either IPC or Arrow Flight, using a block compression,  but still
> decoded into Ram at the destination in full array form.  Is this correct ?
>
>
> Given that arrow is a columnar format, has any thought been given to an
> option to have the data compressed both in memory and in flight, using some
> of the columnar techniques ?
>  As I deal primarily with Timeseries numerical data, I was thinking about
> some of the algorithms from the Gorilla paper [1]  for Floats  and
> Timestamps (Delta-of-Delta) or similar might be appropriate.
>
> The interface functions could  still iterate over the data and produce raw
> values so this is transparent to users of the data, but the data
> blocks/arrays in-mem are actually compressed.
>
> With this method, blocks could come out of a data base/source, through the
> data service, across the wire (flight)  and land in the consuming
> applications memory without ever being decompressed or processed until
> final use.
>
>
> Crazy thought ?
>
>
> Regards
>
> Mark.
>
>
> [1]: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf
>
>