You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2020/04/23 00:24:09 UTC

[VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

Hello,

I have proposed adding a simple RecordBatch IPC message body
compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
protocol in GitHub PR [1] as discussed on the mailing list [2]. This
is distinct from separate discussions about adding in-memory encodings
(like RLE-encoding) to the Arrow columnar format.

This change is not forward compatible so it will not be safe to send
compressed messages to old libraries, but since we are still pre-1.0.0
the consensus is that this is acceptable. We may separately consider
increasing the metadata version for 1.0.0 to require clients to
upgrade.

Please vote whether to accept the addition. The vote will be open for
at least 72 hours.

[ ] +1 Accept this addition to the IPC protocol
[ ] +0
[ ] -1 Do not accept the changes because...

Here is my vote: +1

Thanks,
Wes

[1]: https://github.com/apache/arrow/pull/6707
[2]: https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E

Re: [RESULT] [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

Posted by Wes McKinney <we...@gmail.com>.
hi Micah,

I'll take care of it shortly.

Thanks

On Sat, May 2, 2020 at 5:13 PM Micah Kornfield <em...@gmail.com> wrote:
>
> Hi Wes,
> Will you have time open JIRAs on tracking implementations in each
> language?  I can try to do it sometime this week if not.
>
> Thanks,
> Micah
>
> On Thu, Apr 30, 2020 at 2:49 PM Wes McKinney <we...@gmail.com> wrote:
>
> > The vote carries with 7 binding +1 votes and 1 non-binding +1
> >
> > On Fri, Apr 24, 2020 at 7:40 AM Francois Saint-Jacques
> > <fs...@gmail.com> wrote:
> > >
> > > +1 (binding)
> > >
> > > On Fri, Apr 24, 2020 at 5:41 AM Krisztián Szűcs
> > > <sz...@gmail.com> wrote:
> > > >
> > > > +1 (binding)
> > > >
> > > > On 2020. Apr 24., Fri at 1:51, Micah Kornfield <em...@gmail.com>
> > > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > On Thu, Apr 23, 2020 at 2:35 PM Sutou Kouhei <ko...@clear-code.com>
> > wrote:
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > In <
> > CAJPUwMDEm1-5SUDxZRwYfkHSDEFssq8TbspGyicaJhKBEfUT+Q@mail.gmail.com>
> > > > > >   "[VOTE] Add "trivial" RecordBatch body compression to Arrow IPC
> > > > > > protocol" on Wed, 22 Apr 2020 19:24:09 -0500,
> > > > > >   Wes McKinney <we...@gmail.com> wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I have proposed adding a simple RecordBatch IPC message body
> > > > > > > compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> > > > > > > protocol in GitHub PR [1] as discussed on the mailing list [2].
> > This
> > > > > > > is distinct from separate discussions about adding in-memory
> > encodings
> > > > > > > (like RLE-encoding) to the Arrow columnar format.
> > > > > > >
> > > > > > > This change is not forward compatible so it will not be safe to
> > send
> > > > > > > compressed messages to old libraries, but since we are still
> > pre-1.0.0
> > > > > > > the consensus is that this is acceptable. We may separately
> > consider
> > > > > > > increasing the metadata version for 1.0.0 to require clients to
> > > > > > > upgrade.
> > > > > > >
> > > > > > > Please vote whether to accept the addition. The vote will be
> > open for
> > > > > > > at least 72 hours.
> > > > > > >
> > > > > > > [ ] +1 Accept this addition to the IPC protocol
> > > > > > > [ ] +0
> > > > > > > [ ] -1 Do not accept the changes because...
> > > > > > >
> > > > > > > Here is my vote: +1
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Wes
> > > > > > >
> > > > > > > [1]: https://github.com/apache/arrow/pull/6707
> > > > > > > [2]:
> > > > > >
> > > > >
> > https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
> > > > > >
> > > > >
> >

Re: [RESULT] [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

Posted by Micah Kornfield <em...@gmail.com>.
Hi Wes,
Will you have time open JIRAs on tracking implementations in each
language?  I can try to do it sometime this week if not.

Thanks,
Micah

On Thu, Apr 30, 2020 at 2:49 PM Wes McKinney <we...@gmail.com> wrote:

> The vote carries with 7 binding +1 votes and 1 non-binding +1
>
> On Fri, Apr 24, 2020 at 7:40 AM Francois Saint-Jacques
> <fs...@gmail.com> wrote:
> >
> > +1 (binding)
> >
> > On Fri, Apr 24, 2020 at 5:41 AM Krisztián Szűcs
> > <sz...@gmail.com> wrote:
> > >
> > > +1 (binding)
> > >
> > > On 2020. Apr 24., Fri at 1:51, Micah Kornfield <em...@gmail.com>
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > On Thu, Apr 23, 2020 at 2:35 PM Sutou Kouhei <ko...@clear-code.com>
> wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > In <
> CAJPUwMDEm1-5SUDxZRwYfkHSDEFssq8TbspGyicaJhKBEfUT+Q@mail.gmail.com>
> > > > >   "[VOTE] Add "trivial" RecordBatch body compression to Arrow IPC
> > > > > protocol" on Wed, 22 Apr 2020 19:24:09 -0500,
> > > > >   Wes McKinney <we...@gmail.com> wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I have proposed adding a simple RecordBatch IPC message body
> > > > > > compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> > > > > > protocol in GitHub PR [1] as discussed on the mailing list [2].
> This
> > > > > > is distinct from separate discussions about adding in-memory
> encodings
> > > > > > (like RLE-encoding) to the Arrow columnar format.
> > > > > >
> > > > > > This change is not forward compatible so it will not be safe to
> send
> > > > > > compressed messages to old libraries, but since we are still
> pre-1.0.0
> > > > > > the consensus is that this is acceptable. We may separately
> consider
> > > > > > increasing the metadata version for 1.0.0 to require clients to
> > > > > > upgrade.
> > > > > >
> > > > > > Please vote whether to accept the addition. The vote will be
> open for
> > > > > > at least 72 hours.
> > > > > >
> > > > > > [ ] +1 Accept this addition to the IPC protocol
> > > > > > [ ] +0
> > > > > > [ ] -1 Do not accept the changes because...
> > > > > >
> > > > > > Here is my vote: +1
> > > > > >
> > > > > > Thanks,
> > > > > > Wes
> > > > > >
> > > > > > [1]: https://github.com/apache/arrow/pull/6707
> > > > > > [2]:
> > > > >
> > > >
> https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
> > > > >
> > > >
>

[RESULT] [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

Posted by Wes McKinney <we...@gmail.com>.
The vote carries with 7 binding +1 votes and 1 non-binding +1

On Fri, Apr 24, 2020 at 7:40 AM Francois Saint-Jacques
<fs...@gmail.com> wrote:
>
> +1 (binding)
>
> On Fri, Apr 24, 2020 at 5:41 AM Krisztián Szűcs
> <sz...@gmail.com> wrote:
> >
> > +1 (binding)
> >
> > On 2020. Apr 24., Fri at 1:51, Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > +1 (binding)
> > >
> > > On Thu, Apr 23, 2020 at 2:35 PM Sutou Kouhei <ko...@clear-code.com> wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > In <CA...@mail.gmail.com>
> > > >   "[VOTE] Add "trivial" RecordBatch body compression to Arrow IPC
> > > > protocol" on Wed, 22 Apr 2020 19:24:09 -0500,
> > > >   Wes McKinney <we...@gmail.com> wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I have proposed adding a simple RecordBatch IPC message body
> > > > > compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> > > > > protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> > > > > is distinct from separate discussions about adding in-memory encodings
> > > > > (like RLE-encoding) to the Arrow columnar format.
> > > > >
> > > > > This change is not forward compatible so it will not be safe to send
> > > > > compressed messages to old libraries, but since we are still pre-1.0.0
> > > > > the consensus is that this is acceptable. We may separately consider
> > > > > increasing the metadata version for 1.0.0 to require clients to
> > > > > upgrade.
> > > > >
> > > > > Please vote whether to accept the addition. The vote will be open for
> > > > > at least 72 hours.
> > > > >
> > > > > [ ] +1 Accept this addition to the IPC protocol
> > > > > [ ] +0
> > > > > [ ] -1 Do not accept the changes because...
> > > > >
> > > > > Here is my vote: +1
> > > > >
> > > > > Thanks,
> > > > > Wes
> > > > >
> > > > > [1]: https://github.com/apache/arrow/pull/6707
> > > > > [2]:
> > > >
> > > https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
> > > >
> > >

Re: [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

Posted by Francois Saint-Jacques <fs...@gmail.com>.
+1 (binding)

On Fri, Apr 24, 2020 at 5:41 AM Krisztián Szűcs
<sz...@gmail.com> wrote:
>
> +1 (binding)
>
> On 2020. Apr 24., Fri at 1:51, Micah Kornfield <em...@gmail.com>
> wrote:
>
> > +1 (binding)
> >
> > On Thu, Apr 23, 2020 at 2:35 PM Sutou Kouhei <ko...@clear-code.com> wrote:
> >
> > > +1 (binding)
> > >
> > > In <CA...@mail.gmail.com>
> > >   "[VOTE] Add "trivial" RecordBatch body compression to Arrow IPC
> > > protocol" on Wed, 22 Apr 2020 19:24:09 -0500,
> > >   Wes McKinney <we...@gmail.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > I have proposed adding a simple RecordBatch IPC message body
> > > > compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> > > > protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> > > > is distinct from separate discussions about adding in-memory encodings
> > > > (like RLE-encoding) to the Arrow columnar format.
> > > >
> > > > This change is not forward compatible so it will not be safe to send
> > > > compressed messages to old libraries, but since we are still pre-1.0.0
> > > > the consensus is that this is acceptable. We may separately consider
> > > > increasing the metadata version for 1.0.0 to require clients to
> > > > upgrade.
> > > >
> > > > Please vote whether to accept the addition. The vote will be open for
> > > > at least 72 hours.
> > > >
> > > > [ ] +1 Accept this addition to the IPC protocol
> > > > [ ] +0
> > > > [ ] -1 Do not accept the changes because...
> > > >
> > > > Here is my vote: +1
> > > >
> > > > Thanks,
> > > > Wes
> > > >
> > > > [1]: https://github.com/apache/arrow/pull/6707
> > > > [2]:
> > >
> > https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
> > >
> >

Re: [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

Posted by Krisztián Szűcs <sz...@gmail.com>.
+1 (binding)

On 2020. Apr 24., Fri at 1:51, Micah Kornfield <em...@gmail.com>
wrote:

> +1 (binding)
>
> On Thu, Apr 23, 2020 at 2:35 PM Sutou Kouhei <ko...@clear-code.com> wrote:
>
> > +1 (binding)
> >
> > In <CA...@mail.gmail.com>
> >   "[VOTE] Add "trivial" RecordBatch body compression to Arrow IPC
> > protocol" on Wed, 22 Apr 2020 19:24:09 -0500,
> >   Wes McKinney <we...@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > I have proposed adding a simple RecordBatch IPC message body
> > > compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> > > protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> > > is distinct from separate discussions about adding in-memory encodings
> > > (like RLE-encoding) to the Arrow columnar format.
> > >
> > > This change is not forward compatible so it will not be safe to send
> > > compressed messages to old libraries, but since we are still pre-1.0.0
> > > the consensus is that this is acceptable. We may separately consider
> > > increasing the metadata version for 1.0.0 to require clients to
> > > upgrade.
> > >
> > > Please vote whether to accept the addition. The vote will be open for
> > > at least 72 hours.
> > >
> > > [ ] +1 Accept this addition to the IPC protocol
> > > [ ] +0
> > > [ ] -1 Do not accept the changes because...
> > >
> > > Here is my vote: +1
> > >
> > > Thanks,
> > > Wes
> > >
> > > [1]: https://github.com/apache/arrow/pull/6707
> > > [2]:
> >
> https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
> >
>

Re: [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

Posted by Micah Kornfield <em...@gmail.com>.
+1 (binding)

On Thu, Apr 23, 2020 at 2:35 PM Sutou Kouhei <ko...@clear-code.com> wrote:

> +1 (binding)
>
> In <CA...@mail.gmail.com>
>   "[VOTE] Add "trivial" RecordBatch body compression to Arrow IPC
> protocol" on Wed, 22 Apr 2020 19:24:09 -0500,
>   Wes McKinney <we...@gmail.com> wrote:
>
> > Hello,
> >
> > I have proposed adding a simple RecordBatch IPC message body
> > compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> > protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> > is distinct from separate discussions about adding in-memory encodings
> > (like RLE-encoding) to the Arrow columnar format.
> >
> > This change is not forward compatible so it will not be safe to send
> > compressed messages to old libraries, but since we are still pre-1.0.0
> > the consensus is that this is acceptable. We may separately consider
> > increasing the metadata version for 1.0.0 to require clients to
> > upgrade.
> >
> > Please vote whether to accept the addition. The vote will be open for
> > at least 72 hours.
> >
> > [ ] +1 Accept this addition to the IPC protocol
> > [ ] +0
> > [ ] -1 Do not accept the changes because...
> >
> > Here is my vote: +1
> >
> > Thanks,
> > Wes
> >
> > [1]: https://github.com/apache/arrow/pull/6707
> > [2]:
> https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
>

Re: [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

Posted by Sutou Kouhei <ko...@clear-code.com>.
+1 (binding)

In <CA...@mail.gmail.com>
  "[VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol" on Wed, 22 Apr 2020 19:24:09 -0500,
  Wes McKinney <we...@gmail.com> wrote:

> Hello,
> 
> I have proposed adding a simple RecordBatch IPC message body
> compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> is distinct from separate discussions about adding in-memory encodings
> (like RLE-encoding) to the Arrow columnar format.
> 
> This change is not forward compatible so it will not be safe to send
> compressed messages to old libraries, but since we are still pre-1.0.0
> the consensus is that this is acceptable. We may separately consider
> increasing the metadata version for 1.0.0 to require clients to
> upgrade.
> 
> Please vote whether to accept the addition. The vote will be open for
> at least 72 hours.
> 
> [ ] +1 Accept this addition to the IPC protocol
> [ ] +0
> [ ] -1 Do not accept the changes because...
> 
> Here is my vote: +1
> 
> Thanks,
> Wes
> 
> [1]: https://github.com/apache/arrow/pull/6707
> [2]: https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E

Re: [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

Posted by Neal Richardson <ne...@gmail.com>.
+1 (binding)

Neal

On Thu, Apr 23, 2020 at 2:55 AM Antoine Pitrou <an...@python.org> wrote:

>
> +1 (binding)
>
>
> Le 23/04/2020 à 02:24, Wes McKinney a écrit :
> > Hello,
> >
> > I have proposed adding a simple RecordBatch IPC message body
> > compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> > protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> > is distinct from separate discussions about adding in-memory encodings
> > (like RLE-encoding) to the Arrow columnar format.
> >
> > This change is not forward compatible so it will not be safe to send
> > compressed messages to old libraries, but since we are still pre-1.0.0
> > the consensus is that this is acceptable. We may separately consider
> > increasing the metadata version for 1.0.0 to require clients to
> > upgrade.
> >
> > Please vote whether to accept the addition. The vote will be open for
> > at least 72 hours.
> >
> > [ ] +1 Accept this addition to the IPC protocol
> > [ ] +0
> > [ ] -1 Do not accept the changes because...
> >
> > Here is my vote: +1
> >
> > Thanks,
> > Wes
> >
> > [1]: https://github.com/apache/arrow/pull/6707
> > [2]:
> https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
> >
>

Re: [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

Posted by Antoine Pitrou <an...@python.org>.
+1 (binding)


Le 23/04/2020 à 02:24, Wes McKinney a écrit :
> Hello,
> 
> I have proposed adding a simple RecordBatch IPC message body
> compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> is distinct from separate discussions about adding in-memory encodings
> (like RLE-encoding) to the Arrow columnar format.
> 
> This change is not forward compatible so it will not be safe to send
> compressed messages to old libraries, but since we are still pre-1.0.0
> the consensus is that this is acceptable. We may separately consider
> increasing the metadata version for 1.0.0 to require clients to
> upgrade.
> 
> Please vote whether to accept the addition. The vote will be open for
> at least 72 hours.
> 
> [ ] +1 Accept this addition to the IPC protocol
> [ ] +0
> [ ] -1 Do not accept the changes because...
> 
> Here is my vote: +1
> 
> Thanks,
> Wes
> 
> [1]: https://github.com/apache/arrow/pull/6707
> [2]: https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
> 

Re: [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

Posted by Fan Liya <li...@gmail.com>.
My vote: +1

Best,
Liya Fan

On Thu, Apr 23, 2020 at 8:24 AM Wes McKinney <we...@gmail.com> wrote:

> Hello,
>
> I have proposed adding a simple RecordBatch IPC message body
> compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> is distinct from separate discussions about adding in-memory encodings
> (like RLE-encoding) to the Arrow columnar format.
>
> This change is not forward compatible so it will not be safe to send
> compressed messages to old libraries, but since we are still pre-1.0.0
> the consensus is that this is acceptable. We may separately consider
> increasing the metadata version for 1.0.0 to require clients to
> upgrade.
>
> Please vote whether to accept the addition. The vote will be open for
> at least 72 hours.
>
> [ ] +1 Accept this addition to the IPC protocol
> [ ] +0
> [ ] -1 Do not accept the changes because...
>
> Here is my vote: +1
>
> Thanks,
> Wes
>
> [1]: https://github.com/apache/arrow/pull/6707
> [2]:
> https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
>

Re: IPC body buffer compression

Posted by Wes McKinney <we...@gmail.com>.
BTW I just opened https://issues.apache.org/jira/browse/ARROW-8823
since we don't track "what the uncompressed size would have been"
without compression turned on.

On Sat, May 16, 2020 at 10:19 AM Amol Umbarkar <am...@gmail.com> wrote:
>
> Thanks Wes. Will do.
>
> On Sat, May 16, 2020 at 7:06 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Amol,
> >
> > thanks for pointing that out. Such a heuristic (observing compression
> > ratios of stream messages) could be implemented at some point so that
> > compression could be toggled off mid-stream if it doesn't seem to be
> > helping. Feel free to open a JIRA issue about this
> >
> > - Wes
> >
> > On Sat, May 16, 2020 at 6:39 AM Amol Umbarkar <am...@gmail.com>
> > wrote:
> > >
> > > Hello All,
> > > I was going through dask developer log recently. Dask seems to be
> > > selectively do compression if it is found to be useful. They sort of pick
> > > 10kb of sample upfront to calculate compression and if the results are
> > good
> > > then the whole batch is compressed. This seems to save de-compression
> > > effort on receiver side.
> > >
> > > Please take a look at
> > >
> > https://blog.dask.org/2016/04/14/dask-distributed-optimizing-protocol#problem-3-unwanted-compression
> > >
> > > Thought this could be relevant to arrow batch transfers as well.
> > >
> > > Thanks,
> > > Amol
> > >
> > > On Thu, Apr 23, 2020 at 5:54 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > Hello,
> > > >
> > > > I have proposed adding a simple RecordBatch IPC message body
> > > > compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> > > > protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> > > > is distinct from separate discussions about adding in-memory encodings
> > > > (like RLE-encoding) to the Arrow columnar format.
> > > >
> > > > This change is not forward compatible so it will not be safe to send
> > > > compressed messages to old libraries, but since we are still pre-1.0.0
> > > > the consensus is that this is acceptable. We may separately consider
> > > > increasing the metadata version for 1.0.0 to require clients to
> > > > upgrade.
> > > >
> > > > Please vote whether to accept the addition. The vote will be open for
> > > > at least 72 hours.
> > > >
> > > > [ ] +1 Accept this addition to the IPC protocol
> > > > [ ] +0
> > > > [ ] -1 Do not accept the changes because...
> > > >
> > > > Here is my vote: +1
> > > >
> > > > Thanks,
> > > > Wes
> > > >
> > > > [1]: https://github.com/apache/arrow/pull/6707
> > > > [2]:
> > > >
> > https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
> > > >
> >

Re: IPC body buffer compression

Posted by Amol Umbarkar <am...@gmail.com>.
Thanks Wes. Will do.

On Sat, May 16, 2020 at 7:06 PM Wes McKinney <we...@gmail.com> wrote:

> hi Amol,
>
> thanks for pointing that out. Such a heuristic (observing compression
> ratios of stream messages) could be implemented at some point so that
> compression could be toggled off mid-stream if it doesn't seem to be
> helping. Feel free to open a JIRA issue about this
>
> - Wes
>
> On Sat, May 16, 2020 at 6:39 AM Amol Umbarkar <am...@gmail.com>
> wrote:
> >
> > Hello All,
> > I was going through dask developer log recently. Dask seems to be
> > selectively do compression if it is found to be useful. They sort of pick
> > 10kb of sample upfront to calculate compression and if the results are
> good
> > then the whole batch is compressed. This seems to save de-compression
> > effort on receiver side.
> >
> > Please take a look at
> >
> https://blog.dask.org/2016/04/14/dask-distributed-optimizing-protocol#problem-3-unwanted-compression
> >
> > Thought this could be relevant to arrow batch transfers as well.
> >
> > Thanks,
> > Amol
> >
> > On Thu, Apr 23, 2020 at 5:54 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > Hello,
> > >
> > > I have proposed adding a simple RecordBatch IPC message body
> > > compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> > > protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> > > is distinct from separate discussions about adding in-memory encodings
> > > (like RLE-encoding) to the Arrow columnar format.
> > >
> > > This change is not forward compatible so it will not be safe to send
> > > compressed messages to old libraries, but since we are still pre-1.0.0
> > > the consensus is that this is acceptable. We may separately consider
> > > increasing the metadata version for 1.0.0 to require clients to
> > > upgrade.
> > >
> > > Please vote whether to accept the addition. The vote will be open for
> > > at least 72 hours.
> > >
> > > [ ] +1 Accept this addition to the IPC protocol
> > > [ ] +0
> > > [ ] -1 Do not accept the changes because...
> > >
> > > Here is my vote: +1
> > >
> > > Thanks,
> > > Wes
> > >
> > > [1]: https://github.com/apache/arrow/pull/6707
> > > [2]:
> > >
> https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
> > >
>

Re: IPC body buffer compression

Posted by Wes McKinney <we...@gmail.com>.
hi Amol,

thanks for pointing that out. Such a heuristic (observing compression
ratios of stream messages) could be implemented at some point so that
compression could be toggled off mid-stream if it doesn't seem to be
helping. Feel free to open a JIRA issue about this

- Wes

On Sat, May 16, 2020 at 6:39 AM Amol Umbarkar <am...@gmail.com> wrote:
>
> Hello All,
> I was going through dask developer log recently. Dask seems to be
> selectively do compression if it is found to be useful. They sort of pick
> 10kb of sample upfront to calculate compression and if the results are good
> then the whole batch is compressed. This seems to save de-compression
> effort on receiver side.
>
> Please take a look at
> https://blog.dask.org/2016/04/14/dask-distributed-optimizing-protocol#problem-3-unwanted-compression
>
> Thought this could be relevant to arrow batch transfers as well.
>
> Thanks,
> Amol
>
> On Thu, Apr 23, 2020 at 5:54 AM Wes McKinney <we...@gmail.com> wrote:
>
> > Hello,
> >
> > I have proposed adding a simple RecordBatch IPC message body
> > compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> > protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> > is distinct from separate discussions about adding in-memory encodings
> > (like RLE-encoding) to the Arrow columnar format.
> >
> > This change is not forward compatible so it will not be safe to send
> > compressed messages to old libraries, but since we are still pre-1.0.0
> > the consensus is that this is acceptable. We may separately consider
> > increasing the metadata version for 1.0.0 to require clients to
> > upgrade.
> >
> > Please vote whether to accept the addition. The vote will be open for
> > at least 72 hours.
> >
> > [ ] +1 Accept this addition to the IPC protocol
> > [ ] +0
> > [ ] -1 Do not accept the changes because...
> >
> > Here is my vote: +1
> >
> > Thanks,
> > Wes
> >
> > [1]: https://github.com/apache/arrow/pull/6707
> > [2]:
> > https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
> >

IPC body buffer compression

Posted by Amol Umbarkar <am...@gmail.com>.
Hello All,
I was going through dask developer log recently. Dask seems to be
selectively do compression if it is found to be useful. They sort of pick
10kb of sample upfront to calculate compression and if the results are good
then the whole batch is compressed. This seems to save de-compression
effort on receiver side.

Please take a look at
https://blog.dask.org/2016/04/14/dask-distributed-optimizing-protocol#problem-3-unwanted-compression

Thought this could be relevant to arrow batch transfers as well.

Thanks,
Amol

On Thu, Apr 23, 2020 at 5:54 AM Wes McKinney <we...@gmail.com> wrote:

> Hello,
>
> I have proposed adding a simple RecordBatch IPC message body
> compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> is distinct from separate discussions about adding in-memory encodings
> (like RLE-encoding) to the Arrow columnar format.
>
> This change is not forward compatible so it will not be safe to send
> compressed messages to old libraries, but since we are still pre-1.0.0
> the consensus is that this is acceptable. We may separately consider
> increasing the metadata version for 1.0.0 to require clients to
> upgrade.
>
> Please vote whether to accept the addition. The vote will be open for
> at least 72 hours.
>
> [ ] +1 Accept this addition to the IPC protocol
> [ ] +0
> [ ] -1 Do not accept the changes because...
>
> Here is my vote: +1
>
> Thanks,
> Wes
>
> [1]: https://github.com/apache/arrow/pull/6707
> [2]:
> https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
>