You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Amol Umbarkar <am...@gmail.com> on 2020/05/16 11:39:27 UTC

IPC body buffer compression

Hello All,
I was going through dask developer log recently. Dask seems to be
selectively do compression if it is found to be useful. They sort of pick
10kb of sample upfront to calculate compression and if the results are good
then the whole batch is compressed. This seems to save de-compression
effort on receiver side.

Please take a look at
https://blog.dask.org/2016/04/14/dask-distributed-optimizing-protocol#problem-3-unwanted-compression

Thought this could be relevant to arrow batch transfers as well.

Thanks,
Amol

On Thu, Apr 23, 2020 at 5:54 AM Wes McKinney <we...@gmail.com> wrote:

> Hello,
>
> I have proposed adding a simple RecordBatch IPC message body
> compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> is distinct from separate discussions about adding in-memory encodings
> (like RLE-encoding) to the Arrow columnar format.
>
> This change is not forward compatible so it will not be safe to send
> compressed messages to old libraries, but since we are still pre-1.0.0
> the consensus is that this is acceptable. We may separately consider
> increasing the metadata version for 1.0.0 to require clients to
> upgrade.
>
> Please vote whether to accept the addition. The vote will be open for
> at least 72 hours.
>
> [ ] +1 Accept this addition to the IPC protocol
> [ ] +0
> [ ] -1 Do not accept the changes because...
>
> Here is my vote: +1
>
> Thanks,
> Wes
>
> [1]: https://github.com/apache/arrow/pull/6707
> [2]:
> https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
>

Re: IPC body buffer compression

Posted by Wes McKinney <we...@gmail.com>.
BTW I just opened https://issues.apache.org/jira/browse/ARROW-8823
since we don't track "what the uncompressed size would have been"
without compression turned on.

On Sat, May 16, 2020 at 10:19 AM Amol Umbarkar <am...@gmail.com> wrote:
>
> Thanks Wes. Will do.
>
> On Sat, May 16, 2020 at 7:06 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Amol,
> >
> > thanks for pointing that out. Such a heuristic (observing compression
> > ratios of stream messages) could be implemented at some point so that
> > compression could be toggled off mid-stream if it doesn't seem to be
> > helping. Feel free to open a JIRA issue about this
> >
> > - Wes
> >
> > On Sat, May 16, 2020 at 6:39 AM Amol Umbarkar <am...@gmail.com>
> > wrote:
> > >
> > > Hello All,
> > > I was going through dask developer log recently. Dask seems to be
> > > selectively do compression if it is found to be useful. They sort of pick
> > > 10kb of sample upfront to calculate compression and if the results are
> > good
> > > then the whole batch is compressed. This seems to save de-compression
> > > effort on receiver side.
> > >
> > > Please take a look at
> > >
> > https://blog.dask.org/2016/04/14/dask-distributed-optimizing-protocol#problem-3-unwanted-compression
> > >
> > > Thought this could be relevant to arrow batch transfers as well.
> > >
> > > Thanks,
> > > Amol
> > >
> > > On Thu, Apr 23, 2020 at 5:54 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > Hello,
> > > >
> > > > I have proposed adding a simple RecordBatch IPC message body
> > > > compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> > > > protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> > > > is distinct from separate discussions about adding in-memory encodings
> > > > (like RLE-encoding) to the Arrow columnar format.
> > > >
> > > > This change is not forward compatible so it will not be safe to send
> > > > compressed messages to old libraries, but since we are still pre-1.0.0
> > > > the consensus is that this is acceptable. We may separately consider
> > > > increasing the metadata version for 1.0.0 to require clients to
> > > > upgrade.
> > > >
> > > > Please vote whether to accept the addition. The vote will be open for
> > > > at least 72 hours.
> > > >
> > > > [ ] +1 Accept this addition to the IPC protocol
> > > > [ ] +0
> > > > [ ] -1 Do not accept the changes because...
> > > >
> > > > Here is my vote: +1
> > > >
> > > > Thanks,
> > > > Wes
> > > >
> > > > [1]: https://github.com/apache/arrow/pull/6707
> > > > [2]:
> > > >
> > https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
> > > >
> >

Re: IPC body buffer compression

Posted by Amol Umbarkar <am...@gmail.com>.
Thanks Wes. Will do.

On Sat, May 16, 2020 at 7:06 PM Wes McKinney <we...@gmail.com> wrote:

> hi Amol,
>
> thanks for pointing that out. Such a heuristic (observing compression
> ratios of stream messages) could be implemented at some point so that
> compression could be toggled off mid-stream if it doesn't seem to be
> helping. Feel free to open a JIRA issue about this
>
> - Wes
>
> On Sat, May 16, 2020 at 6:39 AM Amol Umbarkar <am...@gmail.com>
> wrote:
> >
> > Hello All,
> > I was going through dask developer log recently. Dask seems to be
> > selectively do compression if it is found to be useful. They sort of pick
> > 10kb of sample upfront to calculate compression and if the results are
> good
> > then the whole batch is compressed. This seems to save de-compression
> > effort on receiver side.
> >
> > Please take a look at
> >
> https://blog.dask.org/2016/04/14/dask-distributed-optimizing-protocol#problem-3-unwanted-compression
> >
> > Thought this could be relevant to arrow batch transfers as well.
> >
> > Thanks,
> > Amol
> >
> > On Thu, Apr 23, 2020 at 5:54 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > Hello,
> > >
> > > I have proposed adding a simple RecordBatch IPC message body
> > > compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> > > protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> > > is distinct from separate discussions about adding in-memory encodings
> > > (like RLE-encoding) to the Arrow columnar format.
> > >
> > > This change is not forward compatible so it will not be safe to send
> > > compressed messages to old libraries, but since we are still pre-1.0.0
> > > the consensus is that this is acceptable. We may separately consider
> > > increasing the metadata version for 1.0.0 to require clients to
> > > upgrade.
> > >
> > > Please vote whether to accept the addition. The vote will be open for
> > > at least 72 hours.
> > >
> > > [ ] +1 Accept this addition to the IPC protocol
> > > [ ] +0
> > > [ ] -1 Do not accept the changes because...
> > >
> > > Here is my vote: +1
> > >
> > > Thanks,
> > > Wes
> > >
> > > [1]: https://github.com/apache/arrow/pull/6707
> > > [2]:
> > >
> https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
> > >
>

Re: IPC body buffer compression

Posted by Wes McKinney <we...@gmail.com>.
hi Amol,

thanks for pointing that out. Such a heuristic (observing compression
ratios of stream messages) could be implemented at some point so that
compression could be toggled off mid-stream if it doesn't seem to be
helping. Feel free to open a JIRA issue about this

- Wes

On Sat, May 16, 2020 at 6:39 AM Amol Umbarkar <am...@gmail.com> wrote:
>
> Hello All,
> I was going through dask developer log recently. Dask seems to be
> selectively do compression if it is found to be useful. They sort of pick
> 10kb of sample upfront to calculate compression and if the results are good
> then the whole batch is compressed. This seems to save de-compression
> effort on receiver side.
>
> Please take a look at
> https://blog.dask.org/2016/04/14/dask-distributed-optimizing-protocol#problem-3-unwanted-compression
>
> Thought this could be relevant to arrow batch transfers as well.
>
> Thanks,
> Amol
>
> On Thu, Apr 23, 2020 at 5:54 AM Wes McKinney <we...@gmail.com> wrote:
>
> > Hello,
> >
> > I have proposed adding a simple RecordBatch IPC message body
> > compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> > protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> > is distinct from separate discussions about adding in-memory encodings
> > (like RLE-encoding) to the Arrow columnar format.
> >
> > This change is not forward compatible so it will not be safe to send
> > compressed messages to old libraries, but since we are still pre-1.0.0
> > the consensus is that this is acceptable. We may separately consider
> > increasing the metadata version for 1.0.0 to require clients to
> > upgrade.
> >
> > Please vote whether to accept the addition. The vote will be open for
> > at least 72 hours.
> >
> > [ ] +1 Accept this addition to the IPC protocol
> > [ ] +0
> > [ ] -1 Do not accept the changes because...
> >
> > Here is my vote: +1
> >
> > Thanks,
> > Wes
> >
> > [1]: https://github.com/apache/arrow/pull/6707
> > [2]:
> > https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
> >