You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Bryan Cutler <cu...@gmail.com> on 2019/07/11 18:43:36 UTC

Re: Adding a new encoding for FP data - unsubscribe

Mani, please send a reply to dev-unsubscribe@arrow.apache.org to remove
yourself from the list.

On Thu, Jul 11, 2019 at 11:10 AM mani vannan <ma...@analyticswise.com>
wrote:

> All,
>
> Can someone please help me to unsubscribe to this group?
>
> Thank you.
>
> -----Original Message-----
> From: Radev, Martin <ma...@tum.de>
> Sent: Thursday, July 11, 2019 2:08 PM
> To: dev@arrow.apache.org; emkornfield@gmail.com
> Cc: Raoofy, Amir <am...@tum.de>; Karlstetter, Roman <
> roman.karlstetter@tum.de>
> Subject: Re: Adding a new encoding for FP data
>
> Hello Micah,
>
>
> the changes will go to the C++ implementation of Parquet within Arrow.
>
> In that sense, if Arrow uses the compression and encoding methods
> available in Parquet in any way, I expect a benefit.
>
>
> My plan is to add the new encoding to parquet-cpp and parquer-mr (java).
>
>
> If you have any more questions or concerns, let me know.
>
> I am close to done with my patch.
>
>
> Regards,
>
> Martin
>
>
> ________________________________
> From: Micah Kornfield <em...@gmail.com>
> Sent: Thursday, July 11, 2019 5:26:26 PM
> To: dev@arrow.apache.org
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Adding a new encoding for FP data
>
> Hi Martin,
> Can you clarify were you expecting the encoding to only be used in
> Parquet, or more generally in Arrow?
>
> Thanks,
> Micah
>
> On Thu, Jul 11, 2019 at 7:06 AM Wes McKinney <we...@gmail.com> wrote:
>
> > hi folks,
> >
> > If you could participate in Micah's discussion about compression and
> > encoding generally at
> >
> >
> > https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd
> > 25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E
> >
> > it would be helpful. I personally think that Arrow would benefit from
> > an alternate protocol message type to the current RecordBatch (as
> > defined in Message.fbs) that allows for encoded or compressed columns.
> > This won't be an overnight change (more on the order of months of
> > work), but it's worth taking the time to carefully consider the
> > implications of developing and supporting such a feature for the long
> > term
> >
> > On Thu, Jul 11, 2019 at 5:34 AM Fan Liya <li...@gmail.com> wrote:
> > >
> > > Hi Radev,
> > >
> > > Thanks a lot for providing so much technical details. I need to read
> > > them carefully.
> > >
> > > I think FP encoding is definitely a useful feature.
> > > I hope this feature can be implemented in Arrow soon, so that we can
> > > use
> > it
> > > in our system.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin <ma...@tum.de>
> > wrote:
> > >
> > > > Hello Liya Fan,
> > > >
> > > >
> > > > this explains the technique but for a more complex case:
> > > >
> > > >
> > https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrun
> > chy/
> [https://s0.wp.com/i/blank.jpg]<
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
> >
>
> x86 code compression in kkrunchy | The ryg blog<
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
> >
> fgiesen.wordpress.com
> This is about the "secret ingredient" in my EXE packer kkrunchy, which was
> used in our (Farbrausch) 64k intros starting from "fr-030: Candytron", and
> also in a lot of other ...
>
>
>
> > > >
> > > > For FP data, the approach which seemed to be the best is the
> following.
> > > >
> > > > Say we have a buffer of two 32-bit floating point values:
> > > >
> > > > buf = [af, bf]
> > > >
> > > > We interpret each FP value as a 32-bit uint and look at each
> > > > individual byte. We have 8 bytes in total for this small input.
> > > >
> > > > buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
> > > >
> > > > Then we apply stream splitting and the new buffer becomes:
> > > >
> > > > newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
> > > >
> > > > We compress newbuf.
> > > >
> > > > Due to similarities the sign bits, mantissa bits and MSB exponent
> > bits, we
> > > > might have a lot more repetitions in data. For scientific data,
> > > > the
> > 2nd and
> > > > 3rd byte for 32-bit data is probably largely noise. Thus in the
> > original
> > > > representation we would always have a few bytes of data which
> > > > could
> > appear
> > > > somewhere else in the buffer and then a couple bytes of possible
> > noise. In
> > > > the new representation we have a long stream of data which could
> > compress
> > > > well and then a sequence of noise towards the end.
> > > >
> > > > This transformation improved compression ratio as can be seen in
> > > > the report.
> > > >
> > > > It also improved speed for ZSTD. This could be because ZSTD makes
> > > > a decision of how to compress the data - RLE, new huffman tree,
> > > > huffman
> > tree
> > > > of the previous frame, raw representation. Each can potentially
> > achieve a
> > > > different compression ratio and compression/decompression speed.
> > > > It
> > turned
> > > > out that when the transformation is applied, zstd would attempt to
> > compress
> > > > fewer frames and copy the other. This could lead to less attempts
> > > > to
> > build
> > > > a huffman tree. It's hard to pin-point the exact reason.
> > > >
> > > > I did not try other lossless text compressors but I expect similar
> > results.
> > > >
> > > > For code, I can polish my patches, create a Jira task and submit
> > > > the patches for review.
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > >
> > > > ________________________________
> > > > From: Fan Liya <li...@gmail.com>
> > > > Sent: Thursday, July 11, 2019 11:32:53 AM
> > > > To: dev@arrow.apache.org
> > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: Adding a new encoding for FP data
> > > >
> > > > Hi Radev,
> > > >
> > > > Thanks for the information. It seems interesting.
> > > > IMO, Arrow has much to do for data compression. However, it seems
> > there are
> > > > some differences for memory data compression and external storage
> > > > data compression.
> > > >
> > > > Could you please provide some reference for stream splitting?
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin
> > > > <ma...@tum.de>
> > wrote:
> > > >
> > > > > Hello people,
> > > > >
> > > > >
> > > > > there has been discussion in the Apache Parquet mailing list on
> > adding a
> > > > > new encoder for FP data.
> > > > > The reason for this is that the supported compressors by Apache
> > Parquet
> > > > > (zstd, gzip, etc) do not compress well raw FP data.
> > > > >
> > > > >
> > > > > In my investigation it turns out that a very simple simple
> > > > > technique, named stream splitting, can improve the compression
> > > > > ratio and even
> > speed
> > > > > for some of the compressors.
> > > > >
> > > > > You can read about the results here:
> > > > >
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> [
> https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p
> ]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>
>
> report.pdf - Google Drive<
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>
> drive.google.com
>
>
>
> > > > >
> > > > >
> > > > > I went through the developer guide for Apache Arrow and wrote a
> > patch to
> > > > > add the new encoding and test coverage for it.
> > > > >
> > > > > I will polish my patch and work in parallel to extend the Apache
> > Parquet
> > > > > format for the new encoding.
> > > > >
> > > > >
> > > > > If you have any concerns, please let me know.
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > >
> > > >
> >
>