You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Radev, Martin" <ma...@tum.de> on 2019/09/03 19:17:23 UTC

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Hello all,


thank you Julien for the interest.


Could other people, part of Apache Parquet, share their opinions?

Do you have your own data which you would like to use for testing the new encoding?


Regards,

Martin

________________________________
From: Julien Le Dem <ju...@wework.com.INVALID>
Sent: Friday, August 30, 2019 2:38:37 AM
To: dev@parquet.apache.org
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

I think this looks promising to me. At first glance it seems combining
simplicity and efficiency.
I'd like to hear more from other members of the PMC.

On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <ma...@tum.de> wrote:

> Dear all,
>
>
> there was some earlier discussion on adding a new encoding for better
> compression of FP32 and FP64 data.
>
>
> The pull request which extends the format is here:
> https://github.com/apache/parquet-format/pull/144
> The change has one approval from earlier from Zoltan.
>
>
> The results from an investigation on compression ratio and speed with the
> new encoding vs other encodings is available here:
> https://github.com/martinradev/arrow-fp-compression-bench
> It is visible that for many tests the new encoding performs better in
> compression ratio and in some cases in speed. The improvements in
> compression speed come from the fact that the new format can potentially
> lead to a faster parsing for some compressors like GZIP.
>
>
> An earlier report which examines other FP compressors (fpzip, spdp, fpc,
> zfp, sz) and new potential encodings is available here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> The report also covers lossy compression but the BYTE_STREAM_SPLIT
> encoding only has the focus of lossless compression.
>
>
> Can we have a vote?
>
>
> Regards,
>
> Martin
>
>

[RESULT] [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by Wes McKinney <we...@gmail.com>.
Forwarding the vote outcome with the [RESULT] subject line for
searchability purposes

On Tue, Dec 3, 2019 at 2:28 AM Gabor Szadovszky <ga...@apache.org> wrote:
>
> Hi Martin,
>
> The vote succeeded. 4 +1 votes (3 bindings) and no -1 votes.
> I'll submit your PR soon.
>
> Cheers,
> Gabor
>
> On Fri, Nov 29, 2019 at 10:59 PM Radev, Martin <ma...@tum.de> wrote:
>
> > Dear Parquet community,
> >
> > thank you for the votes and time spent reviewing.
> >
> > There have been so far 4 votes in the mailing list and Zoltan added a +1
> > to the Github pull request some time ago.
> > How many more would be necessary to have the spec change accepted?
> > Once that is done I can continue with polishing my changes for Arrow and
> > Parquet-MR.
> >
> > There are some new things.
> > I did some further tests and implementation changes (no spec changes):
> >
> >   1.  I added more real data tests from various sources.
> >   2.  I generated synthetic data with varying zero-order entropy of an
> > element for each test file.
> > This should help give some idea when different encoding-compressor options
> > perform well.
> >   3.  I investigated different possible implementations for the
> > encoder/decoder and added the most efficient option locally into Arrow's
> > parquet-cpp.
> > This was necessary since I found out that my early scalar implementation
> > in Arrow did not get auto-vectorized by the compiler and thus was somewhat
> > inefficient especially if the data is present in the cache.
> >
> > The report is available here:
> > https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf.
> > It mostly contains tables and plots with some context information and
> > reasoning.
> > The two SIMD implementations I wrote are available here in a standalone
> > program for benchmarking and validating:
> > https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/prog.cpp
> >
> > The short story is that the BYTE_STREAM_SPLIT encoding performs also the
> > best among all other options for the newly added test cases.
> > Furthermore, the plot for data with varying entropy shows that the
> > BYTE_STREAM_SPLIT encoding with zstd provides the best compression ratio
> > when the entropy per float element is >8 bits. The BYTE_STREAM_SPLIT
> > encoding+zstd provides better write throughput than all other combinations.
> > It also improves read throughput than to just using ZSTD to compress data.
> > The report also shows the performance difference for different
> > implementations of the encoder/decoder and compare it to memcpy. There's a
> > clear performance advantage in using the custom vectorized version over
> > what GCC generates.
> >
> > If you have any questions, let me know.
> >
> > Kind regards,
> > Martin
> >
> >
> > ________________________________
> > From: Junjie Chen <ch...@gmail.com>
> > Sent: Friday, November 8, 2019 2:56:48 AM
> > To: dev@parquet.apache.org
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > +1 from me to add BYTE_STREAM_SPLIT to parquet-format.
> >
> > Gabor Szadovszky <ga...@apache.org> 于2019年11月7日周四 下午6:07写道:
> > >
> > > +1 for adding BYTE_STREAM_SPLIT encoding to parquet-format.
> > >
> > > On Tue, Nov 5, 2019 at 11:22 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > +1 from me on adding the FP encoding
> > > >
> > > > On Sat, Nov 2, 2019 at 4:51 AM Radev, Martin <ma...@tum.de>
> > wrote:
> > > > >
> > > > > Hello all,
> > > > >
> > > > >
> > > > > thanks for the vote Ryan and to Wes for the feedback.
> > > > >
> > > > >
> > > > > The concern with regards to adding more complex features in the
> > Parquet
> > > > spec is valid.
> > > > >
> > > > > However, the proposed encoding is very simple and I already have
> > > > unpolished patches for both parquet-mr and arrow.
> > > > >
> > > > > In its design I purposely opted for something simple to guarantee 1)
> > > > good compression speed and 2) ease of implementation.
> > > > >
> > > > >
> > > > > On the topic of testing, I added four more test cases which were
> > taken
> > > > from here<https://sdrbench.github.io/>. I also added the size in MB of
> > > > all test case and entropy per element.
> > > > >
> > > > > Having the entropy reported helps show that the encoding performs
> > better
> > > > than any other option for high-entropy data and not so well for
> > low-entropy
> > > > data.
> > > > >
> > > > >
> > > > > I would be happy to receive some more feedback and votes.
> > > > >
> > > > >
> > > > > Kind regards,
> > > > >
> > > > > Martin
> > > > >
> > > > > ________________________________
> > > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > > Sent: Friday, November 1, 2019 6:28 PM
> > > > > To: Parquet Dev
> > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > > >
> > > > > I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format.
> > > > Looks
> > > > > like it is simple and performs well. We could use a good floating
> > point
> > > > > encoding.
> > > > >
> > > > > I don't think I agree that differences in features between the Java
> > and
> > > > CPP
> > > > > implementations should hold back new work. It would be great to have
> > more
> > > > > testing and validation, as well as more thorough implementations.
> > But I
> > > > > don't think we shouldn't accept contributions like this because of
> > those
> > > > > concerns.
> > > > >
> > > > > On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > > > >
> > > > > > I have to say I'm struggling with piling more things into the
> > Parquet
> > > > > > specification when we already have a significant implementation
> > > > > > shortfall in other areas. LZ4 is still not properly implemented for
> > > > > > example, and then there is the question of the V2 encodings and
> > data
> > > > > > page formats.
> > > > > >
> > > > > > I'm generally in favor of adding more efficient storage of floating
> > > > > > point data, but will it actually be implemented broadly? Parquet
> > as a
> > > > > > format already has become an "implementation swamp" where any two
> > > > > > implementations may not be compatible with each other,
> > particularly in
> > > > > > consideration of more advanced features.
> > > > > >
> > > > > > For a single organization using a single implementation, having
> > > > > > advanced features may be useful, so I see the benefits to users
> > that
> > > > > > tightly control what code and what settings to use.
> > > > > >
> > > > > > On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin <martin.radev@tum.de
> > >
> > > > wrote:
> > > > > > >
> > > > > > > Dear all,
> > > > > > >
> > > > > > >
> > > > > > > would there be any interest in reviewing the BYTE_STREAM_SPLIT
> > > > encoding?
> > > > > > >
> > > > > > > Please feel free to contact me directly if you need help or would
> > > > like
> > > > > > to provide more test data.
> > > > > > >
> > > > > > >
> > > > > > > Results for the encoding based on the implementation in Arrow are
> > > > here:
> > > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > > > Patch to Arrow is here:
> > > > > >
> > > >
> > https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49
> > > > > > >
> > > > > > >
> > > > > > > The new encoding combined with a compressor performs better than
> > any
> > > > of
> > > > > > the other alternatives for data where there is little change in the
> > > > > > upper-most bytes of fp32 and fp64 values. My early experiments also
> > > > show
> > > > > > that this encoding+zstd performs better on average than any of the
> > > > > > specialized floating-point lossless compressors like fpc, spdp,
> > zfp.
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Martin
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: Radev, Martin <ma...@tum.de>
> > > > > > > Sent: Thursday, October 10, 2019 2:34:15 PM
> > > > > > > To: Parquet Dev
> > > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> > Parquet
> > > > > > >
> > > > > > > Dear Ryan Blue and other Parquet developers,
> > > > > > >
> > > > > > > I tested Ryan's proposal for modifying the encoding.
> > > > > > >
> > > > > > > The short answer is that it doesn't perform well in my tests. The
> > > > > > encoding, code and results can be viewed below.
> > > > > > >
> > > > > > >
> > > > > > > The current implementation only handles 32-bit IEEE754 floats in
> > the
> > > > > > following way:
> > > > > > >
> > > > > > >   1.  For each block of 128 values, the min and max is computed
> > for
> > > > the
> > > > > > exponent
> > > > > > >   2.  The number of bits for the exponent RLE is computed as
> > > > > > ceil(log2((max - min + 1))). The sign bit uses 1 bit.
> > > > > > >   3.  The sign, exponent and 23 remaining mantissa bits are
> > > > extracted.
> > > > > > >   4.  One RLE encoder is used for the sign and one for the
> > exponent.
> > > > > > > A new RLE encoder for the exponent is created if the block
> > requires
> > > > less
> > > > > > or more bits than the number of bits used for the current encoder.
> > > > > > > The 23 mantissa bits are divided into three streams. (Not sure
> > > > whether
> > > > > > this is strictly a good idea).
> > > > > > >   5.  Also, for each 128 values we need to store 2 bytes: the min
> > > > value
> > > > > > and the number of bits used by the RLE.
> > > > > > >
> > > > > > >
> > > > > > > I did not implement a decoder and have not added unit tests to
> > > > guarantee
> > > > > > that the implementation is sound.
> > > > > > >
> > > > > > > Ryan, can you please review the implementation as to whether it
> > > > > > corresponds to what you had in mind?
> > > > > > > Check FPByteStreamSplitComponentRLEEncoder. Please do not focus
> > on
> > > > the
> > > > > > code quality - this is only a quick hack.
> > > > > > >
> > > > > >
> > > >
> > https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979
> > > > > > >
> > > > > > > In my benchmark tests I compared the following scenarios:
> > > > > > >
> > > > > > >   *   dictionary
> > > > > > >   *   plain encoding + ZSTD
> > > > > > >   *   the StreamSplit+RLE implementation above
> > > > > > >   *   the StreamSplit+RLE implementation above + ZSTD
> > > > > > >   *   the original BYTE_STREAM_SPLIT proposal + ZSTD
> > > > > > >
> > > > > > > Size of parquet files, measured in MB, for each combination:
> > > > > > >
> > > > > > >         Plain   Plain+ZSTD      Dict    Split+RLE
> > > >  Split+RLE+ZSTD
> > > > > > StreamSplit+ZSTD
> > > > > > > msg_bt.sp
> > > > > > >         128     112     128     113     93      84
> > > > > > > msg_lu.sp
> > > > > > >         93      87      94      80      73      67
> > > > > > > msg_sp.sp
> > > > > > >         139     125     139     123     96      88
> > > > > > > msg_sweep3d.sp
> > > > > > >         60      19      60      47      40      13
> > > > > > > num_brain.sp
> > > > > > >         68      60      69      55      54      49
> > > > > > > num_comet.sp
> > > > > > >         52      45      50      46      42      38
> > > > > > > num_control.sp
> > > > > > >         77      71      77      70      69      63
> > > > > > > num_plasma.sp
> > > > > > >         17      2       8       16      8       1
> > > > > > > obs_error.sp
> > > > > > >         30      24      30      25      22      21
> > > > > > > obs_info.sp
> > > > > > >         10      8       10      8       7       7
> > > > > > > obs_spitzer.sp
> > > > > > >         95      83      95      99      82      72
> > > > > > > obs_temp.sp
> > > > > > >         20      18      20      18      18      16
> > > > > > >
> > > > > > >
> > > > > > > I have not tested the encoding on any of the other test data we
> > have
> > > > > > since they contain also FP64 columns. I did not add support for
> > FP64
> > > > in the
> > > > > > StreamSplit+RLE encoding to save on time and also because I do not
> > > > expect
> > > > > > much improvement.
> > > > > > >
> > > > > > >
> > > > > > > From the results it can be seen that the proposed StreamSplit+RLE
> > > > > > encoding does not improve results.
> > > > > > >
> > > > > > > Using RLE for the sign bits and exponent obfuscates the input
> > and the
> > > > > > compression step cannot fully take advantage of repetitions in the
> > data
> > > > > > since they were removed from the RLE step. Repetitive patterns are
> > > > replaced
> > > > > > by the RLE bits which likely do not compress very well.
> > > > > > >
> > > > > > > Also, GZIP/ZSTD handle repetitions in data better on average. For
> > > > > > example, GZIP's Deflate algorithm can encode a long run length
> > with 3
> > > > > > bytes(not sure whether it can be less?) for the raw data + 3 bytes
> > for
> > > > the
> > > > > > reference ( 8 bits + 15 bits + 2 bits ).
> > > > > > >
> > > > > > > Now, the RLE step might produce better results for long runs of
> > the
> > > > same
> > > > > > value. However, compressors also handles more complex cases when we
> > > > have a
> > > > > > pattern in the data which doesn't necessary have a long run length.
> > > > Also,
> > > > > > compressors like GZIP/ZSTD often do entropy-coding-aware parsing (
> > > > > > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html ,
> > > > > cbloom rants: 10-10-08 : On LZ Optimal Parsing<
> > > > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html>
> > > > > cbloomrants.blogspot.com
> > > > > 10-10-08 On LZ Optimal Parsing So the LZ-Huffman I've done for
> > RAD/Oodle
> > > > does some new stuff with optimal parsing. I want to write...
> > > > >
> > > > >
> > > > >
> > > > > >
> > > >
> > https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> > > > > > ).
> > > > > > >
> > > > > > > The proposed RLE encoding+ZSTD is slower than
> > BYTE_STREAM_SPLIT+ZSTD.
> > > > > > The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy
> > > > where as
> > > > > > the new RLE encoding has to extract the components and run through
> > each
> > > > > > block of 128 values twice - once to compute min and max, and once
> > to
> > > > > > encode. There's also the overhead of using the Arrow RLEEncoder.
> > > > > > >
> > > > > > > Ryan and other folks, can you provide feedback? Does the
> > > > implementation
> > > > > > look reasonable to you?
> > > > > > >
> > > > > > > Can somebody please work with me on this new encoding? There has
> > been
> > > > > > some interest and some discussions but it hasn't been pushed likely
> > > > due to
> > > > > > work around the current release.
> > > > > > > For a bit more discussions and results, please refer to:
> > > > > > > Recent benchmark of Arrow implementation:
> > > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > > > Separate report:
> > > > > >
> > > >
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > > [
> > > >
> > https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p
> > > > ]<
> > > >
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > >
> > > > >
> > > > > report.pdf - Google Drive<
> > > >
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > >
> > > > > drive.google.com
> > > > >
> > > > >
> > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > > Martin
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: Ryan Blue <rb...@netflix.com>
> > > > > > > Sent: Thursday, September 19, 2019 7:54 PM
> > > > > > > To: Radev, Martin
> > > > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> > Parquet
> > > > > > >
> > > > > > > Sounds good, thanks for working on this!
> > > > > > >
> > > > > > > On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <
> > martin.radev@tum.de
> > > > > > <ma...@tum.de>> wrote:
> > > > > > >
> > > > > > > Hello Ryan,
> > > > > > >
> > > > > > >
> > > > > > > we decided that it would be beneficial to try out your proposal.
> > > > > > >
> > > > > > >
> > > > > > > I will look into it and provide measurements on the compression
> > ratio
> > > > > > and speed.
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Martin
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: Ryan Blue <rb...@netflix.com>>
> > > > > > > Sent: Saturday, September 14, 2019 2:23:20 AM
> > > > > > > To: Radev, Martin
> > > > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> > Parquet
> > > > > > >
> > > > > > > > Using RLE for the sign, exponents and the top-most mantissa
> > bytes
> > > > can
> > > > > > help when data is repetitive and make it worse for other.
> > > > > > >
> > > > > > > I agree. But we use RLE in similar cases because we do tend to
> > have
> > > > runs
> > > > > > of values, and values that fit in a fixed number of bits.
> > Exponents and
> > > > > > sign bits would probably fit this model extremely well most of the
> > > > time if
> > > > > > you have similar floating point values or sorted values. It would
> > be
> > > > really
> > > > > > interesting to see how well this performs in comparison to the
> > > > compression
> > > > > > tests you've already done. For mantissa bits, I agree it wouldn't
> > be
> > > > worth
> > > > > > encoding first.
> > > > > > >
> > > > > > > On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <
> > martin.radev@tum.de
> > > > > > <ma...@tum.de>> wrote:
> > > > > > >
> > > > > > > Hello Ryan, Wes and other parquet devs,
> > > > > > >
> > > > > > >
> > > > > > > thanks for the response. I was away on vacation and that's why I
> > am
> > > > > > answering just now.
> > > > > > >
> > > > > > >
> > > > > > > > whether you think adding run-length encoding to any of the byte
> > > > > > streams would be beneficial before applying Zstd.
> > > > > > > The short answer is "only for some cases but it will make it
> > worse in
> > > > > > both compression ratio and speed for other".
> > > > > > >
> > > > > > > Our initial investigation also separated the sign, exponent and
> > > > mantissa
> > > > > > into separate streams.
> > > > > > >
> > > > > > > The encoding was the following assuming 32-bit IEEE754:
> > > > > > >
> > > > > > > - stream of sign bits
> > > > > > >
> > > > > > > - stream of exponents bits. Conveniently the exponent for a
> > 32-bit
> > > > > > IEEE754 number is 8 bits.
> > > > > > >
> > > > > > > - separate the remaining 23 bits into four streams of 8, 8, 7
> > bits.
> > > > An
> > > > > > extra zero bit is added to the block which has only seven bits.
> > This
> > > > was
> > > > > > done since zstd, zlib, etc work at a byte granularity and we would
> > want
> > > > > > repetitions to happen at such.
> > > > > > >
> > > > > > > For 64-bit IEEE754 even more padding has to be added since the
> > > > exponent
> > > > > > is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more
> > > > > > exponent bits and 4 more mantissa bits to keep repetitions at a
> > byte
> > > > > > granularity. My original report shows results for when the
> > > > floating-point
> > > > > > values are split at a component granularity. Report is here:
> > > > > >
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > > > > > > Results are just slightly better in terms of compression ratio
> > for
> > > > some
> > > > > > tests but compression and decompression speed is expectedly worse.
> > The
> > > > > > reason is that splitting a value is somewhat more complex. We need
> > to
> > > > keep
> > > > > > a stream of bits for the signs, keep track of when a byte in the
> > > > stream is
> > > > > > exhausted, do bit manipulation to extract components, etc. This is
> > > > also the
> > > > > > reason why I preferred to go with the byte-wise decomposition of
> > the
> > > > > > values. It's faster and the compression ratio is just slightly
> > worse
> > > > for
> > > > > > some of the tests.
> > > > > > >
> > > > > > >
> > > > > > > Using RLE for the sign, exponents and the top-most mantissa
> > bytes can
> > > > > > help when data is repetitive and make it worse for other. I suppose
> > > > using
> > > > > > one of the compressors yields a better compression ratio on
> > average.
> > > > Also,
> > > > > > this can again make encoding and decoding slower.
> > > > > > >
> > > > > > >
> > > > > > > The design of the BYTE_STREAM_SPLIT encoding had in mind two
> > things:
> > > > > > >
> > > > > > > - It would only make data more compressible and leave
> > compression to
> > > > the
> > > > > > codec in use.
> > > > > > >   This leaves the complexity to the codec and choice of
> > > > > > speed/compression ratio to the user.
> > > > > > >
> > > > > > > - It should be fast.
> > > > > > >   There's an extra compression step so preferably there's very
> > little
> > > > > > latency before it.
> > > > > > >
> > > > > > > @Wes, can you have a look?
> > > > > > >
> > > > > > > More opinions are welcome.
> > > > > > >
> > > > > > > If you have floating point data available, I would be very happy
> > to
> > > > > > examine whether this approach offers benefit for you.
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Martin
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > > > > Sent: Tuesday, September 3, 2019 11:51:46 PM
> > > > > > > To: Parquet Dev
> > > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> > Parquet
> > > > > > >
> > > > > > > Hi Martin,
> > > > > > >
> > > > > > > Thanks for taking a look at this! I agree that the approach here
> > > > looks
> > > > > > > promising. We've had occasional requests for lossy floating point
> > > > > > > compression in the past, so it would be good to add this.
> > > > > > >
> > > > > > > I did some work in this area a few years ago that is similar and
> > I'd
> > > > like
> > > > > > > to hear what you think about that approach compared to this one.
> > That
> > > > > > work
> > > > > > > was based on the same observation, that the main problem is the
> > > > mantissa,
> > > > > > > while exponents tend to compress well. What I did was take the
> > > > exponent
> > > > > > and
> > > > > > > mantissa and encode each separately, like the component encoding
> > in
> > > > your
> > > > > > > test. But to encode each stream, I used Parquet's RLE encoder
> > > > instead of
> > > > > > > just applying compression. This seemed to work well for
> > exponents and
> > > > > > sign
> > > > > > > bits, but probably isn't worth the cost for mantissa bits. It
> > could
> > > > also
> > > > > > be
> > > > > > > interesting to test a separate stream for sign bits.
> > > > > > >
> > > > > > > I guess what I'd like to hear your take on is whether you think
> > > > adding
> > > > > > > run-length encoding to any of the byte streams would be
> > beneficial
> > > > before
> > > > > > > applying Zstd.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > rb
> > > > > > >
> > > > > > > On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <
> > wesmckinn@gmail.com
> > > > > > <ma...@gmail.com>> wrote:
> > > > > > >
> > > > > > > > I'm interested in this. I have been busy the last couple of
> > weeks
> > > > so
> > > > > > have
> > > > > > > > not been able to take a closer look. I will try to give some
> > > > feedback
> > > > > > this
> > > > > > > > week.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <
> > martin.radev@tum.de
> > > > > > <ma...@tum.de>> wrote:
> > > > > > > >
> > > > > > > > > Hello all,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > thank you Julien for the interest.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Could other people, part of Apache Parquet, share their
> > opinions?
> > > > > > > > >
> > > > > > > > > Do you have your own data which you would like to use for
> > testing
> > > > > > the new
> > > > > > > > > encoding?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Martin
> > > > > > > > >
> > > > > > > > > ________________________________
> > > > > > > > > From: Julien Le Dem <ju...@wework.com.INVALID>
> > > > > > > > > Sent: Friday, August 30, 2019 2:38:37 AM
> > > > > > > > > To: dev@parquet.apache.org<ma...@parquet.apache.org>
> > > > > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> > > > Parquet
> > > > > > > > >
> > > > > > > > > I think this looks promising to me. At first glance it seems
> > > > > > combining
> > > > > > > > > simplicity and efficiency.
> > > > > > > > > I'd like to hear more from other members of the PMC.
> > > > > > > > >
> > > > > > > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <
> > > > martin.radev@tum.de
> > > > > > <ma...@tum.de>>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Dear all,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > there was some earlier discussion on adding a new encoding
> > for
> > > > > > better
> > > > > > > > > > compression of FP32 and FP64 data.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > The pull request which extends the format is here:
> > > > > > > > > > https://github.com/apache/parquet-format/pull/144
> > > > > > > > > > The change has one approval from earlier from Zoltan.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > The results from an investigation on compression ratio and
> > > > speed
> > > > > > with
> > > > > > > > the
> > > > > > > > > > new encoding vs other encodings is available here:
> > > > > > > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > > > > > > It is visible that for many tests the new encoding performs
> > > > better
> > > > > > in
> > > > > > > > > > compression ratio and in some cases in speed. The
> > improvements
> > > > in
> > > > > > > > > > compression speed come from the fact that the new format
> > can
> > > > > > > > potentially
> > > > > > > > > > lead to a faster parsing for some compressors like GZIP.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > An earlier report which examines other FP compressors
> > (fpzip,
> > > > spdp,
> > > > > > > > fpc,
> > > > > > > > > > zfp, sz) and new potential encodings is available here:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > > > > > > > The report also covers lossy compression but the
> > > > BYTE_STREAM_SPLIT
> > > > > > > > > > encoding only has the focus of lossless compression.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Can we have a vote?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > >
> > > > > > > > > > Martin
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Ryan Blue
> > > > > > > Software Engineer
> > > > > > > Netflix
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Ryan Blue
> > > > > > > Software Engineer
> > > > > > > Netflix
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Ryan Blue
> > > > > > > Software Engineer
> > > > > > > Netflix
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > >
> >

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by Gabor Szadovszky <ga...@apache.org>.
Hi Martin,

The vote succeeded. 4 +1 votes (3 bindings) and no -1 votes.
I'll submit your PR soon.

Cheers,
Gabor

On Fri, Nov 29, 2019 at 10:59 PM Radev, Martin <ma...@tum.de> wrote:

> Dear Parquet community,
>
> thank you for the votes and time spent reviewing.
>
> There have been so far 4 votes in the mailing list and Zoltan added a +1
> to the Github pull request some time ago.
> How many more would be necessary to have the spec change accepted?
> Once that is done I can continue with polishing my changes for Arrow and
> Parquet-MR.
>
> There are some new things.
> I did some further tests and implementation changes (no spec changes):
>
>   1.  I added more real data tests from various sources.
>   2.  I generated synthetic data with varying zero-order entropy of an
> element for each test file.
> This should help give some idea when different encoding-compressor options
> perform well.
>   3.  I investigated different possible implementations for the
> encoder/decoder and added the most efficient option locally into Arrow's
> parquet-cpp.
> This was necessary since I found out that my early scalar implementation
> in Arrow did not get auto-vectorized by the compiler and thus was somewhat
> inefficient especially if the data is present in the cache.
>
> The report is available here:
> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf.
> It mostly contains tables and plots with some context information and
> reasoning.
> The two SIMD implementations I wrote are available here in a standalone
> program for benchmarking and validating:
> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/prog.cpp
>
> The short story is that the BYTE_STREAM_SPLIT encoding performs also the
> best among all other options for the newly added test cases.
> Furthermore, the plot for data with varying entropy shows that the
> BYTE_STREAM_SPLIT encoding with zstd provides the best compression ratio
> when the entropy per float element is >8 bits. The BYTE_STREAM_SPLIT
> encoding+zstd provides better write throughput than all other combinations.
> It also improves read throughput than to just using ZSTD to compress data.
> The report also shows the performance difference for different
> implementations of the encoder/decoder and compare it to memcpy. There's a
> clear performance advantage in using the custom vectorized version over
> what GCC generates.
>
> If you have any questions, let me know.
>
> Kind regards,
> Martin
>
>
> ________________________________
> From: Junjie Chen <ch...@gmail.com>
> Sent: Friday, November 8, 2019 2:56:48 AM
> To: dev@parquet.apache.org
> Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>
> +1 from me to add BYTE_STREAM_SPLIT to parquet-format.
>
> Gabor Szadovszky <ga...@apache.org> 于2019年11月7日周四 下午6:07写道:
> >
> > +1 for adding BYTE_STREAM_SPLIT encoding to parquet-format.
> >
> > On Tue, Nov 5, 2019 at 11:22 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > +1 from me on adding the FP encoding
> > >
> > > On Sat, Nov 2, 2019 at 4:51 AM Radev, Martin <ma...@tum.de>
> wrote:
> > > >
> > > > Hello all,
> > > >
> > > >
> > > > thanks for the vote Ryan and to Wes for the feedback.
> > > >
> > > >
> > > > The concern with regards to adding more complex features in the
> Parquet
> > > spec is valid.
> > > >
> > > > However, the proposed encoding is very simple and I already have
> > > unpolished patches for both parquet-mr and arrow.
> > > >
> > > > In its design I purposely opted for something simple to guarantee 1)
> > > good compression speed and 2) ease of implementation.
> > > >
> > > >
> > > > On the topic of testing, I added four more test cases which were
> taken
> > > from here<https://sdrbench.github.io/>. I also added the size in MB of
> > > all test case and entropy per element.
> > > >
> > > > Having the entropy reported helps show that the encoding performs
> better
> > > than any other option for high-entropy data and not so well for
> low-entropy
> > > data.
> > > >
> > > >
> > > > I would be happy to receive some more feedback and votes.
> > > >
> > > >
> > > > Kind regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > Sent: Friday, November 1, 2019 6:28 PM
> > > > To: Parquet Dev
> > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > >
> > > > I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format.
> > > Looks
> > > > like it is simple and performs well. We could use a good floating
> point
> > > > encoding.
> > > >
> > > > I don't think I agree that differences in features between the Java
> and
> > > CPP
> > > > implementations should hold back new work. It would be great to have
> more
> > > > testing and validation, as well as more thorough implementations.
> But I
> > > > don't think we shouldn't accept contributions like this because of
> those
> > > > concerns.
> > > >
> > > > On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney <we...@gmail.com>
> wrote:
> > > >
> > > > > I have to say I'm struggling with piling more things into the
> Parquet
> > > > > specification when we already have a significant implementation
> > > > > shortfall in other areas. LZ4 is still not properly implemented for
> > > > > example, and then there is the question of the V2 encodings and
> data
> > > > > page formats.
> > > > >
> > > > > I'm generally in favor of adding more efficient storage of floating
> > > > > point data, but will it actually be implemented broadly? Parquet
> as a
> > > > > format already has become an "implementation swamp" where any two
> > > > > implementations may not be compatible with each other,
> particularly in
> > > > > consideration of more advanced features.
> > > > >
> > > > > For a single organization using a single implementation, having
> > > > > advanced features may be useful, so I see the benefits to users
> that
> > > > > tightly control what code and what settings to use.
> > > > >
> > > > > On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin <martin.radev@tum.de
> >
> > > wrote:
> > > > > >
> > > > > > Dear all,
> > > > > >
> > > > > >
> > > > > > would there be any interest in reviewing the BYTE_STREAM_SPLIT
> > > encoding?
> > > > > >
> > > > > > Please feel free to contact me directly if you need help or would
> > > like
> > > > > to provide more test data.
> > > > > >
> > > > > >
> > > > > > Results for the encoding based on the implementation in Arrow are
> > > here:
> > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > > Patch to Arrow is here:
> > > > >
> > >
> https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49
> > > > > >
> > > > > >
> > > > > > The new encoding combined with a compressor performs better than
> any
> > > of
> > > > > the other alternatives for data where there is little change in the
> > > > > upper-most bytes of fp32 and fp64 values. My early experiments also
> > > show
> > > > > that this encoding+zstd performs better on average than any of the
> > > > > specialized floating-point lossless compressors like fpc, spdp,
> zfp.
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Martin
> > > > > >
> > > > > > ________________________________
> > > > > > From: Radev, Martin <ma...@tum.de>
> > > > > > Sent: Thursday, October 10, 2019 2:34:15 PM
> > > > > > To: Parquet Dev
> > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> Parquet
> > > > > >
> > > > > > Dear Ryan Blue and other Parquet developers,
> > > > > >
> > > > > > I tested Ryan's proposal for modifying the encoding.
> > > > > >
> > > > > > The short answer is that it doesn't perform well in my tests. The
> > > > > encoding, code and results can be viewed below.
> > > > > >
> > > > > >
> > > > > > The current implementation only handles 32-bit IEEE754 floats in
> the
> > > > > following way:
> > > > > >
> > > > > >   1.  For each block of 128 values, the min and max is computed
> for
> > > the
> > > > > exponent
> > > > > >   2.  The number of bits for the exponent RLE is computed as
> > > > > ceil(log2((max - min + 1))). The sign bit uses 1 bit.
> > > > > >   3.  The sign, exponent and 23 remaining mantissa bits are
> > > extracted.
> > > > > >   4.  One RLE encoder is used for the sign and one for the
> exponent.
> > > > > > A new RLE encoder for the exponent is created if the block
> requires
> > > less
> > > > > or more bits than the number of bits used for the current encoder.
> > > > > > The 23 mantissa bits are divided into three streams. (Not sure
> > > whether
> > > > > this is strictly a good idea).
> > > > > >   5.  Also, for each 128 values we need to store 2 bytes: the min
> > > value
> > > > > and the number of bits used by the RLE.
> > > > > >
> > > > > >
> > > > > > I did not implement a decoder and have not added unit tests to
> > > guarantee
> > > > > that the implementation is sound.
> > > > > >
> > > > > > Ryan, can you please review the implementation as to whether it
> > > > > corresponds to what you had in mind?
> > > > > > Check FPByteStreamSplitComponentRLEEncoder. Please do not focus
> on
> > > the
> > > > > code quality - this is only a quick hack.
> > > > > >
> > > > >
> > >
> https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979
> > > > > >
> > > > > > In my benchmark tests I compared the following scenarios:
> > > > > >
> > > > > >   *   dictionary
> > > > > >   *   plain encoding + ZSTD
> > > > > >   *   the StreamSplit+RLE implementation above
> > > > > >   *   the StreamSplit+RLE implementation above + ZSTD
> > > > > >   *   the original BYTE_STREAM_SPLIT proposal + ZSTD
> > > > > >
> > > > > > Size of parquet files, measured in MB, for each combination:
> > > > > >
> > > > > >         Plain   Plain+ZSTD      Dict    Split+RLE
> > >  Split+RLE+ZSTD
> > > > > StreamSplit+ZSTD
> > > > > > msg_bt.sp
> > > > > >         128     112     128     113     93      84
> > > > > > msg_lu.sp
> > > > > >         93      87      94      80      73      67
> > > > > > msg_sp.sp
> > > > > >         139     125     139     123     96      88
> > > > > > msg_sweep3d.sp
> > > > > >         60      19      60      47      40      13
> > > > > > num_brain.sp
> > > > > >         68      60      69      55      54      49
> > > > > > num_comet.sp
> > > > > >         52      45      50      46      42      38
> > > > > > num_control.sp
> > > > > >         77      71      77      70      69      63
> > > > > > num_plasma.sp
> > > > > >         17      2       8       16      8       1
> > > > > > obs_error.sp
> > > > > >         30      24      30      25      22      21
> > > > > > obs_info.sp
> > > > > >         10      8       10      8       7       7
> > > > > > obs_spitzer.sp
> > > > > >         95      83      95      99      82      72
> > > > > > obs_temp.sp
> > > > > >         20      18      20      18      18      16
> > > > > >
> > > > > >
> > > > > > I have not tested the encoding on any of the other test data we
> have
> > > > > since they contain also FP64 columns. I did not add support for
> FP64
> > > in the
> > > > > StreamSplit+RLE encoding to save on time and also because I do not
> > > expect
> > > > > much improvement.
> > > > > >
> > > > > >
> > > > > > From the results it can be seen that the proposed StreamSplit+RLE
> > > > > encoding does not improve results.
> > > > > >
> > > > > > Using RLE for the sign bits and exponent obfuscates the input
> and the
> > > > > compression step cannot fully take advantage of repetitions in the
> data
> > > > > since they were removed from the RLE step. Repetitive patterns are
> > > replaced
> > > > > by the RLE bits which likely do not compress very well.
> > > > > >
> > > > > > Also, GZIP/ZSTD handle repetitions in data better on average. For
> > > > > example, GZIP's Deflate algorithm can encode a long run length
> with 3
> > > > > bytes(not sure whether it can be less?) for the raw data + 3 bytes
> for
> > > the
> > > > > reference ( 8 bits + 15 bits + 2 bits ).
> > > > > >
> > > > > > Now, the RLE step might produce better results for long runs of
> the
> > > same
> > > > > value. However, compressors also handles more complex cases when we
> > > have a
> > > > > pattern in the data which doesn't necessary have a long run length.
> > > Also,
> > > > > compressors like GZIP/ZSTD often do entropy-coding-aware parsing (
> > > > > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html ,
> > > > cbloom rants: 10-10-08 : On LZ Optimal Parsing<
> > > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html>
> > > > cbloomrants.blogspot.com
> > > > 10-10-08 On LZ Optimal Parsing So the LZ-Huffman I've done for
> RAD/Oodle
> > > does some new stuff with optimal parsing. I want to write...
> > > >
> > > >
> > > >
> > > > >
> > >
> https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> > > > > ).
> > > > > >
> > > > > > The proposed RLE encoding+ZSTD is slower than
> BYTE_STREAM_SPLIT+ZSTD.
> > > > > The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy
> > > where as
> > > > > the new RLE encoding has to extract the components and run through
> each
> > > > > block of 128 values twice - once to compute min and max, and once
> to
> > > > > encode. There's also the overhead of using the Arrow RLEEncoder.
> > > > > >
> > > > > > Ryan and other folks, can you provide feedback? Does the
> > > implementation
> > > > > look reasonable to you?
> > > > > >
> > > > > > Can somebody please work with me on this new encoding? There has
> been
> > > > > some interest and some discussions but it hasn't been pushed likely
> > > due to
> > > > > work around the current release.
> > > > > > For a bit more discussions and results, please refer to:
> > > > > > Recent benchmark of Arrow implementation:
> > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > > Separate report:
> > > > >
> > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > [
> > >
> https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p
> > > ]<
> > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > >
> > > >
> > > > report.pdf - Google Drive<
> > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > >
> > > > drive.google.com
> > > >
> > > >
> > > >
> > > > > >
> > > > > > Regards,
> > > > > > Martin
> > > > > >
> > > > > >
> > > > > >
> > > > > > ________________________________
> > > > > > From: Ryan Blue <rb...@netflix.com>
> > > > > > Sent: Thursday, September 19, 2019 7:54 PM
> > > > > > To: Radev, Martin
> > > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> Parquet
> > > > > >
> > > > > > Sounds good, thanks for working on this!
> > > > > >
> > > > > > On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <
> martin.radev@tum.de
> > > > > <ma...@tum.de>> wrote:
> > > > > >
> > > > > > Hello Ryan,
> > > > > >
> > > > > >
> > > > > > we decided that it would be beneficial to try out your proposal.
> > > > > >
> > > > > >
> > > > > > I will look into it and provide measurements on the compression
> ratio
> > > > > and speed.
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Martin
> > > > > >
> > > > > > ________________________________
> > > > > > From: Ryan Blue <rb...@netflix.com>>
> > > > > > Sent: Saturday, September 14, 2019 2:23:20 AM
> > > > > > To: Radev, Martin
> > > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> Parquet
> > > > > >
> > > > > > > Using RLE for the sign, exponents and the top-most mantissa
> bytes
> > > can
> > > > > help when data is repetitive and make it worse for other.
> > > > > >
> > > > > > I agree. But we use RLE in similar cases because we do tend to
> have
> > > runs
> > > > > of values, and values that fit in a fixed number of bits.
> Exponents and
> > > > > sign bits would probably fit this model extremely well most of the
> > > time if
> > > > > you have similar floating point values or sorted values. It would
> be
> > > really
> > > > > interesting to see how well this performs in comparison to the
> > > compression
> > > > > tests you've already done. For mantissa bits, I agree it wouldn't
> be
> > > worth
> > > > > encoding first.
> > > > > >
> > > > > > On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <
> martin.radev@tum.de
> > > > > <ma...@tum.de>> wrote:
> > > > > >
> > > > > > Hello Ryan, Wes and other parquet devs,
> > > > > >
> > > > > >
> > > > > > thanks for the response. I was away on vacation and that's why I
> am
> > > > > answering just now.
> > > > > >
> > > > > >
> > > > > > > whether you think adding run-length encoding to any of the byte
> > > > > streams would be beneficial before applying Zstd.
> > > > > > The short answer is "only for some cases but it will make it
> worse in
> > > > > both compression ratio and speed for other".
> > > > > >
> > > > > > Our initial investigation also separated the sign, exponent and
> > > mantissa
> > > > > into separate streams.
> > > > > >
> > > > > > The encoding was the following assuming 32-bit IEEE754:
> > > > > >
> > > > > > - stream of sign bits
> > > > > >
> > > > > > - stream of exponents bits. Conveniently the exponent for a
> 32-bit
> > > > > IEEE754 number is 8 bits.
> > > > > >
> > > > > > - separate the remaining 23 bits into four streams of 8, 8, 7
> bits.
> > > An
> > > > > extra zero bit is added to the block which has only seven bits.
> This
> > > was
> > > > > done since zstd, zlib, etc work at a byte granularity and we would
> want
> > > > > repetitions to happen at such.
> > > > > >
> > > > > > For 64-bit IEEE754 even more padding has to be added since the
> > > exponent
> > > > > is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more
> > > > > exponent bits and 4 more mantissa bits to keep repetitions at a
> byte
> > > > > granularity. My original report shows results for when the
> > > floating-point
> > > > > values are split at a component granularity. Report is here:
> > > > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > > > > > Results are just slightly better in terms of compression ratio
> for
> > > some
> > > > > tests but compression and decompression speed is expectedly worse.
> The
> > > > > reason is that splitting a value is somewhat more complex. We need
> to
> > > keep
> > > > > a stream of bits for the signs, keep track of when a byte in the
> > > stream is
> > > > > exhausted, do bit manipulation to extract components, etc. This is
> > > also the
> > > > > reason why I preferred to go with the byte-wise decomposition of
> the
> > > > > values. It's faster and the compression ratio is just slightly
> worse
> > > for
> > > > > some of the tests.
> > > > > >
> > > > > >
> > > > > > Using RLE for the sign, exponents and the top-most mantissa
> bytes can
> > > > > help when data is repetitive and make it worse for other. I suppose
> > > using
> > > > > one of the compressors yields a better compression ratio on
> average.
> > > Also,
> > > > > this can again make encoding and decoding slower.
> > > > > >
> > > > > >
> > > > > > The design of the BYTE_STREAM_SPLIT encoding had in mind two
> things:
> > > > > >
> > > > > > - It would only make data more compressible and leave
> compression to
> > > the
> > > > > codec in use.
> > > > > >   This leaves the complexity to the codec and choice of
> > > > > speed/compression ratio to the user.
> > > > > >
> > > > > > - It should be fast.
> > > > > >   There's an extra compression step so preferably there's very
> little
> > > > > latency before it.
> > > > > >
> > > > > > @Wes, can you have a look?
> > > > > >
> > > > > > More opinions are welcome.
> > > > > >
> > > > > > If you have floating point data available, I would be very happy
> to
> > > > > examine whether this approach offers benefit for you.
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Martin
> > > > > >
> > > > > > ________________________________
> > > > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > > > Sent: Tuesday, September 3, 2019 11:51:46 PM
> > > > > > To: Parquet Dev
> > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> Parquet
> > > > > >
> > > > > > Hi Martin,
> > > > > >
> > > > > > Thanks for taking a look at this! I agree that the approach here
> > > looks
> > > > > > promising. We've had occasional requests for lossy floating point
> > > > > > compression in the past, so it would be good to add this.
> > > > > >
> > > > > > I did some work in this area a few years ago that is similar and
> I'd
> > > like
> > > > > > to hear what you think about that approach compared to this one.
> That
> > > > > work
> > > > > > was based on the same observation, that the main problem is the
> > > mantissa,
> > > > > > while exponents tend to compress well. What I did was take the
> > > exponent
> > > > > and
> > > > > > mantissa and encode each separately, like the component encoding
> in
> > > your
> > > > > > test. But to encode each stream, I used Parquet's RLE encoder
> > > instead of
> > > > > > just applying compression. This seemed to work well for
> exponents and
> > > > > sign
> > > > > > bits, but probably isn't worth the cost for mantissa bits. It
> could
> > > also
> > > > > be
> > > > > > interesting to test a separate stream for sign bits.
> > > > > >
> > > > > > I guess what I'd like to hear your take on is whether you think
> > > adding
> > > > > > run-length encoding to any of the byte streams would be
> beneficial
> > > before
> > > > > > applying Zstd.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > rb
> > > > > >
> > > > > > On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <
> wesmckinn@gmail.com
> > > > > <ma...@gmail.com>> wrote:
> > > > > >
> > > > > > > I'm interested in this. I have been busy the last couple of
> weeks
> > > so
> > > > > have
> > > > > > > not been able to take a closer look. I will try to give some
> > > feedback
> > > > > this
> > > > > > > week.
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <
> martin.radev@tum.de
> > > > > <ma...@tum.de>> wrote:
> > > > > > >
> > > > > > > > Hello all,
> > > > > > > >
> > > > > > > >
> > > > > > > > thank you Julien for the interest.
> > > > > > > >
> > > > > > > >
> > > > > > > > Could other people, part of Apache Parquet, share their
> opinions?
> > > > > > > >
> > > > > > > > Do you have your own data which you would like to use for
> testing
> > > > > the new
> > > > > > > > encoding?
> > > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Martin
> > > > > > > >
> > > > > > > > ________________________________
> > > > > > > > From: Julien Le Dem <ju...@wework.com.INVALID>
> > > > > > > > Sent: Friday, August 30, 2019 2:38:37 AM
> > > > > > > > To: dev@parquet.apache.org<ma...@parquet.apache.org>
> > > > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> > > Parquet
> > > > > > > >
> > > > > > > > I think this looks promising to me. At first glance it seems
> > > > > combining
> > > > > > > > simplicity and efficiency.
> > > > > > > > I'd like to hear more from other members of the PMC.
> > > > > > > >
> > > > > > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <
> > > martin.radev@tum.de
> > > > > <ma...@tum.de>>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Dear all,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > there was some earlier discussion on adding a new encoding
> for
> > > > > better
> > > > > > > > > compression of FP32 and FP64 data.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The pull request which extends the format is here:
> > > > > > > > > https://github.com/apache/parquet-format/pull/144
> > > > > > > > > The change has one approval from earlier from Zoltan.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The results from an investigation on compression ratio and
> > > speed
> > > > > with
> > > > > > > the
> > > > > > > > > new encoding vs other encodings is available here:
> > > > > > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > > > > > It is visible that for many tests the new encoding performs
> > > better
> > > > > in
> > > > > > > > > compression ratio and in some cases in speed. The
> improvements
> > > in
> > > > > > > > > compression speed come from the fact that the new format
> can
> > > > > > > potentially
> > > > > > > > > lead to a faster parsing for some compressors like GZIP.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > An earlier report which examines other FP compressors
> (fpzip,
> > > spdp,
> > > > > > > fpc,
> > > > > > > > > zfp, sz) and new potential encodings is available here:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > > > > > > The report also covers lossy compression but the
> > > BYTE_STREAM_SPLIT
> > > > > > > > > encoding only has the focus of lossless compression.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Can we have a vote?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Martin
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Ryan Blue
> > > > > > Software Engineer
> > > > > > Netflix
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Ryan Blue
> > > > > > Software Engineer
> > > > > > Netflix
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Ryan Blue
> > > > > > Software Engineer
> > > > > > Netflix
> > > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > >
>

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by "Radev, Martin" <ma...@tum.de>.
Dear Parquet community,

thank you for the votes and time spent reviewing.

There have been so far 4 votes in the mailing list and Zoltan added a +1 to the Github pull request some time ago.
How many more would be necessary to have the spec change accepted?
Once that is done I can continue with polishing my changes for Arrow and Parquet-MR.

There are some new things.
I did some further tests and implementation changes (no spec changes):

  1.  I added more real data tests from various sources.
  2.  I generated synthetic data with varying zero-order entropy of an element for each test file.
This should help give some idea when different encoding-compressor options perform well.
  3.  I investigated different possible implementations for the encoder/decoder and added the most efficient option locally into Arrow's parquet-cpp.
This was necessary since I found out that my early scalar implementation in Arrow did not get auto-vectorized by the compiler and thus was somewhat inefficient especially if the data is present in the cache.

The report is available here: https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf. It mostly contains tables and plots with some context information and reasoning.
The two SIMD implementations I wrote are available here in a standalone program for benchmarking and validating: https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/prog.cpp

The short story is that the BYTE_STREAM_SPLIT encoding performs also the best among all other options for the newly added test cases.
Furthermore, the plot for data with varying entropy shows that the BYTE_STREAM_SPLIT encoding with zstd provides the best compression ratio when the entropy per float element is >8 bits. The BYTE_STREAM_SPLIT encoding+zstd provides better write throughput than all other combinations. It also improves read throughput than to just using ZSTD to compress data.
The report also shows the performance difference for different implementations of the encoder/decoder and compare it to memcpy. There's a clear performance advantage in using the custom vectorized version over what GCC generates.

If you have any questions, let me know.

Kind regards,
Martin


________________________________
From: Junjie Chen <ch...@gmail.com>
Sent: Friday, November 8, 2019 2:56:48 AM
To: dev@parquet.apache.org
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

+1 from me to add BYTE_STREAM_SPLIT to parquet-format.

Gabor Szadovszky <ga...@apache.org> 于2019年11月7日周四 下午6:07写道:
>
> +1 for adding BYTE_STREAM_SPLIT encoding to parquet-format.
>
> On Tue, Nov 5, 2019 at 11:22 PM Wes McKinney <we...@gmail.com> wrote:
>
> > +1 from me on adding the FP encoding
> >
> > On Sat, Nov 2, 2019 at 4:51 AM Radev, Martin <ma...@tum.de> wrote:
> > >
> > > Hello all,
> > >
> > >
> > > thanks for the vote Ryan and to Wes for the feedback.
> > >
> > >
> > > The concern with regards to adding more complex features in the Parquet
> > spec is valid.
> > >
> > > However, the proposed encoding is very simple and I already have
> > unpolished patches for both parquet-mr and arrow.
> > >
> > > In its design I purposely opted for something simple to guarantee 1)
> > good compression speed and 2) ease of implementation.
> > >
> > >
> > > On the topic of testing, I added four more test cases which were taken
> > from here<https://sdrbench.github.io/>. I also added the size in MB of
> > all test case and entropy per element.
> > >
> > > Having the entropy reported helps show that the encoding performs better
> > than any other option for high-entropy data and not so well for low-entropy
> > data.
> > >
> > >
> > > I would be happy to receive some more feedback and votes.
> > >
> > >
> > > Kind regards,
> > >
> > > Martin
> > >
> > > ________________________________
> > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > Sent: Friday, November 1, 2019 6:28 PM
> > > To: Parquet Dev
> > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > >
> > > I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format.
> > Looks
> > > like it is simple and performs well. We could use a good floating point
> > > encoding.
> > >
> > > I don't think I agree that differences in features between the Java and
> > CPP
> > > implementations should hold back new work. It would be great to have more
> > > testing and validation, as well as more thorough implementations. But I
> > > don't think we shouldn't accept contributions like this because of those
> > > concerns.
> > >
> > > On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > > I have to say I'm struggling with piling more things into the Parquet
> > > > specification when we already have a significant implementation
> > > > shortfall in other areas. LZ4 is still not properly implemented for
> > > > example, and then there is the question of the V2 encodings and data
> > > > page formats.
> > > >
> > > > I'm generally in favor of adding more efficient storage of floating
> > > > point data, but will it actually be implemented broadly? Parquet as a
> > > > format already has become an "implementation swamp" where any two
> > > > implementations may not be compatible with each other, particularly in
> > > > consideration of more advanced features.
> > > >
> > > > For a single organization using a single implementation, having
> > > > advanced features may be useful, so I see the benefits to users that
> > > > tightly control what code and what settings to use.
> > > >
> > > > On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin <ma...@tum.de>
> > wrote:
> > > > >
> > > > > Dear all,
> > > > >
> > > > >
> > > > > would there be any interest in reviewing the BYTE_STREAM_SPLIT
> > encoding?
> > > > >
> > > > > Please feel free to contact me directly if you need help or would
> > like
> > > > to provide more test data.
> > > > >
> > > > >
> > > > > Results for the encoding based on the implementation in Arrow are
> > here:
> > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > Patch to Arrow is here:
> > > >
> > https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49
> > > > >
> > > > >
> > > > > The new encoding combined with a compressor performs better than any
> > of
> > > > the other alternatives for data where there is little change in the
> > > > upper-most bytes of fp32 and fp64 values. My early experiments also
> > show
> > > > that this encoding+zstd performs better on average than any of the
> > > > specialized floating-point lossless compressors like fpc, spdp, zfp.
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > > ________________________________
> > > > > From: Radev, Martin <ma...@tum.de>
> > > > > Sent: Thursday, October 10, 2019 2:34:15 PM
> > > > > To: Parquet Dev
> > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > > >
> > > > > Dear Ryan Blue and other Parquet developers,
> > > > >
> > > > > I tested Ryan's proposal for modifying the encoding.
> > > > >
> > > > > The short answer is that it doesn't perform well in my tests. The
> > > > encoding, code and results can be viewed below.
> > > > >
> > > > >
> > > > > The current implementation only handles 32-bit IEEE754 floats in the
> > > > following way:
> > > > >
> > > > >   1.  For each block of 128 values, the min and max is computed for
> > the
> > > > exponent
> > > > >   2.  The number of bits for the exponent RLE is computed as
> > > > ceil(log2((max - min + 1))). The sign bit uses 1 bit.
> > > > >   3.  The sign, exponent and 23 remaining mantissa bits are
> > extracted.
> > > > >   4.  One RLE encoder is used for the sign and one for the exponent.
> > > > > A new RLE encoder for the exponent is created if the block requires
> > less
> > > > or more bits than the number of bits used for the current encoder.
> > > > > The 23 mantissa bits are divided into three streams. (Not sure
> > whether
> > > > this is strictly a good idea).
> > > > >   5.  Also, for each 128 values we need to store 2 bytes: the min
> > value
> > > > and the number of bits used by the RLE.
> > > > >
> > > > >
> > > > > I did not implement a decoder and have not added unit tests to
> > guarantee
> > > > that the implementation is sound.
> > > > >
> > > > > Ryan, can you please review the implementation as to whether it
> > > > corresponds to what you had in mind?
> > > > > Check FPByteStreamSplitComponentRLEEncoder. Please do not focus on
> > the
> > > > code quality - this is only a quick hack.
> > > > >
> > > >
> > https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979
> > > > >
> > > > > In my benchmark tests I compared the following scenarios:
> > > > >
> > > > >   *   dictionary
> > > > >   *   plain encoding + ZSTD
> > > > >   *   the StreamSplit+RLE implementation above
> > > > >   *   the StreamSplit+RLE implementation above + ZSTD
> > > > >   *   the original BYTE_STREAM_SPLIT proposal + ZSTD
> > > > >
> > > > > Size of parquet files, measured in MB, for each combination:
> > > > >
> > > > >         Plain   Plain+ZSTD      Dict    Split+RLE
> >  Split+RLE+ZSTD
> > > > StreamSplit+ZSTD
> > > > > msg_bt.sp
> > > > >         128     112     128     113     93      84
> > > > > msg_lu.sp
> > > > >         93      87      94      80      73      67
> > > > > msg_sp.sp
> > > > >         139     125     139     123     96      88
> > > > > msg_sweep3d.sp
> > > > >         60      19      60      47      40      13
> > > > > num_brain.sp
> > > > >         68      60      69      55      54      49
> > > > > num_comet.sp
> > > > >         52      45      50      46      42      38
> > > > > num_control.sp
> > > > >         77      71      77      70      69      63
> > > > > num_plasma.sp
> > > > >         17      2       8       16      8       1
> > > > > obs_error.sp
> > > > >         30      24      30      25      22      21
> > > > > obs_info.sp
> > > > >         10      8       10      8       7       7
> > > > > obs_spitzer.sp
> > > > >         95      83      95      99      82      72
> > > > > obs_temp.sp
> > > > >         20      18      20      18      18      16
> > > > >
> > > > >
> > > > > I have not tested the encoding on any of the other test data we have
> > > > since they contain also FP64 columns. I did not add support for FP64
> > in the
> > > > StreamSplit+RLE encoding to save on time and also because I do not
> > expect
> > > > much improvement.
> > > > >
> > > > >
> > > > > From the results it can be seen that the proposed StreamSplit+RLE
> > > > encoding does not improve results.
> > > > >
> > > > > Using RLE for the sign bits and exponent obfuscates the input and the
> > > > compression step cannot fully take advantage of repetitions in the data
> > > > since they were removed from the RLE step. Repetitive patterns are
> > replaced
> > > > by the RLE bits which likely do not compress very well.
> > > > >
> > > > > Also, GZIP/ZSTD handle repetitions in data better on average. For
> > > > example, GZIP's Deflate algorithm can encode a long run length with 3
> > > > bytes(not sure whether it can be less?) for the raw data + 3 bytes for
> > the
> > > > reference ( 8 bits + 15 bits + 2 bits ).
> > > > >
> > > > > Now, the RLE step might produce better results for long runs of the
> > same
> > > > value. However, compressors also handles more complex cases when we
> > have a
> > > > pattern in the data which doesn't necessary have a long run length.
> > Also,
> > > > compressors like GZIP/ZSTD often do entropy-coding-aware parsing (
> > > > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html ,
> > > cbloom rants: 10-10-08 : On LZ Optimal Parsing<
> > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html>
> > > cbloomrants.blogspot.com
> > > 10-10-08 On LZ Optimal Parsing So the LZ-Huffman I've done for RAD/Oodle
> > does some new stuff with optimal parsing. I want to write...
> > >
> > >
> > >
> > > >
> > https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> > > > ).
> > > > >
> > > > > The proposed RLE encoding+ZSTD is slower than BYTE_STREAM_SPLIT+ZSTD.
> > > > The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy
> > where as
> > > > the new RLE encoding has to extract the components and run through each
> > > > block of 128 values twice - once to compute min and max, and once to
> > > > encode. There's also the overhead of using the Arrow RLEEncoder.
> > > > >
> > > > > Ryan and other folks, can you provide feedback? Does the
> > implementation
> > > > look reasonable to you?
> > > > >
> > > > > Can somebody please work with me on this new encoding? There has been
> > > > some interest and some discussions but it hasn't been pushed likely
> > due to
> > > > work around the current release.
> > > > > For a bit more discussions and results, please refer to:
> > > > > Recent benchmark of Arrow implementation:
> > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > Separate report:
> > > >
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > [
> > https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p
> > ]<
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > >
> > >
> > > report.pdf - Google Drive<
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > >
> > > drive.google.com
> > >
> > >
> > >
> > > > >
> > > > > Regards,
> > > > > Martin
> > > > >
> > > > >
> > > > >
> > > > > ________________________________
> > > > > From: Ryan Blue <rb...@netflix.com>
> > > > > Sent: Thursday, September 19, 2019 7:54 PM
> > > > > To: Radev, Martin
> > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > > >
> > > > > Sounds good, thanks for working on this!
> > > > >
> > > > > On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <martin.radev@tum.de
> > > > <ma...@tum.de>> wrote:
> > > > >
> > > > > Hello Ryan,
> > > > >
> > > > >
> > > > > we decided that it would be beneficial to try out your proposal.
> > > > >
> > > > >
> > > > > I will look into it and provide measurements on the compression ratio
> > > > and speed.
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > > ________________________________
> > > > > From: Ryan Blue <rb...@netflix.com>>
> > > > > Sent: Saturday, September 14, 2019 2:23:20 AM
> > > > > To: Radev, Martin
> > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > > >
> > > > > > Using RLE for the sign, exponents and the top-most mantissa bytes
> > can
> > > > help when data is repetitive and make it worse for other.
> > > > >
> > > > > I agree. But we use RLE in similar cases because we do tend to have
> > runs
> > > > of values, and values that fit in a fixed number of bits. Exponents and
> > > > sign bits would probably fit this model extremely well most of the
> > time if
> > > > you have similar floating point values or sorted values. It would be
> > really
> > > > interesting to see how well this performs in comparison to the
> > compression
> > > > tests you've already done. For mantissa bits, I agree it wouldn't be
> > worth
> > > > encoding first.
> > > > >
> > > > > On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <martin.radev@tum.de
> > > > <ma...@tum.de>> wrote:
> > > > >
> > > > > Hello Ryan, Wes and other parquet devs,
> > > > >
> > > > >
> > > > > thanks for the response. I was away on vacation and that's why I am
> > > > answering just now.
> > > > >
> > > > >
> > > > > > whether you think adding run-length encoding to any of the byte
> > > > streams would be beneficial before applying Zstd.
> > > > > The short answer is "only for some cases but it will make it worse in
> > > > both compression ratio and speed for other".
> > > > >
> > > > > Our initial investigation also separated the sign, exponent and
> > mantissa
> > > > into separate streams.
> > > > >
> > > > > The encoding was the following assuming 32-bit IEEE754:
> > > > >
> > > > > - stream of sign bits
> > > > >
> > > > > - stream of exponents bits. Conveniently the exponent for a 32-bit
> > > > IEEE754 number is 8 bits.
> > > > >
> > > > > - separate the remaining 23 bits into four streams of 8, 8, 7 bits.
> > An
> > > > extra zero bit is added to the block which has only seven bits. This
> > was
> > > > done since zstd, zlib, etc work at a byte granularity and we would want
> > > > repetitions to happen at such.
> > > > >
> > > > > For 64-bit IEEE754 even more padding has to be added since the
> > exponent
> > > > is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more
> > > > exponent bits and 4 more mantissa bits to keep repetitions at a byte
> > > > granularity. My original report shows results for when the
> > floating-point
> > > > values are split at a component granularity. Report is here:
> > > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > > > > Results are just slightly better in terms of compression ratio for
> > some
> > > > tests but compression and decompression speed is expectedly worse. The
> > > > reason is that splitting a value is somewhat more complex. We need to
> > keep
> > > > a stream of bits for the signs, keep track of when a byte in the
> > stream is
> > > > exhausted, do bit manipulation to extract components, etc. This is
> > also the
> > > > reason why I preferred to go with the byte-wise decomposition of the
> > > > values. It's faster and the compression ratio is just slightly worse
> > for
> > > > some of the tests.
> > > > >
> > > > >
> > > > > Using RLE for the sign, exponents and the top-most mantissa bytes can
> > > > help when data is repetitive and make it worse for other. I suppose
> > using
> > > > one of the compressors yields a better compression ratio on average.
> > Also,
> > > > this can again make encoding and decoding slower.
> > > > >
> > > > >
> > > > > The design of the BYTE_STREAM_SPLIT encoding had in mind two things:
> > > > >
> > > > > - It would only make data more compressible and leave compression to
> > the
> > > > codec in use.
> > > > >   This leaves the complexity to the codec and choice of
> > > > speed/compression ratio to the user.
> > > > >
> > > > > - It should be fast.
> > > > >   There's an extra compression step so preferably there's very little
> > > > latency before it.
> > > > >
> > > > > @Wes, can you have a look?
> > > > >
> > > > > More opinions are welcome.
> > > > >
> > > > > If you have floating point data available, I would be very happy to
> > > > examine whether this approach offers benefit for you.
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > > ________________________________
> > > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > > Sent: Tuesday, September 3, 2019 11:51:46 PM
> > > > > To: Parquet Dev
> > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > > >
> > > > > Hi Martin,
> > > > >
> > > > > Thanks for taking a look at this! I agree that the approach here
> > looks
> > > > > promising. We've had occasional requests for lossy floating point
> > > > > compression in the past, so it would be good to add this.
> > > > >
> > > > > I did some work in this area a few years ago that is similar and I'd
> > like
> > > > > to hear what you think about that approach compared to this one. That
> > > > work
> > > > > was based on the same observation, that the main problem is the
> > mantissa,
> > > > > while exponents tend to compress well. What I did was take the
> > exponent
> > > > and
> > > > > mantissa and encode each separately, like the component encoding in
> > your
> > > > > test. But to encode each stream, I used Parquet's RLE encoder
> > instead of
> > > > > just applying compression. This seemed to work well for exponents and
> > > > sign
> > > > > bits, but probably isn't worth the cost for mantissa bits. It could
> > also
> > > > be
> > > > > interesting to test a separate stream for sign bits.
> > > > >
> > > > > I guess what I'd like to hear your take on is whether you think
> > adding
> > > > > run-length encoding to any of the byte streams would be beneficial
> > before
> > > > > applying Zstd.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > rb
> > > > >
> > > > > On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <wesmckinn@gmail.com
> > > > <ma...@gmail.com>> wrote:
> > > > >
> > > > > > I'm interested in this. I have been busy the last couple of weeks
> > so
> > > > have
> > > > > > not been able to take a closer look. I will try to give some
> > feedback
> > > > this
> > > > > > week.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <martin.radev@tum.de
> > > > <ma...@tum.de>> wrote:
> > > > > >
> > > > > > > Hello all,
> > > > > > >
> > > > > > >
> > > > > > > thank you Julien for the interest.
> > > > > > >
> > > > > > >
> > > > > > > Could other people, part of Apache Parquet, share their opinions?
> > > > > > >
> > > > > > > Do you have your own data which you would like to use for testing
> > > > the new
> > > > > > > encoding?
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Martin
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: Julien Le Dem <ju...@wework.com.INVALID>
> > > > > > > Sent: Friday, August 30, 2019 2:38:37 AM
> > > > > > > To: dev@parquet.apache.org<ma...@parquet.apache.org>
> > > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> > Parquet
> > > > > > >
> > > > > > > I think this looks promising to me. At first glance it seems
> > > > combining
> > > > > > > simplicity and efficiency.
> > > > > > > I'd like to hear more from other members of the PMC.
> > > > > > >
> > > > > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <
> > martin.radev@tum.de
> > > > <ma...@tum.de>>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Dear all,
> > > > > > > >
> > > > > > > >
> > > > > > > > there was some earlier discussion on adding a new encoding for
> > > > better
> > > > > > > > compression of FP32 and FP64 data.
> > > > > > > >
> > > > > > > >
> > > > > > > > The pull request which extends the format is here:
> > > > > > > > https://github.com/apache/parquet-format/pull/144
> > > > > > > > The change has one approval from earlier from Zoltan.
> > > > > > > >
> > > > > > > >
> > > > > > > > The results from an investigation on compression ratio and
> > speed
> > > > with
> > > > > > the
> > > > > > > > new encoding vs other encodings is available here:
> > > > > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > > > > It is visible that for many tests the new encoding performs
> > better
> > > > in
> > > > > > > > compression ratio and in some cases in speed. The improvements
> > in
> > > > > > > > compression speed come from the fact that the new format can
> > > > > > potentially
> > > > > > > > lead to a faster parsing for some compressors like GZIP.
> > > > > > > >
> > > > > > > >
> > > > > > > > An earlier report which examines other FP compressors (fpzip,
> > spdp,
> > > > > > fpc,
> > > > > > > > zfp, sz) and new potential encodings is available here:
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > > > > > The report also covers lossy compression but the
> > BYTE_STREAM_SPLIT
> > > > > > > > encoding only has the focus of lossless compression.
> > > > > > > >
> > > > > > > >
> > > > > > > > Can we have a vote?
> > > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Martin
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> >

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by Junjie Chen <ch...@gmail.com>.
+1 from me to add BYTE_STREAM_SPLIT to parquet-format.

Gabor Szadovszky <ga...@apache.org> 于2019年11月7日周四 下午6:07写道:
>
> +1 for adding BYTE_STREAM_SPLIT encoding to parquet-format.
>
> On Tue, Nov 5, 2019 at 11:22 PM Wes McKinney <we...@gmail.com> wrote:
>
> > +1 from me on adding the FP encoding
> >
> > On Sat, Nov 2, 2019 at 4:51 AM Radev, Martin <ma...@tum.de> wrote:
> > >
> > > Hello all,
> > >
> > >
> > > thanks for the vote Ryan and to Wes for the feedback.
> > >
> > >
> > > The concern with regards to adding more complex features in the Parquet
> > spec is valid.
> > >
> > > However, the proposed encoding is very simple and I already have
> > unpolished patches for both parquet-mr and arrow.
> > >
> > > In its design I purposely opted for something simple to guarantee 1)
> > good compression speed and 2) ease of implementation.
> > >
> > >
> > > On the topic of testing, I added four more test cases which were taken
> > from here<https://sdrbench.github.io/>. I also added the size in MB of
> > all test case and entropy per element.
> > >
> > > Having the entropy reported helps show that the encoding performs better
> > than any other option for high-entropy data and not so well for low-entropy
> > data.
> > >
> > >
> > > I would be happy to receive some more feedback and votes.
> > >
> > >
> > > Kind regards,
> > >
> > > Martin
> > >
> > > ________________________________
> > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > Sent: Friday, November 1, 2019 6:28 PM
> > > To: Parquet Dev
> > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > >
> > > I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format.
> > Looks
> > > like it is simple and performs well. We could use a good floating point
> > > encoding.
> > >
> > > I don't think I agree that differences in features between the Java and
> > CPP
> > > implementations should hold back new work. It would be great to have more
> > > testing and validation, as well as more thorough implementations. But I
> > > don't think we shouldn't accept contributions like this because of those
> > > concerns.
> > >
> > > On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > > I have to say I'm struggling with piling more things into the Parquet
> > > > specification when we already have a significant implementation
> > > > shortfall in other areas. LZ4 is still not properly implemented for
> > > > example, and then there is the question of the V2 encodings and data
> > > > page formats.
> > > >
> > > > I'm generally in favor of adding more efficient storage of floating
> > > > point data, but will it actually be implemented broadly? Parquet as a
> > > > format already has become an "implementation swamp" where any two
> > > > implementations may not be compatible with each other, particularly in
> > > > consideration of more advanced features.
> > > >
> > > > For a single organization using a single implementation, having
> > > > advanced features may be useful, so I see the benefits to users that
> > > > tightly control what code and what settings to use.
> > > >
> > > > On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin <ma...@tum.de>
> > wrote:
> > > > >
> > > > > Dear all,
> > > > >
> > > > >
> > > > > would there be any interest in reviewing the BYTE_STREAM_SPLIT
> > encoding?
> > > > >
> > > > > Please feel free to contact me directly if you need help or would
> > like
> > > > to provide more test data.
> > > > >
> > > > >
> > > > > Results for the encoding based on the implementation in Arrow are
> > here:
> > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > Patch to Arrow is here:
> > > >
> > https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49
> > > > >
> > > > >
> > > > > The new encoding combined with a compressor performs better than any
> > of
> > > > the other alternatives for data where there is little change in the
> > > > upper-most bytes of fp32 and fp64 values. My early experiments also
> > show
> > > > that this encoding+zstd performs better on average than any of the
> > > > specialized floating-point lossless compressors like fpc, spdp, zfp.
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > > ________________________________
> > > > > From: Radev, Martin <ma...@tum.de>
> > > > > Sent: Thursday, October 10, 2019 2:34:15 PM
> > > > > To: Parquet Dev
> > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > > >
> > > > > Dear Ryan Blue and other Parquet developers,
> > > > >
> > > > > I tested Ryan's proposal for modifying the encoding.
> > > > >
> > > > > The short answer is that it doesn't perform well in my tests. The
> > > > encoding, code and results can be viewed below.
> > > > >
> > > > >
> > > > > The current implementation only handles 32-bit IEEE754 floats in the
> > > > following way:
> > > > >
> > > > >   1.  For each block of 128 values, the min and max is computed for
> > the
> > > > exponent
> > > > >   2.  The number of bits for the exponent RLE is computed as
> > > > ceil(log2((max - min + 1))). The sign bit uses 1 bit.
> > > > >   3.  The sign, exponent and 23 remaining mantissa bits are
> > extracted.
> > > > >   4.  One RLE encoder is used for the sign and one for the exponent.
> > > > > A new RLE encoder for the exponent is created if the block requires
> > less
> > > > or more bits than the number of bits used for the current encoder.
> > > > > The 23 mantissa bits are divided into three streams. (Not sure
> > whether
> > > > this is strictly a good idea).
> > > > >   5.  Also, for each 128 values we need to store 2 bytes: the min
> > value
> > > > and the number of bits used by the RLE.
> > > > >
> > > > >
> > > > > I did not implement a decoder and have not added unit tests to
> > guarantee
> > > > that the implementation is sound.
> > > > >
> > > > > Ryan, can you please review the implementation as to whether it
> > > > corresponds to what you had in mind?
> > > > > Check FPByteStreamSplitComponentRLEEncoder. Please do not focus on
> > the
> > > > code quality - this is only a quick hack.
> > > > >
> > > >
> > https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979
> > > > >
> > > > > In my benchmark tests I compared the following scenarios:
> > > > >
> > > > >   *   dictionary
> > > > >   *   plain encoding + ZSTD
> > > > >   *   the StreamSplit+RLE implementation above
> > > > >   *   the StreamSplit+RLE implementation above + ZSTD
> > > > >   *   the original BYTE_STREAM_SPLIT proposal + ZSTD
> > > > >
> > > > > Size of parquet files, measured in MB, for each combination:
> > > > >
> > > > >         Plain   Plain+ZSTD      Dict    Split+RLE
> >  Split+RLE+ZSTD
> > > > StreamSplit+ZSTD
> > > > > msg_bt.sp
> > > > >         128     112     128     113     93      84
> > > > > msg_lu.sp
> > > > >         93      87      94      80      73      67
> > > > > msg_sp.sp
> > > > >         139     125     139     123     96      88
> > > > > msg_sweep3d.sp
> > > > >         60      19      60      47      40      13
> > > > > num_brain.sp
> > > > >         68      60      69      55      54      49
> > > > > num_comet.sp
> > > > >         52      45      50      46      42      38
> > > > > num_control.sp
> > > > >         77      71      77      70      69      63
> > > > > num_plasma.sp
> > > > >         17      2       8       16      8       1
> > > > > obs_error.sp
> > > > >         30      24      30      25      22      21
> > > > > obs_info.sp
> > > > >         10      8       10      8       7       7
> > > > > obs_spitzer.sp
> > > > >         95      83      95      99      82      72
> > > > > obs_temp.sp
> > > > >         20      18      20      18      18      16
> > > > >
> > > > >
> > > > > I have not tested the encoding on any of the other test data we have
> > > > since they contain also FP64 columns. I did not add support for FP64
> > in the
> > > > StreamSplit+RLE encoding to save on time and also because I do not
> > expect
> > > > much improvement.
> > > > >
> > > > >
> > > > > From the results it can be seen that the proposed StreamSplit+RLE
> > > > encoding does not improve results.
> > > > >
> > > > > Using RLE for the sign bits and exponent obfuscates the input and the
> > > > compression step cannot fully take advantage of repetitions in the data
> > > > since they were removed from the RLE step. Repetitive patterns are
> > replaced
> > > > by the RLE bits which likely do not compress very well.
> > > > >
> > > > > Also, GZIP/ZSTD handle repetitions in data better on average. For
> > > > example, GZIP's Deflate algorithm can encode a long run length with 3
> > > > bytes(not sure whether it can be less?) for the raw data + 3 bytes for
> > the
> > > > reference ( 8 bits + 15 bits + 2 bits ).
> > > > >
> > > > > Now, the RLE step might produce better results for long runs of the
> > same
> > > > value. However, compressors also handles more complex cases when we
> > have a
> > > > pattern in the data which doesn't necessary have a long run length.
> > Also,
> > > > compressors like GZIP/ZSTD often do entropy-coding-aware parsing (
> > > > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html ,
> > > cbloom rants: 10-10-08 : On LZ Optimal Parsing<
> > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html>
> > > cbloomrants.blogspot.com
> > > 10-10-08 On LZ Optimal Parsing So the LZ-Huffman I've done for RAD/Oodle
> > does some new stuff with optimal parsing. I want to write...
> > >
> > >
> > >
> > > >
> > https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> > > > ).
> > > > >
> > > > > The proposed RLE encoding+ZSTD is slower than BYTE_STREAM_SPLIT+ZSTD.
> > > > The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy
> > where as
> > > > the new RLE encoding has to extract the components and run through each
> > > > block of 128 values twice - once to compute min and max, and once to
> > > > encode. There's also the overhead of using the Arrow RLEEncoder.
> > > > >
> > > > > Ryan and other folks, can you provide feedback? Does the
> > implementation
> > > > look reasonable to you?
> > > > >
> > > > > Can somebody please work with me on this new encoding? There has been
> > > > some interest and some discussions but it hasn't been pushed likely
> > due to
> > > > work around the current release.
> > > > > For a bit more discussions and results, please refer to:
> > > > > Recent benchmark of Arrow implementation:
> > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > Separate report:
> > > >
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > [
> > https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p
> > ]<
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > >
> > >
> > > report.pdf - Google Drive<
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > >
> > > drive.google.com
> > >
> > >
> > >
> > > > >
> > > > > Regards,
> > > > > Martin
> > > > >
> > > > >
> > > > >
> > > > > ________________________________
> > > > > From: Ryan Blue <rb...@netflix.com>
> > > > > Sent: Thursday, September 19, 2019 7:54 PM
> > > > > To: Radev, Martin
> > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > > >
> > > > > Sounds good, thanks for working on this!
> > > > >
> > > > > On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <martin.radev@tum.de
> > > > <ma...@tum.de>> wrote:
> > > > >
> > > > > Hello Ryan,
> > > > >
> > > > >
> > > > > we decided that it would be beneficial to try out your proposal.
> > > > >
> > > > >
> > > > > I will look into it and provide measurements on the compression ratio
> > > > and speed.
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > > ________________________________
> > > > > From: Ryan Blue <rb...@netflix.com>>
> > > > > Sent: Saturday, September 14, 2019 2:23:20 AM
> > > > > To: Radev, Martin
> > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > > >
> > > > > > Using RLE for the sign, exponents and the top-most mantissa bytes
> > can
> > > > help when data is repetitive and make it worse for other.
> > > > >
> > > > > I agree. But we use RLE in similar cases because we do tend to have
> > runs
> > > > of values, and values that fit in a fixed number of bits. Exponents and
> > > > sign bits would probably fit this model extremely well most of the
> > time if
> > > > you have similar floating point values or sorted values. It would be
> > really
> > > > interesting to see how well this performs in comparison to the
> > compression
> > > > tests you've already done. For mantissa bits, I agree it wouldn't be
> > worth
> > > > encoding first.
> > > > >
> > > > > On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <martin.radev@tum.de
> > > > <ma...@tum.de>> wrote:
> > > > >
> > > > > Hello Ryan, Wes and other parquet devs,
> > > > >
> > > > >
> > > > > thanks for the response. I was away on vacation and that's why I am
> > > > answering just now.
> > > > >
> > > > >
> > > > > > whether you think adding run-length encoding to any of the byte
> > > > streams would be beneficial before applying Zstd.
> > > > > The short answer is "only for some cases but it will make it worse in
> > > > both compression ratio and speed for other".
> > > > >
> > > > > Our initial investigation also separated the sign, exponent and
> > mantissa
> > > > into separate streams.
> > > > >
> > > > > The encoding was the following assuming 32-bit IEEE754:
> > > > >
> > > > > - stream of sign bits
> > > > >
> > > > > - stream of exponents bits. Conveniently the exponent for a 32-bit
> > > > IEEE754 number is 8 bits.
> > > > >
> > > > > - separate the remaining 23 bits into four streams of 8, 8, 7 bits.
> > An
> > > > extra zero bit is added to the block which has only seven bits. This
> > was
> > > > done since zstd, zlib, etc work at a byte granularity and we would want
> > > > repetitions to happen at such.
> > > > >
> > > > > For 64-bit IEEE754 even more padding has to be added since the
> > exponent
> > > > is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more
> > > > exponent bits and 4 more mantissa bits to keep repetitions at a byte
> > > > granularity. My original report shows results for when the
> > floating-point
> > > > values are split at a component granularity. Report is here:
> > > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > > > > Results are just slightly better in terms of compression ratio for
> > some
> > > > tests but compression and decompression speed is expectedly worse. The
> > > > reason is that splitting a value is somewhat more complex. We need to
> > keep
> > > > a stream of bits for the signs, keep track of when a byte in the
> > stream is
> > > > exhausted, do bit manipulation to extract components, etc. This is
> > also the
> > > > reason why I preferred to go with the byte-wise decomposition of the
> > > > values. It's faster and the compression ratio is just slightly worse
> > for
> > > > some of the tests.
> > > > >
> > > > >
> > > > > Using RLE for the sign, exponents and the top-most mantissa bytes can
> > > > help when data is repetitive and make it worse for other. I suppose
> > using
> > > > one of the compressors yields a better compression ratio on average.
> > Also,
> > > > this can again make encoding and decoding slower.
> > > > >
> > > > >
> > > > > The design of the BYTE_STREAM_SPLIT encoding had in mind two things:
> > > > >
> > > > > - It would only make data more compressible and leave compression to
> > the
> > > > codec in use.
> > > > >   This leaves the complexity to the codec and choice of
> > > > speed/compression ratio to the user.
> > > > >
> > > > > - It should be fast.
> > > > >   There's an extra compression step so preferably there's very little
> > > > latency before it.
> > > > >
> > > > > @Wes, can you have a look?
> > > > >
> > > > > More opinions are welcome.
> > > > >
> > > > > If you have floating point data available, I would be very happy to
> > > > examine whether this approach offers benefit for you.
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > > ________________________________
> > > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > > Sent: Tuesday, September 3, 2019 11:51:46 PM
> > > > > To: Parquet Dev
> > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > > >
> > > > > Hi Martin,
> > > > >
> > > > > Thanks for taking a look at this! I agree that the approach here
> > looks
> > > > > promising. We've had occasional requests for lossy floating point
> > > > > compression in the past, so it would be good to add this.
> > > > >
> > > > > I did some work in this area a few years ago that is similar and I'd
> > like
> > > > > to hear what you think about that approach compared to this one. That
> > > > work
> > > > > was based on the same observation, that the main problem is the
> > mantissa,
> > > > > while exponents tend to compress well. What I did was take the
> > exponent
> > > > and
> > > > > mantissa and encode each separately, like the component encoding in
> > your
> > > > > test. But to encode each stream, I used Parquet's RLE encoder
> > instead of
> > > > > just applying compression. This seemed to work well for exponents and
> > > > sign
> > > > > bits, but probably isn't worth the cost for mantissa bits. It could
> > also
> > > > be
> > > > > interesting to test a separate stream for sign bits.
> > > > >
> > > > > I guess what I'd like to hear your take on is whether you think
> > adding
> > > > > run-length encoding to any of the byte streams would be beneficial
> > before
> > > > > applying Zstd.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > rb
> > > > >
> > > > > On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <wesmckinn@gmail.com
> > > > <ma...@gmail.com>> wrote:
> > > > >
> > > > > > I'm interested in this. I have been busy the last couple of weeks
> > so
> > > > have
> > > > > > not been able to take a closer look. I will try to give some
> > feedback
> > > > this
> > > > > > week.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <martin.radev@tum.de
> > > > <ma...@tum.de>> wrote:
> > > > > >
> > > > > > > Hello all,
> > > > > > >
> > > > > > >
> > > > > > > thank you Julien for the interest.
> > > > > > >
> > > > > > >
> > > > > > > Could other people, part of Apache Parquet, share their opinions?
> > > > > > >
> > > > > > > Do you have your own data which you would like to use for testing
> > > > the new
> > > > > > > encoding?
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Martin
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: Julien Le Dem <ju...@wework.com.INVALID>
> > > > > > > Sent: Friday, August 30, 2019 2:38:37 AM
> > > > > > > To: dev@parquet.apache.org<ma...@parquet.apache.org>
> > > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> > Parquet
> > > > > > >
> > > > > > > I think this looks promising to me. At first glance it seems
> > > > combining
> > > > > > > simplicity and efficiency.
> > > > > > > I'd like to hear more from other members of the PMC.
> > > > > > >
> > > > > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <
> > martin.radev@tum.de
> > > > <ma...@tum.de>>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Dear all,
> > > > > > > >
> > > > > > > >
> > > > > > > > there was some earlier discussion on adding a new encoding for
> > > > better
> > > > > > > > compression of FP32 and FP64 data.
> > > > > > > >
> > > > > > > >
> > > > > > > > The pull request which extends the format is here:
> > > > > > > > https://github.com/apache/parquet-format/pull/144
> > > > > > > > The change has one approval from earlier from Zoltan.
> > > > > > > >
> > > > > > > >
> > > > > > > > The results from an investigation on compression ratio and
> > speed
> > > > with
> > > > > > the
> > > > > > > > new encoding vs other encodings is available here:
> > > > > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > > > > It is visible that for many tests the new encoding performs
> > better
> > > > in
> > > > > > > > compression ratio and in some cases in speed. The improvements
> > in
> > > > > > > > compression speed come from the fact that the new format can
> > > > > > potentially
> > > > > > > > lead to a faster parsing for some compressors like GZIP.
> > > > > > > >
> > > > > > > >
> > > > > > > > An earlier report which examines other FP compressors (fpzip,
> > spdp,
> > > > > > fpc,
> > > > > > > > zfp, sz) and new potential encodings is available here:
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > > > > > The report also covers lossy compression but the
> > BYTE_STREAM_SPLIT
> > > > > > > > encoding only has the focus of lossless compression.
> > > > > > > >
> > > > > > > >
> > > > > > > > Can we have a vote?
> > > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Martin
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> >

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by Gabor Szadovszky <ga...@apache.org>.
+1 for adding BYTE_STREAM_SPLIT encoding to parquet-format.

On Tue, Nov 5, 2019 at 11:22 PM Wes McKinney <we...@gmail.com> wrote:

> +1 from me on adding the FP encoding
>
> On Sat, Nov 2, 2019 at 4:51 AM Radev, Martin <ma...@tum.de> wrote:
> >
> > Hello all,
> >
> >
> > thanks for the vote Ryan and to Wes for the feedback.
> >
> >
> > The concern with regards to adding more complex features in the Parquet
> spec is valid.
> >
> > However, the proposed encoding is very simple and I already have
> unpolished patches for both parquet-mr and arrow.
> >
> > In its design I purposely opted for something simple to guarantee 1)
> good compression speed and 2) ease of implementation.
> >
> >
> > On the topic of testing, I added four more test cases which were taken
> from here<https://sdrbench.github.io/>. I also added the size in MB of
> all test case and entropy per element.
> >
> > Having the entropy reported helps show that the encoding performs better
> than any other option for high-entropy data and not so well for low-entropy
> data.
> >
> >
> > I would be happy to receive some more feedback and votes.
> >
> >
> > Kind regards,
> >
> > Martin
> >
> > ________________________________
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > Sent: Friday, November 1, 2019 6:28 PM
> > To: Parquet Dev
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format.
> Looks
> > like it is simple and performs well. We could use a good floating point
> > encoding.
> >
> > I don't think I agree that differences in features between the Java and
> CPP
> > implementations should hold back new work. It would be great to have more
> > testing and validation, as well as more thorough implementations. But I
> > don't think we shouldn't accept contributions like this because of those
> > concerns.
> >
> > On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > > I have to say I'm struggling with piling more things into the Parquet
> > > specification when we already have a significant implementation
> > > shortfall in other areas. LZ4 is still not properly implemented for
> > > example, and then there is the question of the V2 encodings and data
> > > page formats.
> > >
> > > I'm generally in favor of adding more efficient storage of floating
> > > point data, but will it actually be implemented broadly? Parquet as a
> > > format already has become an "implementation swamp" where any two
> > > implementations may not be compatible with each other, particularly in
> > > consideration of more advanced features.
> > >
> > > For a single organization using a single implementation, having
> > > advanced features may be useful, so I see the benefits to users that
> > > tightly control what code and what settings to use.
> > >
> > > On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin <ma...@tum.de>
> wrote:
> > > >
> > > > Dear all,
> > > >
> > > >
> > > > would there be any interest in reviewing the BYTE_STREAM_SPLIT
> encoding?
> > > >
> > > > Please feel free to contact me directly if you need help or would
> like
> > > to provide more test data.
> > > >
> > > >
> > > > Results for the encoding based on the implementation in Arrow are
> here:
> > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > Patch to Arrow is here:
> > >
> https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49
> > > >
> > > >
> > > > The new encoding combined with a compressor performs better than any
> of
> > > the other alternatives for data where there is little change in the
> > > upper-most bytes of fp32 and fp64 values. My early experiments also
> show
> > > that this encoding+zstd performs better on average than any of the
> > > specialized floating-point lossless compressors like fpc, spdp, zfp.
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Radev, Martin <ma...@tum.de>
> > > > Sent: Thursday, October 10, 2019 2:34:15 PM
> > > > To: Parquet Dev
> > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > >
> > > > Dear Ryan Blue and other Parquet developers,
> > > >
> > > > I tested Ryan's proposal for modifying the encoding.
> > > >
> > > > The short answer is that it doesn't perform well in my tests. The
> > > encoding, code and results can be viewed below.
> > > >
> > > >
> > > > The current implementation only handles 32-bit IEEE754 floats in the
> > > following way:
> > > >
> > > >   1.  For each block of 128 values, the min and max is computed for
> the
> > > exponent
> > > >   2.  The number of bits for the exponent RLE is computed as
> > > ceil(log2((max - min + 1))). The sign bit uses 1 bit.
> > > >   3.  The sign, exponent and 23 remaining mantissa bits are
> extracted.
> > > >   4.  One RLE encoder is used for the sign and one for the exponent.
> > > > A new RLE encoder for the exponent is created if the block requires
> less
> > > or more bits than the number of bits used for the current encoder.
> > > > The 23 mantissa bits are divided into three streams. (Not sure
> whether
> > > this is strictly a good idea).
> > > >   5.  Also, for each 128 values we need to store 2 bytes: the min
> value
> > > and the number of bits used by the RLE.
> > > >
> > > >
> > > > I did not implement a decoder and have not added unit tests to
> guarantee
> > > that the implementation is sound.
> > > >
> > > > Ryan, can you please review the implementation as to whether it
> > > corresponds to what you had in mind?
> > > > Check FPByteStreamSplitComponentRLEEncoder. Please do not focus on
> the
> > > code quality - this is only a quick hack.
> > > >
> > >
> https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979
> > > >
> > > > In my benchmark tests I compared the following scenarios:
> > > >
> > > >   *   dictionary
> > > >   *   plain encoding + ZSTD
> > > >   *   the StreamSplit+RLE implementation above
> > > >   *   the StreamSplit+RLE implementation above + ZSTD
> > > >   *   the original BYTE_STREAM_SPLIT proposal + ZSTD
> > > >
> > > > Size of parquet files, measured in MB, for each combination:
> > > >
> > > >         Plain   Plain+ZSTD      Dict    Split+RLE
>  Split+RLE+ZSTD
> > > StreamSplit+ZSTD
> > > > msg_bt.sp
> > > >         128     112     128     113     93      84
> > > > msg_lu.sp
> > > >         93      87      94      80      73      67
> > > > msg_sp.sp
> > > >         139     125     139     123     96      88
> > > > msg_sweep3d.sp
> > > >         60      19      60      47      40      13
> > > > num_brain.sp
> > > >         68      60      69      55      54      49
> > > > num_comet.sp
> > > >         52      45      50      46      42      38
> > > > num_control.sp
> > > >         77      71      77      70      69      63
> > > > num_plasma.sp
> > > >         17      2       8       16      8       1
> > > > obs_error.sp
> > > >         30      24      30      25      22      21
> > > > obs_info.sp
> > > >         10      8       10      8       7       7
> > > > obs_spitzer.sp
> > > >         95      83      95      99      82      72
> > > > obs_temp.sp
> > > >         20      18      20      18      18      16
> > > >
> > > >
> > > > I have not tested the encoding on any of the other test data we have
> > > since they contain also FP64 columns. I did not add support for FP64
> in the
> > > StreamSplit+RLE encoding to save on time and also because I do not
> expect
> > > much improvement.
> > > >
> > > >
> > > > From the results it can be seen that the proposed StreamSplit+RLE
> > > encoding does not improve results.
> > > >
> > > > Using RLE for the sign bits and exponent obfuscates the input and the
> > > compression step cannot fully take advantage of repetitions in the data
> > > since they were removed from the RLE step. Repetitive patterns are
> replaced
> > > by the RLE bits which likely do not compress very well.
> > > >
> > > > Also, GZIP/ZSTD handle repetitions in data better on average. For
> > > example, GZIP's Deflate algorithm can encode a long run length with 3
> > > bytes(not sure whether it can be less?) for the raw data + 3 bytes for
> the
> > > reference ( 8 bits + 15 bits + 2 bits ).
> > > >
> > > > Now, the RLE step might produce better results for long runs of the
> same
> > > value. However, compressors also handles more complex cases when we
> have a
> > > pattern in the data which doesn't necessary have a long run length.
> Also,
> > > compressors like GZIP/ZSTD often do entropy-coding-aware parsing (
> > > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html ,
> > cbloom rants: 10-10-08 : On LZ Optimal Parsing<
> http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html>
> > cbloomrants.blogspot.com
> > 10-10-08 On LZ Optimal Parsing So the LZ-Huffman I've done for RAD/Oodle
> does some new stuff with optimal parsing. I want to write...
> >
> >
> >
> > >
> https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> > > ).
> > > >
> > > > The proposed RLE encoding+ZSTD is slower than BYTE_STREAM_SPLIT+ZSTD.
> > > The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy
> where as
> > > the new RLE encoding has to extract the components and run through each
> > > block of 128 values twice - once to compute min and max, and once to
> > > encode. There's also the overhead of using the Arrow RLEEncoder.
> > > >
> > > > Ryan and other folks, can you provide feedback? Does the
> implementation
> > > look reasonable to you?
> > > >
> > > > Can somebody please work with me on this new encoding? There has been
> > > some interest and some discussions but it hasn't been pushed likely
> due to
> > > work around the current release.
> > > > For a bit more discussions and results, please refer to:
> > > > Recent benchmark of Arrow implementation:
> > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > Separate report:
> > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > [
> https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p
> ]<
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> >
> >
> > report.pdf - Google Drive<
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> >
> > drive.google.com
> >
> >
> >
> > > >
> > > > Regards,
> > > > Martin
> > > >
> > > >
> > > >
> > > > ________________________________
> > > > From: Ryan Blue <rb...@netflix.com>
> > > > Sent: Thursday, September 19, 2019 7:54 PM
> > > > To: Radev, Martin
> > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > >
> > > > Sounds good, thanks for working on this!
> > > >
> > > > On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <martin.radev@tum.de
> > > <ma...@tum.de>> wrote:
> > > >
> > > > Hello Ryan,
> > > >
> > > >
> > > > we decided that it would be beneficial to try out your proposal.
> > > >
> > > >
> > > > I will look into it and provide measurements on the compression ratio
> > > and speed.
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Ryan Blue <rb...@netflix.com>>
> > > > Sent: Saturday, September 14, 2019 2:23:20 AM
> > > > To: Radev, Martin
> > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > >
> > > > > Using RLE for the sign, exponents and the top-most mantissa bytes
> can
> > > help when data is repetitive and make it worse for other.
> > > >
> > > > I agree. But we use RLE in similar cases because we do tend to have
> runs
> > > of values, and values that fit in a fixed number of bits. Exponents and
> > > sign bits would probably fit this model extremely well most of the
> time if
> > > you have similar floating point values or sorted values. It would be
> really
> > > interesting to see how well this performs in comparison to the
> compression
> > > tests you've already done. For mantissa bits, I agree it wouldn't be
> worth
> > > encoding first.
> > > >
> > > > On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <martin.radev@tum.de
> > > <ma...@tum.de>> wrote:
> > > >
> > > > Hello Ryan, Wes and other parquet devs,
> > > >
> > > >
> > > > thanks for the response. I was away on vacation and that's why I am
> > > answering just now.
> > > >
> > > >
> > > > > whether you think adding run-length encoding to any of the byte
> > > streams would be beneficial before applying Zstd.
> > > > The short answer is "only for some cases but it will make it worse in
> > > both compression ratio and speed for other".
> > > >
> > > > Our initial investigation also separated the sign, exponent and
> mantissa
> > > into separate streams.
> > > >
> > > > The encoding was the following assuming 32-bit IEEE754:
> > > >
> > > > - stream of sign bits
> > > >
> > > > - stream of exponents bits. Conveniently the exponent for a 32-bit
> > > IEEE754 number is 8 bits.
> > > >
> > > > - separate the remaining 23 bits into four streams of 8, 8, 7 bits.
> An
> > > extra zero bit is added to the block which has only seven bits. This
> was
> > > done since zstd, zlib, etc work at a byte granularity and we would want
> > > repetitions to happen at such.
> > > >
> > > > For 64-bit IEEE754 even more padding has to be added since the
> exponent
> > > is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more
> > > exponent bits and 4 more mantissa bits to keep repetitions at a byte
> > > granularity. My original report shows results for when the
> floating-point
> > > values are split at a component granularity. Report is here:
> > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > > > Results are just slightly better in terms of compression ratio for
> some
> > > tests but compression and decompression speed is expectedly worse. The
> > > reason is that splitting a value is somewhat more complex. We need to
> keep
> > > a stream of bits for the signs, keep track of when a byte in the
> stream is
> > > exhausted, do bit manipulation to extract components, etc. This is
> also the
> > > reason why I preferred to go with the byte-wise decomposition of the
> > > values. It's faster and the compression ratio is just slightly worse
> for
> > > some of the tests.
> > > >
> > > >
> > > > Using RLE for the sign, exponents and the top-most mantissa bytes can
> > > help when data is repetitive and make it worse for other. I suppose
> using
> > > one of the compressors yields a better compression ratio on average.
> Also,
> > > this can again make encoding and decoding slower.
> > > >
> > > >
> > > > The design of the BYTE_STREAM_SPLIT encoding had in mind two things:
> > > >
> > > > - It would only make data more compressible and leave compression to
> the
> > > codec in use.
> > > >   This leaves the complexity to the codec and choice of
> > > speed/compression ratio to the user.
> > > >
> > > > - It should be fast.
> > > >   There's an extra compression step so preferably there's very little
> > > latency before it.
> > > >
> > > > @Wes, can you have a look?
> > > >
> > > > More opinions are welcome.
> > > >
> > > > If you have floating point data available, I would be very happy to
> > > examine whether this approach offers benefit for you.
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > Sent: Tuesday, September 3, 2019 11:51:46 PM
> > > > To: Parquet Dev
> > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > >
> > > > Hi Martin,
> > > >
> > > > Thanks for taking a look at this! I agree that the approach here
> looks
> > > > promising. We've had occasional requests for lossy floating point
> > > > compression in the past, so it would be good to add this.
> > > >
> > > > I did some work in this area a few years ago that is similar and I'd
> like
> > > > to hear what you think about that approach compared to this one. That
> > > work
> > > > was based on the same observation, that the main problem is the
> mantissa,
> > > > while exponents tend to compress well. What I did was take the
> exponent
> > > and
> > > > mantissa and encode each separately, like the component encoding in
> your
> > > > test. But to encode each stream, I used Parquet's RLE encoder
> instead of
> > > > just applying compression. This seemed to work well for exponents and
> > > sign
> > > > bits, but probably isn't worth the cost for mantissa bits. It could
> also
> > > be
> > > > interesting to test a separate stream for sign bits.
> > > >
> > > > I guess what I'd like to hear your take on is whether you think
> adding
> > > > run-length encoding to any of the byte streams would be beneficial
> before
> > > > applying Zstd.
> > > >
> > > > Thanks!
> > > >
> > > > rb
> > > >
> > > > On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <wesmckinn@gmail.com
> > > <ma...@gmail.com>> wrote:
> > > >
> > > > > I'm interested in this. I have been busy the last couple of weeks
> so
> > > have
> > > > > not been able to take a closer look. I will try to give some
> feedback
> > > this
> > > > > week.
> > > > >
> > > > > Thanks
> > > > >
> > > > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <martin.radev@tum.de
> > > <ma...@tum.de>> wrote:
> > > > >
> > > > > > Hello all,
> > > > > >
> > > > > >
> > > > > > thank you Julien for the interest.
> > > > > >
> > > > > >
> > > > > > Could other people, part of Apache Parquet, share their opinions?
> > > > > >
> > > > > > Do you have your own data which you would like to use for testing
> > > the new
> > > > > > encoding?
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Martin
> > > > > >
> > > > > > ________________________________
> > > > > > From: Julien Le Dem <ju...@wework.com.INVALID>
> > > > > > Sent: Friday, August 30, 2019 2:38:37 AM
> > > > > > To: dev@parquet.apache.org<ma...@parquet.apache.org>
> > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> Parquet
> > > > > >
> > > > > > I think this looks promising to me. At first glance it seems
> > > combining
> > > > > > simplicity and efficiency.
> > > > > > I'd like to hear more from other members of the PMC.
> > > > > >
> > > > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <
> martin.radev@tum.de
> > > <ma...@tum.de>>
> > > > > wrote:
> > > > > >
> > > > > > > Dear all,
> > > > > > >
> > > > > > >
> > > > > > > there was some earlier discussion on adding a new encoding for
> > > better
> > > > > > > compression of FP32 and FP64 data.
> > > > > > >
> > > > > > >
> > > > > > > The pull request which extends the format is here:
> > > > > > > https://github.com/apache/parquet-format/pull/144
> > > > > > > The change has one approval from earlier from Zoltan.
> > > > > > >
> > > > > > >
> > > > > > > The results from an investigation on compression ratio and
> speed
> > > with
> > > > > the
> > > > > > > new encoding vs other encodings is available here:
> > > > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > > > It is visible that for many tests the new encoding performs
> better
> > > in
> > > > > > > compression ratio and in some cases in speed. The improvements
> in
> > > > > > > compression speed come from the fact that the new format can
> > > > > potentially
> > > > > > > lead to a faster parsing for some compressors like GZIP.
> > > > > > >
> > > > > > >
> > > > > > > An earlier report which examines other FP compressors (fpzip,
> spdp,
> > > > > fpc,
> > > > > > > zfp, sz) and new potential encodings is available here:
> > > > > > >
> > > > > >
> > > > >
> > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > > > > The report also covers lossy compression but the
> BYTE_STREAM_SPLIT
> > > > > > > encoding only has the focus of lossless compression.
> > > > > > >
> > > > > > >
> > > > > > > Can we have a vote?
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Martin
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by Wes McKinney <we...@gmail.com>.
+1 from me on adding the FP encoding

On Sat, Nov 2, 2019 at 4:51 AM Radev, Martin <ma...@tum.de> wrote:
>
> Hello all,
>
>
> thanks for the vote Ryan and to Wes for the feedback.
>
>
> The concern with regards to adding more complex features in the Parquet spec is valid.
>
> However, the proposed encoding is very simple and I already have unpolished patches for both parquet-mr and arrow.
>
> In its design I purposely opted for something simple to guarantee 1) good compression speed and 2) ease of implementation.
>
>
> On the topic of testing, I added four more test cases which were taken from here<https://sdrbench.github.io/>. I also added the size in MB of all test case and entropy per element.
>
> Having the entropy reported helps show that the encoding performs better than any other option for high-entropy data and not so well for low-entropy data.
>
>
> I would be happy to receive some more feedback and votes.
>
>
> Kind regards,
>
> Martin
>
> ________________________________
> From: Ryan Blue <rb...@netflix.com.INVALID>
> Sent: Friday, November 1, 2019 6:28 PM
> To: Parquet Dev
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>
> I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format. Looks
> like it is simple and performs well. We could use a good floating point
> encoding.
>
> I don't think I agree that differences in features between the Java and CPP
> implementations should hold back new work. It would be great to have more
> testing and validation, as well as more thorough implementations. But I
> don't think we shouldn't accept contributions like this because of those
> concerns.
>
> On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney <we...@gmail.com> wrote:
>
> > I have to say I'm struggling with piling more things into the Parquet
> > specification when we already have a significant implementation
> > shortfall in other areas. LZ4 is still not properly implemented for
> > example, and then there is the question of the V2 encodings and data
> > page formats.
> >
> > I'm generally in favor of adding more efficient storage of floating
> > point data, but will it actually be implemented broadly? Parquet as a
> > format already has become an "implementation swamp" where any two
> > implementations may not be compatible with each other, particularly in
> > consideration of more advanced features.
> >
> > For a single organization using a single implementation, having
> > advanced features may be useful, so I see the benefits to users that
> > tightly control what code and what settings to use.
> >
> > On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin <ma...@tum.de> wrote:
> > >
> > > Dear all,
> > >
> > >
> > > would there be any interest in reviewing the BYTE_STREAM_SPLIT encoding?
> > >
> > > Please feel free to contact me directly if you need help or would like
> > to provide more test data.
> > >
> > >
> > > Results for the encoding based on the implementation in Arrow are here:
> > https://github.com/martinradev/arrow-fp-compression-bench
> > > Patch to Arrow is here:
> > https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49
> > >
> > >
> > > The new encoding combined with a compressor performs better than any of
> > the other alternatives for data where there is little change in the
> > upper-most bytes of fp32 and fp64 values. My early experiments also show
> > that this encoding+zstd performs better on average than any of the
> > specialized floating-point lossless compressors like fpc, spdp, zfp.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > > ________________________________
> > > From: Radev, Martin <ma...@tum.de>
> > > Sent: Thursday, October 10, 2019 2:34:15 PM
> > > To: Parquet Dev
> > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > >
> > > Dear Ryan Blue and other Parquet developers,
> > >
> > > I tested Ryan's proposal for modifying the encoding.
> > >
> > > The short answer is that it doesn't perform well in my tests. The
> > encoding, code and results can be viewed below.
> > >
> > >
> > > The current implementation only handles 32-bit IEEE754 floats in the
> > following way:
> > >
> > >   1.  For each block of 128 values, the min and max is computed for the
> > exponent
> > >   2.  The number of bits for the exponent RLE is computed as
> > ceil(log2((max - min + 1))). The sign bit uses 1 bit.
> > >   3.  The sign, exponent and 23 remaining mantissa bits are extracted.
> > >   4.  One RLE encoder is used for the sign and one for the exponent.
> > > A new RLE encoder for the exponent is created if the block requires less
> > or more bits than the number of bits used for the current encoder.
> > > The 23 mantissa bits are divided into three streams. (Not sure whether
> > this is strictly a good idea).
> > >   5.  Also, for each 128 values we need to store 2 bytes: the min value
> > and the number of bits used by the RLE.
> > >
> > >
> > > I did not implement a decoder and have not added unit tests to guarantee
> > that the implementation is sound.
> > >
> > > Ryan, can you please review the implementation as to whether it
> > corresponds to what you had in mind?
> > > Check FPByteStreamSplitComponentRLEEncoder. Please do not focus on the
> > code quality - this is only a quick hack.
> > >
> > https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979
> > >
> > > In my benchmark tests I compared the following scenarios:
> > >
> > >   *   dictionary
> > >   *   plain encoding + ZSTD
> > >   *   the StreamSplit+RLE implementation above
> > >   *   the StreamSplit+RLE implementation above + ZSTD
> > >   *   the original BYTE_STREAM_SPLIT proposal + ZSTD
> > >
> > > Size of parquet files, measured in MB, for each combination:
> > >
> > >         Plain   Plain+ZSTD      Dict    Split+RLE       Split+RLE+ZSTD
> > StreamSplit+ZSTD
> > > msg_bt.sp
> > >         128     112     128     113     93      84
> > > msg_lu.sp
> > >         93      87      94      80      73      67
> > > msg_sp.sp
> > >         139     125     139     123     96      88
> > > msg_sweep3d.sp
> > >         60      19      60      47      40      13
> > > num_brain.sp
> > >         68      60      69      55      54      49
> > > num_comet.sp
> > >         52      45      50      46      42      38
> > > num_control.sp
> > >         77      71      77      70      69      63
> > > num_plasma.sp
> > >         17      2       8       16      8       1
> > > obs_error.sp
> > >         30      24      30      25      22      21
> > > obs_info.sp
> > >         10      8       10      8       7       7
> > > obs_spitzer.sp
> > >         95      83      95      99      82      72
> > > obs_temp.sp
> > >         20      18      20      18      18      16
> > >
> > >
> > > I have not tested the encoding on any of the other test data we have
> > since they contain also FP64 columns. I did not add support for FP64 in the
> > StreamSplit+RLE encoding to save on time and also because I do not expect
> > much improvement.
> > >
> > >
> > > From the results it can be seen that the proposed StreamSplit+RLE
> > encoding does not improve results.
> > >
> > > Using RLE for the sign bits and exponent obfuscates the input and the
> > compression step cannot fully take advantage of repetitions in the data
> > since they were removed from the RLE step. Repetitive patterns are replaced
> > by the RLE bits which likely do not compress very well.
> > >
> > > Also, GZIP/ZSTD handle repetitions in data better on average. For
> > example, GZIP's Deflate algorithm can encode a long run length with 3
> > bytes(not sure whether it can be less?) for the raw data + 3 bytes for the
> > reference ( 8 bits + 15 bits + 2 bits ).
> > >
> > > Now, the RLE step might produce better results for long runs of the same
> > value. However, compressors also handles more complex cases when we have a
> > pattern in the data which doesn't necessary have a long run length. Also,
> > compressors like GZIP/ZSTD often do entropy-coding-aware parsing (
> > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html ,
> cbloom rants: 10-10-08 : On LZ Optimal Parsing<http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html>
> cbloomrants.blogspot.com
> 10-10-08 On LZ Optimal Parsing So the LZ-Huffman I've done for RAD/Oodle does some new stuff with optimal parsing. I want to write...
>
>
>
> > https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> > ).
> > >
> > > The proposed RLE encoding+ZSTD is slower than BYTE_STREAM_SPLIT+ZSTD.
> > The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy where as
> > the new RLE encoding has to extract the components and run through each
> > block of 128 values twice - once to compute min and max, and once to
> > encode. There's also the overhead of using the Arrow RLEEncoder.
> > >
> > > Ryan and other folks, can you provide feedback? Does the implementation
> > look reasonable to you?
> > >
> > > Can somebody please work with me on this new encoding? There has been
> > some interest and some discussions but it hasn't been pushed likely due to
> > work around the current release.
> > > For a bit more discussions and results, please refer to:
> > > Recent benchmark of Arrow implementation:
> > https://github.com/martinradev/arrow-fp-compression-bench
> > > Separate report:
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> [https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing>
>
> report.pdf - Google Drive<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing>
> drive.google.com
>
>
>
> > >
> > > Regards,
> > > Martin
> > >
> > >
> > >
> > > ________________________________
> > > From: Ryan Blue <rb...@netflix.com>
> > > Sent: Thursday, September 19, 2019 7:54 PM
> > > To: Radev, Martin
> > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > >
> > > Sounds good, thanks for working on this!
> > >
> > > On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <martin.radev@tum.de
> > <ma...@tum.de>> wrote:
> > >
> > > Hello Ryan,
> > >
> > >
> > > we decided that it would be beneficial to try out your proposal.
> > >
> > >
> > > I will look into it and provide measurements on the compression ratio
> > and speed.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > > ________________________________
> > > From: Ryan Blue <rb...@netflix.com>>
> > > Sent: Saturday, September 14, 2019 2:23:20 AM
> > > To: Radev, Martin
> > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > >
> > > > Using RLE for the sign, exponents and the top-most mantissa bytes can
> > help when data is repetitive and make it worse for other.
> > >
> > > I agree. But we use RLE in similar cases because we do tend to have runs
> > of values, and values that fit in a fixed number of bits. Exponents and
> > sign bits would probably fit this model extremely well most of the time if
> > you have similar floating point values or sorted values. It would be really
> > interesting to see how well this performs in comparison to the compression
> > tests you've already done. For mantissa bits, I agree it wouldn't be worth
> > encoding first.
> > >
> > > On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <martin.radev@tum.de
> > <ma...@tum.de>> wrote:
> > >
> > > Hello Ryan, Wes and other parquet devs,
> > >
> > >
> > > thanks for the response. I was away on vacation and that's why I am
> > answering just now.
> > >
> > >
> > > > whether you think adding run-length encoding to any of the byte
> > streams would be beneficial before applying Zstd.
> > > The short answer is "only for some cases but it will make it worse in
> > both compression ratio and speed for other".
> > >
> > > Our initial investigation also separated the sign, exponent and mantissa
> > into separate streams.
> > >
> > > The encoding was the following assuming 32-bit IEEE754:
> > >
> > > - stream of sign bits
> > >
> > > - stream of exponents bits. Conveniently the exponent for a 32-bit
> > IEEE754 number is 8 bits.
> > >
> > > - separate the remaining 23 bits into four streams of 8, 8, 7 bits. An
> > extra zero bit is added to the block which has only seven bits. This was
> > done since zstd, zlib, etc work at a byte granularity and we would want
> > repetitions to happen at such.
> > >
> > > For 64-bit IEEE754 even more padding has to be added since the exponent
> > is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more
> > exponent bits and 4 more mantissa bits to keep repetitions at a byte
> > granularity. My original report shows results for when the floating-point
> > values are split at a component granularity. Report is here:
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > > Results are just slightly better in terms of compression ratio for some
> > tests but compression and decompression speed is expectedly worse. The
> > reason is that splitting a value is somewhat more complex. We need to keep
> > a stream of bits for the signs, keep track of when a byte in the stream is
> > exhausted, do bit manipulation to extract components, etc. This is also the
> > reason why I preferred to go with the byte-wise decomposition of the
> > values. It's faster and the compression ratio is just slightly worse for
> > some of the tests.
> > >
> > >
> > > Using RLE for the sign, exponents and the top-most mantissa bytes can
> > help when data is repetitive and make it worse for other. I suppose using
> > one of the compressors yields a better compression ratio on average. Also,
> > this can again make encoding and decoding slower.
> > >
> > >
> > > The design of the BYTE_STREAM_SPLIT encoding had in mind two things:
> > >
> > > - It would only make data more compressible and leave compression to the
> > codec in use.
> > >   This leaves the complexity to the codec and choice of
> > speed/compression ratio to the user.
> > >
> > > - It should be fast.
> > >   There's an extra compression step so preferably there's very little
> > latency before it.
> > >
> > > @Wes, can you have a look?
> > >
> > > More opinions are welcome.
> > >
> > > If you have floating point data available, I would be very happy to
> > examine whether this approach offers benefit for you.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > > ________________________________
> > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > Sent: Tuesday, September 3, 2019 11:51:46 PM
> > > To: Parquet Dev
> > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > >
> > > Hi Martin,
> > >
> > > Thanks for taking a look at this! I agree that the approach here looks
> > > promising. We've had occasional requests for lossy floating point
> > > compression in the past, so it would be good to add this.
> > >
> > > I did some work in this area a few years ago that is similar and I'd like
> > > to hear what you think about that approach compared to this one. That
> > work
> > > was based on the same observation, that the main problem is the mantissa,
> > > while exponents tend to compress well. What I did was take the exponent
> > and
> > > mantissa and encode each separately, like the component encoding in your
> > > test. But to encode each stream, I used Parquet's RLE encoder instead of
> > > just applying compression. This seemed to work well for exponents and
> > sign
> > > bits, but probably isn't worth the cost for mantissa bits. It could also
> > be
> > > interesting to test a separate stream for sign bits.
> > >
> > > I guess what I'd like to hear your take on is whether you think adding
> > > run-length encoding to any of the byte streams would be beneficial before
> > > applying Zstd.
> > >
> > > Thanks!
> > >
> > > rb
> > >
> > > On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <wesmckinn@gmail.com
> > <ma...@gmail.com>> wrote:
> > >
> > > > I'm interested in this. I have been busy the last couple of weeks so
> > have
> > > > not been able to take a closer look. I will try to give some feedback
> > this
> > > > week.
> > > >
> > > > Thanks
> > > >
> > > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <martin.radev@tum.de
> > <ma...@tum.de>> wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > >
> > > > > thank you Julien for the interest.
> > > > >
> > > > >
> > > > > Could other people, part of Apache Parquet, share their opinions?
> > > > >
> > > > > Do you have your own data which you would like to use for testing
> > the new
> > > > > encoding?
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > > ________________________________
> > > > > From: Julien Le Dem <ju...@wework.com.INVALID>
> > > > > Sent: Friday, August 30, 2019 2:38:37 AM
> > > > > To: dev@parquet.apache.org<ma...@parquet.apache.org>
> > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > > >
> > > > > I think this looks promising to me. At first glance it seems
> > combining
> > > > > simplicity and efficiency.
> > > > > I'd like to hear more from other members of the PMC.
> > > > >
> > > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <martin.radev@tum.de
> > <ma...@tum.de>>
> > > > wrote:
> > > > >
> > > > > > Dear all,
> > > > > >
> > > > > >
> > > > > > there was some earlier discussion on adding a new encoding for
> > better
> > > > > > compression of FP32 and FP64 data.
> > > > > >
> > > > > >
> > > > > > The pull request which extends the format is here:
> > > > > > https://github.com/apache/parquet-format/pull/144
> > > > > > The change has one approval from earlier from Zoltan.
> > > > > >
> > > > > >
> > > > > > The results from an investigation on compression ratio and speed
> > with
> > > > the
> > > > > > new encoding vs other encodings is available here:
> > > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > > It is visible that for many tests the new encoding performs better
> > in
> > > > > > compression ratio and in some cases in speed. The improvements in
> > > > > > compression speed come from the fact that the new format can
> > > > potentially
> > > > > > lead to a faster parsing for some compressors like GZIP.
> > > > > >
> > > > > >
> > > > > > An earlier report which examines other FP compressors (fpzip, spdp,
> > > > fpc,
> > > > > > zfp, sz) and new potential encodings is available here:
> > > > > >
> > > > >
> > > >
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > > > > > encoding only has the focus of lossless compression.
> > > > > >
> > > > > >
> > > > > > Can we have a vote?
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Martin
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by "Radev, Martin" <ma...@tum.de>.
Hello all,


thanks for the vote Ryan and to Wes for the feedback.


The concern with regards to adding more complex features in the Parquet spec is valid.

However, the proposed encoding is very simple and I already have unpolished patches for both parquet-mr and arrow.

In its design I purposely opted for something simple to guarantee 1) good compression speed and 2) ease of implementation.


On the topic of testing, I added four more test cases which were taken from here<https://sdrbench.github.io/>. I also added the size in MB of all test case and entropy per element.

Having the entropy reported helps show that the encoding performs better than any other option for high-entropy data and not so well for low-entropy data.


I would be happy to receive some more feedback and votes.


Kind regards,

Martin

________________________________
From: Ryan Blue <rb...@netflix.com.INVALID>
Sent: Friday, November 1, 2019 6:28 PM
To: Parquet Dev
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format. Looks
like it is simple and performs well. We could use a good floating point
encoding.

I don't think I agree that differences in features between the Java and CPP
implementations should hold back new work. It would be great to have more
testing and validation, as well as more thorough implementations. But I
don't think we shouldn't accept contributions like this because of those
concerns.

On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney <we...@gmail.com> wrote:

> I have to say I'm struggling with piling more things into the Parquet
> specification when we already have a significant implementation
> shortfall in other areas. LZ4 is still not properly implemented for
> example, and then there is the question of the V2 encodings and data
> page formats.
>
> I'm generally in favor of adding more efficient storage of floating
> point data, but will it actually be implemented broadly? Parquet as a
> format already has become an "implementation swamp" where any two
> implementations may not be compatible with each other, particularly in
> consideration of more advanced features.
>
> For a single organization using a single implementation, having
> advanced features may be useful, so I see the benefits to users that
> tightly control what code and what settings to use.
>
> On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin <ma...@tum.de> wrote:
> >
> > Dear all,
> >
> >
> > would there be any interest in reviewing the BYTE_STREAM_SPLIT encoding?
> >
> > Please feel free to contact me directly if you need help or would like
> to provide more test data.
> >
> >
> > Results for the encoding based on the implementation in Arrow are here:
> https://github.com/martinradev/arrow-fp-compression-bench
> > Patch to Arrow is here:
> https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49
> >
> >
> > The new encoding combined with a compressor performs better than any of
> the other alternatives for data where there is little change in the
> upper-most bytes of fp32 and fp64 values. My early experiments also show
> that this encoding+zstd performs better on average than any of the
> specialized floating-point lossless compressors like fpc, spdp, zfp.
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Radev, Martin <ma...@tum.de>
> > Sent: Thursday, October 10, 2019 2:34:15 PM
> > To: Parquet Dev
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > Dear Ryan Blue and other Parquet developers,
> >
> > I tested Ryan's proposal for modifying the encoding.
> >
> > The short answer is that it doesn't perform well in my tests. The
> encoding, code and results can be viewed below.
> >
> >
> > The current implementation only handles 32-bit IEEE754 floats in the
> following way:
> >
> >   1.  For each block of 128 values, the min and max is computed for the
> exponent
> >   2.  The number of bits for the exponent RLE is computed as
> ceil(log2((max - min + 1))). The sign bit uses 1 bit.
> >   3.  The sign, exponent and 23 remaining mantissa bits are extracted.
> >   4.  One RLE encoder is used for the sign and one for the exponent.
> > A new RLE encoder for the exponent is created if the block requires less
> or more bits than the number of bits used for the current encoder.
> > The 23 mantissa bits are divided into three streams. (Not sure whether
> this is strictly a good idea).
> >   5.  Also, for each 128 values we need to store 2 bytes: the min value
> and the number of bits used by the RLE.
> >
> >
> > I did not implement a decoder and have not added unit tests to guarantee
> that the implementation is sound.
> >
> > Ryan, can you please review the implementation as to whether it
> corresponds to what you had in mind?
> > Check FPByteStreamSplitComponentRLEEncoder. Please do not focus on the
> code quality - this is only a quick hack.
> >
> https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979
> >
> > In my benchmark tests I compared the following scenarios:
> >
> >   *   dictionary
> >   *   plain encoding + ZSTD
> >   *   the StreamSplit+RLE implementation above
> >   *   the StreamSplit+RLE implementation above + ZSTD
> >   *   the original BYTE_STREAM_SPLIT proposal + ZSTD
> >
> > Size of parquet files, measured in MB, for each combination:
> >
> >         Plain   Plain+ZSTD      Dict    Split+RLE       Split+RLE+ZSTD
> StreamSplit+ZSTD
> > msg_bt.sp
> >         128     112     128     113     93      84
> > msg_lu.sp
> >         93      87      94      80      73      67
> > msg_sp.sp
> >         139     125     139     123     96      88
> > msg_sweep3d.sp
> >         60      19      60      47      40      13
> > num_brain.sp
> >         68      60      69      55      54      49
> > num_comet.sp
> >         52      45      50      46      42      38
> > num_control.sp
> >         77      71      77      70      69      63
> > num_plasma.sp
> >         17      2       8       16      8       1
> > obs_error.sp
> >         30      24      30      25      22      21
> > obs_info.sp
> >         10      8       10      8       7       7
> > obs_spitzer.sp
> >         95      83      95      99      82      72
> > obs_temp.sp
> >         20      18      20      18      18      16
> >
> >
> > I have not tested the encoding on any of the other test data we have
> since they contain also FP64 columns. I did not add support for FP64 in the
> StreamSplit+RLE encoding to save on time and also because I do not expect
> much improvement.
> >
> >
> > From the results it can be seen that the proposed StreamSplit+RLE
> encoding does not improve results.
> >
> > Using RLE for the sign bits and exponent obfuscates the input and the
> compression step cannot fully take advantage of repetitions in the data
> since they were removed from the RLE step. Repetitive patterns are replaced
> by the RLE bits which likely do not compress very well.
> >
> > Also, GZIP/ZSTD handle repetitions in data better on average. For
> example, GZIP's Deflate algorithm can encode a long run length with 3
> bytes(not sure whether it can be less?) for the raw data + 3 bytes for the
> reference ( 8 bits + 15 bits + 2 bits ).
> >
> > Now, the RLE step might produce better results for long runs of the same
> value. However, compressors also handles more complex cases when we have a
> pattern in the data which doesn't necessary have a long run length. Also,
> compressors like GZIP/ZSTD often do entropy-coding-aware parsing (
> http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html ,
cbloom rants: 10-10-08 : On LZ Optimal Parsing<http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html>
cbloomrants.blogspot.com
10-10-08 On LZ Optimal Parsing So the LZ-Huffman I've done for RAD/Oodle does some new stuff with optimal parsing. I want to write...



> https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> ).
> >
> > The proposed RLE encoding+ZSTD is slower than BYTE_STREAM_SPLIT+ZSTD.
> The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy where as
> the new RLE encoding has to extract the components and run through each
> block of 128 values twice - once to compute min and max, and once to
> encode. There's also the overhead of using the Arrow RLEEncoder.
> >
> > Ryan and other folks, can you provide feedback? Does the implementation
> look reasonable to you?
> >
> > Can somebody please work with me on this new encoding? There has been
> some interest and some discussions but it hasn't been pushed likely due to
> work around the current release.
> > For a bit more discussions and results, please refer to:
> > Recent benchmark of Arrow implementation:
> https://github.com/martinradev/arrow-fp-compression-bench
> > Separate report:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
[https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing>

report.pdf - Google Drive<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing>
drive.google.com



> >
> > Regards,
> > Martin
> >
> >
> >
> > ________________________________
> > From: Ryan Blue <rb...@netflix.com>
> > Sent: Thursday, September 19, 2019 7:54 PM
> > To: Radev, Martin
> > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > Sounds good, thanks for working on this!
> >
> > On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <martin.radev@tum.de
> <ma...@tum.de>> wrote:
> >
> > Hello Ryan,
> >
> >
> > we decided that it would be beneficial to try out your proposal.
> >
> >
> > I will look into it and provide measurements on the compression ratio
> and speed.
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Ryan Blue <rb...@netflix.com>>
> > Sent: Saturday, September 14, 2019 2:23:20 AM
> > To: Radev, Martin
> > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > > Using RLE for the sign, exponents and the top-most mantissa bytes can
> help when data is repetitive and make it worse for other.
> >
> > I agree. But we use RLE in similar cases because we do tend to have runs
> of values, and values that fit in a fixed number of bits. Exponents and
> sign bits would probably fit this model extremely well most of the time if
> you have similar floating point values or sorted values. It would be really
> interesting to see how well this performs in comparison to the compression
> tests you've already done. For mantissa bits, I agree it wouldn't be worth
> encoding first.
> >
> > On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <martin.radev@tum.de
> <ma...@tum.de>> wrote:
> >
> > Hello Ryan, Wes and other parquet devs,
> >
> >
> > thanks for the response. I was away on vacation and that's why I am
> answering just now.
> >
> >
> > > whether you think adding run-length encoding to any of the byte
> streams would be beneficial before applying Zstd.
> > The short answer is "only for some cases but it will make it worse in
> both compression ratio and speed for other".
> >
> > Our initial investigation also separated the sign, exponent and mantissa
> into separate streams.
> >
> > The encoding was the following assuming 32-bit IEEE754:
> >
> > - stream of sign bits
> >
> > - stream of exponents bits. Conveniently the exponent for a 32-bit
> IEEE754 number is 8 bits.
> >
> > - separate the remaining 23 bits into four streams of 8, 8, 7 bits. An
> extra zero bit is added to the block which has only seven bits. This was
> done since zstd, zlib, etc work at a byte granularity and we would want
> repetitions to happen at such.
> >
> > For 64-bit IEEE754 even more padding has to be added since the exponent
> is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more
> exponent bits and 4 more mantissa bits to keep repetitions at a byte
> granularity. My original report shows results for when the floating-point
> values are split at a component granularity. Report is here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > Results are just slightly better in terms of compression ratio for some
> tests but compression and decompression speed is expectedly worse. The
> reason is that splitting a value is somewhat more complex. We need to keep
> a stream of bits for the signs, keep track of when a byte in the stream is
> exhausted, do bit manipulation to extract components, etc. This is also the
> reason why I preferred to go with the byte-wise decomposition of the
> values. It's faster and the compression ratio is just slightly worse for
> some of the tests.
> >
> >
> > Using RLE for the sign, exponents and the top-most mantissa bytes can
> help when data is repetitive and make it worse for other. I suppose using
> one of the compressors yields a better compression ratio on average. Also,
> this can again make encoding and decoding slower.
> >
> >
> > The design of the BYTE_STREAM_SPLIT encoding had in mind two things:
> >
> > - It would only make data more compressible and leave compression to the
> codec in use.
> >   This leaves the complexity to the codec and choice of
> speed/compression ratio to the user.
> >
> > - It should be fast.
> >   There's an extra compression step so preferably there's very little
> latency before it.
> >
> > @Wes, can you have a look?
> >
> > More opinions are welcome.
> >
> > If you have floating point data available, I would be very happy to
> examine whether this approach offers benefit for you.
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > Sent: Tuesday, September 3, 2019 11:51:46 PM
> > To: Parquet Dev
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > Hi Martin,
> >
> > Thanks for taking a look at this! I agree that the approach here looks
> > promising. We've had occasional requests for lossy floating point
> > compression in the past, so it would be good to add this.
> >
> > I did some work in this area a few years ago that is similar and I'd like
> > to hear what you think about that approach compared to this one. That
> work
> > was based on the same observation, that the main problem is the mantissa,
> > while exponents tend to compress well. What I did was take the exponent
> and
> > mantissa and encode each separately, like the component encoding in your
> > test. But to encode each stream, I used Parquet's RLE encoder instead of
> > just applying compression. This seemed to work well for exponents and
> sign
> > bits, but probably isn't worth the cost for mantissa bits. It could also
> be
> > interesting to test a separate stream for sign bits.
> >
> > I guess what I'd like to hear your take on is whether you think adding
> > run-length encoding to any of the byte streams would be beneficial before
> > applying Zstd.
> >
> > Thanks!
> >
> > rb
> >
> > On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <wesmckinn@gmail.com
> <ma...@gmail.com>> wrote:
> >
> > > I'm interested in this. I have been busy the last couple of weeks so
> have
> > > not been able to take a closer look. I will try to give some feedback
> this
> > > week.
> > >
> > > Thanks
> > >
> > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <martin.radev@tum.de
> <ma...@tum.de>> wrote:
> > >
> > > > Hello all,
> > > >
> > > >
> > > > thank you Julien for the interest.
> > > >
> > > >
> > > > Could other people, part of Apache Parquet, share their opinions?
> > > >
> > > > Do you have your own data which you would like to use for testing
> the new
> > > > encoding?
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Julien Le Dem <ju...@wework.com.INVALID>
> > > > Sent: Friday, August 30, 2019 2:38:37 AM
> > > > To: dev@parquet.apache.org<ma...@parquet.apache.org>
> > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > >
> > > > I think this looks promising to me. At first glance it seems
> combining
> > > > simplicity and efficiency.
> > > > I'd like to hear more from other members of the PMC.
> > > >
> > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <martin.radev@tum.de
> <ma...@tum.de>>
> > > wrote:
> > > >
> > > > > Dear all,
> > > > >
> > > > >
> > > > > there was some earlier discussion on adding a new encoding for
> better
> > > > > compression of FP32 and FP64 data.
> > > > >
> > > > >
> > > > > The pull request which extends the format is here:
> > > > > https://github.com/apache/parquet-format/pull/144
> > > > > The change has one approval from earlier from Zoltan.
> > > > >
> > > > >
> > > > > The results from an investigation on compression ratio and speed
> with
> > > the
> > > > > new encoding vs other encodings is available here:
> > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > It is visible that for many tests the new encoding performs better
> in
> > > > > compression ratio and in some cases in speed. The improvements in
> > > > > compression speed come from the fact that the new format can
> > > potentially
> > > > > lead to a faster parsing for some compressors like GZIP.
> > > > >
> > > > >
> > > > > An earlier report which examines other FP compressors (fpzip, spdp,
> > > fpc,
> > > > > zfp, sz) and new potential encodings is available here:
> > > > >
> > > >
> > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > > > > encoding only has the focus of lossless compression.
> > > > >
> > > > >
> > > > > Can we have a vote?
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>


--
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format. Looks
like it is simple and performs well. We could use a good floating point
encoding.

I don't think I agree that differences in features between the Java and CPP
implementations should hold back new work. It would be great to have more
testing and validation, as well as more thorough implementations. But I
don't think we shouldn't accept contributions like this because of those
concerns.

On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney <we...@gmail.com> wrote:

> I have to say I'm struggling with piling more things into the Parquet
> specification when we already have a significant implementation
> shortfall in other areas. LZ4 is still not properly implemented for
> example, and then there is the question of the V2 encodings and data
> page formats.
>
> I'm generally in favor of adding more efficient storage of floating
> point data, but will it actually be implemented broadly? Parquet as a
> format already has become an "implementation swamp" where any two
> implementations may not be compatible with each other, particularly in
> consideration of more advanced features.
>
> For a single organization using a single implementation, having
> advanced features may be useful, so I see the benefits to users that
> tightly control what code and what settings to use.
>
> On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin <ma...@tum.de> wrote:
> >
> > Dear all,
> >
> >
> > would there be any interest in reviewing the BYTE_STREAM_SPLIT encoding?
> >
> > Please feel free to contact me directly if you need help or would like
> to provide more test data.
> >
> >
> > Results for the encoding based on the implementation in Arrow are here:
> https://github.com/martinradev/arrow-fp-compression-bench
> > Patch to Arrow is here:
> https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49
> >
> >
> > The new encoding combined with a compressor performs better than any of
> the other alternatives for data where there is little change in the
> upper-most bytes of fp32 and fp64 values. My early experiments also show
> that this encoding+zstd performs better on average than any of the
> specialized floating-point lossless compressors like fpc, spdp, zfp.
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Radev, Martin <ma...@tum.de>
> > Sent: Thursday, October 10, 2019 2:34:15 PM
> > To: Parquet Dev
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > Dear Ryan Blue and other Parquet developers,
> >
> > I tested Ryan's proposal for modifying the encoding.
> >
> > The short answer is that it doesn't perform well in my tests. The
> encoding, code and results can be viewed below.
> >
> >
> > The current implementation only handles 32-bit IEEE754 floats in the
> following way:
> >
> >   1.  For each block of 128 values, the min and max is computed for the
> exponent
> >   2.  The number of bits for the exponent RLE is computed as
> ceil(log2((max - min + 1))). The sign bit uses 1 bit.
> >   3.  The sign, exponent and 23 remaining mantissa bits are extracted.
> >   4.  One RLE encoder is used for the sign and one for the exponent.
> > A new RLE encoder for the exponent is created if the block requires less
> or more bits than the number of bits used for the current encoder.
> > The 23 mantissa bits are divided into three streams. (Not sure whether
> this is strictly a good idea).
> >   5.  Also, for each 128 values we need to store 2 bytes: the min value
> and the number of bits used by the RLE.
> >
> >
> > I did not implement a decoder and have not added unit tests to guarantee
> that the implementation is sound.
> >
> > Ryan, can you please review the implementation as to whether it
> corresponds to what you had in mind?
> > Check FPByteStreamSplitComponentRLEEncoder. Please do not focus on the
> code quality - this is only a quick hack.
> >
> https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979
> >
> > In my benchmark tests I compared the following scenarios:
> >
> >   *   dictionary
> >   *   plain encoding + ZSTD
> >   *   the StreamSplit+RLE implementation above
> >   *   the StreamSplit+RLE implementation above + ZSTD
> >   *   the original BYTE_STREAM_SPLIT proposal + ZSTD
> >
> > Size of parquet files, measured in MB, for each combination:
> >
> >         Plain   Plain+ZSTD      Dict    Split+RLE       Split+RLE+ZSTD
> StreamSplit+ZSTD
> > msg_bt.sp
> >         128     112     128     113     93      84
> > msg_lu.sp
> >         93      87      94      80      73      67
> > msg_sp.sp
> >         139     125     139     123     96      88
> > msg_sweep3d.sp
> >         60      19      60      47      40      13
> > num_brain.sp
> >         68      60      69      55      54      49
> > num_comet.sp
> >         52      45      50      46      42      38
> > num_control.sp
> >         77      71      77      70      69      63
> > num_plasma.sp
> >         17      2       8       16      8       1
> > obs_error.sp
> >         30      24      30      25      22      21
> > obs_info.sp
> >         10      8       10      8       7       7
> > obs_spitzer.sp
> >         95      83      95      99      82      72
> > obs_temp.sp
> >         20      18      20      18      18      16
> >
> >
> > I have not tested the encoding on any of the other test data we have
> since they contain also FP64 columns. I did not add support for FP64 in the
> StreamSplit+RLE encoding to save on time and also because I do not expect
> much improvement.
> >
> >
> > From the results it can be seen that the proposed StreamSplit+RLE
> encoding does not improve results.
> >
> > Using RLE for the sign bits and exponent obfuscates the input and the
> compression step cannot fully take advantage of repetitions in the data
> since they were removed from the RLE step. Repetitive patterns are replaced
> by the RLE bits which likely do not compress very well.
> >
> > Also, GZIP/ZSTD handle repetitions in data better on average. For
> example, GZIP's Deflate algorithm can encode a long run length with 3
> bytes(not sure whether it can be less?) for the raw data + 3 bytes for the
> reference ( 8 bits + 15 bits + 2 bits ).
> >
> > Now, the RLE step might produce better results for long runs of the same
> value. However, compressors also handles more complex cases when we have a
> pattern in the data which doesn't necessary have a long run length. Also,
> compressors like GZIP/ZSTD often do entropy-coding-aware parsing (
> http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html ,
> https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> ).
> >
> > The proposed RLE encoding+ZSTD is slower than BYTE_STREAM_SPLIT+ZSTD.
> The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy where as
> the new RLE encoding has to extract the components and run through each
> block of 128 values twice - once to compute min and max, and once to
> encode. There's also the overhead of using the Arrow RLEEncoder.
> >
> > Ryan and other folks, can you provide feedback? Does the implementation
> look reasonable to you?
> >
> > Can somebody please work with me on this new encoding? There has been
> some interest and some discussions but it hasn't been pushed likely due to
> work around the current release.
> > For a bit more discussions and results, please refer to:
> > Recent benchmark of Arrow implementation:
> https://github.com/martinradev/arrow-fp-compression-bench
> > Separate report:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> >
> > Regards,
> > Martin
> >
> >
> >
> > ________________________________
> > From: Ryan Blue <rb...@netflix.com>
> > Sent: Thursday, September 19, 2019 7:54 PM
> > To: Radev, Martin
> > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > Sounds good, thanks for working on this!
> >
> > On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <martin.radev@tum.de
> <ma...@tum.de>> wrote:
> >
> > Hello Ryan,
> >
> >
> > we decided that it would be beneficial to try out your proposal.
> >
> >
> > I will look into it and provide measurements on the compression ratio
> and speed.
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Ryan Blue <rb...@netflix.com>>
> > Sent: Saturday, September 14, 2019 2:23:20 AM
> > To: Radev, Martin
> > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > > Using RLE for the sign, exponents and the top-most mantissa bytes can
> help when data is repetitive and make it worse for other.
> >
> > I agree. But we use RLE in similar cases because we do tend to have runs
> of values, and values that fit in a fixed number of bits. Exponents and
> sign bits would probably fit this model extremely well most of the time if
> you have similar floating point values or sorted values. It would be really
> interesting to see how well this performs in comparison to the compression
> tests you've already done. For mantissa bits, I agree it wouldn't be worth
> encoding first.
> >
> > On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <martin.radev@tum.de
> <ma...@tum.de>> wrote:
> >
> > Hello Ryan, Wes and other parquet devs,
> >
> >
> > thanks for the response. I was away on vacation and that's why I am
> answering just now.
> >
> >
> > > whether you think adding run-length encoding to any of the byte
> streams would be beneficial before applying Zstd.
> > The short answer is "only for some cases but it will make it worse in
> both compression ratio and speed for other".
> >
> > Our initial investigation also separated the sign, exponent and mantissa
> into separate streams.
> >
> > The encoding was the following assuming 32-bit IEEE754:
> >
> > - stream of sign bits
> >
> > - stream of exponents bits. Conveniently the exponent for a 32-bit
> IEEE754 number is 8 bits.
> >
> > - separate the remaining 23 bits into four streams of 8, 8, 7 bits. An
> extra zero bit is added to the block which has only seven bits. This was
> done since zstd, zlib, etc work at a byte granularity and we would want
> repetitions to happen at such.
> >
> > For 64-bit IEEE754 even more padding has to be added since the exponent
> is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more
> exponent bits and 4 more mantissa bits to keep repetitions at a byte
> granularity. My original report shows results for when the floating-point
> values are split at a component granularity. Report is here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > Results are just slightly better in terms of compression ratio for some
> tests but compression and decompression speed is expectedly worse. The
> reason is that splitting a value is somewhat more complex. We need to keep
> a stream of bits for the signs, keep track of when a byte in the stream is
> exhausted, do bit manipulation to extract components, etc. This is also the
> reason why I preferred to go with the byte-wise decomposition of the
> values. It's faster and the compression ratio is just slightly worse for
> some of the tests.
> >
> >
> > Using RLE for the sign, exponents and the top-most mantissa bytes can
> help when data is repetitive and make it worse for other. I suppose using
> one of the compressors yields a better compression ratio on average. Also,
> this can again make encoding and decoding slower.
> >
> >
> > The design of the BYTE_STREAM_SPLIT encoding had in mind two things:
> >
> > - It would only make data more compressible and leave compression to the
> codec in use.
> >   This leaves the complexity to the codec and choice of
> speed/compression ratio to the user.
> >
> > - It should be fast.
> >   There's an extra compression step so preferably there's very little
> latency before it.
> >
> > @Wes, can you have a look?
> >
> > More opinions are welcome.
> >
> > If you have floating point data available, I would be very happy to
> examine whether this approach offers benefit for you.
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > Sent: Tuesday, September 3, 2019 11:51:46 PM
> > To: Parquet Dev
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > Hi Martin,
> >
> > Thanks for taking a look at this! I agree that the approach here looks
> > promising. We've had occasional requests for lossy floating point
> > compression in the past, so it would be good to add this.
> >
> > I did some work in this area a few years ago that is similar and I'd like
> > to hear what you think about that approach compared to this one. That
> work
> > was based on the same observation, that the main problem is the mantissa,
> > while exponents tend to compress well. What I did was take the exponent
> and
> > mantissa and encode each separately, like the component encoding in your
> > test. But to encode each stream, I used Parquet's RLE encoder instead of
> > just applying compression. This seemed to work well for exponents and
> sign
> > bits, but probably isn't worth the cost for mantissa bits. It could also
> be
> > interesting to test a separate stream for sign bits.
> >
> > I guess what I'd like to hear your take on is whether you think adding
> > run-length encoding to any of the byte streams would be beneficial before
> > applying Zstd.
> >
> > Thanks!
> >
> > rb
> >
> > On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <wesmckinn@gmail.com
> <ma...@gmail.com>> wrote:
> >
> > > I'm interested in this. I have been busy the last couple of weeks so
> have
> > > not been able to take a closer look. I will try to give some feedback
> this
> > > week.
> > >
> > > Thanks
> > >
> > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <martin.radev@tum.de
> <ma...@tum.de>> wrote:
> > >
> > > > Hello all,
> > > >
> > > >
> > > > thank you Julien for the interest.
> > > >
> > > >
> > > > Could other people, part of Apache Parquet, share their opinions?
> > > >
> > > > Do you have your own data which you would like to use for testing
> the new
> > > > encoding?
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Julien Le Dem <ju...@wework.com.INVALID>
> > > > Sent: Friday, August 30, 2019 2:38:37 AM
> > > > To: dev@parquet.apache.org<ma...@parquet.apache.org>
> > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > >
> > > > I think this looks promising to me. At first glance it seems
> combining
> > > > simplicity and efficiency.
> > > > I'd like to hear more from other members of the PMC.
> > > >
> > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <martin.radev@tum.de
> <ma...@tum.de>>
> > > wrote:
> > > >
> > > > > Dear all,
> > > > >
> > > > >
> > > > > there was some earlier discussion on adding a new encoding for
> better
> > > > > compression of FP32 and FP64 data.
> > > > >
> > > > >
> > > > > The pull request which extends the format is here:
> > > > > https://github.com/apache/parquet-format/pull/144
> > > > > The change has one approval from earlier from Zoltan.
> > > > >
> > > > >
> > > > > The results from an investigation on compression ratio and speed
> with
> > > the
> > > > > new encoding vs other encodings is available here:
> > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > It is visible that for many tests the new encoding performs better
> in
> > > > > compression ratio and in some cases in speed. The improvements in
> > > > > compression speed come from the fact that the new format can
> > > potentially
> > > > > lead to a faster parsing for some compressors like GZIP.
> > > > >
> > > > >
> > > > > An earlier report which examines other FP compressors (fpzip, spdp,
> > > fpc,
> > > > > zfp, sz) and new potential encodings is available here:
> > > > >
> > > >
> > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > > > > encoding only has the focus of lossless compression.
> > > > >
> > > > >
> > > > > Can we have a vote?
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by Wes McKinney <we...@gmail.com>.
I have to say I'm struggling with piling more things into the Parquet
specification when we already have a significant implementation
shortfall in other areas. LZ4 is still not properly implemented for
example, and then there is the question of the V2 encodings and data
page formats.

I'm generally in favor of adding more efficient storage of floating
point data, but will it actually be implemented broadly? Parquet as a
format already has become an "implementation swamp" where any two
implementations may not be compatible with each other, particularly in
consideration of more advanced features.

For a single organization using a single implementation, having
advanced features may be useful, so I see the benefits to users that
tightly control what code and what settings to use.

On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin <ma...@tum.de> wrote:
>
> Dear all,
>
>
> would there be any interest in reviewing the BYTE_STREAM_SPLIT encoding?
>
> Please feel free to contact me directly if you need help or would like to provide more test data.
>
>
> Results for the encoding based on the implementation in Arrow are here: https://github.com/martinradev/arrow-fp-compression-bench
> Patch to Arrow is here: https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49
>
>
> The new encoding combined with a compressor performs better than any of the other alternatives for data where there is little change in the upper-most bytes of fp32 and fp64 values. My early experiments also show that this encoding+zstd performs better on average than any of the specialized floating-point lossless compressors like fpc, spdp, zfp.
>
>
> Regards,
>
> Martin
>
> ________________________________
> From: Radev, Martin <ma...@tum.de>
> Sent: Thursday, October 10, 2019 2:34:15 PM
> To: Parquet Dev
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>
> Dear Ryan Blue and other Parquet developers,
>
> I tested Ryan's proposal for modifying the encoding.
>
> The short answer is that it doesn't perform well in my tests. The encoding, code and results can be viewed below.
>
>
> The current implementation only handles 32-bit IEEE754 floats in the following way:
>
>   1.  For each block of 128 values, the min and max is computed for the exponent
>   2.  The number of bits for the exponent RLE is computed as ceil(log2((max - min + 1))). The sign bit uses 1 bit.
>   3.  The sign, exponent and 23 remaining mantissa bits are extracted.
>   4.  One RLE encoder is used for the sign and one for the exponent.
> A new RLE encoder for the exponent is created if the block requires less or more bits than the number of bits used for the current encoder.
> The 23 mantissa bits are divided into three streams. (Not sure whether this is strictly a good idea).
>   5.  Also, for each 128 values we need to store 2 bytes: the min value and the number of bits used by the RLE.
>
>
> I did not implement a decoder and have not added unit tests to guarantee that the implementation is sound.
>
> Ryan, can you please review the implementation as to whether it corresponds to what you had in mind?
> Check FPByteStreamSplitComponentRLEEncoder. Please do not focus on the code quality - this is only a quick hack.
> https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979
>
> In my benchmark tests I compared the following scenarios:
>
>   *   dictionary
>   *   plain encoding + ZSTD
>   *   the StreamSplit+RLE implementation above
>   *   the StreamSplit+RLE implementation above + ZSTD
>   *   the original BYTE_STREAM_SPLIT proposal + ZSTD
>
> Size of parquet files, measured in MB, for each combination:
>
>         Plain   Plain+ZSTD      Dict    Split+RLE       Split+RLE+ZSTD  StreamSplit+ZSTD
> msg_bt.sp
>         128     112     128     113     93      84
> msg_lu.sp
>         93      87      94      80      73      67
> msg_sp.sp
>         139     125     139     123     96      88
> msg_sweep3d.sp
>         60      19      60      47      40      13
> num_brain.sp
>         68      60      69      55      54      49
> num_comet.sp
>         52      45      50      46      42      38
> num_control.sp
>         77      71      77      70      69      63
> num_plasma.sp
>         17      2       8       16      8       1
> obs_error.sp
>         30      24      30      25      22      21
> obs_info.sp
>         10      8       10      8       7       7
> obs_spitzer.sp
>         95      83      95      99      82      72
> obs_temp.sp
>         20      18      20      18      18      16
>
>
> I have not tested the encoding on any of the other test data we have since they contain also FP64 columns. I did not add support for FP64 in the StreamSplit+RLE encoding to save on time and also because I do not expect much improvement.
>
>
> From the results it can be seen that the proposed StreamSplit+RLE encoding does not improve results.
>
> Using RLE for the sign bits and exponent obfuscates the input and the compression step cannot fully take advantage of repetitions in the data since they were removed from the RLE step. Repetitive patterns are replaced by the RLE bits which likely do not compress very well.
>
> Also, GZIP/ZSTD handle repetitions in data better on average. For example, GZIP's Deflate algorithm can encode a long run length with 3 bytes(not sure whether it can be less?) for the raw data + 3 bytes for the reference ( 8 bits + 15 bits + 2 bits ).
>
> Now, the RLE step might produce better results for long runs of the same value. However, compressors also handles more complex cases when we have a pattern in the data which doesn't necessary have a long run length. Also, compressors like GZIP/ZSTD often do entropy-coding-aware parsing ( http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html , https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html).
>
> The proposed RLE encoding+ZSTD is slower than BYTE_STREAM_SPLIT+ZSTD. The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy where as the new RLE encoding has to extract the components and run through each block of 128 values twice - once to compute min and max, and once to encode. There's also the overhead of using the Arrow RLEEncoder.
>
> Ryan and other folks, can you provide feedback? Does the implementation look reasonable to you?
>
> Can somebody please work with me on this new encoding? There has been some interest and some discussions but it hasn't been pushed likely due to work around the current release.
> For a bit more discussions and results, please refer to:
> Recent benchmark of Arrow implementation: https://github.com/martinradev/arrow-fp-compression-bench
> Separate report: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
>
> Regards,
> Martin
>
>
>
> ________________________________
> From: Ryan Blue <rb...@netflix.com>
> Sent: Thursday, September 19, 2019 7:54 PM
> To: Radev, Martin
> Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>
> Sounds good, thanks for working on this!
>
> On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <ma...@tum.de>> wrote:
>
> Hello Ryan,
>
>
> we decided that it would be beneficial to try out your proposal.
>
>
> I will look into it and provide measurements on the compression ratio and speed.
>
>
> Regards,
>
> Martin
>
> ________________________________
> From: Ryan Blue <rb...@netflix.com>>
> Sent: Saturday, September 14, 2019 2:23:20 AM
> To: Radev, Martin
> Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>
> > Using RLE for the sign, exponents and the top-most mantissa bytes can help when data is repetitive and make it worse for other.
>
> I agree. But we use RLE in similar cases because we do tend to have runs of values, and values that fit in a fixed number of bits. Exponents and sign bits would probably fit this model extremely well most of the time if you have similar floating point values or sorted values. It would be really interesting to see how well this performs in comparison to the compression tests you've already done. For mantissa bits, I agree it wouldn't be worth encoding first.
>
> On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <ma...@tum.de>> wrote:
>
> Hello Ryan, Wes and other parquet devs,
>
>
> thanks for the response. I was away on vacation and that's why I am answering just now.
>
>
> > whether you think adding run-length encoding to any of the byte streams would be beneficial before applying Zstd.
> The short answer is "only for some cases but it will make it worse in both compression ratio and speed for other".
>
> Our initial investigation also separated the sign, exponent and mantissa into separate streams.
>
> The encoding was the following assuming 32-bit IEEE754:
>
> - stream of sign bits
>
> - stream of exponents bits. Conveniently the exponent for a 32-bit IEEE754 number is 8 bits.
>
> - separate the remaining 23 bits into four streams of 8, 8, 7 bits. An extra zero bit is added to the block which has only seven bits. This was done since zstd, zlib, etc work at a byte granularity and we would want repetitions to happen at such.
>
> For 64-bit IEEE754 even more padding has to be added since the exponent is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more exponent bits and 4 more mantissa bits to keep repetitions at a byte granularity. My original report shows results for when the floating-point values are split at a component granularity. Report is here: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> Results are just slightly better in terms of compression ratio for some tests but compression and decompression speed is expectedly worse. The reason is that splitting a value is somewhat more complex. We need to keep a stream of bits for the signs, keep track of when a byte in the stream is exhausted, do bit manipulation to extract components, etc. This is also the reason why I preferred to go with the byte-wise decomposition of the values. It's faster and the compression ratio is just slightly worse for some of the tests.
>
>
> Using RLE for the sign, exponents and the top-most mantissa bytes can help when data is repetitive and make it worse for other. I suppose using one of the compressors yields a better compression ratio on average. Also, this can again make encoding and decoding slower.
>
>
> The design of the BYTE_STREAM_SPLIT encoding had in mind two things:
>
> - It would only make data more compressible and leave compression to the codec in use.
>   This leaves the complexity to the codec and choice of speed/compression ratio to the user.
>
> - It should be fast.
>   There's an extra compression step so preferably there's very little latency before it.
>
> @Wes, can you have a look?
>
> More opinions are welcome.
>
> If you have floating point data available, I would be very happy to examine whether this approach offers benefit for you.
>
>
> Regards,
>
> Martin
>
> ________________________________
> From: Ryan Blue <rb...@netflix.com.INVALID>
> Sent: Tuesday, September 3, 2019 11:51:46 PM
> To: Parquet Dev
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>
> Hi Martin,
>
> Thanks for taking a look at this! I agree that the approach here looks
> promising. We've had occasional requests for lossy floating point
> compression in the past, so it would be good to add this.
>
> I did some work in this area a few years ago that is similar and I'd like
> to hear what you think about that approach compared to this one. That work
> was based on the same observation, that the main problem is the mantissa,
> while exponents tend to compress well. What I did was take the exponent and
> mantissa and encode each separately, like the component encoding in your
> test. But to encode each stream, I used Parquet's RLE encoder instead of
> just applying compression. This seemed to work well for exponents and sign
> bits, but probably isn't worth the cost for mantissa bits. It could also be
> interesting to test a separate stream for sign bits.
>
> I guess what I'd like to hear your take on is whether you think adding
> run-length encoding to any of the byte streams would be beneficial before
> applying Zstd.
>
> Thanks!
>
> rb
>
> On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <we...@gmail.com>> wrote:
>
> > I'm interested in this. I have been busy the last couple of weeks so have
> > not been able to take a closer look. I will try to give some feedback this
> > week.
> >
> > Thanks
> >
> > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <ma...@tum.de>> wrote:
> >
> > > Hello all,
> > >
> > >
> > > thank you Julien for the interest.
> > >
> > >
> > > Could other people, part of Apache Parquet, share their opinions?
> > >
> > > Do you have your own data which you would like to use for testing the new
> > > encoding?
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > > ________________________________
> > > From: Julien Le Dem <ju...@wework.com.INVALID>
> > > Sent: Friday, August 30, 2019 2:38:37 AM
> > > To: dev@parquet.apache.org<ma...@parquet.apache.org>
> > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > >
> > > I think this looks promising to me. At first glance it seems combining
> > > simplicity and efficiency.
> > > I'd like to hear more from other members of the PMC.
> > >
> > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <ma...@tum.de>>
> > wrote:
> > >
> > > > Dear all,
> > > >
> > > >
> > > > there was some earlier discussion on adding a new encoding for better
> > > > compression of FP32 and FP64 data.
> > > >
> > > >
> > > > The pull request which extends the format is here:
> > > > https://github.com/apache/parquet-format/pull/144
> > > > The change has one approval from earlier from Zoltan.
> > > >
> > > >
> > > > The results from an investigation on compression ratio and speed with
> > the
> > > > new encoding vs other encodings is available here:
> > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > It is visible that for many tests the new encoding performs better in
> > > > compression ratio and in some cases in speed. The improvements in
> > > > compression speed come from the fact that the new format can
> > potentially
> > > > lead to a faster parsing for some compressors like GZIP.
> > > >
> > > >
> > > > An earlier report which examines other FP compressors (fpzip, spdp,
> > fpc,
> > > > zfp, sz) and new potential encodings is available here:
> > > >
> > >
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > > > encoding only has the focus of lossless compression.
> > > >
> > > >
> > > > Can we have a vote?
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > >
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by "Radev, Martin" <ma...@tum.de>.
Dear all,


would there be any interest in reviewing the BYTE_STREAM_SPLIT encoding?

Please feel free to contact me directly if you need help or would like to provide more test data.


Results for the encoding based on the implementation in Arrow are here: https://github.com/martinradev/arrow-fp-compression-bench
Patch to Arrow is here: https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49


The new encoding combined with a compressor performs better than any of the other alternatives for data where there is little change in the upper-most bytes of fp32 and fp64 values. My early experiments also show that this encoding+zstd performs better on average than any of the specialized floating-point lossless compressors like fpc, spdp, zfp.


Regards,

Martin

________________________________
From: Radev, Martin <ma...@tum.de>
Sent: Thursday, October 10, 2019 2:34:15 PM
To: Parquet Dev
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Dear Ryan Blue and other Parquet developers,

I tested Ryan's proposal for modifying the encoding.

The short answer is that it doesn't perform well in my tests. The encoding, code and results can be viewed below.


The current implementation only handles 32-bit IEEE754 floats in the following way:

  1.  For each block of 128 values, the min and max is computed for the exponent
  2.  The number of bits for the exponent RLE is computed as ceil(log2((max - min + 1))). The sign bit uses 1 bit.
  3.  The sign, exponent and 23 remaining mantissa bits are extracted.
  4.  One RLE encoder is used for the sign and one for the exponent.
A new RLE encoder for the exponent is created if the block requires less or more bits than the number of bits used for the current encoder.
The 23 mantissa bits are divided into three streams. (Not sure whether this is strictly a good idea).
  5.  Also, for each 128 values we need to store 2 bytes: the min value and the number of bits used by the RLE.


I did not implement a decoder and have not added unit tests to guarantee that the implementation is sound.

Ryan, can you please review the implementation as to whether it corresponds to what you had in mind?
Check FPByteStreamSplitComponentRLEEncoder. Please do not focus on the code quality - this is only a quick hack.
https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979

In my benchmark tests I compared the following scenarios:

  *   dictionary
  *   plain encoding + ZSTD
  *   the StreamSplit+RLE implementation above
  *   the StreamSplit+RLE implementation above + ZSTD
  *   the original BYTE_STREAM_SPLIT proposal + ZSTD

Size of parquet files, measured in MB, for each combination:

        Plain   Plain+ZSTD      Dict    Split+RLE       Split+RLE+ZSTD  StreamSplit+ZSTD
msg_bt.sp
        128     112     128     113     93      84
msg_lu.sp
        93      87      94      80      73      67
msg_sp.sp
        139     125     139     123     96      88
msg_sweep3d.sp
        60      19      60      47      40      13
num_brain.sp
        68      60      69      55      54      49
num_comet.sp
        52      45      50      46      42      38
num_control.sp
        77      71      77      70      69      63
num_plasma.sp
        17      2       8       16      8       1
obs_error.sp
        30      24      30      25      22      21
obs_info.sp
        10      8       10      8       7       7
obs_spitzer.sp
        95      83      95      99      82      72
obs_temp.sp
        20      18      20      18      18      16


I have not tested the encoding on any of the other test data we have since they contain also FP64 columns. I did not add support for FP64 in the StreamSplit+RLE encoding to save on time and also because I do not expect much improvement.


From the results it can be seen that the proposed StreamSplit+RLE encoding does not improve results.

Using RLE for the sign bits and exponent obfuscates the input and the compression step cannot fully take advantage of repetitions in the data since they were removed from the RLE step. Repetitive patterns are replaced by the RLE bits which likely do not compress very well.

Also, GZIP/ZSTD handle repetitions in data better on average. For example, GZIP's Deflate algorithm can encode a long run length with 3 bytes(not sure whether it can be less?) for the raw data + 3 bytes for the reference ( 8 bits + 15 bits + 2 bits ).

Now, the RLE step might produce better results for long runs of the same value. However, compressors also handles more complex cases when we have a pattern in the data which doesn't necessary have a long run length. Also, compressors like GZIP/ZSTD often do entropy-coding-aware parsing ( http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html , https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html).

The proposed RLE encoding+ZSTD is slower than BYTE_STREAM_SPLIT+ZSTD. The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy where as the new RLE encoding has to extract the components and run through each block of 128 values twice - once to compute min and max, and once to encode. There's also the overhead of using the Arrow RLEEncoder.

Ryan and other folks, can you provide feedback? Does the implementation look reasonable to you?

Can somebody please work with me on this new encoding? There has been some interest and some discussions but it hasn't been pushed likely due to work around the current release.
For a bit more discussions and results, please refer to:
Recent benchmark of Arrow implementation: https://github.com/martinradev/arrow-fp-compression-bench
Separate report: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing

Regards,
Martin



________________________________
From: Ryan Blue <rb...@netflix.com>
Sent: Thursday, September 19, 2019 7:54 PM
To: Radev, Martin
Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Sounds good, thanks for working on this!

On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <ma...@tum.de>> wrote:

Hello Ryan,


we decided that it would be beneficial to try out your proposal.


I will look into it and provide measurements on the compression ratio and speed.


Regards,

Martin

________________________________
From: Ryan Blue <rb...@netflix.com>>
Sent: Saturday, September 14, 2019 2:23:20 AM
To: Radev, Martin
Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

> Using RLE for the sign, exponents and the top-most mantissa bytes can help when data is repetitive and make it worse for other.

I agree. But we use RLE in similar cases because we do tend to have runs of values, and values that fit in a fixed number of bits. Exponents and sign bits would probably fit this model extremely well most of the time if you have similar floating point values or sorted values. It would be really interesting to see how well this performs in comparison to the compression tests you've already done. For mantissa bits, I agree it wouldn't be worth encoding first.

On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <ma...@tum.de>> wrote:

Hello Ryan, Wes and other parquet devs,


thanks for the response. I was away on vacation and that's why I am answering just now.


> whether you think adding run-length encoding to any of the byte streams would be beneficial before applying Zstd.
The short answer is "only for some cases but it will make it worse in both compression ratio and speed for other".

Our initial investigation also separated the sign, exponent and mantissa into separate streams.

The encoding was the following assuming 32-bit IEEE754:

- stream of sign bits

- stream of exponents bits. Conveniently the exponent for a 32-bit IEEE754 number is 8 bits.

- separate the remaining 23 bits into four streams of 8, 8, 7 bits. An extra zero bit is added to the block which has only seven bits. This was done since zstd, zlib, etc work at a byte granularity and we would want repetitions to happen at such.

For 64-bit IEEE754 even more padding has to be added since the exponent is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more exponent bits and 4 more mantissa bits to keep repetitions at a byte granularity. My original report shows results for when the floating-point values are split at a component granularity. Report is here: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
Results are just slightly better in terms of compression ratio for some tests but compression and decompression speed is expectedly worse. The reason is that splitting a value is somewhat more complex. We need to keep a stream of bits for the signs, keep track of when a byte in the stream is exhausted, do bit manipulation to extract components, etc. This is also the reason why I preferred to go with the byte-wise decomposition of the values. It's faster and the compression ratio is just slightly worse for some of the tests.


Using RLE for the sign, exponents and the top-most mantissa bytes can help when data is repetitive and make it worse for other. I suppose using one of the compressors yields a better compression ratio on average. Also, this can again make encoding and decoding slower.


The design of the BYTE_STREAM_SPLIT encoding had in mind two things:

- It would only make data more compressible and leave compression to the codec in use.
  This leaves the complexity to the codec and choice of speed/compression ratio to the user.

- It should be fast.
  There's an extra compression step so preferably there's very little latency before it.

@Wes, can you have a look?

More opinions are welcome.

If you have floating point data available, I would be very happy to examine whether this approach offers benefit for you.


Regards,

Martin

________________________________
From: Ryan Blue <rb...@netflix.com.INVALID>
Sent: Tuesday, September 3, 2019 11:51:46 PM
To: Parquet Dev
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Hi Martin,

Thanks for taking a look at this! I agree that the approach here looks
promising. We've had occasional requests for lossy floating point
compression in the past, so it would be good to add this.

I did some work in this area a few years ago that is similar and I'd like
to hear what you think about that approach compared to this one. That work
was based on the same observation, that the main problem is the mantissa,
while exponents tend to compress well. What I did was take the exponent and
mantissa and encode each separately, like the component encoding in your
test. But to encode each stream, I used Parquet's RLE encoder instead of
just applying compression. This seemed to work well for exponents and sign
bits, but probably isn't worth the cost for mantissa bits. It could also be
interesting to test a separate stream for sign bits.

I guess what I'd like to hear your take on is whether you think adding
run-length encoding to any of the byte streams would be beneficial before
applying Zstd.

Thanks!

rb

On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <we...@gmail.com>> wrote:

> I'm interested in this. I have been busy the last couple of weeks so have
> not been able to take a closer look. I will try to give some feedback this
> week.
>
> Thanks
>
> On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <ma...@tum.de>> wrote:
>
> > Hello all,
> >
> >
> > thank you Julien for the interest.
> >
> >
> > Could other people, part of Apache Parquet, share their opinions?
> >
> > Do you have your own data which you would like to use for testing the new
> > encoding?
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Julien Le Dem <ju...@wework.com.INVALID>
> > Sent: Friday, August 30, 2019 2:38:37 AM
> > To: dev@parquet.apache.org<ma...@parquet.apache.org>
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > I think this looks promising to me. At first glance it seems combining
> > simplicity and efficiency.
> > I'd like to hear more from other members of the PMC.
> >
> > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <ma...@tum.de>>
> wrote:
> >
> > > Dear all,
> > >
> > >
> > > there was some earlier discussion on adding a new encoding for better
> > > compression of FP32 and FP64 data.
> > >
> > >
> > > The pull request which extends the format is here:
> > > https://github.com/apache/parquet-format/pull/144
> > > The change has one approval from earlier from Zoltan.
> > >
> > >
> > > The results from an investigation on compression ratio and speed with
> the
> > > new encoding vs other encodings is available here:
> > > https://github.com/martinradev/arrow-fp-compression-bench
> > > It is visible that for many tests the new encoding performs better in
> > > compression ratio and in some cases in speed. The improvements in
> > > compression speed come from the fact that the new format can
> potentially
> > > lead to a faster parsing for some compressors like GZIP.
> > >
> > >
> > > An earlier report which examines other FP compressors (fpzip, spdp,
> fpc,
> > > zfp, sz) and new potential encodings is available here:
> > >
> >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > > encoding only has the focus of lossless compression.
> > >
> > >
> > > Can we have a vote?
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> >
>


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by "Radev, Martin" <ma...@tum.de>.
Dear Ryan Blue and other Parquet developers,

I tested Ryan's proposal for modifying the encoding.

The short answer is that it doesn't perform well in my tests. The encoding, code and results can be viewed below.


The current implementation only handles 32-bit IEEE754 floats in the following way:

  1.  For each block of 128 values, the min and max is computed for the exponent
  2.  The number of bits for the exponent RLE is computed as ceil(log2((max - min + 1))). The sign bit uses 1 bit.
  3.  The sign, exponent and 23 remaining mantissa bits are extracted.
  4.  One RLE encoder is used for the sign and one for the exponent.
A new RLE encoder for the exponent is created if the block requires less or more bits than the number of bits used for the current encoder.
The 23 mantissa bits are divided into three streams. (Not sure whether this is strictly a good idea).
  5.  Also, for each 128 values we need to store 2 bytes: the min value and the number of bits used by the RLE.


I did not implement a decoder and have not added unit tests to guarantee that the implementation is sound.

Ryan, can you please review the implementation as to whether it corresponds to what you had in mind?
Check FPByteStreamSplitComponentRLEEncoder. Please do not focus on the code quality - this is only a quick hack.
https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979

In my benchmark tests I compared the following scenarios:

  *   dictionary
  *   plain encoding + ZSTD
  *   the StreamSplit+RLE implementation above
  *   the StreamSplit+RLE implementation above + ZSTD
  *   the original BYTE_STREAM_SPLIT proposal + ZSTD

Size of parquet files, measured in MB, for each combination:

        Plain   Plain+ZSTD      Dict    Split+RLE       Split+RLE+ZSTD  StreamSplit+ZSTD
msg_bt.sp
        128     112     128     113     93      84
msg_lu.sp
        93      87      94      80      73      67
msg_sp.sp
        139     125     139     123     96      88
msg_sweep3d.sp
        60      19      60      47      40      13
num_brain.sp
        68      60      69      55      54      49
num_comet.sp
        52      45      50      46      42      38
num_control.sp
        77      71      77      70      69      63
num_plasma.sp
        17      2       8       16      8       1
obs_error.sp
        30      24      30      25      22      21
obs_info.sp
        10      8       10      8       7       7
obs_spitzer.sp
        95      83      95      99      82      72
obs_temp.sp
        20      18      20      18      18      16


I have not tested the encoding on any of the other test data we have since they contain also FP64 columns. I did not add support for FP64 in the StreamSplit+RLE encoding to save on time and also because I do not expect much improvement.


From the results it can be seen that the proposed StreamSplit+RLE encoding does not improve results.

Using RLE for the sign bits and exponent obfuscates the input and the compression step cannot fully take advantage of repetitions in the data since they were removed from the RLE step. Repetitive patterns are replaced by the RLE bits which likely do not compress very well.

Also, GZIP/ZSTD handle repetitions in data better on average. For example, GZIP's Deflate algorithm can encode a long run length with 3 bytes(not sure whether it can be less?) for the raw data + 3 bytes for the reference ( 8 bits + 15 bits + 2 bits ).

Now, the RLE step might produce better results for long runs of the same value. However, compressors also handles more complex cases when we have a pattern in the data which doesn't necessary have a long run length. Also, compressors like GZIP/ZSTD often do entropy-coding-aware parsing ( http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html , https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html).

The proposed RLE encoding+ZSTD is slower than BYTE_STREAM_SPLIT+ZSTD. The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy where as the new RLE encoding has to extract the components and run through each block of 128 values twice - once to compute min and max, and once to encode. There's also the overhead of using the Arrow RLEEncoder.

Ryan and other folks, can you provide feedback? Does the implementation look reasonable to you?

Can somebody please work with me on this new encoding? There has been some interest and some discussions but it hasn't been pushed likely due to work around the current release.
For a bit more discussions and results, please refer to:
Recent benchmark of Arrow implementation: https://github.com/martinradev/arrow-fp-compression-bench
Separate report: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing

Regards,
Martin



________________________________
From: Ryan Blue <rb...@netflix.com>
Sent: Thursday, September 19, 2019 7:54 PM
To: Radev, Martin
Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Sounds good, thanks for working on this!

On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <ma...@tum.de>> wrote:

Hello Ryan,


we decided that it would be beneficial to try out your proposal.


I will look into it and provide measurements on the compression ratio and speed.


Regards,

Martin

________________________________
From: Ryan Blue <rb...@netflix.com>>
Sent: Saturday, September 14, 2019 2:23:20 AM
To: Radev, Martin
Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

> Using RLE for the sign, exponents and the top-most mantissa bytes can help when data is repetitive and make it worse for other.

I agree. But we use RLE in similar cases because we do tend to have runs of values, and values that fit in a fixed number of bits. Exponents and sign bits would probably fit this model extremely well most of the time if you have similar floating point values or sorted values. It would be really interesting to see how well this performs in comparison to the compression tests you've already done. For mantissa bits, I agree it wouldn't be worth encoding first.

On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <ma...@tum.de>> wrote:

Hello Ryan, Wes and other parquet devs,


thanks for the response. I was away on vacation and that's why I am answering just now.


> whether you think adding run-length encoding to any of the byte streams would be beneficial before applying Zstd.
The short answer is "only for some cases but it will make it worse in both compression ratio and speed for other".

Our initial investigation also separated the sign, exponent and mantissa into separate streams.

The encoding was the following assuming 32-bit IEEE754:

- stream of sign bits

- stream of exponents bits. Conveniently the exponent for a 32-bit IEEE754 number is 8 bits.

- separate the remaining 23 bits into four streams of 8, 8, 7 bits. An extra zero bit is added to the block which has only seven bits. This was done since zstd, zlib, etc work at a byte granularity and we would want repetitions to happen at such.

For 64-bit IEEE754 even more padding has to be added since the exponent is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more exponent bits and 4 more mantissa bits to keep repetitions at a byte granularity. My original report shows results for when the floating-point values are split at a component granularity. Report is here: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
Results are just slightly better in terms of compression ratio for some tests but compression and decompression speed is expectedly worse. The reason is that splitting a value is somewhat more complex. We need to keep a stream of bits for the signs, keep track of when a byte in the stream is exhausted, do bit manipulation to extract components, etc. This is also the reason why I preferred to go with the byte-wise decomposition of the values. It's faster and the compression ratio is just slightly worse for some of the tests.


Using RLE for the sign, exponents and the top-most mantissa bytes can help when data is repetitive and make it worse for other. I suppose using one of the compressors yields a better compression ratio on average. Also, this can again make encoding and decoding slower.


The design of the BYTE_STREAM_SPLIT encoding had in mind two things:

- It would only make data more compressible and leave compression to the codec in use.
  This leaves the complexity to the codec and choice of speed/compression ratio to the user.

- It should be fast.
  There's an extra compression step so preferably there's very little latency before it.

@Wes, can you have a look?

More opinions are welcome.

If you have floating point data available, I would be very happy to examine whether this approach offers benefit for you.


Regards,

Martin

________________________________
From: Ryan Blue <rb...@netflix.com.INVALID>
Sent: Tuesday, September 3, 2019 11:51:46 PM
To: Parquet Dev
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Hi Martin,

Thanks for taking a look at this! I agree that the approach here looks
promising. We've had occasional requests for lossy floating point
compression in the past, so it would be good to add this.

I did some work in this area a few years ago that is similar and I'd like
to hear what you think about that approach compared to this one. That work
was based on the same observation, that the main problem is the mantissa,
while exponents tend to compress well. What I did was take the exponent and
mantissa and encode each separately, like the component encoding in your
test. But to encode each stream, I used Parquet's RLE encoder instead of
just applying compression. This seemed to work well for exponents and sign
bits, but probably isn't worth the cost for mantissa bits. It could also be
interesting to test a separate stream for sign bits.

I guess what I'd like to hear your take on is whether you think adding
run-length encoding to any of the byte streams would be beneficial before
applying Zstd.

Thanks!

rb

On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <we...@gmail.com>> wrote:

> I'm interested in this. I have been busy the last couple of weeks so have
> not been able to take a closer look. I will try to give some feedback this
> week.
>
> Thanks
>
> On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <ma...@tum.de>> wrote:
>
> > Hello all,
> >
> >
> > thank you Julien for the interest.
> >
> >
> > Could other people, part of Apache Parquet, share their opinions?
> >
> > Do you have your own data which you would like to use for testing the new
> > encoding?
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Julien Le Dem <ju...@wework.com.INVALID>
> > Sent: Friday, August 30, 2019 2:38:37 AM
> > To: dev@parquet.apache.org<ma...@parquet.apache.org>
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > I think this looks promising to me. At first glance it seems combining
> > simplicity and efficiency.
> > I'd like to hear more from other members of the PMC.
> >
> > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <ma...@tum.de>>
> wrote:
> >
> > > Dear all,
> > >
> > >
> > > there was some earlier discussion on adding a new encoding for better
> > > compression of FP32 and FP64 data.
> > >
> > >
> > > The pull request which extends the format is here:
> > > https://github.com/apache/parquet-format/pull/144
> > > The change has one approval from earlier from Zoltan.
> > >
> > >
> > > The results from an investigation on compression ratio and speed with
> the
> > > new encoding vs other encodings is available here:
> > > https://github.com/martinradev/arrow-fp-compression-bench
> > > It is visible that for many tests the new encoding performs better in
> > > compression ratio and in some cases in speed. The improvements in
> > > compression speed come from the fact that the new format can
> potentially
> > > lead to a faster parsing for some compressors like GZIP.
> > >
> > >
> > > An earlier report which examines other FP compressors (fpzip, spdp,
> fpc,
> > > zfp, sz) and new potential encodings is available here:
> > >
> >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > > encoding only has the focus of lossless compression.
> > >
> > >
> > > Can we have a vote?
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> >
>


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Sounds good, thanks for working on this!

On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <ma...@tum.de> wrote:

> Hello Ryan,
>
>
> we decided that it would be beneficial to try out your proposal.
>
>
> I will look into it and provide measurements on the compression ratio and
> speed.
>
>
> Regards,
>
> Martin
> ------------------------------
> *From:* Ryan Blue <rb...@netflix.com>
> *Sent:* Saturday, September 14, 2019 2:23:20 AM
> *To:* Radev, Martin
> *Cc:* Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> *Subject:* Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>
> > Using RLE for the sign, exponents and the top-most mantissa bytes can
> help when data is repetitive and make it worse for other.
>
> I agree. But we use RLE in similar cases because we do tend to have runs
> of values, and values that fit in a fixed number of bits. Exponents and
> sign bits would probably fit this model extremely well most of the time if
> you have similar floating point values or sorted values. It would be really
> interesting to see how well this performs in comparison to the compression
> tests you've already done. For mantissa bits, I agree it wouldn't be worth
> encoding first.
>
> On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <ma...@tum.de> wrote:
>
>> Hello Ryan, Wes and other parquet devs,
>>
>>
>> thanks for the response. I was away on vacation and that's why I am
>> answering just now.
>>
>>
>> > whether you think adding run-length encoding to any of the byte
>> streams would be beneficial before applying Zstd.
>> The short answer is "only for some cases but it will make it worse in
>> both compression ratio and speed for other".
>>
>> Our initial investigation also separated the sign, exponent and mantissa
>> into separate streams.
>>
>> The encoding was the following assuming 32-bit IEEE754:
>>
>> - stream of sign bits
>>
>> - stream of exponents bits. Conveniently the exponent for a 32-bit
>> IEEE754 number is 8 bits.
>>
>> - separate the remaining 23 bits into four streams of 8, 8, 7 bits. An
>> extra zero bit is added to the block which has only seven bits. This was
>> done since zstd, zlib, etc work at a byte granularity and we would
>> want repetitions to happen at such.
>>
>> For 64-bit IEEE754 even more padding has to be added since the exponent
>> is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more
>> exponent bits and 4 more mantissa bits to keep repetitions at a byte
>> granularity. My original report shows results for when the floating-point
>> values are split at a component granularity. Report is here:
>> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
>> Results are just slightly better in terms of compression ratio for some
>> tests but compression and decompression speed is expectedly worse. The
>> reason is that splitting a value is somewhat more complex. We need to keep
>> a stream of bits for the signs, keep track of when a byte in the stream is
>> exhausted, do bit manipulation to extract components, etc. This is also the
>> reason why I preferred to go with the byte-wise decomposition of the
>> values. It's faster and the compression ratio is just slightly worse for
>> some of the tests.
>>
>>
>> Using RLE for the sign, exponents and the top-most mantissa bytes can
>> help when data is repetitive and make it worse for other. I suppose using
>> one of the compressors yields a better compression ratio on average. Also,
>> this can again make encoding and decoding slower.
>>
>>
>> The design of the BYTE_STREAM_SPLIT encoding had in mind two things:
>>
>> - It would only make data more compressible and leave compression to the
>> codec in use.
>>   This leaves the complexity to the codec and choice of speed/compression
>> ratio to the user.
>>
>> - It should be fast.
>>   There's an extra compression step so preferably there's very little
>> latency before it.
>>
>> @Wes, can you have a look?
>>
>> More opinions are welcome.
>>
>> If you have floating point data available, I would be very happy to
>> examine whether this approach offers benefit for you.
>>
>>
>> Regards,
>>
>> Martin
>>
>> ------------------------------
>> *From:* Ryan Blue <rb...@netflix.com.INVALID>
>> *Sent:* Tuesday, September 3, 2019 11:51:46 PM
>> *To:* Parquet Dev
>> *Cc:* Raoofy, Amir; Karlstetter, Roman
>> *Subject:* Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>>
>> Hi Martin,
>>
>> Thanks for taking a look at this! I agree that the approach here looks
>> promising. We've had occasional requests for lossy floating point
>> compression in the past, so it would be good to add this.
>>
>> I did some work in this area a few years ago that is similar and I'd like
>> to hear what you think about that approach compared to this one. That work
>> was based on the same observation, that the main problem is the mantissa,
>> while exponents tend to compress well. What I did was take the exponent
>> and
>> mantissa and encode each separately, like the component encoding in your
>> test. But to encode each stream, I used Parquet's RLE encoder instead of
>> just applying compression. This seemed to work well for exponents and sign
>> bits, but probably isn't worth the cost for mantissa bits. It could also
>> be
>> interesting to test a separate stream for sign bits.
>>
>> I guess what I'd like to hear your take on is whether you think adding
>> run-length encoding to any of the byte streams would be beneficial before
>> applying Zstd.
>>
>> Thanks!
>>
>> rb
>>
>> On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> > I'm interested in this. I have been busy the last couple of weeks so
>> have
>> > not been able to take a closer look. I will try to give some feedback
>> this
>> > week.
>> >
>> > Thanks
>> >
>> > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <ma...@tum.de> wrote:
>> >
>> > > Hello all,
>> > >
>> > >
>> > > thank you Julien for the interest.
>> > >
>> > >
>> > > Could other people, part of Apache Parquet, share their opinions?
>> > >
>> > > Do you have your own data which you would like to use for testing the
>> new
>> > > encoding?
>> > >
>> > >
>> > > Regards,
>> > >
>> > > Martin
>> > >
>> > > ________________________________
>> > > From: Julien Le Dem <ju...@wework.com.INVALID>
>> > > Sent: Friday, August 30, 2019 2:38:37 AM
>> > > To: dev@parquet.apache.org
>> > > Cc: Raoofy, Amir; Karlstetter, Roman
>> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>> > >
>> > > I think this looks promising to me. At first glance it seems combining
>> > > simplicity and efficiency.
>> > > I'd like to hear more from other members of the PMC.
>> > >
>> > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <ma...@tum.de>
>> > wrote:
>> > >
>> > > > Dear all,
>> > > >
>> > > >
>> > > > there was some earlier discussion on adding a new encoding for
>> better
>> > > > compression of FP32 and FP64 data.
>> > > >
>> > > >
>> > > > The pull request which extends the format is here:
>> > > > https://github.com/apache/parquet-format/pull/144
>> > > > The change has one approval from earlier from Zoltan.
>> > > >
>> > > >
>> > > > The results from an investigation on compression ratio and speed
>> with
>> > the
>> > > > new encoding vs other encodings is available here:
>> > > > https://github.com/martinradev/arrow-fp-compression-bench
>> > > > It is visible that for many tests the new encoding performs better
>> in
>> > > > compression ratio and in some cases in speed. The improvements in
>> > > > compression speed come from the fact that the new format can
>> > potentially
>> > > > lead to a faster parsing for some compressors like GZIP.
>> > > >
>> > > >
>> > > > An earlier report which examines other FP compressors (fpzip, spdp,
>> > fpc,
>> > > > zfp, sz) and new potential encodings is available here:
>> > > >
>> > >
>> >
>> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
>> > > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
>> > > > encoding only has the focus of lossless compression.
>> > > >
>> > > >
>> > > > Can we have a vote?
>> > > >
>> > > >
>> > > > Regards,
>> > > >
>> > > > Martin
>> > > >
>> > > >
>> > >
>> >
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by "Radev, Martin" <ma...@tum.de>.
Hello Ryan,


we decided that it would be beneficial to try out your proposal.


I will look into it and provide measurements on the compression ratio and speed.


Regards,

Martin

________________________________
From: Ryan Blue <rb...@netflix.com>
Sent: Saturday, September 14, 2019 2:23:20 AM
To: Radev, Martin
Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

> Using RLE for the sign, exponents and the top-most mantissa bytes can help when data is repetitive and make it worse for other.

I agree. But we use RLE in similar cases because we do tend to have runs of values, and values that fit in a fixed number of bits. Exponents and sign bits would probably fit this model extremely well most of the time if you have similar floating point values or sorted values. It would be really interesting to see how well this performs in comparison to the compression tests you've already done. For mantissa bits, I agree it wouldn't be worth encoding first.

On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <ma...@tum.de>> wrote:

Hello Ryan, Wes and other parquet devs,


thanks for the response. I was away on vacation and that's why I am answering just now.


> whether you think adding run-length encoding to any of the byte streams would be beneficial before applying Zstd.
The short answer is "only for some cases but it will make it worse in both compression ratio and speed for other".

Our initial investigation also separated the sign, exponent and mantissa into separate streams.

The encoding was the following assuming 32-bit IEEE754:

- stream of sign bits

- stream of exponents bits. Conveniently the exponent for a 32-bit IEEE754 number is 8 bits.

- separate the remaining 23 bits into four streams of 8, 8, 7 bits. An extra zero bit is added to the block which has only seven bits. This was done since zstd, zlib, etc work at a byte granularity and we would want repetitions to happen at such.

For 64-bit IEEE754 even more padding has to be added since the exponent is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more exponent bits and 4 more mantissa bits to keep repetitions at a byte granularity. My original report shows results for when the floating-point values are split at a component granularity. Report is here: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
Results are just slightly better in terms of compression ratio for some tests but compression and decompression speed is expectedly worse. The reason is that splitting a value is somewhat more complex. We need to keep a stream of bits for the signs, keep track of when a byte in the stream is exhausted, do bit manipulation to extract components, etc. This is also the reason why I preferred to go with the byte-wise decomposition of the values. It's faster and the compression ratio is just slightly worse for some of the tests.


Using RLE for the sign, exponents and the top-most mantissa bytes can help when data is repetitive and make it worse for other. I suppose using one of the compressors yields a better compression ratio on average. Also, this can again make encoding and decoding slower.


The design of the BYTE_STREAM_SPLIT encoding had in mind two things:

- It would only make data more compressible and leave compression to the codec in use.
  This leaves the complexity to the codec and choice of speed/compression ratio to the user.

- It should be fast.
  There's an extra compression step so preferably there's very little latency before it.

@Wes, can you have a look?

More opinions are welcome.

If you have floating point data available, I would be very happy to examine whether this approach offers benefit for you.


Regards,

Martin

________________________________
From: Ryan Blue <rb...@netflix.com.INVALID>
Sent: Tuesday, September 3, 2019 11:51:46 PM
To: Parquet Dev
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Hi Martin,

Thanks for taking a look at this! I agree that the approach here looks
promising. We've had occasional requests for lossy floating point
compression in the past, so it would be good to add this.

I did some work in this area a few years ago that is similar and I'd like
to hear what you think about that approach compared to this one. That work
was based on the same observation, that the main problem is the mantissa,
while exponents tend to compress well. What I did was take the exponent and
mantissa and encode each separately, like the component encoding in your
test. But to encode each stream, I used Parquet's RLE encoder instead of
just applying compression. This seemed to work well for exponents and sign
bits, but probably isn't worth the cost for mantissa bits. It could also be
interesting to test a separate stream for sign bits.

I guess what I'd like to hear your take on is whether you think adding
run-length encoding to any of the byte streams would be beneficial before
applying Zstd.

Thanks!

rb

On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <we...@gmail.com>> wrote:

> I'm interested in this. I have been busy the last couple of weeks so have
> not been able to take a closer look. I will try to give some feedback this
> week.
>
> Thanks
>
> On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <ma...@tum.de>> wrote:
>
> > Hello all,
> >
> >
> > thank you Julien for the interest.
> >
> >
> > Could other people, part of Apache Parquet, share their opinions?
> >
> > Do you have your own data which you would like to use for testing the new
> > encoding?
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Julien Le Dem <ju...@wework.com.INVALID>
> > Sent: Friday, August 30, 2019 2:38:37 AM
> > To: dev@parquet.apache.org<ma...@parquet.apache.org>
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > I think this looks promising to me. At first glance it seems combining
> > simplicity and efficiency.
> > I'd like to hear more from other members of the PMC.
> >
> > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <ma...@tum.de>>
> wrote:
> >
> > > Dear all,
> > >
> > >
> > > there was some earlier discussion on adding a new encoding for better
> > > compression of FP32 and FP64 data.
> > >
> > >
> > > The pull request which extends the format is here:
> > > https://github.com/apache/parquet-format/pull/144
> > > The change has one approval from earlier from Zoltan.
> > >
> > >
> > > The results from an investigation on compression ratio and speed with
> the
> > > new encoding vs other encodings is available here:
> > > https://github.com/martinradev/arrow-fp-compression-bench
> > > It is visible that for many tests the new encoding performs better in
> > > compression ratio and in some cases in speed. The improvements in
> > > compression speed come from the fact that the new format can
> potentially
> > > lead to a faster parsing for some compressors like GZIP.
> > >
> > >
> > > An earlier report which examines other FP compressors (fpzip, spdp,
> fpc,
> > > zfp, sz) and new potential encodings is available here:
> > >
> >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > > encoding only has the focus of lossless compression.
> > >
> > >
> > > Can we have a vote?
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> >
>


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
> Using RLE for the sign, exponents and the top-most mantissa bytes can
help when data is repetitive and make it worse for other.

I agree. But we use RLE in similar cases because we do tend to have runs of
values, and values that fit in a fixed number of bits. Exponents and sign
bits would probably fit this model extremely well most of the time if you
have similar floating point values or sorted values. It would be really
interesting to see how well this performs in comparison to the compression
tests you've already done. For mantissa bits, I agree it wouldn't be worth
encoding first.

On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <ma...@tum.de> wrote:

> Hello Ryan, Wes and other parquet devs,
>
>
> thanks for the response. I was away on vacation and that's why I am
> answering just now.
>
>
> > whether you think adding run-length encoding to any of the byte streams
> would be beneficial before applying Zstd.
> The short answer is "only for some cases but it will make it worse in both
> compression ratio and speed for other".
>
> Our initial investigation also separated the sign, exponent and mantissa
> into separate streams.
>
> The encoding was the following assuming 32-bit IEEE754:
>
> - stream of sign bits
>
> - stream of exponents bits. Conveniently the exponent for a 32-bit IEEE754
> number is 8 bits.
>
> - separate the remaining 23 bits into four streams of 8, 8, 7 bits. An
> extra zero bit is added to the block which has only seven bits. This was
> done since zstd, zlib, etc work at a byte granularity and we would
> want repetitions to happen at such.
>
> For 64-bit IEEE754 even more padding has to be added since the exponent is
> 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more exponent
> bits and 4 more mantissa bits to keep repetitions at a byte granularity. My
> original report shows results for when the floating-point values are split
> at a component granularity. Report is here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> Results are just slightly better in terms of compression ratio for some
> tests but compression and decompression speed is expectedly worse. The
> reason is that splitting a value is somewhat more complex. We need to keep
> a stream of bits for the signs, keep track of when a byte in the stream is
> exhausted, do bit manipulation to extract components, etc. This is also the
> reason why I preferred to go with the byte-wise decomposition of the
> values. It's faster and the compression ratio is just slightly worse for
> some of the tests.
>
>
> Using RLE for the sign, exponents and the top-most mantissa bytes can help
> when data is repetitive and make it worse for other. I suppose using one of
> the compressors yields a better compression ratio on average. Also, this
> can again make encoding and decoding slower.
>
>
> The design of the BYTE_STREAM_SPLIT encoding had in mind two things:
>
> - It would only make data more compressible and leave compression to the
> codec in use.
>   This leaves the complexity to the codec and choice of speed/compression
> ratio to the user.
>
> - It should be fast.
>   There's an extra compression step so preferably there's very little
> latency before it.
>
> @Wes, can you have a look?
>
> More opinions are welcome.
>
> If you have floating point data available, I would be very happy to
> examine whether this approach offers benefit for you.
>
>
> Regards,
>
> Martin
>
> ------------------------------
> *From:* Ryan Blue <rb...@netflix.com.INVALID>
> *Sent:* Tuesday, September 3, 2019 11:51:46 PM
> *To:* Parquet Dev
> *Cc:* Raoofy, Amir; Karlstetter, Roman
> *Subject:* Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>
> Hi Martin,
>
> Thanks for taking a look at this! I agree that the approach here looks
> promising. We've had occasional requests for lossy floating point
> compression in the past, so it would be good to add this.
>
> I did some work in this area a few years ago that is similar and I'd like
> to hear what you think about that approach compared to this one. That work
> was based on the same observation, that the main problem is the mantissa,
> while exponents tend to compress well. What I did was take the exponent and
> mantissa and encode each separately, like the component encoding in your
> test. But to encode each stream, I used Parquet's RLE encoder instead of
> just applying compression. This seemed to work well for exponents and sign
> bits, but probably isn't worth the cost for mantissa bits. It could also be
> interesting to test a separate stream for sign bits.
>
> I guess what I'd like to hear your take on is whether you think adding
> run-length encoding to any of the byte streams would be beneficial before
> applying Zstd.
>
> Thanks!
>
> rb
>
> On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <we...@gmail.com> wrote:
>
> > I'm interested in this. I have been busy the last couple of weeks so have
> > not been able to take a closer look. I will try to give some feedback
> this
> > week.
> >
> > Thanks
> >
> > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <ma...@tum.de> wrote:
> >
> > > Hello all,
> > >
> > >
> > > thank you Julien for the interest.
> > >
> > >
> > > Could other people, part of Apache Parquet, share their opinions?
> > >
> > > Do you have your own data which you would like to use for testing the
> new
> > > encoding?
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > > ________________________________
> > > From: Julien Le Dem <ju...@wework.com.INVALID>
> > > Sent: Friday, August 30, 2019 2:38:37 AM
> > > To: dev@parquet.apache.org
> > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > >
> > > I think this looks promising to me. At first glance it seems combining
> > > simplicity and efficiency.
> > > I'd like to hear more from other members of the PMC.
> > >
> > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <ma...@tum.de>
> > wrote:
> > >
> > > > Dear all,
> > > >
> > > >
> > > > there was some earlier discussion on adding a new encoding for better
> > > > compression of FP32 and FP64 data.
> > > >
> > > >
> > > > The pull request which extends the format is here:
> > > > https://github.com/apache/parquet-format/pull/144
> > > > The change has one approval from earlier from Zoltan.
> > > >
> > > >
> > > > The results from an investigation on compression ratio and speed with
> > the
> > > > new encoding vs other encodings is available here:
> > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > It is visible that for many tests the new encoding performs better in
> > > > compression ratio and in some cases in speed. The improvements in
> > > > compression speed come from the fact that the new format can
> > potentially
> > > > lead to a faster parsing for some compressors like GZIP.
> > > >
> > > >
> > > > An earlier report which examines other FP compressors (fpzip, spdp,
> > fpc,
> > > > zfp, sz) and new potential encodings is available here:
> > > >
> > >
> >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > > > encoding only has the focus of lossless compression.
> > > >
> > > >
> > > > Can we have a vote?
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > >
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by "Radev, Martin" <ma...@tum.de>.
Hello Ryan, Wes and other parquet devs,


thanks for the response. I was away on vacation and that's why I am answering just now.


> whether you think adding run-length encoding to any of the byte streams would be beneficial before applying Zstd.
The short answer is "only for some cases but it will make it worse in both compression ratio and speed for other".

Our initial investigation also separated the sign, exponent and mantissa into separate streams.

The encoding was the following assuming 32-bit IEEE754:

- stream of sign bits

- stream of exponents bits. Conveniently the exponent for a 32-bit IEEE754 number is 8 bits.

- separate the remaining 23 bits into four streams of 8, 8, 7 bits. An extra zero bit is added to the block which has only seven bits. This was done since zstd, zlib, etc work at a byte granularity and we would want repetitions to happen at such.

For 64-bit IEEE754 even more padding has to be added since the exponent is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more exponent bits and 4 more mantissa bits to keep repetitions at a byte granularity. My original report shows results for when the floating-point values are split at a component granularity. Report is here: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
Results are just slightly better in terms of compression ratio for some tests but compression and decompression speed is expectedly worse. The reason is that splitting a value is somewhat more complex. We need to keep a stream of bits for the signs, keep track of when a byte in the stream is exhausted, do bit manipulation to extract components, etc. This is also the reason why I preferred to go with the byte-wise decomposition of the values. It's faster and the compression ratio is just slightly worse for some of the tests.


Using RLE for the sign, exponents and the top-most mantissa bytes can help when data is repetitive and make it worse for other. I suppose using one of the compressors yields a better compression ratio on average. Also, this can again make encoding and decoding slower.


The design of the BYTE_STREAM_SPLIT encoding had in mind two things:

- It would only make data more compressible and leave compression to the codec in use.
  This leaves the complexity to the codec and choice of speed/compression ratio to the user.

- It should be fast.
  There's an extra compression step so preferably there's very little latency before it.

@Wes, can you have a look?

More opinions are welcome.

If you have floating point data available, I would be very happy to examine whether this approach offers benefit for you.


Regards,

Martin

________________________________
From: Ryan Blue <rb...@netflix.com.INVALID>
Sent: Tuesday, September 3, 2019 11:51:46 PM
To: Parquet Dev
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Hi Martin,

Thanks for taking a look at this! I agree that the approach here looks
promising. We've had occasional requests for lossy floating point
compression in the past, so it would be good to add this.

I did some work in this area a few years ago that is similar and I'd like
to hear what you think about that approach compared to this one. That work
was based on the same observation, that the main problem is the mantissa,
while exponents tend to compress well. What I did was take the exponent and
mantissa and encode each separately, like the component encoding in your
test. But to encode each stream, I used Parquet's RLE encoder instead of
just applying compression. This seemed to work well for exponents and sign
bits, but probably isn't worth the cost for mantissa bits. It could also be
interesting to test a separate stream for sign bits.

I guess what I'd like to hear your take on is whether you think adding
run-length encoding to any of the byte streams would be beneficial before
applying Zstd.

Thanks!

rb

On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <we...@gmail.com> wrote:

> I'm interested in this. I have been busy the last couple of weeks so have
> not been able to take a closer look. I will try to give some feedback this
> week.
>
> Thanks
>
> On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <ma...@tum.de> wrote:
>
> > Hello all,
> >
> >
> > thank you Julien for the interest.
> >
> >
> > Could other people, part of Apache Parquet, share their opinions?
> >
> > Do you have your own data which you would like to use for testing the new
> > encoding?
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Julien Le Dem <ju...@wework.com.INVALID>
> > Sent: Friday, August 30, 2019 2:38:37 AM
> > To: dev@parquet.apache.org
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > I think this looks promising to me. At first glance it seems combining
> > simplicity and efficiency.
> > I'd like to hear more from other members of the PMC.
> >
> > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <ma...@tum.de>
> wrote:
> >
> > > Dear all,
> > >
> > >
> > > there was some earlier discussion on adding a new encoding for better
> > > compression of FP32 and FP64 data.
> > >
> > >
> > > The pull request which extends the format is here:
> > > https://github.com/apache/parquet-format/pull/144
> > > The change has one approval from earlier from Zoltan.
> > >
> > >
> > > The results from an investigation on compression ratio and speed with
> the
> > > new encoding vs other encodings is available here:
> > > https://github.com/martinradev/arrow-fp-compression-bench
> > > It is visible that for many tests the new encoding performs better in
> > > compression ratio and in some cases in speed. The improvements in
> > > compression speed come from the fact that the new format can
> potentially
> > > lead to a faster parsing for some compressors like GZIP.
> > >
> > >
> > > An earlier report which examines other FP compressors (fpzip, spdp,
> fpc,
> > > zfp, sz) and new potential encodings is available here:
> > >
> >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > > encoding only has the focus of lossless compression.
> > >
> > >
> > > Can we have a vote?
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> >
>


--
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Hi Martin,

Thanks for taking a look at this! I agree that the approach here looks
promising. We've had occasional requests for lossy floating point
compression in the past, so it would be good to add this.

I did some work in this area a few years ago that is similar and I'd like
to hear what you think about that approach compared to this one. That work
was based on the same observation, that the main problem is the mantissa,
while exponents tend to compress well. What I did was take the exponent and
mantissa and encode each separately, like the component encoding in your
test. But to encode each stream, I used Parquet's RLE encoder instead of
just applying compression. This seemed to work well for exponents and sign
bits, but probably isn't worth the cost for mantissa bits. It could also be
interesting to test a separate stream for sign bits.

I guess what I'd like to hear your take on is whether you think adding
run-length encoding to any of the byte streams would be beneficial before
applying Zstd.

Thanks!

rb

On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <we...@gmail.com> wrote:

> I'm interested in this. I have been busy the last couple of weeks so have
> not been able to take a closer look. I will try to give some feedback this
> week.
>
> Thanks
>
> On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <ma...@tum.de> wrote:
>
> > Hello all,
> >
> >
> > thank you Julien for the interest.
> >
> >
> > Could other people, part of Apache Parquet, share their opinions?
> >
> > Do you have your own data which you would like to use for testing the new
> > encoding?
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Julien Le Dem <ju...@wework.com.INVALID>
> > Sent: Friday, August 30, 2019 2:38:37 AM
> > To: dev@parquet.apache.org
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > I think this looks promising to me. At first glance it seems combining
> > simplicity and efficiency.
> > I'd like to hear more from other members of the PMC.
> >
> > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <ma...@tum.de>
> wrote:
> >
> > > Dear all,
> > >
> > >
> > > there was some earlier discussion on adding a new encoding for better
> > > compression of FP32 and FP64 data.
> > >
> > >
> > > The pull request which extends the format is here:
> > > https://github.com/apache/parquet-format/pull/144
> > > The change has one approval from earlier from Zoltan.
> > >
> > >
> > > The results from an investigation on compression ratio and speed with
> the
> > > new encoding vs other encodings is available here:
> > > https://github.com/martinradev/arrow-fp-compression-bench
> > > It is visible that for many tests the new encoding performs better in
> > > compression ratio and in some cases in speed. The improvements in
> > > compression speed come from the fact that the new format can
> potentially
> > > lead to a faster parsing for some compressors like GZIP.
> > >
> > >
> > > An earlier report which examines other FP compressors (fpzip, spdp,
> fpc,
> > > zfp, sz) and new potential encodings is available here:
> > >
> >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > > encoding only has the focus of lossless compression.
> > >
> > >
> > > Can we have a vote?
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Posted by Wes McKinney <we...@gmail.com>.
I'm interested in this. I have been busy the last couple of weeks so have
not been able to take a closer look. I will try to give some feedback this
week.

Thanks

On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <ma...@tum.de> wrote:

> Hello all,
>
>
> thank you Julien for the interest.
>
>
> Could other people, part of Apache Parquet, share their opinions?
>
> Do you have your own data which you would like to use for testing the new
> encoding?
>
>
> Regards,
>
> Martin
>
> ________________________________
> From: Julien Le Dem <ju...@wework.com.INVALID>
> Sent: Friday, August 30, 2019 2:38:37 AM
> To: dev@parquet.apache.org
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>
> I think this looks promising to me. At first glance it seems combining
> simplicity and efficiency.
> I'd like to hear more from other members of the PMC.
>
> On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <ma...@tum.de> wrote:
>
> > Dear all,
> >
> >
> > there was some earlier discussion on adding a new encoding for better
> > compression of FP32 and FP64 data.
> >
> >
> > The pull request which extends the format is here:
> > https://github.com/apache/parquet-format/pull/144
> > The change has one approval from earlier from Zoltan.
> >
> >
> > The results from an investigation on compression ratio and speed with the
> > new encoding vs other encodings is available here:
> > https://github.com/martinradev/arrow-fp-compression-bench
> > It is visible that for many tests the new encoding performs better in
> > compression ratio and in some cases in speed. The improvements in
> > compression speed come from the fact that the new format can potentially
> > lead to a faster parsing for some compressors like GZIP.
> >
> >
> > An earlier report which examines other FP compressors (fpzip, spdp, fpc,
> > zfp, sz) and new potential encodings is available here:
> >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > encoding only has the focus of lossless compression.
> >
> >
> > Can we have a vote?
> >
> >
> > Regards,
> >
> > Martin
> >
> >
>