You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Radev, Martin" <ma...@tum.de> on 2019/07/01 20:56:45 UTC

Re: Floating point data compression for Apache Parquet

Hello folks,


thank you for your input.


I am finished with my investigation regarding introducing special support for FP compression in Apache Parquet.

My report also includes an investigation of lossy compressors though there are still some things to be cleared out.


Report: https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv


Sections 3 4 5 6 are the most important to go over.


Let me know if you have any questions or concerns.


Regards,

Martin

________________________________
From: Zoltan Ivanfi <zi...@cloudera.com.INVALID>
Sent: Thursday, June 13, 2019 2:16:56 PM
To: Parquet Dev
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet

Hi Martin,

Thanks for your interest in improving Parquet. Efficient encodings are
really important in a big data file format, so this topic is
definitely worth researching and personally I am looking forward to
your report. Whether to add any new encodings to Parquet, however, can
not be answered until we see the results of your findings.

You mention two paths. One has very small computational overhead but
does not provide significant space savings. The other provides
significant space savings but at the price of a significant
computational overhead. While purely based on these properties both of
them seem "balanced" (one is small effort, small gain; the other is
large effort, large gain) and therefore sound reasonable options, I
would argue that one should also consider development costs, code
complexity and compatibility implications when deciding about whether
a new feature is worth implementing.

Adding a new encoding or compression to Parquet complicates the
specification of the file format and requires implementing it in every
language binding of the format, which is not only a considerable
effort, but is also error-prone (see LZ4 for an example, which was
added to both the Java and the C++ implementation of Parquet, yet they
are incompatible with each other). And lack of support is not only a
minor annoyance in this case: if one is forced to use an older reader
that does not support the new encoding yet (or a language binding that
does not support it at all), the data simply can not be read.

In my opinion, no matter how low the computational overhead of a new
encoding is, if it does not provide significant gains, then the
specification clutter, implementation costs and the potential of
compatibility problems greatly outweigh its advantages. For this
reason, I would say that only encodings that provide significant gains
are worth adding. As far as I am concerned, such a new encoding would
be a welcome addition to Parquet.

Thanks,

Zoltan

On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <ma...@tum.de> wrote:
>
> Dear all,
>
> thank you for your work on the Apache Parquet format.
>
> We are a group of students at the Technical University of Munich who would like to extend the available compression and encoding options for 32-bit and 64-bit floating point data in Apache Parquet.
> The current encodings and compression algorithms offered in Apache Parquet are heavily specialized towards integer and text data.
> Thus there is an opportunity in reducing both io throughput requirements and space requirements for handling floating point data by selecting a specialized compression algorithm.
>
> Currently, I am doing an investigation on the available literature and publicly available fp compressors. In my investigation I am writing a report on my findings - the available algorithms, their strengths and weaknesses, compression rates, compression speeds and decompression speeds, and licenses. Once finished I will share the report with you and make a proposal which ones IMO are good candidates for Apache Parquet.
>
> The goal is to add a solution for both 32-bit and 64-bit fp types. I think that it would be beneficial to offer at the very least two distinct paths. The first one should offer fast compression and decompression speed with some but not significant saving in space. The second one should offer slower compression and decompression speed but with a decent compression rate. Both lossless. A lossy path will be investigated further and discussed with the community.
>
> If I get an approval from you – the developers – I can continue with adding support for the new encoding/compression options in the C++ implementation of Apache Parquet in Apache Arrow.
>
> Please let me know what you think of this idea and whether you have any concerns with the plan.
>
> Best regards,
> Martin Radev
>

Re: Floating point data compression for Apache Parquet

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
Thank you Roman. I'm looking forward to the proposal.

Regarding the process, there is more information on the Apache page:
https://www.apache.org/foundation/voting.html

Cheers, Fokko Driesprong


Op di 16 jul. 2019 om 16:03 schreef Roman Karlstetter <
roman.karlstetter@gmail.com>:

>  Ok, thanks for the clarification.
>
> Having a prototype implementation ready (in the form of a PR for arrow,
> e.g.) to support the argument in the discussion doesn't hurt, I guess.
>
> Just for completeness, regarding the motivation for this work: as Martin
> wrote, he is a student at the chair for computer architecture and parallel
> systems (CAPS, https://www.caps.in.tum.de/) at Technical University of
> Munich. We (the chair) are working on a project storing large amounts of
> sensor data (floating point) in parquet, and we're trying to improve
> storage efficiency for this type of data. Having a more efficient encoding
> for these floats is a first step here, but we have further ideas, like
> lossy compression. so we definitely should start such a discussion thread.
>
> Roman
>
> Am Di., 16. Juli 2019 um 15:59 Uhr schrieb Zoltan Ivanfi
> <zi...@cloudera.com.invalid>:
>
> > Hi Wes,
> >
> > As far as I'm concerned, the PR was not close to being merged. I
> > pointed out in an earlier e-mail of this thread that we need to
> > involve more community members for this change. I approved the PR to
> > indicate that it looks good to me but was not going to merge it
> > without consensus.
> >
> > Br,
> >
> > Zoltan
> >
> > On Tue, Jul 16, 2019 at 1:45 PM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > I think first you need to create a [DISCUSS] thread whose subject
> > > clearly indicates that you are proposing to modify the Parquet format.
> > > You should link to the PR with the changes to parquet-format. Then
> > > wait for feedback to collect.
> > >
> > > Frankly, I was surprised to see a PR close to being merged based
> > > largely on a 2-way discussion between Martin and Zoltan (unless I
> > > missed something)
> > >
> > > On Tue, Jul 16, 2019 at 2:11 AM Roman Karlstetter
> > > <ro...@gmail.com> wrote:
> > > >
> > > > Hi Wes,
> > > >
> > > > what would be the formal or informal requirements for such a vote to
> > pass?
> > > > What is needed in terms of code and specification before we can start
> > such
> > > > a vote?
> > > >
> > > > Roman
> > > >
> > > > Am Fr., 12. Juli 2019 um 17:07 Uhr schrieb Wes McKinney <
> > wesmckinn@gmail.com
> > > > >:
> > > >
> > > > > I think we need to vote to make any changes to the Parquet format.
> > New
> > > > > features carry a heavy responsibility
> > > > >
> > > > > On Fri, Jul 12, 2019 at 10:04 AM Michael Heuer <he...@gmail.com>
> > wrote:
> > > > > >
> > > > > > Hello Martin,
> > > > > >
> > > > > > I'm willing to run some tests at scale on our genomics data when
> a
> > > > > parquet-mr pull request for the Java implementation is ready.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > >    michael
> > > > > >
> > > > > >
> > > > > > > On Jul 11, 2019, at 1:09 PM, Radev, Martin <
> martin.radev@tum.de>
> > > > > wrote:
> > > > > > >
> > > > > > > Dear all,
> > > > > > >
> > > > > > >
> > > > > > > I created a Jira issue for the new feature and also made a pull
> > > > > request for my patch which extends the format and documentation.
> > > > > > >
> > > > > > > Jira issue: https://issues.apache.org/jira/browse/PARQUET-1622
> <
> > > > > https://issues.apache.org/jira/browse/PARQUET-1622>
> > > > > > > Pull request:
> https://github.com/apache/parquet-format/pull/144
> > <
> > > > > https://github.com/apache/parquet-format/pull/144>
> > > > > > >
> > > > > > >
> > > > > > > I also have a WIP patch for adding the "BYTE_STREAM_SPLIT"
> > encoding to
> > > > > parquet-cpp within Apache Arrow.
> > > > > > >
> > > > > > >
> > > > > > > How should we proceed?
> > > > > > >
> > > > > > > It would be great to get feedback from other community members.
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Martin
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: Radev, Martin <martin.radev@tum.de <mailto:
> > martin.radev@tum.de>>
> > > > > > > Sent: Tuesday, July 9, 2019 1:01:25 AM
> > > > > > > To: Zoltan Ivanfi
> > > > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > > > Subject: Re: Floating point data compression for Apache Parquet
> > > > > > >
> > > > > > > Hello Zoltan,
> > > > > > >
> > > > > > >
> > > > > > > I can provide a C++ and Java implementation for the encoder.
> > > > > > >
> > > > > > > The encoder/decoder is very small, and naturally I have to add
> > tests.
> > > > > > >
> > > > > > > I expect the biggest hurdle would be setting up the environment
> > and
> > > > > reading though the developer guides.
> > > > > > >
> > > > > > >
> > > > > > > I will write my patches for Apache Arrow and for Apache Parquet
> > and
> > > > > send them for review.
> > > > > > >
> > > > > > > After getting them in, I can continue with the Java
> > implementation.
> > > > > > >
> > > > > > > Let me know if you have any concerns.
> > > > > > >
> > > > > > >
> > > > > > > It would be great to get an opinion from other Parquet
> > contributors : )
> > > > > > >
> > > > > > >
> > > > > > > Thank you for the feedback!
> > > > > > >
> > > > > > >
> > > > > > > Best regards,
> > > > > > >
> > > > > > > Martin
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: Zoltan Ivanfi <zi...@cloudera.com>
> > > > > > > Sent: Monday, July 8, 2019 5:06:30 PM
> > > > > > > To: Radev, Martin
> > > > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > > > Subject: Re: Floating point data compression for Apache Parquet
> > > > > > >
> > > > > > > Hi Martin,
> > > > > > >
> > > > > > > I agree that bs_zstd would be a good place to start. Regarding
> > the
> > > > > choice of language, Java, C++ and Python are your options. As far
> as
> > I
> > > > > know, the Java implementation of Parquet has more users from the
> > business
> > > > > sector, where decimal is preferred over floating point data types.
> > It is
> > > > > also much more tightly integrated with the Hadoop ecosystem (it is
> > even
> > > > > called parquet-mr, as in MapReduce), making for a steeper learning
> > curve.
> > > > > > >
> > > > > > > The Python and C++ language bindings have more scientific
> users,
> > so
> > > > > users of these may be more interested in the new encodings. Python
> > is a
> > > > > good language for rapid prototyping as well, but the Python binding
> > of
> > > > > Parquet may use the C++ library under the hood, I'm not sure (I'm
> > more
> > > > > familiar with the Java implementation). In any case, there are at
> > least two
> > > > > Python bindings: pyarrow and fastparquet.
> > > > > > >
> > > > > > > I think we can extend the format before the actual
> > implementations are
> > > > > ready, provided that the specification is clear and nobody objects
> to
> > > > > adding it to the format. For this, I would wait for the opinion of
> a
> > few
> > > > > more Parquet developers first, since changes to the format that are
> > only
> > > > > supported by a single committer usually have a hard time getting
> > into the
> > > > > spec. Additionally, could you please clarify which language
> bindings
> > you
> > > > > plan to implement yourself? This will help the developers of the
> > different
> > > > > language bindings assess how much work they will have to do to add
> > support.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Zoltan
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jul 5, 2019 at 4:34 PM Radev, Martin <
> > martin.radev@tum.de
> > > > > <ma...@tum.de>> wrote:
> > > > > > >
> > > > > > > Hello Zoltan and Parquet devs,
> > > > > > >
> > > > > > >
> > > > > > > do you think it would be appropriate to start with a Parquet
> > prototype
> > > > > from my side?
> > > > > > >
> > > > > > > I suspect that integrating 'bs_zstd' would be the simplest to
> > > > > integrate and from the report we can see an improvement in both
> > ratio and
> > > > > speed.
> > > > > > >
> > > > > > >
> > > > > > > Do you think that Apache Arrow is an appropriate place to
> > prototype
> > > > > the extension of the format?
> > > > > > >
> > > > > > > Do you agree that the enum field 'Encodings' is a suitable
> place
> > to
> > > > > add the 'Byte stream-splitting transformation'? In that way it
> could
> > be
> > > > > used with any of the other supported compressors.
> > > > > > >
> > > > > > > It might be best to also add a Java implementation of the
> > > > > transformation. Would the project 'parquet-mr' be a good place?
> > > > > > >
> > > > > > >
> > > > > > > Would the workflow be such that I write my patches, we verify
> for
> > > > > correctness, get reviews, merge them AND just then we make
> > adjustments to
> > > > > the Apache Parquet spec?
> > > > > > >
> > > > > > >
> > > > > > > Any piece of advice is welcome!
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Martin
> > > > > > >
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: Zoltan Ivanfi <zi...@cloudera.com>>
> > > > > > > Sent: Friday, July 5, 2019 4:21:39 PM
> > > > > > > To: Radev, Martin
> > > > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > > > Subject: Re: Floating point data compression for Apache Parquet
> > > > > > >
> > > > > > >
> > > > > > > Hi Martin,
> > > > > > >
> > > > > > > Thanks for the explanations, makes sense. Nice work!
> > > > > > >
> > > > > > > Br,
> > > > > > >
> > > > > > > Zoltan
> > > > > > >
> > > > > > > On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin <
> > martin.radev@tum.de
> > > > > <ma...@tum.de>> wrote:
> > > > > > >
> > > > > > > Hello Zoltan,
> > > > > > >
> > > > > > >
> > > > > > >> Is data pre-loaded to RAM before making the measurements?
> > > > > > > Yes, the file is read into physical memory.
> > > > > > >
> > > > > > > For mmap-ed files, read from external storage, I would expect,
> > but not
> > > > > 100% sure, that the IO-overhead would be big enough that all
> > algorithms
> > > > > compress quite close at the same speed.
> > > > > > >
> > > > > > >
> > > > > > >> In "Figure 3: Decompression speed in MB/s", is data size
> > measured
> > > > > before or after uncompression?
> > > > > > >
> > > > > > >> In "Figure 4: Compression speed in MB/s", is data size
> measured
> > > > > before or after compression?
> > > > > > > For both the reported result is "size of the original file /
> > time to
> > > > > compress or decompress".
> > > > > > >
> > > > > > >> According to "Figure 3: Decompression speed in MB/s",
> > decompression
> > > > > of bs_zstd is almost twice as fast as plain zstd. Do you know what
> > causes
> > > > > this massive speed improvement?
> > > > > > >
> > > > > > > I do not know all of the details. As you mentioned, the written
> > out
> > > > > data is less, this could potentially lead to improvement in speed
> as
> > less
> > > > > data has to be written out to memory during compression or read
> from
> > memory
> > > > > during decompression.
> > > > > > >
> > > > > > > Another thing to consider is that ZSTD uses different
> techniques
> > to
> > > > > compress a block of data - "raw", "RLE", "Huffman coding",
> "Treeless
> > > > > coding".
> > > > > > >
> > > > > > > I expect that "Huffman coding" is more costly than "RLE" and I
> > also
> > > > > expect that "RLE" to be applicable for the majority of the sign
> bits
> > thus
> > > > > leading to a performance win for when the transformation is
> applied.
> > > > > > >
> > > > > > >
> > > > > > > I also expect that zstd has to do some form of "optimal
> parsing"
> > to
> > > > > decide how to process the input in order to compress it well. This
> is
> > > > > something every wanna-be-good LZ-like compressor has to do (
> > > > >
> >
> https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> > > > > ,
> > > > >
> >
> http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> > > > > ). It might be so that the transformed input is somehow easy which
> > leads to
> > > > > faster compression rates and also easier to decompress data which
> > leads to
> > > > > faster decompression rates.
> > > > > > > cbloom rants: 10-24-11 - LZ Optimal Parse with A Star Part 1<
> > > > >
> >
> http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> > > > > <
> > > > >
> >
> http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> > > > > >>
> > > > > > > cbloomrants.blogspot.com <http://cbloomrants.blogspot.com/>
> > > > > > > First two notes that aren't about the A-Star parse : 1. All
> good
> > LZ
> > > > > encoders these days that aren't optimal parsers use complicated
> > heuri...
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I used this as a reference:
> > > > > https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not
> > familiar
> > > > > with ZSTD in particular.
> > > > > > >
> > > > > > >
> > > > > > > I also checked that the majority of the time is spent in zstd.
> > > > > > >
> > > > > > > Example run for msg_sweep3d.dp using zstd at level 1.
> > > > > > > - Transformation during compression: 0.086s, ZSTD compress on
> > > > > transformed data: 0.08s
> > > > > > >
> > > > > > > - regular ZSTD: 0.34s
> > > > > > > - ZSTD decompress from compressed transformed data: 0.067s,
> > > > > Transformation during decompression: 0.021s
> > > > > > > - regular ZSTD decompress: 0.24s
> > > > > > >
> > > > > > >
> > > > > > > Example run for msg_sweep3d.dp using zstd at level 20.
> > > > > > >
> > > > > > > - Transformation during compression: 0.083s, ZSTD compress on
> > > > > transformed data: 14.35s
> > > > > > >
> > > > > > > - regular ZSTD: 183s
> > > > > > > - ZSTD decompress from compressed transformed data: 0.075s,
> > > > > Transformation during decompression: 0.022s
> > > > > > > - regular ZSTD decompress: 0.31s
> > > > > > > Here it's clear that the transformed input is easier to parse
> > > > > (compress). Maybe also the blocks are of type which takes less time
> > to
> > > > > decompress.
> > > > > > >
> > > > > > >> If considering using existing libraries to provide any of the
> > > > > compression algorithms, license compatibility is also an important
> > factor
> > > > > and therefore would be worth mentioning in Section 5.
> > > > > > > This is something I forgot to list. I will back to you and the
> > other
> > > > > devs with information.
> > > > > > >
> > > > > > > The filter I proposed for lossless compression can be
> integrated
> > > > > without any concerns for a license.
> > > > > > >
> > > > > > >
> > > > > > >> Are any of the investigated strategies applicable to DECIMAL
> > values?
> > > > > > > The lossy compressors SZ and ZFP do not support that outside of
> > the
> > > > > box. I could communicate with the SZ developers to come to a
> > decision how
> > > > > this can be added to SZ. An option is to losslessly compress the
> > > > > pre-decimal number and lossyly compress the post-decimal number.
> > > > > > >
> > > > > > > For lossless compression, we can apply a similar stream
> splitting
> > > > > technique for decimal types though it might be somewhat more
> complex
> > and I
> > > > > have not really though about this case.
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Martin
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: Zoltan Ivanfi <zi...@cloudera.com>>
> > > > > > > Sent: Wednesday, July 3, 2019 6:07:50 PM
> > > > > > > To: Parquet Dev; Radev, Martin
> > > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > > Subject: Re: Floating point data compression for Apache Parquet
> > > > > > >
> > > > > > > Hi Martin,
> > > > > > >
> > > > > > > Thanks for the thorough investigation, very nice report. I
> would
> > have
> > > > > a few questions:
> > > > > > >
> > > > > > > - Is data pre-loaded to RAM before making the measurements?
> > > > > > >
> > > > > > > - In "Figure 3: Decompression speed in MB/s", is data size
> > measured
> > > > > before or after uncompression?
> > > > > > >
> > > > > > > - In "Figure 4: Compression speed in MB/s", is data size
> measured
> > > > > before or after compression?
> > > > > > >
> > > > > > > - According to "Figure 3: Decompression speed in MB/s",
> > decompression
> > > > > of bs_zstd is almost twice as fast as plain zstd. Do you know what
> > causes
> > > > > this massive speed improvement? Based on the description provided
> in
> > > > > section 3.2, bs_zstd uses the same zstd compression with an extra
> > step of
> > > > > splitting/combining streams. Since this is extra work, I would have
> > > > > expected bs_zstd to be slower than pure zstd, unless the compressed
> > data
> > > > > becomes so much smaller that it radically improves data access
> times.
> > > > > However, according to "Figure 2: Compression ratio", bs_zstd
> achieves
> > > > > "only" 23% better compression than plain zstd, which can not be the
> > reason
> > > > > for the 2x speed-up in itself.
> > > > > > >
> > > > > > > - If considering using existing libraries to provide any of the
> > > > > compression algorithms, license compatibility is also an important
> > factor
> > > > > and therefore would be worth mentioning in Section 5.
> > > > > > >
> > > > > > > - Are any of the investigated strategies applicable to DECIMAL
> > values?
> > > > > Since floating point values and calculations have an inherent
> > inaccuracy,
> > > > > the DECIMAL type is much more important for storing financial data,
> > which
> > > > > is one of the main use cases of Parquet.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Zoltan
> > > > > > >
> > > > > > > On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <
> > martin.radev@tum.de
> > > > > <ma...@tum.de>> wrote:
> > > > > > > Hello folks,
> > > > > > >
> > > > > > >
> > > > > > > thank you for your input.
> > > > > > >
> > > > > > >
> > > > > > > I am finished with my investigation regarding introducing
> special
> > > > > support for FP compression in Apache Parquet.
> > > > > > >
> > > > > > > My report also includes an investigation of lossy compressors
> > though
> > > > > there are still some things to be cleared out.
> > > > > > >
> > > > > > >
> > > > > > > Report:
> > > > > https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
> > > > > > >
> > > > > > >
> > > > > > > Sections 3 4 5 6 are the most important to go over.
> > > > > > >
> > > > > > >
> > > > > > > Let me know if you have any questions or concerns.
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Martin
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: Zoltan Ivanfi <zi...@cloudera.com.INVALID>
> > > > > > > Sent: Thursday, June 13, 2019 2:16:56 PM
> > > > > > > To: Parquet Dev
> > > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > > Subject: Re: Floating point data compression for Apache Parquet
> > > > > > >
> > > > > > > Hi Martin,
> > > > > > >
> > > > > > > Thanks for your interest in improving Parquet. Efficient
> > encodings are
> > > > > > > really important in a big data file format, so this topic is
> > > > > > > definitely worth researching and personally I am looking
> forward
> > to
> > > > > > > your report. Whether to add any new encodings to Parquet,
> > however, can
> > > > > > > not be answered until we see the results of your findings.
> > > > > > >
> > > > > > > You mention two paths. One has very small computational
> overhead
> > but
> > > > > > > does not provide significant space savings. The other provides
> > > > > > > significant space savings but at the price of a significant
> > > > > > > computational overhead. While purely based on these properties
> > both of
> > > > > > > them seem "balanced" (one is small effort, small gain; the
> other
> > is
> > > > > > > large effort, large gain) and therefore sound reasonable
> > options, I
> > > > > > > would argue that one should also consider development costs,
> code
> > > > > > > complexity and compatibility implications when deciding about
> > whether
> > > > > > > a new feature is worth implementing.
> > > > > > >
> > > > > > > Adding a new encoding or compression to Parquet complicates the
> > > > > > > specification of the file format and requires implementing it
> in
> > every
> > > > > > > language binding of the format, which is not only a
> considerable
> > > > > > > effort, but is also error-prone (see LZ4 for an example, which
> > was
> > > > > > > added to both the Java and the C++ implementation of Parquet,
> > yet they
> > > > > > > are incompatible with each other). And lack of support is not
> > only a
> > > > > > > minor annoyance in this case: if one is forced to use an older
> > reader
> > > > > > > that does not support the new encoding yet (or a language
> > binding that
> > > > > > > does not support it at all), the data simply can not be read.
> > > > > > >
> > > > > > > In my opinion, no matter how low the computational overhead of
> a
> > new
> > > > > > > encoding is, if it does not provide significant gains, then the
> > > > > > > specification clutter, implementation costs and the potential
> of
> > > > > > > compatibility problems greatly outweigh its advantages. For
> this
> > > > > > > reason, I would say that only encodings that provide
> significant
> > gains
> > > > > > > are worth adding. As far as I am concerned, such a new encoding
> > would
> > > > > > > be a welcome addition to Parquet.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Zoltan
> > > > > > >
> > > > > > > On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <
> > martin.radev@tum.de
> > > > > <ma...@tum.de>> wrote:
> > > > > > >>
> > > > > > >> Dear all,
> > > > > > >>
> > > > > > >> thank you for your work on the Apache Parquet format.
> > > > > > >>
> > > > > > >> We are a group of students at the Technical University of
> > Munich who
> > > > > would like to extend the available compression and encoding options
> > for
> > > > > 32-bit and 64-bit floating point data in Apache Parquet.
> > > > > > >> The current encodings and compression algorithms offered in
> > Apache
> > > > > Parquet are heavily specialized towards integer and text data.
> > > > > > >> Thus there is an opportunity in reducing both io throughput
> > > > > requirements and space requirements for handling floating point
> data
> > by
> > > > > selecting a specialized compression algorithm.
> > > > > > >>
> > > > > > >> Currently, I am doing an investigation on the available
> > literature
> > > > > and publicly available fp compressors. In my investigation I am
> > writing a
> > > > > report on my findings - the available algorithms, their strengths
> and
> > > > > weaknesses, compression rates, compression speeds and decompression
> > speeds,
> > > > > and licenses. Once finished I will share the report with you and
> > make a
> > > > > proposal which ones IMO are good candidates for Apache Parquet.
> > > > > > >>
> > > > > > >> The goal is to add a solution for both 32-bit and 64-bit fp
> > types. I
> > > > > think that it would be beneficial to offer at the very least two
> > distinct
> > > > > paths. The first one should offer fast compression and
> decompression
> > speed
> > > > > with some but not significant saving in space. The second one
> should
> > offer
> > > > > slower compression and decompression speed but with a decent
> > compression
> > > > > rate. Both lossless. A lossy path will be investigated further and
> > > > > discussed with the community.
> > > > > > >>
> > > > > > >> If I get an approval from you – the developers – I can
> continue
> > with
> > > > > adding support for the new encoding/compression options in the C++
> > > > > implementation of Apache Parquet in Apache Arrow.
> > > > > > >>
> > > > > > >> Please let me know what you think of this idea and whether you
> > have
> > > > > any concerns with the plan.
> > > > > > >>
> > > > > > >> Best regards,
> > > > > > >> Martin Radev
> > > > > >
> > > > >
> >
>

Re: Floating point data compression for Apache Parquet

Posted by Roman Karlstetter <ro...@gmail.com>.
 Ok, thanks for the clarification.

Having a prototype implementation ready (in the form of a PR for arrow,
e.g.) to support the argument in the discussion doesn't hurt, I guess.

Just for completeness, regarding the motivation for this work: as Martin
wrote, he is a student at the chair for computer architecture and parallel
systems (CAPS, https://www.caps.in.tum.de/) at Technical University of
Munich. We (the chair) are working on a project storing large amounts of
sensor data (floating point) in parquet, and we're trying to improve
storage efficiency for this type of data. Having a more efficient encoding
for these floats is a first step here, but we have further ideas, like
lossy compression. so we definitely should start such a discussion thread.

Roman

Am Di., 16. Juli 2019 um 15:59 Uhr schrieb Zoltan Ivanfi
<zi...@cloudera.com.invalid>:

> Hi Wes,
>
> As far as I'm concerned, the PR was not close to being merged. I
> pointed out in an earlier e-mail of this thread that we need to
> involve more community members for this change. I approved the PR to
> indicate that it looks good to me but was not going to merge it
> without consensus.
>
> Br,
>
> Zoltan
>
> On Tue, Jul 16, 2019 at 1:45 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > I think first you need to create a [DISCUSS] thread whose subject
> > clearly indicates that you are proposing to modify the Parquet format.
> > You should link to the PR with the changes to parquet-format. Then
> > wait for feedback to collect.
> >
> > Frankly, I was surprised to see a PR close to being merged based
> > largely on a 2-way discussion between Martin and Zoltan (unless I
> > missed something)
> >
> > On Tue, Jul 16, 2019 at 2:11 AM Roman Karlstetter
> > <ro...@gmail.com> wrote:
> > >
> > > Hi Wes,
> > >
> > > what would be the formal or informal requirements for such a vote to
> pass?
> > > What is needed in terms of code and specification before we can start
> such
> > > a vote?
> > >
> > > Roman
> > >
> > > Am Fr., 12. Juli 2019 um 17:07 Uhr schrieb Wes McKinney <
> wesmckinn@gmail.com
> > > >:
> > >
> > > > I think we need to vote to make any changes to the Parquet format.
> New
> > > > features carry a heavy responsibility
> > > >
> > > > On Fri, Jul 12, 2019 at 10:04 AM Michael Heuer <he...@gmail.com>
> wrote:
> > > > >
> > > > > Hello Martin,
> > > > >
> > > > > I'm willing to run some tests at scale on our genomics data when a
> > > > parquet-mr pull request for the Java implementation is ready.
> > > > >
> > > > > Cheers,
> > > > >
> > > > >    michael
> > > > >
> > > > >
> > > > > > On Jul 11, 2019, at 1:09 PM, Radev, Martin <ma...@tum.de>
> > > > wrote:
> > > > > >
> > > > > > Dear all,
> > > > > >
> > > > > >
> > > > > > I created a Jira issue for the new feature and also made a pull
> > > > request for my patch which extends the format and documentation.
> > > > > >
> > > > > > Jira issue: https://issues.apache.org/jira/browse/PARQUET-1622 <
> > > > https://issues.apache.org/jira/browse/PARQUET-1622>
> > > > > > Pull request: https://github.com/apache/parquet-format/pull/144
> <
> > > > https://github.com/apache/parquet-format/pull/144>
> > > > > >
> > > > > >
> > > > > > I also have a WIP patch for adding the "BYTE_STREAM_SPLIT"
> encoding to
> > > > parquet-cpp within Apache Arrow.
> > > > > >
> > > > > >
> > > > > > How should we proceed?
> > > > > >
> > > > > > It would be great to get feedback from other community members.
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Martin
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > ________________________________
> > > > > > From: Radev, Martin <martin.radev@tum.de <mailto:
> martin.radev@tum.de>>
> > > > > > Sent: Tuesday, July 9, 2019 1:01:25 AM
> > > > > > To: Zoltan Ivanfi
> > > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > > Subject: Re: Floating point data compression for Apache Parquet
> > > > > >
> > > > > > Hello Zoltan,
> > > > > >
> > > > > >
> > > > > > I can provide a C++ and Java implementation for the encoder.
> > > > > >
> > > > > > The encoder/decoder is very small, and naturally I have to add
> tests.
> > > > > >
> > > > > > I expect the biggest hurdle would be setting up the environment
> and
> > > > reading though the developer guides.
> > > > > >
> > > > > >
> > > > > > I will write my patches for Apache Arrow and for Apache Parquet
> and
> > > > send them for review.
> > > > > >
> > > > > > After getting them in, I can continue with the Java
> implementation.
> > > > > >
> > > > > > Let me know if you have any concerns.
> > > > > >
> > > > > >
> > > > > > It would be great to get an opinion from other Parquet
> contributors : )
> > > > > >
> > > > > >
> > > > > > Thank you for the feedback!
> > > > > >
> > > > > >
> > > > > > Best regards,
> > > > > >
> > > > > > Martin
> > > > > >
> > > > > > ________________________________
> > > > > > From: Zoltan Ivanfi <zi...@cloudera.com>
> > > > > > Sent: Monday, July 8, 2019 5:06:30 PM
> > > > > > To: Radev, Martin
> > > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > > Subject: Re: Floating point data compression for Apache Parquet
> > > > > >
> > > > > > Hi Martin,
> > > > > >
> > > > > > I agree that bs_zstd would be a good place to start. Regarding
> the
> > > > choice of language, Java, C++ and Python are your options. As far as
> I
> > > > know, the Java implementation of Parquet has more users from the
> business
> > > > sector, where decimal is preferred over floating point data types.
> It is
> > > > also much more tightly integrated with the Hadoop ecosystem (it is
> even
> > > > called parquet-mr, as in MapReduce), making for a steeper learning
> curve.
> > > > > >
> > > > > > The Python and C++ language bindings have more scientific users,
> so
> > > > users of these may be more interested in the new encodings. Python
> is a
> > > > good language for rapid prototyping as well, but the Python binding
> of
> > > > Parquet may use the C++ library under the hood, I'm not sure (I'm
> more
> > > > familiar with the Java implementation). In any case, there are at
> least two
> > > > Python bindings: pyarrow and fastparquet.
> > > > > >
> > > > > > I think we can extend the format before the actual
> implementations are
> > > > ready, provided that the specification is clear and nobody objects to
> > > > adding it to the format. For this, I would wait for the opinion of a
> few
> > > > more Parquet developers first, since changes to the format that are
> only
> > > > supported by a single committer usually have a hard time getting
> into the
> > > > spec. Additionally, could you please clarify which language bindings
> you
> > > > plan to implement yourself? This will help the developers of the
> different
> > > > language bindings assess how much work they will have to do to add
> support.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Zoltan
> > > > > >
> > > > > >
> > > > > > On Fri, Jul 5, 2019 at 4:34 PM Radev, Martin <
> martin.radev@tum.de
> > > > <ma...@tum.de>> wrote:
> > > > > >
> > > > > > Hello Zoltan and Parquet devs,
> > > > > >
> > > > > >
> > > > > > do you think it would be appropriate to start with a Parquet
> prototype
> > > > from my side?
> > > > > >
> > > > > > I suspect that integrating 'bs_zstd' would be the simplest to
> > > > integrate and from the report we can see an improvement in both
> ratio and
> > > > speed.
> > > > > >
> > > > > >
> > > > > > Do you think that Apache Arrow is an appropriate place to
> prototype
> > > > the extension of the format?
> > > > > >
> > > > > > Do you agree that the enum field 'Encodings' is a suitable place
> to
> > > > add the 'Byte stream-splitting transformation'? In that way it could
> be
> > > > used with any of the other supported compressors.
> > > > > >
> > > > > > It might be best to also add a Java implementation of the
> > > > transformation. Would the project 'parquet-mr' be a good place?
> > > > > >
> > > > > >
> > > > > > Would the workflow be such that I write my patches, we verify for
> > > > correctness, get reviews, merge them AND just then we make
> adjustments to
> > > > the Apache Parquet spec?
> > > > > >
> > > > > >
> > > > > > Any piece of advice is welcome!
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Martin
> > > > > >
> > > > > >
> > > > > > ________________________________
> > > > > > From: Zoltan Ivanfi <zi...@cloudera.com>>
> > > > > > Sent: Friday, July 5, 2019 4:21:39 PM
> > > > > > To: Radev, Martin
> > > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > > Subject: Re: Floating point data compression for Apache Parquet
> > > > > >
> > > > > >
> > > > > > Hi Martin,
> > > > > >
> > > > > > Thanks for the explanations, makes sense. Nice work!
> > > > > >
> > > > > > Br,
> > > > > >
> > > > > > Zoltan
> > > > > >
> > > > > > On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin <
> martin.radev@tum.de
> > > > <ma...@tum.de>> wrote:
> > > > > >
> > > > > > Hello Zoltan,
> > > > > >
> > > > > >
> > > > > >> Is data pre-loaded to RAM before making the measurements?
> > > > > > Yes, the file is read into physical memory.
> > > > > >
> > > > > > For mmap-ed files, read from external storage, I would expect,
> but not
> > > > 100% sure, that the IO-overhead would be big enough that all
> algorithms
> > > > compress quite close at the same speed.
> > > > > >
> > > > > >
> > > > > >> In "Figure 3: Decompression speed in MB/s", is data size
> measured
> > > > before or after uncompression?
> > > > > >
> > > > > >> In "Figure 4: Compression speed in MB/s", is data size measured
> > > > before or after compression?
> > > > > > For both the reported result is "size of the original file /
> time to
> > > > compress or decompress".
> > > > > >
> > > > > >> According to "Figure 3: Decompression speed in MB/s",
> decompression
> > > > of bs_zstd is almost twice as fast as plain zstd. Do you know what
> causes
> > > > this massive speed improvement?
> > > > > >
> > > > > > I do not know all of the details. As you mentioned, the written
> out
> > > > data is less, this could potentially lead to improvement in speed as
> less
> > > > data has to be written out to memory during compression or read from
> memory
> > > > during decompression.
> > > > > >
> > > > > > Another thing to consider is that ZSTD uses different techniques
> to
> > > > compress a block of data - "raw", "RLE", "Huffman coding", "Treeless
> > > > coding".
> > > > > >
> > > > > > I expect that "Huffman coding" is more costly than "RLE" and I
> also
> > > > expect that "RLE" to be applicable for the majority of the sign bits
> thus
> > > > leading to a performance win for when the transformation is applied.
> > > > > >
> > > > > >
> > > > > > I also expect that zstd has to do some form of "optimal parsing"
> to
> > > > decide how to process the input in order to compress it well. This is
> > > > something every wanna-be-good LZ-like compressor has to do (
> > > >
> https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> > > > ,
> > > >
> http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> > > > ). It might be so that the transformed input is somehow easy which
> leads to
> > > > faster compression rates and also easier to decompress data which
> leads to
> > > > faster decompression rates.
> > > > > > cbloom rants: 10-24-11 - LZ Optimal Parse with A Star Part 1<
> > > >
> http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> > > > <
> > > >
> http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> > > > >>
> > > > > > cbloomrants.blogspot.com <http://cbloomrants.blogspot.com/>
> > > > > > First two notes that aren't about the A-Star parse : 1. All good
> LZ
> > > > encoders these days that aren't optimal parsers use complicated
> heuri...
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > I used this as a reference:
> > > > https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not
> familiar
> > > > with ZSTD in particular.
> > > > > >
> > > > > >
> > > > > > I also checked that the majority of the time is spent in zstd.
> > > > > >
> > > > > > Example run for msg_sweep3d.dp using zstd at level 1.
> > > > > > - Transformation during compression: 0.086s, ZSTD compress on
> > > > transformed data: 0.08s
> > > > > >
> > > > > > - regular ZSTD: 0.34s
> > > > > > - ZSTD decompress from compressed transformed data: 0.067s,
> > > > Transformation during decompression: 0.021s
> > > > > > - regular ZSTD decompress: 0.24s
> > > > > >
> > > > > >
> > > > > > Example run for msg_sweep3d.dp using zstd at level 20.
> > > > > >
> > > > > > - Transformation during compression: 0.083s, ZSTD compress on
> > > > transformed data: 14.35s
> > > > > >
> > > > > > - regular ZSTD: 183s
> > > > > > - ZSTD decompress from compressed transformed data: 0.075s,
> > > > Transformation during decompression: 0.022s
> > > > > > - regular ZSTD decompress: 0.31s
> > > > > > Here it's clear that the transformed input is easier to parse
> > > > (compress). Maybe also the blocks are of type which takes less time
> to
> > > > decompress.
> > > > > >
> > > > > >> If considering using existing libraries to provide any of the
> > > > compression algorithms, license compatibility is also an important
> factor
> > > > and therefore would be worth mentioning in Section 5.
> > > > > > This is something I forgot to list. I will back to you and the
> other
> > > > devs with information.
> > > > > >
> > > > > > The filter I proposed for lossless compression can be integrated
> > > > without any concerns for a license.
> > > > > >
> > > > > >
> > > > > >> Are any of the investigated strategies applicable to DECIMAL
> values?
> > > > > > The lossy compressors SZ and ZFP do not support that outside of
> the
> > > > box. I could communicate with the SZ developers to come to a
> decision how
> > > > this can be added to SZ. An option is to losslessly compress the
> > > > pre-decimal number and lossyly compress the post-decimal number.
> > > > > >
> > > > > > For lossless compression, we can apply a similar stream splitting
> > > > technique for decimal types though it might be somewhat more complex
> and I
> > > > have not really though about this case.
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Martin
> > > > > >
> > > > > > ________________________________
> > > > > > From: Zoltan Ivanfi <zi...@cloudera.com>>
> > > > > > Sent: Wednesday, July 3, 2019 6:07:50 PM
> > > > > > To: Parquet Dev; Radev, Martin
> > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > Subject: Re: Floating point data compression for Apache Parquet
> > > > > >
> > > > > > Hi Martin,
> > > > > >
> > > > > > Thanks for the thorough investigation, very nice report. I would
> have
> > > > a few questions:
> > > > > >
> > > > > > - Is data pre-loaded to RAM before making the measurements?
> > > > > >
> > > > > > - In "Figure 3: Decompression speed in MB/s", is data size
> measured
> > > > before or after uncompression?
> > > > > >
> > > > > > - In "Figure 4: Compression speed in MB/s", is data size measured
> > > > before or after compression?
> > > > > >
> > > > > > - According to "Figure 3: Decompression speed in MB/s",
> decompression
> > > > of bs_zstd is almost twice as fast as plain zstd. Do you know what
> causes
> > > > this massive speed improvement? Based on the description provided in
> > > > section 3.2, bs_zstd uses the same zstd compression with an extra
> step of
> > > > splitting/combining streams. Since this is extra work, I would have
> > > > expected bs_zstd to be slower than pure zstd, unless the compressed
> data
> > > > becomes so much smaller that it radically improves data access times.
> > > > However, according to "Figure 2: Compression ratio", bs_zstd achieves
> > > > "only" 23% better compression than plain zstd, which can not be the
> reason
> > > > for the 2x speed-up in itself.
> > > > > >
> > > > > > - If considering using existing libraries to provide any of the
> > > > compression algorithms, license compatibility is also an important
> factor
> > > > and therefore would be worth mentioning in Section 5.
> > > > > >
> > > > > > - Are any of the investigated strategies applicable to DECIMAL
> values?
> > > > Since floating point values and calculations have an inherent
> inaccuracy,
> > > > the DECIMAL type is much more important for storing financial data,
> which
> > > > is one of the main use cases of Parquet.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Zoltan
> > > > > >
> > > > > > On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <
> martin.radev@tum.de
> > > > <ma...@tum.de>> wrote:
> > > > > > Hello folks,
> > > > > >
> > > > > >
> > > > > > thank you for your input.
> > > > > >
> > > > > >
> > > > > > I am finished with my investigation regarding introducing special
> > > > support for FP compression in Apache Parquet.
> > > > > >
> > > > > > My report also includes an investigation of lossy compressors
> though
> > > > there are still some things to be cleared out.
> > > > > >
> > > > > >
> > > > > > Report:
> > > > https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
> > > > > >
> > > > > >
> > > > > > Sections 3 4 5 6 are the most important to go over.
> > > > > >
> > > > > >
> > > > > > Let me know if you have any questions or concerns.
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Martin
> > > > > >
> > > > > > ________________________________
> > > > > > From: Zoltan Ivanfi <zi...@cloudera.com.INVALID>
> > > > > > Sent: Thursday, June 13, 2019 2:16:56 PM
> > > > > > To: Parquet Dev
> > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > Subject: Re: Floating point data compression for Apache Parquet
> > > > > >
> > > > > > Hi Martin,
> > > > > >
> > > > > > Thanks for your interest in improving Parquet. Efficient
> encodings are
> > > > > > really important in a big data file format, so this topic is
> > > > > > definitely worth researching and personally I am looking forward
> to
> > > > > > your report. Whether to add any new encodings to Parquet,
> however, can
> > > > > > not be answered until we see the results of your findings.
> > > > > >
> > > > > > You mention two paths. One has very small computational overhead
> but
> > > > > > does not provide significant space savings. The other provides
> > > > > > significant space savings but at the price of a significant
> > > > > > computational overhead. While purely based on these properties
> both of
> > > > > > them seem "balanced" (one is small effort, small gain; the other
> is
> > > > > > large effort, large gain) and therefore sound reasonable
> options, I
> > > > > > would argue that one should also consider development costs, code
> > > > > > complexity and compatibility implications when deciding about
> whether
> > > > > > a new feature is worth implementing.
> > > > > >
> > > > > > Adding a new encoding or compression to Parquet complicates the
> > > > > > specification of the file format and requires implementing it in
> every
> > > > > > language binding of the format, which is not only a considerable
> > > > > > effort, but is also error-prone (see LZ4 for an example, which
> was
> > > > > > added to both the Java and the C++ implementation of Parquet,
> yet they
> > > > > > are incompatible with each other). And lack of support is not
> only a
> > > > > > minor annoyance in this case: if one is forced to use an older
> reader
> > > > > > that does not support the new encoding yet (or a language
> binding that
> > > > > > does not support it at all), the data simply can not be read.
> > > > > >
> > > > > > In my opinion, no matter how low the computational overhead of a
> new
> > > > > > encoding is, if it does not provide significant gains, then the
> > > > > > specification clutter, implementation costs and the potential of
> > > > > > compatibility problems greatly outweigh its advantages. For this
> > > > > > reason, I would say that only encodings that provide significant
> gains
> > > > > > are worth adding. As far as I am concerned, such a new encoding
> would
> > > > > > be a welcome addition to Parquet.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Zoltan
> > > > > >
> > > > > > On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <
> martin.radev@tum.de
> > > > <ma...@tum.de>> wrote:
> > > > > >>
> > > > > >> Dear all,
> > > > > >>
> > > > > >> thank you for your work on the Apache Parquet format.
> > > > > >>
> > > > > >> We are a group of students at the Technical University of
> Munich who
> > > > would like to extend the available compression and encoding options
> for
> > > > 32-bit and 64-bit floating point data in Apache Parquet.
> > > > > >> The current encodings and compression algorithms offered in
> Apache
> > > > Parquet are heavily specialized towards integer and text data.
> > > > > >> Thus there is an opportunity in reducing both io throughput
> > > > requirements and space requirements for handling floating point data
> by
> > > > selecting a specialized compression algorithm.
> > > > > >>
> > > > > >> Currently, I am doing an investigation on the available
> literature
> > > > and publicly available fp compressors. In my investigation I am
> writing a
> > > > report on my findings - the available algorithms, their strengths and
> > > > weaknesses, compression rates, compression speeds and decompression
> speeds,
> > > > and licenses. Once finished I will share the report with you and
> make a
> > > > proposal which ones IMO are good candidates for Apache Parquet.
> > > > > >>
> > > > > >> The goal is to add a solution for both 32-bit and 64-bit fp
> types. I
> > > > think that it would be beneficial to offer at the very least two
> distinct
> > > > paths. The first one should offer fast compression and decompression
> speed
> > > > with some but not significant saving in space. The second one should
> offer
> > > > slower compression and decompression speed but with a decent
> compression
> > > > rate. Both lossless. A lossy path will be investigated further and
> > > > discussed with the community.
> > > > > >>
> > > > > >> If I get an approval from you – the developers – I can continue
> with
> > > > adding support for the new encoding/compression options in the C++
> > > > implementation of Apache Parquet in Apache Arrow.
> > > > > >>
> > > > > >> Please let me know what you think of this idea and whether you
> have
> > > > any concerns with the plan.
> > > > > >>
> > > > > >> Best regards,
> > > > > >> Martin Radev
> > > > >
> > > >
>

Re: Floating point data compression for Apache Parquet

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Wes,

As far as I'm concerned, the PR was not close to being merged. I
pointed out in an earlier e-mail of this thread that we need to
involve more community members for this change. I approved the PR to
indicate that it looks good to me but was not going to merge it
without consensus.

Br,

Zoltan

On Tue, Jul 16, 2019 at 1:45 PM Wes McKinney <we...@gmail.com> wrote:
>
> I think first you need to create a [DISCUSS] thread whose subject
> clearly indicates that you are proposing to modify the Parquet format.
> You should link to the PR with the changes to parquet-format. Then
> wait for feedback to collect.
>
> Frankly, I was surprised to see a PR close to being merged based
> largely on a 2-way discussion between Martin and Zoltan (unless I
> missed something)
>
> On Tue, Jul 16, 2019 at 2:11 AM Roman Karlstetter
> <ro...@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > what would be the formal or informal requirements for such a vote to pass?
> > What is needed in terms of code and specification before we can start such
> > a vote?
> >
> > Roman
> >
> > Am Fr., 12. Juli 2019 um 17:07 Uhr schrieb Wes McKinney <wesmckinn@gmail.com
> > >:
> >
> > > I think we need to vote to make any changes to the Parquet format. New
> > > features carry a heavy responsibility
> > >
> > > On Fri, Jul 12, 2019 at 10:04 AM Michael Heuer <he...@gmail.com> wrote:
> > > >
> > > > Hello Martin,
> > > >
> > > > I'm willing to run some tests at scale on our genomics data when a
> > > parquet-mr pull request for the Java implementation is ready.
> > > >
> > > > Cheers,
> > > >
> > > >    michael
> > > >
> > > >
> > > > > On Jul 11, 2019, at 1:09 PM, Radev, Martin <ma...@tum.de>
> > > wrote:
> > > > >
> > > > > Dear all,
> > > > >
> > > > >
> > > > > I created a Jira issue for the new feature and also made a pull
> > > request for my patch which extends the format and documentation.
> > > > >
> > > > > Jira issue: https://issues.apache.org/jira/browse/PARQUET-1622 <
> > > https://issues.apache.org/jira/browse/PARQUET-1622>
> > > > > Pull request: https://github.com/apache/parquet-format/pull/144 <
> > > https://github.com/apache/parquet-format/pull/144>
> > > > >
> > > > >
> > > > > I also have a WIP patch for adding the "BYTE_STREAM_SPLIT" encoding to
> > > parquet-cpp within Apache Arrow.
> > > > >
> > > > >
> > > > > How should we proceed?
> > > > >
> > > > > It would be great to get feedback from other community members.
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > ________________________________
> > > > > From: Radev, Martin <martin.radev@tum.de <ma...@tum.de>>
> > > > > Sent: Tuesday, July 9, 2019 1:01:25 AM
> > > > > To: Zoltan Ivanfi
> > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: Floating point data compression for Apache Parquet
> > > > >
> > > > > Hello Zoltan,
> > > > >
> > > > >
> > > > > I can provide a C++ and Java implementation for the encoder.
> > > > >
> > > > > The encoder/decoder is very small, and naturally I have to add tests.
> > > > >
> > > > > I expect the biggest hurdle would be setting up the environment and
> > > reading though the developer guides.
> > > > >
> > > > >
> > > > > I will write my patches for Apache Arrow and for Apache Parquet and
> > > send them for review.
> > > > >
> > > > > After getting them in, I can continue with the Java implementation.
> > > > >
> > > > > Let me know if you have any concerns.
> > > > >
> > > > >
> > > > > It would be great to get an opinion from other Parquet contributors : )
> > > > >
> > > > >
> > > > > Thank you for the feedback!
> > > > >
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Martin
> > > > >
> > > > > ________________________________
> > > > > From: Zoltan Ivanfi <zi...@cloudera.com>
> > > > > Sent: Monday, July 8, 2019 5:06:30 PM
> > > > > To: Radev, Martin
> > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: Floating point data compression for Apache Parquet
> > > > >
> > > > > Hi Martin,
> > > > >
> > > > > I agree that bs_zstd would be a good place to start. Regarding the
> > > choice of language, Java, C++ and Python are your options. As far as I
> > > know, the Java implementation of Parquet has more users from the business
> > > sector, where decimal is preferred over floating point data types. It is
> > > also much more tightly integrated with the Hadoop ecosystem (it is even
> > > called parquet-mr, as in MapReduce), making for a steeper learning curve.
> > > > >
> > > > > The Python and C++ language bindings have more scientific users, so
> > > users of these may be more interested in the new encodings. Python is a
> > > good language for rapid prototyping as well, but the Python binding of
> > > Parquet may use the C++ library under the hood, I'm not sure (I'm more
> > > familiar with the Java implementation). In any case, there are at least two
> > > Python bindings: pyarrow and fastparquet.
> > > > >
> > > > > I think we can extend the format before the actual implementations are
> > > ready, provided that the specification is clear and nobody objects to
> > > adding it to the format. For this, I would wait for the opinion of a few
> > > more Parquet developers first, since changes to the format that are only
> > > supported by a single committer usually have a hard time getting into the
> > > spec. Additionally, could you please clarify which language bindings you
> > > plan to implement yourself? This will help the developers of the different
> > > language bindings assess how much work they will have to do to add support.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Zoltan
> > > > >
> > > > >
> > > > > On Fri, Jul 5, 2019 at 4:34 PM Radev, Martin <martin.radev@tum.de
> > > <ma...@tum.de>> wrote:
> > > > >
> > > > > Hello Zoltan and Parquet devs,
> > > > >
> > > > >
> > > > > do you think it would be appropriate to start with a Parquet prototype
> > > from my side?
> > > > >
> > > > > I suspect that integrating 'bs_zstd' would be the simplest to
> > > integrate and from the report we can see an improvement in both ratio and
> > > speed.
> > > > >
> > > > >
> > > > > Do you think that Apache Arrow is an appropriate place to prototype
> > > the extension of the format?
> > > > >
> > > > > Do you agree that the enum field 'Encodings' is a suitable place to
> > > add the 'Byte stream-splitting transformation'? In that way it could be
> > > used with any of the other supported compressors.
> > > > >
> > > > > It might be best to also add a Java implementation of the
> > > transformation. Would the project 'parquet-mr' be a good place?
> > > > >
> > > > >
> > > > > Would the workflow be such that I write my patches, we verify for
> > > correctness, get reviews, merge them AND just then we make adjustments to
> > > the Apache Parquet spec?
> > > > >
> > > > >
> > > > > Any piece of advice is welcome!
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > >
> > > > > ________________________________
> > > > > From: Zoltan Ivanfi <zi...@cloudera.com>>
> > > > > Sent: Friday, July 5, 2019 4:21:39 PM
> > > > > To: Radev, Martin
> > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: Floating point data compression for Apache Parquet
> > > > >
> > > > >
> > > > > Hi Martin,
> > > > >
> > > > > Thanks for the explanations, makes sense. Nice work!
> > > > >
> > > > > Br,
> > > > >
> > > > > Zoltan
> > > > >
> > > > > On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin <martin.radev@tum.de
> > > <ma...@tum.de>> wrote:
> > > > >
> > > > > Hello Zoltan,
> > > > >
> > > > >
> > > > >> Is data pre-loaded to RAM before making the measurements?
> > > > > Yes, the file is read into physical memory.
> > > > >
> > > > > For mmap-ed files, read from external storage, I would expect, but not
> > > 100% sure, that the IO-overhead would be big enough that all algorithms
> > > compress quite close at the same speed.
> > > > >
> > > > >
> > > > >> In "Figure 3: Decompression speed in MB/s", is data size measured
> > > before or after uncompression?
> > > > >
> > > > >> In "Figure 4: Compression speed in MB/s", is data size measured
> > > before or after compression?
> > > > > For both the reported result is "size of the original file / time to
> > > compress or decompress".
> > > > >
> > > > >> According to "Figure 3: Decompression speed in MB/s", decompression
> > > of bs_zstd is almost twice as fast as plain zstd. Do you know what causes
> > > this massive speed improvement?
> > > > >
> > > > > I do not know all of the details. As you mentioned, the written out
> > > data is less, this could potentially lead to improvement in speed as less
> > > data has to be written out to memory during compression or read from memory
> > > during decompression.
> > > > >
> > > > > Another thing to consider is that ZSTD uses different techniques to
> > > compress a block of data - "raw", "RLE", "Huffman coding", "Treeless
> > > coding".
> > > > >
> > > > > I expect that "Huffman coding" is more costly than "RLE" and I also
> > > expect that "RLE" to be applicable for the majority of the sign bits thus
> > > leading to a performance win for when the transformation is applied.
> > > > >
> > > > >
> > > > > I also expect that zstd has to do some form of "optimal parsing" to
> > > decide how to process the input in order to compress it well. This is
> > > something every wanna-be-good LZ-like compressor has to do (
> > > https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> > > ,
> > > http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> > > ). It might be so that the transformed input is somehow easy which leads to
> > > faster compression rates and also easier to decompress data which leads to
> > > faster decompression rates.
> > > > > cbloom rants: 10-24-11 - LZ Optimal Parse with A Star Part 1<
> > > http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> > > <
> > > http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> > > >>
> > > > > cbloomrants.blogspot.com <http://cbloomrants.blogspot.com/>
> > > > > First two notes that aren't about the A-Star parse : 1. All good LZ
> > > encoders these days that aren't optimal parsers use complicated heuri...
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > I used this as a reference:
> > > https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar
> > > with ZSTD in particular.
> > > > >
> > > > >
> > > > > I also checked that the majority of the time is spent in zstd.
> > > > >
> > > > > Example run for msg_sweep3d.dp using zstd at level 1.
> > > > > - Transformation during compression: 0.086s, ZSTD compress on
> > > transformed data: 0.08s
> > > > >
> > > > > - regular ZSTD: 0.34s
> > > > > - ZSTD decompress from compressed transformed data: 0.067s,
> > > Transformation during decompression: 0.021s
> > > > > - regular ZSTD decompress: 0.24s
> > > > >
> > > > >
> > > > > Example run for msg_sweep3d.dp using zstd at level 20.
> > > > >
> > > > > - Transformation during compression: 0.083s, ZSTD compress on
> > > transformed data: 14.35s
> > > > >
> > > > > - regular ZSTD: 183s
> > > > > - ZSTD decompress from compressed transformed data: 0.075s,
> > > Transformation during decompression: 0.022s
> > > > > - regular ZSTD decompress: 0.31s
> > > > > Here it's clear that the transformed input is easier to parse
> > > (compress). Maybe also the blocks are of type which takes less time to
> > > decompress.
> > > > >
> > > > >> If considering using existing libraries to provide any of the
> > > compression algorithms, license compatibility is also an important factor
> > > and therefore would be worth mentioning in Section 5.
> > > > > This is something I forgot to list. I will back to you and the other
> > > devs with information.
> > > > >
> > > > > The filter I proposed for lossless compression can be integrated
> > > without any concerns for a license.
> > > > >
> > > > >
> > > > >> Are any of the investigated strategies applicable to DECIMAL values?
> > > > > The lossy compressors SZ and ZFP do not support that outside of the
> > > box. I could communicate with the SZ developers to come to a decision how
> > > this can be added to SZ. An option is to losslessly compress the
> > > pre-decimal number and lossyly compress the post-decimal number.
> > > > >
> > > > > For lossless compression, we can apply a similar stream splitting
> > > technique for decimal types though it might be somewhat more complex and I
> > > have not really though about this case.
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > > ________________________________
> > > > > From: Zoltan Ivanfi <zi...@cloudera.com>>
> > > > > Sent: Wednesday, July 3, 2019 6:07:50 PM
> > > > > To: Parquet Dev; Radev, Martin
> > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: Floating point data compression for Apache Parquet
> > > > >
> > > > > Hi Martin,
> > > > >
> > > > > Thanks for the thorough investigation, very nice report. I would have
> > > a few questions:
> > > > >
> > > > > - Is data pre-loaded to RAM before making the measurements?
> > > > >
> > > > > - In "Figure 3: Decompression speed in MB/s", is data size measured
> > > before or after uncompression?
> > > > >
> > > > > - In "Figure 4: Compression speed in MB/s", is data size measured
> > > before or after compression?
> > > > >
> > > > > - According to "Figure 3: Decompression speed in MB/s", decompression
> > > of bs_zstd is almost twice as fast as plain zstd. Do you know what causes
> > > this massive speed improvement? Based on the description provided in
> > > section 3.2, bs_zstd uses the same zstd compression with an extra step of
> > > splitting/combining streams. Since this is extra work, I would have
> > > expected bs_zstd to be slower than pure zstd, unless the compressed data
> > > becomes so much smaller that it radically improves data access times.
> > > However, according to "Figure 2: Compression ratio", bs_zstd achieves
> > > "only" 23% better compression than plain zstd, which can not be the reason
> > > for the 2x speed-up in itself.
> > > > >
> > > > > - If considering using existing libraries to provide any of the
> > > compression algorithms, license compatibility is also an important factor
> > > and therefore would be worth mentioning in Section 5.
> > > > >
> > > > > - Are any of the investigated strategies applicable to DECIMAL values?
> > > Since floating point values and calculations have an inherent inaccuracy,
> > > the DECIMAL type is much more important for storing financial data, which
> > > is one of the main use cases of Parquet.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Zoltan
> > > > >
> > > > > On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <martin.radev@tum.de
> > > <ma...@tum.de>> wrote:
> > > > > Hello folks,
> > > > >
> > > > >
> > > > > thank you for your input.
> > > > >
> > > > >
> > > > > I am finished with my investigation regarding introducing special
> > > support for FP compression in Apache Parquet.
> > > > >
> > > > > My report also includes an investigation of lossy compressors though
> > > there are still some things to be cleared out.
> > > > >
> > > > >
> > > > > Report:
> > > https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
> > > > >
> > > > >
> > > > > Sections 3 4 5 6 are the most important to go over.
> > > > >
> > > > >
> > > > > Let me know if you have any questions or concerns.
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > > ________________________________
> > > > > From: Zoltan Ivanfi <zi...@cloudera.com.INVALID>
> > > > > Sent: Thursday, June 13, 2019 2:16:56 PM
> > > > > To: Parquet Dev
> > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: Floating point data compression for Apache Parquet
> > > > >
> > > > > Hi Martin,
> > > > >
> > > > > Thanks for your interest in improving Parquet. Efficient encodings are
> > > > > really important in a big data file format, so this topic is
> > > > > definitely worth researching and personally I am looking forward to
> > > > > your report. Whether to add any new encodings to Parquet, however, can
> > > > > not be answered until we see the results of your findings.
> > > > >
> > > > > You mention two paths. One has very small computational overhead but
> > > > > does not provide significant space savings. The other provides
> > > > > significant space savings but at the price of a significant
> > > > > computational overhead. While purely based on these properties both of
> > > > > them seem "balanced" (one is small effort, small gain; the other is
> > > > > large effort, large gain) and therefore sound reasonable options, I
> > > > > would argue that one should also consider development costs, code
> > > > > complexity and compatibility implications when deciding about whether
> > > > > a new feature is worth implementing.
> > > > >
> > > > > Adding a new encoding or compression to Parquet complicates the
> > > > > specification of the file format and requires implementing it in every
> > > > > language binding of the format, which is not only a considerable
> > > > > effort, but is also error-prone (see LZ4 for an example, which was
> > > > > added to both the Java and the C++ implementation of Parquet, yet they
> > > > > are incompatible with each other). And lack of support is not only a
> > > > > minor annoyance in this case: if one is forced to use an older reader
> > > > > that does not support the new encoding yet (or a language binding that
> > > > > does not support it at all), the data simply can not be read.
> > > > >
> > > > > In my opinion, no matter how low the computational overhead of a new
> > > > > encoding is, if it does not provide significant gains, then the
> > > > > specification clutter, implementation costs and the potential of
> > > > > compatibility problems greatly outweigh its advantages. For this
> > > > > reason, I would say that only encodings that provide significant gains
> > > > > are worth adding. As far as I am concerned, such a new encoding would
> > > > > be a welcome addition to Parquet.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Zoltan
> > > > >
> > > > > On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <martin.radev@tum.de
> > > <ma...@tum.de>> wrote:
> > > > >>
> > > > >> Dear all,
> > > > >>
> > > > >> thank you for your work on the Apache Parquet format.
> > > > >>
> > > > >> We are a group of students at the Technical University of Munich who
> > > would like to extend the available compression and encoding options for
> > > 32-bit and 64-bit floating point data in Apache Parquet.
> > > > >> The current encodings and compression algorithms offered in Apache
> > > Parquet are heavily specialized towards integer and text data.
> > > > >> Thus there is an opportunity in reducing both io throughput
> > > requirements and space requirements for handling floating point data by
> > > selecting a specialized compression algorithm.
> > > > >>
> > > > >> Currently, I am doing an investigation on the available literature
> > > and publicly available fp compressors. In my investigation I am writing a
> > > report on my findings - the available algorithms, their strengths and
> > > weaknesses, compression rates, compression speeds and decompression speeds,
> > > and licenses. Once finished I will share the report with you and make a
> > > proposal which ones IMO are good candidates for Apache Parquet.
> > > > >>
> > > > >> The goal is to add a solution for both 32-bit and 64-bit fp types. I
> > > think that it would be beneficial to offer at the very least two distinct
> > > paths. The first one should offer fast compression and decompression speed
> > > with some but not significant saving in space. The second one should offer
> > > slower compression and decompression speed but with a decent compression
> > > rate. Both lossless. A lossy path will be investigated further and
> > > discussed with the community.
> > > > >>
> > > > >> If I get an approval from you – the developers – I can continue with
> > > adding support for the new encoding/compression options in the C++
> > > implementation of Apache Parquet in Apache Arrow.
> > > > >>
> > > > >> Please let me know what you think of this idea and whether you have
> > > any concerns with the plan.
> > > > >>
> > > > >> Best regards,
> > > > >> Martin Radev
> > > >
> > >

Re: Floating point data compression for Apache Parquet

Posted by Wes McKinney <we...@gmail.com>.
I think first you need to create a [DISCUSS] thread whose subject
clearly indicates that you are proposing to modify the Parquet format.
You should link to the PR with the changes to parquet-format. Then
wait for feedback to collect.

Frankly, I was surprised to see a PR close to being merged based
largely on a 2-way discussion between Martin and Zoltan (unless I
missed something)

On Tue, Jul 16, 2019 at 2:11 AM Roman Karlstetter
<ro...@gmail.com> wrote:
>
> Hi Wes,
>
> what would be the formal or informal requirements for such a vote to pass?
> What is needed in terms of code and specification before we can start such
> a vote?
>
> Roman
>
> Am Fr., 12. Juli 2019 um 17:07 Uhr schrieb Wes McKinney <wesmckinn@gmail.com
> >:
>
> > I think we need to vote to make any changes to the Parquet format. New
> > features carry a heavy responsibility
> >
> > On Fri, Jul 12, 2019 at 10:04 AM Michael Heuer <he...@gmail.com> wrote:
> > >
> > > Hello Martin,
> > >
> > > I'm willing to run some tests at scale on our genomics data when a
> > parquet-mr pull request for the Java implementation is ready.
> > >
> > > Cheers,
> > >
> > >    michael
> > >
> > >
> > > > On Jul 11, 2019, at 1:09 PM, Radev, Martin <ma...@tum.de>
> > wrote:
> > > >
> > > > Dear all,
> > > >
> > > >
> > > > I created a Jira issue for the new feature and also made a pull
> > request for my patch which extends the format and documentation.
> > > >
> > > > Jira issue: https://issues.apache.org/jira/browse/PARQUET-1622 <
> > https://issues.apache.org/jira/browse/PARQUET-1622>
> > > > Pull request: https://github.com/apache/parquet-format/pull/144 <
> > https://github.com/apache/parquet-format/pull/144>
> > > >
> > > >
> > > > I also have a WIP patch for adding the "BYTE_STREAM_SPLIT" encoding to
> > parquet-cpp within Apache Arrow.
> > > >
> > > >
> > > > How should we proceed?
> > > >
> > > > It would be great to get feedback from other community members.
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > >
> > > >
> > > >
> > > > ________________________________
> > > > From: Radev, Martin <martin.radev@tum.de <ma...@tum.de>>
> > > > Sent: Tuesday, July 9, 2019 1:01:25 AM
> > > > To: Zoltan Ivanfi
> > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: Floating point data compression for Apache Parquet
> > > >
> > > > Hello Zoltan,
> > > >
> > > >
> > > > I can provide a C++ and Java implementation for the encoder.
> > > >
> > > > The encoder/decoder is very small, and naturally I have to add tests.
> > > >
> > > > I expect the biggest hurdle would be setting up the environment and
> > reading though the developer guides.
> > > >
> > > >
> > > > I will write my patches for Apache Arrow and for Apache Parquet and
> > send them for review.
> > > >
> > > > After getting them in, I can continue with the Java implementation.
> > > >
> > > > Let me know if you have any concerns.
> > > >
> > > >
> > > > It would be great to get an opinion from other Parquet contributors : )
> > > >
> > > >
> > > > Thank you for the feedback!
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Zoltan Ivanfi <zi...@cloudera.com>
> > > > Sent: Monday, July 8, 2019 5:06:30 PM
> > > > To: Radev, Martin
> > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: Floating point data compression for Apache Parquet
> > > >
> > > > Hi Martin,
> > > >
> > > > I agree that bs_zstd would be a good place to start. Regarding the
> > choice of language, Java, C++ and Python are your options. As far as I
> > know, the Java implementation of Parquet has more users from the business
> > sector, where decimal is preferred over floating point data types. It is
> > also much more tightly integrated with the Hadoop ecosystem (it is even
> > called parquet-mr, as in MapReduce), making for a steeper learning curve.
> > > >
> > > > The Python and C++ language bindings have more scientific users, so
> > users of these may be more interested in the new encodings. Python is a
> > good language for rapid prototyping as well, but the Python binding of
> > Parquet may use the C++ library under the hood, I'm not sure (I'm more
> > familiar with the Java implementation). In any case, there are at least two
> > Python bindings: pyarrow and fastparquet.
> > > >
> > > > I think we can extend the format before the actual implementations are
> > ready, provided that the specification is clear and nobody objects to
> > adding it to the format. For this, I would wait for the opinion of a few
> > more Parquet developers first, since changes to the format that are only
> > supported by a single committer usually have a hard time getting into the
> > spec. Additionally, could you please clarify which language bindings you
> > plan to implement yourself? This will help the developers of the different
> > language bindings assess how much work they will have to do to add support.
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > > >
> > > > On Fri, Jul 5, 2019 at 4:34 PM Radev, Martin <martin.radev@tum.de
> > <ma...@tum.de>> wrote:
> > > >
> > > > Hello Zoltan and Parquet devs,
> > > >
> > > >
> > > > do you think it would be appropriate to start with a Parquet prototype
> > from my side?
> > > >
> > > > I suspect that integrating 'bs_zstd' would be the simplest to
> > integrate and from the report we can see an improvement in both ratio and
> > speed.
> > > >
> > > >
> > > > Do you think that Apache Arrow is an appropriate place to prototype
> > the extension of the format?
> > > >
> > > > Do you agree that the enum field 'Encodings' is a suitable place to
> > add the 'Byte stream-splitting transformation'? In that way it could be
> > used with any of the other supported compressors.
> > > >
> > > > It might be best to also add a Java implementation of the
> > transformation. Would the project 'parquet-mr' be a good place?
> > > >
> > > >
> > > > Would the workflow be such that I write my patches, we verify for
> > correctness, get reviews, merge them AND just then we make adjustments to
> > the Apache Parquet spec?
> > > >
> > > >
> > > > Any piece of advice is welcome!
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > >
> > > > ________________________________
> > > > From: Zoltan Ivanfi <zi...@cloudera.com>>
> > > > Sent: Friday, July 5, 2019 4:21:39 PM
> > > > To: Radev, Martin
> > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: Floating point data compression for Apache Parquet
> > > >
> > > >
> > > > Hi Martin,
> > > >
> > > > Thanks for the explanations, makes sense. Nice work!
> > > >
> > > > Br,
> > > >
> > > > Zoltan
> > > >
> > > > On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin <martin.radev@tum.de
> > <ma...@tum.de>> wrote:
> > > >
> > > > Hello Zoltan,
> > > >
> > > >
> > > >> Is data pre-loaded to RAM before making the measurements?
> > > > Yes, the file is read into physical memory.
> > > >
> > > > For mmap-ed files, read from external storage, I would expect, but not
> > 100% sure, that the IO-overhead would be big enough that all algorithms
> > compress quite close at the same speed.
> > > >
> > > >
> > > >> In "Figure 3: Decompression speed in MB/s", is data size measured
> > before or after uncompression?
> > > >
> > > >> In "Figure 4: Compression speed in MB/s", is data size measured
> > before or after compression?
> > > > For both the reported result is "size of the original file / time to
> > compress or decompress".
> > > >
> > > >> According to "Figure 3: Decompression speed in MB/s", decompression
> > of bs_zstd is almost twice as fast as plain zstd. Do you know what causes
> > this massive speed improvement?
> > > >
> > > > I do not know all of the details. As you mentioned, the written out
> > data is less, this could potentially lead to improvement in speed as less
> > data has to be written out to memory during compression or read from memory
> > during decompression.
> > > >
> > > > Another thing to consider is that ZSTD uses different techniques to
> > compress a block of data - "raw", "RLE", "Huffman coding", "Treeless
> > coding".
> > > >
> > > > I expect that "Huffman coding" is more costly than "RLE" and I also
> > expect that "RLE" to be applicable for the majority of the sign bits thus
> > leading to a performance win for when the transformation is applied.
> > > >
> > > >
> > > > I also expect that zstd has to do some form of "optimal parsing" to
> > decide how to process the input in order to compress it well. This is
> > something every wanna-be-good LZ-like compressor has to do (
> > https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> > ,
> > http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> > ). It might be so that the transformed input is somehow easy which leads to
> > faster compression rates and also easier to decompress data which leads to
> > faster decompression rates.
> > > > cbloom rants: 10-24-11 - LZ Optimal Parse with A Star Part 1<
> > http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> > <
> > http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> > >>
> > > > cbloomrants.blogspot.com <http://cbloomrants.blogspot.com/>
> > > > First two notes that aren't about the A-Star parse : 1. All good LZ
> > encoders these days that aren't optimal parsers use complicated heuri...
> > > >
> > > >
> > > >
> > > >
> > > > I used this as a reference:
> > https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar
> > with ZSTD in particular.
> > > >
> > > >
> > > > I also checked that the majority of the time is spent in zstd.
> > > >
> > > > Example run for msg_sweep3d.dp using zstd at level 1.
> > > > - Transformation during compression: 0.086s, ZSTD compress on
> > transformed data: 0.08s
> > > >
> > > > - regular ZSTD: 0.34s
> > > > - ZSTD decompress from compressed transformed data: 0.067s,
> > Transformation during decompression: 0.021s
> > > > - regular ZSTD decompress: 0.24s
> > > >
> > > >
> > > > Example run for msg_sweep3d.dp using zstd at level 20.
> > > >
> > > > - Transformation during compression: 0.083s, ZSTD compress on
> > transformed data: 14.35s
> > > >
> > > > - regular ZSTD: 183s
> > > > - ZSTD decompress from compressed transformed data: 0.075s,
> > Transformation during decompression: 0.022s
> > > > - regular ZSTD decompress: 0.31s
> > > > Here it's clear that the transformed input is easier to parse
> > (compress). Maybe also the blocks are of type which takes less time to
> > decompress.
> > > >
> > > >> If considering using existing libraries to provide any of the
> > compression algorithms, license compatibility is also an important factor
> > and therefore would be worth mentioning in Section 5.
> > > > This is something I forgot to list. I will back to you and the other
> > devs with information.
> > > >
> > > > The filter I proposed for lossless compression can be integrated
> > without any concerns for a license.
> > > >
> > > >
> > > >> Are any of the investigated strategies applicable to DECIMAL values?
> > > > The lossy compressors SZ and ZFP do not support that outside of the
> > box. I could communicate with the SZ developers to come to a decision how
> > this can be added to SZ. An option is to losslessly compress the
> > pre-decimal number and lossyly compress the post-decimal number.
> > > >
> > > > For lossless compression, we can apply a similar stream splitting
> > technique for decimal types though it might be somewhat more complex and I
> > have not really though about this case.
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Zoltan Ivanfi <zi...@cloudera.com>>
> > > > Sent: Wednesday, July 3, 2019 6:07:50 PM
> > > > To: Parquet Dev; Radev, Martin
> > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: Floating point data compression for Apache Parquet
> > > >
> > > > Hi Martin,
> > > >
> > > > Thanks for the thorough investigation, very nice report. I would have
> > a few questions:
> > > >
> > > > - Is data pre-loaded to RAM before making the measurements?
> > > >
> > > > - In "Figure 3: Decompression speed in MB/s", is data size measured
> > before or after uncompression?
> > > >
> > > > - In "Figure 4: Compression speed in MB/s", is data size measured
> > before or after compression?
> > > >
> > > > - According to "Figure 3: Decompression speed in MB/s", decompression
> > of bs_zstd is almost twice as fast as plain zstd. Do you know what causes
> > this massive speed improvement? Based on the description provided in
> > section 3.2, bs_zstd uses the same zstd compression with an extra step of
> > splitting/combining streams. Since this is extra work, I would have
> > expected bs_zstd to be slower than pure zstd, unless the compressed data
> > becomes so much smaller that it radically improves data access times.
> > However, according to "Figure 2: Compression ratio", bs_zstd achieves
> > "only" 23% better compression than plain zstd, which can not be the reason
> > for the 2x speed-up in itself.
> > > >
> > > > - If considering using existing libraries to provide any of the
> > compression algorithms, license compatibility is also an important factor
> > and therefore would be worth mentioning in Section 5.
> > > >
> > > > - Are any of the investigated strategies applicable to DECIMAL values?
> > Since floating point values and calculations have an inherent inaccuracy,
> > the DECIMAL type is much more important for storing financial data, which
> > is one of the main use cases of Parquet.
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > > > On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <martin.radev@tum.de
> > <ma...@tum.de>> wrote:
> > > > Hello folks,
> > > >
> > > >
> > > > thank you for your input.
> > > >
> > > >
> > > > I am finished with my investigation regarding introducing special
> > support for FP compression in Apache Parquet.
> > > >
> > > > My report also includes an investigation of lossy compressors though
> > there are still some things to be cleared out.
> > > >
> > > >
> > > > Report:
> > https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
> > > >
> > > >
> > > > Sections 3 4 5 6 are the most important to go over.
> > > >
> > > >
> > > > Let me know if you have any questions or concerns.
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Zoltan Ivanfi <zi...@cloudera.com.INVALID>
> > > > Sent: Thursday, June 13, 2019 2:16:56 PM
> > > > To: Parquet Dev
> > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: Floating point data compression for Apache Parquet
> > > >
> > > > Hi Martin,
> > > >
> > > > Thanks for your interest in improving Parquet. Efficient encodings are
> > > > really important in a big data file format, so this topic is
> > > > definitely worth researching and personally I am looking forward to
> > > > your report. Whether to add any new encodings to Parquet, however, can
> > > > not be answered until we see the results of your findings.
> > > >
> > > > You mention two paths. One has very small computational overhead but
> > > > does not provide significant space savings. The other provides
> > > > significant space savings but at the price of a significant
> > > > computational overhead. While purely based on these properties both of
> > > > them seem "balanced" (one is small effort, small gain; the other is
> > > > large effort, large gain) and therefore sound reasonable options, I
> > > > would argue that one should also consider development costs, code
> > > > complexity and compatibility implications when deciding about whether
> > > > a new feature is worth implementing.
> > > >
> > > > Adding a new encoding or compression to Parquet complicates the
> > > > specification of the file format and requires implementing it in every
> > > > language binding of the format, which is not only a considerable
> > > > effort, but is also error-prone (see LZ4 for an example, which was
> > > > added to both the Java and the C++ implementation of Parquet, yet they
> > > > are incompatible with each other). And lack of support is not only a
> > > > minor annoyance in this case: if one is forced to use an older reader
> > > > that does not support the new encoding yet (or a language binding that
> > > > does not support it at all), the data simply can not be read.
> > > >
> > > > In my opinion, no matter how low the computational overhead of a new
> > > > encoding is, if it does not provide significant gains, then the
> > > > specification clutter, implementation costs and the potential of
> > > > compatibility problems greatly outweigh its advantages. For this
> > > > reason, I would say that only encodings that provide significant gains
> > > > are worth adding. As far as I am concerned, such a new encoding would
> > > > be a welcome addition to Parquet.
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > > > On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <martin.radev@tum.de
> > <ma...@tum.de>> wrote:
> > > >>
> > > >> Dear all,
> > > >>
> > > >> thank you for your work on the Apache Parquet format.
> > > >>
> > > >> We are a group of students at the Technical University of Munich who
> > would like to extend the available compression and encoding options for
> > 32-bit and 64-bit floating point data in Apache Parquet.
> > > >> The current encodings and compression algorithms offered in Apache
> > Parquet are heavily specialized towards integer and text data.
> > > >> Thus there is an opportunity in reducing both io throughput
> > requirements and space requirements for handling floating point data by
> > selecting a specialized compression algorithm.
> > > >>
> > > >> Currently, I am doing an investigation on the available literature
> > and publicly available fp compressors. In my investigation I am writing a
> > report on my findings - the available algorithms, their strengths and
> > weaknesses, compression rates, compression speeds and decompression speeds,
> > and licenses. Once finished I will share the report with you and make a
> > proposal which ones IMO are good candidates for Apache Parquet.
> > > >>
> > > >> The goal is to add a solution for both 32-bit and 64-bit fp types. I
> > think that it would be beneficial to offer at the very least two distinct
> > paths. The first one should offer fast compression and decompression speed
> > with some but not significant saving in space. The second one should offer
> > slower compression and decompression speed but with a decent compression
> > rate. Both lossless. A lossy path will be investigated further and
> > discussed with the community.
> > > >>
> > > >> If I get an approval from you – the developers – I can continue with
> > adding support for the new encoding/compression options in the C++
> > implementation of Apache Parquet in Apache Arrow.
> > > >>
> > > >> Please let me know what you think of this idea and whether you have
> > any concerns with the plan.
> > > >>
> > > >> Best regards,
> > > >> Martin Radev
> > >
> >

Re: Floating point data compression for Apache Parquet

Posted by Roman Karlstetter <ro...@gmail.com>.
Hi Wes,

what would be the formal or informal requirements for such a vote to pass?
What is needed in terms of code and specification before we can start such
a vote?

Roman

Am Fr., 12. Juli 2019 um 17:07 Uhr schrieb Wes McKinney <wesmckinn@gmail.com
>:

> I think we need to vote to make any changes to the Parquet format. New
> features carry a heavy responsibility
>
> On Fri, Jul 12, 2019 at 10:04 AM Michael Heuer <he...@gmail.com> wrote:
> >
> > Hello Martin,
> >
> > I'm willing to run some tests at scale on our genomics data when a
> parquet-mr pull request for the Java implementation is ready.
> >
> > Cheers,
> >
> >    michael
> >
> >
> > > On Jul 11, 2019, at 1:09 PM, Radev, Martin <ma...@tum.de>
> wrote:
> > >
> > > Dear all,
> > >
> > >
> > > I created a Jira issue for the new feature and also made a pull
> request for my patch which extends the format and documentation.
> > >
> > > Jira issue: https://issues.apache.org/jira/browse/PARQUET-1622 <
> https://issues.apache.org/jira/browse/PARQUET-1622>
> > > Pull request: https://github.com/apache/parquet-format/pull/144 <
> https://github.com/apache/parquet-format/pull/144>
> > >
> > >
> > > I also have a WIP patch for adding the "BYTE_STREAM_SPLIT" encoding to
> parquet-cpp within Apache Arrow.
> > >
> > >
> > > How should we proceed?
> > >
> > > It would be great to get feedback from other community members.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: Radev, Martin <martin.radev@tum.de <ma...@tum.de>>
> > > Sent: Tuesday, July 9, 2019 1:01:25 AM
> > > To: Zoltan Ivanfi
> > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: Floating point data compression for Apache Parquet
> > >
> > > Hello Zoltan,
> > >
> > >
> > > I can provide a C++ and Java implementation for the encoder.
> > >
> > > The encoder/decoder is very small, and naturally I have to add tests.
> > >
> > > I expect the biggest hurdle would be setting up the environment and
> reading though the developer guides.
> > >
> > >
> > > I will write my patches for Apache Arrow and for Apache Parquet and
> send them for review.
> > >
> > > After getting them in, I can continue with the Java implementation.
> > >
> > > Let me know if you have any concerns.
> > >
> > >
> > > It would be great to get an opinion from other Parquet contributors : )
> > >
> > >
> > > Thank you for the feedback!
> > >
> > >
> > > Best regards,
> > >
> > > Martin
> > >
> > > ________________________________
> > > From: Zoltan Ivanfi <zi...@cloudera.com>
> > > Sent: Monday, July 8, 2019 5:06:30 PM
> > > To: Radev, Martin
> > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: Floating point data compression for Apache Parquet
> > >
> > > Hi Martin,
> > >
> > > I agree that bs_zstd would be a good place to start. Regarding the
> choice of language, Java, C++ and Python are your options. As far as I
> know, the Java implementation of Parquet has more users from the business
> sector, where decimal is preferred over floating point data types. It is
> also much more tightly integrated with the Hadoop ecosystem (it is even
> called parquet-mr, as in MapReduce), making for a steeper learning curve.
> > >
> > > The Python and C++ language bindings have more scientific users, so
> users of these may be more interested in the new encodings. Python is a
> good language for rapid prototyping as well, but the Python binding of
> Parquet may use the C++ library under the hood, I'm not sure (I'm more
> familiar with the Java implementation). In any case, there are at least two
> Python bindings: pyarrow and fastparquet.
> > >
> > > I think we can extend the format before the actual implementations are
> ready, provided that the specification is clear and nobody objects to
> adding it to the format. For this, I would wait for the opinion of a few
> more Parquet developers first, since changes to the format that are only
> supported by a single committer usually have a hard time getting into the
> spec. Additionally, could you please clarify which language bindings you
> plan to implement yourself? This will help the developers of the different
> language bindings assess how much work they will have to do to add support.
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> > >
> > > On Fri, Jul 5, 2019 at 4:34 PM Radev, Martin <martin.radev@tum.de
> <ma...@tum.de>> wrote:
> > >
> > > Hello Zoltan and Parquet devs,
> > >
> > >
> > > do you think it would be appropriate to start with a Parquet prototype
> from my side?
> > >
> > > I suspect that integrating 'bs_zstd' would be the simplest to
> integrate and from the report we can see an improvement in both ratio and
> speed.
> > >
> > >
> > > Do you think that Apache Arrow is an appropriate place to prototype
> the extension of the format?
> > >
> > > Do you agree that the enum field 'Encodings' is a suitable place to
> add the 'Byte stream-splitting transformation'? In that way it could be
> used with any of the other supported compressors.
> > >
> > > It might be best to also add a Java implementation of the
> transformation. Would the project 'parquet-mr' be a good place?
> > >
> > >
> > > Would the workflow be such that I write my patches, we verify for
> correctness, get reviews, merge them AND just then we make adjustments to
> the Apache Parquet spec?
> > >
> > >
> > > Any piece of advice is welcome!
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> > > ________________________________
> > > From: Zoltan Ivanfi <zi...@cloudera.com>>
> > > Sent: Friday, July 5, 2019 4:21:39 PM
> > > To: Radev, Martin
> > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: Floating point data compression for Apache Parquet
> > >
> > >
> > > Hi Martin,
> > >
> > > Thanks for the explanations, makes sense. Nice work!
> > >
> > > Br,
> > >
> > > Zoltan
> > >
> > > On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin <martin.radev@tum.de
> <ma...@tum.de>> wrote:
> > >
> > > Hello Zoltan,
> > >
> > >
> > >> Is data pre-loaded to RAM before making the measurements?
> > > Yes, the file is read into physical memory.
> > >
> > > For mmap-ed files, read from external storage, I would expect, but not
> 100% sure, that the IO-overhead would be big enough that all algorithms
> compress quite close at the same speed.
> > >
> > >
> > >> In "Figure 3: Decompression speed in MB/s", is data size measured
> before or after uncompression?
> > >
> > >> In "Figure 4: Compression speed in MB/s", is data size measured
> before or after compression?
> > > For both the reported result is "size of the original file / time to
> compress or decompress".
> > >
> > >> According to "Figure 3: Decompression speed in MB/s", decompression
> of bs_zstd is almost twice as fast as plain zstd. Do you know what causes
> this massive speed improvement?
> > >
> > > I do not know all of the details. As you mentioned, the written out
> data is less, this could potentially lead to improvement in speed as less
> data has to be written out to memory during compression or read from memory
> during decompression.
> > >
> > > Another thing to consider is that ZSTD uses different techniques to
> compress a block of data - "raw", "RLE", "Huffman coding", "Treeless
> coding".
> > >
> > > I expect that "Huffman coding" is more costly than "RLE" and I also
> expect that "RLE" to be applicable for the majority of the sign bits thus
> leading to a performance win for when the transformation is applied.
> > >
> > >
> > > I also expect that zstd has to do some form of "optimal parsing" to
> decide how to process the input in order to compress it well. This is
> something every wanna-be-good LZ-like compressor has to do (
> https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> ,
> http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> ). It might be so that the transformed input is somehow easy which leads to
> faster compression rates and also easier to decompress data which leads to
> faster decompression rates.
> > > cbloom rants: 10-24-11 - LZ Optimal Parse with A Star Part 1<
> http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> <
> http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> >>
> > > cbloomrants.blogspot.com <http://cbloomrants.blogspot.com/>
> > > First two notes that aren't about the A-Star parse : 1. All good LZ
> encoders these days that aren't optimal parsers use complicated heuri...
> > >
> > >
> > >
> > >
> > > I used this as a reference:
> https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar
> with ZSTD in particular.
> > >
> > >
> > > I also checked that the majority of the time is spent in zstd.
> > >
> > > Example run for msg_sweep3d.dp using zstd at level 1.
> > > - Transformation during compression: 0.086s, ZSTD compress on
> transformed data: 0.08s
> > >
> > > - regular ZSTD: 0.34s
> > > - ZSTD decompress from compressed transformed data: 0.067s,
> Transformation during decompression: 0.021s
> > > - regular ZSTD decompress: 0.24s
> > >
> > >
> > > Example run for msg_sweep3d.dp using zstd at level 20.
> > >
> > > - Transformation during compression: 0.083s, ZSTD compress on
> transformed data: 14.35s
> > >
> > > - regular ZSTD: 183s
> > > - ZSTD decompress from compressed transformed data: 0.075s,
> Transformation during decompression: 0.022s
> > > - regular ZSTD decompress: 0.31s
> > > Here it's clear that the transformed input is easier to parse
> (compress). Maybe also the blocks are of type which takes less time to
> decompress.
> > >
> > >> If considering using existing libraries to provide any of the
> compression algorithms, license compatibility is also an important factor
> and therefore would be worth mentioning in Section 5.
> > > This is something I forgot to list. I will back to you and the other
> devs with information.
> > >
> > > The filter I proposed for lossless compression can be integrated
> without any concerns for a license.
> > >
> > >
> > >> Are any of the investigated strategies applicable to DECIMAL values?
> > > The lossy compressors SZ and ZFP do not support that outside of the
> box. I could communicate with the SZ developers to come to a decision how
> this can be added to SZ. An option is to losslessly compress the
> pre-decimal number and lossyly compress the post-decimal number.
> > >
> > > For lossless compression, we can apply a similar stream splitting
> technique for decimal types though it might be somewhat more complex and I
> have not really though about this case.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > > ________________________________
> > > From: Zoltan Ivanfi <zi...@cloudera.com>>
> > > Sent: Wednesday, July 3, 2019 6:07:50 PM
> > > To: Parquet Dev; Radev, Martin
> > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: Floating point data compression for Apache Parquet
> > >
> > > Hi Martin,
> > >
> > > Thanks for the thorough investigation, very nice report. I would have
> a few questions:
> > >
> > > - Is data pre-loaded to RAM before making the measurements?
> > >
> > > - In "Figure 3: Decompression speed in MB/s", is data size measured
> before or after uncompression?
> > >
> > > - In "Figure 4: Compression speed in MB/s", is data size measured
> before or after compression?
> > >
> > > - According to "Figure 3: Decompression speed in MB/s", decompression
> of bs_zstd is almost twice as fast as plain zstd. Do you know what causes
> this massive speed improvement? Based on the description provided in
> section 3.2, bs_zstd uses the same zstd compression with an extra step of
> splitting/combining streams. Since this is extra work, I would have
> expected bs_zstd to be slower than pure zstd, unless the compressed data
> becomes so much smaller that it radically improves data access times.
> However, according to "Figure 2: Compression ratio", bs_zstd achieves
> "only" 23% better compression than plain zstd, which can not be the reason
> for the 2x speed-up in itself.
> > >
> > > - If considering using existing libraries to provide any of the
> compression algorithms, license compatibility is also an important factor
> and therefore would be worth mentioning in Section 5.
> > >
> > > - Are any of the investigated strategies applicable to DECIMAL values?
> Since floating point values and calculations have an inherent inaccuracy,
> the DECIMAL type is much more important for storing financial data, which
> is one of the main use cases of Parquet.
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> > > On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <martin.radev@tum.de
> <ma...@tum.de>> wrote:
> > > Hello folks,
> > >
> > >
> > > thank you for your input.
> > >
> > >
> > > I am finished with my investigation regarding introducing special
> support for FP compression in Apache Parquet.
> > >
> > > My report also includes an investigation of lossy compressors though
> there are still some things to be cleared out.
> > >
> > >
> > > Report:
> https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
> > >
> > >
> > > Sections 3 4 5 6 are the most important to go over.
> > >
> > >
> > > Let me know if you have any questions or concerns.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > > ________________________________
> > > From: Zoltan Ivanfi <zi...@cloudera.com.INVALID>
> > > Sent: Thursday, June 13, 2019 2:16:56 PM
> > > To: Parquet Dev
> > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: Floating point data compression for Apache Parquet
> > >
> > > Hi Martin,
> > >
> > > Thanks for your interest in improving Parquet. Efficient encodings are
> > > really important in a big data file format, so this topic is
> > > definitely worth researching and personally I am looking forward to
> > > your report. Whether to add any new encodings to Parquet, however, can
> > > not be answered until we see the results of your findings.
> > >
> > > You mention two paths. One has very small computational overhead but
> > > does not provide significant space savings. The other provides
> > > significant space savings but at the price of a significant
> > > computational overhead. While purely based on these properties both of
> > > them seem "balanced" (one is small effort, small gain; the other is
> > > large effort, large gain) and therefore sound reasonable options, I
> > > would argue that one should also consider development costs, code
> > > complexity and compatibility implications when deciding about whether
> > > a new feature is worth implementing.
> > >
> > > Adding a new encoding or compression to Parquet complicates the
> > > specification of the file format and requires implementing it in every
> > > language binding of the format, which is not only a considerable
> > > effort, but is also error-prone (see LZ4 for an example, which was
> > > added to both the Java and the C++ implementation of Parquet, yet they
> > > are incompatible with each other). And lack of support is not only a
> > > minor annoyance in this case: if one is forced to use an older reader
> > > that does not support the new encoding yet (or a language binding that
> > > does not support it at all), the data simply can not be read.
> > >
> > > In my opinion, no matter how low the computational overhead of a new
> > > encoding is, if it does not provide significant gains, then the
> > > specification clutter, implementation costs and the potential of
> > > compatibility problems greatly outweigh its advantages. For this
> > > reason, I would say that only encodings that provide significant gains
> > > are worth adding. As far as I am concerned, such a new encoding would
> > > be a welcome addition to Parquet.
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> > > On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <martin.radev@tum.de
> <ma...@tum.de>> wrote:
> > >>
> > >> Dear all,
> > >>
> > >> thank you for your work on the Apache Parquet format.
> > >>
> > >> We are a group of students at the Technical University of Munich who
> would like to extend the available compression and encoding options for
> 32-bit and 64-bit floating point data in Apache Parquet.
> > >> The current encodings and compression algorithms offered in Apache
> Parquet are heavily specialized towards integer and text data.
> > >> Thus there is an opportunity in reducing both io throughput
> requirements and space requirements for handling floating point data by
> selecting a specialized compression algorithm.
> > >>
> > >> Currently, I am doing an investigation on the available literature
> and publicly available fp compressors. In my investigation I am writing a
> report on my findings - the available algorithms, their strengths and
> weaknesses, compression rates, compression speeds and decompression speeds,
> and licenses. Once finished I will share the report with you and make a
> proposal which ones IMO are good candidates for Apache Parquet.
> > >>
> > >> The goal is to add a solution for both 32-bit and 64-bit fp types. I
> think that it would be beneficial to offer at the very least two distinct
> paths. The first one should offer fast compression and decompression speed
> with some but not significant saving in space. The second one should offer
> slower compression and decompression speed but with a decent compression
> rate. Both lossless. A lossy path will be investigated further and
> discussed with the community.
> > >>
> > >> If I get an approval from you – the developers – I can continue with
> adding support for the new encoding/compression options in the C++
> implementation of Apache Parquet in Apache Arrow.
> > >>
> > >> Please let me know what you think of this idea and whether you have
> any concerns with the plan.
> > >>
> > >> Best regards,
> > >> Martin Radev
> >
>

Re: Floating point data compression for Apache Parquet

Posted by Wes McKinney <we...@gmail.com>.
I think we need to vote to make any changes to the Parquet format. New
features carry a heavy responsibility

On Fri, Jul 12, 2019 at 10:04 AM Michael Heuer <he...@gmail.com> wrote:
>
> Hello Martin,
>
> I'm willing to run some tests at scale on our genomics data when a parquet-mr pull request for the Java implementation is ready.
>
> Cheers,
>
>    michael
>
>
> > On Jul 11, 2019, at 1:09 PM, Radev, Martin <ma...@tum.de> wrote:
> >
> > Dear all,
> >
> >
> > I created a Jira issue for the new feature and also made a pull request for my patch which extends the format and documentation.
> >
> > Jira issue: https://issues.apache.org/jira/browse/PARQUET-1622 <https://issues.apache.org/jira/browse/PARQUET-1622>
> > Pull request: https://github.com/apache/parquet-format/pull/144 <https://github.com/apache/parquet-format/pull/144>
> >
> >
> > I also have a WIP patch for adding the "BYTE_STREAM_SPLIT" encoding to parquet-cpp within Apache Arrow.
> >
> >
> > How should we proceed?
> >
> > It would be great to get feedback from other community members.
> >
> >
> > Regards,
> >
> > Martin
> >
> >
> >
> >
> > ________________________________
> > From: Radev, Martin <martin.radev@tum.de <ma...@tum.de>>
> > Sent: Tuesday, July 9, 2019 1:01:25 AM
> > To: Zoltan Ivanfi
> > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: Floating point data compression for Apache Parquet
> >
> > Hello Zoltan,
> >
> >
> > I can provide a C++ and Java implementation for the encoder.
> >
> > The encoder/decoder is very small, and naturally I have to add tests.
> >
> > I expect the biggest hurdle would be setting up the environment and reading though the developer guides.
> >
> >
> > I will write my patches for Apache Arrow and for Apache Parquet and send them for review.
> >
> > After getting them in, I can continue with the Java implementation.
> >
> > Let me know if you have any concerns.
> >
> >
> > It would be great to get an opinion from other Parquet contributors : )
> >
> >
> > Thank you for the feedback!
> >
> >
> > Best regards,
> >
> > Martin
> >
> > ________________________________
> > From: Zoltan Ivanfi <zi...@cloudera.com>
> > Sent: Monday, July 8, 2019 5:06:30 PM
> > To: Radev, Martin
> > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: Floating point data compression for Apache Parquet
> >
> > Hi Martin,
> >
> > I agree that bs_zstd would be a good place to start. Regarding the choice of language, Java, C++ and Python are your options. As far as I know, the Java implementation of Parquet has more users from the business sector, where decimal is preferred over floating point data types. It is also much more tightly integrated with the Hadoop ecosystem (it is even called parquet-mr, as in MapReduce), making for a steeper learning curve.
> >
> > The Python and C++ language bindings have more scientific users, so users of these may be more interested in the new encodings. Python is a good language for rapid prototyping as well, but the Python binding of Parquet may use the C++ library under the hood, I'm not sure (I'm more familiar with the Java implementation). In any case, there are at least two Python bindings: pyarrow and fastparquet.
> >
> > I think we can extend the format before the actual implementations are ready, provided that the specification is clear and nobody objects to adding it to the format. For this, I would wait for the opinion of a few more Parquet developers first, since changes to the format that are only supported by a single committer usually have a hard time getting into the spec. Additionally, could you please clarify which language bindings you plan to implement yourself? This will help the developers of the different language bindings assess how much work they will have to do to add support.
> >
> > Thanks,
> >
> > Zoltan
> >
> >
> > On Fri, Jul 5, 2019 at 4:34 PM Radev, Martin <ma...@tum.de>> wrote:
> >
> > Hello Zoltan and Parquet devs,
> >
> >
> > do you think it would be appropriate to start with a Parquet prototype from my side?
> >
> > I suspect that integrating 'bs_zstd' would be the simplest to integrate and from the report we can see an improvement in both ratio and speed.
> >
> >
> > Do you think that Apache Arrow is an appropriate place to prototype the extension of the format?
> >
> > Do you agree that the enum field 'Encodings' is a suitable place to add the 'Byte stream-splitting transformation'? In that way it could be used with any of the other supported compressors.
> >
> > It might be best to also add a Java implementation of the transformation. Would the project 'parquet-mr' be a good place?
> >
> >
> > Would the workflow be such that I write my patches, we verify for correctness, get reviews, merge them AND just then we make adjustments to the Apache Parquet spec?
> >
> >
> > Any piece of advice is welcome!
> >
> >
> > Regards,
> >
> > Martin
> >
> >
> > ________________________________
> > From: Zoltan Ivanfi <zi...@cloudera.com>>
> > Sent: Friday, July 5, 2019 4:21:39 PM
> > To: Radev, Martin
> > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: Floating point data compression for Apache Parquet
> >
> >
> > Hi Martin,
> >
> > Thanks for the explanations, makes sense. Nice work!
> >
> > Br,
> >
> > Zoltan
> >
> > On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin <ma...@tum.de>> wrote:
> >
> > Hello Zoltan,
> >
> >
> >> Is data pre-loaded to RAM before making the measurements?
> > Yes, the file is read into physical memory.
> >
> > For mmap-ed files, read from external storage, I would expect, but not 100% sure, that the IO-overhead would be big enough that all algorithms compress quite close at the same speed.
> >
> >
> >> In "Figure 3: Decompression speed in MB/s", is data size measured before or after uncompression?
> >
> >> In "Figure 4: Compression speed in MB/s", is data size measured before or after compression?
> > For both the reported result is "size of the original file / time to compress or decompress".
> >
> >> According to "Figure 3: Decompression speed in MB/s", decompression of bs_zstd is almost twice as fast as plain zstd. Do you know what causes this massive speed improvement?
> >
> > I do not know all of the details. As you mentioned, the written out data is less, this could potentially lead to improvement in speed as less data has to be written out to memory during compression or read from memory during decompression.
> >
> > Another thing to consider is that ZSTD uses different techniques to compress a block of data - "raw", "RLE", "Huffman coding", "Treeless coding".
> >
> > I expect that "Huffman coding" is more costly than "RLE" and I also expect that "RLE" to be applicable for the majority of the sign bits thus leading to a performance win for when the transformation is applied.
> >
> >
> > I also expect that zstd has to do some form of "optimal parsing" to decide how to process the input in order to compress it well. This is something every wanna-be-good LZ-like compressor has to do ( https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html , http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html ). It might be so that the transformed input is somehow easy which leads to faster compression rates and also easier to decompress data which leads to faster decompression rates.
> > cbloom rants: 10-24-11 - LZ Optimal Parse with A Star Part 1<http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html <http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html>>
> > cbloomrants.blogspot.com <http://cbloomrants.blogspot.com/>
> > First two notes that aren't about the A-Star parse : 1. All good LZ encoders these days that aren't optimal parsers use complicated heuri...
> >
> >
> >
> >
> > I used this as a reference: https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar with ZSTD in particular.
> >
> >
> > I also checked that the majority of the time is spent in zstd.
> >
> > Example run for msg_sweep3d.dp using zstd at level 1.
> > - Transformation during compression: 0.086s, ZSTD compress on transformed data: 0.08s
> >
> > - regular ZSTD: 0.34s
> > - ZSTD decompress from compressed transformed data: 0.067s, Transformation during decompression: 0.021s
> > - regular ZSTD decompress: 0.24s
> >
> >
> > Example run for msg_sweep3d.dp using zstd at level 20.
> >
> > - Transformation during compression: 0.083s, ZSTD compress on transformed data: 14.35s
> >
> > - regular ZSTD: 183s
> > - ZSTD decompress from compressed transformed data: 0.075s, Transformation during decompression: 0.022s
> > - regular ZSTD decompress: 0.31s
> > Here it's clear that the transformed input is easier to parse (compress). Maybe also the blocks are of type which takes less time to decompress.
> >
> >> If considering using existing libraries to provide any of the compression algorithms, license compatibility is also an important factor and therefore would be worth mentioning in Section 5.
> > This is something I forgot to list. I will back to you and the other devs with information.
> >
> > The filter I proposed for lossless compression can be integrated without any concerns for a license.
> >
> >
> >> Are any of the investigated strategies applicable to DECIMAL values?
> > The lossy compressors SZ and ZFP do not support that outside of the box. I could communicate with the SZ developers to come to a decision how this can be added to SZ. An option is to losslessly compress the pre-decimal number and lossyly compress the post-decimal number.
> >
> > For lossless compression, we can apply a similar stream splitting technique for decimal types though it might be somewhat more complex and I have not really though about this case.
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Zoltan Ivanfi <zi...@cloudera.com>>
> > Sent: Wednesday, July 3, 2019 6:07:50 PM
> > To: Parquet Dev; Radev, Martin
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: Floating point data compression for Apache Parquet
> >
> > Hi Martin,
> >
> > Thanks for the thorough investigation, very nice report. I would have a few questions:
> >
> > - Is data pre-loaded to RAM before making the measurements?
> >
> > - In "Figure 3: Decompression speed in MB/s", is data size measured before or after uncompression?
> >
> > - In "Figure 4: Compression speed in MB/s", is data size measured before or after compression?
> >
> > - According to "Figure 3: Decompression speed in MB/s", decompression of bs_zstd is almost twice as fast as plain zstd. Do you know what causes this massive speed improvement? Based on the description provided in section 3.2, bs_zstd uses the same zstd compression with an extra step of splitting/combining streams. Since this is extra work, I would have expected bs_zstd to be slower than pure zstd, unless the compressed data becomes so much smaller that it radically improves data access times. However, according to "Figure 2: Compression ratio", bs_zstd achieves "only" 23% better compression than plain zstd, which can not be the reason for the 2x speed-up in itself.
> >
> > - If considering using existing libraries to provide any of the compression algorithms, license compatibility is also an important factor and therefore would be worth mentioning in Section 5.
> >
> > - Are any of the investigated strategies applicable to DECIMAL values? Since floating point values and calculations have an inherent inaccuracy, the DECIMAL type is much more important for storing financial data, which is one of the main use cases of Parquet.
> >
> > Thanks,
> >
> > Zoltan
> >
> > On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <ma...@tum.de>> wrote:
> > Hello folks,
> >
> >
> > thank you for your input.
> >
> >
> > I am finished with my investigation regarding introducing special support for FP compression in Apache Parquet.
> >
> > My report also includes an investigation of lossy compressors though there are still some things to be cleared out.
> >
> >
> > Report: https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
> >
> >
> > Sections 3 4 5 6 are the most important to go over.
> >
> >
> > Let me know if you have any questions or concerns.
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Zoltan Ivanfi <zi...@cloudera.com.INVALID>
> > Sent: Thursday, June 13, 2019 2:16:56 PM
> > To: Parquet Dev
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: Floating point data compression for Apache Parquet
> >
> > Hi Martin,
> >
> > Thanks for your interest in improving Parquet. Efficient encodings are
> > really important in a big data file format, so this topic is
> > definitely worth researching and personally I am looking forward to
> > your report. Whether to add any new encodings to Parquet, however, can
> > not be answered until we see the results of your findings.
> >
> > You mention two paths. One has very small computational overhead but
> > does not provide significant space savings. The other provides
> > significant space savings but at the price of a significant
> > computational overhead. While purely based on these properties both of
> > them seem "balanced" (one is small effort, small gain; the other is
> > large effort, large gain) and therefore sound reasonable options, I
> > would argue that one should also consider development costs, code
> > complexity and compatibility implications when deciding about whether
> > a new feature is worth implementing.
> >
> > Adding a new encoding or compression to Parquet complicates the
> > specification of the file format and requires implementing it in every
> > language binding of the format, which is not only a considerable
> > effort, but is also error-prone (see LZ4 for an example, which was
> > added to both the Java and the C++ implementation of Parquet, yet they
> > are incompatible with each other). And lack of support is not only a
> > minor annoyance in this case: if one is forced to use an older reader
> > that does not support the new encoding yet (or a language binding that
> > does not support it at all), the data simply can not be read.
> >
> > In my opinion, no matter how low the computational overhead of a new
> > encoding is, if it does not provide significant gains, then the
> > specification clutter, implementation costs and the potential of
> > compatibility problems greatly outweigh its advantages. For this
> > reason, I would say that only encodings that provide significant gains
> > are worth adding. As far as I am concerned, such a new encoding would
> > be a welcome addition to Parquet.
> >
> > Thanks,
> >
> > Zoltan
> >
> > On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <ma...@tum.de>> wrote:
> >>
> >> Dear all,
> >>
> >> thank you for your work on the Apache Parquet format.
> >>
> >> We are a group of students at the Technical University of Munich who would like to extend the available compression and encoding options for 32-bit and 64-bit floating point data in Apache Parquet.
> >> The current encodings and compression algorithms offered in Apache Parquet are heavily specialized towards integer and text data.
> >> Thus there is an opportunity in reducing both io throughput requirements and space requirements for handling floating point data by selecting a specialized compression algorithm.
> >>
> >> Currently, I am doing an investigation on the available literature and publicly available fp compressors. In my investigation I am writing a report on my findings - the available algorithms, their strengths and weaknesses, compression rates, compression speeds and decompression speeds, and licenses. Once finished I will share the report with you and make a proposal which ones IMO are good candidates for Apache Parquet.
> >>
> >> The goal is to add a solution for both 32-bit and 64-bit fp types. I think that it would be beneficial to offer at the very least two distinct paths. The first one should offer fast compression and decompression speed with some but not significant saving in space. The second one should offer slower compression and decompression speed but with a decent compression rate. Both lossless. A lossy path will be investigated further and discussed with the community.
> >>
> >> If I get an approval from you – the developers – I can continue with adding support for the new encoding/compression options in the C++ implementation of Apache Parquet in Apache Arrow.
> >>
> >> Please let me know what you think of this idea and whether you have any concerns with the plan.
> >>
> >> Best regards,
> >> Martin Radev
>

Re: Floating point data compression for Apache Parquet

Posted by Michael Heuer <he...@gmail.com>.
Hello Martin,

I'm willing to run some tests at scale on our genomics data when a parquet-mr pull request for the Java implementation is ready.

Cheers,

   michael


> On Jul 11, 2019, at 1:09 PM, Radev, Martin <ma...@tum.de> wrote:
> 
> Dear all,
> 
> 
> I created a Jira issue for the new feature and also made a pull request for my patch which extends the format and documentation.
> 
> Jira issue: https://issues.apache.org/jira/browse/PARQUET-1622 <https://issues.apache.org/jira/browse/PARQUET-1622>
> Pull request: https://github.com/apache/parquet-format/pull/144 <https://github.com/apache/parquet-format/pull/144>
> 
> 
> I also have a WIP patch for adding the "BYTE_STREAM_SPLIT" encoding to parquet-cpp within Apache Arrow.
> 
> 
> How should we proceed?
> 
> It would be great to get feedback from other community members.
> 
> 
> Regards,
> 
> Martin
> 
> 
> 
> 
> ________________________________
> From: Radev, Martin <martin.radev@tum.de <ma...@tum.de>>
> Sent: Tuesday, July 9, 2019 1:01:25 AM
> To: Zoltan Ivanfi
> Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Floating point data compression for Apache Parquet
> 
> Hello Zoltan,
> 
> 
> I can provide a C++ and Java implementation for the encoder.
> 
> The encoder/decoder is very small, and naturally I have to add tests.
> 
> I expect the biggest hurdle would be setting up the environment and reading though the developer guides.
> 
> 
> I will write my patches for Apache Arrow and for Apache Parquet and send them for review.
> 
> After getting them in, I can continue with the Java implementation.
> 
> Let me know if you have any concerns.
> 
> 
> It would be great to get an opinion from other Parquet contributors : )
> 
> 
> Thank you for the feedback!
> 
> 
> Best regards,
> 
> Martin
> 
> ________________________________
> From: Zoltan Ivanfi <zi...@cloudera.com>
> Sent: Monday, July 8, 2019 5:06:30 PM
> To: Radev, Martin
> Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Floating point data compression for Apache Parquet
> 
> Hi Martin,
> 
> I agree that bs_zstd would be a good place to start. Regarding the choice of language, Java, C++ and Python are your options. As far as I know, the Java implementation of Parquet has more users from the business sector, where decimal is preferred over floating point data types. It is also much more tightly integrated with the Hadoop ecosystem (it is even called parquet-mr, as in MapReduce), making for a steeper learning curve.
> 
> The Python and C++ language bindings have more scientific users, so users of these may be more interested in the new encodings. Python is a good language for rapid prototyping as well, but the Python binding of Parquet may use the C++ library under the hood, I'm not sure (I'm more familiar with the Java implementation). In any case, there are at least two Python bindings: pyarrow and fastparquet.
> 
> I think we can extend the format before the actual implementations are ready, provided that the specification is clear and nobody objects to adding it to the format. For this, I would wait for the opinion of a few more Parquet developers first, since changes to the format that are only supported by a single committer usually have a hard time getting into the spec. Additionally, could you please clarify which language bindings you plan to implement yourself? This will help the developers of the different language bindings assess how much work they will have to do to add support.
> 
> Thanks,
> 
> Zoltan
> 
> 
> On Fri, Jul 5, 2019 at 4:34 PM Radev, Martin <ma...@tum.de>> wrote:
> 
> Hello Zoltan and Parquet devs,
> 
> 
> do you think it would be appropriate to start with a Parquet prototype from my side?
> 
> I suspect that integrating 'bs_zstd' would be the simplest to integrate and from the report we can see an improvement in both ratio and speed.
> 
> 
> Do you think that Apache Arrow is an appropriate place to prototype the extension of the format?
> 
> Do you agree that the enum field 'Encodings' is a suitable place to add the 'Byte stream-splitting transformation'? In that way it could be used with any of the other supported compressors.
> 
> It might be best to also add a Java implementation of the transformation. Would the project 'parquet-mr' be a good place?
> 
> 
> Would the workflow be such that I write my patches, we verify for correctness, get reviews, merge them AND just then we make adjustments to the Apache Parquet spec?
> 
> 
> Any piece of advice is welcome!
> 
> 
> Regards,
> 
> Martin
> 
> 
> ________________________________
> From: Zoltan Ivanfi <zi...@cloudera.com>>
> Sent: Friday, July 5, 2019 4:21:39 PM
> To: Radev, Martin
> Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Floating point data compression for Apache Parquet
> 
> 
> Hi Martin,
> 
> Thanks for the explanations, makes sense. Nice work!
> 
> Br,
> 
> Zoltan
> 
> On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin <ma...@tum.de>> wrote:
> 
> Hello Zoltan,
> 
> 
>> Is data pre-loaded to RAM before making the measurements?
> Yes, the file is read into physical memory.
> 
> For mmap-ed files, read from external storage, I would expect, but not 100% sure, that the IO-overhead would be big enough that all algorithms compress quite close at the same speed.
> 
> 
>> In "Figure 3: Decompression speed in MB/s", is data size measured before or after uncompression?
> 
>> In "Figure 4: Compression speed in MB/s", is data size measured before or after compression?
> For both the reported result is "size of the original file / time to compress or decompress".
> 
>> According to "Figure 3: Decompression speed in MB/s", decompression of bs_zstd is almost twice as fast as plain zstd. Do you know what causes this massive speed improvement?
> 
> I do not know all of the details. As you mentioned, the written out data is less, this could potentially lead to improvement in speed as less data has to be written out to memory during compression or read from memory during decompression.
> 
> Another thing to consider is that ZSTD uses different techniques to compress a block of data - "raw", "RLE", "Huffman coding", "Treeless coding".
> 
> I expect that "Huffman coding" is more costly than "RLE" and I also expect that "RLE" to be applicable for the majority of the sign bits thus leading to a performance win for when the transformation is applied.
> 
> 
> I also expect that zstd has to do some form of "optimal parsing" to decide how to process the input in order to compress it well. This is something every wanna-be-good LZ-like compressor has to do ( https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html , http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html ). It might be so that the transformed input is somehow easy which leads to faster compression rates and also easier to decompress data which leads to faster decompression rates.
> cbloom rants: 10-24-11 - LZ Optimal Parse with A Star Part 1<http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html <http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html>>
> cbloomrants.blogspot.com <http://cbloomrants.blogspot.com/>
> First two notes that aren't about the A-Star parse : 1. All good LZ encoders these days that aren't optimal parsers use complicated heuri...
> 
> 
> 
> 
> I used this as a reference: https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar with ZSTD in particular.
> 
> 
> I also checked that the majority of the time is spent in zstd.
> 
> Example run for msg_sweep3d.dp using zstd at level 1.
> - Transformation during compression: 0.086s, ZSTD compress on transformed data: 0.08s
> 
> - regular ZSTD: 0.34s
> - ZSTD decompress from compressed transformed data: 0.067s, Transformation during decompression: 0.021s
> - regular ZSTD decompress: 0.24s
> 
> 
> Example run for msg_sweep3d.dp using zstd at level 20.
> 
> - Transformation during compression: 0.083s, ZSTD compress on transformed data: 14.35s
> 
> - regular ZSTD: 183s
> - ZSTD decompress from compressed transformed data: 0.075s, Transformation during decompression: 0.022s
> - regular ZSTD decompress: 0.31s
> Here it's clear that the transformed input is easier to parse (compress). Maybe also the blocks are of type which takes less time to decompress.
> 
>> If considering using existing libraries to provide any of the compression algorithms, license compatibility is also an important factor and therefore would be worth mentioning in Section 5.
> This is something I forgot to list. I will back to you and the other devs with information.
> 
> The filter I proposed for lossless compression can be integrated without any concerns for a license.
> 
> 
>> Are any of the investigated strategies applicable to DECIMAL values?
> The lossy compressors SZ and ZFP do not support that outside of the box. I could communicate with the SZ developers to come to a decision how this can be added to SZ. An option is to losslessly compress the pre-decimal number and lossyly compress the post-decimal number.
> 
> For lossless compression, we can apply a similar stream splitting technique for decimal types though it might be somewhat more complex and I have not really though about this case.
> 
> 
> Regards,
> 
> Martin
> 
> ________________________________
> From: Zoltan Ivanfi <zi...@cloudera.com>>
> Sent: Wednesday, July 3, 2019 6:07:50 PM
> To: Parquet Dev; Radev, Martin
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Floating point data compression for Apache Parquet
> 
> Hi Martin,
> 
> Thanks for the thorough investigation, very nice report. I would have a few questions:
> 
> - Is data pre-loaded to RAM before making the measurements?
> 
> - In "Figure 3: Decompression speed in MB/s", is data size measured before or after uncompression?
> 
> - In "Figure 4: Compression speed in MB/s", is data size measured before or after compression?
> 
> - According to "Figure 3: Decompression speed in MB/s", decompression of bs_zstd is almost twice as fast as plain zstd. Do you know what causes this massive speed improvement? Based on the description provided in section 3.2, bs_zstd uses the same zstd compression with an extra step of splitting/combining streams. Since this is extra work, I would have expected bs_zstd to be slower than pure zstd, unless the compressed data becomes so much smaller that it radically improves data access times. However, according to "Figure 2: Compression ratio", bs_zstd achieves "only" 23% better compression than plain zstd, which can not be the reason for the 2x speed-up in itself.
> 
> - If considering using existing libraries to provide any of the compression algorithms, license compatibility is also an important factor and therefore would be worth mentioning in Section 5.
> 
> - Are any of the investigated strategies applicable to DECIMAL values? Since floating point values and calculations have an inherent inaccuracy, the DECIMAL type is much more important for storing financial data, which is one of the main use cases of Parquet.
> 
> Thanks,
> 
> Zoltan
> 
> On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <ma...@tum.de>> wrote:
> Hello folks,
> 
> 
> thank you for your input.
> 
> 
> I am finished with my investigation regarding introducing special support for FP compression in Apache Parquet.
> 
> My report also includes an investigation of lossy compressors though there are still some things to be cleared out.
> 
> 
> Report: https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
> 
> 
> Sections 3 4 5 6 are the most important to go over.
> 
> 
> Let me know if you have any questions or concerns.
> 
> 
> Regards,
> 
> Martin
> 
> ________________________________
> From: Zoltan Ivanfi <zi...@cloudera.com.INVALID>
> Sent: Thursday, June 13, 2019 2:16:56 PM
> To: Parquet Dev
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Floating point data compression for Apache Parquet
> 
> Hi Martin,
> 
> Thanks for your interest in improving Parquet. Efficient encodings are
> really important in a big data file format, so this topic is
> definitely worth researching and personally I am looking forward to
> your report. Whether to add any new encodings to Parquet, however, can
> not be answered until we see the results of your findings.
> 
> You mention two paths. One has very small computational overhead but
> does not provide significant space savings. The other provides
> significant space savings but at the price of a significant
> computational overhead. While purely based on these properties both of
> them seem "balanced" (one is small effort, small gain; the other is
> large effort, large gain) and therefore sound reasonable options, I
> would argue that one should also consider development costs, code
> complexity and compatibility implications when deciding about whether
> a new feature is worth implementing.
> 
> Adding a new encoding or compression to Parquet complicates the
> specification of the file format and requires implementing it in every
> language binding of the format, which is not only a considerable
> effort, but is also error-prone (see LZ4 for an example, which was
> added to both the Java and the C++ implementation of Parquet, yet they
> are incompatible with each other). And lack of support is not only a
> minor annoyance in this case: if one is forced to use an older reader
> that does not support the new encoding yet (or a language binding that
> does not support it at all), the data simply can not be read.
> 
> In my opinion, no matter how low the computational overhead of a new
> encoding is, if it does not provide significant gains, then the
> specification clutter, implementation costs and the potential of
> compatibility problems greatly outweigh its advantages. For this
> reason, I would say that only encodings that provide significant gains
> are worth adding. As far as I am concerned, such a new encoding would
> be a welcome addition to Parquet.
> 
> Thanks,
> 
> Zoltan
> 
> On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <ma...@tum.de>> wrote:
>> 
>> Dear all,
>> 
>> thank you for your work on the Apache Parquet format.
>> 
>> We are a group of students at the Technical University of Munich who would like to extend the available compression and encoding options for 32-bit and 64-bit floating point data in Apache Parquet.
>> The current encodings and compression algorithms offered in Apache Parquet are heavily specialized towards integer and text data.
>> Thus there is an opportunity in reducing both io throughput requirements and space requirements for handling floating point data by selecting a specialized compression algorithm.
>> 
>> Currently, I am doing an investigation on the available literature and publicly available fp compressors. In my investigation I am writing a report on my findings - the available algorithms, their strengths and weaknesses, compression rates, compression speeds and decompression speeds, and licenses. Once finished I will share the report with you and make a proposal which ones IMO are good candidates for Apache Parquet.
>> 
>> The goal is to add a solution for both 32-bit and 64-bit fp types. I think that it would be beneficial to offer at the very least two distinct paths. The first one should offer fast compression and decompression speed with some but not significant saving in space. The second one should offer slower compression and decompression speed but with a decent compression rate. Both lossless. A lossy path will be investigated further and discussed with the community.
>> 
>> If I get an approval from you – the developers – I can continue with adding support for the new encoding/compression options in the C++ implementation of Apache Parquet in Apache Arrow.
>> 
>> Please let me know what you think of this idea and whether you have any concerns with the plan.
>> 
>> Best regards,
>> Martin Radev


Re: Floating point data compression for Apache Parquet

Posted by "Radev, Martin" <ma...@tum.de>.
Dear all,


I created a Jira issue for the new feature and also made a pull request for my patch which extends the format and documentation.

Jira issue: https://issues.apache.org/jira/browse/PARQUET-1622
Pull request: https://github.com/apache/parquet-format/pull/144


I also have a WIP patch for adding the "BYTE_STREAM_SPLIT" encoding to parquet-cpp within Apache Arrow.


How should we proceed?

It would be great to get feedback from other community members.


Regards,

Martin




________________________________
From: Radev, Martin <ma...@tum.de>
Sent: Tuesday, July 9, 2019 1:01:25 AM
To: Zoltan Ivanfi
Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet

Hello Zoltan,


I can provide a C++ and Java implementation for the encoder.

The encoder/decoder is very small, and naturally I have to add tests.

I expect the biggest hurdle would be setting up the environment and reading though the developer guides.


I will write my patches for Apache Arrow and for Apache Parquet and send them for review.

After getting them in, I can continue with the Java implementation.

Let me know if you have any concerns.


It would be great to get an opinion from other Parquet contributors : )


Thank you for the feedback!


Best regards,

Martin

________________________________
From: Zoltan Ivanfi <zi...@cloudera.com>
Sent: Monday, July 8, 2019 5:06:30 PM
To: Radev, Martin
Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet

Hi Martin,

I agree that bs_zstd would be a good place to start. Regarding the choice of language, Java, C++ and Python are your options. As far as I know, the Java implementation of Parquet has more users from the business sector, where decimal is preferred over floating point data types. It is also much more tightly integrated with the Hadoop ecosystem (it is even called parquet-mr, as in MapReduce), making for a steeper learning curve.

The Python and C++ language bindings have more scientific users, so users of these may be more interested in the new encodings. Python is a good language for rapid prototyping as well, but the Python binding of Parquet may use the C++ library under the hood, I'm not sure (I'm more familiar with the Java implementation). In any case, there are at least two Python bindings: pyarrow and fastparquet.

I think we can extend the format before the actual implementations are ready, provided that the specification is clear and nobody objects to adding it to the format. For this, I would wait for the opinion of a few more Parquet developers first, since changes to the format that are only supported by a single committer usually have a hard time getting into the spec. Additionally, could you please clarify which language bindings you plan to implement yourself? This will help the developers of the different language bindings assess how much work they will have to do to add support.

Thanks,

Zoltan


On Fri, Jul 5, 2019 at 4:34 PM Radev, Martin <ma...@tum.de>> wrote:

Hello Zoltan and Parquet devs,


do you think it would be appropriate to start with a Parquet prototype from my side?

I suspect that integrating 'bs_zstd' would be the simplest to integrate and from the report we can see an improvement in both ratio and speed.


Do you think that Apache Arrow is an appropriate place to prototype the extension of the format?

Do you agree that the enum field 'Encodings' is a suitable place to add the 'Byte stream-splitting transformation'? In that way it could be used with any of the other supported compressors.

It might be best to also add a Java implementation of the transformation. Would the project 'parquet-mr' be a good place?


Would the workflow be such that I write my patches, we verify for correctness, get reviews, merge them AND just then we make adjustments to the Apache Parquet spec?


Any piece of advice is welcome!


Regards,

Martin


________________________________
From: Zoltan Ivanfi <zi...@cloudera.com>>
Sent: Friday, July 5, 2019 4:21:39 PM
To: Radev, Martin
Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet


Hi Martin,

Thanks for the explanations, makes sense. Nice work!

Br,

Zoltan

On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin <ma...@tum.de>> wrote:

Hello Zoltan,


> Is data pre-loaded to RAM before making the measurements?
Yes, the file is read into physical memory.

For mmap-ed files, read from external storage, I would expect, but not 100% sure, that the IO-overhead would be big enough that all algorithms compress quite close at the same speed.


>In "Figure 3: Decompression speed in MB/s", is data size measured before or after uncompression?

> In "Figure 4: Compression speed in MB/s", is data size measured before or after compression?
For both the reported result is "size of the original file / time to compress or decompress".

> According to "Figure 3: Decompression speed in MB/s", decompression of bs_zstd is almost twice as fast as plain zstd. Do you know what causes this massive speed improvement?

I do not know all of the details. As you mentioned, the written out data is less, this could potentially lead to improvement in speed as less data has to be written out to memory during compression or read from memory during decompression.

Another thing to consider is that ZSTD uses different techniques to compress a block of data - "raw", "RLE", "Huffman coding", "Treeless coding".

I expect that "Huffman coding" is more costly than "RLE" and I also expect that "RLE" to be applicable for the majority of the sign bits thus leading to a performance win for when the transformation is applied.


I also expect that zstd has to do some form of "optimal parsing" to decide how to process the input in order to compress it well. This is something every wanna-be-good LZ-like compressor has to do ( https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html , http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html ). It might be so that the transformed input is somehow easy which leads to faster compression rates and also easier to decompress data which leads to faster decompression rates.
cbloom rants: 10-24-11 - LZ Optimal Parse with A Star Part 1<http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html>
cbloomrants.blogspot.com
First two notes that aren't about the A-Star parse : 1. All good LZ encoders these days that aren't optimal parsers use complicated heuri...




I used this as a reference: https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar with ZSTD in particular.


I also checked that the majority of the time is spent in zstd.

Example run for msg_sweep3d.dp using zstd at level 1.
- Transformation during compression: 0.086s, ZSTD compress on transformed data: 0.08s

- regular ZSTD: 0.34s
- ZSTD decompress from compressed transformed data: 0.067s, Transformation during decompression: 0.021s
- regular ZSTD decompress: 0.24s


Example run for msg_sweep3d.dp using zstd at level 20.

- Transformation during compression: 0.083s, ZSTD compress on transformed data: 14.35s

- regular ZSTD: 183s
- ZSTD decompress from compressed transformed data: 0.075s, Transformation during decompression: 0.022s
- regular ZSTD decompress: 0.31s
Here it's clear that the transformed input is easier to parse (compress). Maybe also the blocks are of type which takes less time to decompress.

> If considering using existing libraries to provide any of the compression algorithms, license compatibility is also an important factor and therefore would be worth mentioning in Section 5.
This is something I forgot to list. I will back to you and the other devs with information.

The filter I proposed for lossless compression can be integrated without any concerns for a license.


> Are any of the investigated strategies applicable to DECIMAL values?
The lossy compressors SZ and ZFP do not support that outside of the box. I could communicate with the SZ developers to come to a decision how this can be added to SZ. An option is to losslessly compress the pre-decimal number and lossyly compress the post-decimal number.

For lossless compression, we can apply a similar stream splitting technique for decimal types though it might be somewhat more complex and I have not really though about this case.


Regards,

Martin

________________________________
From: Zoltan Ivanfi <zi...@cloudera.com>>
Sent: Wednesday, July 3, 2019 6:07:50 PM
To: Parquet Dev; Radev, Martin
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet

Hi Martin,

Thanks for the thorough investigation, very nice report. I would have a few questions:

- Is data pre-loaded to RAM before making the measurements?

- In "Figure 3: Decompression speed in MB/s", is data size measured before or after uncompression?

- In "Figure 4: Compression speed in MB/s", is data size measured before or after compression?

- According to "Figure 3: Decompression speed in MB/s", decompression of bs_zstd is almost twice as fast as plain zstd. Do you know what causes this massive speed improvement? Based on the description provided in section 3.2, bs_zstd uses the same zstd compression with an extra step of splitting/combining streams. Since this is extra work, I would have expected bs_zstd to be slower than pure zstd, unless the compressed data becomes so much smaller that it radically improves data access times. However, according to "Figure 2: Compression ratio", bs_zstd achieves "only" 23% better compression than plain zstd, which can not be the reason for the 2x speed-up in itself.

- If considering using existing libraries to provide any of the compression algorithms, license compatibility is also an important factor and therefore would be worth mentioning in Section 5.

- Are any of the investigated strategies applicable to DECIMAL values? Since floating point values and calculations have an inherent inaccuracy, the DECIMAL type is much more important for storing financial data, which is one of the main use cases of Parquet.

Thanks,

Zoltan

On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <ma...@tum.de>> wrote:
Hello folks,


thank you for your input.


I am finished with my investigation regarding introducing special support for FP compression in Apache Parquet.

My report also includes an investigation of lossy compressors though there are still some things to be cleared out.


Report: https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv


Sections 3 4 5 6 are the most important to go over.


Let me know if you have any questions or concerns.


Regards,

Martin

________________________________
From: Zoltan Ivanfi <zi...@cloudera.com.INVALID>
Sent: Thursday, June 13, 2019 2:16:56 PM
To: Parquet Dev
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet

Hi Martin,

Thanks for your interest in improving Parquet. Efficient encodings are
really important in a big data file format, so this topic is
definitely worth researching and personally I am looking forward to
your report. Whether to add any new encodings to Parquet, however, can
not be answered until we see the results of your findings.

You mention two paths. One has very small computational overhead but
does not provide significant space savings. The other provides
significant space savings but at the price of a significant
computational overhead. While purely based on these properties both of
them seem "balanced" (one is small effort, small gain; the other is
large effort, large gain) and therefore sound reasonable options, I
would argue that one should also consider development costs, code
complexity and compatibility implications when deciding about whether
a new feature is worth implementing.

Adding a new encoding or compression to Parquet complicates the
specification of the file format and requires implementing it in every
language binding of the format, which is not only a considerable
effort, but is also error-prone (see LZ4 for an example, which was
added to both the Java and the C++ implementation of Parquet, yet they
are incompatible with each other). And lack of support is not only a
minor annoyance in this case: if one is forced to use an older reader
that does not support the new encoding yet (or a language binding that
does not support it at all), the data simply can not be read.

In my opinion, no matter how low the computational overhead of a new
encoding is, if it does not provide significant gains, then the
specification clutter, implementation costs and the potential of
compatibility problems greatly outweigh its advantages. For this
reason, I would say that only encodings that provide significant gains
are worth adding. As far as I am concerned, such a new encoding would
be a welcome addition to Parquet.

Thanks,

Zoltan

On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <ma...@tum.de>> wrote:
>
> Dear all,
>
> thank you for your work on the Apache Parquet format.
>
> We are a group of students at the Technical University of Munich who would like to extend the available compression and encoding options for 32-bit and 64-bit floating point data in Apache Parquet.
> The current encodings and compression algorithms offered in Apache Parquet are heavily specialized towards integer and text data.
> Thus there is an opportunity in reducing both io throughput requirements and space requirements for handling floating point data by selecting a specialized compression algorithm.
>
> Currently, I am doing an investigation on the available literature and publicly available fp compressors. In my investigation I am writing a report on my findings - the available algorithms, their strengths and weaknesses, compression rates, compression speeds and decompression speeds, and licenses. Once finished I will share the report with you and make a proposal which ones IMO are good candidates for Apache Parquet.
>
> The goal is to add a solution for both 32-bit and 64-bit fp types. I think that it would be beneficial to offer at the very least two distinct paths. The first one should offer fast compression and decompression speed with some but not significant saving in space. The second one should offer slower compression and decompression speed but with a decent compression rate. Both lossless. A lossy path will be investigated further and discussed with the community.
>
> If I get an approval from you – the developers – I can continue with adding support for the new encoding/compression options in the C++ implementation of Apache Parquet in Apache Arrow.
>
> Please let me know what you think of this idea and whether you have any concerns with the plan.
>
> Best regards,
> Martin Radev
>

Re: Floating point data compression for Apache Parquet

Posted by "Radev, Martin" <ma...@tum.de>.
Hello Zoltan,


I can provide a C++ and Java implementation for the encoder.

The encoder/decoder is very small, and naturally I have to add tests.

I expect the biggest hurdle would be setting up the environment and reading though the developer guides.


I will write my patches for Apache Arrow and for Apache Parquet and send them for review.

After getting them in, I can continue with the Java implementation.

Let me know if you have any concerns.


It would be great to get an opinion from other Parquet contributors : )


Thank you for the feedback!


Best regards,

Martin

________________________________
From: Zoltan Ivanfi <zi...@cloudera.com>
Sent: Monday, July 8, 2019 5:06:30 PM
To: Radev, Martin
Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet

Hi Martin,

I agree that bs_zstd would be a good place to start. Regarding the choice of language, Java, C++ and Python are your options. As far as I know, the Java implementation of Parquet has more users from the business sector, where decimal is preferred over floating point data types. It is also much more tightly integrated with the Hadoop ecosystem (it is even called parquet-mr, as in MapReduce), making for a steeper learning curve.

The Python and C++ language bindings have more scientific users, so users of these may be more interested in the new encodings. Python is a good language for rapid prototyping as well, but the Python binding of Parquet may use the C++ library under the hood, I'm not sure (I'm more familiar with the Java implementation). In any case, there are at least two Python bindings: pyarrow and fastparquet.

I think we can extend the format before the actual implementations are ready, provided that the specification is clear and nobody objects to adding it to the format. For this, I would wait for the opinion of a few more Parquet developers first, since changes to the format that are only supported by a single committer usually have a hard time getting into the spec. Additionally, could you please clarify which language bindings you plan to implement yourself? This will help the developers of the different language bindings assess how much work they will have to do to add support.

Thanks,

Zoltan


On Fri, Jul 5, 2019 at 4:34 PM Radev, Martin <ma...@tum.de>> wrote:

Hello Zoltan and Parquet devs,


do you think it would be appropriate to start with a Parquet prototype from my side?

I suspect that integrating 'bs_zstd' would be the simplest to integrate and from the report we can see an improvement in both ratio and speed.


Do you think that Apache Arrow is an appropriate place to prototype the extension of the format?

Do you agree that the enum field 'Encodings' is a suitable place to add the 'Byte stream-splitting transformation'? In that way it could be used with any of the other supported compressors.

It might be best to also add a Java implementation of the transformation. Would the project 'parquet-mr' be a good place?


Would the workflow be such that I write my patches, we verify for correctness, get reviews, merge them AND just then we make adjustments to the Apache Parquet spec?


Any piece of advice is welcome!


Regards,

Martin


________________________________
From: Zoltan Ivanfi <zi...@cloudera.com>>
Sent: Friday, July 5, 2019 4:21:39 PM
To: Radev, Martin
Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet


Hi Martin,

Thanks for the explanations, makes sense. Nice work!

Br,

Zoltan

On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin <ma...@tum.de>> wrote:

Hello Zoltan,


> Is data pre-loaded to RAM before making the measurements?
Yes, the file is read into physical memory.

For mmap-ed files, read from external storage, I would expect, but not 100% sure, that the IO-overhead would be big enough that all algorithms compress quite close at the same speed.


>In "Figure 3: Decompression speed in MB/s", is data size measured before or after uncompression?

> In "Figure 4: Compression speed in MB/s", is data size measured before or after compression?
For both the reported result is "size of the original file / time to compress or decompress".

> According to "Figure 3: Decompression speed in MB/s", decompression of bs_zstd is almost twice as fast as plain zstd. Do you know what causes this massive speed improvement?

I do not know all of the details. As you mentioned, the written out data is less, this could potentially lead to improvement in speed as less data has to be written out to memory during compression or read from memory during decompression.

Another thing to consider is that ZSTD uses different techniques to compress a block of data - "raw", "RLE", "Huffman coding", "Treeless coding".

I expect that "Huffman coding" is more costly than "RLE" and I also expect that "RLE" to be applicable for the majority of the sign bits thus leading to a performance win for when the transformation is applied.


I also expect that zstd has to do some form of "optimal parsing" to decide how to process the input in order to compress it well. This is something every wanna-be-good LZ-like compressor has to do ( https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html , http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html ). It might be so that the transformed input is somehow easy which leads to faster compression rates and also easier to decompress data which leads to faster decompression rates.

I used this as a reference: https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar with ZSTD in particular.


I also checked that the majority of the time is spent in zstd.

Example run for msg_sweep3d.dp using zstd at level 1.
- Transformation during compression: 0.086s, ZSTD compress on transformed data: 0.08s

- regular ZSTD: 0.34s
- ZSTD decompress from compressed transformed data: 0.067s, Transformation during decompression: 0.021s
- regular ZSTD decompress: 0.24s


Example run for msg_sweep3d.dp using zstd at level 20.

- Transformation during compression: 0.083s, ZSTD compress on transformed data: 14.35s

- regular ZSTD: 183s
- ZSTD decompress from compressed transformed data: 0.075s, Transformation during decompression: 0.022s
- regular ZSTD decompress: 0.31s
Here it's clear that the transformed input is easier to parse (compress). Maybe also the blocks are of type which takes less time to decompress.

> If considering using existing libraries to provide any of the compression algorithms, license compatibility is also an important factor and therefore would be worth mentioning in Section 5.
This is something I forgot to list. I will back to you and the other devs with information.

The filter I proposed for lossless compression can be integrated without any concerns for a license.


> Are any of the investigated strategies applicable to DECIMAL values?
The lossy compressors SZ and ZFP do not support that outside of the box. I could communicate with the SZ developers to come to a decision how this can be added to SZ. An option is to losslessly compress the pre-decimal number and lossyly compress the post-decimal number.

For lossless compression, we can apply a similar stream splitting technique for decimal types though it might be somewhat more complex and I have not really though about this case.


Regards,

Martin

________________________________
From: Zoltan Ivanfi <zi...@cloudera.com>>
Sent: Wednesday, July 3, 2019 6:07:50 PM
To: Parquet Dev; Radev, Martin
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet

Hi Martin,

Thanks for the thorough investigation, very nice report. I would have a few questions:

- Is data pre-loaded to RAM before making the measurements?

- In "Figure 3: Decompression speed in MB/s", is data size measured before or after uncompression?

- In "Figure 4: Compression speed in MB/s", is data size measured before or after compression?

- According to "Figure 3: Decompression speed in MB/s", decompression of bs_zstd is almost twice as fast as plain zstd. Do you know what causes this massive speed improvement? Based on the description provided in section 3.2, bs_zstd uses the same zstd compression with an extra step of splitting/combining streams. Since this is extra work, I would have expected bs_zstd to be slower than pure zstd, unless the compressed data becomes so much smaller that it radically improves data access times. However, according to "Figure 2: Compression ratio", bs_zstd achieves "only" 23% better compression than plain zstd, which can not be the reason for the 2x speed-up in itself.

- If considering using existing libraries to provide any of the compression algorithms, license compatibility is also an important factor and therefore would be worth mentioning in Section 5.

- Are any of the investigated strategies applicable to DECIMAL values? Since floating point values and calculations have an inherent inaccuracy, the DECIMAL type is much more important for storing financial data, which is one of the main use cases of Parquet.

Thanks,

Zoltan

On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <ma...@tum.de>> wrote:
Hello folks,


thank you for your input.


I am finished with my investigation regarding introducing special support for FP compression in Apache Parquet.

My report also includes an investigation of lossy compressors though there are still some things to be cleared out.


Report: https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv


Sections 3 4 5 6 are the most important to go over.


Let me know if you have any questions or concerns.


Regards,

Martin

________________________________
From: Zoltan Ivanfi <zi...@cloudera.com.INVALID>
Sent: Thursday, June 13, 2019 2:16:56 PM
To: Parquet Dev
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet

Hi Martin,

Thanks for your interest in improving Parquet. Efficient encodings are
really important in a big data file format, so this topic is
definitely worth researching and personally I am looking forward to
your report. Whether to add any new encodings to Parquet, however, can
not be answered until we see the results of your findings.

You mention two paths. One has very small computational overhead but
does not provide significant space savings. The other provides
significant space savings but at the price of a significant
computational overhead. While purely based on these properties both of
them seem "balanced" (one is small effort, small gain; the other is
large effort, large gain) and therefore sound reasonable options, I
would argue that one should also consider development costs, code
complexity and compatibility implications when deciding about whether
a new feature is worth implementing.

Adding a new encoding or compression to Parquet complicates the
specification of the file format and requires implementing it in every
language binding of the format, which is not only a considerable
effort, but is also error-prone (see LZ4 for an example, which was
added to both the Java and the C++ implementation of Parquet, yet they
are incompatible with each other). And lack of support is not only a
minor annoyance in this case: if one is forced to use an older reader
that does not support the new encoding yet (or a language binding that
does not support it at all), the data simply can not be read.

In my opinion, no matter how low the computational overhead of a new
encoding is, if it does not provide significant gains, then the
specification clutter, implementation costs and the potential of
compatibility problems greatly outweigh its advantages. For this
reason, I would say that only encodings that provide significant gains
are worth adding. As far as I am concerned, such a new encoding would
be a welcome addition to Parquet.

Thanks,

Zoltan

On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <ma...@tum.de>> wrote:
>
> Dear all,
>
> thank you for your work on the Apache Parquet format.
>
> We are a group of students at the Technical University of Munich who would like to extend the available compression and encoding options for 32-bit and 64-bit floating point data in Apache Parquet.
> The current encodings and compression algorithms offered in Apache Parquet are heavily specialized towards integer and text data.
> Thus there is an opportunity in reducing both io throughput requirements and space requirements for handling floating point data by selecting a specialized compression algorithm.
>
> Currently, I am doing an investigation on the available literature and publicly available fp compressors. In my investigation I am writing a report on my findings - the available algorithms, their strengths and weaknesses, compression rates, compression speeds and decompression speeds, and licenses. Once finished I will share the report with you and make a proposal which ones IMO are good candidates for Apache Parquet.
>
> The goal is to add a solution for both 32-bit and 64-bit fp types. I think that it would be beneficial to offer at the very least two distinct paths. The first one should offer fast compression and decompression speed with some but not significant saving in space. The second one should offer slower compression and decompression speed but with a decent compression rate. Both lossless. A lossy path will be investigated further and discussed with the community.
>
> If I get an approval from you – the developers – I can continue with adding support for the new encoding/compression options in the C++ implementation of Apache Parquet in Apache Arrow.
>
> Please let me know what you think of this idea and whether you have any concerns with the plan.
>
> Best regards,
> Martin Radev
>

Re: Floating point data compression for Apache Parquet

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Martin,

I agree that bs_zstd would be a good place to start. Regarding the choice
of language, Java, C++ and Python are your options. As far as I know, the
Java implementation of Parquet has more users from the business sector,
where decimal is preferred over floating point data types. It is also much
more tightly integrated with the Hadoop ecosystem (it is even called
parquet-mr, as in MapReduce), making for a steeper learning curve.

The Python and C++ language bindings have more scientific users, so users
of these may be more interested in the new encodings. Python is a good
language for rapid prototyping as well, but the Python binding of Parquet
may use the C++ library under the hood, I'm not sure (I'm more familiar
with the Java implementation). In any case, there are at least two Python
bindings: pyarrow and fastparquet.

I think we can extend the format before the actual implementations are
ready, provided that the specification is clear and nobody objects to
adding it to the format. For this, I would wait for the opinion of a few
more Parquet developers first, since changes to the format that are only
supported by a single committer usually have a hard time getting into the
spec. Additionally, could you please clarify which language bindings you
plan to implement yourself? This will help the developers of the different
language bindings assess how much work they will have to do to add support.

Thanks,

Zoltan


On Fri, Jul 5, 2019 at 4:34 PM Radev, Martin <ma...@tum.de> wrote:

> Hello Zoltan and Parquet devs,
>
>
> do you think it would be appropriate to start with a Parquet prototype
> from my side?
>
> I suspect that integrating 'bs_zstd' would be the simplest to
> integrate and from the report we can see an improvement in both ratio and
> speed.
>
>
> Do you think that Apache Arrow is an appropriate place to prototype the
> extension of the format?
>
> Do you agree that the enum field 'Encodings' is a suitable place to add
> the 'Byte stream-splitting transformation'? In that way it could be used
> with any of the other supported compressors.
>
> It might be best to also add a Java implementation of the transformation.
> Would the project 'parquet-mr' be a good place?
>
>
> Would the workflow be such that I write my patches, we verify for
> correctness, get reviews, merge them AND just then we make adjustments to
> the Apache Parquet spec?
>
>
> Any piece of advice is welcome!
>
>
> Regards,
>
> Martin
>
>
> ------------------------------
> *From:* Zoltan Ivanfi <zi...@cloudera.com>
> *Sent:* Friday, July 5, 2019 4:21:39 PM
> *To:* Radev, Martin
> *Cc:* Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> *Subject:* Re: Floating point data compression for Apache Parquet
>
>
> Hi Martin,
>
> Thanks for the explanations, makes sense. Nice work!
>
> Br,
>
> Zoltan
>
> On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin <ma...@tum.de> wrote:
>
>> Hello Zoltan,
>>
>>
>> *> **Is data pre-loaded to RAM before making the measurements?*
>> Yes, the file is read into physical memory.
>>
>> For mmap-ed files, read from external storage, I would expect, but not
>> 100% sure, that the IO-overhead would be big enough that all algorithms
>> compress quite close at the same speed.
>>
>>
>>
>> *>In "Figure 3: Decompression speed in MB/s", is data size measured
>> before or after uncompression? *
>> *> In "Figure 4: Compression speed in MB/s", is data size measured before
>> or after compression?*
>> For both the reported result is "size of the original file / time to
>> compress or decompress".
>>
>> *> **According to "Figure 3: Decompression speed in MB/s", decompression
>> of bs_zstd is almost twice as fast as plain zstd. Do you know what causes
>> this massive speed improvement?*
>>
>> I do not know all of the details. As you mentioned, the written out data
>> is less, this could potentially lead to improvement in speed as less data
>> has to be written out to memory during compression or read from memory
>> during decompression.
>>
>> Another thing to consider is that ZSTD uses different techniques to
>> compress a block of data - "raw", "RLE", "Huffman coding", "Treeless
>> coding".
>>
>> I expect that "Huffman coding" is more costly than "RLE" and I also
>> expect that "RLE" to be applicable for the majority of the sign bits thus
>> leading to a performance win for when the transformation is applied.
>>
>> I also expect that zstd has to do some form of "optimal parsing" to
>> decide how to process the input in order to compress it well. This is
>> something every wanna-be-good LZ-like compressor has to do (
>> https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
>>  ,
>> http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html ).
>> It might be so that the transformed input is somehow easy which leads to
>> faster compression rates and also easier to decompress data which leads to
>> faster decompression rates.
>>
>> I used this as a reference:
>> https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar
>> with ZSTD in particular.
>>
>> I also checked that the majority of the time is spent in zstd.
>>
>> Example run for msg_sweep3d.dp using zstd at level 1.
>> - Transformation during compression: 0.086s, ZSTD compress on transformed
>> data: 0.08s
>>
>> - regular ZSTD: 0.34s
>> - ZSTD decompress from compressed transformed data: 0.067s, Transformation
>> during decompression: 0.021s
>> - regular ZSTD decompress: 0.24s
>>
>>
>> Example run for msg_sweep3d.dp using zstd at level 20.
>>
>> - Transformation during compression: 0.083s, ZSTD compress on transformed
>> data: 14.35s
>>
>> - regular ZSTD: 183s
>> - ZSTD decompress from compressed transformed
>> data: 0.075s, Transformation during decompression: 0.022s
>> - regular ZSTD decompress: 0.31s
>> Here it's clear that the transformed input is easier to parse (compress).
>> Maybe also the blocks are of type which takes less time to decompress.
>>
>> *> **If considering using existing libraries to provide any of the
>> compression algorithms, license compatibility is also an important factor
>> and therefore would be worth mentioning in Section 5.*
>> This is something I forgot to list. I will back to you and the other devs
>> with information.
>>
>> The filter I proposed for lossless compression can be integrated without
>> any concerns for a license.
>>
>>
>> *> **Are any of the investigated strategies applicable to DECIMAL
>> values?*
>> The lossy compressors SZ and ZFP do not support that outside of the box.
>> I could communicate with the SZ developers to come to a decision how this
>> can be added to SZ. An option is to losslessly compress the pre-decimal
>> number and lossyly compress the post-decimal number.
>>
>> For lossless compression, we can apply a similar stream splitting
>> technique for decimal types though it might be somewhat more complex and I
>> have not really though about this case.
>>
>>
>> Regards,
>>
>> Martin
>> ------------------------------
>> *From:* Zoltan Ivanfi <zi...@cloudera.com>
>> *Sent:* Wednesday, July 3, 2019 6:07:50 PM
>> *To:* Parquet Dev; Radev, Martin
>> *Cc:* Raoofy, Amir; Karlstetter, Roman
>> *Subject:* Re: Floating point data compression for Apache Parquet
>>
>> Hi Martin,
>>
>> Thanks for the thorough investigation, very nice report. I would have a
>> few questions:
>>
>> - Is data pre-loaded to RAM before making the measurements?
>>
>> - In "Figure 3: Decompression speed in MB/s", is data size measured
>> before or after uncompression?
>>
>> - In "Figure 4: Compression speed in MB/s", is data size measured before
>> or after compression?
>>
>> - According to "Figure 3: Decompression speed in MB/s", decompression of
>> bs_zstd is almost twice as fast as plain zstd. Do you know what causes this
>> massive speed improvement? Based on the description provided in section
>> 3.2, bs_zstd uses the same zstd compression with an extra step of
>> splitting/combining streams. Since this is extra work, I would have
>> expected bs_zstd to be slower than pure zstd, unless the compressed data
>> becomes so much smaller that it radically improves data access times.
>> However, according to "Figure 2: Compression ratio", bs_zstd achieves
>> "only" 23% better compression than plain zstd, which can not be the reason
>> for the 2x speed-up in itself.
>>
>> - If considering using existing libraries to provide any of the
>> compression algorithms, license compatibility is also an important factor
>> and therefore would be worth mentioning in Section 5.
>>
>> - Are any of the investigated strategies applicable to DECIMAL values?
>> Since floating point values and calculations have an inherent inaccuracy,
>> the DECIMAL type is much more important for storing financial data, which
>> is one of the main use cases of Parquet.
>>
>> Thanks,
>>
>> Zoltan
>>
>> On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <ma...@tum.de>
>> wrote:
>>
>>> Hello folks,
>>>
>>>
>>> thank you for your input.
>>>
>>>
>>> I am finished with my investigation regarding introducing special
>>> support for FP compression in Apache Parquet.
>>>
>>> My report also includes an investigation of lossy compressors though
>>> there are still some things to be cleared out.
>>>
>>>
>>> Report:
>>> https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
>>>
>>>
>>> Sections 3 4 5 6 are the most important to go over.
>>>
>>>
>>> Let me know if you have any questions or concerns.
>>>
>>>
>>> Regards,
>>>
>>> Martin
>>>
>>> ________________________________
>>> From: Zoltan Ivanfi <zi...@cloudera.com.INVALID>
>>> Sent: Thursday, June 13, 2019 2:16:56 PM
>>> To: Parquet Dev
>>> Cc: Raoofy, Amir; Karlstetter, Roman
>>> Subject: Re: Floating point data compression for Apache Parquet
>>>
>>> Hi Martin,
>>>
>>> Thanks for your interest in improving Parquet. Efficient encodings are
>>> really important in a big data file format, so this topic is
>>> definitely worth researching and personally I am looking forward to
>>> your report. Whether to add any new encodings to Parquet, however, can
>>> not be answered until we see the results of your findings.
>>>
>>> You mention two paths. One has very small computational overhead but
>>> does not provide significant space savings. The other provides
>>> significant space savings but at the price of a significant
>>> computational overhead. While purely based on these properties both of
>>> them seem "balanced" (one is small effort, small gain; the other is
>>> large effort, large gain) and therefore sound reasonable options, I
>>> would argue that one should also consider development costs, code
>>> complexity and compatibility implications when deciding about whether
>>> a new feature is worth implementing.
>>>
>>> Adding a new encoding or compression to Parquet complicates the
>>> specification of the file format and requires implementing it in every
>>> language binding of the format, which is not only a considerable
>>> effort, but is also error-prone (see LZ4 for an example, which was
>>> added to both the Java and the C++ implementation of Parquet, yet they
>>> are incompatible with each other). And lack of support is not only a
>>> minor annoyance in this case: if one is forced to use an older reader
>>> that does not support the new encoding yet (or a language binding that
>>> does not support it at all), the data simply can not be read.
>>>
>>> In my opinion, no matter how low the computational overhead of a new
>>> encoding is, if it does not provide significant gains, then the
>>> specification clutter, implementation costs and the potential of
>>> compatibility problems greatly outweigh its advantages. For this
>>> reason, I would say that only encodings that provide significant gains
>>> are worth adding. As far as I am concerned, such a new encoding would
>>> be a welcome addition to Parquet.
>>>
>>> Thanks,
>>>
>>> Zoltan
>>>
>>> On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <ma...@tum.de>
>>> wrote:
>>> >
>>> > Dear all,
>>> >
>>> > thank you for your work on the Apache Parquet format.
>>> >
>>> > We are a group of students at the Technical University of Munich who
>>> would like to extend the available compression and encoding options for
>>> 32-bit and 64-bit floating point data in Apache Parquet.
>>> > The current encodings and compression algorithms offered in Apache
>>> Parquet are heavily specialized towards integer and text data.
>>> > Thus there is an opportunity in reducing both io throughput
>>> requirements and space requirements for handling floating point data by
>>> selecting a specialized compression algorithm.
>>> >
>>> > Currently, I am doing an investigation on the available literature and
>>> publicly available fp compressors. In my investigation I am writing a
>>> report on my findings - the available algorithms, their strengths and
>>> weaknesses, compression rates, compression speeds and decompression speeds,
>>> and licenses. Once finished I will share the report with you and make a
>>> proposal which ones IMO are good candidates for Apache Parquet.
>>> >
>>> > The goal is to add a solution for both 32-bit and 64-bit fp types. I
>>> think that it would be beneficial to offer at the very least two distinct
>>> paths. The first one should offer fast compression and decompression speed
>>> with some but not significant saving in space. The second one should offer
>>> slower compression and decompression speed but with a decent compression
>>> rate. Both lossless. A lossy path will be investigated further and
>>> discussed with the community.
>>> >
>>> > If I get an approval from you – the developers – I can continue with
>>> adding support for the new encoding/compression options in the C++
>>> implementation of Apache Parquet in Apache Arrow.
>>> >
>>> > Please let me know what you think of this idea and whether you have
>>> any concerns with the plan.
>>> >
>>> > Best regards,
>>> > Martin Radev
>>> >
>>>
>>

Re: Floating point data compression for Apache Parquet

Posted by "Radev, Martin" <ma...@tum.de>.
Hello Zoltan and Parquet devs,


do you think it would be appropriate to start with a Parquet prototype from my side?

I suspect that integrating 'bs_zstd' would be the simplest to integrate and from the report we can see an improvement in both ratio and speed.


Do you think that Apache Arrow is an appropriate place to prototype the extension of the format?

Do you agree that the enum field 'Encodings' is a suitable place to add the 'Byte stream-splitting transformation'? In that way it could be used with any of the other supported compressors.

It might be best to also add a Java implementation of the transformation. Would the project 'parquet-mr' be a good place?


Would the workflow be such that I write my patches, we verify for correctness, get reviews, merge them AND just then we make adjustments to the Apache Parquet spec?


Any piece of advice is welcome!


Regards,

Martin


________________________________
From: Zoltan Ivanfi <zi...@cloudera.com>
Sent: Friday, July 5, 2019 4:21:39 PM
To: Radev, Martin
Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet


Hi Martin,

Thanks for the explanations, makes sense. Nice work!

Br,

Zoltan

On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin <ma...@tum.de>> wrote:

Hello Zoltan,


> Is data pre-loaded to RAM before making the measurements?
Yes, the file is read into physical memory.

For mmap-ed files, read from external storage, I would expect, but not 100% sure, that the IO-overhead would be big enough that all algorithms compress quite close at the same speed.


>In "Figure 3: Decompression speed in MB/s", is data size measured before or after uncompression?

> In "Figure 4: Compression speed in MB/s", is data size measured before or after compression?
For both the reported result is "size of the original file / time to compress or decompress".

> According to "Figure 3: Decompression speed in MB/s", decompression of bs_zstd is almost twice as fast as plain zstd. Do you know what causes this massive speed improvement?

I do not know all of the details. As you mentioned, the written out data is less, this could potentially lead to improvement in speed as less data has to be written out to memory during compression or read from memory during decompression.

Another thing to consider is that ZSTD uses different techniques to compress a block of data - "raw", "RLE", "Huffman coding", "Treeless coding".

I expect that "Huffman coding" is more costly than "RLE" and I also expect that "RLE" to be applicable for the majority of the sign bits thus leading to a performance win for when the transformation is applied.


I also expect that zstd has to do some form of "optimal parsing" to decide how to process the input in order to compress it well. This is something every wanna-be-good LZ-like compressor has to do ( https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html , http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html ). It might be so that the transformed input is somehow easy which leads to faster compression rates and also easier to decompress data which leads to faster decompression rates.

I used this as a reference: https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar with ZSTD in particular.


I also checked that the majority of the time is spent in zstd.

Example run for msg_sweep3d.dp using zstd at level 1.
- Transformation during compression: 0.086s, ZSTD compress on transformed data: 0.08s

- regular ZSTD: 0.34s
- ZSTD decompress from compressed transformed data: 0.067s, Transformation during decompression: 0.021s
- regular ZSTD decompress: 0.24s


Example run for msg_sweep3d.dp using zstd at level 20.

- Transformation during compression: 0.083s, ZSTD compress on transformed data: 14.35s

- regular ZSTD: 183s
- ZSTD decompress from compressed transformed data: 0.075s, Transformation during decompression: 0.022s
- regular ZSTD decompress: 0.31s
Here it's clear that the transformed input is easier to parse (compress). Maybe also the blocks are of type which takes less time to decompress.

> If considering using existing libraries to provide any of the compression algorithms, license compatibility is also an important factor and therefore would be worth mentioning in Section 5.
This is something I forgot to list. I will back to you and the other devs with information.

The filter I proposed for lossless compression can be integrated without any concerns for a license.


> Are any of the investigated strategies applicable to DECIMAL values?
The lossy compressors SZ and ZFP do not support that outside of the box. I could communicate with the SZ developers to come to a decision how this can be added to SZ. An option is to losslessly compress the pre-decimal number and lossyly compress the post-decimal number.

For lossless compression, we can apply a similar stream splitting technique for decimal types though it might be somewhat more complex and I have not really though about this case.


Regards,

Martin

________________________________
From: Zoltan Ivanfi <zi...@cloudera.com>>
Sent: Wednesday, July 3, 2019 6:07:50 PM
To: Parquet Dev; Radev, Martin
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet

Hi Martin,

Thanks for the thorough investigation, very nice report. I would have a few questions:

- Is data pre-loaded to RAM before making the measurements?

- In "Figure 3: Decompression speed in MB/s", is data size measured before or after uncompression?

- In "Figure 4: Compression speed in MB/s", is data size measured before or after compression?

- According to "Figure 3: Decompression speed in MB/s", decompression of bs_zstd is almost twice as fast as plain zstd. Do you know what causes this massive speed improvement? Based on the description provided in section 3.2, bs_zstd uses the same zstd compression with an extra step of splitting/combining streams. Since this is extra work, I would have expected bs_zstd to be slower than pure zstd, unless the compressed data becomes so much smaller that it radically improves data access times. However, according to "Figure 2: Compression ratio", bs_zstd achieves "only" 23% better compression than plain zstd, which can not be the reason for the 2x speed-up in itself.

- If considering using existing libraries to provide any of the compression algorithms, license compatibility is also an important factor and therefore would be worth mentioning in Section 5.

- Are any of the investigated strategies applicable to DECIMAL values? Since floating point values and calculations have an inherent inaccuracy, the DECIMAL type is much more important for storing financial data, which is one of the main use cases of Parquet.

Thanks,

Zoltan

On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <ma...@tum.de>> wrote:
Hello folks,


thank you for your input.


I am finished with my investigation regarding introducing special support for FP compression in Apache Parquet.

My report also includes an investigation of lossy compressors though there are still some things to be cleared out.


Report: https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv


Sections 3 4 5 6 are the most important to go over.


Let me know if you have any questions or concerns.


Regards,

Martin

________________________________
From: Zoltan Ivanfi <zi...@cloudera.com.INVALID>
Sent: Thursday, June 13, 2019 2:16:56 PM
To: Parquet Dev
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet

Hi Martin,

Thanks for your interest in improving Parquet. Efficient encodings are
really important in a big data file format, so this topic is
definitely worth researching and personally I am looking forward to
your report. Whether to add any new encodings to Parquet, however, can
not be answered until we see the results of your findings.

You mention two paths. One has very small computational overhead but
does not provide significant space savings. The other provides
significant space savings but at the price of a significant
computational overhead. While purely based on these properties both of
them seem "balanced" (one is small effort, small gain; the other is
large effort, large gain) and therefore sound reasonable options, I
would argue that one should also consider development costs, code
complexity and compatibility implications when deciding about whether
a new feature is worth implementing.

Adding a new encoding or compression to Parquet complicates the
specification of the file format and requires implementing it in every
language binding of the format, which is not only a considerable
effort, but is also error-prone (see LZ4 for an example, which was
added to both the Java and the C++ implementation of Parquet, yet they
are incompatible with each other). And lack of support is not only a
minor annoyance in this case: if one is forced to use an older reader
that does not support the new encoding yet (or a language binding that
does not support it at all), the data simply can not be read.

In my opinion, no matter how low the computational overhead of a new
encoding is, if it does not provide significant gains, then the
specification clutter, implementation costs and the potential of
compatibility problems greatly outweigh its advantages. For this
reason, I would say that only encodings that provide significant gains
are worth adding. As far as I am concerned, such a new encoding would
be a welcome addition to Parquet.

Thanks,

Zoltan

On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <ma...@tum.de>> wrote:
>
> Dear all,
>
> thank you for your work on the Apache Parquet format.
>
> We are a group of students at the Technical University of Munich who would like to extend the available compression and encoding options for 32-bit and 64-bit floating point data in Apache Parquet.
> The current encodings and compression algorithms offered in Apache Parquet are heavily specialized towards integer and text data.
> Thus there is an opportunity in reducing both io throughput requirements and space requirements for handling floating point data by selecting a specialized compression algorithm.
>
> Currently, I am doing an investigation on the available literature and publicly available fp compressors. In my investigation I am writing a report on my findings - the available algorithms, their strengths and weaknesses, compression rates, compression speeds and decompression speeds, and licenses. Once finished I will share the report with you and make a proposal which ones IMO are good candidates for Apache Parquet.
>
> The goal is to add a solution for both 32-bit and 64-bit fp types. I think that it would be beneficial to offer at the very least two distinct paths. The first one should offer fast compression and decompression speed with some but not significant saving in space. The second one should offer slower compression and decompression speed but with a decent compression rate. Both lossless. A lossy path will be investigated further and discussed with the community.
>
> If I get an approval from you – the developers – I can continue with adding support for the new encoding/compression options in the C++ implementation of Apache Parquet in Apache Arrow.
>
> Please let me know what you think of this idea and whether you have any concerns with the plan.
>
> Best regards,
> Martin Radev
>

Re: Floating point data compression for Apache Parquet

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Martin,

Thanks for the explanations, makes sense. Nice work!

Br,

Zoltan

On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin <ma...@tum.de> wrote:

> Hello Zoltan,
>
>
> *> **Is data pre-loaded to RAM before making the measurements?*
> Yes, the file is read into physical memory.
>
> For mmap-ed files, read from external storage, I would expect, but not
> 100% sure, that the IO-overhead would be big enough that all algorithms
> compress quite close at the same speed.
>
>
>
> *>In "Figure 3: Decompression speed in MB/s", is data size measured before
> or after uncompression? *
> *> In "Figure 4: Compression speed in MB/s", is data size measured before
> or after compression?*
> For both the reported result is "size of the original file / time to
> compress or decompress".
>
> *> **According to "Figure 3: Decompression speed in MB/s", decompression
> of bs_zstd is almost twice as fast as plain zstd. Do you know what causes
> this massive speed improvement?*
>
> I do not know all of the details. As you mentioned, the written out data
> is less, this could potentially lead to improvement in speed as less data
> has to be written out to memory during compression or read from memory
> during decompression.
>
> Another thing to consider is that ZSTD uses different techniques to
> compress a block of data - "raw", "RLE", "Huffman coding", "Treeless
> coding".
>
> I expect that "Huffman coding" is more costly than "RLE" and I also expect
> that "RLE" to be applicable for the majority of the sign bits thus leading
> to a performance win for when the transformation is applied.
>
> I also expect that zstd has to do some form of "optimal parsing" to decide
> how to process the input in order to compress it well. This is something
> every wanna-be-good LZ-like compressor has to do (
> https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
>  ,
> http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html ).
> It might be so that the transformed input is somehow easy which leads to
> faster compression rates and also easier to decompress data which leads to
> faster decompression rates.
>
> I used this as a reference:
> https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar
> with ZSTD in particular.
>
> I also checked that the majority of the time is spent in zstd.
>
> Example run for msg_sweep3d.dp using zstd at level 1.
> - Transformation during compression: 0.086s, ZSTD compress on transformed
> data: 0.08s
>
> - regular ZSTD: 0.34s
> - ZSTD decompress from compressed transformed data: 0.067s, Transformation
> during decompression: 0.021s
> - regular ZSTD decompress: 0.24s
>
>
> Example run for msg_sweep3d.dp using zstd at level 20.
>
> - Transformation during compression: 0.083s, ZSTD compress on transformed
> data: 14.35s
>
> - regular ZSTD: 183s
> - ZSTD decompress from compressed transformed data: 0.075s, Transformation
> during decompression: 0.022s
> - regular ZSTD decompress: 0.31s
> Here it's clear that the transformed input is easier to parse (compress).
> Maybe also the blocks are of type which takes less time to decompress.
>
> *> **If considering using existing libraries to provide any of the
> compression algorithms, license compatibility is also an important factor
> and therefore would be worth mentioning in Section 5.*
> This is something I forgot to list. I will back to you and the other devs
> with information.
>
> The filter I proposed for lossless compression can be integrated without
> any concerns for a license.
>
>
> *> **Are any of the investigated strategies applicable to DECIMAL values?*
> The lossy compressors SZ and ZFP do not support that outside of the box. I
> could communicate with the SZ developers to come to a decision how this can
> be added to SZ. An option is to losslessly compress the pre-decimal number
> and lossyly compress the post-decimal number.
>
> For lossless compression, we can apply a similar stream splitting
> technique for decimal types though it might be somewhat more complex and I
> have not really though about this case.
>
>
> Regards,
>
> Martin
> ------------------------------
> *From:* Zoltan Ivanfi <zi...@cloudera.com>
> *Sent:* Wednesday, July 3, 2019 6:07:50 PM
> *To:* Parquet Dev; Radev, Martin
> *Cc:* Raoofy, Amir; Karlstetter, Roman
> *Subject:* Re: Floating point data compression for Apache Parquet
>
> Hi Martin,
>
> Thanks for the thorough investigation, very nice report. I would have a
> few questions:
>
> - Is data pre-loaded to RAM before making the measurements?
>
> - In "Figure 3: Decompression speed in MB/s", is data size measured before
> or after uncompression?
>
> - In "Figure 4: Compression speed in MB/s", is data size measured before
> or after compression?
>
> - According to "Figure 3: Decompression speed in MB/s", decompression of
> bs_zstd is almost twice as fast as plain zstd. Do you know what causes this
> massive speed improvement? Based on the description provided in section
> 3.2, bs_zstd uses the same zstd compression with an extra step of
> splitting/combining streams. Since this is extra work, I would have
> expected bs_zstd to be slower than pure zstd, unless the compressed data
> becomes so much smaller that it radically improves data access times.
> However, according to "Figure 2: Compression ratio", bs_zstd achieves
> "only" 23% better compression than plain zstd, which can not be the reason
> for the 2x speed-up in itself.
>
> - If considering using existing libraries to provide any of the
> compression algorithms, license compatibility is also an important factor
> and therefore would be worth mentioning in Section 5.
>
> - Are any of the investigated strategies applicable to DECIMAL values?
> Since floating point values and calculations have an inherent inaccuracy,
> the DECIMAL type is much more important for storing financial data, which
> is one of the main use cases of Parquet.
>
> Thanks,
>
> Zoltan
>
> On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <ma...@tum.de> wrote:
>
>> Hello folks,
>>
>>
>> thank you for your input.
>>
>>
>> I am finished with my investigation regarding introducing special support
>> for FP compression in Apache Parquet.
>>
>> My report also includes an investigation of lossy compressors though
>> there are still some things to be cleared out.
>>
>>
>> Report:
>> https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
>>
>>
>> Sections 3 4 5 6 are the most important to go over.
>>
>>
>> Let me know if you have any questions or concerns.
>>
>>
>> Regards,
>>
>> Martin
>>
>> ________________________________
>> From: Zoltan Ivanfi <zi...@cloudera.com.INVALID>
>> Sent: Thursday, June 13, 2019 2:16:56 PM
>> To: Parquet Dev
>> Cc: Raoofy, Amir; Karlstetter, Roman
>> Subject: Re: Floating point data compression for Apache Parquet
>>
>> Hi Martin,
>>
>> Thanks for your interest in improving Parquet. Efficient encodings are
>> really important in a big data file format, so this topic is
>> definitely worth researching and personally I am looking forward to
>> your report. Whether to add any new encodings to Parquet, however, can
>> not be answered until we see the results of your findings.
>>
>> You mention two paths. One has very small computational overhead but
>> does not provide significant space savings. The other provides
>> significant space savings but at the price of a significant
>> computational overhead. While purely based on these properties both of
>> them seem "balanced" (one is small effort, small gain; the other is
>> large effort, large gain) and therefore sound reasonable options, I
>> would argue that one should also consider development costs, code
>> complexity and compatibility implications when deciding about whether
>> a new feature is worth implementing.
>>
>> Adding a new encoding or compression to Parquet complicates the
>> specification of the file format and requires implementing it in every
>> language binding of the format, which is not only a considerable
>> effort, but is also error-prone (see LZ4 for an example, which was
>> added to both the Java and the C++ implementation of Parquet, yet they
>> are incompatible with each other). And lack of support is not only a
>> minor annoyance in this case: if one is forced to use an older reader
>> that does not support the new encoding yet (or a language binding that
>> does not support it at all), the data simply can not be read.
>>
>> In my opinion, no matter how low the computational overhead of a new
>> encoding is, if it does not provide significant gains, then the
>> specification clutter, implementation costs and the potential of
>> compatibility problems greatly outweigh its advantages. For this
>> reason, I would say that only encodings that provide significant gains
>> are worth adding. As far as I am concerned, such a new encoding would
>> be a welcome addition to Parquet.
>>
>> Thanks,
>>
>> Zoltan
>>
>> On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <ma...@tum.de>
>> wrote:
>> >
>> > Dear all,
>> >
>> > thank you for your work on the Apache Parquet format.
>> >
>> > We are a group of students at the Technical University of Munich who
>> would like to extend the available compression and encoding options for
>> 32-bit and 64-bit floating point data in Apache Parquet.
>> > The current encodings and compression algorithms offered in Apache
>> Parquet are heavily specialized towards integer and text data.
>> > Thus there is an opportunity in reducing both io throughput
>> requirements and space requirements for handling floating point data by
>> selecting a specialized compression algorithm.
>> >
>> > Currently, I am doing an investigation on the available literature and
>> publicly available fp compressors. In my investigation I am writing a
>> report on my findings - the available algorithms, their strengths and
>> weaknesses, compression rates, compression speeds and decompression speeds,
>> and licenses. Once finished I will share the report with you and make a
>> proposal which ones IMO are good candidates for Apache Parquet.
>> >
>> > The goal is to add a solution for both 32-bit and 64-bit fp types. I
>> think that it would be beneficial to offer at the very least two distinct
>> paths. The first one should offer fast compression and decompression speed
>> with some but not significant saving in space. The second one should offer
>> slower compression and decompression speed but with a decent compression
>> rate. Both lossless. A lossy path will be investigated further and
>> discussed with the community.
>> >
>> > If I get an approval from you – the developers – I can continue with
>> adding support for the new encoding/compression options in the C++
>> implementation of Apache Parquet in Apache Arrow.
>> >
>> > Please let me know what you think of this idea and whether you have any
>> concerns with the plan.
>> >
>> > Best regards,
>> > Martin Radev
>> >
>>
>

Re: Floating point data compression for Apache Parquet

Posted by "Radev, Martin" <ma...@tum.de>.
Hello Zoltan,


> Is data pre-loaded to RAM before making the measurements?
Yes, the file is read into physical memory.

For mmap-ed files, read from external storage, I would expect, but not 100% sure, that the IO-overhead would be big enough that all algorithms compress quite close at the same speed.


>In "Figure 3: Decompression speed in MB/s", is data size measured before or after uncompression?

> In "Figure 4: Compression speed in MB/s", is data size measured before or after compression?
For both the reported result is "size of the original file / time to compress or decompress".

> According to "Figure 3: Decompression speed in MB/s", decompression of bs_zstd is almost twice as fast as plain zstd. Do you know what causes this massive speed improvement?

I do not know all of the details. As you mentioned, the written out data is less, this could potentially lead to improvement in speed as less data has to be written out to memory during compression or read from memory during decompression.

Another thing to consider is that ZSTD uses different techniques to compress a block of data - "raw", "RLE", "Huffman coding", "Treeless coding".

I expect that "Huffman coding" is more costly than "RLE" and I also expect that "RLE" to be applicable for the majority of the sign bits thus leading to a performance win for when the transformation is applied.


I also expect that zstd has to do some form of "optimal parsing" to decide how to process the input in order to compress it well. This is something every wanna-be-good LZ-like compressor has to do ( https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html , http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html ). It might be so that the transformed input is somehow easy which leads to faster compression rates and also easier to decompress data which leads to faster decompression rates.

I used this as a reference: https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar with ZSTD in particular.


I also checked that the majority of the time is spent in zstd.

Example run for msg_sweep3d.dp using zstd at level 1.
- Transformation during compression: 0.086s, ZSTD compress on transformed data: 0.08s

- regular ZSTD: 0.34s
- ZSTD decompress from compressed transformed data: 0.067s, Transformation during decompression: 0.021s
- regular ZSTD decompress: 0.24s


Example run for msg_sweep3d.dp using zstd at level 20.

- Transformation during compression: 0.083s, ZSTD compress on transformed data: 14.35s

- regular ZSTD: 183s
- ZSTD decompress from compressed transformed data: 0.075s, Transformation during decompression: 0.022s
- regular ZSTD decompress: 0.31s
Here it's clear that the transformed input is easier to parse (compress). Maybe also the blocks are of type which takes less time to decompress.

> If considering using existing libraries to provide any of the compression algorithms, license compatibility is also an important factor and therefore would be worth mentioning in Section 5.
This is something I forgot to list. I will back to you and the other devs with information.

The filter I proposed for lossless compression can be integrated without any concerns for a license.


> Are any of the investigated strategies applicable to DECIMAL values?
The lossy compressors SZ and ZFP do not support that outside of the box. I could communicate with the SZ developers to come to a decision how this can be added to SZ. An option is to losslessly compress the pre-decimal number and lossyly compress the post-decimal number.

For lossless compression, we can apply a similar stream splitting technique for decimal types though it might be somewhat more complex and I have not really though about this case.


Regards,

Martin

________________________________
From: Zoltan Ivanfi <zi...@cloudera.com>
Sent: Wednesday, July 3, 2019 6:07:50 PM
To: Parquet Dev; Radev, Martin
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet

Hi Martin,

Thanks for the thorough investigation, very nice report. I would have a few questions:

- Is data pre-loaded to RAM before making the measurements?

- In "Figure 3: Decompression speed in MB/s", is data size measured before or after uncompression?

- In "Figure 4: Compression speed in MB/s", is data size measured before or after compression?

- According to "Figure 3: Decompression speed in MB/s", decompression of bs_zstd is almost twice as fast as plain zstd. Do you know what causes this massive speed improvement? Based on the description provided in section 3.2, bs_zstd uses the same zstd compression with an extra step of splitting/combining streams. Since this is extra work, I would have expected bs_zstd to be slower than pure zstd, unless the compressed data becomes so much smaller that it radically improves data access times. However, according to "Figure 2: Compression ratio", bs_zstd achieves "only" 23% better compression than plain zstd, which can not be the reason for the 2x speed-up in itself.

- If considering using existing libraries to provide any of the compression algorithms, license compatibility is also an important factor and therefore would be worth mentioning in Section 5.

- Are any of the investigated strategies applicable to DECIMAL values? Since floating point values and calculations have an inherent inaccuracy, the DECIMAL type is much more important for storing financial data, which is one of the main use cases of Parquet.

Thanks,

Zoltan

On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <ma...@tum.de>> wrote:
Hello folks,


thank you for your input.


I am finished with my investigation regarding introducing special support for FP compression in Apache Parquet.

My report also includes an investigation of lossy compressors though there are still some things to be cleared out.


Report: https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv


Sections 3 4 5 6 are the most important to go over.


Let me know if you have any questions or concerns.


Regards,

Martin

________________________________
From: Zoltan Ivanfi <zi...@cloudera.com.INVALID>
Sent: Thursday, June 13, 2019 2:16:56 PM
To: Parquet Dev
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet

Hi Martin,

Thanks for your interest in improving Parquet. Efficient encodings are
really important in a big data file format, so this topic is
definitely worth researching and personally I am looking forward to
your report. Whether to add any new encodings to Parquet, however, can
not be answered until we see the results of your findings.

You mention two paths. One has very small computational overhead but
does not provide significant space savings. The other provides
significant space savings but at the price of a significant
computational overhead. While purely based on these properties both of
them seem "balanced" (one is small effort, small gain; the other is
large effort, large gain) and therefore sound reasonable options, I
would argue that one should also consider development costs, code
complexity and compatibility implications when deciding about whether
a new feature is worth implementing.

Adding a new encoding or compression to Parquet complicates the
specification of the file format and requires implementing it in every
language binding of the format, which is not only a considerable
effort, but is also error-prone (see LZ4 for an example, which was
added to both the Java and the C++ implementation of Parquet, yet they
are incompatible with each other). And lack of support is not only a
minor annoyance in this case: if one is forced to use an older reader
that does not support the new encoding yet (or a language binding that
does not support it at all), the data simply can not be read.

In my opinion, no matter how low the computational overhead of a new
encoding is, if it does not provide significant gains, then the
specification clutter, implementation costs and the potential of
compatibility problems greatly outweigh its advantages. For this
reason, I would say that only encodings that provide significant gains
are worth adding. As far as I am concerned, such a new encoding would
be a welcome addition to Parquet.

Thanks,

Zoltan

On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <ma...@tum.de>> wrote:
>
> Dear all,
>
> thank you for your work on the Apache Parquet format.
>
> We are a group of students at the Technical University of Munich who would like to extend the available compression and encoding options for 32-bit and 64-bit floating point data in Apache Parquet.
> The current encodings and compression algorithms offered in Apache Parquet are heavily specialized towards integer and text data.
> Thus there is an opportunity in reducing both io throughput requirements and space requirements for handling floating point data by selecting a specialized compression algorithm.
>
> Currently, I am doing an investigation on the available literature and publicly available fp compressors. In my investigation I am writing a report on my findings - the available algorithms, their strengths and weaknesses, compression rates, compression speeds and decompression speeds, and licenses. Once finished I will share the report with you and make a proposal which ones IMO are good candidates for Apache Parquet.
>
> The goal is to add a solution for both 32-bit and 64-bit fp types. I think that it would be beneficial to offer at the very least two distinct paths. The first one should offer fast compression and decompression speed with some but not significant saving in space. The second one should offer slower compression and decompression speed but with a decent compression rate. Both lossless. A lossy path will be investigated further and discussed with the community.
>
> If I get an approval from you – the developers – I can continue with adding support for the new encoding/compression options in the C++ implementation of Apache Parquet in Apache Arrow.
>
> Please let me know what you think of this idea and whether you have any concerns with the plan.
>
> Best regards,
> Martin Radev
>

Re: Floating point data compression for Apache Parquet

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Martin,

Thanks for the thorough investigation, very nice report. I would have a few
questions:

- Is data pre-loaded to RAM before making the measurements?

- In "Figure 3: Decompression speed in MB/s", is data size measured before
or after uncompression?

- In "Figure 4: Compression speed in MB/s", is data size measured before or
after compression?

- According to "Figure 3: Decompression speed in MB/s", decompression of
bs_zstd is almost twice as fast as plain zstd. Do you know what causes this
massive speed improvement? Based on the description provided in section
3.2, bs_zstd uses the same zstd compression with an extra step of
splitting/combining streams. Since this is extra work, I would have
expected bs_zstd to be slower than pure zstd, unless the compressed data
becomes so much smaller that it radically improves data access times.
However, according to "Figure 2: Compression ratio", bs_zstd achieves
"only" 23% better compression than plain zstd, which can not be the reason
for the 2x speed-up in itself.

- If considering using existing libraries to provide any of the compression
algorithms, license compatibility is also an important factor and therefore
would be worth mentioning in Section 5.

- Are any of the investigated strategies applicable to DECIMAL values?
Since floating point values and calculations have an inherent inaccuracy,
the DECIMAL type is much more important for storing financial data, which
is one of the main use cases of Parquet.

Thanks,

Zoltan

On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <ma...@tum.de> wrote:

> Hello folks,
>
>
> thank you for your input.
>
>
> I am finished with my investigation regarding introducing special support
> for FP compression in Apache Parquet.
>
> My report also includes an investigation of lossy compressors though there
> are still some things to be cleared out.
>
>
> Report: https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
>
>
> Sections 3 4 5 6 are the most important to go over.
>
>
> Let me know if you have any questions or concerns.
>
>
> Regards,
>
> Martin
>
> ________________________________
> From: Zoltan Ivanfi <zi...@cloudera.com.INVALID>
> Sent: Thursday, June 13, 2019 2:16:56 PM
> To: Parquet Dev
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Floating point data compression for Apache Parquet
>
> Hi Martin,
>
> Thanks for your interest in improving Parquet. Efficient encodings are
> really important in a big data file format, so this topic is
> definitely worth researching and personally I am looking forward to
> your report. Whether to add any new encodings to Parquet, however, can
> not be answered until we see the results of your findings.
>
> You mention two paths. One has very small computational overhead but
> does not provide significant space savings. The other provides
> significant space savings but at the price of a significant
> computational overhead. While purely based on these properties both of
> them seem "balanced" (one is small effort, small gain; the other is
> large effort, large gain) and therefore sound reasonable options, I
> would argue that one should also consider development costs, code
> complexity and compatibility implications when deciding about whether
> a new feature is worth implementing.
>
> Adding a new encoding or compression to Parquet complicates the
> specification of the file format and requires implementing it in every
> language binding of the format, which is not only a considerable
> effort, but is also error-prone (see LZ4 for an example, which was
> added to both the Java and the C++ implementation of Parquet, yet they
> are incompatible with each other). And lack of support is not only a
> minor annoyance in this case: if one is forced to use an older reader
> that does not support the new encoding yet (or a language binding that
> does not support it at all), the data simply can not be read.
>
> In my opinion, no matter how low the computational overhead of a new
> encoding is, if it does not provide significant gains, then the
> specification clutter, implementation costs and the potential of
> compatibility problems greatly outweigh its advantages. For this
> reason, I would say that only encodings that provide significant gains
> are worth adding. As far as I am concerned, such a new encoding would
> be a welcome addition to Parquet.
>
> Thanks,
>
> Zoltan
>
> On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <ma...@tum.de>
> wrote:
> >
> > Dear all,
> >
> > thank you for your work on the Apache Parquet format.
> >
> > We are a group of students at the Technical University of Munich who
> would like to extend the available compression and encoding options for
> 32-bit and 64-bit floating point data in Apache Parquet.
> > The current encodings and compression algorithms offered in Apache
> Parquet are heavily specialized towards integer and text data.
> > Thus there is an opportunity in reducing both io throughput requirements
> and space requirements for handling floating point data by selecting a
> specialized compression algorithm.
> >
> > Currently, I am doing an investigation on the available literature and
> publicly available fp compressors. In my investigation I am writing a
> report on my findings - the available algorithms, their strengths and
> weaknesses, compression rates, compression speeds and decompression speeds,
> and licenses. Once finished I will share the report with you and make a
> proposal which ones IMO are good candidates for Apache Parquet.
> >
> > The goal is to add a solution for both 32-bit and 64-bit fp types. I
> think that it would be beneficial to offer at the very least two distinct
> paths. The first one should offer fast compression and decompression speed
> with some but not significant saving in space. The second one should offer
> slower compression and decompression speed but with a decent compression
> rate. Both lossless. A lossy path will be investigated further and
> discussed with the community.
> >
> > If I get an approval from you – the developers – I can continue with
> adding support for the new encoding/compression options in the C++
> implementation of Apache Parquet in Apache Arrow.
> >
> > Please let me know what you think of this idea and whether you have any
> concerns with the plan.
> >
> > Best regards,
> > Martin Radev
> >
>