You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Radev, Martin" <ma...@tum.de> on 2019/07/23 18:22:43 UTC

[DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or Compression algorithm to Parquet Format

Dear Apache Parquet Devs,

I would like to make a proposal for extending the Apache Parquet specification by adding a better encoding for FP data which improves compression ratio and also to raise the question of adding a lossy compression algorithm for FP data.

Contents:
1. Problem: FP data compression is suboptimal in Apache Parquet
2. Solution idea: a new encoding for FP data to improve compression
integration of zfp for lossy compression of FP data
3. Our motifs for making these changes to Parquet
4. Current implementation in parquet-mr, arrow, parquet-format
5. Benchmark - dataset, benchmark project using Avro, results
6. Open questions

1. Problem
Apache Parquet already offers a variety of encodings and compression algorithms, yet neither compresses well 32-bit or 64-bit FP data.
There are many reasons for this:
- sources of FP data such as sensors typically add noise to measurements. Thus, the least significant mantissa bits often contain some noise.
- the available encodings in Apache Parquet specialize for string data and integer data. The IEEE 754 representation of FP data is significantly different.
- the available compressors in Apache Parquet exploit repetitions in the input sequence. For floating-point data, an element in the sequence is either
4-bytes or 8 bytes. Also, the least significant bits of the mantissa are often noise which makes it very unlikely to have long subsequences being repeated.
Thus, they often cannot perform well on raw FP data.

2. Solution idea
I already investigated a variety of ways to compress FP data and the report was shared with the Parquet community.
My investigation focused on lossless compression and lossy compression.
The original report can be viewed here: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view

For lossless compression, it turns out that a very simple encoding, named "byte stream splitting", can produce very good results. Combined with zstd it outperformed all FP-specific compressors (fpc, spdp, fpzip, zfp) for the majority of the test cases. The encoding creates a stream for each byte of the underlying FP type (4 for float, 8 for double) and scatters each byte of the value to the corresponding stream. The streams are concatenated and later compressed. The new encoding does not only offer good results, but it is also simple to implement, has very little overhead and can even improve performance for some cases.

For lossy compression, I compared two lossy compressors - ZFP and SZ. SZ outperformed ZFP in compression ratio by a reasonable margin but unfortunately the project has bugs and the API is not thread-safe.
This makes it unsuitable for Parquet at the current moment. ZFP is a more mature project which makes it potentially a good fit for integrating into Parquet. We can discuss lossy compression in another thread. I only wanted to hint that we're considering this as a great alternative for some Parquet users since the achieved compression ratio is much higher than that of lossless compression.

Also, please note that this work is not about improving storage efficiency of the decimal type but only for floats and doubles.

3. Our motifs
The CAPS chair at the Technical University of Munich uses Apache Parquet for storing large amount of FP sensor data. The new encoding improves storage efficiency - both in needed capacity and time to store.
Despite our own interests, the improvement is also beneficial for other Parquet users who store FP data.

4. Status of the implementation
The current status of the implementation:
- Pull request for adding the new BYTE_STREAM_SPLIT encoding to parquet-format: https://github.com/apache/parquet-format/pull/144
- Patch for adding BYTE_STREAM_SPLIT to parquet-mr: https://github.com/martinradev/parquet-mr/commit/4c0e25581fa4b454535e6dbbfb3ab9932b97350c
Patch for exposing BYTE_STREAM_SPLIT in ParquetWriter: https://github.com/martinradev/parquet-mr/commit/2ec340d5ac8e1d6e598cb83f9b17d75f11f7ff61
I did not send a PR for these two patches since we have to vote on the new feature first and then get the parquet-format pull request in.
- Patch for adding BYTE_STREAM_SPLIT to Apache Arrow: https://github.com/martinradev/arrow/commit/193c8704c4aab8fdff51f410f0206fa5ed21d801
Again, no PR since we need to vote for changing the specification first.
- I made public the simple benchmark app which I used to collect compression numbers: https://github.com/martinradev/parquet-mr-streamsplit-bench

5. Benchmark:
For more info and results please check my mini benchmark project: https://github.com/martinradev/parquet-mr-streamsplit-bench
The short description is that there is an improvement of 11% on average for FP32 and 6% for FP64 when gzip is used as a compression algorithm.
Note that the improvement is higher for many of the large test cases but the average is lower due to outliers for some small test cases.
Similar results are to be expected also with other compression algorithm in Parquet.

6. Open questions
- Will you be happy to add the new BYTE_STREAM_SPLIT encoding to Apache Parquet?
- What are your thoughts on the future addition of also lossy compression of FP32 and FP64 to Apache Parquet?
- What are the next steps?

Regards,
Martin

Re: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or Compression algorithm to Parquet Format

Posted by "Radev, Martin" <ma...@tum.de>.

Hello people,


thanks Gabor for the feedback.

I think I am done with all of the investigation with respect to designing and adding a new encoding for better compression of FP data.


Final report

My report and benchmark is available here: https://github.com/martinradev/arrow-fp-compression-bench

I deviated a bit from the original idea and investigated another encoding which I named "ADAPTIVE_BYTE_STREAM_SPLIT" which should achieve somewhat better results for cases when the data is very repetitive. In those cases, it turns out that PLAIN and DICTIONARY encoding are reasonably better than the early "BYTE_STREAM_SPLIT" encoding. However, there is some overhead in the adaptive encoding and is unfortunately not that competitive on average.


My report includes a couple of tables, bar plots and a scatter plot. It also explains how both the "BYTE_STREAM_SPLIT" and "ADAPTIVE_BYTE_STREAM_SPLIT" encodings work.

My conclusion

It is visible in the report that for many test cases the BYTE_STREAM_SPLIT encoding improves compression ratio and also compression speed.
Thus, users of Parquet could benefit from using it for certain types of data.


The implementation

The implementation of "BYTE_STREAM_SPLIT" is simple which makes adding it to parquet-mr and arrow easy. I already have some unpolished patches for both projects.

Risks

The only risk is that it is not competitive for all types of data but this is also the case for all other encodings in Parquet.

Also, if a better technique is introduced later, it can obviously become obsolete.


What I hope to get from you

1) Please read my report available on the github project ( https://github.com/martinradev/arrow-fp-compression-bench ) and let's have a discussion.

2) I would like to hear what you think.

3) State whether you see any problems in adding it.

The final stage

After all of this I will start a vote some day in the week of 26th of August.

Let me know whether you need anything from me.


Regards,

Martin


________________________________
From: Gabor Szadovszky <ga...@cloudera.com.INVALID>
Sent: Thursday, July 25, 2019 1:29:46 PM
To: Parquet Dev
Cc: heuermh@gmail.com; Karlstetter, Roman; Raoofy, Amir
Subject: Re: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or Compression algorithm to Parquet Format

Hi Martin,

I've removed the guys from CC who are members of the parquet dev list. I
also suggest to write to the dev list only and let the others subscribe to
it if they are interested or follow the discussion at
https://lists.apache.org/list.html?dev@parquet.apache.org.

Thanks a lot for this summary and all the efforts you've already spent for
this. Personally, I would be happy to see the lossless encoding you've
suggested in parquet-format and then the implementations as well. Related
to the lossy compression I am not sure. Until now we did not do anything
like that in Parquet. However, I can see possible benefits in lossy
encodings but let's handle it separately.

The next step would be to initiate a vote on this list. See some details at
https://www.apache.org/foundation/voting.html about the procedural voting.
Also, some notes here: https://community.apache.org/committers/voting.html.
You may also find some examples of voting in the archives of this mailing
list.

Regards,
Gabor

On Thu, Jul 25, 2019 at 12:56 PM Radev, Martin <ma...@tum.de> wrote:

> Dear all,
>
>
> how should be proceeded with this proposal?
>
>
> Would somebody like to offer feedback on the new encoding, change of
> specification, and patches?
>
> How should we start a vote?
>
>
> I am new to this project and do not have connections in this community.
> Would a senior contributor like to help me drive this?
>
>
> Regards,
>
> Martin
>
> ________________________________
> From: Radev, Martin <ma...@tum.de>
> Sent: Tuesday, July 23, 2019 8:22:43 PM
> To: dev@parquet.apache.org
> Cc: Zoltan Ivanfi; wesmckinn@gmail.com; fokko@driesprong.frl;
> heuermh@gmail.com; Karlstetter, Roman; Raoofy, Amir
> Subject: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or
> Compression algorithm to Parquet Format
>
> Dear Apache Parquet Devs,
>
> I would like to make a proposal for extending the Apache Parquet
> specification by adding a better encoding for FP data which improves
> compression ratio and also to raise the question of adding a lossy
> compression algorithm for FP data.
>
> Contents:
> 1. Problem: FP data compression is suboptimal in Apache Parquet
> 2. Solution idea: a new encoding for FP data to improve compression
>                            integration of zfp for lossy compression of FP
> data
> 3. Our motifs for making these changes to Parquet
> 4. Current implementation in parquet-mr, arrow, parquet-format
> 5. Benchmark - dataset, benchmark project using Avro, results
> 6. Open questions
>
> 1. Problem
> Apache Parquet already offers a variety of encodings and compression
> algorithms, yet neither compresses well 32-bit or 64-bit FP data.
> There are many reasons for this:
> - sources of FP data such as sensors typically add noise to measurements.
> Thus, the least significant mantissa bits often contain some noise.
> - the available encodings in Apache Parquet specialize for string data and
> integer data. The IEEE 754 representation of FP data is significantly
> different.
> - the available compressors in Apache Parquet exploit repetitions in the
> input sequence. For floating-point data, an element in the sequence is
> either
>   4-bytes or 8 bytes. Also, the least significant bits of the mantissa are
> often noise which makes it very unlikely to have long subsequences being
> repeated.
>   Thus, they often cannot perform well on raw FP data.
>
> 2. Solution idea
> I already investigated a variety of ways to compress FP data and the
> report was shared with the Parquet community.
> My investigation focused on lossless compression and lossy compression.
> The original report can be viewed here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
[https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>

report.pdf - Google Drive<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>
drive.google.com



> [
> https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p
> ]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>
[https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>

report.pdf - Google Drive<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>
drive.google.com



>
> report.pdf - Google Drive<
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>
[https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>

report.pdf - Google Drive<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>
drive.google.com



> drive.google.com
>
>
>
>
> For lossless compression, it turns out that a very simple encoding, named
> "byte stream splitting", can produce very good results. Combined with zstd
> it outperformed all FP-specific compressors (fpc, spdp, fpzip, zfp) for the
> majority of the test cases. The encoding creates a stream for each byte of
> the underlying FP type (4 for float, 8 for double) and scatters each byte
> of the value to the corresponding stream. The streams are concatenated and
> later compressed. The new encoding does not only offer good results, but it
> is also simple to implement, has very little overhead and can even improve
> performance for some cases.
>
> For lossy compression, I compared two lossy compressors - ZFP and SZ. SZ
> outperformed ZFP in compression ratio by a reasonable margin but
> unfortunately the project has bugs and the API is not thread-safe.
> This makes it unsuitable for Parquet at the current moment. ZFP is a more
> mature project which makes it potentially a good fit for integrating into
> Parquet. We can discuss lossy compression in another thread. I only wanted
> to hint that we're considering this as a great alternative for some Parquet
> users since the achieved compression ratio is much higher than that of
> lossless compression.
>
> Also, please note that this work is not about improving storage efficiency
> of the decimal type but only for floats and doubles.
>
> 3. Our motifs
> The CAPS chair at the Technical University of Munich uses Apache Parquet
> for storing large amount of FP sensor data. The new encoding improves
> storage efficiency - both in needed capacity and time to store.
> Despite our own interests, the improvement is also beneficial for other
> Parquet users who store FP data.
>
> 4. Status of the implementation
> The current status of the implementation:
> - Pull request for adding the new BYTE_STREAM_SPLIT encoding to
> parquet-format: https://github.com/apache/parquet-format/pull/144
> - Patch for adding BYTE_STREAM_SPLIT to parquet-mr:
> https://github.com/martinradev/parquet-mr/commit/4c0e25581fa4b454535e6dbbfb3ab9932b97350c
>   Patch for exposing BYTE_STREAM_SPLIT in ParquetWriter:
> https://github.com/martinradev/parquet-mr/commit/2ec340d5ac8e1d6e598cb83f9b17d75f11f7ff61
>   I did not send a PR for these two patches since we have to vote on the
> new feature first and then get the parquet-format pull request in.
> - Patch for adding BYTE_STREAM_SPLIT to Apache Arrow:
> https://github.com/martinradev/arrow/commit/193c8704c4aab8fdff51f410f0206fa5ed21d801
>   Again, no PR since we need to vote for changing the specification first.
> - I made public the simple benchmark app which I used to collect
> compression numbers:
> https://github.com/martinradev/parquet-mr-streamsplit-bench
>
> 5. Benchmark:
> For more info and results please check my mini benchmark project:
> https://github.com/martinradev/parquet-mr-streamsplit-bench
> The short description is that there is an improvement of 11% on average
> for FP32 and 6% for FP64 when gzip is used as a compression algorithm.
> Note that the improvement is higher for many of the large test cases but
> the average is lower due to outliers for some small test cases.
> Similar results are to be expected also with other compression algorithm
> in Parquet.
>
> 6. Open questions
> - Will you be happy to add the new BYTE_STREAM_SPLIT encoding to Apache
> Parquet?
> - What are your thoughts on the future addition of also lossy compression
> of FP32 and FP64 to Apache Parquet?
> - What are the next steps?
>
> Regards,
> Martin
>
>

Re: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or Compression algorithm to Parquet Format

Posted by Gabor Szadovszky <ga...@cloudera.com.INVALID>.

Hi Martin,

I've removed the guys from CC who are members of the parquet dev list. I
also suggest to write to the dev list only and let the others subscribe to
it if they are interested or follow the discussion at
https://lists.apache.org/list.html?dev@parquet.apache.org.

Thanks a lot for this summary and all the efforts you've already spent for
this. Personally, I would be happy to see the lossless encoding you've
suggested in parquet-format and then the implementations as well. Related
to the lossy compression I am not sure. Until now we did not do anything
like that in Parquet. However, I can see possible benefits in lossy
encodings but let's handle it separately.

The next step would be to initiate a vote on this list. See some details at
https://www.apache.org/foundation/voting.html about the procedural voting.
Also, some notes here: https://community.apache.org/committers/voting.html.
You may also find some examples of voting in the archives of this mailing
list.

Regards,
Gabor

On Thu, Jul 25, 2019 at 12:56 PM Radev, Martin <ma...@tum.de> wrote:

> Dear all,
>
>
> how should be proceeded with this proposal?
>
>
> Would somebody like to offer feedback on the new encoding, change of
> specification, and patches?
>
> How should we start a vote?
>
>
> I am new to this project and do not have connections in this community.
> Would a senior contributor like to help me drive this?
>
>
> Regards,
>
> Martin
>
> ________________________________
> From: Radev, Martin <ma...@tum.de>
> Sent: Tuesday, July 23, 2019 8:22:43 PM
> To: dev@parquet.apache.org
> Cc: Zoltan Ivanfi; wesmckinn@gmail.com; fokko@driesprong.frl;
> heuermh@gmail.com; Karlstetter, Roman; Raoofy, Amir
> Subject: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or
> Compression algorithm to Parquet Format
>
> Dear Apache Parquet Devs,
>
> I would like to make a proposal for extending the Apache Parquet
> specification by adding a better encoding for FP data which improves
> compression ratio and also to raise the question of adding a lossy
> compression algorithm for FP data.
>
> Contents:
> 1. Problem: FP data compression is suboptimal in Apache Parquet
> 2. Solution idea: a new encoding for FP data to improve compression
>                            integration of zfp for lossy compression of FP
> data
> 3. Our motifs for making these changes to Parquet
> 4. Current implementation in parquet-mr, arrow, parquet-format
> 5. Benchmark - dataset, benchmark project using Avro, results
> 6. Open questions
>
> 1. Problem
> Apache Parquet already offers a variety of encodings and compression
> algorithms, yet neither compresses well 32-bit or 64-bit FP data.
> There are many reasons for this:
> - sources of FP data such as sensors typically add noise to measurements.
> Thus, the least significant mantissa bits often contain some noise.
> - the available encodings in Apache Parquet specialize for string data and
> integer data. The IEEE 754 representation of FP data is significantly
> different.
> - the available compressors in Apache Parquet exploit repetitions in the
> input sequence. For floating-point data, an element in the sequence is
> either
>   4-bytes or 8 bytes. Also, the least significant bits of the mantissa are
> often noise which makes it very unlikely to have long subsequences being
> repeated.
>   Thus, they often cannot perform well on raw FP data.
>
> 2. Solution idea
> I already investigated a variety of ways to compress FP data and the
> report was shared with the Parquet community.
> My investigation focused on lossless compression and lossy compression.
> The original report can be viewed here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> [
> https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p
> ]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>
>
> report.pdf - Google Drive<
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>
> drive.google.com
>
>
>
>
> For lossless compression, it turns out that a very simple encoding, named
> "byte stream splitting", can produce very good results. Combined with zstd
> it outperformed all FP-specific compressors (fpc, spdp, fpzip, zfp) for the
> majority of the test cases. The encoding creates a stream for each byte of
> the underlying FP type (4 for float, 8 for double) and scatters each byte
> of the value to the corresponding stream. The streams are concatenated and
> later compressed. The new encoding does not only offer good results, but it
> is also simple to implement, has very little overhead and can even improve
> performance for some cases.
>
> For lossy compression, I compared two lossy compressors - ZFP and SZ. SZ
> outperformed ZFP in compression ratio by a reasonable margin but
> unfortunately the project has bugs and the API is not thread-safe.
> This makes it unsuitable for Parquet at the current moment. ZFP is a more
> mature project which makes it potentially a good fit for integrating into
> Parquet. We can discuss lossy compression in another thread. I only wanted
> to hint that we're considering this as a great alternative for some Parquet
> users since the achieved compression ratio is much higher than that of
> lossless compression.
>
> Also, please note that this work is not about improving storage efficiency
> of the decimal type but only for floats and doubles.
>
> 3. Our motifs
> The CAPS chair at the Technical University of Munich uses Apache Parquet
> for storing large amount of FP sensor data. The new encoding improves
> storage efficiency - both in needed capacity and time to store.
> Despite our own interests, the improvement is also beneficial for other
> Parquet users who store FP data.
>
> 4. Status of the implementation
> The current status of the implementation:
> - Pull request for adding the new BYTE_STREAM_SPLIT encoding to
> parquet-format: https://github.com/apache/parquet-format/pull/144
> - Patch for adding BYTE_STREAM_SPLIT to parquet-mr:
> https://github.com/martinradev/parquet-mr/commit/4c0e25581fa4b454535e6dbbfb3ab9932b97350c
>   Patch for exposing BYTE_STREAM_SPLIT in ParquetWriter:
> https://github.com/martinradev/parquet-mr/commit/2ec340d5ac8e1d6e598cb83f9b17d75f11f7ff61
>   I did not send a PR for these two patches since we have to vote on the
> new feature first and then get the parquet-format pull request in.
> - Patch for adding BYTE_STREAM_SPLIT to Apache Arrow:
> https://github.com/martinradev/arrow/commit/193c8704c4aab8fdff51f410f0206fa5ed21d801
>   Again, no PR since we need to vote for changing the specification first.
> - I made public the simple benchmark app which I used to collect
> compression numbers:
> https://github.com/martinradev/parquet-mr-streamsplit-bench
>
> 5. Benchmark:
> For more info and results please check my mini benchmark project:
> https://github.com/martinradev/parquet-mr-streamsplit-bench
> The short description is that there is an improvement of 11% on average
> for FP32 and 6% for FP64 when gzip is used as a compression algorithm.
> Note that the improvement is higher for many of the large test cases but
> the average is lower due to outliers for some small test cases.
> Similar results are to be expected also with other compression algorithm
> in Parquet.
>
> 6. Open questions
> - Will you be happy to add the new BYTE_STREAM_SPLIT encoding to Apache
> Parquet?
> - What are your thoughts on the future addition of also lossy compression
> of FP32 and FP64 to Apache Parquet?
> - What are the next steps?
>
> Regards,
> Martin
>
>

Re: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or Compression algorithm to Parquet Format

Posted by "Radev, Martin" <ma...@tum.de>.

Dear all,


how should be proceeded with this proposal?


Would somebody like to offer feedback on the new encoding, change of specification, and patches?

How should we start a vote?


I am new to this project and do not have connections in this community. Would a senior contributor like to help me drive this?


Regards,

Martin

________________________________
From: Radev, Martin <ma...@tum.de>
Sent: Tuesday, July 23, 2019 8:22:43 PM
To: dev@parquet.apache.org
Cc: Zoltan Ivanfi; wesmckinn@gmail.com; fokko@driesprong.frl; heuermh@gmail.com; Karlstetter, Roman; Raoofy, Amir
Subject: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or Compression algorithm to Parquet Format

Dear Apache Parquet Devs,

I would like to make a proposal for extending the Apache Parquet specification by adding a better encoding for FP data which improves compression ratio and also to raise the question of adding a lossy compression algorithm for FP data.

Contents:
1. Problem: FP data compression is suboptimal in Apache Parquet
2. Solution idea: a new encoding for FP data to improve compression
                           integration of zfp for lossy compression of FP data
3. Our motifs for making these changes to Parquet
4. Current implementation in parquet-mr, arrow, parquet-format
5. Benchmark - dataset, benchmark project using Avro, results
6. Open questions

1. Problem
Apache Parquet already offers a variety of encodings and compression algorithms, yet neither compresses well 32-bit or 64-bit FP data.
There are many reasons for this:
- sources of FP data such as sensors typically add noise to measurements. Thus, the least significant mantissa bits often contain some noise.
- the available encodings in Apache Parquet specialize for string data and integer data. The IEEE 754 representation of FP data is significantly different.
- the available compressors in Apache Parquet exploit repetitions in the input sequence. For floating-point data, an element in the sequence is either
  4-bytes or 8 bytes. Also, the least significant bits of the mantissa are often noise which makes it very unlikely to have long subsequences being repeated.
  Thus, they often cannot perform well on raw FP data.

2. Solution idea
I already investigated a variety of ways to compress FP data and the report was shared with the Parquet community.
My investigation focused on lossless compression and lossy compression.
The original report can be viewed here: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
[https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>

report.pdf - Google Drive<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>
drive.google.com




For lossless compression, it turns out that a very simple encoding, named "byte stream splitting", can produce very good results. Combined with zstd it outperformed all FP-specific compressors (fpc, spdp, fpzip, zfp) for the majority of the test cases. The encoding creates a stream for each byte of the underlying FP type (4 for float, 8 for double) and scatters each byte of the value to the corresponding stream. The streams are concatenated and later compressed. The new encoding does not only offer good results, but it is also simple to implement, has very little overhead and can even improve performance for some cases.

For lossy compression, I compared two lossy compressors - ZFP and SZ. SZ outperformed ZFP in compression ratio by a reasonable margin but unfortunately the project has bugs and the API is not thread-safe.
This makes it unsuitable for Parquet at the current moment. ZFP is a more mature project which makes it potentially a good fit for integrating into Parquet. We can discuss lossy compression in another thread. I only wanted to hint that we're considering this as a great alternative for some Parquet users since the achieved compression ratio is much higher than that of lossless compression.

Also, please note that this work is not about improving storage efficiency of the decimal type but only for floats and doubles.

3. Our motifs
The CAPS chair at the Technical University of Munich uses Apache Parquet for storing large amount of FP sensor data. The new encoding improves storage efficiency - both in needed capacity and time to store.
Despite our own interests, the improvement is also beneficial for other Parquet users who store FP data.

4. Status of the implementation
The current status of the implementation:
- Pull request for adding the new BYTE_STREAM_SPLIT encoding to parquet-format: https://github.com/apache/parquet-format/pull/144
- Patch for adding BYTE_STREAM_SPLIT to parquet-mr: https://github.com/martinradev/parquet-mr/commit/4c0e25581fa4b454535e6dbbfb3ab9932b97350c
  Patch for exposing BYTE_STREAM_SPLIT in ParquetWriter: https://github.com/martinradev/parquet-mr/commit/2ec340d5ac8e1d6e598cb83f9b17d75f11f7ff61
  I did not send a PR for these two patches since we have to vote on the new feature first and then get the parquet-format pull request in.
- Patch for adding BYTE_STREAM_SPLIT to Apache Arrow: https://github.com/martinradev/arrow/commit/193c8704c4aab8fdff51f410f0206fa5ed21d801
  Again, no PR since we need to vote for changing the specification first.
- I made public the simple benchmark app which I used to collect compression numbers: https://github.com/martinradev/parquet-mr-streamsplit-bench

5. Benchmark:
For more info and results please check my mini benchmark project: https://github.com/martinradev/parquet-mr-streamsplit-bench
The short description is that there is an improvement of 11% on average for FP32 and 6% for FP64 when gzip is used as a compression algorithm.
Note that the improvement is higher for many of the large test cases but the average is lower due to outliers for some small test cases.
Similar results are to be expected also with other compression algorithm in Parquet.

6. Open questions
- Will you be happy to add the new BYTE_STREAM_SPLIT encoding to Apache Parquet?
- What are your thoughts on the future addition of also lossy compression of FP32 and FP64 to Apache Parquet?
- What are the next steps?

Regards,
Martin