You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Martin Radev (JIRA)" <ji...@apache.org> on 2019/07/11 14:20:00 UTC

[jira] [Created] (ARROW-5913) Add support for Parquet's BYTE_STREAM_SPLIT encoding

Martin Radev created ARROW-5913:
-----------------------------------

             Summary: Add support for Parquet's BYTE_STREAM_SPLIT encoding
                 Key: ARROW-5913
                 URL: https://issues.apache.org/jira/browse/ARROW-5913
             Project: Apache Arrow
          Issue Type: Wish
          Components: C++
            Reporter: Martin Radev


*From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622 ):*

Apache Parquet does not have any encodings suitable for FP data and the available text compressors (zstd, gzip, etc) do not handle FP data very well.

It is possible to apply a simple data transformation named "stream splitting". Such could be "byte stream splitting" which creates K streams of length N where K is the number of bytes in the data type (4 for floats, 8 for doubles) and N is the number of elements in the sequence.

The transformed data compresses significantly better on average than the original data and for some cases there is a performance improvement in compression and decompression speed.

You can read a more detailed report here:
[https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
]

*Apache Arrow can benefit from the reduced requirements for storing FP parquet column data and improvements in decompression speed.*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)