You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2019/08/19 19:38:00 UTC

[jira] [Updated] (ARROW-5913) [C++][Parquet] Add support for Parquet's BYTE_STREAM_SPLIT encoding

     [ https://issues.apache.org/jira/browse/ARROW-5913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney updated ARROW-5913:
--------------------------------
    Summary: [C++][Parquet] Add support for Parquet's BYTE_STREAM_SPLIT encoding  (was: Add support for Parquet's BYTE_STREAM_SPLIT encoding)

> [C++][Parquet] Add support for Parquet's BYTE_STREAM_SPLIT encoding
> -------------------------------------------------------------------
>
>                 Key: ARROW-5913
>                 URL: https://issues.apache.org/jira/browse/ARROW-5913
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: C++
>            Reporter: Martin Radev
>            Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> *From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622 ):*
> Apache Parquet does not have any encodings suitable for FP data and the available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream splitting". Such could be "byte stream splitting" which creates K streams of length N where K is the number of bytes in the data type (4 for floats, 8 for doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the original data and for some cases there is a performance improvement in compression and decompression speed.
> You can read a more detailed report here:
>  [https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> *Apache Arrow can benefit from the reduced requirements for storing FP parquet column data and improvements in decompression speed.*



--
This message was sent by Atlassian Jira
(v8.3.2#803003)