You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Xuwei Fu (Jira)" <ji...@apache.org> on 2023/01/18 03:01:00 UTC

[jira] [Commented] (PARQUET-1622) Add BYTE_STREAM_SPLIT encoding

    [ https://issues.apache.org/jira/browse/PARQUET-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678055#comment-17678055 ] 

Xuwei Fu commented on PARQUET-1622:
-----------------------------------

[~gszadovszky] [~martinradev] 

Hi all, I meet a problem here: [https://github.com/apache/arrow/issues/15173]

Would you mind take a look? Seems we don't have "non-null value count" here.

> Add BYTE_STREAM_SPLIT encoding
> ------------------------------
>
>                 Key: PARQUET-1622
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1622
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift
>            Reporter: Martin Radev
>            Assignee: Martin Radev
>            Priority: Minor
>              Labels: features, pull-request-available
>             Fix For: 1.12.0, format-2.8.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Apache Parquet does not have any encodings suitable for FP data and the available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream splitting". Such could be "byte stream splitting" which creates K streams of length N where K is the number of bytes in the data type (4 for floats, 8 for doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the original data and for some cases there is a performance improvement in compression and decompression speed.
> You can read a more detailed report here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view



--
This message was sent by Atlassian Jira
(v8.20.10#820010)