You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/03/24 15:23:00 UTC

[jira] [Resolved] (PARQUET-1241) [C++] Use LZ4 frame format

     [ https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoine Pitrou resolved PARQUET-1241.
-------------------------------------
    Fix Version/s:     (was: cpp-4.0.0)
       Resolution: Won't Do

We're abandoning this. Instead, PARQUET-1996 added a definition for new LZ4_RAW compression format which uses the LZ4 "block" codec without Hadoop-specific framing.

The C++ implementation of LZ4_RAW is tracked by PARQUET-1998.

> [C++] Use LZ4 frame format
> --------------------------
>
>                 Key: PARQUET-1241
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1241
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp, parquet-format
>            Reporter: Lawrence Chan
>            Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data should be framed or not. We should choose one and make it explicit in the spec, as they are not inter-operable. After some discussions with others [1], we think it would be beneficial to use the framed format, which adds a small header in exchange for more self-contained decompression as well as a richer feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian Jira
(v8.3.4#803005)