You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Martin Radev (Jira)" <ji...@apache.org> on 2019/11/02 12:31:00 UTC

[jira] [Commented] (PARQUET-1241) [C++] Use LZ4 frame format

    [ https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16965336#comment-16965336 ] 

Martin Radev commented on PARQUET-1241:
---------------------------------------

I think this can be almost gracefully fixed without breaking back compatibility for the one-shot path.

1. First the Parquet spec should be tightened by specifying which LZ4 format is used.
 2. Switch to framed format in parquet-cpp one-shot path.
- To guarantee back-compatibility the implementation should check the first 4 magic bytes in the payload. The magic bytes are specified in the LZ4 spec: [https://android.googlesource.com/platform/external/lz4/+/HEAD/doc/lz4_Frame_format.md]
- The four magic bytes cannot alias with the first four bytes of the block format. The spec doesn't explicitly state that but the encoding seems to guarantee that: [https://android.googlesource.com/platform/external/lz4/+/HEAD/doc/lz4_Block_format.md]
- So, in the implementation if the 4 magic bytes are present, then pass the payload to the framed path. If not, used the old blocked path.
 3. Have a look what other parquet read/write implementations are currently doing and open issues if necessary.

Does this make sense?
 If you think this makes sense, I can write the parquet-cpp and parquet patches.

> [C++] Use LZ4 frame format
> --------------------------
>
>                 Key: PARQUET-1241
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1241
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp, parquet-format
>            Reporter: Lawrence Chan
>            Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data should be framed or not. We should choose one and make it explicit in the spec, as they are not inter-operable. After some discussions with others [1], we think it would be beneficial to use the framed format, which adds a small header in exchange for more self-contained decompression as well as a richer feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian Jira
(v8.3.4#803005)