You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Chao Sun (JIRA)" <ji...@apache.org> on 2018/03/17 18:37:00 UTC

[jira] [Created] (PARQUET-1249) Clarify encoding schemes for boolean types

Chao Sun created PARQUET-1249:
---------------------------------

             Summary: Clarify encoding schemes for boolean types
                 Key: PARQUET-1249
                 URL: https://issues.apache.org/jira/browse/PARQUET-1249
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-format
            Reporter: Chao Sun


In the Parquet format specification, under [the section for Plain encoding|https://github.com/apache/parquet-format/blob/master/Encodings.md#plain-plain--0], boolean is encoded using the deprecated bit-packed encoding. However, [the section for bit-packed encoding|https://github.com/apache/parquet-format/blob/master/Encodings.md#bit-packed-deprecated-bit_packed--4] specifies that it is only used for repetition/definition levels. This seems contradictory. 

[The section for RLE/bit-packed hybrid encoding|https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3] says "_Boolean values in data pages, as an alternative to PLAIN encoding_" - perhaps we should be specific and indicate this is only used for data page V2?

Also, implementation-wise, I saw parquet-cpp still encode boolean as plain 1-bit value while parquet-mr uses bit-packed encoding as described in the specification. Perhaps consolidation should be done for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)