You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Lars Volker (JIRA)" <ji...@apache.org> on 2017/01/05 18:53:58 UTC

[jira] [Created] (PARQUET-826) parquet.thrift comments for Statistics are not consistent with parquet-mr and Hive implementations

Lars Volker created PARQUET-826:
-----------------------------------

             Summary: parquet.thrift comments for Statistics are not consistent with parquet-mr and Hive implementations
                 Key: PARQUET-826
                 URL: https://issues.apache.org/jira/browse/PARQUET-826
             Project: Parquet
          Issue Type: Bug
          Components: parquet-format
            Reporter: Lars Volker
            Assignee: Lars Volker


I'm currently working on adding support for writing min/max statistics to Parquet files to Impala ([IMPALA-3909|https://issues.cloudera.org/browse/IMPALA-3909]). I noticed, that the comments in [parquet.thrift#L201|https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L201] don't seem to match the implementations in parquet-mr and Hive.

The comments ask for min/max statistics to be "encoded in PLAIN encoding". For strings (BYTE_ARRAY), this should be "4 byte length stored as little endian, followed by bytes".

Looking at [BinaryStatistics.java#L61|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L61], it seems to return the bytes without a length-prefix. Writing a parquet file with Hive also shows this behavior.

Similarly, but less ambiguous, PLAIN encoding for booleans uses bit-packing. It seems to be implied that for a single bit (min/max of a boolean column) it means setting the least significant bit of a single byte. This could be made more clear in the parquet.thrift file, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)