You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Lars Volker <lv...@cloudera.com> on 2016/12/16 19:06:45 UTC

Inconsistencies between parquet.thrift and parquet-mr/Hive

Hi All,

I'm currently working on adding support for writing min/max statistics to
Parquet files to Impala (IMPALA-3909
<https://issues.cloudera.org/browse/IMPALA-3909>). I noticed, that the
comments in parquet.thrift#L201
<https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L201>
don't
seem to match the implementations in parquet-mr and Hive.

The comments ask for min/max statistics to be "*encoded in PLAIN encoding*".
For strings (BYTE_ARRAY), this should be "*4 byte length stored as little
endian, followed by bytes*".

Looking at BinaryStatistics.java#L61
<https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L61>,
it seems to return the bytes without a length-prefix. Writing a parquet
file with Hive also shows this behavior.

Is this the intended behavior? If so, we might want to add a description to
the Statistics struct in parquet.thrift to elaborate on the intrinsics of
storing string values there.

Similarly, but less ambiguous, PLAIN encoding for booleans uses
bit-packing. It seems to be implied that for a single bit (min/max of a
boolean column) it means setting the least significant bit of a single
byte. This could be made more clear in the parquet.thrift file, too.

I'm curious to hear your feedback. Let me know if you think we should
change the parquet.thrift file and I'll happily send a PR.

Cheers, Lars

Re: Inconsistencies between parquet.thrift and parquet-mr/Hive

Posted by Lars Volker <lv...@cloudera.com>.
Hi All,

This looks like it got dropped. Is there anything else I should do before
this can be considered for review?

Thanks for the help, Lars

On Thu, Jan 5, 2017 at 8:08 PM, Lars Volker <lv...@cloudera.com> wrote:

> I created PARQUET-826 <https://issues.apache.org/jira/browse/PARQUET-826>
> to track this and submitted PR #48
> <https://github.com/apache/parquet-format/pull/48> to address it.
>
> On Fri, Dec 16, 2016 at 8:06 PM, Lars Volker <lv...@cloudera.com> wrote:
>
>> Hi All,
>>
>> I'm currently working on adding support for writing min/max statistics to
>> Parquet files to Impala (IMPALA-3909
>> <https://issues.cloudera.org/browse/IMPALA-3909>). I noticed, that the
>> comments in parquet.thrift#L201
>> <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L201> don't
>> seem to match the implementations in parquet-mr and Hive.
>>
>> The comments ask for min/max statistics to be "*encoded in PLAIN
>> encoding*". For strings (BYTE_ARRAY), this should be "*4 byte length
>> stored as little endian, followed by bytes*".
>>
>> Looking at BinaryStatistics.java#L61
>> <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L61>,
>> it seems to return the bytes without a length-prefix. Writing a parquet
>> file with Hive also shows this behavior.
>>
>> Is this the intended behavior? If so, we might want to add a description
>> to the Statistics struct in parquet.thrift to elaborate on the intrinsics
>> of storing string values there.
>>
>> Similarly, but less ambiguous, PLAIN encoding for booleans uses
>> bit-packing. It seems to be implied that for a single bit (min/max of a
>> boolean column) it means setting the least significant bit of a single
>> byte. This could be made more clear in the parquet.thrift file, too.
>>
>> I'm curious to hear your feedback. Let me know if you think we should
>> change the parquet.thrift file and I'll happily send a PR.
>>
>> Cheers, Lars
>>
>
>

Re: Inconsistencies between parquet.thrift and parquet-mr/Hive

Posted by Lars Volker <lv...@cloudera.com>.
I created PARQUET-826 <https://issues.apache.org/jira/browse/PARQUET-826>
to track this and submitted PR #48
<https://github.com/apache/parquet-format/pull/48> to address it.

On Fri, Dec 16, 2016 at 8:06 PM, Lars Volker <lv...@cloudera.com> wrote:

> Hi All,
>
> I'm currently working on adding support for writing min/max statistics to
> Parquet files to Impala (IMPALA-3909
> <https://issues.cloudera.org/browse/IMPALA-3909>). I noticed, that the
> comments in parquet.thrift#L201
> <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L201> don't
> seem to match the implementations in parquet-mr and Hive.
>
> The comments ask for min/max statistics to be "*encoded in PLAIN encoding*".
> For strings (BYTE_ARRAY), this should be "*4 byte length stored as little
> endian, followed by bytes*".
>
> Looking at BinaryStatistics.java#L61
> <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L61>,
> it seems to return the bytes without a length-prefix. Writing a parquet
> file with Hive also shows this behavior.
>
> Is this the intended behavior? If so, we might want to add a description
> to the Statistics struct in parquet.thrift to elaborate on the intrinsics
> of storing string values there.
>
> Similarly, but less ambiguous, PLAIN encoding for booleans uses
> bit-packing. It seems to be implied that for a single bit (min/max of a
> boolean column) it means setting the least significant bit of a single
> byte. This could be made more clear in the parquet.thrift file, too.
>
> I'm curious to hear your feedback. Let me know if you think we should
> change the parquet.thrift file and I'll happily send a PR.
>
> Cheers, Lars
>