You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Andrew Duffy <ro...@aduffy.org> on 2016/08/26 01:08:24 UTC

Parquet Format Change for Statistics Ordering

Hello Dev-Parquet,

I recently filed an issue, PARQUET-686
<https://issues.apache.org/jira/browse/PARQUET-686>, to attempt to fix the
abnormal sort order for Binary types in Parquet. This is to allow for the
calculation of statistics based on an *unsigned* interpretation of binary
bytestrings, which is the sort of thing you want for UTF8 columns, for
example. This is currently causing a correctness issue with Spark, see
SPARK-17213 <https://issues.apache.org/jira/browse/SPARK-17213> for more
details on that, which means there is a likelihood that this is also broken
in other query engines that pushdown String filters to Parquet.

The fix requires a change both to any implementation of Parquet
(parquet-mr, parquet-cpp) as well as the format, to add a new set of
optional fields on the statistics that allow specifying explicit signed and
unsigned statistics. The PR to parquet-format can be seen at
https://github.com/apache/parquet-format/pull/42.

Wanted to distribute this change back out to the community for comment.

-Andrew

Re: Parquet Format Change for Statistics Ordering

Posted by Wes McKinney <we...@gmail.com>.
The type of comparison used here strikes me as dependent on the
ConvertedType of the column. Adding explicit signed/unsigned min/max
of course gives you both options after the fact. So another option is
(if I'm understanding correctly) to change parquet-mr's BYTE_ARRAY
comparison used for UTF8 ConvertedType.

As an aside, are decimal statistics (e.g. 12-byte or 16-byte decimals)
valid based on a signed binary comparison?

Since we don't have any heavily dependent production users of
parquet-cpp yet we'll be happy to implement whatever solution works
for everyone.

- Wes

On Thu, Aug 25, 2016 at 6:08 PM, Andrew Duffy <ro...@aduffy.org> wrote:
> Hello Dev-Parquet,
>
> I recently filed an issue, PARQUET-686
> <https://issues.apache.org/jira/browse/PARQUET-686>, to attempt to fix the
> abnormal sort order for Binary types in Parquet. This is to allow for the
> calculation of statistics based on an *unsigned* interpretation of binary
> bytestrings, which is the sort of thing you want for UTF8 columns, for
> example. This is currently causing a correctness issue with Spark, see
> SPARK-17213 <https://issues.apache.org/jira/browse/SPARK-17213> for more
> details on that, which means there is a likelihood that this is also broken
> in other query engines that pushdown String filters to Parquet.
>
> The fix requires a change both to any implementation of Parquet
> (parquet-mr, parquet-cpp) as well as the format, to add a new set of
> optional fields on the statistics that allow specifying explicit signed and
> unsigned statistics. The PR to parquet-format can be seen at
> https://github.com/apache/parquet-format/pull/42.
>
> Wanted to distribute this change back out to the community for comment.
>
> -Andrew