You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Micah Kornfield <em...@gmail.com> on 2022/11/05 04:54:09 UTC

[Format] Clarifying Sort Order Requirements for Floating Points and Logical Types

A new proposal for adding a logical annotation to support Float16 values
[1]  reopened the discussion on specifying how parquet should deal with
edge cases for floating point types (PARQUET-1222 [2]).

To try to resolve this the consensus from the JIRA is to not try to specify
an ordering when writing but only rules but rather only specify rules for
reading data. The rules where already present in the parquet.thrift file
[3]. They are:

>
>
>    *     - If the min is a NaN, it should be ignored.
>    *     - If the max is a NaN, it should be ignored.
>    *     - If the min is +0, the row group may contain -0 values as well.
>    *     - If the max is -0, the row group may contain +0 values as well.
>    *     - When looking for NaN values, min and max should be ignored.


I've created a PR [4] to update README.md in parquet-format that:
1.  Specifies statistics should not be used when a column has an unknown
logical type since correct comparisons cannot be performed.
2.  Specifies the ordering for primitive types and references the
parquet.thrift for the details on how to handle floating point values.

Feedback and other ideas are welcome.

Thanks,
Micah

[1] https://github.com/apache/parquet-format/pull/184
[2] https://issues.apache.org/jira/browse/PARQUET-1222
[3]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L897
[4] https://github.com/apache/parquet-format/pull/185

Re: [Format] Clarifying Sort Order Requirements for Floating Points and Logical Types

Posted by Micah Kornfield <em...@gmail.com>.
https://github.com/apache/parquet-format/pull/185 has been merged.

On Fri, Nov 4, 2022 at 9:54 PM Micah Kornfield <em...@gmail.com>
wrote:

> A new proposal for adding a logical annotation to support Float16 values
> [1]  reopened the discussion on specifying how parquet should deal with
> edge cases for floating point types (PARQUET-1222 [2]).
>
> To try to resolve this the consensus from the JIRA is to not try to
> specify an ordering when writing but only rules but rather only specify
> rules for reading data. The rules where already present in the
> parquet.thrift file [3]. They are:
>
>>
>>
>>    *     - If the min is a NaN, it should be ignored.
>>    *     - If the max is a NaN, it should be ignored.
>>    *     - If the min is +0, the row group may contain -0 values as well.
>>    *     - If the max is -0, the row group may contain +0 values as well.
>>    *     - When looking for NaN values, min and max should be ignored.
>
>
> I've created a PR [4] to update README.md in parquet-format that:
> 1.  Specifies statistics should not be used when a column has an unknown
> logical type since correct comparisons cannot be performed.
> 2.  Specifies the ordering for primitive types and references the
> parquet.thrift for the details on how to handle floating point values.
>
> Feedback and other ideas are welcome.
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/parquet-format/pull/184
> [2] https://issues.apache.org/jira/browse/PARQUET-1222
> [3]
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L897
> [4] https://github.com/apache/parquet-format/pull/185
>
>