You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Micah Kornfield (Jira)" <ji...@apache.org> on 2022/09/30 05:32:00 UTC

[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

    [ https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611356#comment-17611356 ] 

Micah Kornfield commented on PARQUET-1222:
------------------------------------------

I'd propose the following "fix":
- Add a new optional bool value to the statistics  struct "contains_nan".  When unset, I think we specify the semantics for comparisons relative to -0.0/0.0 and NaN, etc are not well defined and implementations have taken different routes.
- When set, if true, it means the column contains at least one NaN, when set to false it means no NaNs are present.  Further when set, it implies the following ordering:
NaNs are never included in Min/Max statistics in the struct.  -0.0, +0.0, are considered two distinct values and are ordered according to sign.

Thoughts?  Should I bring this up on the mailing list?

> Specify a well-defined sorting order for float and double types
> ---------------------------------------------------------------
>
>                 Key: PARQUET-1222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Zoltan Ivanfi
>            Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers as follows:
> {code:java}
>    *   FLOAT - signed comparison of the represented value
>    *   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a partial ordering with strange behaviour in specific corner cases. For example, according to IEEE 754, -0 is neither less nor more than \+0 and comparing NaN to anything always returns false. This ordering is not suitable for statistics. Additionally, the Java implementation already uses a different (total) ordering that handles these cases correctly but differently than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new TotalFloatingPointOrder should be introduced. The default for writing doubles and floats would be the new TotalFloatingPointOrder. This ordering should be effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)