You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/04/04 09:43:00 UTC

[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

    [ https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314456#comment-17314456 ] 

Antoine Pitrou commented on PARQUET-1222:
-----------------------------------------

I'll note that Parquet C++ now has the following behaviour:

* signed zeros are properly ordered (ARROW-5562)
* NaNs are ignored when computing min/max (PARQUET-1225); if a page or column chunk only has NaNs, the statistics are unset


> Specify a well-defined sorting order for float and double types
> ---------------------------------------------------------------
>
>                 Key: PARQUET-1222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Zoltan Ivanfi
>            Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers as follows:
> {code:java}
>    *   FLOAT - signed comparison of the represented value
>    *   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a partial ordering with strange behaviour in specific corner cases. For example, according to IEEE 754, -0 is neither less nor more than \+0 and comparing NaN to anything always returns false. This ordering is not suitable for statistics. Additionally, the Java implementation already uses a different (total) ordering that handles these cases correctly but differently than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new TotalFloatingPointOrder should be introduced. The default for writing doubles and floats would be the new TotalFloatingPointOrder. This ordering should be effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)