You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Zoltan Ivanfi (JIRA)" <ji...@apache.org> on 2018/02/26 14:19:00 UTC

[jira] [Issue Comment Deleted] (PARQUET-1222) Definition of float and double sort order is ambigious

     [ https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zoltan Ivanfi updated PARQUET-1222:
-----------------------------------
    Comment: was deleted

(was: [~jbapple] Indeed, the proposed order does not match the IEEE 754 totalOrder for the following reasons:
* In case of subnormal values, different bit patterns may correspond to the same numeric value, similar to how one can write 0.005 as 0.5 * 10^-2 or 0.05 * 10^-1 or ... The IEEE 754 totalOrder predicate orders these numerically equal values according to the exponents used in their representation. This could lead to row groups being dropped or not based on what bit pattern was used to save the data and what bit pattern is used for looking them up. Since both the corresponding numeric values and their presentation to the user are identical, this would lead to behaviour that I consider incorrect, or at least unintuitive.
* No programming languages or libraries I know of implement the totalOrder predicate. I am also not aware of any hardware implementation, and if there were one, it would still be virtually impossible to use it without proper exposure through a library or some very low-level wizardry.
* The definition of the totalOrder predicate looks very complicated. I think any implementation is prone to be complicated, error-prone and inefficient.

The proposed order, on the other hand, is sufficient for sorting and seems to be easy to implement. If the regular (normally hardware-accelerated) floating point comparison operators can decide the order, then it returns those. Otherwise it orders by the exponent part of the bit pattern.

However, depending on one's point of view, the proposed ordering may not really be considered a total ordering as both zeros are equivalent to each other and all NaN are equivalent to each other as well. However, it is a (strict) weak ordering for sure and maybe the enum should be renamed as such. However, my understanding is that the only difference between weak ordering and total ordering is whether equivalence by ordering is the same as equality by value. But then the totalOrder defined by IEEE 754 is not a total order either, since it defines NaN-s of the same bit pattern to be equivalent for ordering, while it also defines a NaN value to not be equal to itself at the same time. Since the equality defined by IEEE 754 is already inconsistent in the mathematical sense (because of x != x), there really is no proper equality operation to use for judging whether the proposed ordering is weak or total. From a more practical point of view, we could consider an ordering weak if it is sufficient for sorting but not for hashing, while a total ordering would be sufficient for both. This, however, depends on how we calculate the hash values and is also entirely out of scope for min/max values or indexes.)

> Definition of float and double sort order is ambigious
> ------------------------------------------------------
>
>                 Key: PARQUET-1222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Zoltan Ivanfi
>            Priority: Critical
>             Fix For: format-2.5.0
>
>         Attachments: ordering.png
>
>
> Currently parquet-format specifies the sort order for floating point numbers as follows:
> {code:java}
>    *   FLOAT - signed comparison of the represented value
>    *   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a partial ordering with strange behaviour in specific corner cases. For example, according to IEEE 754, -0 is neither less nor more than \+0 and comparing NaN to anything always returns false. This ordering is not suitable for statistics. Additionally, the Java implementation already uses a different (total) ordering that handles these cases correctly but differently than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new TotalFloatingPointOrder should be introduced. The default for writing doubles and floats would be the new TotalFloatingPointOrder. The proposed ordering is the following:
>  * -∞
>  * negative numbers in their natural order
>  * -0 and +0 in the same equivalence class \(!)
>  * positive numbers in their natural order
>  * +∞
>  * all NaN values, including the negative ones \(!), in the same equivalence class \(!)
> This ordering should be effective and easy to implement in all programming languages. A visual representation of the ordering of some example values:
> !ordering.png|width=640px!
> For reading existing stats created using TypeDefinedOrder, the following compatibility rules should be applied:
> * When looking for NaN values, min and max should be ignored.
> * If the min is a NaN, it should be ignored.
> * If the max is a NaN, it should be ignored.
> * If the min is \+0, the row group may contain -0 values as well.
> * If the max is -0, the row group may contain \+0 values as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)