You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Zoltan Ivanfi (JIRA)" <ji...@apache.org> on 2018/02/19 14:13:00 UTC

[jira] [Created] (PARQUET-1222) Definition of float and double sort order is ambigious

Zoltan Ivanfi created PARQUET-1222:
--------------------------------------

             Summary: Definition of float and double sort order is ambigious
                 Key: PARQUET-1222
                 URL: https://issues.apache.org/jira/browse/PARQUET-1222
             Project: Parquet
          Issue Type: Bug
          Components: parquet-format
            Reporter: Zoltan Ivanfi


Currently parquet-format specifies the sort order for floating point numbers as follows:

{code}
   *   FLOAT - signed comparison of the represented value
   *   DOUBLE - signed comparison of the represented value
{code}

The problem is that the comparison of floating point numbers is only a partial ordering with strange behaviour in specific corner cases. For example, according to IEEE 754, -0 is neither less nor more than \+0 and comparing NaN to anything always returns false. This ordering is not suitable for statistics. Additionally, the Java implementation already uses a different (total) ordering that handles these cases correctly but differently than the C\+\+ implementations, which leads to interoperability problems.

We should explicitly require implementations to follow a specific comparison logic for these types. The candidates are:

* The [Java implementation|http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/Double.java#l999] which looks easy and efficient to implement in any language.
* The [IEEE 754 totalOrder predicate|https://github.com/rust-lang/rust/issues/5585] which is rather complicated to the extent that it is hard to tell whether the Java implementation adheres to it.
* The [IEEE 754-2008 min and max operations|https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max] which may be hard to use for comparison, so components could not use sorting to achieve the smallest possible min/max ranges.

An additional problem is dealing with existing data. One possibility is to specify legacy rules, like "if the stats contain NaN and the file was written by Impala, it should be ignored", but that would complicate the specs and be a burden on implementors. In fact, `min_value` and `max_value` were introduced because we did not want to define similar legacy rules for `min` and `max`. Another alternative is to deprecate `min_value` and `max_value` as well and introduce `yet_another_min` and `yet_another_max` fields instead (with nicer names, naturally).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)