You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/29 12:40:10 UTC

[GitHub] [iceberg] findepi opened a new pull request #2891: Fix -NaN ordering in spec

findepi opened a new pull request #2891:
URL: https://github.com/apache/iceberg/pull/2891


   In Java, -NaN is not distinguishable from NaN (is the same value, as
   introspectable with `Float.floatToIntBits` or
   `Double.doubleToLongBits`).
   
   Moreover, Java sorts all possible `NaN` values within single peer group.
   The different NaN values compare equal, even if they are not the same.
   They always compare as "greater" than positive infinity.
   
   cc @yyanyy @rdblue @electrum 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] electrum commented on pull request #2891: Fix -NaN ordering in spec

Posted by GitBox <gi...@apache.org>.
electrum commented on pull request #2891:
URL: https://github.com/apache/iceberg/pull/2891#issuecomment-889649391


   This brings up the question of how different NaN representations should be handled in Iceberg. Should writers canonicalize them? What do ORC and Parquet do for non-canonical values?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2891: Fix -NaN ordering in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2891:
URL: https://github.com/apache/iceberg/pull/2891#issuecomment-890592412


   @electrum, I was leaving this vague. My inclination is to not to require canonicalization. It's more work and there may be legitimate uses of signalling NaNs. Plus, I just looked into signalling NaN values and they are evidently not part of the IEEE 754 spec and differ between processors:
   
   > Encodings of qNaN and sNaN are not specified in IEEE 754 and implemented differently on different processors. ([wikipedia](https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Single-precision_examples))
   
   @findepi, in IEEE 754, the sign bit is independent of whether a value is NaN (exponent bits are all set and significand is non-zero). That means there are legitimate -NaN values and it would be reasonable for another language to sort them that way. I think that there is no need to change the handling of -NaN values in the spec.
   
   It's also reasonable for Java to not distinguish between -NaN and NaN values and to only produce positive NaNs. The only update I would make is to ensure that the Java implementation only writes positive NaN values to align with its sort. In most cases, no changes are needed because we track NaN values through counts after identifying them with `Float.isNaN` or `Double.isNaN`. The only cases that would need to be updated are when engines write -NaN values into files *and* are sorting them. Those engines would be responsible for converting -NaN to NaN or to implement the sort order according to the spec rather than using the default Java comparison.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2891: Fix -NaN ordering in spec

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2891:
URL: https://github.com/apache/iceberg/pull/2891#issuecomment-985025917


   I'm not sure I see the problem with just saying all "NaN" values come last. Doesn't -NaN not actually exist as a defined constant? From what I know of the standard the sign bit doesn't change the state of the NaN so we should treat them all the same. Since Java already does this seems like we would be fine just saying all NAN are > positive infinity. No need to qualify it imho.
   
   So I'm +1 on just removing it from the spec. No need to add a detail about NaN ordering after that I think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on pull request #2891: Fix -NaN ordering in spec

Posted by GitBox <gi...@apache.org>.
findepi commented on pull request #2891:
URL: https://github.com/apache/iceberg/pull/2891#issuecomment-889685056


   > Is this specific to Java? Are -NaN values ordered in other languages?
   
   @rdblue this i do not know. Since the spec points at Java sorting as the 'reference', so i focused on that.
   
   > This brings up the question of how different NaN representations should be handled in Iceberg. Should writers canonicalize them? 
   
   @electrum this is a good question, and i was thinking about this too.
   it seems that, from Trino perspective, it doesn't matter much, because we treat all NaN values as indistinguishable. The canonicalzation is applied at comparison time, in the engine, so storage is not required to canonicalize. Of course, it would be better to have writers canonicalize, but I am concerned we will be never able to assume that at read time, because of pre-existing data.
   
   However, even if we follow this path, we still could want to define how NaNs interact with `distinct_counts` in manifest.
   Or, we would ignore `distinct_counts` whenever `nan_value_counts > 0`.
   (I don't know yet, whether this is important. We may or may not use `distinct_counts`.)
   
   > What do ORC and Parquet do for non-canonical values?
   
   @electrum you mean the reference writer implementations? i don't know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2891: Fix -NaN ordering in spec

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2891:
URL: https://github.com/apache/iceberg/pull/2891#discussion_r679523222



##########
File path: site/docs/spec.md
##########
@@ -267,7 +267,8 @@ A sort order is defined by an sort order id and a list of sort fields. The order
 
 Order id `0` is reserved for the unsorted order. 
 
-Sorting floating-point numbers should produce the following behavior: `-NaN` < `-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`. This aligns with the implementation of Java floating-point types comparisons. 
+Sorting floating-point numbers should produce the following behavior: `-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`.
+The different `NaN` representation can be in an arbitrary order. This aligns with the implementation of Java floating-point types comparisons.

Review comment:
       can you specify more on what "order" means here? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on a change in pull request #2891: Fix -NaN ordering in spec

Posted by GitBox <gi...@apache.org>.
findepi commented on a change in pull request #2891:
URL: https://github.com/apache/iceberg/pull/2891#discussion_r679690689



##########
File path: site/docs/spec.md
##########
@@ -267,7 +267,8 @@ A sort order is defined by an sort order id and a list of sort fields. The order
 
 Order id `0` is reserved for the unsorted order. 
 
-Sorting floating-point numbers should produce the following behavior: `-NaN` < `-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`. This aligns with the implementation of Java floating-point types comparisons. 
+Sorting floating-point numbers should produce the following behavior: `-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`.
+The different `NaN` representation can be in an arbitrary order. This aligns with the implementation of Java floating-point types comparisons.

Review comment:
       "order" was supposed to refer to sorting from the previous sentence. can i ask you for help on better wording?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on pull request #2891: Fix -NaN ordering in spec

Posted by GitBox <gi...@apache.org>.
findepi commented on pull request #2891:
URL: https://github.com/apache/iceberg/pull/2891#issuecomment-939340325


   @rdblue i am not sure what the conclusion here is. Please help me understand.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on pull request #2891: Fix -NaN ordering in spec

Posted by GitBox <gi...@apache.org>.
findepi commented on pull request #2891:
URL: https://github.com/apache/iceberg/pull/2891#issuecomment-985843589


   > Doesn't -NaN not actually exist as a defined constant?
   
   it doesn't exist in Java, yet you can calculate `-NaN` as an expression (produces a NaN value)
   
   > From what I know of the standard the sign bit doesn't change the state of the NaN so we should treat them all the same.
   
   Well, arguably you can dissect bits of a NaN value (via `Double.doubleToRawLongBits` and inspecting the various bits).
   Yet, this is not what reference implementation is doing, right?
   
   Thus, one could argue to treat NaN values with sign bit set differently from NaN values with sign bit not set, but I don't believe this is what is desired, nor implemented.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on pull request #2891: Fix -NaN ordering in spec

Posted by GitBox <gi...@apache.org>.
findepi commented on pull request #2891:
URL: https://github.com/apache/iceberg/pull/2891#issuecomment-890738961


   
   
   > It's also reasonable for Java to not distinguish between -NaN and NaN values and to only produce positive NaNs. 
   
   It would be more accurate to say that java doesn't distinguish NaN values with sign bit set and those without.
   In Java, you can construct a NaN value with a bit sign set, e.g. `Double.longBitsToDouble(0xfff8000000000000L)`.
   
   If your intention is to expect Java writers to sort these as before everything, we still need to update the spec (and Java writers).
   The spec points at Java sorting as the reference implementation.
   
   If your intention is to expect Java writers not to write NaN values with sign bit set, this is in fact the canonicalization that you wanted to avoid.
   
   
   As was observed, signalling NaNs are not portable between processors, so we should exclude them from the picture.
   While NaN values can carry an additional payload (fraction bits, the sign), such use is not portable as well.
   There might be legitimate use-cases for these in some applications, but it seems we cannot expect them to be respected by the engines, which do not distinguish them.
   
   Thus, if e.g. Rust-based application writes -NaN values and Java application rewrites the data file (update, compaction, format change), we should expect the NaN attributes to be lost.
   That can be confusing to users and users would consider this a bug, which we probably wouldn't be able to fix.
   
   It seems it would be safer to have just one NaN concept in the Iceberg spec and treat all NaN values as indistinguishable.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2891: Fix -NaN ordering in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2891:
URL: https://github.com/apache/iceberg/pull/2891#issuecomment-889539367


   Is this specific to Java? Are -NaN values ordered in other languages?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on pull request #2891: Fix -NaN ordering in spec

Posted by GitBox <gi...@apache.org>.
findepi commented on pull request #2891:
URL: https://github.com/apache/iceberg/pull/2891#issuecomment-1030535279


   What should happen with this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org