You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/02/04 23:15:48 UTC

[GitHub] [iceberg] yyanyy commented on a change in pull request #1872: Core: add contains_nan to field_summary

yyanyy commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r570609722



##########
File path: api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -144,18 +143,37 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     @Override
     public <T> Boolean isNaN(BoundReference<T> ref) {
       int pos = Accessors.toPosition(ref.accessor());
-      // containsNull encodes whether at least one partition value is null, lowerBound is null if
-      // all partition values are null.
-      if (stats.get(pos).containsNull() && stats.get(pos).lowerBound() == null) {
-        return ROWS_CANNOT_MATCH; // all values are null
+
+      if (stats.get(pos).containsNaN() != null && !stats.get(pos).containsNaN()) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      if (allValuesAreNull(stats.get(pos))) {
+        return ROWS_CANNOT_MATCH;
       }
 
       return ROWS_MIGHT_MATCH;
     }
 
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {
+      // Before introducing containsNaN field, containsNull encodes whether at least one partition value is null,
+      // lowerBound is null if all partition values are null.
+      // After introducing containsNaN field, containsNaN must be false to ensure all values are null since bounds
+      // don't include NaN anymore.
+      return summary.containsNull() && summary.lowerBound() == null &&
+          (summary.containsNaN() == null || !summary.containsNaN());

Review comment:
       I think the change for excluding NaN in `lower`/`upper` and adding `containsNaN` both belong to this PR, so if a release contains this change, then it would either be (1) `NaN` is part of `lower`/`upper` and `containsNaN` is missing, or (2) `containsNaN` exists and `lower`/`upper` doesn't store `NaN`. But I guess people may implement their own manifest summary that already exclude `NaN` from bounds but no `containsNaN`, so we still want to handle this, and file level metrics could give more granular information so there isn't necessarily any performance penalty. I have updated this PR to check for existence of `containsNaN`, but please let me know if my understanding isn't correct! 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org