You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/12/03 19:44:20 UTC

[GitHub] [iceberg] yyanyy opened a new pull request #1872: Core: add contains_nan to field_summary

yyanyy opened a new pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872


   Will update manifest evaluator to use this new field after either this or #1747 is merged. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#issuecomment-774179675


   Thanks @yyanyy! Looks great. I merged it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r569030682



##########
File path: api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -144,18 +143,37 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     @Override
     public <T> Boolean isNaN(BoundReference<T> ref) {
       int pos = Accessors.toPosition(ref.accessor());
-      // containsNull encodes whether at least one partition value is null, lowerBound is null if
-      // all partition values are null.
-      if (stats.get(pos).containsNull() && stats.get(pos).lowerBound() == null) {
-        return ROWS_CANNOT_MATCH; // all values are null
+
+      if (stats.get(pos).containsNaN() != null && !stats.get(pos).containsNaN()) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      if (allValuesAreNull(stats.get(pos))) {
+        return ROWS_CANNOT_MATCH;
       }
 
       return ROWS_MIGHT_MATCH;
     }
 
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {
+      // Before introducing containsNaN field, containsNull encodes whether at least one partition value is null,
+      // lowerBound is null if all partition values are null.
+      // After introducing containsNaN field, containsNaN must be false to ensure all values are null since bounds
+      // don't include NaN anymore.
+      return summary.containsNull() && summary.lowerBound() == null &&
+          (summary.containsNaN() == null || !summary.containsNaN());

Review comment:
       One more thing: if we go for the safer option, we only need to do that for float and double columns. All other types can't contain NaN.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r536288481



##########
File path: core/src/main/java/org/apache/iceberg/GenericPartitionFieldSummary.java
##########
@@ -136,8 +144,10 @@ public Object get(int i) {
       case 0:
         return containsNull;
       case 1:
-        return lowerBound();
+        return containsNaN;

Review comment:
       have we made sure changing the position is backwards compatible?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r569026206



##########
File path: api/src/main/java/org/apache/iceberg/ManifestFile.java
##########
@@ -199,6 +200,16 @@ default boolean hasDeletedFiles() {
      */
     boolean containsNull();
 
+    /**
+     * Returns true if at least one data file in the manifest has a nan value for the field.

Review comment:
       Nit: `nan` -> `NaN`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r536928399



##########
File path: core/src/main/java/org/apache/iceberg/PartitionSummary.java
##########
@@ -73,14 +75,16 @@ private PartitionFieldStats(Type type) {
     }
 
     public PartitionFieldSummary toSummary() {
-      return new GenericPartitionFieldSummary(containsNull,
+      return new GenericPartitionFieldSummary(containsNull, containsNaN,

Review comment:
       Since this is a `boolean`, maybe we can change the constructor as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r544532186



##########
File path: api/src/main/java/org/apache/iceberg/ManifestFile.java
##########
@@ -199,6 +200,12 @@ default boolean hasDeletedFiles() {
      */
     boolean containsNull();
 
+    /**
+     * Returns true if at least one data file in the manifest has a nan value for the field.
+     * Null if this information doesn't exist.
+     */
+    Boolean containsNaN();

Review comment:
       I think once this change is released, this field will be populated by default, and making it nullable is mostly for backward compatibility; also the unassigned value for this field is null in both the [actual implementation](https://github.com/apache/iceberg/pull/1872/files#diff-5ed34e844eaf59fe89785c86989ba8a467414fa112e5d266a26cd867b795bc8bR45) and [test implementation](https://github.com/apache/iceberg/pull/1872/files#diff-c9732959939317302e8738e19b1d831cef747b77d4c8bfa2fdab883fd087a68aR153), so I think defaulting to null here might not help much? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r570609722



##########
File path: api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -144,18 +143,37 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     @Override
     public <T> Boolean isNaN(BoundReference<T> ref) {
       int pos = Accessors.toPosition(ref.accessor());
-      // containsNull encodes whether at least one partition value is null, lowerBound is null if
-      // all partition values are null.
-      if (stats.get(pos).containsNull() && stats.get(pos).lowerBound() == null) {
-        return ROWS_CANNOT_MATCH; // all values are null
+
+      if (stats.get(pos).containsNaN() != null && !stats.get(pos).containsNaN()) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      if (allValuesAreNull(stats.get(pos))) {
+        return ROWS_CANNOT_MATCH;
       }
 
       return ROWS_MIGHT_MATCH;
     }
 
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {
+      // Before introducing containsNaN field, containsNull encodes whether at least one partition value is null,
+      // lowerBound is null if all partition values are null.
+      // After introducing containsNaN field, containsNaN must be false to ensure all values are null since bounds
+      // don't include NaN anymore.
+      return summary.containsNull() && summary.lowerBound() == null &&
+          (summary.containsNaN() == null || !summary.containsNaN());

Review comment:
       I think the change for excluding NaN in `lower`/`upper` and adding `containsNaN` both belong to this PR, so if a release contains this change, then it would either be (1) `NaN` is part of `lower`/`upper` and `containsNaN` is missing, or (2) `containsNaN` exists and `lower`/`upper` doesn't store `NaN`. But I guess people may implement their own manifest summary that already exclude `NaN` from bounds but no `containsNaN`, so we still want to handle this, and file level metrics could give more granular information so there isn't necessarily any performance penalty. I have updated this PR to check for existence of `containsNaN`, but please let me know if my understanding isn't correct! 

##########
File path: api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -329,5 +338,12 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
 
       return ROWS_MIGHT_MATCH;
     }
+
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {
+      // containsNull encodes whether at least one partition value is null, lowerBound is null if all partition values
+      // are null; in case bounds don't include NaN value, containsNaN needs to be checked against.
+      return summary.containsNull() && summary.lowerBound() == null &&
+          summary.containsNaN() != null && !summary.containsNaN();

Review comment:
       Yes, sorry I forgot to address this comment... Will update




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r570650525



##########
File path: api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -144,18 +143,37 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     @Override
     public <T> Boolean isNaN(BoundReference<T> ref) {
       int pos = Accessors.toPosition(ref.accessor());
-      // containsNull encodes whether at least one partition value is null, lowerBound is null if
-      // all partition values are null.
-      if (stats.get(pos).containsNull() && stats.get(pos).lowerBound() == null) {
-        return ROWS_CANNOT_MATCH; // all values are null
+
+      if (stats.get(pos).containsNaN() != null && !stats.get(pos).containsNaN()) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      if (allValuesAreNull(stats.get(pos))) {
+        return ROWS_CANNOT_MATCH;
       }
 
       return ROWS_MIGHT_MATCH;
     }
 
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {
+      // Before introducing containsNaN field, containsNull encodes whether at least one partition value is null,
+      // lowerBound is null if all partition values are null.
+      // After introducing containsNaN field, containsNaN must be false to ensure all values are null since bounds
+      // don't include NaN anymore.
+      return summary.containsNull() && summary.lowerBound() == null &&
+          (summary.containsNaN() == null || !summary.containsNaN());

Review comment:
       Yes, I think your summary is correct. Because this is a format change and we expect/want others to implement the format, we want to account for strange cases like that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r569025028



##########
File path: api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -144,18 +143,37 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     @Override
     public <T> Boolean isNaN(BoundReference<T> ref) {
       int pos = Accessors.toPosition(ref.accessor());
-      // containsNull encodes whether at least one partition value is null, lowerBound is null if
-      // all partition values are null.
-      if (stats.get(pos).containsNull() && stats.get(pos).lowerBound() == null) {
-        return ROWS_CANNOT_MATCH; // all values are null
+
+      if (stats.get(pos).containsNaN() != null && !stats.get(pos).containsNaN()) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      if (allValuesAreNull(stats.get(pos))) {
+        return ROWS_CANNOT_MATCH;
       }
 
       return ROWS_MIGHT_MATCH;
     }
 
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {
+      // Before introducing containsNaN field, containsNull encodes whether at least one partition value is null,
+      // lowerBound is null if all partition values are null.
+      // After introducing containsNaN field, containsNaN must be false to ensure all values are null since bounds
+      // don't include NaN anymore.
+      return summary.containsNull() && summary.lowerBound() == null &&
+          (summary.containsNaN() == null || !summary.containsNaN());

Review comment:
       I understand that this is trying to retain the previous behavior, but if `containsNaN` is not known, it seems safer to assume there is a NaN value.
   
   I think there is an argument for doing this: at one point, NaN wasn't handled separately and if `lower` and `upper` were null then there were no non-null values. A `NaN` would cause a bound to be non-null. The problem is that we must accept manifests that were written with the new rule (don't store NaN for `lower`), but that didn't set this. I'd prefer to go with the safer option.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r556968041



##########
File path: api/src/main/java/org/apache/iceberg/ManifestFile.java
##########
@@ -199,6 +200,16 @@ default boolean hasDeletedFiles() {
      */
     boolean containsNull();
 
+    /**
+     * Returns true if at least one data file in the manifest has a nan value for the field.
+     * Null if this information doesn't exist.
+     * <p>
+     * Default to return null to ensure backward compatibility.
+     */
+    default Boolean containsNaN() {
+      return null;

Review comment:
       I see, make sense, thanks for the explanation.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r544727714



##########
File path: api/src/main/java/org/apache/iceberg/ManifestFile.java
##########
@@ -199,6 +200,12 @@ default boolean hasDeletedFiles() {
      */
     boolean containsNull();
 
+    /**
+     * Returns true if at least one data file in the manifest has a nan value for the field.
+     * Null if this information doesn't exist.
+     */
+    Boolean containsNaN();

Review comment:
       While we don't expect other implementations of this interface, it is a fair point that we would accept other implementations in any method that uses `ManifestFile` as an argument type. Adding the default would ensure backward-compatiblity in case anyone has an alternative implementation, and is really low cost. So I'd opt to add it. Thanks @holdenk!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r570609722



##########
File path: api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -144,18 +143,37 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     @Override
     public <T> Boolean isNaN(BoundReference<T> ref) {
       int pos = Accessors.toPosition(ref.accessor());
-      // containsNull encodes whether at least one partition value is null, lowerBound is null if
-      // all partition values are null.
-      if (stats.get(pos).containsNull() && stats.get(pos).lowerBound() == null) {
-        return ROWS_CANNOT_MATCH; // all values are null
+
+      if (stats.get(pos).containsNaN() != null && !stats.get(pos).containsNaN()) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      if (allValuesAreNull(stats.get(pos))) {
+        return ROWS_CANNOT_MATCH;
       }
 
       return ROWS_MIGHT_MATCH;
     }
 
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {
+      // Before introducing containsNaN field, containsNull encodes whether at least one partition value is null,
+      // lowerBound is null if all partition values are null.
+      // After introducing containsNaN field, containsNaN must be false to ensure all values are null since bounds
+      // don't include NaN anymore.
+      return summary.containsNull() && summary.lowerBound() == null &&
+          (summary.containsNaN() == null || !summary.containsNaN());

Review comment:
       I think the change for excluding NaN in `lower`/`upper` and adding `containsNaN` both belong to this PR, so if a release contains this change, then it would either be (1) `NaN` is part of `lower`/`upper` and `containsNaN` is missing, or (2) `containsNaN` exists and `lower`/`upper` doesn't store `NaN`. But I guess people may implement their own manifest summary that already exclude `NaN` from bounds but no `containsNaN`, so we still want to handle this, and file level metrics could give more granular information so there isn't necessarily any performance penalty. I have updated this PR to check for existence of `containsNaN`, but please let me know if my understanding isn't correct! 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r570693495



##########
File path: api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -329,5 +338,12 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
 
       return ROWS_MIGHT_MATCH;
     }
+
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {
+      // containsNull encodes whether at least one partition value is null, lowerBound is null if all partition values
+      // are null; in case bounds don't include NaN value, containsNaN needs to be checked against.
+      return summary.containsNull() && summary.lowerBound() == null &&
+          summary.containsNaN() != null && !summary.containsNaN();

Review comment:
       Yes, sorry I forgot to address this comment... Will update




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r555364988



##########
File path: api/src/main/java/org/apache/iceberg/ManifestFile.java
##########
@@ -199,6 +200,16 @@ default boolean hasDeletedFiles() {
      */
     boolean containsNull();
 
+    /**
+     * Returns true if at least one data file in the manifest has a nan value for the field.
+     * Null if this information doesn't exist.
+     * <p>
+     * Default to return null to ensure backward compatibility.
+     */
+    default Boolean containsNaN() {
+      return null;

Review comment:
       Just going back to the discussion of default to make sure my understanding is correct, if the returned value is null, does it mean the manifest file would suggest the data file does not contain NaN?
   
   This behavior seems consistent with the default `false` in `PartitionSummary.java`, but logically speaking it seems like returning true as default is more reasonable to suggest a file always might contain NaN. What is the take here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r546089319



##########
File path: api/src/main/java/org/apache/iceberg/ManifestFile.java
##########
@@ -203,8 +203,12 @@ default boolean hasDeletedFiles() {
     /**
      * Returns true if at least one data file in the manifest has a nan value for the field.
      * Null if this information doesn't exist.
+     * <p>

Review comment:
       According to [this](https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#format) I think `<p>` itself might be enough?
   
   > If you have more than one paragraph in the doc comment, separate the paragraphs with a \<p\> paragraph tag, as shown




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] holdenk commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
holdenk commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r546038809



##########
File path: api/src/main/java/org/apache/iceberg/ManifestFile.java
##########
@@ -203,8 +203,12 @@ default boolean hasDeletedFiles() {
     /**
      * Returns true if at least one data file in the manifest has a nan value for the field.
      * Null if this information doesn't exist.
+     * <p>

Review comment:
       Nit: do should we add a closing </p> in the docstring?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r536928238



##########
File path: core/src/main/java/org/apache/iceberg/GenericPartitionFieldSummary.java
##########
@@ -136,8 +144,10 @@ public Object get(int i) {
       case 0:
         return containsNull;
       case 1:
-        return lowerBound();
+        return containsNaN;

Review comment:
       The position change is compatible. All Iceberg metadata files conform to the Iceberg data file spec, so we can reorder columns if we choose.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r536928127



##########
File path: core/src/main/java/org/apache/iceberg/GenericPartitionFieldSummary.java
##########
@@ -72,10 +73,11 @@ public GenericPartitionFieldSummary(Schema avroSchema) {
     }
   }
 
-  public GenericPartitionFieldSummary(boolean containsNull, ByteBuffer lowerBound,
+  public GenericPartitionFieldSummary(boolean containsNull, Boolean containsNaN, ByteBuffer lowerBound,

Review comment:
       Why use `Boolean` instead of the primitive here? Is there a place that constructs these without knowing whether there are `NaN` values?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r569033258



##########
File path: core/src/main/java/org/apache/iceberg/GenericPartitionFieldSummary.java
##########
@@ -72,7 +74,19 @@ public GenericPartitionFieldSummary(Schema avroSchema) {
     }
   }
 
-  public GenericPartitionFieldSummary(boolean containsNull, ByteBuffer lowerBound,
+  public GenericPartitionFieldSummary(boolean containsNull, boolean containsNaN, ByteBuffer lowerBound,
+                               ByteBuffer upperBound) {
+    this.avroSchema = AVRO_SCHEMA;
+    this.containsNull = containsNull;
+    this.containsNaN = containsNaN;
+    this.lowerBound = ByteBuffers.toByteArray(lowerBound);
+    this.upperBound = ByteBuffers.toByteArray(upperBound);
+    this.fromProjectionPos = null;
+  }
+
+  // for testing backward compatibility only
+  @VisibleForTesting
+  GenericPartitionFieldSummary(boolean containsNull, ByteBuffer lowerBound,
                                       ByteBuffer upperBound) {

Review comment:
       Nit: indentation is off.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r570650525



##########
File path: api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -144,18 +143,37 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     @Override
     public <T> Boolean isNaN(BoundReference<T> ref) {
       int pos = Accessors.toPosition(ref.accessor());
-      // containsNull encodes whether at least one partition value is null, lowerBound is null if
-      // all partition values are null.
-      if (stats.get(pos).containsNull() && stats.get(pos).lowerBound() == null) {
-        return ROWS_CANNOT_MATCH; // all values are null
+
+      if (stats.get(pos).containsNaN() != null && !stats.get(pos).containsNaN()) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      if (allValuesAreNull(stats.get(pos))) {
+        return ROWS_CANNOT_MATCH;
       }
 
       return ROWS_MIGHT_MATCH;
     }
 
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {
+      // Before introducing containsNaN field, containsNull encodes whether at least one partition value is null,
+      // lowerBound is null if all partition values are null.
+      // After introducing containsNaN field, containsNaN must be false to ensure all values are null since bounds
+      // don't include NaN anymore.
+      return summary.containsNull() && summary.lowerBound() == null &&
+          (summary.containsNaN() == null || !summary.containsNaN());

Review comment:
       Yes, I think your summary is correct. Because this is a format change and we expect/want others to implement the format, we want to account for strange cases like that.

##########
File path: api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -329,5 +338,12 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
 
       return ROWS_MIGHT_MATCH;
     }
+
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {
+      // containsNull encodes whether at least one partition value is null, lowerBound is null if all partition values
+      // are null; in case bounds don't include NaN value, containsNaN needs to be checked against.
+      return summary.containsNull() && summary.lowerBound() == null &&
+          summary.containsNaN() != null && !summary.containsNaN();

Review comment:
       `containsNaN` could only be true if the type is `float` or `double`. Should this check the column type before checking `containsNaN`? Then we would be able to ignore it for most cases. For example, the test below uses a string column but has no `containsNaN` values, so it doesn't catch that all values are null.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#issuecomment-772121417


   @yyanyy, looks mostly good. I think we should be a little bit more careful in the evaluator, but otherwise there's nothing other than minor changes needed. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] holdenk commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
holdenk commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r544511435



##########
File path: api/src/main/java/org/apache/iceberg/ManifestFile.java
##########
@@ -199,6 +200,12 @@ default boolean hasDeletedFiles() {
      */
     boolean containsNull();
 
+    /**
+     * Returns true if at least one data file in the manifest has a nan value for the field.
+     * Null if this information doesn't exist.
+     */
+    Boolean containsNaN();

Review comment:
       Does it make sense for this to have a default of null?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r537796844



##########
File path: core/src/main/java/org/apache/iceberg/GenericPartitionFieldSummary.java
##########
@@ -136,8 +144,10 @@ public Object get(int i) {
       case 0:
         return containsNull;
       case 1:
-        return lowerBound();
+        return containsNaN;

Review comment:
       Ah sorry I misread the original question... Thanks Ryan for help with answering it! 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#issuecomment-772121417


   @yyanyy, looks mostly good. I think we should be a little bit more careful in the evaluator, but otherwise there's nothing other than minor changes needed. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r555463456



##########
File path: api/src/main/java/org/apache/iceberg/ManifestFile.java
##########
@@ -199,6 +200,16 @@ default boolean hasDeletedFiles() {
      */
     boolean containsNull();
 
+    /**
+     * Returns true if at least one data file in the manifest has a nan value for the field.
+     * Null if this information doesn't exist.
+     * <p>
+     * Default to return null to ensure backward compatibility.
+     */
+    default Boolean containsNaN() {
+      return null;

Review comment:
       I think `PartitionSummary` is populated when trying to write a file, so if the code contains this change, it will populate and write NaN boolean correctly. I guess you might be thinking about `GenericPartitionFieldSummary` when you mention returning true as default since that's the class to be constructed when reading from avro files that may not contain this info?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r569025028



##########
File path: api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -144,18 +143,37 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     @Override
     public <T> Boolean isNaN(BoundReference<T> ref) {
       int pos = Accessors.toPosition(ref.accessor());
-      // containsNull encodes whether at least one partition value is null, lowerBound is null if
-      // all partition values are null.
-      if (stats.get(pos).containsNull() && stats.get(pos).lowerBound() == null) {
-        return ROWS_CANNOT_MATCH; // all values are null
+
+      if (stats.get(pos).containsNaN() != null && !stats.get(pos).containsNaN()) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      if (allValuesAreNull(stats.get(pos))) {
+        return ROWS_CANNOT_MATCH;
       }
 
       return ROWS_MIGHT_MATCH;
     }
 
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {
+      // Before introducing containsNaN field, containsNull encodes whether at least one partition value is null,
+      // lowerBound is null if all partition values are null.
+      // After introducing containsNaN field, containsNaN must be false to ensure all values are null since bounds
+      // don't include NaN anymore.
+      return summary.containsNull() && summary.lowerBound() == null &&
+          (summary.containsNaN() == null || !summary.containsNaN());

Review comment:
       I understand that this is trying to retain the previous behavior, but if `containsNaN` is not known, it seems safer to assume there is a NaN value.
   
   I think there is an argument for doing this: at one point, NaN wasn't handled separately and if `lower` and `upper` were null then there were no non-null values. A `NaN` would cause a bound to be non-null. The problem is that we must accept manifests that were written with the new rule (don't store NaN for `lower`), but that didn't set this. I'd prefer to go with the safer option.

##########
File path: api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -144,18 +143,37 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     @Override
     public <T> Boolean isNaN(BoundReference<T> ref) {
       int pos = Accessors.toPosition(ref.accessor());
-      // containsNull encodes whether at least one partition value is null, lowerBound is null if
-      // all partition values are null.
-      if (stats.get(pos).containsNull() && stats.get(pos).lowerBound() == null) {
-        return ROWS_CANNOT_MATCH; // all values are null
+
+      if (stats.get(pos).containsNaN() != null && !stats.get(pos).containsNaN()) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      if (allValuesAreNull(stats.get(pos))) {
+        return ROWS_CANNOT_MATCH;
       }
 
       return ROWS_MIGHT_MATCH;
     }
 
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {

Review comment:
       Could you move this private method to the bottom of the class?

##########
File path: api/src/main/java/org/apache/iceberg/ManifestFile.java
##########
@@ -199,6 +200,16 @@ default boolean hasDeletedFiles() {
      */
     boolean containsNull();
 
+    /**
+     * Returns true if at least one data file in the manifest has a nan value for the field.

Review comment:
       Nit: `nan` -> `NaN`

##########
File path: core/src/main/java/org/apache/iceberg/GenericPartitionFieldSummary.java
##########
@@ -72,7 +74,19 @@ public GenericPartitionFieldSummary(Schema avroSchema) {
     }
   }
 
-  public GenericPartitionFieldSummary(boolean containsNull, ByteBuffer lowerBound,
+  public GenericPartitionFieldSummary(boolean containsNull, boolean containsNaN, ByteBuffer lowerBound,
+                               ByteBuffer upperBound) {

Review comment:
       Nit: indentation is off.

##########
File path: api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -144,18 +143,37 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     @Override
     public <T> Boolean isNaN(BoundReference<T> ref) {
       int pos = Accessors.toPosition(ref.accessor());
-      // containsNull encodes whether at least one partition value is null, lowerBound is null if
-      // all partition values are null.
-      if (stats.get(pos).containsNull() && stats.get(pos).lowerBound() == null) {
-        return ROWS_CANNOT_MATCH; // all values are null
+
+      if (stats.get(pos).containsNaN() != null && !stats.get(pos).containsNaN()) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      if (allValuesAreNull(stats.get(pos))) {
+        return ROWS_CANNOT_MATCH;
       }
 
       return ROWS_MIGHT_MATCH;
     }
 
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {
+      // Before introducing containsNaN field, containsNull encodes whether at least one partition value is null,
+      // lowerBound is null if all partition values are null.
+      // After introducing containsNaN field, containsNaN must be false to ensure all values are null since bounds
+      // don't include NaN anymore.
+      return summary.containsNull() && summary.lowerBound() == null &&
+          (summary.containsNaN() == null || !summary.containsNaN());

Review comment:
       One more thing: if we go for the safer option, we only need to do that for float and double columns. All other types can't contain NaN.

##########
File path: core/src/main/java/org/apache/iceberg/GenericPartitionFieldSummary.java
##########
@@ -72,7 +74,19 @@ public GenericPartitionFieldSummary(Schema avroSchema) {
     }
   }
 
-  public GenericPartitionFieldSummary(boolean containsNull, ByteBuffer lowerBound,
+  public GenericPartitionFieldSummary(boolean containsNull, boolean containsNaN, ByteBuffer lowerBound,
+                               ByteBuffer upperBound) {
+    this.avroSchema = AVRO_SCHEMA;
+    this.containsNull = containsNull;
+    this.containsNaN = containsNaN;
+    this.lowerBound = ByteBuffers.toByteArray(lowerBound);
+    this.upperBound = ByteBuffers.toByteArray(upperBound);
+    this.fromProjectionPos = null;
+  }
+
+  // for testing backward compatibility only
+  @VisibleForTesting
+  GenericPartitionFieldSummary(boolean containsNull, ByteBuffer lowerBound,
                                       ByteBuffer upperBound) {

Review comment:
       Nit: indentation is off.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#issuecomment-739440621


   This looks about ready to me. Please update now that #1747 is in and we'll get it merged. Thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r536309052



##########
File path: core/src/main/java/org/apache/iceberg/GenericPartitionFieldSummary.java
##########
@@ -136,8 +144,10 @@ public Object get(int i) {
       case 0:
         return containsNull;
       case 1:
-        return lowerBound();
+        return containsNaN;

Review comment:
       This is a good point, I forgot that primitive boolean will be default to false which will impact correctness when we use this in evaluator. I'll make this nullable. Thanks for pointing this out! 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r569025693



##########
File path: api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -144,18 +143,37 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     @Override
     public <T> Boolean isNaN(BoundReference<T> ref) {
       int pos = Accessors.toPosition(ref.accessor());
-      // containsNull encodes whether at least one partition value is null, lowerBound is null if
-      // all partition values are null.
-      if (stats.get(pos).containsNull() && stats.get(pos).lowerBound() == null) {
-        return ROWS_CANNOT_MATCH; // all values are null
+
+      if (stats.get(pos).containsNaN() != null && !stats.get(pos).containsNaN()) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      if (allValuesAreNull(stats.get(pos))) {
+        return ROWS_CANNOT_MATCH;
       }
 
       return ROWS_MIGHT_MATCH;
     }
 
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {

Review comment:
       Could you move this private method to the bottom of the class?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r569028253



##########
File path: core/src/main/java/org/apache/iceberg/GenericPartitionFieldSummary.java
##########
@@ -72,7 +74,19 @@ public GenericPartitionFieldSummary(Schema avroSchema) {
     }
   }
 
-  public GenericPartitionFieldSummary(boolean containsNull, ByteBuffer lowerBound,
+  public GenericPartitionFieldSummary(boolean containsNull, boolean containsNaN, ByteBuffer lowerBound,
+                               ByteBuffer upperBound) {

Review comment:
       Nit: indentation is off.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r537796897



##########
File path: core/src/main/java/org/apache/iceberg/GenericPartitionFieldSummary.java
##########
@@ -72,10 +73,11 @@ public GenericPartitionFieldSummary(Schema avroSchema) {
     }
   }
 
-  public GenericPartitionFieldSummary(boolean containsNull, ByteBuffer lowerBound,
+  public GenericPartitionFieldSummary(boolean containsNull, Boolean containsNaN, ByteBuffer lowerBound,

Review comment:
       This is for a [test case](https://github.com/apache/iceberg/pull/1872/files#diff-366ef0b17670bc9547fe6d5b7b71a3d95c7f17eb97afa33c03dcdaeab27deb33R221) that provide null here to test backward compatibility. I guess I'll mark this constructor as `@VisibleForTesting` and create another public one that accepts `boolean`? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1872: Core: add contains_nan to field_summary

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r570652423



##########
File path: api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -329,5 +338,12 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
 
       return ROWS_MIGHT_MATCH;
     }
+
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {
+      // containsNull encodes whether at least one partition value is null, lowerBound is null if all partition values
+      // are null; in case bounds don't include NaN value, containsNaN needs to be checked against.
+      return summary.containsNull() && summary.lowerBound() == null &&
+          summary.containsNaN() != null && !summary.containsNaN();

Review comment:
       `containsNaN` could only be true if the type is `float` or `double`. Should this check the column type before checking `containsNaN`? Then we would be able to ignore it for most cases. For example, the test below uses a string column but has no `containsNaN` values, so it doesn't catch that all values are null.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org