You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/12 00:54:41 UTC

[GitHub] [iceberg] yyanyy opened a new pull request #2069: API: handle NaN as min/max stats in evaluators

yyanyy opened a new pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069


   This PR is based on the behavior we observed in [here](https://github.com/apache/iceberg/blob/master/core/src/test/java/org/apache/iceberg/TestMetrics.java#L362-L417):
   - NaN is always populated as max when present in Parquet, and both max and min for ORC if the first row of the file is NaN
   - when lower bound is NaN, upper bound will also always be NaN
   - in Java, NaN is considered to be greater than all numbers in comparison
   
   This fixes #1761. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#issuecomment-761687854


   @yyanyy, the tests look good other than one case that I mentioned. I also commented on the implementation thread, but overall I think this is correct. Nice work!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#issuecomment-761662986


   > not all languages treat NaN like Java. For instance, in C++, there are roughly 2^52 different NaN values, and the totalOrder predicate sorts negative NaNs before negative infinity.
   
   Yeah that's why I was asking about the spec in #2055. I think What you describe is consistent with the current plan of -NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
yyanyy commented on pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#issuecomment-761213005


   > I think in Java NaN is neither greater then, nor equal to, nor less than any numbers. That’s the IEEE754 semantics, at least.
   
   Thanks for the interests in this PR! I think in Java, in comparisons NaN is handled as greater than all floating point numbers including infinity, but you are right NaN shouldn't normally be comparable. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#discussion_r559044025



##########
File path: api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
##########
@@ -204,15 +210,20 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     public <T> Boolean ltEq(BoundReference<T> ref, Literal<T> lit) {
       Integer id = ref.fieldId();
 
-      if (containsNullsOnly(id)) {
+      if (containsNullsOnly(id) || containsNaNsOnly(id)) {
         return ROWS_CANNOT_MATCH;
       }
 
       if (lowerBounds != null && lowerBounds.containsKey(id)) {
         T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));
 
         int cmp = lit.comparator().compare(lower, lit.value());
-        if (cmp > 0) {
+
+        // Due to the comparison implementation of ORC stats, for float/double columns in ORC files,

Review comment:
       I don't that there is a need for an extra method that has just one methdo call. I'd probably do it like this:
   
   ```java
           T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));
           if (NaNUtil.isNaN(lower)) {
             // NaN indicates unreliable bounds. See the InclusiveMetricsEvaluator docs for more.
             return ROWS_MIGHT_MATCH;
           }
   
           int cmp = lit.comparator().compare(lower, lit.value());
           if (cmp > 0) {
             return ROWS_CANNOT_MATCH;
           }
   ```
   
   The docs would go in the javadoc for the whole class, and each NaN check could simply refer back to it. I also moved the NaN check above the comparison to keep the logic simple: if the value is NaN, the bound is invalid.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#discussion_r555454950



##########
File path: api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
##########
@@ -204,15 +210,20 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     public <T> Boolean ltEq(BoundReference<T> ref, Literal<T> lit) {
       Integer id = ref.fieldId();
 
-      if (containsNullsOnly(id)) {
+      if (containsNullsOnly(id) || containsNaNsOnly(id)) {
         return ROWS_CANNOT_MATCH;
       }
 
       if (lowerBounds != null && lowerBounds.containsKey(id)) {
         T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));
 
         int cmp = lit.comparator().compare(lower, lit.value());
-        if (cmp > 0) {
+
+        // Due to the comparison implementation of ORC stats, for float/double columns in ORC files,

Review comment:
       Thanks for noticing this. Instead of having this paragraph everywhere, can we abstract this to a method like `checkNaNLowerBound`, and make this the documentation of that method?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick edited a comment on pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
kbendick edited a comment on pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#issuecomment-761919569


   For ORC metrics, which currently don't populate `nanValueCounts` in the metrics and for which it seems difficult to populate `nanValueCounts`, will this still work when we create `Metrics` using the non-deprecated constructors that require passing `nanValueCounts`? This is in reference to this issue for removing the `Metrics` constructors that don't take in `nanValueCounts` maps: https://github.com/apache/iceberg/issues/2044
   
   I suppose when we go to deprecate the old `Metrics` constructors, we can simply pass `null` into the `Metrics` constructor for `nanValueCounts` with ORC in order to keep the current behavior. Or update the various evaluator code paths to ensure that a missing key from this map does not indicate anything special if we choose to pass in an empty map.
   
   This is all based off of my understanding of ORC's `ColumnStatistics` and the notes in this PR about `upperBound ` and `lowerBound` being `NaN` in ORC if _any_ value in the file is `NaN`, so please correct me if I'm mistaken and somebody sees an easy way to get NaN stats from an ORC file. To me, I fail to see how we can get the count of NaN values in the following code block (though as I mentioned we can always continue to pass in `null` to match the existing behavior, provided the various metrics evaluators are aware that `null` for that Map entry for nullValueCounts doesnt convey much meaning for ORC).
   
   The relevant code for populating ORC metrics is here: https://github.com/apache/iceberg/blob/29915e3522f9a808420a2e16fb1c2cf3ae165a74/orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java#L100-L172


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#discussion_r556967311



##########
File path: api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
##########
@@ -204,15 +210,20 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     public <T> Boolean ltEq(BoundReference<T> ref, Literal<T> lit) {
       Integer id = ref.fieldId();
 
-      if (containsNullsOnly(id)) {
+      if (containsNullsOnly(id) || containsNaNsOnly(id)) {
         return ROWS_CANNOT_MATCH;
       }
 
       if (lowerBounds != null && lowerBounds.containsKey(id)) {
         T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));
 
         int cmp = lit.comparator().compare(lower, lit.value());
-        if (cmp > 0) {
+
+        // Due to the comparison implementation of ORC stats, for float/double columns in ORC files,

Review comment:
       Sorry I did not see the comment, I think it is valuable enough to abstract out the method given the amount of comments, and have something like:
   
   ```
   /*
    * all the comments...
    */
   private boolean isLowerBoundNaN() {
     return NaNUtil.isNaN(lower);
   }
   
   ```
   
   I am not good at naming, you can probably come up with a better name...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jbapple commented on pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
jbapple commented on pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#issuecomment-761231825


   Odd, maybe we're trying different things. What does this print for you?
   
   ```
   public class Comp {
     public static void main(String[] args) {
        System.out.println(Double.NaN > 0.0);
     }
   }
   ```
   
   I'm on OpenJDK 1.8, and this prints `false` for me.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#issuecomment-761631296


   @jbapple @yyanyy this blog has a good summary: https://blog.csdn.net/jierui001/article/details/3278382


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#discussion_r559044908



##########
File path: api/src/test/java/org/apache/iceberg/expressions/TestMetricsEvaluatorsNaNHandling.java
##########
@@ -0,0 +1,405 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.expressions;
+
+import java.nio.ByteBuffer;
+import java.util.Set;
+import java.util.function.BiFunction;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.TestHelpers;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.types.Types;
+import org.junit.Assert;
+import org.junit.Test;
+
+import static org.apache.iceberg.types.Conversions.toByteBuffer;
+import static org.apache.iceberg.types.Types.NestedField.optional;
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+/**
+ * This test class ensures that metrics evaluators could handle NaN as upper/lower bounds correctly.
+ */
+public class TestMetricsEvaluatorsNaNHandling {
+  private static final Schema SCHEMA = new Schema(
+      required(1, "all_nan", Types.DoubleType.get()),
+      required(2, "max_nan", Types.DoubleType.get()),
+      optional(3, "min_max_nan", Types.FloatType.get()),
+      required(4, "all_nan_null_bounds", Types.DoubleType.get()),
+      optional(5, "some_nan_correct_bounds", Types.FloatType.get())
+  );
+
+  private static final DataFile FILE = new TestHelpers.TestDataFile("file.avro", TestHelpers.Row.of(), 50,
+      // any value counts, including nulls
+      ImmutableMap.<Integer, Long>builder()
+          .put(1, 10L)
+          .put(2, 10L)
+          .put(3, 10L)
+          .put(4, 10L)
+          .put(5, 10L)
+          .build(),
+      // null value counts
+      ImmutableMap.<Integer, Long>builder()
+          .put(1, 0L)
+          .put(2, 0L)
+          .put(3, 0L)
+          .put(4, 0L)
+          .put(5, 0L)
+          .build(),
+      // nan value counts
+      ImmutableMap.<Integer, Long>builder()
+          .put(1, 10L)
+          .put(4, 10L)
+          .put(5, 5L)
+          .build(),
+      // lower bounds
+      ImmutableMap.<Integer, ByteBuffer>builder()
+          .put(1, toByteBuffer(Types.DoubleType.get(), Double.NaN))
+          .put(2, toByteBuffer(Types.DoubleType.get(), 7D))
+          .put(3, toByteBuffer(Types.FloatType.get(), Float.NaN))
+          .put(5, toByteBuffer(Types.FloatType.get(), 7F))
+          .build(),
+      // upper bounds
+      ImmutableMap.<Integer, ByteBuffer>builder()
+          .put(1, toByteBuffer(Types.DoubleType.get(), Double.NaN))
+          .put(2, toByteBuffer(Types.DoubleType.get(), Double.NaN))
+          .put(3, toByteBuffer(Types.FloatType.get(), Float.NaN))
+          .put(5, toByteBuffer(Types.FloatType.get(), 22F))
+          .build());
+
+  private static final Set<BiFunction<String, Number, Expression>> LESS_THAN_EXPRESSIONS =
+      ImmutableSet.of(Expressions::lessThan, Expressions::lessThanOrEqual);
+
+  private static final Set<BiFunction<String, Number, Expression>> GREATER_THAN_EXPRESSIONS =
+      ImmutableSet.of(Expressions::greaterThan, Expressions::greaterThanOrEqual);
+
+  @Test
+  public void testInclusiveMetricsEvaluatorLessThanAndLessThanOrEqual() {
+    for (BiFunction<String, Number, Expression> func : LESS_THAN_EXPRESSIONS) {
+      boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("all_nan", 1D)).eval(FILE);
+      Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("max_nan", 1D)).eval(FILE);
+      Assert.assertFalse("Should not match: 1 is smaller than lower bound", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("max_nan", 10D)).eval(FILE);
+      Assert.assertTrue("Should match: 10 is larger than lower bound", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("min_max_nan", 1F)).eval(FILE);
+      Assert.assertTrue("Should match: no visibility", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("all_nan_null_bounds", 1D)).eval(FILE);
+      Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("some_nan_correct_bounds", 1F)).eval(FILE);
+      Assert.assertFalse("Should not match: 1 is smaller than lower bound", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("some_nan_correct_bounds", 10F)).eval(FILE);
+      Assert.assertTrue("Should match: 10 larger than lower bound", shouldRead);
+    }
+  }
+
+  @Test
+  public void testInclusiveMetricsEvaluatorGreaterThanAndGreaterThanOrEqual() {
+    for (BiFunction<String, Number, Expression> func : GREATER_THAN_EXPRESSIONS) {
+      boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("all_nan", 1D)).eval(FILE);
+      Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("max_nan", 1D)).eval(FILE);
+      Assert.assertTrue("Should match: upper bound is larger than 1", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("max_nan", 10D)).eval(FILE);
+      Assert.assertTrue("Should match: upper bound is larger than 10", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("min_max_nan", 1F)).eval(FILE);
+      Assert.assertTrue("Should match: no visibility", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("all_nan_null_bounds", 1D)).eval(FILE);
+      Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("some_nan_correct_bounds", 1F)).eval(FILE);
+      Assert.assertTrue("Should match: 1 is smaller than upper bound", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("some_nan_correct_bounds", 10F)).eval(FILE);
+      Assert.assertTrue("Should match: 10 is smaller than upper bound", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("some_nan_correct_bounds", 30)).eval(FILE);
+      Assert.assertFalse("Should not match: 30 is greater than upper bound", shouldRead);
+    }
+  }
+
+  @Test
+  public void testInclusiveMetricsEvaluatorEquals() {
+    boolean shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.equal("all_nan", 1D)).eval(FILE);
+    Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, Expressions.equal("max_nan", 1D)).eval(FILE);
+    Assert.assertFalse("Should not match: 1 is smaller than lower bound", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, Expressions.equal("max_nan", 10D)).eval(FILE);
+    Assert.assertTrue("Should match: 10 is within bounds", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, Expressions.equal("min_max_nan", 1F)).eval(FILE);
+    Assert.assertTrue("Should match: no visibility", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.equal("all_nan_null_bounds", 1D)).eval(FILE);
+    Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.equal("some_nan_correct_bounds", 1F)).eval(FILE);
+    Assert.assertFalse("Should not match: 1 is smaller than lower bound", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.equal("some_nan_correct_bounds", 10F)).eval(FILE);
+    Assert.assertTrue("Should match: 10 is within bounds", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.equal("some_nan_correct_bounds", 30)).eval(FILE);
+    Assert.assertFalse("Should not match: 30 is greater than upper bound", shouldRead);
+  }
+
+  @Test
+  public void testInclusiveMetricsEvaluatorNotEquals() {
+    boolean shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("all_nan", 1D)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("max_nan", 1D)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("max_nan", 10D)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("min_max_nan", 1F)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("all_nan_null_bounds", 1D)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("some_nan_correct_bounds", 1F)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("some_nan_correct_bounds", 10F)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("some_nan_correct_bounds", 30)).eval(FILE);
+    Assert.assertTrue("Should match: no visibility", shouldRead);
+  }
+
+  @Test
+  public void testInclusiveMetricsEvaluatorIn() {
+    boolean shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.in("all_nan", 1D, 10D, 30D)).eval(FILE);
+    Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.in("max_nan", 1D, 10D, 30D)).eval(FILE);
+    Assert.assertTrue("Should match: 10 and 30 are greater than lower bound", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.in("min_max_nan", 1F, 10F, 30F)).eval(FILE);
+    Assert.assertTrue("Should match: no visibility", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.in("all_nan_null_bounds", 1D, 10D, 30D)).eval(FILE);
+    Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.in("some_nan_correct_bounds", 1F, 10F, 30F)).eval(FILE);
+    Assert.assertTrue("Should match: 10 within bounds", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.in("some_nan_correct_bounds", 1F)).eval(FILE);
+    Assert.assertFalse("Should not match: 1 not within bounds", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.in("some_nan_correct_bounds", 30F)).eval(FILE);
+    Assert.assertFalse("Should not match: 30 not within bounds", shouldRead);

Review comment:
       Nit: 1F and 30F could be combined to one case, `Expressions.in("some_nan_correct_bounds", 1F, 30F)`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#issuecomment-763791097


   Thank you for fixing this, @yyanyy! Great work.
   
   Thanks to @jbapple, @jackye1995, and @kbendick for reviewing!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick edited a comment on pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
kbendick edited a comment on pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#issuecomment-761919569


   For ORC metrics, which currently don't populate `nanValueCounts` in the metrics and for which it seems difficult to populate `nanValueCounts`, will this still work when we create `Metrics` using the non-deprecated constructors that require passing `nanValueCounts`? This is in reference to this issue for removing the `Metrics` constructors that don't take in `nanValueCounts` maps: https://github.com/apache/iceberg/issues/2044
   
   I suppose when we go to deprecate the old `Metrics` constructors, we can simply pass `null` into the `Metrics` constructor for `nanValueCounts` with ORC in order to keep the current behavior. Or update the various evaluator code paths to ensure that a missing key from this map does not indicate anything special if we choose to pass in an empty map.
   
   This is all based off of my understanding of ORC's `ColumnStatistics` and the notes in this PR about `upperBound ` and `lowerBound` being `NaN` in ORC if _any_ value in the file is `NaN`, so please correct me if I'm mistaken and somebody sees an easy way to get NaN stats from an ORC file. To me, I fail to see how we can get the count of NaN values in the following code block (though as I mentioned we can always continue to pass in `null` to match the existing behavior, provided the various metrics evaluators are aware that `null` for that Map entry for nullValueCounts doesnt convey much meaning for ORC).
   
   We could also potentially track NaN values in the various orc writers, like the parquet writers do.
   
   The relevant code for populating ORC metrics is here: https://github.com/apache/iceberg/blob/29915e3522f9a808420a2e16fb1c2cf3ae165a74/orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java#L100-L172


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
yyanyy commented on pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#issuecomment-761266457


   > Odd, maybe we're trying different things. What does this print for you?
   > 
   > ```
   > public class Comp {
   >   public static void main(String[] args) {
   >      System.out.println(Double.NaN > 0.0);
   >   }
   > }
   > ```
   > 
   > I'm on OpenJDK 1.8, and this prints `false` for me.
   
   Yeah this prints `false` for me too, as well as `Double.NaN == Double.NaN`. I think we workaround this problem in evaluators by using a [`Comparator.naturalOrder()`](https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/expressions/Literals.java#L156) instead of direct operator comparison, as the following code prints the behavior that NaN is considered larger than positive infinity as described in [here](https://docs.oracle.com/javase/8/docs/api/java/lang/Double.html#compareTo-java.lang.Double-)(by returning -1, 1, 1, -1, 0 respectively)
   
   ```
       Comparator comparator = Comparator.naturalOrder();
       System.out.println(comparator.compare(1.2F, Float.NaN));
       System.out.println(comparator.compare(Float.NaN, 1.2F));
       System.out.println(comparator.compare(Float.NaN, Float.POSITIVE_INFINITY));
       System.out.println(comparator.compare(Float.POSITIVE_INFINITY, Float.NaN));
       System.out.println(comparator.compare(Float.NaN, Float.NaN));
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#discussion_r559044025



##########
File path: api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
##########
@@ -204,15 +210,20 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     public <T> Boolean ltEq(BoundReference<T> ref, Literal<T> lit) {
       Integer id = ref.fieldId();
 
-      if (containsNullsOnly(id)) {
+      if (containsNullsOnly(id) || containsNaNsOnly(id)) {
         return ROWS_CANNOT_MATCH;
       }
 
       if (lowerBounds != null && lowerBounds.containsKey(id)) {
         T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));
 
         int cmp = lit.comparator().compare(lower, lit.value());
-        if (cmp > 0) {
+
+        // Due to the comparison implementation of ORC stats, for float/double columns in ORC files,

Review comment:
       I don't that there is a need for an extra method that has just one method call. I'd probably do it like this:
   
   ```java
           T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));
           if (NaNUtil.isNaN(lower)) {
             // NaN indicates unreliable bounds. See the InclusiveMetricsEvaluator docs for more.
             return ROWS_MIGHT_MATCH;
           }
   
           int cmp = lit.comparator().compare(lower, lit.value());
           if (cmp > 0) {
             return ROWS_CANNOT_MATCH;
           }
   ```
   
   The docs would go in the javadoc for the whole class, and each NaN check could simply refer back to it. I also moved the NaN check above the comparison to keep the logic simple: if the value is NaN, the bound is invalid.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jbapple commented on pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
jbapple commented on pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#issuecomment-759923675


   I think in Java NaN is neither greater then, nor equal to, nor less than any numbers. That’s the IEEE754 semantics, at least.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#discussion_r556967311



##########
File path: api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
##########
@@ -204,15 +210,20 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     public <T> Boolean ltEq(BoundReference<T> ref, Literal<T> lit) {
       Integer id = ref.fieldId();
 
-      if (containsNullsOnly(id)) {
+      if (containsNullsOnly(id) || containsNaNsOnly(id)) {
         return ROWS_CANNOT_MATCH;
       }
 
       if (lowerBounds != null && lowerBounds.containsKey(id)) {
         T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));
 
         int cmp = lit.comparator().compare(lower, lit.value());
-        if (cmp > 0) {
+
+        // Due to the comparison implementation of ORC stats, for float/double columns in ORC files,

Review comment:
       Sorry I did not see the comment, I think it is valuable enough to abstract out the method given the amount of comments, and have something like:
   
   ```java
   /*
    * all the comments...
    */
   private boolean isLowerBoundNaN(lower) {
     return NaNUtil.isNaN(lower);
   }
   
   ```
   
   I am not good at naming, you can probably come up with a better name...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
yyanyy commented on pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#issuecomment-763080534


   > For ORC metrics, which currently don't populate `nanValueCounts` in the metrics and for which it seems difficult to populate `nanValueCounts`, will this still work when we create `Metrics` using the non-deprecated constructors that require passing `nanValueCounts`? This is in reference to this issue for removing the `Metrics` constructors that don't take in `nanValueCounts` maps: #2044
   > 
   > I suppose when we go to deprecate the old `Metrics` constructors, we can simply pass `null` into the `Metrics` constructor for `nanValueCounts` with ORC in order to keep the current behavior. Or update the various evaluator code paths to ensure that a missing key from this map does not indicate anything special if we choose to pass in an empty map.
   > 
   > This is all based off of my understanding of ORC's `ColumnStatistics` and the notes in this PR about `upperBound ` and `lowerBound` being `NaN` in ORC if _any_ value in the file is `NaN`, so please correct me if I'm mistaken and somebody sees an easy way to get NaN stats from an ORC file. To me, I fail to see how we can get the count of NaN values in the following code block (though as I mentioned we can always continue to pass in `null` to match the existing behavior, provided the various metrics evaluators are aware that `null` for that Map entry for nullValueCounts doesnt convey much meaning for ORC).
   > 
   > We could also potentially track NaN values in the various orc writers, like the parquet writers do.
   > 
   > The relevant code for populating ORC metrics is here:
   > 
   > https://github.com/apache/iceberg/blob/29915e3522f9a808420a2e16fb1c2cf3ae165a74/orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java#L100-L172
   
   Thanks for your interests in this PR! You are right that some of this PR will not be working without NaN counter for checking if a field contains partial/all NaN, and it's hard to get ORC `nanValueCounts` from ORC file directly. I think today evaluators should treat missing keys from this map or lack of such map the same way as not having NaN value count information for a given field, or do you think we are not doing this in some places?
   
   Regarding populating NaN counter for ORC files, yes we are going to track NaN values in various ORC writers just as we do for parquet writers, and the change is currently in review in #1790. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#discussion_r555474356



##########
File path: api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
##########
@@ -204,15 +210,20 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
     public <T> Boolean ltEq(BoundReference<T> ref, Literal<T> lit) {
       Integer id = ref.fieldId();
 
-      if (containsNullsOnly(id)) {
+      if (containsNullsOnly(id) || containsNaNsOnly(id)) {
         return ROWS_CANNOT_MATCH;
       }
 
       if (lowerBounds != null && lowerBounds.containsKey(id)) {
         T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));
 
         int cmp = lit.comparator().compare(lower, lit.value());
-        if (cmp > 0) {
+
+        // Due to the comparison implementation of ORC stats, for float/double columns in ORC files,

Review comment:
       Thanks for the quick review! Yeah I do realize that repeating the same comment over and over again is a bit annoying, but I wasn't sure where the right balance is. Since I'm hoping to check for `isNaN` after comparing to avoid unnecessary checking in `lt`/`ltEq`, the only thing I can abstract out is `NaNUtil.isNaN(lower)`, that we are essentially wrapping around a wrapper; and also I guess that might not help much with readability since the actual explanation in this case will be outside of the logic flow here, so the reader will have to jump around to understand the full intention. Maybe we can shorten this comment everywhere and have the full version at the start of the class? Do you/other people have any suggestion? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jbapple commented on pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
jbapple commented on pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#issuecomment-761655391


   LGTM. One note I'll make: not all languages treat NaN like Java. For instance, in C++, there are roughly 2^52 different NaN values, and the totalOrder predicate sorts negative NaNs before negative infinity. As such, you could have columns with a NaN lower bound and a numeric upper bound. I'm not sure what the latest draft Iceberg specification says about this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
yyanyy commented on pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#issuecomment-763241880


   I have updated the PR based on comments. Thank you everyone for your interests in this PR! 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#issuecomment-761919569


   For ORC metrics, which currently don't populate `nanValueCounts` in the metrics and for which it seems difficult to populate `nanValueCounts`, will this still work when we create `Metrics` using the non-deprecated constructors that require passing `nanValueCounts`? This is in reference to this issue for removing the `Metrics` constructors that don't take in `nanValueCounts` maps: https://github.com/apache/iceberg/issues/2044
   
   I suppose when we go to deprecate the old `Metrics` constructors, we can simply pass `null` into the `Metrics` constructor for `nanValueCounts` with ORG in order to 
   
   Based off of my understanding of ORC's `ColumnStatistics` and the notes in this PR about `upperBound ` and `lowerBound` being `NaN` in ORC if _any_ value in the file is `NaN`. To me, I fail to see how we can get the count of NaN values in the following code block (though we can always continue to pass in `null` to match the existing behavior, provided the various metrics evaluators are aware that `null` for that Map entry for nullValueCounts doesnt convey much meaning for ORC).
   
   The relevant code for populating ORC metrics is here: https://github.com/apache/iceberg/blob/29915e3522f9a808420a2e16fb1c2cf3ae165a74/orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java#L100-L172


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2069: API: handle NaN as min/max stats in evaluators

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#discussion_r559045076



##########
File path: api/src/test/java/org/apache/iceberg/expressions/TestMetricsEvaluatorsNaNHandling.java
##########
@@ -0,0 +1,405 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.expressions;
+
+import java.nio.ByteBuffer;
+import java.util.Set;
+import java.util.function.BiFunction;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.TestHelpers;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.types.Types;
+import org.junit.Assert;
+import org.junit.Test;
+
+import static org.apache.iceberg.types.Conversions.toByteBuffer;
+import static org.apache.iceberg.types.Types.NestedField.optional;
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+/**
+ * This test class ensures that metrics evaluators could handle NaN as upper/lower bounds correctly.
+ */
+public class TestMetricsEvaluatorsNaNHandling {
+  private static final Schema SCHEMA = new Schema(
+      required(1, "all_nan", Types.DoubleType.get()),
+      required(2, "max_nan", Types.DoubleType.get()),
+      optional(3, "min_max_nan", Types.FloatType.get()),
+      required(4, "all_nan_null_bounds", Types.DoubleType.get()),
+      optional(5, "some_nan_correct_bounds", Types.FloatType.get())
+  );
+
+  private static final DataFile FILE = new TestHelpers.TestDataFile("file.avro", TestHelpers.Row.of(), 50,
+      // any value counts, including nulls
+      ImmutableMap.<Integer, Long>builder()
+          .put(1, 10L)
+          .put(2, 10L)
+          .put(3, 10L)
+          .put(4, 10L)
+          .put(5, 10L)
+          .build(),
+      // null value counts
+      ImmutableMap.<Integer, Long>builder()
+          .put(1, 0L)
+          .put(2, 0L)
+          .put(3, 0L)
+          .put(4, 0L)
+          .put(5, 0L)
+          .build(),
+      // nan value counts
+      ImmutableMap.<Integer, Long>builder()
+          .put(1, 10L)
+          .put(4, 10L)
+          .put(5, 5L)
+          .build(),
+      // lower bounds
+      ImmutableMap.<Integer, ByteBuffer>builder()
+          .put(1, toByteBuffer(Types.DoubleType.get(), Double.NaN))
+          .put(2, toByteBuffer(Types.DoubleType.get(), 7D))
+          .put(3, toByteBuffer(Types.FloatType.get(), Float.NaN))
+          .put(5, toByteBuffer(Types.FloatType.get(), 7F))
+          .build(),
+      // upper bounds
+      ImmutableMap.<Integer, ByteBuffer>builder()
+          .put(1, toByteBuffer(Types.DoubleType.get(), Double.NaN))
+          .put(2, toByteBuffer(Types.DoubleType.get(), Double.NaN))
+          .put(3, toByteBuffer(Types.FloatType.get(), Float.NaN))
+          .put(5, toByteBuffer(Types.FloatType.get(), 22F))
+          .build());
+
+  private static final Set<BiFunction<String, Number, Expression>> LESS_THAN_EXPRESSIONS =
+      ImmutableSet.of(Expressions::lessThan, Expressions::lessThanOrEqual);
+
+  private static final Set<BiFunction<String, Number, Expression>> GREATER_THAN_EXPRESSIONS =
+      ImmutableSet.of(Expressions::greaterThan, Expressions::greaterThanOrEqual);
+
+  @Test
+  public void testInclusiveMetricsEvaluatorLessThanAndLessThanOrEqual() {
+    for (BiFunction<String, Number, Expression> func : LESS_THAN_EXPRESSIONS) {
+      boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("all_nan", 1D)).eval(FILE);
+      Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("max_nan", 1D)).eval(FILE);
+      Assert.assertFalse("Should not match: 1 is smaller than lower bound", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("max_nan", 10D)).eval(FILE);
+      Assert.assertTrue("Should match: 10 is larger than lower bound", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("min_max_nan", 1F)).eval(FILE);
+      Assert.assertTrue("Should match: no visibility", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("all_nan_null_bounds", 1D)).eval(FILE);
+      Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("some_nan_correct_bounds", 1F)).eval(FILE);
+      Assert.assertFalse("Should not match: 1 is smaller than lower bound", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("some_nan_correct_bounds", 10F)).eval(FILE);
+      Assert.assertTrue("Should match: 10 larger than lower bound", shouldRead);
+    }
+  }
+
+  @Test
+  public void testInclusiveMetricsEvaluatorGreaterThanAndGreaterThanOrEqual() {
+    for (BiFunction<String, Number, Expression> func : GREATER_THAN_EXPRESSIONS) {
+      boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("all_nan", 1D)).eval(FILE);
+      Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("max_nan", 1D)).eval(FILE);
+      Assert.assertTrue("Should match: upper bound is larger than 1", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("max_nan", 10D)).eval(FILE);
+      Assert.assertTrue("Should match: upper bound is larger than 10", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("min_max_nan", 1F)).eval(FILE);
+      Assert.assertTrue("Should match: no visibility", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("all_nan_null_bounds", 1D)).eval(FILE);
+      Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("some_nan_correct_bounds", 1F)).eval(FILE);
+      Assert.assertTrue("Should match: 1 is smaller than upper bound", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("some_nan_correct_bounds", 10F)).eval(FILE);
+      Assert.assertTrue("Should match: 10 is smaller than upper bound", shouldRead);
+
+      shouldRead = new InclusiveMetricsEvaluator(SCHEMA, func.apply("some_nan_correct_bounds", 30)).eval(FILE);
+      Assert.assertFalse("Should not match: 30 is greater than upper bound", shouldRead);
+    }
+  }
+
+  @Test
+  public void testInclusiveMetricsEvaluatorEquals() {
+    boolean shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.equal("all_nan", 1D)).eval(FILE);
+    Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, Expressions.equal("max_nan", 1D)).eval(FILE);
+    Assert.assertFalse("Should not match: 1 is smaller than lower bound", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, Expressions.equal("max_nan", 10D)).eval(FILE);
+    Assert.assertTrue("Should match: 10 is within bounds", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, Expressions.equal("min_max_nan", 1F)).eval(FILE);
+    Assert.assertTrue("Should match: no visibility", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.equal("all_nan_null_bounds", 1D)).eval(FILE);
+    Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.equal("some_nan_correct_bounds", 1F)).eval(FILE);
+    Assert.assertFalse("Should not match: 1 is smaller than lower bound", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.equal("some_nan_correct_bounds", 10F)).eval(FILE);
+    Assert.assertTrue("Should match: 10 is within bounds", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.equal("some_nan_correct_bounds", 30)).eval(FILE);
+    Assert.assertFalse("Should not match: 30 is greater than upper bound", shouldRead);
+  }
+
+  @Test
+  public void testInclusiveMetricsEvaluatorNotEquals() {
+    boolean shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("all_nan", 1D)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("max_nan", 1D)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("max_nan", 10D)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("min_max_nan", 1F)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("all_nan_null_bounds", 1D)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("some_nan_correct_bounds", 1F)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("some_nan_correct_bounds", 10F)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notEqual("some_nan_correct_bounds", 30)).eval(FILE);
+    Assert.assertTrue("Should match: no visibility", shouldRead);
+  }
+
+  @Test
+  public void testInclusiveMetricsEvaluatorIn() {
+    boolean shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.in("all_nan", 1D, 10D, 30D)).eval(FILE);
+    Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.in("max_nan", 1D, 10D, 30D)).eval(FILE);
+    Assert.assertTrue("Should match: 10 and 30 are greater than lower bound", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.in("min_max_nan", 1F, 10F, 30F)).eval(FILE);
+    Assert.assertTrue("Should match: no visibility", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.in("all_nan_null_bounds", 1D, 10D, 30D)).eval(FILE);
+    Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.in("some_nan_correct_bounds", 1F, 10F, 30F)).eval(FILE);
+    Assert.assertTrue("Should match: 10 within bounds", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.in("some_nan_correct_bounds", 1F)).eval(FILE);
+    Assert.assertFalse("Should not match: 1 not within bounds", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.in("some_nan_correct_bounds", 30F)).eval(FILE);
+    Assert.assertFalse("Should not match: 30 not within bounds", shouldRead);
+  }
+
+  @Test
+  public void testInclusiveMetricsEvaluatorNotIn() {
+    boolean shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notIn("all_nan", 1D)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notIn("max_nan", 1D)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notIn("max_nan", 10D)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notIn("min_max_nan", 1F)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notIn("all_nan_null_bounds", 1D)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notIn("some_nan_correct_bounds", 1F)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notIn("some_nan_correct_bounds", 10F)).eval(FILE);
+    shouldRead = shouldRead & new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.notIn("some_nan_correct_bounds", 30)).eval(FILE);
+    Assert.assertTrue("Should match: no visibility", shouldRead);
+  }
+
+  @Test
+  public void testStrictMetricsEvaluatorLessThanAndLessThanOrEqual() {
+    for (BiFunction<String, Number, Expression> func : LESS_THAN_EXPRESSIONS) {
+      boolean shouldRead = new StrictMetricsEvaluator(SCHEMA, func.apply("all_nan", 1D)).eval(FILE);
+      Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+      shouldRead = new StrictMetricsEvaluator(SCHEMA, func.apply("max_nan", 10D)).eval(FILE);
+      Assert.assertFalse("Should not match: 10 is less than upper bound", shouldRead);
+
+      shouldRead = new StrictMetricsEvaluator(SCHEMA, func.apply("min_max_nan", 1F)).eval(FILE);
+      Assert.assertFalse("Should not match: no visibility", shouldRead);
+
+      shouldRead = new StrictMetricsEvaluator(SCHEMA, func.apply("all_nan_null_bounds", 1D)).eval(FILE);
+      Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+      shouldRead = new StrictMetricsEvaluator(SCHEMA, func.apply("some_nan_correct_bounds", 10F)).eval(FILE);
+      Assert.assertFalse("Should not match: nan value exists", shouldRead);

Review comment:
       This case would return `false` anyway because the correct upper bound is > 10. I think you should change this test case to 30F so that it matches the whole range, but does not match because there are some NaN values.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org