You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/09 10:53:47 UTC

[GitHub] [iceberg] hililiwei opened a new pull request, #4734: ORC: Upgrade orc version to use the built-in estimate memory method

hililiwei opened a new pull request, #4734:
URL: https://github.com/apache/iceberg/pull/4734

   ## What is the purpose of the change
   
   Upgrade the ORC version to replace the inner classes obtained by reflection with the built-in estimated memory method.
   
   ## Brief change log
   
   1. upgrade the orc version
   2. Use  memory evaluation methods in `Writer` 
   3. ORC version 1.7.4 fixed [ORC-1121](https://issues.apache.org/jira/browse/ORC-1121) , it exposed a bug in the unit test. This PR tries to fix it.
   
   
   ## Verifying this change
   
   1. Existing ORC related tests
   2. Added tests for [ORC-1121](https://issues.apache.org/jira/browse/ORC-1121).
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (**yes** / no)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**)
     - The serializers: (yes / **no** / don't know)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a diff in pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

kbendick commented on code in PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#discussion_r868330842


##########
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:
##########
@@ -297,7 +297,9 @@ public void testNoNulls() {
   }
 
   @Test
-  public void testIsNaN() {
+  public void testIsNaNForParquet() {

Review Comment:
   ~~Nit: You don't need to change the test name, because the `Assume` will take care of it and signal that in the tests.~~
   
   EDIT: I see why the name was changed now.
   
   For people that maintain forks and cherry-pick individual changes, updating the test name will probably cause them more of headache than is worth it.
   
   Also, I would consider using `Assume.assumeTrue("Parquet metrics do not track NaN", format == FileFormat.PARQUET)` unless this should also possibly run someday for Avro.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

hililiwei commented on PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#issuecomment-1138440712

   > update
   
    updated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a diff in pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

kbendick commented on code in PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#discussion_r869515613


##########
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:
##########
@@ -298,17 +298,32 @@ public void testNoNulls() {
 
   @Test
   public void testIsNaN() {
+    Assume.assumeTrue("Avro files do not have row group statistics", format != FileFormat.AVRO);
+
     boolean shouldRead = shouldRead(isNaN("all_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+    Assert.assertTrue(String.format("Should read: NaN counts are not tracked in %s metrics", format), shouldRead);
 
     shouldRead = shouldRead(isNaN("some_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
-
-    shouldRead = shouldRead(isNaN("no_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+    Assert.assertTrue(String.format("Should read: NaN counts are not tracked in %s metrics", format), shouldRead);
 
     shouldRead = shouldRead(isNaN("all_nulls"));
     Assert.assertFalse("Should skip: all null column will not contain nan value", shouldRead);
+
+    if (format == FileFormat.ORC) {
+      testIsNanOrc();
+    } else {
+      testIsNanParquet();
+    }
+  }
+
+  private void testIsNanParquet() {
+    boolean shouldRead = shouldRead(isNaN("no_nans"));
+    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+  }
+
+  private void testIsNanOrc() {
+    boolean shouldRead = shouldRead(isNaN("no_nans"));
+    Assert.assertFalse("Should skip: no nans will not contain nan value", shouldRead);

Review Comment:
   For the ORC assertion message, it might be more clear if we're a little more descriptive, given that we've stated that the `NaN counts are not tracked in orc metrics` already.
   
   That's true for counts, but ORC tracks whether or not a column in a row group contains any NaN values if I'm not mistaken, so we should make that distinction more clear as people might not be that familiar with ORC when editing this test later.
   
   Something like `Should skip: The existence of NaN in a column is tracked in ORC metrics`. WDYT? Please correct me if my understanding is wrong. 😄 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

kbendick commented on PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#issuecomment-1142270329

   > > Do we shade in the ORC Version? Just checking because this could potentially break folks who are using the older versions of ORC at runtime right?
   > 
   > I think we do not shade things. We had to do that in Hive code to avoid clashing versions. We also struggle with Parquet and Avro. It might worth considering shading our own versions of file formats. But it is for another PR.
   
   We do shade parquet. If you open up a spark repl with the iceberg-spark-runtime it’s under `org.apache.iceberg.relocated` I believe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dongjoon-hyun commented on pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#issuecomment-1124329911

   cc @openinx because he was reviewed there and gave some comments on that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a diff in pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

kbendick commented on code in PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#discussion_r869515613


##########
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:
##########
@@ -298,17 +298,32 @@ public void testNoNulls() {
 
   @Test
   public void testIsNaN() {
+    Assume.assumeTrue("Avro files do not have row group statistics", format != FileFormat.AVRO);
+
     boolean shouldRead = shouldRead(isNaN("all_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+    Assert.assertTrue(String.format("Should read: NaN counts are not tracked in %s metrics", format), shouldRead);
 
     shouldRead = shouldRead(isNaN("some_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
-
-    shouldRead = shouldRead(isNaN("no_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+    Assert.assertTrue(String.format("Should read: NaN counts are not tracked in %s metrics", format), shouldRead);
 
     shouldRead = shouldRead(isNaN("all_nulls"));
     Assert.assertFalse("Should skip: all null column will not contain nan value", shouldRead);
+
+    if (format == FileFormat.ORC) {
+      testIsNanOrc();
+    } else {
+      testIsNanParquet();
+    }
+  }
+
+  private void testIsNanParquet() {
+    boolean shouldRead = shouldRead(isNaN("no_nans"));
+    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+  }
+
+  private void testIsNanOrc() {
+    boolean shouldRead = shouldRead(isNaN("no_nans"));
+    Assert.assertFalse("Should skip: no nans will not contain nan value", shouldRead);

Review Comment:
   For the ORC assertion message, it might be more clear if we're a little more descriptive, given that we've stated that the `NaN counts are not tricked in orc metrics` already.
   
   Something like `Should skip: The existence of NaN in a column is tracked in ORC metrics`. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx merged pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

openinx merged PR #4734:
URL: https://github.com/apache/iceberg/pull/4734


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

hililiwei commented on PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#issuecomment-1145008551

   Thanks @openinx  and all for the reviewing ! 
   
   @RussellSpitzer Do we need to add shadows in core modules?  If needed, I can try to rise a PR. 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#issuecomment-1142297429

   If that's the case let's be sure to note this in the 0.14 release notes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a diff in pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

openinx commented on code in PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#discussion_r873643782


##########
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:
##########
@@ -298,17 +298,23 @@ public void testNoNulls() {
 
   @Test
   public void testIsNaN() {
+    Assume.assumeTrue("Avro files do not have row group statistics", format != FileFormat.AVRO);
+
     boolean shouldRead = shouldRead(isNaN("all_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+    Assert.assertTrue(String.format("Should read: NaN counts are not tracked in %s metrics", format), shouldRead);
 
     shouldRead = shouldRead(isNaN("some_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
-
-    shouldRead = shouldRead(isNaN("no_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+    Assert.assertTrue(String.format("Should read: NaN counts are not tracked in %s metrics", format), shouldRead);
 
     shouldRead = shouldRead(isNaN("all_nulls"));
     Assert.assertFalse("Should skip: all null column will not contain nan value", shouldRead);
+
+    shouldRead = shouldRead(isNaN("no_nans"));
+    if (format == FileFormat.ORC) {
+      Assert.assertFalse("Should skip: no nans will not contain nan value", shouldRead);
+    } else {
+      Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);

Review Comment:
   Why this PR will affect the parquet test case ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

pvary commented on PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#issuecomment-1142287028

   > > > Do we shade in the ORC Version? Just checking because this could potentially break folks who are using the older versions of ORC at runtime right?
   > > 
   > > 
   > > I think we do not shade things. We had to do that in Hive code to avoid clashing versions. We also struggle with Parquet and Avro. It might worth considering shading our own versions of file formats. But it is for another PR.
   > 
   > We do shade parquet. If you open up a spark repl with the iceberg-spark-runtime it’s under `org.apache.iceberg.relocated.org.apache.parquet` I believe. If the others aren’t there (which I believe they are), we should look into adding them.
   
   Thanks for the pointer @kbendick. So the issue is solved for all the runtime jars (found the same code for Flink and Hive too). So it is just the direct users of the core jars who has to do that themselves.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a diff in pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

kbendick commented on code in PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#discussion_r868332048


##########
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:
##########
@@ -311,6 +313,14 @@ public void testIsNaN() {
     Assert.assertFalse("Should skip: all null column will not contain nan value", shouldRead);
   }
 
+  @Test
+  public void testIsNaNForORC() {
+    Assume.assumeFalse("Not for Parquet files", format == FileFormat.PARQUET);

Review Comment:
   Same note about the `Assume` statement.
   
   I would just put `Assume.assumeTrue(format == FileFormat.ORC)` unless you think that Avro should also be included here.
   
   Many of the `Assume` statements don't seem to give a reason phrase at all.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

kbendick commented on PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#issuecomment-1136419077

   Given that the tests were also updated in #4573, I think that this is good to go. Rerequesting review from @dongjoon-hyun (to clear the impression that further changes are expected). Please correct me if I'm wrong.
   
   Corresponding test adjustment: https://github.com/apache/iceberg/blob/04871b79e9aba02e9534498fa035b627bcb9af6e/data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java#L299-L321


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

openinx commented on PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#issuecomment-1142255290

   Looks good to me !  Thanks all for the reviewing !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

pvary commented on PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#issuecomment-1142261908

   > Do we shade in the ORC Version? Just checking because this could potentially break folks who are using the older versions of ORC at runtime right?
   
   I think we do not shade things. We had to do that in Hive code to avoid clashing versions. We also struggle with Parquet and Avro. It might worth considering shading our own versions of file formats. But it is for another PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a diff in pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

kbendick commented on code in PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#discussion_r868330842


##########
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:
##########
@@ -297,7 +297,9 @@ public void testNoNulls() {
   }
 
   @Test
-  public void testIsNaN() {
+  public void testIsNaNForParquet() {

Review Comment:
   Nit: You don't need to change the test name, because the `Assume` will take care of it and signal that in the tests.
   
   
   For people that maintain forks and cherry-pick individual changes, updating the test name will probably cause them more of headache than is worth it.
   
   Also, I would consider using `Assume.assumeTrue("Parquet metrics do not track NaN", format == FileFormat.PARQUET)` unless this should also possibly run someday for Avro.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a diff in pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

kbendick commented on code in PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#discussion_r868338155


##########
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:
##########
@@ -297,7 +297,9 @@ public void testNoNulls() {
   }
 
   @Test
-  public void testIsNaN() {
+  public void testIsNaNForParquet() {

Review Comment:
   NIT: My suggestion, to avoid modifying the name / making two tests that are named specifically for their file format, that you consider making two separate functions and  calling out to each situationally.
   
   But I'll let others comment on that as some people might prefer two separate named tests for simplicity. So just update the names to `testIsNanOrc` and `testIsNanParquet` at least.
   
   If you do decide to split it up, here's what I was thinking:
   
   ```java
   public void testIsNaN() {
     Assume.assumeTrue("Avro files do not have row group statistics", format != FileFormat.AVRO);
     
     if (format == FileFormat.ORC) {
       testIsNanForOrc();
     } else {
       testIsNanForParquet();
     }
   }
   
   private void testIsNanForOrc() { .... }
   
   private void testIsNanForParquet() { ... }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a diff in pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

kbendick commented on code in PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#discussion_r868332048


##########
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:
##########
@@ -311,6 +313,14 @@ public void testIsNaN() {
     Assert.assertFalse("Should skip: all null column will not contain nan value", shouldRead);
   }
 
+  @Test
+  public void testIsNaNForORC() {
+    Assume.assumeFalse("Not for Parquet files", format == FileFormat.PARQUET);

Review Comment:
   Same note about the `Assume` statement.
   
   I would just put `Assume.assumeTrue(format == FileFormat.ORC)` unless you think that Avro should also be included here.
   
   Many of the `Assume` statements don't seem to give a reason phrase at all, but if you do want to include one, I'd give a specific reason why (e.g. `Assume.assumeTrue("ORC files track NaN counts in their statistics", format == FileFormat.ORC);`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a diff in pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

hililiwei commented on code in PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#discussion_r870086123


##########
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:
##########
@@ -298,17 +298,32 @@ public void testNoNulls() {
 
   @Test
   public void testIsNaN() {
+    Assume.assumeTrue("Avro files do not have row group statistics", format != FileFormat.AVRO);
+
     boolean shouldRead = shouldRead(isNaN("all_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+    Assert.assertTrue(String.format("Should read: NaN counts are not tracked in %s metrics", format), shouldRead);
 
     shouldRead = shouldRead(isNaN("some_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
-
-    shouldRead = shouldRead(isNaN("no_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+    Assert.assertTrue(String.format("Should read: NaN counts are not tracked in %s metrics", format), shouldRead);
 
     shouldRead = shouldRead(isNaN("all_nulls"));
     Assert.assertFalse("Should skip: all null column will not contain nan value", shouldRead);
+
+    if (format == FileFormat.ORC) {
+      testIsNanOrc();
+    } else {
+      testIsNanParquet();
+    }
+  }
+
+  private void testIsNanParquet() {
+    boolean shouldRead = shouldRead(isNaN("no_nans"));
+    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+  }
+
+  private void testIsNanOrc() {
+    boolean shouldRead = shouldRead(isNaN("no_nans"));
+    Assert.assertFalse("Should skip: no nans will not contain nan value", shouldRead);

Review Comment:
   > Good work @hililiwei! I like this a lot more.
   > 
   > I think we might be ok without the `testIsNan${format}` helper methods. We can just inline within the test as most of the logic is shared.
   > 
   
   good!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

hililiwei commented on PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#issuecomment-1121780430

   Hi @kbendick, thank you for your timely review.
   In the latest revision, I adopted the split scheme for subsequent review.
   1. Keep the original method for the common part of the test.
   2. separated the different tests(for no_nans) into two methods, one for ORC and one for Parquet, named `testIsNanOrc` and `testIsNanParquet`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

hililiwei commented on PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#issuecomment-1127860219

   > Merged the Version bump #4573 here, The difference in the test case is that previously assumed all file formats would not be able to skip the row group, now with the new ORC version ORC can skip the row group.
   
   updated based on #4578


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

kbendick commented on PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#issuecomment-1136411800

   Hi @hililiwei! Can you update the PR description, now that this PR is not updating the ORC version?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a diff in pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

kbendick commented on code in PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#discussion_r868338155


##########
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:
##########
@@ -297,7 +297,9 @@ public void testNoNulls() {
   }
 
   @Test
-  public void testIsNaN() {
+  public void testIsNaNForParquet() {

Review Comment:
   EDIT - I see now why the name was changed.
   
   My suggestion, to avoid modifying the name / making two tests that are named specifically for their file format, that you consider making two separate functions and  calling out to each situationally.
   
   Like
   ```java
   public void testIsNaN() {
     Assume.assumeTrue("Avro files do not have row group statistics", format != FileFormat.AVRO);
     
     if (format == FileFormat.ORC) {
       testIsNanForOrc();
     } else {
       testIsNanForParquet();
     }
   }
   
   private void testIsNanForOrc() { .... }
   
   private void testIsNanForParquet() { ... }
   ```
   
   This way, the same test (by name) will be run for each, but the correct assertions will be handled.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a diff in pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

kbendick commented on code in PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#discussion_r868341986


##########
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:
##########
@@ -311,6 +313,14 @@ public void testIsNaN() {
     Assert.assertFalse("Should skip: all null column will not contain nan value", shouldRead);
   }
 
+  @Test
+  public void testIsNaNForORC() {
+    Assume.assumeFalse("Not for Parquet files", format == FileFormat.PARQUET);

Review Comment:
   Also, I think some of the tests from above can be repeated for each file format.
   
   Like the `all_nulls` etc. Can you please add all of those statements for both file formats? That is, add an assertion on reading `all_nans`, `some_nans`, `no_nans`, and `all_nulls` in the ORC tests. Right now, the ORC test is only checking `no_nans` but we should verify the correct behavior for all of those with ORC as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a diff in pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

kbendick commented on code in PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#discussion_r869503263


##########
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:
##########
@@ -298,17 +298,32 @@ public void testNoNulls() {
 
   @Test
   public void testIsNaN() {
+    Assume.assumeTrue("Avro files do not have row group statistics", format != FileFormat.AVRO);
+
     boolean shouldRead = shouldRead(isNaN("all_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+    Assert.assertTrue(String.format("Should read: NaN counts are not tracked in %s metrics", format), shouldRead);
 
     shouldRead = shouldRead(isNaN("some_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
-
-    shouldRead = shouldRead(isNaN("no_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+    Assert.assertTrue(String.format("Should read: NaN counts are not tracked in %s metrics", format), shouldRead);
 
     shouldRead = shouldRead(isNaN("all_nulls"));
     Assert.assertFalse("Should skip: all null column will not contain nan value", shouldRead);
+
+    if (format == FileFormat.ORC) {
+      testIsNanOrc();
+    } else {
+      testIsNanParquet();
+    }
+  }
+
+  private void testIsNanParquet() {
+    boolean shouldRead = shouldRead(isNaN("no_nans"));
+    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+  }
+
+  private void testIsNanOrc() {
+    boolean shouldRead = shouldRead(isNaN("no_nans"));
+    Assert.assertFalse("Should skip: no nans will not contain nan value", shouldRead);

Review Comment:
   Good work @hililiwei! I like this a lot more.
   
   I think we might be ok without the `testIsNan${format}` helper methods. We can just inline within the test as most of the logic is shared.
   
   So something like the following, now that I see that so much of the logic can be shared.
   ```java
   shouldRead = shouldRead(isNaN("no_nans"));
   if (format == FileFormat.ORC) {
     Assert.assertFalse("Should skip: no nans will not contain nan value", shouldRead);
   } else {
     Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
   }
   ``` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

hililiwei commented on PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#issuecomment-1141852503

   cc @openinx  @RussellSpitzer 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#issuecomment-1127673426

   Merged the Version bump https://github.com/apache/iceberg/pull/4573 here, The difference in the test case is that previously assumed all file formats would not be able to skip the row group, now with the new ORC version ORC can skip the row group.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a diff in pull request #4734: ORC: Upgrade orc version to use the built-in estimate memory method

Posted by GitBox <gi...@apache.org>.

hililiwei commented on code in PR #4734:
URL: https://github.com/apache/iceberg/pull/4734#discussion_r870081657


##########
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:
##########
@@ -298,17 +298,32 @@ public void testNoNulls() {
 
   @Test
   public void testIsNaN() {
+    Assume.assumeTrue("Avro files do not have row group statistics", format != FileFormat.AVRO);
+
     boolean shouldRead = shouldRead(isNaN("all_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+    Assert.assertTrue(String.format("Should read: NaN counts are not tracked in %s metrics", format), shouldRead);
 
     shouldRead = shouldRead(isNaN("some_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
-
-    shouldRead = shouldRead(isNaN("no_nans"));
-    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+    Assert.assertTrue(String.format("Should read: NaN counts are not tracked in %s metrics", format), shouldRead);
 
     shouldRead = shouldRead(isNaN("all_nulls"));
     Assert.assertFalse("Should skip: all null column will not contain nan value", shouldRead);
+
+    if (format == FileFormat.ORC) {
+      testIsNanOrc();
+    } else {
+      testIsNanParquet();
+    }
+  }
+
+  private void testIsNanParquet() {
+    boolean shouldRead = shouldRead(isNaN("no_nans"));
+    Assert.assertTrue("Should read: NaN counts are not tracked in Parquet metrics", shouldRead);
+  }
+
+  private void testIsNanOrc() {
+    boolean shouldRead = shouldRead(isNaN("no_nans"));
+    Assert.assertFalse("Should skip: no nans will not contain nan value", shouldRead);

Review Comment:
   There seems to be only hasNull flag in ORC statistics used to better answer ‘IS NULL’. I ran a demo and didn't seem to find anything about `nan`. In `all nans` column, the `hasNull` is still `false`, and in `all nulls` , it is `true`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org