You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/21 02:12:05 UTC

[GitHub] [iceberg] RussellSpitzer opened a new pull request #1638: Fix Reading of Statistics from old Parquet CDH Versions

RussellSpitzer opened a new pull request #1638:
URL: https://github.com/apache/iceberg/pull/1638


   The behvaior of hasNonNull in parquet statistics chagnes based on the
   version of the file that the statistics are read from. For binary columns
   generated by CDH 1.5.0 to 1.7.0, the function returns hasNonNull as false
   even if there are nonNull values in the column. This not an issue for files
   written by OSS Parquet because of a seperate bug which prohibts reading these
   statistics from Parquet earlier than 1.8.0. The fix for that bug (PARQUET-251)
   was backported in to the CDH distribution so it correctly can return statistics
   but not in a format that Iceberg is able to handle correctly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1638: Fix Reading of Statistics from old Parquet CDH Versions

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1638:
URL: https://github.com/apache/iceberg/pull/1638#issuecomment-714028966


   Thanks for the quick fix, @RussellSpitzer!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1638: Fix Reading of Statistics from old Parquet CDH Versions

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1638:
URL: https://github.com/apache/iceberg/pull/1638#discussion_r509795159



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestCDHParquetStatistics.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import org.apache.parquet.column.statistics.Statistics;
+import org.junit.Assert;
+import org.junit.Test;
+
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.when;
+
+/**
+ * Tests for Parquet 1.5.0-Stats which cannot be evaluated like later versions of Parquet stats. They are intercepted
+ * by the hasNonNullButNoMinMax function which always returns ROWS_MAY_MATCH
+ */
+public class TestCDHParquetStatistics {
+
+  @Test
+  public void testCDHParquetStatistcs() {
+    Statistics cdhBinaryColumnStats = mock(Statistics.class);
+    when(cdhBinaryColumnStats.getMaxBytes()).thenReturn(null);
+    when(cdhBinaryColumnStats.getMinBytes()).thenReturn(null);
+    when(cdhBinaryColumnStats.getNumNulls()).thenReturn(0L);
+    Assert.assertTrue(ParquetMetricsRowGroupFilter.hasNonNullButNoMinMax(cdhBinaryColumnStats, 50L));

Review comment:
       Yeah, I just want to avoid having a method that assumes something about the context in which it will be called. Probably best to make sure it behaves correctly in all situations. Looks like if num nulls is unknown, -1, this will return that there are non-null values, even if the value count is 0. We might want to fix that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1638: Fix Reading of Statistics from old Parquet CDH Versions

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1638:
URL: https://github.com/apache/iceberg/pull/1638


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1638: Fix Reading of Statistics from old Parquet CDH Versions

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1638:
URL: https://github.com/apache/iceberg/pull/1638#discussion_r509774287



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java
##########
@@ -423,4 +451,24 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
       return (T) conversions.get(id).apply(statistics.genericGetMax());
     }
   }
+
+  /**
+   * Checks against older versions of Parquet statistics which may have a null count but undefined min and max
+   * statistics. Returns true if nonNull values exist in the row group but no further statistics are available.
+   * <p>
+   * We can't use {@code  statistics.hasNonNullValue()} because it is inaccurate with older files and will return
+   * false if min and max are not set.
+   * <p>
+   * This is specifically for 1.5.0-CDH Parquet builds and later which contain the different unusual hasNonNull
+   * behavior. OSS Parquet builds are not effected because PARQUET-251 prohibits the reading of these statistics

Review comment:
       Other types properly report min/max and hasNonNulls works. I'd have to go into the fix that CDH did to figure out for sure why their behavior is this way but I think they just disabled min/max stats to get around the Parquet-251 bug. 
   
   Strings are stored as a binary type. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1638: Fix Reading of Statistics from old Parquet CDH Versions

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1638:
URL: https://github.com/apache/iceberg/pull/1638#discussion_r509772880



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestCDHParquetStatistics.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import org.apache.parquet.column.statistics.Statistics;
+import org.junit.Assert;
+import org.junit.Test;
+
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.when;
+
+/**
+ * Tests for Parquet 1.5.0-Stats which cannot be evaluated like later versions of Parquet stats. They are intercepted
+ * by the hasNonNullButNoMinMax function which always returns ROWS_MAY_MATCH
+ */
+public class TestCDHParquetStatistics {
+
+  @Test
+  public void testCDHParquetStatistcs() {
+    Statistics cdhBinaryColumnStats = mock(Statistics.class);
+    when(cdhBinaryColumnStats.getMaxBytes()).thenReturn(null);
+    when(cdhBinaryColumnStats.getMinBytes()).thenReturn(null);
+    when(cdhBinaryColumnStats.getNumNulls()).thenReturn(0L);
+    Assert.assertTrue(ParquetMetricsRowGroupFilter.hasNonNullButNoMinMax(cdhBinaryColumnStats, 50L));

Review comment:
       Sure I can add that, techinically since this is in the inner loop of "notEmpty" either nonNull >= 0 (set) or haveNonNulls is true. If hasNoNulls is true then we don't care about this pathway anyway because max/min are set




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1638: Fix Reading of Statistics from old Parquet CDH Versions

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1638:
URL: https://github.com/apache/iceberg/pull/1638#discussion_r509739816



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestCDHParquetStatistics.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import org.apache.parquet.column.statistics.Statistics;
+import org.junit.Assert;
+import org.junit.Test;
+
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.when;
+
+/**
+ * Tests for Parquet 1.5.0-Stats which cannot be evaluated like later versions of Parquet stats. They are intercepted
+ * by the hasNonNullButNoMinMax function which always returns ROWS_MAY_MATCH
+ */
+public class TestCDHParquetStatistics {
+
+  @Test
+  public void testCDHParquetStatistcs() {
+    Statistics cdhBinaryColumnStats = mock(Statistics.class);
+    when(cdhBinaryColumnStats.getMaxBytes()).thenReturn(null);
+    when(cdhBinaryColumnStats.getMinBytes()).thenReturn(null);
+    when(cdhBinaryColumnStats.getNumNulls()).thenReturn(0L);
+    Assert.assertTrue(ParquetMetricsRowGroupFilter.hasNonNullButNoMinMax(cdhBinaryColumnStats, 50L));

Review comment:
       Should we also test -1 for num nulls? I know we don't expect it, but the behavior should still be correct, I think.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1638: Fix Reading of Statistics from old Parquet CDH Versions

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1638:
URL: https://github.com/apache/iceberg/pull/1638#issuecomment-713892121


   @jun-he You may want to took a look as well since you added some of the more recent filtering code?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1638: Fix Reading of Statistics from old Parquet CDH Versions

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1638:
URL: https://github.com/apache/iceberg/pull/1638#issuecomment-713247464


   Re: @kbendick + @aokolnychyi  + @rdblue 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1638: Fix Reading of Statistics from old Parquet CDH Versions

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1638:
URL: https://github.com/apache/iceberg/pull/1638#discussion_r509797819



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestCDHParquetStatistics.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import org.apache.parquet.column.statistics.Statistics;
+import org.junit.Assert;
+import org.junit.Test;
+
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.when;
+
+/**
+ * Tests for Parquet 1.5.0-Stats which cannot be evaluated like later versions of Parquet stats. They are intercepted
+ * by the hasNonNullButNoMinMax function which always returns ROWS_MAY_MATCH
+ */
+public class TestCDHParquetStatistics {
+
+  @Test
+  public void testCDHParquetStatistcs() {
+    Statistics cdhBinaryColumnStats = mock(Statistics.class);
+    when(cdhBinaryColumnStats.getMaxBytes()).thenReturn(null);
+    when(cdhBinaryColumnStats.getMinBytes()).thenReturn(null);
+    when(cdhBinaryColumnStats.getNumNulls()).thenReturn(0L);
+    Assert.assertTrue(ParquetMetricsRowGroupFilter.hasNonNullButNoMinMax(cdhBinaryColumnStats, 50L));

Review comment:
       I can add that, but in the usage again this is past the "valueCount ==0" which causes no match in the first return




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1638: Fix Reading of Statistics from old Parquet CDH Versions

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1638:
URL: https://github.com/apache/iceberg/pull/1638#discussion_r509803627



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java
##########
@@ -423,4 +451,24 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
       return (T) conversions.get(id).apply(statistics.genericGetMax());
     }
   }
+
+  /**
+   * Checks against older versions of Parquet statistics which may have a null count but undefined min and max
+   * statistics. Returns true if nonNull values exist in the row group but no further statistics are available.
+   * <p>
+   * We can't use {@code  statistics.hasNonNullValue()} because it is inaccurate with older files and will return
+   * false if min and max are not set.
+   * <p>
+   * This is specifically for 1.5.0-CDH Parquet builds and later which contain the different unusual hasNonNull
+   * behavior. OSS Parquet builds are not effected because PARQUET-251 prohibits the reading of these statistics

Review comment:
       Oh, I think I know. This probably happens when min/max values are so large that Parquet chooses not to keep them in stats, which isn't a problem for fixed-length types. That will happen in newer versions, too, but the stats object will be different.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1638: Fix Reading of Statistics from old Parquet CDH Versions

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1638:
URL: https://github.com/apache/iceberg/pull/1638#discussion_r509740053



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java
##########
@@ -423,4 +451,24 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
       return (T) conversions.get(id).apply(statistics.genericGetMax());
     }
   }
+
+  /**
+   * Checks against older versions of Parquet statistics which may have a null count but undefined min and max
+   * statistics. Returns true if nonNull values exist in the row group but no further statistics are available.
+   * <p>
+   * We can't use {@code  statistics.hasNonNullValue()} because it is inaccurate with older files and will return
+   * false if min and max are not set.
+   * <p>
+   * This is specifically for 1.5.0-CDH Parquet builds and later which contain the different unusual hasNonNull
+   * behavior. OSS Parquet builds are not effected because PARQUET-251 prohibits the reading of these statistics

Review comment:
       Why is this bug limited to binary and strings?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org