You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/07 01:24:56 UTC

[GitHub] [iceberg] rdblue commented on a diff in pull request #6517: Parquet: Fixes Incorrect Skipping of RowGroups with NaNs

rdblue commented on code in PR #6517:
URL: https://github.com/apache/iceberg/pull/6517#discussion_r1063931282


##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java:
##########
@@ -560,24 +560,26 @@ private <T> T max(Statistics<?> statistics, int id) {
   }
 
   /**
-   * Checks against older versions of Parquet statistics which may have a null count but undefined
-   * min and max statistics. Returns true if nonNull values exist in the row group but no further
-   * statistics are available.
-   *
-   * <p>We can't use {@code statistics.hasNonNullValue()} because it is inaccurate with older files
-   * and will return false if min and max are not set.
+   * Older versions of Parquet statistics which may have a null count but undefined min and max
+   * statistics. This is similar to the current behavior when NaN values are present.
    *
    * <p>This is specifically for 1.5.0-CDH Parquet builds and later which contain the different
    * unusual hasNonNull behavior. OSS Parquet builds are not effected because PARQUET-251 prohibits
    * the reading of these statistics from versions of Parquet earlier than 1.8.0.
    *
    * @param statistics Statistics to check
-   * @param valueCount Number of values in the row group
-   * @return true if nonNull values exist and no other stats can be used
+   * @return true if min and max statistics are null
    */
-  static boolean hasNonNullButNoMinMax(Statistics statistics, long valueCount) {
-    return statistics.getNumNulls() < valueCount
-        && (statistics.getMaxBytes() == null || statistics.getMinBytes() == null);
+  static boolean nullMinMax(Statistics statistics) {
+    return statistics.getMaxBytes() == null || statistics.getMinBytes() == null;
+  }
+
+  static boolean minMaxUndefined(Statistics statistics) {
+    return (!statistics.isEmpty() && !statistics.hasNonNullValue()) || nullMinMax(statistics);

Review Comment:
   I agree with @szehon-ho's simplification of the logic, and I'm debating whether I prefer that or using `isEmpty`, which is really unclear.
   
   I also think that we may be adding a case to unknown when it is actually known. Since the check becomes `isNumNullSet` and not `hasNonNullValue` because NaN values trigger `hasNonNullValue = false`, are we including cases where `hasNonNullValue` is actually false because the null count is zero? This would be good to make sure tests cover.
   
   Otherwise I think the logic in this PR looks good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org