You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/06 03:17:44 UTC

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6517: Parquet: Fixes Incorrect Skipping of RowGroups with NaNs

szehon-ho commented on code in PR #6517:
URL: https://github.com/apache/iceberg/pull/6517#discussion_r1063063122


##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java:
##########
@@ -560,24 +560,26 @@ private <T> T max(Statistics<?> statistics, int id) {
   }
 
   /**
-   * Checks against older versions of Parquet statistics which may have a null count but undefined
-   * min and max statistics. Returns true if nonNull values exist in the row group but no further
-   * statistics are available.
-   *
-   * <p>We can't use {@code statistics.hasNonNullValue()} because it is inaccurate with older files
-   * and will return false if min and max are not set.
+   * Older versions of Parquet statistics which may have a null count but undefined min and max
+   * statistics. This is similar to the current behavior when NaN values are present.
    *
    * <p>This is specifically for 1.5.0-CDH Parquet builds and later which contain the different
    * unusual hasNonNull behavior. OSS Parquet builds are not effected because PARQUET-251 prohibits
    * the reading of these statistics from versions of Parquet earlier than 1.8.0.
    *
    * @param statistics Statistics to check
-   * @param valueCount Number of values in the row group
-   * @return true if nonNull values exist and no other stats can be used
+   * @return true if min and max statistics are null
    */
-  static boolean hasNonNullButNoMinMax(Statistics statistics, long valueCount) {
-    return statistics.getNumNulls() < valueCount
-        && (statistics.getMaxBytes() == null || statistics.getMinBytes() == null);
+  static boolean nullMinMax(Statistics statistics) {
+    return statistics.getMaxBytes() == null || statistics.getMinBytes() == null;
+  }
+
+  static boolean minMaxUndefined(Statistics statistics) {
+    return (!statistics.isEmpty() && !statistics.hasNonNullValue()) || nullMinMax(statistics);

Review Comment:
   So for my understanding its more like other stats are defined but not min/max?  Its not strictly what the method name says, so maybe worth a javadoc?
   
   Also I spent some time to explore this:
   ```
   (!statistics.isEmpty() && !statistics.hasNonNullValue())
   ```
   I noticed statistics.isEmpty() internally is (!statistics.hasNonNullValue() && !isNumNullsSet), so I think after some math it simplifies down then to : 
   (stats.isNumNullSet && !stats.hasNonNullValue)
   
   It could be clearer this way, although I guess its safer to use isEmpty if statistitcs.isEmpty definition ever changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org