You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/03 16:34:41 UTC

[GitHub] [iceberg] RussellSpitzer opened a new issue, #6516: Parquet: Metric Row Group Filter handles Undefined Min/Max incorrectly Missing Rows

RussellSpitzer opened a new issue, #6516:
URL: https://github.com/apache/iceberg/issues/6516

   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   Summary:
   This effects files where the parquet statistics mark either the min or max of a given column to NaN with Float or Double type (I have not checked other types) which will have their row groups skipped incorrectly. 
   
   In this case our evaluators are handling undefined min/max statistics incorrectly. When one of these two stats is **NaN** parquet will set hasNonNull equal to false.
   
   https://github.com/apache/parquet-mr/blob/433de8df33fcf31927f7b51456be9f53e64d48b9/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java#L89-L92
   
   ```java
           // Drop min/max values in case of NaN as the sorting order of values is undefined for this case
           if (min.isNaN() || max.isNaN()) {
             stats.setMinMax(0.0f, 0.0f);
             ((Statistics<?>) stats).hasNonNullValue = false;
   ```
   
   Unfortunately when evaluating `gt` we check whether this is "false" and return "rows won't match" if it is. 
   
   See
   https://github.com/apache/iceberg/blob/cf00f6a06b256e9c4defe226b6a37aa83c40f561/parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java#L279-L281
   
   I'm unsure what the intent in the Parquet code is here and it seems to be a bug to me in the parquet library. I'm not sure why having a NaN would mean that the statistics should report there are no non-null values. According to the Java Doc in the parquet library 
   
   ```java
     /**
      * Returns whether there have been non-null values added to this statistics
      *
      * @return true if the values contained at least one non-null value
      */
   ```
   
   Which implies to me that this should not be false if we added a NaN.
   
   While we probably should try to fix this on the Parquet Side I think we can fix this on the iceberg side by not using the "hasNonNull" in our metric checks if the value count is > 0? 
   
   Thanks John Pugliesi for reporting this and providing a repo 🙇 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed issue #6516: Parquet: Metric Row Group Filter handles Undefined Min/Max incorrectly Missing Rows

Posted by GitBox <gi...@apache.org>.
rdblue closed issue #6516: Parquet: Metric Row Group Filter handles Undefined Min/Max incorrectly Missing Rows
URL: https://github.com/apache/iceberg/issues/6516


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on issue #6516: Parquet: Metric Row Group Filter handles Undefined Min/Max incorrectly Missing Rows

Posted by GitBox <gi...@apache.org>.
zhongyujiang commented on issue #6516:
URL: https://github.com/apache/iceberg/issues/6516#issuecomment-1370862183

   I encountered the same problem too.
   > I'm not sure why having a NaN would mean that the statistics should report there are no non-null values
   
   I had the same doubt, after digged a bit more, I think the doc here is trying to say "whether there have been non-null values added to this statsistics", not "whether there is a non-null value in the column chunk". If I understand correctly,   statistics become unreliable when NaN values are present, so parquet will just discard the statistics that have been added and set hasNonNullValue to false.
   
   So I think the root cause of this is that ParquetMetricRowGroupFilter mistakenly used `hasNonNull()` to determine whether there is a non-null value in the column chunk. As above, this method cannot be used for such purpose. I think we can only conclude that there is no non-null value in the column chunk when 
   `Statistics#getNumNulls() = ColumnChunkMetadata#getValueCount()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org