You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "zhongyujiang (via GitHub)" <gi...@apache.org> on 2023/03/23 13:53:03 UTC

[GitHub] [parquet-format] zhongyujiang commented on pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

zhongyujiang commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1481237476

   > Thus, to solve the problem of only-NaN pages, the comments in the spec are extended to mandate the following behavior:
   > 
   > Once a writer writes the nan_count/nan_counts fields, they have to:
   > never write NaN into min/max if there are non-NaN non-Null values and
   > always write min=max=NaN if the only non-null values in a page are NaN
   > A reader observing that nan_count/nan_counts field was written can then rely on that if min or max are NaN, then both have to be NaN and this means that the only non-NULL values are NaN.
   
   Instead of writing min and max as NaN when there are only NaN values and then let the reader to check whether min max  NaN are credible by evaluating whether naNCounts is empty, wouldn't it be much simpler if we just left the evaluation of isNaN and notNaN to the reader?
   A reader can always conclude a page / column is all NaN when value count of the field == NaN count of the filed (when valueCounts and naNCounts both exists), this's Iceberg's current way of [evaluating isNaN](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java#L486).  Is there anything wrong with doing this in Parquet?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org