You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "mapleFU (via GitHub)" <gi...@apache.org> on 2023/06/12 13:59:26 UTC

[GitHub] [parquet-format] mapleFU commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1226712152


##########
src/main/thrift/parquet.thrift:
##########
@@ -886,16 +891,25 @@ union ColumnOrder {
    *   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
    *
    * (*) Because the sorting order is not specified properly for floating
-   *     point values (relations vs. total ordering) the following
-   *     compatibility rules should be applied when reading statistics:
+   *     point values (relations vs. total ordering), the following compatibility
+   *     rules should be applied when reading statistics:
    *     - If the min is a NaN, it should be ignored.
    *     - If the max is a NaN, it should be ignored.
+   *     - If the nan_count field is set, a reader can compute
+   *       nan_count + null_count == num_values to deduce whether all non-NULL
+   *       values are NaN.
+   *     - When looking for NaN values, min and max should be ignored.
+   *       If the nan_count field is set, it can be used to check whether
+   *       NaNs are present.
    *     - If the min is +0, the row group may contain -0 values as well.
    *     - If the max is -0, the row group may contain +0 values as well.
-   *     - When looking for NaN values, min and max should be ignored.
    * 
    *     When writing statistics the following rules should be followed:
-   *     - NaNs should not be written to min or max statistics fields.
+   *     - It is suggested to always set the nan_count fields for FLOAT and
+           DOUBLE columns.
+   *     - NaNs should not be written to min or max statistics fields except
+   *       in the column index, where a value has to be written incase of

Review Comment:
   ```
   NaNs should not be written to min or max statistics fields except
   in the column index, where a value has to be written incase of
   ```
   
   Does this means `nan_pages` and `nan_count` in this patch?



##########
README.md:
##########
@@ -161,21 +161,7 @@ following rules:
     * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
       signed zeros.   The details are documented in the
       [Thrift definition](src/main/thrift/parquet.thrift) in the
-      `ColumnOrder` union. They are summarized here but the Thrift definition

Review Comment:
   So this part is removed and unified into the `parquet.thrift`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org