You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "wgtmac (via GitHub)" <gi...@apache.org> on 2023/06/16 09:01:02 UTC

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

wgtmac commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1231982021


##########
src/main/thrift/parquet.thrift:
##########
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list<bool> nan_pages

Review Comment:
   As `nan_counts` will be set only after this proposal, could we simply deduce a NaN page by checking `null_pages[i] == false && nan_counts[i] > 0 && min_values[i] == NaN && max_values[i] == NaN`? If that is true, we can safely remove definition of `nan_pages` list.



##########
README.md:
##########
@@ -161,21 +161,7 @@ following rules:
     * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
       signed zeros.   The details are documented in the
       [Thrift definition](src/main/thrift/parquet.thrift) in the
-      `ColumnOrder` union. They are summarized here but the Thrift definition

Review Comment:
   Yes, this looks reasonable.



##########
src/main/thrift/parquet.thrift:
##########
@@ -886,16 +891,25 @@ union ColumnOrder {
    *   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
    *
    * (*) Because the sorting order is not specified properly for floating
-   *     point values (relations vs. total ordering) the following
-   *     compatibility rules should be applied when reading statistics:
+   *     point values (relations vs. total ordering), the following compatibility
+   *     rules should be applied when reading statistics:
    *     - If the min is a NaN, it should be ignored.
    *     - If the max is a NaN, it should be ignored.
+   *     - If the nan_count field is set, a reader can compute
+   *       nan_count + null_count == num_values to deduce whether all non-NULL
+   *       values are NaN.
+   *     - When looking for NaN values, min and max should be ignored.
+   *       If the nan_count field is set, it can be used to check whether
+   *       NaNs are present.
    *     - If the min is +0, the row group may contain -0 values as well.
    *     - If the max is -0, the row group may contain +0 values as well.
-   *     - When looking for NaN values, min and max should be ignored.
    * 
    *     When writing statistics the following rules should be followed:
-   *     - NaNs should not be written to min or max statistics fields.
+   *     - It is suggested to always set the nan_count fields for FLOAT and
+           DOUBLE columns.
+   *     - NaNs should not be written to min or max statistics fields except

Review Comment:
   I would expect to explicitly state that `NaN value should not be written to min or max fields in the Statistics of DataPageHeader, DataPageHeaderV2 and ColumnMetaData. But it is suggested to write NaN to min_values and max_values fields in the ColumnIndex where a value has to be written in case of a only-NaN page`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org