You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "JFinis (via GitHub)" <gi...@apache.org> on 2023/03/23 10:47:36 UTC

[GitHub] [parquet-format] JFinis commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1145999358


##########
src/main/thrift/parquet.thrift:
##########
@@ -952,6 +961,9 @@ struct ColumnIndex {
    * Such more compact values must still be valid values within the column's
    * logical type. Readers must make sure that list entries are populated before
    * using them by inspecting null_pages.
+   * For columns of type FLOAT and DOUBLE, NaN values are not to be included

Review Comment:
   Don’t we then have the same problem already for the NaN values stored in the actual columns? We do already serialize NaN to binary values in the columns themselves. There we also do not mandate a specific bit pattern. The spec does define float double to be IEEE compliant:
   ```
      * FLOAT - 4 bytes per value.  IEEE. Stored as little-endian.
      * DOUBLE - 8 bytes per value.  IEEE. Stored as little-endian.
   ```
   So if I see it correctly, any conforming implementation has to be able to handle all NaN bit patterns that IEEE allows. Otherwise they could not read the actual data in the columns.
   
   As you mention Java: Java has a defined way of reading IEEE bits into Java floats: Float.intBitsToFloa (and the respective method for double). Here it is guaranteed that all valid NaN bit patterns produce a Java Nan. From [the documentation](https://docs.oracle.com/javase/7/docs/api/java/lang/Float.html):
   
   > If the argument is any value in the range 0x7f800001 through 0x7fffffff or in the range 0xff800001 through 0xffffffff, the result is a NaN.
   
   This method is used by parquet-mr, so we should be fine here.
   
   So, to generalize, as I see it, the following holds:
   * Parquet defines FLOAT/DOUBLE to be IEEE without further mandating any bit patterns.
     * If a reader cannot handle all NaN bit patterns, they are not conforming to the spec.
     * Also, such reader would already today malfunction, as there can be NaNs in columns already.
   * All prominent programming languages (C++, Java, Python, Go, ...) have IEEE compliant binary to float conversions, so this also sounds like a rather theoretical problem.
   
   I can also change my suggestion to not write NaNs, but that comes with its own challenges, as layed out in my commit message. But I first want to get a common understanding of why NaN bit patterns in bounds are worse than NaN bit patterns in the columns themselves. Maybe I'm missing something here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org