You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by ap...@apache.org on 2022/12/07 07:55:00 UTC

[parquet-format] branch master updated: PARQUET-1222: [Format] Add details about sort order to README.md (#185)

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/master by this push:
     new 92ae9a3  PARQUET-1222: [Format] Add details about sort order to README.md (#185)
92ae9a3 is described below

commit 92ae9a3187d7673c9a40f81f40886faa20807722
Author: emkornfield <em...@gmail.com>
AuthorDate: Tue Dec 6 23:54:55 2022 -0800

    PARQUET-1222: [Format] Add details about sort order to README.md (#185)
    
    This adds details about primitive sort order to the specification docs.
    
    See JIRA for discussion.
---
 README.md                      | 40 ++++++++++++++++++++++++++++++++++++++--
 src/main/thrift/parquet.thrift |  7 +++++++
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index f5478c8..99b0546 100644
--- a/README.md
+++ b/README.md
@@ -81,7 +81,7 @@ more pages.
   - Encoding/Compression - Page
 
 ## File format
-This file and the [thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format.
+This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format.
 
     4-byte magic number "PAR1"
     <Column 1 Chunk 1 + Column Metadata>
@@ -104,7 +104,7 @@ This file and the [thrift definition](src/main/thrift/parquet.thrift) should be
 In the above example, there are N columns in this table, split into M row
 groups.  The file metadata contains the locations of all the column metadata
 start locations.  More details on what is contained in the metadata can be found
-in the thrift definition.
+in the Thrift definition.
 
 Metadata is written after the data to allow for single pass writing.
 
@@ -144,6 +144,42 @@ documented in [LogicalTypes.md][logical-types].
 
 [logical-types]: LogicalTypes.md
 
+### Sort Order
+
+Parquet stores min/max statistics at several levels (such as Column Chunk,
+Column Index and Data Page). Comparison for values of a type obey the
+following rules:
+
+1.  Each logical type has a specified comparison order. If a column is
+    annotated with an unknown logical type, statistics may not be used
+    for pruning data. The sort order for logical types is documented in
+    the [LogicalTypes.md][logical-types] page.
+2.  For primitive types, the following rules apply:
+
+    * BOOLEAN - false, true
+    * INT32, INT64 - Signed comparison.
+    * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
+      signed zeros.   The details are documented in the
+      [Thrift definition](src/main/thrift/parquet.thrift) in the
+      `ColumnOrder` union. They are summarized here but the Thrift definition
+      is considered authoritative:
+      * NaNs should not be written to min or max statistics fields.
+      * If the computed max value is zero (whether negative or positive),
+        `+0.0` should be written into the max statistics field.
+      * If the computed min value is zero (whether negative or positive),
+        `-0.0` should be written into the min statistics field.
+
+      For backwards compatibility when reading files:
+      * If the min is a NaN, it should be ignored.
+      * If the max is a NaN, it should be ignored.
+      * If the min is +0, the row group may contain -0 values as well.
+      * If the max is -0, the row group may contain +0 values as well.
+      * When looking for NaN values, min and max should be ignored.
+      
+    * BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise
+      comparison.
+
+
 ## Nested Encoding
 To encode nested columns, Parquet uses the Dremel encoding with definition and
 repetition levels.  Definition levels specify how many optional fields in the
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 81a7cf8..d602c68 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -902,6 +902,13 @@ union ColumnOrder {
    *     - If the min is +0, the row group may contain -0 values as well.
    *     - If the max is -0, the row group may contain +0 values as well.
    *     - When looking for NaN values, min and max should be ignored.
+   * 
+   *     When writing statistics the following rules should be followed:
+   *     - NaNs should not be written to min or max statistics fields.
+   *     - If the computed max value is zero (whether negative or positive),
+   *       `+0.0` should be written into the max statistics field.
+   *     - If the computed min value is zero (whether negative or positive),
+   *       `-0.0` should be written into the min statistics field.
    */
   1: TypeDefinedOrder TYPE_ORDER;
 }