You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by ap...@apache.org on 2022/12/07 07:55:00 UTC
[parquet-format] branch master updated: PARQUET-1222: [Format] Add details about sort order to README.md (#185)
This is an automated email from the ASF dual-hosted git repository.
apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new 92ae9a3 PARQUET-1222: [Format] Add details about sort order to README.md (#185)
92ae9a3 is described below
commit 92ae9a3187d7673c9a40f81f40886faa20807722
Author: emkornfield <em...@gmail.com>
AuthorDate: Tue Dec 6 23:54:55 2022 -0800
PARQUET-1222: [Format] Add details about sort order to README.md (#185)
This adds details about primitive sort order to the specification docs.
See JIRA for discussion.
---
README.md | 40 ++++++++++++++++++++++++++++++++++++++--
src/main/thrift/parquet.thrift | 7 +++++++
2 files changed, 45 insertions(+), 2 deletions(-)
diff --git a/README.md b/README.md
index f5478c8..99b0546 100644
--- a/README.md
+++ b/README.md
@@ -81,7 +81,7 @@ more pages.
- Encoding/Compression - Page
## File format
-This file and the [thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format.
+This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format.
4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
@@ -104,7 +104,7 @@ This file and the [thrift definition](src/main/thrift/parquet.thrift) should be
In the above example, there are N columns in this table, split into M row
groups. The file metadata contains the locations of all the column metadata
start locations. More details on what is contained in the metadata can be found
-in the thrift definition.
+in the Thrift definition.
Metadata is written after the data to allow for single pass writing.
@@ -144,6 +144,42 @@ documented in [LogicalTypes.md][logical-types].
[logical-types]: LogicalTypes.md
+### Sort Order
+
+Parquet stores min/max statistics at several levels (such as Column Chunk,
+Column Index and Data Page). Comparison for values of a type obey the
+following rules:
+
+1. Each logical type has a specified comparison order. If a column is
+ annotated with an unknown logical type, statistics may not be used
+ for pruning data. The sort order for logical types is documented in
+ the [LogicalTypes.md][logical-types] page.
+2. For primitive types, the following rules apply:
+
+ * BOOLEAN - false, true
+ * INT32, INT64 - Signed comparison.
+ * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
+ signed zeros. The details are documented in the
+ [Thrift definition](src/main/thrift/parquet.thrift) in the
+ `ColumnOrder` union. They are summarized here but the Thrift definition
+ is considered authoritative:
+ * NaNs should not be written to min or max statistics fields.
+ * If the computed max value is zero (whether negative or positive),
+ `+0.0` should be written into the max statistics field.
+ * If the computed min value is zero (whether negative or positive),
+ `-0.0` should be written into the min statistics field.
+
+ For backwards compatibility when reading files:
+ * If the min is a NaN, it should be ignored.
+ * If the max is a NaN, it should be ignored.
+ * If the min is +0, the row group may contain -0 values as well.
+ * If the max is -0, the row group may contain +0 values as well.
+ * When looking for NaN values, min and max should be ignored.
+
+ * BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise
+ comparison.
+
+
## Nested Encoding
To encode nested columns, Parquet uses the Dremel encoding with definition and
repetition levels. Definition levels specify how many optional fields in the
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 81a7cf8..d602c68 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -902,6 +902,13 @@ union ColumnOrder {
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
* - When looking for NaN values, min and max should be ignored.
+ *
+ * When writing statistics the following rules should be followed:
+ * - NaNs should not be written to min or max statistics fields.
+ * - If the computed max value is zero (whether negative or positive),
+ * `+0.0` should be written into the max statistics field.
+ * - If the computed min value is zero (whether negative or positive),
+ * `-0.0` should be written into the min statistics field.
*/
1: TypeDefinedOrder TYPE_ORDER;
}