You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by ap...@apache.org on 2021/04/22 11:15:16 UTC
[parquet-format] branch master updated: PARQUET-2016: Reference
column_order field from column indexes (#173)
This is an automated email from the ASF dual-hosted git repository.
apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new 473a3a7 PARQUET-2016: Reference column_order field from column indexes (#173)
473a3a7 is described below
commit 473a3a7710f992b01af79095757d71e1fc68ef62
Author: Gabor Szadovszky <ga...@apache.org>
AuthorDate: Thu Apr 22 13:14:34 2021 +0200
PARQUET-2016: Reference column_order field from column indexes (#173)
---
PageIndex.md | 3 +++
src/main/thrift/parquet.thrift | 36 ++++++++++++++++++++----------------
2 files changed, 23 insertions(+), 16 deletions(-)
diff --git a/PageIndex.md b/PageIndex.md
index 551ef0c..96f7a47 100644
--- a/PageIndex.md
+++ b/PageIndex.md
@@ -96,3 +96,6 @@ For range scans, this approach can be extended to return ranges of rows, page
indices, and page offsets to scan in each column. The reader can then
initialize a scanner for each column and fast forward them to the start row of
the scan.
+
+The `min_values` and `max_values` are calculated based on the `column_orders`
+field in the `FileMetaData` struct of the footer.
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index c6eeea9..05690e4 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -941,13 +941,14 @@ struct ColumnIndex {
1: required list<bool> null_pages
/**
- * Two lists containing lower and upper bounds for the values of each page.
- * These may be the actual minimum and maximum values found on a page, but
- * can also be (more compact) values that do not exist on a page. For
- * example, instead of storing ""Blart Versenwald III", a writer may set
- * min_values[i]="B", max_values[i]="C". Such more compact values must still
- * be valid values within the column's logical type. Readers must make sure
- * that list entries are populated before using them by inspecting null_pages.
+ * Two lists containing lower and upper bounds for the values of each page
+ * determined by the ColumnOrder of the column. These may be the actual
+ * minimum and maximum values found on a page, but can also be (more compact)
+ * values that do not exist on a page. For example, instead of storing ""Blart
+ * Versenwald III", a writer may set min_values[i]="B", max_values[i]="C".
+ * Such more compact values must still be valid values within the column's
+ * logical type. Readers must make sure that list entries are populated before
+ * using them by inspecting null_pages.
*/
2: required list<binary> min_values
3: required list<binary> max_values
@@ -1024,17 +1025,20 @@ struct FileMetaData {
6: optional string created_by
/**
- * Sort order used for the min_value and max_value fields of each column in
- * this file. Sort orders are listed in the order matching the columns in the
- * schema. The indexes are not necessary the same though, because only leaf
- * nodes of the schema are represented in the list of sort orders.
+ * Sort order used for the min_value and max_value fields in the Statistics
+ * objects and the min_values and max_values fields in the ColumnIndex
+ * objects of each column in this file. Sort orders are listed in the order
+ * matching the columns in the schema. The indexes are not necessary the same
+ * though, because only leaf nodes of the schema are represented in the list
+ * of sort orders.
*
- * Without column_orders, the meaning of the min_value and max_value fields is
- * undefined. To ensure well-defined behaviour, if min_value and max_value are
- * written to a Parquet file, column_orders must be written as well.
+ * Without column_orders, the meaning of the min_value and max_value fields
+ * in the Statistics object and the ColumnIndex object is undefined. To ensure
+ * well-defined behaviour, if these fields are written to a Parquet file,
+ * column_orders must be written as well.
*
- * The obsolete min and max fields are always sorted by signed comparison
- * regardless of column_orders.
+ * The obsolete min and max fields in the Statistics object are always sorted
+ * by signed comparison regardless of column_orders.
*/
7: optional list<ColumnOrder> column_orders;