You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by ap...@apache.org on 2021/04/22 11:15:16 UTC

[parquet-format] branch master updated: PARQUET-2016: Reference column_order field from column indexes (#173)

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/master by this push:
     new 473a3a7  PARQUET-2016: Reference column_order field from column indexes (#173)
473a3a7 is described below

commit 473a3a7710f992b01af79095757d71e1fc68ef62
Author: Gabor Szadovszky <ga...@apache.org>
AuthorDate: Thu Apr 22 13:14:34 2021 +0200

    PARQUET-2016: Reference column_order field from column indexes (#173)
---
 PageIndex.md                   |  3 +++
 src/main/thrift/parquet.thrift | 36 ++++++++++++++++++++----------------
 2 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/PageIndex.md b/PageIndex.md
index 551ef0c..96f7a47 100644
--- a/PageIndex.md
+++ b/PageIndex.md
@@ -96,3 +96,6 @@ For range scans, this approach can be extended to return ranges of rows, page
 indices, and page offsets to scan in each column. The reader can then
 initialize a scanner for each column and fast forward them to the start row of
 the scan.
+
+The `min_values` and `max_values` are calculated based on the `column_orders`
+field in the `FileMetaData` struct of the footer.
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index c6eeea9..05690e4 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -941,13 +941,14 @@ struct ColumnIndex {
   1: required list<bool> null_pages
 
   /**
-   * Two lists containing lower and upper bounds for the values of each page.
-   * These may be the actual minimum and maximum values found on a page, but
-   * can also be (more compact) values that do not exist on a page. For
-   * example, instead of storing ""Blart Versenwald III", a writer may set
-   * min_values[i]="B", max_values[i]="C". Such more compact values must still
-   * be valid values within the column's logical type. Readers must make sure
-   * that list entries are populated before using them by inspecting null_pages.
+   * Two lists containing lower and upper bounds for the values of each page
+   * determined by the ColumnOrder of the column. These may be the actual
+   * minimum and maximum values found on a page, but can also be (more compact)
+   * values that do not exist on a page. For example, instead of storing ""Blart
+   * Versenwald III", a writer may set min_values[i]="B", max_values[i]="C".
+   * Such more compact values must still be valid values within the column's
+   * logical type. Readers must make sure that list entries are populated before
+   * using them by inspecting null_pages.
    */
   2: required list<binary> min_values
   3: required list<binary> max_values
@@ -1024,17 +1025,20 @@ struct FileMetaData {
   6: optional string created_by
 
   /**
-   * Sort order used for the min_value and max_value fields of each column in
-   * this file. Sort orders are listed in the order matching the columns in the
-   * schema. The indexes are not necessary the same though, because only leaf
-   * nodes of the schema are represented in the list of sort orders.
+   * Sort order used for the min_value and max_value fields in the Statistics
+   * objects and the min_values and max_values fields in the ColumnIndex
+   * objects of each column in this file. Sort orders are listed in the order
+   * matching the columns in the schema. The indexes are not necessary the same
+   * though, because only leaf nodes of the schema are represented in the list
+   * of sort orders.
    *
-   * Without column_orders, the meaning of the min_value and max_value fields is
-   * undefined. To ensure well-defined behaviour, if min_value and max_value are
-   * written to a Parquet file, column_orders must be written as well.
+   * Without column_orders, the meaning of the min_value and max_value fields
+   * in the Statistics object and the ColumnIndex object is undefined. To ensure
+   * well-defined behaviour, if these fields are written to a Parquet file,
+   * column_orders must be written as well.
    *
-   * The obsolete min and max fields are always sorted by signed comparison
-   * regardless of column_orders.
+   * The obsolete min and max fields in the Statistics object are always sorted
+   * by signed comparison regardless of column_orders.
    */
   7: optional list<ColumnOrder> column_orders;