You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2021/04/08 13:43:42 UTC

[GitHub] [parquet-format] pitrou commented on a change in pull request #173: PARQUET-2016: Reference column_order field from column indexes

pitrou commented on a change in pull request #173:
URL: https://github.com/apache/parquet-format/pull/173#discussion_r609714187



##########
File path: src/main/thrift/parquet.thrift
##########
@@ -1024,17 +1025,20 @@ struct FileMetaData {
   6: optional string created_by
 
   /**
-   * Sort order used for the min_value and max_value fields of each column in
-   * this file. Sort orders are listed in the order matching the columns in the
-   * schema. The indexes are not necessary the same though, because only leaf
-   * nodes of the schema are represented in the list of sort orders.
+   * Sort order used for the min_value and max_value fields in the Statistics
+   * objects and the min_values and max_values fields in the ColumnIndex
+   * objects of each column in this file. Sort orders are listed in the order
+   * matching the columns in the schema. The indexes are not necessary the same
+   * though, because only leaf nodes of the schema are represented in the list
+   * of sort orders.
    *
-   * Without column_orders, the meaning of the min_value and max_value fields is
-   * undefined. To ensure well-defined behaviour, if min_value and max_value are
-   * written to a Parquet file, column_orders must be written as well.
+   * Without column_orders, the meaning of the min_value and max_value fields
+   * in the Statistics object and the ColumnIndex object is undefined. To ensure
+   * well-defined behaviour, these fields are written to a Parquet file,

Review comment:
       "if these fields" perhaps?

##########
File path: src/main/thrift/parquet.thrift
##########
@@ -941,13 +941,14 @@ struct ColumnIndex {
   1: required list<bool> null_pages
 
   /**
-   * Two lists containing lower and upper bounds for the values of each page.
-   * These may be the actual minimum and maximum values found on a page, but
-   * can also be (more compact) values that do not exist on a page. For
-   * example, instead of storing ""Blart Versenwald III", a writer may set
-   * min_values[i]="B", max_values[i]="C". Such more compact values must still
-   * be valid values within the column's logical type. Readers must make sure
-   * that list entries are populated before using them by inspecting null_pages.
+   * Two lists containing lower and upper bounds for the values of each page
+   * determined by the ColumnOrder of the column. These may be the actual
+   * minimum and maximum values found on a page, but can also be (more compact)
+   * values that do not exist on a page. For example, instead of storing ""Blart
+   * Versenwald III", a writer may set min_values[i]="B", max_values[i]="C".

Review comment:
       A bit out of scope, but is there a least of types for which such shortening is allowed? Or are writers allowed to store shortened statictics even for non-string types such as integers, timestamps...?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org