You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2021/04/08 14:20:08 UTC

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #173: PARQUET-2016: Reference column_order field from column indexes

gszadovszky commented on a change in pull request #173:
URL: https://github.com/apache/parquet-format/pull/173#discussion_r609753323



##########
File path: src/main/thrift/parquet.thrift
##########
@@ -941,13 +941,14 @@ struct ColumnIndex {
   1: required list<bool> null_pages
 
   /**
-   * Two lists containing lower and upper bounds for the values of each page.
-   * These may be the actual minimum and maximum values found on a page, but
-   * can also be (more compact) values that do not exist on a page. For
-   * example, instead of storing ""Blart Versenwald III", a writer may set
-   * min_values[i]="B", max_values[i]="C". Such more compact values must still
-   * be valid values within the column's logical type. Readers must make sure
-   * that list entries are populated before using them by inspecting null_pages.
+   * Two lists containing lower and upper bounds for the values of each page
+   * determined by the ColumnOrder of the column. These may be the actual
+   * minimum and maximum values found on a page, but can also be (more compact)
+   * values that do not exist on a page. For example, instead of storing ""Blart
+   * Versenwald III", a writer may set min_values[i]="B", max_values[i]="C".

Review comment:
       There is a bit more info about the possible truncation in the [column index spec](https://github.com/apache/parquet-format/blob/master/PageIndex.md) (search for "truncate"). The only existing type that would allow such truncation is BINARY<STRING> but I guess the spec did not want to be too tight for potential later types.
   Anyway, parquet-mr has implemented a truncation mechanism for UTF8 strings and the default length above we truncate is 64.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org