You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by bl...@apache.org on 2017/04/17 18:23:46 UTC

parquet-format git commit: PARQUET-686: Add Order to store the order used for min/max stats.

Repository: parquet-format
Updated Branches:
  refs/heads/master 65e851eae -> 041708da1


PARQUET-686: Add Order to store the order used for min/max stats.

This adds a new enum, `Order`, that will be set to the order used to produce the min and max values in all `Statistics` objects (at the page level). `Order` has 8 symbols: `SIGNED`, `UNSIGNED`, and 6 symbols for custom orderings. This also adds a `CustomOrder` struct that is used to map the custom order symbols to string descriptors, such as [order keywords used by ICU collating sequences](http://userguide.icu-project.org/collation/api#TOC-Instantiating-the-Predefined-Collators). `CustomOrder` mappings are stored in the file footer.

Author: Ryan Blue <bl...@apache.org>

Closes #46 from rdblue/PARQUET-686-add-stats-ordering and squashes the following commits:

f878c34 [Ryan Blue] PARQUET-686: Remove Order enum.
9447fb8 [Ryan Blue] PARQUET-686: Use "is" instead of "must be".
ffbb60b [Ryan Blue] PARQUET-686: Store ColumnOrder as a union.
c6e43b0 [Ryan Blue] PARQUET-686: Add new min_value and max_value stats.
eed4d47 [Ryan Blue] PARQUET-686: Add clarifications from review comments.
9962df8 [Ryan Blue] PARQUET-686: Remove is_ascending and number columns starting with 1.
faa9edb [Ryan Blue] PARQUET-686: Add order specs to logical types.
4534062 [Ryan Blue] PARQUET-686: Add ColumnOrders to FileMetaData.


Project: http://git-wip-us.apache.org/repos/asf/parquet-format/repo
Commit: http://git-wip-us.apache.org/repos/asf/parquet-format/commit/041708da
Tree: http://git-wip-us.apache.org/repos/asf/parquet-format/tree/041708da
Diff: http://git-wip-us.apache.org/repos/asf/parquet-format/diff/041708da

Branch: refs/heads/master
Commit: 041708da1af52e7cb9288c331b542aa25b68a2b6
Parents: 65e851e
Author: Ryan Blue <bl...@apache.org>
Authored: Mon Apr 17 11:23:41 2017 -0700
Committer: Ryan Blue <bl...@apache.org>
Committed: Mon Apr 17 11:23:41 2017 -0700

----------------------------------------------------------------------
 LogicalTypes.md                | 30 +++++++++++++++++++
 src/main/thrift/parquet.thrift | 59 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 88 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/parquet-format/blob/041708da/LogicalTypes.md
----------------------------------------------------------------------
diff --git a/LogicalTypes.md b/LogicalTypes.md
index c411dbf..29cf527 100644
--- a/LogicalTypes.md
+++ b/LogicalTypes.md
@@ -37,6 +37,8 @@ may require additional metadata fields, as well as rules for those fields.
 `UTF8` may only be used to annotate the binary primitive type and indicates
 that the byte array should be interpreted as a UTF-8 encoded character string.
 
+The sort order used for `UTF8` strings is `UNSIGNED` byte-wise comparison.
+
 ## Numeric Types
 
 ### Signed Integers
@@ -55,6 +57,8 @@ allows.
 implied by the `int32` and `int64` primitive types if no other annotation is
 present and should be considered optional.
 
+The sort order used for signed integer types is `SIGNED`.
+
 ### Unsigned Integers
 
 `UINT_8`, `UINT_16`, `UINT_32`, and `UINT_64` annotations can be used to
@@ -70,6 +74,8 @@ allows.
 `UINT_8`, `UINT_16`, and `UINT_32` must annotate an `int32` primitive type and
 `UINT_64` must annotate an `int64` primitive type.
 
+The sort order used for unsigned integer types is `UNSIGNED`.
+
 ### DECIMAL
 
 `DECIMAL` annotation represents arbitrary-precision signed decimal numbers of
@@ -98,6 +104,15 @@ integer. A precision too large for the underlying type (see below) is an error.
 A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both
 `scale` and `precision` fields set, even if scale is 0 by default.
 
+The sort order used for `DECIMAL` values is `SIGNED`. The order is equivalent
+to signed comparison of decimal values.
+
+If the column uses `int32` or `int64` physical types, then signed comparison of
+the integer values produces the correct ordering. If the physical type is
+fixed, then the correct ordering can be produced by flipping the
+most-significant bit in the first byte and then using unsigned byte-wise
+comparison.
+
 ## Date/Time Types
 
 ### DATE
@@ -106,30 +121,40 @@ A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both
 annotate an `int32` that stores the number of days from the Unix epoch, 1
 January 1970.
 
+The sort order used for `DATE` is `SIGNED`.
+
 ### TIME\_MILLIS
 
 `TIME_MILLIS` is used for a logical time type with millisecond precision,
 without a date. It must annotate an `int32` that stores the number of
 milliseconds after midnight.
 
+The sort order used for `TIME\_MILLIS` is `SIGNED`.
+
 ### TIME\_MICROS
 
 `TIME_MICROS` is used for a logical time type with microsecond precision,
 without a date. It must annotate an `int64` that stores the number of
 microseconds after midnight.
 
+The sort order used for `TIME\_MICROS` is `SIGNED`.
+
 ### TIMESTAMP\_MILLIS
 
 `TIMESTAMP_MILLIS` is used for a combined logical date and time type, with
 millisecond precision. It must annotate an `int64` that stores the number of
 milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC.
 
+The sort order used for `TIMESTAMP\_MILLIS` is `SIGNED`.
+
 ### TIMESTAMP\_MICROS
 
 `TIMESTAMP_MICROS` is used for a combined logical date and time type with
 microsecond precision. It must annotate an `int64` that stores the number of
 microseconds from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC.
 
+The sort order used for `TIMESTAMP\_MICROS` is `SIGNED`.
+
 ### INTERVAL
 
 `INTERVAL` is used for an interval of time. It must annotate a
@@ -144,8 +169,13 @@ example, there is no requirement that a large number of days should be
 expressed as a mix of months and days because there is not a constant
 conversion from days to months.
 
+The sort order used for `INTERVAL` is `UNSIGNED`, produced by sorting by
+the value of months, then days, then milliseconds with unsigned comparison.
+
 ## Embedded Types
 
+Embedded types do not have type-specific orderings.
+
 ### JSON
 
 `JSON` is used for an embedded JSON document. It must annotate a `binary`

http://git-wip-us.apache.org/repos/asf/parquet-format/blob/041708da/src/main/thrift/parquet.thrift
----------------------------------------------------------------------
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index e89bc80..47812ab 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -28,6 +28,17 @@ namespace java org.apache.parquet.format
  * with the encodings to control the on disk storage format.
  * For example INT16 is not included as a type since a good encoding of INT32
  * would handle this.
+ *
+ * When a logical type is not present, the type-defined sort order of these
+ * physical types are:
+ * * BOOLEAN - false, true
+ * * INT32 - signed comparison
+ * * INT64 - signed comparison
+ * * INT96 - signed comparison
+ * * FLOAT - signed comparison
+ * * DOUBLE - signed comparison
+ * * BYTE_ARRAY - unsigned byte-wise comparison
+ * * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
  */
 enum Type {
   BOOLEAN = 0;
@@ -202,13 +213,33 @@ enum FieldRepetitionType {
  * All fields are optional.
  */
 struct Statistics {
-   /** min and max value of the column, encoded in PLAIN encoding */
+   /**
+    * DEPRECATED: min and max value of the column. Use min_value and max_value.
+    *
+    * Values are encoded using PLAIN encoding, except that variable-length byte
+    * arrays do not include a length prefix.
+    *
+    * These fields encode min and max values determined by SIGNED comparison
+    * only. New files should use the correct order for a column's logical type
+    * and store the values in the min_value and max_value fields.
+    *
+    * To support older readers, these may be set when the column order is
+    * SIGNED.
+    */
    1: optional binary max;
    2: optional binary min;
    /** count of null value in the column */
    3: optional i64 null_count;
    /** count of distinct values occurring */
    4: optional i64 distinct_count;
+   /**
+    * Min and max values for the column, determined by its ColumnOrder.
+    *
+    * Values are encoded using PLAIN encoding, except that variable-length byte
+    * arrays do not include a length prefix.
+    */
+   5: optional binary max_value;
+   6: optional binary min_value;
 }
 
 /**
@@ -547,6 +578,23 @@ struct RowGroup {
   4: optional list<SortingColumn> sorting_columns
 }
 
+/** Empty struct to signal the order defined by the physical or logical type */
+struct TypeDefinedOrder {}
+
+/**
+ * Union to specify the order used for min, max, and sorting values in a column.
+ *
+ * Possible values are:
+ * * TypeDefinedOrder - the column uses the order defined by its logical or
+ *                      physical type (if there is no logical type).
+ *
+ * If the reader does not support the value of this union, min and max stats
+ * for this column should be ignored.
+ */
+union ColumnOrder {
+  1: TypeDefinedOrder TYPE_ORDER;
+}
+
 /**
  * Description for file metadata
  */
@@ -576,5 +624,14 @@ struct FileMetaData {
    * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
    **/
   6: optional string created_by
+
+  /**
+   * Sort order used for each column in this file.
+   *
+   * If this list is not present, then the order for each column is assumed to
+   * be Signed. In addition, min and max values for INTERVAL or DECIMAL stored
+   * as fixed or bytes should be ignored.
+   */
+  7: optional list<ColumnOrder> column_orders;
 }