You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by bl...@apache.org on 2017/10/06 23:38:57 UTC
parquet-format git commit: PARQUET-686: Clarifications about min-max
stats.
Repository: parquet-format
Updated Branches:
refs/heads/master 523d7b6d7 -> bef543899
PARQUET-686: Clarifications about min-max stats.
Changed some descriptions to reflect code changes that happened during code review without updating the corresponding comments and documentation:
* Removed references to the `SIGNED` and `UNSIGNED` sort orders, which were removed in favour of a single `TYPE_ORDER`.
* Removed obsolete references to `column_orders`'s effect on the `min` and `max` values, since those were declared obsolete instead and `column_orders` only affects the new `min_value` and `max_value` fields.
* Clarified `ColumnOrder`'s purpose, since the purpose of a union containing a single empty struct was hard to grasp.
Author: Zoltan Ivanfi <zi...@cloudera.com>
Closes #55 from zivanfi/master and squashes the following commits:
a499d86 [Zoltan Ivanfi] Comparison rules updates.
0c973f7 [Zoltan Ivanfi] PARQUET-686: Further clarifications.
f8fab0b [Zoltan Ivanfi] PARQUET-686: Minor improvements in Thrift comments.
c86090d [Zoltan Ivanfi] PARQUET-686: Clarifications about min-max stats.
Project: http://git-wip-us.apache.org/repos/asf/parquet-format/repo
Commit: http://git-wip-us.apache.org/repos/asf/parquet-format/commit/bef54389
Tree: http://git-wip-us.apache.org/repos/asf/parquet-format/tree/bef54389
Diff: http://git-wip-us.apache.org/repos/asf/parquet-format/diff/bef54389
Branch: refs/heads/master
Commit: bef5438990116725af041cdd8ced2bca0ed2608a
Parents: 523d7b6
Author: Zoltan Ivanfi <zi...@cloudera.com>
Authored: Fri Oct 6 16:38:53 2017 -0700
Committer: Ryan Blue <bl...@apache.org>
Committed: Fri Oct 6 16:38:53 2017 -0700
----------------------------------------------------------------------
LogicalTypes.md | 26 ++++++++------
src/main/thrift/parquet.thrift | 67 +++++++++++++++++++++++++++----------
2 files changed, 64 insertions(+), 29 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/parquet-format/blob/bef54389/LogicalTypes.md
----------------------------------------------------------------------
diff --git a/LogicalTypes.md b/LogicalTypes.md
index 29cf527..6e5c9db 100644
--- a/LogicalTypes.md
+++ b/LogicalTypes.md
@@ -37,7 +37,7 @@ may require additional metadata fields, as well as rules for those fields.
`UTF8` may only be used to annotate the binary primitive type and indicates
that the byte array should be interpreted as a UTF-8 encoded character string.
-The sort order used for `UTF8` strings is `UNSIGNED` byte-wise comparison.
+The sort order used for `UTF8` strings is unsigned byte-wise comparison.
## Numeric Types
@@ -57,7 +57,7 @@ allows.
implied by the `int32` and `int64` primitive types if no other annotation is
present and should be considered optional.
-The sort order used for signed integer types is `SIGNED`.
+The sort order used for signed integer types is signed.
### Unsigned Integers
@@ -74,7 +74,7 @@ allows.
`UINT_8`, `UINT_16`, and `UINT_32` must annotate an `int32` primitive type and
`UINT_64` must annotate an `int64` primitive type.
-The sort order used for unsigned integer types is `UNSIGNED`.
+The sort order used for unsigned integer types is unsigned.
### DECIMAL
@@ -104,8 +104,8 @@ integer. A precision too large for the underlying type (see below) is an error.
A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both
`scale` and `precision` fields set, even if scale is 0 by default.
-The sort order used for `DECIMAL` values is `SIGNED`. The order is equivalent
-to signed comparison of decimal values.
+The sort order used for `DECIMAL` values is signed comparison of the represented
+value.
If the column uses `int32` or `int64` physical types, then signed comparison of
the integer values produces the correct ordering. If the physical type is
@@ -121,7 +121,7 @@ comparison.
annotate an `int32` that stores the number of days from the Unix epoch, 1
January 1970.
-The sort order used for `DATE` is `SIGNED`.
+The sort order used for `DATE` is signed.
### TIME\_MILLIS
@@ -129,7 +129,7 @@ The sort order used for `DATE` is `SIGNED`.
without a date. It must annotate an `int32` that stores the number of
milliseconds after midnight.
-The sort order used for `TIME\_MILLIS` is `SIGNED`.
+The sort order used for `TIME\_MILLIS` is signed.
### TIME\_MICROS
@@ -137,7 +137,7 @@ The sort order used for `TIME\_MILLIS` is `SIGNED`.
without a date. It must annotate an `int64` that stores the number of
microseconds after midnight.
-The sort order used for `TIME\_MICROS` is `SIGNED`.
+The sort order used for `TIME\_MICROS` is signed.
### TIMESTAMP\_MILLIS
@@ -145,7 +145,7 @@ The sort order used for `TIME\_MICROS` is `SIGNED`.
millisecond precision. It must annotate an `int64` that stores the number of
milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC.
-The sort order used for `TIMESTAMP\_MILLIS` is `SIGNED`.
+The sort order used for `TIMESTAMP\_MILLIS` is signed.
### TIMESTAMP\_MICROS
@@ -153,7 +153,7 @@ The sort order used for `TIMESTAMP\_MILLIS` is `SIGNED`.
microsecond precision. It must annotate an `int64` that stores the number of
microseconds from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC.
-The sort order used for `TIMESTAMP\_MICROS` is `SIGNED`.
+The sort order used for `TIMESTAMP\_MICROS` is signed.
### INTERVAL
@@ -169,7 +169,7 @@ example, there is no requirement that a large number of days should be
expressed as a mix of months and days because there is not a constant
conversion from days to months.
-The sort order used for `INTERVAL` is `UNSIGNED`, produced by sorting by
+The sort order used for `INTERVAL` is unsigned, produced by sorting by
the value of months, then days, then milliseconds with unsigned comparison.
## Embedded Types
@@ -184,6 +184,8 @@ string of valid JSON as defined by the [JSON specification][json-spec]
[json-spec]: http://json.org/
+The sort order used for `JSON` is unsigned byte-wise comparison.
+
### BSON
`BSON` is used for an embedded BSON document. It must annotate a `binary`
@@ -192,6 +194,8 @@ defined by the [BSON specification][bson-spec].
[bson-spec]: http://bsonspec.org/spec.html
+The sort order used for `BSON` is unsigned byte-wise comparison.
+
## Nested Types
This section specifies how `LIST` and `MAP` can be used to encode nested types
http://git-wip-us.apache.org/repos/asf/parquet-format/blob/bef54389/src/main/thrift/parquet.thrift
----------------------------------------------------------------------
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 47812ab..3c51639 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -28,17 +28,6 @@ namespace java org.apache.parquet.format
* with the encodings to control the on disk storage format.
* For example INT16 is not included as a type since a good encoding of INT32
* would handle this.
- *
- * When a logical type is not present, the type-defined sort order of these
- * physical types are:
- * * BOOLEAN - false, true
- * * INT32 - signed comparison
- * * INT64 - signed comparison
- * * INT96 - signed comparison
- * * FLOAT - signed comparison
- * * DOUBLE - signed comparison
- * * BYTE_ARRAY - unsigned byte-wise comparison
- * * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*/
enum Type {
BOOLEAN = 0;
@@ -219,12 +208,12 @@ struct Statistics {
* Values are encoded using PLAIN encoding, except that variable-length byte
* arrays do not include a length prefix.
*
- * These fields encode min and max values determined by SIGNED comparison
+ * These fields encode min and max values determined by signed comparison
* only. New files should use the correct order for a column's logical type
* and store the values in the min_value and max_value fields.
*
* To support older readers, these may be set when the column order is
- * SIGNED.
+ * signed.
*/
1: optional binary max;
2: optional binary min;
@@ -582,7 +571,9 @@ struct RowGroup {
struct TypeDefinedOrder {}
/**
- * Union to specify the order used for min, max, and sorting values in a column.
+ * Union to specify the order used for the min_value and max_value fields for a
+ * column. This union takes the role of an enhanced enum that allows rich
+ * elements (which will be needed for a collation-based ordering in the future).
*
* Possible values are:
* * TypeDefinedOrder - the column uses the order defined by its logical or
@@ -592,6 +583,41 @@ struct TypeDefinedOrder {}
* for this column should be ignored.
*/
union ColumnOrder {
+
+ /**
+ * The sort orders for logical types are:
+ * UTF8 - unsigned byte-wise comparison
+ * INT8 - signed comparison
+ * INT16 - signed comparison
+ * INT32 - signed comparison
+ * INT64 - signed comparison
+ * UINT8 - unsigned comparison
+ * UINT16 - unsigned comparison
+ * UINT32 - unsigned comparison
+ * UINT64 - unsigned comparison
+ * DECIMAL - signed comparison of the represented value
+ * DATE - signed comparison
+ * TIME_MILLIS - signed comparison
+ * TIME_MICROS - signed comparison
+ * TIMESTAMP_MILLIS - signed comparison
+ * TIMESTAMP_MICROS - signed comparison
+ * INTERVAL - unsigned comparison
+ * JSON - unsigned byte-wise comparison
+ * BSON - unsigned byte-wise comparison
+ * ENUM - unsigned byte-wise comparison
+ * LIST - undefined
+ * MAP - undefined
+ *
+ * In the absence of logical types, the sort order is determined by the physical type:
+ * BOOLEAN - false, true
+ * INT32 - signed comparison
+ * INT64 - signed comparison
+ * INT96 (only used for legacy timestamps) - unsigned comparison
+ * FLOAT - signed comparison of the represented value
+ * DOUBLE - signed comparison of the represented value
+ * BYTE_ARRAY - unsigned byte-wise comparison
+ * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
+ */
1: TypeDefinedOrder TYPE_ORDER;
}
@@ -626,11 +652,16 @@ struct FileMetaData {
6: optional string created_by
/**
- * Sort order used for each column in this file.
+ * Sort order used for the min_value and max_value fields of each column in
+ * this file. Each sort order corresponds to one column, determined by its
+ * position in the list, matching the position of the column in the schema.
+ *
+ * Without column_orders, the meaning of the min_value and max_value fields is
+ * undefined. To ensure well-defined behaviour, if min_value and max_value are
+ * written to a Parquet file, column_orders must be written as well.
*
- * If this list is not present, then the order for each column is assumed to
- * be Signed. In addition, min and max values for INTERVAL or DECIMAL stored
- * as fixed or bytes should be ignored.
+ * The obsolete min and max fields are always sorted by signed comparison
+ * regardless of column_orders.
*/
7: optional list<ColumnOrder> column_orders;
}