You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by ga...@apache.org on 2018/06/25 06:27:57 UTC
[parquet-format] branch master updated: PARQUET-1312: Improve logical types documentation (#98)

This is an automated email from the ASF dual-hosted git repository.

gabor pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/master by this push:
     new 54e6133  PARQUET-1312: Improve logical types documentation (#98)
54e6133 is described below

commit 54e6133e887a6ea90501ddd72fff5312b7038a7c
Author: nandorKollar <na...@users.noreply.github.com>
AuthorDate: Mon Jun 25 08:27:55 2018 +0200

    PARQUET-1312: Improve logical types documentation (#98)
---
 LogicalTypes.md | 241 ++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 207 insertions(+), 34 deletions(-)

diff --git a/LogicalTypes.md b/LogicalTypes.md
index 762769e..3be6f21 100644
--- a/LogicalTypes.md
+++ b/LogicalTypes.md
@@ -29,17 +29,41 @@ This file contains the specification for all logical types.
 
 ### Metadata
 
-The parquet format's `ConvertedType` stores the type annotation. The annotation
+The parquet format's `LogicalType` stores the type annotation. The annotation
 may require additional metadata fields, as well as rules for those fields.
+There is an older representation of the logical type annotations called `ConvertedType`.
+To support backward compatibility with old files, readers should interpret `LogicalTypes`
+in the same way as `ConvertedType`, and writers should populate `ConvertedType` in the metadata
+according to well defined conversion rules.
+
+### Compatibility
+
+The Thrift definition of the metadata has two fields for logical types: `ConvertedType` and `LogicalType`.
+`ConvertedType` is an enum of all available annotation. Since Thrift enums can't have additional type parameters,
+it is cumbersome to define additional type parameters, like decimal scale and precision
+(which are additional 32 bit integer fields on SchemaElement, and are relevant only for decimals) or time unit
+and UTC adjustment flag for Timestamp types. To overcome this problem, a new logical type representation was introduced into
+the metadata to replace `ConvertedType`: `LogicalType`.  The new representation is a union of struct of logical types,
+this way allowing more flexible API, logical types can have type parameters.
+
+However, to maintain compatibility, Parquet readers should be able to read
+and interpret old logical type representation (in case the new one is not present,
+because the file was written by older writer), and write `ConvertedType` field for old readers.
+
+Compatibility considerations are mentioned for each annotation in the corresponding section.
 
 ## String Types
 
-### UTF8
+### STRING
 
-`UTF8` may only be used to annotate the binary primitive type and indicates
+`STRING` may only be used to annotate the binary primitive type and indicates
 that the byte array should be interpreted as a UTF-8 encoded character string.
 
-The sort order used for `UTF8` strings is unsigned byte-wise comparison.
+The sort order used for `STRING` strings is unsigned byte-wise comparison.
+
+*Compatibility*
+
+`STRING` corresponds to `UTF8` ConvertedType.
 
 ### ENUM
 
@@ -65,17 +89,21 @@ The sort order used for `UUID` values is unsigned byte-wise comparison.
 
 ### Signed Integers
 
-`INT_8`, `INT_16`, `INT_32`, and `INT_64` annotations can be used to specify
-the maximum number of bits in the stored value.  Implementations may use these
-annotations to produce smaller in-memory representations when reading data.
+`INT` annotation can be used to specify the maximum number of bits in the stored value.
+The annotation has two parameter: bit width and sign.
+Allowed bit width values are `8`, `16`, `32`, `64`, and sign can be `true` or `false`.
+For signed integers, the second parameter should be `true`,
+for example, a signed integer with bit width of 8 is defined as `INT(8, true)`
+Implementations may use these annotations to produce smaller
+in-memory representations when reading data.
 
 If a stored value is larger than the maximum allowed by the annotation, the
 behavior is not defined and can be determined by the implementation.
 Implementations must not write values that are larger than the annotation
 allows.
 
-`INT_8`, `INT_16`, and `INT_32` must annotate an `int32` primitive type and
-`INT_64` must annotate an `int64` primitive type. `INT_32` and `INT_64` are
+`INT(8, true)`, `INT(16, true)`, and `INT(32, true)` must annotate an `int32` primitive type and
+`INT(64, true)` must annotate an `int64` primitive type. `INT(32, true)` and `INT(64, true)` are
 implied by the `int32` and `int64` primitive types if no other annotation is
 present and should be considered optional.
 
@@ -83,9 +111,13 @@ The sort order used for signed integer types is signed.
 
 ### Unsigned Integers
 
-`UINT_8`, `UINT_16`, `UINT_32`, and `UINT_64` annotations can be used to
-specify unsigned integer types, along with a maximum number of bits in the
-stored value. Implementations may use these annotations to produce smaller
+`INT` annotation can be used to specify unsigned integer types,
+along with a maximum number of bits in the stored value.
+The annotation has two parameter: bit width and sign.
+Allowed bit width values are `8`, `16`, `32`, `64`, and sign can be `true` or `false`.
+In case of unsigned integers, the second parameter should be `false`,
+for example, an unsigned integer with bit width of 8 is defined as `INT(8, false)`
+Implementations may use these annotations to produce smaller
 in-memory representations when reading data.
 
 If a stored value is larger than the maximum allowed by the annotation, the
@@ -93,10 +125,83 @@ behavior is not defined and can be determined by the implementation.
 Implementations must not write values that are larger than the annotation
 allows.
 
+`INT(8, false)`, `INT(16, false)`, and `INT(32, false)` must annotate an `int32` primitive type and
+`INT(64, true)` must annotate an `int64` primitive type.
+
+The sort order used for unsigned integer types is unsigned.
+
+### Deprecated integer ConvertedType
+
+`INT_8`, `INT_16`, `INT_32`, and `INT_64` annotations can be also used to specify
+signed integers with 8, 16, 32, or 64 bit width.
+
+`INT_8`, `INT_16`, and `INT_32` must annotate an `int32` primitive type and
+`INT_64` must annotate an `int64` primitive type. `INT_32` and `INT_64` are
+implied by the `int32` and `int64` primitive types if no other annotation is
+present and should be considered optional.
+
+`UINT_8`, `UINT_16`, `UINT_32`, and `UINT_64` annotations can be also used to specify
+unsigned integers with 8, 16, 32, or 64 bit width.
+
 `UINT_8`, `UINT_16`, and `UINT_32` must annotate an `int32` primitive type and
 `UINT_64` must annotate an `int64` primitive type.
 
-The sort order used for unsigned integer types is unsigned.
+*Backward compatibility:*
+
+| ConvertedType | LogicalType |
+|---------------|-------------|
+| INT_8  | IntType (bitWidth = 8, isSigned = true) |
+| INT_16 | IntType (bitWidth = 16, isSigned = true) |
+| INT_32 | IntType (bitWidth = 32, isSigned = true) |
+| INT_64 | IntType (bitWidth = 64, isSigned = true) |
+| UINT_8  | IntType (bitWidth = 8, isSigned = false) |
+| UINT_16 | IntType (bitWidth = 16, isSigned = false) |
+| UINT_32 | IntType (bitWidth = 32, isSigned = false) |
+| UINT_64 | IntType (bitWidth = 64, isSigned = false) |
+
+*Forward compatibility:*
+
+<table>
+    <tr colspan="3">
+        <th colspan="3">LogicalType</th>
+        <th>ConvertedType</th>
+    </tr>
+    <tr>
+        <td rowspan="8">IntType</td>
+        <td rowspan="4">isSigned</td>
+        <td>bitWidth = 8</td>
+        <td>INT_8</td>
+    </tr>
+    <tr>
+        <td>bitWidth = 16</td>
+        <td>INT_16</td>
+    </tr>
+    <tr>
+        <td>bitWidth = 32</td>
+        <td>INT_32</td>
+    </tr>
+    <tr>
+        <td>bitWidth = 64</td>
+        <td>INT_64</td>
+    </tr>
+    <tr>
+        <td rowspan="4">!isSigned</td>
+        <td>bitWidth = 8</td>
+        <td>UINT_8</td>
+    </tr>
+    <tr>
+        <td>bitWidth = 16</td>
+        <td>UINT_16</td>
+    </tr>
+    <tr>
+        <td>bitWidth = 32</td>
+        <td>UINT_32</td>
+    </tr>
+    <tr>
+        <td>bitWidth = 64</td>
+        <td>UINT_64</td>
+    </tr>
+</table>
 
 ### DECIMAL
 
@@ -123,9 +228,6 @@ integer. A precision too large for the underlying type (see below) is an error.
 * `binary`: `precision` is not limited, but is required. The minimum number of
   bytes to store the unscaled value should be used.
 
-A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both
-`scale` and `precision` fields set, even if scale is 0 by default.
-
 The sort order used for `DECIMAL` values is signed comparison of the represented
 value.
 
@@ -135,6 +237,11 @@ fixed, then the correct ordering can be produced by flipping the
 most-significant bit in the first byte and then using unsigned byte-wise
 comparison.
 
+*Compatibility*
+
+To support compatibility with older readers, implementations of parquet-format should
+write `DecimalType` precision and scale into the corresponding SchemaElement field in metadata.
+
 ## Date/Time Types
 
 ### DATE
@@ -145,37 +252,103 @@ January 1970.
 
 The sort order used for `DATE` is signed.
 
-### TIME\_MILLIS
+### TIME
+
+`TIME` is used for a logical time type without a date with millisecond or microsecond precision.
+The type has two type parameters: UTC adjustment (`true` or `false`)
+and precision (`MILLIS` or `MICROS`).
 
-`TIME_MILLIS` is used for a logical time type with millisecond precision,
-without a date. It must annotate an `int32` that stores the number of
+`TIME` with precision `MILLIS` is used for millisecond precision.
+It must annotate an `int32` that stores the number of
 milliseconds after midnight.
 
-The sort order used for `TIME\_MILLIS` is signed.
+`TIME` with precision `MICROS` is used for microsecond precision.
+It must annotate an `int64` that stores the number of
+microseconds after midnight.
 
-### TIME\_MICROS
+The sort order used for `TIME` is signed.
 
-`TIME_MICROS` is used for a logical time type with microsecond precision,
-without a date. It must annotate an `int64` that stores the number of
-microseconds after midnight.
+#### Deprecated time ConvertedType
 
-The sort order used for `TIME\_MICROS` is signed.
+`TIME_MILLIS` is the deprecated ConvertedType counterpart of `TIME` logical type
+with precision `MILLIS`. Like the logical type counterpart, it must annotate an `int32`
 
-### TIMESTAMP\_MILLIS
+`TIME_MICROS` is the deprecated ConvertedType counterpart of `TIME` logical type
+with precision `MICROS`. Like the logical type counterpart, it must annotate an `int64`
 
-`TIMESTAMP_MILLIS` is used for a combined logical date and time type, with
-millisecond precision. It must annotate an `int64` that stores the number of
-milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC.
+*Backward compatibility:*
+
+| ConvertedType | LogicalType |
+|---------------|-------------|
+| TIME_MILLIS | TimeType (isAdjustedToUTC = true, unit = MILLIS) |
+| TIME_MICROS | TimeType (isAdjustedToUTC = true, unit = MICROS) |
+
+*Forward compatibility:*
 
-The sort order used for `TIMESTAMP\_MILLIS` is signed.
+<table>
+    <tr colspan="3">
+        <th colspan="3">LogicalType</th>
+        <th>ConvertedType</th>
+    </tr>
+    <tr>
+        <td rowspan="2" colspan="2">TimeType</td>
+        <td>unit = MILLIS</td>
+        <td>TIME_MILLIS</td>
+    </tr>
+    <tr>
+        <td>unit = MICROS</td>
+        <td>TIME_MICROS</td>
+    </tr>
+</table>
 
-### TIMESTAMP\_MICROS
+### TIMESTAMP
 
-`TIMESTAMP_MICROS` is used for a combined logical date and time type with
-microsecond precision. It must annotate an `int64` that stores the number of
+`TIMESTAMP` is used for a combined logical date and time type, with
+millisecond or microsecond precision. The type has two type parameters:
+UTC adjustment (`true` or `false`) and precision (`MILLIS` or `MICROS`).
+
+`TIMESTAMP` with precision `MILLIS` is used for millisecond precision.
+It must annotate an `int64` that stores the number of
+milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC.
+
+`TIMESTAMP` with precision `MICROS` is used for microsecond precision.
+It must annotate an `int64` that stores the number of
 microseconds from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC.
 
-The sort order used for `TIMESTAMP\_MICROS` is signed.
+The sort order used for `TIMESTAMP` is signed.
+
+#### Deprecated timestamp ConvertedType
+
+`TIMESTAMP_MILLIS` is the deprecated ConvertedType counterpart of `TIMESTAMP` logical type
+with precision `MILLIS`. Like the logical type counterpart, it must annotate an `int64`
+
+`TIMESTAMP_MICROS` is the deprecated ConvertedType counterpart of `TIMESTAMP` logical type
+with precision `MICROS`. Like the logical type counterpart, it must annotate an `int64`
+
+*Backward compatibility:*
+
+| ConvertedType | LogicalType |
+|---------------|-------------|
+| TIMESTAMP_MILLIS | TimestampType (isAdjustedToUTC = true, unit = MILLIS) |
+| TIMESTAMP_MICROS | TimestampType (isAdjustedToUTC = true, unit = MICROS) |
+
+*Forward compatibility:*
+
+<table>
+    <tr colspan="3">
+        <th colspan="3">LogicalType</th>
+        <th>ConvertedType</th>
+    </tr>
+    <tr>
+        <td rowspan="2" colspan="2">TimestampType</td>
+        <td>unit = MILLIS</td>
+        <td>TIMESTAMP_MILLIS</td>
+    </tr>
+    <tr>
+        <td>unit = MICROS</td>
+        <td>TIMESTAMP_MICROS</td>
+    </tr>
+</table>
 
 ### INTERVAL