You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Zoltan Ivanfi (JIRA)" <ji...@apache.org> on 2018/02/13 10:43:00 UTC

[jira] [Issue Comment Deleted] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

     [ https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zoltan Ivanfi updated PARQUET-1065:
-----------------------------------
    Comment: was deleted

(was: The Parquet specification does not talk about endianness (which is something that I think should be addressed), but it defines data in terms of Thrift structures and the language bindings (at least parquet-mr) directly use these Thrift structures for reading and writing. Based on the Thrift specification (and some actual data files as well), these Thrift structures have a big-endian byte order. To quote from the [Integer encoding|https://github.com/apache/thrift/blob/master/doc/specs/thrift-binary-protocol.md#integer-encoding] section of the Thrift specification:

{quote}In the binary protocol integers are encoded with the most significant byte first (big endian byte order, aka network order). An int8 needs 1 byte, an int16 2, an int32 4 and an int64 needs 8 bytes.{quote}

However, please note that there is no int96 type here, so that really should be specified in Parquet Format, but given that all other int types have a big-endian byte order, I don't think any other choice would make sense for int96. (Parquet-tools already interperts int96 values according to this ordering). Impala, however, simply writes the 12 bytes of it's little-endian in-memory representation into the consecutive bytes of an int96, so the values are meaningless for less-than or greater-than comparisons.)

> Deprecate type-defined sort ordering for INT96 type
> ---------------------------------------------------
>
>                 Key: PARQUET-1065
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1065
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Zoltan Ivanfi
>            Assignee: Zoltan Ivanfi
>            Priority: Major
>             Fix For: 1.10.0
>
>
> [parquet.thrift in parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37] defines the the sort order for INT96 to be signed. [ParquetMetadataConverter.java in parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422] uses unsigned ordering instead. In practice, INT96 is only used for timestamps and neither signed nor unsigned ordering of the numeric values is correct for this purpose. For this reason, the INT96 sort order should be specified as undefined.
> (As a special case, min == max signifies that all values are the same, and can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)