You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Lars Volker (JIRA)" <ji...@apache.org> on 2018/01/09 11:21:00 UTC

[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

    [ https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318280#comment-16318280 ] 

Lars Volker commented on PARQUET-1065:
--------------------------------------

My understanding is that primitive types (INT32, INT64) use little-endian order, INT96 might do the same, though it's not documented explicitly in parquet.thrift. Both fields in INT96 timestamps (time and date) are encoded as little endian, too, so interpreting the resulting 12 bytes as an unsigned 12 byte integer stored as little endian should give the correct order, no?

A 8 byte timestamp with bytes T0..T7 and 4 byte date with bytes D0..D3 would be stored like this example. Memory addresses increase to the right, the first row is a 12 byte integer in little endian order:

|I0|I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|
|T0|T1|T2|T3|T4|T5|T6|T7|D0|D1|D2|D3|

Comparing the resulting timestamp as an int96 would compare the most significant byte first, which is stored at the highest address (I11, D3). Logically, this will compare by date first, then by timestamp.

[~zi] - Am I missing something?

> Deprecate type-defined sort ordering for INT96 type
> ---------------------------------------------------
>
>                 Key: PARQUET-1065
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1065
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Zoltan Ivanfi
>            Assignee: Zoltan Ivanfi
>
> [parquet.thrift in parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37] defines the the sort order for INT96 to be signed. [ParquetMetadataConverter.java in parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422] uses unsigned ordering instead. In practice, INT96 is only used for timestamps and neither signed nor unsigned ordering of the numeric values is correct for this purpose. For this reason, the INT96 sort order should be specified as undefined.
> (As a special case, min == max signifies that all values are the same, and can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)