You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Zoltan Ivanfi (JIRA)" <ji...@apache.org> on 2019/01/14 15:55:00 UTC

[jira] [Comment Edited] (HIVE-20980) Reinstate Parquet timestamp conversion between HS2 time zone and UTC

    [ https://issues.apache.org/jira/browse/HIVE-20980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742233#comment-16742233 ] 

Zoltan Ivanfi edited comment on HIVE-20980 at 1/14/19 3:54 PM:
---------------------------------------------------------------

[~jcamachorodriguez] The addition of session-local time zones was orthogonal to the semantics change and it seemed to make sense to restore the timezone-aware semantics based on the session-local time zone rather than the server time zone. That being said, I do not have a strong preference towards either one, so if you prefer one over the other, we are fine with your choice.

There is an isAdjustedToUTC parameter in parquet-format indeed, which will be made available in the upcoming parquet-mr 1.11.0 release. It is also one of the reasons why I would prefer the TIMESTAMP and TIMESTAMP WITHOUT TIME ZONE types to behave differently for Parquet. The isAdjustedToUTC annotates int64 timestamps, while previously we used int96 timestamps. Writing int64 timestamps is a breaking change in itself, so it should only be done at the user's explicit request. However, a configuration switch would not suffice for this purpose, because the necessity of writing backwards-compatible int96 timestamp for any single table would prevent every other table from using the new int64 timestamps as well.

At the same time, introducing new semantics for timestamps breaks the existing rule that an int96 written by Impala is LocalDateTime but an int96 written by Hive or Spark is Instant. To prevent further confusion, the new semantics should never be written into int96 timestamps, only int64 ones, because the latter allow saving semantics metadata in the isAdjustedToUTC type parameter.

Having the old TIMESTAMP type behave in the legacy way and writing only int64 timestamps with the new TIMESTAMP WITH LOCAL TIME ZONE type resolves these two problems in a nice way. (Please see [this appendix|https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.gonr2yqv3e77] of the proposal for details.) It is true that TIMESTAMP will behave differently between different file formats again, but that inconsisteny has historically been a part of Hive and fixing that would be a breaking change.


was (Author: zi):
[~jcamachorodriguez] The addition of session-local time zones was orthogonal to the semantics change and it seemed to make sense to restore the timezone-aware semantics based on the session-local time zone rather than the server time zone. That being said, I do not have a strong preference towards either one, so if you prefer one over the other, we are fine with your choice.

There is an isAdjustedToUTC parameter in parquet-format indeed, which will be made available in the upcoming parquet-mr 1.11.0 release. It is also one of the reasons why I would prefer the TIMESTAMP and TIMESTAMP WITHOUT TIME ZONE types to behave differently for Parquet. The isAdjustedToUTC annotates int64 timestamps, while previously we used int96 timestamps. Writing int64 timestamps is a breaking change in itself, so it should only be done at the user's explicit request. However, a configuration switch would not suffice for this purpose, because the necessity of writing backwards-compatible int96 timestamp for any single table would prevent every other table from using the new int64 timestamps as well.

At the same time, introducing new semantics for timestamps breaks the existing rule that an int96 written by Impala is LocalDateTime but an int96 written by Hive or Spark is Instant. To prevent further confusion, the new semantics should never be written into int96 timestamps, only int64 ones, because the latter allow saving semantics metadata in the isAdjustedToUTC type parameter.

Handling the old TIMESTAMP type behave in the legacy way and writing only int64 timestamps with new TIMESTAMP WITH LOCAL TIME ZONE type resolves these two problems in a nice way. (Please see [this appendix|https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.gonr2yqv3e77] of the proposal for details.) It is true that TIMESTAMP will behave differently between different file formats again, but that inconsisteny has historically been a part of Hive and fixing that would be a breaking change.

> Reinstate Parquet timestamp conversion between HS2 time zone and UTC
> --------------------------------------------------------------------
>
>                 Key: HIVE-20980
>                 URL: https://issues.apache.org/jira/browse/HIVE-20980
>             Project: Hive
>          Issue Type: Sub-task
>          Components: File Formats
>            Reporter: Karen Coppage
>            Assignee: Karen Coppage
>            Priority: Major
>         Attachments: HIVE-20980.1.patch, HIVE-20980.2.patch, HIVE-20980.2.patch
>
>
> With HIVE-20007, Parquet timestamps became timezone-agnostic. This means that timestamps written after the change are read exactly as they were written; but timestamps stored before this change are effectively converted from the writing HS2 server time zone to GMT time zone. This patch reinstates the original behavior: timestamps are converted to UTC before write and from UTC before read.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)