You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Karen Coppage (Jira)" <ji...@apache.org> on 2019/10/25 07:11:00 UTC

[jira] [Commented] (HIVE-22006) Hive parquet timestamp compatibility, part 2

    [ https://issues.apache.org/jira/browse/HIVE-22006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959495#comment-16959495 ] 

Karen Coppage commented on HIVE-22006:
--------------------------------------

Hi [~h-vetinari],

Unfortunately introducing a switch (and turning it on) or simply changing timestamp writing to time zone agnostic would make all previously written timestamp data unusable.

[~kuczoram] and I have worked on patches (HIVE-21050, HIVE-21215, HIVE-21216) that would introduce the option to store Parquet timestamps in a logical type that includes metadata indicating that the timestamp is time zone agnostic, without breaking backwards compatibility (Hive would correctly read previously written timestamps). Sadly, we cannot commit these patches until Parquet 1.11 is released. Impala also has an implementation for this waiting in the wings. If Parquet 1.11 were to be released, and Spark were to also implement the feature, then Hive/Impala/Spark could safely work on the same Parquet data, as you said.

I'm not sure about ORC. Timestamps stored as text have always been time zone agnostic.

tl;dr, there is a backwards compatible solution for Parquet; it's currently blocked by the Parquet community.

> Hive parquet timestamp compatibility, part 2
> --------------------------------------------
>
>                 Key: HIVE-22006
>                 URL: https://issues.apache.org/jira/browse/HIVE-22006
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: All Versions
>            Reporter: H. Vetinari
>            Priority: Major
>
> The interaction between HIVE / IMPALA / SPARK writing timestamps is a major source of headaches in every scenario where such interaction cannot be avoided.
> HIVE-9482 added hive.parquet.timestamp.skip.conversion, which *only* affects the *reading* of timestamps.
> It formulates the next steps as:
> > Later fix will change the write path to not convert, and stop the read-conversion even for files written by Hive itself.
> At the very least, HIVE needs a switch to also turn off the conversion on writes. That would at least allow a setup where all three of HIVE / IMPALA / SPARK can be configured not to convert on read/write, and can hence safely work on the same data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)