You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Stamatis Zampetakis (Jira)" <ji...@apache.org> on 2021/05/11 15:22:00 UTC

[jira] [Commented] (HIVE-25104) Backward incompatible timestamp serialization in Parquet for certain timezones

    [ https://issues.apache.org/jira/browse/HIVE-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342654#comment-17342654 ] 

Stamatis Zampetakis commented on HIVE-25104:
--------------------------------------------

In HIVE-20007/HIVE-12192, we switched to use the new Java APIs for managing dates and timestamps (i.e., LocalDate, LocalDateTime, Instant, etc.); this is a positive change but brings many backward compatibility problems. The one that we are hitting in this JIRA lies is the computation of time zone offsets from UTC that is made differently between old and new Java classes.

To be more precise the new java classes have elaborate rules to identify the offset for a given timezone and this may be different from one year to another. Consider for instance US/Pacific time zone, now it has an offset of -08:00:00 from UTC but [before the year 1883|http://www.statoids.com/tus.html] the offset was -07:52:58. The old java classes which are in use in {{branch-2.3}} for isntance do not account for this differences and always use the same offset for a given time zone (e.g., -8:00 for US/Pacific). The latter means that the value that we end up writing in Parquet is different and essentially depends on the year.

> Backward incompatible timestamp serialization in Parquet for certain timezones
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-25104
>                 URL: https://issues.apache.org/jira/browse/HIVE-25104
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 3.1.2
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>
> HIVE-12192, HIVE-20007 changed the way that timestamp computations are performed and to some extend how timestamps are serialized and deserialized in files (Parquet, Avro, Orc).
> In versions that include HIVE-12192 or HIVE-20007 the serialization in Parquet files is not backwards compatible. In other words writing timestamps with a version of Hive that includes HIVE-12192/HIVE-20007 and reading them with another (not including the previous issues) may lead to different results depending on the default timezone of the system.
> Consider the following scenario where the default system timezone is set to US/Pacific.
> At apache/master commit 37f13b02dff94e310d77febd60f93d5a205254d3
> {code:sql}
> CREATE EXTERNAL TABLE employee(eid INT,birth timestamp) STORED AS PARQUET
>  LOCATION '/tmp/hiveexttbl/employee';
> INSERT INTO employee VALUES (1, '1880-01-01 00:00:00');
> INSERT INTO employee VALUES (2, '1884-01-01 00:00:00');
> INSERT INTO employee VALUES (3, '1990-01-01 00:00:00');
> SELECT * FROM employee;
> {code}
> |1|1880-01-01 00:00:00|
> |2|1884-01-01 00:00:00|
> |3|1990-01-01 00:00:00|
> At apache/branch-2.3 commit 324f9faf12d4b91a9359391810cb3312c004d356
> {code:sql}
> CREATE EXTERNAL TABLE employee(eid INT,birth timestamp) STORED AS PARQUET
>  LOCATION '/tmp/hiveexttbl/employee';
> SELECT * FROM employee;
> {code}
> |1|1879-12-31 23:52:58|
> |2|1884-01-01 00:00:00|
> |3|1990-01-01 00:00:00|
> The timestamp for {{eid=1}} in branch-2.3 is different from the one in master.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)