You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Stamatis Zampetakis (Jira)" <ji...@apache.org> on 2021/05/14 09:34:00 UTC
[jira] [Comment Edited] (HIVE-25104) Backward incompatible timestamp serialization in Parquet for certain timezones

    [ https://issues.apache.org/jira/browse/HIVE-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344489#comment-17344489 ] 

Stamatis Zampetakis edited comment on HIVE-25104 at 5/14/21, 9:33 AM:
----------------------------------------------------------------------

In order to make timestamp serialization backwards compatible we need to write data to the Parquet file in exactly the same way as before. This can be achieved either by using the legacy code and the old date/time APIs or by adapting the new code to simulate the old behavior.

The first question that pops up is if we should retain both serialization options (old rules vs new rules) or keep only one of them.

+Keep only old conversion rules+
 Pros
 * Backwards compatible with Hive versions before 3.1.0
 * Decreased maintenance cost and simpler implementation
 * Less configuration options

 Cons
 * Backwards incompatible with Hive 3.1.[0-2] (HIVE-12192 is released)
 * Complicated forward compatibility since the majority of new tools (using Java8+) are using the new java APIs and they are relying by default in the new rules
 * Performance impact if we don't rewrite the old rules using the new APIs

+Keep only new conversion rules+
 Pros
 * Backwards compatible with Hive versions after 3.1.0
 * Decreased maintenance cost and simpler implementation
 * Less configuration options
 * Efficient implementation
 * Better integration with newer tools (using Java8+) reading Parquet files

 Cons
 * Backwards incompatible with Hive versions before 3.1.0

+Keep both conversion rules and control via properties+
 Pros
 * Backwards compatible with all Hive versions as long as the properties are set correctly

 Cons
 * Increased maintenance cost due to the the presence of multiple rules
 * More configuration options that the user may need to set differently depending on the input files

Keeping one of them makes sense mostly if we decide to keep a unique option when we deserialize (read) the data otherwise the code simplicity, code maintenance, and other advantages mentioned above become really minor.

Shifting the weight to backward compatibility the most compelling option is to keep both old and new conversion rules and provide properties to control how we read/write timestamp data. 


was (Author: zabetak):
In order to make timestamp serialization backwards compatible we need to write data to the Parquet file in exactly the same way as before. This can be achieved either by using the legacy code and the old date/time APIs or by adapting the new code to simulate the old behavior.

The first question that pops up is if we should retain both serialization options (old rules vs new rules) or keep only one of them.

+Keep only old conversion rules+
 Pros
 * Backwards compatible with Hive versions before 3.1.0
 * Decreased maintenance cost and simpler implementation
 * Less configuration options
 Cons
 * Backwards incompatible with Hive 3.1.[0-2] (HIVE-12192 is released)
 * Complicated forward compatibility since the majority of new tools (using Java8+) are using the new java APIs and they are relying by default in the new rules
 * Performance impact if we don't rewrite the old rules using the new APIs

+Keep only new conversion rules+
 Pros
 * Backwards compatible with Hive versions after 3.1.0
 * Decreased maintenance cost and simpler implementation
 * Less configuration options
 * Efficient implementation
 * Better integration with newer tools (using Java8+) reading Parquet files
 Cons
 * Backwards incompatible with Hive versions before 3.1.0

+Keep both conversion rules and control via properties+
 Pros
 * Backwards compatible with all Hive versions as long as the properties are set correctly
 Cons
 * Increased maintenance cost due to the the presence of multiple rules
 * More configuration options that the user may need to set differently depending on the input files

Keeping one of them makes sense mostly if we decide to keep a unique option when we deserialize (read) the data otherwise the code simplicity, code maintenance, and other advantages mentioned above become really minor.

Shifting the weight to backward compatibility the most compelling option is to keep both old and new conversion rules and provide properties to control how we read/write timestamp data. 

> Backward incompatible timestamp serialization in Parquet for certain timezones
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-25104
>                 URL: https://issues.apache.org/jira/browse/HIVE-25104
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 3.1.2
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>
> HIVE-12192, HIVE-20007 changed the way that timestamp computations are performed and to some extend how timestamps are serialized and deserialized in files (Parquet, Avro, Orc).
> In versions that include HIVE-12192 or HIVE-20007 the serialization in Parquet files is not backwards compatible. In other words writing timestamps with a version of Hive that includes HIVE-12192/HIVE-20007 and reading them with another (not including the previous issues) may lead to different results depending on the default timezone of the system.
> Consider the following scenario where the default system timezone is set to US/Pacific.
> At apache/master commit 37f13b02dff94e310d77febd60f93d5a205254d3
> {code:sql}
> CREATE EXTERNAL TABLE employee(eid INT,birth timestamp) STORED AS PARQUET
>  LOCATION '/tmp/hiveexttbl/employee';
> INSERT INTO employee VALUES (1, '1880-01-01 00:00:00');
> INSERT INTO employee VALUES (2, '1884-01-01 00:00:00');
> INSERT INTO employee VALUES (3, '1990-01-01 00:00:00');
> SELECT * FROM employee;
> {code}
> |1|1880-01-01 00:00:00|
> |2|1884-01-01 00:00:00|
> |3|1990-01-01 00:00:00|
> At apache/branch-2.3 commit 324f9faf12d4b91a9359391810cb3312c004d356
> {code:sql}
> CREATE EXTERNAL TABLE employee(eid INT,birth timestamp) STORED AS PARQUET
>  LOCATION '/tmp/hiveexttbl/employee';
> SELECT * FROM employee;
> {code}
> |1|1879-12-31 23:52:58|
> |2|1884-01-01 00:00:00|
> |3|1990-01-01 00:00:00|
> The timestamp for {{eid=1}} in branch-2.3 is different from the one in master.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)