You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2016/02/04 01:38:39 UTC

[jira] [Commented] (HIVE-12767) Implement table property to address Parquet int96 timestamp bug

    [ https://issues.apache.org/jira/browse/HIVE-12767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131453#comment-15131453 ] 

Ryan Blue commented on HIVE-12767:
----------------------------------

[~spena], the q tests from the last patch look great and make me confident that the create table and property behavior are correct. I still think that the time zone conversion isn't implementing what it should (see below). I also found a couple of minor things:

# Minor: The CTAS test says the zone property on the resulting table should be PST, but is is (and should be) UTC. I think the comment is wrong and the behavior is correct.
# It would be nice to validate that the table property always overrides the global option
# Is there a test that files are created with the table property in key/value metadata? I think that the copy from one table to another with a different property set wouldn't work otherwise, but I want to make sure.
# What is the Hive 2.0 test doing? It looks like it is validating that setting the table property to UTC converts values back to local time? Or was the file written with UTC? (If so, we should have one written in another zone.)

The Hive 2.0 test I think demonstrates that the time zone conversion isn't correct. It is correct for the with and without conversion cases. But, it isn't a correct implementation for the meaning of the table property's time zone. In the Hive 2.0 test, it is set to UTC, but the value is read in local (if I'm reading this correctly).

Here's the problem: the back-end converts julian/nanos to year/month/day/hour values, then sets those on a calendar. It then uses the backing time in milliseconds to create a Timestamp. That timestamp is in _local_ time because of HIVE-12192, so to have the time _not_ adjusted, the calendar needs to also use the local zone. So using the local zone ends up not adjusting the value.

The problem is that the current patch uses this implementation to adjust values to the table property zone. That doesn't work because the "no offset" zone for the table property is UTC, not the local zone. I'm attaching a file that demonstrates the problem. It uses [Spark's implementation of fromJulianDay|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L180] to demonstrate what I'm talking about. I chose a date/time at random and tested out the two implementations.

Tests that show _correct_ behavior:
* {{testKnownTimestampWithoutConversion}} shows that when UTC is used, the fromJulianDay value is the same as the "skip conversion" value returned by NanoTimeUtils.
* {{testKnownTimestampWithConversion}} shows that when the local zone is used (for me, PST), the fromJulianDay value is the same as the "do not skip conversion" value returned by NanoTimeUtils.

Tests that show _incorrect_ behavior:
* {{testKnownWithZoneArgumentUTC}} shows that passing a UTC calendar to NanoTimeUtils does not match the fromJulianDay/UTC value that matched the "skip conversion" in tests above. This doesn't pass, but should.
* {{testKnownWithWrongZone}} shows that passing a PST calendar to NanoTimeUtils matches the fromJulianDay/UTC value that corresponds to "skip conversion". This shouldn't pass, but does.

I'm attaching my test to this and can help if you have any questions about it.

> Implement table property to address Parquet int96 timestamp bug
> ---------------------------------------------------------------
>
>                 Key: HIVE-12767
>                 URL: https://issues.apache.org/jira/browse/HIVE-12767
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.2.1, 2.0.0
>            Reporter: Sergio Peña
>            Assignee: Sergio Peña
>         Attachments: HIVE-12767.3.patch, HIVE-12767.4.patch
>
>
> Parque timestamps using INT96 are not compatible with other tools, like Impala, due to issues in Hive because it adjusts timezones values in a different way than Impala.
> To address such issues. a new table property (parquet.mr.int96.write.zone) must be used in Hive that detects what timezone to use when writing and reading timestamps from Parquet.
> The following is the exit criteria for the fix:
> * Hive will read Parquet MR int96 timestamp data and adjust values using a time zone from a table property, if set, or using the local time zone if it is absent. No adjustment will be applied to data written by Impala.
> * Hive will write Parquet int96 timestamps using a time zone adjustment from the same table property, if set, or using the local time zone if it is absent. This keeps the data in the table consistent.
> * New tables created by Hive will set the table property to UTC if the global option to set the property for new tables is enabled.
> ** Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the property unless the global setting to do so is enabled.
> ** Tables created using CREATE TABLE LIKE <OTHER TABLE> will copy the property of the table that is copied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)