You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2015/08/25 12:19:46 UTC

[jira] [Commented] (HIVE-6394) Implement Timestmap in ParquetSerde

    [ https://issues.apache.org/jira/browse/HIVE-6394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711046#comment-14711046 ] 

Cheng Lian commented on HIVE-6394:
----------------------------------

While testing Spark SQL 1.5-SNAPSHOT for Parquet/Hive compatibility, we hit SPARK-10177.  In short, Spark SQL and Hive both have their own Julian date conversion code, and their results don't match.  Currently, we've fixed this issue by making Spark SQL behave the same as Hive so that we can interoperate (see [Spark PR #8400|https://github.com/apache/spark/pull/8400]).  However, Hive's behavior looks a little bit weird to me: when converting a calendar timestamp to a Julian timestamp, Hive always gives a result 12 hours later than the expected result.

This behavior can be verified by the following Spark 1.5-SNAPSHOT shell snippet (I'm using Spark 1.5-SNAPSHOT shell since it comes with Hive 1.2.1 dependencies):
{code}
import java.sql._
import java.util._

import org.apache.hadoop.hive.ql.io.parquet.timestamp._
import org.apache.spark.sql.catalyst.util._

TimeZone.setDefault(TimeZone.getTimeZone("GMT"))
val timestamp = Timestamp.valueOf("1970-00-00 00:00:00")

val hiveNanoTime = NanoTimeUtils.getNanoTime(timestamp, false)
val hiveJulianDay = hiveNanoTime.getJulianDay
val hiveTimeOfDayNanos = hiveNanoTime.getTimeOfDayNanos

println(
  s"""Hive converts "$timestamp" to Julian timestamp:
     |(julianDay=$hiveJulianDay, timeOfDayNanos=$hiveTimeOfDayNanos)
   """.stripMargin)
{code}
The result is:
{noformat}
Hive converts "1970-01-01 00:00:00.0" to Julian timestamp:
(julianDay=2440588, timeOfDayNanos=0)
{noformat}
According to definition on [this page|http://aa.usno.navy.mil/data/docs/JulianDate.php], Julian dates count from noon.  Namely "00:00:00" of any calendar date must map to a Julian timestamp with a fraction of 0.5, i.e. an integral date plus 12 hours.  And the correct Julian timestamp given by the converter in the aforementioned page is "2440587.500000", which is equivalent to:
{noformat}
(julianDay=2440587, timeOfDayNanos=43200000000000)
{noformat}
This means, INT96 timestamp values stored in Parquet files written by Hive all have a 12 hr offset.  (I haven't tried to verify this issue against Impala.)

This shouldn't a big problem though, as long as the read path always correctly decode written timestamp values.  Just curious, is this 12 hr offset intentional?


> Implement Timestmap in ParquetSerde
> -----------------------------------
>
>                 Key: HIVE-6394
>                 URL: https://issues.apache.org/jira/browse/HIVE-6394
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Serializers/Deserializers
>            Reporter: Jarek Jarcec Cecho
>            Assignee: Szehon Ho
>              Labels: Parquet
>             Fix For: 0.14.0
>
>         Attachments: HIVE-6394.2.patch, HIVE-6394.3.patch, HIVE-6394.4.patch, HIVE-6394.5.patch, HIVE-6394.6.patch, HIVE-6394.6.patch, HIVE-6394.7.patch, HIVE-6394.patch
>
>
> This JIRA is to implement timestamp support in Parquet SerDe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)