You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Pablo Langa Blanco (JIRA)" <ji...@apache.org> on 2019/01/03 21:26:00 UTC
[jira] [Commented] (SPARK-25873) Date corruption when Spark and Hive both are on different timezones

    [ https://issues.apache.org/jira/browse/SPARK-25873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733538#comment-16733538 ] 

Pablo Langa Blanco commented on SPARK-25873:
--------------------------------------------

Hi [~pawanlawale]

Timestamp in hive is saved as a long value without timezone associated to this value. It has no sense that spark access to the timezone of the remote cluster because different timestamps could be generated from different timezone (different from hive cluster timezone)

The best option is the application manage it’s own timezone associated to each timestamp if its requiered. ¿don’t you think?

Regards

Pablo

> Date corruption when Spark and Hive both are on different timezones
> -------------------------------------------------------------------
>
>                 Key: SPARK-25873
>                 URL: https://issues.apache.org/jira/browse/SPARK-25873
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Spark Shell, Spark Submit
>    Affects Versions: 2.2.1
>            Reporter: Pawan
>            Priority: Major
>
> There is date alteration when loading date from one table to another in hive through spark. This happens when Hive is on a remote machine with timezone different than the one on which Spark is running. This happens only when the Source table format is 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> Below are the steps to produce the issue:
> 1. Create two tables as below in hive which has a timezone, say in, EST
> {code}
>  CREATE TABLE t_src(
>  name varchar(10),
>  dob timestamp
>  )
>  ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
>  STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
>  OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> {code}
> {code}
> INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0');
> {code}
>  
> {code}
>  CREATE TABLE t_tgt(
>  name varchar(10),
>  dob timestamp
>  );
> {code}
> 2. Copy {{hive-site.xml}} to {{spark-2.2.1-bin-hadoop2.7/conf}} folder, so that when you create {{sqlContext}} for hive it connects to your remote hive server.
> 3. Start your spark-shell on some other machine whose timezone is different than that of Hive, say, PDT
> 4. Execute below code:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT CAST(alias.name AS String) as a0, alias.dob as a1 FROM t_src alias"
> val q2 = "INSERT OVERWRITE TABLE t_tgt SELECT tbl0.a0 as c0, tbl0.a1 as c1 FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
> {code}
> 5. Now navigate to hive and check the contents of the {{TARGET table (t_tgt)}}. The dob field will have incorrect values.
>  
> Is this a known issue? Is there any work around on this? Can it be fixed?
>  
> Thanks & regards,
> Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org