You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2018/10/30 08:19:00 UTC

[jira] [Updated] (SPARK-25873) Date corruption when Spark and Hive both are on different timezones

     [ https://issues.apache.org/jira/browse/SPARK-25873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-25873:
---------------------------------
    Description: 
There is date alteration when loading date from one table to another in hive through spark. This happens when Hive is on a remote machine with timezone different than the one on which Spark is running. This happens only when the Source table format is 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

Below are the steps to produce the issue:

1. Create two tables as below in hive which has a timezone, say in, EST

{code}
 CREATE TABLE t_src(
 name varchar(10),
 dob timestamp
 )
 ROW FORMAT SERDE 
 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
 STORED AS INPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
 OUTPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
{code}

{code}
INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0');
{code}
 
{code}
 CREATE TABLE t_tgt(
 name varchar(10),
 dob timestamp
 );
{code}


2. Copy {{hive-site.xml}} to {{spark-2.2.1-bin-hadoop2.7/conf}} folder, so that when you create {{sqlContext}} for hive it connects to your remote hive server.

3. Start your spark-shell on some other machine whose timezone is different than that of Hive, say, PDT

4. Execute below code:
{code}
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

val q0 = "TRUNCATE table t_tgt"
val q1 = "SELECT CAST(alias.name AS String) as a0, alias.dob as a1 FROM t_src alias"
val q2 = "INSERT OVERWRITE TABLE t_tgt SELECT tbl0.a0 as c0, tbl0.a1 as c1 FROM tbl0"

sqlContext.sql(q0)
sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
sqlContext.sql(q2)
{code}

5. Now navigate to hive and check the contents of the {{TARGET table (t_tgt)}}. The dob field will have incorrect values.

 

Is this a known issue? Is there any work around on this? Can it be fixed?

 

Thanks & regards,

Pawan Lawale

  was:
There is date alteration when loading date from one table to another in hive through spark. This happens when Hive is on a remote machine with timezone different than the one on which Spark is running. This happens only when the Source table format is 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

Below are the steps to produce the issue:

1. Create two tables as below in hive which has a timezone, say in, EST
 -----------SOURCE TABLE----------------
 CREATE TABLE t_src(
 name varchar(10),
 dob timestamp
 )
 ROW FORMAT SERDE 
 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
 STORED AS INPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
 OUTPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
 
 INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0');
 
 -----------TARGET TABLE----------------
 CREATE TABLE t_tgt(
 name varchar(10),
 dob timestamp
 );


2. Copy hive-site.xml to spark-2.2.1-bin-hadoop2.7/conf folder, so that when you create sqlContext for hive it connects to your remote hive server.
3. Start your spark-shell on some other machine whose timezone is different than that of Hive, say, PDT
3. Execute below code:
 import org.apache.spark.sql.hive.HiveContext
 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

val q0 = "TRUNCATE table t_tgt"
 val q1 = "SELECT CAST(alias.name AS String) as a0, alias.dob as a1 FROM t_src alias"
 val q2 = "INSERT OVERWRITE TABLE t_tgt SELECT tbl0.a0 as c0, tbl0.a1 as c1 FROM tbl0"

sqlContext.sql(q0)
 sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
 sqlContext.sql(q2)

4. Now navigate to hive and check the contents of the TARGET table (t_tgt). The dob field will have incorrect values.

 

Is this a known issue? Is there any work around on this? Can it be fixed?

 

Thanks & regards,

Pawan Lawale


> Date corruption when Spark and Hive both are on different timezones
> -------------------------------------------------------------------
>
>                 Key: SPARK-25873
>                 URL: https://issues.apache.org/jira/browse/SPARK-25873
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Spark Shell, Spark Submit
>    Affects Versions: 2.2.1
>            Reporter: Pawan
>            Priority: Major
>
> There is date alteration when loading date from one table to another in hive through spark. This happens when Hive is on a remote machine with timezone different than the one on which Spark is running. This happens only when the Source table format is 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> Below are the steps to produce the issue:
> 1. Create two tables as below in hive which has a timezone, say in, EST
> {code}
>  CREATE TABLE t_src(
>  name varchar(10),
>  dob timestamp
>  )
>  ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
>  STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
>  OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> {code}
> {code}
> INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0');
> {code}
>  
> {code}
>  CREATE TABLE t_tgt(
>  name varchar(10),
>  dob timestamp
>  );
> {code}
> 2. Copy {{hive-site.xml}} to {{spark-2.2.1-bin-hadoop2.7/conf}} folder, so that when you create {{sqlContext}} for hive it connects to your remote hive server.
> 3. Start your spark-shell on some other machine whose timezone is different than that of Hive, say, PDT
> 4. Execute below code:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT CAST(alias.name AS String) as a0, alias.dob as a1 FROM t_src alias"
> val q2 = "INSERT OVERWRITE TABLE t_tgt SELECT tbl0.a0 as c0, tbl0.a1 as c1 FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
> {code}
> 5. Now navigate to hive and check the contents of the {{TARGET table (t_tgt)}}. The dob field will have incorrect values.
>  
> Is this a known issue? Is there any work around on this? Can it be fixed?
>  
> Thanks & regards,
> Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org