You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "smohr003 (JIRA)" <ji...@apache.org> on 2018/12/04 22:28:00 UTC
[jira] [Commented] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned

    [ https://issues.apache.org/jira/browse/SPARK-25919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709352#comment-16709352 ] 

smohr003 commented on SPARK-25919:
----------------------------------

I cannot reproduce this. 

Please note that I get an error in the spark side, regarding 
{code:java}
hive.exec.dynamic.partition.mode{code}
that should be set to nonstrict 

Having set that 
{code:java}
sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict"){code}
, there is no problem with data in tables. I am using Hive 2.1 with Spark 2.2. 

> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-25919
>                 URL: https://issues.apache.org/jira/browse/SPARK-25919
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Spark Shell, Spark Submit
>    Affects Versions: 2.1.0, 2.2.1
>            Reporter: Pawan
>            Priority: Blocker
>
> Hi
> I found a really strange issue. Below are the steps to reproduce it. This issue occurs only when the table row format is ParquetHiveSerDe and the target table is Partitioned
> *Hive:*
> Login in to hive terminal on cluster and create below tables.
> {code:java}
> create table t_src(
> name varchar(10),
> dob timestamp
> )
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> create table t_tgt(
> name varchar(10),
> dob timestamp
> )
> PARTITIONED BY (city varchar(10))
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
> {code}
> Insert data into the source table (t_src)
> {code:java}
> INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0');{code}
> *Spark-shell:*
> Get on to spark-shell. 
> Execute below commands on spark shell:
> {code:java}
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM DEFAULT.t_src alias"
> val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as c0, tbl0.a1 as c1, NULL as c2 FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
> {code}
>  After this check the contents of target table t_tgt. You will see the date "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows the contents of both the tables:
> {code:java}
> select * from t_src;
> +-------------+------------------------+--+
> | t_src.name | t_src.dob |
> +-------------+------------------------+--+
> | p1 | 0001-01-01 00:00:00.0 |
> | p2 | 0002-01-01 00:00:00.0 |
> | p3 | 0003-01-01 00:00:00.0 |
> | p4 | 0004-01-01 00:00:00.0 |
> +-------------+------------------------+–+
>  select * from t_tgt;
> +-------------+------------------------+--+
> | t_src.name | t_src.dob | t_tgt.city |
> +-------------+------------------------+--+
> | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF |
> | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF |
> +-------------+------------------------+--+
> {code}
>  
> Is this a known issue? Is it fixed in any subsequent releases?
> Thanks & regards,
> Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org