You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Prasanth Jayachandran (JIRA)" <ji...@apache.org> on 2017/09/05 21:56:00 UTC

[jira] [Comment Edited] (HIVE-17280) Data loss in CONCATENATE ORC created by Spark

    [ https://issues.apache.org/jira/browse/HIVE-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154377#comment-16154377 ] 

Prasanth Jayachandran edited comment on HIVE-17280 at 9/5/17 9:55 PM:
----------------------------------------------------------------------

[~mgaido] Posted a patch to HIVE-17403 that will fix the issue (along with adding restrictions). Tested this locally and it worked. If concatenation finds incompatible file, it will rename to Hive's convention to avoid the issue that I mentioned above. 


was (Author: prasanth_j):
[~mgaido] Posted a patch to HIVE-17280 that will fix the issue (along with adding restrictions). Tested this locally and it worked. If concatenation finds incompatible file, it will rename to Hive's convention to avoid the issue that I mentioned above. 

> Data loss in CONCATENATE ORC created by Spark
> ---------------------------------------------
>
>                 Key: HIVE-17280
>                 URL: https://issues.apache.org/jira/browse/HIVE-17280
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, Spark
>    Affects Versions: 1.2.1
>         Environment: Spark 1.6.3
>            Reporter: Marco Gaido
>            Priority: Critical
>
> Hive concatenation causes data loss if the ORC files in the table were written by Spark.
> Here are the steps to reproduce the problem:
>  - create a table;
> {code:java}
> hive
> hive> create table aa (a string, b int) stored as orc;
> {code}
>  - insert 2 rows using Spark;
> {code:java}
> spark-shell
> scala> case class AA(a:String, b:Int)
> scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - change table schema;
> {code:java}
> hive
> hive> alter table aa add columns(aa string, bb int);
> {code}
>  - insert other 2 rows with Spark
> {code:java}
> spark-shell
> scala> case class BB(a:String, b:Int, aa:String, bb:Int)
> scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - at this point, running a select statement with Hive returns correctly *4 rows* in the table; then run the concatenation
> {code:java}
> hive
> hive> alter table aa concatenate;
> {code}
> At this point, a select returns only *3 rows, ie. a row is missing*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)