You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:13:29 UTC
[jira] [Resolved] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

     [ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-20663.
----------------------------------
    Resolution: Incomplete

> Data missing after insert overwrite table partition which is created on specific location
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-20663
>                 URL: https://issues.apache.org/jira/browse/SPARK-20663
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.1
>            Reporter: kobefeng
>            Priority: Major
>              Labels: bulk-closed
>
> Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE.
> {code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
> -- create partition table first
> $ hadoop fs -mkdir /user/kofeng/partitioned_table
> $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
> spark-sql> create table kofeng.partitioned_table(
>          >     id bigint,
>          >     name string,
>          >     dt string
>          > ) using parquet options ('compression'='snappy', 'path'='/user/kofeng/partitioned_table')
>          > partitioned by (dt);
> -- add partition with specific location
> spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> $ hadoop fs -ls /user/kofeng/partitioned_table
> drwxr-xr-x   - kofeng kofeng          0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507
> -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS
> spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name;
> $ hadoop fs -ls /user/kofeng/partitioned_table
> -rw-r--r--   3 kofeng kofeng         0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS
> ----
> ----
> -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify:
> spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507');
> hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> OK
> -- could see hive insert overwrite that specific location successfully.
> hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test;
> hive> select * from kofeng.partitioned_table;
> OK
> 123	kofeng	20170507
> $ hadoop fs -ls /user/kofeng/partitioned_table/20170507
> -rwxr-xr-x   3 kofeng kofeng       338 2017-05-08 17:26 /user/kofeng/partitioned_table/20170507/000000_0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org