You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Deepanker (JIRA)" <ji...@apache.org> on 2018/09/18 11:18:00 UTC

[jira] [Comment Edited] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions

    [ https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618950#comment-16618950 ] 

Deepanker edited comment on SPARK-20236 at 9/18/18 11:17 AM:
-------------------------------------------------------------

What is the difference between this Jira and these ones: 
 https://issues.apache.org/jira/browse/SPARK-18185, 

https://issues.apache.org/jira/browse/SPARK-18183

I tested this out with spark 2.2 (which confirms the fix was present before 2.3 as well) this only works for external tables not managed tables in hive? Any reason why is that?

Now we can enable/disable this behaviour via this property: {{spark.sql.sources.partitionOverwriteMode }}whereas previously it was default? 


was (Author: deepanker):
What is the difference between this Jira and these ones: 
https://issues.apache.org/jira/browse/SPARK-18185, 

https://issues.apache.org/jira/browse/SPARK-18183

I tested this out with spark 2.2 (which confirms the fix was present before 2.3 as well) this only works for external tables not managed tables in hive? Any reason why is that?

> Overwrite a partitioned data source table should only overwrite related partitions
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-20236
>                 URL: https://issues.apache.org/jira/browse/SPARK-20236
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Wenchen Fan
>            Assignee: Wenchen Fan
>            Priority: Major
>              Labels: releasenotes
>             Fix For: 2.3.0
>
>
> When we overwrite a partitioned data source table, currently Spark will truncate the entire table to write new data, or truncate a bunch of partitions according to the given static partitions.
> For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions that starts with {{a=1}}.
> This behavior is kind of reasonable as we can know which partitions will be overwritten before runtime. However, hive has a different behavior that it only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only one data column and is partitioned by {{a}} and {{b}}.
> It seems better if we can follow hive's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org