You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "jeanlyn (Jira)" <ji...@apache.org> on 2023/11/28 15:57:00 UTC

[jira] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

    [ https://issues.apache.org/jira/browse/SPARK-43106 ]


    jeanlyn deleted comment on SPARK-43106:
    ---------------------------------

was (Author: jeanlyn):
I think we also encountered similar problems, we circumvent this problem by using parameters *spark.sql.hive.convertInsertingPartitionedTable=false*

> Data lost from the table if the INSERT OVERWRITE query fails
> ------------------------------------------------------------
>
>                 Key: SPARK-43106
>                 URL: https://issues.apache.org/jira/browse/SPARK-43106
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: Vaibhav Beriwala
>            Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then table data is unavailable to other readers for the entire duration the job takes.
> This behavior is the same even for the partitioned tables when using static partitioning. For dynamic partitioning, we do not delete the table data before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and not as part of the Job commit operation? This issue is not there with Hive - where the data is cleaned up as part of the Job commit operation probably. As part of SPARK-19183, we did add a new hook in the commit protocol for this exact same purpose, but seems like its default behavior is still to delete the table data before the job launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org