You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Aman Rastogi (Jira)" <ji...@apache.org> on 2020/09/16 06:47:00 UTC
[jira] [Comment Edited] (SPARK-32778) Accidental Data Deletion on calling saveAsTable

    [ https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196721#comment-17196721 ] 

Aman Rastogi edited comment on SPARK-32778 at 9/16/20, 6:46 AM:
----------------------------------------------------------------

I have reproduced the issue with v2.4.4. Code is also similar as it was in v2.2.0 

 

Line: 176

[https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala]


was (Author: amanr):
I have reproduced the issue with v2.4.4. Code is also similar as it was in v2.2.0 

https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

> Accidental Data Deletion on calling saveAsTable
> -----------------------------------------------
>
>                 Key: SPARK-32778
>                 URL: https://issues.apache.org/jira/browse/SPARK-32778
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.4
>            Reporter: Aman Rastogi
>            Priority: Major
>
> {code:java}
> df.write.option("path", "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
> {code}
> Above code deleted the data present in path "/already/existing/path". This happened because table was already not there in hive metastore however, path given had data. And if table is not present in Hive Metastore, SaveMode gets modified internally to SaveMode.Overwrite irrespective of what user has provided, which leads to data deletion. This change was introduced as part of https://issues.apache.org/jira/browse/SPARK-19583. 
> Now, suppose if user is not using external hive metastore (hive metastore is associated with a cluster) and if cluster goes down or due to some reason user has to migrate to a new cluster. Once user tries to save data using above code in new cluster, it will first delete the data. It could be a production data and user is completely unaware of it as they have provided SaveMode.Append or ErrorIfExists. This will be an accidental data deletion.
>  
> Repro Steps:
>  
>  # Save data through a hive table as mentioned in above code
>  # create another cluster and save data in new table in new cluster by giving same path
>  
> Proposed Fix:
> Instead of modifying SaveMode to Overwrite, we should modify it to ErrorIfExists in class CreateDataSourceTableAsSelectCommand.
> Change (line 154)
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = false)
>  
> {code}
> to
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, tableExists = false){code}
> This should not break CTAS. Even in case of CTAS, user may not want to delete data if already exists as it could be accidental.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org