You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Takeshi Yamamuro (Jira)" <ji...@apache.org> on 2020/09/03 05:37:00 UTC

[jira] [Commented] (SPARK-32778) Accidental Data Deletion on calling saveAsTable

    [ https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189848#comment-17189848 ] 

Takeshi Yamamuro commented on SPARK-32778:
------------------------------------------

Have you tried the latest releases, v2.4.6 or v3.0.0?

> Accidental Data Deletion on calling saveAsTable
> -----------------------------------------------
>
>                 Key: SPARK-32778
>                 URL: https://issues.apache.org/jira/browse/SPARK-32778
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Aman Rastogi
>            Priority: Major
>
> {code:java}
> df.write.option("path", "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
> {code}
> Above code deleted the data present in path "/already/existing/path". This happened because table was already not there in hive metastore however, path given had data. And if table is not present in Hive Metastore, SaveMode gets modified internally to SaveMode.Overwrite irrespective of what user has provided, which leads to data deletion. This change was introduced as part of https://issues.apache.org/jira/browse/SPARK-19583. 
> Now, suppose if user is not using external hive metastore (hive metastore is associated with a cluster) and if cluster goes down or due to some reason user has to migrate to a new cluster. Once user tries to save data using above code in new cluster, it will first delete the data. It could be a production data and user is completely unaware of it as they have provided SaveMode.Append or ErrorIfExists. This will be an accidental data deletion.
>  
> Repro Steps:
>  
>  # Save data through a hive table as mentioned in above code
>  # create another cluster and save data in new table in new cluster by giving same path
>  
> Proposed Fix:
> Instead of modifying SaveMode to Overwrite, we should modify it to ErrorIfExists in class CreateDataSourceTableAsSelectCommand.
> Change (line 154)
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = false)
>  
> {code}
> to
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, tableExists = false){code}
> This should not break CTAS. Even in case of CTAS, user may not want to delete data if already exists as it could be accidental.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org