You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2020/09/04 04:21:00 UTC
[jira] [Resolved] (SPARK-32778) Accidental Data Deletion on calling saveAsTable

     [ https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-32778.
----------------------------------
    Resolution: Incomplete

Leaving this resolved for now since 2.2.0 is EOL and we won't land a fix there

> Accidental Data Deletion on calling saveAsTable
> -----------------------------------------------
>
>                 Key: SPARK-32778
>                 URL: https://issues.apache.org/jira/browse/SPARK-32778
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Aman Rastogi
>            Priority: Major
>
> {code:java}
> df.write.option("path", "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
> {code}
> Above code deleted the data present in path "/already/existing/path". This happened because table was already not there in hive metastore however, path given had data. And if table is not present in Hive Metastore, SaveMode gets modified internally to SaveMode.Overwrite irrespective of what user has provided, which leads to data deletion. This change was introduced as part of https://issues.apache.org/jira/browse/SPARK-19583. 
> Now, suppose if user is not using external hive metastore (hive metastore is associated with a cluster) and if cluster goes down or due to some reason user has to migrate to a new cluster. Once user tries to save data using above code in new cluster, it will first delete the data. It could be a production data and user is completely unaware of it as they have provided SaveMode.Append or ErrorIfExists. This will be an accidental data deletion.
>  
> Repro Steps:
>  
>  # Save data through a hive table as mentioned in above code
>  # create another cluster and save data in new table in new cluster by giving same path
>  
> Proposed Fix:
> Instead of modifying SaveMode to Overwrite, we should modify it to ErrorIfExists in class CreateDataSourceTableAsSelectCommand.
> Change (line 154)
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = false)
>  
> {code}
> to
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, tableExists = false){code}
> This should not break CTAS. Even in case of CTAS, user may not want to delete data if already exists as it could be accidental.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org