You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "adrians (via GitHub)" <gi...@apache.org> on 2024/01/08 09:59:56 UTC

[PR] [SPARK-46617][SQL] Create-table-if-not-exists should not silently overwrite existing tables [spark]

adrians opened a new pull request, #44622:
URL: https://github.com/apache/spark/pull/44622

   
   ### What changes were proposed in this pull request?
   
   The bug: If running "create table if-not-exists as select..." without a hive-catalog, the table is always overwritten.
   The cause: DataWritingCommand.assertEmptyRootPath verifies if the storage-location exists only if the SaveMode is ErrorIfExists (to know whether to save or throw exceptions in "create table as select" statements). When the SaveMode is ignore (as is the case in "create table if-not-exists as select..." statements), the presence of files in the storage-location is not checked, so later the statement will force-overwrite the contents.
   
   ### Why are the changes needed?
   
   It's a bug: inconsistent behavior depending on which case it is in the test-matrix:
   
   ```
   +----------------------------+-------------------------------------+
   | sql statement              | Behavior when overwriting a table   |
   |                            +---------------+---------------------+
   |                            | hive-enabled  | hive-disabled       |
   +----------------------------+---------------+---------------------+
   | create-table               | exception (1) | exception (2)       |
   | create-table-if-not-exists | skip (3)      | OVERWRITE *BUG* (4) |
   +----------------------------+---------------+---------------------+
   ```
   Explained more in-depth in the jira ticket: https://issues.apache.org/jira/browse/SPARK-46617
   
   ### Does this PR introduce _any_ user-facing change?
   
   create-table-if-not-exists statements will work in a more consistent manner.
   
   ### How was this patch tested?
   
   Went manually through the test-matrix again, checked that the table was not overwritten.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46617][SQL] Create-table-if-not-exists should not silently overwrite existing data-files [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44622:
URL: https://github.com/apache/spark/pull/44622#discussion_r1482324935


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:
##########
@@ -102,14 +102,17 @@ object DataWritingCommand {
   }
   /**
    * When execute CTAS operators, and the location is not empty, throw [[AnalysisException]].
-   * For CTAS, the SaveMode is always [[ErrorIfExists]]
+   * For CTAS, the SaveMode is always [[ErrorIfExists]].
+   * For Create-Table-If-Not-Exists, the SaveMode is [[Ignore]].
    *
    * @param tablePath Table location.
    * @param saveMode  Save mode of the table.
    * @param hadoopConf Configuration.
    */
   def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: Configuration): Unit = {

Review Comment:
   how about the Scala/Python APIs? `DataFrameWriter.mode(...).saveAsTable`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46617][SQL] Create-table-if-not-exists should not silently overwrite existing data-files [spark]

Posted by "adrians (via GitHub)" <gi...@apache.org>.
adrians commented on PR #44622:
URL: https://github.com/apache/spark/pull/44622#issuecomment-1914256810

   Ping @cloud-fan .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46617][SQL] Create-table-if-not-exists should not silently overwrite existing tables [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on PR #44622:
URL: https://github.com/apache/spark/pull/44622#issuecomment-1883678063

   I'm not a committer, so take my suggestion with a grain of salt, but: I think committers will ask you to add an automated test to cover the bottom right quadrant of your matrix. This will prevent regressions in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46617][SQL] Create-table-if-not-exists should not silently overwrite existing data-files [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #44622:
URL: https://github.com/apache/spark/pull/44622#issuecomment-1934489074

   I agree that the current behavior is buggy, but we should have more discussion about the expected behavior.
   
   For `CREATE TABLE IF NOT EXISTS`, I think the user expectation is that the table should be present after this command, either because it already exist, or because it was just created. If the table location is not empty, it's more like a system error and we should fail, or we can overwrite it and create the table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46617][SQL] Create-table-if-not-exists should not silently overwrite existing tables [spark]

Posted by "adrians (via GitHub)" <gi...@apache.org>.
adrians commented on PR #44622:
URL: https://github.com/apache/spark/pull/44622#issuecomment-1884833930

   Hi @nchammas, thanks for the feedback!
   I've just added unit-tests to validate this kind of behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46617][SQL] Create-table-if-not-exists should not silently overwrite existing data-files [spark]

Posted by "adrians (via GitHub)" <gi...@apache.org>.
adrians commented on code in PR #44622:
URL: https://github.com/apache/spark/pull/44622#discussion_r1474552639


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:
##########
@@ -102,14 +102,17 @@ object DataWritingCommand {
   }
   /**
    * When execute CTAS operators, and the location is not empty, throw [[AnalysisException]].
-   * For CTAS, the SaveMode is always [[ErrorIfExists]]
+   * For CTAS, the SaveMode is always [[ErrorIfExists]].
+   * For Create-Table-If-Not-Exists, the SaveMode is [[Ignore]].
    *
    * @param tablePath Table location.
    * @param saveMode  Save mode of the table.
    * @param hadoopConf Configuration.
    */
   def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: Configuration): Unit = {

Review Comment:
   I don't think a separate flag is needed. The SaveModes are mapped reasonably over the SQL syntax variants, so no additional logic would be needed, we can base on the SaveMode.
   
   ```
   +------------------------------------------+---------------+--------------------------------------------------------+
   | SQL Syntax                               | SaveMode      | Semantics                                              |
   +------------------------------------------+---------------+--------------------------------------------------------+
   | CREATE TABLE ... AS SELECT               | ErrorIfExists | If data/table exists, show error.                      |
   |                                          |               | (or if allowNonEmptyLocationInCTAS is set,             |
   |                                          |               |  ignore error & overwrite anyway)                      |
   |                                          |               |                                                        |
   | CREATE TABLE IF NOT EXISTS ... AS SELECT | Ignore        | If data/table exists, ignore error & stop.             |
   |                                          |               |                                                        |
   | CREATE OR REPLACE TABLE ... AS SELECT    | Overwrite     | If data/table exists, ignore error & overwrite anyway. |
   |                                          |               | (or shows "The feature is not supported" exception)    |
   +------------------------------------------+---------------+--------------------------------------------------------+
   ```
   
   My issue was that the `Ignore` was following the same code-path as `Overwrite` in this situation - the contents of the storage location were not checked (as implemented in assertEmptyRootPath) and as a consequence, the caller considered safe to overwrite it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46617][SQL] Create-table-if-not-exists should not silently overwrite existing data-files [spark]

Posted by "adrians (via GitHub)" <gi...@apache.org>.
adrians commented on PR #44622:
URL: https://github.com/apache/spark/pull/44622#issuecomment-1895339590

   Ping @LuciferYang .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46617][SQL] Create-table-if-not-exists should not silently overwrite existing data-files [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44622:
URL: https://github.com/apache/spark/pull/44622#discussion_r1473837971


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:
##########
@@ -102,14 +102,17 @@ object DataWritingCommand {
   }
   /**
    * When execute CTAS operators, and the location is not empty, throw [[AnalysisException]].
-   * For CTAS, the SaveMode is always [[ErrorIfExists]]
+   * For CTAS, the SaveMode is always [[ErrorIfExists]].
+   * For Create-Table-If-Not-Exists, the SaveMode is [[Ignore]].
    *
    * @param tablePath Table location.
    * @param saveMode  Save mode of the table.
    * @param hadoopConf Configuration.
    */
   def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: Configuration): Unit = {

Review Comment:
   shall we just pass a boolean flag "isCTAS"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46617][SQL] Create-table-if-not-exists should not silently overwrite existing data-files [spark]

Posted by "adrians (via GitHub)" <gi...@apache.org>.
adrians commented on code in PR #44622:
URL: https://github.com/apache/spark/pull/44622#discussion_r1483012515


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:
##########
@@ -102,14 +102,17 @@ object DataWritingCommand {
   }
   /**
    * When execute CTAS operators, and the location is not empty, throw [[AnalysisException]].
-   * For CTAS, the SaveMode is always [[ErrorIfExists]]
+   * For CTAS, the SaveMode is always [[ErrorIfExists]].
+   * For Create-Table-If-Not-Exists, the SaveMode is [[Ignore]].
    *
    * @param tablePath Table location.
    * @param saveMode  Save mode of the table.
    * @param hadoopConf Configuration.
    */
   def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: Configuration): Unit = {

Review Comment:
   This merge-request changes the `CreateDataSourceTableAsSelectCommand` function (which in turn does a check using `DataWritingCommand.assertEmptyRootPath`). That function is called as implementation for both SQL & Scala APIs.
   
   For SQL behavior, I tested it manually.
   For the Scala API behavior, I added the 3 unit-tests, with all the tests having the same structure:
   
   * Phase 1 - create a simple external table
   * Phase 2 - delete that table (metadata is cleaned-up, data-files remain in place)
   * Phase 3 - create another table in the same path, using `df.write.[...].option("path", path).mode(SaveMode.[...]).saveAsTable("test_table")` where the "mode" covers the 3 situations , check that semantics are respected as per the previous tables.
   
   Without my code-changes, the test for `mode(SaveMode.ErrorIfExists)` behaves identically to `mode(SaveMode.Override)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org